back to index

Implementing Filters in the New Haystack Doc Store


Chapters

0:0 Intro
2:41 Filtering
5:36 Testing Existing Filter Utils
7:57 Making Sense of Filter Utils
10:35 Writing the First Filter
16:26 First Working Filter
18:24 Testing New Filters
21:27 Implementing in the Doc Store
24:2 Testing Pipeline Filters
27:11 Final Issue and Outro

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at the current state of the PR for the Haystack Pinecone
00:00:07.680 | document store and we're going to try and work through and maybe solve all of the issues
00:00:15.120 | although I tend to be quite optimistic so maybe that's not going to be realistic but we're going
00:00:21.680 | to give it a go and we're going to go through and try and figure out what we need to do to actually
00:00:27.280 | merge this into the Haystack library. So we'll just scroll down and
00:00:35.280 | so I have Bogdan who has given a lot of feedback on this and the first thing or okay not the first
00:00:46.880 | thing because I need to figure this out as well is make sure typing is compliant with mypy.
00:00:54.480 | I mean for most of it I was just taking methods from other parts of the framework
00:01:00.560 | so I'm not sure if there's been some updates since I started doing that so my types are just
00:01:07.120 | kind of out of date or if it's some other types I've kind of put in there not realized they were
00:01:11.600 | different I'm not sure so we'll figure that out later it's fine. So one big one is to make use
00:01:23.520 | of the filter utils for converting the filters at the moment it's just a method inside the
00:01:30.320 | document store called I think build filter clause because this handle is converting from a
00:01:37.520 | from Haystack syntax of your so Haystack filter syntax to Pinecone filter syntax
00:01:47.760 | so we need to just solve that and what do you think so it's just a load of these like little
00:01:54.480 | ones that we're going to go through so for the so vector dim this is what makes me think I was
00:02:01.600 | looking at some older version this has been deprecated now and now we're just using vector
00:02:08.640 | or embedding dim so let's fix that so anywhere it says vector dim I just need to remove that
00:02:18.960 | first let me just check okay
00:02:26.960 | and just replace all of these so let's replace all of those so embedding dim
00:02:37.840 | okay it's number one easy so this is the more difficult thing I think we need to to figure out
00:02:46.720 | so there's this filter utils script and when I looked it was kind of confusing before so I'm
00:02:56.960 | going to let's take a look now and see if it see if we can figure out so filter utils I think it's
00:03:05.120 | here so it's still in the document store directory and we have this logical filter clause come down
00:03:15.200 | here and ah okay okay this yeah this is why it's so confusing to me so we have convert to elastic
00:03:25.760 | search convert to sql and they're all just empty so maybe I just need to I'm just going to try and
00:03:34.320 | add the convert to pinecone method in here like pretty much as it is already
00:03:41.040 | and hopefully I don't know figure that out but at the same time we have this logical filter clause
00:03:51.440 | object and I'm not sure what that looks like exactly so if we come here class that is able
00:04:00.640 | to pass a filter and convert it into the format that the underlying databases of the document
00:04:05.840 | stores require filters are defined as nested dictionaries keys of dictionaries can be logical
00:04:13.840 | operators comparison operator and so on okay so the logical filter clause is here
00:04:28.400 | then we have the comparison operators as well okay
00:04:31.280 | logical operator keys might take a dictionary metadata field names
00:04:38.080 | so at the moment we are converting these correctly as far as I know
00:04:47.760 | but I'm just not sure how to integrate that with this filter utils so let's have a look at
00:04:58.080 | another document store let's have a look at we aviate and see how they use use that so it's
00:05:07.360 | filter utils logical filter clause so they only use a logical filter clause here they don't use
00:05:14.880 | the other one as order comparison so let's have a look logical filter clause see where it's being
00:05:22.080 | used so if there's filters this is just being called so logical filter clause pass filters
00:05:29.280 | convert to okay kind of need to see what this is so if we if we take
00:05:39.760 | okay let's take this and we'll put it into a document into another file
00:05:50.800 | okay let me create a file here and host like test
00:05:54.160 | and so I'll just call it filter utils test
00:05:58.560 | okay so I'm going to test this but we need to import it so where is
00:06:07.680 | whereas we aviate importing this from here so okay let's take that
00:06:18.240 | import it here should work I'll select the kernel haystack
00:06:23.120 | let's see so filters I'm going to
00:06:31.760 | take a filter from the filter test I created before
00:06:43.360 | yeah anything there we go
00:06:45.680 | let's see what that gives us a filter
00:06:51.760 | or operation wow super weird um convert to elastic convert to elastic search
00:07:10.000 | right so I need to figure this out a little bit
00:07:16.480 | see what we have into
00:07:24.960 | so parse is creating one of these objects
00:07:31.040 | what object is it filter utils or operation
00:07:39.920 | seems really complicated to me
00:07:41.520 | so parse here okay not operation parse value okay not and
00:07:55.280 | so it's like we need to modify the current filter so so we need to modify this to not
00:08:08.240 | take a look directly at the keys but instead look at the objects
00:08:15.520 | created over here so this and operation or or so on we also need to figure out what those are so
00:08:26.080 | they in here they're actually in this script and operation
00:08:33.040 | ah okay
00:08:42.560 | ah ah cool okay so this is where that is happening
00:08:54.080 | so condition convert to elastic search where's condition coming from here um condition
00:09:03.200 | self-lock okay it's just it's stored in there from the logical filter clause
00:09:11.680 | okay so you initialize this logical filter clause which consumes
00:09:23.840 | which consumes where is it um up here somewhere conditions here okay
00:09:32.640 | self-conditions are added to the logical uh filter what's it called again logical filter clause
00:09:42.800 | and then later on when we actually use this and operation
00:09:50.640 | that is inheriting that value from the logical filter clause so
00:09:55.360 | the condition self.conditions is already in here
00:09:58.320 | okay and then in here we need to specify how to deal with each of those items
00:10:14.660 | but then it's a condition convert to we aviate it's a condition in the
00:10:25.360 | operator return
00:10:30.720 | okay so if we do something like
00:10:40.320 | let's just take let's take the we aviate one if we do something
00:10:46.240 | like convert to convert to pine cone
00:10:51.200 | convert to pine cone here
00:11:00.240 | and the operator is not and is that what it is okay so the operator in this case is
00:11:09.120 | and is that right can't remember now so or um let me filter this
00:11:18.480 | where is um okay yeah it's and so and then the operands what is the operands here
00:11:29.440 | um is that relevant
00:11:30.640 | I don't use it here
00:11:34.320 | all right let's see what that returns to us so first I think we also need to pass convert to
00:11:51.920 | convert off in the original or in the first um class so logical filter clause
00:12:00.800 | fine convert to pine cone
00:12:17.280 | and we just pass I'm not going to write the uh description in there yet
00:12:21.680 | so let's go ahead and I'm just going to open store that maybe I can also
00:12:29.920 | let me add it for the or as well so or operation okay oh um no no not that one let me copy that
00:12:43.120 | out okay where's your operation oh here so convert to pine cone again here this time we are using or
00:12:59.280 | I imagine in here maybe I need to create a list
00:13:06.000 | so let's let's just test it first see what we get
00:13:10.960 | so install now let's go back and actually test that so um restart
00:13:21.280 | yeah run this should we should now have so we get the filter date um don't need to do that again
00:13:36.880 | and then from that we should be able to so if we just convert to we aviate for example
00:13:43.120 | we should see something uh okay what happened there did I interrupt
00:13:49.120 | okay let's run again okay fine
00:13:55.120 | pass we get this filter utils or operation convert to we aviate let's see what happens
00:14:06.160 | okay cool and then if we do same again but convert to pine cone attribute error
00:14:17.360 | um okay so I need to add convert pine cone to a lot of things here I think
00:14:30.000 | okay um let's go back and do that then now you just leave it as we aviate we're going to reform
00:14:37.360 | everything I just want to make sure that everything is in place and I can just modify things in there
00:14:42.720 | the thing I think realistically the conversion from to hate to pine cone should be super
00:14:51.680 | straightforward because we don't have all these different syntaxes ours is very aligned to
00:15:00.240 | pine cone so it shouldn't really be an issue or pine cones is very aligned to haystacks
00:15:07.200 | and so on I'm just going to skip forward to this in this bit okay so that should be enough
00:15:15.280 | for that and we can first need to put console again
00:15:24.880 | and then we'll just rerun this and make sure that we're actually returning some sort of
00:15:31.360 | dictionary or filter let's make sure it's everything is in place so restart
00:15:40.160 | okay and then run this see what happens
00:15:48.720 | okay so some issues in there let's try and figure that out okay so we've got this now so we have like
00:15:57.760 | a really basic dictionary coming through so all you really need to do now is modify the code that
00:16:05.840 | we've added to filter utils to align to the logic in here so I think what we'll do is maybe have a
00:16:15.600 | few example filters and kind of figure out how we're going to write that in there so let's start
00:16:24.000 | with that okay so we've got this working now at least for this one filter query which is pretty
00:16:34.000 | cool a lot easier than I was expecting so that's really good and this sort of logic of putting
00:16:43.040 | things or separating everything out I think makes a lot more sense particularly if you consider
00:16:48.960 | how complex some of these other querying syntax must get so before I was thinking oh this seems
00:16:58.080 | a really overly complicated way of doing things but now I think I'm convinced this is probably
00:17:04.000 | one of the best ways to do it so the only one that we don't have is this not operation so I
00:17:15.680 | don't really know what to do with that yet so I'm going to for now I'm just going to leave it and
00:17:20.800 | come back to it later maybe just add like a not implemented error or something like that the only
00:17:32.160 | reason I say that is it would be great to say if you say not this maybe you could invert it by just
00:17:41.200 | looking for all of the opposite values but to do that you'd have to pull out all of the metadata
00:17:46.480 | from a pine cone and I don't think it's possible to do that but something I need to check so for
00:17:55.440 | now I'm just going to ignore this one we'll deal with it later so yeah let's that looks good I
00:18:05.360 | think the only thing now I want to test it on some other filters this is just one filter maybe
00:18:11.680 | it doesn't work on the others so let me find some of those so I think which notebook were those
00:18:21.120 | testing filters and yeah we come down here we have a few other filters as well so let's try those
00:18:29.600 | hopefully it works so filter down here we pass them and then we convert them to pine cone
00:18:45.680 | so convert to pine cone
00:18:48.880 | let's see
00:18:54.400 | let me just remove that so we see it straight away
00:18:59.760 | okay so we have um we have a lot of stuff in here let me just remove this
00:19:08.320 | my Zelda
00:19:09.040 | okay um yeah I mean that looks looks right to me let's try another
00:19:23.360 | there's a testing okay yeah uh we we just did that one so let's try this one
00:19:44.240 | let's see what we get so convert to pine cone just this again
00:19:53.200 | okay um
00:19:58.240 | yeah and this looks good as well
00:20:06.800 | so I think I think we're probably ready to actually go ahead and test that in here so
00:20:15.040 | let me take one of these um I'll take some of these filters I'm just going to apply directly
00:20:23.040 | to pine cone to make sure it's actually working so to do this I have to create a load of vectors
00:20:30.400 | and I don't want me to do that I just want to create like a couple of vectors so if I go to
00:20:40.640 | pine cone demo come here it's okay so this one this one's easier
00:20:46.880 | yeah so this one only does it with like six um so I'm gonna run through this and see if it works
00:20:56.160 | so initialize I think we've already created this dictionary this document store um oh the other
00:21:04.960 | thing we need to do is actually implement it in pine cone so that's important so in here at the
00:21:13.360 | moment I'm going to go to filter I'm using this uh build filter clause I don't want that anymore
00:21:21.360 | so I actually want to just remove this
00:21:23.920 | and then let's find where we're applying filters
00:21:30.800 | okay let's have a look at right document see what I do
00:21:34.800 | okay so here we're using self build filter clause so we need to change this to use
00:21:44.320 | the same um logic as the other document source so so I look at um we aviate and I think
00:21:52.480 | okay uh ignore that we aviate filters see where they okay here this is what we want to do
00:22:05.040 | so the filter this just use this nothing more than that so it's pretty simple
00:22:13.440 | so the filters are just going to be converted into that so but they we use filter dict here but
00:22:19.760 | we just use filters
00:22:23.440 | convert to pine cone
00:22:26.960 | and I'm just going to search for this that will tell us where else we are
00:22:36.160 | using that so search for this oh it's only the only here
00:22:41.760 | oh really okay so that's that's it it's the only bit we need to change we also need to import this
00:22:48.880 | so it's what is it logical filter okay import this
00:22:55.440 | and we should be good to go
00:23:04.400 | okay now let's go ahead and pip install that
00:23:09.840 | and then we can test it and hopefully it will work okay so come over here pine cone demo
00:23:21.920 | let's restart that rerun it
00:23:26.320 | and so we've already got a document store here so I don't think we need to
00:23:33.280 | write these documents again yeah can ignore that dense passage retriever we do need
00:23:42.640 | we don't need to update the embedding so let me just move this to a new cell
00:23:48.800 | run it we do need this
00:23:56.960 | and then okay let's make sure it's working and there's answers and then we will try and apply
00:24:05.360 | the metadata filter so filters just something really simple here and then maybe we can like
00:24:15.760 | modify them a little bit to see what else you can get okay it looks good just going to run that
00:24:26.880 | just so I can see maybe some of the metadata we have here it's just this one again I want to
00:24:34.320 | return another name so I can add it like a crane or statement and see what that looks like but
00:24:41.680 | for now it's fine okay let's run it okay cool so it looks like filtering actually worked
00:24:55.280 | we're only returning two here because based on that filter there's only
00:24:59.920 | two items in the out of the six so that that's really cool let's get all the documents so
00:25:08.640 | and let's get some other names so we have this one
00:25:17.360 | let's add that into our into our filter so we're going to go we're going to use all here so
00:25:27.840 | want or
00:25:30.880 | this or this
00:25:43.920 | okay let's run that I think that's in the right syntax
00:25:48.240 | there we go so now we're returning more samples we're returning all of them now because they're
00:25:55.120 | either from from this one or from this one which is really cool and we can we can double check that
00:26:04.800 | is the case so I can't see anything there I return all documents and we'll go forward for duck in
00:26:15.680 | all ducks
00:26:17.440 | is it duck can I do let's just check that this works no okay so
00:26:29.760 | and see what's in there it's a document
00:26:32.640 | meta cool document meta and then name
00:26:38.400 | so let's just print all those out to make sure that is the case and just print
00:26:46.240 | cool so it looks like it's working
00:26:58.080 | so that is I think most
00:27:00.560 | of the issues on the pull request I think there's one other maybe one other big one at least
00:27:09.120 | okay so the typing with mypy so we can see that if we I'm not going to go through that now
00:27:19.040 | but I can at least show you what that error is so
00:27:24.800 | so we come to type check here
00:27:29.280 | and we can see in the test with mypy there's a load of
00:27:35.280 | typing issues like here all of these all typing issues so that's something to fix as well but
00:27:44.560 | we're not going to go through that now I think it's not that interesting yeah I think that's it
00:27:50.800 | for this video I hope it's been useful to just kind of go through that and like interesting to
00:27:57.760 | see how haystack actually works and particularly haystack filtering works in the in the library
00:28:05.680 | so yeah thank you very much for watching I hope it's been useful and I will see you in the next