back to index

Testing the New Haystack Doc Store


Chapters

0:0 Intro
1:19 Demo Start and Install
3:25 Initialization
6:30 Download and Write Documents
10:55 Extractive QA Pipeline
11:23 Fetch by ID
19:1 Metadata Filtering
22:24 Get All Documents

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're going to continue going through the Haystack and Pinecone integration that
00:00:07.000 | I've been building.
00:00:08.000 | We're just going to go through a demo notebook that I've put together and see how everything
00:00:14.500 | works.
00:00:15.500 | This is essentially how I'm testing the document store as I build it.
00:00:22.500 | And then in, or maybe not this video, I think the next video, we're going to actually go
00:00:28.120 | through the final steps to actually integrate this into the Haystack library.
00:00:34.880 | So you can kind of see that in progress at the moment.
00:00:40.320 | We have this pull request and the document store works, but it's not perfect.
00:00:46.920 | There's a few Haystack specific things that I've missed out.
00:00:52.400 | So you can see the review comments down here.
00:00:55.780 | So the filtering in particular, I've implemented it in one way using, I think, a slightly old
00:01:04.840 | document store and that's not how they do filtering now, at least, or it is how they
00:01:10.320 | do it, but it's not implemented in the right way.
00:01:13.600 | So we'll go through that.
00:01:15.280 | And then a few other little things.
00:01:19.520 | So yeah, let's go through the demo and then we'll leave that for later.
00:01:24.620 | So for now, if you wanted to test this at the moment, it's not in Haystack, the pull
00:01:31.200 | request hasn't been accepted.
00:01:33.440 | So if you want to clone this, which I'm going to be going through, I'm going to git clone
00:01:40.240 | and you're coming from pinecone.io/haystack.
00:01:45.280 | And then from there, we want to activate the pinecone.store branch.
00:01:50.700 | So I'm going to actually do that here.
00:01:55.520 | So I'm going to cd into the correct Haystack directory.
00:02:02.360 | So documents, projects, Haystack.
00:02:08.680 | And then in here, I can check this.
00:02:11.460 | So I am already on this branch.
00:02:14.240 | So it's not really an issue for me.
00:02:17.360 | I have a Haystack environment that I've set up.
00:02:21.320 | So I'm going to come back to the Haystack.
00:02:27.360 | Okay.
00:02:31.560 | What was the last thing?
00:02:32.560 | Okay.
00:02:33.560 | And then pip install.
00:02:34.760 | So pip install.
00:02:37.480 | So I think this should already be installed, the most recent one for me.
00:02:41.360 | So it should be pretty quick.
00:02:44.320 | Okay.
00:02:46.080 | So that has installed and then we can go ahead and actually use this.
00:02:51.120 | So we set this to make sure we're using the Haystack environment.
00:02:57.880 | And then let's go through everything.
00:03:00.040 | So this is not, so I have the Haystack version that I've been working on in this directory.
00:03:08.680 | Now we're in this Haystack test directory.
00:03:10.160 | I've just been making a load of notebooks here to test how things work and figure everything
00:03:17.440 | And this is a notebook I'm using to test the Pinecone document store.
00:03:23.680 | So I initialize it first with this Pinecone from Haystack document stores, import Pinecone
00:03:30.640 | document store.
00:03:31.640 | So I can run that.
00:03:36.400 | And this is just using, so we have the environment.
00:03:38.440 | So these are Pinecone specific arguments here.
00:03:42.600 | So the environment I want to run it on.
00:03:44.520 | So I'm using the default one, which is US West one on GCP.
00:03:50.120 | And then my key, so the key I am getting from in here.
00:03:57.600 | So I have my, it says app.pinecone.io.
00:04:02.640 | This is my Haystack project.
00:04:06.240 | If you, if you like have just signed up for this, it will come up with like your name
00:04:11.160 | default project.
00:04:13.120 | So you just click on that and then you go to API keys here.
00:04:19.120 | I said, I'm just using my default one and I can just copy over here.
00:04:23.040 | And then I've placed it, it's not the best way to do it, but I just find it easy.
00:04:28.840 | I've just placed it in a, in a file called secret and I'm reading from that.
00:04:34.360 | You can also put it in, you know, slightly worse way of doing this.
00:04:37.480 | You can put it in directly in the notebook as well.
00:04:41.240 | I would, I wouldn't even recommend doing this, I would recommend doing that even less.
00:04:46.800 | Otherwise you want to enter the API key into your environment variables and then you can
00:04:53.800 | just load it like OS environ something like this, like API key.
00:05:02.800 | Okay.
00:05:04.720 | I'm just loading it from file.
00:05:06.640 | I'll find it easier.
00:05:09.080 | So okay, so we've initialized the document store here.
00:05:13.540 | We can see, so we have the embedding dimensions by default, this is 768.
00:05:18.960 | So if you're using a model that doesn't use that dimensionality, which most models do,
00:05:25.680 | but maybe you're using, for example, like the mini LM models, they use dimensionality
00:05:31.200 | of 368 maybe.
00:05:34.720 | So I think the embedding, what is it that, no, okay.
00:05:43.000 | Let me have a look at the document source so I can see which argument it is.
00:06:01.360 | Okay, so we come down to here and the, oh yeah, okay.
00:06:05.400 | It's vector dim.
00:06:06.400 | I was expecting it to come up, weird.
00:06:10.320 | So vector dim is three, would be like 368, for example, but in this case it's fine.
00:06:16.040 | The default value of 768 is, it's not a problem.
00:06:21.880 | And then we see record count is zero because we haven't inserted anything into our document
00:06:26.800 | store yet.
00:06:27.800 | So we're going to go ahead and actually do that.
00:06:31.120 | But to do that, we do need to download some documents.
00:06:35.720 | So I'm importing a few things here.
00:06:39.440 | This is just using one of Haystack's like demo notebooks.
00:06:46.040 | So I'm just importing these, oops, these things here, clean Wikitex, all these things.
00:06:53.160 | And then we convert these to, these files to dictionaries.
00:06:57.200 | I run that and then we can see those dictionaries in here.
00:07:03.040 | So we have this content and then, yeah, we get all of our documents here.
00:07:09.040 | We can see we, the documents have this particular format where we have content, which is like
00:07:14.800 | the length of text or the big chunk of text.
00:07:18.200 | Let me go up to this one actually.
00:07:21.240 | And then we also have metadata and we have the name of the file in here and we could
00:07:25.880 | put all the metadata in here as well.
00:07:27.440 | So it aligns quite well with the format that Pinecone consumes.
00:07:35.360 | So that's useful at least.
00:07:38.780 | Now one thing that we have here at the moment that I don't want is too much data because
00:07:45.680 | at the moment I'm working from my Mac and it's not particularly powerful.
00:07:51.720 | So what I'm going to do is, we have a lot of bits here.
00:07:56.600 | So if I just have a look, we have, yeah, 2.4 or almost 2.5 thousand examples.
00:08:05.320 | It's going to take a long time to embed all of those, all the contents into vectors on
00:08:13.280 | my Mac.
00:08:14.680 | So I am going to just take the first maybe 10 or just 6 examples because it will be quicker.
00:08:26.860 | And I can still test everything with that.
00:08:31.040 | So I'll do that.
00:08:33.280 | And then here we come to writing the dictionaries containing documents or database.
00:08:41.440 | Now this doesn't do any, this doesn't upsert them into Pinecone because we, the current
00:08:47.800 | implementation uses both a SQL database to store the long contents and also Pinecone
00:08:55.120 | to store the vectors and the metadata.
00:08:58.080 | At the moment we don't have a retriever set.
00:09:03.960 | We do that next.
00:09:05.360 | So this is just going to write everything to the local SQL database.
00:09:08.840 | So I run that, we've written those documents very quick, obviously we only have 6.
00:09:15.960 | And then we come here and I can initialize the dense retriever.
00:09:20.880 | So in this example, we're using this Facebook DPR model.
00:09:27.840 | And we'll stick with that, obviously you can swap the other models as well.
00:09:34.000 | So we'll leave that there.
00:09:36.600 | And I'm going to update, so once I've initialized the retrieved model, I can update the embeddings
00:09:43.720 | using that retrieved model in a small batch size.
00:09:46.760 | Now 16 is pretty big, we don't actually need to do that many, so 2, just doing batches
00:09:53.040 | of 2.
00:09:55.680 | So yeah, let's run that, shouldn't take too long.
00:09:59.920 | Okay, so we've just processed those, that's, yeah, it's finished running.
00:10:05.560 | So now we have those vectors and the metadata and everything in Pinecone and also in the
00:10:13.000 | SQL database.
00:10:14.000 | And we can see that as well, so if we go to our Pinecone console again, so if I, okay,
00:10:22.900 | so it's loading and we should see in a moment, okay, we have this index document here.
00:10:30.720 | So if I click on that, it'll give me index information, zoom out a little bit.
00:10:40.640 | And we'll see the number of vectors in there at the moment is 6, okay.
00:10:44.080 | So we have those 6 items or documents in Pinecone as expected.
00:10:54.760 | Now from there, we can set up our QA pipeline, so we're performing extractive QA or open
00:11:01.880 | main question answering.
00:11:04.720 | So we run that, this might take a moment just to initialize everything in Haystack, okay.
00:11:12.780 | So that has run and now we can actually start asking questions.
00:11:18.040 | Now at the moment, this is going to just return a load of rubbish because we only have six
00:11:24.400 | documents in there, none of those documents talk about any of this.
00:11:31.700 | So we're not going to return anything relevant here, but we just want to, or at least during
00:11:37.380 | testing, all I'm doing is making sure that everything is pointing to the right place
00:11:42.900 | and actually processing in a way that I would expect.
00:11:46.200 | So when I run this, I would expect a prediction to be returned, I would expect five answers
00:11:54.040 | from that prediction and I would expect it to run.
00:11:58.000 | So here we can see it's inference, six examples here, so if we come up here, okay, so we have,
00:12:09.640 | so because we are retrieving 10 here, where we retrieved those 10 examples, but in our
00:12:17.760 | case there's only six examples, so if I, let me reduce this to five and we'll do three
00:12:25.520 | here, but let me explain it quickly.
00:12:28.040 | So we're returning all six examples and then this inference in samples, where it has loading
00:12:33.840 | bar, that's referring to the reader model, taking a look at that single example and scoring
00:12:40.560 | it and pulling out a specific answer from that.
00:12:44.560 | So we only have six examples here, so we only see inferencing examples six times.
00:12:51.200 | So now if I reduce top K in the retriever to five and reader, I'm just reducing a little
00:12:56.960 | bit because typically that is a lower value than your retriever top K, we should see now
00:13:05.840 | that there's inferencing samples five times rather than six.
00:13:10.680 | So with one, two, three, four, and five, okay?
00:13:19.480 | So now we're running or looping through each of those five return examples, inferencing
00:13:25.200 | the reader model, extracting them answers.
00:13:28.440 | And then we should be able to use this print answers and we should get something from them.
00:13:35.120 | Now we can see that although it's not returning the correct answer because the correct answer
00:13:41.400 | is not in there, we only have those six contents, it is at least returning something that would
00:13:48.680 | make sense from a syntax point of view.
00:13:51.520 | So we're saying who, specifying who created this vocabulary and it's returning the name
00:13:57.960 | of a person.
00:13:58.960 | Okay, so in one of these, in one of these contents, there's a name of person.
00:14:05.440 | So it's pulling out the name of person because it knows like we're asking a question about
00:14:08.800 | who created something.
00:14:10.400 | So the answer that would naturally be the name of a person.
00:14:14.240 | It's not the right answer, but at least it's, you know, almost being logical in the answer
00:14:20.000 | that it is returning, okay?
00:14:24.100 | So as well as that, I want to, I wanted to test the other functions, not just the querying
00:14:32.400 | in Hayside, but also getting your documents by IDs.
00:14:36.040 | So if we run this, these are just the IDs that have been assigned to the different documents.
00:14:45.140 | We should get two sets of answers.
00:14:47.480 | So we have a document, we have the content here, and we also have, we have the embedding
00:14:54.460 | as well, if you need that, the metadata and it goes on for a little while.
00:15:03.160 | So if I open in the text editor so we can see everything.
00:15:10.060 | So we have all of these and then we have the second.
00:15:12.820 | So just here you see on the right, we have the second document, so it's returning two
00:15:17.180 | of those.
00:15:18.780 | So that's good.
00:15:19.780 | That's seems to be working as we expect.
00:15:23.740 | Okay, so let's minimize that.
00:15:30.720 | And the last thing that I want to check is, okay, is a metadata filtering working?
00:15:39.500 | So I actually tested this a lot more than what we have here.
00:15:45.560 | So we can, maybe you can switch over to that notebook because then we can at least see,
00:15:51.060 | you know, what level of filtering we can actually do here.
00:15:58.580 | Now again, this, this will change slightly as well.
00:16:03.580 | So let's make sure this is the right notebook.
00:16:10.140 | So what I will do rather than running all of this again, I'm just going to take this.
00:16:16.900 | This is using a different, a different index, however.
00:16:23.720 | So I'll just show you this rather than running through it.
00:16:27.700 | Okay.
00:16:28.700 | So in this notebook, I had more time to run this.
00:16:33.460 | I think I didn't run it on my Mac, I ran it on my, my other computer, which is a lot faster.
00:16:43.380 | So in this notebook, what we're doing is using the squad data set, only 4,000 examples.
00:16:50.900 | I wanted, I still wanted it to be quick, where specifying that we want to create a squad
00:16:57.020 | index in the Pinecone document store, rather than default document index.
00:17:02.580 | We're using a different model as well.
00:17:04.060 | So vector dim is 384 in this example.
00:17:08.820 | And then here, I'm just getting the, all of the contexts from a squad.
00:17:15.020 | So I'm getting all the contexts and all the titles.
00:17:16.860 | The titles are just for my reference, so I can modify the metadata filters later.
00:17:21.900 | The context of what we actually saw or encode and store.
00:17:28.180 | See I'm writing them there and then here, I'm initializing dense passage retriever.
00:17:35.460 | This time, rather than, so we're using dense passage retriever here, DPR, but we can replace
00:17:42.620 | this with other sentence transform models as well, and it's still, still works.
00:17:49.140 | I don't know if there's any particular reason not to do this, but it works.
00:17:54.260 | So you can do it, or at least in this case, it works.
00:17:58.160 | So I use sentence transformers here.
00:18:01.340 | It's a smaller model, so I thought it's easier, a little bit faster.
00:18:08.340 | And then I updated the embeddings for reader model, just use a default stuff here again.
00:18:15.980 | And then I run this, so which college at Notre Dame I had in 1921.
00:18:21.540 | We have all the top case stuff here, and in this case, it should return a relevant answer.
00:18:28.140 | And we see, so it's the way that Haystack returns, I was a little confused at first,
00:18:34.900 | Haystack returns like a small segment of your context or contents, not the full contents.
00:18:41.260 | So we get the answer, which is College of Commerce, and then we get the context that
00:18:46.740 | that is pulled from, right?
00:18:50.420 | So that's quite cool.
00:18:53.300 | So by 1921, the edition of the College of Commerce, so it is returning the right answer
00:18:58.660 | there.
00:18:59.660 | And that's, I think, really cool to see.
00:19:01.940 | And then I wanted to obviously begin adding filters using, specifically with Haystack's
00:19:07.700 | filtering syntax.
00:19:08.700 | So there's a lot of testing here to make sure different filters work with the code that
00:19:14.140 | I wrote.
00:19:16.780 | So this is Haystack filtering context or syntax.
00:19:25.540 | The only difference between this and, for example, Pinecones is that we need to have
00:19:30.140 | this.
00:19:31.140 | Okay, so we would have that.
00:19:33.740 | And I also think that this here would be like a single dictionary, and this would also be
00:19:43.500 | a single dictionary, right?
00:19:45.780 | And then we would remove that dictionary.
00:19:48.300 | I think that's the Pinecones version of it.
00:19:51.840 | But the Haystack syntax is slightly different, obviously they're inspired by the same original
00:19:58.680 | like syntax or filtering, and they just use this dictionary, okay?
00:20:07.200 | So there's a method in the document store at the moment, which I can open to show you.
00:20:14.880 | This is going to change though.
00:20:16.560 | So if I come down here, we go to filter, build filter clause.
00:20:24.400 | So this is what is handling the translation from Haystack syntax to Pinecones syntax.
00:20:34.180 | It's relatively messy.
00:20:37.580 | So there'll probably be some iterations to make it a bit cleaner, possibly, if possible.
00:20:43.520 | But there's a lot of, you know, if and elif stuff in here, depending on what syntax we're
00:20:49.280 | looking at.
00:20:50.280 | It's kind of hard to make this particularly clean.
00:20:56.400 | But that's what's handling that, and that is called whenever you have a filter specified
00:21:01.740 | in your query.
00:21:03.600 | So let's go back to this.
00:21:10.000 | So this, yep, we have a very simple filter in Haystack syntax.
00:21:17.000 | Does it make sense?
00:21:18.000 | Probably not.
00:21:19.000 | But I wanted to make sure it worked.
00:21:26.280 | And this will just be translated into Pinecones syntax and actually work.
00:21:32.280 | And you can see that when you actually run this, you have, I think, one -- oh, you have
00:21:40.160 | this or, so you have only items from the Age of Enlightenment, or you have this single
00:21:46.840 | context that are returned.
00:21:49.840 | So you can see those, which is pretty useful.
00:21:53.360 | And then I'm just modifying those filters, testing some different things, just going
00:21:59.440 | through those.
00:22:00.840 | So it's relatively simple testing.
00:22:05.320 | There's not a lot going on here.
00:22:06.680 | A little more complex here.
00:22:08.800 | We have these almost like layered statements.
00:22:14.320 | But it's still pretty straightforward.
00:22:17.520 | So that's the filtering, testing, and then I think from there, that's pretty much it.
00:22:24.880 | The only other thing was get all documents.
00:22:27.060 | So if I can run this, and if I run that, we see that we actually do return all those documents.
00:22:34.120 | So that's another method I wanted to make sure was working.
00:22:37.800 | So I think that's it for like going through how you would actually use this document store.
00:22:44.080 | Again, there's still things that will be changed, but none of these methods, as far as I know,
00:22:51.040 | should be.
00:22:52.040 | Possibly the -- I think I did see in a comment that maybe the vector dimension.
00:22:59.800 | So here, this might change.
00:23:03.720 | But otherwise, everything else should stay the same.
00:23:07.360 | So yeah, that's it for this walkthrough.
00:23:09.840 | I hope it's been useful or interesting.
00:23:14.640 | So thank you very much for watching, and I will see you in the next one.