back to indexTesting the New Haystack Doc Store
Chapters
0:0 Intro
1:19 Demo Start and Install
3:25 Initialization
6:30 Download and Write Documents
10:55 Extractive QA Pipeline
11:23 Fetch by ID
19:1 Metadata Filtering
22:24 Get All Documents
00:00:00.000 |
Today, we're going to continue going through the Haystack and Pinecone integration that 00:00:08.000 |
We're just going to go through a demo notebook that I've put together and see how everything 00:00:15.500 |
This is essentially how I'm testing the document store as I build it. 00:00:22.500 |
And then in, or maybe not this video, I think the next video, we're going to actually go 00:00:28.120 |
through the final steps to actually integrate this into the Haystack library. 00:00:34.880 |
So you can kind of see that in progress at the moment. 00:00:40.320 |
We have this pull request and the document store works, but it's not perfect. 00:00:46.920 |
There's a few Haystack specific things that I've missed out. 00:00:52.400 |
So you can see the review comments down here. 00:00:55.780 |
So the filtering in particular, I've implemented it in one way using, I think, a slightly old 00:01:04.840 |
document store and that's not how they do filtering now, at least, or it is how they 00:01:10.320 |
do it, but it's not implemented in the right way. 00:01:19.520 |
So yeah, let's go through the demo and then we'll leave that for later. 00:01:24.620 |
So for now, if you wanted to test this at the moment, it's not in Haystack, the pull 00:01:33.440 |
So if you want to clone this, which I'm going to be going through, I'm going to git clone 00:01:45.280 |
And then from there, we want to activate the pinecone.store branch. 00:01:55.520 |
So I'm going to cd into the correct Haystack directory. 00:02:17.360 |
I have a Haystack environment that I've set up. 00:02:37.480 |
So I think this should already be installed, the most recent one for me. 00:02:46.080 |
So that has installed and then we can go ahead and actually use this. 00:02:51.120 |
So we set this to make sure we're using the Haystack environment. 00:03:00.040 |
So this is not, so I have the Haystack version that I've been working on in this directory. 00:03:10.160 |
I've just been making a load of notebooks here to test how things work and figure everything 00:03:17.440 |
And this is a notebook I'm using to test the Pinecone document store. 00:03:23.680 |
So I initialize it first with this Pinecone from Haystack document stores, import Pinecone 00:03:36.400 |
And this is just using, so we have the environment. 00:03:38.440 |
So these are Pinecone specific arguments here. 00:03:44.520 |
So I'm using the default one, which is US West one on GCP. 00:03:50.120 |
And then my key, so the key I am getting from in here. 00:04:06.240 |
If you, if you like have just signed up for this, it will come up with like your name 00:04:13.120 |
So you just click on that and then you go to API keys here. 00:04:19.120 |
I said, I'm just using my default one and I can just copy over here. 00:04:23.040 |
And then I've placed it, it's not the best way to do it, but I just find it easy. 00:04:28.840 |
I've just placed it in a, in a file called secret and I'm reading from that. 00:04:34.360 |
You can also put it in, you know, slightly worse way of doing this. 00:04:37.480 |
You can put it in directly in the notebook as well. 00:04:41.240 |
I would, I wouldn't even recommend doing this, I would recommend doing that even less. 00:04:46.800 |
Otherwise you want to enter the API key into your environment variables and then you can 00:04:53.800 |
just load it like OS environ something like this, like API key. 00:05:09.080 |
So okay, so we've initialized the document store here. 00:05:13.540 |
We can see, so we have the embedding dimensions by default, this is 768. 00:05:18.960 |
So if you're using a model that doesn't use that dimensionality, which most models do, 00:05:25.680 |
but maybe you're using, for example, like the mini LM models, they use dimensionality 00:05:34.720 |
So I think the embedding, what is it that, no, okay. 00:05:43.000 |
Let me have a look at the document source so I can see which argument it is. 00:06:01.360 |
Okay, so we come down to here and the, oh yeah, okay. 00:06:10.320 |
So vector dim is three, would be like 368, for example, but in this case it's fine. 00:06:16.040 |
The default value of 768 is, it's not a problem. 00:06:21.880 |
And then we see record count is zero because we haven't inserted anything into our document 00:06:27.800 |
So we're going to go ahead and actually do that. 00:06:31.120 |
But to do that, we do need to download some documents. 00:06:39.440 |
This is just using one of Haystack's like demo notebooks. 00:06:46.040 |
So I'm just importing these, oops, these things here, clean Wikitex, all these things. 00:06:53.160 |
And then we convert these to, these files to dictionaries. 00:06:57.200 |
I run that and then we can see those dictionaries in here. 00:07:03.040 |
So we have this content and then, yeah, we get all of our documents here. 00:07:09.040 |
We can see we, the documents have this particular format where we have content, which is like 00:07:21.240 |
And then we also have metadata and we have the name of the file in here and we could 00:07:27.440 |
So it aligns quite well with the format that Pinecone consumes. 00:07:38.780 |
Now one thing that we have here at the moment that I don't want is too much data because 00:07:45.680 |
at the moment I'm working from my Mac and it's not particularly powerful. 00:07:51.720 |
So what I'm going to do is, we have a lot of bits here. 00:07:56.600 |
So if I just have a look, we have, yeah, 2.4 or almost 2.5 thousand examples. 00:08:05.320 |
It's going to take a long time to embed all of those, all the contents into vectors on 00:08:14.680 |
So I am going to just take the first maybe 10 or just 6 examples because it will be quicker. 00:08:33.280 |
And then here we come to writing the dictionaries containing documents or database. 00:08:41.440 |
Now this doesn't do any, this doesn't upsert them into Pinecone because we, the current 00:08:47.800 |
implementation uses both a SQL database to store the long contents and also Pinecone 00:09:05.360 |
So this is just going to write everything to the local SQL database. 00:09:08.840 |
So I run that, we've written those documents very quick, obviously we only have 6. 00:09:15.960 |
And then we come here and I can initialize the dense retriever. 00:09:20.880 |
So in this example, we're using this Facebook DPR model. 00:09:27.840 |
And we'll stick with that, obviously you can swap the other models as well. 00:09:36.600 |
And I'm going to update, so once I've initialized the retrieved model, I can update the embeddings 00:09:43.720 |
using that retrieved model in a small batch size. 00:09:46.760 |
Now 16 is pretty big, we don't actually need to do that many, so 2, just doing batches 00:09:55.680 |
So yeah, let's run that, shouldn't take too long. 00:09:59.920 |
Okay, so we've just processed those, that's, yeah, it's finished running. 00:10:05.560 |
So now we have those vectors and the metadata and everything in Pinecone and also in the 00:10:14.000 |
And we can see that as well, so if we go to our Pinecone console again, so if I, okay, 00:10:22.900 |
so it's loading and we should see in a moment, okay, we have this index document here. 00:10:30.720 |
So if I click on that, it'll give me index information, zoom out a little bit. 00:10:40.640 |
And we'll see the number of vectors in there at the moment is 6, okay. 00:10:44.080 |
So we have those 6 items or documents in Pinecone as expected. 00:10:54.760 |
Now from there, we can set up our QA pipeline, so we're performing extractive QA or open 00:11:04.720 |
So we run that, this might take a moment just to initialize everything in Haystack, okay. 00:11:12.780 |
So that has run and now we can actually start asking questions. 00:11:18.040 |
Now at the moment, this is going to just return a load of rubbish because we only have six 00:11:24.400 |
documents in there, none of those documents talk about any of this. 00:11:31.700 |
So we're not going to return anything relevant here, but we just want to, or at least during 00:11:37.380 |
testing, all I'm doing is making sure that everything is pointing to the right place 00:11:42.900 |
and actually processing in a way that I would expect. 00:11:46.200 |
So when I run this, I would expect a prediction to be returned, I would expect five answers 00:11:54.040 |
from that prediction and I would expect it to run. 00:11:58.000 |
So here we can see it's inference, six examples here, so if we come up here, okay, so we have, 00:12:09.640 |
so because we are retrieving 10 here, where we retrieved those 10 examples, but in our 00:12:17.760 |
case there's only six examples, so if I, let me reduce this to five and we'll do three 00:12:28.040 |
So we're returning all six examples and then this inference in samples, where it has loading 00:12:33.840 |
bar, that's referring to the reader model, taking a look at that single example and scoring 00:12:40.560 |
it and pulling out a specific answer from that. 00:12:44.560 |
So we only have six examples here, so we only see inferencing examples six times. 00:12:51.200 |
So now if I reduce top K in the retriever to five and reader, I'm just reducing a little 00:12:56.960 |
bit because typically that is a lower value than your retriever top K, we should see now 00:13:05.840 |
that there's inferencing samples five times rather than six. 00:13:10.680 |
So with one, two, three, four, and five, okay? 00:13:19.480 |
So now we're running or looping through each of those five return examples, inferencing 00:13:28.440 |
And then we should be able to use this print answers and we should get something from them. 00:13:35.120 |
Now we can see that although it's not returning the correct answer because the correct answer 00:13:41.400 |
is not in there, we only have those six contents, it is at least returning something that would 00:13:51.520 |
So we're saying who, specifying who created this vocabulary and it's returning the name 00:13:58.960 |
Okay, so in one of these, in one of these contents, there's a name of person. 00:14:05.440 |
So it's pulling out the name of person because it knows like we're asking a question about 00:14:10.400 |
So the answer that would naturally be the name of a person. 00:14:14.240 |
It's not the right answer, but at least it's, you know, almost being logical in the answer 00:14:24.100 |
So as well as that, I want to, I wanted to test the other functions, not just the querying 00:14:32.400 |
in Hayside, but also getting your documents by IDs. 00:14:36.040 |
So if we run this, these are just the IDs that have been assigned to the different documents. 00:14:47.480 |
So we have a document, we have the content here, and we also have, we have the embedding 00:14:54.460 |
as well, if you need that, the metadata and it goes on for a little while. 00:15:03.160 |
So if I open in the text editor so we can see everything. 00:15:10.060 |
So we have all of these and then we have the second. 00:15:12.820 |
So just here you see on the right, we have the second document, so it's returning two 00:15:30.720 |
And the last thing that I want to check is, okay, is a metadata filtering working? 00:15:39.500 |
So I actually tested this a lot more than what we have here. 00:15:45.560 |
So we can, maybe you can switch over to that notebook because then we can at least see, 00:15:51.060 |
you know, what level of filtering we can actually do here. 00:15:58.580 |
Now again, this, this will change slightly as well. 00:16:03.580 |
So let's make sure this is the right notebook. 00:16:10.140 |
So what I will do rather than running all of this again, I'm just going to take this. 00:16:16.900 |
This is using a different, a different index, however. 00:16:23.720 |
So I'll just show you this rather than running through it. 00:16:28.700 |
So in this notebook, I had more time to run this. 00:16:33.460 |
I think I didn't run it on my Mac, I ran it on my, my other computer, which is a lot faster. 00:16:43.380 |
So in this notebook, what we're doing is using the squad data set, only 4,000 examples. 00:16:50.900 |
I wanted, I still wanted it to be quick, where specifying that we want to create a squad 00:16:57.020 |
index in the Pinecone document store, rather than default document index. 00:17:08.820 |
And then here, I'm just getting the, all of the contexts from a squad. 00:17:15.020 |
So I'm getting all the contexts and all the titles. 00:17:16.860 |
The titles are just for my reference, so I can modify the metadata filters later. 00:17:21.900 |
The context of what we actually saw or encode and store. 00:17:28.180 |
See I'm writing them there and then here, I'm initializing dense passage retriever. 00:17:35.460 |
This time, rather than, so we're using dense passage retriever here, DPR, but we can replace 00:17:42.620 |
this with other sentence transform models as well, and it's still, still works. 00:17:49.140 |
I don't know if there's any particular reason not to do this, but it works. 00:17:54.260 |
So you can do it, or at least in this case, it works. 00:18:01.340 |
It's a smaller model, so I thought it's easier, a little bit faster. 00:18:08.340 |
And then I updated the embeddings for reader model, just use a default stuff here again. 00:18:15.980 |
And then I run this, so which college at Notre Dame I had in 1921. 00:18:21.540 |
We have all the top case stuff here, and in this case, it should return a relevant answer. 00:18:28.140 |
And we see, so it's the way that Haystack returns, I was a little confused at first, 00:18:34.900 |
Haystack returns like a small segment of your context or contents, not the full contents. 00:18:41.260 |
So we get the answer, which is College of Commerce, and then we get the context that 00:18:53.300 |
So by 1921, the edition of the College of Commerce, so it is returning the right answer 00:19:01.940 |
And then I wanted to obviously begin adding filters using, specifically with Haystack's 00:19:08.700 |
So there's a lot of testing here to make sure different filters work with the code that 00:19:16.780 |
So this is Haystack filtering context or syntax. 00:19:25.540 |
The only difference between this and, for example, Pinecones is that we need to have 00:19:33.740 |
And I also think that this here would be like a single dictionary, and this would also be 00:19:51.840 |
But the Haystack syntax is slightly different, obviously they're inspired by the same original 00:19:58.680 |
like syntax or filtering, and they just use this dictionary, okay? 00:20:07.200 |
So there's a method in the document store at the moment, which I can open to show you. 00:20:16.560 |
So if I come down here, we go to filter, build filter clause. 00:20:24.400 |
So this is what is handling the translation from Haystack syntax to Pinecones syntax. 00:20:37.580 |
So there'll probably be some iterations to make it a bit cleaner, possibly, if possible. 00:20:43.520 |
But there's a lot of, you know, if and elif stuff in here, depending on what syntax we're 00:20:50.280 |
It's kind of hard to make this particularly clean. 00:20:56.400 |
But that's what's handling that, and that is called whenever you have a filter specified 00:21:10.000 |
So this, yep, we have a very simple filter in Haystack syntax. 00:21:26.280 |
And this will just be translated into Pinecones syntax and actually work. 00:21:32.280 |
And you can see that when you actually run this, you have, I think, one -- oh, you have 00:21:40.160 |
this or, so you have only items from the Age of Enlightenment, or you have this single 00:21:49.840 |
So you can see those, which is pretty useful. 00:21:53.360 |
And then I'm just modifying those filters, testing some different things, just going 00:22:08.800 |
We have these almost like layered statements. 00:22:17.520 |
So that's the filtering, testing, and then I think from there, that's pretty much it. 00:22:27.060 |
So if I can run this, and if I run that, we see that we actually do return all those documents. 00:22:34.120 |
So that's another method I wanted to make sure was working. 00:22:37.800 |
So I think that's it for like going through how you would actually use this document store. 00:22:44.080 |
Again, there's still things that will be changed, but none of these methods, as far as I know, 00:22:52.040 |
Possibly the -- I think I did see in a comment that maybe the vector dimension. 00:23:03.720 |
But otherwise, everything else should stay the same. 00:23:14.640 |
So thank you very much for watching, and I will see you in the next one.