Testing the New Haystack Doc Store

Today, we're going to continue going through the Haystack and Pinecone integration that I've been building. We're just going to go through a demo notebook that I've put together and see how everything works. This is essentially how I'm testing the document store as I build it. And then in, or maybe not this video, I think the next video, we're going to actually go through the final steps to actually integrate this into the Haystack library.

So you can kind of see that in progress at the moment. We have this pull request and the document store works, but it's not perfect. There's a few Haystack specific things that I've missed out. So you can see the review comments down here. So the filtering in particular, I've implemented it in one way using, I think, a slightly old document store and that's not how they do filtering now, at least, or it is how they do it, but it's not implemented in the right way.

So we'll go through that. And then a few other little things. So yeah, let's go through the demo and then we'll leave that for later. So for now, if you wanted to test this at the moment, it's not in Haystack, the pull request hasn't been accepted. So if you want to clone this, which I'm going to be going through, I'm going to git clone and you're coming from pinecone.io/haystack.

And then from there, we want to activate the pinecone.store branch. So I'm going to actually do that here. So I'm going to cd into the correct Haystack directory. So documents, projects, Haystack. And then in here, I can check this. So I am already on this branch. So it's not really an issue for me.

I have a Haystack environment that I've set up. So I'm going to come back to the Haystack. Okay. What was the last thing? Okay. And then pip install. So pip install. So I think this should already be installed, the most recent one for me. So it should be pretty quick.

Okay. So that has installed and then we can go ahead and actually use this. So we set this to make sure we're using the Haystack environment. And then let's go through everything. So this is not, so I have the Haystack version that I've been working on in this directory.

Now we're in this Haystack test directory. I've just been making a load of notebooks here to test how things work and figure everything out. And this is a notebook I'm using to test the Pinecone document store. So I initialize it first with this Pinecone from Haystack document stores, import Pinecone document store.

So I can run that. And this is just using, so we have the environment. So these are Pinecone specific arguments here. So the environment I want to run it on. So I'm using the default one, which is US West one on GCP. And then my key, so the key I am getting from in here.

So I have my, it says app.pinecone.io. This is my Haystack project. If you, if you like have just signed up for this, it will come up with like your name default project. So you just click on that and then you go to API keys here. I said, I'm just using my default one and I can just copy over here.

And then I've placed it, it's not the best way to do it, but I just find it easy. I've just placed it in a, in a file called secret and I'm reading from that. You can also put it in, you know, slightly worse way of doing this. You can put it in directly in the notebook as well.

I would, I wouldn't even recommend doing this, I would recommend doing that even less. Otherwise you want to enter the API key into your environment variables and then you can just load it like OS environ something like this, like API key. Okay. I'm just loading it from file. I'll find it easier.

So okay, so we've initialized the document store here. We can see, so we have the embedding dimensions by default, this is 768. So if you're using a model that doesn't use that dimensionality, which most models do, but maybe you're using, for example, like the mini LM models, they use dimensionality of 368 maybe.

So I think the embedding, what is it that, no, okay. Let me have a look at the document source so I can see which argument it is. Okay, so we come down to here and the, oh yeah, okay. It's vector dim. I was expecting it to come up, weird.

So vector dim is three, would be like 368, for example, but in this case it's fine. The default value of 768 is, it's not a problem. And then we see record count is zero because we haven't inserted anything into our document store yet. So we're going to go ahead and actually do that.

But to do that, we do need to download some documents. So I'm importing a few things here. This is just using one of Haystack's like demo notebooks. So I'm just importing these, oops, these things here, clean Wikitex, all these things. And then we convert these to, these files to dictionaries.

I run that and then we can see those dictionaries in here. So we have this content and then, yeah, we get all of our documents here. We can see we, the documents have this particular format where we have content, which is like the length of text or the big chunk of text.

Let me go up to this one actually. And then we also have metadata and we have the name of the file in here and we could put all the metadata in here as well. So it aligns quite well with the format that Pinecone consumes. So that's useful at least.

Now one thing that we have here at the moment that I don't want is too much data because at the moment I'm working from my Mac and it's not particularly powerful. So what I'm going to do is, we have a lot of bits here. So if I just have a look, we have, yeah, 2.4 or almost 2.5 thousand examples.

It's going to take a long time to embed all of those, all the contents into vectors on my Mac. So I am going to just take the first maybe 10 or just 6 examples because it will be quicker. And I can still test everything with that. So I'll do that.

And then here we come to writing the dictionaries containing documents or database. Now this doesn't do any, this doesn't upsert them into Pinecone because we, the current implementation uses both a SQL database to store the long contents and also Pinecone to store the vectors and the metadata. At the moment we don't have a retriever set.

We do that next. So this is just going to write everything to the local SQL database. So I run that, we've written those documents very quick, obviously we only have 6. And then we come here and I can initialize the dense retriever. So in this example, we're using this Facebook DPR model.

And we'll stick with that, obviously you can swap the other models as well. So we'll leave that there. And I'm going to update, so once I've initialized the retrieved model, I can update the embeddings using that retrieved model in a small batch size. Now 16 is pretty big, we don't actually need to do that many, so 2, just doing batches of 2.

So yeah, let's run that, shouldn't take too long. Okay, so we've just processed those, that's, yeah, it's finished running. So now we have those vectors and the metadata and everything in Pinecone and also in the SQL database. And we can see that as well, so if we go to our Pinecone console again, so if I, okay, so it's loading and we should see in a moment, okay, we have this index document here.

So if I click on that, it'll give me index information, zoom out a little bit. And we'll see the number of vectors in there at the moment is 6, okay. So we have those 6 items or documents in Pinecone as expected. Now from there, we can set up our QA pipeline, so we're performing extractive QA or open main question answering.

So we run that, this might take a moment just to initialize everything in Haystack, okay. So that has run and now we can actually start asking questions. Now at the moment, this is going to just return a load of rubbish because we only have six documents in there, none of those documents talk about any of this.

So we're not going to return anything relevant here, but we just want to, or at least during testing, all I'm doing is making sure that everything is pointing to the right place and actually processing in a way that I would expect. So when I run this, I would expect a prediction to be returned, I would expect five answers from that prediction and I would expect it to run.

So here we can see it's inference, six examples here, so if we come up here, okay, so we have, so because we are retrieving 10 here, where we retrieved those 10 examples, but in our case there's only six examples, so if I, let me reduce this to five and we'll do three here, but let me explain it quickly.

So we're returning all six examples and then this inference in samples, where it has loading bar, that's referring to the reader model, taking a look at that single example and scoring it and pulling out a specific answer from that. So we only have six examples here, so we only see inferencing examples six times.

So now if I reduce top K in the retriever to five and reader, I'm just reducing a little bit because typically that is a lower value than your retriever top K, we should see now that there's inferencing samples five times rather than six. So with one, two, three, four, and five, okay?

So now we're running or looping through each of those five return examples, inferencing the reader model, extracting them answers. And then we should be able to use this print answers and we should get something from them. Now we can see that although it's not returning the correct answer because the correct answer is not in there, we only have those six contents, it is at least returning something that would make sense from a syntax point of view.

So we're saying who, specifying who created this vocabulary and it's returning the name of a person. Okay, so in one of these, in one of these contents, there's a name of person. So it's pulling out the name of person because it knows like we're asking a question about who created something.

So the answer that would naturally be the name of a person. It's not the right answer, but at least it's, you know, almost being logical in the answer that it is returning, okay? So as well as that, I want to, I wanted to test the other functions, not just the querying in Hayside, but also getting your documents by IDs.

So if we run this, these are just the IDs that have been assigned to the different documents. We should get two sets of answers. So we have a document, we have the content here, and we also have, we have the embedding as well, if you need that, the metadata and it goes on for a little while.

So if I open in the text editor so we can see everything. So we have all of these and then we have the second. So just here you see on the right, we have the second document, so it's returning two of those. So that's good. That's seems to be working as we expect.

Okay, so let's minimize that. And the last thing that I want to check is, okay, is a metadata filtering working? So I actually tested this a lot more than what we have here. So we can, maybe you can switch over to that notebook because then we can at least see, you know, what level of filtering we can actually do here.

Now again, this, this will change slightly as well. So let's make sure this is the right notebook. So what I will do rather than running all of this again, I'm just going to take this. This is using a different, a different index, however. So I'll just show you this rather than running through it.

Okay. So in this notebook, I had more time to run this. I think I didn't run it on my Mac, I ran it on my, my other computer, which is a lot faster. So in this notebook, what we're doing is using the squad data set, only 4,000 examples. I wanted, I still wanted it to be quick, where specifying that we want to create a squad index in the Pinecone document store, rather than default document index.

We're using a different model as well. So vector dim is 384 in this example. And then here, I'm just getting the, all of the contexts from a squad. So I'm getting all the contexts and all the titles. The titles are just for my reference, so I can modify the metadata filters later.

The context of what we actually saw or encode and store. See I'm writing them there and then here, I'm initializing dense passage retriever. This time, rather than, so we're using dense passage retriever here, DPR, but we can replace this with other sentence transform models as well, and it's still, still works.

I don't know if there's any particular reason not to do this, but it works. So you can do it, or at least in this case, it works. So I use sentence transformers here. It's a smaller model, so I thought it's easier, a little bit faster. And then I updated the embeddings for reader model, just use a default stuff here again.

And then I run this, so which college at Notre Dame I had in 1921. We have all the top case stuff here, and in this case, it should return a relevant answer. And we see, so it's the way that Haystack returns, I was a little confused at first, Haystack returns like a small segment of your context or contents, not the full contents.

So we get the answer, which is College of Commerce, and then we get the context that that is pulled from, right? So that's quite cool. So by 1921, the edition of the College of Commerce, so it is returning the right answer there. And that's, I think, really cool to see.

And then I wanted to obviously begin adding filters using, specifically with Haystack's filtering syntax. So there's a lot of testing here to make sure different filters work with the code that I wrote. So this is Haystack filtering context or syntax. The only difference between this and, for example, Pinecones is that we need to have this.

Okay, so we would have that. And I also think that this here would be like a single dictionary, and this would also be a single dictionary, right? And then we would remove that dictionary. I think that's the Pinecones version of it. But the Haystack syntax is slightly different, obviously they're inspired by the same original like syntax or filtering, and they just use this dictionary, okay?

So there's a method in the document store at the moment, which I can open to show you. This is going to change though. So if I come down here, we go to filter, build filter clause. So this is what is handling the translation from Haystack syntax to Pinecones syntax.

It's relatively messy. So there'll probably be some iterations to make it a bit cleaner, possibly, if possible. But there's a lot of, you know, if and elif stuff in here, depending on what syntax we're looking at. It's kind of hard to make this particularly clean. But that's what's handling that, and that is called whenever you have a filter specified in your query.

So let's go back to this. So this, yep, we have a very simple filter in Haystack syntax. Does it make sense? Probably not. But I wanted to make sure it worked. And this will just be translated into Pinecones syntax and actually work. And you can see that when you actually run this, you have, I think, one -- oh, you have this or, so you have only items from the Age of Enlightenment, or you have this single context that are returned.

So you can see those, which is pretty useful. And then I'm just modifying those filters, testing some different things, just going through those. So it's relatively simple testing. There's not a lot going on here. A little more complex here. We have these almost like layered statements. But it's still pretty straightforward.

So that's the filtering, testing, and then I think from there, that's pretty much it. The only other thing was get all documents. So if I can run this, and if I run that, we see that we actually do return all those documents. So that's another method I wanted to make sure was working.

So I think that's it for like going through how you would actually use this document store. Again, there's still things that will be changed, but none of these methods, as far as I know, should be. Possibly the -- I think I did see in a comment that maybe the vector dimension.

So here, this might change. But otherwise, everything else should stay the same. So yeah, that's it for this walkthrough. I hope it's been useful or interesting. So thank you very much for watching, and I will see you in the next one.

Testing the New Haystack Doc Store

Chapters

Transcript