Adding New Doc Stores to Haystack

Today we're going to do something a little bit different and we're going to have a look at the project that I'm currently working on. Now this project is a very high level and if you know the Haystack framework it's to implement a new document store within that framework. If you don't know it no worries I'm going to explain it.

So I left a couple of notes which you can see here to remind myself what I actually want to talk about in this video. So to start I want to just give you an overview of what is ODQA or Open Domain Question Answering and the three components in that.

You've probably if you've watched a couple of my videos you've definitely seen me talk about this before but for those of you that don't know it I'm just going to very quickly go over that and I'm going to explain how Haystack fits into that. So if you already know what Haystack is and you already know Open Domain Question Answering you probably just want to skip ahead a little bit it's up to you.

I'll leave some chapter markers in the timeline so you can skip ahead if you want and then we're going to have a look at the current document source within the Haystack framework and why I'm going to implement a Pinecone document. Then we're going to have a look so this is probably kind of relevant especially if you want to contribute to open source.

I'm going to just go through set up my own git repo for this and set it up so that it works with the other official Haystack git repo as well. So I'll show you how you can do that how you can set up your git for open source projects when you're planning on contributing and then we'll have a look at the document source or the specific code in document source in the Haystack library.

So yeah I think it's a different video but it should be should be interesting. Oh and one last thing actually if you do want to test document source go ahead and just you can git clone from here to your own you know local machine. So I'm going to git clone the above okay and then you want to within that same repository so let's say that downloads it to your documents Haystack folder.

Navigate to that folder in your terminal window or command line and then you just need to write a pip install dot right and that will install this version of Haystack and then you can test the pinecone document. So now I'll show you how you can start doing that so if you if you do want to test that and maybe even if you want to contribute and you see some like terrible code that I've written you want to make it more efficient please do go you know go ahead and do that.

So let's start with a quick overview of open domain question answering and the three components that we see in there. So I'm going to assume you've already seen some of my videos on this or that you're aware of Haystack and open domain question answering. So I'm going to be really quick so open domain question answering or ODQA.

It's basically you have a load of text data or some other data in a database somewhere and let's say wikipedia that's a good example so we have our wikipedia articles over here called wiki and we ask a question so we have let's say like a search bar like google over here and we ask a question like who are the Normans question mark and then we press search right.

We have in wikipedia we have the answer to that okay the Normans are people from Normandy in the north of France but we need a way of translating that question into that specific little part of information and pulling that from our larger database. This can contain you know millions or more of paragraphs of information so how do we pull the right bit of information.

So this query gets converted into a vector using what is called a retriever model. So we go into our retriever model it's converted into a vector all of these wikipedia snippets are also vectors so we go to our database this is a vector database over here and we compare our query vector which we'll call xq with all of our context vectors and that will return a set of relevant context vectors so we'll call them xc and these context vectors are quite big they're like paragraphs.

Now we don't need a paragraph of text to tell us where the Normans are from or yeah where the Normans are from we just need a small snippet so we pass this into what's called a reader model over here also passing in our question and that reader model is going to take our context so let's say this is one of them that's a really long piece of text and it's going to say okay the specific answer is here okay so it's going to say north of france or normandy okay and that's open the main question answering super you know very high level that's how it works now how does haystack fit into that so i will show you it makes most sense to just go to their repo okay and we actually have this little about here it's nice open source nlp framework uses transformer models we all know what they are and allows us to implement production ready search question answering semantic document search and summarization okay so it's quite a lot the main bit we're focusing on at least here is this kind of like search and question answering so this is exactly what i just showed you all right so we're entering our question this produces a query vector we press search and then we'll get these contexts which are the long chunks of text and then the answers which are those little smaller snippets so you saw that highlighted all right so the retriever gets retrieving database gets those big paragraphs the reader model gets a little answer okay and haystack is basically a way to do that and it's incredibly easy and and straightforward like you don't really need to know too much about open the main question answer you probably never even need to know that it's open the main question answering and you can implement this pipeline and get something that works really well it's honestly a very good framework for this sort of thing so i've been earlier on i mentioned there's this document sorting document store is basically where you're storing the the information so i'm earlier on i said vector database document store is basically you can see them as being both the same thing just different names for the same thing now we kind of have this overview here we're not going to worry about indexing pipelines so much and that's basically just you know you get your data and you convert it into a readable format for your document store but we do have our document store and then we have the search pipeline so we have the retriever and then the reader over here we have a generator i haven't mentioned that we're not going to talk about it here we got answer but let me show you an example so we'll go to here better retrieval via dpr notebook let's scroll down a little bit okay so here we're using vice as a document store okay so this is what we're going to store all of our information scroll down a little bit more and over here we're pulling in these these files which are is a wiki i think it's the wiki for game of thrones like the fan wiki or something like that it's just all the text from those so you have loads of paragraphs talking about game of thrones and this random like trivia about it about it so once we so we've got the data and we write that into our document store now we've written the text to our document store at this point but not the actual vectors because like i said there's a vector database behind this as well and what has happened here is we're storing text of the context or of that game of thrones wiki data that you know our database but we've not converted them into vectors yet because we need a retriever model to do that and we can't search until we have that those vectors okay because we're searching using a vector search so what we do is initialize a retriever model here here they're using dpr um you can use dpr if you want probably i would look into retriever models on in sentence transformers personally it's probably more efficient and and generally i think the performance is better as well but in this example it's just dpr and then so we initialize that dpr retriever model and then update the embedding so we create the vectors from the text that we've already pushed to our document store okay in this case vectors stores them in the vector database which is by uh vice in this case although vice is just an index it doesn't have all of the other stuff that would make it a vector database but i'm going to just call it that for the sake of simplicity and then we have a reader so the reader model is that that the final part that you saw so um after we after we have retrieved a set of relevant context vectors we pass them through our reader model and that extracts a specific snippet of an answer okay like a few words that answers our question um and then so haystack allows us to do all this super easy you know we've i don't know how many lines of code that is it's you know we have our imports and then there is something like four lines of code to initialize all those components and then we are initializing a extractive qa pipeline so open domain question answering we are extracting the answers from a set of contexts that's why it's extractive qa here and in that pipeline we just pass our read and retrieve we don't need to pass a document store or the vector database because if we have a look up here the retriever already contains the document store okay so it's already in there okay so we've initialized the pipeline and then we can ask questions so we have this pipeline run query you created dothraki vocabulary we have this top k we won't focus on those right here it's just how many answers to return basically um and then we print the answers although in this notebook we don't actually see them print the answers i have an example so i'll show you quickly okay so i'm just going to come down here and you can i have an example this is with the my so you know what i've built so far the pinecone document so it's not finished yet but it is kind of working so same thing again extractive qa pipeline i've just copied this code from the from that notebook you saw um and come down here and this is the bit that we were missing from the other one so print answers prediction and we get this now in this example it's a really bad answer because i haven't i've only uploaded like six contexts because i was just testing it but we'll basically you'll get something like this so you'll get an answer um it's like who created the dothraki vocabulary and it's just pulling out someone's name because it's like it knows that you're asking who created something uh but there aren't any good answers here so it's like okay that's all i can give you so again elio garcia from here okay so that's the sort of format that we're expecting the sort of workflow uh when you're using hsi which is it's really good it makes this very easy and and haystack is it's all about making open main question answering really simple and that's kind of why i want to bring pinecone as a document store into that as well because pinecone makes vector search incredibly simple um as you know from what i can see um simpler than any other option out there so that's why i think they go together very well um and you yeah you get good performance and everything else with pinecone as well so there's a there's a lot of benefits to including it now what i want to just show you quickly is okay how how do you get started when you're setting up um when you're hoping to contribute open source project how do you set up in in git so i mean you can see here i made this little little chart um which kind of explains why you know why why the setup is different so you have your local machine where you're going to do your development work um typically you would only have what you you wouldn't have this like haystack repo upstream thing you would just have this when you're working on a project okay you have your remote repository your origin and you pull and push to your origin um it's pretty simple so if you're planning on contributing to a project it's a little bit different because you need to first you need to fork your new repository so you have the official repository and then you have your forked repository that becomes your origin and then you pull and push to that origin that's been forked but you also if you're working on this for a long time you want your repository to stay up to date with the upstream or the you know official repository so you need to also set up a an upstream remote as well so that you can pull and merge from that upstream to your to your local and your your forked repo to keep it up to date so let me show you quickly how you might do that okay so on my right i just have terminal window and on the left i have haystack repository we're on that notebook we're on before let me so we go to the top level first thing you need to do is where are we so we come to here view code no make it bigger here we go so we're on this green button and we are going to no we're not we're going to fork here okay i already have a fork so i'm going to okay fork again to a different account okay i'm not going to do i'm just going to use this one i've already created but anyway you fork it will create a fork on your account okay and then you go to that to that fork so actually i can go up here just replace deep set ai with my username okay and you can see i also fetch upstream here you can also use that i'm going to we're going to set up on on git so you can see how it works so this is my my personal version of haystack that i'm that i am modifying right so what i want to do is code i'm going to copy this and i'm going to come over here i'm going to cd documents i think that's fine i'm going to git clone that okay so i'm going to clone haystack into onto my local machine it's actually going to take a really long time so let me just sail for another random project so can i cd haystack okay i'm just going to make the haystack no problem um the cd into it i'm going to pretend it's a a repo so i'm going to get in it okay and let's just pretend this is now our haystack repository that we've just done a git clone for okay so first thing we need to do is we need to add our remote so git remote add origin so the origin remote is going to be our personal um like repo so it's here the this one james callum haystack so let me copy this pull it into here and then what you need to do is set up a upstream remote so git remote add upstream and this is going to be the original so if i go to here so deep set ai haystack so this is the original repo i'm going to copy this and paste it in here okay so now i think it's git remote yeah so you can see the origin the upstreams and that's that's what we that's what we want so now whenever there's an update to the upstream that we want to merge without without the version of the project we're working on we would just write git pull upstream okay and it'll pull any updates and we'll have to merge and commit everything into our own repo at the same time if i let me move out of this one and move into the real haystack that i'm actually working on so projects haystack okay there's nothing i want to change at the moment but what i can do is i can git pull upstream and maybe maybe we'll see some changes i'm not sure i think i did it yesterday so maybe not oh maybe yeah there is okay so this is actually going to update everything okay so i need to actually git pull upstream master i think they use so not just git pull upstream need to specify the branch as well okay and this is up to date so there's no no issues there okay so that's pretty good but obviously when we are committing things so you know git commit to do everything we're doing and we do a git push we don't go to upstream we go to origin okay and we should also have a branch checked out as well so we would get check out you can actually see mine here if i do git status you can see i'm on the pinecone.sore branch so when i'm when i've committed some changes and i want to push that to my repository i'm going to do git push u origin and then i'm going to do pinecone.sore okay and i'll make changes in my own own repository so yeah that's really it for the like when you're working with open source you you have that so you're able to pull from the original repository which is this deep set haystack and you're also working and pulling and pushing to your personal version of that repo so i thought it's useful it might be useful for someone out there and that's really you know how you get started with your project so once you have that setup you have your haystack repository which i'll go and open in vs code over here okay so this is the haystack repository or the local version of it and in here we have the document source okay and there are there are a few so let's go through those now i'm not saying you you have to use pinecone or anything like that but i'm just saying why i why i want to include it so we have a deep set cloud which i haven't i'm i think is very new i'm not sure even sure it's fully implemented yet i'm not i'm not sure so i can't say anything about that i have no idea i know that's deep sets offering and i'm pretty excited to see what that is to be honest i'm sure it's very cool elastic search and now with elastic search you're not really doing a vector search you're doing like a sparse retrieval followed by a dense vector re-ranking so yeah you can use this but you're not doing a an actual semantic search or or full-on or full open the main question answering with this so some great advice you have to handle the infrastructure and stuff yourself so i'm not too keen on that so yeah it's five it's good but you have to you have to understand what you're doing otherwise it's a nightmare especially you have a lot of vectors if you get to like a million plus buys can be difficult you have milvus milvus is you know i think they can host stuff for you um you can definitely use that i found it difficult to set up and for me i want everything to be super simple and easy and just good um i struggled with milvus a bit so you can use it of course i did find it like great and i had the same thing with we aviate um again i hate i hate good things um but you're going to struggle a little bit um another thing on these two so i think milvus and we aviate are the closest you're going to get to pinecone at the moment in haystack but they also doesn't we don't they don't have the full metadata filtering capabilities of pinecone as well metadata filtering pinecone is very good because they have something called single stage filtering not pre or post filtering and it is definitely i think it outperforms the others so that's another thing to to consider as well so there are a few there are a few document stores that you can already use um there are others as well there's memory graph db i haven't used these so i can't comment on them um but yeah they're the current offerings that you have and then we have pinecone which i'm currently implementing okay now let's have a look at i want to have a quick look at the sort of core functionalities um we already really had a look at them but i just want to cover them very quickly again so basically what i'm going to need to work on to actually get this working is write documents or initialize actually initializing your document store so if i open this more okay so you can ignore this it's not relevant so this bit here i've got a document store a pinecone document store initializing that with a api key environment like you would with pinecone normally okay so that's just initializing your document store like how how does that work and that's pretty simple it's just the init over here um and we're connecting to it nothing nothing special there and then the first thing that you saw before is write documents so this is where we're writing text and this is we kind of have to do something we can't just use pinecone for this because we can have very long chunks of text and pinecone doesn't accept more than five kilobytes of metadata at the moment okay so we can't store long pieces of text in that metadata so we actually have to use a sql a local sql instance to store those big contexts and then the smaller bits of metadata we just throw into pinecone so that's basically populating our sql database all right so before we just initializing a retriever model it's not i haven't had to do anything there and then we update embeddings so this is where we're looking at those contexts converting them into vectors and storing them in pinecone and then there's just a querying so we're asking a question here now you can't see here but this is actually calling a method called query and that's a standard method so you have to the way we have to implement everything in our document store has to fit with the standard haystack ways of doing things and then we have to be able to delete things delete documents get documents by ids and so on and also include metadata which is important now i'm not going to actually dive into details of that in this video i want to do that in the in maybe the next video we'll cover a lot of that stuff but for now just an overview of what there is in there how all this works why we'd even want to do it in the first place and so on so yeah i don't think there's anything else i'm to cover in this first video um well let me let me check yeah that looks like everything so yeah that's everything uh for this video we'll i think explore this in a lot more depth in the next video so that should be pretty interesting if you if you do want to contribute and help implement this please feel free go ahead and do it um so over here this is the repo at the moment um i'll update in the description if that changes but yeah that would be really cool so and as well you can just go ahead and test it that's also also really helps so yeah that's everything for this video um thank you very much for watching i hope it's i know it's a bit different it's not really a tutorial it's more just like walking through what we're doing but i hope it's useful to see you know how this sort of stuff might work um maybe you know maybe it's something that you want to do as well um so that would be cool anyway thank you very much for watching i hope it's been useful and i'll see you in the next one bye

Adding New Doc Stores to Haystack

Chapters

Transcript