back to index

Adding New Doc Stores to Haystack


Chapters

0:0 Intro
2:15 Contributing or Testing
3:31 ODQA
6:20 What is Haystack?
8:13 Haystack QA Workflow
14:52 Contributing to Open Source
22:54 Haystack Doc Stores
26:9 Doc Store Core Methods
29:31 Final Notes, Contribute/Test

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to do something a little bit different and we're going to have a look at the
00:00:05.200 | project that I'm currently working on. Now this project is a very high level and if you know the
00:00:14.400 | Haystack framework it's to implement a new document store within that framework. If you
00:00:21.120 | don't know it no worries I'm going to explain it. So I left a couple of notes which you can see here
00:00:28.480 | to remind myself what I actually want to talk about in this video. So to start I want to just
00:00:35.360 | give you an overview of what is ODQA or Open Domain Question Answering and the three components in
00:00:42.720 | that. You've probably if you've watched a couple of my videos you've definitely seen me talk about
00:00:46.560 | this before but for those of you that don't know it I'm just going to very quickly go over that
00:00:53.600 | and I'm going to explain how Haystack fits into that. So if you already know what Haystack is
00:01:00.240 | and you already know Open Domain Question Answering you probably just want to skip ahead a
00:01:05.600 | little bit it's up to you. I'll leave some chapter markers in the timeline so you can skip ahead if
00:01:12.800 | you want and then we're going to have a look at the current document source within the Haystack
00:01:20.960 | framework and why I'm going to implement a Pinecone document. Then we're going to have
00:01:27.680 | a look so this is probably kind of relevant especially if you want to contribute to
00:01:32.880 | open source. I'm going to just go through set up my own git repo for this and set it up so that it
00:01:44.160 | works with the other official Haystack git repo as well. So I'll show you how you can do that
00:01:50.880 | how you can set up your git for open source projects when you're planning on contributing
00:01:56.480 | and then we'll have a look at the document source or the specific code in document source
00:02:06.240 | in the Haystack library. So yeah I think it's a different video but it should be should be
00:02:13.840 | interesting. Oh and one last thing actually if you do want to test document source go ahead and just
00:02:24.720 | you can git clone from here to your own you know local machine. So I'm going to git clone the above
00:02:35.440 | okay and then you want to within that same repository so let's say that downloads it to
00:02:44.800 | your documents Haystack folder. Navigate to that folder in your terminal window or command line
00:02:56.640 | and then you just need to write a pip install dot right and that will install this version of
00:03:06.960 | Haystack and then you can test the pinecone document. So now I'll show you how you can start
00:03:12.880 | doing that so if you if you do want to test that and maybe even if you want to contribute and
00:03:18.880 | you see some like terrible code that I've written you want to make it more efficient please do go
00:03:25.680 | you know go ahead and do that. So let's start with a quick overview of open domain question
00:03:34.160 | answering and the three components that we see in there. So I'm going to assume you've already seen
00:03:40.640 | some of my videos on this or that you're aware of Haystack and open domain question answering.
00:03:44.800 | So I'm going to be really quick so open domain question answering or ODQA. It's basically you
00:03:52.160 | have a load of text data or some other data in a database somewhere and let's say wikipedia
00:04:00.400 | that's a good example so we have our wikipedia articles over here called wiki and we ask a
00:04:09.440 | question so we have let's say like a search bar like google over here and we ask a question like
00:04:17.120 | who are the Normans question mark and then we press search right.
00:04:22.240 | We have in wikipedia we have the answer to that okay the Normans are people from Normandy in
00:04:31.760 | the north of France but we need a way of translating that question into that specific
00:04:41.440 | little part of information and pulling that from our larger database. This can contain you know
00:04:47.600 | millions or more of paragraphs of information so how do we pull the right bit of information.
00:04:55.520 | So this query gets converted into a vector using what is called a retriever model. So we go into
00:05:04.240 | our retriever model it's converted into a vector all of these wikipedia snippets are also vectors
00:05:11.120 | so we go to our database this is a vector database over here and we compare our query vector
00:05:18.480 | which we'll call xq with all of our context vectors and that will return a set of relevant
00:05:24.880 | context vectors so we'll call them xc and these context vectors are quite big they're like
00:05:35.760 | paragraphs. Now we don't need a paragraph of text to tell us where the Normans are from or
00:05:43.040 | yeah where the Normans are from we just need a small snippet so we pass this into what's
00:05:49.120 | called a reader model over here also passing in our question and that reader model is going to
00:05:57.600 | take our context so let's say this is one of them that's a really long piece of text
00:06:04.160 | and it's going to say okay the specific answer is here okay so it's going to say
00:06:08.240 | north of france or normandy okay and that's open the main question answering super
00:06:16.640 | you know very high level that's how it works now how does haystack fit into that
00:06:22.960 | so i will show you it makes most sense to just go to their repo okay
00:06:31.840 | and we actually have this little about here it's nice open source nlp framework
00:06:37.840 | uses transformer models we all know what they are and allows us to implement production ready
00:06:47.200 | search question answering semantic document search and summarization okay so it's quite a lot the
00:06:53.040 | main bit we're focusing on at least here is this kind of like search and question answering so this
00:07:01.440 | is exactly what i just showed you all right so we're entering our question this produces a query
00:07:07.200 | vector we press search and then we'll get these contexts which are the long chunks of text and
00:07:12.480 | then the answers which are those little smaller snippets so you saw that highlighted all right
00:07:19.760 | so the retriever gets retrieving database gets those big paragraphs the reader model gets a
00:07:26.880 | little answer okay and haystack is basically a way to do that and it's incredibly easy
00:07:35.920 | and and straightforward like you don't really need to know too much about open the main question
00:07:41.200 | answer you probably never even need to know that it's open the main question answering
00:07:44.880 | and you can implement this pipeline and get something that works really well it's honestly
00:07:51.120 | a very good framework for this sort of thing so i've been earlier on i mentioned there's this
00:07:58.080 | document sorting document store is basically where you're storing the the information so i'm earlier
00:08:05.040 | on i said vector database document store is basically you can see them as being both the
00:08:10.240 | same thing just different names for the same thing now we kind of have this overview here
00:08:19.200 | we're not going to worry about indexing pipelines so much and that's basically just you know you get
00:08:23.840 | your data and you convert it into a readable format for your document store but we do have
00:08:29.440 | our document store and then we have the search pipeline so we have the retriever and then the
00:08:33.120 | reader over here we have a generator i haven't mentioned that we're not going to talk about it
00:08:37.200 | here we got answer but let me show you an example so we'll go to here better retrieval via dpr notebook
00:08:50.000 | let's scroll down a little bit
00:08:53.120 | okay so here we're using vice as a document store
00:08:59.200 | okay so this is what we're going to store all of our information scroll down a little bit more
00:09:05.840 | and over here we're pulling in these these files which are is a wiki i think it's the wiki for game
00:09:14.960 | of thrones like the fan wiki or something like that it's just all the text from those so you
00:09:21.840 | have loads of paragraphs talking about game of thrones and this random like trivia about it
00:09:29.280 | about it so once we so we've got the data and we write that into our document store
00:09:39.360 | now we've written the text to our document store at this point but not the actual vectors
00:09:46.160 | because like i said there's a vector database behind this as well and
00:09:54.560 | what has happened here is we're storing text of the context or of that game of thrones wiki data
00:10:01.840 | that you know our database but we've not converted them into vectors yet because we need a retriever
00:10:08.320 | model to do that and we can't search until we have that those vectors okay because we're searching
00:10:15.360 | using a vector search so what we do is initialize a retriever model here here they're using dpr
00:10:24.560 | um you can use dpr if you want probably i would look into retriever models on
00:10:30.720 | in sentence transformers personally it's probably more efficient and and generally
00:10:37.120 | i think the performance is better as well but in this example it's just dpr
00:10:42.000 | and then so we initialize that dpr retriever model and then update the embedding so we
00:10:52.960 | create the vectors from the text that we've already pushed to our document store
00:11:00.000 | okay in this case vectors stores them in the vector database which is by uh vice in this case
00:11:06.720 | although vice is just an index it doesn't have all of the other stuff that would make it a vector
00:11:11.680 | database but i'm going to just call it that for the sake of simplicity and then we have a reader
00:11:20.160 | so the reader model is that that the final part that you saw so um after we after we have retrieved
00:11:29.440 | a set of relevant context vectors we pass them through our reader model and that extracts a
00:11:36.560 | specific snippet of an answer okay like a few words that answers our question um and then
00:11:43.280 | so haystack allows us to do all this super easy you know we've i don't know how many lines of
00:11:49.200 | code that is it's you know we have our imports and then there is something like four lines of code
00:11:54.960 | to initialize all those components and then we are initializing a extractive qa pipeline
00:12:01.680 | so open domain question answering we are extracting
00:12:05.200 | the answers from a set of contexts that's why it's extractive qa here
00:12:09.920 | and in that pipeline we just pass our read and retrieve we don't need to pass a document store
00:12:17.040 | or the vector database because if we have a look up here the retriever
00:12:22.080 | already contains the document store okay so it's already in there
00:12:26.800 | okay so we've initialized the pipeline and then we can ask questions so we have this pipeline
00:12:36.400 | run query you created dothraki vocabulary we have this top k we won't focus on those
00:12:43.680 | right here it's just how many answers to return basically um and then we print the answers although
00:12:51.920 | in this notebook we don't actually see them print the answers i have an example so i'll show you
00:12:57.440 | quickly okay so i'm just going to come down here and you can i have an example this is with the
00:13:02.640 | my so you know what i've built so far the pinecone document so it's not finished yet but
00:13:09.920 | it is kind of working so same thing again extractive qa pipeline i've just copied this code
00:13:17.600 | from the from that notebook you saw um and come down here and this is the bit that we were missing
00:13:25.040 | from the other one so print answers prediction and we get this now in this example it's a really
00:13:30.640 | bad answer because i haven't i've only uploaded like six contexts because i was just testing it
00:13:38.160 | but we'll basically you'll get something like this so you'll get an answer um it's like who
00:13:43.440 | created the dothraki vocabulary and it's just pulling out someone's name because it's like
00:13:47.680 | it knows that you're asking who created something uh but there aren't any good answers here so it's
00:13:53.440 | like okay that's all i can give you so again elio garcia from here okay so that's the sort of
00:14:01.520 | format that we're expecting the sort of workflow uh when you're using hsi which is it's really
00:14:07.680 | good it makes this very easy and and haystack is it's all about making open main question answering
00:14:15.280 | really simple and that's kind of why i want to bring pinecone as a document store into that
00:14:22.960 | as well because pinecone makes vector search incredibly simple um as you know from what i
00:14:29.120 | can see um simpler than any other option out there so that's why i think they go together
00:14:36.880 | very well um and you yeah you get good performance and everything else with pinecone as well so
00:14:45.040 | there's a there's a lot of benefits to including it now what i want to just show you quickly is
00:14:52.000 | okay how how do you get started when you're setting up um when you're hoping to contribute
00:14:59.120 | open source project how do you set up in in git so
00:15:02.880 | i mean you can see here i made this little little chart um which kind of explains why
00:15:14.400 | you know why why the setup is different so you have your local machine where you're going to do
00:15:18.000 | your development work um typically you would only have what you you wouldn't have this
00:15:26.320 | like haystack repo upstream thing you would just have this when you're working on a project
00:15:32.320 | okay you have your remote repository your origin and you pull and push to your origin
00:15:37.840 | um it's pretty simple so if you're planning on contributing to a project it's a little bit
00:15:44.480 | different because you need to first you need to fork your new repository so you have the official
00:15:50.800 | repository and then you have your forked repository that becomes your origin and then you pull and
00:15:56.640 | push to that origin that's been forked but you also if you're working on this for a long time
00:16:04.800 | you want your repository to stay up to date with the upstream or the you know official repository
00:16:13.280 | so you need to also set up a an upstream remote as well so that you can pull and merge from that
00:16:20.880 | upstream to your to your local and your your forked repo to keep it up to date so let me show you
00:16:27.920 | quickly how you might do that okay so on my right i just have terminal window and on the left i have
00:16:35.840 | haystack repository we're on that notebook we're on before let me so we go to the top level first
00:16:44.720 | thing you need to do is where are we so we come to here view code no make it bigger here we go
00:16:55.680 | so we're on this green button and we are going to no we're not we're going to fork here
00:17:04.640 | okay i already have a fork so i'm going to okay fork again to a different account okay i'm not
00:17:11.680 | going to do i'm just going to use this one i've already created but anyway you fork it will create
00:17:15.840 | a fork on your account okay and then you go to that to that fork so actually i can go up here
00:17:24.880 | just replace deep set ai with my username
00:17:27.520 | okay and you can see i also fetch upstream here you can also use that i'm going to we're
00:17:40.880 | going to set up on on git so you can see how it works so this is my my personal version of haystack
00:17:49.280 | that i'm that i am modifying right so what i want to do is code i'm going to copy this
00:18:01.520 | and i'm going to come over here i'm going to cd documents i think that's fine i'm going to git
00:18:11.920 | clone that okay so i'm going to clone haystack into onto my local machine it's actually going
00:18:22.320 | to take a really long time so let me just sail for another random project
00:18:33.520 | so can i cd haystack
00:18:35.760 | okay i'm just going to make the haystack no problem um
00:18:42.240 | the cd into it i'm going to pretend it's a a repo so i'm going to get in it okay and let's just
00:18:52.720 | pretend this is now our haystack repository that we've just done a git clone for okay
00:19:01.520 | so first thing we need to do is we need to add our remote so git remote add origin so the origin
00:19:12.480 | remote is going to be our personal um like repo so it's here the this one james callum haystack
00:19:22.480 | so let me copy this
00:19:24.960 | pull it into here and then what you need to do is set up a upstream remote so git remote
00:19:37.200 | add upstream and this is going to be the original so if i go to here so deep set ai haystack so this
00:19:46.000 | is the original repo i'm going to copy this and paste it in here
00:19:52.480 | okay so now i think it's git remote yeah so you can see the origin the upstreams
00:20:02.160 | and that's that's what we that's what we want so now whenever there's an update to the upstream
00:20:09.200 | that we want to merge without without the version of the project we're working on we would just
00:20:16.560 | write git pull upstream okay and it'll pull any updates and we'll have to merge and commit
00:20:24.720 | everything into our own repo at the same time if i let me move out of this one and move into the
00:20:34.960 | real haystack that i'm actually working on so projects haystack okay
00:20:42.080 | there's nothing i want to change at the moment but what i can do is i can git pull upstream
00:20:49.760 | and maybe maybe we'll see some changes i'm not sure
00:20:57.520 | i think i did it yesterday so maybe not oh maybe yeah there is
00:21:02.080 | okay so this is actually going to update everything
00:21:10.320 | okay so i need to actually git pull upstream master i think they use
00:21:22.400 | so not just git pull upstream need to specify the branch as well okay and this is up to date so
00:21:30.240 | there's no no issues there okay so that's pretty good but obviously when we are committing things
00:21:36.000 | so you know git commit to do everything we're doing and we do a git push we don't go to upstream
00:21:43.680 | we go to origin okay and we should also have a branch checked out as well so we would get check
00:21:51.680 | out you can actually see mine here if i do git status you can see i'm on the pinecone.sore branch
00:21:59.600 | so when i'm when i've committed some changes and i want to push that to my
00:22:06.000 | repository i'm going to do git push u origin and then i'm going to do pinecone.sore okay
00:22:18.160 | and i'll make changes in my own own repository so yeah that's really it for the like when you're
00:22:24.800 | working with open source you you have that so you're able to pull from the original repository
00:22:30.960 | which is this deep set haystack and you're also working and pulling and pushing to your personal
00:22:40.640 | version of that repo so i thought it's useful it might be useful for someone out there
00:22:47.920 | and that's really you know how you get started with your project so once you have that setup
00:22:55.600 | you have your haystack repository which i'll go and open in vs code over here okay so this
00:23:05.920 | is the haystack repository or the local version of it and in here we have the document source
00:23:15.520 | okay and there are there are a few so let's go through those now i'm not saying you you have to
00:23:23.120 | use pinecone or anything like that but i'm just saying why i why i want to include it so we have
00:23:29.200 | a deep set cloud which i haven't i'm i think is very new i'm not sure even sure it's fully
00:23:33.920 | implemented yet i'm not i'm not sure so i can't say anything about that i have no idea i know that's
00:23:41.040 | deep sets offering and i'm pretty excited to see what that is to be honest i'm sure it's very cool
00:23:47.040 | elastic search and now with elastic search you're not really doing a vector search
00:23:52.880 | you're doing like a sparse retrieval followed by a dense vector re-ranking
00:24:00.640 | so yeah you can use this but you're not doing a an actual semantic search or or full-on or full
00:24:08.640 | open the main question answering with this so some great advice you have to handle the
00:24:15.600 | infrastructure and stuff yourself so i'm not too keen on that so yeah it's five it's good but you
00:24:24.320 | have to you have to understand what you're doing otherwise it's a nightmare especially you have a
00:24:28.480 | lot of vectors if you get to like a million plus buys can be difficult you have milvus milvus is
00:24:36.720 | you know i think they can host stuff for you um you can definitely use that i found it
00:24:44.960 | difficult to set up and for me i want everything to be super simple and easy and just
00:24:53.840 | good um i struggled with milvus a bit so you can use it of course i did find it like great and i
00:25:03.920 | had the same thing with we aviate um again i hate i hate good things um but you're going to struggle
00:25:12.320 | a little bit um another thing on these two so i think milvus and we aviate are the closest you're
00:25:18.480 | going to get to pinecone at the moment in haystack but they also doesn't we don't they don't have the
00:25:25.600 | full metadata filtering capabilities of pinecone as well metadata filtering pinecone is very good
00:25:33.280 | because they have something called single stage filtering not pre or post filtering and it is
00:25:41.280 | definitely i think it outperforms the others so that's another thing to to consider as well
00:25:48.400 | so there are a few there are a few document stores that you can already use
00:25:51.440 | um there are others as well there's memory graph db i haven't used these so i can't comment on them
00:25:57.520 | um but yeah they're the current offerings that you have
00:26:03.600 | and then we have pinecone which i'm currently implementing okay now let's have a look at
00:26:12.720 | i want to have a quick look at the sort of core functionalities um we already really had a look
00:26:18.800 | at them but i just want to cover them very quickly again so basically what i'm going to need to work
00:26:25.520 | on to actually get this working is write documents or initialize actually initializing your document
00:26:36.880 | store so if i open this more okay so you can ignore this it's not relevant so this bit here
00:26:47.680 | i've got a document store a pinecone document store initializing that with a api key environment
00:26:54.800 | like you would with pinecone normally okay so that's just initializing your document store
00:27:00.080 | like how how does that work and that's pretty simple it's just the init over here um and we're
00:27:05.920 | connecting to it nothing nothing special there and then the first thing that you saw before is
00:27:14.160 | write documents so this is where we're writing text and this is we kind of have to do something
00:27:21.200 | we can't just use pinecone for this because we can have very long chunks of text and pinecone
00:27:28.160 | doesn't accept more than five kilobytes of metadata at the moment okay so we can't store
00:27:35.280 | long pieces of text in that metadata so we actually have to use a sql a local sql instance
00:27:42.000 | to store those big contexts and then the smaller bits of metadata we just throw into pinecone
00:27:50.560 | so that's basically populating our sql database
00:27:54.560 | all right so before we just initializing a retriever model it's not i haven't had to do
00:28:02.640 | anything there and then we update embeddings so this is where we're looking at those contexts
00:28:09.440 | converting them into vectors and storing them in pinecone and then there's just a querying so
00:28:19.200 | we're asking a question here now you can't see here but this is actually
00:28:22.080 | calling a method called query and that's a standard method so you have to the way we
00:28:28.160 | have to implement everything in our document store has to fit with the standard haystack
00:28:34.160 | ways of doing things and then we have to be able to delete things delete documents
00:28:42.160 | get documents by ids and so on and also include metadata which is important now i'm not going to
00:28:49.600 | actually dive into details of that in this video i want to do that in the in maybe the next video
00:28:54.880 | we'll cover a lot of that stuff but for now just an overview of what there is in there how all this
00:29:02.160 | works why we'd even want to do it in the first place and so on so yeah i don't think there's
00:29:08.720 | anything else i'm to cover in this first video um well let me let me check
00:29:14.000 | yeah that looks like everything so yeah that's everything uh for this video we'll
00:29:26.000 | i think explore this in a lot more depth in the next video so that should be pretty interesting
00:29:32.160 | if you if you do want to contribute and help implement this please feel free go ahead and
00:29:38.480 | do it um so over here this is the repo at the moment um i'll update in the description if that
00:29:45.440 | changes but yeah that would be really cool so and as well you can just go ahead and test it
00:29:53.280 | that's also also really helps so yeah that's everything for this video um thank you very
00:30:00.720 | much for watching i hope it's i know it's a bit different it's not really a tutorial it's more
00:30:04.960 | just like walking through what we're doing but i hope it's useful to see you know how this sort of
00:30:10.320 | stuff might work um maybe you know maybe it's something that you want to do as well um so
00:30:16.560 | that would be cool anyway thank you very much for watching i hope it's been useful
00:30:21.760 | and i'll see you in the next one bye