Adding New Doc Stores to Haystack

00:00:00.000 | Today we're going to do something a little bit different and we're going to have a look at the

00:00:05.200 | project that I'm currently working on. Now this project is a very high level and if you know the

00:00:14.400 | Haystack framework it's to implement a new document store within that framework. If you

00:00:21.120 | don't know it no worries I'm going to explain it. So I left a couple of notes which you can see here

00:00:28.480 | to remind myself what I actually want to talk about in this video. So to start I want to just

00:00:35.360 | give you an overview of what is ODQA or Open Domain Question Answering and the three components in

00:00:42.720 | that. You've probably if you've watched a couple of my videos you've definitely seen me talk about

00:00:46.560 | this before but for those of you that don't know it I'm just going to very quickly go over that

00:00:53.600 | and I'm going to explain how Haystack fits into that. So if you already know what Haystack is

00:01:00.240 | and you already know Open Domain Question Answering you probably just want to skip ahead a

00:01:05.600 | little bit it's up to you. I'll leave some chapter markers in the timeline so you can skip ahead if

00:01:12.800 | you want and then we're going to have a look at the current document source within the Haystack

00:01:20.960 | framework and why I'm going to implement a Pinecone document. Then we're going to have

00:01:27.680 | a look so this is probably kind of relevant especially if you want to contribute to

00:01:32.880 | open source. I'm going to just go through set up my own git repo for this and set it up so that it

00:01:44.160 | works with the other official Haystack git repo as well. So I'll show you how you can do that

00:01:50.880 | how you can set up your git for open source projects when you're planning on contributing

00:01:56.480 | and then we'll have a look at the document source or the specific code in document source

00:02:06.240 | in the Haystack library. So yeah I think it's a different video but it should be should be

00:02:13.840 | interesting. Oh and one last thing actually if you do want to test document source go ahead and just

00:02:24.720 | you can git clone from here to your own you know local machine. So I'm going to git clone the above

00:02:35.440 | okay and then you want to within that same repository so let's say that downloads it to

00:02:44.800 | your documents Haystack folder. Navigate to that folder in your terminal window or command line

00:02:56.640 | and then you just need to write a pip install dot right and that will install this version of

00:03:06.960 | Haystack and then you can test the pinecone document. So now I'll show you how you can start

00:03:12.880 | doing that so if you if you do want to test that and maybe even if you want to contribute and

00:03:18.880 | you see some like terrible code that I've written you want to make it more efficient please do go

00:03:25.680 | you know go ahead and do that. So let's start with a quick overview of open domain question

00:03:34.160 | answering and the three components that we see in there. So I'm going to assume you've already seen

00:03:40.640 | some of my videos on this or that you're aware of Haystack and open domain question answering.

00:03:44.800 | So I'm going to be really quick so open domain question answering or ODQA. It's basically you

00:03:52.160 | have a load of text data or some other data in a database somewhere and let's say wikipedia

00:04:00.400 | that's a good example so we have our wikipedia articles over here called wiki and we ask a

00:04:09.440 | question so we have let's say like a search bar like google over here and we ask a question like

00:04:17.120 | who are the Normans question mark and then we press search right.

00:04:22.240 | We have in wikipedia we have the answer to that okay the Normans are people from Normandy in

00:04:31.760 | the north of France but we need a way of translating that question into that specific

00:04:41.440 | little part of information and pulling that from our larger database. This can contain you know

00:04:47.600 | millions or more of paragraphs of information so how do we pull the right bit of information.

00:04:55.520 | So this query gets converted into a vector using what is called a retriever model. So we go into

00:05:04.240 | our retriever model it's converted into a vector all of these wikipedia snippets are also vectors

00:05:11.120 | so we go to our database this is a vector database over here and we compare our query vector

00:05:18.480 | which we'll call xq with all of our context vectors and that will return a set of relevant

00:05:24.880 | context vectors so we'll call them xc and these context vectors are quite big they're like

00:05:35.760 | paragraphs. Now we don't need a paragraph of text to tell us where the Normans are from or

00:05:43.040 | yeah where the Normans are from we just need a small snippet so we pass this into what's

00:05:49.120 | called a reader model over here also passing in our question and that reader model is going to

00:05:57.600 | take our context so let's say this is one of them that's a really long piece of text

00:06:04.160 | and it's going to say okay the specific answer is here okay so it's going to say

00:06:08.240 | north of france or normandy okay and that's open the main question answering super

00:06:16.640 | you know very high level that's how it works now how does haystack fit into that

00:06:22.960 | so i will show you it makes most sense to just go to their repo okay

00:06:31.840 | and we actually have this little about here it's nice open source nlp framework

00:06:37.840 | uses transformer models we all know what they are and allows us to implement production ready

00:06:47.200 | search question answering semantic document search and summarization okay so it's quite a lot the

00:06:53.040 | main bit we're focusing on at least here is this kind of like search and question answering so this

00:07:01.440 | is exactly what i just showed you all right so we're entering our question this produces a query

00:07:07.200 | vector we press search and then we'll get these contexts which are the long chunks of text and

00:07:12.480 | then the answers which are those little smaller snippets so you saw that highlighted all right

00:07:19.760 | so the retriever gets retrieving database gets those big paragraphs the reader model gets a

00:07:26.880 | little answer okay and haystack is basically a way to do that and it's incredibly easy

00:07:35.920 | and and straightforward like you don't really need to know too much about open the main question

00:07:41.200 | answer you probably never even need to know that it's open the main question answering

00:07:44.880 | and you can implement this pipeline and get something that works really well it's honestly

00:07:51.120 | a very good framework for this sort of thing so i've been earlier on i mentioned there's this

00:07:58.080 | document sorting document store is basically where you're storing the the information so i'm earlier

00:08:05.040 | on i said vector database document store is basically you can see them as being both the

00:08:10.240 | same thing just different names for the same thing now we kind of have this overview here

00:08:19.200 | we're not going to worry about indexing pipelines so much and that's basically just you know you get

00:08:23.840 | your data and you convert it into a readable format for your document store but we do have

00:08:29.440 | our document store and then we have the search pipeline so we have the retriever and then the

00:08:33.120 | reader over here we have a generator i haven't mentioned that we're not going to talk about it

00:08:37.200 | here we got answer but let me show you an example so we'll go to here better retrieval via dpr notebook

00:08:50.000 | let's scroll down a little bit

00:08:53.120 | okay so here we're using vice as a document store

00:08:59.200 | okay so this is what we're going to store all of our information scroll down a little bit more

00:09:05.840 | and over here we're pulling in these these files which are is a wiki i think it's the wiki for game

00:09:14.960 | of thrones like the fan wiki or something like that it's just all the text from those so you

00:09:21.840 | have loads of paragraphs talking about game of thrones and this random like trivia about it

00:09:29.280 | about it so once we so we've got the data and we write that into our document store

00:09:39.360 | now we've written the text to our document store at this point but not the actual vectors

00:09:46.160 | because like i said there's a vector database behind this as well and

00:09:54.560 | what has happened here is we're storing text of the context or of that game of thrones wiki data

00:10:01.840 | that you know our database but we've not converted them into vectors yet because we need a retriever

00:10:08.320 | model to do that and we can't search until we have that those vectors okay because we're searching

00:10:15.360 | using a vector search so what we do is initialize a retriever model here here they're using dpr

00:10:24.560 | um you can use dpr if you want probably i would look into retriever models on

00:10:30.720 | in sentence transformers personally it's probably more efficient and and generally

00:10:37.120 | i think the performance is better as well but in this example it's just dpr

00:10:42.000 | and then so we initialize that dpr retriever model and then update the embedding so we

00:10:52.960 | create the vectors from the text that we've already pushed to our document store

00:11:00.000 | okay in this case vectors stores them in the vector database which is by uh vice in this case

00:11:06.720 | although vice is just an index it doesn't have all of the other stuff that would make it a vector

00:11:11.680 | database but i'm going to just call it that for the sake of simplicity and then we have a reader

00:11:20.160 | so the reader model is that that the final part that you saw so um after we after we have retrieved

00:11:29.440 | a set of relevant context vectors we pass them through our reader model and that extracts a

00:11:36.560 | specific snippet of an answer okay like a few words that answers our question um and then

00:11:43.280 | so haystack allows us to do all this super easy you know we've i don't know how many lines of

00:11:49.200 | code that is it's you know we have our imports and then there is something like four lines of code

00:11:54.960 | to initialize all those components and then we are initializing a extractive qa pipeline

00:12:01.680 | so open domain question answering we are extracting

00:12:05.200 | the answers from a set of contexts that's why it's extractive qa here

00:12:09.920 | and in that pipeline we just pass our read and retrieve we don't need to pass a document store

00:12:17.040 | or the vector database because if we have a look up here the retriever

00:12:22.080 | already contains the document store okay so it's already in there

00:12:26.800 | okay so we've initialized the pipeline and then we can ask questions so we have this pipeline

00:12:36.400 | run query you created dothraki vocabulary we have this top k we won't focus on those

00:12:43.680 | right here it's just how many answers to return basically um and then we print the answers although

00:12:51.920 | in this notebook we don't actually see them print the answers i have an example so i'll show you

00:12:57.440 | quickly okay so i'm just going to come down here and you can i have an example this is with the

00:13:02.640 | my so you know what i've built so far the pinecone document so it's not finished yet but

00:13:09.920 | it is kind of working so same thing again extractive qa pipeline i've just copied this code

00:13:17.600 | from the from that notebook you saw um and come down here and this is the bit that we were missing

00:13:25.040 | from the other one so print answers prediction and we get this now in this example it's a really

00:13:30.640 | bad answer because i haven't i've only uploaded like six contexts because i was just testing it

00:13:38.160 | but we'll basically you'll get something like this so you'll get an answer um it's like who

00:13:43.440 | created the dothraki vocabulary and it's just pulling out someone's name because it's like

00:13:47.680 | it knows that you're asking who created something uh but there aren't any good answers here so it's

00:13:53.440 | like okay that's all i can give you so again elio garcia from here okay so that's the sort of

00:14:01.520 | format that we're expecting the sort of workflow uh when you're using hsi which is it's really

00:14:07.680 | good it makes this very easy and and haystack is it's all about making open main question answering

00:14:15.280 | really simple and that's kind of why i want to bring pinecone as a document store into that

00:14:22.960 | as well because pinecone makes vector search incredibly simple um as you know from what i

00:14:29.120 | can see um simpler than any other option out there so that's why i think they go together

00:14:36.880 | very well um and you yeah you get good performance and everything else with pinecone as well so

00:14:45.040 | there's a there's a lot of benefits to including it now what i want to just show you quickly is

00:14:52.000 | okay how how do you get started when you're setting up um when you're hoping to contribute

00:14:59.120 | open source project how do you set up in in git so

00:15:02.880 | i mean you can see here i made this little little chart um which kind of explains why

00:15:14.400 | you know why why the setup is different so you have your local machine where you're going to do

00:15:18.000 | your development work um typically you would only have what you you wouldn't have this

00:15:26.320 | like haystack repo upstream thing you would just have this when you're working on a project

00:15:32.320 | okay you have your remote repository your origin and you pull and push to your origin

00:15:37.840 | um it's pretty simple so if you're planning on contributing to a project it's a little bit

00:15:44.480 | different because you need to first you need to fork your new repository so you have the official

00:15:50.800 | repository and then you have your forked repository that becomes your origin and then you pull and

00:15:56.640 | push to that origin that's been forked but you also if you're working on this for a long time

00:16:04.800 | you want your repository to stay up to date with the upstream or the you know official repository

00:16:13.280 | so you need to also set up a an upstream remote as well so that you can pull and merge from that

00:16:20.880 | upstream to your to your local and your your forked repo to keep it up to date so let me show you

00:16:27.920 | quickly how you might do that okay so on my right i just have terminal window and on the left i have

00:16:35.840 | haystack repository we're on that notebook we're on before let me so we go to the top level first

00:16:44.720 | thing you need to do is where are we so we come to here view code no make it bigger here we go

00:16:55.680 | so we're on this green button and we are going to no we're not we're going to fork here

00:17:04.640 | okay i already have a fork so i'm going to okay fork again to a different account okay i'm not

00:17:11.680 | going to do i'm just going to use this one i've already created but anyway you fork it will create

00:17:15.840 | a fork on your account okay and then you go to that to that fork so actually i can go up here

00:17:24.880 | just replace deep set ai with my username

00:17:27.520 | okay and you can see i also fetch upstream here you can also use that i'm going to we're

00:17:40.880 | going to set up on on git so you can see how it works so this is my my personal version of haystack

00:17:49.280 | that i'm that i am modifying right so what i want to do is code i'm going to copy this

00:18:01.520 | and i'm going to come over here i'm going to cd documents i think that's fine i'm going to git

00:18:11.920 | clone that okay so i'm going to clone haystack into onto my local machine it's actually going

00:18:22.320 | to take a really long time so let me just sail for another random project

00:18:33.520 | so can i cd haystack

00:18:35.760 | okay i'm just going to make the haystack no problem um

00:18:42.240 | the cd into it i'm going to pretend it's a a repo so i'm going to get in it okay and let's just

00:18:52.720 | pretend this is now our haystack repository that we've just done a git clone for okay

00:19:01.520 | so first thing we need to do is we need to add our remote so git remote add origin so the origin

00:19:12.480 | remote is going to be our personal um like repo so it's here the this one james callum haystack

00:19:22.480 | so let me copy this

00:19:24.960 | pull it into here and then what you need to do is set up a upstream remote so git remote

00:19:37.200 | add upstream and this is going to be the original so if i go to here so deep set ai haystack so this

00:19:46.000 | is the original repo i'm going to copy this and paste it in here

00:19:52.480 | okay so now i think it's git remote yeah so you can see the origin the upstreams

00:20:02.160 | and that's that's what we that's what we want so now whenever there's an update to the upstream

00:20:09.200 | that we want to merge without without the version of the project we're working on we would just

00:20:16.560 | write git pull upstream okay and it'll pull any updates and we'll have to merge and commit

00:20:24.720 | everything into our own repo at the same time if i let me move out of this one and move into the

00:20:34.960 | real haystack that i'm actually working on so projects haystack okay

00:20:42.080 | there's nothing i want to change at the moment but what i can do is i can git pull upstream

00:20:49.760 | and maybe maybe we'll see some changes i'm not sure

00:20:57.520 | i think i did it yesterday so maybe not oh maybe yeah there is

00:21:02.080 | okay so this is actually going to update everything

00:21:10.320 | okay so i need to actually git pull upstream master i think they use

00:21:22.400 | so not just git pull upstream need to specify the branch as well okay and this is up to date so

00:21:30.240 | there's no no issues there okay so that's pretty good but obviously when we are committing things

00:21:36.000 | so you know git commit to do everything we're doing and we do a git push we don't go to upstream

00:21:43.680 | we go to origin okay and we should also have a branch checked out as well so we would get check

00:21:51.680 | out you can actually see mine here if i do git status you can see i'm on the pinecone.sore branch

00:21:59.600 | so when i'm when i've committed some changes and i want to push that to my

00:22:06.000 | repository i'm going to do git push u origin and then i'm going to do pinecone.sore okay

00:22:18.160 | and i'll make changes in my own own repository so yeah that's really it for the like when you're

00:22:24.800 | working with open source you you have that so you're able to pull from the original repository

00:22:30.960 | which is this deep set haystack and you're also working and pulling and pushing to your personal

00:22:40.640 | version of that repo so i thought it's useful it might be useful for someone out there

00:22:47.920 | and that's really you know how you get started with your project so once you have that setup

00:22:55.600 | you have your haystack repository which i'll go and open in vs code over here okay so this

00:23:05.920 | is the haystack repository or the local version of it and in here we have the document source

00:23:15.520 | okay and there are there are a few so let's go through those now i'm not saying you you have to

00:23:23.120 | use pinecone or anything like that but i'm just saying why i why i want to include it so we have

00:23:29.200 | a deep set cloud which i haven't i'm i think is very new i'm not sure even sure it's fully

00:23:33.920 | implemented yet i'm not i'm not sure so i can't say anything about that i have no idea i know that's

00:23:41.040 | deep sets offering and i'm pretty excited to see what that is to be honest i'm sure it's very cool

00:23:47.040 | elastic search and now with elastic search you're not really doing a vector search

00:23:52.880 | you're doing like a sparse retrieval followed by a dense vector re-ranking

00:24:00.640 | so yeah you can use this but you're not doing a an actual semantic search or or full-on or full

00:24:08.640 | open the main question answering with this so some great advice you have to handle the

00:24:15.600 | infrastructure and stuff yourself so i'm not too keen on that so yeah it's five it's good but you

00:24:24.320 | have to you have to understand what you're doing otherwise it's a nightmare especially you have a

00:24:28.480 | lot of vectors if you get to like a million plus buys can be difficult you have milvus milvus is

00:24:36.720 | you know i think they can host stuff for you um you can definitely use that i found it

00:24:44.960 | difficult to set up and for me i want everything to be super simple and easy and just

00:24:53.840 | good um i struggled with milvus a bit so you can use it of course i did find it like great and i

00:25:03.920 | had the same thing with we aviate um again i hate i hate good things um but you're going to struggle

00:25:12.320 | a little bit um another thing on these two so i think milvus and we aviate are the closest you're

00:25:18.480 | going to get to pinecone at the moment in haystack but they also doesn't we don't they don't have the

00:25:25.600 | full metadata filtering capabilities of pinecone as well metadata filtering pinecone is very good

00:25:33.280 | because they have something called single stage filtering not pre or post filtering and it is

00:25:41.280 | definitely i think it outperforms the others so that's another thing to to consider as well

00:25:48.400 | so there are a few there are a few document stores that you can already use

00:25:51.440 | um there are others as well there's memory graph db i haven't used these so i can't comment on them

00:25:57.520 | um but yeah they're the current offerings that you have

00:26:03.600 | and then we have pinecone which i'm currently implementing okay now let's have a look at

00:26:12.720 | i want to have a quick look at the sort of core functionalities um we already really had a look

00:26:18.800 | at them but i just want to cover them very quickly again so basically what i'm going to need to work

00:26:25.520 | on to actually get this working is write documents or initialize actually initializing your document

00:26:36.880 | store so if i open this more okay so you can ignore this it's not relevant so this bit here

00:26:47.680 | i've got a document store a pinecone document store initializing that with a api key environment

00:26:54.800 | like you would with pinecone normally okay so that's just initializing your document store

00:27:00.080 | like how how does that work and that's pretty simple it's just the init over here um and we're

00:27:05.920 | connecting to it nothing nothing special there and then the first thing that you saw before is

00:27:14.160 | write documents so this is where we're writing text and this is we kind of have to do something

00:27:21.200 | we can't just use pinecone for this because we can have very long chunks of text and pinecone

00:27:28.160 | doesn't accept more than five kilobytes of metadata at the moment okay so we can't store

00:27:35.280 | long pieces of text in that metadata so we actually have to use a sql a local sql instance

00:27:42.000 | to store those big contexts and then the smaller bits of metadata we just throw into pinecone

00:27:50.560 | so that's basically populating our sql database

00:27:54.560 | all right so before we just initializing a retriever model it's not i haven't had to do

00:28:02.640 | anything there and then we update embeddings so this is where we're looking at those contexts

00:28:09.440 | converting them into vectors and storing them in pinecone and then there's just a querying so

00:28:19.200 | we're asking a question here now you can't see here but this is actually

00:28:22.080 | calling a method called query and that's a standard method so you have to the way we

00:28:28.160 | have to implement everything in our document store has to fit with the standard haystack

00:28:34.160 | ways of doing things and then we have to be able to delete things delete documents

00:28:42.160 | get documents by ids and so on and also include metadata which is important now i'm not going to

00:28:49.600 | actually dive into details of that in this video i want to do that in the in maybe the next video

00:28:54.880 | we'll cover a lot of that stuff but for now just an overview of what there is in there how all this

00:29:02.160 | works why we'd even want to do it in the first place and so on so yeah i don't think there's

00:29:08.720 | anything else i'm to cover in this first video um well let me let me check

00:29:14.000 | yeah that looks like everything so yeah that's everything uh for this video we'll

00:29:26.000 | i think explore this in a lot more depth in the next video so that should be pretty interesting

00:29:32.160 | if you if you do want to contribute and help implement this please feel free go ahead and

00:29:38.480 | do it um so over here this is the repo at the moment um i'll update in the description if that

00:29:45.440 | changes but yeah that would be really cool so and as well you can just go ahead and test it

00:29:53.280 | that's also also really helps so yeah that's everything for this video um thank you very

00:30:00.720 | much for watching i hope it's i know it's a bit different it's not really a tutorial it's more

00:30:04.960 | just like walking through what we're doing but i hope it's useful to see you know how this sort of

00:30:10.320 | stuff might work um maybe you know maybe it's something that you want to do as well um so

00:30:16.560 | that would be cool anyway thank you very much for watching i hope it's been useful

00:30:21.760 | and i'll see you in the next one bye

Adding New Doc Stores to Haystack

Chapters