back to indexAdding New Doc Stores to Haystack
Chapters
0:0 Intro
2:15 Contributing or Testing
3:31 ODQA
6:20 What is Haystack?
8:13 Haystack QA Workflow
14:52 Contributing to Open Source
22:54 Haystack Doc Stores
26:9 Doc Store Core Methods
29:31 Final Notes, Contribute/Test
00:00:00.000 |
Today we're going to do something a little bit different and we're going to have a look at the 00:00:05.200 |
project that I'm currently working on. Now this project is a very high level and if you know the 00:00:14.400 |
Haystack framework it's to implement a new document store within that framework. If you 00:00:21.120 |
don't know it no worries I'm going to explain it. So I left a couple of notes which you can see here 00:00:28.480 |
to remind myself what I actually want to talk about in this video. So to start I want to just 00:00:35.360 |
give you an overview of what is ODQA or Open Domain Question Answering and the three components in 00:00:42.720 |
that. You've probably if you've watched a couple of my videos you've definitely seen me talk about 00:00:46.560 |
this before but for those of you that don't know it I'm just going to very quickly go over that 00:00:53.600 |
and I'm going to explain how Haystack fits into that. So if you already know what Haystack is 00:01:00.240 |
and you already know Open Domain Question Answering you probably just want to skip ahead a 00:01:05.600 |
little bit it's up to you. I'll leave some chapter markers in the timeline so you can skip ahead if 00:01:12.800 |
you want and then we're going to have a look at the current document source within the Haystack 00:01:20.960 |
framework and why I'm going to implement a Pinecone document. Then we're going to have 00:01:27.680 |
a look so this is probably kind of relevant especially if you want to contribute to 00:01:32.880 |
open source. I'm going to just go through set up my own git repo for this and set it up so that it 00:01:44.160 |
works with the other official Haystack git repo as well. So I'll show you how you can do that 00:01:50.880 |
how you can set up your git for open source projects when you're planning on contributing 00:01:56.480 |
and then we'll have a look at the document source or the specific code in document source 00:02:06.240 |
in the Haystack library. So yeah I think it's a different video but it should be should be 00:02:13.840 |
interesting. Oh and one last thing actually if you do want to test document source go ahead and just 00:02:24.720 |
you can git clone from here to your own you know local machine. So I'm going to git clone the above 00:02:35.440 |
okay and then you want to within that same repository so let's say that downloads it to 00:02:44.800 |
your documents Haystack folder. Navigate to that folder in your terminal window or command line 00:02:56.640 |
and then you just need to write a pip install dot right and that will install this version of 00:03:06.960 |
Haystack and then you can test the pinecone document. So now I'll show you how you can start 00:03:12.880 |
doing that so if you if you do want to test that and maybe even if you want to contribute and 00:03:18.880 |
you see some like terrible code that I've written you want to make it more efficient please do go 00:03:25.680 |
you know go ahead and do that. So let's start with a quick overview of open domain question 00:03:34.160 |
answering and the three components that we see in there. So I'm going to assume you've already seen 00:03:40.640 |
some of my videos on this or that you're aware of Haystack and open domain question answering. 00:03:44.800 |
So I'm going to be really quick so open domain question answering or ODQA. It's basically you 00:03:52.160 |
have a load of text data or some other data in a database somewhere and let's say wikipedia 00:04:00.400 |
that's a good example so we have our wikipedia articles over here called wiki and we ask a 00:04:09.440 |
question so we have let's say like a search bar like google over here and we ask a question like 00:04:17.120 |
who are the Normans question mark and then we press search right. 00:04:22.240 |
We have in wikipedia we have the answer to that okay the Normans are people from Normandy in 00:04:31.760 |
the north of France but we need a way of translating that question into that specific 00:04:41.440 |
little part of information and pulling that from our larger database. This can contain you know 00:04:47.600 |
millions or more of paragraphs of information so how do we pull the right bit of information. 00:04:55.520 |
So this query gets converted into a vector using what is called a retriever model. So we go into 00:05:04.240 |
our retriever model it's converted into a vector all of these wikipedia snippets are also vectors 00:05:11.120 |
so we go to our database this is a vector database over here and we compare our query vector 00:05:18.480 |
which we'll call xq with all of our context vectors and that will return a set of relevant 00:05:24.880 |
context vectors so we'll call them xc and these context vectors are quite big they're like 00:05:35.760 |
paragraphs. Now we don't need a paragraph of text to tell us where the Normans are from or 00:05:43.040 |
yeah where the Normans are from we just need a small snippet so we pass this into what's 00:05:49.120 |
called a reader model over here also passing in our question and that reader model is going to 00:05:57.600 |
take our context so let's say this is one of them that's a really long piece of text 00:06:04.160 |
and it's going to say okay the specific answer is here okay so it's going to say 00:06:08.240 |
north of france or normandy okay and that's open the main question answering super 00:06:16.640 |
you know very high level that's how it works now how does haystack fit into that 00:06:22.960 |
so i will show you it makes most sense to just go to their repo okay 00:06:31.840 |
and we actually have this little about here it's nice open source nlp framework 00:06:37.840 |
uses transformer models we all know what they are and allows us to implement production ready 00:06:47.200 |
search question answering semantic document search and summarization okay so it's quite a lot the 00:06:53.040 |
main bit we're focusing on at least here is this kind of like search and question answering so this 00:07:01.440 |
is exactly what i just showed you all right so we're entering our question this produces a query 00:07:07.200 |
vector we press search and then we'll get these contexts which are the long chunks of text and 00:07:12.480 |
then the answers which are those little smaller snippets so you saw that highlighted all right 00:07:19.760 |
so the retriever gets retrieving database gets those big paragraphs the reader model gets a 00:07:26.880 |
little answer okay and haystack is basically a way to do that and it's incredibly easy 00:07:35.920 |
and and straightforward like you don't really need to know too much about open the main question 00:07:41.200 |
answer you probably never even need to know that it's open the main question answering 00:07:44.880 |
and you can implement this pipeline and get something that works really well it's honestly 00:07:51.120 |
a very good framework for this sort of thing so i've been earlier on i mentioned there's this 00:07:58.080 |
document sorting document store is basically where you're storing the the information so i'm earlier 00:08:05.040 |
on i said vector database document store is basically you can see them as being both the 00:08:10.240 |
same thing just different names for the same thing now we kind of have this overview here 00:08:19.200 |
we're not going to worry about indexing pipelines so much and that's basically just you know you get 00:08:23.840 |
your data and you convert it into a readable format for your document store but we do have 00:08:29.440 |
our document store and then we have the search pipeline so we have the retriever and then the 00:08:33.120 |
reader over here we have a generator i haven't mentioned that we're not going to talk about it 00:08:37.200 |
here we got answer but let me show you an example so we'll go to here better retrieval via dpr notebook 00:08:53.120 |
okay so here we're using vice as a document store 00:08:59.200 |
okay so this is what we're going to store all of our information scroll down a little bit more 00:09:05.840 |
and over here we're pulling in these these files which are is a wiki i think it's the wiki for game 00:09:14.960 |
of thrones like the fan wiki or something like that it's just all the text from those so you 00:09:21.840 |
have loads of paragraphs talking about game of thrones and this random like trivia about it 00:09:29.280 |
about it so once we so we've got the data and we write that into our document store 00:09:39.360 |
now we've written the text to our document store at this point but not the actual vectors 00:09:46.160 |
because like i said there's a vector database behind this as well and 00:09:54.560 |
what has happened here is we're storing text of the context or of that game of thrones wiki data 00:10:01.840 |
that you know our database but we've not converted them into vectors yet because we need a retriever 00:10:08.320 |
model to do that and we can't search until we have that those vectors okay because we're searching 00:10:15.360 |
using a vector search so what we do is initialize a retriever model here here they're using dpr 00:10:24.560 |
um you can use dpr if you want probably i would look into retriever models on 00:10:30.720 |
in sentence transformers personally it's probably more efficient and and generally 00:10:37.120 |
i think the performance is better as well but in this example it's just dpr 00:10:42.000 |
and then so we initialize that dpr retriever model and then update the embedding so we 00:10:52.960 |
create the vectors from the text that we've already pushed to our document store 00:11:00.000 |
okay in this case vectors stores them in the vector database which is by uh vice in this case 00:11:06.720 |
although vice is just an index it doesn't have all of the other stuff that would make it a vector 00:11:11.680 |
database but i'm going to just call it that for the sake of simplicity and then we have a reader 00:11:20.160 |
so the reader model is that that the final part that you saw so um after we after we have retrieved 00:11:29.440 |
a set of relevant context vectors we pass them through our reader model and that extracts a 00:11:36.560 |
specific snippet of an answer okay like a few words that answers our question um and then 00:11:43.280 |
so haystack allows us to do all this super easy you know we've i don't know how many lines of 00:11:49.200 |
code that is it's you know we have our imports and then there is something like four lines of code 00:11:54.960 |
to initialize all those components and then we are initializing a extractive qa pipeline 00:12:01.680 |
so open domain question answering we are extracting 00:12:05.200 |
the answers from a set of contexts that's why it's extractive qa here 00:12:09.920 |
and in that pipeline we just pass our read and retrieve we don't need to pass a document store 00:12:17.040 |
or the vector database because if we have a look up here the retriever 00:12:22.080 |
already contains the document store okay so it's already in there 00:12:26.800 |
okay so we've initialized the pipeline and then we can ask questions so we have this pipeline 00:12:36.400 |
run query you created dothraki vocabulary we have this top k we won't focus on those 00:12:43.680 |
right here it's just how many answers to return basically um and then we print the answers although 00:12:51.920 |
in this notebook we don't actually see them print the answers i have an example so i'll show you 00:12:57.440 |
quickly okay so i'm just going to come down here and you can i have an example this is with the 00:13:02.640 |
my so you know what i've built so far the pinecone document so it's not finished yet but 00:13:09.920 |
it is kind of working so same thing again extractive qa pipeline i've just copied this code 00:13:17.600 |
from the from that notebook you saw um and come down here and this is the bit that we were missing 00:13:25.040 |
from the other one so print answers prediction and we get this now in this example it's a really 00:13:30.640 |
bad answer because i haven't i've only uploaded like six contexts because i was just testing it 00:13:38.160 |
but we'll basically you'll get something like this so you'll get an answer um it's like who 00:13:43.440 |
created the dothraki vocabulary and it's just pulling out someone's name because it's like 00:13:47.680 |
it knows that you're asking who created something uh but there aren't any good answers here so it's 00:13:53.440 |
like okay that's all i can give you so again elio garcia from here okay so that's the sort of 00:14:01.520 |
format that we're expecting the sort of workflow uh when you're using hsi which is it's really 00:14:07.680 |
good it makes this very easy and and haystack is it's all about making open main question answering 00:14:15.280 |
really simple and that's kind of why i want to bring pinecone as a document store into that 00:14:22.960 |
as well because pinecone makes vector search incredibly simple um as you know from what i 00:14:29.120 |
can see um simpler than any other option out there so that's why i think they go together 00:14:36.880 |
very well um and you yeah you get good performance and everything else with pinecone as well so 00:14:45.040 |
there's a there's a lot of benefits to including it now what i want to just show you quickly is 00:14:52.000 |
okay how how do you get started when you're setting up um when you're hoping to contribute 00:14:59.120 |
open source project how do you set up in in git so 00:15:02.880 |
i mean you can see here i made this little little chart um which kind of explains why 00:15:14.400 |
you know why why the setup is different so you have your local machine where you're going to do 00:15:18.000 |
your development work um typically you would only have what you you wouldn't have this 00:15:26.320 |
like haystack repo upstream thing you would just have this when you're working on a project 00:15:32.320 |
okay you have your remote repository your origin and you pull and push to your origin 00:15:37.840 |
um it's pretty simple so if you're planning on contributing to a project it's a little bit 00:15:44.480 |
different because you need to first you need to fork your new repository so you have the official 00:15:50.800 |
repository and then you have your forked repository that becomes your origin and then you pull and 00:15:56.640 |
push to that origin that's been forked but you also if you're working on this for a long time 00:16:04.800 |
you want your repository to stay up to date with the upstream or the you know official repository 00:16:13.280 |
so you need to also set up a an upstream remote as well so that you can pull and merge from that 00:16:20.880 |
upstream to your to your local and your your forked repo to keep it up to date so let me show you 00:16:27.920 |
quickly how you might do that okay so on my right i just have terminal window and on the left i have 00:16:35.840 |
haystack repository we're on that notebook we're on before let me so we go to the top level first 00:16:44.720 |
thing you need to do is where are we so we come to here view code no make it bigger here we go 00:16:55.680 |
so we're on this green button and we are going to no we're not we're going to fork here 00:17:04.640 |
okay i already have a fork so i'm going to okay fork again to a different account okay i'm not 00:17:11.680 |
going to do i'm just going to use this one i've already created but anyway you fork it will create 00:17:15.840 |
a fork on your account okay and then you go to that to that fork so actually i can go up here 00:17:27.520 |
okay and you can see i also fetch upstream here you can also use that i'm going to we're 00:17:40.880 |
going to set up on on git so you can see how it works so this is my my personal version of haystack 00:17:49.280 |
that i'm that i am modifying right so what i want to do is code i'm going to copy this 00:18:01.520 |
and i'm going to come over here i'm going to cd documents i think that's fine i'm going to git 00:18:11.920 |
clone that okay so i'm going to clone haystack into onto my local machine it's actually going 00:18:22.320 |
to take a really long time so let me just sail for another random project 00:18:35.760 |
okay i'm just going to make the haystack no problem um 00:18:42.240 |
the cd into it i'm going to pretend it's a a repo so i'm going to get in it okay and let's just 00:18:52.720 |
pretend this is now our haystack repository that we've just done a git clone for okay 00:19:01.520 |
so first thing we need to do is we need to add our remote so git remote add origin so the origin 00:19:12.480 |
remote is going to be our personal um like repo so it's here the this one james callum haystack 00:19:24.960 |
pull it into here and then what you need to do is set up a upstream remote so git remote 00:19:37.200 |
add upstream and this is going to be the original so if i go to here so deep set ai haystack so this 00:19:46.000 |
is the original repo i'm going to copy this and paste it in here 00:19:52.480 |
okay so now i think it's git remote yeah so you can see the origin the upstreams 00:20:02.160 |
and that's that's what we that's what we want so now whenever there's an update to the upstream 00:20:09.200 |
that we want to merge without without the version of the project we're working on we would just 00:20:16.560 |
write git pull upstream okay and it'll pull any updates and we'll have to merge and commit 00:20:24.720 |
everything into our own repo at the same time if i let me move out of this one and move into the 00:20:34.960 |
real haystack that i'm actually working on so projects haystack okay 00:20:42.080 |
there's nothing i want to change at the moment but what i can do is i can git pull upstream 00:20:49.760 |
and maybe maybe we'll see some changes i'm not sure 00:20:57.520 |
i think i did it yesterday so maybe not oh maybe yeah there is 00:21:02.080 |
okay so this is actually going to update everything 00:21:10.320 |
okay so i need to actually git pull upstream master i think they use 00:21:22.400 |
so not just git pull upstream need to specify the branch as well okay and this is up to date so 00:21:30.240 |
there's no no issues there okay so that's pretty good but obviously when we are committing things 00:21:36.000 |
so you know git commit to do everything we're doing and we do a git push we don't go to upstream 00:21:43.680 |
we go to origin okay and we should also have a branch checked out as well so we would get check 00:21:51.680 |
out you can actually see mine here if i do git status you can see i'm on the pinecone.sore branch 00:21:59.600 |
so when i'm when i've committed some changes and i want to push that to my 00:22:06.000 |
repository i'm going to do git push u origin and then i'm going to do pinecone.sore okay 00:22:18.160 |
and i'll make changes in my own own repository so yeah that's really it for the like when you're 00:22:24.800 |
working with open source you you have that so you're able to pull from the original repository 00:22:30.960 |
which is this deep set haystack and you're also working and pulling and pushing to your personal 00:22:40.640 |
version of that repo so i thought it's useful it might be useful for someone out there 00:22:47.920 |
and that's really you know how you get started with your project so once you have that setup 00:22:55.600 |
you have your haystack repository which i'll go and open in vs code over here okay so this 00:23:05.920 |
is the haystack repository or the local version of it and in here we have the document source 00:23:15.520 |
okay and there are there are a few so let's go through those now i'm not saying you you have to 00:23:23.120 |
use pinecone or anything like that but i'm just saying why i why i want to include it so we have 00:23:29.200 |
a deep set cloud which i haven't i'm i think is very new i'm not sure even sure it's fully 00:23:33.920 |
implemented yet i'm not i'm not sure so i can't say anything about that i have no idea i know that's 00:23:41.040 |
deep sets offering and i'm pretty excited to see what that is to be honest i'm sure it's very cool 00:23:47.040 |
elastic search and now with elastic search you're not really doing a vector search 00:23:52.880 |
you're doing like a sparse retrieval followed by a dense vector re-ranking 00:24:00.640 |
so yeah you can use this but you're not doing a an actual semantic search or or full-on or full 00:24:08.640 |
open the main question answering with this so some great advice you have to handle the 00:24:15.600 |
infrastructure and stuff yourself so i'm not too keen on that so yeah it's five it's good but you 00:24:24.320 |
have to you have to understand what you're doing otherwise it's a nightmare especially you have a 00:24:28.480 |
lot of vectors if you get to like a million plus buys can be difficult you have milvus milvus is 00:24:36.720 |
you know i think they can host stuff for you um you can definitely use that i found it 00:24:44.960 |
difficult to set up and for me i want everything to be super simple and easy and just 00:24:53.840 |
good um i struggled with milvus a bit so you can use it of course i did find it like great and i 00:25:03.920 |
had the same thing with we aviate um again i hate i hate good things um but you're going to struggle 00:25:12.320 |
a little bit um another thing on these two so i think milvus and we aviate are the closest you're 00:25:18.480 |
going to get to pinecone at the moment in haystack but they also doesn't we don't they don't have the 00:25:25.600 |
full metadata filtering capabilities of pinecone as well metadata filtering pinecone is very good 00:25:33.280 |
because they have something called single stage filtering not pre or post filtering and it is 00:25:41.280 |
definitely i think it outperforms the others so that's another thing to to consider as well 00:25:48.400 |
so there are a few there are a few document stores that you can already use 00:25:51.440 |
um there are others as well there's memory graph db i haven't used these so i can't comment on them 00:25:57.520 |
um but yeah they're the current offerings that you have 00:26:03.600 |
and then we have pinecone which i'm currently implementing okay now let's have a look at 00:26:12.720 |
i want to have a quick look at the sort of core functionalities um we already really had a look 00:26:18.800 |
at them but i just want to cover them very quickly again so basically what i'm going to need to work 00:26:25.520 |
on to actually get this working is write documents or initialize actually initializing your document 00:26:36.880 |
store so if i open this more okay so you can ignore this it's not relevant so this bit here 00:26:47.680 |
i've got a document store a pinecone document store initializing that with a api key environment 00:26:54.800 |
like you would with pinecone normally okay so that's just initializing your document store 00:27:00.080 |
like how how does that work and that's pretty simple it's just the init over here um and we're 00:27:05.920 |
connecting to it nothing nothing special there and then the first thing that you saw before is 00:27:14.160 |
write documents so this is where we're writing text and this is we kind of have to do something 00:27:21.200 |
we can't just use pinecone for this because we can have very long chunks of text and pinecone 00:27:28.160 |
doesn't accept more than five kilobytes of metadata at the moment okay so we can't store 00:27:35.280 |
long pieces of text in that metadata so we actually have to use a sql a local sql instance 00:27:42.000 |
to store those big contexts and then the smaller bits of metadata we just throw into pinecone 00:27:50.560 |
so that's basically populating our sql database 00:27:54.560 |
all right so before we just initializing a retriever model it's not i haven't had to do 00:28:02.640 |
anything there and then we update embeddings so this is where we're looking at those contexts 00:28:09.440 |
converting them into vectors and storing them in pinecone and then there's just a querying so 00:28:19.200 |
we're asking a question here now you can't see here but this is actually 00:28:22.080 |
calling a method called query and that's a standard method so you have to the way we 00:28:28.160 |
have to implement everything in our document store has to fit with the standard haystack 00:28:34.160 |
ways of doing things and then we have to be able to delete things delete documents 00:28:42.160 |
get documents by ids and so on and also include metadata which is important now i'm not going to 00:28:49.600 |
actually dive into details of that in this video i want to do that in the in maybe the next video 00:28:54.880 |
we'll cover a lot of that stuff but for now just an overview of what there is in there how all this 00:29:02.160 |
works why we'd even want to do it in the first place and so on so yeah i don't think there's 00:29:08.720 |
anything else i'm to cover in this first video um well let me let me check 00:29:14.000 |
yeah that looks like everything so yeah that's everything uh for this video we'll 00:29:26.000 |
i think explore this in a lot more depth in the next video so that should be pretty interesting 00:29:32.160 |
if you if you do want to contribute and help implement this please feel free go ahead and 00:29:38.480 |
do it um so over here this is the repo at the moment um i'll update in the description if that 00:29:45.440 |
changes but yeah that would be really cool so and as well you can just go ahead and test it 00:29:53.280 |
that's also also really helps so yeah that's everything for this video um thank you very 00:30:00.720 |
much for watching i hope it's i know it's a bit different it's not really a tutorial it's more 00:30:04.960 |
just like walking through what we're doing but i hope it's useful to see you know how this sort of 00:30:10.320 |
stuff might work um maybe you know maybe it's something that you want to do as well um so 00:30:16.560 |
that would be cool anyway thank you very much for watching i hope it's been useful