Back to Index

Q&A Document Retrieval With DPR


Transcript

Okay so in the previous video what we did was sell our Elasticsearch document store to contain all of our paragraphs from meditations. So we did that in this script here and all together we only have, it's not that much data, 508 paragraphs or documents within our document store. So what we now want to do is set up the next part of our Retriever reader sack which is the Retriever and what the Retriever will do is given a query it will communicate with our Elasticsearch document store and return a certain number of contexts which are the paragraphs in our case that it thinks are most relevant to our query.

So that's what we are going to be doing here and the first thing that we need to do is initialize our document store again so I'm just going to copy these and paste them here and this would just initialize it from what we've already built so it's using the same index that already exists.

So just initialize that and once we have our document store okay cool we have that now. Now what we want to do is set up our DPR which is a Dense Passage Retriever which essentially uses dense vectors and a type of efficient similarity search to embed these indexes as dense vectors and then once it comes to actually searching and finding the most similar or the most relevant documents later on it will use those dense vectors and find the most similar ones.

So I'll explain that a little bit better in a moment. So first what we want to do is actually initialize that so we do from haysack dense retriever import dense passage retriever sorry it's the other way around here so retriever dense and then we'll put it into a variable called retriever which uses the dense passage retriever from up here and in here we need to pass a few parameters so the first thing is the document store so the document store is just what we've already initialized doc store and then we need to initialize two different models so it's the query embedding model and the passage embedding model now behind the scenes haystack is using the hugging face transformers library so what we'll do is we'll head over to the models over there and see which embedding models we can use for dpr okay so here let's just search for dpr and you'll find we have all these models from facebook ai so now with dpr the reason that it's so useful for question answering is that we have what are two different models that encode the text that we pass into it so we have this sort of setup during training and what we see down here are these two models we have this ep but encoder and we also have this eq but encoder now the ep but encoder encodes the passages or the context so essentially the paragraphs that we have fed into our elastic search model this is what will be encoding them into these vectors here now this is during training this whole graph so all we will actually see when we're encoding these vectors is we will see the ep encoder and this will create the ep vectors and all we're going to do is feed in all of the documents from elastic search into this now once all of these have been encoded we then have a new set of dense vectors and all of those will be fed back into our document store so back into elastic now when it comes to performing similarity search later on we're going to ask a question and that question will be processed by the eq encoder so here we have our eq encoder and we have our question so that will go into here and that will encode our question and then send it over to elastic and say okay what are the most similar vectors to this vector that we created from a question and the reason that dpr is so good is that if you look at the training down here we are creating these ep vectors and these eq vectors that are matching so where we have a matching question to a matching context we are training them to maximize the dot product because the dot product measures the alignment between those two vectors so what happens is that a relevant passage and a relevant question will come out to have a very similar vector so one example that i like to use is if our question was what is the capital of france the embedding that i will create from that will create a context that looks something like the capital of france is and you know something here we don't know why it will pop because it doesn't actually know what the capital of france is it's just doing linguistic transformations to try and figure out what sort of context the answer would come from then of course when you feed this context into elastic the most similar vector will be the one which contains the answer to our question to our question okay because the answer to our question which is something like the capital of france is paris now we don't have paris here but it will be able to figure that out because it will be the most similar sequence to the context that dpr has produced now back to hugging face here you can see we have these multiple dpr models and what we want is a pair we want a question encoder and a ctx which is context encoder now we'll be using this single nq base so what i'll do is just copy this and in here we just add in our model okay so that's a question encoder now what we also need is the context encoder which is instead of question here we just add ctx now we have two other parameters that we need to add in here which is use gpu which is if you're using a gpu obviously you set this to true if not you go with faults it will take a little bit of time to process this if you're not using a gpu though then we also add embed title equals true as well now what we should see is this will execute without error hopefully okay great and then what we need to do is update the embeddings within elastic search so what we've done here is kind of set up the process and now what we need to do is update the documents that we have in elastic search to have dpr embeddings so to do that we go doc store update embeddings and then in here we pass our retriever okay now this may take well this would be really quick for me we don't have that many documents and even on cpu actually with the lack of documents we have it should be pretty quick so what we see here we created these embeddings and then we posted them again to our index so that is pretty cool and now what we need to do is just test it actually works now let's go with retriever and this is how we get context from our elastic search document store so right retrieve and then we pass in a query here so let me just find something here like what did you learn from your great grandfather maybe or from verus uh yeah let's go from grandfather let's go grandfather so what did you what did your grandfather teach you i don't know if this is going to work but let's see okay so you see that we return quite a few contacts here now we haven't settled a full thing so we're just returning what it sees as being relevant context we are not actually extracting an answer out yet because that will be the job of our reader model so what we have is from my great grandfather we have that one so it's okay some other ones here let's type in grandfather okay so it's just returning that one which is fine it's not perfect but what we would expect to do in reality is return more so let's try another one as well and let's say who taught you about freedom of will who taught the freedom of will and we see here okay in the first one we don't get out the correct answer that we want or the correct context and we go down and i saw there he is so here is the context that we wanted to return so it returns that as the fourth best context which is fine because when we build our reader model later on we kind of expect that to sort those a little bit better than our retriever model this is pretty cool and i think definitely a good start so now what we have retrieved for meditations set up our document store and now we have also set up our retriever so we can also cross that off and next thing is our reader model so i think that's it for this video in the next one of course we'll move on to that reader model and let's just see how that goes but so far i'm pretty happy with that so thank you for watching and i'll see you again in the next one