Back to Index

How to build a Q&A AI in Python (Open-domain Question-Answering)


Chapters

0:0 Why QA
4:5 Open Domain QA
8:24 Do we need to fine-tune?
11:44 How Retriever Training Works
12:59 SQuAD Training Data
16:29 Retriever Fine-tuning
19:32 IR Evaluation
25:58 Vector Database Setup
33:42 Querying
37:41 Final Notes

Transcript

Today we're going to have a look at open domain question answering and How can fine-tune our own retriever model to use for open main question answering? we're going to start with a few examples over here, we have Google and we can ask Google questions like we would a normal person so we can say How do I tie my shoelaces?

So what we have right here is is three components to the the question and answer and I want you to remember these these are Relevant for what we are going to be building. There's a query at the top We have what we can refer to as a context which is a video which is where we're getting this small more specific answer from and we can ask another question is Google Skynet So we have a question at the top.

We have this paragraph which is our context and then we have the answer which is yes Which is highlighted here. So it's slightly different to the the previous one. We had the video this time We have actual text which is our context And this is more aligned with what we will see Throughout this video as well.

Now what we really want to be asking here is one how does Google do that and More importantly is why should we care now? We can imagine this problem as Us being in a really big warehouse. Now. We don't know anything about this warehouse. We just know that we want a Certain object now, we don't know the specific name for the object We just kind of know what it looks like now in this warehouse.

Everything we assume is probably going to be organized in some way so the first thing we're going to do is have a look at the products around us and try and figure out there's some sort of order and Once we have figured out how this warehouse is structured and how we can search through the warehouse We need to figure out.

Okay Everything is maybe organized based on the name of the product. We don't know the name of the product We're kind of going to struggle along We're going to spend a lot of time searching through a lot of items in order to find what we actually want to find that is like a normal traditional search experience where you have to know the exact keywords that You will find with whatever it is.

You're looking for you have to know the product name now This sort of natural language search that we just saw with Google is not like that warehouse where you're just alone Trying to figure out how it's structured and how to search through it instead It's almost like you have a guide with you someone who knows this warehouse.

They basically live in this warehouse They know where everything is and if they don't know where a specific object is they can point you in a pretty good direction and really help you find everything probably a lot faster because As well as knowing where everything is this person also speaks the same way that we do we can ask them a question like okay guide where are those marble like things and With that they will hopefully be able to guide you in the right direction And maybe they'll be able to guide you right to that product or at least in this area that they are founded so that's the difference between traditional search and a Question and answering search now don't get me wrong that is places where you do want to keep traditional search But particularly for unstructured like text data or is we saw earlier video and audio data This sort of Q&A approach can be really powerful so that leads us on to the second question or first question I asked which is How does Google do that now?

Google is pretty complex, but at the core of what we saw there Was something called open domain question and answering or ODQ a now ODQ a is a set of language models and technologies All piled together into a open domain question answering pipeline now that pipeline At its simplest will look something like this So we have our let's say at the top we have our question that question is going to Come down here, and it's going to hit What is a retrieval model which is with what we will train or fine-tune in this video now?

this retrieval model that will handle Taking our query and converting it into a vector now we convert it into a vector because then we compare it to other chunks of text and We can encode the semantics and meaning Behind that text rather than just the the keywords So that's why we can search for concepts and meaning rather than just keywords as I said mentioned earlier So we have this retrieval model now the retrieval model creates a vector, but then we need to We need something that allows us to search So we need other vectors to compare to now where do we store those vectors?

Well, we have a vector database now a vector database Let's bring that over here This is going to contain Context vectors so you remember earlier on with those Google searches read the question we have the context and the answer This is where that is relevant so in here. We have loads of context so just but see But they've all been converted into vectors using the same model.

We have appeared a retrieval model We just did it before we started searching so we index those contexts into our vector database now at search time we Convert our question into a vector and that comes down into the vector database and the vector database will compare That question vector to all of these context vectors, and it will return the ones that are most similar So maybe we want to return the top five most simple context at this point if we are just using a Retriever model and the vector database we can return those We had our question and we can return these context to the user So that would be like in our earlier examples we ask a question and Google returns like the page or it just returns a paragraph to you rather than the highlighting this specific answer and this is these are the components that we're going to cover today and In a future video and an article what we are also going to include is a reader model So a reader model is added on to this open the main Q&A sack or it's the last component of the of the sack and this What this does is it takes each of your?

Context vectors, and it reads them so it has a look at the context vectors and we have this long text here and it says okay given the question which we also feed into our reader model I Think the answer to that question is Right here, okay, so it allows us to Extract a very specific answer from a longer chunk of text the context so that's the open domain Q&A Structure or pipeline and now I think we should move on to actually fine-tuning the retriever component of that pipeline the first thing we need to think about when fine-tuning our retriever model is Whether or not actually needs to be fine-tuned because there are retrieved models out there already that we can just download and use and There's this really good concept I saw from in one of Nils Reimer's YouTube videos where he's going to talk and he talks about the long tail of semantic relatedness and The basic gist of it is that you have common knowledge that everyone pretty much everyone knows about and You have a lot of data and benchmarks in that area So that would be our cat versus dog example up here.

So imagine you're on the street. You walk up to a stranger You ask them a question. What was the difference between cat and dog? They're probably going to know the answer. Everyone knows that Then you get more specific. So you ask them. Okay, what's the difference between C and Java?

Some people know some people will not and then we get even more specific Hightorch tense flow and then we get even more specific Roberta versus D Burton and TSA versus mirror Bert as we get more specific less and less people know what you're talking about and With that there are less datasets and there are less benchmarks, but at the same time that's where most of the interesting use cases exist now whether your use case exists within the common knowledge area or within the The long tail is Really how you can?

hazard a guess at whether you need to fine-tune a retriever model or not So if we just modify that chart a little bit and we get this so we have same thing common knowledge on the y-axis and What we have on the right. I've just renamed it. So the ease of finding a model and/or data Okay, so based on how niche your use case is the harder is going to be to find data For that niche and the less data there is out there the less likely someone else is already trained or pre-trained a model so most of the pre-trained models out there are trained on a very generic broad range of Concepts like that a train on Wikipedia pages or something like that So that's fine You you know If you're comparing cats versus dogs or even C versus Java The model is probably being pre-trained on something similar to that and it might be able to figure out But if your use case is more specific like you have I don't know like some very specific financial documents or technical documentation something along those lines where not many people understand the content that documentation and There's not no it's not general knowledge or it's not easily accessible on the internet in that case You will probably need to train or fine-tune your own model.

So If that is the case, how how do we do that? Well to train a retrieved model we need pairs of texts. So we need questions and Relevant context. So that's where you can see here. We have question a context a they're both related and they both end up In the same sort of area.

That's what we need to teach our retrieving model to do We tell our truth model. Okay question a and context a Process them and output a vector for each one of those and then we want to look at those two vectors and say, okay Are they similar or not if they're not similar we?

Tell the model look you need to figure this out and make them more similar if they are similar thing then great good job That's what we're optimizing on. We're optimizing on minimizing that That difference between similar pairs and maximizing the difference between dissimilar pairs Now our data is going to just contain all of these rows where we have the question and context pairs We don't need labels Because we are going to be using multiple negative negatives of ranking loss Which we'll discuss in a minute is also a video on that if you do want to go into a little more depth so To train our model.

We're going to be using the squad to data set From over here and in there we have our questions and we have context. Okay That's what we're going to be training on those two pairs So let's take a look at what that process looks like So as I said, we're going to be using squad v2 data set We're going to pulling that from hooker base data sets you may need to pip install data sets if you do not have that already and This is how we load the data set.

So we've got squad v2 and we're getting the training split of that because there's also validation split that we will use later and From now What we will see is we'll get an ID title context the question So I only do question and context is really that important for us so we can go down Just a few examples here Samples of rows from the data set and Like I said before we we need to take the question and context pairs from the data set.

So to do that We what we're taking them here. We're also creating this inputting example or list of input example objects now Input example is just the data format that we use when we are training with the sentence transformers library And which you can see here now again, if you if you do need to install that it's just pipping salt sense and transformers and TQDM is just a progress bar that you see down here.

That's all so what we're doing is Just appending a load of input examples where we have to question and context We don't have any label here because we are going to be training with multiple negative ranking loss where we don't need labels now Also because we're using that type of loss M&R loss We need to make sure that each batch does not include duplicates now the reason for this is When we train with M&R loss we're going to be putting everything in in or training in batches and The model is going to be looking at two pairs like this It's going to take the question and it's going to say okay for that question this context here Needs to be the most similar and all of these contexts need to be What as dissimilar as possible?

now the problem with if you have duplicates in your Batches is that you have let's say this exact same maybe not the exact same question But you have to use that same context down here now your model is going to optimizing To make all of these as dissimilar as possible But also this one here, even though it's exactly the same as the one that optimizing to be more similar So you really need to try and avoid this.

It's okay if it happens occasionally in the odd batch But you need to avoid it as much as possible. So that's why we use this no duplicates data loader This will make sure we don't have any duplicates within these batch So go down we need to initialize a sentence transformer again.

This is Same as what we usually do a rim here, but we're actually using the Microsoft MP net model Now MP net is really good for the sentence transformers in general If I if I did this with a better model, I think the performance is Two percentage points less than if I use a MP net model, it's not huge, but it does make a difference So it's good to try both if you want, but this one is the one I'd go with So here we've initialized the model we also have this pooling layer now the pooling layer is important That is what makes a sentence transformer on just a normal transformer and it works Like this so we have our sentence It will get tokenized and split into many tokens or all down here and put into birds and peanut or some of the transformer and On the output of that we get all these token vectors now We all these token vectors represent our sentence, but there are loads of them We want a single vector to represent a single sentence.

So we use this pooling light and this pooling layer takes all of those vectors and Takes the average in every single dimension So we take the average and we get our sentence vector. That's based how and why we use the pooling layer So we have that we can come down we see our transform model.

We have the pooling layer. It's good like I said earlier, we are going to be using MNR loss for training and That is a batching thing where we want to get the pairs together and then we rank all the other contexts as dissimilar So we initialize it like this using the sentence transformers library and then we're ready to train our model So it's really not that Complicated particularly if you're using sentence transformers libraries.

It's very easy so we warm up for 10% of the Training steps. That's a pretty standard number for sentence transformers You can modify a little bit but 10% is usually pretty good. It just helps make sure we don't overfit and the same for Almost always set that to one when you're fine-tuning central sentence transformers.

Otherwise, it really does tend to overfit So it's usually a good idea It shouldn't take too long to train. It was 30 minutes for me here might take a bit longer But really it's not not training on too much here I Think we have like a hundred thousand. Yeah, I think we have a hundred and 30,000 training samples in the spot data now Another really important thing is evaluation.

So once we've trained our model, how do we know that actually works? Like is it performing? Well, you know, we don't know so we need to measure how Accurately our model is retrieving the correct context for a particular question, which is a slightly different other operations or evaluation metrics that we use with other language models and To evaluate the retrieval performance.

We use this information retrieval evaluator now from this We're going to be using the map at K metric which is in short an average precision value or Fraction of returned Context are relevant to the question we asked and the at K component of that is just saying We are going to consider the top K return result So if it's at 10 is in case you turn we're going to return 10 context and then we're going to calculate that metric from those 10 return context now by default This evaluator is going to using that map K metric So we initialize it like this so information retrieval evaluator and It needs data.

We're going to be using validation set of the same data. We use before so the squad V2 data set And it looks the same. Okay, we have ID title context question answers We at the moment we're going to need ID Context and question now this evaluator needs us to map relevant questions and context Using those IDs.

So what we're first going to do is we're going to convert this into a pandas a frame So I find it a little easy to work with We're going to be doing here. So I'm writing it to a date frame here and see we have context ID question, which is all we need Now we need to assign a specific or a new ID to the context because at the moment if you look here, we have an ID and it's shared by the context and also the question and Another thing is the ID for the context like here, you know context is saying the ID is different So what we're going to do is use these IDs for the questions because they all the IDs need to be unique I'm going to create a new ID for each context.

So we're going to deduplicate that data frame so we have question and ID nothing else and Then we're just going to append this con onto the end of our ID. So now we have unique IDs for all of our context as well as our questions and What we can now do is merge or form in a join without no dupes Data frame and the original date frame.

So Do that and now we have a unique ID. So this ID why? For each of our context so you can see ID why is not changing where we have these duplicate context and then we have IDX Questions. So this is what we need for value vector So we need to reformat this into three different Dictionaries, so we have our queries, which is a mapping of question ID to Question like the text we have our corpus which is a mapping of the context ID to the actual context and we also have our relevant docs, which is a mapping of the question ID to relevant Context ID so you can have multiple context IDs for a particular question But in this case, it is actually just a one-to-one mapping So we first create the IR queries Dictionary, so it's literally just the the ID as a key and the question as a value So great all of those key value pairs same again for IR corpus exactly the same but this time for the context and Then if we come down here, so this one's slightly different So we this is mapping the question ID to a set of relevant context IDs now We could map these directly But what I've done here is in this if you use this same script and you have an example where maybe you have multiple Contexts for each question this will handle those and so you'll get a list or a set of multiple context IDs in here on just one if That is relevant and we see that we have multiple In our case, we have multiple Questions that map to a single context ID So it's actually many to one rather than one to one like I said before You Now with each of those three dictionaries we can initialize our Evaluator which is just passing like this and we evaluate and that's really simple.

There's nothing Nothing else to it and the so the map at K performance for that is 0.74 which is is good if you we compare it to like some of the state of the art retrieval models So this is also using MP net and it's been trained on a lot more data, by the way as well This is more more general This is getting 76 percent.

So a little bit better than ours like two percentage points better But this is the squad data set this other model here has I'm pretty confident It's been trained also on the squad data set So it has already seen that day if you do the same thing, but for your own more niche data set I Think most likely you will get the model that outperforms any other pre-trained models But of course, you know, definitely just test it evaluates you what you get Okay, so we have fine-tuned our retrieval model now But we can't really use it unless we saw the vectors that we create with our retrieval model somewhere So we need a vector database now To do that.

We need to take a few Steps, so we have already done a few. Okay, so we have trained Specifically we trained our retrieval model so we can cross that off We've also downloaded a load of context so that the squad context will use a validation set for that. Okay now the next step in on this side is to encode those contexts which we will do in a moment and Then over here.

We also need to initialize our index for our vector database. So so we're going to do that and then we're going to take those encoded context and now initialize index and we're going to Populate that index with our encoded context vectors. So let's get started with that Okay, so in a new notebook now So if you if you're following along and maybe your model is still training or fine-tuning That's fine.

You can actually just download the model that we trained here So it's just pine cone and peanut retriever spot - that that's a model that you saw a moment ago If you want to try to bet one as well, you can replace this with But and that's also there so we have that model we can see it looks same as what we had before and I'm also just going to reload the validation data set now I'm going to here I'm going to extract all of the unique context because we have like we saw before the score date has many copies of the same context and we don't want to have duplicates in our index of the same vectors because well if we're searching and we Compare the distance between our query vector and all these contexts and we have like five that are in the exact same position Can it return all of those so we don't want to do that.

So we need to remove Duplicates. Okay, so we're looping through the whole data set We're checking if context is already within this unique context lists I've initialized if it isn't we add it to that list and we also add the ID of So that remember the question ID context ID was shared.

All right, we're just using that ID this time We're not using the con unique context ideas that we created before So We loop through and then obviously when it sees that context again It's not going to add that different ID to the unique IDs list and then we can filter using the face datasets library here to only keep the IDs or the rows which have an ID from the unique idealist So with that we end up with just one of each context and you can see here We only get 1200 rows of that which is much less than the the full squad validation set Now what we want to do is encode Those context in the context vectors.

So we use our model that we've initialized and we use the encode method now We also convert that vector. So this I think outputs a pytorch tensor We convert into a list because we are going to be sending all this to our pinecone vector database index Through an API and for that we want to have them in a list format So we encode the context and now we can see okay We have all the other features that we had before but now we also have the encoding feature So we move on to the vector database.

Now, how do we use this vector database? We first need to install the pinecone client here So again for using the pinecone vector database so we come down and we can Import pinecone. So this will just import the way that we interact with our pinecone vector database and We then initialize a connection.

So for this you do need a free API key. So it's free I think you don't need to don't need to pay anything for that. So you just go to app.pinecone.io and You create an account and then with the API key that you're given you enter it in here So just in there pinecone in its API key and you also set your cloud environment so us-west1 GCP and Then what we can do once we've initialized that connection is we create a new index So to create that index we only need this little chunk code we need this dot create index method and We specify the name of the index.

I'm going to just call it squad index. You call it whatever you want and The dimensionality of our vectors as well so you can see here It's seven hundred and sixty eight and you can also define the metric now you could also use Euclidean distance, but we are going to stick with the default which is cosine and Yeah, that's all you need to you only need to run this but at the same time because I have multiple Indexes and I'm rerunning this code and testing it.

I Added this in so I'm just saying if squad index is not already existing And so I'm checking that it is not already running that and then creating it And then we connect so specifically to our index run just pinecone database as a whole We connect so we use pinecone index and specify the index that we'd like to connect to And then from now we just use this index object for everything we're doing so here we are just preparing our data to Upset into pinecone, which is just upload into into our pinecone index, so when we are uploading data into pinecone we three Components we don't need all of them, but I'll go through them So we we have a list and within that list We have these tuples in there.

We have the ID of the Vector or the entry record whatever and then you have your encoding or the vector itself and then Optionally, you don't need to include this so you can remove it if you want. You have a dictionary of metadata so this is just key value pairs that you can use but things like metadata filtering or if you want to return a Particular piece of data with your search you can include that in there as well Which is why I'm using it for here and you don't need to do this you could saw the text locally and but I'm just doing it like this because it's a little bit easier for this little script and With that we we just upset so we just run this index insert.

I'm doing in batches you can You could probably increase the batch size here, but I'm just sticking that and it is reasonably. It's pretty quick. Anyway, so 30 seconds So with that our Index so our vector database and our retriever ready so we can actually begin querying and you know asking a question and returning a relevant context Okay, so we're on to Another notebook for this.

So again, I'm going to initialize a retriever model and I'm going to import pine cone initialize connection To the index to the squad index again now This is pretty straightforward Fortunately now we've worked at the end all we do is we take a question, so I'm gonna say when were the Normans in Normandy and We just write so model and code include our query within a list here because for example if you had multiple queries you might have that in a list something else up here and In that case, you would not have to add these square brackets but in this case, we just have a single query and We create a pytorch tension tensor Which is our query vector and we convert it into a list because we're going to be sending it through the pine cone API again And what I'm doing here is I'm querying so this is going to the pine cone vector database and currying I'm saying I want to find the most similar vectors or context vectors to this query vector or question vector I want to return just the top two contexts that you you find and I want to include a metadata in that response because in the metadata I included the text so we can come down here and We see okay.

This first example isn't actually what we want. So in here with what we ask When were the Normans in Normandy? This doesn't actually answer that question but then the second one so if I Open this up in the text editor. So come down here We have the second question So the Normans were the people who in the 10th and 11th centuries gave their names Normandy So you can assume that they were probably in Normandy in that time.

So that's when they were there So we get the correct context in second position for this one. We also get the score 70s Read us a it's a good score. It's high And Then we have another question here. So we've got three questions we'll go through each one is very quickly how many outputs are expected for each input in a function problem and We do actually return the correct answer for that one straight away.

So a function problems a computational problem Where a single output is expected for every input. So that's how I saw specific answer there Okay, and get a really high score for that So it's pretty confident that this is correct and then a lower score for the other ones where the actual question is not answered so One final one here.

So I've changed the wording a little bit for this one because I don't want to be You know just doing a keyword search. So I put who use Islamic Lombard, etc construction techniques in the Mediterranean and What you know, I modified that a little bit and we do actually return the correct answer straight away Confidence 0.6.

I should compare it to the other ones which are much lower. That's pretty good. There's a good separation there Which is what we want to be looking for so That is a I know it's pretty long. There's a lot of components and moving parts there, but that's the open domain Question and answering pipeline or at least the vector database of retriever components of that We've also have a look at how we can fine-tune our own retreat.

So we've covered quite a lot so I mean with all that you're Ready to just go ahead and implement what I think are probably two Most crucial components in open domain question answering if you if you don't have a good Vector database and good retriever model you are going to be returning Not good context your reader model And if you don't have good context for your reader model your reader model isn't going to give you anything good So these two are probably the most important parts, right if your reader model is rubbish Maybe it gives you a kind of weird Span or answer but at least you've got a good context to to go off of so you're your users Are at least getting some relevant information now, I think one of the the coolest Things about open domain question answering is just how widely applicable is Basically any company across so many industries across the world Can use this if they have?

unstructured data that they need to essentially open the doors to for their Staff or their users right if you want to get data or information to someone which is a big part of most jobs If you want to get data information someone more effectively and using a more natural Form of search and question answering this is probably applicable to you and maybe your company Anyway, that's it for this video Thank you very much for watching.

I hope it's been useful and I'll see you again in the next one