back to index

How to build a Q&A AI in Python (Open-domain Question-Answering)


Chapters

0:0 Why QA
4:5 Open Domain QA
8:24 Do we need to fine-tune?
11:44 How Retriever Training Works
12:59 SQuAD Training Data
16:29 Retriever Fine-tuning
19:32 IR Evaluation
25:58 Vector Database Setup
33:42 Querying
37:41 Final Notes

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we're going to have a look at open domain question answering and
00:00:03.520 | How can fine-tune our own retriever model to use for open main question answering?
00:00:10.980 | we're going to start with a few examples over here, we have Google and
00:00:15.500 | we can ask Google questions like we would a
00:00:19.920 | normal person so we can say
00:00:22.940 | How do I tie my shoelaces?
00:00:29.020 | So what we have right here is is three
00:00:32.700 | components to the the question and answer and I want you to remember these these are
00:00:39.420 | Relevant for what we are going to be building. There's a query at the top
00:00:44.080 | We have what we can refer to as a context which is a video which is where we're getting this small
00:00:51.720 | more specific answer from and we can ask another question is
00:00:57.560 | Google Skynet
00:00:59.480 | So we have a question at the top. We have this paragraph which is our context and then we have the answer which is yes
00:01:06.200 | Which is highlighted here. So it's slightly different to the the previous one. We had the video this time
00:01:12.340 | We have actual text which is our context
00:01:14.560 | And this is more aligned with what we will see
00:01:19.500 | Throughout this video as well. Now what we really want to be asking here is
00:01:26.440 | one how does Google do that and
00:01:28.960 | More importantly is why should we care now? We can imagine this problem as
00:01:35.920 | Us being in a really big warehouse. Now. We don't know anything about this warehouse. We just know that we want a
00:01:44.760 | Certain object now, we don't know the specific name for the object
00:01:50.880 | We just kind of know what it looks like now in this warehouse. Everything we assume is probably going to be organized in some way
00:01:58.720 | so the first thing we're going to do is have a look at the
00:02:02.500 | products around us and try and figure out there's some sort of order and
00:02:06.640 | Once we have figured out how this warehouse is structured and how we can search through the warehouse
00:02:14.440 | We need to figure out. Okay
00:02:18.480 | Everything is maybe organized based on the name of the product. We don't know the name of the product
00:02:24.200 | We're kind of going to struggle along
00:02:26.240 | We're going to spend a lot of time searching through a lot of items in order to find what we actually want to find
00:02:32.440 | that is like a normal traditional search experience where you have to know the exact keywords that
00:02:38.840 | You will find with whatever it is. You're looking for you have to know the product name now
00:02:45.600 | This sort of natural language search that we just saw with Google is not like that warehouse where you're just alone
00:02:53.080 | Trying to figure out how it's structured and how to search through it instead
00:02:57.280 | It's almost like you have a guide with you someone who knows this warehouse. They basically live in this warehouse
00:03:03.920 | They know where everything is and if they don't know where a specific object is
00:03:07.720 | they can point you in a pretty good direction and
00:03:12.200 | really help you find everything probably a lot faster because
00:03:17.480 | As well as knowing where everything is this person also speaks the same way that we do we can ask them a question like okay
00:03:25.320 | guide where are those
00:03:27.440 | marble like things and
00:03:30.200 | With that they will hopefully be able to guide you in the right direction
00:03:34.520 | And maybe they'll be able to guide you right to that product or at least in this area that they are
00:03:41.400 | founded
00:03:43.760 | that's the difference between traditional search and a
00:03:47.360 | Question and answering search now don't get me wrong that is places where you do want to keep traditional search
00:03:54.560 | But particularly for unstructured like text data or is we saw earlier video and audio data
00:04:00.640 | This sort of Q&A approach can be really powerful
00:04:04.880 | so that leads us on to the second question or first question I asked which is
00:04:11.000 | How does Google do that now?
00:04:13.400 | Google is pretty complex, but at the core of what we saw there
00:04:18.640 | Was something called open domain question and answering or ODQ a now ODQ a is
00:04:25.640 | a set of
00:04:28.760 | language models and
00:04:31.400 | technologies
00:04:33.040 | All piled together into a open domain question answering pipeline now that pipeline
00:04:40.720 | At its simplest will look something like this
00:04:43.880 | So we have our let's say at the top we have our question
00:04:49.440 | that question is going to
00:04:52.200 | Come down here, and it's going to hit
00:04:54.640 | What is a retrieval model which is with what we will train or fine-tune in this video now?
00:05:02.600 | this retrieval model
00:05:05.240 | that will
00:05:07.520 | handle
00:05:08.840 | Taking our query and converting it into a vector now
00:05:13.200 | we convert it into a vector because then we compare it to other chunks of text and
00:05:18.200 | We can encode
00:05:21.120 | the semantics and meaning
00:05:23.560 | Behind that text rather than just the the keywords
00:05:27.040 | So that's why we can search for concepts and meaning rather than just keywords as I said mentioned earlier
00:05:33.040 | So we have this retrieval model
00:05:36.400 | now the retrieval model creates a vector, but then we need to
00:05:40.880 | We need something that allows us to search
00:05:44.080 | So we need other vectors to compare to now where do we store those vectors?
00:05:49.960 | Well, we have a vector database now a vector database
00:05:53.940 | Let's bring that over here
00:05:57.120 | This is going to contain
00:06:02.800 | Context vectors so you remember earlier on with those Google searches read the question we have the context and the answer
00:06:08.880 | This is where that is relevant so in here. We have loads of context so just but see
00:06:16.240 | But they've all been converted into vectors using the same model. We have appeared a retrieval model
00:06:23.520 | We just did it before we started searching so we index those contexts into our vector database
00:06:32.160 | now at search time we
00:06:34.160 | Convert our question into a vector and that comes down into the vector database and the vector database will compare
00:06:42.080 | That question vector to all of these context vectors, and it will return the ones that are most similar
00:06:48.400 | So maybe we want to return the top five most simple context at this point if we are just using a
00:06:56.360 | Retriever model and the vector database we can return those
00:06:59.740 | We had our question and we can return these context to the user
00:07:04.240 | So that would be like in our earlier examples we ask a question and Google returns
00:07:08.920 | like the page or it just returns a paragraph to you rather than the
00:07:14.720 | highlighting this specific answer and
00:07:17.560 | this is these are the components that we're going to cover today and
00:07:22.680 | In a future video and an article what we are also going to include is a reader model
00:07:29.240 | So a reader model is added on to this open the main
00:07:32.920 | Q&A sack or it's the last component of the of the sack and
00:07:39.560 | What this does is it takes each of your?
00:07:42.200 | Context vectors, and it reads them so it has a look at the context vectors
00:07:48.560 | and we have this long text here and it says okay given the
00:07:54.680 | question which we also
00:07:57.240 | feed into our reader model I
00:08:00.000 | Think the answer to that question is
00:08:05.280 | Right here, okay, so it allows us to
00:08:08.960 | Extract a very specific answer from a longer chunk of text the context
00:08:16.760 | so that's
00:08:18.240 | the open domain
00:08:21.320 | Structure or pipeline and now I think we should move on to actually fine-tuning the retriever component of that pipeline
00:08:29.080 | the first thing we need to think about when fine-tuning our retriever model is
00:08:33.320 | Whether or not actually needs to be fine-tuned because there are
00:08:38.080 | retrieved models out there already that we can just download and use and
00:08:43.640 | There's this really good concept I saw from in one of Nils Reimer's YouTube videos
00:08:49.000 | where he's going to talk and he talks about the long tail of semantic relatedness and
00:08:55.800 | The basic gist of it is that you have
00:08:59.880 | common knowledge that everyone pretty much everyone knows about and
00:09:05.160 | You have a lot of data and benchmarks in that area
00:09:10.000 | So that would be our cat versus dog example up here. So imagine you're on the street. You walk up to a stranger
00:09:16.040 | You ask them a question. What was the difference between cat and dog? They're probably going to know the answer. Everyone knows that
00:09:21.600 | Then you get more specific. So you ask them. Okay, what's the difference between C and Java?
00:09:27.880 | Some people know some people will not and then we get even more specific
00:09:33.320 | Hightorch tense flow and then we get even more specific Roberta versus D Burton and TSA versus mirror Bert as
00:09:40.920 | we get more specific less and less people know what you're talking about and
00:09:46.760 | With that there are less datasets and there are less benchmarks, but at the same time that's where most of the interesting use cases
00:09:55.640 | exist
00:09:57.280 | now whether your use case exists within the common knowledge area or within the
00:10:03.600 | The long tail is
00:10:05.600 | Really how you can?
00:10:08.440 | hazard a guess at whether you need to fine-tune a retriever model or not
00:10:12.520 | So if we just modify that chart a little bit
00:10:15.840 | and we get this so we have same thing common knowledge on the y-axis and
00:10:20.800 | What we have on the right. I've just renamed it. So the ease of finding a model and/or data
00:10:28.840 | Okay, so based on how niche your use case is the harder is going to be to find data
00:10:36.720 | For that niche and the less data there is out there the less likely someone else is already trained or pre-trained a model
00:10:45.960 | most of the pre-trained models out there are
00:10:48.600 | trained on a very generic broad range of
00:10:54.200 | Concepts like that a train on Wikipedia pages or something like that
00:10:57.560 | So that's fine
00:10:59.680 | You you know
00:11:00.520 | If you're comparing cats versus dogs or even C versus Java
00:11:03.520 | The model is probably being pre-trained on something similar to that and it might be able to figure out
00:11:08.920 | But if your use case is more specific like you have I don't know like some very specific
00:11:15.920 | financial documents or
00:11:19.080 | technical documentation something along those lines where not many people understand the content that documentation and
00:11:26.160 | There's not no it's not general knowledge or it's not easily accessible on the internet in that case
00:11:33.860 | You will probably need to train or fine-tune your own model. So
00:11:39.120 | If that is the case, how how do we do that?
00:11:44.320 | Well to train a retrieved model we need pairs of texts. So we need questions and
00:11:50.760 | Relevant context. So that's where you can see here. We have question a context a they're both related and they both end up
00:11:58.200 | In the same sort of area. That's what we need to teach our retrieving model to do
00:12:03.040 | We tell our truth model. Okay question a and context a
00:12:08.640 | Process them and output a vector for each one of those and then we want to look at those two vectors and say, okay
00:12:15.200 | Are they similar or not if they're not similar we?
00:12:17.960 | Tell the model look you need to figure this out and make them more similar if they are similar thing then great good job
00:12:24.460 | That's what we're optimizing on. We're optimizing on minimizing that
00:12:28.780 | That difference between similar pairs and
00:12:33.260 | maximizing the difference between dissimilar pairs
00:12:37.840 | Now our data is going to just contain all of these rows where we have the question and context pairs
00:12:45.320 | We don't need labels
00:12:47.040 | Because we are going to be using multiple
00:12:49.040 | negative negatives of ranking loss
00:12:51.160 | Which we'll discuss in a minute is also a video on that if you do want to go into a little more depth
00:12:59.280 | To train our model. We're going to be using the squad to data set
00:13:04.040 | From over here and in there we have our questions and we have context. Okay
00:13:10.380 | That's what we're going to be training on those two pairs
00:13:13.560 | So let's take a look at what that process looks like
00:13:17.200 | So as I said, we're going to be using squad v2 data set
00:13:21.000 | We're going to pulling that from hooker base data sets
00:13:23.280 | you may need to pip install data sets if you do not have that already and
00:13:28.200 | This is how we load the data set. So we've got squad v2 and we're getting the training split of that because there's also
00:13:34.840 | validation split that we will use later and
00:13:37.560 | From now
00:13:40.960 | What we will see is we'll get an ID title context the question
00:13:46.060 | So I only do question and context is really that important for us so we can go down
00:13:52.040 | Just a few examples here
00:13:57.840 | Samples of rows from the data set and
00:14:02.080 | Like I said before we we need to take the question and context pairs from the data set. So to do that
00:14:11.020 | We what we're taking them here. We're also creating this inputting example or list of input example objects now
00:14:19.860 | Input example is just the data format that we use when we are training with the sentence transformers library
00:14:27.360 | And which you can see here
00:14:29.940 | now again, if you if you do need to install that it's just pipping salt sense and transformers and
00:14:36.200 | TQDM is just a progress bar that you see down here. That's all so what we're doing is
00:14:42.920 | Just appending a load of input examples where we have to question and context
00:14:47.880 | We don't have any label here because we are going to be training with multiple negative ranking loss where we don't need labels
00:14:58.320 | Also because we're using that type of loss M&R loss
00:15:01.960 | We need to make sure that each batch does not include duplicates now
00:15:08.340 | the reason for this is
00:15:11.000 | When we train with M&R loss
00:15:13.440 | we're going to be
00:15:15.360 | putting everything in in or training in batches and
00:15:19.520 | The model is going to be looking at two pairs like this
00:15:24.880 | It's going to take the question and it's going to say okay for that question this context here
00:15:30.400 | Needs to be the most similar and all of these
00:15:35.080 | contexts need to be
00:15:37.920 | What as dissimilar as possible?
00:15:43.040 | the problem with if you have duplicates in your
00:15:49.560 | Batches is that you have let's say this exact same maybe not the exact same question
00:15:54.400 | But you have to use that same context down here now your model is going to optimizing
00:15:59.220 | To make all of these as dissimilar as possible
00:16:03.160 | But also this one here, even though it's exactly the same as the one that optimizing to be more similar
00:16:09.960 | So you really need to try and avoid this. It's okay if it happens occasionally in the odd batch
00:16:16.240 | But you need to avoid it as much as possible. So that's why we use this no duplicates data loader
00:16:23.320 | This will make sure we don't have any duplicates within these batch
00:16:27.480 | So go down we need to initialize a sentence transformer again. This is
00:16:35.160 | Same as what we usually do a rim here, but we're actually using the Microsoft MP net model
00:16:43.320 | Now MP net is really good for the sentence transformers in general
00:16:47.880 | If I if I did this with a better model, I think the performance is
00:16:52.640 | Two percentage points less than if I use a MP net model, it's not huge, but it does make a difference
00:17:00.760 | So it's good to try both if you want, but this one is the one I'd go with
00:17:08.440 | So here we've initialized the model we also have this pooling layer now the pooling layer is important
00:17:15.480 | That is what makes a sentence transformer on just a normal transformer and it works
00:17:21.100 | Like this so we have our sentence
00:17:24.020 | It will get tokenized and split into
00:17:27.360 | many tokens or all down here and put into birds and peanut or some of the transformer and
00:17:35.360 | On the output of that we get all these token vectors now
00:17:39.400 | We all these token vectors represent our sentence, but there are loads of them
00:17:45.680 | We want a single vector to represent a single sentence. So we use this pooling light and this pooling layer
00:17:53.120 | takes all of those vectors and
00:17:56.080 | Takes the average in every single dimension
00:18:00.000 | So we take the average and we get our sentence vector. That's based how and why we use
00:18:05.920 | the pooling layer
00:18:08.720 | So we have that we can come down we see our transform model. We have the pooling layer. It's good
00:18:14.760 | like I said earlier, we are going to be using MNR loss for training and
00:18:19.760 | That is a batching thing where we want to get the pairs together and then we rank all the other
00:18:25.800 | contexts as dissimilar
00:18:29.000 | So we initialize it like this using the sentence transformers library and then we're ready to train our model
00:18:35.160 | So it's really not that
00:18:37.480 | Complicated particularly if you're using sentence transformers libraries. It's very easy
00:18:42.400 | so we warm up for 10% of the
00:18:46.840 | Training steps. That's a pretty standard number for sentence transformers
00:18:52.160 | You can modify a little bit but 10% is usually pretty good. It just helps make sure we don't overfit and the same for
00:19:00.280 | Almost always set that to one when you're fine-tuning central sentence transformers. Otherwise, it really does tend to overfit
00:19:08.320 | So it's usually a good idea
00:19:11.560 | It shouldn't take too long to train. It was 30 minutes for me here might take a bit longer
00:19:16.760 | But really it's not not training on too much here
00:19:22.760 | Think we have like a hundred thousand. Yeah, I think we have a hundred and
00:19:27.440 | 30,000 training samples in the spot data
00:19:33.560 | Another really important thing is evaluation. So once we've trained our model, how do we know that actually works?
00:19:40.080 | Like is it performing? Well, you know, we don't know so we need to measure how
00:19:47.400 | Accurately our model is retrieving the correct context for a particular question, which is a
00:19:54.400 | slightly different other
00:19:56.920 | operations or evaluation metrics that we use with other language models and
00:20:01.920 | To evaluate the retrieval performance. We use this information retrieval evaluator now from this
00:20:08.800 | We're going to be using the map at K metric
00:20:13.680 | which is in short an average precision value or
00:20:18.320 | Fraction of returned
00:20:21.720 | Context are relevant to the question we asked and the at K component of that is just saying
00:20:28.320 | We are going to consider the top K return result
00:20:33.520 | So if it's at 10 is in case you turn we're going to return
00:20:38.400 | 10 context and then we're going to calculate that metric from those 10 return context now by default
00:20:45.480 | This evaluator is going to using that map K metric
00:20:50.120 | So we initialize it like this so information retrieval evaluator and
00:20:57.480 | It needs data. We're going to be using validation set of the same data. We use before so the squad V2 data set
00:21:05.200 | And it looks the same. Okay, we have ID title context
00:21:08.880 | question answers
00:21:11.160 | We at the moment we're going to need ID
00:21:13.520 | Context and question now this evaluator needs us to map relevant questions and context
00:21:20.800 | Using those IDs. So what we're first going to do is we're going to convert this into a pandas a frame
00:21:27.680 | So I find it a little easy to work with
00:21:29.680 | We're going to be doing here. So
00:21:32.520 | I'm writing it to a date frame here and see we have context ID question, which is all we need
00:21:37.440 | Now we need to assign a specific or a new
00:21:41.880 | ID to the context
00:21:44.600 | because at the moment if you look here, we have an ID and it's shared by the context and also the question and
00:21:51.080 | Another thing is the ID for the context like here, you know context is saying the ID is different
00:21:58.200 | So what we're going to do is use these IDs for the questions because they all the IDs need to be unique
00:22:03.240 | I'm going to create a new ID for each context. So we're going to deduplicate that data frame
00:22:09.520 | so we have question and ID nothing else and
00:22:13.040 | Then we're just going to append this con onto the end of our ID. So now we have
00:22:19.320 | unique IDs for all of our context as well as our questions and
00:22:24.520 | What we can now do is merge or form in a join without no dupes
00:22:29.680 | Data frame and the original date frame. So
00:22:32.900 | Do that and now we have a unique ID. So this ID why?
00:22:40.000 | For each of our context so you can see ID why is not changing where we have these duplicate context and then we have IDX
00:22:48.300 | Questions. So this is what we need for
00:22:52.680 | value vector
00:22:54.240 | So we need to reformat this into
00:22:57.160 | three different
00:22:59.960 | Dictionaries, so we have our queries, which is a mapping of question ID to
00:23:07.200 | Question like the text we have our corpus
00:23:11.840 | which is a mapping of the context ID to the actual context and we also have our relevant docs, which is a
00:23:18.280 | mapping of the question ID to relevant
00:23:23.240 | Context ID so you can have multiple context IDs for a particular question
00:23:27.800 | But in this case, it is actually just a one-to-one mapping
00:23:31.200 | So we first create the IR queries
00:23:35.880 | Dictionary, so it's literally just the the ID as a key and the question as a value
00:23:42.800 | So great all of those key value pairs
00:23:46.960 | same again for IR corpus exactly the same but this time for the context and
00:23:52.440 | Then if we come down here, so this one's slightly different
00:23:57.840 | So we this is mapping the question ID to a set of relevant context IDs now
00:24:04.400 | We could map these directly
00:24:06.440 | But what I've done here is in this if you use this same script and you have an example where maybe you have multiple
00:24:15.320 | Contexts for each question this will handle those and so you'll get a list or a set of multiple
00:24:22.680 | context IDs in here on just one if
00:24:26.240 | That is relevant and we see that we have multiple
00:24:29.520 | In our case, we have multiple
00:24:32.560 | Questions that map to a single context ID
00:24:37.880 | So it's actually many to one rather than one to one like I said before
00:24:45.800 | Now with each of those three dictionaries we can initialize our
00:24:50.480 | Evaluator which is just passing like this and we evaluate and that's really simple. There's nothing
00:24:57.400 | Nothing else to it and the so the map at K performance for that is
00:25:04.080 | 0.74 which is is good if you we compare it to like some of the state of the art retrieval models
00:25:12.600 | So this is also using MP net and it's been trained on a lot more data, by the way as well
00:25:17.320 | This is more more general
00:25:19.320 | This is getting 76 percent. So a little bit better than ours like two percentage points better
00:25:26.600 | But this is the squad data set this other model here has I'm pretty confident
00:25:33.760 | It's been trained also on the squad data set
00:25:35.760 | So it has already seen that day if you do the same thing, but for your own
00:25:40.800 | more niche data set I
00:25:43.000 | Think most likely you will get the model that outperforms any other pre-trained models
00:25:51.440 | But of course, you know, definitely just test it evaluates you what you get
00:25:56.480 | Okay, so we have fine-tuned our retrieval model now
00:26:01.320 | But we can't really use it unless we saw the vectors that we create with our retrieval model somewhere
00:26:10.480 | So we need a vector database now
00:26:12.520 | To do that. We need to take a few
00:26:16.160 | Steps, so we have already done a few. Okay, so we have
00:26:21.680 | trained
00:26:24.560 | Specifically we trained our retrieval model so we can cross that off
00:26:28.840 | We've also downloaded a load of context so that the squad context will use a validation set for that. Okay
00:26:37.040 | now the next step in on this side is to encode those contexts which we will do in a moment and
00:26:44.400 | Then over here. We also need to initialize our index for our vector database. So so we're going to do that and
00:26:51.840 | then we're going to take those encoded context and now initialize index and we're going to
00:27:00.120 | Populate that index with our encoded context vectors. So let's get started with that
00:27:07.160 | Okay, so in a new notebook now
00:27:09.840 | So if you if you're following along and maybe your model is still training or fine-tuning
00:27:16.000 | That's fine. You can actually just download the model that we trained here
00:27:20.680 | So it's just pine cone and peanut retriever spot - that that's a model that you saw a moment ago
00:27:27.240 | If you want to try to bet one as well, you can replace this with
00:27:29.760 | But and that's also there
00:27:34.760 | we have that model we can see it looks same as what we had before and
00:27:38.920 | I'm also just going to reload the validation data set
00:27:43.720 | now I'm going to here I'm going to
00:27:48.000 | extract all of the unique context because we have like we saw before the score date has many copies of the same context and
00:27:56.400 | we don't want to have duplicates in our index of the same vectors because
00:28:02.640 | well if we're searching and we
00:28:05.240 | Compare the distance between our query vector and all these contexts and we have like five that are in the exact same position
00:28:11.360 | Can it return all of those so we don't want to do that. So we need to remove
00:28:17.600 | Duplicates. Okay, so we're looping through the whole data set
00:28:22.640 | We're checking if context is already within this unique context lists
00:28:26.040 | I've initialized if it isn't we add it to that list and we also add the ID of
00:28:32.360 | So that remember the question ID context ID was shared. All right, we're just using that ID this time
00:28:37.600 | We're not using the con unique context ideas that we created before
00:28:44.420 | We loop through and then obviously when it sees that context again
00:28:47.460 | It's not going to add that different ID to the unique IDs list and then we can filter using the face datasets
00:28:55.600 | library here to
00:28:58.240 | only keep the
00:29:01.080 | IDs or the rows which have an ID from the unique idealist
00:29:04.640 | So with that we end up with just one of each context and you can see here
00:29:10.000 | We only get 1200 rows of that which is much less than the the full
00:29:15.080 | squad validation set
00:29:17.680 | Now what we want to do is encode
00:29:20.720 | Those context in the context vectors. So we use our model that we've initialized and we use the encode method now
00:29:28.360 | We also convert that vector. So this I think outputs a pytorch tensor
00:29:35.040 | We convert into a list because we are going to be sending all this to our pinecone vector database index
00:29:41.440 | Through an API and for that we want to have them in a list format
00:29:45.340 | So we encode the context and now we can see okay
00:29:50.400 | We have all the other features that we had before but now we also have the encoding feature
00:29:54.640 | So we move on to the vector database. Now, how do we use this vector database?
00:30:00.120 | We first need to install the pinecone client here
00:30:04.680 | So again for using the pinecone vector database
00:30:07.760 | so we come down and we can
00:30:11.080 | Import pinecone. So this will just import the way that we interact with our pinecone
00:30:17.400 | vector database and
00:30:20.160 | We then initialize a connection. So for this you do need a free API key. So it's free
00:30:26.520 | I think you don't need to don't need to pay anything for that. So you just go to
00:30:31.040 | app.pinecone.io and
00:30:33.920 | You create an account and then with the API key that you're given you enter it in here
00:30:41.840 | So just in there pinecone in its API key and you also set your cloud environment
00:30:49.160 | so us-west1
00:30:51.160 | GCP and
00:30:53.200 | Then what we can do once we've initialized that connection is we create a new index
00:30:58.960 | So to create that index we only need this little chunk code
00:31:05.000 | we need this dot create index method and
00:31:08.360 | We specify the name of the index. I'm going to just call it squad index. You call it whatever you want and
00:31:16.160 | The dimensionality of our vectors as well so you can see here
00:31:21.120 | It's seven hundred and sixty eight and you can also define the metric now
00:31:26.080 | you could also use Euclidean distance, but we are going to stick with the default which is cosine and
00:31:32.920 | Yeah, that's all you need to you only need to run this but at the same time because I have multiple
00:31:40.320 | Indexes and I'm rerunning this code and testing it. I
00:31:43.160 | Added this in so I'm just saying if squad index is not already existing
00:31:48.880 | And so I'm checking that it is not already running that and then creating it
00:31:54.300 | And then we connect so specifically to our index run just
00:32:00.000 | pinecone database as a whole
00:32:03.120 | We connect so we use pinecone index and specify the index that we'd like to connect to
00:32:10.020 | And then from now we just use this index object for everything we're doing
00:32:17.420 | here we are just preparing our data to
00:32:20.260 | Upset into pinecone, which is just upload into into our pinecone
00:32:25.060 | index, so when we are
00:32:27.860 | uploading data into pinecone we
00:32:30.900 | three
00:32:33.860 | Components we don't need all of them, but I'll go through them
00:32:37.320 | So we we have a list and within that list
00:32:40.500 | We have these tuples in there. We have the ID of the
00:32:45.440 | Vector or the entry record whatever and then you have your encoding or the vector itself and then
00:32:53.080 | Optionally, you don't need to include this so you can remove it if you want. You have a dictionary of metadata
00:32:59.360 | so this is just key value pairs that you can use but things like metadata filtering or if you want to return a
00:33:05.880 | Particular piece of data with your search you can include that in there as well
00:33:11.980 | Which is why I'm using it for here and you don't need to do this
00:33:15.020 | you could saw the text locally and but I'm just doing it like this because it's a little bit easier for this
00:33:21.800 | little script and
00:33:24.560 | With that we we just upset so we just run this index insert. I'm doing in batches you can
00:33:32.240 | You could probably increase the batch size here, but I'm just sticking that and it is reasonably. It's pretty quick. Anyway, so 30 seconds
00:33:39.480 | So with that our
00:33:43.760 | Index so our vector database and our retriever ready so we can actually begin querying and you know asking a question and
00:33:52.240 | returning a
00:33:54.360 | relevant context
00:33:56.280 | Okay, so we're on to
00:33:58.280 | Another notebook for this. So again, I'm going to initialize a retriever model and I'm going to import pine cone initialize
00:34:06.160 | connection
00:34:08.120 | To the index to the squad index again now
00:34:11.120 | This is pretty straightforward
00:34:14.160 | Fortunately now we've worked at the end
00:34:16.440 | all we do is we take a question, so I'm gonna say when were the Normans in Normandy and
00:34:25.040 | We just write so model and code include our query within a list here because for example if you had
00:34:32.480 | multiple queries you might have that in a list something else up here and
00:34:38.800 | In that case, you would not have to add these square brackets
00:34:44.400 | but in this case, we just have a single query and
00:34:47.940 | We create a pytorch tension tensor
00:34:51.200 | Which is our query vector and we convert it into a list because we're going to be sending it through the pine cone API again
00:34:56.940 | And what I'm doing here is I'm querying so this is going to the pine cone vector database and currying
00:35:05.700 | I'm saying I want to find the most similar vectors or context vectors to this query vector or question vector
00:35:12.220 | I want to return just the top two
00:35:16.260 | contexts that you you find and
00:35:18.260 | I want to include a metadata in that response because in the metadata I included the text so we can come down here and
00:35:27.580 | We see okay. This first example isn't actually what we want. So in here with what we ask
00:35:35.180 | When were the Normans in Normandy? This doesn't actually answer that question
00:35:40.380 | but then the second one so if I
00:35:45.460 | Open this up in the text editor. So come down here
00:35:49.020 | We have the second question
00:35:52.060 | So the Normans were the people who in the 10th and 11th centuries gave their names Normandy
00:35:58.060 | So you can assume that they were probably in Normandy in that time. So that's when they were there
00:36:03.660 | So we get the correct context in second position for this one. We also get the score 70s
00:36:11.540 | Read us a it's a good score. It's high
00:36:16.020 | Then we have another question here. So we've got three questions
00:36:19.940 | we'll go through each one is very quickly how many outputs are expected for each input in a function problem and
00:36:25.140 | We do actually return the correct answer for that one straight away. So a function problems a computational problem
00:36:30.960 | Where a single output is expected for every input. So that's how I saw specific answer there
00:36:37.780 | Okay, and get a really high score for that
00:36:40.220 | So it's pretty confident that this is correct and then a lower score for the other ones where the actual question is not answered
00:36:49.780 | One final one here. So I've changed the wording a little bit for this one because I don't want to be
00:36:55.180 | You know just doing a keyword search. So I put who use Islamic Lombard, etc
00:36:59.540 | construction techniques in the Mediterranean and
00:37:05.220 | What you know, I modified that a little bit and we do actually return the correct answer straight away
00:37:10.180 | Confidence 0.6. I should compare it to the other ones which are much lower. That's pretty good. There's a good separation there
00:37:18.060 | Which is what we want to be looking for
00:37:22.060 | That is a I know it's pretty long. There's a lot of components and moving parts there, but that's the open domain
00:37:29.620 | Question and answering pipeline or at least the vector database of retriever components of that
00:37:35.940 | We've also have a look at how we can fine-tune our own retreat. So we've covered quite a lot
00:37:40.940 | so I mean with all that you're
00:37:43.540 | Ready to just go ahead and implement what I think are probably two
00:37:48.460 | Most crucial components in open domain question answering if you if you don't have a good
00:37:55.060 | Vector database and good retriever model you are going to be returning
00:37:59.540 | Not good context your reader model
00:38:02.020 | And if you don't have good context for your reader model your reader model isn't going to give you anything good
00:38:07.740 | So these two are probably the most important parts, right if your reader model is rubbish
00:38:13.100 | Maybe it gives you a kind of weird
00:38:15.860 | Span or answer but at least you've got a good context to to go off of so you're your users
00:38:22.740 | Are at least getting some relevant information now, I think one of the the coolest
00:38:28.220 | Things about open domain question answering is just how widely applicable is
00:38:33.060 | Basically any company across so many industries across the world
00:38:38.180 | Can use this if they have?
00:38:41.680 | unstructured data that they need to
00:38:45.000 | essentially open the doors to for their
00:38:49.900 | Staff or their users right if you want to get data or information to someone which is a big part of most jobs
00:38:57.660 | If you want to get data information someone more
00:39:01.620 | effectively and
00:39:04.060 | using a more natural
00:39:06.060 | Form of search and question answering this is probably applicable to you and maybe your company
00:39:15.380 | Anyway, that's it for this video
00:39:18.420 | Thank you very much for watching. I hope it's been useful and I'll see you again in the next one