Okay in the last video we had a look at how to build what you can see on the screen right now a very simple sort of interface using streamlet now What we want to do in this video is go through how we actually build the smart part behind the Open domain Q&A system that we're going to put together here.
So I Said before there are a few components to open the main Q&A. We're going to stick the first two for now so the vector database, which we're going to use pinecone for and The retriever model which we are going to download From hugging face model hub and we're going to use the sentence transformers library to actually implement that now the first thing we are going to want to do is Create our vector database or our index now to do that.
There are three Parts or three steps we need to take first. We need to download our data We're going to be using the squad data set from hugging face data sets Then we want to encode those vectors encode those paragraphs or what we call context into context vectors and We use sentence transformers and a retriever model for that.
And then the next part is uploading or pushing all of those vectors into our pinecone vector database so to do all of that We're just going to very quickly go through that code because it is a lot and I don't want to focus on it too much So here we have Script I'm going to maybe zoom out a little bit so you can see So the first thing we do is import everything you don't need tqdm here But you can pip install tqdm if you do want to use that So we are From data sets.
So this is hugging face data sets. You will need to install this. So that is just a pip install data sets We're going to first initialize our retriever model, so we're using the pinecone MP net retrievers or to Retrieve model. So this is a retriever model that is based on the MP net model from Microsoft and it's been trained on the spot to data set and First we need to do is initialize our connection to Pinecone.
So this is where we're going to Store all of our vectors now to do that. You do need an API key So I wouldn't I wouldn't write it in your code, but I'm going to just do that For the sake of simplicity here. So I'm going to go to this app dot pinecone dio and this is free By the way, you don't have to pay anything so we just go to app dot pinecone dio And then you will have to sign up.
So you create an account I already have one so I don't need to worry about that and I have this default API key over here. Like I could use that and Yeah, I'm just going to use that so We can see the key if we want. I want to zoom in a little bit.
I'll Chris it's a little bit bigger and So we can see that and we can see the value though, we just copy or we just press over here and copy that across and Then I'm just going to paste in here Okay. Now this script by the way, I will leave a link to this in the description So you can just download it instead of writing it all out Because this isn't essential to our app.
It's just how we build we encode all of our context and actually Saw them in our in our vector database so We have that we have the cloud environment that we're using there Switch this back to the app We want to check if the index already exists, so I'm going to create a secure index now actually you can see in mind I already have it because I've run this code already so QA index Already exists, so it's not going to create a new index and instead it's just going to connect that index here right, so we've just connected with or we created our index our vector database index and Now what I want to do is I'm going to switch back to our Data, and I'm going to run through that.
So I'm going to load the data search and the squad data set from Hugging face Now the I'm going to use a validation split because the model has been trained on the training data, but squad I want to make it at least a little bit hard, so we're going to use a validation split that it hasn't seen before I'm removing any unique or duplicate contexts in there, so Zoom out a little bit here and squad death.
We're using this filter, so this is all hugging face data sets syntax here And then we're encoding it so this model don't encode so this is our sentence transformer We're encoding it to create a load of Sentence vectors for our context and we're converting these to list because we are going to be pushing these through an API request To pinecone game.
We need a list not a numpy array of what you can get going to get an error Okay, then back to the pinecone side of things. We want to create a list of It's basically a list of tuples and those tuples include the ID of Each context so there's a unique ID for each context.
We want the vector or the encoding the context vector And then we also have this dictionary here now. This is metadata so metadata and pinecone is like any other Information about your vectors that you want to include and this is really good if you want to use metadata filtering which is super powerful in pinecone Sand I definitely want to you know leave the option open later on.
I'm not sure if we'll use it or not We'll probably put something in there. Just so we can play around with it now That creates the format that we need to upset everything which means just like push or upload everything to pinecone So then I do that in chunks of 50 at a time It just makes things a little bit easier on the on the API request rather than sending everything at once.
Okay? So that's like how we create the index so now what we're going to do is actually Integrate that a little bit in in our app So let's switch back to our app here. Let's view it So first, let's just remove this. We don't need that. Okay, it's a will automatically reload So First we want to do here is let's initialize the pinecone connection, so I'm going to just take Let's just take this part of the code Just copy it and then we'll remove what we don't need in a minute So we don't need we do need sentence transformers In a minute, we don't need datasets We do need pinecone.
So actually here. We're initializing our our retriever model It's the same as what we did before. So we do want to keep that in there and a bigger API key again just saw this somewhere else or if you are using Streamlit cloud they have like a secrets management System and it's something we'll look at in the future for sure.
But for now, I'm just putting in here So we have our API key environment and we're just doing the same thing we did before but actually we don't want to create an index We're assuming we've already created an index if web in our app, so we're just going to connect to it okay, so with that we've kind of set up the Like the back end part of our app.
I've smart part that's going to handle the open the main Q&A But it's going to be a little bit slow and we we will have a look at how to solve that pretty soon, but for now, what we're going to do is actually just implement this and We're going to actually query and see what we we return So I'm gonna save this we won't see anything change in our app now other than the fact that it takes longer to load because it's downloading the the retriever model that's the main part of the The slowness here and then obviously connecting to pinecone also take takes a second as well so Now we're going to deal with how slow it is But we will we will fix that pretty soon and Now I actually want to do is I want to say okay if the query is not empty because by default it is empty That's why we add that in there.
So I'm going to actually remove this Enter if it is not empty. So if query Is not equal to nothing we're going to query Pinecone for whatever is in that query. So the first thing we need to do is create our context vector So I'm gonna write XQ Just shorthand for context vector It's pretty standard Especially if you use vice before they tend to use this and I say I said context vector.
I mean query vector So we're going to do model encode and We need to put this in square brackets and we have query. Okay, and then we're going to convert that to a list Okay, so this is going to create our Query vector. Let's write it down create query vector And then the next thing we want to do is Query pinecone with this query vector so To do that We want to write First let's get relevant Context and we're going to solve these in XC.
So like query like context vector similar thing to the Query vector that we use for with XQ But this time we're gonna write index dot query And we're going to pass XQ. So our query vector and we're gonna say how many results we want to return now Later on we're going to use Streamlet a little like a slider bar to decide how many we would like to return but for now we will hard code it and another thing that we want to include here is we want to tell pinecone to Return the metadata because by default it will not return metadata so Return metadata equals true.
So these are like the extra little bits. I mentioned before so included our title so Like the topic Wikipedia topic that the context is coming from and also the text itself okay, so we're going to return the relevant context and Then we're gonna loop through each of those now When we do this, there's a particular format that we need to follow so our context are actually going to be sword so for context in XC results and Results is going to return a list and we just want the the first item in that list The reason it returns a list is because if you are a querying pinecone with multiple queries It will turn a list of you know, your answers for each query But in this case, we are only ever going to query with one query vector so we always enter a position zero here and then in there we will have all of our returned matches inside this matches Key value value so For context in there.
All we're going to do is write st dot right Context and then we want to go into the metadata that we were returning and We have title and text here. We don't want title we want text Okay, so let's say that and check that actually works so again, this is going to take a lot while to load because we're Initializing like the full pipeline of our vector database and the true model So every time we run this it's downloading the full tree model, which takes quite a bit of time Okay, so this is just rerun our app and now I can say who are the Normans?
Okay Again trying to reload everything so it's gonna take a while. We're going to fix this in the next video so We should be returning five context and if we scroll down we can see we have these five paragraphs Now each one is paragraphs. It's a single Context so we could maybe we can inspect the element Okay, so we can see down here It's Pretty horrific to look at so if I zoom up we can see each one of these is A single a single one of our context, right?
These here cool so I Think that is that's it for this Video, so we're now we have these back end working and the next one What we'll do is fix this issue with it taking forever to load reload everything every time Which is actually super easy But we'll make a big difference to our app so Thank you very much for watching and I will see you in the next one.
Bye