Open Source Generative AI in Question-Answering (NLP) using Python

Today, we're going to talk about abstractive or generative question and answering. And we're going to focus on actually building or implementing something like this using a few different components. But in the end, what we're going to essentially be able to get is we're going to be able to ask a question in natural language.

And we're going to be able to return documents or web pages or so on that are related to our particular question. And we're also going to be able to use something called a generator model to generate a human natural language answer to our question based on these documents that we've retrieved from an external source.

So we can think of it as a GPT model that is answering our questions. But if GPT was also giving us the sources of the information that it was answering questions based on. So let's jump straight into just understanding what is exactly that we're going to be building. So we're going to start with all of our documents, our text or whatever it is we're going to be using.

In our case, we're going to be using text from Wikipedia. So we're going to take all of this and we're going to encode it using what's called a retriever model. So it's called a retriever. And what that will give us is a ton of these vectors. Where each vector represents like a segment of our text.

So for example, maybe we might have this little segment here followed by this segment and so on. We're going to take all of those vector embeddings and we're going to put them into a vector database over here. Now we're going to be using Pinecone for this. So what we'll do is just put everything in Pinecone and at that point, we've actually built the retrieval pipeline.

We don't have the generative part of it yet, but we do have the retrieval pipeline. So then what we can do is ask a question. So we'll ask a question over here, it'll be in natural language. And what we'll do with that is actually also take that into the retriever model and that will output.

Maybe we'll output it over here, that will output a single query vector, or question vector. That will then be passed into Pinecone here, which will compare that query vector to all of the previously encoded vectors. And it will return a few of those that are the most relevant to our particular query vector.

So it will bring these out and it will say, okay, these three items are the most relevant to your particular query. And it's basing those on the concept or the idea behind the language being used. It's not basing them on matching particular terms, like keyword matching or anything like that.

It's actually basing it on the semantic understanding of the question and of the answers and of the relevant documents. So we'll take these and we'll bring them over here. Now over here, we're going to have what's called a generator model. So the generator model, it can be a lot of different things.

One example that I kind of briefly mentioned is it could actually be something like GPT-3. So you could have GPT-3 here. We're going to be using another one or another model called BART that will generate everything for us. Just because this is open source and we can just run it in our Colab notebook.

But you can use GPT-3, you can use GoHead, you can use all these different types of models. Depending on what it is you're wanting to do. So we'd pass those relevant context or documents, whatever you like to call them. We pass those into our generator model. Alongside that, we also want to pass in the question, the original question.

One thing that I missed here is actually here. We would be converting these back into their original text format. Which we've stored in Pinecone. So that will actually be the text and the same with the query. So we're going to have the query and the context and we're going to feed them into the generator.

And that will then output us an answer in natural language format. So let's actually jump straight into the code for building all of this. So we're going to be working from this example over on the Pinecone docs. So it's pinecone.io/docs/abstractive-question-answering. There'll be a link in the video as well.

And what we want to do is just opening Colab over here. That will open this. So let's get started. We need to install any dependencies. So in here we have datasets, Pinecone, sentence transformers, and PyTorch. And we'll jump into what each one of those does pretty soon. Okay. Once that is installed, we come down here and we're going to just load and prepare our dataset.

So we'll be taking these Wikipedia snippets dataset. This is coming from the HuggingFace datasets hub. So we're loading it like this and it's a pretty big dataset. So we're actually streaming that data by saying streaming equals true. I think it's nine gigabytes. So this will just allow us to load what we're using right now, rather than loading the full thing in memory at once.

And then we shuffle that dataset randomly. So we were using a CD here just so you can replicate what I'm doing here. Let me run this. And then we'll come down here and we can just show the first item or the first document from the dataset. So we're just iterating through it.

We take the next item and we can see we have the ID and start and end where, you know, where, where the text is actually being pulled from. And we have article title, section title, and then we have the passage text. So this is the document or the context.

And this is what we're going to be encoding. So this is what we're going to be encoding and storing in our vector database. So what I'm going to do here is actually filter for only the documents that have history in the section title here. So basically we just want history related documents.

So we do that. Now we can't check how many items we have there because we're using the streaming feature. So that will just essentially stream everything. And if it sees history, we'll lay it through. If not, it will not let it through. But there are quite a few passages in there.

So we're just going to filter out or we're going to choose the first 50,000 of those, which is quite a bit. Now, one thing I should make you aware of here is in your runtime, it should be GPU anyway, but in case it's not here, you can set your hardware to use GPU.

If it's on none, it means you're using CPU and it will be a lot slower when we're embedding everything later on. So we do want to make sure that we're using GPU. Okay. So after that has completed, we have our 50,000 documents all with history in the section title.

So if we take a look at the head here, we can see that all of those, they don't all say history specifically, but they have history at least in the title like here. Okay. So what we're going to do now is we'll need to embed and index all of these passages here or embed and store all of them.

So to do that, we're going to, we'll need to initialize the Pancone index, but I'm going to do that after initializing the retriever model. So I'm going to scroll down to here, come to the retrieve model, and we're going to be using this flex sentence embeddings or datasets V3 MPNet base model.

So this is basically one of the best sentence transform models you can use for basically anything. So that's why it has all here has been trained on, I think a billion sentence pairs. So it's a pretty good model to try and use whenever you're not sure which model to use.

So we initialize that, okay. It might take a moment to download. Okay. And then one thing we will want to do is make sure we move this to a GPU. So actually what we need to do is import Torch. Now I want to say device equals CUDA, if Torch CUDA is available.

So this is saying, if there's a CUDA enabled GPU, set the device to that. Otherwise we're going to use CPU. Okay. And then we can see what the device is. And actually rather than moving the retriever to that device, I'm going to come back up to the initialization here, and I'm going to initialize it on that device to start with.

So like this. Okay. Now, an important thing to note here is that we have the word embedding dimension 768. So remember that and we'll come up here and we will initialize our Pinecone index. So the first thing we need to do is connect to our Pinecone environment. So we need an API key for that, which is free.

So to get that, we need to go to app.pinecone.io. Once here, we will either need to sign up or log in. So I'm going to log in. And once we've done that, we'll just get a little loading screen here, and then we should find something like this. So on the top left up here, you have your organization and then you have projects.

So one of those should say like your name and default project. So I'm going to go over to that. And then here, I just have a list of the indexes that I currently have running. Now I think abstractive question answering is not in there. So what I'm going to do is we're going to have to create it.

So we come over to API keys on the left here. We copy the API key value, come over to here, and then we will just paste it into here. I'm going to go and paste mine into a new variable. So mine is stored in a new variable called API key.

So I initialize with that. And what we're going to do is create a new index. We're going to call it abstractive question answering. And we are going to say, if that index name does not exist, then we create it. Now, I remember I said to remember that dimensionality at number 768 before.

This is why, because it's here. We need that number to align, this number here to align with the embedding dimensionality of our retriever model. We can also check that using this. So retriever get sentence embedding dimension, like so, and we get 768. So we can actually take this and place it in here rather than hard-coding it.

Metric, because the embedding vectors are normalized, as we can see here, we can actually use either dot product or cosine similarity here, we're going to just stick cosine similarity, and that would just take a moment for the index to be created. Okay. Once we have created it, we will move on to this, which is just connecting to our new index.

So let's scroll down and we will come down to the generating embeddings and upsetting. So what we're going to do here is in batches of 64, we're going to extract our passage text. So we'll have 64 of these passages all at one time, and we're going to encode them all using our retriever model.

Then what we're going to do is get the metadata. So that is simply the text that we have in here. So if I show you an example, we have, take this, I'm going to do DF and we're going to take the first few items and paste that, so basically we're going to do this.

We're going to take all of that data that we have in our data frame. And for each one of our vectors, so first one would be this, we're going to attach that metadata to the vector. And then here we'd create some unique IDs, just count, we could actually use the IDs themselves, but this is just easier and we're going to add all those to a upsert list, which is just a list that contains two boards containing each ID, the vector embedding, and the metadata related to that embedding.

And then we upsert all of that. So basically insert it all into the Pinecone vector database. Then at the end here, we're just going to check that we have all those vectors in the index. And you can see here that it actually brought through 50,001. So maybe there was a duplicate in there.

I'm not too sure. But we have all of those in there. So I can try running this, but it's basically just going to start from the start again. So see here, I'm not going to wait until the end of that because it will take a little bit of time, even when we're using a GPU on Colab, although actually not too long.

Anyway, I'm going to stop that and we'll just move straight onto the generator and we can at least just see from the past runs what it would be doing. So the first thing we would do here is initialize the tokenizer and the model for our generator model. And we're using this BART LFQA, which is Long Formal Question Answering model.

Okay. So if we come up here, we'll explain a little bit of what this model is. So using the Explain Lycan5 BART model, which is just a sequence sequence model, which has been trained using Explain Lycan5 dataset, which is from Reddit. And if we come down here, we can see the format that we're going to be putting all of our text into this model.

So we're going to have our question, which is going to be what we type. We'll say like, what is a sonic boom? And then that's followed by context. And then with each passage, we proceed it with a P token like this. And then we have the passage and then P token, another passage.

And basically the model has been trained to read this sort of format and then generate a natural language answer based on this question and based on this information that we have provided it with. So we come down here, we would initialize it like that. And then we're just going to create these two helper functions.

So this is just to help us query Pinecone. So given a particular query, we encode it. So from text to a vector embedding or the query embedding is what we'd usually call it. We query Pinecone like this, this will return K many passages, and it would return these, what we call the context or the passages or something along those lines.

One thing that is pretty important here is that we include the metadata because that includes the human readable text of those passages that we're going to be feeding in and why do we need that? Because we are going to be formatting them in this string, which is like what I showed you before.

We have the, so the context here, which is going to be the P token followed by the passage, and then we concatenate all those together. And then what we would do is create that format that you saw before with the question followed by the question and the context followed by those context with the P tokens in the, in the middle or preceding each one.

So with those help functions, we then move on to our query. So we have our query, when was the first electric power system built? We can query Pinecone and that will return these matches here. So this is the response directly from Pinecone. And we see that we have the passage text and we have some, I think, relevant passages in there.

So this is just returning, just returning one here. We use pretty print here so that we can more nicely visualize everything or print everything. And then what we want to do is query or format our query. So we have our query, which is the question we just asked up here.

When's the first electric power system built? And then we also have what we returned from Pinecone. Okay. We, and then we print what we get from there or what we will be producing. So we have the question and you can see that same format that you saw before. And then you have context and you have the P token followed by the passages.

So we write another function, generate answer. This is going to take our, the formatted query here. It's going to tokenize it using our Bart tokenizer. And then it's going to use a generator to generate a prediction or generate an answer. Okay. So from there, we, that will output a load of token IDs, which we obviously can't read.

So then we use this batch decode or the tokenizer decode to decode them into human readable text. Like that. So if we then go ahead and actually run that, we will see that we want to focus on this bit here. The first electric power system was built in 1881 at Godalming in England.

It was powered by two water wheels and then, and so if we look at that answer or what we looked at here, we can see that it is basically reformulating that information there into a more concise answer. So we see in 1881 at Godalming in England and so on.

So that's pretty cool. Now, what if we go a little further? If we ask some more questions, you say, how was the first wireless message sent? And this time we're going to return five of these contexts. So we're going to return more information. And ideally this should give us, give the Bart generation model more information to produce an answer from.

So it should generally speaking, be able to produce a better answer if we give it more of that information. But not all the time. In this case, we say, we see first wireless message sent in 1866, so on and so on. Okay. Nice short answer, which is good. We set that by setting the max length up here at 40.

And, you know, I don't know the answer to this question. So what we can do is, you know, not just rely on the model to actually give us the answer, which is a problem that we see a lot with the GPT-3, CHAT-GPT and so on models, we can actually have a look at what, where this information is actually coming from.

So we can see here, I think this is probably the most relevant part. So this guy is claimed to have transmitted an electrical signal through the atmosphere at this point, right? And I don't think any of the other contexts really give us any more information on that. So we can see that according to this context, and if we want to provide a link back to where that was actually from, that does at least seem to be true.

Now, this is probably a good example of when this is useful. So if we ask a question like, where did COVID-19 originate? And we get this like random answer. And I think most of us probably know that this is kind of nonsense, right? So it's a zoonotic disease transmitted from one animal to another.

Okay. Let's have a look at where this is coming from. And we can see that all of these contexts don't actually contain anything about COVID-19. And so we can pretty confidently say that this is nonsense. And simply the reason is that this model has never seen anything about COVID-19.

The BART generation model hasn't seen anything about that because the training data it was trained on was from before that time. And as well, none of the contexts that we have indexed yet contain anything about it either. So it can be pretty useful to include that, particularly when it comes to fact-checking things like that.

And then let's finish your final few questions. What is war on current? I'm not going to check these, but I'm pretty sure. So this one is true. Who's first person on the moon? Neil Armstrong. We, I think all know that it's true. And what is NASA's most expensive project?

I think this one is possibly, possibly true, possibly not. I can't remember, but nonetheless, we, we get some pretty cool answers there. So that's it for this video in this example, walkthrough of abstractive or generative question answering. I hope this has been useful and interesting. So thank you very much for watching and I will see you again in the next one.

Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye. Bye.

Open Source Generative AI in Question-Answering (NLP) using Python

Chapters

Transcript