back to indexOpen Source Generative AI in Question-Answering (NLP) using Python
Chapters
0:0 What is generative AI and Q&A?
1:2 Generative question-answering architecture
4:36 Getting code and prerequisites
5:6 Data preprocessing
7:41 Embedding and indexing text
13:50 BART text generation model
14:52 Querying with generative question-answering
17:45 Asking questions and getting results
21:29 Final notes
00:00:00.000 |
Today, we're going to talk about abstractive or generative question and answering. 00:00:04.960 |
And we're going to focus on actually building or implementing something like this using a few different components. 00:00:12.760 |
But in the end, what we're going to essentially be able to get is we're going to be able to ask a question in natural language. 00:00:20.540 |
And we're going to be able to return documents or web pages or so on that are related to our particular question. 00:00:29.600 |
And we're also going to be able to use something called a generator model to generate a human natural language answer to our question 00:00:39.680 |
based on these documents that we've retrieved from an external source. 00:00:44.600 |
So we can think of it as a GPT model that is answering our questions. 00:00:50.160 |
But if GPT was also giving us the sources of the information that it was answering questions based on. 00:00:57.120 |
So let's jump straight into just understanding what is exactly that we're going to be building. 00:01:02.120 |
So we're going to start with all of our documents, our text or whatever it is we're going to be using. 00:01:08.200 |
In our case, we're going to be using text from Wikipedia. 00:01:12.960 |
So we're going to take all of this and we're going to encode it using what's called a retriever model. 00:01:21.840 |
And what that will give us is a ton of these vectors. 00:01:27.120 |
Where each vector represents like a segment of our text. 00:01:31.520 |
So for example, maybe we might have this little segment here followed by this segment and so on. 00:01:39.440 |
We're going to take all of those vector embeddings and we're going to put them into a vector database over here. 00:01:47.760 |
Now we're going to be using Pinecone for this. 00:01:51.720 |
So what we'll do is just put everything in Pinecone and at that point, we've actually built the retrieval pipeline. 00:01:59.880 |
We don't have the generative part of it yet, but we do have the retrieval pipeline. 00:02:07.200 |
So we'll ask a question over here, it'll be in natural language. 00:02:10.680 |
And what we'll do with that is actually also take that into the retriever model and that will output. 00:02:18.600 |
Maybe we'll output it over here, that will output a single query vector, or question vector. 00:02:26.000 |
That will then be passed into Pinecone here, which will compare that query vector to all of the previously encoded vectors. 00:02:33.920 |
And it will return a few of those that are the most relevant to our particular query vector. 00:02:38.600 |
So it will bring these out and it will say, okay, these three items are the most relevant to your particular query. 00:02:45.560 |
And it's basing those on the concept or the idea behind the language being used. 00:02:50.520 |
It's not basing them on matching particular terms, like keyword matching or anything like that. 00:02:55.400 |
It's actually basing it on the semantic understanding of the question and of the answers and of the relevant documents. 00:03:04.480 |
So we'll take these and we'll bring them over here. 00:03:08.400 |
Now over here, we're going to have what's called a generator model. 00:03:12.480 |
So the generator model, it can be a lot of different things. 00:03:16.000 |
One example that I kind of briefly mentioned is it could actually be something like GPT-3. 00:03:24.840 |
We're going to be using another one or another model called BART that will generate everything for us. 00:03:31.600 |
Just because this is open source and we can just run it in our Colab notebook. 00:03:36.840 |
But you can use GPT-3, you can use GoHead, you can use all these different types of models. 00:03:42.440 |
Depending on what it is you're wanting to do. 00:03:44.680 |
So we'd pass those relevant context or documents, whatever you like to call them. 00:03:53.480 |
Alongside that, we also want to pass in the question, the original question. 00:03:58.320 |
One thing that I missed here is actually here. 00:04:10.040 |
We would be converting these back into their original text format. 00:04:17.080 |
So that will actually be the text and the same with the query. 00:04:21.680 |
So we're going to have the query and the context and we're going to feed them into the generator. 00:04:25.560 |
And that will then output us an answer in natural language format. 00:04:31.400 |
So let's actually jump straight into the code for building all of this. 00:04:36.720 |
So we're going to be working from this example over on the Pinecone docs. 00:04:41.480 |
So it's pinecone.io/docs/abstractive-question-answering. 00:04:48.600 |
And what we want to do is just opening Colab over here. 00:04:57.360 |
So in here we have datasets, Pinecone, sentence transformers, and PyTorch. 00:05:02.760 |
And we'll jump into what each one of those does pretty soon. 00:05:07.000 |
Once that is installed, we come down here and we're going to just load and prepare our dataset. 00:05:12.640 |
So we'll be taking these Wikipedia snippets dataset. 00:05:15.560 |
This is coming from the HuggingFace datasets hub. 00:05:19.240 |
So we're loading it like this and it's a pretty big dataset. 00:05:24.120 |
So we're actually streaming that data by saying streaming equals true. 00:05:30.640 |
So this will just allow us to load what we're using right now, rather than loading the full thing in 00:05:39.720 |
So we were using a CD here just so you can replicate what I'm doing here. 00:05:45.480 |
And then we'll come down here and we can just show the first item or the first document from the 00:05:54.400 |
We take the next item and we can see we have the ID and start and end where, you know, where, where 00:06:04.280 |
And we have article title, section title, and then we have the passage text. 00:06:14.320 |
So this is what we're going to be encoding and storing in our vector database. 00:06:18.440 |
So what I'm going to do here is actually filter for only the documents that have history in the 00:06:27.760 |
So basically we just want history related documents. 00:06:33.520 |
Now we can't check how many items we have there because we're using the streaming feature. 00:06:38.280 |
So that will just essentially stream everything. 00:06:40.880 |
And if it sees history, we'll lay it through. 00:06:48.640 |
So we're just going to filter out or we're going to choose the first 50,000 of those, which is 00:06:57.720 |
Now, one thing I should make you aware of here is in your runtime, it should be GPU anyway, but in 00:07:04.560 |
case it's not here, you can set your hardware to use GPU. 00:07:10.320 |
If it's on none, it means you're using CPU and it will be a lot slower when we're embedding 00:07:14.960 |
So we do want to make sure that we're using GPU. 00:07:21.320 |
So after that has completed, we have our 50,000 documents all with history in the section title. 00:07:30.880 |
So if we take a look at the head here, we can see that all of those, they don't all say history 00:07:36.560 |
specifically, but they have history at least in the title like here. 00:07:42.200 |
So what we're going to do now is we'll need to embed and index all of these passages here or 00:07:51.760 |
So to do that, we're going to, we'll need to initialize the Pancone index, but I'm going to do 00:07:58.840 |
So I'm going to scroll down to here, come to the retrieve model, and we're going to be using this 00:08:03.560 |
flex sentence embeddings or datasets V3 MPNet base model. 00:08:08.400 |
So this is basically one of the best sentence transform models you can use for basically 00:08:16.240 |
So that's why it has all here has been trained on, I think a billion sentence pairs. 00:08:21.520 |
So it's a pretty good model to try and use whenever you're not sure which model to use. 00:08:33.160 |
And then one thing we will want to do is make sure we move this to a GPU. 00:08:40.200 |
So actually what we need to do is import Torch. 00:08:44.400 |
Now I want to say device equals CUDA, if Torch CUDA is available. 00:08:53.120 |
So this is saying, if there's a CUDA enabled GPU, set the device to that. 00:09:03.640 |
And actually rather than moving the retriever to that device, I'm going to come back up to the 00:09:09.800 |
initialization here, and I'm going to initialize it on that device to start with. 00:09:16.720 |
Now, an important thing to note here is that we have the word embedding dimension 768. 00:09:23.920 |
So remember that and we'll come up here and we will initialize our Pinecone index. 00:09:30.200 |
So the first thing we need to do is connect to our Pinecone environment. 00:09:34.280 |
So we need an API key for that, which is free. 00:09:37.000 |
So to get that, we need to go to app.pinecone.io. 00:09:44.000 |
Once here, we will either need to sign up or log in. 00:09:50.120 |
And once we've done that, we'll just get a little loading screen here, and then we should find 00:09:57.040 |
So on the top left up here, you have your organization and then you have projects. 00:10:04.280 |
So one of those should say like your name and default project. 00:10:10.720 |
And then here, I just have a list of the indexes that I currently have running. 00:10:15.520 |
Now I think abstractive question answering is not in there. 00:10:19.200 |
So what I'm going to do is we're going to have to create it. 00:10:22.040 |
So we come over to API keys on the left here. 00:10:24.840 |
We copy the API key value, come over to here, and then we will just paste it into here. 00:10:31.920 |
I'm going to go and paste mine into a new variable. 00:10:34.920 |
So mine is stored in a new variable called API key. 00:10:41.240 |
And what we're going to do is create a new index. 00:10:47.120 |
We're going to call it abstractive question answering. 00:10:49.480 |
And we are going to say, if that index name does not exist, then we create it. 00:10:55.080 |
Now, I remember I said to remember that dimensionality at number 768 before. 00:11:03.040 |
We need that number to align, this number here to align with the embedding 00:11:14.000 |
So retriever get sentence embedding dimension, like so, and we get 768. 00:11:21.440 |
So we can actually take this and place it in here rather than hard-coding it. 00:11:26.040 |
Metric, because the embedding vectors are normalized, as we can see here, we can 00:11:34.880 |
actually use either dot product or cosine similarity here, we're going to just stick 00:11:38.880 |
cosine similarity, and that would just take a moment for the index to be created. 00:11:44.360 |
Once we have created it, we will move on to this, which is 00:11:49.480 |
So let's scroll down and we will come down to the generating embeddings and upsetting. 00:11:55.800 |
So what we're going to do here is in batches of 64, we're going 00:12:03.000 |
So we'll have 64 of these passages all at one time, and we're going to encode 00:12:09.760 |
Then what we're going to do is get the metadata. 00:12:11.840 |
So that is simply the text that we have in here. 00:12:16.520 |
So if I show you an example, we have, take this, I'm going to do DF and we're 00:12:24.960 |
going to take the first few items and paste that, so basically we're going to do this. 00:12:30.760 |
We're going to take all of that data that we have in our data frame. 00:12:34.680 |
And for each one of our vectors, so first one would be this, we're going 00:12:41.760 |
And then here we'd create some unique IDs, just count, we could actually 00:12:46.440 |
use the IDs themselves, but this is just easier and we're going to add 00:12:51.040 |
all those to a upsert list, which is just a list that contains two boards 00:12:55.720 |
containing each ID, the vector embedding, and the metadata related to that embedding. 00:13:02.760 |
So basically insert it all into the Pinecone vector database. 00:13:07.000 |
Then at the end here, we're just going to check that we have all 00:13:11.920 |
And you can see here that it actually brought through 50,001. 00:13:22.440 |
So I can try running this, but it's basically just going 00:13:28.200 |
So see here, I'm not going to wait until the end of that because it will take a 00:13:33.920 |
little bit of time, even when we're using a GPU on Colab, although actually not too long. 00:13:39.360 |
Anyway, I'm going to stop that and we'll just move straight onto the generator 00:13:44.400 |
and we can at least just see from the past runs what it would be doing. 00:13:50.000 |
So the first thing we would do here is initialize the tokenizer and 00:13:56.120 |
And we're using this BART LFQA, which is Long Formal Question Answering model. 00:14:01.480 |
So if we come up here, we'll explain a little bit of what this model is. 00:14:05.600 |
So using the Explain Lycan5 BART model, which is just a sequence sequence 00:14:11.080 |
model, which has been trained using Explain Lycan5 dataset, which is from Reddit. 00:14:16.080 |
And if we come down here, we can see the format that we're going to be putting 00:14:22.080 |
So we're going to have our question, which is going to be what we type. 00:14:30.080 |
And then with each passage, we proceed it with a P token like this. 00:14:35.960 |
And then we have the passage and then P token, another passage. 00:14:39.160 |
And basically the model has been trained to read this sort of format and then 00:14:43.640 |
generate a natural language answer based on this question and based on this 00:14:52.280 |
So we come down here, we would initialize it like that. 00:14:56.880 |
And then we're just going to create these two helper functions. 00:15:07.600 |
So from text to a vector embedding or the query embedding is what we'd usually call it. 00:15:14.040 |
We query Pinecone like this, this will return K many passages, and it would 00:15:22.600 |
return these, what we call the context or the passages or something along those lines. 00:15:29.360 |
One thing that is pretty important here is that we include the metadata because 00:15:33.400 |
that includes the human readable text of those passages that we're going to be 00:15:39.280 |
Because we are going to be formatting them in this string, which 00:15:44.240 |
We have the, so the context here, which is going to be the P token 00:15:49.840 |
followed by the passage, and then we concatenate all those together. 00:15:54.680 |
And then what we would do is create that format that you saw before with the 00:15:59.200 |
question followed by the question and the context followed by those context 00:16:03.040 |
with the P tokens in the, in the middle or preceding each one. 00:16:06.880 |
So with those help functions, we then move on to our query. 00:16:10.840 |
So we have our query, when was the first electric power system built? 00:16:15.440 |
We can query Pinecone and that will return these matches here. 00:16:18.640 |
So this is the response directly from Pinecone. 00:16:21.720 |
And we see that we have the passage text and we have some, I 00:16:28.080 |
So this is just returning, just returning one here. 00:16:32.280 |
We use pretty print here so that we can more nicely visualize 00:16:38.240 |
And then what we want to do is query or format our query. 00:16:42.200 |
So we have our query, which is the question we just asked up here. 00:16:46.440 |
When's the first electric power system built? 00:16:48.280 |
And then we also have what we returned from Pinecone. 00:16:51.840 |
We, and then we print what we get from there or what we will be producing. 00:16:57.040 |
So we have the question and you can see that same format that you saw before. 00:17:01.760 |
And then you have context and you have the P token followed by the passages. 00:17:05.840 |
So we write another function, generate answer. 00:17:09.240 |
This is going to take our, the formatted query here. 00:17:13.840 |
It's going to tokenize it using our Bart tokenizer. 00:17:18.280 |
And then it's going to use a generator to generate a prediction 00:17:26.960 |
So from there, we, that will output a load of token IDs, 00:17:34.880 |
So then we use this batch decode or the tokenizer decode to decode 00:17:45.680 |
So if we then go ahead and actually run that, we will see that we 00:17:52.440 |
The first electric power system was built in 1881 at Godalming in England. 00:17:58.920 |
It was powered by two water wheels and then, and so if we look at that answer 00:18:05.960 |
or what we looked at here, we can see that it is basically reformulating 00:18:12.720 |
that information there into a more concise answer. 00:18:15.880 |
So we see in 1881 at Godalming in England and so on. 00:18:26.160 |
If we ask some more questions, you say, how was the 00:18:31.280 |
And this time we're going to return five of these contexts. 00:18:36.720 |
And ideally this should give us, give the Bart generation model more 00:18:45.200 |
So it should generally speaking, be able to produce a better answer if 00:18:52.240 |
In this case, we say, we see first wireless message sent in 1866, so on and so on. 00:19:02.400 |
We set that by setting the max length up here at 40. 00:19:06.360 |
And, you know, I don't know the answer to this question. 00:19:09.560 |
So what we can do is, you know, not just rely on the model to actually 00:19:13.800 |
give us the answer, which is a problem that we see a lot with the GPT-3, 00:19:19.680 |
CHAT-GPT and so on models, we can actually have a look at what, where 00:19:26.520 |
So we can see here, I think this is probably the most relevant part. 00:19:35.400 |
So this guy is claimed to have transmitted an electrical signal through 00:19:45.560 |
And I don't think any of the other contexts really give us 00:19:50.280 |
So we can see that according to this context, and if we want to provide a 00:19:54.880 |
link back to where that was actually from, that does at least seem to be true. 00:19:59.800 |
Now, this is probably a good example of when this is useful. 00:20:04.280 |
So if we ask a question like, where did COVID-19 originate? 00:20:12.280 |
And I think most of us probably know that this is kind of nonsense, right? 00:20:17.320 |
So it's a zoonotic disease transmitted from one animal to another. 00:20:23.520 |
Let's have a look at where this is coming from. 00:20:25.880 |
And we can see that all of these contexts don't actually 00:20:33.640 |
And so we can pretty confidently say that this is nonsense. 00:20:38.400 |
And simply the reason is that this model has never seen 00:20:44.080 |
The BART generation model hasn't seen anything about that because 00:20:46.960 |
the training data it was trained on was from before that time. 00:20:50.720 |
And as well, none of the contexts that we have indexed yet 00:20:57.360 |
So it can be pretty useful to include that, particularly when it comes 00:21:03.480 |
And then let's finish your final few questions. 00:21:07.640 |
I'm not going to check these, but I'm pretty sure. 00:21:21.840 |
I think this one is possibly, possibly true, possibly not. 00:21:27.080 |
I can't remember, but nonetheless, we, we get some pretty cool answers there. 00:21:33.000 |
So that's it for this video in this example, walkthrough of abstractive 00:21:44.080 |
So thank you very much for watching and I will see you again in the next one.