Open Source Generative AI in Question-Answering (NLP) using Python

00:00:00.000 | Today, we're going to talk about abstractive or generative question and answering.

00:00:04.960 | And we're going to focus on actually building or implementing something like this using a few different components.

00:00:12.760 | But in the end, what we're going to essentially be able to get is we're going to be able to ask a question in natural language.

00:00:20.540 | And we're going to be able to return documents or web pages or so on that are related to our particular question.

00:00:29.600 | And we're also going to be able to use something called a generator model to generate a human natural language answer to our question

00:00:39.680 | based on these documents that we've retrieved from an external source.

00:00:44.600 | So we can think of it as a GPT model that is answering our questions.

00:00:50.160 | But if GPT was also giving us the sources of the information that it was answering questions based on.

00:00:57.120 | So let's jump straight into just understanding what is exactly that we're going to be building.

00:01:02.120 | So we're going to start with all of our documents, our text or whatever it is we're going to be using.

00:01:08.200 | In our case, we're going to be using text from Wikipedia.

00:01:12.960 | So we're going to take all of this and we're going to encode it using what's called a retriever model.

00:01:19.400 | So it's called a retriever.

00:01:21.840 | And what that will give us is a ton of these vectors.

00:01:27.120 | Where each vector represents like a segment of our text.

00:01:31.520 | So for example, maybe we might have this little segment here followed by this segment and so on.

00:01:39.440 | We're going to take all of those vector embeddings and we're going to put them into a vector database over here.

00:01:47.760 | Now we're going to be using Pinecone for this.

00:01:51.720 | So what we'll do is just put everything in Pinecone and at that point, we've actually built the retrieval pipeline.

00:01:59.880 | We don't have the generative part of it yet, but we do have the retrieval pipeline.

00:02:04.240 | So then what we can do is ask a question.

00:02:07.200 | So we'll ask a question over here, it'll be in natural language.

00:02:10.680 | And what we'll do with that is actually also take that into the retriever model and that will output.

00:02:18.600 | Maybe we'll output it over here, that will output a single query vector, or question vector.

00:02:26.000 | That will then be passed into Pinecone here, which will compare that query vector to all of the previously encoded vectors.

00:02:33.920 | And it will return a few of those that are the most relevant to our particular query vector.

00:02:38.600 | So it will bring these out and it will say, okay, these three items are the most relevant to your particular query.

00:02:45.560 | And it's basing those on the concept or the idea behind the language being used.

00:02:50.520 | It's not basing them on matching particular terms, like keyword matching or anything like that.

00:02:55.400 | It's actually basing it on the semantic understanding of the question and of the answers and of the relevant documents.

00:03:04.480 | So we'll take these and we'll bring them over here.

00:03:08.400 | Now over here, we're going to have what's called a generator model.

00:03:12.480 | So the generator model, it can be a lot of different things.

00:03:16.000 | One example that I kind of briefly mentioned is it could actually be something like GPT-3.

00:03:22.120 | So you could have GPT-3 here.

00:03:24.840 | We're going to be using another one or another model called BART that will generate everything for us.

00:03:31.600 | Just because this is open source and we can just run it in our Colab notebook.

00:03:36.840 | But you can use GPT-3, you can use GoHead, you can use all these different types of models.

00:03:42.440 | Depending on what it is you're wanting to do.

00:03:44.680 | So we'd pass those relevant context or documents, whatever you like to call them.

00:03:50.800 | We pass those into our generator model.

00:03:53.480 | Alongside that, we also want to pass in the question, the original question.

00:03:58.320 | One thing that I missed here is actually here.

00:04:10.040 | We would be converting these back into their original text format.

00:04:13.440 | Which we've stored in Pinecone.

00:04:17.080 | So that will actually be the text and the same with the query.

00:04:21.680 | So we're going to have the query and the context and we're going to feed them into the generator.

00:04:25.560 | And that will then output us an answer in natural language format.

00:04:31.400 | So let's actually jump straight into the code for building all of this.

00:04:36.720 | So we're going to be working from this example over on the Pinecone docs.

00:04:41.480 | So it's pinecone.io/docs/abstractive-question-answering.

00:04:45.760 | There'll be a link in the video as well.

00:04:48.600 | And what we want to do is just opening Colab over here.

00:04:51.760 | That will open this.

00:04:53.120 | So let's get started.

00:04:54.840 | We need to install any dependencies.

00:04:57.360 | So in here we have datasets, Pinecone, sentence transformers, and PyTorch.

00:05:02.760 | And we'll jump into what each one of those does pretty soon.

00:05:06.720 | Okay.

00:05:07.000 | Once that is installed, we come down here and we're going to just load and prepare our dataset.

00:05:12.640 | So we'll be taking these Wikipedia snippets dataset.

00:05:15.560 | This is coming from the HuggingFace datasets hub.

00:05:19.240 | So we're loading it like this and it's a pretty big dataset.

00:05:24.120 | So we're actually streaming that data by saying streaming equals true.

00:05:29.280 | I think it's nine gigabytes.

00:05:30.640 | So this will just allow us to load what we're using right now, rather than loading the full thing in

00:05:36.080 | memory at once.

00:05:36.840 | And then we shuffle that dataset randomly.

00:05:39.720 | So we were using a CD here just so you can replicate what I'm doing here.

00:05:44.200 | Let me run this.

00:05:45.480 | And then we'll come down here and we can just show the first item or the first document from the

00:05:52.200 | dataset.

00:05:52.960 | So we're just iterating through it.

00:05:54.400 | We take the next item and we can see we have the ID and start and end where, you know, where, where

00:06:01.280 | the text is actually being pulled from.

00:06:04.280 | And we have article title, section title, and then we have the passage text.

00:06:09.240 | So this is the document or the context.

00:06:11.920 | And this is what we're going to be encoding.

00:06:14.320 | So this is what we're going to be encoding and storing in our vector database.

00:06:18.440 | So what I'm going to do here is actually filter for only the documents that have history in the

00:06:26.600 | section title here.

00:06:27.760 | So basically we just want history related documents.

00:06:31.640 | So we do that.

00:06:33.520 | Now we can't check how many items we have there because we're using the streaming feature.

00:06:38.280 | So that will just essentially stream everything.

00:06:40.880 | And if it sees history, we'll lay it through.

00:06:42.760 | If not, it will not let it through.

00:06:44.760 | But there are quite a few passages in there.

00:06:48.640 | So we're just going to filter out or we're going to choose the first 50,000 of those, which is

00:06:55.360 | quite a bit.

00:06:57.720 | Now, one thing I should make you aware of here is in your runtime, it should be GPU anyway, but in

00:07:04.560 | case it's not here, you can set your hardware to use GPU.

00:07:10.320 | If it's on none, it means you're using CPU and it will be a lot slower when we're embedding

00:07:14.000 | everything later on.

00:07:14.960 | So we do want to make sure that we're using GPU.

00:07:17.920 | Okay.

00:07:21.320 | So after that has completed, we have our 50,000 documents all with history in the section title.

00:07:30.880 | So if we take a look at the head here, we can see that all of those, they don't all say history

00:07:36.560 | specifically, but they have history at least in the title like here.

00:07:41.240 | Okay.

00:07:42.200 | So what we're going to do now is we'll need to embed and index all of these passages here or

00:07:50.200 | embed and store all of them.

00:07:51.760 | So to do that, we're going to, we'll need to initialize the Pancone index, but I'm going to do

00:07:56.000 | that after initializing the retriever model.

00:07:58.840 | So I'm going to scroll down to here, come to the retrieve model, and we're going to be using this

00:08:03.560 | flex sentence embeddings or datasets V3 MPNet base model.

00:08:08.400 | So this is basically one of the best sentence transform models you can use for basically

00:08:15.480 | anything.

00:08:16.240 | So that's why it has all here has been trained on, I think a billion sentence pairs.

00:08:21.520 | So it's a pretty good model to try and use whenever you're not sure which model to use.

00:08:27.720 | So we initialize that, okay.

00:08:29.920 | It might take a moment to download.

00:08:31.600 | Okay.

00:08:33.160 | And then one thing we will want to do is make sure we move this to a GPU.

00:08:40.200 | So actually what we need to do is import Torch.

00:08:44.400 | Now I want to say device equals CUDA, if Torch CUDA is available.

00:08:53.120 | So this is saying, if there's a CUDA enabled GPU, set the device to that.

00:08:57.840 | Otherwise we're going to use CPU.

00:09:00.960 | Okay.

00:09:02.000 | And then we can see what the device is.

00:09:03.640 | And actually rather than moving the retriever to that device, I'm going to come back up to the

00:09:09.800 | initialization here, and I'm going to initialize it on that device to start with.

00:09:14.120 | So like this.

00:09:15.720 | Okay.

00:09:16.720 | Now, an important thing to note here is that we have the word embedding dimension 768.

00:09:23.920 | So remember that and we'll come up here and we will initialize our Pinecone index.

00:09:30.200 | So the first thing we need to do is connect to our Pinecone environment.

00:09:34.280 | So we need an API key for that, which is free.

00:09:37.000 | So to get that, we need to go to app.pinecone.io.

00:09:44.000 | Once here, we will either need to sign up or log in.

00:09:48.320 | So I'm going to log in.

00:09:50.120 | And once we've done that, we'll just get a little loading screen here, and then we should find

00:09:56.280 | something like this.

00:09:57.040 | So on the top left up here, you have your organization and then you have projects.

00:10:04.280 | So one of those should say like your name and default project.

00:10:08.800 | So I'm going to go over to that.

00:10:10.720 | And then here, I just have a list of the indexes that I currently have running.

00:10:15.520 | Now I think abstractive question answering is not in there.

00:10:19.200 | So what I'm going to do is we're going to have to create it.

00:10:22.040 | So we come over to API keys on the left here.

00:10:24.840 | We copy the API key value, come over to here, and then we will just paste it into here.

00:10:31.920 | I'm going to go and paste mine into a new variable.

00:10:34.920 | So mine is stored in a new variable called API key.

00:10:38.960 | So I initialize with that.

00:10:41.240 | And what we're going to do is create a new index.

00:10:47.120 | We're going to call it abstractive question answering.

00:10:49.480 | And we are going to say, if that index name does not exist, then we create it.

00:10:55.080 | Now, I remember I said to remember that dimensionality at number 768 before.

00:11:01.560 | This is why, because it's here.

00:11:03.040 | We need that number to align, this number here to align with the embedding

00:11:08.960 | dimensionality of our retriever model.

00:11:11.280 | We can also check that using this.

00:11:14.000 | So retriever get sentence embedding dimension, like so, and we get 768.

00:11:21.440 | So we can actually take this and place it in here rather than hard-coding it.

00:11:26.040 | Metric, because the embedding vectors are normalized, as we can see here, we can

00:11:34.880 | actually use either dot product or cosine similarity here, we're going to just stick

00:11:38.880 | cosine similarity, and that would just take a moment for the index to be created.

00:11:43.560 | Okay.

00:11:44.360 | Once we have created it, we will move on to this, which is

00:11:47.240 | just connecting to our new index.

00:11:49.480 | So let's scroll down and we will come down to the generating embeddings and upsetting.

00:11:55.800 | So what we're going to do here is in batches of 64, we're going

00:11:59.680 | to extract our passage text.

00:12:03.000 | So we'll have 64 of these passages all at one time, and we're going to encode

00:12:08.280 | them all using our retriever model.

00:12:09.760 | Then what we're going to do is get the metadata.

00:12:11.840 | So that is simply the text that we have in here.

00:12:16.520 | So if I show you an example, we have, take this, I'm going to do DF and we're

00:12:24.960 | going to take the first few items and paste that, so basically we're going to do this.

00:12:30.760 | We're going to take all of that data that we have in our data frame.

00:12:34.680 | And for each one of our vectors, so first one would be this, we're going

00:12:38.960 | to attach that metadata to the vector.

00:12:41.760 | And then here we'd create some unique IDs, just count, we could actually

00:12:46.440 | use the IDs themselves, but this is just easier and we're going to add

00:12:51.040 | all those to a upsert list, which is just a list that contains two boards

00:12:55.720 | containing each ID, the vector embedding, and the metadata related to that embedding.

00:13:00.800 | And then we upsert all of that.

00:13:02.760 | So basically insert it all into the Pinecone vector database.

00:13:07.000 | Then at the end here, we're just going to check that we have all

00:13:10.680 | those vectors in the index.

00:13:11.920 | And you can see here that it actually brought through 50,001.

00:13:17.560 | So maybe there was a duplicate in there.

00:13:19.320 | I'm not too sure.

00:13:20.880 | But we have all of those in there.

00:13:22.440 | So I can try running this, but it's basically just going

00:13:26.040 | to start from the start again.

00:13:28.200 | So see here, I'm not going to wait until the end of that because it will take a

00:13:33.920 | little bit of time, even when we're using a GPU on Colab, although actually not too long.

00:13:39.360 | Anyway, I'm going to stop that and we'll just move straight onto the generator

00:13:44.400 | and we can at least just see from the past runs what it would be doing.

00:13:50.000 | So the first thing we would do here is initialize the tokenizer and

00:13:53.840 | the model for our generator model.

00:13:56.120 | And we're using this BART LFQA, which is Long Formal Question Answering model.

00:14:00.840 | Okay.

00:14:01.480 | So if we come up here, we'll explain a little bit of what this model is.

00:14:05.600 | So using the Explain Lycan5 BART model, which is just a sequence sequence

00:14:11.080 | model, which has been trained using Explain Lycan5 dataset, which is from Reddit.

00:14:16.080 | And if we come down here, we can see the format that we're going to be putting

00:14:19.840 | all of our text into this model.

00:14:22.080 | So we're going to have our question, which is going to be what we type.

00:14:25.120 | We'll say like, what is a sonic boom?

00:14:27.360 | And then that's followed by context.

00:14:30.080 | And then with each passage, we proceed it with a P token like this.

00:14:35.960 | And then we have the passage and then P token, another passage.

00:14:39.160 | And basically the model has been trained to read this sort of format and then

00:14:43.640 | generate a natural language answer based on this question and based on this

00:14:50.000 | information that we have provided it with.

00:14:52.280 | So we come down here, we would initialize it like that.

00:14:56.880 | And then we're just going to create these two helper functions.

00:15:01.560 | So this is just to help us query Pinecone.

00:15:04.360 | So given a particular query, we encode it.

00:15:07.600 | So from text to a vector embedding or the query embedding is what we'd usually call it.

00:15:14.040 | We query Pinecone like this, this will return K many passages, and it would

00:15:22.600 | return these, what we call the context or the passages or something along those lines.

00:15:29.360 | One thing that is pretty important here is that we include the metadata because

00:15:33.400 | that includes the human readable text of those passages that we're going to be

00:15:37.080 | feeding in and why do we need that?

00:15:39.280 | Because we are going to be formatting them in this string, which

00:15:42.360 | is like what I showed you before.

00:15:44.240 | We have the, so the context here, which is going to be the P token

00:15:49.840 | followed by the passage, and then we concatenate all those together.

00:15:54.680 | And then what we would do is create that format that you saw before with the

00:15:59.200 | question followed by the question and the context followed by those context

00:16:03.040 | with the P tokens in the, in the middle or preceding each one.

00:16:06.880 | So with those help functions, we then move on to our query.

00:16:10.840 | So we have our query, when was the first electric power system built?

00:16:15.440 | We can query Pinecone and that will return these matches here.

00:16:18.640 | So this is the response directly from Pinecone.

00:16:21.720 | And we see that we have the passage text and we have some, I

00:16:25.960 | think, relevant passages in there.

00:16:28.080 | So this is just returning, just returning one here.

00:16:32.280 | We use pretty print here so that we can more nicely visualize

00:16:36.880 | everything or print everything.

00:16:38.240 | And then what we want to do is query or format our query.

00:16:42.200 | So we have our query, which is the question we just asked up here.

00:16:46.440 | When's the first electric power system built?

00:16:48.280 | And then we also have what we returned from Pinecone.

00:16:51.120 | Okay.

00:16:51.840 | We, and then we print what we get from there or what we will be producing.

00:16:57.040 | So we have the question and you can see that same format that you saw before.

00:17:01.760 | And then you have context and you have the P token followed by the passages.

00:17:05.840 | So we write another function, generate answer.

00:17:09.240 | This is going to take our, the formatted query here.

00:17:13.840 | It's going to tokenize it using our Bart tokenizer.

00:17:18.280 | And then it's going to use a generator to generate a prediction

00:17:24.160 | or generate an answer.

00:17:25.440 | Okay.

00:17:26.960 | So from there, we, that will output a load of token IDs,

00:17:33.280 | which we obviously can't read.

00:17:34.880 | So then we use this batch decode or the tokenizer decode to decode

00:17:40.680 | them into human readable text.

00:17:43.160 | Like that.

00:17:45.680 | So if we then go ahead and actually run that, we will see that we

00:17:50.280 | want to focus on this bit here.

00:17:52.440 | The first electric power system was built in 1881 at Godalming in England.

00:17:58.920 | It was powered by two water wheels and then, and so if we look at that answer

00:18:05.960 | or what we looked at here, we can see that it is basically reformulating

00:18:12.720 | that information there into a more concise answer.

00:18:15.880 | So we see in 1881 at Godalming in England and so on.

00:18:21.240 | So that's pretty cool.

00:18:22.880 | Now, what if we go a little further?

00:18:26.160 | If we ask some more questions, you say, how was the

00:18:29.280 | first wireless message sent?

00:18:31.280 | And this time we're going to return five of these contexts.

00:18:34.680 | So we're going to return more information.

00:18:36.720 | And ideally this should give us, give the Bart generation model more

00:18:43.280 | information to produce an answer from.

00:18:45.200 | So it should generally speaking, be able to produce a better answer if

00:18:49.520 | we give it more of that information.

00:18:51.040 | But not all the time.

00:18:52.240 | In this case, we say, we see first wireless message sent in 1866, so on and so on.

00:18:58.560 | Okay.

00:18:59.840 | Nice short answer, which is good.

00:19:02.400 | We set that by setting the max length up here at 40.

00:19:06.360 | And, you know, I don't know the answer to this question.

00:19:09.560 | So what we can do is, you know, not just rely on the model to actually

00:19:13.800 | give us the answer, which is a problem that we see a lot with the GPT-3,

00:19:19.680 | CHAT-GPT and so on models, we can actually have a look at what, where

00:19:24.560 | this information is actually coming from.

00:19:26.520 | So we can see here, I think this is probably the most relevant part.

00:19:35.400 | So this guy is claimed to have transmitted an electrical signal through

00:19:41.800 | the atmosphere at this point, right?

00:19:45.560 | And I don't think any of the other contexts really give us

00:19:48.960 | any more information on that.

00:19:50.280 | So we can see that according to this context, and if we want to provide a

00:19:54.880 | link back to where that was actually from, that does at least seem to be true.

00:19:59.800 | Now, this is probably a good example of when this is useful.

00:20:04.280 | So if we ask a question like, where did COVID-19 originate?

00:20:08.000 | And we get this like random answer.

00:20:12.280 | And I think most of us probably know that this is kind of nonsense, right?

00:20:17.320 | So it's a zoonotic disease transmitted from one animal to another.

00:20:22.840 | Okay.

00:20:23.520 | Let's have a look at where this is coming from.

00:20:25.880 | And we can see that all of these contexts don't actually

00:20:30.720 | contain anything about COVID-19.

00:20:33.640 | And so we can pretty confidently say that this is nonsense.

00:20:38.400 | And simply the reason is that this model has never seen

00:20:42.800 | anything about COVID-19.

00:20:44.080 | The BART generation model hasn't seen anything about that because

00:20:46.960 | the training data it was trained on was from before that time.

00:20:50.720 | And as well, none of the contexts that we have indexed yet

00:20:55.560 | contain anything about it either.

00:20:57.360 | So it can be pretty useful to include that, particularly when it comes

00:21:01.200 | to fact-checking things like that.

00:21:03.480 | And then let's finish your final few questions.

00:21:06.320 | What is war on current?

00:21:07.640 | I'm not going to check these, but I'm pretty sure.

00:21:11.160 | So this one is true.

00:21:12.440 | Who's first person on the moon?

00:21:15.560 | Neil Armstrong.

00:21:16.480 | We, I think all know that it's true.

00:21:18.720 | And what is NASA's most expensive project?

00:21:21.840 | I think this one is possibly, possibly true, possibly not.

00:21:27.080 | I can't remember, but nonetheless, we, we get some pretty cool answers there.

00:21:33.000 | So that's it for this video in this example, walkthrough of abstractive

00:21:38.960 | or generative question answering.

00:21:40.920 | I hope this has been useful and interesting.

00:21:44.080 | So thank you very much for watching and I will see you again in the next one.

00:21:47.960 | Bye.

00:21:49.000 | Bye.

00:21:49.080 | Bye.

00:21:49.480 | Bye.

00:21:49.880 | Bye.

00:21:50.280 | Bye.

00:21:51.680 | Bye.

00:21:52.080 | Bye.

00:21:52.880 | Bye.

00:21:54.080 | Bye.

00:21:55.080 | Bye.

00:21:56.080 | Bye.

00:21:57.080 | Bye.

00:21:58.080 | Bye.

00:21:59.080 | Bye.

00:22:00.080 | Bye.

00:22:01.080 | Bye.

00:22:02.080 | Bye.

00:22:03.080 | Bye.

00:22:04.080 | Bye.

00:22:05.080 | Bye.

Open Source Generative AI in Question-Answering (NLP) using Python

Chapters