Generative Question-Answering with OpenAI's GPT-3.5 and Davinci

Today we're gonna take a look how to build a generative question answering app using OpenAI and Pinecone. So what I mean by generative question answering is imagine you go down to a grocery store and you go to one of the attendants and you ask them where is this particular item?

Or maybe you're not entirely sure what the item is and you say, okay, where are those things that taste kind of like cherries but look like strawberries? And hopefully the attendant will be able to say, ah, you mean cherry strawberries. And they will say, okay, you just need to go down to aisle two and they'll be on the left or they'll take you there.

That is basically what generative question answering is. You're asking something, a question in natural language, and that something is generating natural language back to you that answers your question. Now, I think one of the best ways to explain this idea of generative question answering is to show you. So we don't have that much data behind this, there's a little bit over 6,000 examples from the Hugging Face, PyTorch, TensorFlow and Streamlit forums, but it is enough for us to get some interesting answers from this.

So what I'm gonna do here is I'm going to say, I want to get a paragraph about a question. And I'm gonna say, what is the difference or what are the differences between TensorFlow and PyTorch? And I'm going to ask you this question. We can limit the amount of data that it's going to pull from by adjusting top case at the moment, it's going to return five items of information.

But I think that should be enough for this. So let's go and get this. So they are two of the most popular deep learning frameworks, PyTorch, Python-based library developed by Facebook, while TensorFlow is a library developed by Google. Both frameworks are open source, large community. Main difference is that PyTorch is more intuitive and easy to use, while TensorFlow is more powerful and has more features.

PyTorch is better suited for rapid programming and research, while TensorFlow is better for production and deployment. Whereas PyTorch is more Pythonic, whereas TensorFlow is more convenient in the industry for prototyping, deployment and scalability. And then it even encourages us to learn both frameworks to get the most out of our deep learning projects.

If we have a look at where this information is coming from, it's mostly TensorFlow. So maybe there's some bias there, and we can actually click on these and it will take us to those original discussions. So we can actually see where this model is getting this information from. So we're not just relying on this actually being the case.

So another thing we can ask here is something that is perhaps more factual rather than opinionated, is how do I use gradient tape in TensorFlow? And let's see what it says. Again, we get this. So it's a powerful tool in TensorFlow that allows us to calculate gradients, computation with respect to its input variables.

And then to use gradient tape, you first need to create a tape object and then record the forward pass of the model. So after the forward pass is recorded, you can then calculate gradients with respect to those variables. This can be done by calling the tape gradient method. And then we continue with some more information like we use to calculate higher order derivatives and then, and then, and so on.

So that's pretty cool. But let's have a look at how we would actually build this. Now, when we're building generative question answering systems with AI, naturally we just replace the person with some sort of AI pipeline. Now, our generative question answering pipeline using AI is going to contain two primary items, a knowledge base and the AI models themselves.

So the knowledge base, we can think of it as a model's long-term memory. Okay, so in the shop assistant example, the shop assistant has been working in that store for a while. They have a long-term memory of roughly where different items are, maybe they're actually very good and know exactly where different items are.

And they just have this large knowledge base of everything in that store. And when we ask that question, we are producing our natural language prompt or query to that person. Something in their brain is translating that language into some sort of concept or idea. And they're looking at their knowledge base based on their past experience, attaching that query to some nearby points within their knowledge base, and then returning that information back to us by generating some natural language that explains what we are looking for.

So if we take a look at how this will look for our AI pipeline, we are going to have an embedding model here, and we have our question like syntax here, question, we feed that into the embedding model. We're going to be using the ARDA002 model, which is a GPT 3.5 embedding model.

That's going to translate that query into some sort of numerical representation that essentially represents the semantic meaning behind our query. So we get a vector in vector space. Now we then pass that to the knowledge base. The knowledge base is going to be Pinecone Vector Database. And from there, within that Pinecone Vector Database, just like the shop assistant, we're going to have essentially what we can think of as past experiences, or just information that has already been indexed.

It's already been encoded by GPT 3.5 model and stored within Pinecone. So we essentially have a ton of index items like this. And then we introduce this query vector into here. So maybe it comes here, and we're going to say, just return, let's say in this example, we have a few more points around here, just return the top three items.

So in this case, we're going to return this, this, and this. So the top three most similar items. And then we pass all of those to another model. So this is going to be another OpenAI model. So this is going to be a DaVinci text generation model. And here, these data points, we actually, right here, we're going to translate them back into text.

So there'll be like natural language texts that contain relevant information to our particular query. We are going to take our query as well. We're going to bring that over here. We're going to feed that in there. And we're going to format them in a certain way. And we'll dive into that a little more later on.

DaVinci will be able to complete our questions. So we're going to ask a question. We have our original query. We're going to say, answer that based on the following information. And then that's where we pass in the information we've returned from Pinecone. And it will be able to answer us questions with incredible accuracy.

Now, one question we might have here is why not just generate the answer from the query directly? And that can work in some cases. And it will actually work in a lot of general knowledge use cases. Like if we ask something like, who was the first man on the moon?

It's probably going to say Neil Armstrong without struggling too much. That's a very obvious fact. But if we ask it more specific questions, then it can struggle or we can suffer from something called hallucination. Now, hallucination is where a text generation model is basically giving text that seems like it's true, but it's actually not true.

And that is just a natural consequence of what these models do. They generate human language. They do it very well, but they aren't necessarily tied to being truthful. They can just make something up. And that can be quite problematic. Imagine like in the medical domain and this model may begin generating text about a patient's diagnosis, which seems very plausible using scientific terminology and statistics and so on.

But in reality, all of that may be completely false. It's just making it up. So by adding in this component of a knowledge base and forcing the model to pull information from knowledge base and then answer the question based on the information we pulled on from that knowledge base, we are forcing the model to almost be truthful and base its answers on actual facts, as long as our knowledge base actually does contain facts.

And at the same time, we can also use it for domain adaption. So whereas a typical language generation model may not have a very good understanding of a particular domain, we can actually help it out by giving it this knowledge from a particular domain and then asking it to answer a question based on that knowledge that we've extracted.

So there are a few good reasons for doing this. Now to do this, we are going to essentially do three things. We first need to index all of our data. So we're going to take all of our tags, we're going to feed it into the embedding model and that will create our vectors.

Let's just say they're here and we pass them into Pinecone. And this is the indexing stage. Now to perform this indexing stage, we're going to come over to this repo here, so Pinecone I/O samples. And at the moment, these notebooks are stored here, but I will leave a link on this video that will keep updated to these notebooks just in case they do change in the future.

We come to here and this is our notebook. Now we can also open it in Colab. So I'm going to do that over here. Again, I'm going to make sure there's a link in this video right now so that you can click that link and just go straight to this.

So the first thing we need to do is install our prerequisites, OpenAI, Pack and Climb datasets. Now I have a few videos that go through basically this process. So if you'd like to understand this in more detail, I will link to one of those videos. But right now, I'm just going to go through it pretty quickly.

So first is Dataprep. We do this all the time. Of course, we're going to be loading this dataset here, MLQA, which is just a set of question answering based on ML questions across a few different document forms. So like HuggingFace, Hightorch, and a few others. I'm not going to go through this, but all we're doing essentially is removing excessively long items from there.

Now, this bit is probably more important. So when we're creating these embeddings, what we're going to do is include quite a lot of information. So we're going to include the thread title or topic, the question that was asked and the answer that we've been given. So we do that here.

And you can kind of see like thread title is this, and then you have the question asked, and then later on, you have the answer a little further on, which is here. So we're going to throw all of that into our Arda002 embedding model, and it can embed a ton of text.

I think it's up to around 10 pages of text. So this is actually not that much for this model. So to actually embed the things, we need an OpenAI API key. I will show you how to do that just in case you don't know. So we go to this betaopenai.com/signup, and you create an account if you don't already have one.

I do, so I'm going to log in. And then you need to go up to your profile on the top right, click View API Keys. And you may need to create a new secret key if you don't have any of these saved elsewhere. Now, once you do have those, you just paste it into here and then continue running.

If you write this, basically you should get a list of these if you have authenticated. Yeah, let's go through these a bit quicker. So we have the embedding model here, Arda002. Create the embeddings here, so on and so on. Here, we're creating our Pinecone index. So we're going to call it OpenAI ML QA.

We need to get a API key again for that, so that we would find it at the pinecone.io. And then you just go into the left here after signing up and then copy your API key. And then of course, put it in here. We create the index and connect to the index.

Okay, so that's initializing everything. After that, we just need to index everything. So we're going through our text here in these batches of 128. We are encoding everything here. So we're encoding everything in these batches of 128. We're getting relevant metadata, so that's going to be the plain text associated with each one of our, what will be vector embeddings.

And we also have the embeddings. So we just clean them up from the response format here and we pull those together. So the IDs are just unique IDs, which is actually just a count in this case, embeddings themselves and the metadata. Then we add all those to Pinecone. Okay, pretty straightforward.

And then we have these 6,000 vectors in Pinecone that we can then use as our knowledge base. Now, usually you probably want way more items than this, but this is, I think, good enough for this example. Now, after everything is indexed, we move on to querying. So this is the next step, not the final step.

So we have a user query and what we're going to do is, it's going to look very similar to what we already did. We're going to pass that into the text embedding, R002 model. So this is the GPT 3.5 embedding model. That gives us our query vector, which we'll call XQ.

And then we pass that to Pinecone. Now, Pinecone has metadata attached to each of our vectors. So let's say we return three of these vectors. We also have the metadata attached to it. Okay, and we return that to ourselves. So this is the querying set. But then later on, we're also going to pass that onto the generation model, which gets produced a more intelligent answer using all of the contexts that we've returned.

So to begin making queries, we're going to be using this zero making queries notebook here. And again, I'm going to open it up in Colab. Now I'm actually going to run through this one because we're going to be moving on to generation stuff, which I haven't covered much before.

And I want to go through it in a little more detail. So I'm going to run this, which is just the prerequisites. And then we'll move on to initializing everything. So we have the Pinecone vector database, the embedding and the generation models. First, we start with Pinecone. I'm going to run it again because I need to connect to the index.

So this is solving the variable called Pinecone key. And here, we're just connecting to the index and we're going to describe the index stats. So because we already populated it, we already did the indexing, we should see that there are already values in there or vectors. And we can see that here.

So we have the 6,000 in there. And here, we're going to need to initialize the OpenAI models. For that, we need an OpenAI key. And again, I've solved that in a variable called OpenAI key. Okay. Great. So the embedding model, let's see how we actually query with that. So we initialize the name of it there.

So text embedding audit002. This needs to be the same as the model that we use during indexing. And I'm going to say, like we saw earlier, what are the differences between PyTorch and TensorFlow? Let's ask that question. So here, we're creating our query vector. Here, we're just extracting it from the response.

And then we use that query vector to query Pinecone. So index.query, I'm going to return top three items. Let's see what it returns. And we can see here, we have this general discussion. I think this post might help you. Come down, it's PyTorch versus TensorFlow. So yeah, probably pretty relevant.

And we have three of those. So we come down a little more. We have another one here. This is on about TensorFlow.js in this case. And so on. So within that, we should have a few relevant contexts that we can then feed into the generation model as basically sources of truth for whatever it is it's saying.

Now, generation stage is actually very simple. We have our contexts down here. We are feeding them up into here. We're taking our query. So we're going to have our query question. And then we're going to have these contexts followed onto the end of it or appended onto the end of it.

Now, actually, before this, we're also going to have like a little statement, which is basically telling the generation model how to or what to do based on instruction. And the instruction is going to be something like, answer the following question given the following context. And I should actually, so this question is actually going to be at the end of this here.

Okay, so we're going to have the instruction followed by the context. So the ground truth of information that we returned from our vector database, and then we're going to have the question. Now let's see what that actually looks like. So come down to here, we have a limit on how much we're going to be feeding into the generation model.

We're just setting that there. We have the context. So this is what we returned from Pinecone up here. Okay, so this text that you can just about see here. Okay, so it's within the metadata, it's a context from the matches that we got there. And then you can see a prompt that we're building.

Okay, so we're going to say, answer the question based on the context below. So then we have context. That's where we feed in our data that we retrieved. Now that data retrieved is obviously a few items. So we join all those with basically these separators here. And then after we have put that data in there, we have the question, that's our query.

And then because this generation model is basically just taking forward the text that we have so far and continuing it to keep with the format of question of context, question, we end the prompt with answer. And then GPT is going to just complete this query. It's going to continue, it's going to generate text.

Now this is going to be basically the final structure, but this is not exactly how we built it. We actually need to remove this bit here because we're going to, in the middle here, we're going to use a bit of logic so that we're not putting too many of these contexts in here.

So here, what we would do is we'll go through all of our contexts one at a time. So essentially we're going to start with context one, then we're going to try context two, context three, one at a time. So we're going to start with just the first context and then we move on in the second iteration with the first two contexts and the first three contexts and so on and so on until we have all of those contexts in there.

The exception here is if the context that we've gone up to exceeds the limit that we've set here. At that point, we would say, okay, the prompt is that number minus one. So let's say we get to the third context here, but it's too big. Then we just take the first two contexts, right?

And then actually after that, we should have a break and that produces our prompt. Otherwise, if we get up to all of the contexts, we just join them all together like this. Okay, at that point, we move on to the completion point. So the text generation part. Now for text generation, we use OpenAI Completion Create.

We're using that TextEventually003 model. We're passing our prompt, temperature, so like the randomness in the generation. We say it's zero because we kind of want it to be pretty accurate. If we want more interesting answers, then we can increase the temperature. We have maximum tokens there. So how many things should the model generate or how many tokens should the model generate?

And then a few other generation variables. You can read up on these in the OpenAI docs. Okay, one thing here, if you did want to stop on a particular character, like maybe you wanted to stop on a full stop, you can do so by specifying that there. So we run this and now we can check what has been generated.

So we have here, so we go response choices, zero text, and then we just strip that text so maybe space on either end. And it says, okay, if you ask me for my personal opinion, I find TensorFlow more convenient in the industry, prototyping, deployment, scalability is easier, and PyTorch more handy in research, more Pythonic and easier to implement complex stuff.

So we see it's pretty aligned to the answer we got earlier on. Now let's try another one. So again, what are the differences between PyTorch and TensorFlow? And what we're going to do is actually modify the prompt. So I'm going to say, give an exhaustive summary and answer based on the question using the context below.

So actually give an exhaustive summary and answer based on the question. So we have a slightly different prompt. Let's see what we get. So before it was a pretty short answer. This, I'm saying exhaustive summary and then answer. So it should be longer, I would expect. So let's come down here and yeah, we can see it's longer.

We see PyTorch and TensorFlow, two of the most popular major machine learning libraries. TensorFlow is maintained and released by Google while PyTorch is maintained and released by Facebook. TensorFlow is more convenient in this industry, prototyping, deployment, and scalability. So we're still getting the same information in there, but it's being generated in a different way.

And it also even says PyTorch is more handy in research, it's more Pythonic, and so on and so on. It also includes this here, TensorFlow.js has several unique advantages over Python equivalent as it can run on the client side too. So I suppose it's saying TensorFlow.js versus TensorFlow in Python, which is not directly related to PyTorch.

But as far as I know, there isn't a JavaScript version of PyTorch. So in reality, the fact that you do have TensorFlow.js is in itself an advantage. So I can see why that's been pulled in, but maybe it's not explained very well in this generation. Okay, so that's it for this example on generative question answering using Pinecone and OpenAI's embedding models and also generation models.

Now, as you can see, this is incredibly easy to put together and incredibly powerful. Like we can build some insane things with this and it can go far beyond what I've shown here with a few different prompts, like asking for a bullet point list that explains what steps you need to take in order to do something is one really cool example that I like.

And this is something that we will definitely explore more in the future, but for now, that's it for this video. I hope all this has been interesting and useful. So thank you very much for watching and I will see you again in the next one. Bye. (gentle music) (gentle music) (gentle music) (gentle music) (gentle music)

Generative Question-Answering with OpenAI's GPT-3.5 and Davinci

Chapters

Transcript