Back to Index

Fixing LLM Hallucinations with Retrieval Augmentation in LangChain #6


Chapters

0:0 Hallucination in LLMs
1:32 Types of LLM Knowledge
3:8 Data Preprocessing with LangChain
9:54 Creating Embeddings with OpenAI's Ada 002
13:14 Creating the Pinecone Vector Database
16:57 Indexing Data into Our Database
20:27 Querying with LangChain
23:7 Generative Question-Answering with LangChain
25:27 Adding Citations to Generated Answers
28:42 Summary of Retrieval Augmentation in LangChain

Transcript

large language models have a little bit of an issue with data freshness. That is the ability to use data that is actually up-to-date. Now, that is because the world, according to a large language model, is essentially frozen in time. The large language model understands the world as it was in its training data set.

And the training data set, it's huge, it contains a ton of information, but you're not going to retrain a large language model on new training data sets very often because it's expensive, it takes a ton of time, and it's just not very easy to do. So how do we handle that problem?

Well, for that, for keeping data up-to-date in a large language model, we can use retrieval augmentation. The idea behind this technique is that we retrieve relevant information from what we call a knowledge base, and we will actually pass that into our large language model, but not through training, but through the prompt that we're feeding into the model.

That makes this external knowledge base our window into the world or into the specific subset of the world that we would like our large language model to have access to. In this video, that's what we're going to talk about. We're going to have a look at how we can implement a retrieval augmentation pipeline using Lionchain.

So before we jump into it, it's probably best we understand that there are different types of knowledge that we can feed into a large language model. We're going to be talking about parametric knowledge and source knowledge. Now, the parametric knowledge is actually gained by the LLM during its training, okay?

So we have a big training process and within that training process, the LLM creates like an internal representation of the world according to that training data set. And they all get stored within the parameters of the large language model. Okay, and it can store a ton of information because they're super big.

But of course, this is pretty static. After training, the parametric knowledge is set and it doesn't change. And that's where source knowledge comes in. So when we are feeding a query or a prompt into our LLM, okay, so we have some prompt and it was a question, we feed that into the LLM and then it's going to return us like an answer based on that prompt.

But this input here, this is what we would call the source knowledge, okay? So source knowledge, and then up here, we have the parametric knowledge. Now, when we're talking about retrieval augmentation, naturally, what we're going to be doing is adding more knowledge via the source knowledge to the LLM.

We're not touching the parametric knowledge. Okay, so we're going to start with this notebook. There'll be a link to this somewhere in the top of the video right now. And the first thing we're going to do is actually build our knowledge base. So this is going to be the location where we saw all of that source knowledge that we will be feeding or potentially feeding into our large language model at inference time.

So when we're making the predictions or generating text. So we're going to be using the Wikipedia dataset from here. So this is from Hugging Face datasets. And we'll just have a quick look at one of those examples. Okay, so if we go across, we have all this text here.

This is what we're going to be putting into our knowledge base. And you can see that it's pretty long, right? Yeah, it goes on for quite a while. So there's a bit of an issue here. A large language model and also the encoding model that we're going to be using as well, they have a limited amount of texts that they can efficiently process.

And they also have like a ceiling where they can't process anymore and will return an error. But more importantly, we have that sort of efficiency threshold. We don't want to be feeding too much text into an embedding model because usually the embeddings are of lesser quality when we do that.

And we also don't want to be feeding too much text into our completion model. So this is a model that's generating an answer because the performance of that. So if you, for example, give it some instructions and you feed in a small amount of extra texts after those instructions, there's a good chance it's gonna follow those instructions.

If we put in the instructions and then loads of texts, there's actually an increased chance that the model will forget to follow those instructions. So both the embedding and the completion quality degrades with the more texts that we feed into those models. So what we need to do here is actually cut down this long chunk of text into smaller chunks.

So to create these chunks, we first need a way of actually measuring the length of our text. Now, we can't just count the number of words or count the number of characters because a language model, that's not how they count the length of texts. They count the length of texts using something called a token.

Now, token is typically like a word or sub-word length, like a chunk or string. And it actually varies by language model or just language model and the tokenizer that they use. Now, for us, we're going to be using the GPT 3.5 Turbo model and the encoding model for that is actually this one here.

Okay, so I mean, we can, maybe I can show you how we can check for that. So we import TIC token. So TIC token is just the tokenizer or the family of tokenizers that OpenAI uses for a lot of their large language models, all of the GPT models. So what we are going to do is we're going to say TIC token dot encoding for model.

And then we just pass in the name of the model that we're going to be using. So GPT 3.5 Turbo. Okay, and actually the embedding model that we should be using is this one. Okay, cool. So, lucky we checked. Let's run that. In reality, there is very little difference between this tokenizer and the P50 tokenizer that we saw before.

So in reality, the difference is pretty minor. But anyway, so we can take a look here and we see that the tokenizer split this into 26 tokens. And if we, let me take this and what we'll do is we'll just split it by spaces. And I just want to actually get the length of that list as well.

Right, so this is the number of words. Right, and I just want to show you that there's not a direct mapping between the number of tokens and the number of words and obviously not for the number of characters either. Okay, so the number of tokens is not exactly the number of words.

Cool, so we'll move on. And now that we have this function here, which is just counting the number of tokens within some text that we pass to it, we can actually initialize what we call a text splitter. Now, text splitter just allows us to take, you know, a long chunk of text like this and split it into chunks and we can specify that chunk size.

So we're going to say we don't want anything longer than 400 tokens. We're going to also add an overlap between chunks. So you imagine, right, we're going to split into 400, roughly 400 token length chunks. At the end of one of those chunks and the beginning of the next chunk, we might actually be splitting it in the middle of a sentence or in between sentences that are related to each other.

So that means that we might cut out some important information, like connecting information between two chunks. So what we do to somewhat avoid this is we add a chunk overlap, which says, okay, for chunk zero and chunk one, between them, there's actually an overlap of about 20 tokens that exist within both of those.

This just reduces the chance of us cutting out something or a connection between two chunks that is actually, like, important information. Okay, add length function. So this is what we created before up here. And then we also have separators. So separators is, it's going to, so this, what we're using here is a recursive character text splitter.

It's going to say, try and split on double newline characters first. If you can't, split on newline character. If not, split on the space. If not, split on anything. Okay, that's all that is. And yeah, so we can run that and we'll get these smaller chunks. They're still pretty long, but as we can see here, they are now all under 400 tokens.

So that's pretty useful. Now, what we want to do is actually move on to creating the embeddings for this whole thing. So the embeddings or vector embeddings are a very key component of this whole retrieval thing that we're about to do. And essentially they will allow us to just retrieve relevant information that we can then pass to our large language model based on like a user's query.

So what we're going to do is take each of the chunks that we're going to be creating and embedding them into essentially what are just vectors, okay? But these vectors are not just normal vectors, they're actually, you can think of them as being numerical representations of the meaning behind whatever text is within that chunk.

And the reason that we can do that is because we're using a specially trained embedding model that essentially translates human readable text into machine readable embedding letters. So once we have those embeddings, we then go and store those in our vector database, which we'll be talking about pretty soon.

And then when we have a user's query, we encode that using the same embedding model and then just compare those vectors within that vector space and find the items that are the most similar in terms of like their, basically their angular similarity, if that makes sense. Or you can, another alternative way that you could think of it is their distance within the vector space.

Although that's not exactly right because it's actually the angular similarity between them, but it's pretty similar. So we're gonna come down to here and I'm just going to first add my OpenAI API key. And one thing I should know is obviously you're gonna, you would be paying for this.

And also actually, if you don't have your API key, so it's, sorry, it's platform. It's openai.com, okay? And then what we're going to need to do is initialize this text embedding R002 model. So this is basically OpenAI's best embedding model at the time of recording this. So we'd go ahead and we would initialize that via lang chain using the OpenAI embeddings objects, okay?

Then with that, we can just encode text like this. So we have this list of chunks of text and then we just do embed, so the embedding model, embed documents, and then pass in a list of our text chunks there, okay? Then we can see, so the response we get from this, okay?

So what we're returning is we get two vector embeddings and that's because we have two chunks of text here. And each one of those has this dimensionality of 1,536. This is just the embedding dimensionality of the text embedding R002 model. Each embedding model is going to vary. This exact number is not typical, but it's within the range of what would be a typical dimensionality for these embedding models.

Okay, cool. So with that, we can actually move on to the vector database part of things. So a vector database is a specific type of knowledge base that allows us to search using these embedding vectors that we've created and actually scale that to billions of records. So we could literally have, well, billions of these text chunks in there that we encode into vectors.

And we can search through those and actually return them very, very quickly. We're talking like, I think at a billion scale, maybe you're looking at a hundred milliseconds, maybe even less if you're optimizing it. Now, because it's a database that we're gonna be using, we can also manage our records.

So we can add, update, delete records, which is super important for that data freshness thing that I mentioned earlier. And we can even do some things like what we'd call metadata filtering. So if I use the example of internal company documents, let's say you have a company documents that belong to engineering, company documents that belong to HR, you could use this metadata filtering to filter purely for HR documents or engineering documents.

So that's where you would start using that. You can also filter based on dates and all these other things as well. So let's take a look at how we would initialize that. So to create the vector database, we're gonna be using Pinecone for this. You do need a free API key from Pinecone.

There may be a wait list at the moment for that, but at least at the time of recording, I think that wait list has been processed pretty quickly. So hopefully we'll not be waiting too long. So I'm going to just, oh, first I'll get my API key and I'll get my environment.

So I've gone to app.pinecone.io. You'll end up in your default project by default, and then you go to API keys. You click copy, right? And also just note your environment here. Okay, so I have US West 1 GCP. So I'm gonna remember that and I'll type that in. Okay, so let me run this, enter my API key, and now I'm gonna enter my environment, which is US West 1 GCP.

Okay, so I'm just getting an error because I've already created the index here. So let me, what I can do is just add another line here. So if index name, kind of want to do that, not quite. So I don't want to delete it. If index name is not in the index list, we're going to create it.

Otherwise I don't need to create it 'cause it's already there. So I'm not going to create it 'cause I don't need to. But of course, when you run this, if this is your first time running this notebook, it will create that index. Then after that, we need to connect to the index.

We're using this gRPC index, which is just an alternative to using index. gRPC is just a little more reliable, can be faster. So I like to use that one instead, but you can use either. Honestly, it doesn't make a huge difference. Okay, and again, if you're running this for the first time, this is going to say zero, okay?

Because it will be an empty index. So me, obviously there's already vectors in there. So yeah, they're already there. And then what we would do is we'd start populating the index. I'm not going to run this again because I've already run it, but let me take you through what is actually happening.

So we first set this batch limit. So this is saying, okay, I don't want to upsert or add any more than 100 records at any one time. Now that's important for two reasons, really more than anything else. First, the API request to OpenAI, and you can only send and receive so much data.

And then the API request to Pinecone for the exact same reason, you can only send so much data. So we limit that so we don't go beyond where we would likely hit a data limit. And then we initialize this text list and also this metadata's list. And then we're going to go through, we're going to create our metadata.

We're going to get our text and we're using the split text method there. And then we just create our metadata. So the metadata is just the metadata we created up here, plus the chunk that we, so this is like the chunk number. So imagine for each record, like the Alan Turing example earlier on, we had three chunks from that single record.

So in that case, we would have chunk zero, chunk one, chunk two, and then we would have the corresponding text for each one of those chunks. And then the metadata is actually on the article level. So that wouldn't vary for each chunk, okay? So it's just the chunk number and the text that will actually vary there.

Okay, we append those to our current batches, which is up here. And then we say, once we reach our batch limit, then we would add everything, okay? So that's what we're doing there. And then actually, so here, so we might actually get to the end here and we'll probably have a few left over.

So we should also catch those as well. So we would say, if the length of text is greater than, yeah, we would do this. Okay, so that's just to catch those final. Let's say we have like three items at the end there with the initial code, they would have been missed.

Okay, and we don't want to miss anything. So yeah, we create our IDs. We're using UUID4 for that. And then we create our embeddings with embed-embed-documents. This is just what we did before. We then add everything to our Pinecone index. So that includes, basically the way that we do that is we'll create a list or an iterable object that contains tuples of IDs, embeddings, and metadatas.

And yeah, that's it. Okay, so after that, we will have indexed everything. Of course, I already had everything in there. So this isn't varied. This doesn't change, but for you, it should say something. It should say like 27.4-ish thousand. And yeah, so that is our indexing process. So we just added everything to our knowledge base or added all the source knowledge to our knowledge base.

And then what we want to do is actually back in LineChain, we're going to initialize a new Pinecone instance. So the Pinecone instance that we just created was not within LineChain. The reason that I did that is because creating the index and populating it in LineChain is a fair bit slower than just doing it directly with the Pinecone client.

So I tend to avoid doing that. Maybe at some point in the future, that'll be optimized a little better than it is now, but for now, yeah, it isn't. So I avoid doing that part within LineChain. But we are going to be using LineChain and actually for the next, so for the querying and for the retrieval augmentation with a large language model, LineChain makes this much easier, okay?

So I'm going to reinitialize Pinecone, but in LineChain. Now, as far as I, this might change, but the GRPC index wasn't recognized by LineChain last time I tried. So we just use a normal index here. And yeah, we just initialize our VectorSort, okay? So this is a VectorDatabaseConnection. Essentially the same as what we had up here, the index, okay?

And the only thing, the only extra thing we need to do here is we need to tell LineChain where the text within our metadata is stored. So we're saying the text field is text, and we can see that because we create it here, okay? Cool. So we run that.

And then what we can do is we do a similarity search across that VectorSort, okay? So we pass in our query. We're going to say who was Benito Mussolini, okay? And we're going to return the top three most relevant docs to that query. And we see, okay, page content, Benito, Mussolini, so on and so on, Italian politician and journalist, prime minister of Italy, so on and so on, leader of the National Fascist Party.

Okay, obviously relevant. And then this one, again, you know, it's clearly, I think clearly relevant and obviously relevant again. So we're getting three relevant documents there, okay? Now, what can we do with that? It's a lot of information, right? If we scroll across, that's a ton of text, and we don't really, or at least I don't want to feed all that information to our users.

So what we want to do is actually come down to here, and we're going to layer a large language model on top of what we just did. So that retrieval thing we just did, we're actually going to add a large language model onto the end of that. And it's essentially going to take the query, it's going to take these contexts, these documents that we returned, and it is going to, we're going to put them back together into the prompt, and then ask the large language model to answer the query based on those returned documents or contexts.

Okay, and we would call this generative question answering, and I mean, let's just see how it works, right? So we're going to initialize our LLM, we're using the GPT 3.5 turbo model. Temperature we set to zero, so we basically decrease the randomness in the model generation as much as possible.

That's important when we're trying to do like factual question answering, because we don't really want the model to make anything up. It doesn't protect us 100% from it making things up, but it will limit it a bit more than if we set a high temperature. And then we actually use this retrieval QA chain.

So the retrieval QA chain is just going to wrap everything up into a single function, okay? So it's going to give an inquiry, it's going to send it to our VEX database, retrieve everything, and then pass the query and the retrieved documents into the large language model and get it to answer the question for us.

Okay, I should run this. And then I run this. And this can take a little bit of time. This is one, my bad internet connection, and also just the slowness of interacting with open AI at the moment. So we get Benito Mussolini who was an Italian politician journalist who served as prime minister of Italy.

He was leader of National Fascist Party and invented the ideology of fascism. He was a dictator of Italy by the end of 1927 in his form of fascism, Italian fascism, so on and so on, right? There's a ton of texts in there, okay? And I mean, it looks pretty accurate, right?

But you know, large language models, they're very good at saying things are completely wrong in a very convincing way. And that's actually one of the biggest problems with these. Like you don't necessarily know that what it's telling you is true. And of course, like for people that use these things a lot, they are pretty aware of this.

And they're probably going to cross-check things. But you know, even for me, I use these all the time. Sometimes a large language model will say something and I'm kind of unsure, like, oh, is that true? Is it not? I don't know. And then I have to check. And it turns out that it's just completely false.

So that is problematic, especially when you start deploying this to users that are not necessarily using these sort of models all the time. So there's not a 100% full solution for that problem, for the issue of hallucinations. But we can do things to limit it. On one end, we can use prompt engineering to reduce the likelihood of the model making things up.

We can set the temperature to zero to reduce the likelihood of the model making things up. Another thing we can do, which is not really, you know, modifying the model at all, but it's actually just giving the user citations so they can actually check where this information is coming from.

So to do that in LineChain, it's actually really easy. We just use a slightly different version of the Retrieval QA chain called the Retrieval QA with Sources chain, okay? And then we use it in pretty much the same way. So we're just gonna pass the same query about Benito Mussolini, you can see here, actually.

And we're just gonna run that. Okay, so let's wait a moment. Okay, and yeah, I mean, we're getting the, pretty much the same, in fact, I think it's the same answer. And what we can see is we actually get the sources of this information as well. So we can actually, I think we can click on this.

And yeah, it's gonna take us through and we can see, ah, okay, so this looks like a pretty good source. Maybe this is a bit more trustworthy and we can also just use this as essentially a check. We can go through what we're reading and if something seems so weird, we can check on here to actually see that it's either true, like it's actually there or it's not.

So yeah, that can be really useful. Simply just adding the source of our information, it can make a big difference. And really, I think help users trust the system that we're building, and even just as developers and also people like managers wanting to integrate these systems into their operations, having those sources can, I think, make a big difference in trustworthiness.

So we've learned how to ground our large language models using source knowledge. So source knowledge, again, is the knowledge that we're feeding into the large language model via the input prompt. And naturally by doing this, we're kind of, you know, we're encouraging accuracy in our large language model outputs and just reducing the likelihood of hallucinations or inaccurate information in there.

Now, as well, we can obviously keep information super up to date with this approach. And we start at the end there with sources, we can actually cite everything, which can be super helpful in trusting the output of these models. Now, we're already seeing large language models being used with external knowledge bases in a lot of really big products like Bing, AI, Google's BARD, we see chat GPT plugins are, you know, starting to use this sort of thing as well.

So I think the future of large language models, these knowledge bases are going to be incredibly important. They are essentially an efficient form of memory for these models that we can update and manage, which we just can't do with, if we just rely on parametric knowledge. So yeah, I really think that this like long-term memory for large language models is super important.

It's here to say, and it's definitely worth looking at whatever you're building at the moment and thinking, okay, does it make sense to integrate something like this? Will it help, right? But for now, that's it for this video. So I hope this has been useful and interesting. Thank you very much for watching and I will see you again in the next one.

Bye. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (soft music) you