back to indexFixing LLM Hallucinations with Retrieval Augmentation in LangChain #6
Chapters
0:0 Hallucination in LLMs
1:32 Types of LLM Knowledge
3:8 Data Preprocessing with LangChain
9:54 Creating Embeddings with OpenAI's Ada 002
13:14 Creating the Pinecone Vector Database
16:57 Indexing Data into Our Database
20:27 Querying with LangChain
23:7 Generative Question-Answering with LangChain
25:27 Adding Citations to Generated Answers
28:42 Summary of Retrieval Augmentation in LangChain
00:00:00.000 |
large language models have a little bit of an issue 00:00:05.320 |
That is the ability to use data that is actually up-to-date. 00:00:19.480 |
The large language model understands the world 00:00:29.280 |
but you're not going to retrain a large language model 00:00:36.160 |
because it's expensive, it takes a ton of time, 00:01:06.800 |
but through the prompt that we're feeding into the model. 00:01:23.080 |
In this video, that's what we're going to talk about. 00:01:25.240 |
We're going to have a look at how we can implement 00:01:27.920 |
a retrieval augmentation pipeline using Lionchain. 00:01:39.560 |
that we can feed into a large language model. 00:01:41.560 |
We're going to be talking about parametric knowledge 00:02:00.240 |
the LLM creates like an internal representation 00:02:04.160 |
of the world according to that training data set. 00:02:06.920 |
And they all get stored within the parameters 00:02:20.720 |
After training, the parametric knowledge is set 00:02:27.480 |
So when we are feeding a query or a prompt into our LLM, 00:02:32.480 |
okay, so we have some prompt and it was a question, 00:02:44.920 |
this is what we would call the source knowledge, okay? 00:02:49.800 |
and then up here, we have the parametric knowledge. 00:02:55.480 |
Now, when we're talking about retrieval augmentation, 00:03:00.480 |
is adding more knowledge via the source knowledge to the LLM. 00:03:08.400 |
Okay, so we're going to start with this notebook. 00:03:25.360 |
or potentially feeding into our large language model 00:03:30.160 |
So when we're making the predictions or generating text. 00:03:33.760 |
So we're going to be using the Wikipedia dataset from here. 00:03:41.640 |
And we'll just have a quick look at one of those examples. 00:03:47.240 |
Okay, so if we go across, we have all this text here. 00:03:53.040 |
And you can see that it's pretty long, right? 00:04:02.600 |
A large language model and also the encoding model 00:04:15.640 |
where they can't process anymore and will return an error. 00:04:28.360 |
because usually the embeddings are of lesser quality 00:04:32.360 |
And we also don't want to be feeding too much text 00:04:36.120 |
So this is a model that's generating an answer 00:04:42.080 |
So if you, for example, give it some instructions 00:04:44.600 |
and you feed in a small amount of extra texts 00:04:48.920 |
there's a good chance it's gonna follow those instructions. 00:04:51.160 |
If we put in the instructions and then loads of texts, 00:04:57.800 |
that the model will forget to follow those instructions. 00:05:01.280 |
So both the embedding and the completion quality degrades 00:05:06.280 |
with the more texts that we feed into those models. 00:05:32.240 |
that's not how they count the length of texts. 00:05:38.840 |
Now, token is typically like a word or sub-word length, 00:05:51.040 |
or just language model and the tokenizer that they use. 00:05:55.240 |
Now, for us, we're going to be using the GPT 3.5 Turbo model 00:05:59.560 |
and the encoding model for that is actually this one here. 00:06:07.160 |
maybe I can show you how we can check for that. 00:06:23.640 |
So what we are going to do is we're going to say 00:06:32.800 |
And then we just pass in the name of the model 00:06:58.480 |
So in reality, the difference is pretty minor. 00:07:06.920 |
and we see that the tokenizer split this into 26 tokens. 00:07:17.880 |
and what we'll do is we'll just split it by spaces. 00:07:36.120 |
between the number of tokens and the number of words 00:07:38.880 |
and obviously not for the number of characters either. 00:07:58.720 |
we can actually initialize what we call a text splitter. 00:08:12.760 |
So we're going to say we don't want anything longer 00:08:18.000 |
We're going to also add an overlap between chunks. 00:08:36.920 |
or in between sentences that are related to each other. 00:08:47.560 |
like connecting information between two chunks. 00:08:55.280 |
which says, okay, for chunk zero and chunk one, 00:09:02.080 |
of about 20 tokens that exist within both of those. 00:09:06.160 |
This just reduces the chance of us cutting out something 00:09:11.880 |
that is actually, like, important information. 00:09:29.040 |
try and split on double newline characters first. 00:09:56.360 |
to creating the embeddings for this whole thing. 00:10:03.560 |
are a very key component of this whole retrieval thing 00:10:15.000 |
that we can then pass to our large language model 00:10:31.120 |
But these vectors are not just normal vectors, 00:10:40.840 |
of the meaning behind whatever text is within that chunk. 00:10:47.320 |
is because we're using a specially trained embedding model 00:10:51.560 |
that essentially translates human readable text 00:11:02.520 |
we then go and store those in our vector database, 00:11:09.720 |
we encode that using the same embedding model 00:11:21.480 |
basically their angular similarity, if that makes sense. 00:11:34.440 |
because it's actually the angular similarity between them, 00:11:40.720 |
and I'm just going to first add my OpenAI API key. 00:11:45.720 |
And one thing I should know is obviously you're gonna, 00:11:51.520 |
And also actually, if you don't have your API key, 00:12:05.000 |
is initialize this text embedding R002 model. 00:12:09.120 |
So this is basically OpenAI's best embedding model 00:12:14.320 |
So we'd go ahead and we would initialize that via lang chain 00:12:22.200 |
Then with that, we can just encode text like this. 00:12:29.440 |
and then we just do embed, so the embedding model, 00:12:38.320 |
Then we can see, so the response we get from this, okay? 00:12:41.680 |
So what we're returning is we get two vector embeddings 00:12:45.960 |
and that's because we have two chunks of text here. 00:12:48.960 |
And each one of those has this dimensionality of 1,536. 00:13:20.280 |
So a vector database is a specific type of knowledge base 00:13:25.280 |
that allows us to search using these embedding vectors 00:13:39.600 |
of these text chunks in there that we encode into vectors. 00:13:49.280 |
We're talking like, I think at a billion scale, 00:13:52.880 |
maybe you're looking at a hundred milliseconds, 00:13:58.080 |
Now, because it's a database that we're gonna be using, 00:14:06.400 |
which is super important for that data freshness thing 00:14:16.240 |
So if I use the example of internal company documents, 00:14:28.720 |
to filter purely for HR documents or engineering documents. 00:14:41.360 |
So let's take a look at how we would initialize that. 00:14:52.880 |
There may be a wait list at the moment for that, 00:15:01.480 |
I think that wait list has been processed pretty quickly. 00:15:09.600 |
oh, first I'll get my API key and I'll get my environment. 00:15:16.000 |
You'll end up in your default project by default, 00:15:29.600 |
So I'm gonna remember that and I'll type that in. 00:15:49.760 |
So let me, what I can do is just add another line here. 00:15:54.760 |
So if index name, kind of want to do that, not quite. 00:16:12.280 |
So I'm not going to create it 'cause I don't need to. 00:16:18.000 |
if this is your first time running this notebook, 00:16:22.720 |
Then after that, we need to connect to the index. 00:16:31.520 |
gRPC is just a little more reliable, can be faster. 00:16:43.960 |
Okay, and again, if you're running this for the first time, 00:16:51.320 |
So me, obviously there's already vectors in there. 00:17:06.440 |
but let me take you through what is actually happening. 00:17:28.960 |
and you can only send and receive so much data. 00:18:04.480 |
So the metadata is just the metadata we created up here, 00:18:25.560 |
and then we would have the corresponding text 00:18:29.800 |
And then the metadata is actually on the article level. 00:18:43.720 |
Okay, we append those to our current batches, 00:18:50.720 |
And then we say, once we reach our batch limit, 00:19:10.720 |
So we would say, if the length of text is greater than, 00:19:22.120 |
Let's say we have like three items at the end there 00:19:25.320 |
with the initial code, they would have been missed. 00:19:44.400 |
We then add everything to our Pinecone index. 00:19:48.240 |
So that includes, basically the way that we do that 00:19:53.680 |
that contains tuples of IDs, embeddings, and metadatas. 00:20:02.360 |
Okay, so after that, we will have indexed everything. 00:20:06.160 |
Of course, I already had everything in there. 00:20:22.120 |
So we just added everything to our knowledge base 00:20:24.840 |
or added all the source knowledge to our knowledge base. 00:20:27.720 |
And then what we want to do is actually back in LineChain, 00:20:31.560 |
we're going to initialize a new Pinecone instance. 00:20:35.000 |
So the Pinecone instance that we just created 00:20:40.040 |
The reason that I did that is because creating the index 00:20:44.840 |
and populating it in LineChain is a fair bit slower 00:20:49.360 |
than just doing it directly with the Pinecone client. 00:20:56.520 |
that'll be optimized a little better than it is now, 00:21:09.040 |
and actually for the next, so for the querying 00:21:21.720 |
So I'm going to reinitialize Pinecone, but in LineChain. 00:21:31.040 |
but the GRPC index wasn't recognized by LineChain 00:21:40.240 |
And yeah, we just initialize our VectorSort, okay? 00:21:47.440 |
Essentially the same as what we had up here, the index, okay? 00:21:51.280 |
And the only thing, the only extra thing we need to do here 00:22:03.360 |
and we can see that because we create it here, okay? 00:22:11.000 |
And then what we can do is we do a similarity search 00:22:18.440 |
We're going to say who was Benito Mussolini, okay? 00:22:22.320 |
And we're going to return the top three most relevant docs 00:22:26.960 |
And we see, okay, page content, Benito, Mussolini, 00:22:31.760 |
so on and so on, Italian politician and journalist, 00:22:43.160 |
And then this one, again, you know, it's clearly, 00:22:47.200 |
I think clearly relevant and obviously relevant again. 00:22:50.600 |
So we're getting three relevant documents there, okay? 00:23:01.120 |
and we don't really, or at least I don't want to feed 00:23:06.360 |
So what we want to do is actually come down to here, 00:23:10.960 |
and we're going to layer a large language model 00:23:18.480 |
we're actually going to add a large language model 00:23:22.920 |
And it's essentially going to take the query, 00:23:33.960 |
we're going to put them back together into the prompt, 00:23:37.040 |
and then ask the large language model to answer the query 00:23:39.400 |
based on those returned documents or contexts. 00:23:44.840 |
Okay, and we would call this generative question answering, 00:23:49.160 |
and I mean, let's just see how it works, right? 00:24:10.560 |
because we don't really want the model to make anything up. 00:24:13.760 |
It doesn't protect us 100% from it making things up, 00:24:21.600 |
And then we actually use this retrieval QA chain. 00:24:26.560 |
So the retrieval QA chain is just going to wrap everything up 00:24:41.080 |
and the retrieved documents into the large language model 00:25:05.680 |
So we get Benito Mussolini who was an Italian politician 00:25:08.760 |
journalist who served as prime minister of Italy. 00:25:16.440 |
He was a dictator of Italy by the end of 1927 00:25:34.960 |
they're very good at saying things are completely wrong 00:25:44.040 |
And that's actually one of the biggest problems with these. 00:25:52.520 |
And of course, like for people that use these things a lot, 00:25:59.840 |
And they're probably going to cross-check things. 00:26:02.280 |
But you know, even for me, I use these all the time. 00:26:04.880 |
Sometimes a large language model will say something 00:26:07.920 |
and I'm kind of unsure, like, oh, is that true? 00:26:13.960 |
And it turns out that it's just completely false. 00:26:20.680 |
especially when you start deploying this to users 00:26:27.320 |
So there's not a 100% full solution for that problem, 00:26:44.160 |
to reduce the likelihood of the model making things up. 00:26:51.400 |
to reduce the likelihood of the model making things up. 00:26:55.800 |
which is not really, you know, modifying the model at all, 00:27:00.360 |
but it's actually just giving the user citations 00:27:09.640 |
So to do that in LineChain, it's actually really easy. 00:27:16.960 |
called the Retrieval QA with Sources chain, okay? 00:27:21.440 |
And then we use it in pretty much the same way. 00:27:26.040 |
about Benito Mussolini, you can see here, actually. 00:27:38.680 |
pretty much the same, in fact, I think it's the same answer. 00:27:42.200 |
And what we can see is we actually get the sources 00:27:46.760 |
So we can actually, I think we can click on this. 00:27:49.440 |
And yeah, it's gonna take us through and we can see, 00:27:51.880 |
ah, okay, so this looks like a pretty good source. 00:27:59.400 |
and we can also just use this as essentially a check. 00:28:05.920 |
and if something seems so weird, we can check on here 00:28:16.520 |
Simply just adding the source of our information, 00:28:21.480 |
And really, I think help users trust the system 00:28:25.960 |
that we're building, and even just as developers 00:28:29.200 |
and also people like managers wanting to integrate 00:28:42.480 |
So we've learned how to ground our large language models 00:28:52.400 |
that we're feeding into the large language model 00:29:07.640 |
and just reducing the likelihood of hallucinations 00:29:14.360 |
Now, as well, we can obviously keep information 00:29:30.400 |
Now, we're already seeing large language models 00:29:37.440 |
in a lot of really big products like Bing, AI, 00:29:42.440 |
Google's BARD, we see chat GPT plugins are, you know, 00:29:50.440 |
So I think the future of large language models, 00:29:55.080 |
these knowledge bases are going to be incredibly important. 00:29:59.520 |
They are essentially an efficient form of memory 00:30:03.640 |
for these models that we can update and manage, 00:30:12.520 |
So yeah, I really think that this like long-term memory 00:30:16.800 |
for large language models is super important. 00:30:20.220 |
It's here to say, and it's definitely worth looking 00:30:25.220 |
at whatever you're building at the moment and thinking, 00:30:27.280 |
okay, does it make sense to integrate something like this? 00:30:35.960 |
So I hope this has been useful and interesting.