Fixing LLM Hallucinations with Retrieval Augmentation in LangChain #6

00:00:00.000 | large language models have a little bit of an issue

00:00:03.320 | with data freshness.

00:00:05.320 | That is the ability to use data that is actually up-to-date.

00:00:10.320 | Now, that is because the world,

00:00:13.840 | according to a large language model,

00:00:17.040 | is essentially frozen in time.

00:00:19.480 | The large language model understands the world

00:00:22.220 | as it was in its training data set.

00:00:25.120 | And the training data set, it's huge,

00:00:27.400 | it contains a ton of information,

00:00:29.280 | but you're not going to retrain a large language model

00:00:33.680 | on new training data sets very often

00:00:36.160 | because it's expensive, it takes a ton of time,

00:00:39.320 | and it's just not very easy to do.

00:00:42.120 | So how do we handle that problem?

00:00:46.040 | Well, for that, for keeping data up-to-date

00:00:50.000 | in a large language model,

00:00:51.640 | we can use retrieval augmentation.

00:00:53.840 | The idea behind this technique

00:00:55.760 | is that we retrieve relevant information

00:00:58.280 | from what we call a knowledge base,

00:01:00.480 | and we will actually pass that

00:01:03.040 | into our large language model,

00:01:04.920 | but not through training,

00:01:06.800 | but through the prompt that we're feeding into the model.

00:01:10.620 | That makes this external knowledge base

00:01:13.840 | our window into the world

00:01:16.640 | or into the specific subset of the world

00:01:19.560 | that we would like our large language model

00:01:21.600 | to have access to.

00:01:23.080 | In this video, that's what we're going to talk about.

00:01:25.240 | We're going to have a look at how we can implement

00:01:27.920 | a retrieval augmentation pipeline using Lionchain.

00:01:32.640 | So before we jump into it,

00:01:35.080 | it's probably best we understand

00:01:37.200 | that there are different types of knowledge

00:01:39.560 | that we can feed into a large language model.

00:01:41.560 | We're going to be talking about parametric knowledge

00:01:44.240 | and source knowledge.

00:01:45.960 | Now, the parametric knowledge

00:01:48.460 | is actually gained by the LLM

00:01:52.160 | during its training, okay?

00:01:54.600 | So we have a big training process

00:01:57.360 | and within that training process,

00:02:00.240 | the LLM creates like an internal representation

00:02:04.160 | of the world according to that training data set.

00:02:06.920 | And they all get stored within the parameters

00:02:10.720 | of the large language model.

00:02:13.200 | Okay, and it can store a ton of information

00:02:15.400 | because they're super big.

00:02:17.680 | But of course, this is pretty static.

00:02:20.720 | After training, the parametric knowledge is set

00:02:23.840 | and it doesn't change.

00:02:25.760 | And that's where source knowledge comes in.

00:02:27.480 | So when we are feeding a query or a prompt into our LLM,

00:02:32.480 | okay, so we have some prompt and it was a question,

00:02:37.840 | we feed that into the LLM

00:02:40.320 | and then it's going to return us

00:02:41.800 | like an answer based on that prompt.

00:02:43.480 | But this input here,

00:02:44.920 | this is what we would call the source knowledge, okay?

00:02:48.720 | So source knowledge,

00:02:49.800 | and then up here, we have the parametric knowledge.

00:02:55.480 | Now, when we're talking about retrieval augmentation,

00:02:58.400 | naturally, what we're going to be doing

00:03:00.480 | is adding more knowledge via the source knowledge to the LLM.

00:03:05.480 | We're not touching the parametric knowledge.

00:03:08.400 | Okay, so we're going to start with this notebook.

00:03:11.040 | There'll be a link to this

00:03:12.760 | somewhere in the top of the video right now.

00:03:15.040 | And the first thing we're going to do

00:03:17.120 | is actually build our knowledge base.

00:03:19.040 | So this is going to be the location

00:03:21.120 | where we saw all of that source knowledge

00:03:23.640 | that we will be feeding

00:03:25.360 | or potentially feeding into our large language model

00:03:28.560 | at inference time.

00:03:30.160 | So when we're making the predictions or generating text.

00:03:33.760 | So we're going to be using the Wikipedia dataset from here.

00:03:38.720 | So this is from Hugging Face datasets.

00:03:41.640 | And we'll just have a quick look at one of those examples.

00:03:47.240 | Okay, so if we go across, we have all this text here.

00:03:50.520 | This is what we're going to be putting

00:03:51.640 | into our knowledge base.

00:03:53.040 | And you can see that it's pretty long, right?

00:03:57.320 | Yeah, it goes on for quite a while.

00:03:59.760 | So there's a bit of an issue here.

00:04:02.600 | A large language model and also the encoding model

00:04:07.040 | that we're going to be using as well,

00:04:08.680 | they have a limited amount of texts

00:04:10.680 | that they can efficiently process.

00:04:13.560 | And they also have like a ceiling

00:04:15.640 | where they can't process anymore and will return an error.

00:04:18.880 | But more importantly,

00:04:20.280 | we have that sort of efficiency threshold.

00:04:23.880 | We don't want to be feeding too much text

00:04:26.480 | into an embedding model

00:04:28.360 | because usually the embeddings are of lesser quality

00:04:31.160 | when we do that.

00:04:32.360 | And we also don't want to be feeding too much text

00:04:34.640 | into our completion model.

00:04:36.120 | So this is a model that's generating an answer

00:04:39.080 | because the performance of that.

00:04:42.080 | So if you, for example, give it some instructions

00:04:44.600 | and you feed in a small amount of extra texts

00:04:47.440 | after those instructions,

00:04:48.920 | there's a good chance it's gonna follow those instructions.

00:04:51.160 | If we put in the instructions and then loads of texts,

00:04:55.200 | there's actually an increased chance

00:04:57.800 | that the model will forget to follow those instructions.

00:05:01.280 | So both the embedding and the completion quality degrades

00:05:06.280 | with the more texts that we feed into those models.

00:05:11.280 | So what we need to do here

00:05:14.320 | is actually cut down this long chunk of text

00:05:17.360 | into smaller chunks.

00:05:18.880 | So to create these chunks,

00:05:20.280 | we first need a way of actually measuring

00:05:22.880 | the length of our text.

00:05:25.040 | Now, we can't just count the number of words

00:05:27.320 | or count the number of characters

00:05:28.960 | because a language model,

00:05:32.240 | that's not how they count the length of texts.

00:05:35.520 | They count the length of texts

00:05:36.680 | using something called a token.

00:05:38.840 | Now, token is typically like a word or sub-word length,

00:05:45.280 | like a chunk or string.

00:05:48.000 | And it actually varies by language model

00:05:51.040 | or just language model and the tokenizer that they use.

00:05:55.240 | Now, for us, we're going to be using the GPT 3.5 Turbo model

00:05:59.560 | and the encoding model for that is actually this one here.

00:06:04.560 | Okay, so I mean, we can,

00:06:07.160 | maybe I can show you how we can check for that.

00:06:09.640 | So we import TIC token.

00:06:11.360 | So TIC token is just the tokenizer

00:06:15.160 | or the family of tokenizers that OpenAI uses

00:06:18.800 | for a lot of their large language models,

00:06:21.640 | all of the GPT models.

00:06:23.640 | So what we are going to do is we're going to say

00:06:28.160 | TIC token dot encoding for model.

00:06:32.800 | And then we just pass in the name of the model

00:06:35.040 | that we're going to be using.

00:06:35.880 | So GPT 3.5 Turbo.

00:06:38.280 | Okay, and actually the embedding model

00:06:42.160 | that we should be using is this one.

00:06:45.760 | Okay, cool.

00:06:47.480 | So, lucky we checked.

00:06:50.600 | Let's run that.

00:06:51.800 | In reality, there is very little difference

00:06:54.000 | between this tokenizer and the P50 tokenizer

00:06:57.640 | that we saw before.

00:06:58.480 | So in reality, the difference is pretty minor.

00:07:03.040 | But anyway, so we can take a look here

00:07:06.920 | and we see that the tokenizer split this into 26 tokens.

00:07:13.480 | And if we, let me take this

00:07:17.880 | and what we'll do is we'll just split it by spaces.

00:07:22.880 | And I just want to actually get the length

00:07:27.560 | of that list as well.

00:07:28.840 | Right, so this is the number of words.

00:07:33.320 | Right, and I just want to show you

00:07:34.320 | that there's not a direct mapping

00:07:36.120 | between the number of tokens and the number of words

00:07:38.880 | and obviously not for the number of characters either.

00:07:42.400 | Okay, so the number of tokens

00:07:45.160 | is not exactly the number of words.

00:07:47.720 | Cool, so we'll move on.

00:07:50.600 | And now that we have this function here,

00:07:54.080 | which is just counting the number of tokens

00:07:56.520 | within some text that we pass to it,

00:07:58.720 | we can actually initialize what we call a text splitter.

00:08:02.240 | Now, text splitter just allows us to take,

00:08:06.280 | you know, a long chunk of text like this

00:08:08.600 | and split it into chunks

00:08:10.600 | and we can specify that chunk size.

00:08:12.760 | So we're going to say we don't want anything longer

00:08:15.400 | than 400 tokens.

00:08:18.000 | We're going to also add an overlap between chunks.

00:08:20.920 | So you imagine, right,

00:08:23.720 | we're going to split into 400,

00:08:25.760 | roughly 400 token length chunks.

00:08:30.200 | At the end of one of those chunks

00:08:31.720 | and the beginning of the next chunk,

00:08:33.280 | we might actually be splitting it

00:08:35.040 | in the middle of a sentence

00:08:36.920 | or in between sentences that are related to each other.

00:08:41.560 | So that means that we might cut out

00:08:44.920 | some important information,

00:08:47.560 | like connecting information between two chunks.

00:08:50.120 | So what we do to somewhat avoid this

00:08:53.120 | is we add a chunk overlap,

00:08:55.280 | which says, okay, for chunk zero and chunk one,

00:09:00.120 | between them, there's actually an overlap

00:09:02.080 | of about 20 tokens that exist within both of those.

00:09:06.160 | This just reduces the chance of us cutting out something

00:09:09.600 | or a connection between two chunks

00:09:11.880 | that is actually, like, important information.

00:09:15.480 | Okay, add length function.

00:09:17.400 | So this is what we created before up here.

00:09:20.360 | And then we also have separators.

00:09:21.880 | So separators is, it's going to,

00:09:24.680 | so this, what we're using here

00:09:26.360 | is a recursive character text splitter.

00:09:27.920 | It's going to say,

00:09:29.040 | try and split on double newline characters first.

00:09:31.440 | If you can't, split on newline character.

00:09:34.320 | If not, split on the space.

00:09:35.720 | If not, split on anything.

00:09:37.600 | Okay, that's all that is.

00:09:39.720 | And yeah, so we can run that

00:09:42.200 | and we'll get these smaller chunks.

00:09:44.840 | They're still pretty long,

00:09:46.160 | but as we can see here,

00:09:48.200 | they are now all under 400 tokens.

00:09:52.680 | So that's pretty useful.

00:09:54.240 | Now, what we want to do is actually move on

00:09:56.360 | to creating the embeddings for this whole thing.

00:10:00.440 | So the embeddings or vector embeddings

00:10:03.560 | are a very key component of this whole retrieval thing

00:10:06.760 | that we're about to do.

00:10:08.200 | And essentially they will allow us

00:10:11.120 | to just retrieve relevant information

00:10:15.000 | that we can then pass to our large language model

00:10:17.120 | based on like a user's query.

00:10:20.240 | So what we're going to do

00:10:21.320 | is take each of the chunks

00:10:23.280 | that we're going to be creating

00:10:24.440 | and embedding them into essentially

00:10:28.120 | what are just vectors, okay?

00:10:31.120 | But these vectors are not just normal vectors,

00:10:34.520 | they're actually, you can think of them

00:10:37.560 | as being numerical representations

00:10:40.840 | of the meaning behind whatever text is within that chunk.

00:10:45.840 | And the reason that we can do that

00:10:47.320 | is because we're using a specially trained embedding model

00:10:51.560 | that essentially translates human readable text

00:10:56.560 | into machine readable embedding letters.

00:11:00.800 | So once we have those embeddings,

00:11:02.520 | we then go and store those in our vector database,

00:11:04.880 | which we'll be talking about pretty soon.

00:11:06.680 | And then when we have a user's query,

00:11:09.720 | we encode that using the same embedding model

00:11:12.040 | and then just compare those vectors

00:11:14.600 | within that vector space

00:11:15.640 | and find the items that are the most similar

00:11:19.320 | in terms of like their,

00:11:21.480 | basically their angular similarity, if that makes sense.

00:11:25.640 | Or you can, another alternative way

00:11:28.280 | that you could think of it

00:11:29.360 | is their distance within the vector space.

00:11:32.000 | Although that's not exactly right

00:11:34.440 | because it's actually the angular similarity between them,

00:11:37.560 | but it's pretty similar.

00:11:39.400 | So we're gonna come down to here

00:11:40.720 | and I'm just going to first add my OpenAI API key.

00:11:45.720 | And one thing I should know is obviously you're gonna,

00:11:49.840 | you would be paying for this.

00:11:51.520 | And also actually, if you don't have your API key,

00:11:55.320 | so it's, sorry, it's platform.

00:11:58.880 | It's openai.com, okay?

00:12:03.400 | And then what we're going to need to do

00:12:05.000 | is initialize this text embedding R002 model.

00:12:09.120 | So this is basically OpenAI's best embedding model

00:12:12.240 | at the time of recording this.

00:12:14.320 | So we'd go ahead and we would initialize that via lang chain

00:12:18.960 | using the OpenAI embeddings objects, okay?

00:12:22.200 | Then with that, we can just encode text like this.

00:12:25.400 | So we have this list of chunks of text

00:12:29.440 | and then we just do embed, so the embedding model,

00:12:32.800 | embed documents, and then pass in a list

00:12:34.880 | of our text chunks there, okay?

00:12:38.320 | Then we can see, so the response we get from this, okay?

00:12:41.680 | So what we're returning is we get two vector embeddings

00:12:45.960 | and that's because we have two chunks of text here.

00:12:48.960 | And each one of those has this dimensionality of 1,536.

00:12:54.760 | This is just the embedding dimensionality

00:12:57.280 | of the text embedding R002 model.

00:13:01.040 | Each embedding model is going to vary.

00:13:03.720 | This exact number is not typical,

00:13:06.440 | but it's within the range

00:13:08.040 | of what would be a typical dimensionality

00:13:11.880 | for these embedding models.

00:13:13.480 | Okay, cool.

00:13:14.320 | So with that, we can actually move on

00:13:17.120 | to the vector database part of things.

00:13:20.280 | So a vector database is a specific type of knowledge base

00:13:25.280 | that allows us to search using these embedding vectors

00:13:31.440 | that we've created and actually scale

00:13:33.760 | that to billions of records.

00:13:35.760 | So we could literally have, well, billions

00:13:39.600 | of these text chunks in there that we encode into vectors.

00:13:44.240 | And we can search through those

00:13:45.680 | and actually return them very, very quickly.

00:13:49.280 | We're talking like, I think at a billion scale,

00:13:52.880 | maybe you're looking at a hundred milliseconds,

00:13:55.720 | maybe even less if you're optimizing it.

00:13:58.080 | Now, because it's a database that we're gonna be using,

00:14:00.600 | we can also manage our records.

00:14:02.560 | So we can add, update, delete records,

00:14:06.400 | which is super important for that data freshness thing

00:14:10.000 | that I mentioned earlier.

00:14:11.240 | And we can even do some things

00:14:13.480 | like what we'd call metadata filtering.

00:14:16.240 | So if I use the example of internal company documents,

00:14:20.720 | let's say you have a company documents

00:14:22.600 | that belong to engineering,

00:14:24.400 | company documents that belong to HR,

00:14:26.920 | you could use this metadata filtering

00:14:28.720 | to filter purely for HR documents or engineering documents.

00:14:33.720 | So that's where you would start using that.

00:14:36.920 | You can also filter based on dates

00:14:39.360 | and all these other things as well.

00:14:41.360 | So let's take a look at how we would initialize that.

00:14:44.480 | So to create the vector database,

00:14:47.960 | we're gonna be using Pinecone for this.

00:14:49.640 | You do need a free API key from Pinecone.

00:14:52.880 | There may be a wait list at the moment for that,

00:14:57.520 | but at least at the time of recording,

00:15:01.480 | I think that wait list has been processed pretty quickly.

00:15:04.160 | So hopefully we'll not be waiting too long.

00:15:07.640 | So I'm going to just,

00:15:09.600 | oh, first I'll get my API key and I'll get my environment.

00:15:13.440 | So I've gone to app.pinecone.io.

00:15:16.000 | You'll end up in your default project by default,

00:15:19.880 | and then you go to API keys.

00:15:21.480 | You click copy, right?

00:15:24.320 | And also just note your environment here.

00:15:26.880 | Okay, so I have US West 1 GCP.

00:15:29.600 | So I'm gonna remember that and I'll type that in.

00:15:32.120 | Okay, so let me run this, enter my API key,

00:15:35.640 | and now I'm gonna enter my environment,

00:15:36.880 | which is US West 1 GCP.

00:15:41.320 | Okay, so I'm just getting an error

00:15:46.320 | because I've already created the index here.

00:15:49.760 | So let me, what I can do is just add another line here.

00:15:54.760 | So if index name, kind of want to do that, not quite.

00:15:59.840 | So I don't want to delete it.

00:16:01.240 | If index name is not in the index list,

00:16:06.240 | we're going to create it.

00:16:09.000 | Otherwise I don't need to create it

00:16:10.880 | 'cause it's already there.

00:16:12.280 | So I'm not going to create it 'cause I don't need to.

00:16:16.440 | But of course, when you run this,

00:16:18.000 | if this is your first time running this notebook,

00:16:20.040 | it will create that index.

00:16:22.720 | Then after that, we need to connect to the index.

00:16:25.960 | We're using this gRPC index,

00:16:27.720 | which is just an alternative to using index.

00:16:31.520 | gRPC is just a little more reliable, can be faster.

00:16:36.080 | So I like to use that one instead,

00:16:39.240 | but you can use either.

00:16:40.600 | Honestly, it doesn't make a huge difference.

00:16:43.960 | Okay, and again, if you're running this for the first time,

00:16:47.280 | this is going to say zero, okay?

00:16:49.840 | Because it will be an empty index.

00:16:51.320 | So me, obviously there's already vectors in there.

00:16:54.120 | So yeah, they're already there.

00:16:57.440 | And then what we would do

00:16:59.520 | is we'd start populating the index.

00:17:02.920 | I'm not going to run this again

00:17:04.200 | because I've already run it,

00:17:06.440 | but let me take you through what is actually happening.

00:17:09.840 | So we first set this batch limit.

00:17:12.320 | So this is saying, okay,

00:17:14.000 | I don't want to upsert or add

00:17:17.600 | any more than 100 records at any one time.

00:17:21.920 | Now that's important for two reasons,

00:17:24.360 | really more than anything else.

00:17:25.920 | First, the API request to OpenAI,

00:17:28.960 | and you can only send and receive so much data.

00:17:31.880 | And then the API request to Pinecone

00:17:35.160 | for the exact same reason,

00:17:36.400 | you can only send so much data.

00:17:38.440 | So we limit that so we don't go beyond

00:17:41.760 | where we would likely hit a data limit.

00:17:45.800 | And then we initialize this text list

00:17:49.040 | and also this metadata's list.

00:17:51.400 | And then we're going to go through,

00:17:52.400 | we're going to create our metadata.

00:17:54.920 | We're going to get our text

00:17:57.400 | and we're using the split text method there.

00:18:00.840 | And then we just create our metadata.

00:18:04.480 | So the metadata is just the metadata we created up here,

00:18:09.360 | plus the chunk that we,

00:18:12.200 | so this is like the chunk number.

00:18:14.520 | So imagine for each record,

00:18:16.240 | like the Alan Turing example earlier on,

00:18:18.120 | we had three chunks from that single record.

00:18:21.280 | So in that case, we would have chunk zero,

00:18:23.960 | chunk one, chunk two,

00:18:25.560 | and then we would have the corresponding text

00:18:27.600 | for each one of those chunks.

00:18:29.800 | And then the metadata is actually on the article level.

00:18:33.680 | So that wouldn't vary for each chunk, okay?

00:18:37.760 | So it's just the chunk number and the text

00:18:42.000 | that will actually vary there.

00:18:43.720 | Okay, we append those to our current batches,

00:18:49.240 | which is up here.

00:18:50.720 | And then we say, once we reach our batch limit,

00:18:54.240 | then we would add everything, okay?

00:18:57.520 | So that's what we're doing there.

00:18:59.760 | And then actually, so here,

00:19:02.600 | so we might actually get to the end here

00:19:04.200 | and we'll probably have a few left over.

00:19:06.520 | So we should also catch those as well.

00:19:10.720 | So we would say, if the length of text is greater than,

00:19:15.720 | yeah, we would do this.

00:19:18.640 | Okay, so that's just to catch those final.

00:19:22.120 | Let's say we have like three items at the end there

00:19:25.320 | with the initial code, they would have been missed.

00:19:29.360 | Okay, and we don't want to miss anything.

00:19:31.440 | So yeah, we create our IDs.

00:19:34.760 | We're using UUID4 for that.

00:19:39.040 | And then we create our embeddings

00:19:41.440 | with embed-embed-documents.

00:19:42.760 | This is just what we did before.

00:19:44.400 | We then add everything to our Pinecone index.

00:19:48.240 | So that includes, basically the way that we do that

00:19:51.440 | is we'll create a list or an iterable object

00:19:53.680 | that contains tuples of IDs, embeddings, and metadatas.

00:20:00.080 | And yeah, that's it.

00:20:02.360 | Okay, so after that, we will have indexed everything.

00:20:06.160 | Of course, I already had everything in there.

00:20:08.960 | So this isn't varied.

00:20:10.400 | This doesn't change, but for you,

00:20:11.880 | it should say something.

00:20:13.040 | It should say like 27.4-ish thousand.

00:20:16.720 | And yeah, so that is our indexing process.

00:20:22.120 | So we just added everything to our knowledge base

00:20:24.840 | or added all the source knowledge to our knowledge base.

00:20:27.720 | And then what we want to do is actually back in LineChain,

00:20:31.560 | we're going to initialize a new Pinecone instance.

00:20:35.000 | So the Pinecone instance that we just created

00:20:37.760 | was not within LineChain.

00:20:40.040 | The reason that I did that is because creating the index

00:20:44.840 | and populating it in LineChain is a fair bit slower

00:20:49.360 | than just doing it directly with the Pinecone client.

00:20:52.040 | So I tend to avoid doing that.

00:20:54.880 | Maybe at some point in the future,

00:20:56.520 | that'll be optimized a little better than it is now,

00:21:00.120 | but for now, yeah, it isn't.

00:21:02.560 | So I avoid doing that part within LineChain.

00:21:06.680 | But we are going to be using LineChain

00:21:09.040 | and actually for the next, so for the querying

00:21:12.520 | and for the retrieval augmentation

00:21:15.160 | with a large language model,

00:21:17.200 | LineChain makes this much easier, okay?

00:21:21.720 | So I'm going to reinitialize Pinecone, but in LineChain.

00:21:27.320 | Now, as far as I, this might change,

00:21:31.040 | but the GRPC index wasn't recognized by LineChain

00:21:35.200 | last time I tried.

00:21:37.680 | So we just use a normal index here.

00:21:40.240 | And yeah, we just initialize our VectorSort, okay?

00:21:44.800 | So this is a VectorDatabaseConnection.

00:21:47.440 | Essentially the same as what we had up here, the index, okay?

00:21:51.280 | And the only thing, the only extra thing we need to do here

00:21:54.520 | is we need to tell LineChain where the text

00:21:57.840 | within our metadata is stored.

00:21:59.480 | So we're saying the text field is text,

00:22:03.360 | and we can see that because we create it here, okay?

00:22:08.640 | Cool. So we run that.

00:22:11.000 | And then what we can do is we do a similarity search

00:22:15.000 | across that VectorSort, okay?

00:22:17.120 | So we pass in our query.

00:22:18.440 | We're going to say who was Benito Mussolini, okay?

00:22:22.320 | And we're going to return the top three most relevant docs

00:22:25.280 | to that query.

00:22:26.960 | And we see, okay, page content, Benito, Mussolini,

00:22:31.760 | so on and so on, Italian politician and journalist,

00:22:35.120 | prime minister of Italy, so on and so on,

00:22:38.000 | leader of the National Fascist Party.

00:22:40.760 | Okay, obviously relevant.

00:22:43.160 | And then this one, again, you know, it's clearly,

00:22:47.200 | I think clearly relevant and obviously relevant again.

00:22:50.600 | So we're getting three relevant documents there, okay?

00:22:55.080 | Now, what can we do with that?

00:22:56.600 | It's a lot of information, right?

00:22:57.960 | If we scroll across, that's a ton of text,

00:23:01.120 | and we don't really, or at least I don't want to feed

00:23:03.600 | all that information to our users.

00:23:06.360 | So what we want to do is actually come down to here,

00:23:10.960 | and we're going to layer a large language model

00:23:15.120 | on top of what we just did.

00:23:16.600 | So that retrieval thing we just did,

00:23:18.480 | we're actually going to add a large language model

00:23:21.560 | onto the end of that.

00:23:22.920 | And it's essentially going to take the query,

00:23:24.920 | it's going to take these contexts,

00:23:28.080 | these documents that we returned,

00:23:30.600 | and it is going to,

00:23:33.960 | we're going to put them back together into the prompt,

00:23:37.040 | and then ask the large language model to answer the query

00:23:39.400 | based on those returned documents or contexts.

00:23:44.840 | Okay, and we would call this generative question answering,

00:23:49.160 | and I mean, let's just see how it works, right?

00:23:52.880 | So we're going to initialize our LLM,

00:23:55.560 | we're using the GPT 3.5 turbo model.

00:23:59.480 | Temperature we set to zero,

00:24:00.920 | so we basically decrease the randomness

00:24:03.800 | in the model generation as much as possible.

00:24:06.720 | That's important when we're trying to do

00:24:08.880 | like factual question answering,

00:24:10.560 | because we don't really want the model to make anything up.

00:24:13.760 | It doesn't protect us 100% from it making things up,

00:24:17.400 | but it will limit it a bit more

00:24:19.640 | than if we set a high temperature.

00:24:21.600 | And then we actually use this retrieval QA chain.

00:24:26.560 | So the retrieval QA chain is just going to wrap everything up

00:24:30.440 | into a single function, okay?

00:24:32.560 | So it's going to give an inquiry,

00:24:35.120 | it's going to send it to our VEX database,

00:24:36.960 | retrieve everything, and then pass the query

00:24:41.080 | and the retrieved documents into the large language model

00:24:44.480 | and get it to answer the question for us.

00:24:47.480 | Okay, I should run this.

00:24:51.400 | And then I run this.

00:24:53.560 | And this can take a little bit of time.

00:24:55.200 | This is one, my bad internet connection,

00:25:00.040 | and also just the slowness of interacting

00:25:02.960 | with open AI at the moment.

00:25:05.680 | So we get Benito Mussolini who was an Italian politician

00:25:08.760 | journalist who served as prime minister of Italy.

00:25:11.720 | He was leader of National Fascist Party

00:25:13.560 | and invented the ideology of fascism.

00:25:16.440 | He was a dictator of Italy by the end of 1927

00:25:20.120 | in his form of fascism, Italian fascism,

00:25:22.680 | so on and so on, right?

00:25:24.400 | There's a ton of texts in there, okay?

00:25:27.640 | And I mean, it looks pretty accurate, right?

00:25:31.200 | But you know, large language models,

00:25:34.960 | they're very good at saying things are completely wrong

00:25:39.960 | in a very convincing way.

00:25:44.040 | And that's actually one of the biggest problems with these.

00:25:48.280 | Like you don't necessarily know

00:25:50.520 | that what it's telling you is true.

00:25:52.520 | And of course, like for people that use these things a lot,

00:25:57.520 | they are pretty aware of this.

00:25:59.840 | And they're probably going to cross-check things.

00:26:02.280 | But you know, even for me, I use these all the time.

00:26:04.880 | Sometimes a large language model will say something

00:26:07.920 | and I'm kind of unsure, like, oh, is that true?

00:26:10.960 | Is it not?

00:26:11.800 | I don't know.

00:26:12.640 | And then I have to check.

00:26:13.960 | And it turns out that it's just completely false.

00:26:17.400 | So that is problematic,

00:26:20.680 | especially when you start deploying this to users

00:26:23.600 | that are not necessarily using

00:26:25.040 | these sort of models all the time.

00:26:27.320 | So there's not a 100% full solution for that problem,

00:26:33.520 | for the issue of hallucinations.

00:26:36.840 | But we can do things to limit it.

00:26:41.000 | On one end, we can use prompt engineering

00:26:44.160 | to reduce the likelihood of the model making things up.

00:26:48.920 | We can set the temperature to zero

00:26:51.400 | to reduce the likelihood of the model making things up.

00:26:54.680 | Another thing we can do,

00:26:55.800 | which is not really, you know, modifying the model at all,

00:27:00.360 | but it's actually just giving the user citations

00:27:05.360 | so they can actually check

00:27:07.240 | where this information is coming from.

00:27:09.640 | So to do that in LineChain, it's actually really easy.

00:27:12.920 | We just use a slightly different version

00:27:15.080 | of the Retrieval QA chain

00:27:16.960 | called the Retrieval QA with Sources chain, okay?

00:27:21.440 | And then we use it in pretty much the same way.

00:27:23.520 | So we're just gonna pass the same query

00:27:26.040 | about Benito Mussolini, you can see here, actually.

00:27:29.400 | And we're just gonna run that.

00:27:32.480 | Okay, so let's wait a moment.

00:27:35.320 | Okay, and yeah, I mean, we're getting the,

00:27:38.680 | pretty much the same, in fact, I think it's the same answer.

00:27:42.200 | And what we can see is we actually get the sources

00:27:45.480 | of this information as well.

00:27:46.760 | So we can actually, I think we can click on this.

00:27:49.440 | And yeah, it's gonna take us through and we can see,

00:27:51.880 | ah, okay, so this looks like a pretty good source.

00:27:57.760 | Maybe this is a bit more trustworthy

00:27:59.400 | and we can also just use this as essentially a check.

00:28:03.960 | We can go through what we're reading

00:28:05.920 | and if something seems so weird, we can check on here

00:28:08.440 | to actually see that it's either true,

00:28:11.640 | like it's actually there or it's not.

00:28:13.560 | So yeah, that can be really useful.

00:28:16.520 | Simply just adding the source of our information,

00:28:19.600 | it can make a big difference.

00:28:21.480 | And really, I think help users trust the system

00:28:25.960 | that we're building, and even just as developers

00:28:29.200 | and also people like managers wanting to integrate

00:28:32.280 | these systems into their operations,

00:28:35.480 | having those sources can, I think,

00:28:38.320 | make a big difference in trustworthiness.

00:28:42.480 | So we've learned how to ground our large language models

00:28:47.480 | using source knowledge.

00:28:50.160 | So source knowledge, again, is the knowledge

00:28:52.400 | that we're feeding into the large language model

00:28:55.000 | via the input prompt.

00:28:58.120 | And naturally by doing this, we're kind of,

00:29:00.800 | you know, we're encouraging accuracy

00:29:03.120 | in our large language model outputs

00:29:07.640 | and just reducing the likelihood of hallucinations

00:29:12.000 | or inaccurate information in there.

00:29:14.360 | Now, as well, we can obviously keep information

00:29:17.720 | super up to date with this approach.

00:29:20.960 | And we start at the end there with sources,

00:29:23.400 | we can actually cite everything,

00:29:25.600 | which can be super helpful in trusting

00:29:28.600 | the output of these models.

00:29:30.400 | Now, we're already seeing large language models

00:29:33.080 | being used with external knowledge bases

00:29:37.440 | in a lot of really big products like Bing, AI,

00:29:42.440 | Google's BARD, we see chat GPT plugins are, you know,

00:29:47.440 | starting to use this sort of thing as well.

00:29:50.440 | So I think the future of large language models,

00:29:55.080 | these knowledge bases are going to be incredibly important.

00:29:59.520 | They are essentially an efficient form of memory

00:30:03.640 | for these models that we can update and manage,

00:30:07.240 | which we just can't do with,

00:30:10.280 | if we just rely on parametric knowledge.

00:30:12.520 | So yeah, I really think that this like long-term memory

00:30:16.800 | for large language models is super important.

00:30:20.220 | It's here to say, and it's definitely worth looking

00:30:25.220 | at whatever you're building at the moment and thinking,

00:30:27.280 | okay, does it make sense to integrate something like this?

00:30:31.000 | Will it help, right?

00:30:33.280 | But for now, that's it for this video.

00:30:35.960 | So I hope this has been useful and interesting.

00:30:40.040 | Thank you very much for watching

00:30:41.840 | and I will see you again in the next one.

00:30:44.120 | Bye.

00:30:44.960 | (upbeat music)

00:30:48.380 | (upbeat music)

00:30:50.960 | (upbeat music)

00:30:53.580 | (upbeat music)

00:30:56.160 | (soft music)

00:30:58.580 | you

Fixing LLM Hallucinations with Retrieval Augmentation in LangChain #6

Chapters