back to index

LangChain Multi-Query Retriever for RAG


Chapters

0:0 LangChain Multi-Query
0:31 What is Multi-Query in RAG?
1:50 RAG Index Code
2:56 Creating a LangChain MultiQueryRetriever
7:16 Adding Generation to Multi-Query
8:51 RAG in LangChain using Sequential Chain
11:18 Customizing LangChain Multi Query
13:41 Reducing Multi Query Hallucination
16:56 Multi Query in a Larger RAG Pipeline

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today, we're going to be talking about another method
00:00:02.880 | that we can use to make retrieval for LLMs better.
00:00:06.780 | We're going to be taking a look at how to do multi-query
00:00:09.920 | within Langtrain.
00:00:11.260 | This is a very hands-on video.
00:00:12.740 | I'm going to almost jump straight into the code,
00:00:15.140 | but in a future video, I will talk a little bit more
00:00:17.980 | about multi-query and maybe a more fully-fledged
00:00:21.580 | retrieval system that uses multi-query,
00:00:23.620 | but other components as well.
00:00:25.300 | But for now in this video,
00:00:26.580 | I just want to introduce you to multi-query.
00:00:29.000 | So let's jump straight into it.
00:00:31.220 | So let's have a very quick look
00:00:32.980 | at what multi-query actually is.
00:00:35.340 | So typically in retrieval,
00:00:38.060 | what we're going to do is we're going to take a single query.
00:00:40.100 | We're going to throw that into our right pipeline.
00:00:44.180 | It's going to like go to our vector database
00:00:47.340 | and return a few items, right?
00:00:48.980 | So this single query gets turned into this query vector
00:00:52.320 | and that is mapped to some other vectors
00:00:55.900 | and we return them.
00:00:57.100 | The idea behind multi-query is that,
00:00:59.760 | rather than just having this single query,
00:01:02.720 | we actually pass this into a LLM
00:01:05.520 | and that LLM will generate multiple queries for us.
00:01:09.240 | So let's say it will generate three.
00:01:11.700 | We then have these multiple queries
00:01:14.680 | that get translated into query vectors.
00:01:17.600 | And the idea is that there is some variety between them.
00:01:20.800 | So rather than just identifying, you know,
00:01:22.880 | a single point in vector space that is relevant to us,
00:01:25.600 | we might identify three points within the vector space
00:01:29.220 | that are relevant to us.
00:01:30.260 | And we naturally pull in a higher variety of records
00:01:34.220 | using this technique.
00:01:35.860 | So what is LLMA may become three different questions
00:01:39.260 | and we'll see some examples of that later on
00:01:41.580 | in this video as well.
00:01:42.700 | But that is the core idea.
00:01:44.820 | We're searching a wider or broader vector space
00:01:48.820 | for some answers to our query.
00:01:50.900 | So as usual, here is the code.
00:01:53.780 | We're going to skip the first part of it.
00:01:56.120 | I will just point out libraries we need installed here.
00:01:59.160 | The dataset we're using here is this AI archive chunks.
00:02:02.640 | You probably have seen it before
00:02:04.240 | if you've been watching recent videos.
00:02:06.160 | And all I'm doing here is setting everything up.
00:02:09.200 | So I'm setting up my OpenAI API key,
00:02:12.400 | OpenAI embeddings.
00:02:14.080 | This is all using via Langchain.
00:02:17.080 | I'm creating my Pinecone index here.
00:02:20.680 | Again, instructions, if you need them, are there.
00:02:23.360 | Again, I'm not going to go through them.
00:02:25.260 | And then we would populate our index.
00:02:26.700 | Now the full length of the documents is, you know,
00:02:29.840 | it's not huge, but it can take a little bit of time,
00:02:32.300 | especially depending on your internet connection.
00:02:35.300 | So if needed, you can just speed things up
00:02:40.020 | by taking like the first 5,000 documents.
00:02:42.420 | Results won't be as good because that means
00:02:44.340 | we have less data to retrieve.
00:02:47.400 | But if you just want to follow along,
00:02:49.500 | I would recommend doing that.
00:02:51.020 | Basically, you will get your indexing done
00:02:53.380 | in like a minute or so if you do that.
00:02:56.500 | And here is where we actually want to
00:02:58.480 | sort of dive into the notebook.
00:03:00.620 | So we are going to be doing multi-query in Langchain,
00:03:04.260 | as it says.
00:03:05.180 | And the first thing we need to do for this
00:03:07.060 | is to initialize a VectorStore object in Langchain.
00:03:10.820 | All right, so here we're going to be using
00:03:11.900 | the Pinecone VectorStores.
00:03:13.220 | That's what we've initialized above.
00:03:15.100 | Of course, if needed, if you're using something else,
00:03:17.860 | you just swap that in there.
00:03:19.100 | All right, so we have our VectorStore.
00:03:20.420 | We also need an LLM.
00:03:22.340 | So this LLM is going to be the thing
00:03:24.900 | that both generates our queries
00:03:27.420 | and also generates the answer to our query
00:03:31.560 | at the end of the rag pipeline.
00:03:33.400 | So we initialize that as well.
00:03:36.280 | Then what we can do here is we want to initialize
00:03:41.620 | this multi-query retriever.
00:03:43.100 | So as usual, Langchain kind of has everything in there.
00:03:46.980 | So you already have a specific retriever
00:03:49.040 | that is used for multi-query.
00:03:51.020 | We'll see how we can customize that
00:03:53.100 | towards the end of the video,
00:03:54.260 | but this is what we're starting with.
00:03:56.300 | So the multi-query retriever, as most retrievers,
00:04:00.720 | requires a VectorStore as a retriever.
00:04:02.860 | And in this case, because we're generating
00:04:05.020 | the multiple queries, we also need the LLM in there as well.
00:04:09.500 | Here, we're just setting the logging level.
00:04:11.140 | So this is so that we can see the queries
00:04:15.220 | that we're generating from the multi-query retriever.
00:04:18.260 | You don't need this, but if you would like to see
00:04:20.740 | what is actually going on with generating queries,
00:04:24.020 | you probably should.
00:04:25.780 | So our question is going to be, tell me about LLAMA2.
00:04:29.740 | Okay, let's use our multi-query retriever
00:04:32.900 | and see what happens.
00:04:34.420 | Okay, so we get our logging here.
00:04:36.820 | We can see the generated queries that we have.
00:04:38.940 | So query number one is,
00:04:40.780 | what information can you provide about LLAMA2?
00:04:43.180 | Okay, so it's taken our initial query here.
00:04:46.940 | That is the first question that it will search with.
00:04:50.880 | Then we have, could you give me some details about LLAMA2?
00:04:55.540 | And three, I would like to learn more about LLAMA2.
00:04:58.580 | Can you help me with that?
00:05:00.100 | So that's, you know, I think that's kind of cool,
00:05:03.680 | but at the same time, hey, you know,
00:05:06.260 | there's not much variety between these questions.
00:05:09.660 | Like you're going to get slightly different results,
00:05:12.160 | but not significantly, because the semantic meaning
00:05:14.900 | between these is not that different, right?
00:05:19.020 | And you can kind of see that here.
00:05:20.580 | So this is the number of unique documents
00:05:23.100 | that we're returning.
00:05:24.460 | The default number of documents that is returned
00:05:28.220 | for each query is three, right?
00:05:30.580 | So in reality, we are returning nine documents here,
00:05:34.120 | but only five of those are actually unique.
00:05:35.940 | So there's a lot of overlap between these queries,
00:05:39.340 | but nonetheless, we still have brought in
00:05:42.100 | an extra two queries compared to
00:05:44.500 | if we just did a single query.
00:05:45.980 | So, you know, we are at least expanding the scope
00:05:49.820 | of our search a little bit here,
00:05:51.660 | but we can modify that,
00:05:53.340 | and I will show you how to do that pretty soon
00:05:56.180 | to broaden the scope further.
00:05:58.340 | But yes, here we can see the documents that were returned.
00:06:02.140 | They're not formatted too nicely,
00:06:05.660 | but we can see that they're relevant.
00:06:06.860 | So we know that this one here is actually coming
00:06:08.780 | from the LLAMA2 paper, and it is talking about,
00:06:13.220 | we develop and release LLAMA2,
00:06:14.740 | a seven to seven billion parameter LLM, right?
00:06:19.740 | So it's giving us some information there.
00:06:22.460 | The next one, which is here, is actually talking about,
00:06:26.780 | I think it's talking about LLAMAs, like the, yeah, here.
00:06:30.680 | It's talking about alpacas and LLAMAs and so on.
00:06:33.480 | I'm not sure what this one is.
00:06:36.180 | I mean, it's talking about LLMs,
00:06:37.580 | but it's just not in,
00:06:39.260 | it's not talking about LLAMA in the context that we want.
00:06:41.840 | We have another one here, LLAMA2 paper again.
00:06:46.340 | So we're getting something that is relevant, hopefully.
00:06:49.620 | We develop and release LLAMA2, and then there.
00:06:52.860 | Generally perform better than existing open source models.
00:06:55.900 | Okay, so we're getting more information there.
00:06:58.620 | Chain of thought prompting.
00:07:00.620 | Here we get, again, it's talking about the animals.
00:07:03.580 | And then this final one here is the base paper.
00:07:08.420 | And this one is talking about Sanford alpaca
00:07:13.420 | and instruction following LLAMA model, right?
00:07:15.460 | So that one is relevant.
00:07:16.940 | So we have a few results here, not all of them relevant,
00:07:19.940 | but for the most part, we can work with that.
00:07:23.180 | So let's come down to here and see
00:07:25.300 | how we actually implement multi-query
00:07:27.340 | into like a full route pipeline.
00:07:29.880 | And we're gonna do this to start with,
00:07:32.900 | and we're gonna do this in this video, at least,
00:07:35.260 | using LangChain and the sort of standard way
00:07:38.020 | of doing it in LangChain.
00:07:39.520 | In another future video, we'll look at doing it
00:07:42.980 | sort of outside LangChain as well, just so we can compare.
00:07:46.220 | Okay, so to do that, within RAG here,
00:07:49.460 | we've already built a retrieval part.
00:07:51.660 | All right, so that's what I just showed you,
00:07:53.020 | the multi-query retriever.
00:07:54.540 | We need the augmentation
00:07:56.340 | for the generation of our queries part.
00:07:58.820 | So to do that, we set up this QA prompt,
00:08:02.260 | so question answering prompt, just has some instructions,
00:08:05.140 | and then we feed in some context
00:08:06.860 | that we get from the retrieval part,
00:08:09.100 | and then we add in our question.
00:08:11.340 | So I'm gonna run that.
00:08:12.580 | And this is how we can feed our,
00:08:16.260 | the documents that we've gotten from before,
00:08:18.720 | the ones I just showed you, into that QA chain directly.
00:08:23.400 | Okay, and we get this answer here.
00:08:25.100 | So alarm2 is a collection of pre-trained
00:08:26.740 | fine-tuned language models,
00:08:28.020 | ranging in scale of seven to 70 billion parameter models.
00:08:32.220 | There's some weird formatting here,
00:08:34.780 | that's from the source data.
00:08:37.580 | They're optimized for dialogue use cases,
00:08:39.420 | developed and released by so-on and so-on.
00:08:42.460 | All right, so there's quite a bit of information in there,
00:08:46.000 | which is useful.
00:08:47.280 | So it does work.
00:08:48.620 | Let's see if we can, you know,
00:08:50.020 | let's see what else we can do.
00:08:51.780 | So I'm gonna put all this together
00:08:53.180 | into a single sequential chain.
00:08:55.100 | So this is like LangChain's way
00:08:57.100 | of just putting things together.
00:08:58.980 | So rather than me kind of like writing some code
00:09:02.160 | to handle this stuff,
00:09:03.940 | I'm kind of chaining things together
00:09:06.060 | with LangChain's approach of doing it.
00:09:09.740 | Honestly, whichever approach you go with,
00:09:11.960 | it's up to you.
00:09:14.040 | Depending on what you're doing,
00:09:15.180 | it might be easier just to write a function
00:09:18.260 | that handles all this stuff.
00:09:19.700 | But again, it's, you know, it's up to you.
00:09:21.900 | This is a LangChain way of doing it,
00:09:23.340 | if you'd like to do so.
00:09:24.740 | So for the retrieval part,
00:09:26.240 | we can't connect the retriever directly
00:09:29.020 | to the generation part
00:09:30.640 | because we need to format our context
00:09:32.700 | that come out of that.
00:09:34.020 | So what I have done here
00:09:36.500 | is I've defined this function,
00:09:38.300 | which does the retrieval,
00:09:40.300 | and then also does the formatting for us
00:09:42.500 | and then returns it.
00:09:43.820 | And then I'm wrapping this retrieval transform function
00:09:48.500 | into what's called transform chain.
00:09:50.740 | Okay, so it's basically,
00:09:52.420 | it's like a custom chain in LangChain.
00:09:54.900 | That's the way I would view it.
00:09:56.800 | So the input into this is going to be a question,
00:10:00.680 | which we set up here,
00:10:02.040 | and the output is going to be query and context,
00:10:04.640 | which we have set up here.
00:10:06.440 | Now, one thing that you can't do with this,
00:10:10.240 | or at least in the next part here,
00:10:12.200 | is that you cannot have the same input variable
00:10:15.480 | and the same output variable.
00:10:17.200 | All right, so that's why I'm calling this question
00:10:19.280 | and why this is the query.
00:10:20.560 | If I put question here, I'm going to get an error.
00:10:23.640 | So we just need to be wary of that.
00:10:26.700 | Now that we have our transform chain for retrieval
00:10:29.260 | and we have our QA chain from before,
00:10:31.900 | we wrap all of this into a single sequential chain
00:10:34.780 | and that gives us our RAG pipeline in LangChain.
00:10:38.620 | So let me run this and this.
00:10:43.120 | With that, we can just perform the full RAG pipeline
00:10:47.540 | by calling this method here.
00:10:51.580 | Okay, so we input our question.
00:10:54.620 | You can see, I still have the logging on,
00:10:56.860 | so you can see the output there at the top.
00:10:59.900 | Don't know why it's in this weird color, but okay.
00:11:03.620 | So at the top, we have the same things we saw before,
00:11:06.460 | those three questions, and then this is the output.
00:11:09.180 | All right, it's the same as what we have before
00:11:10.540 | 'cause we're actually just doing the same thing.
00:11:12.100 | We just wrapped it into this sequential chain
00:11:14.280 | from LangChain.
00:11:15.240 | Okay, cool.
00:11:16.140 | So that's the full RAG pipeline.
00:11:18.980 | Now let's take a look at modifying our prompt
00:11:22.460 | in order to change the behavior
00:11:24.380 | of how we're generating these queries.
00:11:26.380 | And I think this is very important
00:11:28.780 | and probably the most important part of this video,
00:11:33.000 | which is, okay, how does it behave with different queries?
00:11:36.740 | So we're gonna start with this prompt A.
00:11:38.300 | So we can look at this.
00:11:39.460 | I'm just saying, okay,
00:11:40.380 | generate three different search queries
00:11:42.840 | that aim to answer the user question
00:11:44.860 | from multiple perspectives.
00:11:46.780 | Each query must tackle the question
00:11:48.140 | from a different viewpoint.
00:11:49.980 | We want to get a variety of relevant search results.
00:11:52.940 | Okay, so what I'm trying to do with this query
00:11:56.300 | is search or add more variety
00:11:59.340 | to the queries that are being created.
00:12:01.300 | So that is the idea here.
00:12:04.020 | Now we can see how that performs.
00:12:05.900 | We come down to here.
00:12:08.260 | I'm going to put it into here.
00:12:09.760 | So this is kind of like our custom approach to doing this.
00:12:14.980 | We have this lineless object here, an output parser.
00:12:18.620 | Essentially what this is going to do
00:12:20.180 | is our query here is going to generate the questions
00:12:25.180 | separated by new line characters.
00:12:28.060 | This output parser here is going to look for new lines
00:12:31.660 | and it's gonna separate out the queries based on that.
00:12:35.180 | So it's just parsing the output we generate here.
00:12:38.220 | Okay, cool.
00:12:39.260 | So we can run that.
00:12:42.180 | And what I'm gonna do is reinitialize the retriever
00:12:46.140 | with our new LLM chain here.
00:12:47.720 | And yeah, we run this.
00:12:51.820 | And we'll just see the sort of queries that we get, okay?
00:12:55.480 | So we get what are the characteristics
00:12:57.700 | and behavior of llamas?
00:12:59.420 | How are llamas used in agriculture and farming?
00:13:02.800 | What are the different breeds of llamas
00:13:04.040 | and their unique traits?
00:13:05.220 | So yes, we've definitely got more diverse questions here,
00:13:10.100 | but now it's, you know, it sees llama too.
00:13:13.060 | And it's like, okay, you want me to ask some unique,
00:13:16.420 | diverse questions about llamas, perfect.
00:13:19.260 | So there's kind of like pros and cons to doing this.
00:13:23.220 | Obviously the results we get here
00:13:25.220 | are not going to be as relevant to our query.
00:13:28.460 | Although we actually still do get the llama paper
00:13:30.820 | because honestly, I don't think there's much in there
00:13:33.660 | that talks about llamas in agriculture.
00:13:36.860 | So yeah, that doesn't really work.
00:13:41.860 | So let's try another prompt.
00:13:43.900 | So what I want to kind of point out here
00:13:46.040 | is that when you're trying to increase the variety
00:13:49.900 | of the prompts that are generated
00:13:51.100 | by your multi-query system here,
00:13:54.260 | the more you increase that variety,
00:13:56.100 | the more likely it is to hallucinate
00:13:58.000 | or just kind of go down the wrong path.
00:14:00.980 | And that's exactly what we just saw there.
00:14:03.900 | So now what I'm going to do in a second prompt
00:14:06.660 | is be more specific.
00:14:07.780 | I'm going to say, okay, I'm basically saying the same
00:14:10.180 | as what I said in that first prompt, but I just added this.
00:14:13.940 | The user questions are focused on LLMs,
00:14:16.340 | machine learning, and related disciplines, right?
00:14:18.880 | So I'm just giving the LLM some context
00:14:21.860 | as to what it should be generating queries for.
00:14:25.220 | And well, let's see, let's see if this helps our LLM.
00:14:29.800 | So we put this in, let's run this,
00:14:32.860 | and then run our retriever again.
00:14:36.540 | Okay, so we have more variety here, seven,
00:14:40.060 | which is more than the five we had for the first one.
00:14:42.780 | And now we can see, okay, what are the key features
00:14:45.700 | and capabilities of large language model LLAMA2?
00:14:49.300 | Okay, so that's cool.
00:14:51.540 | How does LLAMA2 compare to other large language models
00:14:54.460 | in terms of performance and efficiency?
00:14:57.860 | Okay, what are the applications and use cases of LLAMA2
00:15:02.060 | in the field of machine learning and NLP?
00:15:05.620 | Right, so I personally think those results are way better
00:15:10.100 | than what we were getting before.
00:15:11.740 | And we can see the docs that are being returned here.
00:15:14.580 | It's not a big dataset,
00:15:15.580 | so I don't expect anything outstanding here,
00:15:18.860 | but we should at least maybe see less
00:15:21.860 | of the agriculture documents in here.
00:15:25.940 | So this one is definitely talking about LLAMA2.
00:15:30.140 | We can go on to the next one, which is here.
00:15:33.180 | Even large language models are brittle, social bias.
00:15:36.540 | So this one, I don't see anything relevant for LLAMA2 here,
00:15:43.060 | unless I'm missing.
00:15:44.420 | Yeah, I don't think so.
00:15:46.020 | So that one isn't so relevant.
00:15:48.100 | Let's see this one.
00:15:50.180 | Okay, so it's talking about LLAMs.
00:15:51.740 | You have GPT-3 here, Lambda Gopher.
00:15:55.060 | Okay, all sort of comparable LLAMs,
00:15:57.740 | comparable to some degree.
00:15:59.820 | So, okay, it doesn't talk about LLAMA,
00:16:01.820 | but at least we have LLAMs in there.
00:16:04.100 | That's good.
00:16:05.220 | Here, it's coming from the LLAMA2 paper.
00:16:07.900 | So, "These closed product LLAMs are heavily fine-tuned
00:16:11.420 | "to align with human preferences.
00:16:13.660 | "Greatly enhances their usability and safety."
00:16:16.580 | Okay, "In this work, we develop and release LLAMA2,"
00:16:19.260 | and then, okay.
00:16:20.700 | All right, so it's talking about LLAMA2.
00:16:23.780 | Here, we are talking about the original LLAMA model,
00:16:27.780 | okay, which I think is still relevant here.
00:16:30.620 | Okay, cool.
00:16:31.940 | And here, we have another paper.
00:16:35.260 | It's not specific to LLAMA1 or LLAMA2,
00:16:38.860 | and it is an older one.
00:16:40.060 | It's just talking about ML and NLP in general.
00:16:42.420 | Okay, and this one's talking, again, generally about LLAMs.
00:16:45.540 | So, we have sort of a mix of LLAMs in there, some LLAMA.
00:16:50.060 | So, I think we're kind of getting tighter
00:16:53.580 | into where we need to be,
00:16:54.900 | but for sure, it could be better.
00:16:57.100 | Now, we can see from the results here
00:16:59.540 | that we've broadened the scope of what we're searching for,
00:17:01.940 | which is, that's what we want to do with multi-query,
00:17:05.060 | but it still doesn't make a good retrieval system,
00:17:09.180 | at least by itself.
00:17:10.580 | Multi-query needs to be part
00:17:12.100 | of a larger retrieval pipeline because, yes,
00:17:16.460 | it broadens the scope of what we're searching for,
00:17:18.300 | but then we need to tighten up that scope,
00:17:20.220 | and we need to actually filter down
00:17:22.940 | so that we don't have so much irrelevant or noisy results
00:17:26.980 | within what we're returning.
00:17:28.820 | So, yes, we have that broader scope.
00:17:31.340 | We can probably tighten it up,
00:17:33.300 | especially in this use case
00:17:34.820 | where we're searching for a particular keyword,
00:17:36.700 | which is LLAMA2, by using something like hybrid search,
00:17:39.660 | and then following that retrieval step,
00:17:43.080 | returning, I don't know, more records,
00:17:46.340 | let's say, like, five records per query
00:17:48.860 | or 20 records per query,
00:17:50.740 | and we'll end up returning like 50 or so documents.
00:17:53.820 | Then what we'd want to do with that
00:17:55.820 | is look at the original query,
00:17:57.900 | put that into a re-ranking model
00:17:59.940 | alongside those, like, 50 documents,
00:18:02.620 | and re-ranking up to, like,
00:18:04.100 | the top three or top five documents,
00:18:06.900 | and then returning that to our LLM.
00:18:08.940 | And within that sort of pipeline,
00:18:11.700 | that's where something like multi-query
00:18:13.800 | can be really helpful in just helping us
00:18:16.580 | pull in a wider variety of results
00:18:18.880 | that can be useful for us.
00:18:20.540 | So, that's it for this video.
00:18:23.300 | I hope this has been useful and interesting.
00:18:25.660 | So, thank you very much for watching,
00:18:27.500 | and I will see you again in the next one.
00:18:30.280 | (gentle music)
00:18:32.860 | (gentle music)
00:18:35.440 | (gentle music)
00:18:38.020 | (gentle music)
00:18:40.600 | (gentle music)