LangChain Multi-Query Retriever for RAG

00:00:00.000 | Today, we're going to be talking about another method

00:00:02.880 | that we can use to make retrieval for LLMs better.

00:00:06.780 | We're going to be taking a look at how to do multi-query

00:00:09.920 | within Langtrain.

00:00:11.260 | This is a very hands-on video.

00:00:12.740 | I'm going to almost jump straight into the code,

00:00:15.140 | but in a future video, I will talk a little bit more

00:00:17.980 | about multi-query and maybe a more fully-fledged

00:00:21.580 | retrieval system that uses multi-query,

00:00:23.620 | but other components as well.

00:00:25.300 | But for now in this video,

00:00:26.580 | I just want to introduce you to multi-query.

00:00:29.000 | So let's jump straight into it.

00:00:31.220 | So let's have a very quick look

00:00:32.980 | at what multi-query actually is.

00:00:35.340 | So typically in retrieval,

00:00:38.060 | what we're going to do is we're going to take a single query.

00:00:40.100 | We're going to throw that into our right pipeline.

00:00:44.180 | It's going to like go to our vector database

00:00:47.340 | and return a few items, right?

00:00:48.980 | So this single query gets turned into this query vector

00:00:52.320 | and that is mapped to some other vectors

00:00:55.900 | and we return them.

00:00:57.100 | The idea behind multi-query is that,

00:00:59.760 | rather than just having this single query,

00:01:02.720 | we actually pass this into a LLM

00:01:05.520 | and that LLM will generate multiple queries for us.

00:01:09.240 | So let's say it will generate three.

00:01:11.700 | We then have these multiple queries

00:01:14.680 | that get translated into query vectors.

00:01:17.600 | And the idea is that there is some variety between them.

00:01:20.800 | So rather than just identifying, you know,

00:01:22.880 | a single point in vector space that is relevant to us,

00:01:25.600 | we might identify three points within the vector space

00:01:29.220 | that are relevant to us.

00:01:30.260 | And we naturally pull in a higher variety of records

00:01:34.220 | using this technique.

00:01:35.860 | So what is LLMA may become three different questions

00:01:39.260 | and we'll see some examples of that later on

00:01:41.580 | in this video as well.

00:01:42.700 | But that is the core idea.

00:01:44.820 | We're searching a wider or broader vector space

00:01:48.820 | for some answers to our query.

00:01:50.900 | So as usual, here is the code.

00:01:53.780 | We're going to skip the first part of it.

00:01:56.120 | I will just point out libraries we need installed here.

00:01:59.160 | The dataset we're using here is this AI archive chunks.

00:02:02.640 | You probably have seen it before

00:02:04.240 | if you've been watching recent videos.

00:02:06.160 | And all I'm doing here is setting everything up.

00:02:09.200 | So I'm setting up my OpenAI API key,

00:02:12.400 | OpenAI embeddings.

00:02:14.080 | This is all using via Langchain.

00:02:17.080 | I'm creating my Pinecone index here.

00:02:20.680 | Again, instructions, if you need them, are there.

00:02:23.360 | Again, I'm not going to go through them.

00:02:25.260 | And then we would populate our index.

00:02:26.700 | Now the full length of the documents is, you know,

00:02:29.840 | it's not huge, but it can take a little bit of time,

00:02:32.300 | especially depending on your internet connection.

00:02:35.300 | So if needed, you can just speed things up

00:02:40.020 | by taking like the first 5,000 documents.

00:02:42.420 | Results won't be as good because that means

00:02:44.340 | we have less data to retrieve.

00:02:47.400 | But if you just want to follow along,

00:02:49.500 | I would recommend doing that.

00:02:51.020 | Basically, you will get your indexing done

00:02:53.380 | in like a minute or so if you do that.

00:02:56.500 | And here is where we actually want to

00:02:58.480 | sort of dive into the notebook.

00:03:00.620 | So we are going to be doing multi-query in Langchain,

00:03:04.260 | as it says.

00:03:05.180 | And the first thing we need to do for this

00:03:07.060 | is to initialize a VectorStore object in Langchain.

00:03:10.820 | All right, so here we're going to be using

00:03:11.900 | the Pinecone VectorStores.

00:03:13.220 | That's what we've initialized above.

00:03:15.100 | Of course, if needed, if you're using something else,

00:03:17.860 | you just swap that in there.

00:03:19.100 | All right, so we have our VectorStore.

00:03:20.420 | We also need an LLM.

00:03:22.340 | So this LLM is going to be the thing

00:03:24.900 | that both generates our queries

00:03:27.420 | and also generates the answer to our query

00:03:31.560 | at the end of the rag pipeline.

00:03:33.400 | So we initialize that as well.

00:03:36.280 | Then what we can do here is we want to initialize

00:03:41.620 | this multi-query retriever.

00:03:43.100 | So as usual, Langchain kind of has everything in there.

00:03:46.980 | So you already have a specific retriever

00:03:49.040 | that is used for multi-query.

00:03:51.020 | We'll see how we can customize that

00:03:53.100 | towards the end of the video,

00:03:54.260 | but this is what we're starting with.

00:03:56.300 | So the multi-query retriever, as most retrievers,

00:04:00.720 | requires a VectorStore as a retriever.

00:04:02.860 | And in this case, because we're generating

00:04:05.020 | the multiple queries, we also need the LLM in there as well.

00:04:09.500 | Here, we're just setting the logging level.

00:04:11.140 | So this is so that we can see the queries

00:04:15.220 | that we're generating from the multi-query retriever.

00:04:18.260 | You don't need this, but if you would like to see

00:04:20.740 | what is actually going on with generating queries,

00:04:24.020 | you probably should.

00:04:25.780 | So our question is going to be, tell me about LLAMA2.

00:04:29.740 | Okay, let's use our multi-query retriever

00:04:32.900 | and see what happens.

00:04:34.420 | Okay, so we get our logging here.

00:04:36.820 | We can see the generated queries that we have.

00:04:38.940 | So query number one is,

00:04:40.780 | what information can you provide about LLAMA2?

00:04:43.180 | Okay, so it's taken our initial query here.

00:04:46.940 | That is the first question that it will search with.

00:04:50.880 | Then we have, could you give me some details about LLAMA2?

00:04:55.540 | And three, I would like to learn more about LLAMA2.

00:04:58.580 | Can you help me with that?

00:05:00.100 | So that's, you know, I think that's kind of cool,

00:05:03.680 | but at the same time, hey, you know,

00:05:06.260 | there's not much variety between these questions.

00:05:09.660 | Like you're going to get slightly different results,

00:05:12.160 | but not significantly, because the semantic meaning

00:05:14.900 | between these is not that different, right?

00:05:19.020 | And you can kind of see that here.

00:05:20.580 | So this is the number of unique documents

00:05:23.100 | that we're returning.

00:05:24.460 | The default number of documents that is returned

00:05:28.220 | for each query is three, right?

00:05:30.580 | So in reality, we are returning nine documents here,

00:05:34.120 | but only five of those are actually unique.

00:05:35.940 | So there's a lot of overlap between these queries,

00:05:39.340 | but nonetheless, we still have brought in

00:05:42.100 | an extra two queries compared to

00:05:44.500 | if we just did a single query.

00:05:45.980 | So, you know, we are at least expanding the scope

00:05:49.820 | of our search a little bit here,

00:05:51.660 | but we can modify that,

00:05:53.340 | and I will show you how to do that pretty soon

00:05:56.180 | to broaden the scope further.

00:05:58.340 | But yes, here we can see the documents that were returned.

00:06:02.140 | They're not formatted too nicely,

00:06:05.660 | but we can see that they're relevant.

00:06:06.860 | So we know that this one here is actually coming

00:06:08.780 | from the LLAMA2 paper, and it is talking about,

00:06:13.220 | we develop and release LLAMA2,

00:06:14.740 | a seven to seven billion parameter LLM, right?

00:06:19.740 | So it's giving us some information there.

00:06:22.460 | The next one, which is here, is actually talking about,

00:06:26.780 | I think it's talking about LLAMAs, like the, yeah, here.

00:06:30.680 | It's talking about alpacas and LLAMAs and so on.

00:06:33.480 | I'm not sure what this one is.

00:06:36.180 | I mean, it's talking about LLMs,

00:06:37.580 | but it's just not in,

00:06:39.260 | it's not talking about LLAMA in the context that we want.

00:06:41.840 | We have another one here, LLAMA2 paper again.

00:06:46.340 | So we're getting something that is relevant, hopefully.

00:06:49.620 | We develop and release LLAMA2, and then there.

00:06:52.860 | Generally perform better than existing open source models.

00:06:55.900 | Okay, so we're getting more information there.

00:06:58.620 | Chain of thought prompting.

00:07:00.620 | Here we get, again, it's talking about the animals.

00:07:03.580 | And then this final one here is the base paper.

00:07:08.420 | And this one is talking about Sanford alpaca

00:07:13.420 | and instruction following LLAMA model, right?

00:07:15.460 | So that one is relevant.

00:07:16.940 | So we have a few results here, not all of them relevant,

00:07:19.940 | but for the most part, we can work with that.

00:07:23.180 | So let's come down to here and see

00:07:25.300 | how we actually implement multi-query

00:07:27.340 | into like a full route pipeline.

00:07:29.880 | And we're gonna do this to start with,

00:07:32.900 | and we're gonna do this in this video, at least,

00:07:35.260 | using LangChain and the sort of standard way

00:07:38.020 | of doing it in LangChain.

00:07:39.520 | In another future video, we'll look at doing it

00:07:42.980 | sort of outside LangChain as well, just so we can compare.

00:07:46.220 | Okay, so to do that, within RAG here,

00:07:49.460 | we've already built a retrieval part.

00:07:51.660 | All right, so that's what I just showed you,

00:07:53.020 | the multi-query retriever.

00:07:54.540 | We need the augmentation

00:07:56.340 | for the generation of our queries part.

00:07:58.820 | So to do that, we set up this QA prompt,

00:08:02.260 | so question answering prompt, just has some instructions,

00:08:05.140 | and then we feed in some context

00:08:06.860 | that we get from the retrieval part,

00:08:09.100 | and then we add in our question.

00:08:11.340 | So I'm gonna run that.

00:08:12.580 | And this is how we can feed our,

00:08:16.260 | the documents that we've gotten from before,

00:08:18.720 | the ones I just showed you, into that QA chain directly.

00:08:23.400 | Okay, and we get this answer here.

00:08:25.100 | So alarm2 is a collection of pre-trained

00:08:26.740 | fine-tuned language models,

00:08:28.020 | ranging in scale of seven to 70 billion parameter models.

00:08:32.220 | There's some weird formatting here,

00:08:34.780 | that's from the source data.

00:08:37.580 | They're optimized for dialogue use cases,

00:08:39.420 | developed and released by so-on and so-on.

00:08:42.460 | All right, so there's quite a bit of information in there,

00:08:46.000 | which is useful.

00:08:47.280 | So it does work.

00:08:48.620 | Let's see if we can, you know,

00:08:50.020 | let's see what else we can do.

00:08:51.780 | So I'm gonna put all this together

00:08:53.180 | into a single sequential chain.

00:08:55.100 | So this is like LangChain's way

00:08:57.100 | of just putting things together.

00:08:58.980 | So rather than me kind of like writing some code

00:09:02.160 | to handle this stuff,

00:09:03.940 | I'm kind of chaining things together

00:09:06.060 | with LangChain's approach of doing it.

00:09:09.740 | Honestly, whichever approach you go with,

00:09:11.960 | it's up to you.

00:09:14.040 | Depending on what you're doing,

00:09:15.180 | it might be easier just to write a function

00:09:18.260 | that handles all this stuff.

00:09:19.700 | But again, it's, you know, it's up to you.

00:09:21.900 | This is a LangChain way of doing it,

00:09:23.340 | if you'd like to do so.

00:09:24.740 | So for the retrieval part,

00:09:26.240 | we can't connect the retriever directly

00:09:29.020 | to the generation part

00:09:30.640 | because we need to format our context

00:09:32.700 | that come out of that.

00:09:34.020 | So what I have done here

00:09:36.500 | is I've defined this function,

00:09:38.300 | which does the retrieval,

00:09:40.300 | and then also does the formatting for us

00:09:42.500 | and then returns it.

00:09:43.820 | And then I'm wrapping this retrieval transform function

00:09:48.500 | into what's called transform chain.

00:09:50.740 | Okay, so it's basically,

00:09:52.420 | it's like a custom chain in LangChain.

00:09:54.900 | That's the way I would view it.

00:09:56.800 | So the input into this is going to be a question,

00:10:00.680 | which we set up here,

00:10:02.040 | and the output is going to be query and context,

00:10:04.640 | which we have set up here.

00:10:06.440 | Now, one thing that you can't do with this,

00:10:10.240 | or at least in the next part here,

00:10:12.200 | is that you cannot have the same input variable

00:10:15.480 | and the same output variable.

00:10:17.200 | All right, so that's why I'm calling this question

00:10:19.280 | and why this is the query.

00:10:20.560 | If I put question here, I'm going to get an error.

00:10:23.640 | So we just need to be wary of that.

00:10:26.700 | Now that we have our transform chain for retrieval

00:10:29.260 | and we have our QA chain from before,

00:10:31.900 | we wrap all of this into a single sequential chain

00:10:34.780 | and that gives us our RAG pipeline in LangChain.

00:10:38.620 | So let me run this and this.

00:10:43.120 | With that, we can just perform the full RAG pipeline

00:10:47.540 | by calling this method here.

00:10:51.580 | Okay, so we input our question.

00:10:54.620 | You can see, I still have the logging on,

00:10:56.860 | so you can see the output there at the top.

00:10:59.900 | Don't know why it's in this weird color, but okay.

00:11:03.620 | So at the top, we have the same things we saw before,

00:11:06.460 | those three questions, and then this is the output.

00:11:09.180 | All right, it's the same as what we have before

00:11:10.540 | 'cause we're actually just doing the same thing.

00:11:12.100 | We just wrapped it into this sequential chain

00:11:14.280 | from LangChain.

00:11:15.240 | Okay, cool.

00:11:16.140 | So that's the full RAG pipeline.

00:11:18.980 | Now let's take a look at modifying our prompt

00:11:22.460 | in order to change the behavior

00:11:24.380 | of how we're generating these queries.

00:11:26.380 | And I think this is very important

00:11:28.780 | and probably the most important part of this video,

00:11:33.000 | which is, okay, how does it behave with different queries?

00:11:36.740 | So we're gonna start with this prompt A.

00:11:38.300 | So we can look at this.

00:11:39.460 | I'm just saying, okay,

00:11:40.380 | generate three different search queries

00:11:42.840 | that aim to answer the user question

00:11:44.860 | from multiple perspectives.

00:11:46.780 | Each query must tackle the question

00:11:48.140 | from a different viewpoint.

00:11:49.980 | We want to get a variety of relevant search results.

00:11:52.940 | Okay, so what I'm trying to do with this query

00:11:56.300 | is search or add more variety

00:11:59.340 | to the queries that are being created.

00:12:01.300 | So that is the idea here.

00:12:04.020 | Now we can see how that performs.

00:12:05.900 | We come down to here.

00:12:08.260 | I'm going to put it into here.

00:12:09.760 | So this is kind of like our custom approach to doing this.

00:12:14.980 | We have this lineless object here, an output parser.

00:12:18.620 | Essentially what this is going to do

00:12:20.180 | is our query here is going to generate the questions

00:12:25.180 | separated by new line characters.

00:12:28.060 | This output parser here is going to look for new lines

00:12:31.660 | and it's gonna separate out the queries based on that.

00:12:35.180 | So it's just parsing the output we generate here.

00:12:38.220 | Okay, cool.

00:12:39.260 | So we can run that.

00:12:42.180 | And what I'm gonna do is reinitialize the retriever

00:12:46.140 | with our new LLM chain here.

00:12:47.720 | And yeah, we run this.

00:12:51.820 | And we'll just see the sort of queries that we get, okay?

00:12:55.480 | So we get what are the characteristics

00:12:57.700 | and behavior of llamas?

00:12:59.420 | How are llamas used in agriculture and farming?

00:13:02.800 | What are the different breeds of llamas

00:13:04.040 | and their unique traits?

00:13:05.220 | So yes, we've definitely got more diverse questions here,

00:13:10.100 | but now it's, you know, it sees llama too.

00:13:13.060 | And it's like, okay, you want me to ask some unique,

00:13:16.420 | diverse questions about llamas, perfect.

00:13:19.260 | So there's kind of like pros and cons to doing this.

00:13:23.220 | Obviously the results we get here

00:13:25.220 | are not going to be as relevant to our query.

00:13:28.460 | Although we actually still do get the llama paper

00:13:30.820 | because honestly, I don't think there's much in there

00:13:33.660 | that talks about llamas in agriculture.

00:13:36.860 | So yeah, that doesn't really work.

00:13:41.860 | So let's try another prompt.

00:13:43.900 | So what I want to kind of point out here

00:13:46.040 | is that when you're trying to increase the variety

00:13:49.900 | of the prompts that are generated

00:13:51.100 | by your multi-query system here,

00:13:54.260 | the more you increase that variety,

00:13:56.100 | the more likely it is to hallucinate

00:13:58.000 | or just kind of go down the wrong path.

00:14:00.980 | And that's exactly what we just saw there.

00:14:03.900 | So now what I'm going to do in a second prompt

00:14:06.660 | is be more specific.

00:14:07.780 | I'm going to say, okay, I'm basically saying the same

00:14:10.180 | as what I said in that first prompt, but I just added this.

00:14:13.940 | The user questions are focused on LLMs,

00:14:16.340 | machine learning, and related disciplines, right?

00:14:18.880 | So I'm just giving the LLM some context

00:14:21.860 | as to what it should be generating queries for.

00:14:25.220 | And well, let's see, let's see if this helps our LLM.

00:14:29.800 | So we put this in, let's run this,

00:14:32.860 | and then run our retriever again.

00:14:36.540 | Okay, so we have more variety here, seven,

00:14:40.060 | which is more than the five we had for the first one.

00:14:42.780 | And now we can see, okay, what are the key features

00:14:45.700 | and capabilities of large language model LLAMA2?

00:14:49.300 | Okay, so that's cool.

00:14:51.540 | How does LLAMA2 compare to other large language models

00:14:54.460 | in terms of performance and efficiency?

00:14:57.860 | Okay, what are the applications and use cases of LLAMA2

00:15:02.060 | in the field of machine learning and NLP?

00:15:05.620 | Right, so I personally think those results are way better

00:15:10.100 | than what we were getting before.

00:15:11.740 | And we can see the docs that are being returned here.

00:15:14.580 | It's not a big dataset,

00:15:15.580 | so I don't expect anything outstanding here,

00:15:18.860 | but we should at least maybe see less

00:15:21.860 | of the agriculture documents in here.

00:15:25.940 | So this one is definitely talking about LLAMA2.

00:15:30.140 | We can go on to the next one, which is here.

00:15:33.180 | Even large language models are brittle, social bias.

00:15:36.540 | So this one, I don't see anything relevant for LLAMA2 here,

00:15:43.060 | unless I'm missing.

00:15:44.420 | Yeah, I don't think so.

00:15:46.020 | So that one isn't so relevant.

00:15:48.100 | Let's see this one.

00:15:50.180 | Okay, so it's talking about LLAMs.

00:15:51.740 | You have GPT-3 here, Lambda Gopher.

00:15:55.060 | Okay, all sort of comparable LLAMs,

00:15:57.740 | comparable to some degree.

00:15:59.820 | So, okay, it doesn't talk about LLAMA,

00:16:01.820 | but at least we have LLAMs in there.

00:16:04.100 | That's good.

00:16:05.220 | Here, it's coming from the LLAMA2 paper.

00:16:07.900 | So, "These closed product LLAMs are heavily fine-tuned

00:16:11.420 | "to align with human preferences.

00:16:13.660 | "Greatly enhances their usability and safety."

00:16:16.580 | Okay, "In this work, we develop and release LLAMA2,"

00:16:19.260 | and then, okay.

00:16:20.700 | All right, so it's talking about LLAMA2.

00:16:23.780 | Here, we are talking about the original LLAMA model,

00:16:27.780 | okay, which I think is still relevant here.

00:16:30.620 | Okay, cool.

00:16:31.940 | And here, we have another paper.

00:16:35.260 | It's not specific to LLAMA1 or LLAMA2,

00:16:38.860 | and it is an older one.

00:16:40.060 | It's just talking about ML and NLP in general.

00:16:42.420 | Okay, and this one's talking, again, generally about LLAMs.

00:16:45.540 | So, we have sort of a mix of LLAMs in there, some LLAMA.

00:16:50.060 | So, I think we're kind of getting tighter

00:16:53.580 | into where we need to be,

00:16:54.900 | but for sure, it could be better.

00:16:57.100 | Now, we can see from the results here

00:16:59.540 | that we've broadened the scope of what we're searching for,

00:17:01.940 | which is, that's what we want to do with multi-query,

00:17:05.060 | but it still doesn't make a good retrieval system,

00:17:09.180 | at least by itself.

00:17:10.580 | Multi-query needs to be part

00:17:12.100 | of a larger retrieval pipeline because, yes,

00:17:16.460 | it broadens the scope of what we're searching for,

00:17:18.300 | but then we need to tighten up that scope,

00:17:20.220 | and we need to actually filter down

00:17:22.940 | so that we don't have so much irrelevant or noisy results

00:17:26.980 | within what we're returning.

00:17:28.820 | So, yes, we have that broader scope.

00:17:31.340 | We can probably tighten it up,

00:17:33.300 | especially in this use case

00:17:34.820 | where we're searching for a particular keyword,

00:17:36.700 | which is LLAMA2, by using something like hybrid search,

00:17:39.660 | and then following that retrieval step,

00:17:43.080 | returning, I don't know, more records,

00:17:46.340 | let's say, like, five records per query

00:17:48.860 | or 20 records per query,

00:17:50.740 | and we'll end up returning like 50 or so documents.

00:17:53.820 | Then what we'd want to do with that

00:17:55.820 | is look at the original query,

00:17:57.900 | put that into a re-ranking model

00:17:59.940 | alongside those, like, 50 documents,

00:18:02.620 | and re-ranking up to, like,

00:18:04.100 | the top three or top five documents,

00:18:06.900 | and then returning that to our LLM.

00:18:08.940 | And within that sort of pipeline,

00:18:11.700 | that's where something like multi-query

00:18:13.800 | can be really helpful in just helping us

00:18:16.580 | pull in a wider variety of results

00:18:18.880 | that can be useful for us.

00:18:20.540 | So, that's it for this video.

00:18:23.300 | I hope this has been useful and interesting.

00:18:25.660 | So, thank you very much for watching,

00:18:27.500 | and I will see you again in the next one.

00:18:29.440 | Bye.

00:18:30.280 | (gentle music)

00:18:32.860 | (gentle music)

00:18:35.440 | (gentle music)

00:18:38.020 | (gentle music)

00:18:40.600 | (gentle music)

00:18:43.180 | you

LangChain Multi-Query Retriever for RAG

Chapters