Back to Index

LangChain Multi-Query Retriever for RAG


Chapters

0:0 LangChain Multi-Query
0:31 What is Multi-Query in RAG?
1:50 RAG Index Code
2:56 Creating a LangChain MultiQueryRetriever
7:16 Adding Generation to Multi-Query
8:51 RAG in LangChain using Sequential Chain
11:18 Customizing LangChain Multi Query
13:41 Reducing Multi Query Hallucination
16:56 Multi Query in a Larger RAG Pipeline

Transcript

Today, we're going to be talking about another method that we can use to make retrieval for LLMs better. We're going to be taking a look at how to do multi-query within Langtrain. This is a very hands-on video. I'm going to almost jump straight into the code, but in a future video, I will talk a little bit more about multi-query and maybe a more fully-fledged retrieval system that uses multi-query, but other components as well.

But for now in this video, I just want to introduce you to multi-query. So let's jump straight into it. So let's have a very quick look at what multi-query actually is. So typically in retrieval, what we're going to do is we're going to take a single query. We're going to throw that into our right pipeline.

It's going to like go to our vector database and return a few items, right? So this single query gets turned into this query vector and that is mapped to some other vectors and we return them. The idea behind multi-query is that, rather than just having this single query, we actually pass this into a LLM and that LLM will generate multiple queries for us.

So let's say it will generate three. We then have these multiple queries that get translated into query vectors. And the idea is that there is some variety between them. So rather than just identifying, you know, a single point in vector space that is relevant to us, we might identify three points within the vector space that are relevant to us.

And we naturally pull in a higher variety of records using this technique. So what is LLMA may become three different questions and we'll see some examples of that later on in this video as well. But that is the core idea. We're searching a wider or broader vector space for some answers to our query.

So as usual, here is the code. We're going to skip the first part of it. I will just point out libraries we need installed here. The dataset we're using here is this AI archive chunks. You probably have seen it before if you've been watching recent videos. And all I'm doing here is setting everything up.

So I'm setting up my OpenAI API key, OpenAI embeddings. This is all using via Langchain. I'm creating my Pinecone index here. Again, instructions, if you need them, are there. Again, I'm not going to go through them. And then we would populate our index. Now the full length of the documents is, you know, it's not huge, but it can take a little bit of time, especially depending on your internet connection.

So if needed, you can just speed things up by taking like the first 5,000 documents. Results won't be as good because that means we have less data to retrieve. But if you just want to follow along, I would recommend doing that. Basically, you will get your indexing done in like a minute or so if you do that.

And here is where we actually want to sort of dive into the notebook. So we are going to be doing multi-query in Langchain, as it says. And the first thing we need to do for this is to initialize a VectorStore object in Langchain. All right, so here we're going to be using the Pinecone VectorStores.

That's what we've initialized above. Of course, if needed, if you're using something else, you just swap that in there. All right, so we have our VectorStore. We also need an LLM. So this LLM is going to be the thing that both generates our queries and also generates the answer to our query at the end of the rag pipeline.

So we initialize that as well. Then what we can do here is we want to initialize this multi-query retriever. So as usual, Langchain kind of has everything in there. So you already have a specific retriever that is used for multi-query. We'll see how we can customize that towards the end of the video, but this is what we're starting with.

So the multi-query retriever, as most retrievers, requires a VectorStore as a retriever. And in this case, because we're generating the multiple queries, we also need the LLM in there as well. Here, we're just setting the logging level. So this is so that we can see the queries that we're generating from the multi-query retriever.

You don't need this, but if you would like to see what is actually going on with generating queries, you probably should. So our question is going to be, tell me about LLAMA2. Okay, let's use our multi-query retriever and see what happens. Okay, so we get our logging here. We can see the generated queries that we have.

So query number one is, what information can you provide about LLAMA2? Okay, so it's taken our initial query here. That is the first question that it will search with. Then we have, could you give me some details about LLAMA2? And three, I would like to learn more about LLAMA2.

Can you help me with that? So that's, you know, I think that's kind of cool, but at the same time, hey, you know, there's not much variety between these questions. Like you're going to get slightly different results, but not significantly, because the semantic meaning between these is not that different, right?

And you can kind of see that here. So this is the number of unique documents that we're returning. The default number of documents that is returned for each query is three, right? So in reality, we are returning nine documents here, but only five of those are actually unique. So there's a lot of overlap between these queries, but nonetheless, we still have brought in an extra two queries compared to if we just did a single query.

So, you know, we are at least expanding the scope of our search a little bit here, but we can modify that, and I will show you how to do that pretty soon to broaden the scope further. But yes, here we can see the documents that were returned. They're not formatted too nicely, but we can see that they're relevant.

So we know that this one here is actually coming from the LLAMA2 paper, and it is talking about, we develop and release LLAMA2, a seven to seven billion parameter LLM, right? So it's giving us some information there. The next one, which is here, is actually talking about, I think it's talking about LLAMAs, like the, yeah, here.

It's talking about alpacas and LLAMAs and so on. I'm not sure what this one is. I mean, it's talking about LLMs, but it's just not in, it's not talking about LLAMA in the context that we want. We have another one here, LLAMA2 paper again. So we're getting something that is relevant, hopefully.

We develop and release LLAMA2, and then there. Generally perform better than existing open source models. Okay, so we're getting more information there. Chain of thought prompting. Here we get, again, it's talking about the animals. And then this final one here is the base paper. And this one is talking about Sanford alpaca and instruction following LLAMA model, right?

So that one is relevant. So we have a few results here, not all of them relevant, but for the most part, we can work with that. So let's come down to here and see how we actually implement multi-query into like a full route pipeline. And we're gonna do this to start with, and we're gonna do this in this video, at least, using LangChain and the sort of standard way of doing it in LangChain.

In another future video, we'll look at doing it sort of outside LangChain as well, just so we can compare. Okay, so to do that, within RAG here, we've already built a retrieval part. All right, so that's what I just showed you, the multi-query retriever. We need the augmentation for the generation of our queries part.

So to do that, we set up this QA prompt, so question answering prompt, just has some instructions, and then we feed in some context that we get from the retrieval part, and then we add in our question. So I'm gonna run that. And this is how we can feed our, the documents that we've gotten from before, the ones I just showed you, into that QA chain directly.

Okay, and we get this answer here. So alarm2 is a collection of pre-trained fine-tuned language models, ranging in scale of seven to 70 billion parameter models. There's some weird formatting here, that's from the source data. They're optimized for dialogue use cases, developed and released by so-on and so-on. All right, so there's quite a bit of information in there, which is useful.

So it does work. Let's see if we can, you know, let's see what else we can do. So I'm gonna put all this together into a single sequential chain. So this is like LangChain's way of just putting things together. So rather than me kind of like writing some code to handle this stuff, I'm kind of chaining things together with LangChain's approach of doing it.

Honestly, whichever approach you go with, it's up to you. Depending on what you're doing, it might be easier just to write a function that handles all this stuff. But again, it's, you know, it's up to you. This is a LangChain way of doing it, if you'd like to do so.

So for the retrieval part, we can't connect the retriever directly to the generation part because we need to format our context that come out of that. So what I have done here is I've defined this function, which does the retrieval, and then also does the formatting for us and then returns it.

And then I'm wrapping this retrieval transform function into what's called transform chain. Okay, so it's basically, it's like a custom chain in LangChain. That's the way I would view it. So the input into this is going to be a question, which we set up here, and the output is going to be query and context, which we have set up here.

Now, one thing that you can't do with this, or at least in the next part here, is that you cannot have the same input variable and the same output variable. All right, so that's why I'm calling this question and why this is the query. If I put question here, I'm going to get an error.

So we just need to be wary of that. Now that we have our transform chain for retrieval and we have our QA chain from before, we wrap all of this into a single sequential chain and that gives us our RAG pipeline in LangChain. So let me run this and this.

With that, we can just perform the full RAG pipeline by calling this method here. Okay, so we input our question. You can see, I still have the logging on, so you can see the output there at the top. Don't know why it's in this weird color, but okay. So at the top, we have the same things we saw before, those three questions, and then this is the output.

All right, it's the same as what we have before 'cause we're actually just doing the same thing. We just wrapped it into this sequential chain from LangChain. Okay, cool. So that's the full RAG pipeline. Now let's take a look at modifying our prompt in order to change the behavior of how we're generating these queries.

And I think this is very important and probably the most important part of this video, which is, okay, how does it behave with different queries? So we're gonna start with this prompt A. So we can look at this. I'm just saying, okay, generate three different search queries that aim to answer the user question from multiple perspectives.

Each query must tackle the question from a different viewpoint. We want to get a variety of relevant search results. Okay, so what I'm trying to do with this query is search or add more variety to the queries that are being created. So that is the idea here. Now we can see how that performs.

We come down to here. I'm going to put it into here. So this is kind of like our custom approach to doing this. We have this lineless object here, an output parser. Essentially what this is going to do is our query here is going to generate the questions separated by new line characters.

This output parser here is going to look for new lines and it's gonna separate out the queries based on that. So it's just parsing the output we generate here. Okay, cool. So we can run that. And what I'm gonna do is reinitialize the retriever with our new LLM chain here.

And yeah, we run this. And we'll just see the sort of queries that we get, okay? So we get what are the characteristics and behavior of llamas? How are llamas used in agriculture and farming? What are the different breeds of llamas and their unique traits? So yes, we've definitely got more diverse questions here, but now it's, you know, it sees llama too.

And it's like, okay, you want me to ask some unique, diverse questions about llamas, perfect. So there's kind of like pros and cons to doing this. Obviously the results we get here are not going to be as relevant to our query. Although we actually still do get the llama paper because honestly, I don't think there's much in there that talks about llamas in agriculture.

So yeah, that doesn't really work. So let's try another prompt. So what I want to kind of point out here is that when you're trying to increase the variety of the prompts that are generated by your multi-query system here, the more you increase that variety, the more likely it is to hallucinate or just kind of go down the wrong path.

And that's exactly what we just saw there. So now what I'm going to do in a second prompt is be more specific. I'm going to say, okay, I'm basically saying the same as what I said in that first prompt, but I just added this. The user questions are focused on LLMs, machine learning, and related disciplines, right?

So I'm just giving the LLM some context as to what it should be generating queries for. And well, let's see, let's see if this helps our LLM. So we put this in, let's run this, and then run our retriever again. Okay, so we have more variety here, seven, which is more than the five we had for the first one.

And now we can see, okay, what are the key features and capabilities of large language model LLAMA2? Okay, so that's cool. How does LLAMA2 compare to other large language models in terms of performance and efficiency? Okay, what are the applications and use cases of LLAMA2 in the field of machine learning and NLP?

Right, so I personally think those results are way better than what we were getting before. And we can see the docs that are being returned here. It's not a big dataset, so I don't expect anything outstanding here, but we should at least maybe see less of the agriculture documents in here.

So this one is definitely talking about LLAMA2. We can go on to the next one, which is here. Even large language models are brittle, social bias. So this one, I don't see anything relevant for LLAMA2 here, unless I'm missing. Yeah, I don't think so. So that one isn't so relevant.

Let's see this one. Okay, so it's talking about LLAMs. You have GPT-3 here, Lambda Gopher. Okay, all sort of comparable LLAMs, comparable to some degree. So, okay, it doesn't talk about LLAMA, but at least we have LLAMs in there. That's good. Here, it's coming from the LLAMA2 paper. So, "These closed product LLAMs are heavily fine-tuned "to align with human preferences.

"Greatly enhances their usability and safety." Okay, "In this work, we develop and release LLAMA2," and then, okay. All right, so it's talking about LLAMA2. Here, we are talking about the original LLAMA model, okay, which I think is still relevant here. Okay, cool. And here, we have another paper. It's not specific to LLAMA1 or LLAMA2, and it is an older one.

It's just talking about ML and NLP in general. Okay, and this one's talking, again, generally about LLAMs. So, we have sort of a mix of LLAMs in there, some LLAMA. So, I think we're kind of getting tighter into where we need to be, but for sure, it could be better.

Now, we can see from the results here that we've broadened the scope of what we're searching for, which is, that's what we want to do with multi-query, but it still doesn't make a good retrieval system, at least by itself. Multi-query needs to be part of a larger retrieval pipeline because, yes, it broadens the scope of what we're searching for, but then we need to tighten up that scope, and we need to actually filter down so that we don't have so much irrelevant or noisy results within what we're returning.

So, yes, we have that broader scope. We can probably tighten it up, especially in this use case where we're searching for a particular keyword, which is LLAMA2, by using something like hybrid search, and then following that retrieval step, returning, I don't know, more records, let's say, like, five records per query or 20 records per query, and we'll end up returning like 50 or so documents.

Then what we'd want to do with that is look at the original query, put that into a re-ranking model alongside those, like, 50 documents, and re-ranking up to, like, the top three or top five documents, and then returning that to our LLM. And within that sort of pipeline, that's where something like multi-query can be really helpful in just helping us pull in a wider variety of results that can be useful for us.

So, that's it for this video. I hope this has been useful and interesting. So, thank you very much for watching, and I will see you again in the next one. Bye. (gentle music) (gentle music) (gentle music) (gentle music) (gentle music) you