Back to Index

RAG But Better: Rerankers with Cohere AI


Chapters

0:0 RAG and Rerankers
1:25 Problems of Retrieval Only
4:32 How Embedding Models Work
6:34 How Rerankers Work
8:20 Implementing Reranking in Python
13:11 Testing Retrieval without Reranking
15:21 Retrieval with Cohere Reranking
21:54 Tips for Reranking

Transcript

Driva Augmented Generation, or RAG, has become a little bit of an overloaded term. It promises quite a lot, but when we actually start implementing it, especially when we're new to doing this stuff, the results are sometimes amazing, but more often than not, kind of not as good as what we were expecting.

And that is because RAG, as with most tools, is very easy to get started with, but then it's very hard to actually get good at implementing. The truth is that there is a lot more to RAG than just putting documents into a vector database and then retrieving documents from that vector database and putting them into an LLM.

In order to make the most out of RAG, you have to do a lot of other things as well. So that's why we're starting this series on how to do RAG better. In this first video, we're going to be looking at how to do re-ranking, which is probably the easiest and fastest way to make a RAG pipeline better.

Now I'm going to be talking throughout this entire series within the context of RAG and LLMs, but in reality, this can be applied to retrieval as a whole. If you have a semantic search application, or maybe even recommendation systems, you can actually apply not all, but a lot of what we're going to be talking about throughout the series, including re-ranking, which we'll go through today.

So before jumping into the solution of re-ranking, I'm going to talk a little bit about the problem that we face with just retrieval as a whole, and then specific to LLMs. So to begin with retrieval, to ensure fast search times, we use something called vector search. That is, we transform our text into vectors, place them all into a vector space, and then compare their proximity to what we call a query vector, which is just a vector version of some sort of query, and see which ones are the closest together, and we return them.

Now for vector search to work, we need vectors, which are essentially just compressed representations of semantic meaning behind that text. Because we're compressing that information into a single vector, we will naturally lose some information, but that is the cost of vector search, and for the most part, it's definitely worth paying.

Vector search can give us very good results. But what I tend to find with vector search and RAG with LLMs is that, okay, I get some good results at the top, but there's actually another result in, let's say, position 17, for example, that actually provides some very relevant context for the question that I have asked.

So in this example, let's say this is position 17 down here. We have that relevant item, but what we would typically do when we're doing RAG with LLMs is we're returning the top three items. So we're missing out on these other relevant records down here. So what can we do?

The simplest is simply to just return everything, and send all of these into our LLM. So over here, we have our LLM. Now that's okay, but LLMs have limited context windows. So we're going to end up filling that context window very quickly if we just start returning everything. So we want to return all of this, so we want to return a lot of records so that we have high retrieval recall, but then we want to limit the number of records we actually send to our LLM.

And that's where re-ranking would come in. So by adding a re-ranker, we can still use all of those records, we still get to return all of these from our retrieval component, but then the records that we actually send to our LLM are just these here, these top three. And the re-ranker has gone ahead and handled the reordering of our records to get the most relevant items at the top, so we can then send all of that to our LLM.

Now the question here is, is a re-ranker really going to help us here? Can we not just use a better retrieval model? And yes, we can use a better retrieval model, and that's something we'll be talking about in a future video. But there is a very good reason as to why a re-ranker can generally perform better than a encoder model or retrieval model.

So let's talk about that very quickly. This is what an encoder model is doing. So this is an encoder/retriever. So this is like your ARDA002. Now what it's doing is we have a transformer model. So these are the same transformer model. The reason that I've got two of them on the screen right now is because you use your first iteration or inference step of the transformer model to create your embedding for document A.

And from that you get your vector A. So that is the compressed information that we can then take across to our vector database, which would kind of be like this point here. That's in our vector space. And then in another inference step, we're going to do the same for document B.

We get vector B, and there we go. We have that in our vector search, and we can then compare the proximity of those two records to get the similarity. The metric that we'd be using here, the computation, would be either dot product or cosine in the case of ARDA002.

Now you have to consider that the computational complexity of something like cosine similarity is much simpler than one of these transformer inference steps. So the reason that we use this encoder architecture is that we can do all of the transformer inferences at the start, when we're building our index, that takes a long time because transformers are big, heavy things.

They take a lot of computation. Whereas the cosine similarity step at the end, which we can run at the time when our user is making a query, is very fast. So it's kind of like we're doing the heavy part of the computation to compare documents at the very start of building the index.

And that means we can do very quick, simple computations at user query time. And that is different to what we do re-ranking. So here, this transformer is our re-ranker. And at query time, right, so let's say document A here, maybe that's our query. And document B is one of the documents in the database.

We're saying to the transformer, okay, how similar are these two items? So to compare the similarity in this case, we're running an entire transformer inference step. And notice, because we're doing everything in a single transformer step, we're not losing as much information as we are with this one, where we're compressing everything into vectors.

That means that theoretically, we lose less information, so we can get a more accurate similarity score here. But at the same time, it's way slower. So it's kind of like, you know, on one side, you have fast and, you know, relatively accurate. And then on this side, you have slow, but super accurate.

So the idea with the sort of re-ranking approach to retrieval is that we use our retrieval encoder step to basically filter down the total number of documents to just, you know, in this example, let's say there's like 25 documents there. 25 documents is not too much. So feeding them into our re-ranker is actually going to be very fast.

Whereas if we fed all documents into our re-ranker, we'd be waiting, I don't know, like a really long time, which we don't want to do. So instead, we filter down the encoder, feed them into the re-ranker, and then we'll get like three amazing results super quickly. So that is how the re-ranking approach works.

Let's see how we'd actually implement that in Python. Okay. So we're going to be working through this notebook here. We need HookingFace datasets, that's going to be where we get our dataset from, OpenAI for creating our embeddings, Pinecone for storing those embeddings, and Cohere for our re-ranker. We're going to start by downloading our dataset, which is this AI archive.

It's pre-chunked, so I've already chunked it into like tokens of 300, I think, something like that. And it's basically just a dataset of archive papers. You can kind of see a few of them here that are related to LLMs. Essentially, I gathered it by taking some recent papers that are well-known, like LLAMA 2 paper, GPT-4 paper, GPT-Q, and so on, and just extracting that, extracting what that was referencing, and extracting those papers, and kind of just going in a loop through that.

So yeah, we have a fair few records in there. It's not huge, but it's not small either, so 41.5,000 chunks, where each chunk is roughly this size. So I'm just going to reformat the data into the format we need. This is basically like a pinecone format. You have ID, text, which we're going to convert into embeddings, and metadata.

We're not going to use metadata in this example, but it can be useful, and maybe it's something that we'll look at in a future video in this series as well. So we need to define our embedding function. So we need to define that encoder model that we're going to be using.

For that, I'm going to be using OpenAI. It's easy, R002, fairly good performance, although there are better models, and that's something we will also be talking about in the future. So I'm going to just run that, and I will need to enter my OpenAI API key. To get that, you need to head on over to platform.openai.com, and get your API key.

I'm going to enter mine in here, and yeah. So with that, we should be able to initialize our embedding model, which we are doing here. I'm not going to go through all these functions, because I've done it a million times before. I think people are probably getting bored of that part of these videos.

So I'm just going to run through those bits very quickly. I'm going to get my pinecone credentials, again, app.pinecone.io for those, and I will run that, enter my API key first, and then I want my PyCone environment, which I find next to my API key in the console. So mine was this.

Yours would probably be like gcpsarter or something along those lines. OK, cool. So here, I'm going to create an index if it doesn't already exist. My index does actually already exist, and I'm not going to recreate it, because it takes a little bit of time, or at least it did the other day when creating this.

So you can see that I already have like the 41,000 records in there. If you're looking at that, you should probably see nothing in yours, unless you've just run this or you're connecting to an existing index. OK. So this is the code I use to create my index, right?

It's pretty straightforward. The one thing that is maybe a little more complicated, but it's not that complicated, is we're actually creating the embeddings here. So I think I defined an embedding function up here, actually, and I ended up not using it for some reason. Just ignore that. So in here, this is where we're doing our embeddings, but we're wrapping it within an exponential backoff function to avoid rate lump errors, which I was hitting a lot the other day.

So essentially, it's going to try and embed. If it gets a rate limit error, it's going to wait. And it's going to keep doing that for a maximum of five retries. Hopefully, you shouldn't be hitting five retries. If so, there's probably something wrong. So yeah, you should be OK there.

But if you are hitting those rate limit errors, you might be waiting a little bit of time for this to finish. If not, it should finish quite quickly. I was hitting tons of rate limit errors the other day, and I ended up-- this took like 40 minutes, I think.

So yeah, just be aware of that. It's going to depend on the rate limits you have set on your OpenAI account. Now we want to test retrieval without Cohere's re-ranking model first. So I'm going to ask this question. So get docs. Yeah, I'm just querying. Again, I'm not going to go through everything.

I'm just going to return, for now, the top three records. So my question is, can you explain why we would want to do reinforcement learning with human feedback? That's what this is here. It's like a training method that is kind of like why ChatGPT was so good when it was released.

So I kind of want-- OK, why would I want to do that? I think the first answer here-- and there's some-- the scraping that I did is not perfect, so I apologize for that. But for the most part, I think we can read it. So it's a powerful strategy for fine-tuning large language models, enabling significant improvements in their performance, iteratively aligning the model's responses more closely with human expectations and preferences.

It can help fix issues of factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LMs. OK, so I think that's a good answer, like number one there. And then let's have a look at the second one-- increasingly popular technique for reducing harmful behaviors, OK, can significantly change metrics-- doesn't necessarily tell me any benefits there, OK?

So the only relevant bit of information in this second sentence is increasingly popular technique for reducing harmful behaviors, OK? So just one little bit there. And then number three, I think-- like, I don't see anything in this that tells me why I should use RLHF. It's telling me about RLHF, but isn't telling me why I'd actually want to use it.

So these results could be better, all right? So number one, good. Number two, it's kind of relevant. Number three, not so much. So can we get better than that? Yes, we can. We just need to use reranking. So I'm going to come down to here, and we're going to initialize our reranking model.

So for that, we need another API key, which is Cohere's API key. This should be free. Like, the Pinecone and Cohere ones will be free. The OpenAI one, I think, you need to pay a little bit. So yeah, just be aware of that. But again, we'll be-- like I said, later on in this series, we'll be talking about other alternatives to OpenAI for embedding models, which may actually be a fair bit better.

So I'm going to go to this website here, dashboard.cohere.com/api-keys. You will probably need to sign up, make an account, and do all of that. And then you will get to your Cohere dashboard, new trial key. I'm going to call it something-- I don't know-- demo generate trial key. OK, and I'm going to put it into here.

Cool. So we now want to rerank stuff. Let's try. So I'm just going to rerun the last results, because I only got three here. I'm going to rerun it with 25. So yeah, we have many more now. And I'm just going to re-rank those 25. And I want to see what was re-ranked.

I just want to compare those results. So when we re-rank stuff, we're going to return this Cohere responses re-rank result object. And we can access the text from those like this. OK, so you can see we kind of get this output there. And the way that I've set up the docs object that I returned from the last item here, you can see it's a dictionary, where the text maps to the position.

The reason I've done that is so that I can just very quickly see what the reordered position after re-ranking is. So you can see that, OK, it's kept the zero position, like the top result. But then it's swapped out one and two for these two items here, OK? So I'm going to define this function here.

It's basically just going to do everything we've just gone through. It's going to query, get those results. It's going to then re-rank everything. And it's just going to compare the results for us. So I'm going to set a top k of 25, so returning 25 records from our retrieval step.

And then we're just going to return the top three from our re-ranking step. So I'm going to compare that query, so the RLHF query, OK? So zero has remained the same. One has been swapped for 23, and two has been swapped for 14. So this won't show us the first results here, because they haven't changed.

So we're looking at these results first. So the original is what we went through before, where it has the one kind of useful bit of information, increasingly popular technique for reducing harmful behaviors in large language models. And then the rest wasn't really that relevant to our specific question, which is basically why would I want to use RLHF?

Now having a look at 23, we've shown it's possible to use RLHF to train LLMs that acts as helpful and harmless assistants, OK? So that's useful, OK? That's why we might want to use it. RLHF training also improves honesty, OK? That's another reason to use it. In other words, associated with aligning LLMs, RLHF improves helpfulness and harmlessness by a huge margin, OK?

Another reason why we might want to use it. So OK, three good reasons already. Our alignment interventions actually enhance the capabilities of large models. And yes, I think that's another reason, combined with training for specialized skills without degradation in alignment or performance. Another reason why we should use it, right?

So this here is talking about RLHF like the previous number two ranked context, but it's way more relevant to our specific question, which is that's why we use re-ranking models. Now let's have another look. So this is-- yeah, this one, there was nothing relevant, right? So this is the original.

For our specific question, there wasn't anything relevant in here. The re-ranked one has this. Just one thing here is like the LLMs are actually reading all of this text, which is kind of impressive. I really struggle to, but anyway. So the model outputs are output safe responses. I think that's-- assuming it's talking about RLHF is a good-- it's helpful.

We switch entirely to RLHF to teach the model how to write more nuanced responses. So that's a good reason. Comprehensive tuning with RLHF has added the benefit that it may make the model more robust to jailbreak attempts. Another benefit. We can do it to RLHF by first collecting human preferences-- it's not relevant-- annotators, write a prompt they believe can elicit safe behavior, and then compare multiple model responses to the prompts, selecting the responses that are safest according to a set of guidelines.

We use the human preference data to train a safety reward model, and-- OK. So I think the relevant bits here are make the model more robust to jailbreak attempts, and teach the model how to write more nuanced responses. So those two are good. The rest of it isn't as relevant, but it's far more relevant than this one where it didn't tell us any benefits using RLHF.

Cool. Now let's try one more. So what is red teaming? It's like a safety or security testing thing that they apply to LLMs now. It's like stress testing for LLMs. You can see that it hasn't changed the top one again. And I think the responses here were generally not quite as obviously better with re-ranking, but still slightly better.

What I will do is just kind of let you read those. So you have this one here. You can pause and read through if you want. And also this one as well. So again, you can pause and read through if you like. I'm not going to go through all those again.

So that is re-ranking. I think it's pretty clear. It can help a lot. At least I have found it just, you know, I don't have any specific metrics on how much it helps, but just from using it in actual use cases, it helps quite a bit. So I hope this is something that you can also use to sort of improve your retrieval pipelines, particularly when you're using RAG and sending everything to LLMs.

But you should also test it and make sure it is actually helping. So for example, if you're using a, maybe you're using kind of like an older re-ranking model, the chances are it won't actually be quite as good as some of the more recent and better encoder models. So you could actually degrade performance if you do that.

So you always want to make sure that you're using kind of like state-of-the-art re-rankers alongside state-of-the-art encoders. And you should see an impact kind of similar to what we saw here with the RLHF question. But anyway, as I mentioned, this is like the first method I would use when trying to optimize an existing retrieval pipeline.

And as you can see, super easy to implement, it's, you know, you don't really need to modify other parts of the pipeline. You just need to put this into the middle. So I'll leave it there for now. I hope this walkthrough has been useful and interesting. Thank you very much for watching, and I will see you again in the next one.

Bye. you