Navigating RAG Optimization with an Evaluation Driven Compass: Atita Arora and Deanna Emery

00:00:00.000 | Hello, everyone. Welcome to our talk. My name is Atita, and I work as a solution architect

00:00:18.320 | at Quadrant, and together with me, I have Deanna. I'm Deanna Emery, and I'm a founding

00:00:24.540 | AI researcher at Quotient AI. Cool. So we would be talking about navigating

00:00:29.160 | RAG optimization with an evaluation-driven compass. So I think the track is about RAG,

00:00:35.020 | so let's talk and extend what we already know and what we've already seen so far.

00:00:39.980 | So in this talk today, we will be discussing some of the key essential topics for anyone

00:00:44.420 | interested in building or productionizing the most popular implementation of GenerateAVI,

00:00:50.160 | that is RAG, Retrieval Augmented Generation. In simple terms, RAG combines the capabilities

00:00:55.460 | of searching and retrieving through the vast amount of information stored in knowledge

00:01:01.320 | source, usually a vector database. And then we use this information to generate relevant

00:01:07.700 | and coherent responses, leveraging the capabilities of a large language model. We will be talking

00:01:13.680 | through the known challenges of this approach and how can you combat them by adopting an evaluation-based

00:01:20.280 | optimization techniques to get the desired results. So without further ado, let's get started.

00:01:27.140 | So yeah, as simple as it's defined, RAG can be implemented in many different ways, as we

00:01:32.960 | can see on the slide. Apologies for the busy slide, as I wanted to cover all the aspects of how

00:01:39.140 | RAG can be implemented. And we will be starting with the simplest one, the naive RAG. This involves

00:01:45.000 | three key steps. First, we split the documents using the specific chunking strategy. Next, we process

00:01:51.860 | the document embedding and store them into a vector database. For each user query, we then retrieve

00:01:57.860 | the most relevant document chunks based on our retrieval strategy. Finally, we add these retrieved

00:02:03.640 | three chunks to our prompt to generate an answer using a chosen LLM. Advanced versions involve

00:02:10.180 | enhancement to user queries, such as query expansion or rewriting, and post-retrieval treatments like

00:02:17.180 | result re-ranking or fusion. The further advanced version of RAG includes query routing to task-based agents,

00:02:27.380 | and self-improvement modules like TSPI. However, there's one tool that is common in all these

00:02:33.240 | implementations if you notice, no matter if you want to build a naive or agentic RAG, and that is a vector database.

00:02:40.240 | One of the most common use cases for building RAG is for knowledge management. There are aspects like retrieval performance,

00:02:47.080 | scalability to support large data volume, and resource optimization are paramount.

00:02:52.240 | Speaking of vector database, Quadrant is open source vector search database built on Rust, and is purpose-built to support your

00:03:02.500 | Generative AI applications built on large scale of data. If you haven't already checked out, please do check out.

00:03:10.740 | So coming back to our topic, it is worth to understand and acknowledge the challenges that come along with all the goodness of RAG.

00:03:17.100 | I'm repurposing our naive RAG architecture to highlight common possible issues on each level. After all, the first step to solving any problem is to recognize there is one.

00:03:27.360 | To begin with, during the data processing stage, we could have issues with information missing from our data set, or information that fails to get extracted from our source of information.

00:03:38.360 | This would result in incorrect and incomplete responses. On data ingestion, there is a constant battle to determine the optimum chunking

00:03:46.620 | strategy, along with determining a suitable embedding model that basically understands the specificity and jargons used in your data set.

00:03:55.120 | Information retrieval itself is quite interesting and an ever-evolving field. Having spent 17 years in the space myself, I can probably say that

00:04:04.620 | Relevancy is an unsolved problem. So the challenge with determining relevant documents, retrieval size, or retrieval window, if you may call it, and the order of

00:04:14.880 | documents is unskippable.

00:04:16.880 | Response generation can face challenges such as incorrect and incomplete answers to all the previously mentioned issues, and the issues of straying from the

00:04:26.880 | the provided context. To add, our query is also vulnerable to ambiguous or vague questions. There are certainly other challenges,

00:04:37.140 | like generating coherent responses, maintaining user conversations, scaling a RAG for hundreds or thousands of concurrent users,

00:04:44.740 | along with data security and compliance issues. So looks like RAG isn't really a piece of cake after all.

00:04:51.140 | Fortunately, as for like the challenges, we have plenty of improvement techniques as well for the RAG. Let's look at them next.

00:04:58.400 | So we saw challenges zooming into data quality, data that missed to get generated. After all, the

00:05:03.400 | foundation of great responses is, it lies in the richness and accuracy of its information or context in case of RAG, which can be

00:05:12.400 | controlled through adopting data cleaning and advanced data extraction methodologies. It is not a bad idea to use a general purpose embedding model to begin with,

00:05:20.400 | But for added improvements, it would be a good idea to use an embedding model that comprehends the terminologies of your domain.

00:05:27.660 | Metadata. It is a very versatile feature that can help you retrieve your added understanding of your documents and improving your retrieval,

00:05:37.660 | plus leveraging the metadata filtering during the retrieval, it can also help you out with filtering out the irrelevant documents.

00:05:43.660 | So it is also probably a good idea to invest your time into determining and improving your chunking strategy, which we spoke about earlier.

00:05:52.920 | Sometimes just by reducing the chunk size or adding semantic chunking can work wonders. We would be seeing that also when Deanna walks us through all the experimentation.

00:06:00.920 | So, we have been aware of lost in the middle problem, and this is where it may be a good idea to determine an apt context

00:06:08.180 | context size needed for your RAG to generate a helpful response.

00:06:11.440 | Also, using suitable indexing algorithms like HNSW, BM25, or even graphs, they can do wonders.

00:06:20.440 | Talking about suitable context size also prioritizes the document re-ranking as one of the key improvement parameters to ensure the most relevant documents that are provided in the context for LLM to generate a helpful response.

00:06:31.180 | With LLMs in the picture, we cannot overlook the difference. A good prompt or thinking about questions as a chain of thought can make to the progress of process of response generation.

00:06:47.440 | Semantic understanding is clearly desired. We know that text search alone doesn't work, but if the dataset has the requirement for the exact matches, it may be worth exploring dense as well as sparse vectors, and you can do that on one or many fields.

00:07:00.440 | Similar to the embedding model, it may be worth switching and experimenting with different LLMs as well, to ensure that the desired response is generated, and lastly, for better handling of task-driven user queries, such as fetching specific information or using custom data, it is beneficial to address these challenges using agents, which are well suited for the job.

00:07:22.040 | So with so many different levers to tweak in RAC pipeline, it's hard to know what's going wrong, what to change, where to start.

00:07:30.840 | This is why evaluation is very important. Without an evaluation-based guided flow, it is difficult to accurately measure progress and ensure optimal performance.

00:07:40.680 | The evaluation also helps you to iteratively refine applications, making informed decisions, and ultimately achieve goals more effectively.

00:07:47.480 | So on that topic, I would like to welcome Tiana, who is going to basically walk us through how we did this experimentation.

00:07:55.320 | Thanks, Satitha, for the perfect setup.

00:08:00.520 | So this is where Quotient comes in. I might steal that from you. Thank you.

00:08:04.920 | So because the quality of RAG is so dependent on the underlying documents, a thorough evaluation of RAG system has to be customized to suit that specific domain and data set.

00:08:19.320 | So Quotient's evaluation solution fills this need by enabling developers to measure the effectiveness of their LLM products accurately.

00:08:28.520 | Quotient's platform accelerates the experimentation process. With an evaluation data set that contains realistic examples of inputs and expected outputs for your AI solution, you can quickly experiment and iterate to optimize your RAG solutions.

00:08:45.720 | And if you don't have an evaluation data set, don't worry, Quotient can help you get started by generating one for you.

00:08:53.720 | And you can hear more from us on this at the AI Sizzle and Waves meetup on Friday.

00:08:58.200 | So how does this work in practice?

00:09:02.280 | Once you have your quadrant vector database set up, you can populate your evaluation data set by submitting queries to return the context for the LLM.

00:09:11.320 | You can then submit your evaluation data set to Quotient, which handles the full orchestration, including the prompt formatting, execution of LLMs, and the metric computations.

00:09:23.560 | So to see it in action, we've put together a demo walkthrough where we'll show you a workflow for making evaluation informed changes to optimize your RAG system using Quotient and Quadrant.

00:09:35.240 | For the sake of time, we've executed the notebook ahead of time and we'll be walking you through the code outputs.

00:09:41.400 | And if you scan the QR code here, you can find the notebook on GitHub.

00:09:46.600 | In this demo, we are building a RAG solution for question answering on Quadrants documentation.

00:09:53.480 | And this will help enable Quadrant users to get help quickly.

00:09:57.240 | So before we begin evaluation, it's important to take a step back and consider what we're optimizing for.

00:10:04.200 | Given this use case, it's generally important to get helpful answers to the questions.

00:10:09.720 | But it's perhaps more important that the answers do not contain any inaccurate information that could misguide users.

00:10:16.440 | And so in other words, we want to minimize hallucinations.

00:10:21.400 | And with that in mind, we will be looking at the following metrics shown here with a focus on faithfulness.

00:10:27.000 | The first two metrics are both focused on measuring the quality of the retrieval side of RAG.

00:10:33.240 | Context relevance tells us whether the necessary information to answer the question is in the retrieved documents.

00:10:39.080 | Chunk relevance tells us how much of the information retrieved is actually useful for answering the question versus just noise.

00:10:46.360 | Faithfulness is our hallucination metric.

00:10:49.560 | And then because the focus of this talk is going to be optimizing the retrieval side of RAG,

00:10:55.960 | we're sticking to some of the more general text quality metrics here.

00:11:01.800 | So when we're first getting started, we want to consider a simple naive RAG implementation to help us

00:11:07.960 | better optimize the data processing and vector database setup.

00:11:12.040 | So we start off by choosing a reasonable embedding model and chunking parameters,

00:11:17.720 | retrieval window and the mistral instruct model.

00:11:20.680 | And then to see if we require additional context to answer the questions,

00:11:26.040 | we set up a second experiment where we increase the chunk parameters.

00:11:31.160 | And so here are the results of those first two experiments.

00:11:34.040 | You can see that by increasing the chunk size in experiment two, we had some minor improvements

00:11:39.720 | in our text quality metrics.

00:11:41.240 | That said, we had a considerable drop in our faithfulness, which is the metric we're optimizing for.

00:11:47.960 | Of note, you can see that the context relevance increased, meaning that we retrieved more of the

00:11:54.680 | and so what this implies is that if we simply retrieve more documents but use the smaller chunk size from

00:12:12.680 | before, we might get better results.

00:12:14.280 | As expected, the smaller chunk size with the larger retrieval window achieved the highest relevance

00:12:24.200 | scores and the best faithfulness score, meaning that we have a lower occurrence of hallucinations.

00:12:29.400 | In our next two iterations, we test out a new embedding model as well as a different LLM switching to GPT 3.5.

00:12:38.200 | And while the embedding model experiment in the light grey didn't quite work out,

00:12:42.840 | for experiment five in the dark grey, we found that using the same RAG configuration,

00:12:47.480 | but changing the LLM improved performance across all our metrics.

00:12:54.440 | So here we're looking at the aggregated metrics for our top performing RAG configuration.

00:12:58.760 | And the large variance in the context relevance, which is highlighted in red,

00:13:03.000 | implies that some of the questions are having a harder time retrieving the right documents.

00:13:07.000 | So to better understand what's going on, we have to look into the data.

00:13:11.000 | And here we've shown the two worst performing data points in terms of hallucination.

00:13:16.520 | This third column here shows us the retrieved documents, and it's a lot of text,

00:13:21.400 | so I'll just summarize. So it seems like we're returning a lot of unrelated documents

00:13:27.160 | that are likely over indexing on specific words in the query, like quadrant, support, and search,

00:13:34.520 | and we're returning documents that repeat these terms many times.

00:13:37.720 | So to address this issue, one possible solution could be expanding our retrieval window to capture

00:13:44.520 | more documents and thereby making it more likely we get the right information. But in doing so,

00:13:49.960 | we'd also be adding a lot of noise to our context. So by now we likely need to expand out of naive RAG

00:13:56.840 | and start to add in some more advanced techniques. And this is where re-ranking comes in. So we could

00:14:03.080 | try using a different embedding model to re-rank the retrieved documents and then return only the top

00:14:08.120 | few, thus weeding out some of the more unrelated ones. So we try this out. We perform three different

00:14:14.120 | re-ranking experiments, trying re-rankers from mixed-bred, Cohere, and Gina's Colbert model,

00:14:19.240 | and we plot them here against our top naive RAG approach in blue. And you can see that in general,

00:14:25.800 | our context relevance and with it our faithfulness have improved with this strategy, with Cohere achieving

00:14:32.120 | the best relevance and faithfulness scores. If we return to those same two worst-performing data

00:14:38.760 | points from experiment five now, and look at how our re-ranking implementation did, you can see that

00:14:44.680 | in the second example, we're now retrieving documents that contain the desired answer. And with it, our context

00:14:51.240 | relevance and faithfulness are close to one. That said, in the first example, we're still getting poor results.

00:14:57.000 | And this suggests that even after expanding our retrieval windows, we're still unable to identify the relevant documents.

00:15:03.240 | So if we think about the quadrant documentation and the text within it, it contains a lot of special terminology,

00:15:10.520 | jargon, acronyms, and so it's unsurprising that these generally trained embedding models are going to be limited in performance.

00:15:17.320 | So we could try fine-tuning or training our own model to address this issue, but this could be time-consuming and costly.

00:15:26.680 | So another option to try is hybrid search, which combines sparse and dense vectors,

00:15:32.200 | and these sparse vectors help us capture documents that share similar terminology.

00:15:36.200 | So we tried two implementations of this, one where we incorporate hybrid search with a re-ranker and

00:15:42.680 | one without, and you can see that using the Cohere re-ranker with hybrid search gives us the best

00:15:48.760 | performance across all metrics except for chunk relevance. So looking at this hybrid search re-ranking

00:15:55.560 | experiment now on these same two data points, you can see that the context relevance and faithfulness

00:16:00.520 | scores are both close to one now, a significant improvement over our prior ones. And we're also

00:16:06.440 | better able to retrieve the information necessary to answer these questions. And so this suggests that

00:16:13.000 | domain-specific terminology has a big effect on our overall performance.

00:16:17.080 | So to summarize, the table here shows what gains in performance we were able to make over just 10

00:16:23.800 | experiments. Starting from a faithfulness score of 0.76, we worked our way up to a score of just under

00:16:29.800 | 0.85. And notably, all of these gains were made without changing from a generic question-answering

00:16:35.640 | prompt. So there's certainly many more experiments to run, and we have plenty of room for improvement,

00:16:40.680 | but you can see how, starting from scratch, you can improve your RAG system by making incremental

00:16:46.360 | changes, evaluating using a combination of metrics that together can help you identify underlying issues,

00:16:53.160 | then observing patterns in your data, forming a hypothesis, and repeating the process.

00:16:59.640 | So to summarize this talk and the experimentation, in this talk we covered several key aspects of

00:17:09.000 | improving RAG. The baseline takeaway is that there is no substitute for evaluation-based or data-driven

00:17:15.960 | improvements. We emphasized leveraging domain understanding to achieve significant wins and

00:17:21.960 | outlined various techniques for enhancement in our experiments. To ensure continuous improvement,

00:17:28.120 | it is crucial to keep your evaluation data set up to date. Lastly, avoid over-engineering your RAG

00:17:34.200 | application without considering a combination of carefully chosen metrics. If this talk picked your

00:17:40.120 | interest and you are interested to get in touch with us, before that, some key references and the QR code,

00:17:47.160 | please feel free to scan them, get in touch with us. We're going to be around and looking forward to all your

00:17:52.360 | questions.

00:18:00.360 | We'll see you next time.