Building Production-Ready RAG Applications: Jerry Liu

00:00:00.000 | Hey everyone, my name is Jerry, co-founder and CEO of LamaIndex, and today we'll be talking about

00:00:19.100 | how to build production-ready RAG applications. I think there's still time for a raffle for the

00:00:23.680 | bucket hat, so if you guys stop by our booth, please fill out the Google form. Okay, let's get

00:00:28.360 | started. So, everybody knows that there's been a ton of amazing use cases in Gen AI recently,

00:00:33.600 | you know, knowledge search and QA, conversational agents, workflow automation, document processing.

00:00:39.420 | These are all things that you can build, especially using the reasoning capabilities of LLMs,

00:00:44.220 | over your data. So, if we just do a quick refresher, in terms of like paradigms for how do you actually

00:00:51.860 | get language models to understand data that hasn't been trained over, there's really like two main

00:00:56.880 | paradigms. One is retrieval augmentation, where you like fix the model and you basically create a data

00:01:02.440 | pipeline to put context into the prompt from some data source into the input prompt of the language

00:01:08.400 | model. So, like a vector database, you know, like unstructured text SQL database, etc. The next paradigm

00:01:15.640 | here is fine-tuning. How can we bake knowledge into the weights of the network by actually updating the

00:01:21.480 | weights of the model itself, some adapter on top of the model, but basically some sort of training

00:01:26.100 | process over some new data to actually incorporate knowledge? We'll probably talk a little bit more

00:01:31.440 | about retrieval augmentation, but this is just like to help you get started and really understanding the

00:01:36.380 | mission statement of the company. Okay, let's talk about RAG, retrieval augmented generation.

00:01:44.440 | It's become kind of a buzzword recently, but we'll first walk through the current RAG stack for building

00:01:49.900 | a QA system. This really consists of two main components, data ingestion, as well as data

00:01:55.240 | querying, which contains retrieval and synthesis. If you're just getting started in Lama Index, you can

00:02:00.020 | basically do this in around like five-ish lines of code, so you don't really need to think about it.

00:02:04.440 | But if you do want to learn some of the lower level components, and I do encourage like every

00:02:08.440 | engineer, AI engineer, to basically just like learn how these components work under the hood,

00:02:12.960 | I would encourage you to check out some of our docs to really understand how do you actually do data

00:02:17.120 | ingestion and data querying? Like how do you actually retrieve from a vector database, and how do you

00:02:21.640 | synthesize that with an L on? So that's basically the key stack that's kind of emerging these days.

00:02:28.900 | Like for every sort of chatbot, like chat over your PDF, like over your unstructured data, a lot of these

00:02:35.960 | things are basically using the same principles of like how do you actually load data from some data source and

00:02:41.480 | actually, you know, retrieve inquiry over it. But I think as developers are actually developing these

00:02:48.260 | applications, they're realizing that this isn't quite enough. Like there's certain issues that you're

00:02:53.860 | running into that are blockers for actually being able to productionize these applications. And so what are

00:02:59.120 | these challenges with naive rag? One aspect here is just like the response, and this is the key thing that we're

00:03:05.900 | focused on, like the response quality is not very good. You run into, for instance, like bad retrieval issues,

00:03:10.860 | like during the retrieval stage from your vector database, if you're not actually returning the relevant

00:03:15.800 | chunks from your vector database, you're not going to be able to have the correct context actually put into the LL.

00:03:20.900 | So this includes certain issues like low precision. Not all chunks in the retrieve set are relevant. This leads to like

00:03:27.680 | hallucination, like loss in the middle problems. You have a lot of fluff in the return response. This could mean low recall,

00:03:33.280 | like your top K isn't high enough, or basically like the set of like information that you need to actually answer the question is just not there.

00:03:39.680 | And of course there's other issues too, like outdated information. And many of you who are building apps these days might be

00:03:46.460 | familiar with some like key concepts of like just why the LLM isn't always guaranteed to give you a correct answer. There's

00:03:53.120 | hallucination, irrelevance, like toxicity bias. There's a lot of issues on the LLM side as well. So what can we do? What can we

00:04:02.040 | actually do to try to improve the performance of a retrieval augmented generation application? And for many of you, like you might be

00:04:09.680 | running into certain issues, and it really runs the gamut across like the entire pipeline. There's stuff you can do on

00:04:15.860 | the data, like can we store additional information beyond just like the raw text chunks, right, that you're putting in the

00:04:21.500 | vector database? Can you optimize that data pipeline somehow? Play around with chunk sizes, that type of thing. Can you

00:04:26.860 | optimize the embedding representation itself? A lot of times when you're using a pre-trained embedding model, it's not really

00:04:32.300 | optimal for giving you the best performance. There's the retrieval algorithm. You know, the default thing you do is just look up the

00:04:39.040 | top K most similar elements from your vector database to return to the LLM. Many times that's not enough. And

00:04:45.220 | what are kind of like both simple things you can do as well as hard things? And there's also synthesis. Like why is

00:04:51.080 | there, yeah, there's like a V in the, anyway, so can we use LLMs for more than generation? And so basically like you can

00:04:57.160 | use the LLM to actually help you with like reasoning as opposed to just like pure, pure, just like just pure

00:05:05.460 | generation, right? You can actually use it to try to reason over, given a question, can you break it down into

00:05:10.200 | simpler questions, route to different data sources, and kind of like have a more sophisticated way of like

00:05:16.540 | querying your data.

00:05:17.140 | Of course, like if you kind of been around some of my recent talks, like I always say before you actually try any of

00:05:24.880 | these techniques, you need to be pretty task specific and make sure that you need a way to, that you actually have a way to

00:05:29.620 | measure performance. So I'll probably spend like two minutes talking about evaluation. Simon, my co-founder,

00:05:35.920 | just ran a workshop yesterday on really just like how do you evaluate, build a data set, evaluate RAG systems

00:05:41.600 | and help iterate on that. If you missed the workshop, don't worry, I'll, we'll have the slides and materials

00:05:46.440 | available online so that you can take a look. At a very high level, in terms of evaluation, it's important

00:05:52.640 | because you basically need to define a benchmark for your system to understand how are you going to iterate on and

00:05:57.480 | improve it. And there's like a few different ways you can try to do evaluation. Right? I think Anton from

00:06:02.580 | from Chroma was, was just saying some of this, but like you basically need a way to evaluate both the

00:06:08.500 | end to end solution. Like you have your input query as well as the output response. You also want to

00:06:13.360 | probably be able to evaluate like specific components. Like if you've diagnosed that the retrieval is, is like the

00:06:18.460 | portion that needs improving, you need like retrieval metrics to really understand how can you improve your

00:06:23.240 | retrieval system. Um, so there's retrieval and there's synthesis. Let's talk a little bit, just like 30 seconds on each one.

00:06:31.240 | Um, evaluation on retrieval, what does this look like? You basically want to make sure that the stuff that's returned

00:06:37.260 | actually answers the query and that you're kind of, you know, not returning a bunch of fluff, uh, and that the stuff that you

00:06:43.200 | returned is relevant to the question. Um, so first you need an evaluation data set. A lot of people are, uh, have like human labeled data sets. If you're in, uh, building stuff in prod,

00:06:52.340 | you might have like user feedback as well. If not, you can synthetically generate a data set. This data set is input

00:06:58.340 | like query and output. The IDs of like the return documents are relevant to the query. So you need that somehow.

00:07:03.340 | Once you have that, you can measure stuff with ranking metrics, right? You can measure stuff like success rate,

00:07:09.340 | hit rate, MR, and DCG, uh, a variety of these things. Uh, and, and so like once you are able to evaluate this, like this really isn't, uh, kind of like an LLM problem. This is like an IR problem. And this has been around for at least like,

00:07:21.340 | like a decade or two. Um, but a lot of this is becoming like, you know, it's, it's still very relevant in the face of actually building these LLM maps.

00:07:31.340 | The next piece here is, um, there's a retrieve portion, right? But then you generate a response from it. And then how do you actually evaluate the whole thing end to end? So evaluation of the final response, uh, given the input, you still want to generate some sort of data set. So you could do that through like human annotations, user feedback.

00:07:47.340 | You could have like ground truth reference answers given the query that really indicates like, Hey, this is the proper answer to this question. Um, and you can also just like, you know, synthetically generate it with like GPT-4.

00:07:57.340 | Uh, you run this through the full RAG pipeline that you built, the retrieval and synthesis. Uh, and you can run like LLM based evals. Um, so label free evals with label evals. There's a lot of, uh, projects these days, uh, going on about how do you actually properly evaluate the outputs, uh, predicted outputs of a language model.

00:08:15.340 | Once you've defined your eval benchmark, now you want to think about how do you actually optimize your RAG systems. So I sent a teaser on the slide, uh, a few, uh, like yesterday, but the way I think about this is that when do you want to actually improve your system?

00:08:30.340 | There's like a million things that you can do to try to actually improve your RAG system. Uh, and like, you probably don't want to start with the hard stuff first, uh, just because like, you know, part of the value of language models is how it's kind of democratized access to every developer. It's really just made it easy for people to get up and running.

00:08:45.340 | And so if for instance, you're running into some performance issues with RAG, I'd probably start with the basics. Like I call it like table stakes RAG techniques, uh, better parsing, um, so that you don't just split by even chunks, like adjusting your chunk sizes, trying out stuff that's already integrated with a vector database, like hybrid search, as well as like metadata filters.

00:09:03.340 | There's also like advanced retrieval methods, uh, that you could try. This is like a little bit more advanced. Some of it pulls from like traditional IR. Some of it's more like kind of, uh, really like, uh, new in, in this age of like,

00:09:15.320 | LLM based apps, there's like a re ranking. Um, that's a traditional concept. There's also concepts in Lama index, like recursive retrieval, like dealing with embedded tables, like a small to big retrieval and a lot of other stuff that we have that help you potentially improve the performance of your application. Uh, and then the last bit, like this kind of gets into more expressive stuff that might be harder to implement, might incur a higher latency and cost, but it's potentially more powerful. And for looking is like agents, like how do you incorporate agents towards better, like RAG pipelines to better

00:09:45.300 | answer different types of questions and synthesize information. And how do you actually fine tune stuff?

00:09:51.300 | Let's talk a little bit about the table stakes first. So chunk sizes, tuning your chunk size can have outsized impacts on performance, right? Uh, if you've kind of like played around with RAG systems, uh, this may or may not be obvious to you.

00:10:02.300 | What's interesting though, is that like more retrieved tokens does not always equate to higher performance. And that if you do like re ranking of your retrieved tokens, it doesn't necessarily mean that your final generation response is going to be better. And this is a

00:10:15.280 | This is again, due to stuff like loss in the middle problems where stuff in the middle of the LLM context window tends to get lost, where stuff at the end, uh, tends to be a little bit, uh, uh, more well remembered by the LLM. Um, and so I think we did a workshop with like arise a few, uh, a week ago where basically we showed, you know, uh, there is kind of like an optimal chunk size given your data set. And a lot of times when you try out stuff like re ranking, it actually increases your error metrics.

00:10:38.280 | Metadata filtering. Uh, this is another like very table stakes thing that I think everybody should look into. And I think vector databases like, you know, Chroma, Pinecone, Reviate, like these, uh, vector databases are all implementing these, uh, capabilities under the hood.

00:10:53.280 | Metadata filtering is basically just like, how can you add structured context, uh, to your, your chunks, like your text chunks. And you can use this for both like embeddings as well as synthesis. But it also integrates with like the metadata, metadata filter capabilities of a vector database.

00:11:08.260 | Um, so metadata is just like, again, structured JSON dictionary. It could be like page number. It could be the document title. It could be the summary of adjacent chunks. You can get creative with it too. You could hallucinate like questions, uh, that the chunk answers. Um, and it can help retrieval. It can help augment your response quality. It also integrates with the vector database filters.

00:11:26.260 | So as an example, um, let's say the question, uh, is over like the SEC, uh, like 10Q document. And like, can you tell me the risk factors in 2021?

00:11:35.260 | If you just do raw semantic search, typically it's very low precision. You're going to return a bunch of stuff that may or may not match this. You might even return stuff from like other years. If you have a bunch of documents from different years in the same vector collection. Um, and so like, you're kind of like rolling the dice a little bit.

00:11:49.260 | Um, but one idea here is basically, you know, if you have access to the metadata of the documents, um, and you ask a question like this, um, you basically combine structured query capabilities by inferring the metadata filters.

00:12:04.260 | Like a where clause in a SQL statement, like a year equals 2021. And you combine that with semantic search to return the most relevant candidates given your query. And this improves the precision of your, uh, of your results.

00:12:19.260 | Moving on to stuff that's maybe a bit more advanced, like advanced retrieval is one thing that we found generally helps is this idea of like small to big retrieval. Um, so what does that mean?

00:12:28.260 | Basically right now when you embed a big text chunk, you, uh, also synthesize over that text chunk.

00:12:33.260 | And so it's a little suboptimal because what if like the embedding representation is like biased because, you know, there's a bunch of fluff in that text chunk that contains a bunch of irrelevant information.

00:12:42.260 | You're not actually optimizing your retrieval quality. So embedding a big text chunk sometimes feels a little suboptimal.

00:12:48.260 | One thing that you could do is basically embed text at the sentence level or on a smaller level and then expand that window during synthesis time.

00:12:55.260 | Um, and so this is contained in a variety of like Lama index abstractions.

00:12:59.260 | But the idea is that you return, you retrieve on more granular pieces of information. So smaller chunks.

00:13:04.260 | This makes it so that these chunks are more likely to be retrieved when you actually ask a query over these specific pieces of context.

00:13:10.260 | But then you want to make sure that the LLM actually has access to more information to actually synthesize a proper result.

00:13:15.260 | So this leads to like more precise retrieval, right? So, um, we, we tried this out.

00:13:22.260 | It helps avoid like some loss in the middle problems.

00:13:25.260 | You can set a smaller top K value like K equals two.

00:13:28.260 | Uh, whereas like, uh, over this data set, if you set like K equals five for naive retrieval over big text chunks, you basically start returning a lot of context.

00:13:36.260 | And that kind of leads into issues where, uh, you know, maybe the relevant context is in the middle, but you're not able to find out.

00:13:42.260 | Uh, or, or you're like that the LLM is, is, is not able to kind of synthesize over that information.

00:13:47.260 | A very related idea here is just like embedding a reference to the parent chunk, um, as opposed to the actual text chunk itself.

00:13:58.260 | So for instance, if you want to embed like not just the raw text chunk or not the text chunk, but actually like a smaller chunk, um, or a summary or questions that answer the chunk, we have found that that actually helps to improve retrieval performance a decent amount.

00:14:11.260 | Um, and it's, it kind of, again, goes along with this idea.

00:14:14.260 | Like a lot of times you want to embed something that's more amenable for embedding based retrieval.

00:14:18.260 | Uh, but then you want to return enough context so that the LLM can actually synthesize over that information.

00:14:23.260 | The next bit here is actually kind of even more advanced stuff, right?

00:14:33.260 | This goes on into agents and this goes on into that last pillar that I mentioned, which is how can you use LLMs for, for, for reasoning as opposed to just synthesizing?

00:14:40.260 | The intuition here is that like for a lot of rag, if you're just using the LLM at the end, you're one constrained by the quality of your retriever and you're really only able to do stuff like question answering.

00:14:51.260 | And there's certain types of questions and more advanced, uh, analysis that you might want to launch that like top K rag can't really answer.

00:14:58.260 | It's not necessarily just a one off question.

00:15:00.260 | You might need to have like an entire sequence of reasoning steps to actually pull together a piece of information, or you might want to like summarize a document and compare it with like other documents.

00:15:09.260 | So one kind of architecture we're exploring right now is this idea of like multi-document agents.

00:15:15.260 | What if like instead of just like rag, we moved a little bit more into agent territory.

00:15:19.260 | We modeled each document not just as a sequence of text chunks, but actually as a set of tools that contains the ability to both like summarize that document as well as to do QA over that document over specific facts.

00:15:30.260 | Um, and of course, if you want to scale to like, you know, hundreds or thousands or millions of documents, um, typically an agent can only have access to a limited window of tools.

00:15:40.260 | So you probably want to do some sort of retrieval on these tools, similar to how you want to retrieve like text chunks from a document.

00:15:47.260 | The main difference is that because these are tools, you actually want to act upon them.

00:15:50.260 | You want to use them as opposed to just like taking the raw text and plugging it into the context window.

00:15:55.260 | So blending this combination of like, uh, kind of, um, embedding based retrieval or any sort of retrieval as well as like agent tool use.

00:16:03.260 | It's a very interesting paradigm that I think is really only possible with this age of LMs and hasn't really existed before this.

00:16:09.260 | Another kind of advanced concept is this idea of fine tuning.

00:16:16.260 | Um, and so fine tuning, uh, you know, some other presenters have talked about this as well, but the idea of like fine tuning in a rag system is that,

00:16:24.260 | uh, it really optimizes specific pieces of this rag pipeline for you to kind of better, um, like improve the performance of either retriever or synthesis capabilities.

00:16:35.260 | So one thing you can do is fine tune your embeddings.

00:16:37.260 | Um, I think, uh, Anton was talking about this as well.

00:16:39.260 | Like if you just use a pre-trained model, the embedding representations are not going to be optimized over your specific data.

00:16:44.260 | So sometimes you're just going to retrieve the wrong, wrong information.

00:16:47.260 | Um, if you can somehow tune these embeddings so that given any sort of like relevant question that the user might ask,

00:16:54.260 | that you're actually returning the relevant response, then you're going to have like better performance.

00:16:58.260 | So, um, an idea here, right, is to generate a synthetic query data set from raw text chunks using LLMs and use this to fine tune an embedding model.

00:17:07.260 | Um, and you can do this like, uh, if we go back really quick actually, uh, you can do this by basically, um, kind of fine tuning the base model itself.

00:17:16.260 | You can also fine tune an adapter on top of the model.

00:17:18.260 | Um, and fine tuning an adapter on top of the model has a few advantages in that you don't require the base model's weights to actually fine tune stuff.

00:17:25.260 | And if you just fine tune the query, you don't have to re-index your entire document corpus.

00:17:32.260 | There's also fine tuning LLMs, which of course, like a lot of people are very interested in doing these days.

00:17:37.260 | Um, an intuition here specifically for RAG is that if you have a weaker LLM, like 3.5 Turbo, like Llama 2, 7B, like these weaker LLMs,

00:17:45.260 | are bad, are, are not bad at like, um, uh, wait, yeah.

00:17:50.260 | Weaker LLMs are, are maybe a little bit worse at like response synthesis, reasoning, structured outputs, et cetera.

00:17:56.260 | Um, compared to like bigger models.

00:17:58.260 | So the solution here is what if you can generate a synthetic data set using a bigger model like GPT-4,

00:18:03.260 | that's something we're exploring, and you actually distill that into 3.5 Turbo.

00:18:07.260 | So it gets better at chain of thought, longer response quality, um, better structured outputs, and a lot of other possibilities as well.

00:18:14.260 | So all these things are in our docs. There's production RAG, uh, there's fine tuning, and I have two seconds left.

00:18:20.260 | So thank you very much.

00:18:21.260 | Thank you very much.

00:18:22.260 | Thank you very much.

00:18:23.260 | Thank you very much.

00:18:24.260 | Thank you very much.

00:18:25.260 | Thank you very much.

00:18:26.260 | Thank you.

00:18:28.260 | Thank you.

00:18:29.260 | Thank you.

00:18:30.260 | We'll see you next time.