back to index

Retrieval Augmented Generation in the Wild: Anton Troynikov


Chapters

0:0 Intro
0:27 Retrieval Loop
1:4 Purpose
1:22 Human Feedback
2:37 Applications
3:37 Challenges
5:6 The bad news
5:26 Which embedding model to use
6:57 How to chunk
8:50 Retrieval relevance
10:44 What we are building
12:7 Outro

Whisper Transcript | Transcript Only Page

00:00:00.600 | Hi everybody, as Dave said as I walked up, I'm Anton, I'm the co-founder of Chroma.
00:00:18.280 | I'm here to talk to you about retrieval augmented generation in the wild and what it is that
00:00:23.720 | Chroma is building for beyond just vector search.
00:00:28.320 | By now, you've all seen versions of this probably a half dozen times throughout this conference.
00:00:33.420 | This is the basic retrieval loop that one would use in a RAG application.
00:00:37.820 | You have some corpus of documents, you embed them in your favorite vector store, which is
00:00:42.380 | Chroma.
00:00:43.380 | I mean, check the lanyards, man.
00:00:48.980 | You embed your corpus of documents, you have an embedding model for your queries, you find
00:00:54.300 | the nearest neighbor vectors for those embeddings and you return the associated documents which,
00:00:57.840 | along with the query, you then put into the LLM's context window and return some results.
00:01:02.940 | Now, this is the basic RAG loop, but I think of this as more like the open loop retrieval
00:01:07.860 | augmented generation application.
00:01:10.560 | And my purpose in showing you all this is to show you that you need a lot more than simple
00:01:14.940 | vector search to build some of the more powerful, more promising applications that take RAG in
00:01:19.840 | the future.
00:01:20.840 | So, look at what some of those might be.
00:01:23.020 | The first piece to this, of course, is incorporating human feedback into this loop.
00:01:26.940 | Previously, without human feedback, it isn't possible to adapt the data, the embeddings model
00:01:32.700 | itself, to the specific task, to the model, and to the user.
00:01:36.940 | Human feedback is required to actually return better results for particular queries on your
00:01:42.240 | specific data, on the specific tasks that you want to perform.
00:01:44.940 | So, generally, embedding models are trained in a general context and you actually want
00:01:48.200 | to update them for your specific tasks.
00:01:50.040 | So, basically, the memory that you're using for your RAG application needs to be able to
00:01:54.260 | support this sort of human feedback.
00:01:56.840 | Now, the other piece that we've seen, and these are currently in the early stages, but they're
00:02:01.500 | emerging as something like a capable machine, and I think that one of the ways to make agents
00:02:05.460 | actually capable is a better RAG system, a better memory for AI.
00:02:09.920 | And that means that your retrieval system, your memory, needs to support self-updates from
00:02:14.940 | the agent itself out of the box.
00:02:17.760 | All in all, what this means is you have a constantly dynamically updating data set.
00:02:21.340 | Something that's built as a search index out of the box is not going to be able to support
00:02:24.560 | these types of capabilities.
00:02:27.520 | Next of course, we're talking about agents with world models.
00:02:29.560 | So, in other words, the agent needs to be able to store its interaction with the world and
00:02:33.620 | update the data that it's working with based on that interaction.
00:02:37.780 | And finally, you need to be able to tie all of these together.
00:02:40.600 | Now, this sounds like a very complex system that's frontier research, and it is currently
00:02:45.200 | research-grade, but we're seeing some of the first applications of this in the wild already
00:02:50.440 | today.
00:02:51.440 | I'm sure some of you are familiar with this paper.
00:02:54.400 | This is the famous Voyager paper out of Nvidia, where they trained an agent to play Minecraft,
00:03:00.500 | to learn how to play it, by learning skills in a particular environment and then recognizing
00:03:04.900 | when it's in the same context and recalling that skill.
00:03:07.400 | Now, the other interesting piece to this is several of the more complex skills were learned
00:03:11.640 | through human demonstration and then retrained in the retrieval system, which of course was
00:03:15.360 | Cormo.
00:03:17.360 | My point in showing this to you is that the simple rag loop might be the bread and butter
00:03:23.360 | of most of the applications being developed today.
00:03:25.320 | But the most powerful things that you'll be able to build with AI in the future require
00:03:29.500 | a much more capable retrieval system than one that only supports a search index.
00:03:36.420 | Now, of course, in retrieval itself there are plenty of challenges.
00:03:41.320 | Information retrieval is kind of a classic task, and the setting in which it's been found previously
00:03:46.040 | has been in recommender systems and in search systems.
00:03:49.940 | Now that we're all using this in production for AI applications in completely different
00:03:53.360 | ways, there's a lot of open questions that haven't really been asked quite in the same
00:03:57.060 | way or with quite the same intensity.
00:04:00.020 | A key piece of how retrieval needs to function for AI, and anyone who's built one of these is
00:04:04.060 | aware of this, is you need to be able to return not just all relevant information, but also
00:04:08.160 | no irrelevant information.
00:04:10.680 | It's common knowledge by now, and this is supported by empirical research, that distractors
00:04:14.840 | in the model context cause the performance of the entire AI-based application to fall off
00:04:19.240 | a cliff if those distractors are present.
00:04:21.920 | So what does it mean to actually retrieve relevant info and no irrelevant info?
00:04:25.840 | You need to know which embedding model you need to be using at all in the first place,
00:04:28.860 | and we've all seen the claims from the different API and embedding model providers.
00:04:32.640 | This one is best for code.
00:04:34.040 | This one is best for English language.
00:04:35.320 | This one is best for multilingual data sets.
00:04:37.480 | But the reality is, the only way to find out which is best for your data set is to have
00:04:41.040 | an effective way to figure that out.
00:04:43.640 | The next question, of course, is how do I chunk up the data?
00:04:46.360 | Chunking determines what results are available to the model at all.
00:04:52.040 | And it's obvious that different types of chunking produce different relevancy in the return results.
00:04:57.560 | And finally, how do we even determine whether a given retrieved result is actually relevant
00:05:01.200 | to the task or to the user?
00:05:03.040 | So let's dive into some of these in a little bit more depth.
00:05:06.760 | So the bad news is, again, nobody really has the answers.
00:05:09.120 | Despite the fact that information retrieval is a long study problem, there isn't great solution
00:05:13.560 | to these problems today.
00:05:14.560 | But the good news is that these are important problems and increasingly important problems.
00:05:18.620 | And we see much more production data rather than sort of academic benchmarks that we can
00:05:23.820 | work from to solve some of these for the first time.
00:05:27.380 | So first, the question of which embedding model should we be using.
00:05:30.100 | Of course, there are existing academic benchmarks.
00:05:32.100 | And for now, these appear to be mostly saturated.
00:05:35.540 | The reason for that is these are synthetic benchmarks designed specifically for the information retrieval
00:05:39.620 | problem and don't necessarily reflect how retrieval systems are used in AI use cases.
00:05:45.100 | So what can you do about that?
00:05:46.820 | You can take some of the open source tooling built to build these benchmarks in the first
00:05:50.860 | place and apply it to your data sets and your use cases.
00:05:55.200 | You can use human feedback on relevance by adding a simple relevance feedback endpoint.
00:05:59.380 | And this is something that Chroma is building to support in the very near future.
00:06:02.880 | You can construct your own data sets because you're viewing your data in production.
00:06:06.260 | You know what actually matters to you.
00:06:08.260 | And then you need a way to effectively evaluate the performance of particular embedding models.
00:06:14.580 | Of course, there are great evaluation tools coming on to the market now from several vendors.
00:06:18.760 | Which of these is best, we don't know, but we intend to support all of these with Chroma.
00:06:23.540 | One interesting part about embedding models, and again, this is something that's been well
00:06:28.300 | known in the research community for a while but has been empirically tested recently.
00:06:32.360 | Embedding models with the same training objective, with roughly the same data, tend to learn very
00:06:36.580 | similar representations up to an affine linear transform, which suggests that it's possible to project
00:06:42.140 | one model's embedding space into another model's embedding space by using a simple linear transform.
00:06:46.320 | So the choice of which embedding model you actually want to use might not end up being so important
00:06:50.500 | if you're actually able to sort of apply and figure out those transform from your own data set.
00:06:57.680 | So the question is how to chunk.
00:07:00.140 | Of course, there's a few things to consider.
00:07:01.680 | Chunking, in part, exists because we have bounded context lengths for our LLMs.
00:07:06.680 | So we want to make sure that the retrieve results can actually fit in that context.
00:07:10.360 | We want to make sure that we retain the semantic content of the data we're aiming to retrieve.
00:07:19.540 | We want to make sure that we retain the relevant semantic content of that data rather than just
00:07:27.040 | semantic content in general.
00:07:28.540 | We also want to make sure that we're respecting the natural structure of the data, because often,
00:07:32.540 | especially textual data, was generated for humans to read and understand in the first place.
00:07:36.620 | So this inherent structure of that data provides cues about where the semantic boundaries might
00:07:41.560 | Of course, there are tools for chunking.
00:07:43.400 | There's NLTK.
00:07:44.400 | There's LangChain.
00:07:45.400 | LamaIndex also supports many forms of chunking.
00:07:48.320 | But there are experimental ideas here which we're particularly interested in trying.
00:07:52.840 | One interesting thought that we've had and we're experimenting with lightweight open source
00:07:56.680 | language models to achieve these is using the model prediction perplexity for the next actual
00:08:01.360 | token in the document based on a sliding window of previous tokens.
00:08:05.760 | In other words, you can see when the model mispredicts or has a very low probability for the next
00:08:11.500 | actual piece of text as a determinator of where a semantic boundary in the text might be.
00:08:16.520 | And that might be natural for chunking.
00:08:17.880 | And what that also means is because you have a model actually predicting chunk boundaries,
00:08:23.140 | you can then fine tune that model to make sure that chunk boundaries are relevant to your
00:08:26.280 | application.
00:08:27.280 | So this is something that we're actively exploring.
00:08:29.100 | We can use information hierarchies.
00:08:30.480 | Again, tools like LamaIndex support information hierarchies out of the box and multiple data sources and
00:08:34.580 | signals to re-ranking.
00:08:36.140 | And we can also try to use embedding continuity.
00:08:38.320 | This is something that we're experimenting with as well, where essentially you take a sliding
00:08:42.120 | window across your documents, embed that sliding window, and look for discontinuities in the
00:08:47.560 | resulting time series.
00:08:51.580 | So this is an important question.
00:08:52.320 | I'll give you a demonstration about why retrieval results - being able to compute retrieval result
00:08:58.660 | relevance is actually very important in your application.
00:09:01.940 | Imagine in your application you've gone and you've embedded every English-language Wikipedia
00:09:05.740 | page about birds, and that's what's in your corpus.
00:09:09.060 | And in your traditional retrieval augmented generation system, what you're doing for each
00:09:12.140 | query is just returning the five nearest neighbors and then stuffing them into the model's context
00:09:15.940 | window.
00:09:16.940 | One day, a user's query comes along, and that query is about fish and not birds.
00:09:20.680 | You're guaranteed to return some five nearest neighbors, but you're also guaranteed to not
00:09:25.680 | have a single relevant result among them.
00:09:28.400 | How can you, as an application developer, make that determination?
00:09:31.500 | So, there's a few possibilities here.
00:09:34.200 | The first, of course, is human feedback around relevancy signal.
00:09:38.340 | The traditional approach in information retrieval is using an auxiliary re-ranking model.
00:09:42.480 | In other words, you take other signals in sort of the query chain, so what else was the
00:09:47.560 | user looking at at the time?
00:09:49.360 | What things has the user found to be useful in the past and use those as additional signal
00:09:53.500 | around the relevancy?
00:09:56.380 | And we can also, of course, do augmented retrieval, which Chroma does out of the box.
00:09:59.560 | We have keyword-based search, and we have metadata-based filtering, so you can scope the search if you
00:10:05.540 | have those additional signals beforehand.
00:10:07.720 | Now, to me, the most interesting approach here is actually an algorithmic one.
00:10:12.520 | So what I mean by that is conditional on the data set that you have available and conditional
00:10:17.280 | on what we know about the task that the user is trying to perform, it should be possible
00:10:22.560 | to generate a conditional relevancy signal per user, per task, per model, and per instance
00:10:27.720 | of that task, but this requires a model which can understand the semantics of the query
00:10:32.780 | as well as the content of the data set very well.
00:10:35.620 | This is something that we're experimenting with, and this is another place where we think open-source,
00:10:40.140 | lightweight language models have actually a lot to offer, even at the data layer.
00:10:45.360 | So to talk a little bit about what we're building, this is the advertising portion of my talk.
00:10:51.140 | In core engineering, we're, of course, building our horizontally scalable cluster version.
00:10:54.640 | Single-node Chroma works great, many of you have probably already tried it by now.
00:10:57.700 | It's time to actually make it work across multiple nodes.
00:10:59.700 | By December, we'll have our database-as-a-service technical preview up and ready so you guys can
00:11:03.480 | try a Chroma cloud.
00:11:05.460 | In January, we'll have our hybrid deployments available if you want to run Chroma in your
00:11:08.700 | enterprise cluster.
00:11:09.700 | And along the way, we're building to support multimodal data.
00:11:12.700 | We know that GPT Vision's API is coming very soon, probably at OpenAI's developer day.
00:11:20.760 | Gemini will also have image understanding and voice.
00:11:24.300 | That means that you'll be able to use multimodal data in your retrieval applications for the first
00:11:28.760 | time.
00:11:29.760 | We're just talking about text.
00:11:31.660 | So these questions about relevancy and other types of data become even more important, right?
00:11:35.780 | Because now you start having questions about relevancy, aesthetic quality, all of these other
00:11:39.640 | pieces which you need to make these multimodal retrieval augmented systems work.
00:11:44.660 | And finally, we're working on model selection.
00:11:46.820 | Chroma, basically, Chroma wants to do everything in the data layer for you so that, just like a
00:11:53.420 | a modern DBMS, just like you use Postgres in a web application, everything in the data
00:11:58.560 | layer for you as an application developer should just work.
00:12:01.300 | Your focus should be on the application logic and making your application actually run correctly,
00:12:05.320 | and that's what Chroma is building for in AI.
00:12:08.020 | And that's it.
00:12:09.020 | Thank you very much.
00:12:09.700 | Thank you very much.
00:12:10.760 | Thank you very much.
00:12:10.940 | Thanks.
00:12:11.940 | Thank you.
00:12:12.940 | Thanks.
00:12:13.940 | Thanks.
00:12:13.940 | Thanks.
00:12:14.940 | Thanks.
00:12:15.940 | Thanks.
00:12:16.940 | Thanks.
00:12:17.940 | Thanks.