Retrieval Augmented Generation in the Wild: Anton Troynikov

00:00:00.600 | Hi everybody, as Dave said as I walked up, I'm Anton, I'm the co-founder of Chroma.

00:00:18.280 | I'm here to talk to you about retrieval augmented generation in the wild and what it is that

00:00:23.720 | Chroma is building for beyond just vector search.

00:00:28.320 | By now, you've all seen versions of this probably a half dozen times throughout this conference.

00:00:33.420 | This is the basic retrieval loop that one would use in a RAG application.

00:00:37.820 | You have some corpus of documents, you embed them in your favorite vector store, which is

00:00:42.380 | Chroma.

00:00:43.380 | I mean, check the lanyards, man.

00:00:48.980 | You embed your corpus of documents, you have an embedding model for your queries, you find

00:00:54.300 | the nearest neighbor vectors for those embeddings and you return the associated documents which,

00:00:57.840 | along with the query, you then put into the LLM's context window and return some results.

00:01:02.940 | Now, this is the basic RAG loop, but I think of this as more like the open loop retrieval

00:01:07.860 | augmented generation application.

00:01:10.560 | And my purpose in showing you all this is to show you that you need a lot more than simple

00:01:14.940 | vector search to build some of the more powerful, more promising applications that take RAG in

00:01:19.840 | the future.

00:01:20.840 | So, look at what some of those might be.

00:01:23.020 | The first piece to this, of course, is incorporating human feedback into this loop.

00:01:26.940 | Previously, without human feedback, it isn't possible to adapt the data, the embeddings model

00:01:32.700 | itself, to the specific task, to the model, and to the user.

00:01:36.940 | Human feedback is required to actually return better results for particular queries on your

00:01:42.240 | specific data, on the specific tasks that you want to perform.

00:01:44.940 | So, generally, embedding models are trained in a general context and you actually want

00:01:48.200 | to update them for your specific tasks.

00:01:50.040 | So, basically, the memory that you're using for your RAG application needs to be able to

00:01:54.260 | support this sort of human feedback.

00:01:56.840 | Now, the other piece that we've seen, and these are currently in the early stages, but they're

00:02:01.500 | emerging as something like a capable machine, and I think that one of the ways to make agents

00:02:05.460 | actually capable is a better RAG system, a better memory for AI.

00:02:09.920 | And that means that your retrieval system, your memory, needs to support self-updates from

00:02:14.940 | the agent itself out of the box.

00:02:17.760 | All in all, what this means is you have a constantly dynamically updating data set.

00:02:21.340 | Something that's built as a search index out of the box is not going to be able to support

00:02:24.560 | these types of capabilities.

00:02:27.520 | Next of course, we're talking about agents with world models.

00:02:29.560 | So, in other words, the agent needs to be able to store its interaction with the world and

00:02:33.620 | update the data that it's working with based on that interaction.

00:02:37.780 | And finally, you need to be able to tie all of these together.

00:02:40.600 | Now, this sounds like a very complex system that's frontier research, and it is currently

00:02:45.200 | research-grade, but we're seeing some of the first applications of this in the wild already

00:02:50.440 | today.

00:02:51.440 | I'm sure some of you are familiar with this paper.

00:02:54.400 | This is the famous Voyager paper out of Nvidia, where they trained an agent to play Minecraft,

00:03:00.500 | to learn how to play it, by learning skills in a particular environment and then recognizing

00:03:04.900 | when it's in the same context and recalling that skill.

00:03:07.400 | Now, the other interesting piece to this is several of the more complex skills were learned

00:03:11.640 | through human demonstration and then retrained in the retrieval system, which of course was

00:03:15.360 | Cormo.

00:03:17.360 | My point in showing this to you is that the simple rag loop might be the bread and butter

00:03:23.360 | of most of the applications being developed today.

00:03:25.320 | But the most powerful things that you'll be able to build with AI in the future require

00:03:29.500 | a much more capable retrieval system than one that only supports a search index.

00:03:36.420 | Now, of course, in retrieval itself there are plenty of challenges.

00:03:41.320 | Information retrieval is kind of a classic task, and the setting in which it's been found previously

00:03:46.040 | has been in recommender systems and in search systems.

00:03:49.940 | Now that we're all using this in production for AI applications in completely different

00:03:53.360 | ways, there's a lot of open questions that haven't really been asked quite in the same

00:03:57.060 | way or with quite the same intensity.

00:04:00.020 | A key piece of how retrieval needs to function for AI, and anyone who's built one of these is

00:04:04.060 | aware of this, is you need to be able to return not just all relevant information, but also

00:04:08.160 | no irrelevant information.

00:04:10.680 | It's common knowledge by now, and this is supported by empirical research, that distractors

00:04:14.840 | in the model context cause the performance of the entire AI-based application to fall off

00:04:19.240 | a cliff if those distractors are present.

00:04:21.920 | So what does it mean to actually retrieve relevant info and no irrelevant info?

00:04:25.840 | You need to know which embedding model you need to be using at all in the first place,

00:04:28.860 | and we've all seen the claims from the different API and embedding model providers.

00:04:32.640 | This one is best for code.

00:04:34.040 | This one is best for English language.

00:04:35.320 | This one is best for multilingual data sets.

00:04:37.480 | But the reality is, the only way to find out which is best for your data set is to have

00:04:41.040 | an effective way to figure that out.

00:04:43.640 | The next question, of course, is how do I chunk up the data?

00:04:46.360 | Chunking determines what results are available to the model at all.

00:04:52.040 | And it's obvious that different types of chunking produce different relevancy in the return results.

00:04:57.560 | And finally, how do we even determine whether a given retrieved result is actually relevant

00:05:01.200 | to the task or to the user?

00:05:03.040 | So let's dive into some of these in a little bit more depth.

00:05:06.760 | So the bad news is, again, nobody really has the answers.

00:05:09.120 | Despite the fact that information retrieval is a long study problem, there isn't great solution

00:05:13.560 | to these problems today.

00:05:14.560 | But the good news is that these are important problems and increasingly important problems.

00:05:18.620 | And we see much more production data rather than sort of academic benchmarks that we can

00:05:23.820 | work from to solve some of these for the first time.

00:05:27.380 | So first, the question of which embedding model should we be using.

00:05:30.100 | Of course, there are existing academic benchmarks.

00:05:32.100 | And for now, these appear to be mostly saturated.

00:05:35.540 | The reason for that is these are synthetic benchmarks designed specifically for the information retrieval

00:05:39.620 | problem and don't necessarily reflect how retrieval systems are used in AI use cases.

00:05:45.100 | So what can you do about that?

00:05:46.820 | You can take some of the open source tooling built to build these benchmarks in the first

00:05:50.860 | place and apply it to your data sets and your use cases.

00:05:55.200 | You can use human feedback on relevance by adding a simple relevance feedback endpoint.

00:05:59.380 | And this is something that Chroma is building to support in the very near future.

00:06:02.880 | You can construct your own data sets because you're viewing your data in production.

00:06:06.260 | You know what actually matters to you.

00:06:08.260 | And then you need a way to effectively evaluate the performance of particular embedding models.

00:06:14.580 | Of course, there are great evaluation tools coming on to the market now from several vendors.

00:06:18.760 | Which of these is best, we don't know, but we intend to support all of these with Chroma.

00:06:23.540 | One interesting part about embedding models, and again, this is something that's been well

00:06:28.300 | known in the research community for a while but has been empirically tested recently.

00:06:32.360 | Embedding models with the same training objective, with roughly the same data, tend to learn very

00:06:36.580 | similar representations up to an affine linear transform, which suggests that it's possible to project

00:06:42.140 | one model's embedding space into another model's embedding space by using a simple linear transform.

00:06:46.320 | So the choice of which embedding model you actually want to use might not end up being so important

00:06:50.500 | if you're actually able to sort of apply and figure out those transform from your own data set.

00:06:57.680 | So the question is how to chunk.

00:07:00.140 | Of course, there's a few things to consider.

00:07:01.680 | Chunking, in part, exists because we have bounded context lengths for our LLMs.

00:07:06.680 | So we want to make sure that the retrieve results can actually fit in that context.

00:07:10.360 | We want to make sure that we retain the semantic content of the data we're aiming to retrieve.

00:07:19.540 | We want to make sure that we retain the relevant semantic content of that data rather than just

00:07:27.040 | semantic content in general.

00:07:28.540 | We also want to make sure that we're respecting the natural structure of the data, because often,

00:07:32.540 | especially textual data, was generated for humans to read and understand in the first place.

00:07:36.620 | So this inherent structure of that data provides cues about where the semantic boundaries might

00:07:40.500 | be.

00:07:41.560 | Of course, there are tools for chunking.

00:07:43.400 | There's NLTK.

00:07:44.400 | There's LangChain.

00:07:45.400 | LamaIndex also supports many forms of chunking.

00:07:48.320 | But there are experimental ideas here which we're particularly interested in trying.

00:07:52.840 | One interesting thought that we've had and we're experimenting with lightweight open source

00:07:56.680 | language models to achieve these is using the model prediction perplexity for the next actual

00:08:01.360 | token in the document based on a sliding window of previous tokens.

00:08:05.760 | In other words, you can see when the model mispredicts or has a very low probability for the next

00:08:11.500 | actual piece of text as a determinator of where a semantic boundary in the text might be.

00:08:16.520 | And that might be natural for chunking.

00:08:17.880 | And what that also means is because you have a model actually predicting chunk boundaries,

00:08:23.140 | you can then fine tune that model to make sure that chunk boundaries are relevant to your

00:08:26.280 | application.

00:08:27.280 | So this is something that we're actively exploring.

00:08:29.100 | We can use information hierarchies.

00:08:30.480 | Again, tools like LamaIndex support information hierarchies out of the box and multiple data sources and

00:08:34.580 | signals to re-ranking.

00:08:36.140 | And we can also try to use embedding continuity.

00:08:38.320 | This is something that we're experimenting with as well, where essentially you take a sliding

00:08:42.120 | window across your documents, embed that sliding window, and look for discontinuities in the

00:08:47.560 | resulting time series.

00:08:51.580 | So this is an important question.

00:08:52.320 | I'll give you a demonstration about why retrieval results - being able to compute retrieval result

00:08:58.660 | relevance is actually very important in your application.

00:09:01.940 | Imagine in your application you've gone and you've embedded every English-language Wikipedia

00:09:05.740 | page about birds, and that's what's in your corpus.

00:09:09.060 | And in your traditional retrieval augmented generation system, what you're doing for each

00:09:12.140 | query is just returning the five nearest neighbors and then stuffing them into the model's context

00:09:15.940 | window.

00:09:16.940 | One day, a user's query comes along, and that query is about fish and not birds.

00:09:20.680 | You're guaranteed to return some five nearest neighbors, but you're also guaranteed to not

00:09:25.680 | have a single relevant result among them.

00:09:28.400 | How can you, as an application developer, make that determination?

00:09:31.500 | So, there's a few possibilities here.

00:09:34.200 | The first, of course, is human feedback around relevancy signal.

00:09:38.340 | The traditional approach in information retrieval is using an auxiliary re-ranking model.

00:09:42.480 | In other words, you take other signals in sort of the query chain, so what else was the

00:09:47.560 | user looking at at the time?

00:09:49.360 | What things has the user found to be useful in the past and use those as additional signal

00:09:53.500 | around the relevancy?

00:09:56.380 | And we can also, of course, do augmented retrieval, which Chroma does out of the box.

00:09:59.560 | We have keyword-based search, and we have metadata-based filtering, so you can scope the search if you

00:10:05.540 | have those additional signals beforehand.

00:10:07.720 | Now, to me, the most interesting approach here is actually an algorithmic one.

00:10:12.520 | So what I mean by that is conditional on the data set that you have available and conditional

00:10:17.280 | on what we know about the task that the user is trying to perform, it should be possible

00:10:22.560 | to generate a conditional relevancy signal per user, per task, per model, and per instance

00:10:27.720 | of that task, but this requires a model which can understand the semantics of the query

00:10:32.780 | as well as the content of the data set very well.

00:10:35.620 | This is something that we're experimenting with, and this is another place where we think open-source,

00:10:40.140 | lightweight language models have actually a lot to offer, even at the data layer.

00:10:45.360 | So to talk a little bit about what we're building, this is the advertising portion of my talk.

00:10:51.140 | In core engineering, we're, of course, building our horizontally scalable cluster version.

00:10:54.640 | Single-node Chroma works great, many of you have probably already tried it by now.

00:10:57.700 | It's time to actually make it work across multiple nodes.

00:10:59.700 | By December, we'll have our database-as-a-service technical preview up and ready so you guys can

00:11:03.480 | try a Chroma cloud.

00:11:05.460 | In January, we'll have our hybrid deployments available if you want to run Chroma in your

00:11:08.700 | enterprise cluster.

00:11:09.700 | And along the way, we're building to support multimodal data.

00:11:12.700 | We know that GPT Vision's API is coming very soon, probably at OpenAI's developer day.

00:11:20.760 | Gemini will also have image understanding and voice.

00:11:24.300 | That means that you'll be able to use multimodal data in your retrieval applications for the first

00:11:28.760 | time.

00:11:29.760 | We're just talking about text.

00:11:31.660 | So these questions about relevancy and other types of data become even more important, right?

00:11:35.780 | Because now you start having questions about relevancy, aesthetic quality, all of these other

00:11:39.640 | pieces which you need to make these multimodal retrieval augmented systems work.

00:11:44.660 | And finally, we're working on model selection.

00:11:46.820 | Chroma, basically, Chroma wants to do everything in the data layer for you so that, just like a

00:11:53.420 | a modern DBMS, just like you use Postgres in a web application, everything in the data

00:11:58.560 | layer for you as an application developer should just work.

00:12:01.300 | Your focus should be on the application logic and making your application actually run correctly,

00:12:05.320 | and that's what Chroma is building for in AI.

00:12:08.020 | And that's it.

00:12:09.020 | Thank you very much.

00:12:09.700 | Thank you very much.

00:12:10.760 | Thank you very much.

00:12:10.940 | Thanks.

00:12:11.940 | Thank you.

00:12:12.940 | Thanks.

00:12:13.940 | Thanks.

00:12:14.940 | Thanks.

00:12:15.940 | Thanks.

00:12:16.940 | Thanks.

00:12:17.940 | Thanks.

Retrieval Augmented Generation in the Wild: Anton Troynikov

Chapters