Back to Index

Retrieval Augmented Generation in the Wild: Anton Troynikov


Chapters

0:0 Intro
0:27 Retrieval Loop
1:4 Purpose
1:22 Human Feedback
2:37 Applications
3:37 Challenges
5:6 The bad news
5:26 Which embedding model to use
6:57 How to chunk
8:50 Retrieval relevance
10:44 What we are building
12:7 Outro

Transcript

Hi everybody, as Dave said as I walked up, I'm Anton, I'm the co-founder of Chroma. I'm here to talk to you about retrieval augmented generation in the wild and what it is that Chroma is building for beyond just vector search. By now, you've all seen versions of this probably a half dozen times throughout this conference.

This is the basic retrieval loop that one would use in a RAG application. You have some corpus of documents, you embed them in your favorite vector store, which is Chroma. I mean, check the lanyards, man. You embed your corpus of documents, you have an embedding model for your queries, you find the nearest neighbor vectors for those embeddings and you return the associated documents which, along with the query, you then put into the LLM's context window and return some results.

Now, this is the basic RAG loop, but I think of this as more like the open loop retrieval augmented generation application. And my purpose in showing you all this is to show you that you need a lot more than simple vector search to build some of the more powerful, more promising applications that take RAG in the future.

So, look at what some of those might be. The first piece to this, of course, is incorporating human feedback into this loop. Previously, without human feedback, it isn't possible to adapt the data, the embeddings model itself, to the specific task, to the model, and to the user. Human feedback is required to actually return better results for particular queries on your specific data, on the specific tasks that you want to perform.

So, generally, embedding models are trained in a general context and you actually want to update them for your specific tasks. So, basically, the memory that you're using for your RAG application needs to be able to support this sort of human feedback. Now, the other piece that we've seen, and these are currently in the early stages, but they're emerging as something like a capable machine, and I think that one of the ways to make agents actually capable is a better RAG system, a better memory for AI.

And that means that your retrieval system, your memory, needs to support self-updates from the agent itself out of the box. All in all, what this means is you have a constantly dynamically updating data set. Something that's built as a search index out of the box is not going to be able to support these types of capabilities.

Next of course, we're talking about agents with world models. So, in other words, the agent needs to be able to store its interaction with the world and update the data that it's working with based on that interaction. And finally, you need to be able to tie all of these together.

Now, this sounds like a very complex system that's frontier research, and it is currently research-grade, but we're seeing some of the first applications of this in the wild already today. I'm sure some of you are familiar with this paper. This is the famous Voyager paper out of Nvidia, where they trained an agent to play Minecraft, to learn how to play it, by learning skills in a particular environment and then recognizing when it's in the same context and recalling that skill.

Now, the other interesting piece to this is several of the more complex skills were learned through human demonstration and then retrained in the retrieval system, which of course was Cormo. My point in showing this to you is that the simple rag loop might be the bread and butter of most of the applications being developed today.

But the most powerful things that you'll be able to build with AI in the future require a much more capable retrieval system than one that only supports a search index. Now, of course, in retrieval itself there are plenty of challenges. Information retrieval is kind of a classic task, and the setting in which it's been found previously has been in recommender systems and in search systems.

Now that we're all using this in production for AI applications in completely different ways, there's a lot of open questions that haven't really been asked quite in the same way or with quite the same intensity. A key piece of how retrieval needs to function for AI, and anyone who's built one of these is aware of this, is you need to be able to return not just all relevant information, but also no irrelevant information.

It's common knowledge by now, and this is supported by empirical research, that distractors in the model context cause the performance of the entire AI-based application to fall off a cliff if those distractors are present. So what does it mean to actually retrieve relevant info and no irrelevant info? You need to know which embedding model you need to be using at all in the first place, and we've all seen the claims from the different API and embedding model providers.

This one is best for code. This one is best for English language. This one is best for multilingual data sets. But the reality is, the only way to find out which is best for your data set is to have an effective way to figure that out. The next question, of course, is how do I chunk up the data?

Chunking determines what results are available to the model at all. And it's obvious that different types of chunking produce different relevancy in the return results. And finally, how do we even determine whether a given retrieved result is actually relevant to the task or to the user? So let's dive into some of these in a little bit more depth.

So the bad news is, again, nobody really has the answers. Despite the fact that information retrieval is a long study problem, there isn't great solution to these problems today. But the good news is that these are important problems and increasingly important problems. And we see much more production data rather than sort of academic benchmarks that we can work from to solve some of these for the first time.

So first, the question of which embedding model should we be using. Of course, there are existing academic benchmarks. And for now, these appear to be mostly saturated. The reason for that is these are synthetic benchmarks designed specifically for the information retrieval problem and don't necessarily reflect how retrieval systems are used in AI use cases.

So what can you do about that? You can take some of the open source tooling built to build these benchmarks in the first place and apply it to your data sets and your use cases. You can use human feedback on relevance by adding a simple relevance feedback endpoint. And this is something that Chroma is building to support in the very near future.

You can construct your own data sets because you're viewing your data in production. You know what actually matters to you. And then you need a way to effectively evaluate the performance of particular embedding models. Of course, there are great evaluation tools coming on to the market now from several vendors.

Which of these is best, we don't know, but we intend to support all of these with Chroma. One interesting part about embedding models, and again, this is something that's been well known in the research community for a while but has been empirically tested recently. Embedding models with the same training objective, with roughly the same data, tend to learn very similar representations up to an affine linear transform, which suggests that it's possible to project one model's embedding space into another model's embedding space by using a simple linear transform.

So the choice of which embedding model you actually want to use might not end up being so important if you're actually able to sort of apply and figure out those transform from your own data set. So the question is how to chunk. Of course, there's a few things to consider.

Chunking, in part, exists because we have bounded context lengths for our LLMs. So we want to make sure that the retrieve results can actually fit in that context. We want to make sure that we retain the semantic content of the data we're aiming to retrieve. We want to make sure that we retain the relevant semantic content of that data rather than just semantic content in general.

We also want to make sure that we're respecting the natural structure of the data, because often, especially textual data, was generated for humans to read and understand in the first place. So this inherent structure of that data provides cues about where the semantic boundaries might be. Of course, there are tools for chunking.

There's NLTK. There's LangChain. LamaIndex also supports many forms of chunking. But there are experimental ideas here which we're particularly interested in trying. One interesting thought that we've had and we're experimenting with lightweight open source language models to achieve these is using the model prediction perplexity for the next actual token in the document based on a sliding window of previous tokens.

In other words, you can see when the model mispredicts or has a very low probability for the next actual piece of text as a determinator of where a semantic boundary in the text might be. And that might be natural for chunking. And what that also means is because you have a model actually predicting chunk boundaries, you can then fine tune that model to make sure that chunk boundaries are relevant to your application.

So this is something that we're actively exploring. We can use information hierarchies. Again, tools like LamaIndex support information hierarchies out of the box and multiple data sources and signals to re-ranking. And we can also try to use embedding continuity. This is something that we're experimenting with as well, where essentially you take a sliding window across your documents, embed that sliding window, and look for discontinuities in the resulting time series.

So this is an important question. I'll give you a demonstration about why retrieval results - being able to compute retrieval result relevance is actually very important in your application. Imagine in your application you've gone and you've embedded every English-language Wikipedia page about birds, and that's what's in your corpus.

And in your traditional retrieval augmented generation system, what you're doing for each query is just returning the five nearest neighbors and then stuffing them into the model's context window. One day, a user's query comes along, and that query is about fish and not birds. You're guaranteed to return some five nearest neighbors, but you're also guaranteed to not have a single relevant result among them.

How can you, as an application developer, make that determination? So, there's a few possibilities here. The first, of course, is human feedback around relevancy signal. The traditional approach in information retrieval is using an auxiliary re-ranking model. In other words, you take other signals in sort of the query chain, so what else was the user looking at at the time?

What things has the user found to be useful in the past and use those as additional signal around the relevancy? And we can also, of course, do augmented retrieval, which Chroma does out of the box. We have keyword-based search, and we have metadata-based filtering, so you can scope the search if you have those additional signals beforehand.

Now, to me, the most interesting approach here is actually an algorithmic one. So what I mean by that is conditional on the data set that you have available and conditional on what we know about the task that the user is trying to perform, it should be possible to generate a conditional relevancy signal per user, per task, per model, and per instance of that task, but this requires a model which can understand the semantics of the query as well as the content of the data set very well.

This is something that we're experimenting with, and this is another place where we think open-source, lightweight language models have actually a lot to offer, even at the data layer. So to talk a little bit about what we're building, this is the advertising portion of my talk. In core engineering, we're, of course, building our horizontally scalable cluster version.

Single-node Chroma works great, many of you have probably already tried it by now. It's time to actually make it work across multiple nodes. By December, we'll have our database-as-a-service technical preview up and ready so you guys can try a Chroma cloud. In January, we'll have our hybrid deployments available if you want to run Chroma in your enterprise cluster.

And along the way, we're building to support multimodal data. We know that GPT Vision's API is coming very soon, probably at OpenAI's developer day. Gemini will also have image understanding and voice. That means that you'll be able to use multimodal data in your retrieval applications for the first time.

We're just talking about text. So these questions about relevancy and other types of data become even more important, right? Because now you start having questions about relevancy, aesthetic quality, all of these other pieces which you need to make these multimodal retrieval augmented systems work. And finally, we're working on model selection.

Chroma, basically, Chroma wants to do everything in the data layer for you so that, just like a a modern DBMS, just like you use Postgres in a web application, everything in the data layer for you as an application developer should just work. Your focus should be on the application logic and making your application actually run correctly, and that's what Chroma is building for in AI.

And that's it. Thank you very much. Thank you very much. Thank you very much. Thanks. Thank you. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks. Thanks.