back to indexRetrieval Augmented Generation in the Wild: Anton Troynikov

Chapters
0:0 Intro
0:27 Retrieval Loop
1:4 Purpose
1:22 Human Feedback
2:37 Applications
3:37 Challenges
5:6 The bad news
5:26 Which embedding model to use
6:57 How to chunk
8:50 Retrieval relevance
10:44 What we are building
12:7 Outro
00:00:00.600 |
Hi everybody, as Dave said as I walked up, I'm Anton, I'm the co-founder of Chroma. 00:00:18.280 |
I'm here to talk to you about retrieval augmented generation in the wild and what it is that 00:00:23.720 |
Chroma is building for beyond just vector search. 00:00:28.320 |
By now, you've all seen versions of this probably a half dozen times throughout this conference. 00:00:33.420 |
This is the basic retrieval loop that one would use in a RAG application. 00:00:37.820 |
You have some corpus of documents, you embed them in your favorite vector store, which is 00:00:48.980 |
You embed your corpus of documents, you have an embedding model for your queries, you find 00:00:54.300 |
the nearest neighbor vectors for those embeddings and you return the associated documents which, 00:00:57.840 |
along with the query, you then put into the LLM's context window and return some results. 00:01:02.940 |
Now, this is the basic RAG loop, but I think of this as more like the open loop retrieval 00:01:10.560 |
And my purpose in showing you all this is to show you that you need a lot more than simple 00:01:14.940 |
vector search to build some of the more powerful, more promising applications that take RAG in 00:01:23.020 |
The first piece to this, of course, is incorporating human feedback into this loop. 00:01:26.940 |
Previously, without human feedback, it isn't possible to adapt the data, the embeddings model 00:01:32.700 |
itself, to the specific task, to the model, and to the user. 00:01:36.940 |
Human feedback is required to actually return better results for particular queries on your 00:01:42.240 |
specific data, on the specific tasks that you want to perform. 00:01:44.940 |
So, generally, embedding models are trained in a general context and you actually want 00:01:50.040 |
So, basically, the memory that you're using for your RAG application needs to be able to 00:01:56.840 |
Now, the other piece that we've seen, and these are currently in the early stages, but they're 00:02:01.500 |
emerging as something like a capable machine, and I think that one of the ways to make agents 00:02:05.460 |
actually capable is a better RAG system, a better memory for AI. 00:02:09.920 |
And that means that your retrieval system, your memory, needs to support self-updates from 00:02:17.760 |
All in all, what this means is you have a constantly dynamically updating data set. 00:02:21.340 |
Something that's built as a search index out of the box is not going to be able to support 00:02:27.520 |
Next of course, we're talking about agents with world models. 00:02:29.560 |
So, in other words, the agent needs to be able to store its interaction with the world and 00:02:33.620 |
update the data that it's working with based on that interaction. 00:02:37.780 |
And finally, you need to be able to tie all of these together. 00:02:40.600 |
Now, this sounds like a very complex system that's frontier research, and it is currently 00:02:45.200 |
research-grade, but we're seeing some of the first applications of this in the wild already 00:02:51.440 |
I'm sure some of you are familiar with this paper. 00:02:54.400 |
This is the famous Voyager paper out of Nvidia, where they trained an agent to play Minecraft, 00:03:00.500 |
to learn how to play it, by learning skills in a particular environment and then recognizing 00:03:04.900 |
when it's in the same context and recalling that skill. 00:03:07.400 |
Now, the other interesting piece to this is several of the more complex skills were learned 00:03:11.640 |
through human demonstration and then retrained in the retrieval system, which of course was 00:03:17.360 |
My point in showing this to you is that the simple rag loop might be the bread and butter 00:03:23.360 |
of most of the applications being developed today. 00:03:25.320 |
But the most powerful things that you'll be able to build with AI in the future require 00:03:29.500 |
a much more capable retrieval system than one that only supports a search index. 00:03:36.420 |
Now, of course, in retrieval itself there are plenty of challenges. 00:03:41.320 |
Information retrieval is kind of a classic task, and the setting in which it's been found previously 00:03:46.040 |
has been in recommender systems and in search systems. 00:03:49.940 |
Now that we're all using this in production for AI applications in completely different 00:03:53.360 |
ways, there's a lot of open questions that haven't really been asked quite in the same 00:04:00.020 |
A key piece of how retrieval needs to function for AI, and anyone who's built one of these is 00:04:04.060 |
aware of this, is you need to be able to return not just all relevant information, but also 00:04:10.680 |
It's common knowledge by now, and this is supported by empirical research, that distractors 00:04:14.840 |
in the model context cause the performance of the entire AI-based application to fall off 00:04:21.920 |
So what does it mean to actually retrieve relevant info and no irrelevant info? 00:04:25.840 |
You need to know which embedding model you need to be using at all in the first place, 00:04:28.860 |
and we've all seen the claims from the different API and embedding model providers. 00:04:37.480 |
But the reality is, the only way to find out which is best for your data set is to have 00:04:43.640 |
The next question, of course, is how do I chunk up the data? 00:04:46.360 |
Chunking determines what results are available to the model at all. 00:04:52.040 |
And it's obvious that different types of chunking produce different relevancy in the return results. 00:04:57.560 |
And finally, how do we even determine whether a given retrieved result is actually relevant 00:05:03.040 |
So let's dive into some of these in a little bit more depth. 00:05:06.760 |
So the bad news is, again, nobody really has the answers. 00:05:09.120 |
Despite the fact that information retrieval is a long study problem, there isn't great solution 00:05:14.560 |
But the good news is that these are important problems and increasingly important problems. 00:05:18.620 |
And we see much more production data rather than sort of academic benchmarks that we can 00:05:23.820 |
work from to solve some of these for the first time. 00:05:27.380 |
So first, the question of which embedding model should we be using. 00:05:30.100 |
Of course, there are existing academic benchmarks. 00:05:32.100 |
And for now, these appear to be mostly saturated. 00:05:35.540 |
The reason for that is these are synthetic benchmarks designed specifically for the information retrieval 00:05:39.620 |
problem and don't necessarily reflect how retrieval systems are used in AI use cases. 00:05:46.820 |
You can take some of the open source tooling built to build these benchmarks in the first 00:05:50.860 |
place and apply it to your data sets and your use cases. 00:05:55.200 |
You can use human feedback on relevance by adding a simple relevance feedback endpoint. 00:05:59.380 |
And this is something that Chroma is building to support in the very near future. 00:06:02.880 |
You can construct your own data sets because you're viewing your data in production. 00:06:08.260 |
And then you need a way to effectively evaluate the performance of particular embedding models. 00:06:14.580 |
Of course, there are great evaluation tools coming on to the market now from several vendors. 00:06:18.760 |
Which of these is best, we don't know, but we intend to support all of these with Chroma. 00:06:23.540 |
One interesting part about embedding models, and again, this is something that's been well 00:06:28.300 |
known in the research community for a while but has been empirically tested recently. 00:06:32.360 |
Embedding models with the same training objective, with roughly the same data, tend to learn very 00:06:36.580 |
similar representations up to an affine linear transform, which suggests that it's possible to project 00:06:42.140 |
one model's embedding space into another model's embedding space by using a simple linear transform. 00:06:46.320 |
So the choice of which embedding model you actually want to use might not end up being so important 00:06:50.500 |
if you're actually able to sort of apply and figure out those transform from your own data set. 00:07:01.680 |
Chunking, in part, exists because we have bounded context lengths for our LLMs. 00:07:06.680 |
So we want to make sure that the retrieve results can actually fit in that context. 00:07:10.360 |
We want to make sure that we retain the semantic content of the data we're aiming to retrieve. 00:07:19.540 |
We want to make sure that we retain the relevant semantic content of that data rather than just 00:07:28.540 |
We also want to make sure that we're respecting the natural structure of the data, because often, 00:07:32.540 |
especially textual data, was generated for humans to read and understand in the first place. 00:07:36.620 |
So this inherent structure of that data provides cues about where the semantic boundaries might 00:07:45.400 |
LamaIndex also supports many forms of chunking. 00:07:48.320 |
But there are experimental ideas here which we're particularly interested in trying. 00:07:52.840 |
One interesting thought that we've had and we're experimenting with lightweight open source 00:07:56.680 |
language models to achieve these is using the model prediction perplexity for the next actual 00:08:01.360 |
token in the document based on a sliding window of previous tokens. 00:08:05.760 |
In other words, you can see when the model mispredicts or has a very low probability for the next 00:08:11.500 |
actual piece of text as a determinator of where a semantic boundary in the text might be. 00:08:17.880 |
And what that also means is because you have a model actually predicting chunk boundaries, 00:08:23.140 |
you can then fine tune that model to make sure that chunk boundaries are relevant to your 00:08:27.280 |
So this is something that we're actively exploring. 00:08:30.480 |
Again, tools like LamaIndex support information hierarchies out of the box and multiple data sources and 00:08:36.140 |
And we can also try to use embedding continuity. 00:08:38.320 |
This is something that we're experimenting with as well, where essentially you take a sliding 00:08:42.120 |
window across your documents, embed that sliding window, and look for discontinuities in the 00:08:52.320 |
I'll give you a demonstration about why retrieval results - being able to compute retrieval result 00:08:58.660 |
relevance is actually very important in your application. 00:09:01.940 |
Imagine in your application you've gone and you've embedded every English-language Wikipedia 00:09:05.740 |
page about birds, and that's what's in your corpus. 00:09:09.060 |
And in your traditional retrieval augmented generation system, what you're doing for each 00:09:12.140 |
query is just returning the five nearest neighbors and then stuffing them into the model's context 00:09:16.940 |
One day, a user's query comes along, and that query is about fish and not birds. 00:09:20.680 |
You're guaranteed to return some five nearest neighbors, but you're also guaranteed to not 00:09:28.400 |
How can you, as an application developer, make that determination? 00:09:34.200 |
The first, of course, is human feedback around relevancy signal. 00:09:38.340 |
The traditional approach in information retrieval is using an auxiliary re-ranking model. 00:09:42.480 |
In other words, you take other signals in sort of the query chain, so what else was the 00:09:49.360 |
What things has the user found to be useful in the past and use those as additional signal 00:09:56.380 |
And we can also, of course, do augmented retrieval, which Chroma does out of the box. 00:09:59.560 |
We have keyword-based search, and we have metadata-based filtering, so you can scope the search if you 00:10:07.720 |
Now, to me, the most interesting approach here is actually an algorithmic one. 00:10:12.520 |
So what I mean by that is conditional on the data set that you have available and conditional 00:10:17.280 |
on what we know about the task that the user is trying to perform, it should be possible 00:10:22.560 |
to generate a conditional relevancy signal per user, per task, per model, and per instance 00:10:27.720 |
of that task, but this requires a model which can understand the semantics of the query 00:10:32.780 |
as well as the content of the data set very well. 00:10:35.620 |
This is something that we're experimenting with, and this is another place where we think open-source, 00:10:40.140 |
lightweight language models have actually a lot to offer, even at the data layer. 00:10:45.360 |
So to talk a little bit about what we're building, this is the advertising portion of my talk. 00:10:51.140 |
In core engineering, we're, of course, building our horizontally scalable cluster version. 00:10:54.640 |
Single-node Chroma works great, many of you have probably already tried it by now. 00:10:57.700 |
It's time to actually make it work across multiple nodes. 00:10:59.700 |
By December, we'll have our database-as-a-service technical preview up and ready so you guys can 00:11:05.460 |
In January, we'll have our hybrid deployments available if you want to run Chroma in your 00:11:09.700 |
And along the way, we're building to support multimodal data. 00:11:12.700 |
We know that GPT Vision's API is coming very soon, probably at OpenAI's developer day. 00:11:20.760 |
Gemini will also have image understanding and voice. 00:11:24.300 |
That means that you'll be able to use multimodal data in your retrieval applications for the first 00:11:31.660 |
So these questions about relevancy and other types of data become even more important, right? 00:11:35.780 |
Because now you start having questions about relevancy, aesthetic quality, all of these other 00:11:39.640 |
pieces which you need to make these multimodal retrieval augmented systems work. 00:11:44.660 |
And finally, we're working on model selection. 00:11:46.820 |
Chroma, basically, Chroma wants to do everything in the data layer for you so that, just like a 00:11:53.420 |
a modern DBMS, just like you use Postgres in a web application, everything in the data 00:11:58.560 |
layer for you as an application developer should just work. 00:12:01.300 |
Your focus should be on the application logic and making your application actually run correctly, 00:12:05.320 |
and that's what Chroma is building for in AI.