RAG in 2025: State of the Art and the Road Forward — Tengyu Ma, MongoDB (acq. Voyage AI)

Thanks for coming. Thanks for having me here. I'm Tang Yuma. I was the CEO and co-founder of 4H-AI. We just recently got acquired by MongoDB. I'm also teaching at Stanford as well. So this is about RAG, which is the main focus of 4H-AI, the startup who is focusing on how to make retrieval better.

But I will just generally talk about RAG, and we'll touch on some of the products we make as well very quickly. So I guess why we are doing RAG or anything like that, right? So I guess the main reason is that large language model are these days agents, which are using large language models as well.

If they're out of the box, they cannot just have priority information from any of the companies, right? Because if they know anything about what MongoDB, for example, internally has, then the data was leaked. So that means that if you want to apply any of this to enterprise, then you need to ingest a lot of data from the property information.

So, and I'm going to discuss, you know, why, which kind of technologies to enable us to ingest the data. I guess there are a few options: RAG, fine-tuning, and long contacts, which are always to ingest data and now focus on RAG for the rest of the talk. So I guess, you know, for this audience, probably most people knows these technologies, and they are all very simple on a high level.

So for long contacts, it's just the most simple. You just dump all your documents onto a large language model's contacts, and maybe it's like 1 million tokens, maybe it's 1 billion tokens. And then you have a query, and you just get a response. Fine-tuning is like your first fine-tune a large English model.

You update the parameters, and then you say, "I'm not going to look at the documents anymore." When the query comes, I just use the updated parameters to generate response. And RAG is also pretty simple. So basically, what happens is that, on the fly, you use the query to retrieve some subset of the documents.

You use the retrieval or search method, and then you get some relevant documents. You give this small set of relevant documents to the large language model, and then you generate response based on those contacts. So this is my one slide, kind of like a summary of how I think about the differences between these technologies.

You know, some of these are inspired by some of the research at Stanford. When we kind of started to build Voyage, you know, we kind of like believe in RAG, and one of the reasons is that we don't believe that fine-tuning can work. And long contacts, I think, I also don't really believe that it can be cost-efficient in the long run.

So basically, I think the way that I think about this is that I try to make an analogy to how humans are learning from or using the additional property information. So in some sense, long contacts, it's kind of like you scheme an entire library to answer any single question, right?

Every time you answer a question, you need to go through the entire library, which has like probably one billion tokens. And fine-tuning is kind of like you read this library in advance, you must memorize them, you try to internalize them in your brain, in your neurons, in your synapses, and you update your brain, basically rewire your brain so that you really know all of those deeply.

The challenge there is that, you know, it's very difficult and somewhat unnecessary because, you know, you cannot really memorize all the books in the world and memorizing a subset of them. Sometimes it's kind of like, you know, which subset you want to memorize is kind of tricky as well.

So and another thing is that it makes, you know, forgetting the knowledge also tricky because you don't know which part of the knowledge you should forget and how to clearly forget all of them. And also this makes the access, the data governance also kind of tricky because, you know, maybe there are so many libraries, so many books in the library and not everyone can access everything and how to organize those.

And on the other hand, reg is very, very simple and modularized as I've shown, so and very reliable and, you know, and also kind of fast and cheap. So and it's kind of like similar to how humans actually are using the libraries, right? You achieve the most relevant, you know, book chapters or books or book chapters and then answer the question.

It's kind of a hierarchical way to store information, right? You don't really put all of the information in your brain, you put them in a library and then use them when you need it. So that's why I believe in reg and this is how you implement the retrieval part.

So basically there is a breakdown of two components. Actually, there are three, you know, if you are advanced. So these embedding models which vectorize the documents and query into vectors and the vectors are representations of the content or the meanings of the documents and queries. And then you use a vector database to store the data and also search within the k-nearest neighbor search in the vector space.

And then you get the relevant documents and then you can use large-generation model to generate answers. So we have seen significant improvements over the retrieval accuracy in the last two years. when we started Voyage, you know, I think OpenAI v3 was not yet launched. I think OpenAI v3 was launched 1.5 years ago.

And in the last 1.5 years, you know, Voyage, you know, has made significant progress. You know, Cohere also made some progress. So we can see that a new model has much better accuracy and with lower cost. And generally we have much better scaling law, right? So the same number of parameters, the quality becomes better.

Or the same quality, the parameters become smaller and it becomes cheaper. And all of these are through kind of like, you know, optimizing the research stack, the tuning stack, you know, as much as possible. You know, all the way from like data curation, data selection, architecture, loss functions, you know, evaluation, so on and so forth.

And we still, you know, believe that there's a big headroom here because, you know, right now you can see that in this plot, you know, we are averaging over about 100 data sets and accuracy is about 80%. So that means that you still have like probably 20% of the improvement on headroom.

But that said, you know, just to be clear, it's not like for every data set, you only have 80% accuracy. For probably half of the data sets, the accuracy is probably 90% or even 95%. And for some of the other ones, it's kind of 60, sometimes 20, sometimes 30.

So that's why your average is 80%. So basically I'm saying like for some of the tasks that are common, I think you can get already very high accuracy in the retrieval step. And another thing that Voyage and other companies has offered is this so-called matrix learning and also quantization of wire training.

So basically these are two approaches to reduce the storage cost for the vectors. So basically matrix learning means that you make sure that even you have like a high dimensional embedding, right, you can use a subset of the coordinates. It's usually the first, let's say suppose you have a 2048 dimensional vectors and the first 256 dimensional sub vector is still a reasonable embedding.

The accuracy wouldn't be as high as 2048, but it will be almost the same, maybe with a 1% or 2% loss. And quantization is kind of in a similar vein. So where you are, even you lower your precision of the vectors, you still get pretty high performance. And you can see the trade off on the right of the figure here.

So basically you can save, you know, 100x, you know, at least 10x without losing much. If you save 100x in the storage cost, then you start to lose probably 5 to 10%. But Voyage, you know, is doing a great job here because, you know, you can save 100x but still doing better than OpenEye.

That's just because the parental front here is different. So, and you can actually see better trade off, you know, for domain specific models, which I'm going to discuss in a moment. I have nine minutes here, so I will probably just quickly go through some of the techniques that you can use.

So basically the next question is how do you do better reg, you know, besides using better embedding models. Using better embedding models is probably one of the simplest way. So I'm just going to go through it quickly. So one of them is to use hybrid search and re-rankers. You can use, you know, lexical search and other kind of search and then combine them with a re-ranker.

And Voyage provides a re-ranker as well. And another one is you can enhance the queries and documents by the so-called query decomposition and document enrichment. So this is probably the most common one. Maybe there's one minute on it. So it's actually very simple. You just say, if you have a query reg, then you try to improve the query by making it longer using a large language model.

You can also decompose the longer query into small sub-queries so that you can have, like, a few different queries and search for different subset of documents. And you can also enrich the document by adding additional matter information in it. You can add titles, you know, hiders, you know, categories, author, states.

Sometimes you trunk the document so that in the trunk you don't even have this information anymore. So that's why you have to add the global information into each of the trunks. And some of this global information can be added by large language models. Anthropik wrote a blog post which does achieve pretty good results.

So basically they use large language models to generate additional contacts that you can add to the trunks so that you can make the trunks, you know, more informative and then it's easier to search through them. So another one is you can use domain-specific embeddings where you customize embeddings for certain kind of domains.

You know, in MongoDB or Voyage, we customize it for code, for example. And you can see that, you know, you get much better performance and also it's a better trade-off in terms of the storage cost and accuracy. So basically you don't lose as much if you compress the vectors even further.

So here we lose probably 5% by compressing for like about 100x. But before we lose probably 10% or 15%. Fine-tuning is another one. You can fine-tuning embedding models with your own data. And you can also use other, sometimes I call them tricks on top of the embeddings, right? So these are different type of ways to retrieve using additional information like graph, you know, iterative retrieving.

So on and so forth, they're all based on embeddings, but you can use the embeddings in many different ways as an additional layer. So I guess I'll use the next probably five minutes to discuss some of my vision for how reg will go in the future. I do believe that reg will be there forever because this is, as I argued in the first set of slides, this is kind of like very similar to how humans are using additional large amount of data.

You retrieve, you hierarchically select some subset and then you use those to answer the questions or take some actions. And this is very efficient because you only use a small subset of the data. And as a, regarding how reg will evolve from a technical point of view, I'd like to draw some analogy from how the AI generally is evolving.

So I think I was reflecting on when I was teaching at Stanford, you know, starting to teach at Stanford about seven years ago. I started to teach with Chris Ray on this machine learning course. And one of the slides literally have these seven steps on how do we build ML systems in enterprises.

So this actually, this slide is still in the lecture notes. We still teach them, but just with more kind of like asterisks around it. So you can see like you, you need to go through, you know, many steps, you know, tracking data, you know, define your loss functions, you know, build models and iterate and repeat.

And then for the large language model world, it's kind of like this, where you don't need to do any of this. You just take a large language model out of the box and just, you know, you can deploy it in enterprise in most of the cases. Of course, it's not going to be perfect, but this is already better than in the old days.

You do all of these steps in the enterprise using all the enterprise data. Just out of the box, you are doing already very, very well. Of course, you still have this issue that you cannot, out of the box, large language model cannot access property information. Then you can use reg for it.

So, but I think the point here is that before all of these steps have to be done by the kind of like the users or the enterprise or the customers in some sense. And now you, largely speaking, just can take off the shelf components and connect them and build your AI applications very fast without going through these training steps.

The trainings still have to be done, right? All of these steps still are done, but they are done by OpenAI, Anthropik or Voyage, MongoDB, the providers of the models, but not the users, the end users. So, and I think for reg, I would say probably the same kind of evolution which would happen.

So, right now what happens is that we have the several different layers where you have the computing infrastructure layer about the GPUs, you know, or some of the KNs on the CPUs. And there's also a model layer where you have the embedding models, the revampers, the large language models.

And then on top of all of this, people use a lot of like, I call it tricks to make reg accuracy much better, right? You can use all kinds of parsing strategies, you can use all kinds of trunking strategies, you know, you can do some recursive search, you can do some contextual trunks, graph regs, so on and so forth.

Right, that's what happens right now, and it's kind of necessary, these tricks are somewhat necessary because the embeddings and revampers and large language models, none of them are perfect yet. Right, so, and, but I do believe that in the future, I think this model layer will grow, and the tricks will be smaller.

So, it's going to be fewer and fewer tricks, and the models can capture many of the performance gained by the tricks. I think we have seen this in the large language model space as well, right? So, like, two years ago, I think you need to do a lot of things on top of the GPT-3 to make your application work.

And now, even out of the box, you can get the same performance as before with all of the tricks. Of course, you still probably need some kind of tricks because some information are not, some information the embedding models and revampers don't have just, right? So, the general purpose models or off-the-shelf models don't have certain information, and then you can incorporate those into your tricks.

For example, one thing that is that, you know, the definition of the similarity matrix could be something that you should customize in your prompt. And in this one, you know, I think there are several things that we are developing towards this vision, right? So, one of them is multi-model embedding.

This is to dramatically simplify the workflow so that you don't have to do many things, right? So, these days, the multi-model embedding provided by Voyage can just take in screenshots, right? Before you take a PDF, you have to do the data extraction to turn them into image and text and then probably do some embeddings for the image and embedding for the text separately.

And parsing this PDF is actually complex, you know, and for videos, you have to turn them into transcripts and then use the text embedding, so on and so forth. Right now, we have the multi-model embedding, which just takes in screenshots. You can deal with PDF, you know, PPT, PowerPoint, you know, any of the other kind of slide stack in the same way.

Just take a screenshot and then use the multi-model embedding. We don't have the -- we can even do the same thing for video, not necessarily the perfect way, but, like, you just take screenshots of the frames, you know, consecutive frames, and you give it to the multi-model embedding, and you can turn them into vectors, and you can search over those documents or videos or slide stack.

And these are some performance metrics that we have evaluated. You know, we have tried kind of like -- oh, by the way, another one application is tables, right? Now you can just take a screenshot of the tables. You don't have to think too much about what is the header, what is the row, so on and so forth.

And we have done evaluations on many of these document screenshots, you know, table figures, and also text only. And you can see that it's improving across the board. So the final one I would like to mention, which is something that we're going to launch soon is that this context wire and auto-trunking embedding.

So right now what happens is that when you have a long document, you do have to trunk the data. One of the reasons is that the context length of the embeddings is limited. And if you have, like, 100K tokens, you do have to trunk it into three or four trunks.

You know, even though Voyage AI has the -- probably -- I think we have the longest context window. It's still like 32K. So that's one reason to trunk. And another reason to trunk is that sometimes the long document, even if you don't trunk, suppose you can have a way to put all of them in a context window.

Still, when you retrieve, you're going to retrieve on a document level. Then you retrieve a very, very long document. And then you should give this long document to a large language model. It's going to be very, very expensive, right? If you give 100K tokens to a large language model, every time you answer any question, if you do some cost analysis, you'll find that that core is very, very expensive.

So that's why you do have to work on a smaller unit so that you can cut the cost and be also more focused, right? So sometimes you give a long document to a large language model, it misses some of the context in the middle. And you have to use the retrieval to focus on a paragraph, a page, so on and so forth.

So that's what happens right now with the trunking. But all of these are done by the users. And our vision is that we're going to do this for you. And also we're going to get all the meta information from other trunks. So basically, in a nutshell, the interface will be that you give us a long document and we're going to trunk it for you.

And then also we return the trunks and also the vectors for each of the trunk. And each of these vectors is not only representing that trunk, but also representing some of the global meta information from other trunks. So it has all the details of the corresponding trunk and also has some kind of like cross-grid information from other trunks so that you can get the best of the both worlds.

And that's what I'm going to launch soon. And another one is that we're going to have some fine-tune API at some point to make you so that you can fine-tune with your own data. So I guess it's exactly time. Thanks very much. Thanks very much. Thanks very much. Thanks very much.

you We'll see you next time.

RAG in 2025: State of the Art and the Road Forward — Tengyu Ma, MongoDB (acq. Voyage AI)

Transcript