RAG and the MongoDB Document Model: Ben Flast

Great to see everyone. That was a great talk. I'm very interested in this rap group. I have some concerns. I'm here for MongoDB. I'm going to be talking about RAG and specifically what's unique about doing RAG with the MongoDB document model and MongoDB Atlas, the platform. I'm going to start by talking a bit about retrieval augmented generation in general.

I'm sure a lot of us are familiar with it already, but I think it will be good to cover some of the basic concepts. Then I'm going to talk about the document model. For those of you who are not so familiar with MongoDB, this will be a nice little brief intro to what it means to use MongoDB and why we're a unique database.

Then I'm going to talk about vector search and capability that exists inside of MongoDB now. Then I'll talk about some of our AI integrations and then some use cases to help stimulate some ideas for all of you. I'm going to do all this in a quick 15. Obviously, LLM is super exciting.

It's been crazy over the past year and a half, but there has been a question around what all can they do and when do you need to use RAG and when do you not? If you took a vanilla LLM connected to nothing and asked it how much money is in your bank account, it wouldn't know.

I think we can all understand why that's the case and hope for the foreseeable future continues to be the case. All that said, if we want to make useful applications with these LLMs, then the reality is that without context, there's only so much you can do with the LLM.

That's where RAG comes in. RAG stands for retrieval augmented generation. I'm sure this is old hat to most of you, but we're just going to go through quickly. What this means is that you take a generic AI or ML model that, you know, today we're generally talking about LLMs, but it has a training cutoff, it's missing your private data, maybe it hallucinates, maybe it doesn't, but overall it's not personalized.

And you take your data, right, and you augment it at the time of prompting to give it the context that it needs to answer the questions that you want it to for the use cases that you're bringing it to bear for. And so that could be company-specific data, it could be product info, it could be order history, anything that you're storing inside of your application database that's already powering kind of your in-app experiences.

And with that, you get a transformative AI-powered application, right, that's going to be refined and consistent and accurate in the responses that it gives when you're prompting the models. So the typical RAG that you've all probably seen and in most cases probably implemented will look something like this, right, so you have a user that enters a prompt, the question that they enter will get sent to an embedding model, it will be embedded, it will then do a search, a semantic search on a vector database, in this case, MongoDB Atlas Vector Search, obviously, which will pull back similar documents, so then those documents along with the original prompt for the most cases will go into the large language model, and that will give an answer which goes back to the user.

And this is kind of what, you know, most people are doing for all, you know, Chatbot and Copilot and other types of use cases, right? But what's really interesting is that when you use MongoDB, you can go quite a bit farther than this and do things that are, you know, in many cases a bit different.

So with, you know, RAG, the standard RAG is really not going to be enough. The applications of tomorrow are going to need more context, right? And that's where the MongoDB document model comes in. So the document model is really just JSON, and it gets stored inside of MongoDB in something called BSON, which stands for binary JSON, but you have things like a name, a profile, you know, you can include whatever you want as long as it's JSON, and that is actually what you store inside of your database and what you fetch from the database.

So with the document model, if you're comparing it to something that you would do in kind of a relational system where you have objects that your applications are interfacing with, right, like a customer object or a contact object, and you're, you know, stitching together different tables inside of a relational database, instead of having to kind of go through all of this pain and hassle, you get to go to something like this, right, where you just store the objects that your application is using directly inside of the database, and there's not all of this kind of reconfiguring and reconnoitering.

The way we look at this is that, you know, documents are universal, right? In many cases, they're kind of the superset of all, you know, data types that you might want to model. And so you can have JSON, you can have tabular data, key value stores, geospatial graph, it goes on.

And what this translates to is, you know, it's more efficient in many places. It is more productive for developers who are building systems, and in many cases it can be more scalable, since MongoDB is just naturally very horizontally scalable through sharding. So that's documents, and that's kind of just the core benefit of MongoDB.

But now, when we add on vectors is where things get, you know, really interesting, right? So what we've done is we've added in HNSW indexes into MongoDB Atlas, which allows you to do approximate nearest neighbor vector search over data that's stored in your database. And so what you do is you take your embeddings and you add them directly into the documents that you're already storing in your database.

And so if you had this JSON that had symbol, quarter, and content fields, you could add a content underscore embedding field, which would just be the vectorization of, you know, either your entire document, some piece of data in your document, or some piece of data that's living elsewhere that you're going to map back to.

And you can store all of that inside of your documents. And you can store vectors that are up to 4096 dimensions. Once that's done, you add in an index definition. In this case, you know, the type of index is a vector search index. You would say the type of field that you're indexing is a vector.

You would say where the path is, where it's located, the number of dimensions, and the similarity function. So how do you want to determine the distance between the vectors that you're searching for and the ones that you're going to find? So once that's done, behind the scenes, the vector index is immediately built and kept in sync with data as it's updated inside of the database.

And then you can use our dollar vector search aggregation stage to go ahead and compute an approximate nearest neighbor search. And so you have your index. You have the query vector, which is the vectorization of the data that you're searching for. You have the path where the data lives inside of your documents.

And then you have num candidates and limits. And so the limit is how many results you want to get back from this stage. And the num candidates is how many entry points into your HNSW graph do you want to make, which allows you to kind of tune the accuracy of your results.

And then finally, you can use a filter. And this filter is basically a pre-filter. So as we traverse this graph, we'll allow you to kind of fetch the documents and filter out the ones that are less relevant for your specific query. So that is vector search capability. But there's one other kind of core thing that's really important to just call out that we've also introduced alongside vector search, which is something called search nodes.

And this allows you to decouple your approach to scaling. So with a transactional database, right, you have a primary and two secondaries. And this allows you to have, you know, durability, high availability, and all of these guarantees that you would want for a transactional database. But when you're adding search to it, the profile of resource usage may be a bit different.

And so what we've done is we've added in a new type of node into the platform that allows you to store your vector indexes on those nodes and scale them independently from the infrastructure that's storing your transactional data. And this allows you to really tune the amount of resources that you bring to bear to perfectly serve your workload.

And so with that, we've really kind of transformed how Atlas can serve these vector search workloads by both giving you kind of a unified interface and a consistent use of the document model, yet at the same time kind of decoupling how you go about scaling for your workloads. And that's really kind of the true power of what we've done with vector search.

But along with this, we've also built several different AI integrations. And so we're integrated into some of the most popular AI frameworks, right? We have integrations inside of Lama Index, Langchain, Microsoft Semantic kernel, AWS Bedrock, and Haystack. And in each of them, we support quite a different -- quite a few different primitives.

And so we have, you know, just to name a few, inside of Langchain, we have vector store, but you can also have a chat message history, you know, abstraction inside of Langchain. We have quite a few in Lama Index, and then, you know, same for Haystack and AWS Bedrock.

And so all of these allow you to do that next level of rag that I was talking about at the very beginning, where you not only get to combine kind of just your typical vector search with rag, but you also get to now use kind of transactional data inside of your database to augment your prompts.

And so to give you just like a couple examples of what that ends up looking like, right, when you think about kind of more broad usage of memory for large language models, you might think about semantic caching. So this is capability inside of Langchain, and you can use MongoDB as the backend of that semantic cache.

And now, right, when a user comes in and asks a question, we'll first kind of send it over to the retriever and figure out kind of what the question should look like, right, find the prompt plus the additional kind of augmented data. And then we'll send it to a semantic cache.

And if that semantic cache is it's a hit based on a semantic similarity, then we'll just fetch the cached answer instead of having to hit the LLM again. Or if it's not a hit, we'll send it to the LLM and do the prompt and get the answer back to the user.

And so in this way, you can use caching to kind of reduce the amount of calls that are being sent to your large language model. And this is, you know, hugely powerful, just kind of reducing the amount of resources that you're using. And again, it can all be done using one database with Langchain in this case.

Separately, though, right, we also now have chat history, right? And so with Langchain, if you wanted to build on top of MongoDB a experience that was maybe similar to, you know, ChatGPT, right, where you have the chat history, and it's continuously fetching that data and putting it back into the prompt so that you can kind of have continuity in the conversation that's happening with the large language model.

Well, you could use the chat message history abstraction inside of Langchain, and you could basically store the history of chats that are going through the platform. And each time a prompt is sent back into the large language model, you could use the chat history, send it back through, include the vector search, and then, you know, send the prompt to the LLM and send the answer back.

And so just another way where you can really kind of evolve this. A cool startup that's using us right now to do a lot of these different things where they're taking advantage of kind of all of the flexibility of having a transactional database kind of built in with your vector search capability is a company called 4149.

I would, you know, recommend checking them out. Basically, they're building an AI teammate and not like a coding teammate, but instead one that kind of, you know, listens to your meeting, tracks what you're doing, fetches additional information and kind of prompts you, the user, with that information that you may need to kind of complete a task, you know, write an email or kind of schedule a project.

And they're using MongoDB not just to store their vector data and do, you know, semantic similarity search, but also to store data about their users, data about, you know, specific meetings, chat history meetings, all of this information that's not necessarily kind of your typical semantic search type data use case, but instead it really benefits from having a single operational transactional database that also has vector search attached.

And so that's where we're seeing like a lot of the excitement as we move into this, you know, world of agents and doing kind of complex differentiated rag. Having a full transactional database really kind of opens up a new world of kind of storing and giving, you know, these agents more affordances to interact with the data.

And, you know, just one more thing to mention is that, you know, at the end of the day, all of this is built inside of MongoDB Atlas, which gives you comprehensive security controls and privacy. It, you know, gives you kind of total uptime and automation to ensure that you have kind of optimal performance to serve your application.

And finally, it's deployable in over 100 plus regions across all of the major cloud providers, including our search note offering that I mentioned earlier, that really allows you to optimize how you deploy these resources. And so we're really thrilled to have this. Just kind of a quick call out.

Thanks all for coming to check out this talk. If you want to try MongoDB Atlas for free, we have a forever free tier where vector search is available. And you can also learn more of our AI capabilities using this other QR code as well. And with that, I'm done.

Thank you. Thank you. Thank you. Thank you. Thank you.

RAG and the MongoDB Document Model: Ben Flast

Transcript