back to index

RAG and the MongoDB Document Model: Ben Flast


Whisper Transcript | Transcript Only Page

00:00:00.040 | Great to see everyone. That was a great talk. I'm very interested in this rap group. I have
00:00:19.120 | some concerns. I'm here for MongoDB. I'm going to be talking about RAG and specifically what's
00:00:25.200 | unique about doing RAG with the MongoDB document model and MongoDB Atlas, the platform. I'm
00:00:31.600 | going to start by talking a bit about retrieval augmented generation in general. I'm sure a
00:00:36.440 | lot of us are familiar with it already, but I think it will be good to cover some of the
00:00:40.200 | basic concepts. Then I'm going to talk about the document model. For those of you who are
00:00:44.440 | not so familiar with MongoDB, this will be a nice little brief intro to what it means
00:00:49.700 | to use MongoDB and why we're a unique database. Then I'm going to talk about vector search and
00:00:54.920 | capability that exists inside of MongoDB now. Then I'll talk about some of our AI integrations
00:01:00.120 | and then some use cases to help stimulate some ideas for all of you. I'm going to do all this
00:01:05.260 | in a quick 15. Obviously, LLM is super exciting. It's been crazy over the past year and a half,
00:01:15.440 | but there has been a question around what all can they do and when do you need to use RAG
00:01:20.780 | and when do you not? If you took a vanilla LLM connected to nothing and asked it how much
00:01:25.980 | money is in your bank account, it wouldn't know. I think we can all understand why that's
00:01:31.180 | the case and hope for the foreseeable future continues to be the case. All that said, if
00:01:38.320 | we want to make useful applications with these LLMs, then the reality is that without context,
00:01:44.180 | there's only so much you can do with the LLM. That's where RAG comes in. RAG stands for retrieval
00:01:50.740 | augmented generation. I'm sure this is old hat to most of you, but we're just going to go through
00:01:55.300 | quickly. What this means is that you take a generic AI or ML model that, you know, today we're generally
00:02:03.700 | talking about LLMs, but it has a training cutoff, it's missing your private data, maybe it hallucinates,
00:02:11.060 | maybe it doesn't, but overall it's not personalized. And you take your data, right, and you augment it
00:02:17.300 | at the time of prompting to give it the context that it needs to answer the questions that you want it to
00:02:22.980 | for the use cases that you're bringing it to bear for. And so that could be company-specific data,
00:02:27.780 | it could be product info, it could be order history, anything that you're storing inside of your
00:02:32.260 | application database that's already powering kind of your in-app experiences. And with that,
00:02:37.060 | you get a transformative AI-powered application, right, that's going to be refined and consistent
00:02:41.700 | and accurate in the responses that it gives when you're prompting the models. So the typical RAG that
00:02:50.580 | you've all probably seen and in most cases probably implemented will look something like this, right,
00:02:55.220 | so you have a user that enters a prompt, the question that they enter will get sent to an
00:03:02.180 | embedding model, it will be embedded, it will then do a search, a semantic search on a vector database,
00:03:07.140 | in this case, MongoDB Atlas Vector Search, obviously, which will pull back similar documents,
00:03:11.380 | so then those documents along with the original prompt for the most cases will go into the large
00:03:15.780 | language model, and that will give an answer which goes back to the user. And this is kind of what,
00:03:20.260 | you know, most people are doing for all, you know, Chatbot and Copilot and other types of use cases,
00:03:24.820 | right? But what's really interesting is that when you use MongoDB, you can go quite a bit farther than this and do
00:03:32.020 | things that are, you know, in many cases a bit different. So with, you know, RAG, the standard
00:03:42.500 | RAG is really not going to be enough. The applications of tomorrow are going to need more context, right?
00:03:47.300 | And that's where the MongoDB document model comes in. So the document model is really just JSON,
00:03:52.820 | and it gets stored inside of MongoDB in something called BSON, which stands for binary JSON, but you
00:03:57.700 | have things like a name, a profile, you know, you can include whatever you want as long as it's JSON,
00:04:04.820 | and that is actually what you store inside of your database and what you fetch from the database.
00:04:09.060 | So with the document model, if you're comparing it to something that you would do in kind of a relational
00:04:13.700 | system where you have objects that your applications are interfacing with, right, like a customer
00:04:18.420 | object or a contact object, and you're, you know, stitching together different tables inside of a
00:04:23.140 | relational database, instead of having to kind of go through all of this pain and hassle, you get to
00:04:27.940 | go to something like this, right, where you just store the objects that your application is using
00:04:32.020 | directly inside of the database, and there's not all of this kind of reconfiguring and reconnoitering.
00:04:36.740 | The way we look at this is that, you know, documents are universal, right? In many cases, they're kind of the
00:04:42.100 | superset of all, you know, data types that you might want to model. And so you can have JSON, you can have
00:04:48.340 | tabular data, key value stores, geospatial graph, it goes on. And what this translates to is, you know,
00:04:54.820 | it's more efficient in many places. It is more productive for developers who are building systems,
00:04:59.860 | and in many cases it can be more scalable, since MongoDB is just naturally very horizontally scalable
00:05:04.180 | through sharding. So that's documents, and that's kind of just the core benefit of MongoDB. But now,
00:05:12.100 | when we add on vectors is where things get, you know, really interesting, right? So what we've done is
00:05:17.380 | we've added in HNSW indexes into MongoDB Atlas, which allows you to do approximate nearest neighbor vector
00:05:24.660 | search over data that's stored in your database. And so what you do is you take your embeddings and you
00:05:29.860 | add them directly into the documents that you're already storing in your database. And so if you
00:05:34.980 | had this JSON that had symbol, quarter, and content fields, you could add a content underscore embedding
00:05:40.740 | field, which would just be the vectorization of, you know, either your entire document, some piece of
00:05:46.500 | data in your document, or some piece of data that's living elsewhere that you're going to map back to.
00:05:50.500 | And you can store all of that inside of your documents. And you can store vectors that are up to
00:05:55.060 | 4096 dimensions. Once that's done, you add in an index definition. In this case, you know,
00:06:02.900 | the type of index is a vector search index. You would say the type of field that you're indexing
00:06:07.300 | is a vector. You would say where the path is, where it's located, the number of dimensions, and the
00:06:12.580 | similarity function. So how do you want to determine the distance between the vectors that you're searching
00:06:16.580 | for and the ones that you're going to find? So once that's done, behind the scenes, the vector index is
00:06:21.940 | immediately built and kept in sync with data as it's updated inside of the database. And then you can
00:06:26.260 | use our dollar vector search aggregation stage to go ahead and compute an approximate nearest neighbor
00:06:30.980 | search. And so you have your index. You have the query vector, which is the vectorization of the data
00:06:35.940 | that you're searching for. You have the path where the data lives inside of your documents. And then you
00:06:41.540 | have num candidates and limits. And so the limit is how many results you want to get back from this
00:06:46.340 | stage. And the num candidates is how many entry points into your HNSW graph do you want to make,
00:06:52.020 | which allows you to kind of tune the accuracy of your results. And then finally, you can use a
00:06:58.900 | filter. And this filter is basically a pre-filter. So as we traverse this graph, we'll allow you to kind
00:07:03.620 | of fetch the documents and filter out the ones that are less relevant for your specific query.
00:07:10.100 | So that is vector search capability. But there's one other kind of core thing that's really important
00:07:15.540 | to just call out that we've also introduced alongside vector search, which is something
00:07:18.980 | called search nodes. And this allows you to decouple your approach to scaling. So with a transactional
00:07:24.900 | database, right, you have a primary and two secondaries. And this allows you to have, you know,
00:07:28.740 | durability, high availability, and all of these guarantees that you would want for a transactional
00:07:32.420 | database. But when you're adding search to it, the profile of resource usage may be a bit different.
00:07:38.180 | And so what we've done is we've added in a new type of node into the platform that allows you to
00:07:42.660 | store your vector indexes on those nodes and scale them independently from the infrastructure that's
00:07:48.740 | storing your transactional data. And this allows you to really tune the amount of resources that you
00:07:53.540 | bring to bear to perfectly serve your workload. And so with that, we've really kind of transformed how
00:08:01.700 | Atlas can serve these vector search workloads by both giving you kind of a unified interface
00:08:05.940 | and a consistent use of the document model, yet at the same time kind of decoupling how you go about
00:08:11.700 | scaling for your workloads. And that's really kind of the true power of what we've done with vector
00:08:15.540 | search. But along with this, we've also built several different AI integrations. And so we're
00:08:20.180 | integrated into some of the most popular AI frameworks, right? We have integrations inside of
00:08:24.340 | Lama Index, Langchain, Microsoft Semantic kernel, AWS Bedrock, and Haystack. And in each of them,
00:08:31.620 | we support quite a different -- quite a few different primitives. And so we have, you know,
00:08:37.380 | just to name a few, inside of Langchain, we have vector store, but you can also have a chat message
00:08:41.540 | history, you know, abstraction inside of Langchain. We have quite a few in Lama Index, and then, you know,
00:08:47.780 | same for Haystack and AWS Bedrock. And so all of these allow you to do that next level of rag that I was
00:08:54.580 | talking about at the very beginning, where you not only get to combine kind of just your typical vector search
00:08:58.820 | with rag, but you also get to now use kind of transactional data inside of your database to
00:09:04.420 | augment your prompts. And so to give you just like a couple examples of what that ends up looking like,
00:09:09.860 | right, when you think about kind of more broad usage of memory for large language models, you might think
00:09:16.260 | about semantic caching. So this is capability inside of Langchain, and you can use MongoDB as the
00:09:20.740 | backend of that semantic cache. And now, right, when a user comes in and asks a question, we'll first
00:09:26.340 | kind of send it over to the retriever and figure out kind of what the question should look like,
00:09:31.300 | right, find the prompt plus the additional kind of augmented data. And then we'll send it to a semantic
00:09:36.420 | cache. And if that semantic cache is it's a hit based on a semantic similarity, then we'll just fetch the
00:09:42.340 | cached answer instead of having to hit the LLM again. Or if it's not a hit, we'll send it to the LLM and do the
00:09:47.860 | prompt and get the answer back to the user. And so in this way, you can use caching to kind of reduce
00:09:51.860 | the amount of calls that are being sent to your large language model. And this is, you know, hugely
00:09:55.940 | powerful, just kind of reducing the amount of resources that you're using. And again, it can all be done
00:10:00.420 | using one database with Langchain in this case. Separately, though, right, we also now have chat history,
00:10:09.140 | right? And so with Langchain, if you wanted to build on top of MongoDB a experience that was maybe similar to,
00:10:16.820 | you know, ChatGPT, right, where you have the chat history, and it's continuously fetching that data
00:10:21.540 | and putting it back into the prompt so that you can kind of have continuity in the conversation that's
00:10:26.420 | happening with the large language model. Well, you could use the chat message history abstraction
00:10:29.780 | inside of Langchain, and you could basically store the history of chats that are going through the
00:10:34.100 | platform. And each time a prompt is sent back into the large language model, you could use the chat
00:10:39.540 | history, send it back through, include the vector search, and then, you know, send the prompt to the LLM
00:10:44.580 | and send the answer back. And so just another way where you can really kind of evolve this.
00:10:48.260 | A cool startup that's using us right now to do a lot of these different things where they're taking
00:10:53.380 | advantage of kind of all of the flexibility of having a transactional database kind of built in
00:10:58.420 | with your vector search capability is a company called 4149. I would, you know, recommend checking
00:11:03.300 | them out. Basically, they're building an AI teammate and not like a coding teammate, but instead one that kind
00:11:08.980 | of, you know, listens to your meeting, tracks what you're doing, fetches additional information and kind
00:11:14.660 | of prompts you, the user, with that information that you may need to kind of complete a task, you know,
00:11:20.660 | write an email or kind of schedule a project. And they're using MongoDB not just to store their vector
00:11:26.420 | data and do, you know, semantic similarity search, but also to store data about their users, data about,
00:11:31.780 | you know, specific meetings, chat history meetings, all of this information that's not necessarily kind
00:11:37.060 | of your typical semantic search type data use case, but instead it really benefits from having a single
00:11:42.740 | operational transactional database that also has vector search attached. And so that's where we're
00:11:47.460 | seeing like a lot of the excitement as we move into this, you know, world of agents and doing kind of
00:11:51.940 | complex differentiated rag. Having a full transactional database really kind of opens up a new world of
00:11:58.740 | kind of storing and giving, you know, these agents more affordances to interact with the data. And, you
00:12:04.980 | know, just one more thing to mention is that, you know, at the end of the day, all of this is built
00:12:09.780 | inside of MongoDB Atlas, which gives you comprehensive security controls and privacy. It, you know, gives you
00:12:16.420 | kind of total uptime and automation to ensure that you have kind of optimal performance to serve your
00:12:21.860 | application. And finally, it's deployable in over 100 plus regions across all of the major cloud
00:12:26.900 | providers, including our search note offering that I mentioned earlier, that really allows you to
00:12:30.900 | optimize how you deploy these resources. And so we're really thrilled to have this. Just kind of a quick
00:12:36.500 | call out. Thanks all for coming to check out this talk. If you want to try MongoDB Atlas for free,
00:12:43.300 | we have a forever free tier where vector search is available. And you can also learn more of our AI
00:12:48.340 | capabilities using this other QR code as well. And with that, I'm done.
00:12:53.780 | Thank you.
00:12:59.220 | Thank you.
00:13:00.260 | Thank you.
00:13:00.260 | Thank you.
00:13:00.260 | Thank you.