RAG and the MongoDB Document Model: Ben Flast

00:00:00.040 | Great to see everyone. That was a great talk. I'm very interested in this rap group. I have

00:00:19.120 | some concerns. I'm here for MongoDB. I'm going to be talking about RAG and specifically what's

00:00:25.200 | unique about doing RAG with the MongoDB document model and MongoDB Atlas, the platform. I'm

00:00:31.600 | going to start by talking a bit about retrieval augmented generation in general. I'm sure a

00:00:36.440 | lot of us are familiar with it already, but I think it will be good to cover some of the

00:00:40.200 | basic concepts. Then I'm going to talk about the document model. For those of you who are

00:00:44.440 | not so familiar with MongoDB, this will be a nice little brief intro to what it means

00:00:49.700 | to use MongoDB and why we're a unique database. Then I'm going to talk about vector search and

00:00:54.920 | capability that exists inside of MongoDB now. Then I'll talk about some of our AI integrations

00:01:00.120 | and then some use cases to help stimulate some ideas for all of you. I'm going to do all this

00:01:05.260 | in a quick 15. Obviously, LLM is super exciting. It's been crazy over the past year and a half,

00:01:15.440 | but there has been a question around what all can they do and when do you need to use RAG

00:01:20.780 | and when do you not? If you took a vanilla LLM connected to nothing and asked it how much

00:01:25.980 | money is in your bank account, it wouldn't know. I think we can all understand why that's

00:01:31.180 | the case and hope for the foreseeable future continues to be the case. All that said, if

00:01:38.320 | we want to make useful applications with these LLMs, then the reality is that without context,

00:01:44.180 | there's only so much you can do with the LLM. That's where RAG comes in. RAG stands for retrieval

00:01:50.740 | augmented generation. I'm sure this is old hat to most of you, but we're just going to go through

00:01:55.300 | quickly. What this means is that you take a generic AI or ML model that, you know, today we're generally

00:02:03.700 | talking about LLMs, but it has a training cutoff, it's missing your private data, maybe it hallucinates,

00:02:11.060 | maybe it doesn't, but overall it's not personalized. And you take your data, right, and you augment it

00:02:17.300 | at the time of prompting to give it the context that it needs to answer the questions that you want it to

00:02:22.980 | for the use cases that you're bringing it to bear for. And so that could be company-specific data,

00:02:27.780 | it could be product info, it could be order history, anything that you're storing inside of your

00:02:32.260 | application database that's already powering kind of your in-app experiences. And with that,

00:02:37.060 | you get a transformative AI-powered application, right, that's going to be refined and consistent

00:02:41.700 | and accurate in the responses that it gives when you're prompting the models. So the typical RAG that

00:02:50.580 | you've all probably seen and in most cases probably implemented will look something like this, right,

00:02:55.220 | so you have a user that enters a prompt, the question that they enter will get sent to an

00:03:02.180 | embedding model, it will be embedded, it will then do a search, a semantic search on a vector database,

00:03:07.140 | in this case, MongoDB Atlas Vector Search, obviously, which will pull back similar documents,

00:03:11.380 | so then those documents along with the original prompt for the most cases will go into the large

00:03:15.780 | language model, and that will give an answer which goes back to the user. And this is kind of what,

00:03:20.260 | you know, most people are doing for all, you know, Chatbot and Copilot and other types of use cases,

00:03:24.820 | right? But what's really interesting is that when you use MongoDB, you can go quite a bit farther than this and do

00:03:32.020 | things that are, you know, in many cases a bit different. So with, you know, RAG, the standard

00:03:42.500 | RAG is really not going to be enough. The applications of tomorrow are going to need more context, right?

00:03:47.300 | And that's where the MongoDB document model comes in. So the document model is really just JSON,

00:03:52.820 | and it gets stored inside of MongoDB in something called BSON, which stands for binary JSON, but you

00:03:57.700 | have things like a name, a profile, you know, you can include whatever you want as long as it's JSON,

00:04:04.820 | and that is actually what you store inside of your database and what you fetch from the database.

00:04:09.060 | So with the document model, if you're comparing it to something that you would do in kind of a relational

00:04:13.700 | system where you have objects that your applications are interfacing with, right, like a customer

00:04:18.420 | object or a contact object, and you're, you know, stitching together different tables inside of a

00:04:23.140 | relational database, instead of having to kind of go through all of this pain and hassle, you get to

00:04:27.940 | go to something like this, right, where you just store the objects that your application is using

00:04:32.020 | directly inside of the database, and there's not all of this kind of reconfiguring and reconnoitering.

00:04:36.740 | The way we look at this is that, you know, documents are universal, right? In many cases, they're kind of the

00:04:42.100 | superset of all, you know, data types that you might want to model. And so you can have JSON, you can have

00:04:48.340 | tabular data, key value stores, geospatial graph, it goes on. And what this translates to is, you know,

00:04:54.820 | it's more efficient in many places. It is more productive for developers who are building systems,

00:04:59.860 | and in many cases it can be more scalable, since MongoDB is just naturally very horizontally scalable

00:05:04.180 | through sharding. So that's documents, and that's kind of just the core benefit of MongoDB. But now,

00:05:12.100 | when we add on vectors is where things get, you know, really interesting, right? So what we've done is

00:05:17.380 | we've added in HNSW indexes into MongoDB Atlas, which allows you to do approximate nearest neighbor vector

00:05:24.660 | search over data that's stored in your database. And so what you do is you take your embeddings and you

00:05:29.860 | add them directly into the documents that you're already storing in your database. And so if you

00:05:34.980 | had this JSON that had symbol, quarter, and content fields, you could add a content underscore embedding

00:05:40.740 | field, which would just be the vectorization of, you know, either your entire document, some piece of

00:05:46.500 | data in your document, or some piece of data that's living elsewhere that you're going to map back to.

00:05:50.500 | And you can store all of that inside of your documents. And you can store vectors that are up to

00:05:55.060 | 4096 dimensions. Once that's done, you add in an index definition. In this case, you know,

00:06:02.900 | the type of index is a vector search index. You would say the type of field that you're indexing

00:06:07.300 | is a vector. You would say where the path is, where it's located, the number of dimensions, and the

00:06:12.580 | similarity function. So how do you want to determine the distance between the vectors that you're searching

00:06:16.580 | for and the ones that you're going to find? So once that's done, behind the scenes, the vector index is

00:06:21.940 | immediately built and kept in sync with data as it's updated inside of the database. And then you can

00:06:26.260 | use our dollar vector search aggregation stage to go ahead and compute an approximate nearest neighbor

00:06:30.980 | search. And so you have your index. You have the query vector, which is the vectorization of the data

00:06:35.940 | that you're searching for. You have the path where the data lives inside of your documents. And then you

00:06:41.540 | have num candidates and limits. And so the limit is how many results you want to get back from this

00:06:46.340 | stage. And the num candidates is how many entry points into your HNSW graph do you want to make,

00:06:52.020 | which allows you to kind of tune the accuracy of your results. And then finally, you can use a

00:06:58.900 | filter. And this filter is basically a pre-filter. So as we traverse this graph, we'll allow you to kind

00:07:03.620 | of fetch the documents and filter out the ones that are less relevant for your specific query.

00:07:10.100 | So that is vector search capability. But there's one other kind of core thing that's really important

00:07:15.540 | to just call out that we've also introduced alongside vector search, which is something

00:07:18.980 | called search nodes. And this allows you to decouple your approach to scaling. So with a transactional

00:07:24.900 | database, right, you have a primary and two secondaries. And this allows you to have, you know,

00:07:28.740 | durability, high availability, and all of these guarantees that you would want for a transactional

00:07:32.420 | database. But when you're adding search to it, the profile of resource usage may be a bit different.

00:07:38.180 | And so what we've done is we've added in a new type of node into the platform that allows you to

00:07:42.660 | store your vector indexes on those nodes and scale them independently from the infrastructure that's

00:07:48.740 | storing your transactional data. And this allows you to really tune the amount of resources that you

00:07:53.540 | bring to bear to perfectly serve your workload. And so with that, we've really kind of transformed how

00:08:01.700 | Atlas can serve these vector search workloads by both giving you kind of a unified interface

00:08:05.940 | and a consistent use of the document model, yet at the same time kind of decoupling how you go about

00:08:11.700 | scaling for your workloads. And that's really kind of the true power of what we've done with vector

00:08:15.540 | search. But along with this, we've also built several different AI integrations. And so we're

00:08:20.180 | integrated into some of the most popular AI frameworks, right? We have integrations inside of

00:08:24.340 | Lama Index, Langchain, Microsoft Semantic kernel, AWS Bedrock, and Haystack. And in each of them,

00:08:31.620 | we support quite a different -- quite a few different primitives. And so we have, you know,

00:08:37.380 | just to name a few, inside of Langchain, we have vector store, but you can also have a chat message

00:08:41.540 | history, you know, abstraction inside of Langchain. We have quite a few in Lama Index, and then, you know,

00:08:47.780 | same for Haystack and AWS Bedrock. And so all of these allow you to do that next level of rag that I was

00:08:54.580 | talking about at the very beginning, where you not only get to combine kind of just your typical vector search

00:08:58.820 | with rag, but you also get to now use kind of transactional data inside of your database to

00:09:04.420 | augment your prompts. And so to give you just like a couple examples of what that ends up looking like,

00:09:09.860 | right, when you think about kind of more broad usage of memory for large language models, you might think

00:09:16.260 | about semantic caching. So this is capability inside of Langchain, and you can use MongoDB as the

00:09:20.740 | backend of that semantic cache. And now, right, when a user comes in and asks a question, we'll first

00:09:26.340 | kind of send it over to the retriever and figure out kind of what the question should look like,

00:09:31.300 | right, find the prompt plus the additional kind of augmented data. And then we'll send it to a semantic

00:09:36.420 | cache. And if that semantic cache is it's a hit based on a semantic similarity, then we'll just fetch the

00:09:42.340 | cached answer instead of having to hit the LLM again. Or if it's not a hit, we'll send it to the LLM and do the

00:09:47.860 | prompt and get the answer back to the user. And so in this way, you can use caching to kind of reduce

00:09:51.860 | the amount of calls that are being sent to your large language model. And this is, you know, hugely

00:09:55.940 | powerful, just kind of reducing the amount of resources that you're using. And again, it can all be done

00:10:00.420 | using one database with Langchain in this case. Separately, though, right, we also now have chat history,

00:10:09.140 | right? And so with Langchain, if you wanted to build on top of MongoDB a experience that was maybe similar to,

00:10:16.820 | you know, ChatGPT, right, where you have the chat history, and it's continuously fetching that data

00:10:21.540 | and putting it back into the prompt so that you can kind of have continuity in the conversation that's

00:10:26.420 | happening with the large language model. Well, you could use the chat message history abstraction

00:10:29.780 | inside of Langchain, and you could basically store the history of chats that are going through the

00:10:34.100 | platform. And each time a prompt is sent back into the large language model, you could use the chat

00:10:39.540 | history, send it back through, include the vector search, and then, you know, send the prompt to the LLM

00:10:44.580 | and send the answer back. And so just another way where you can really kind of evolve this.

00:10:48.260 | A cool startup that's using us right now to do a lot of these different things where they're taking

00:10:53.380 | advantage of kind of all of the flexibility of having a transactional database kind of built in

00:10:58.420 | with your vector search capability is a company called 4149. I would, you know, recommend checking

00:11:03.300 | them out. Basically, they're building an AI teammate and not like a coding teammate, but instead one that kind

00:11:08.980 | of, you know, listens to your meeting, tracks what you're doing, fetches additional information and kind

00:11:14.660 | of prompts you, the user, with that information that you may need to kind of complete a task, you know,

00:11:20.660 | write an email or kind of schedule a project. And they're using MongoDB not just to store their vector

00:11:26.420 | data and do, you know, semantic similarity search, but also to store data about their users, data about,

00:11:31.780 | you know, specific meetings, chat history meetings, all of this information that's not necessarily kind

00:11:37.060 | of your typical semantic search type data use case, but instead it really benefits from having a single

00:11:42.740 | operational transactional database that also has vector search attached. And so that's where we're

00:11:47.460 | seeing like a lot of the excitement as we move into this, you know, world of agents and doing kind of

00:11:51.940 | complex differentiated rag. Having a full transactional database really kind of opens up a new world of

00:11:58.740 | kind of storing and giving, you know, these agents more affordances to interact with the data. And, you

00:12:04.980 | know, just one more thing to mention is that, you know, at the end of the day, all of this is built

00:12:09.780 | inside of MongoDB Atlas, which gives you comprehensive security controls and privacy. It, you know, gives you

00:12:16.420 | kind of total uptime and automation to ensure that you have kind of optimal performance to serve your

00:12:21.860 | application. And finally, it's deployable in over 100 plus regions across all of the major cloud

00:12:26.900 | providers, including our search note offering that I mentioned earlier, that really allows you to

00:12:30.900 | optimize how you deploy these resources. And so we're really thrilled to have this. Just kind of a quick

00:12:36.500 | call out. Thanks all for coming to check out this talk. If you want to try MongoDB Atlas for free,

00:12:43.300 | we have a forever free tier where vector search is available. And you can also learn more of our AI

00:12:48.340 | capabilities using this other QR code as well. And with that, I'm done.

00:12:53.780 | Thank you.

00:12:59.220 | Thank you.

00:13:00.260 | Thank you.