back to indexRAG and the MongoDB Document Model: Ben Flast

00:00:00.040 |
Great to see everyone. That was a great talk. I'm very interested in this rap group. I have 00:00:19.120 |
some concerns. I'm here for MongoDB. I'm going to be talking about RAG and specifically what's 00:00:25.200 |
unique about doing RAG with the MongoDB document model and MongoDB Atlas, the platform. I'm 00:00:31.600 |
going to start by talking a bit about retrieval augmented generation in general. I'm sure a 00:00:36.440 |
lot of us are familiar with it already, but I think it will be good to cover some of the 00:00:40.200 |
basic concepts. Then I'm going to talk about the document model. For those of you who are 00:00:44.440 |
not so familiar with MongoDB, this will be a nice little brief intro to what it means 00:00:49.700 |
to use MongoDB and why we're a unique database. Then I'm going to talk about vector search and 00:00:54.920 |
capability that exists inside of MongoDB now. Then I'll talk about some of our AI integrations 00:01:00.120 |
and then some use cases to help stimulate some ideas for all of you. I'm going to do all this 00:01:05.260 |
in a quick 15. Obviously, LLM is super exciting. It's been crazy over the past year and a half, 00:01:15.440 |
but there has been a question around what all can they do and when do you need to use RAG 00:01:20.780 |
and when do you not? If you took a vanilla LLM connected to nothing and asked it how much 00:01:25.980 |
money is in your bank account, it wouldn't know. I think we can all understand why that's 00:01:31.180 |
the case and hope for the foreseeable future continues to be the case. All that said, if 00:01:38.320 |
we want to make useful applications with these LLMs, then the reality is that without context, 00:01:44.180 |
there's only so much you can do with the LLM. That's where RAG comes in. RAG stands for retrieval 00:01:50.740 |
augmented generation. I'm sure this is old hat to most of you, but we're just going to go through 00:01:55.300 |
quickly. What this means is that you take a generic AI or ML model that, you know, today we're generally 00:02:03.700 |
talking about LLMs, but it has a training cutoff, it's missing your private data, maybe it hallucinates, 00:02:11.060 |
maybe it doesn't, but overall it's not personalized. And you take your data, right, and you augment it 00:02:17.300 |
at the time of prompting to give it the context that it needs to answer the questions that you want it to 00:02:22.980 |
for the use cases that you're bringing it to bear for. And so that could be company-specific data, 00:02:27.780 |
it could be product info, it could be order history, anything that you're storing inside of your 00:02:32.260 |
application database that's already powering kind of your in-app experiences. And with that, 00:02:37.060 |
you get a transformative AI-powered application, right, that's going to be refined and consistent 00:02:41.700 |
and accurate in the responses that it gives when you're prompting the models. So the typical RAG that 00:02:50.580 |
you've all probably seen and in most cases probably implemented will look something like this, right, 00:02:55.220 |
so you have a user that enters a prompt, the question that they enter will get sent to an 00:03:02.180 |
embedding model, it will be embedded, it will then do a search, a semantic search on a vector database, 00:03:07.140 |
in this case, MongoDB Atlas Vector Search, obviously, which will pull back similar documents, 00:03:11.380 |
so then those documents along with the original prompt for the most cases will go into the large 00:03:15.780 |
language model, and that will give an answer which goes back to the user. And this is kind of what, 00:03:20.260 |
you know, most people are doing for all, you know, Chatbot and Copilot and other types of use cases, 00:03:24.820 |
right? But what's really interesting is that when you use MongoDB, you can go quite a bit farther than this and do 00:03:32.020 |
things that are, you know, in many cases a bit different. So with, you know, RAG, the standard 00:03:42.500 |
RAG is really not going to be enough. The applications of tomorrow are going to need more context, right? 00:03:47.300 |
And that's where the MongoDB document model comes in. So the document model is really just JSON, 00:03:52.820 |
and it gets stored inside of MongoDB in something called BSON, which stands for binary JSON, but you 00:03:57.700 |
have things like a name, a profile, you know, you can include whatever you want as long as it's JSON, 00:04:04.820 |
and that is actually what you store inside of your database and what you fetch from the database. 00:04:09.060 |
So with the document model, if you're comparing it to something that you would do in kind of a relational 00:04:13.700 |
system where you have objects that your applications are interfacing with, right, like a customer 00:04:18.420 |
object or a contact object, and you're, you know, stitching together different tables inside of a 00:04:23.140 |
relational database, instead of having to kind of go through all of this pain and hassle, you get to 00:04:27.940 |
go to something like this, right, where you just store the objects that your application is using 00:04:32.020 |
directly inside of the database, and there's not all of this kind of reconfiguring and reconnoitering. 00:04:36.740 |
The way we look at this is that, you know, documents are universal, right? In many cases, they're kind of the 00:04:42.100 |
superset of all, you know, data types that you might want to model. And so you can have JSON, you can have 00:04:48.340 |
tabular data, key value stores, geospatial graph, it goes on. And what this translates to is, you know, 00:04:54.820 |
it's more efficient in many places. It is more productive for developers who are building systems, 00:04:59.860 |
and in many cases it can be more scalable, since MongoDB is just naturally very horizontally scalable 00:05:04.180 |
through sharding. So that's documents, and that's kind of just the core benefit of MongoDB. But now, 00:05:12.100 |
when we add on vectors is where things get, you know, really interesting, right? So what we've done is 00:05:17.380 |
we've added in HNSW indexes into MongoDB Atlas, which allows you to do approximate nearest neighbor vector 00:05:24.660 |
search over data that's stored in your database. And so what you do is you take your embeddings and you 00:05:29.860 |
add them directly into the documents that you're already storing in your database. And so if you 00:05:34.980 |
had this JSON that had symbol, quarter, and content fields, you could add a content underscore embedding 00:05:40.740 |
field, which would just be the vectorization of, you know, either your entire document, some piece of 00:05:46.500 |
data in your document, or some piece of data that's living elsewhere that you're going to map back to. 00:05:50.500 |
And you can store all of that inside of your documents. And you can store vectors that are up to 00:05:55.060 |
4096 dimensions. Once that's done, you add in an index definition. In this case, you know, 00:06:02.900 |
the type of index is a vector search index. You would say the type of field that you're indexing 00:06:07.300 |
is a vector. You would say where the path is, where it's located, the number of dimensions, and the 00:06:12.580 |
similarity function. So how do you want to determine the distance between the vectors that you're searching 00:06:16.580 |
for and the ones that you're going to find? So once that's done, behind the scenes, the vector index is 00:06:21.940 |
immediately built and kept in sync with data as it's updated inside of the database. And then you can 00:06:26.260 |
use our dollar vector search aggregation stage to go ahead and compute an approximate nearest neighbor 00:06:30.980 |
search. And so you have your index. You have the query vector, which is the vectorization of the data 00:06:35.940 |
that you're searching for. You have the path where the data lives inside of your documents. And then you 00:06:41.540 |
have num candidates and limits. And so the limit is how many results you want to get back from this 00:06:46.340 |
stage. And the num candidates is how many entry points into your HNSW graph do you want to make, 00:06:52.020 |
which allows you to kind of tune the accuracy of your results. And then finally, you can use a 00:06:58.900 |
filter. And this filter is basically a pre-filter. So as we traverse this graph, we'll allow you to kind 00:07:03.620 |
of fetch the documents and filter out the ones that are less relevant for your specific query. 00:07:10.100 |
So that is vector search capability. But there's one other kind of core thing that's really important 00:07:15.540 |
to just call out that we've also introduced alongside vector search, which is something 00:07:18.980 |
called search nodes. And this allows you to decouple your approach to scaling. So with a transactional 00:07:24.900 |
database, right, you have a primary and two secondaries. And this allows you to have, you know, 00:07:28.740 |
durability, high availability, and all of these guarantees that you would want for a transactional 00:07:32.420 |
database. But when you're adding search to it, the profile of resource usage may be a bit different. 00:07:38.180 |
And so what we've done is we've added in a new type of node into the platform that allows you to 00:07:42.660 |
store your vector indexes on those nodes and scale them independently from the infrastructure that's 00:07:48.740 |
storing your transactional data. And this allows you to really tune the amount of resources that you 00:07:53.540 |
bring to bear to perfectly serve your workload. And so with that, we've really kind of transformed how 00:08:01.700 |
Atlas can serve these vector search workloads by both giving you kind of a unified interface 00:08:05.940 |
and a consistent use of the document model, yet at the same time kind of decoupling how you go about 00:08:11.700 |
scaling for your workloads. And that's really kind of the true power of what we've done with vector 00:08:15.540 |
search. But along with this, we've also built several different AI integrations. And so we're 00:08:20.180 |
integrated into some of the most popular AI frameworks, right? We have integrations inside of 00:08:24.340 |
Lama Index, Langchain, Microsoft Semantic kernel, AWS Bedrock, and Haystack. And in each of them, 00:08:31.620 |
we support quite a different -- quite a few different primitives. And so we have, you know, 00:08:37.380 |
just to name a few, inside of Langchain, we have vector store, but you can also have a chat message 00:08:41.540 |
history, you know, abstraction inside of Langchain. We have quite a few in Lama Index, and then, you know, 00:08:47.780 |
same for Haystack and AWS Bedrock. And so all of these allow you to do that next level of rag that I was 00:08:54.580 |
talking about at the very beginning, where you not only get to combine kind of just your typical vector search 00:08:58.820 |
with rag, but you also get to now use kind of transactional data inside of your database to 00:09:04.420 |
augment your prompts. And so to give you just like a couple examples of what that ends up looking like, 00:09:09.860 |
right, when you think about kind of more broad usage of memory for large language models, you might think 00:09:16.260 |
about semantic caching. So this is capability inside of Langchain, and you can use MongoDB as the 00:09:20.740 |
backend of that semantic cache. And now, right, when a user comes in and asks a question, we'll first 00:09:26.340 |
kind of send it over to the retriever and figure out kind of what the question should look like, 00:09:31.300 |
right, find the prompt plus the additional kind of augmented data. And then we'll send it to a semantic 00:09:36.420 |
cache. And if that semantic cache is it's a hit based on a semantic similarity, then we'll just fetch the 00:09:42.340 |
cached answer instead of having to hit the LLM again. Or if it's not a hit, we'll send it to the LLM and do the 00:09:47.860 |
prompt and get the answer back to the user. And so in this way, you can use caching to kind of reduce 00:09:51.860 |
the amount of calls that are being sent to your large language model. And this is, you know, hugely 00:09:55.940 |
powerful, just kind of reducing the amount of resources that you're using. And again, it can all be done 00:10:00.420 |
using one database with Langchain in this case. Separately, though, right, we also now have chat history, 00:10:09.140 |
right? And so with Langchain, if you wanted to build on top of MongoDB a experience that was maybe similar to, 00:10:16.820 |
you know, ChatGPT, right, where you have the chat history, and it's continuously fetching that data 00:10:21.540 |
and putting it back into the prompt so that you can kind of have continuity in the conversation that's 00:10:26.420 |
happening with the large language model. Well, you could use the chat message history abstraction 00:10:29.780 |
inside of Langchain, and you could basically store the history of chats that are going through the 00:10:34.100 |
platform. And each time a prompt is sent back into the large language model, you could use the chat 00:10:39.540 |
history, send it back through, include the vector search, and then, you know, send the prompt to the LLM 00:10:44.580 |
and send the answer back. And so just another way where you can really kind of evolve this. 00:10:48.260 |
A cool startup that's using us right now to do a lot of these different things where they're taking 00:10:53.380 |
advantage of kind of all of the flexibility of having a transactional database kind of built in 00:10:58.420 |
with your vector search capability is a company called 4149. I would, you know, recommend checking 00:11:03.300 |
them out. Basically, they're building an AI teammate and not like a coding teammate, but instead one that kind 00:11:08.980 |
of, you know, listens to your meeting, tracks what you're doing, fetches additional information and kind 00:11:14.660 |
of prompts you, the user, with that information that you may need to kind of complete a task, you know, 00:11:20.660 |
write an email or kind of schedule a project. And they're using MongoDB not just to store their vector 00:11:26.420 |
data and do, you know, semantic similarity search, but also to store data about their users, data about, 00:11:31.780 |
you know, specific meetings, chat history meetings, all of this information that's not necessarily kind 00:11:37.060 |
of your typical semantic search type data use case, but instead it really benefits from having a single 00:11:42.740 |
operational transactional database that also has vector search attached. And so that's where we're 00:11:47.460 |
seeing like a lot of the excitement as we move into this, you know, world of agents and doing kind of 00:11:51.940 |
complex differentiated rag. Having a full transactional database really kind of opens up a new world of 00:11:58.740 |
kind of storing and giving, you know, these agents more affordances to interact with the data. And, you 00:12:04.980 |
know, just one more thing to mention is that, you know, at the end of the day, all of this is built 00:12:09.780 |
inside of MongoDB Atlas, which gives you comprehensive security controls and privacy. It, you know, gives you 00:12:16.420 |
kind of total uptime and automation to ensure that you have kind of optimal performance to serve your 00:12:21.860 |
application. And finally, it's deployable in over 100 plus regions across all of the major cloud 00:12:26.900 |
providers, including our search note offering that I mentioned earlier, that really allows you to 00:12:30.900 |
optimize how you deploy these resources. And so we're really thrilled to have this. Just kind of a quick 00:12:36.500 |
call out. Thanks all for coming to check out this talk. If you want to try MongoDB Atlas for free, 00:12:43.300 |
we have a forever free tier where vector search is available. And you can also learn more of our AI 00:12:48.340 |
capabilities using this other QR code as well. And with that, I'm done.