When Vectors Break Down: Graph-Based RAG for Dense Enterprise Knowledge

Welcome! So glad to see you all here. Welcome to When Vectors Break Down: Graph-based RAG for dense enterprise knowledge. And big thank you to SWIX and Ben for putting on yet another amazing event. So it's a pretty interesting signal that we have an entire track dedicated to graph-based RAG.

And I think in addition to all of the agentic promise of graph-based RAG, we're also seeing that the market is starting to catch up, that vector search is just not enough for RAG at scale. You may have seen this really interesting article by Joe Christian Burgum, who is around here somewhere on the rise and fall of the vector database infrastructure category and his subsequent interview on Latentspace, where he talked about how vector databases have experienced this gold rush after ChatGPT's launch, but that the industry is starting to recognize that vector search alone is just insufficient for sophisticated retrieval and that we're going to need multiple strategies beyond simple vector similarity.

This is music to our ears at Rider because we've actually been talking about this for a long time. We've been talking about the benefits of graph-based RAG for a couple of years now. In fact, if you look at this article from November 2023, which in AI time is like prehistoric times, we actually talk about the benefits of knowledge graphs and the shortcomings of vector databases and simple similarity search for enterprise RAG at scale.

And if you're not familiar with Rider, we're this end-to-end agentic platform for enterprises where we build our own models, we build our own graph-based RAG system and have this suite of software tools on top of that for enterprises to be able to build agents and AI applications. And so, as we've been building Knowledge Graph over the years, it's been an interesting journey as we've been working with these Fortune 500 and Global 2000 companies at scale.

Most of them or many of them are in highly regulated industries like healthcare and finance, where accuracy and low hallucinations are super important. And so, our team has been putting together this system over the years of different components put together and different techniques that we could really drive our accuracy rate up high and reduce our hallucinations.

And so, what I wanted to share in this talk was kind of the journey of how we got there. And the main takeaway being, as you're seeing in several of these talks, like the first talk about hybrid search, there are many different ways that you can get the benefits of knowledge graphs in RAG.

And also, how you get there and what you learn along the way is actually often very valuable as you're building out your retrieval system, almost just as valuable as the end result itself. So, I'm going to weave together these two stories of our journey to graph-based RAG and sort of the first principles thinking that I think has made our team successful in putting together this system as we continue to iterate and improve on it.

So, I'm Sam Julien. I'm the director of developer relations at Rider, and you can find most of my writing and books and newsletters and all of those things at samjulien.com. So, I talked about this system composed of multiple pieces put together over a couple of different years. And I want to talk about sort of how we got to this point and where we are now.

And I'm just going to put a blanket caveat on here that please consider this a sketch and not a blueprint of what is currently in production. Of course, there are like many moving pieces and many layers to this, but I want to abstract it enough to make it something that is practical and usable for people.

So, our research team, we have a cracked research team at Rider, and they have four main areas of focus. Enterprise models, like our Palmyra X5 model, that's the one powering the chat on the AI engineer website right now. Practical evaluations, like our finance benchmark called Failsafe QA. Domain specific specialization, these are our domain specific models like Palmyra Med and Palmyra Finn.

And then what our focus is here, retrieval and knowledge integration. So, bringing enterprise data to work with our models in a secure, reliable way. And I think what's really cool about the way our research team works is that they're very focused on solving practical problems for our customers. They're not just sort of like working in isolation, working on theoretical things.

They're actually driven by customer insights. And that's really what I would consider like sort of the first meta lesson of why I think this is working so well for Rider right now. We're really focused on solving the customer problems rather than implementing specific solutions. So, the problem that we are trying to solve kind of constantly, as most of us are here, is that enterprise data is really dense, specialized, and massive.

So, we're often dealing with terabytes of data, and it uses very specific language, and it's often very clustered together. There's not a lot of diversity in the language used in these documents. And that's what our research and engineering teams have been focused on these last few years. So, like most, we kind of started out with a regular search of querying a knowledge base, using an algorithm, and passing that to the LLM.

But that quickly sort of like ran out because of, you know, it was good for basic keyword searches, but not really great for that advanced similarity search that we needed. So, then, again, like most, we went to vector embeddings and did chunking and embeddings and put it in a database and then similarity search and passing it to the LLM for the end user to query.

But we ran into two major problems with this. The first is that with vector retrieval, chunking and nearest neighbors can give inaccurate answers. So, if you look at this example of kind of this text about the founding of Apple and the timeline, it's very easy for us as humans to look at these text chunks and pick out the fact that the Macintosh was created in 1984.

But when you chunk this text naively and you just give it to a nearest neighbor search, it can get confused. And it thinks that it was actually in 1983 instead of 1984 because it's in the same chunk as the introduction of the Lisa. Side note, I'm a huge vintage Apple nerd, and so I liked this example.

The other big problem that we ran into with vector retrieval was that it was failing with really concentrated data. So, if you think about a lot of large enterprises, it's not like they're dealing with documents where, like, some of them are talking about animals and some of them are talking about fruit, right?

So, if you have a mobile phone company, for example, and they have thousands and thousands of documents that all use megapixels and cameras and battery life and things like that, and you ask the RAG system and the LLM to compare two different phone models, it's going to really struggle with that because it's going to find all these answers and have no idea how to make sense of them.

And so that's what took us to graph-based RAG, where instead we would query a graph database and get back the relevant documents using keys and generate an answer. And especially powerful if you combine that with, like, full text and similarity search and things like that. And so this really helped us with our accuracy because we were able to preserve the relationships with the text and provide more context to the model.

And this was really interesting because, at the time, there actually weren't that many people doing graph-based RAG over the last couple of years. And that's why I think the focus of the team on really trying to solve the problem of the customer rather than chase whatever was being hyped up at the time was really important.

So that was really great. But we did run into some challenges back then with using graph databases. Now this is not an indictment of any graph database technology. It's just that we were running into these issues at the time a couple of years ago. And so there were four things that we ran into.

First, that converting the data into the structured graph was getting really challenging and costly at scale. As the graph database scaled, we were hitting the limits of our team's expertise as well as hitting some cost issues. And then we were running into some problems where Cypher was struggling with the advanced similarity matching that we needed.

And we were noticing that LLMs were doing better with text-based queries rather than complex graph structures. Now, again, if you were to do this now, you might not run into those problems, but this is what we ran into historically. And so I think the way that the team approached this is also very interesting, where they decided to stay flexible based on their expertise.

So they were running into these problems that I think were not necessarily fundamental to the technology itself, but more like, okay, how can we solve the problems for our customers using the expertise that we have on the team? And so they came up with a few really interesting solutions to this problem, to these problems.

So first, when it came to converting the data into the graph structure, the team went back to their expertise and they say, what do we know how to do? We know how to build models. So let's build a specialized model that can scale and run on CPUs or smaller GPUs, which I think is a really clever solution.

Now, if you were to do this now, there's probably enough fast, small models out there that you could fine tune something like that. You wouldn't have to build it yourself. But at the time, we didn't really have any options like that. So the team built it themselves and fine tuned a model that was trained to map this data into graph structures of nodes and edges.

And we did some better context-aware splitting and chunking to preserve the context and the semantic relationships. And this really helped preserve the reliability. Okay. And so then the issues with the scaling of the graph databases and the limitations of the expertise on the team with the cost at scale.

So again, we went back and thought about what is our team's expertise in and what can we do? And so what we did was instead, we stored the data points as JSON in a Lucene-based search engine. So we take the graph structure, we convert it into JSON, we put it in the search engine.

And this allowed us to easily handle the large amounts of data without any performance or speed degradation at scale, while still being something that the team was really good at. And so the team had started to assemble this concept of what our RAG system was looking like. And again, this is kind of more of a historical snapshot and a sketch over time.

But where we do the context-aware splitting and text-to-graph with this specialized model and then pass it to a search engine. And we were really starting to drive up our accuracy. But we still have those problems with the similarity matching and the text-based queries doing better than the complex graph structures.

And so again, the team sort of went back to first principles and thought, okay, what is it that we're trying to solve here? And let's go back to the research and figure out what we can build on to build a solution that's best for our customers and our specific needs.

And I think this is kind of the final meta point of letting research challenge your assumptions. So rather than staying focused on the solution, step back, look at the research, and figure out what you can do to solve the challenges for your customers. So they went back to the original RAG paper.

And if you go back to the original RAG paper, it doesn't actually ever talk about using prompt context and questions, which is super interesting. It's sort of like the de facto way of doing RAG now. But the original RAG paper actually proposed this whole two-component architecture with a retriever and a generator with a pre-trained sequence-to-sequence model.

It never actually talks about prompt and context and questions. And so that's where they came across Fusion and Decoder, which I kind of think of as an alternate timeline for RAG, like if we didn't go down the road of prompt and context and questions. And so Fusion and Decoder is this technique that kind of builds upon the original proposal of the original RAG paper, where it processes the passages independently in the encoder to get linear scaling instead of quadratic scaling, but then jointly in the decoder for better evidence aggregation.

So big efficiency breakthrough and lots of state-of-the-art performance. I know there's a super abstract. So if you go to Facebook, they actually have a Fusion and Decoder library that you can play around with and actually do the steps of Fusion and Decoder. I also know that at this point you're going, "What the heck is this guy talking about in a graph RAG track?

Why are we talking about Fusion and Decoder?" Well, I'm glad you asked, because the next big breakthrough was knowledge graph with Fusion and Decoder. So you can use knowledge graphs with Fusion and Decoder as a technique. And this sort of improves upon the Fusion and Decoder paper by using knowledge graphs to understand the relationships between the retrieved passages.

And so it helps with this efficiency bottleneck and improves the process. I'm not going to walk through this diagram step by step, but this is the diagram in the paper of the architecture where it uses the graph and then does this kind of two-stage re-ranking of the passages. And it helps with improving the efficiency while also lowering the cost.

And so the team took all this research and came together to build their own implementation of Fusion and Decoder, since we actually build our own models, to make that kind of the final piece of the puzzle. And it really helped our hallucination rate. It really drove it down. And then we published a white paper with our own findings of it.

And so then we kind of had that piece of the puzzle. And there's a few other techniques that we don't have time to go over. But point being, we're assembling together multiple techniques based on research to get the best results we can for our customers. So that's all well and good, but does it actually work?

That's the important part, right? So we did some benchmarking last year. We used Amazon's robust QA data set and compared our retrieval system with Knowledge Graph and Fusion to Decoder and everything with seven different vector search systems. And we found that we had the best accuracy and the fastest response time.

So I encourage you to check that out and kind of check out this process. Benchmarks are really cool. But what's even cooler is what it unlocks for our customers, which are various features in the product. For one, like most graph structures, we can actually expose the thought process because we have that relationships and the additional context where you can show the snippets and the sub queries and the sources for how the RAG system is actually getting the answers.

And we can expose this in the API to developers as well as in the product. Benchmarks: Yes. Benchmarks: And then we're also able to have knowledge graphics sell it multi-hop questions where we can reason across multiple documents and multiple topics without any struggles. Benchmarks: And then lastly, it can handle complex data formats where vector retrieval struggles, where an answer might be split into multiple pages or maybe there's a similar term that doesn't quite match what the user is looking for.

But because we have that graph structure and Fusion and Decoder with the additional context and relationships, we're able to formulate these correct answers. Benchmarks: So again, my main takeaway here is that there are many ways that you can get the benefits of knowledge graphs in RAG. That could be through a graph database.

It could be through doing something creative with Postgres. It could be through a search engine. But you can take advantage of the relationships that you can build with knowledge graphs in your RAG system. And as you get there, you can challenge your assumptions and focus on the customers to be able to get to the end result to make the team successful.

Benchmarks: And so for our team, it was focusing on the customer needs instead of what was hyped, staying flexible based on the expertise of the team, and letting research challenge their assumptions. So if you want to join this amazing team, we're hiring across research engineering and products. We would love to talk to you about any of our open roles.

And I'm available for questions. You can come find me in the hallway or reach out to me on Twitter or LinkedIn. And that's all I've got for you. Thank you so much.

When Vectors Break Down: Graph-Based RAG for Dense Enterprise Knowledge - Sam Julien, Writer

Transcript