Back to Index

GraphRAG methods to create optimized LLM context windows for Retrieval — Jonathan Larson, Microsoft


Transcript

Hello everyone. Thanks for coming to the session here today. My name is Jonathan Larson and I run the GraphRag team at Microsoft Research. Some of you may have seen our paper that we actually released last year, the GraphRag paper, or you might have seen our GitHub repo that was out there and has a lot of stars on it.

When it released this last year, to our surprise, it got a lot of attention and also it inspired many other offerings that we saw that were out there as well, too. My favorites, of course, were the ones that Neo4j did with the card game, because I'm a huge board game fanatic as well, too.

But there's been lots of derivative work from it. And today, though, I'm not here to talk to you about some of the things that happened. I'm here to talk to you about some of the new horizons. And the two things I really want to leave you with here today are that LM memory with structure is just an absolutely key enabler for building effective AI applications.

The second, as we'll talk about a little bit later in the talk as well, too, are that agents, of course, paired with these structures can provide something even more powerful. And let me go ahead and go back to the slide here. So today, I'm going to tell you about a couple things.

The first is actually showing you Graphrag as applied to a specific domain. In this case, we're going to apply it to the coding domain to look at enterprise productivity in the coding space. The second is we're going to take a look at a new release that we are announcing today.

We're actually having a blog post on this tomorrow. We have benchmark QED just went open source as well. And then I'll talk about some new results that we've had on some Graphrag evolutions on a new technology we've been working on called lazy Graphrag. I won't be going into the specifics of how many it works.

I'll just be showing some benchmarks and talking about some of the ways you'll be able to access it soon. So with that, let me go ahead and jump into Graphrag for code and tell you about how we've been using Graphrag to actually help drive repository-level understanding. And I want to start first with a little demonstration.

Everyone likes to watch little videos, so I'm going to go ahead and hit play. And yes, it's showing on the screen there. So this is a little terminal-based video game that one of my engineers have put together. It's one where you jump over obstacles, and then you get points when you jump over them, and you run into them, you lose points.

And two important things, though. First, the LLM has never seen this code before, and it's small enough for the human to know the ground truth holistically because it's only about 200 lines of code across seven files. But it's complex enough that the LLM's had a lot of troubles understanding it if you just provide all the code directly into the context window.

So with that as a background, let's take a look at what happens if you use typical regular rag over the top of this. So if you're using one of the tools that help analyze your code, this is the type of answer you might get back. And this is -- we ran this through with a regular rag system.

And we asked it, describe what this application is and how it works. And just to read you what it says, because it's too small to read, "The application is designed as a game that is configured and initiated through a main function. The game leverages a configuration file and has its main components, such as the game screen, game logic, encapsulated in separate classes and functions." Totally useless.

It just says, it's a game. There's nothing more to it than that. And it just put a bunch of other cruft in there. If you use GraphRack for code over the exact same code base with the exact same question, you get a much, much better description. This is just a very stark contrast you get between the two.

So in the GraphRack for code, it says, "The application is a terminal-based interactive game designed to run in a curses-based terminal environment. So far, not much better. But this next line just kills it." It features a player character that can jump vertically, obstacles that move horizontally across the screen, and a static background layer.

The game controls via keyboard inputs, specifically using the space bar to trigger the player's jump action. So what this is showing you is semantic understanding. So if you've read our paper or used any of the GraphRack code base that we've put out there, you'll know about the concept for what we call local and global queries.

And so this is what we would call a global query over top of the repository. It requires understanding the whole repository to answer that question correctly. And we can see that it excels at that. So one of the next things we decided to do was, well, if I can answer questions pretty well, can it maybe do code translation?

And so that was the next thing that we aimed it at. So taking the same code base here, we took on the challenge of taking this Python code and then asking to translate it directly into Rust. And let me show you just how that actually works. I'm going to play the video here.

So we're going to do is We're going to start off with in VS Code here. We're going to look at the four source code files. Well, there's four main ones. There's seven files total for this game. These are the source Python code files that we're working with here. We're going to go ahead and run those just to show you again.

This is the same game. So this is again the same side scroller game here. And then what we're going to do is we're going to actually take those source code files and just put them straight into an LL. We're first not going to use GraphRack for code. We're just going to take all that source code, because it's only 200 lines of code, and put in some nice prompts.

We tried a wide variety of these prompts and then try and see if the LL could actually translate that code holistically in all of its pieces into Rust and actually just work out of the box. And so this is actually the translation happening there. We're going to go ahead and copy the Rust code that it generated.

So it did generate Rust code. I have to give it credit for that. But after it generated the code and we reviewed it there, we went and we tried to compile it and there's problems everywhere. It doesn't work out of the box. So then we use GraphRack for code, again bringing structure and these graph structures to the code base there.

We're going to just go ahead and run the translate function there on GraphRack for code. And it generates on its side all these new Rust files that you see are on the side there. I'm just going to skip past here because I don't want to bore you with clicking through a bunch of Rust files.

But once we generate all these Rust files, we can go ahead and run those Rust files on GraphRack for code side of things and presto, we have a full video game completely translated from Python now working in Rust natively. But we didn't want to stop there. The next thing that we wanted to do was, well, that game is kind of a toy example.

It's like 200 lines of code. Let's go to a code base with 100,000 lines of code. And that's what we did next. So we went to the Doom code base. It's about 30 years old. Now, we did run a bunch of tests over the top of this here first because I figured, well, all the LLMs are trained on the Doom code base.

I mean, it's going to just know these things natively and automatically. So we did a bunch of tests with that. We actually figured out, well, it knew the Doom code base. It didn't know any of the specifics. So if you ask it to actually modify the Doom code base in any sort of meaningful way, the LLM models just fail again completely.

So then what we did next was, well, if we can reason and understand over top of this code base, 100,000 lines of code, 231 files, maybe we can generate some new outputs that you otherwise couldn't generate. So with that, we then immediately had to start -- used it to start generating documentation.

So this is actually showing a high level repository level documentation. Again, going back to the concepts of GraphRag, we have local and we have global queries. This is showing you the global query results. And so it's not looking at the understanding of a single file. This is looking at modules in their entirety, like across 20, 30 files and being able to actually give you sense for, like, what's the sound system inside of the video game and how does that work?

But then you can still drill down into what we would call the local style queries and actually see the individual files as well, too. So then we thought, well, this is kind of neat. We can look at the documentation. But if we look at the documentation, we can do Q&A and we can do code translation.

Can we maybe take this one step further? Do feature development. So that's the next thing I'm going to show you. So we had a lot of video game players in our office and we were kind of thinking about, like, what would be a cool thing we put over top of this Doom video game?

And somebody said, well, you can't jump in the original video game. What if we added the ability for a player to jump? And that's a complex thing to add into Doom, because it requires multi-file modification. And if you tried to use AI systems to do multi-file modification, what it'll often do is it will do a great job editing one of the files and then completely break a bunch of the other files in the process of doing so.

And then that just rinses and repeats and continually kind of -- you end up with something that doesn't work is what you oftentimes run into. And it's because of this lack of understanding how everything fits together. But again, this is where GraphRag can really help out. So with that, we went ahead and this is -- if you haven't seen the Doom video game, this is actually a little video of the original Doom video game being played just to give you some context for what it was.

Then we used the new GitHub Copilot coding agent that was announced at build. And we wired it up directly to a GraphRag for code. So we're going to go ahead and create a new issue. This is pointing to the Doom code base. And we're going to tell it add jump capability to the player.

We're just going to add a couple sentences of description here and then tell it to go. And then we're going to assign it to that GitHub Copilot coding agent. And then the next thing we're going to do is we're going to then go ahead and look at what's happening underneath the hood.

So this is actually the GitHub Copilot coding agent reaching out to GraphRag for code on the back end, coming up with a plan that's holistic and it's approaching it from the top down. And then this is really the moment, the aha moment we had for us. It changed a whole lot of files and it worked out of the box where all the other agents that we had tried completely failed on this task.

And because it worked because of those GraphRag structures that we were bringing to bear and actually using those with that. So then the end result was we were related with joy and jumping in Doom specifically. So that's kind of the story of how we got to with that. So with that, I do have another part of the talk I'm going to talk about today.

And that's also five minutes to talk about that one. And that's benchmark QED. Just shifting topics a little bit. I just showed you GraphRag is applied to one vertical. Next, I want to specifically talk about how do we measure and evaluate systems like GraphRag? How do we build systems to evaluate those local and global quality metrics?

And so for that, today I'm announcing benchmark QED. This is available now open source on GitHub. So you can go ahead and check it out at the link. We just, I think, got it live last night. And there's three components to this. The first is what we're calling Auto-Q.

And Auto-Q is focused on doing query generation for target data sets. So it allows you to take a data set and then generate queries for it. Auto-E is the evaluation using LLM as a judge to actually then evaluate how those queries performed on that said data set. Auto-D is the third component.

And that really focuses on data set summarization and data set sampling. Let's jump into just a couple of these in a little more detail. So if you take a look at Auto-Q, Auto-Q is taking a look at the local and global aspects. You can see that there on the x-axis.

On the y-axis, it's then combining that against data-driven, basically questions that are generated based on the data itself, or persona or activity-driven, which is the second type that we have there. Those are more complex. So like, if you take on the role of a person in that domain field, and then using that to generate questions.

So to show you a few of these sample questions, I'm just going to choose a couple here. This was built on an AP news data set that was focused on like medical type of events. And so a data local question might be, why are junior doctors in South Korea striking in February 2024?

There's a lot of highly specific information in this, and I would expect regular AG to perform very well on this type of question. In contrast, take an activity global question here. It might be, what are the main public health initiatives mentioned that target underserved communities? There is nothing in that question that you can really pivot on in terms of like embedding or indexing that would really work on that.

You really have to holistically know the entire data set. And so this Auto-Q will help generate questions across that whole spectrum, from local to global, from data-driven to activity-driven, and give you these categories of questions that you can use. Now, once you generate those questions, we can then start measuring them using Auto-E, which is the evaluation platform that we released along with this as well too.

So just to show you how this in practice and how it works, it gives you a composite score for metrics. The first is comprehensiveness, diversity, and empowerment, which if you're familiar with our paper, those are the three original ones that we used back in that one, and a new one that we also call relevance.

And so we actually did some comparisons just to show you how this works on lazy graph rag, which is one of the newer technologies that we've been working on. And we compared it to vector rag on 8K, 120K, and million token context windows. And we had a few takeaways from that.

The first is if you take a look at these charts, any bar that is above the 50% mark means that lazy graph rag is winning in that benchmark against those specific vector rags. So vector rag blue here is 8K, 120K, and a million token context windows. And it's winning correspondingly at 92%, 90%, and 91% of the time against data local questions, which is kind of a surprise.

Because one of the first things that we started seeing is that lazy graph rag was actually providing dominant performance across the entire span of questions, whether they were local or global. And we do expect rag to be performing better on local questions. And you can see that from the fact that data local and the data global, there's a little bit of a lift between these bars between the global and the local ones.

Second thing we noticed is that the long context window didn't really make much of a difference. We were expecting that maybe the long context windows might give you a little bit of a because, again, a better understanding towards those global types of questions. But it actually turns out that lazy graph rag in this case was still able to dominate those metrics.

And in fact, when we ran the test, it's not on the slide here, we actually found that lazy graph rag was a tenth of the cost of what we saw in the one million token context windows as well, too. So those are a couple things to show there. Now, with the last minute that I have, I did want to address another thing, too, with lazy graph rag.

You may have read the blog posts about it in November last year when we first discussed it. It is now officially being lined up for launch in a couple products. The first is Azure Local. And so they just announced at Build that lazy graph rag will soon be incorporated into their platform for that.

So you can actually try it out for yourselves there. And it's also being incorporated into the recently announced app build Microsoft discovery platform tool as well. So if you're not familiar with Microsoft discovery, it does graph-based scientific co-reasoning. And just to give you a quick summary of it, it goes from hypothesis to experiment, learning, knowledge.

I'll just play a quick, like, five-second video here. And the questions -- the answers to the questions that you start coming back to with on this are powered underneath the hood by graph rag and lazy graph rag as well, too. So as you can see, the co-pilot here is generating the deep reasoning over graph-based scientific knowledge.

Those graphs are being powered by graph rag and lazy graph rag. So with that, I just want to leave you with a couple takeaways. And the first is just that LLM memory structure -- LLM memory with structure is just a really, really powerful tool to keep in your tool belt.

And the agents, of course, can massively amplify this power. And if you have any questions, I'll be outside if you want to talk about any of this further. Thank you so much for your time.