GraphRAG methods to create optimized LLM context windows for Retrieval

00:00:00.000 | Hello everyone. Thanks for coming to the session here today. My name is Jonathan Larson and I run

00:00:21.120 | the GraphRag team at Microsoft Research. Some of you may have seen our paper that we actually

00:00:26.680 | released last year, the GraphRag paper, or you might have seen our GitHub repo that was

00:00:32.740 | out there and has a lot of stars on it. When it released this last year, to our surprise,

00:00:37.800 | it got a lot of attention and also it inspired many other offerings that we saw that were

00:00:42.280 | out there as well, too. My favorites, of course, were the ones that Neo4j did with the card

00:00:46.300 | game, because I'm a huge board game fanatic as well, too. But there's been lots of derivative

00:00:51.100 | work from it. And today, though, I'm not here to talk to you about some of the things that

00:00:55.360 | happened. I'm here to talk to you about some of the new horizons. And the two things I really

00:00:59.980 | want to leave you with here today are that LM memory with structure is just an absolutely key

00:01:05.440 | enabler for building effective AI applications. The second, as we'll talk about a little bit later in

00:01:11.500 | the talk as well, too, are that agents, of course, paired with these structures can provide something

00:01:16.420 | even more powerful. And let me go ahead and go back to the slide here. So today, I'm going to tell you

00:01:20.360 | about a couple things. The first is actually showing you Graphrag as applied to a specific

00:01:24.980 | domain. In this case, we're going to apply it to the coding domain to look at enterprise productivity

00:01:29.180 | in the coding space. The second is we're going to take a look at a new release that we are announcing

00:01:34.700 | today. We're actually having a blog post on this tomorrow. We have benchmark QED just went open source

00:01:39.600 | as well. And then I'll talk about some new results that we've had on some Graphrag evolutions on a new

00:01:45.720 | technology we've been working on called lazy Graphrag. I won't be going into the specifics of how many it

00:01:49.740 | works. I'll just be showing some benchmarks and talking about some of the ways you'll be able to access it

00:01:54.360 | soon. So with that, let me go ahead and jump into Graphrag for code and tell you about how we've been

00:01:59.660 | using Graphrag to actually help drive repository-level understanding. And I want to start first with a little

00:02:05.540 | demonstration. Everyone likes to watch little videos, so I'm going to go ahead and hit play. And yes,

00:02:09.800 | it's showing on the screen there. So this is a little terminal-based video game that one of my

00:02:14.100 | engineers have put together. It's one where you jump over obstacles, and then you get points when you jump

00:02:19.220 | over them, and you run into them, you lose points. And two important things, though. First, the LLM has never

00:02:23.660 | seen this code before, and it's small enough for the human to know the ground truth

00:02:27.720 | holistically because it's only about 200 lines of code across seven files. But it's complex enough that the LLM's had

00:02:33.660 | a lot of troubles understanding it if you just provide all the code directly into the context window.

00:02:37.920 | So with that as a background, let's take a look at what happens if you use typical regular rag over

00:02:42.660 | the top of this. So if you're using one of the tools that help analyze your code, this is the type of

00:02:47.640 | answer you might get back. And this is -- we ran this through with a regular rag system. And we asked

00:02:52.140 | it, describe what this application is and how it works. And just to read you what it says, because it's too

00:02:58.500 | small to read, "The application is designed as a game that is configured and initiated through a main

00:03:02.820 | function. The game leverages a configuration file and has its main components, such as the game screen,

00:03:08.040 | game logic, encapsulated in separate classes and functions." Totally useless. It just says,

00:03:13.080 | it's a game. There's nothing more to it than that. And it just put a bunch of other cruft in there.

00:03:18.300 | If you use GraphRack for code over the exact same code base with the exact same question,

00:03:24.000 | you get a much, much better description. This is just a very stark contrast you get between the two.

00:03:29.100 | So in the GraphRack for code, it says, "The application is a terminal-based interactive game

00:03:34.320 | designed to run in a curses-based terminal environment. So far, not much better. But this

00:03:39.000 | next line just kills it." It features a player character that can jump vertically, obstacles that

00:03:44.280 | move horizontally across the screen, and a static background layer. The game controls via keyboard

00:03:49.080 | inputs, specifically using the space bar to trigger the player's jump action. So what this is showing you

00:03:54.480 | is semantic understanding. So if you've read our paper or used any of the GraphRack code base that

00:04:00.960 | we've put out there, you'll know about the concept for what we call local and global queries. And so

00:04:05.760 | this is what we would call a global query over top of the repository. It requires understanding the whole

00:04:10.380 | repository to answer that question correctly. And we can see that it excels at that. So one of the next

00:04:16.620 | things we decided to do was, well, if I can answer questions pretty well, can it maybe do code

00:04:21.540 | translation? And so that was the next thing that we aimed it at. So taking the same code base here,

00:04:27.280 | we took on the challenge of taking this Python code and then asking to translate it directly into Rust.

00:04:32.340 | And let me show you just how that actually works. I'm going to play the video here. So we're going to do is

00:04:36.720 | We're going to start off with in VS Code here. We're going to look at the four source code files. Well, there's four main ones.

00:04:41.520 | There's seven files total for this game. These are the source Python code files that we're working with here.

00:04:46.920 | We're going to go ahead and run those just to show you again. This is the same game. So this is again the same

00:04:51.780 | side scroller game here. And then what we're going to do is we're going to actually take those source code files and just

00:04:56.580 | put them straight into an LL. We're first not going to use GraphRack for code. We're just going to take all that

00:05:00.780 | source code, because it's only 200 lines of code, and put in some nice prompts. We tried a wide variety of these prompts and

00:05:05.700 | then try and see if the LL could actually translate that code holistically in all of its pieces into

00:05:11.280 | Rust and actually just work out of the box. And so this is actually the translation happening there.

00:05:15.300 | We're going to go ahead and copy the Rust code that it generated. So it did generate Rust code. I have to

00:05:19.620 | give it credit for that. But after it generated the code and we reviewed it there, we went and we tried

00:05:25.020 | to compile it and there's problems everywhere. It doesn't work out of the box. So then we use GraphRack for code,

00:05:31.440 | again bringing structure and these graph structures to the code base there.

00:05:37.020 | We're going to just go ahead and run the translate function there on GraphRack for code. And it generates

00:05:40.740 | on its side all these new Rust files that you see are on the side there. I'm just going to skip past

00:05:46.200 | here because I don't want to bore you with clicking through a bunch of Rust files. But once we generate

00:05:50.160 | all these Rust files, we can go ahead and run those Rust files on GraphRack for code side of things and

00:05:55.740 | presto, we have a full video game completely translated from Python now working in Rust natively. But we

00:06:03.600 | didn't want to stop there. The next thing that we wanted to do was, well, that game is kind of a toy

00:06:09.060 | example. It's like 200 lines of code. Let's go to a code base with 100,000 lines of code. And that's what we did next.

00:06:14.280 | So we went to the Doom code base. It's about 30 years old. Now, we did run a bunch of tests over

00:06:19.680 | the top of this here first because I figured, well, all the LLMs are trained on the Doom code base. I mean,

00:06:24.840 | it's going to just know these things natively and automatically. So we did a bunch of tests with

00:06:29.460 | that. We actually figured out, well, it knew the Doom code base. It didn't know any of the specifics. So if

00:06:34.440 | you ask it to actually modify the Doom code base in any sort of meaningful way, the LLM models just fail

00:06:39.360 | again completely. So then what we did next was, well, if we can reason and understand over top of this

00:06:46.020 | code base, 100,000 lines of code, 231 files, maybe we can generate some new outputs that you otherwise

00:06:51.820 | couldn't generate. So with that, we then immediately had to start -- used it to start generating documentation.

00:06:56.880 | So this is actually showing a high level repository level documentation. Again, going back to the

00:07:02.100 | concepts of GraphRag, we have local and we have global queries. This is showing you the global query

00:07:06.760 | results. And so it's not looking at the understanding of a single file. This is looking at modules in their

00:07:11.820 | entirety, like across 20, 30 files and being able to actually give you sense for, like, what's the sound

00:07:16.740 | system inside of the video game and how does that work? But then you can still drill down into what we

00:07:22.020 | would call the local style queries and actually see the individual files as well, too.

00:07:25.860 | So then we thought, well, this is kind of neat. We can look at the documentation. But if we look

00:07:32.180 | at the documentation, we can do Q&A and we can do code translation. Can we maybe take this one step

00:07:37.080 | further? Do feature development. So that's the next thing I'm going to show you. So we had a lot of video

00:07:42.840 | game players in our office and we were kind of thinking about, like, what would be a cool thing we

00:07:46.380 | put over top of this Doom video game? And somebody said, well, you can't jump in the original video game.

00:07:50.940 | What if we added the ability for a player to jump? And that's a complex thing to add into Doom,

00:07:55.800 | because it requires multi-file modification. And if you tried to use AI systems to do multi-file

00:08:02.120 | modification, what it'll often do is it will do a great job editing one of the files and then completely

00:08:07.240 | break a bunch of the other files in the process of doing so. And then that just rinses and repeats

00:08:11.640 | and continually kind of -- you end up with something that doesn't work is what you oftentimes run into.

00:08:16.040 | And it's because of this lack of understanding how everything fits together. But again, this is where

00:08:21.480 | GraphRag can really help out. So with that, we went ahead and this is -- if you haven't seen the Doom video

00:08:26.440 | game, this is actually a little video of the original Doom video game being played just to give you some

00:08:31.480 | context for what it was. Then we used the new GitHub Copilot coding agent that was announced at build.

00:08:38.440 | And we wired it up directly to a GraphRag for code. So we're going to go ahead and create a new issue.

00:08:43.480 | This is pointing to the Doom code base. And we're going to tell it add jump capability to the player.

00:08:48.280 | We're just going to add a couple sentences of description here and then tell it to go. And

00:08:52.840 | then we're going to assign it to that GitHub Copilot coding agent. And then the next thing we're going

00:08:59.640 | to do is we're going to then go ahead and look at what's happening underneath the hood. So this is

00:09:03.960 | actually the GitHub Copilot coding agent reaching out to GraphRag for code on the back end, coming up

00:09:08.360 | with a plan that's holistic and it's approaching it from the top down. And then this is really the moment,

00:09:13.560 | the aha moment we had for us. It changed a whole lot of files and it worked out of the box where all

00:09:19.160 | the other agents that we had tried completely failed on this task. And because it worked because of those

00:09:23.960 | GraphRag structures that we were bringing to bear and actually using those with that. So then the end

00:09:29.080 | result was we were related with joy and jumping in Doom specifically. So that's kind of the story of how

00:09:37.000 | we got to with that. So with that, I do have another part of the talk I'm going to talk about today.

00:09:42.840 | And that's also five minutes to talk about that one. And that's benchmark QED. Just shifting topics a

00:09:48.200 | little bit. I just showed you GraphRag is applied to one vertical. Next, I want to specifically talk

00:09:53.080 | about how do we measure and evaluate systems like GraphRag? How do we build systems to evaluate those

00:09:58.680 | local and global quality metrics? And so for that, today I'm announcing benchmark QED. This is available

00:10:04.680 | now open source on GitHub. So you can go ahead and check it out at the link. We just, I think,

00:10:09.800 | got it live last night. And there's three components to this. The first is what we're calling

00:10:15.160 | Auto-Q. And Auto-Q is focused on doing query generation for target data sets. So it allows you to take a

00:10:22.040 | data set and then generate queries for it. Auto-E is the evaluation using LLM as a judge to actually then

00:10:28.280 | evaluate how those queries performed on that said data set. Auto-D is the third component. And that really focuses on

00:10:35.160 | data set summarization and data set sampling. Let's jump into just a couple of these in a little more

00:10:39.960 | detail. So if you take a look at Auto-Q, Auto-Q is taking a look at the local and global aspects. You

00:10:46.200 | can see that there on the x-axis. On the y-axis, it's then combining that against data-driven, basically

00:10:52.600 | questions that are generated based on the data itself, or persona or activity-driven, which is the second

00:10:58.280 | type that we have there. Those are more complex. So like, if you take on the role of a person in that domain

00:11:02.520 | field, and then using that to generate questions. So to show you a few of these sample questions,

00:11:07.400 | I'm just going to choose a couple here. This was built on an AP news data set that was focused on

00:11:12.120 | like medical type of events. And so a data local question might be, why are junior doctors in South

00:11:17.880 | Korea striking in February 2024? There's a lot of highly specific information in this, and I would

00:11:23.080 | expect regular AG to perform very well on this type of question. In contrast, take an activity global

00:11:29.160 | question here. It might be, what are the main public health initiatives mentioned that target underserved

00:11:34.920 | communities? There is nothing in that question that you can really pivot on in terms of like embedding

00:11:39.480 | or indexing that would really work on that. You really have to holistically know the entire data

00:11:44.280 | set. And so this Auto-Q will help generate questions across that whole spectrum, from local to global,

00:11:51.080 | from data-driven to activity-driven, and give you these categories of questions that you can use.

00:11:55.000 | Now, once you generate those questions, we can then start measuring them using Auto-E,

00:11:59.640 | which is the evaluation platform that we released along with this as well too. So just to show you

00:12:04.840 | how this in practice and how it works, it gives you a composite score for metrics. The first is

00:12:10.280 | comprehensiveness, diversity, and empowerment, which if you're familiar with our paper, those are the three

00:12:15.480 | original ones that we used back in that one, and a new one that we also call relevance. And so we actually

00:12:20.040 | did some comparisons just to show you how this works on lazy graph rag, which is one of the newer technologies

00:12:24.600 | that we've been working on. And we compared it to vector rag on 8K, 120K, and million token context

00:12:31.240 | windows. And we had a few takeaways from that. The first is if you take a look at these charts,

00:12:35.560 | any bar that is above the 50% mark means that lazy graph rag is winning in that benchmark against those

00:12:41.960 | specific vector rags. So vector rag blue here is 8K, 120K, and a million token context windows. And it's

00:12:47.880 | winning correspondingly at 92%, 90%, and 91% of the time against data local questions, which is kind

00:12:54.040 | of a surprise. Because one of the first things that we started seeing is that lazy graph rag was actually

00:12:58.040 | providing dominant performance across the entire span of questions, whether they were local or global.

00:13:04.120 | And we do expect rag to be performing better on local questions. And you can see that from the fact that

00:13:09.160 | data local and the data global, there's a little bit of a lift between these bars between the global and the

00:13:14.360 | local ones. Second thing we noticed is that the long context window didn't really make much of a

00:13:19.240 | difference. We were expecting that maybe the long context windows might give you a little bit of a

00:13:22.520 | because, again, a better understanding towards those global types of questions. But it actually turns out

00:13:27.560 | that lazy graph rag in this case was still able to dominate those metrics. And in fact, when we ran the

00:13:32.600 | test, it's not on the slide here, we actually found that lazy graph rag was a tenth of the cost of what we saw in

00:13:38.040 | the one million token context windows as well, too. So those are a couple things to show there.

00:13:43.000 | Now, with the last minute that I have, I did want to address another thing, too, with lazy graph rag.

00:13:48.120 | You may have read the blog posts about it in November last year when we first discussed it.

00:13:52.440 | It is now officially being lined up for launch in a couple products. The first is Azure Local. And so

00:13:58.440 | they just announced at Build that lazy graph rag will soon be incorporated into their platform for

00:14:04.520 | that. So you can actually try it out for yourselves there. And it's also being incorporated into the

00:14:08.600 | recently announced app build Microsoft discovery platform tool as well. So if you're not familiar

00:14:13.720 | with Microsoft discovery, it does graph-based scientific co-reasoning. And just to give you a

00:14:17.960 | quick summary of it, it goes from hypothesis to experiment, learning, knowledge. I'll just play a

00:14:23.400 | quick, like, five-second video here. And the questions -- the answers to the questions that you start

00:14:27.800 | coming back to with on this are powered underneath the hood by graph rag and lazy graph rag as well,

00:14:32.600 | too. So as you can see, the co-pilot here is generating the deep reasoning over graph-based

00:14:36.680 | scientific knowledge. Those graphs are being powered by graph rag and lazy graph rag.

00:14:42.520 | So with that, I just want to leave you with a couple takeaways. And the first is just that LLM memory

00:14:47.160 | structure -- LLM memory with structure is just a really, really powerful tool to keep in your tool

00:14:53.240 | belt. And the agents, of course, can massively amplify this power. And if you have any questions,

00:14:58.600 | I'll be outside if you want to talk about any of this further. Thank you so much for your time.

GraphRAG methods to create optimized LLM context windows for Retrieval — Jonathan Larson, Microsoft