back to indexGraphRAG methods to create optimized LLM context windows for Retrieval — Jonathan Larson, Microsoft

00:00:00.000 |
Hello everyone. Thanks for coming to the session here today. My name is Jonathan Larson and I run 00:00:21.120 |
the GraphRag team at Microsoft Research. Some of you may have seen our paper that we actually 00:00:26.680 |
released last year, the GraphRag paper, or you might have seen our GitHub repo that was 00:00:32.740 |
out there and has a lot of stars on it. When it released this last year, to our surprise, 00:00:37.800 |
it got a lot of attention and also it inspired many other offerings that we saw that were 00:00:42.280 |
out there as well, too. My favorites, of course, were the ones that Neo4j did with the card 00:00:46.300 |
game, because I'm a huge board game fanatic as well, too. But there's been lots of derivative 00:00:51.100 |
work from it. And today, though, I'm not here to talk to you about some of the things that 00:00:55.360 |
happened. I'm here to talk to you about some of the new horizons. And the two things I really 00:00:59.980 |
want to leave you with here today are that LM memory with structure is just an absolutely key 00:01:05.440 |
enabler for building effective AI applications. The second, as we'll talk about a little bit later in 00:01:11.500 |
the talk as well, too, are that agents, of course, paired with these structures can provide something 00:01:16.420 |
even more powerful. And let me go ahead and go back to the slide here. So today, I'm going to tell you 00:01:20.360 |
about a couple things. The first is actually showing you Graphrag as applied to a specific 00:01:24.980 |
domain. In this case, we're going to apply it to the coding domain to look at enterprise productivity 00:01:29.180 |
in the coding space. The second is we're going to take a look at a new release that we are announcing 00:01:34.700 |
today. We're actually having a blog post on this tomorrow. We have benchmark QED just went open source 00:01:39.600 |
as well. And then I'll talk about some new results that we've had on some Graphrag evolutions on a new 00:01:45.720 |
technology we've been working on called lazy Graphrag. I won't be going into the specifics of how many it 00:01:49.740 |
works. I'll just be showing some benchmarks and talking about some of the ways you'll be able to access it 00:01:54.360 |
soon. So with that, let me go ahead and jump into Graphrag for code and tell you about how we've been 00:01:59.660 |
using Graphrag to actually help drive repository-level understanding. And I want to start first with a little 00:02:05.540 |
demonstration. Everyone likes to watch little videos, so I'm going to go ahead and hit play. And yes, 00:02:09.800 |
it's showing on the screen there. So this is a little terminal-based video game that one of my 00:02:14.100 |
engineers have put together. It's one where you jump over obstacles, and then you get points when you jump 00:02:19.220 |
over them, and you run into them, you lose points. And two important things, though. First, the LLM has never 00:02:23.660 |
seen this code before, and it's small enough for the human to know the ground truth 00:02:27.720 |
holistically because it's only about 200 lines of code across seven files. But it's complex enough that the LLM's had 00:02:33.660 |
a lot of troubles understanding it if you just provide all the code directly into the context window. 00:02:37.920 |
So with that as a background, let's take a look at what happens if you use typical regular rag over 00:02:42.660 |
the top of this. So if you're using one of the tools that help analyze your code, this is the type of 00:02:47.640 |
answer you might get back. And this is -- we ran this through with a regular rag system. And we asked 00:02:52.140 |
it, describe what this application is and how it works. And just to read you what it says, because it's too 00:02:58.500 |
small to read, "The application is designed as a game that is configured and initiated through a main 00:03:02.820 |
function. The game leverages a configuration file and has its main components, such as the game screen, 00:03:08.040 |
game logic, encapsulated in separate classes and functions." Totally useless. It just says, 00:03:13.080 |
it's a game. There's nothing more to it than that. And it just put a bunch of other cruft in there. 00:03:18.300 |
If you use GraphRack for code over the exact same code base with the exact same question, 00:03:24.000 |
you get a much, much better description. This is just a very stark contrast you get between the two. 00:03:29.100 |
So in the GraphRack for code, it says, "The application is a terminal-based interactive game 00:03:34.320 |
designed to run in a curses-based terminal environment. So far, not much better. But this 00:03:39.000 |
next line just kills it." It features a player character that can jump vertically, obstacles that 00:03:44.280 |
move horizontally across the screen, and a static background layer. The game controls via keyboard 00:03:49.080 |
inputs, specifically using the space bar to trigger the player's jump action. So what this is showing you 00:03:54.480 |
is semantic understanding. So if you've read our paper or used any of the GraphRack code base that 00:04:00.960 |
we've put out there, you'll know about the concept for what we call local and global queries. And so 00:04:05.760 |
this is what we would call a global query over top of the repository. It requires understanding the whole 00:04:10.380 |
repository to answer that question correctly. And we can see that it excels at that. So one of the next 00:04:16.620 |
things we decided to do was, well, if I can answer questions pretty well, can it maybe do code 00:04:21.540 |
translation? And so that was the next thing that we aimed it at. So taking the same code base here, 00:04:27.280 |
we took on the challenge of taking this Python code and then asking to translate it directly into Rust. 00:04:32.340 |
And let me show you just how that actually works. I'm going to play the video here. So we're going to do is 00:04:36.720 |
We're going to start off with in VS Code here. We're going to look at the four source code files. Well, there's four main ones. 00:04:41.520 |
There's seven files total for this game. These are the source Python code files that we're working with here. 00:04:46.920 |
We're going to go ahead and run those just to show you again. This is the same game. So this is again the same 00:04:51.780 |
side scroller game here. And then what we're going to do is we're going to actually take those source code files and just 00:04:56.580 |
put them straight into an LL. We're first not going to use GraphRack for code. We're just going to take all that 00:05:00.780 |
source code, because it's only 200 lines of code, and put in some nice prompts. We tried a wide variety of these prompts and 00:05:05.700 |
then try and see if the LL could actually translate that code holistically in all of its pieces into 00:05:11.280 |
Rust and actually just work out of the box. And so this is actually the translation happening there. 00:05:15.300 |
We're going to go ahead and copy the Rust code that it generated. So it did generate Rust code. I have to 00:05:19.620 |
give it credit for that. But after it generated the code and we reviewed it there, we went and we tried 00:05:25.020 |
to compile it and there's problems everywhere. It doesn't work out of the box. So then we use GraphRack for code, 00:05:31.440 |
again bringing structure and these graph structures to the code base there. 00:05:37.020 |
We're going to just go ahead and run the translate function there on GraphRack for code. And it generates 00:05:40.740 |
on its side all these new Rust files that you see are on the side there. I'm just going to skip past 00:05:46.200 |
here because I don't want to bore you with clicking through a bunch of Rust files. But once we generate 00:05:50.160 |
all these Rust files, we can go ahead and run those Rust files on GraphRack for code side of things and 00:05:55.740 |
presto, we have a full video game completely translated from Python now working in Rust natively. But we 00:06:03.600 |
didn't want to stop there. The next thing that we wanted to do was, well, that game is kind of a toy 00:06:09.060 |
example. It's like 200 lines of code. Let's go to a code base with 100,000 lines of code. And that's what we did next. 00:06:14.280 |
So we went to the Doom code base. It's about 30 years old. Now, we did run a bunch of tests over 00:06:19.680 |
the top of this here first because I figured, well, all the LLMs are trained on the Doom code base. I mean, 00:06:24.840 |
it's going to just know these things natively and automatically. So we did a bunch of tests with 00:06:29.460 |
that. We actually figured out, well, it knew the Doom code base. It didn't know any of the specifics. So if 00:06:34.440 |
you ask it to actually modify the Doom code base in any sort of meaningful way, the LLM models just fail 00:06:39.360 |
again completely. So then what we did next was, well, if we can reason and understand over top of this 00:06:46.020 |
code base, 100,000 lines of code, 231 files, maybe we can generate some new outputs that you otherwise 00:06:51.820 |
couldn't generate. So with that, we then immediately had to start -- used it to start generating documentation. 00:06:56.880 |
So this is actually showing a high level repository level documentation. Again, going back to the 00:07:02.100 |
concepts of GraphRag, we have local and we have global queries. This is showing you the global query 00:07:06.760 |
results. And so it's not looking at the understanding of a single file. This is looking at modules in their 00:07:11.820 |
entirety, like across 20, 30 files and being able to actually give you sense for, like, what's the sound 00:07:16.740 |
system inside of the video game and how does that work? But then you can still drill down into what we 00:07:22.020 |
would call the local style queries and actually see the individual files as well, too. 00:07:25.860 |
So then we thought, well, this is kind of neat. We can look at the documentation. But if we look 00:07:32.180 |
at the documentation, we can do Q&A and we can do code translation. Can we maybe take this one step 00:07:37.080 |
further? Do feature development. So that's the next thing I'm going to show you. So we had a lot of video 00:07:42.840 |
game players in our office and we were kind of thinking about, like, what would be a cool thing we 00:07:46.380 |
put over top of this Doom video game? And somebody said, well, you can't jump in the original video game. 00:07:50.940 |
What if we added the ability for a player to jump? And that's a complex thing to add into Doom, 00:07:55.800 |
because it requires multi-file modification. And if you tried to use AI systems to do multi-file 00:08:02.120 |
modification, what it'll often do is it will do a great job editing one of the files and then completely 00:08:07.240 |
break a bunch of the other files in the process of doing so. And then that just rinses and repeats 00:08:11.640 |
and continually kind of -- you end up with something that doesn't work is what you oftentimes run into. 00:08:16.040 |
And it's because of this lack of understanding how everything fits together. But again, this is where 00:08:21.480 |
GraphRag can really help out. So with that, we went ahead and this is -- if you haven't seen the Doom video 00:08:26.440 |
game, this is actually a little video of the original Doom video game being played just to give you some 00:08:31.480 |
context for what it was. Then we used the new GitHub Copilot coding agent that was announced at build. 00:08:38.440 |
And we wired it up directly to a GraphRag for code. So we're going to go ahead and create a new issue. 00:08:43.480 |
This is pointing to the Doom code base. And we're going to tell it add jump capability to the player. 00:08:48.280 |
We're just going to add a couple sentences of description here and then tell it to go. And 00:08:52.840 |
then we're going to assign it to that GitHub Copilot coding agent. And then the next thing we're going 00:08:59.640 |
to do is we're going to then go ahead and look at what's happening underneath the hood. So this is 00:09:03.960 |
actually the GitHub Copilot coding agent reaching out to GraphRag for code on the back end, coming up 00:09:08.360 |
with a plan that's holistic and it's approaching it from the top down. And then this is really the moment, 00:09:13.560 |
the aha moment we had for us. It changed a whole lot of files and it worked out of the box where all 00:09:19.160 |
the other agents that we had tried completely failed on this task. And because it worked because of those 00:09:23.960 |
GraphRag structures that we were bringing to bear and actually using those with that. So then the end 00:09:29.080 |
result was we were related with joy and jumping in Doom specifically. So that's kind of the story of how 00:09:37.000 |
we got to with that. So with that, I do have another part of the talk I'm going to talk about today. 00:09:42.840 |
And that's also five minutes to talk about that one. And that's benchmark QED. Just shifting topics a 00:09:48.200 |
little bit. I just showed you GraphRag is applied to one vertical. Next, I want to specifically talk 00:09:53.080 |
about how do we measure and evaluate systems like GraphRag? How do we build systems to evaluate those 00:09:58.680 |
local and global quality metrics? And so for that, today I'm announcing benchmark QED. This is available 00:10:04.680 |
now open source on GitHub. So you can go ahead and check it out at the link. We just, I think, 00:10:09.800 |
got it live last night. And there's three components to this. The first is what we're calling 00:10:15.160 |
Auto-Q. And Auto-Q is focused on doing query generation for target data sets. So it allows you to take a 00:10:22.040 |
data set and then generate queries for it. Auto-E is the evaluation using LLM as a judge to actually then 00:10:28.280 |
evaluate how those queries performed on that said data set. Auto-D is the third component. And that really focuses on 00:10:35.160 |
data set summarization and data set sampling. Let's jump into just a couple of these in a little more 00:10:39.960 |
detail. So if you take a look at Auto-Q, Auto-Q is taking a look at the local and global aspects. You 00:10:46.200 |
can see that there on the x-axis. On the y-axis, it's then combining that against data-driven, basically 00:10:52.600 |
questions that are generated based on the data itself, or persona or activity-driven, which is the second 00:10:58.280 |
type that we have there. Those are more complex. So like, if you take on the role of a person in that domain 00:11:02.520 |
field, and then using that to generate questions. So to show you a few of these sample questions, 00:11:07.400 |
I'm just going to choose a couple here. This was built on an AP news data set that was focused on 00:11:12.120 |
like medical type of events. And so a data local question might be, why are junior doctors in South 00:11:17.880 |
Korea striking in February 2024? There's a lot of highly specific information in this, and I would 00:11:23.080 |
expect regular AG to perform very well on this type of question. In contrast, take an activity global 00:11:29.160 |
question here. It might be, what are the main public health initiatives mentioned that target underserved 00:11:34.920 |
communities? There is nothing in that question that you can really pivot on in terms of like embedding 00:11:39.480 |
or indexing that would really work on that. You really have to holistically know the entire data 00:11:44.280 |
set. And so this Auto-Q will help generate questions across that whole spectrum, from local to global, 00:11:51.080 |
from data-driven to activity-driven, and give you these categories of questions that you can use. 00:11:55.000 |
Now, once you generate those questions, we can then start measuring them using Auto-E, 00:11:59.640 |
which is the evaluation platform that we released along with this as well too. So just to show you 00:12:04.840 |
how this in practice and how it works, it gives you a composite score for metrics. The first is 00:12:10.280 |
comprehensiveness, diversity, and empowerment, which if you're familiar with our paper, those are the three 00:12:15.480 |
original ones that we used back in that one, and a new one that we also call relevance. And so we actually 00:12:20.040 |
did some comparisons just to show you how this works on lazy graph rag, which is one of the newer technologies 00:12:24.600 |
that we've been working on. And we compared it to vector rag on 8K, 120K, and million token context 00:12:31.240 |
windows. And we had a few takeaways from that. The first is if you take a look at these charts, 00:12:35.560 |
any bar that is above the 50% mark means that lazy graph rag is winning in that benchmark against those 00:12:41.960 |
specific vector rags. So vector rag blue here is 8K, 120K, and a million token context windows. And it's 00:12:47.880 |
winning correspondingly at 92%, 90%, and 91% of the time against data local questions, which is kind 00:12:54.040 |
of a surprise. Because one of the first things that we started seeing is that lazy graph rag was actually 00:12:58.040 |
providing dominant performance across the entire span of questions, whether they were local or global. 00:13:04.120 |
And we do expect rag to be performing better on local questions. And you can see that from the fact that 00:13:09.160 |
data local and the data global, there's a little bit of a lift between these bars between the global and the 00:13:14.360 |
local ones. Second thing we noticed is that the long context window didn't really make much of a 00:13:19.240 |
difference. We were expecting that maybe the long context windows might give you a little bit of a 00:13:22.520 |
because, again, a better understanding towards those global types of questions. But it actually turns out 00:13:27.560 |
that lazy graph rag in this case was still able to dominate those metrics. And in fact, when we ran the 00:13:32.600 |
test, it's not on the slide here, we actually found that lazy graph rag was a tenth of the cost of what we saw in 00:13:38.040 |
the one million token context windows as well, too. So those are a couple things to show there. 00:13:43.000 |
Now, with the last minute that I have, I did want to address another thing, too, with lazy graph rag. 00:13:48.120 |
You may have read the blog posts about it in November last year when we first discussed it. 00:13:52.440 |
It is now officially being lined up for launch in a couple products. The first is Azure Local. And so 00:13:58.440 |
they just announced at Build that lazy graph rag will soon be incorporated into their platform for 00:14:04.520 |
that. So you can actually try it out for yourselves there. And it's also being incorporated into the 00:14:08.600 |
recently announced app build Microsoft discovery platform tool as well. So if you're not familiar 00:14:13.720 |
with Microsoft discovery, it does graph-based scientific co-reasoning. And just to give you a 00:14:17.960 |
quick summary of it, it goes from hypothesis to experiment, learning, knowledge. I'll just play a 00:14:23.400 |
quick, like, five-second video here. And the questions -- the answers to the questions that you start 00:14:27.800 |
coming back to with on this are powered underneath the hood by graph rag and lazy graph rag as well, 00:14:32.600 |
too. So as you can see, the co-pilot here is generating the deep reasoning over graph-based 00:14:36.680 |
scientific knowledge. Those graphs are being powered by graph rag and lazy graph rag. 00:14:42.520 |
So with that, I just want to leave you with a couple takeaways. And the first is just that LLM memory 00:14:47.160 |
structure -- LLM memory with structure is just a really, really powerful tool to keep in your tool 00:14:53.240 |
belt. And the agents, of course, can massively amplify this power. And if you have any questions, 00:14:58.600 |
I'll be outside if you want to talk about any of this further. Thank you so much for your time.