back to indexAgentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data — Zach Blumenfeld

00:00:00.000 |
I'm going to go over today GraphRag, particularly dealing with multiple data sources, 00:00:19.280 |
so both unstructured and structured data sources, and kind of why you would want to ever do that 00:00:24.100 |
in the first place even. So I prepared some notebooks here. I was going to make slides, 00:00:29.060 |
but then I thought it would just be easier to walk through some of what this looks like in practice. 00:00:33.600 |
So there's a link here, and I can share it with you at the booth later if you have follow-up questions. 00:00:39.060 |
But basically, what I wanted to show you first is just what a general GraphRag architecture looks like. 00:00:45.540 |
And so basically, what we do is we have our agents, we have our tools, just like you normally would. 00:00:52.860 |
But then in the middle here, you see off to the side, we have this knowledge graph. 00:00:56.900 |
And this knowledge graph, you can both extract data from documents in unstructured places, 00:01:01.440 |
and also have standard ETLs for structured data. 00:01:04.320 |
And so there's a big question of why in the hell you would want this knowledge graph thing sticking in the middle there. 00:01:10.380 |
And I could talk about accuracy and explainability, but I think what's really valuable to talk about 00:01:15.440 |
about is kind of what this means going forward with agents and how it's valuable for agentic workflows. 00:01:19.980 |
And as we think about some of what agents can do with reasoning and decomposing questions, 00:01:25.660 |
a lot of the retrieval that we're seeing is not so much just a straight shot vector search anymore. 00:01:31.440 |
A lot of what we're seeing is a question being broken down and being handed multiple queries, right, 00:01:39.240 |
And the great thing about having a knowledge graph is that you can express a very simple data model 00:01:43.780 |
to get started to your agent, which can help it do that decomposition, pull information accurately, 00:01:49.220 |
and then as you sort of expand, you can keep adding more and more data. 00:01:53.280 |
The example that I'm going to show you here inside of these notebooks is going to be for a employee graph. 00:02:00.620 |
So basically think about a knowledge assistant that's responsible for helping pull information 00:02:04.900 |
around skills analysis, look for similarities or substitutions in a team, 00:02:09.920 |
and try to figure out who's collaborating, where skill gaps are, all that sort of stuff. 00:02:13.800 |
And the data that we're going to start with today is just going to be in resumes. 00:02:17.040 |
I just have these PDFs, just like a folder of resumes, that are pretty standard. 00:02:22.080 |
It just lists people's professional experience and descriptions. 00:02:24.920 |
It's for a company called, what did we have here? 00:02:30.360 |
Cyberdyne systems, if anyone's familiar with them. 00:02:37.280 |
So anyway, the first thing that I'm going to do to just show you what this looks like 00:02:42.320 |
is I'm going to load documents into the Neo4j graph database, 00:02:45.840 |
but just sort of as like basically documents. 00:02:49.660 |
Basically take every resume, put an embedding on it. 00:02:53.480 |
What it looks like here, if I was to scroll here, is kind of like this. 00:02:58.260 |
So basically, I have these different nodes, but you'll see the nodes are basically just 00:03:04.160 |
going to have some metadata, some text, which is the resume, and then an embedding. 00:03:07.880 |
And basically what I'm going to do is I'm just going to create an agent inside of ADK, 00:03:12.600 |
so Google's framework, and I'm going to start asking it some of those questions. 00:03:17.320 |
And so basically what you're going to see here, and I don't have time, unfortunately, 00:03:21.180 |
to walk through all of the code, but basically if you look at this agent that's been constructed, 00:03:26.720 |
right, I have my agent, I have some instructions to pull data, and then I give it one tool, 00:03:31.900 |
which is a tool to go search documents, right? 00:03:34.680 |
And so I'm going to ask it a question, how many Python developers do I have? 00:03:38.260 |
You can imagine, right, this is probably not going to work out very well if all I have is 00:03:43.140 |
just documents, because basically it's going to tell me I have five Python developers. 00:03:48.000 |
And that's because I set K equal to five, right, when I went to go pull my documents. 00:03:52.540 |
So obviously that's going to be wrong, and I'm telling you that that answer is wrong. 00:03:56.380 |
So you could probably solve that with doing some entity extraction and putting more metadata 00:04:02.400 |
So then I'm going to ask, who is most similar to a particular person in terms of just their 00:04:08.800 |
And again, here, what you'll see, and I've told it in the bottom to kind of explain what 00:04:14.440 |
it's doing, if I go down to the display here, it'll tell me basically here that what it's 00:04:22.300 |
doing is it's just going to be using search terms to go pull information. 00:04:25.980 |
So this might help you find similarity to a certain extent, like it knows, you know, 00:04:31.020 |
Lucas is a full-stack AI engineer, and he does, you know, Python, JavaScript, and some machine 00:04:37.200 |
So you can search for that, and you can find some similar people, but the logic is still 00:04:42.860 |
It's just sort of, you know, plain semantic similarity search. 00:04:46.140 |
And as I start to go down, I can ask questions like, summarize my technical talent and skills 00:04:52.540 |
It's not going to be able to answer that, right, because it needs to be able to do an 00:04:57.820 |
So if I was to go up and look at the logic again, or I'll go down here, it could say I 00:05:02.720 |
search, you know, employees' resumes using certain search terms and stuff, right? 00:05:06.060 |
And so it's basically, if I go down and ask these questions, it's just going to be using 00:05:11.600 |
And so that's not really good enough for our use case, because we want to do analytics, 00:05:15.760 |
we want to do aggregations, and we want to try to find relationships between people. 00:05:20.480 |
And again, like in the last question, I basically asked it to find who's collaborating on lots 00:05:25.100 |
of projects, and all it can do is search for collaborators, which is not really what I want. 00:05:29.360 |
Like, what I want is to find who's been collaborating with who on different projects, right? 00:05:34.500 |
And the resume data, you know, might have that, but it's all, you know, sunk inside of the text 00:05:40.160 |
So the question is now, well, how do I think about basically explaining my data to my agent 00:05:45.460 |
and then also making sure, right, that I have a data model that makes sense? 00:05:51.100 |
So if you think about it, you can do this at the data layer. 00:05:53.940 |
You can think about how do I want to model my data just to start for some of these beginning 00:05:59.760 |
And here it's basically like I want to know what a person is, so I need that. 00:06:05.240 |
I want some concept of a person knowing skills. 00:06:09.280 |
And then the only other thing that I really care about is what do people do, like what things 00:06:16.480 |
That's just the data model that I want to express. 00:06:19.360 |
So I'm going to do entity extraction of these documents to basically create this graph. 00:06:23.840 |
And really, it's going to be just slightly more complicated than that, because basically 00:06:28.860 |
what I need to do here is I actually need to create a graph where, right, I have, instead 00:06:35.460 |
of just doing things, I have publish, build, one, lead, manage, optimize, shift things. 00:06:42.320 |
And then those things are going to belong to different domains and work types, which is going 00:06:47.100 |
to allow me to kind of connect similar things together inside of the graph. 00:06:50.400 |
So it's a little bit more complicated, but it's the same exact concept. 00:06:55.860 |
And basically, the entity extraction workflows we use are pretty self-explanatory. 00:07:02.620 |
They use, if I was to go here, I use Pydantic classes to do that. 00:07:07.840 |
So I have concept of enumerations on the types of accomplishments and domains that I want and 00:07:13.880 |
I define my things, I define how someone does a thing through an accomplishment, and then 00:07:20.340 |
basically, I put that through another workflow. 00:07:23.940 |
I think I use Langchain in this case to basically decompose those documents and spit out a bunch 00:07:30.320 |
And inside of that JSON, for example, for this person, I have their skills, right, that I get 00:07:35.660 |
in here, and I also have their accomplishments that I get inside of here. 00:07:40.100 |
So I can go ahead and load that as well into my graph. 00:07:44.840 |
And now I have a much more expressive data model. 00:07:48.120 |
So if I go back to here and scroll up and look at this guy, this is now kind of what my data 00:07:58.240 |
I have my people, but now you see how they're connected by all of these different skills 00:08:02.980 |
that they have, as well as the things that they're actually working on, and how those things 00:08:08.020 |
connect to higher level concepts, like whether it's building something for a system, or shipping 00:08:16.520 |
And now, because I have that expressive data model, I'm able to start having a lot more precision 00:08:22.500 |
around the way that I get my questions basically answered. 00:08:27.100 |
So what I do here after all this graph construction, which I already talked through, is I create 00:08:33.640 |
this other agent, give it a similar set of instructions, I tell it a little bit about my data 00:08:38.720 |
model in here, and then I'm actually going to be using this MCP server that allows it to read 00:08:44.460 |
the schema, and also generate cipher statements. 00:08:47.660 |
So this is an MCP tool that we just have, you know, out on GitHub that you can pull. 00:08:53.620 |
Now, when I ask how many developers I have, now I can actually do a query that's going 00:09:00.160 |
And because of that, I get an answer that's much closer to correct, which is 28 developers. 00:09:08.240 |
But now I can ask a similarity question, right, like I did before, who is most similar to Lucas 00:09:17.240 |
And when I went to go calculate that, I got an answer that it's Sarah. 00:09:22.780 |
And it will explain the reasoning that it did. 00:09:24.780 |
It searched for people who knew the same skill sets. 00:09:28.780 |
Sometimes when I ask this, it will also search for people who have similar accomplishments as well. 00:09:33.780 |
And it will explain exactly like, hey, like I did an overlap calculation in the graph to figure out basically, you know, 00:09:42.320 |
given the number of skill sets they had and the number of overlap, this is the person that I think is the closest. 00:09:47.320 |
So the benefit of doing this is that you have much more control and you can go in and filter exactly what skills people have. 00:09:53.700 |
So for example, if I knew someone actually didn't have a certain skill or did have a certain skill, I can audit that. 00:09:59.400 |
I can adjust the graph to be able to make that work. 00:10:02.240 |
And then similarly, as we go down, to summarize a technical talent distribution, again, I can match on those skills 00:10:10.660 |
and I can start to answer these questions and actually get numbers between how many people know different skills. 00:10:15.560 |
And it can also break down different accomplishments and other things of that nature. 00:10:20.260 |
So you just get much more refinement in the types of answers that you get for your questions. 00:10:28.060 |
So instead of just generating Cypher kind of on the fly with a language model, I can also, 00:10:35.300 |
and I'll show you what some of these look like in my bigger screen here, I can go ahead and move this up. 00:10:47.860 |
So this is an example of finding people with similar skills here that I'm about to show you. 00:10:52.760 |
And basically, right, I can do these very flexible queries in a graph database where I can say, hey, go from person ID to this other person ID. 00:11:02.900 |
And as I do that, basically what I'm saying in that top syntax there is go out some, you know, zero to three hops, basically. 00:11:10.860 |
And so what that allows me to do is I can traverse over both the skills, over the common systems, over the common domains they work at, and all of their accomplishments. 00:11:20.360 |
And if I wanted to add something else to that data model, maybe there's a collaboration link or another project link, that would all get picked up inside of that query. 00:11:29.220 |
So there's a lot of flexibility and also higher performance in a graph database when you start wanting to do those types of complex traversals. 00:11:36.820 |
And that allows me then, when I go to find similar people again, so if I was just to scroll down to my question where I define my agent again, I give it more tools, 00:11:49.680 |
and then I can actually look at who is most similar to Lucas Martinez and why, and it will start doing these queries. 00:11:56.180 |
So what you see it get back is, like, all of the results of what I was just showing you earlier. 00:12:00.500 |
And what that will help with, if I scroll down to the response, is now what will help me get all of these specific numbers around the skills and then also the domains now 00:12:10.960 |
between the different AI and analytics and data engineering and things that they were working on. 00:12:15.380 |
So there's more explainability with the way that these questions are being answered. 00:12:19.020 |
The last part that I wanted to show you, and I only have a few minutes left, so I'm going to go very quick, is what happens now when you want to add more data to your graph, right? 00:12:28.200 |
So basically, say that we had this resume data, but now we have this internal data that comes from a human resource intelligence system. 00:12:36.320 |
And this tells me different projects that people are working on together and collaborating on. 00:12:41.080 |
So I have basically the same people that came from resumes earlier, but now I can see different projects that they were working on together. 00:12:48.340 |
And the great thing about graph is that it allows you to add these things very flexibly. 00:12:53.180 |
So if I go back to my notebook here, when you expand a data model with something like, if you're in sort of RDBMS or tables, 00:13:04.720 |
one of the assumptions that I made when I was ingesting the resumes initially was it was one person, an accomplishment only had one person. 00:13:11.820 |
It was sort of this one-to-one relationship, or really it was one-to-many, but an accomplishment only had one person because it was just listed on their resume. 00:13:19.820 |
But now that I am doing this thing with this internal system, I can actually see people who are co-collaborating. 00:13:28.480 |
And what that would mean in a tabular environment is I would have to create another join table, right? 00:13:32.780 |
So I'd have to do some sort of data model refactor. 00:13:34.960 |
But the great thing about graph is that I don't have to do that at all. 00:13:41.700 |
So this is very useful when you're going from one-to-many, to many-to-many, or when you're introducing completely new node and relationship types. 00:13:49.020 |
It's very easy to do that, which is super important as we move very, very fast, right, with our agents. 00:13:54.560 |
And we want to ingest new data quickly and kind of build out our systems and pivot and all that sort of stuff. 00:14:00.380 |
And once I have that information, right, I can start asking questions about who's collaborating with each other. 00:14:06.880 |
So what I'm going to do is I'm going to create, this is sort of what the tool creation looks like. 00:14:12.380 |
This is a lot of the same tools that I had before. 00:14:14.440 |
But the tool that I can create here to find collaborators will basically do this match. 00:14:23.360 |
It will do this match to find people who are working on the same thing within a certain set of domains. 00:14:29.840 |
So that's kind of what the graph looks like that I get returned. 00:14:32.480 |
And when I now want to, since I've added that tool, basically what it means is that when I ask a question, 00:14:38.660 |
like which individuals have collaborated with each other to, and this says, deliver the most AI things, right, 00:14:48.140 |
And then it can now return an answer that's much more, you know, exact and based on my data. 00:14:54.480 |
So now I know that Sarah and Amanda have collaborated on very specific projects, 00:15:02.180 |
and as well as I have other collaborators here with supply chain and such. 00:15:08.560 |
If you have any more questions, I would be happy to meet you at the booth, and we can talk more.