Agentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data

I'm going to go over today GraphRag, particularly dealing with multiple data sources, so both unstructured and structured data sources, and kind of why you would want to ever do that in the first place even. So I prepared some notebooks here. I was going to make slides, but then I thought it would just be easier to walk through some of what this looks like in practice.

So there's a link here, and I can share it with you at the booth later if you have follow-up questions. But basically, what I wanted to show you first is just what a general GraphRag architecture looks like. And so basically, what we do is we have our agents, we have our tools, just like you normally would.

But then in the middle here, you see off to the side, we have this knowledge graph. And this knowledge graph, you can both extract data from documents in unstructured places, and also have standard ETLs for structured data. And so there's a big question of why in the hell you would want this knowledge graph thing sticking in the middle there.

And I could talk about accuracy and explainability, but I think what's really valuable to talk about about is kind of what this means going forward with agents and how it's valuable for agentic workflows. And as we think about some of what agents can do with reasoning and decomposing questions, a lot of the retrieval that we're seeing is not so much just a straight shot vector search anymore.

A lot of what we're seeing is a question being broken down and being handed multiple queries, right, to go and pull the data that you need. And the great thing about having a knowledge graph is that you can express a very simple data model to get started to your agent, which can help it do that decomposition, pull information accurately, and then as you sort of expand, you can keep adding more and more data.

The example that I'm going to show you here inside of these notebooks is going to be for a employee graph. So basically think about a knowledge assistant that's responsible for helping pull information around skills analysis, look for similarities or substitutions in a team, and try to figure out who's collaborating, where skill gaps are, all that sort of stuff.

And the data that we're going to start with today is just going to be in resumes. I just have these PDFs, just like a folder of resumes, that are pretty standard. It just lists people's professional experience and descriptions. It's for a company called, what did we have here? Cyberdyne systems, if anyone's familiar with them.

A little Terminator reference. All right. So anyway, the first thing that I'm going to do to just show you what this looks like is I'm going to load documents into the Neo4j graph database, but just sort of as like basically documents. Basically take every resume, put an embedding on it.

What it looks like here, if I was to scroll here, is kind of like this. So basically, I have these different nodes, but you'll see the nodes are basically just going to have some metadata, some text, which is the resume, and then an embedding. And basically what I'm going to do is I'm just going to create an agent inside of ADK, so Google's framework, and I'm going to start asking it some of those questions.

And so basically what you're going to see here, and I don't have time, unfortunately, to walk through all of the code, but basically if you look at this agent that's been constructed, right, I have my agent, I have some instructions to pull data, and then I give it one tool, which is a tool to go search documents, right?

And so I'm going to ask it a question, how many Python developers do I have? You can imagine, right, this is probably not going to work out very well if all I have is just documents, because basically it's going to tell me I have five Python developers. And that's because I set K equal to five, right, when I went to go pull my documents.

So obviously that's going to be wrong, and I'm telling you that that answer is wrong. So you could probably solve that with doing some entity extraction and putting more metadata right on your node, so that's fine. So then I'm going to ask, who is most similar to a particular person in terms of just their skill set or what they've done?

And again, here, what you'll see, and I've told it in the bottom to kind of explain what it's doing, if I go down to the display here, it'll tell me basically here that what it's doing is it's just going to be using search terms to go pull information. So this might help you find similarity to a certain extent, like it knows, you know, Lucas is a full-stack AI engineer, and he does, you know, Python, JavaScript, and some machine learning stuff, I guess.

So you can search for that, and you can find some similar people, but the logic is still a little bit hard to control. It's just sort of, you know, plain semantic similarity search. And as I start to go down, I can ask questions like, summarize my technical talent and skills distribution.

It's not going to be able to answer that, right, because it needs to be able to do an aggregation to answer a question like that. So if I was to go up and look at the logic again, or I'll go down here, it could say I search, you know, employees' resumes using certain search terms and stuff, right?

And so it's basically, if I go down and ask these questions, it's just going to be using search terms to find things. And so that's not really good enough for our use case, because we want to do analytics, we want to do aggregations, and we want to try to find relationships between people.

And again, like in the last question, I basically asked it to find who's collaborating on lots of projects, and all it can do is search for collaborators, which is not really what I want. Like, what I want is to find who's been collaborating with who on different projects, right?

And the resume data, you know, might have that, but it's all, you know, sunk inside of the text and stuff, so it can't really do that. So the question is now, well, how do I think about basically explaining my data to my agent and then also making sure, right, that I have a data model that makes sense?

So if you think about it, you can do this at the data layer. You can think about how do I want to model my data just to start for some of these beginning questions. And here it's basically like I want to know what a person is, so I need that.

I want some concept of a person knowing skills. And then the only other thing that I really care about is what do people do, like what things do they do, right? Very simple. That's just the data model that I want to express. So I'm going to do entity extraction of these documents to basically create this graph.

And really, it's going to be just slightly more complicated than that, because basically what I need to do here is I actually need to create a graph where, right, I have, instead of just doing things, I have publish, build, one, lead, manage, optimize, shift things. And then those things are going to belong to different domains and work types, which is going to allow me to kind of connect similar things together inside of the graph.

So it's a little bit more complicated, but it's the same exact concept. And basically, the entity extraction workflows we use are pretty self-explanatory. They use, if I was to go here, I use Pydantic classes to do that. So I have concept of enumerations on the types of accomplishments and domains that I want and work types as well.

I define my things, I define how someone does a thing through an accomplishment, and then basically, I put that through another workflow. I think I use Langchain in this case to basically decompose those documents and spit out a bunch of JSON. And inside of that JSON, for example, for this person, I have their skills, right, that I get in here, and I also have their accomplishments that I get inside of here.

So I can go ahead and load that as well into my graph. And now I have a much more expressive data model. So if I go back to here and scroll up and look at this guy, this is now kind of what my data model looks like, right? I have my people, but now you see how they're connected by all of these different skills that they have, as well as the things that they're actually working on, and how those things connect to higher level concepts, like whether it's building something for a system, or shipping code, doing all those sorts of things.

And now, because I have that expressive data model, I'm able to start having a lot more precision around the way that I get my questions basically answered. So what I do here after all this graph construction, which I already talked through, is I create this other agent, give it a similar set of instructions, I tell it a little bit about my data model in here, and then I'm actually going to be using this MCP server that allows it to read the schema, and also generate cipher statements.

So this is an MCP tool that we just have, you know, out on GitHub that you can pull. Now, when I ask how many developers I have, now I can actually do a query that's going to match on person knows skill, Python. And because of that, I get an answer that's much closer to correct, which is 28 developers.

That's very simple, that's just aggregation. But now I can ask a similarity question, right, like I did before, who is most similar to Lucas Martinez and why? So this is the same exact question. And when I went to go calculate that, I got an answer that it's Sarah. And it will explain the reasoning that it did.

It searched for people who knew the same skill sets. Sometimes when I ask this, it will also search for people who have similar accomplishments as well. And it will explain exactly like, hey, like I did an overlap calculation in the graph to figure out basically, you know, given the number of skill sets they had and the number of overlap, this is the person that I think is the closest.

So the benefit of doing this is that you have much more control and you can go in and filter exactly what skills people have. So for example, if I knew someone actually didn't have a certain skill or did have a certain skill, I can audit that. I can adjust the graph to be able to make that work.

And then similarly, as we go down, to summarize a technical talent distribution, again, I can match on those skills and I can start to answer these questions and actually get numbers between how many people know different skills. And it can also break down different accomplishments and other things of that nature.

So you just get much more refinement in the types of answers that you get for your questions. I can also add additional tools. So instead of just generating Cypher kind of on the fly with a language model, I can also, and I'll show you what some of these look like in my bigger screen here, I can go ahead and move this up.

So this is an example of finding people with similar skills here that I'm about to show you. And basically, right, I can do these very flexible queries in a graph database where I can say, hey, go from person ID to this other person ID. And as I do that, basically what I'm saying in that top syntax there is go out some, you know, zero to three hops, basically.

And so what that allows me to do is I can traverse over both the skills, over the common systems, over the common domains they work at, and all of their accomplishments. And if I wanted to add something else to that data model, maybe there's a collaboration link or another project link, that would all get picked up inside of that query.

So there's a lot of flexibility and also higher performance in a graph database when you start wanting to do those types of complex traversals. And that allows me then, when I go to find similar people again, so if I was just to scroll down to my question where I define my agent again, I give it more tools, and then I can actually look at who is most similar to Lucas Martinez and why, and it will start doing these queries.

So what you see it get back is, like, all of the results of what I was just showing you earlier. And what that will help with, if I scroll down to the response, is now what will help me get all of these specific numbers around the skills and then also the domains now between the different AI and analytics and data engineering and things that they were working on.

So there's more explainability with the way that these questions are being answered. The last part that I wanted to show you, and I only have a few minutes left, so I'm going to go very quick, is what happens now when you want to add more data to your graph, right?

So basically, say that we had this resume data, but now we have this internal data that comes from a human resource intelligence system. And this tells me different projects that people are working on together and collaborating on. So I have basically the same people that came from resumes earlier, but now I can see different projects that they were working on together.

And the great thing about graph is that it allows you to add these things very flexibly. So if I go back to my notebook here, when you expand a data model with something like, if you're in sort of RDBMS or tables, one of the assumptions that I made when I was ingesting the resumes initially was it was one person, an accomplishment only had one person.

It was sort of this one-to-one relationship, or really it was one-to-many, but an accomplishment only had one person because it was just listed on their resume. But now that I am doing this thing with this internal system, I can actually see people who are co-collaborating. And what that would mean in a tabular environment is I would have to create another join table, right?

So I'd have to do some sort of data model refactor. But the great thing about graph is that I don't have to do that at all. I can just sort of create new relationships. So this is very useful when you're going from one-to-many, to many-to-many, or when you're introducing completely new node and relationship types.

It's very easy to do that, which is super important as we move very, very fast, right, with our agents. And we want to ingest new data quickly and kind of build out our systems and pivot and all that sort of stuff. And once I have that information, right, I can start asking questions about who's collaborating with each other.

So what I'm going to do is I'm going to create, this is sort of what the tool creation looks like. This is a lot of the same tools that I had before. But the tool that I can create here to find collaborators will basically do this match. It's all the way down here.

It will do this match to find people who are working on the same thing within a certain set of domains. So that's kind of what the graph looks like that I get returned. And when I now want to, since I've added that tool, basically what it means is that when I ask a question, like which individuals have collaborated with each other to, and this says, deliver the most AI things, right, it can go ahead and leverage that tool.

And then it can now return an answer that's much more, you know, exact and based on my data. So now I know that Sarah and Amanda have collaborated on very specific projects, and as well as I have other collaborators here with supply chain and such. So that was my short presentation.

I hope it was helpful. If you have any more questions, I would be happy to meet you at the booth, and we can talk more. But that's it for me today. Thank you, everyone. Thank you, everyone. Thank you, everyone. Thank you. Thank you. Thank you. Thank you.

Agentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data — Zach Blumenfeld

Transcript