Agentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data

00:00:00.000 | I'm going to go over today GraphRag, particularly dealing with multiple data sources,

00:00:19.280 | so both unstructured and structured data sources, and kind of why you would want to ever do that

00:00:24.100 | in the first place even. So I prepared some notebooks here. I was going to make slides,

00:00:29.060 | but then I thought it would just be easier to walk through some of what this looks like in practice.

00:00:33.600 | So there's a link here, and I can share it with you at the booth later if you have follow-up questions.

00:00:39.060 | But basically, what I wanted to show you first is just what a general GraphRag architecture looks like.

00:00:45.540 | And so basically, what we do is we have our agents, we have our tools, just like you normally would.

00:00:52.860 | But then in the middle here, you see off to the side, we have this knowledge graph.

00:00:56.900 | And this knowledge graph, you can both extract data from documents in unstructured places,

00:01:01.440 | and also have standard ETLs for structured data.

00:01:04.320 | And so there's a big question of why in the hell you would want this knowledge graph thing sticking in the middle there.

00:01:10.380 | And I could talk about accuracy and explainability, but I think what's really valuable to talk about

00:01:15.440 | about is kind of what this means going forward with agents and how it's valuable for agentic workflows.

00:01:19.980 | And as we think about some of what agents can do with reasoning and decomposing questions,

00:01:25.660 | a lot of the retrieval that we're seeing is not so much just a straight shot vector search anymore.

00:01:31.440 | A lot of what we're seeing is a question being broken down and being handed multiple queries, right,

00:01:37.040 | to go and pull the data that you need.

00:01:39.240 | And the great thing about having a knowledge graph is that you can express a very simple data model

00:01:43.780 | to get started to your agent, which can help it do that decomposition, pull information accurately,

00:01:49.220 | and then as you sort of expand, you can keep adding more and more data.

00:01:53.280 | The example that I'm going to show you here inside of these notebooks is going to be for a employee graph.

00:02:00.620 | So basically think about a knowledge assistant that's responsible for helping pull information

00:02:04.900 | around skills analysis, look for similarities or substitutions in a team,

00:02:09.920 | and try to figure out who's collaborating, where skill gaps are, all that sort of stuff.

00:02:13.800 | And the data that we're going to start with today is just going to be in resumes.

00:02:17.040 | I just have these PDFs, just like a folder of resumes, that are pretty standard.

00:02:22.080 | It just lists people's professional experience and descriptions.

00:02:24.920 | It's for a company called, what did we have here?

00:02:30.360 | Cyberdyne systems, if anyone's familiar with them.

00:02:34.260 | A little Terminator reference.

00:02:36.660 | All right.

00:02:37.280 | So anyway, the first thing that I'm going to do to just show you what this looks like

00:02:42.320 | is I'm going to load documents into the Neo4j graph database,

00:02:45.840 | but just sort of as like basically documents.

00:02:49.660 | Basically take every resume, put an embedding on it.

00:02:53.480 | What it looks like here, if I was to scroll here, is kind of like this.

00:02:58.260 | So basically, I have these different nodes, but you'll see the nodes are basically just

00:03:04.160 | going to have some metadata, some text, which is the resume, and then an embedding.

00:03:07.880 | And basically what I'm going to do is I'm just going to create an agent inside of ADK,

00:03:12.600 | so Google's framework, and I'm going to start asking it some of those questions.

00:03:17.320 | And so basically what you're going to see here, and I don't have time, unfortunately,

00:03:21.180 | to walk through all of the code, but basically if you look at this agent that's been constructed,

00:03:26.720 | right, I have my agent, I have some instructions to pull data, and then I give it one tool,

00:03:31.900 | which is a tool to go search documents, right?

00:03:34.680 | And so I'm going to ask it a question, how many Python developers do I have?

00:03:38.260 | You can imagine, right, this is probably not going to work out very well if all I have is

00:03:43.140 | just documents, because basically it's going to tell me I have five Python developers.

00:03:48.000 | And that's because I set K equal to five, right, when I went to go pull my documents.

00:03:52.540 | So obviously that's going to be wrong, and I'm telling you that that answer is wrong.

00:03:56.380 | So you could probably solve that with doing some entity extraction and putting more metadata

00:04:00.040 | right on your node, so that's fine.

00:04:02.400 | So then I'm going to ask, who is most similar to a particular person in terms of just their

00:04:07.580 | skill set or what they've done?

00:04:08.800 | And again, here, what you'll see, and I've told it in the bottom to kind of explain what

00:04:14.440 | it's doing, if I go down to the display here, it'll tell me basically here that what it's

00:04:22.300 | doing is it's just going to be using search terms to go pull information.

00:04:25.980 | So this might help you find similarity to a certain extent, like it knows, you know,

00:04:31.020 | Lucas is a full-stack AI engineer, and he does, you know, Python, JavaScript, and some machine

00:04:36.320 | learning stuff, I guess.

00:04:37.200 | So you can search for that, and you can find some similar people, but the logic is still

00:04:41.720 | a little bit hard to control.

00:04:42.860 | It's just sort of, you know, plain semantic similarity search.

00:04:46.140 | And as I start to go down, I can ask questions like, summarize my technical talent and skills

00:04:51.980 | distribution.

00:04:52.540 | It's not going to be able to answer that, right, because it needs to be able to do an

00:04:55.900 | aggregation to answer a question like that.

00:04:57.820 | So if I was to go up and look at the logic again, or I'll go down here, it could say I

00:05:02.720 | search, you know, employees' resumes using certain search terms and stuff, right?

00:05:06.060 | And so it's basically, if I go down and ask these questions, it's just going to be using

00:05:10.240 | search terms to find things.

00:05:11.600 | And so that's not really good enough for our use case, because we want to do analytics,

00:05:15.760 | we want to do aggregations, and we want to try to find relationships between people.

00:05:20.480 | And again, like in the last question, I basically asked it to find who's collaborating on lots

00:05:25.100 | of projects, and all it can do is search for collaborators, which is not really what I want.

00:05:29.360 | Like, what I want is to find who's been collaborating with who on different projects, right?

00:05:34.500 | And the resume data, you know, might have that, but it's all, you know, sunk inside of the text

00:05:38.700 | and stuff, so it can't really do that.

00:05:40.160 | So the question is now, well, how do I think about basically explaining my data to my agent

00:05:45.460 | and then also making sure, right, that I have a data model that makes sense?

00:05:51.100 | So if you think about it, you can do this at the data layer.

00:05:53.940 | You can think about how do I want to model my data just to start for some of these beginning

00:05:59.260 | questions.

00:05:59.760 | And here it's basically like I want to know what a person is, so I need that.

00:06:05.240 | I want some concept of a person knowing skills.

00:06:09.280 | And then the only other thing that I really care about is what do people do, like what things

00:06:13.840 | do they do, right?

00:06:15.080 | Very simple.

00:06:16.480 | That's just the data model that I want to express.

00:06:19.360 | So I'm going to do entity extraction of these documents to basically create this graph.

00:06:23.840 | And really, it's going to be just slightly more complicated than that, because basically

00:06:28.860 | what I need to do here is I actually need to create a graph where, right, I have, instead

00:06:35.460 | of just doing things, I have publish, build, one, lead, manage, optimize, shift things.

00:06:42.320 | And then those things are going to belong to different domains and work types, which is going

00:06:47.100 | to allow me to kind of connect similar things together inside of the graph.

00:06:50.400 | So it's a little bit more complicated, but it's the same exact concept.

00:06:55.860 | And basically, the entity extraction workflows we use are pretty self-explanatory.

00:07:02.620 | They use, if I was to go here, I use Pydantic classes to do that.

00:07:07.840 | So I have concept of enumerations on the types of accomplishments and domains that I want and

00:07:12.660 | work types as well.

00:07:13.880 | I define my things, I define how someone does a thing through an accomplishment, and then

00:07:20.340 | basically, I put that through another workflow.

00:07:23.940 | I think I use Langchain in this case to basically decompose those documents and spit out a bunch

00:07:29.900 | of JSON.

00:07:30.320 | And inside of that JSON, for example, for this person, I have their skills, right, that I get

00:07:35.660 | in here, and I also have their accomplishments that I get inside of here.

00:07:40.100 | So I can go ahead and load that as well into my graph.

00:07:44.840 | And now I have a much more expressive data model.

00:07:48.120 | So if I go back to here and scroll up and look at this guy, this is now kind of what my data

00:07:57.280 | model looks like, right?

00:07:58.240 | I have my people, but now you see how they're connected by all of these different skills

00:08:02.980 | that they have, as well as the things that they're actually working on, and how those things

00:08:08.020 | connect to higher level concepts, like whether it's building something for a system, or shipping

00:08:14.020 | code, doing all those sorts of things.

00:08:16.520 | And now, because I have that expressive data model, I'm able to start having a lot more precision

00:08:22.500 | around the way that I get my questions basically answered.

00:08:27.100 | So what I do here after all this graph construction, which I already talked through, is I create

00:08:33.640 | this other agent, give it a similar set of instructions, I tell it a little bit about my data

00:08:38.720 | model in here, and then I'm actually going to be using this MCP server that allows it to read

00:08:44.460 | the schema, and also generate cipher statements.

00:08:47.660 | So this is an MCP tool that we just have, you know, out on GitHub that you can pull.

00:08:53.620 | Now, when I ask how many developers I have, now I can actually do a query that's going

00:08:58.160 | to match on person knows skill, Python.

00:09:00.160 | And because of that, I get an answer that's much closer to correct, which is 28 developers.

00:09:06.700 | That's very simple, that's just aggregation.

00:09:08.240 | But now I can ask a similarity question, right, like I did before, who is most similar to Lucas

00:09:14.240 | Martinez and why?

00:09:15.240 | So this is the same exact question.

00:09:17.240 | And when I went to go calculate that, I got an answer that it's Sarah.

00:09:22.780 | And it will explain the reasoning that it did.

00:09:24.780 | It searched for people who knew the same skill sets.

00:09:28.780 | Sometimes when I ask this, it will also search for people who have similar accomplishments as well.

00:09:33.780 | And it will explain exactly like, hey, like I did an overlap calculation in the graph to figure out basically, you know,

00:09:42.320 | given the number of skill sets they had and the number of overlap, this is the person that I think is the closest.

00:09:47.320 | So the benefit of doing this is that you have much more control and you can go in and filter exactly what skills people have.

00:09:53.700 | So for example, if I knew someone actually didn't have a certain skill or did have a certain skill, I can audit that.

00:09:59.400 | I can adjust the graph to be able to make that work.

00:10:02.240 | And then similarly, as we go down, to summarize a technical talent distribution, again, I can match on those skills

00:10:10.660 | and I can start to answer these questions and actually get numbers between how many people know different skills.

00:10:15.560 | And it can also break down different accomplishments and other things of that nature.

00:10:20.260 | So you just get much more refinement in the types of answers that you get for your questions.

00:10:25.260 | I can also add additional tools.

00:10:28.060 | So instead of just generating Cypher kind of on the fly with a language model, I can also,

00:10:35.300 | and I'll show you what some of these look like in my bigger screen here, I can go ahead and move this up.

00:10:47.860 | So this is an example of finding people with similar skills here that I'm about to show you.

00:10:52.760 | And basically, right, I can do these very flexible queries in a graph database where I can say, hey, go from person ID to this other person ID.

00:11:02.900 | And as I do that, basically what I'm saying in that top syntax there is go out some, you know, zero to three hops, basically.

00:11:10.860 | And so what that allows me to do is I can traverse over both the skills, over the common systems, over the common domains they work at, and all of their accomplishments.

00:11:20.360 | And if I wanted to add something else to that data model, maybe there's a collaboration link or another project link, that would all get picked up inside of that query.

00:11:29.220 | So there's a lot of flexibility and also higher performance in a graph database when you start wanting to do those types of complex traversals.

00:11:36.820 | And that allows me then, when I go to find similar people again, so if I was just to scroll down to my question where I define my agent again, I give it more tools,

00:11:49.680 | and then I can actually look at who is most similar to Lucas Martinez and why, and it will start doing these queries.

00:11:56.180 | So what you see it get back is, like, all of the results of what I was just showing you earlier.

00:12:00.500 | And what that will help with, if I scroll down to the response, is now what will help me get all of these specific numbers around the skills and then also the domains now

00:12:10.960 | between the different AI and analytics and data engineering and things that they were working on.

00:12:15.380 | So there's more explainability with the way that these questions are being answered.

00:12:19.020 | The last part that I wanted to show you, and I only have a few minutes left, so I'm going to go very quick, is what happens now when you want to add more data to your graph, right?

00:12:28.200 | So basically, say that we had this resume data, but now we have this internal data that comes from a human resource intelligence system.

00:12:36.320 | And this tells me different projects that people are working on together and collaborating on.

00:12:41.080 | So I have basically the same people that came from resumes earlier, but now I can see different projects that they were working on together.

00:12:48.340 | And the great thing about graph is that it allows you to add these things very flexibly.

00:12:53.180 | So if I go back to my notebook here, when you expand a data model with something like, if you're in sort of RDBMS or tables,

00:13:04.720 | one of the assumptions that I made when I was ingesting the resumes initially was it was one person, an accomplishment only had one person.

00:13:11.820 | It was sort of this one-to-one relationship, or really it was one-to-many, but an accomplishment only had one person because it was just listed on their resume.

00:13:19.820 | But now that I am doing this thing with this internal system, I can actually see people who are co-collaborating.

00:13:28.480 | And what that would mean in a tabular environment is I would have to create another join table, right?

00:13:32.780 | So I'd have to do some sort of data model refactor.

00:13:34.960 | But the great thing about graph is that I don't have to do that at all.

00:13:38.860 | I can just sort of create new relationships.

00:13:41.700 | So this is very useful when you're going from one-to-many, to many-to-many, or when you're introducing completely new node and relationship types.

00:13:49.020 | It's very easy to do that, which is super important as we move very, very fast, right, with our agents.

00:13:54.560 | And we want to ingest new data quickly and kind of build out our systems and pivot and all that sort of stuff.

00:14:00.380 | And once I have that information, right, I can start asking questions about who's collaborating with each other.

00:14:06.880 | So what I'm going to do is I'm going to create, this is sort of what the tool creation looks like.

00:14:12.380 | This is a lot of the same tools that I had before.

00:14:14.440 | But the tool that I can create here to find collaborators will basically do this match.

00:14:20.220 | It's all the way down here.

00:14:23.360 | It will do this match to find people who are working on the same thing within a certain set of domains.

00:14:29.840 | So that's kind of what the graph looks like that I get returned.

00:14:32.480 | And when I now want to, since I've added that tool, basically what it means is that when I ask a question,

00:14:38.660 | like which individuals have collaborated with each other to, and this says, deliver the most AI things, right,

00:14:45.460 | it can go ahead and leverage that tool.

00:14:48.140 | And then it can now return an answer that's much more, you know, exact and based on my data.

00:14:54.480 | So now I know that Sarah and Amanda have collaborated on very specific projects,

00:15:02.180 | and as well as I have other collaborators here with supply chain and such.

00:15:05.640 | So that was my short presentation.

00:15:07.520 | I hope it was helpful.

00:15:08.560 | If you have any more questions, I would be happy to meet you at the booth, and we can talk more.

00:15:14.000 | But that's it for me today.

00:15:16.020 | Thank you, everyone.

00:15:16.580 | Thank you, everyone.

00:15:16.640 | Thank you.

00:15:16.880 | Thank you.

00:15:18.880 | Thank you.

00:15:19.720 | Thank you.

Agentic GraphRAG: Simplifying Retrieval Across Structured & Unstructured Data — Zach Blumenfeld