Intro to GraphRAG — Zach Blumenfeld

so as you come in we have here server set up with everything you'll need if you want to follow along you should have gotten a post-it note if you don't just raise your hand and my colleague Alex over here we'll come find you and we'll provide you with one basically what you're gonna do is you're just gonna go if you have a number 160 or below you go to this link here the QR code on top as well and if you have a number that's 201 or above you go to the second link or the QR code from there I'll give you some directions you're gonna have to clone a repo and then you're just gonna have to move an environments file over really quick also for everything in this deck I did create a workshop graph rag intro slack channel so if you're part of the AIE slack group you can also go there and just grab this deck and you know get the links or however however you'd like to do that so we'll get started here thinking just a couple of minutes all right so you're sorry you're not your number will give you your username and your password basically it's attendee all lowercase than your number that'll be both your username and your password when you sign in the other link that you should try to open as well as the browser preview but I'll walk you through that here in a second I'll give it just another minute here for everyone to file in and get situated the user name and password is going to be attendee all lowercase and then the number that you have both the user name and the password are the same so the notebooks the servers will be down after the session but we you have the github link you can go back to so the code is is there for you to use it's just that the environment won't be available afterward all righty so i'm going to go ahead and get started here i'll leave this screen for a second so if you want to grab the qr codes here now would be the time to do it obviously you can go to the slack channel and also pick up this deck all righty so we're going to do an intro to graph ride workshop today um i was debating what to actually put in this workshop since everything's changing so quickly and some of my colleagues convinced me not to make it too complicated so this course is going to be very much an introductory level course if you want to look at sort of more advanced graph rag techniques integrations with things like mcp we have that at our booth and we have some other things that we're doing other events tomorrow that we'll go over some of that stuff and i'll have links to all of that uh as we go through but basically what we're going to do is we're going to get everything set up hopefully that should only take a few more minutes and then we have three modules we're going to go over some graph basics we'll be using neo4j today so it's a graph database just how to query that kind of how to you know construct um your logic to retrieve data we'll go over um another module on unstructured data um how to do entity extraction how many people know about neo4j all right how many people have by the way used neo4j like have written cypher queries okay so some folks in here and then how how many people have used lane chain before okay so so a fair number of you okay that's good to know um and then in our in our third module we'll actually go over we'll use lane graph we'll build a very simple agent that will use some retrieval tools and you get to see how some of that works and we'll wrap up after that with some resources so make sure to ask questions straight away raise your hand i'll stop intermittently we only have 80 minutes though so i want to make sure if you do have a question go ahead and raise it and we'll get that answered because we're going to be moving through the material a little bit quickly as i said before we have two jupyter server set up so you don't need a pip install anything you can go ahead and connect to these notebooks attendees i already explained you should have a number if you don't go ahead and raise your hand the username and password is just going to be attendee followed by your number 160 and less you go to that first link 201 and larger you go to the second link there's also i if you can go ahead and open browser.neo4j.io slash preview i'll show you in a little bit you're going to log into that as well that will let you visualize the graph a little bit better as we start putting data inside of it yes all right so alex we can get a number over here all righty once you're inside of your environment what i want you to do is these two commands so you're going to open up a terminal window in jupyter you can do that by pressing the little plus sign i want you to get clone the command should actually be in the readme it should say get clone and it will give you the link to the repository that you need to clone and once you do that i want you to copy the workshop file over into that gen ai workshop talent folder that environment file ws.env is going to have information to a database that's already set up it will also have an open ai key inside of it that we'll be able to use for the workshop so just to show you what this looks like if i go over here and i look at my my terminal right i've already done this this a little bit bigger you just go ahead and get clone if you go to the readme that was um in the main folder this readme here it has the uh the um the github url so you just basically go get clone um and then oops and then after that you just copy that workshop file into that gen ai workshop talent directory so you'll get this gen ai workshop talent it's like a subdirectory in here and you just have to copy that file and that will have um resources that you'll need to log into uh the browser and for you to connect through your notebooks the other link that is inside of that deck is this browser.neo4j.preview link basically that will give you a way to visualize the graph so what you will do if i go ahead and disconnect and maybe i um um got to connect to an instance so you should get a screen that looks like this and then that workshop file will have and it should actually be the same thing for you guys it should be the attendee the number and then the same thing for the password so go ahead and make sure you do that because then you'll be able to visualize the graph a little bit better any questions so far so for the jupyter environment if you got your number it's attendee all lowercase and then your number for both the username and the password sorry where's the connection url um yeah ws.env for the for the neo4j browser so it's um right here so it's basically going to be your username or sorry it's going to be the attendee all lowercase and then the number that you receive for both the username and the password yeah and if if you um want to come back to this if you have if you're connected through slack the workshop graph rag intro you can go there and you can pick up the slides as we move on so then that way you just have a constant reference back to it um so while everyone gets set up here i'll talk a little bit about just what graph rag is in general to kind of motivate what we're doing here so this is um an architecture actually um that represents what some of our customers do it's a very common architecture for graph rag users it's generalized and basically the idea is that you have your agent over there you have your ai models and your ui so like all the normal things that you might think of if you're putting together a knowledge assistant but then there's this knowledge graph thing in the middle and that knowledge graph thing you can ingest both unstructured and structured data into that so unstructured being things like documents and pdfs and that sort of stuff and then structured being tables like csvs or stuff from a relational database or what have you um and so there's a big question of like well why in you know the heck do we need this like knowledge graph thing in the middle right like we have agents we can have tools and we can go pick stuff from data sources and so the idea with this is that if you have a use case and you kind of know the types of questions that you want to answer with your agents by taking your data and decomposing even a very simple knowledge graph to start you're going to be able to expose a lot of the sort of domain logic that you'd want to apply through the model of your data so the idea is like we'll see when we build a skills graph we'll make some relationships about people knowing skills and by making that schema available to the agent and making tools available to the agent you're going to be able to have a lot more control over how data is retrieved more accurately explain the retrieval logic better and we see this is especially important as we start moving more and more into this agentic world because it's not like a one-shot vector search anymore right we're starting to see that now when we get questions or prompts handed to an agentic workflow those start to get broken down in various ways and when you have a knowledge graph it just lets you offer retrieval logic to complement that in a much more simple and in my opinion a better manner and today we're going to be looking at a skills and employee graph so basically what will the use case we'll be looking at is you're building a knowledge assistant to help with things like searching for talent aligning and analyzing skills within an organization and doing things like staffing and team formation and substitutions and things of that nature and so i'm going to present a little bit about what we're going to go through in these modules first so i'll do some stuff inside of a deck hopefully it won't take too long i just want to kind of talk to you about cypher and some of the things that you'll be seeing and then we'll go ahead and get hands-on here pretty quickly so we're going to talk about creating a graph we'll start with some structured data here just to keep things simple i'll introduce on structured data a little bit later some basic cipher queries some algorithms and we'll get into some vector search and semantic stuff so a knowledge graph basically when we think about it a knowledge graph generally is devont is defined as some design patterns to organize and access interrelated data and at neo4j we model the data inside of the database is what's called a property graph and this consists of three primary elements so the first are nodes these are like your nouns these are your people places and things next are your relationships these are how things are related together hence the name and often will be like verbs so person knows person person lives with person person drives or owns a car and both of the nodes and relationships can have properties which are just attributes they can be strings they can be numbers they can be arrays of things and they can be vectors as well so we can store vectors for we've had for a long time inside of neo4j and you can do search over these things now the query language that we're going to use to access the database is called cipher and i know a lot of you raised your hands in the beginning so you already have some familiarity with this but cipher kind of looks like ascii text so the idea right is that it has this sqls kind of feel to it but you get to write these statements like if you see match person knows skill basically you're connecting a person node to a skills node through that nose relationship so it reads kind of very literally in the way that it's written nodes have what's called labels which is sort of like it would be the equivalent of a type of table within a sql database of basically what type of entity it is and then as i said before they have properties so for example you can identify by a property like name and you can have variables like p and s which refer to the actual entity as you start to write your query more so this is not going to be a course on writing cipher right because we can make an 80 minute course just on like how are we going to make cipher queries um but we'll be walking through these queries so don't expect to like be if you haven't seen cipher before to be a super expert in the cipher query language when we're done but just know that this is kind of how it works and then as you go through hopefully you'll get a better understanding and a feel for how these queries uh work and in the types of data that can be returned as you run them um and so i'm sure is everyone pretty familiar with vector search in here at this point yeah i have a feeling this audience probably would be so i won't spend too long on this right i think we all kind of know what embeddings are it's basically a type of data compression you can apply them to all sorts of things right text audio you can even apply them to graphs oftentimes it's just going to be a vector of numbers and then you can use that to find similar things within that domain space so find texts that are similar uh semantically not just lex lexily like actually based on the types of things that they're talking about and within neo4j you have search indices including vectors so there's range indices you have uniqueness constraints you're able to search text you're able to do full text with blue scene and then we also have approximate nearest neighbor vector search as well that we'll be leveraging as we go through in combination with the cipher queries that we were just looking at to do graph traversals the next thing to know about is that in addition to being able to query the database we also have analytics so we have graph analytics powered on the database that lets you do different types of data enrichment and do more graph global type of analytics so finding which nodes are most central according to different algorithms doing things like community detection how do you cluster the graph finding paths between nodes doing different types of embeddings so we have a lot of those algorithms and we'll be touching on them very very briefly today in the first module just to show that you know once you have a knowledge graph you can start enriching that data and then actually using things like we'll see in our case we'll be using community detection where we'll be summarizing skills inside of our graph and then we'll be able to pass that on to an agent to actually use that to explain some parts of our graph for our use case all righty so with that in mind we'll go ahead here and jump into the first notebook are there any questions before we dive in is anyone still okay yes over here yes do you have uh let me just go back here to the so and this is available in the slack channel too if you don't have a number um my colleague alex over there can go ahead and grab one for you um we're in the workshop graph rag intro slack channel so you can go there to grab the deck and all the links but basically if your number is 160 or below you go to that first jupyter server if it's 201 or above you go to the second one use attendee all lowercase and then your number as both your username and your password you'll do that for the jupyter notebook and then also for the neo4j browser if you want to follow along with visualizing the graph as we go through any other yes i know this is uh introduction to uh graph track so um but maybe like you know when you're building these these graphs uh i see you have like a small small graph how do you i can prioritize whether you should like big big graphs like you know one that makes more than scale or make smaller graphs so your question is about data modeling and whether how do you prioritize making one graph versus multiple graphs um i mean it's a good question i think in general for a lot of what we're seeing with agents i find it's helpful to have a smaller data model if possible especially if you're doing different types of dynamic query generation so to keep that in mind but as things are getting better we can pull back the graph schema and and offer it to agents and i'm we're noticing that as agents sort of keep it or as language models really keep iterating they're starting to get better and better at interpreting so whenever you want to do traversals in a low latency way between two data points those things really should go in the same graph and then it's a question as far as what you make a label versus a property um in that scenario so we'll go through some of it and then if you want to talk after and come by our booth we can have a more sort of use case focused conversation anything else all righty so i'm going to go ahead here and then dive into the notebook so for our first notebook you can go ahead and restart that's fine all right so you're just going to come down here and start um and remember we're in the talent subfolder so there's two workshops in here the one we'll be doing is called talent if you're in the other one it's there's also some interesting stuff in there but you won't be able to follow along all righty so it looks like i'm running now so basically what i'm going to do is i'm going to get my environments file here and i'm just going to load it if you um don't have the environments file just go ahead and move it it's in the root directory just go ahead and move it into this subdirectory it's this ws.n file and basically what we're going to do first is we're just going to load our skills data sets it's going to be a for a table and if we look at that table um we're going to have uh basically three fields there's the um an email field a name field and then just a list of skills for the person and as i said before we'll go into a little bit of detail here around how you might extract this from documents like resumes um in a second but basically for now because we're interested in sort of this skills uh mapping and team formation and staffing kind of use case we're starting with this sort of very simple data set to get us started um and so there's a couple steps here that just go through basically organizing the data to make it easy to load and then we're going to start to create our graph and so a lot of this is just what we'd call like basic kind of neo4j data loading we're going to create chunks out of our data frame you're going to um basically check to make sure you've got nothing in your database i do have stuff in my database because i was just running this before but that's on me because i was just running the course before yours should say zero now the first thing we do is set a constraint so basically inside of neo4j whenever you create nodes if you have what's called a node key constraint or a uniqueness constraint it's basically saying in this case that the email has to be unique and non-null for all your um for all your people and that will make it so that it's very fast to match on people and do merging operations so a lot of times people will say well neo4j is really slow and that's often because of simple mistakes like not setting a constraint and then you're going to have to do very complex searches in the database every time you search on a user rather than having it in an index that's unique and then you also do the same for skill because our data model is going to be person and skill so we have two types of nodes and when we do that we'll go ahead and have two constraints here inside of the database you'll see for skill and for person after that we'll go ahead and start loading our nodes and our relationships so the way that this query works and i i guess i won't run it even though it won't actually change anything in my database but what we're doing here is we're looping through chunks of our data frame and we're saying hey merge a person on email set their name and then for that list of skills basically you're going to merge a skill on a skill name and then you're going to merge here that the person knows that skill so it's going to create this graph pattern of person knows skill in the database once you run that what you can do is if you have that browser window open that we were going over before is i can go ahead and copy one of these or maybe i'll just take i'll take this one well i'll go ahead and take this one first well this will show you inside of the database is if i just match people i'll get my people back here and i may have lost oh cool i still have my internet connection all right so i can go ahead and see that i have my people they have their names and their email addresses you can do the same thing for matching skills and then you can also look for relationships so this gets into that pattern matching that we were talking about before with cypher this is a very simple version of matching a path so i'm saying p which is path is equal to node connect to nose connects to another node and i'm saying limit 25 and that will return a graph where i get to see all these different relationships looks like my internet connection is still somewhat slow but i get it back here so you'll see i'll get my people i'll get that nose relationship then in this case this person knows api design tableau flask and you'll see different skills pop up here inside of your graph and there's you know you can you can go ahead and run these through what through our driver here as well to look at the data pull back the different people that are in there and find out what skills they have and such we do here nose is a relationship type we are making it up so our our domain model that we have i can actually call it here and mine is going to show more than yours if you run the same command because it has the um some other later stuff that we do in the course but basically you have person knows skill that's our data model so you can say person has skill would be another way to put it right exactly exactly yeah and it's it's actually funny because this is becoming even more important um now that we're using uh llms to design queries because like the language that you use is sort of like an annotation for the model right so that starts to become very interesting all righty so there's some cipher queries here that i'll go ahead and run through really quick and i may depending on time need to need to kind of speed things up through this notebook because i want to make sure that we actually get to the agent at the end so um if you go into the uh the deck there's this link browser.neo4j.io slash preview and then i think it's just your username and your password but you can also look inside of your workshop environment file and it will have that information there you just use your username and your password and your uri information which you get here so you get your uri and then your username and your password all righty so like i said we'll go through some of these so for example we can count in cypher so we can say match person no skill we can get back the name and we can count the distinct people uh basically here for for each skill so basically what we're doing here is you can think of it as like okay i've got all my skills and i'm going to count the distinct people that know that skill it's very simply what we're doing when we get that back we'll see kind of what our most popular skills are here going down and they're all kind of tech focused we can also ask different types of multi-hop questions which is very interesting so for example i'll take this and i'll copy it over to my browser because it's it's interesting to see these visually but what we're asking here is we're going to we're going to take this person um named lucy and i'm just going to ask you know what people are kind of similar to lucy in terms of knowing the same skills right so i can go ahead and run that and then what i'll get is i'll get lucy here i'll get all of her skills and then i'll get all the other people that know those skills here right um and you can build on that iteratively so i can if i go back here i can also say well now i want to know all of those skills or all those people and i want to know and basically i'm going to add at the end of that query i get they know a certain skill and then i want to get all of those people and then i want to get all the skills they know so i'm basically adding this and what skills do these other people know to the query and then i'll get a very large graph back but the idea with this is that once we have this logic extracted from whatever our original data source is we can now control at a much more fine-tuned level how we define what a similar person is or what a similar skill is because we have this ability to traverse over the graph and apply concrete logic it's basically like having your information in a symbolic versus just a sub symbolic vector um and so you know you'll get a lot of stuff back because now we're looking at people and all the other skills that they know and i can go in here and find the most central skills among these people right like for example scrum is very central among this group because there's there's a lot of people that know that skill so i'm figuring out about this local community that sort of knows the similar skills to lucy and in here it's just some examples of running that same logic basically inside of the inside of the notebook yes so to get to the graph basically what you're going to do is you're going to go to this neo4j browser link oh i see um so you table then graph and then sometimes if you're not returning nodes it will only return a table like if i said you know return um yeah okay okay yeah so if you if you just say return p um in that case it should return it should return the past sometimes if you don't see it it's because you're returning like just a name or something um and then in that case it'll like just show you a list of names um and i don't have a distinct here oh yeah i did i don't know if it's completely necessary for this one actually yeah i don't think it is completely necessary for this one um there are times when you do very complicated especially we'll see that there are a couple other examples where we do like multi-hop paths and there's a chance with some of those that you'll get basically two paths that are the same um in which case having the distinct there just allows you to filter it better um so that's hard to say they're getting better a lot of it depends on the complexity of your schema and basically you know we see for simpler aggregation queries or when you have a lot of prompt engineering around doing different types of path queries that are very specific on a smaller model they can do well we do often recommend that you have your own expert tools if there's like a really complicated type of traversal that you want to do so right you can write your own python functions or you can have your own mcp server that will just have like your you know set of functions for your more complicated traversals we also see too you can sort of restrict the options for llm so instead of writing a complete query you can say hey like there's you know these you know three types of general patterns you know and and write that part of the pattern and then it will go into this other query so you can do stuff like that to help it we've also done we've had fine-tuned models that we just released i think back in april they're on they're fine-tuned from gamma they're on hugging face so you can try using those as well they can do a little bit better we're not going to use them here though unfortunately because we're using a bunch of open ai uh open ai models for this um all righty so a lot of this um as i was just going over is basically just running these queries um returning in this case the distinct is important to return a distinct name because you might actually get to the same person multiple times um so if you're just returning a name in a and um of a skill right then it's important to um to to use distinct in that case so we get all the distinct skills that basically showed up in the graph we were just looking at another thing that might be important for our use case is finding similar people so this is again using that query that we were just going over to find we use lucy before but now we can actually parameterize that here um and then basically go no skill and then we can match um basically from that skill going to another person and we can sort of count number of shared skills between people and so when we do that right we can go ahead and see in this case um like the number of shared skills um between between different individuals um and we do it again here i think for um a different set of people i think this just counts most skills shared between any two people um so you can kind of see that here again just another way to measure similarity beyond just semantic similarity measuring actually what we have from our model in terms of exact shared skills and as we go through this some of the things that we can do to help speed up our queries and this is this is sort of optional but if we know that we're going to um sort of look for similar skill sets a lot we can create a similar skill relationship inside of our graph so basically we can match two different people um and then we can merge a similar skill set basically bet based on um an overlap of the um of a skill count so we have our data frame locally that we can use for that that we that we were just looking we basically just pulled it back when we were looking at those similar skills and what that's going to do is it's just going to create again this relationship that has similar skill set between people um and if i were to look at that it will um go over to my browser we'll go ahead and see you know i can have a similar skill some of these are overlap one others will be greater um inside of here i think they go up to three um so it's all this is doing is basically saying hey if two people like what is their overlap so we don't have to do that full traversal over and over again if we don't want to um with that similar skill set relationship um the next thing i wanted to show you and yes but if you do this you need to like this is static and if you want to update what's the overlap you need to run this query again you would need to run that query over again yeah so i mean it depends on how often your data gets updated there's nothing actually wrong with doing the multi-hop query over and over again the graph database is designed to handle that so if you had a situation where you had a graph that was constantly getting updated you might not even need to create this relationship the next thing that i wanted to show you was how our graph analytics works inside of neo4j and basically using that to enrich the graph so this is basically creating what we call a gds or a graph data science client and what's going what we're basically doing here is we're creating something called a projection and then we're running a algorithm called leiden for the graph rack community are you all have how many people here have heard of leiden as an algorithm okay so just just a few people how many people have heard of louvain as a graph algorithm okay so we got a couple people basically what this algorithm does is it breaks the graph down into a hierarchy so we'll start by basically breaking the graph into a few big communities and then going into smaller communities and what it's trying to do is optimize what we call modularity and it's basically this metric that says hey i want to create these clusters in my graph where the connections within the cluster are very high and connections across clusters are very low so i'm creating these modules and basically what i do by creating these um and i'm using that similar skill set relationship so this is another important reason to create it because if you do um analytics on your graph it can help with those analytics running a little bit better because i have person connects to another person with a similar skill set and by running this leiden algorithm basically what we get is a bunch of communities that reflect people within the communities knowing similar skills so this is all simulated data but basically if i go down to um and i'll skip over some of this we do some checks around like how good the communities are i encourage you to run this just so that the agent works well at the end but the what i wanted to show you here is this uh graphic right here so what this is this graphic is looking at because basically we wrote this community id property back to the graph and you get these different community ids and you get to see which communities in a heat map have the most skills in a certain in a certain area so this data is randomly generated so a lot of these patterns are going to look maybe a little bit funky if you were to really dig into them but the idea is that um as you have very if you have more relevant data and realistic data this can actually show you like your data engineers are here right and your front end guys and front end folks are over here and then your ml people are over here so you can start to see that within the graph um and really break that down um but do go ahead and run everything here um through the g.drop so that you have that property another way that we can break down uh sort of different groups inside of the graph is to look at go ahead sorry i just had one question regarding like when when do you uh customize your graph like for example the the community detection algorithm that you're running yeah and when do you just let the agent is there any you know again heuristics well actually better to invest time in figuring out whether we should improve our graph well i think it depends on your use case right like if you're very interested in you know saying hey i want to understand like the skill communities inside of my company right if that's like a question that's going to come up frequently then using something like graph analytics can be very beneficial right because you can do basically like employee segmentation you can understand performance with inside of different groups and stuff we see it oftentimes used for customer segmentation and recommendation systems and that sort of thing too um at the same time if you're just like hey i just want to like look for matches of different people with similar skills maybe you don't need community detection for that because that's just like a pairing exercise right um so i'd say you use it whenever you want to do some sort of clustering analysis and persist that and then sort of even have visibility and i guess the the confidence in knowing that there was some way that you did that right and it's not just up to the model that's just making stuff up around how that works right so basically you look at uh what the users are doing and then try to see if you need to build yeah yeah yeah yeah yes the heat map is showing you how often different skills show up with inside of each community aren't the communities based on what skills they have yeah so like the first community for example is it it looks like either tableau or swift right yeah to understand like the skill breakdown within each community and again this is generated data so this is a little bit random right um but you can imagine that in a non-random scenario what you're probably going to end up seeing is like if you have a lot of product managers versus a lot of you know um like front-end developers versus you know like devops folks like you'll see that grouping start to emerge yes two connected questions one do you have any different uh best practices for data modeling for an agent to understand the data model or just general draft best practices around creating data models um yeah i mean i'd say a lot of the agent stuff is evolving super super quickly um you know as llms keep changing and getting better um we've had for a long time guides on how to do like data migration from relational systems to graph and how to think about that there's a certain way in graph how you think about again like nodes being nouns relationships being verbs and how to connect those together for agents i think it's really nice when the data model reflects a natural language right so person knows skill very natural language way of you know saying something that that translates directly to a data model and as i was saying before simpler data models seem to work better when you do like dynamic query generation so there's stuff like that and the rest of it is i know like the it depends answer is like you know such a cop out but it is true that like depending on the type of retrievers that you have the size of your data the cardinality of different categories of things in your data right like you know you generally don't want if you can avoid it to have hundreds or thousands of no labels because it's just a lot so then you make them properties so there's a lot of stuff like that to consider so i don't know if that answers your question but yeah we'll see at the end of module three which we might not have time to get to but you'll see it in the code and i'll show it really quick is that there's you can from the graph schema there's functions that we have to pull back the node labels and the relationship types so you can create a sort of json representation right of what the graph schema looks like and then combine it with specific prompts so then it's like okay i follow that another thing to do that helps even more is if you have a graph data model that's not going to change a lot over time where you know you can just pull it and it will be the same for a while because you can annotate that schema so you can say like for specific properties or no labels or relationship types hey this thing does this and when you ingest data make sure you you know put it here and when you pull data make sure you can you know go on different paths the other thing is putting in like we had person knows skill putting those actual query patterns into the schema as well helps a lot because the model can read that and then understand how to do that traversal better all right anything else all right cool um so we're actually getting close on time so i'm going to go pretty quickly through the rest of this um but hopefully um it'll be pretty understandable so there's another way that we can start um thinking about skills and relationships between skills is how they're semantically similar so basically what we can do is actually make embeddings on our skills so there's another file in here um that basically has a csv file that you read into this notebook that has skills and descriptions and an embedding so which which field here do you think we embedded the skills of the description and why right so one of the things is when you have really short names like r is a technically a programming language although a lot of people don't love it i love r um aws like they're very short right so having descriptions about those uh skill names if you embed those it provides a more informative embedding right so that's the whole idea there so basically we give each skill a description and then we embed that description and what we're seeing inside of this is we're actually going to and these are all text embedding ada so they're so they're 1536 we're going to go ahead and create a vector property this is just loading those up in chunks and then we're going to set the description as well and after we do that um we'll basically have i think we create our vector index down here that we call the skills embedding once that's set up basically what we're able to do and you'll see it show up here we'll get that skills embedding index is we're now going to be able to do vector search on skills inside of the graph so if i have python as a skill and i go ahead and i'll use this command in cypher to search the skills embedding pull back the 10 most relevant skills and you'll see here it'll it'll bring some skills back here it's like ruby and java we got pandas at least that's good django pytorch so some of these are better than others but the point is that we can go ahead and apply these vectors and then pull information back with vector search and another interesting thing that we can do um as well is we can if i had something that wasn't in the database like say i'm just looking for api coding right and i searched that as a term basically what i'm doing here is i'm using the open ai client here to just embed this model or i might be using actually laying chain up here looks like it's laying chain and then i'm doing a search on the database to pull back relevant skills with a certain similarity threshold and i'll get back api design and javascript for that api coding example and what i can actually do in this case is i can say well i have this ability to do semantic similarity in the database i can actually write a relationship that's just similar semantic and i can attach a score to that um so there's some advantages to doing this but a big one is visualization and also clustering so if i were to take this command which takes a semantic similar semantic relationship and i go into my graph and i just put that in here and i just return all basically the skills that are semantically similar this internet speeds up hopefully and i zoom in i'll start to see sort of interesting groupings here so i'll start to see for example that i get my cloud skills here azure aws cloud architecture all in one place similarly like i have flask and django here connected i've got my data analytics groups is like tableau power bi data visualization and then i've got a big grouping over here so you see like you have your jvm languages like your java and scala and kotlin here and then i've got you know my python stuff here with pandas and then if i go up in this group that's connected right i've got my java and then i've got like all this front-end you know frameworks and stuff up here so don't underestimate the power of being able to visualize similarities very important because i can create communities from these i can use this for customized scoring in my retrieval queries which we'll see but the other really cool thing is that if for some reason like maybe i don't think java should be connected to python i can control that i can remove that relationship and then every time i do similarity relationships i have control over that and i can filter that right so that's just some important things to keep in mind about how you can sort of use vectors and graphs together quick question is it only when you do semantic similarity is it only to visualize or is there any other we'll see in a bit here because basically what what we can do and i'll answer this actually as i as i go down i can pull back this semantic similarity relationship here but what i can start to do which is actually pretty cool is i can start creating these um sort of customized scorings between things that kind of balance the like the semantic similarity versus hard relate like skill matches and i can weight that if i wanted to in in a custom way so you can use it in your retrieval patterns as well to improve things um yep you so you can use both um so there's a lot of workflows because now you can compose things together into multiple multiple steps right so you can definitely do something where you can pull similar skills and then look for people right and it just depends on how you break down those functions for the agents sometimes if you know that like there's a very specific pattern that you want to follow like here this is a very this looks like a really big query it's somewhat intimidating but it's actually not that complicated like what you're doing here is you're just sort of doing a weighting between the semantic similarity and a hard like like overlap with similar skill sets like that might be a case where it actually coupling that logic together might make sense like if there's a very specific type of metric that you want for similarity but what if you're coming from like when you're coming from you have a query and say you have an assistant and you have a much larger uh ontology um so it's like tell me about people with certain skills um how do you know that these are even the entities that you're looking at like what's the first step when you go from query to retrieval how do you break it into the entities to know that this is the corey that you should be doing does that make sense yeah i mean why don't we revisit that when we get to the third module the second modules should be really quick and then we can you can see some of the functions in the third module and then that might help me answer that question a little bit better i think i know where you're going but maybe seeing that will help um so as i said before this is doing kind of like a a balance between the semantic similarity and sort of a hard overlap of skills and then you can use that to kind of weight you know how you want to find similar people inside of the graph um so you can you can start balancing both sort of the vector search similarity and the uh similarity that uh just happens with inside of the hard matches and another cool thing about a graph database specifically is i'll go ahead and take this query here um just so you can see what this looks like oh go ahead i mean you could go either way right so some of a lot of this to be honest will come down to cost considerations like how expensive is it in neo4j versus how expensive is it inside of postgres and that varies a lot depending on the type of infrastructure you have having everything in one place means that you it's a little bit you don't have to like sync your data right and then you also the query latency is at least in theory going to be lower because you're just querying from the same database um but if you already have data in postgres maybe or you already have a specialized vector database you also don't have to migrate your data necessarily to neo4j to make that work so yeah i'd say a lot of it actually it's performance but it's really like cost per performance right is kind of what you're thinking about in terms of what what does each deployment cost so this query here um is actually i'm taking i'm looking for similarity between two people and then you see i have this like star dot dot thing here and basically with a graph you're allowed to do what's called variable length queries so i'm saying hey go out on similar semantic but you're allowed to go out anywhere from zero to two hops between these skill sets before you find a connection between john and matthew here and then i can also union it against just the plane you know person knows same skill and when we get that back we'll see right like you get matthew over here matthew knows react john knows html and then those are sort of similar because they both have a semantic similarity to javascript same thing here you see we have this semantic similarity but this is only one hop this is where the variable hop comes in so you can start to control like these you know sort of how far out you can go on on either of these paths to be able to pull back similarities between people um this is just an advantage of a graph database then i think i might want to finish it off for this notebook i would take a break except that uh we only have 23 more minutes so what do you say should we just power through the last 20 minutes you think yeah let's let's let's do that so now i will we looked at some of the advantages of using the graph and the semantic similarity inside of the graph and now we'll talk a little bit about our second module here and i won't go over to the slides because i think for you guys i can probably just hop right into the notebook uh around well what if we have just resumes right we don't have a csv file so this is going to be a simple example that will show you um basically how to take the data from text and turn it into useful data for the graph so again like if you're if you're going through and running this live you're just connecting to your same workshop file which you should have from before testing the connection making sure you can count now you should actually get 154 nodes so here and if you come by our booth we can show you much more uh sort of exciting examples than than the two text blobs that we have here but here we have two different bios and basically the way that you can do this and if you've already done some entity extraction you're probably already familiar with this workflow is we can define our domain model in terms of pedantic classes so here basically i'm going to define my person with a name and an email and then a list of skills and then i have the skills field here if you add relationship properties you would have like a nose maybe like in a more complicated model nose as a relationship would also have like a proficiency property in which place this would be a list of you know nose skills and then you would have a nose skill would have a class um would have a skill property inside of it but you can see this is a very simple example so all we're doing here is we're defining a list of skills skills that someone can have skills just has the name property and then we can create this person list and then once we have our uh pidantic class defined we can create our system message to basically be a prompt for our model and then we can use in this case i used for one here um and we gave it the documents to uh to uh to ingest and then it will spit out at the end of this some json uh with those two people so we had two documents each one um corresponded to one person and then we got our emails and skills um with all their different names and such and once we have that right um it's it's from there it's pretty trivial to load it and it's very similar to what we did last time um in fact if we go down to our graph creation here we'll see uh this isn't exactly the query that we had but uh it's very similar where we've basically we're ingesting one person at a time merging on that uh email address which we have indexed sending the name and then for each of the skills that's sort of a list inside of there we're going to go ahead and merge the skill name and then that nose property connecting them together and then of course i could go back to the graph and i can say these are neo4j employees that i loaded but i can go ahead and look for one of them in the graph and i should get them back here where i have them and i have the different skills that they went ahead and picked up that i can put inside of the the database so um very very simple uh we have um our own um graph rag python package as well which is very good for and and also our knowledge graph builder if you look at some of the code that we used to implement that which is kind of like a reference ui which has more sort of examples around if you wanted to do things like document chunking with overlap um you know and and of course also doing like multi-threading with async and stuff like that so you're not just you know doing a for loop you know over over a bunch of over a bunch of bios so we have all of that if you stop by the booth we can we can give you more around that um but like i said especially for this crowd because you guys are already familiar with this um you know this is this is a very short module and yes so we would have to do that yes so ideally i would have done this in an in an order where i would have done this first and then we've we've got we would have gone through like the clustering materializing new relationships in with weights we're we're putting new relationships in these don't have weights on them but if we wanted to reformulate our communities um thank you so so the the community is kind of following what you're doing you're adding extra links in between the nodes that have weights on them whether it's for semantic distance for skills or whether it's for you know you're you are x hops away from somebody else in terms of some other distance computation that you've materialized yeah so we would we would ideally we run that in a recurring way as we upload data right so because i created i did in this case create i think some new skills in addition to the people so like has same has similar skill set that we we would redo that and then we would get a couple more relationships there the semantic one if we created new skills we would create new semantic relationships between the skills yes okay thank you all right any other questions before we move on i'm ready all right well then that was a uh a very quick module so i'll go over and cover um just some other topics around this very very quickly inside of my inside of my slides here so what we saw was an example where i'd call it entity extraction or named entity recognition where we were taking a document and we were literally breaking out people places and things and relationships from within that document um there's other things that we can do like for example if we have certain types of documents like from a catalog or in this case rfps we can start to break things out by actual document structure so i'm only going to walk through this just so you understand that there's different types of extraction that we can do to create graphs for example if you know what the anatomy of a document is like in this case if we have an rfp we know that this rfp is going to be designed in a way where there'll be different sections that we have an intro objective proposal and subsections within that we can actually create a graph out of those things too so this is another way that we can do um i would call it more like document extraction um where we're actually putting the the metadata of the document and modeling it as a graph and the advantage of doing things this way is that basically as you start to embed these different pieces um and put them into a knowledge graph you can basically do these patterns where you can do these searches on either entities that come from different chunks um and you can sort of go up and down uh these document hierarchies to find things which can be very helpful if you have documents that always have repeated structure um so you know that entity sometime connects between those the structures of those documents you can start to incorporate that inside of your graph retrieval queries um and then it also gives you a way to do community summaries because we saw leiden before um but also if you have documents that give you a natural uh hierarchy um you have a way of also summarizing information um across those documents as well um yeah why do you have entities and um documents and chunks in the same ontology as opposed to extracting entities and um and just creating a separate ontology separate from documents and chunks why do you combine the two well i think when they're combined you can just do traversals between them so what do you look like what's an example of a of a traversal you want to do between entities and trunks an example that you want to do between a traversal of entities and trunks like i say like legal contracts is a good example where if you know like you want to search for different legal clauses but then like the expiry date that might be somewhere else so like that would be one example yeah yes just making sure i understand so like sure just want to make sure uh clear on what you're talking about traversing the document so in a legal document say like a data protection clause across multiple vendors or something like that and comparing the language like is that a use case yeah yeah okay yeah so that and then like you know so there might be like um like a perpetuity piece or like you know dates and different things and then being able to kind of traverse over that document to find that in addition to the entities anything else all right all right so i just wanted to introduce that as as another example of how to do things um for the third module because we're already at 1007 so let me just go ahead and jump into the thing so you'll get to see it i'll go over to module three and this is going to be very simple so has has that who here we probably asked this in the beginning i think we already asked how many people have experienced building agents this is going to be very simple it's going to be a lane graph agent that we're going to that we're going to make here um basically what we're going to do is again a similar setup with our environments file we're going to connect to neo4j test our connection um there's going to be four tools that we want to build we want to be able to retrieve the skills of a person we want to be able to retrieve similar skills to other skills similar people like if we wanted to find out who's another good person to work on a thing um and then retrieve people based on a set of skills and in this example um we're basically going to um do a lot of tools first so at the end of this notebook there's going to be like that text decipher stuff where you get the schema back but here what we're going to go over first is actually going to be putting these different tools together um and we do that by graph patterns and it's the same graph patterns that we've been going over uh so for example here right if we just want to find the skills that someone knows it's very simple right just person matching to their skills um and as you go down this notebook basically what you're seeing is all the different patterns so the second is retrieving people with similar skills and here we're actually going to use the um the vector index and that similar semantic relationship um so we're basically going to pull um actually this one is searching for people with skills i apologize so in this one the this is um you're going to look for skills so for example you if a user puts in different skills those might not match the skills we have inside of the database exactly word for word so you're going to use vector search to pull out um the specific skills and what's semantically similar to those skills and we can do some scoring thresholds in here to pull back exactly what we want um and then that will go ahead and return some skills so if we had for example right like this continuous delivery cloud native and security um this would be like the types of skills that we pull back from that the person's similarity there's a few different ways that we can do that and we've talked about that a lot towards the beginning we can do it by community so we can look for people that know different skills we can get all of their names and then we can look for um that light in community that we created we can look for all the skills that those people know and basically what we're doing at that point is we're looking for people inside of the same uh skills uh community um but the other way that you can do that is um you can look for similar skill sets um using the uh similar skill set relationships so the hard-coded relationship that we've made from before uh which basically looks at hey how many how what's the actual skill overlap if you just looked at um who knows what inside of the graph um and that will bring back um some answers here between so we were looking at john garcia and we're saying hey find similar people and then we can get like a score count of overlap um to the to the different people here and then we can start adding in that semantic similarity so this is where we get this big query right but what this query is actually doing is it's sort of balancing between the um similar skill set and semantically similar skill set so it's kind of taking both those scores and adding them together um and then from there we get a floating number score um and a little bit of a different answer that's not just based on hard skill connections but also skills that are kind of close together um and we can weight those independently as well and we can also recommend people given a set of skills so if we have a set of skills here we can just do a vector search on um on those skills and then actually this one yeah here it is um so basically um the query was broken out into two parts just because this is this is kind of a big thing to look at but the idea with this is we can get um we can basically do vector search on skills get semantically similar skills and then a fine person who knows those skills so very similar to some of the last ones and then we can get a skill count for all of those groups and get people back um when we actually define the functions for our agent we're going to create here um a skills object which basically is just going to help us um with some of our function arguments and returns but basically first tool retrieve skills of person very simple query um and then we'll have uh down here for tool tool tool for tool two when we say find similar skills um what we're going to look at here is again that query where we're going to do that semantic similarity between skills so we're going to do a vector search to find skills and then we're going to go out one hop on semantic similarity um and then we're basically going to collect everything and return it and then for the third one we're going to do that weighting because this tool three is going to be for person similarity we're going to do that weighting between the similar semantic and the similar skill set with that larger query and then i know i'm going through this kind of quickly for the fourth one where we say find person based on skills here again our entry point is going to be a vector search on skills um going out to match those semantically similar skills but then kind of at the end of that uh we'll add on a traversal that will attach the person to knowing those skills count who knows the most and then effectively return that so those are the four tools that we're going to end up using for this agent when we set up the agent here if you're familiar with uh how lang graph works basically uh we we get our llm we test that it's alive we define our list of tools we can if we didn't want to do this in an agentic way right we can just bind our tools to our llm and we can we can invoke our llm with tools but what we're going to do instead this is just showing invoking the different the different tools um is we are going to go and run it with an agent so we're going to use create react agent which comes from lane graph it's one of their pre-built agent um that uses the the react um i don't know if you'd call it a framework but um sort of that methodology uh to build an agent and effectively um once we do that we give it the llm we give it the four tools uh that we had and then um we can see that here we're just testing we're saying hi and we're just making sure we get some response back there's a utility function here just to make it easier running in the notebook um which will basically just you know do this some of the um some of the streaming methodology so i can just say hey what skills um does christoph have and then if i run that and i don't know if i don't know if i need to rerun my agent here looks like not everything's running um so you'll see it says uh when i ran that and actually i ran it for the wrong question here what skills does christoph have person named christoph and then it will bring back his skill so you see there it will choose to use uh retrieve skills of person um and similarly if i went down and i said you know what skills are similar to power bi and data visualization um it'll go through and choose the uh appropriate um you know the the appropriate tool for the job so in this case uh find similar skills it'll pull those back and you'll see going down right if i said well what person has similar skills to you know this other person here um then it will know oh i need person similarity so we'll go ahead and use that specific tool so in this case what we're doing is we're providing a bunch of tools that are presumably expert tools that we can give to the model and then it will know that okay i have to go ahead and pull those those specific tools to be able to provide a response and then there's a little app down here as well if you wanted to run the chat bot so it's a little gradio app here but basically if i ran that i can go ahead and come in here and then i can have a little conversation with it so this is very small but what skills are similar you know i can go ahead and ask it in here and then provided everything's working it'll go ahead and choose the appropriate tool and then i can say well who knows you know maybe i'll just say those skills and it should go ahead and pull the appropriate tool to be able to find out who knows all these different skills right and if i go back to my uh to my example here i should see the query logic that it used so first you know it said find similar skills to what i just mentioned because i asked about power bi and then after that i asked about people who know skills so it said find persons based on similar skills and and likewise i can say you know who is um similar to those people in the graph and it will likewise go through and it should you know understand that it needs to use the find other similar persons tools um to be able to do that so you'll see if i if i was to keep going down i should get calls um here define persons with similar we have it here yeah person similarity so it just called the person similarity for each person um and i know we only have a few minutes left uh there is if you wanted to run this further basically i have a texas cipher example so this is where i'll have an example of passing it the annotated schema so this is getting to kind of what you were asking about right where i've provided these descriptions um so it's sort of like annotations for the schema as well and then i can go ahead and give that to an aggregation query function that will have there's also an llm inside of here that will create the cipher um but you can see in here i asked some questions like describe communities it was able to understand that it needed to grab you know the the match person knows skill and then it needed to grab the lighting community so it knew from the schema right that it needed to generate this cipher and there's a couple more examples of that in the notebook are there any question i know i just went over a lot um are there any questions from that that are worth answering now while we have just a couple minutes left how long will the uh jupiter server be up if you want to play with this jupiter server is going to go down very quickly but if you look at the deck at the end of the deck i have a link to the code and the data is all in github too so basically if if you go here that's the github repository so you can play with it what's that the the deck is in this do you have access to the slack channel so the deck is in the slack channel um and i'll go ahead and jump to that in a second but there's the github repository you can use aura console we have a free trial that you can use you can just set up a cloud database and you can load the data into there um also before you guys leave there's a meetup happening tomorrow it's tonight oh sorry tonight at five um that and there's a link there for more information on that um and then we also have another workshop at one o'clock where today was very simple like we're going to go over more graph analytics type of stuff in that workshop so like the community stuff that i was doing we're going to dive more into depth on that in that workshop um other than that uh come by our booth if you have more questions we're going to be wherever right there is i don't think we have a big expo hall but if you want to see neo4j mcp servers adk examples more knowledge graph construction um all a great uh you know place to come to ask all those types of questions

Intro to GraphRAG — Zach Blumenfeld

Transcript