back to indexIntro to GraphRAG — Zach Blumenfeld

00:00:00.000 |
so as you come in we have here server set up with everything you'll need if you want to follow along 00:00:22.440 |
you should have gotten a post-it note if you don't just raise your hand and my colleague Alex over 00:00:28.320 |
here we'll come find you and we'll provide you with one basically what you're gonna do is you're 00:00:33.120 |
just gonna go if you have a number 160 or below you go to this link here the QR code on top as well 00:00:40.860 |
and if you have a number that's 201 or above you go to the second link or the QR code from there I'll 00:00:47.340 |
give you some directions you're gonna have to clone a repo and then you're just gonna have to move an 00:00:51.480 |
environments file over really quick also for everything in this deck I did create a workshop 00:00:56.280 |
graph rag intro slack channel so if you're part of the AIE slack group you can also go there and just 00:01:02.820 |
grab this deck and you know get the links or however however you'd like to do that so we'll get started 00:01:08.580 |
here thinking just a couple of minutes all right so you're sorry you're not your number will give you 00:01:15.120 |
your username and your password basically it's attendee all lowercase than your number that'll be both your 00:01:23.320 |
the other link that you should try to open as well as the browser preview but I'll walk you through that 00:01:39.660 |
here in a second I'll give it just another minute here for everyone to file in and get situated 00:01:46.200 |
the user name and password is going to be attendee all lowercase and then the number that you have 00:01:58.260 |
both the user name and the password are the same 00:02:00.360 |
so the notebooks the servers will be down after the session but we you have the github link you can go 00:02:14.140 |
back to so the code is is there for you to use it's just that the environment won't be available afterward 00:02:20.520 |
all righty so i'm going to go ahead and get started here i'll leave this screen for a second so if you 00:02:31.920 |
want to grab the qr codes here now would be the time to do it obviously you can go to the slack 00:02:38.280 |
channel and also pick up this deck all righty so we're going to do an intro to graph ride workshop today 00:02:49.860 |
um i was debating what to actually put in this workshop since everything's changing so quickly 00:02:56.160 |
and some of my colleagues convinced me not to make it too complicated so this course is going to be 00:03:02.940 |
very much an introductory level course if you want to look at sort of more advanced graph rag techniques 00:03:09.380 |
integrations with things like mcp we have that at our booth and we have some other things that we're 00:03:14.520 |
doing other events tomorrow that we'll go over some of that stuff and i'll have links to all of that 00:03:19.200 |
uh as we go through but basically what we're going to do is we're going to get everything set 00:03:24.960 |
up hopefully that should only take a few more minutes and then we have three modules we're going 00:03:30.180 |
to go over some graph basics we'll be using neo4j today so it's a graph database just how to query that 00:03:36.360 |
kind of how to you know construct um your logic to retrieve data we'll go over um another module on 00:03:44.040 |
unstructured data um how to do entity extraction 00:03:56.860 |
all right how many people have by the way used neo4j like have written cypher queries 00:04:04.620 |
okay so some folks in here and then how how many people have used lane chain before 00:04:11.340 |
okay so so a fair number of you okay that's good to know um and then in our in our third module we'll 00:04:19.900 |
actually go over we'll use lane graph we'll build a very simple agent that will use some retrieval tools 00:04:25.340 |
and you get to see how some of that works and we'll wrap up after that with some resources so make sure to 00:04:30.220 |
ask questions straight away raise your hand i'll stop intermittently we only have 80 minutes though 00:04:35.820 |
so i want to make sure if you do have a question go ahead and raise it and we'll get that answered 00:04:40.380 |
because we're going to be moving through the material a little bit quickly as i said before we have two 00:04:47.100 |
jupyter server set up so you don't need a pip install anything you can go ahead and connect to these notebooks 00:04:53.020 |
attendees i already explained you should have a number if you don't go ahead and raise your hand 00:04:58.620 |
the username and password is just going to be attendee followed by your number 160 and less 00:05:03.660 |
you go to that first link 201 and larger you go to the second link there's also i if you can go ahead 00:05:09.820 |
and open browser.neo4j.io slash preview i'll show you in a little bit you're going to log into that as 00:05:17.100 |
well that will let you visualize the graph a little bit better as we start putting data inside of it 00:05:25.260 |
yes all right so alex we can get a number over here 00:05:30.540 |
all righty once you're inside of your environment what i want you to do 00:05:41.980 |
is these two commands so you're going to open up a terminal window in jupyter you can do that by pressing 00:05:48.940 |
the little plus sign i want you to get clone the command should actually be in the readme it should 00:05:54.060 |
say get clone and it will give you the link to the repository that you need to clone and once you do 00:05:59.340 |
that i want you to copy the workshop file over into that gen ai workshop talent folder that environment file 00:06:06.300 |
ws.env is going to have information to a database that's already set up it will also have an open ai key 00:06:13.100 |
inside of it that we'll be able to use for the workshop so just to show you what this looks like 00:06:20.700 |
if i go over here and i look at my my terminal right i've already done this 00:06:29.580 |
this a little bit bigger you just go ahead and get clone 00:06:36.860 |
if you go to the readme that was um in the main folder this readme here it has the uh the um the 00:06:46.540 |
github url so you just basically go get clone um and then oops and then after that you just copy 00:06:55.500 |
that workshop file into that gen ai workshop talent directory so you'll get this gen ai workshop talent 00:07:02.940 |
it's like a subdirectory in here and you just have to copy that file and that will have um resources 00:07:08.700 |
that you'll need to log into uh the browser and for you to connect through your notebooks the other link 00:07:14.860 |
that is inside of that deck is this browser.neo4j.preview link basically that will give you a way to 00:07:24.540 |
visualize the graph so what you will do if i go ahead and disconnect and maybe i 00:07:31.180 |
um um got to connect to an instance so you should get a screen that looks like this 00:07:36.700 |
and then that workshop file will have and it should actually be the same thing for you guys it should 00:07:42.220 |
be the attendee the number and then the same thing for the password so go ahead and make sure you do that 00:07:48.460 |
because then you'll be able to visualize the graph a little bit better 00:07:58.860 |
so for the jupyter environment if you got your number it's attendee all lowercase 00:08:03.340 |
and then your number for both the username and the password 00:08:33.980 |
so it's basically going to be your username or sorry it's going to be the attendee all lowercase and then 00:08:43.420 |
the number that you receive for both the username and the password 00:08:47.420 |
yeah and if if you um want to come back to this if you have if you're connected through slack 00:08:52.540 |
the workshop graph rag intro you can go there and you can pick up the slides as we move on so then 00:08:57.740 |
that way you just have a constant reference back to it 00:09:05.180 |
um so while everyone gets set up here i'll talk a little bit about just what graph rag is in general 00:09:12.780 |
to kind of motivate what we're doing here so this is um an architecture actually um that represents what 00:09:19.500 |
some of our customers do it's a very common architecture for graph rag users it's generalized 00:09:25.340 |
and basically the idea is that you have your agent over there you have your ai models and your ui so 00:09:32.140 |
like all the normal things that you might think of if you're putting together a knowledge assistant 00:09:35.740 |
but then there's this knowledge graph thing in the middle and that knowledge graph thing you can ingest 00:09:42.060 |
both unstructured and structured data into that so unstructured being things like documents and pdfs 00:09:47.420 |
and that sort of stuff and then structured being tables like csvs or stuff from a relational database or 00:09:53.020 |
what have you um and so there's a big question of like well why in you know the heck do we need this 00:09:59.740 |
like knowledge graph thing in the middle right like we have agents we can have tools and we can go pick 00:10:04.940 |
stuff from data sources and so the idea with this is that if you have a use case and you kind of know the 00:10:11.020 |
types of questions that you want to answer with your agents by taking your data and decomposing even a very 00:10:16.460 |
simple knowledge graph to start you're going to be able to expose a lot of the sort of domain logic that 00:10:24.460 |
you'd want to apply through the model of your data so the idea is like we'll see when we build a skills 00:10:30.300 |
graph we'll make some relationships about people knowing skills and by making that schema available to 00:10:35.900 |
the agent and making tools available to the agent you're going to be able to have a lot more control over 00:10:41.580 |
how data is retrieved more accurately explain the retrieval logic better and we see this is especially 00:10:47.820 |
important as we start moving more and more into this agentic world because it's not like a one-shot 00:10:52.460 |
vector search anymore right we're starting to see that now when we get questions or prompts handed to 00:10:58.780 |
an agentic workflow those start to get broken down in various ways and when you have a knowledge graph it 00:11:04.140 |
just lets you offer retrieval logic to complement that in a much more simple and in my opinion a better 00:11:11.580 |
manner and today we're going to be looking at a skills and employee graph so basically what will the use 00:11:20.620 |
case we'll be looking at is you're building a knowledge assistant to help with things like searching for 00:11:25.420 |
talent aligning and analyzing skills within an organization and doing things like staffing and team 00:11:31.900 |
formation and substitutions and things of that nature 00:11:34.700 |
and so i'm going to present a little bit about what we're going to go through in these modules 00:11:42.620 |
first so i'll do some stuff inside of a deck hopefully it won't take too long i just want to 00:11:46.460 |
kind of talk to you about cypher and some of the things that you'll be seeing and then we'll go ahead 00:11:52.940 |
so we're going to talk about creating a graph we'll start with some structured data here just to keep things 00:12:00.860 |
simple i'll introduce on structured data a little bit later some basic cipher queries some algorithms 00:12:06.060 |
and we'll get into some vector search and semantic stuff so a knowledge graph basically when we think 00:12:14.460 |
about it a knowledge graph generally is devont is defined as some design patterns to organize and access 00:12:19.820 |
interrelated data and at neo4j we model the data inside of the database is what's called a property graph 00:12:27.260 |
and this consists of three primary elements so the first are nodes these are like your nouns these are 00:12:32.940 |
your people places and things next are your relationships these are how things are related 00:12:39.660 |
together hence the name and often will be like verbs so person knows person person lives with person 00:12:46.460 |
person drives or owns a car and both of the nodes and relationships can have properties which are just 00:12:55.580 |
attributes they can be strings they can be numbers they can be arrays of things and they can be vectors 00:13:01.740 |
as well so we can store vectors for we've had for a long time inside of neo4j and you can do search over 00:13:08.460 |
these things now the query language that we're going to use to access the database is called cipher 00:13:16.620 |
and i know a lot of you raised your hands in the beginning so you already have some familiarity with this 00:13:22.940 |
but cipher kind of looks like ascii text so the idea right is that it has this sqls kind of feel to it 00:13:31.660 |
but you get to write these statements like if you see match person knows skill basically you're connecting 00:13:38.460 |
a person node to a skills node through that nose relationship so it reads kind of very 00:13:44.940 |
literally in the way that it's written nodes have what's called labels which is sort of like it would 00:13:52.060 |
be the equivalent of a type of table within a sql database of basically what type of entity it is 00:13:57.180 |
and then as i said before they have properties so for example you can identify by a property like name 00:14:03.740 |
and you can have variables like p and s which refer to the actual entity as you start to write your query 00:14:12.060 |
more so this is not going to be a course on writing cipher right because we can make an 80 minute 00:14:17.180 |
course just on like how are we going to make cipher queries um but we'll be walking through these 00:14:22.620 |
queries so don't expect to like be if you haven't seen cipher before to be a super expert in the cipher 00:14:28.300 |
query language when we're done but just know that this is kind of how it works and then as you go 00:14:33.100 |
through hopefully you'll get a better understanding and a feel for how these queries uh work and in the 00:14:38.220 |
types of data that can be returned as you run them um and so i'm sure is everyone pretty familiar with 00:14:46.940 |
vector search in here at this point yeah i have a feeling this audience probably would be so i won't 00:14:52.380 |
spend too long on this right i think we all kind of know what embeddings are it's basically a type of 00:14:57.340 |
data compression you can apply them to all sorts of things right text audio you can even apply them to 00:15:03.020 |
graphs oftentimes it's just going to be a vector of numbers and then you can use that to find similar 00:15:09.260 |
things within that domain space so find texts that are similar uh semantically not just lex lexily like 00:15:16.220 |
actually based on the types of things that they're talking about and within neo4j you have search 00:15:22.460 |
indices including vectors so there's range indices you have uniqueness constraints you're able to search 00:15:28.460 |
text you're able to do full text with blue scene and then we also have approximate nearest neighbor 00:15:34.220 |
vector search as well that we'll be leveraging as we go through in combination with the cipher queries 00:15:39.820 |
that we were just looking at to do graph traversals the next thing to know about is that in addition to 00:15:46.540 |
being able to query the database we also have analytics so we have graph analytics powered on the database 00:15:53.180 |
that lets you do different types of data enrichment and do more graph global type of analytics so finding 00:16:00.620 |
which nodes are most central according to different algorithms doing things like community detection how 00:16:05.980 |
do you cluster the graph finding paths between nodes doing different types of embeddings so we have a lot 00:16:12.380 |
of those algorithms and we'll be touching on them very very briefly today in the first module 00:16:17.180 |
just to show that you know once you have a knowledge graph you can start enriching that data and then 00:16:21.660 |
actually using things like we'll see in our case we'll be using community detection where we'll be 00:16:28.540 |
summarizing skills inside of our graph and then we'll be able to pass that on to an agent to actually 00:16:34.060 |
use that to explain some parts of our graph for our use case all righty so with that in mind we'll go ahead 00:16:43.980 |
here and jump into the first notebook are there any questions before we dive in is anyone still okay yes over here 00:16:53.580 |
yes do you have uh let me just go back here to the 00:17:12.220 |
so and this is available in the slack channel too if you don't have a number um my colleague alex over 00:17:19.820 |
there can go ahead and grab one for you um we're in the workshop graph rag intro slack channel so you 00:17:25.420 |
can go there to grab the deck and all the links but basically if your number is 160 or below you go to 00:17:31.260 |
that first jupyter server if it's 201 or above you go to the second one use attendee all lowercase and 00:17:39.580 |
then your number as both your username and your password you'll do that for the jupyter notebook and 00:17:44.620 |
then also for the neo4j browser if you want to follow along with visualizing the graph as we go through 00:17:52.140 |
any other yes i know this is uh introduction to uh graph track so um but maybe like you know when you're 00:18:01.660 |
building these these graphs uh i see you have like a small small graph how do you i can prioritize whether 00:18:09.260 |
you should like big big graphs like you know one that makes more than scale or make smaller graphs 00:18:16.780 |
so your question is about data modeling and whether how do you prioritize making one graph versus 00:18:24.220 |
multiple graphs um i mean it's a good question i think in general for a lot of what we're seeing with 00:18:31.340 |
agents i find it's helpful to have a smaller data model if possible especially if you're doing different 00:18:37.420 |
types of dynamic query generation so to keep that in mind but as things are getting better we can pull 00:18:43.740 |
back the graph schema and and offer it to agents and i'm we're noticing that as agents sort of keep 00:18:49.180 |
it or as language models really keep iterating they're starting to get better and better at 00:18:53.340 |
interpreting so whenever you want to do traversals in a low latency way between two data points those 00:18:59.740 |
things really should go in the same graph and then it's a question as far as what you make a label versus 00:19:04.540 |
a property um in that scenario so we'll go through some of it and then if you want to talk after and come 00:19:10.300 |
by our booth we can have a more sort of use case focused conversation anything else all righty so i'm 00:19:17.820 |
going to go ahead here and then dive into the notebook 00:19:34.700 |
all right so you're just going to come down here and start um and remember we're in the talent subfolder so 00:19:45.660 |
there's two workshops in here the one we'll be doing is called talent 00:19:48.540 |
if you're in the other one it's there's also some interesting stuff in there but you won't be able to 00:19:53.900 |
follow along all righty so it looks like i'm running now so basically what i'm going to do is i'm going to get 00:20:01.180 |
my environments file here and i'm just going to load it if you um don't have the environments file just 00:20:07.580 |
go ahead and move it it's in the root directory just go ahead and move it into this subdirectory 00:20:14.140 |
and basically what we're going to do first is we're just going to load our skills data sets it's going to be a 00:20:22.300 |
for a table and if we look at that table um we're going to have uh basically three fields 00:20:30.220 |
there's the um an email field a name field and then just a list of skills for the person 00:20:39.340 |
and as i said before we'll go into a little bit of detail here around how you might extract this from 00:20:44.380 |
documents like resumes um in a second but basically for now because we're interested in sort of this 00:20:51.260 |
skills uh mapping and team formation and staffing kind of use case we're starting with this sort of 00:20:57.900 |
very simple data set to get us started um and so there's a couple steps here that just go through 00:21:05.820 |
basically organizing the data to make it easy to load and then we're going to start to create our graph 00:21:11.980 |
and so a lot of this is just what we'd call like basic kind of neo4j data loading we're going to create 00:21:19.660 |
chunks out of our data frame you're going to um basically check to make sure you've got nothing in 00:21:26.540 |
your database i do have stuff in my database because i was just running this before but that's on me 00:21:31.660 |
because i was just running the course before yours should say zero now the first thing we do is set a 00:21:36.700 |
constraint so basically inside of neo4j whenever you create nodes if you have what's called a node key 00:21:45.180 |
constraint or a uniqueness constraint it's basically saying in this case that the email has to be 00:21:50.620 |
unique and non-null for all your um for all your people and that will make it so that it's very fast 00:21:57.900 |
to match on people and do merging operations so a lot of times people will say well neo4j is really slow 00:22:03.260 |
and that's often because of simple mistakes like not setting a constraint and then you're going to have to 00:22:09.820 |
do very complex searches in the database every time you search on a user rather than having it 00:22:14.220 |
in an index that's unique and then you also do the same for skill because our data model is going to be 00:22:21.740 |
person and skill so we have two types of nodes and when we do that we'll go ahead and have two constraints 00:22:29.260 |
here inside of the database you'll see for skill and for person after that we'll go ahead and start 00:22:36.620 |
loading our nodes and our relationships so the way that this query works and i i guess i won't run it 00:22:43.260 |
even though it won't actually change anything in my database but what we're doing here is we're looping 00:22:48.780 |
through chunks of our data frame and we're saying hey merge a person on email set their name and then 00:22:56.620 |
for that list of skills basically you're going to merge a skill on a skill name and then you're going 00:23:03.260 |
to merge here that the person knows that skill so it's going to create this graph pattern of person knows 00:23:08.220 |
skill in the database once you run that what you can do is if you have that browser window open that we 00:23:15.020 |
were going over before is i can go ahead and copy one of these or maybe i'll just take i'll take this 00:23:22.780 |
one well i'll go ahead and take this one first well this will show you inside of the database 00:23:28.780 |
is if i just match people i'll get my people back here and i may have lost oh cool i still have my internet 00:23:37.420 |
connection all right so i can go ahead and see that i have my people they have their names and their email 00:23:42.860 |
addresses you can do the same thing for matching skills and then you can also look for relationships 00:23:51.260 |
so this gets into that pattern matching that we were talking about before with cypher this is a very 00:23:56.140 |
simple version of matching a path so i'm saying p which is path is equal to node connect to nose connects 00:24:03.260 |
to another node and i'm saying limit 25 and that will return a graph where i get to see all these different 00:24:09.740 |
relationships looks like my internet connection is still somewhat slow but i get it back here so 00:24:16.220 |
you'll see i'll get my people i'll get that nose relationship then in this case this person knows api 00:24:22.220 |
design tableau flask and you'll see different skills pop up here inside of your graph 00:24:31.420 |
and there's you know you can you can go ahead and run these through what through our driver here as 00:24:36.620 |
well to look at the data pull back the different people that are in there and find out what skills 00:24:53.820 |
nose is a relationship type we are making it up so our our domain model that we have 00:25:02.460 |
i can actually call it here and mine is going to show more than yours if you run the same command because it 00:25:10.060 |
has the um some other later stuff that we do in the course but basically you have person knows skill 00:25:19.180 |
that's our data model so you can say person has skill would be another way to put it right 00:25:24.620 |
exactly exactly yeah and it's it's actually funny because this is becoming even more important um now that 00:25:38.620 |
we're using uh llms to design queries because like the language that you use is sort of like an annotation 00:25:45.580 |
for the model right so that starts to become very interesting 00:25:48.620 |
all righty so there's some cipher queries here that i'll go ahead and run through really quick and i may 00:25:59.740 |
depending on time need to need to kind of speed things up through this notebook because i want to 00:26:04.620 |
make sure that we actually get to the agent at the end so um if you go into the uh the deck 00:26:14.780 |
there's this link browser.neo4j.io slash preview and then i think it's just your username and your password 00:26:25.660 |
but you can also look inside of your workshop environment file and it will have that information 00:26:31.340 |
there you just use your username and your password and your uri information which you get here so you get 00:26:36.940 |
your uri and then your username and your password 00:26:39.100 |
all righty so like i said we'll go through some of these so for example we can count in cypher so we can say 00:26:50.140 |
match person no skill we can get back the name and we can count the distinct people uh basically here for 00:26:57.180 |
for each skill so basically what we're doing here is you can think of it as like okay i've got all my 00:27:02.460 |
skills and i'm going to count the distinct people that know that skill it's very simply what we're doing 00:27:07.180 |
when we get that back we'll see kind of what our most popular skills are here 00:27:12.540 |
going down and they're all kind of tech focused we can also ask different types of multi-hop questions 00:27:18.780 |
which is very interesting so for example i'll take this and i'll copy it over to my browser because 00:27:24.620 |
it's it's interesting to see these visually but what we're asking here is we're going to we're going to 00:27:32.300 |
take this person um named lucy and i'm just going to ask you know what people are kind of similar to lucy in 00:27:40.140 |
terms of knowing the same skills right so i can go ahead and run that and then what i'll get is 00:27:47.500 |
i'll get lucy here i'll get all of her skills and then i'll get all the other people that know those 00:27:52.460 |
skills here right um and you can build on that iteratively so i can if i go back here i can also say 00:28:01.180 |
well now i want to know all of those skills or all those people and i want to know and basically 00:28:08.300 |
i'm going to add at the end of that query i get they know a certain skill and then i want to get 00:28:14.140 |
all of those people and then i want to get all the skills they know so i'm basically adding 00:28:18.300 |
this and what skills do these other people know to the query and then i'll get a very large graph back 00:28:24.300 |
but the idea with this is that once we have this logic extracted from whatever our original data source 00:28:30.540 |
is we can now control at a much more fine-tuned level how we define what a similar person is or what a 00:28:37.820 |
similar skill is because we have this ability to traverse over the graph and apply concrete logic it's 00:28:45.100 |
basically like having your information in a symbolic versus just a sub symbolic vector um and so you know 00:28:51.420 |
you'll get a lot of stuff back because now we're looking at people and all the other skills that they know 00:28:56.460 |
and i can go in here and find the most central skills among these people right like for example 00:29:02.300 |
scrum is very central among this group because there's there's a lot of people that know that skill 00:29:07.180 |
so i'm figuring out about this local community that sort of knows the similar skills to lucy 00:29:12.060 |
and in here it's just some examples of running that same logic 00:29:20.620 |
basically inside of the inside of the notebook yes 00:29:24.060 |
so to get to the graph basically what you're going to do is you're going to go to this neo4j browser link 00:29:41.260 |
you table then graph and then sometimes if you're not returning nodes it will only return a table like if i said 00:30:01.500 |
um in that case it should return it should return the past sometimes if you don't see it it's because 00:30:06.140 |
you're returning like just a name or something um and then in that case it'll like just show you a list of names 00:30:12.700 |
um and i don't have a distinct here oh yeah i did 00:30:23.740 |
i don't know if it's completely necessary for this one actually 00:30:31.180 |
yeah i don't think it is completely necessary for this one 00:30:37.580 |
um there are times when you do very complicated especially we'll see that there are a couple other 00:30:42.060 |
examples where we do like multi-hop paths and there's a chance with some of those that you'll get 00:30:47.340 |
basically two paths that are the same um in which case having the distinct there just allows you to 00:31:07.340 |
a lot of it depends on the complexity of your schema 00:31:10.860 |
and basically you know we see for simpler aggregation queries or when you have a lot of prompt 00:31:17.900 |
engineering around doing different types of path queries that are very specific on a smaller model they can do well 00:31:23.660 |
we do often recommend that you have your own expert tools if there's like a really complicated type of 00:31:28.540 |
traversal that you want to do so right you can write your own python functions or you can have 00:31:33.020 |
your own mcp server that will just have like your you know set of functions for your more complicated traversals 00:31:38.940 |
we also see too you can sort of restrict the options for llm so instead of writing a complete query 00:31:45.260 |
you can say hey like there's you know these you know three types of general patterns you know and and 00:31:51.420 |
write that part of the pattern and then it will go into this other query so you can do stuff like that 00:31:55.340 |
to help it we've also done we've had fine-tuned models that we just released i think back in april 00:32:01.660 |
they're on they're fine-tuned from gamma they're on hugging face so you can try using those as well 00:32:07.740 |
they can do a little bit better we're not going to use them here though unfortunately because we're 00:32:12.940 |
using a bunch of open ai uh open ai models for this um all righty so a lot of this um as i was just going 00:32:24.540 |
over is basically just running these queries um returning in this case the distinct is important to 00:32:30.460 |
return a distinct name because you might actually get to the same person multiple times um so if you're just 00:32:36.700 |
returning a name in a and um of a skill right then it's important to um to to use distinct in that case 00:32:45.020 |
so we get all the distinct skills that basically showed up in the graph we were just looking at 00:32:49.580 |
another thing that might be important for our use case is finding similar people 00:32:54.780 |
so this is again using that query that we were just going over to find we use lucy before but now we can 00:33:02.540 |
actually parameterize that here um and then basically go no skill and then we can match um basically from 00:33:11.340 |
that skill going to another person and we can sort of count number of shared skills between people 00:33:16.940 |
and so when we do that right we can go ahead and see in this case um like the number of shared skills 00:33:27.740 |
um between between different individuals um and we do it again here i think for um a different 00:33:36.620 |
set of people i think this just counts most skills shared between any two people 00:33:41.660 |
um so you can kind of see that here again just another way to measure similarity beyond just semantic 00:33:48.460 |
similarity measuring actually what we have from our model in terms of exact shared skills 00:33:57.180 |
and as we go through this some of the things that we can do to help speed up our queries and this 00:34:01.500 |
is this is sort of optional but if we know that we're going to um sort of look for similar skill 00:34:07.580 |
sets a lot we can create a similar skill relationship inside of our graph so basically we can match two 00:34:15.180 |
different people um and then we can merge a similar skill set basically bet based on um an overlap of the um of a skill 00:34:25.660 |
count so we have our data frame locally that we can use for that that we that we were just looking 00:34:31.100 |
we basically just pulled it back when we were looking at those similar skills and what that's going 00:34:35.180 |
to do is it's just going to create again this relationship that has similar skill set between people 00:34:41.340 |
um and if i were to look at that it will um go over to my browser 00:34:54.620 |
we'll go ahead and see you know i can have a similar skill some of these are overlap one 00:35:00.140 |
others will be greater um inside of here i think they go up to three um so it's all this is doing is 00:35:07.580 |
basically saying hey if two people like what is their overlap so we don't have to do that full traversal 00:35:12.380 |
over and over again if we don't want to um with that similar skill set relationship 00:35:18.140 |
um the next thing i wanted to show you and yes but if you do this you need to like this is static 00:35:24.540 |
and if you want to update what's the overlap you need to run this query again you would need to run 00:35:29.180 |
that query over again yeah so i mean it depends on how often your data gets updated there's nothing 00:35:35.980 |
actually wrong with doing the multi-hop query over and over again the graph database is designed to handle 00:35:41.980 |
that so if you had a situation where you had a graph that was constantly getting updated you might not 00:35:49.580 |
the next thing that i wanted to show you was how our graph analytics works inside of neo4j and basically 00:36:00.780 |
using that to enrich the graph so this is basically creating what we call a gds or a graph data science client 00:36:10.940 |
and what's going what we're basically doing here is we're creating something called a projection 00:36:16.220 |
and then we're running a algorithm called leiden for the graph rack community are you all have 00:36:22.220 |
how many people here have heard of leiden as an algorithm okay so just just a few people how many 00:36:28.220 |
people have heard of louvain as a graph algorithm okay so we got a couple people basically what this 00:36:35.100 |
algorithm does is it breaks the graph down into a hierarchy so we'll start by 00:36:40.140 |
basically breaking the graph into a few big communities and then going into smaller communities 00:36:46.620 |
and what it's trying to do is optimize what we call modularity and it's basically this metric that says 00:36:52.940 |
hey i want to create these clusters in my graph where the connections within the cluster are very high and 00:36:59.420 |
connections across clusters are very low so i'm creating these modules and basically what i do by creating 00:37:05.260 |
these um and i'm using that similar skill set relationship so this is another important reason 00:37:10.300 |
to create it because if you do um analytics on your graph it can help with those analytics running a 00:37:16.060 |
little bit better because i have person connects to another person with a similar skill set 00:37:21.740 |
and by running this leiden algorithm basically what we get is a bunch of communities that reflect 00:37:29.820 |
people within the communities knowing similar skills so this is all simulated data 00:37:36.940 |
but basically if i go down to um and i'll skip over some of this we do some checks around like how good the 00:37:44.940 |
communities are i encourage you to run this just so that the agent works well at the end 00:37:49.820 |
but the what i wanted to show you here is this uh graphic right here so what this is this graphic is 00:37:58.060 |
looking at because basically we wrote this community id property back to the graph 00:38:02.140 |
and you get these different community ids and you get to see which communities in a heat map 00:38:08.540 |
have the most skills in a certain in a certain area so this data is randomly generated so a lot of these 00:38:15.260 |
patterns are going to look maybe a little bit funky if you were to really dig into them but the idea is 00:38:20.620 |
that um as you have very if you have more relevant data and realistic data this can actually show you 00:38:27.100 |
like your data engineers are here right and your front end guys and front end folks are over here and 00:38:33.020 |
then your ml people are over here so you can start to see that within the graph um and really break that 00:38:38.380 |
down um but do go ahead and run everything here um through the g.drop so that you have that property 00:38:45.420 |
another way that we can break down uh sort of different groups inside of the graph is to look at go ahead 00:38:54.540 |
sorry i just had one question regarding like when when do you uh customize your graph like for example the 00:39:00.620 |
the community detection algorithm that you're running yeah and when do you just let the agent 00:39:05.180 |
is there any you know again heuristics well actually better to invest time in figuring out whether 00:39:14.540 |
we should improve our graph well i think it depends on your use case right like if you're very interested 00:39:22.620 |
in you know saying hey i want to understand like the skill communities inside of my company right if that's 00:39:29.500 |
like a question that's going to come up frequently then using something like graph analytics can be 00:39:33.900 |
very beneficial right because you can do basically like employee segmentation you can understand 00:39:39.740 |
performance with inside of different groups and stuff we see it oftentimes used for customer 00:39:44.700 |
segmentation and recommendation systems and that sort of thing too um at the same time if you're just 00:39:50.220 |
like hey i just want to like look for matches of different people with similar skills maybe you don't need 00:39:55.740 |
community detection for that because that's just like a pairing exercise right um so i'd say you use it 00:40:01.020 |
whenever you want to do some sort of clustering analysis and persist that and then sort of even have 00:40:06.780 |
visibility and i guess the the confidence in knowing that there was some way that you did that right and 00:40:12.780 |
it's not just up to the model that's just making stuff up around how that works right so basically you 00:40:18.060 |
look at uh what the users are doing and then try to see if you need to build yeah yeah yeah yeah 00:40:38.300 |
the heat map is showing you how often different skills show up with inside of each community 00:40:45.020 |
aren't the communities based on what skills they have yeah so like the first community for example is it 00:40:55.580 |
it looks like either tableau or swift right yeah to understand like the skill breakdown within each 00:41:04.380 |
community and again this is generated data so this is a little bit random right um but you can imagine 00:41:11.660 |
that in a non-random scenario what you're probably going to end up seeing is like if you have a lot of 00:41:16.860 |
product managers versus a lot of you know um like front-end developers versus you know like devops 00:41:23.660 |
folks like you'll see that grouping start to emerge 00:41:28.540 |
two connected questions one do you have any different uh best practices for data modeling for an agent to 00:41:34.780 |
understand the data model or just general draft best practices around creating data models um yeah i mean 00:41:43.580 |
i'd say a lot of the agent stuff is evolving super super quickly um you know as llms keep changing and 00:41:50.300 |
getting better um we've had for a long time guides on how to do like data migration from relational systems 00:41:56.380 |
to graph and how to think about that there's a certain way in graph how you think about again like 00:42:01.100 |
nodes being nouns relationships being verbs and how to connect those together for agents i think it's 00:42:06.700 |
really nice when the data model reflects a natural language right so person knows skill very natural 00:42:12.860 |
language way of you know saying something that that translates directly to a data model and as i was 00:42:19.260 |
saying before simpler data models seem to work better when you do like dynamic query generation 00:42:24.140 |
so there's stuff like that and the rest of it is i know like the it depends answer is like 00:42:31.100 |
you know such a cop out but it is true that like depending on the type of retrievers that you have 00:42:37.020 |
the size of your data the cardinality of different categories of things in your data right like 00:42:42.940 |
you know you generally don't want if you can avoid it to have hundreds or thousands of no labels 00:42:48.140 |
because it's just a lot so then you make them properties so there's a lot of stuff like that to 00:42:52.060 |
consider so i don't know if that answers your question but yeah we'll see at the end of module 00:43:13.180 |
three which we might not have time to get to but you'll see it in the code and i'll show it really 00:43:17.340 |
quick is that there's you can from the graph schema there's functions that we have to pull back the 00:43:23.100 |
node labels and the relationship types so you can create a sort of json representation right of what 00:43:28.700 |
the graph schema looks like and then combine it with specific prompts so then it's like okay i follow 00:43:33.900 |
that another thing to do that helps even more is if you have a graph data model that's not going to 00:43:39.900 |
change a lot over time where you know you can just pull it and it will be the same for a while 00:43:43.740 |
because you can annotate that schema so you can say like for specific properties or no labels or 00:43:49.020 |
relationship types hey this thing does this and when you ingest data make sure you you know put it 00:43:54.460 |
here and when you pull data make sure you can you know go on different paths the other thing is putting 00:43:59.420 |
in like we had person knows skill putting those actual query patterns into the schema as well helps a 00:44:05.500 |
lot because the model can read that and then understand how to do that traversal better 00:44:11.580 |
all right anything else all right cool um so we're actually getting close on time so i'm going to go 00:44:18.940 |
pretty quickly through the rest of this um but hopefully um it'll be pretty understandable so there's 00:44:27.020 |
another way that we can start um thinking about skills and relationships between skills is how 00:44:32.060 |
they're semantically similar so basically what we can do is actually make embeddings on our skills so 00:44:41.420 |
there's another file in here um that basically has a csv file that you read into this notebook that has 00:44:47.980 |
skills and descriptions and an embedding so which which field here do you think we embedded the skills of 00:44:55.740 |
the description and why right so one of the things is when you have really short names like r is a 00:45:04.700 |
technically a programming language although a lot of people don't love it i love r um aws like they're 00:45:11.820 |
very short right so having descriptions about those uh skill names if you embed those it provides a more 00:45:20.140 |
informative embedding right so that's the whole idea there so basically we give each skill a description 00:45:26.060 |
and then we embed that description and what we're seeing inside of this is we're actually going to 00:45:31.740 |
and these are all text embedding ada so they're so they're 1536 00:45:36.060 |
we're going to go ahead and create a vector property 00:45:43.820 |
and then we're going to set the description as well and after we do that um we'll basically have 00:45:50.940 |
i think we create our vector index down here that we call the skills embedding 00:45:55.100 |
once that's set up basically what we're able to do and you'll see it show up here we'll get that skills 00:46:02.540 |
embedding index is we're now going to be able to do vector search on skills inside of the graph so if 00:46:09.660 |
i have python as a skill and i go ahead and i'll use this command in cypher to search the skills embedding 00:46:16.380 |
pull back the 10 most relevant skills and you'll see here it'll it'll bring some skills back 00:46:21.980 |
here it's like ruby and java we got pandas at least that's good django pytorch so some of these are 00:46:30.060 |
better than others but the point is that we can go ahead and apply these vectors and then pull information 00:46:36.700 |
back with vector search and another interesting thing that we can do um as well 00:46:43.420 |
is we can if i had something that wasn't in the database like say i'm just looking for api coding 00:46:50.700 |
right and i searched that as a term basically what i'm doing here is i'm using the open ai 00:46:57.020 |
client here to just embed this model or i might be using actually laying chain up here looks like 00:47:02.220 |
it's laying chain and then i'm doing a search on the database to pull back relevant skills with a 00:47:07.580 |
certain similarity threshold and i'll get back api design and javascript for that api coding example 00:47:13.580 |
and what i can actually do in this case is i can say well i have this ability to do semantic similarity in the 00:47:24.540 |
database i can actually write a relationship that's just similar semantic and i can attach a score to 00:47:31.340 |
that um so there's some advantages to doing this but a big one is visualization and also clustering 00:47:39.820 |
so if i were to take this command which takes a semantic similar semantic relationship 00:47:46.460 |
and i go into my graph and i just put that in here and i just return all basically the skills that are 00:47:59.980 |
and i zoom in i'll start to see sort of interesting groupings here so 00:48:07.340 |
i'll start to see for example that i get my cloud skills here azure aws cloud architecture all in one place 00:48:16.540 |
similarly like i have flask and django here connected 00:48:19.820 |
i've got my data analytics groups is like tableau power bi data visualization 00:48:25.020 |
and then i've got a big grouping over here so you see like you have your jvm languages like your java 00:48:31.260 |
and scala and kotlin here and then i've got you know my python stuff here with pandas 00:48:38.060 |
and then if i go up in this group that's connected right i've got my java and then i've got like all this 00:48:43.340 |
front-end you know frameworks and stuff up here so don't underestimate the power of being able to 00:48:50.940 |
visualize similarities very important because i can create communities from these i can use this for 00:48:57.180 |
customized scoring in my retrieval queries which we'll see but the other really cool thing is that 00:49:02.940 |
if for some reason like maybe i don't think java should be connected to python i can control that 00:49:10.940 |
i can remove that relationship and then every time i do similarity relationships i have control over that 00:49:16.380 |
and i can filter that right so that's just some important things to keep in mind about how you can 00:49:28.620 |
is it only when you do semantic similarity is it only to visualize or is there any other 00:49:34.300 |
we'll see in a bit here because basically what what we can do and i'll answer this actually as i 00:49:40.620 |
as i go down i can pull back this semantic similarity relationship here but what i can start to do 00:49:49.900 |
which is actually pretty cool is i can start creating these um sort of customized scorings between 00:49:57.100 |
things that kind of balance the like the semantic similarity versus hard relate like skill matches 00:50:03.980 |
and i can weight that if i wanted to in in a custom way so you can use it in your retrieval patterns 00:50:15.260 |
you so you can use both um so there's a lot of workflows because now you can compose things together 00:50:32.460 |
into multiple multiple steps right so you can definitely do something where you can pull similar 00:50:37.100 |
skills and then look for people right and it just depends on how you break down those functions for the agents 00:50:43.420 |
sometimes if you know that like there's a very specific pattern that you want to follow like 00:50:48.140 |
here this is a very this looks like a really big query it's somewhat intimidating but it's actually not that complicated 00:50:53.980 |
like what you're doing here is you're just sort of doing a weighting between the semantic similarity and a hard like 00:51:00.940 |
like overlap with similar skill sets like that might be a case where it actually coupling that logic 00:51:06.380 |
together might make sense like if there's a very specific type of metric that you want for similarity 00:51:11.580 |
but what if you're coming from like when you're coming from you have a query and say you have an assistant 00:51:17.020 |
and you have a much larger uh ontology um so it's like tell me about people with certain skills 00:51:23.980 |
um how do you know that these are even the entities that you're looking at like what's the first step when 00:51:28.940 |
you go from query to retrieval how do you break it into the entities to know that this is the corey 00:51:36.380 |
that you should be doing does that make sense yeah i mean why don't we revisit that when we get to the 00:51:42.140 |
third module the second modules should be really quick and then we can you can see some of the functions 00:51:46.540 |
in the third module and then that might help me answer that question a little bit better i think i know 00:51:51.340 |
where you're going but maybe seeing that will help um so as i said before this is doing kind of like a 00:51:58.300 |
a balance between the semantic similarity and sort of a hard overlap of skills and then you can use 00:52:06.860 |
that to kind of weight you know how you want to find similar people inside of the graph um so you 00:52:13.660 |
can you can start balancing both sort of the vector search similarity and the uh similarity that uh just 00:52:21.340 |
happens with inside of the hard matches and another cool thing about a graph database specifically is 00:52:27.180 |
i'll go ahead and take this query here um just so you can see what this looks like oh go ahead 00:52:34.380 |
i mean you could go either way right so some of a lot of this to be honest will come down to cost 00:52:53.180 |
considerations like how expensive is it in neo4j versus how expensive is it inside of postgres and 00:52:59.100 |
that varies a lot depending on the type of infrastructure you have having everything in 00:53:03.340 |
one place means that you it's a little bit you don't have to like sync your data right and then 00:53:08.380 |
you also the query latency is at least in theory going to be lower because you're just querying from 00:53:13.580 |
the same database um but if you already have data in postgres maybe or you already have a specialized 00:53:20.220 |
vector database you also don't have to migrate your data necessarily to neo4j to make that work 00:53:24.860 |
so yeah i'd say a lot of it actually it's performance but it's really like cost per performance right 00:53:31.340 |
is kind of what you're thinking about in terms of what what does each deployment cost 00:53:35.020 |
so this query here um is actually i'm taking i'm looking for similarity between two people and then 00:53:44.860 |
you see i have this like star dot dot thing here and basically with a graph you're allowed to do what's 00:53:51.340 |
called variable length queries so i'm saying hey go out on similar semantic but you're allowed to go out 00:53:57.260 |
anywhere from zero to two hops between these skill sets before you find a connection between john and 00:54:03.340 |
matthew here and then i can also union it against just the plane you know person knows same skill 00:54:11.900 |
and when we get that back we'll see right like you get matthew over here matthew knows react john knows html 00:54:19.580 |
and then those are sort of similar because they both have a semantic similarity to javascript same thing 00:54:26.140 |
here you see we have this semantic similarity but this is only one hop this is where the variable hop comes 00:54:31.980 |
in so you can start to control like these you know sort of how far out you can go on on either of these 00:54:38.220 |
paths to be able to pull back similarities between people um this is just an advantage of a graph database 00:54:47.260 |
then i think i might want to finish it off for this notebook i would take a break except that uh we only 00:54:56.300 |
have 23 more minutes so what do you say should we just power through the last 20 minutes you think yeah 00:55:04.540 |
let's let's let's do that so now i will we looked at some of the advantages of using the graph and the 00:55:11.500 |
semantic similarity inside of the graph and now we'll talk a little bit about our second module here 00:55:16.620 |
and i won't go over to the slides because i think for you guys i can probably just hop right into the 00:55:21.420 |
notebook uh around well what if we have just resumes right we don't have a csv file so this is going to be 00:55:29.420 |
a simple example that will show you um basically how to take the data from text and turn it into useful 00:55:36.940 |
data for the graph so again like if you're if you're going through and running this live 00:55:42.380 |
you're just connecting to your same workshop file which you should have from before 00:55:45.980 |
testing the connection making sure you can count now you should actually get 154 nodes 00:55:51.020 |
so here and if you come by our booth we can show you much more uh sort of exciting examples than 00:55:58.860 |
than the two text blobs that we have here but here we have two different bios 00:56:04.940 |
and basically the way that you can do this and if you've already done some entity extraction 00:56:08.780 |
you're probably already familiar with this workflow is we can define our domain model 00:56:14.060 |
in terms of pedantic classes so here basically i'm going to define my person with a name and an email 00:56:23.580 |
and then a list of skills and then i have the skills field here if you add relationship properties you 00:56:31.260 |
would have like a nose maybe like in a more complicated model nose as a relationship would 00:56:37.740 |
also have like a proficiency property in which place this would be a list of you know nose skills 00:56:44.780 |
and then you would have a nose skill would have a class um would have a skill property inside of it 00:56:50.860 |
but you can see this is a very simple example so all we're doing here is we're defining a list of skills 00:56:55.100 |
skills that someone can have skills just has the name property and then we can create this person list 00:57:00.140 |
and then once we have our uh pidantic class defined we can create our system message to basically be a 00:57:08.940 |
prompt for our model and then we can use in this case i used for one here um and we gave it the documents to 00:57:19.020 |
uh to uh to ingest and then it will spit out at the end of this some json uh with those two people so 00:57:25.580 |
we had two documents each one um corresponded to one person and then we got our emails and skills 00:57:32.060 |
um with all their different names and such and once we have that right um it's it's from there it's pretty 00:57:40.300 |
trivial to load it and it's very similar to what we did last time um in fact if we go down to our graph 00:57:45.980 |
creation here we'll see uh this isn't exactly the query that we had but uh it's very similar where 00:57:52.300 |
we've basically we're ingesting one person at a time merging on that uh email address which we have 00:57:58.940 |
indexed sending the name and then for each of the skills that's sort of a list inside of there 00:58:05.580 |
we're going to go ahead and merge the skill name and then that nose property connecting them together 00:58:11.020 |
and then of course i could go back to the graph and i can say these are neo4j employees that i loaded 00:58:18.860 |
but i can go ahead and look for one of them in the graph and i should get them back here where i have 00:58:28.060 |
them and i have the different skills that they went ahead and picked up that i can put inside of the 00:58:34.220 |
the database so um very very simple uh we have um our own um graph rag python package as well which 00:58:44.780 |
is very good for and and also our knowledge graph builder if you look at some of the code that we used 00:58:49.820 |
to implement that which is kind of like a reference ui which has more sort of examples around if you 00:58:55.500 |
wanted to do things like document chunking with overlap um you know and and of course also doing like 00:59:02.620 |
multi-threading with async and stuff like that so you're not just you know doing a for loop you know 00:59:07.340 |
over over a bunch of over a bunch of bios so we have all of that if you stop by the booth we can we can 00:59:12.700 |
give you more around that um but like i said especially for this crowd because you guys are already familiar 00:59:18.140 |
with this um you know this is this is a very short module and yes so we would have to do that yes so 00:59:34.780 |
ideally i would have done this in an in an order where i would have done this first and then we've we've got 00:59:40.780 |
we would have gone through like the clustering materializing new relationships in with weights 00:59:45.500 |
we're we're putting new relationships in these don't have weights on them but if we wanted to 00:59:51.500 |
reformulate our communities um thank you so so the the community is kind of following what you're doing 01:00:01.500 |
you're adding extra links in between the nodes that have weights on them whether it's for semantic distance 01:00:07.820 |
for skills or whether it's for you know you're you are x hops away from somebody else in terms of some 01:00:15.500 |
other distance computation that you've materialized yeah so we would we would ideally we run that 01:00:23.100 |
in a recurring way as we upload data right so because i created i did in this case create i think some 01:00:29.900 |
new skills in addition to the people so like has same has similar skill set that we we would redo that and 01:00:37.180 |
then we would get a couple more relationships there the semantic one if we created new skills we would 01:00:43.900 |
create new semantic relationships between the skills yes okay thank you 01:00:57.900 |
before we move on i'm ready all right well then that was a uh a very quick module so 01:01:07.740 |
i'll go over and cover um just some other topics around this very very quickly 01:01:17.500 |
so what we saw was an example where i'd call it entity extraction or named entity recognition where 01:01:27.420 |
we were taking a document and we were literally breaking out people places and things and relationships 01:01:31.980 |
from within that document um there's other things that we can do like for example if we have certain 01:01:38.540 |
types of documents like from a catalog or in this case rfps we can start to break things out by actual 01:01:45.500 |
document structure so i'm only going to walk through this just so you understand that there's different 01:01:50.380 |
types of extraction that we can do to create graphs for example if you know what the anatomy of a document 01:01:57.580 |
is like in this case if we have an rfp we know that this rfp is going to be designed in a way where 01:02:03.660 |
there'll be different sections that we have an intro objective proposal and subsections within that we can 01:02:09.100 |
actually create a graph out of those things too so this is another way that we can do um i would call it more like 01:02:16.780 |
document extraction um where we're actually putting the the metadata of the document and modeling it as 01:02:22.940 |
a graph and the advantage of doing things this way is that basically as you start to embed these different 01:02:29.500 |
pieces um and put them into a knowledge graph you can basically do these patterns where you can do these 01:02:38.700 |
searches on either entities that come from different chunks um and you can sort of go up and down uh these 01:02:47.020 |
document hierarchies to find things which can be very helpful if you have documents that always have 01:02:51.820 |
repeated structure um so you know that entity sometime connects between those the structures of those 01:02:57.740 |
documents you can start to incorporate that inside of your graph retrieval queries um and then it also gives 01:03:06.060 |
you a way to do community summaries because we saw leiden before um but also if you have documents 01:03:12.220 |
that give you a natural uh hierarchy um you have a way of also summarizing information um across those 01:03:18.940 |
documents as well um yeah why do you have entities and um documents and chunks in the same ontology as 01:03:27.260 |
opposed to extracting entities and um and just creating a separate ontology separate from documents and chunks 01:03:33.420 |
why do you combine the two well i think when they're combined you can just do traversals between them 01:03:38.780 |
so what do you look like what's an example of a of a traversal you want to do between entities and trunks 01:03:45.420 |
an example that you want to do between a traversal of entities and trunks like i say like legal contracts is a good 01:03:51.260 |
example where if you know like you want to search for different legal clauses but then like the expiry 01:03:56.460 |
date that might be somewhere else so like that would be one example yeah yes just making sure i 01:04:03.900 |
understand so like sure just want to make sure uh clear on what you're talking about traversing the 01:04:11.100 |
document so in a legal document say like a data protection clause across multiple vendors or something 01:04:17.260 |
like that and comparing the language like is that a use case yeah yeah okay yeah so that and then like 01:04:25.100 |
you know so there might be like um like a perpetuity piece or like you know dates and different things 01:04:31.100 |
and then being able to kind of traverse over that document to find that in addition to the entities 01:04:40.620 |
anything else all right all right so i just wanted to introduce that as as another example of how to do 01:04:48.860 |
things um for the third module because we're already at 1007 so let me just go ahead and jump into the 01:04:56.860 |
thing so you'll get to see it i'll go over to module three and this is going to be very simple so has has that 01:05:05.100 |
who here we probably asked this in the beginning i think we already asked how many people have experienced 01:05:09.180 |
building agents this is going to be very simple it's going to be a lane graph agent that we're going to 01:05:14.460 |
that we're going to make here um basically what we're going to do is again a similar setup with our 01:05:21.100 |
environments file we're going to connect to neo4j test our connection um there's going to be four tools 01:05:29.820 |
that we want to build we want to be able to retrieve the skills of a person we want to be able to retrieve 01:05:35.820 |
similar skills to other skills similar people like if we wanted to find out who's another good person to 01:05:41.820 |
work on a thing um and then retrieve people based on a set of skills and in this example um we're 01:05:50.380 |
basically going to um do a lot of tools first so at the end of this notebook there's going to be like that 01:05:56.460 |
text decipher stuff where you get the schema back but here what we're going to go over first is actually 01:06:02.220 |
going to be putting these different tools together um and we do that by graph patterns and it's the same graph patterns that we've been going over 01:06:09.420 |
uh so for example here right if we just want to find the skills that someone knows it's very simple right 01:06:15.500 |
just person matching to their skills um and as you go down this notebook basically what you're seeing is 01:06:22.780 |
all the different patterns so the second is retrieving people with similar skills and here we're actually 01:06:28.540 |
going to use the um the vector index and that similar semantic relationship um so we're basically 01:06:35.980 |
going to pull um actually this one is searching for people with skills i apologize so in this one the 01:06:42.220 |
this is um you're going to look for skills so for example you if a user puts in different skills those might 01:06:50.860 |
not match the skills we have inside of the database exactly word for word so you're going to use vector search to pull out 01:06:57.260 |
um the specific skills and what's semantically similar to those skills and we can do some scoring 01:07:03.100 |
thresholds in here to pull back exactly what we want um and then that will go ahead and return some 01:07:09.260 |
skills so if we had for example right like this continuous delivery cloud native and security 01:07:14.940 |
um this would be like the types of skills that we pull back from that 01:07:20.780 |
the person's similarity there's a few different ways that we can do that and we've talked about 01:07:26.300 |
that a lot towards the beginning we can do it by community so we can look for people that know different 01:07:34.060 |
skills we can get all of their names and then we can look for um that light in community that we created 01:07:41.740 |
we can look for all the skills that those people know and basically what we're doing at that point is we're 01:07:45.900 |
looking for people inside of the same uh skills uh community um but the other way that you can do that 01:07:53.260 |
is um you can look for similar skill sets um using the uh similar skill set relationships so the hard-coded 01:08:02.380 |
relationship that we've made from before uh which basically looks at hey how many how what's the actual 01:08:07.740 |
skill overlap if you just looked at um who knows what inside of the graph um and that will bring back 01:08:13.820 |
um some answers here between so we were looking at john garcia and we're saying hey find similar people 01:08:20.620 |
and then we can get like a score count of overlap um to the to the different people here and then we can 01:08:27.100 |
start adding in that semantic similarity so this is where we get this big query right but what this query is 01:08:33.420 |
actually doing is it's sort of balancing between the um similar skill set and semantically similar skill 01:08:41.260 |
set so it's kind of taking both those scores and adding them together um and then from there we get 01:08:47.500 |
a floating number score um and a little bit of a different answer that's not just based on hard skill 01:08:53.260 |
connections but also skills that are kind of close together um and we can weight those independently as well 01:08:59.660 |
and we can also recommend people given a set of skills so if we have a set of skills here 01:09:04.220 |
we can just do a vector search on um on those skills and then actually this one 01:09:22.220 |
yeah here it is um so basically um the query was broken out into two parts just because this is 01:09:30.220 |
this is kind of a big thing to look at but the idea with this is we can get um we can basically do 01:09:35.180 |
vector search on skills get semantically similar skills and then a fine person who knows those skills so 01:09:41.980 |
very similar to some of the last ones and then we can get a skill count for all of those groups and get 01:09:46.860 |
people back um when we actually define the functions for our agent we're going to create here um a skills 01:09:56.380 |
object which basically is just going to help us um with some of our function arguments and returns 01:10:01.900 |
but basically first tool retrieve skills of person very simple query um and then we'll have uh down here for 01:10:11.260 |
tool tool tool for tool two when we say find similar skills um what we're going to look at here is again 01:10:18.060 |
that query where we're going to do that semantic similarity between skills so we're going to do a 01:10:22.700 |
vector search to find skills and then we're going to go out one hop on semantic similarity um and then we're 01:10:28.380 |
basically going to collect everything and return it and then for the third one we're going to do that 01:10:34.940 |
weighting because this tool three is going to be for person similarity we're going to do that weighting 01:10:39.900 |
between the similar semantic and the similar skill set with that larger query and then i know i'm going 01:10:47.420 |
through this kind of quickly for the fourth one where we say find person based on skills here again our 01:10:53.580 |
entry point is going to be a vector search on skills um going out to match those semantically similar 01:10:59.580 |
skills but then kind of at the end of that uh we'll add on a traversal that will attach the person to 01:11:06.220 |
knowing those skills count who knows the most and then effectively return that so those are the four 01:11:11.980 |
tools that we're going to end up using for this agent when we set up the agent here if you're familiar 01:11:17.740 |
with uh how lang graph works basically uh we we get our llm we test that it's alive we define our list of 01:11:27.260 |
tools we can if we didn't want to do this in an agentic way right we can just bind our tools to our llm 01:11:33.420 |
and we can we can invoke our llm with tools but what we're going to do instead this is just showing 01:11:42.060 |
invoking the different the different tools um is we are going to go and run it with an agent so we're 01:11:49.100 |
going to use create react agent which comes from lane graph it's one of their pre-built agent um that 01:11:55.900 |
uses the the react um i don't know if you'd call it a framework but um sort of that methodology uh to 01:12:02.860 |
build an agent and effectively um once we do that we give it the llm we give it the four tools uh that we 01:12:11.820 |
had and then um we can see that here we're just testing we're saying hi and we're just making 01:12:19.100 |
sure we get some response back there's a utility function here just to make it easier running in 01:12:24.780 |
the notebook um which will basically just you know do this some of the um some of the streaming 01:12:31.100 |
methodology so i can just say hey what skills um does christoph have and then if i run that and i don't 01:12:37.660 |
know if i don't know if i need to rerun my agent here looks like not everything's running um so you'll 01:12:45.820 |
see it says uh when i ran that and actually i ran it for the wrong question here what skills does christoph 01:12:53.900 |
have person named christoph and then it will bring back his skill so you see there it will choose to use 01:13:00.460 |
uh retrieve skills of person um and similarly if i went down and i said you know what skills are similar 01:13:07.980 |
to power bi and data visualization um it'll go through and choose the uh appropriate um 01:13:14.780 |
you know the the appropriate tool for the job so in this case uh find similar skills it'll pull those back 01:13:24.460 |
and you'll see going down right if i said well what person has similar skills to you know this other 01:13:30.780 |
person here um then it will know oh i need person similarity so we'll go ahead and use that specific 01:13:36.700 |
tool so in this case what we're doing is we're providing a bunch of tools that are presumably expert 01:13:41.900 |
tools that we can give to the model and then it will know that okay i have to go ahead and pull those 01:13:47.580 |
those specific tools to be able to provide a response and then there's a little app down here 01:13:53.900 |
as well if you wanted to run the chat bot so it's a little gradio app here 01:14:01.500 |
but basically if i ran that i can go ahead and come in here and then i can have a little conversation with 01:14:09.980 |
it so this is very small but what skills are similar you know i can go ahead and ask it in here 01:14:17.260 |
and then provided everything's working it'll go ahead and choose the appropriate tool 01:14:23.580 |
and then i can say well who knows you know maybe i'll just say those skills and it should 01:14:32.860 |
go ahead and pull the appropriate tool to be able to find out who knows all these different skills right 01:14:41.820 |
and if i go back to my uh to my example here i should see the query logic that it used so first 01:14:48.940 |
you know it said find similar skills to what i just mentioned because i asked about power bi 01:14:54.380 |
and then after that i asked about people who know skills so it said find persons based on similar skills and 01:15:01.340 |
and likewise i can say you know who is um similar to those people in the graph 01:15:10.140 |
and it will likewise go through and it should you know understand that it needs to use the find other 01:15:16.700 |
similar persons tools um to be able to do that 01:15:23.340 |
so you'll see if i if i was to keep going down i should get calls um here define persons with similar 01:15:32.860 |
we have it here yeah person similarity so it just called the person similarity for each person 01:15:38.540 |
um and i know we only have a few minutes left uh there is if you wanted to run this further basically 01:15:46.300 |
i have a texas cipher example so this is where 01:15:50.460 |
i'll have an example of passing it the annotated schema so this is getting to kind of what you 01:15:55.020 |
were asking about right where i've provided these descriptions um so it's sort of like annotations for 01:16:02.700 |
the schema as well and then i can go ahead and give that to an aggregation query function that will have 01:16:09.100 |
there's also an llm inside of here that will create the cipher um but you can see in here i asked 01:16:15.260 |
some questions like describe communities it was able to understand that it needed to grab you know 01:16:20.380 |
the the match person knows skill and then it needed to grab the lighting community so it knew from the 01:16:26.460 |
schema right that it needed to generate this cipher and there's a couple more examples of that in the notebook 01:16:32.460 |
are there any question i know i just went over a lot um are there any questions from that that are worth 01:16:40.540 |
answering now while we have just a couple minutes left how long will the uh jupiter server be up if 01:16:52.620 |
you want to play with this jupiter server is going to go down very quickly but if you look at the deck 01:16:57.740 |
at the end of the deck i have a link to the code and the data is all in github too 01:17:04.540 |
so basically if if you go here that's the github repository so you can play with it 01:17:10.300 |
what's that the the deck is in this do you have access to the slack channel so the deck is in the 01:17:19.020 |
slack channel um and i'll go ahead and jump to that in a second but there's the github repository you can 01:17:27.500 |
use aura console we have a free trial that you can use you can just set up a cloud database and you can 01:17:33.660 |
load the data into there um also before you guys leave there's a meetup happening tomorrow 01:17:39.820 |
it's tonight oh sorry tonight at five um that and there's a link there for more information on that 01:17:50.300 |
um and then we also have another workshop at one o'clock where today was very simple like we're going 01:17:58.620 |
to go over more graph analytics type of stuff in that workshop so like the community stuff that i 01:18:03.500 |
was doing we're going to dive more into depth on that in that workshop um other than that uh come 01:18:10.860 |
by our booth if you have more questions we're going to be wherever right there is i don't think we have a 01:18:16.460 |
big expo hall but if you want to see neo4j mcp servers adk examples more knowledge graph construction 01:18:23.100 |
um all a great uh you know place to come to ask all those types of questions