back to indexGraph Intelligence: Enhance Reasoning and Retrieval Using Graph Analytics - Alison & Andreas, Neo4j

Chapters
0:0 Intro
5:47 The Basics
7:57 External Data
13:43 How does it work
15:9 What other elements are there
21:57 Graph Analytics Workshop
25:2 Power of Graphs
27:32 Create Instance
29:17 Database
30:42 Account Creation
31:37 Neo4j
32:17 Neo Agent
33:32 Documentation
34:7 Conversations
37:47 Chunking
43:39 Dual Labels
47:33 Backup and Restore
51:7 Notebook Setup
57:14 Connect Cluster Curate
00:00:14.000 |
Hey y'all, how are you doing? How's the day? I'm pretty excited to be here. I was 00:00:20.960 |
here last year and then we were at the summit, the agent summit, not that long 00:00:24.320 |
ago. I've definitely seen some familiar faces in all of those places. So thank 00:00:29.480 |
you for coming to our conversation today. My name is Alison Cosette. I am with the 00:00:36.320 |
developer relations team. You will likely recognize this face from all of the 00:00:40.360 |
GraphRag videos on deep learning, chillin' with Andrew Ng, like, you know, all the 00:00:46.360 |
guys that they had. Yeah, it's just how you roll. So what I'm gonna be talking to 00:00:52.460 |
you about today, so was anybody in the Neo4j workshop this morning with Zach? Oh, this is great. 00:00:58.960 |
I love it. It's like Neo4j here. So yeah, we're gonna do a few things. So that's good. You may have, well, 00:01:09.840 |
no, I'm gonna just go like you didn't do that. Okay. So this is our repo that we're gonna be working with today. 00:01:15.800 |
It has, yes. So is it? Is it not there? Because I can see your profile, but if I can't, you see, I do. 00:01:26.680 |
You know, that would not surprise me that I made that mistake. One moment, please. 00:01:30.620 |
Let's see my settings. Yeah, it probably, I think it defaults to... 00:01:37.560 |
I know it's in here somewhere. Change visibility. Change to public. Sorry. 00:02:11.640 |
Yes. Confirming access. Okay. Is that better? Excellent. All right. Sorry about that. 00:02:23.480 |
Oh, so you can actually see. Yeah. I'm always in dark mode. Sorry. Can we do, can we do that? I don't 00:02:33.320 |
know. Just these lights along here? Yeah. Just the lights in front of the video screen. 00:02:41.400 |
We'll work on that. I love that people are raising their hand. This makes me very excited. I love talking 00:02:47.480 |
to people. I love being in the room with folks. So this is good. So I'll tell you a little bit about 00:02:53.480 |
what our goals are for our next, you know, 120 minutes together. What we're focusing on, obviously, 00:03:01.320 |
a lot of you went to the intro to GraphRag this morning. How many people have been working on RAG 00:03:06.520 |
applications already? Yes, yes, yes. Excellent. How many people have been working on GraphRag already? 00:03:13.960 |
A good number. Okay. So what I'm going to be doing today is talking specifically about how you can use 00:03:22.280 |
graph data science to improve what you're working on in your applications. The reason that we are doing 00:03:30.840 |
this is because one of the beauties of having a Graph is you now get access to new ways of thinking and 00:03:38.280 |
understanding about the data that you have. And so my goal today is to have everybody feel comfortable 00:03:45.960 |
with some of the basics of the algorithms of Graph. And, you know, hopefully while you're working through 00:03:52.520 |
this, you can see something that's going to be of interest. There might be something that you want to 00:03:56.840 |
try when you get back to your application. I do this a lot, and I will say the most successful workshops 00:04:03.960 |
we have are when people bring their real-world scenarios to bear. So we will be going over the code, 00:04:11.240 |
but a lot of what we're going to be doing is talking through the concepts and around 00:04:16.600 |
Graph as a tool for managing your RAG. And we're really just going to be talking about Graph algorithms. 00:04:25.000 |
Anybody here come from the data science side of things? Oh, all right, we got a few, we got a few. 00:04:30.840 |
One of the things that I think is really interesting about the term AI engineer is, you know, somebody 00:04:39.240 |
said, "Oh, it's just engineering. Like, it's still pipelines." I'm like, "No, it's not. It's different." 00:04:44.120 |
Right? I came initially from the AI side, and I can tell you that none of us, none of us look at things the 00:04:49.880 |
same way that we did, you know, a few years back. It isn't the data scientists on one side and people 00:04:55.160 |
putting things into production on the other. You know, we work really closely. And so what my hope 00:05:01.320 |
now is like the, there's this concept that people say all the time that, you know, developers are just 00:05:06.680 |
building the pipes and that the data scientists are worried about the quality of the water inside the 00:05:11.480 |
pipe. But with AI engineers, we're all worried about the pipes and the water, right? So what I really want 00:05:18.040 |
is to give you as much as I can about the algorithms and about the data science and get us rolling. 00:05:28.200 |
So let me see. Where's my repo? Where's my repo? Okay. Um, yeah. So just clone this repo. You can, 00:05:34.760 |
you'll have it. We'll, we're going to walk through it. It's, um, it's, it's really just like a click 00:05:39.160 |
through of a number of Jupyter notebooks. Um, but really like, it's all mostly about the conversation 00:05:45.080 |
that we're going to have today is where we're going. So obviously everybody here, like we always start 00:05:51.160 |
with the basics, right? I call them the, the GPS moments that let's review all the things everybody knows 00:05:55.560 |
first. Um, we know that in our rag applications, what we're really trying to do is we're really, 00:06:01.240 |
we're getting better answers, right? That's what everybody here is struggling with. Um, 00:06:06.200 |
like some of the common things I hear people struggle with are, um, am I getting the right data? Do I have 00:06:12.360 |
the right chunking strategy? Um, how do I handle temporal data? How do I handle data quality? Um, 00:06:19.880 |
what are, what are some of your biggest like rag challenges? Anybody want to share? 00:06:24.920 |
That what just irks you about what you've been working on? Everybody's so quiet. 00:06:30.120 |
I should have signed up a Slack channel. Yeah. 00:06:31.640 |
So we're definitely going to talk about, we're going to talk about that today. Um, 00:06:47.560 |
I think one of the things that we found is that the management is, um, what? Oh, sorry. Um, the, 00:06:54.520 |
the comment was, um, how to handle the volume of the data was a really big one. And the other element 00:07:01.240 |
was understanding the relationships among the data. So did you go to this morning's workshop? 00:07:07.240 |
No. Okay. Well, you'll, we have lots of video on it, so we can, we can definitely get that to you. Yes. 00:07:12.840 |
I would add like temporal relationships. Yes. Yeah. Temporal is a really interesting thing. Um, 00:07:19.720 |
I don't, I didn't, I don't have temporal in the slides, but we can definitely talk about temporal 00:07:23.960 |
because it's really important for sure. Um, one of the things that we find is, you know, I, I will, 00:07:29.720 |
I will give you my apologies up front. I get super excited about graphs. So I apologize in advance if I am 00:07:36.760 |
overly enthusiastic about what I do because I love graphs. And there's so many times where something 00:07:43.000 |
comes up and you're like, Oh, well, if you use graph, you could address it this way. So I just want to, 00:07:47.880 |
you know, keep bringing these things up and let's talk through all of them. But what we're really 00:07:52.920 |
looking at is how do we give this complete and curated response, right? What are we using for 00:07:59.560 |
the external data? I mean, initially everybody drops their stuff into Pinecone or WeVA, you hit a vector 00:08:05.480 |
store. Um, is anybody now working with multiple databases feeding into their RAG system? A little 00:08:12.840 |
bit. Yeah. Still primarily vectors, primarily vectors. Yeah. Yeah. I mean, everybody uses vectors 00:08:19.880 |
because they work right. Um, and that's the starting point. But the thing about the vector 00:08:24.600 |
is the vector really is, I, the analogy I always say is it's like you have a bunch of little index 00:08:29.960 |
cards. And so imagine if I'm the, my first day on the job and everything there is to know about my, 00:08:34.760 |
is that what I have to do is on an index card. And I say, Oh, well, what do I know about this? 00:08:39.400 |
And then somebody finds the index card with this little blurb. And so like, I've just got these little 00:08:43.800 |
things, but I don't understand how it all puts together. Right. We don't understand what is it connected 00:08:48.600 |
to, right? What is it connected to? What context do we need? And so what I want you to start thinking 00:08:55.320 |
about is what are the projects that you've already been working on? When you're getting those answers 00:09:00.760 |
back, what are some of the pieces of information that you wish your vector had? Right. Anybody have 00:09:08.200 |
an idea of something they, you don't have to share, but like you've thought about like, Oh, if only it 00:09:12.200 |
could do this, right? If only it knew about that, I would get a much better answer. If you haven't, 00:09:17.720 |
then I'm sure like the, your business folks that you're responding to have because they always have 00:09:22.200 |
an opinion. Right. So one of the things about the knowledge graph is that it's a way for us to bring 00:09:29.720 |
our structured data and our unstructured data together. The other thing that it will do is it will 00:09:36.280 |
actually find knowledge inside of your unstructured documents, right? So one of the things you brought 00:09:42.520 |
up was how do we create the relationships among these documents or among these pieces of information? 00:09:48.600 |
And there's a couple of ways that happens. One is it's really explicit, right? There's a table of 00:09:55.720 |
information inside this document and you can pull that out, right? Old school regex, everybody's friend, 00:10:01.480 |
or it's something that is less explicit and more implicit. And that's where the LLMs come in, 00:10:09.160 |
finding out what is in there. So in this morning's one, and if you want to hit us up, we can definitely 00:10:15.640 |
show you how we actually extract and build the knowledge graph. What we're working with today 00:10:20.760 |
is I have a database already set for you folks that is a data dump that we're going to load into 00:10:26.840 |
your pro trials. But the thing is this data comes from lots of different places. 00:10:31.640 |
So I guess most people here are familiar with the graph, but just in case I do like my little GPS 00:10:39.000 |
moments, let's talk about all the things we already know. What is a graph, right? Basically we start 00:10:44.760 |
with nodes. Nodes are entities, they're your nouns, they're your items. Some people relate them to the 00:10:51.160 |
tables in your relational database. These are your table names, right? But the important thing is the 00:10:57.320 |
relationships among those entities. So a customer and a product, a product and a part, a supplier and a buyer. 00:11:05.000 |
It's the relationships in and among these pieces of information that's where our business and where our 00:11:10.520 |
information lives. In this case, we've got two people. One of them owns a car, one of them drives the car. 00:11:15.720 |
They do know each other. I'm not sure why person one lives with person two, but not the other way around. 00:11:20.440 |
Who knows? Life is interesting, right? So we have these relationships and this tells us what is the 00:11:27.160 |
the story of the interaction between and among these. And properties. This is where we get into the nitty 00:11:36.040 |
gritty. We get into the details. These are those extra columns within the table, right? Our relationships are 00:11:42.200 |
our foreign keys. Our properties are the columns. The entities are the tables, like roughly, right? 00:11:48.760 |
You know, don't quote me on it, but pretty close, right? As from a conceptual point. These properties 00:11:55.160 |
can live on the entities. They can also live on the relationships, right? So we can see that this person 00:12:02.040 |
has been driving the car since this date. It's a very old car now. But one of the other things that we're 00:12:07.960 |
going to see is we see underneath this car, we see a couple of things. We've got the brand, 00:12:14.520 |
the model. We have a description of the car. And what you'll see underneath is the description embedding. 00:12:20.200 |
So we're all familiar with the embeddings, right? The embedding is a way of taking some piece of text, 00:12:25.720 |
putting it into, you know, many, many dimensions, and creating a vector in space that represents the 00:12:32.600 |
semantics of whatever that text is. One of the things that's really great about a graph 00:12:40.200 |
is that you have immediately that connection from the embedding to all of the other information. 00:12:46.600 |
We see it in this microcosm here of the car, the embedding of the car. And you can pretty quickly see, 00:12:52.760 |
oh, if I did a vector search on, you know, this car, right? And I wanted to know who are the people 00:13:00.280 |
that own cars described like X, it's a pretty quick traversal. You're going to do a vector search. It's 00:13:06.200 |
going to take you to that node, that entity, right? Similar to your chunks that you've already been 00:13:11.640 |
working with. But immediately, you already have all the relationships built in because they're already in 00:13:16.200 |
the database, right? We don't have to, we don't have to like create new tables and views. Everything's 00:13:22.440 |
already there and it's connected. So when you go into graph rag and you're working with your retrievers, 00:13:28.920 |
the beauty of the retriever is in this ability to traverse and traverse very quickly and get that 00:13:35.880 |
immediate context, right? So these are the basic components of what we're going to be working with 00:13:42.360 |
within the database. So we're going to start at the very beginning, right? How does this, 00:13:47.960 |
how does this work? We start with the entity. The first entity is the document and the document has a 00:13:52.680 |
chunk, right? Pretty straightforward. It's a row in your vector store, right? But this is the interesting 00:14:00.920 |
thing. You also have like, anybody here working with agents already? MCP servers? Yeah. The other thing that 00:14:09.800 |
we get and what we're going to be talking a lot about today is the application. I call it the application 00:14:14.600 |
graph. Like they call it a memory graph. For me, it's what is the actual activity that's happening 00:14:19.720 |
in the system? What is, what is it doing? What are the messages going in and out? What are the context 00:14:26.760 |
documents that are being used, right? Everything that we're building is moving quickly and it's moving at 00:14:33.640 |
volume. There's massive amounts of documents and data. There's so much volume. How do we actually 00:14:40.680 |
manage it? What do we do? This is going to be a key part of what we talk about today, which is the 00:14:46.920 |
actual understanding of what's happening in the application and how by monitoring and looking at 00:14:52.920 |
what's going on in the application, you can do things like manage your documents at scale. You can 00:14:59.080 |
understand what are the actual important documents, what's influencing outcomes, right? I've got some 00:15:06.680 |
good stories to share with you about what people have been working on. One of the things you mentioned 00:15:11.560 |
is what other elements are there. So when we look at the entirety of the graph, we have like the memory 00:15:16.120 |
graph for the application. It's connected to the chunks and then those chunks are connected to the 00:15:20.600 |
domain, the products, the people, and basically all the other structured pieces of your business or the 00:15:26.120 |
structured elements that you've extracted from those unstructured documents. So what you end up with is 00:15:32.120 |
you end up with a fully connected network of your data and your system together. So much of the way that 00:15:42.840 |
we have approached development prior is you had your data people over here and you're building people over 00:15:47.720 |
there. We're all together now, right? We all work together and the way that we have to build is we build from 00:15:55.160 |
that cohesive point. We build from the data and the application at the same time. So I'm just going 00:16:00.120 |
to show you what becomes possible as you start taking this kind of an approach. Everything that we're 00:16:06.440 |
looking at today is just this little bit of the graph, right? So we're going to be looking at prompts and 00:16:11.400 |
responses and we're going to be looking at the context documents that they're connected to 00:16:16.280 |
and then from there we're going to say, okay, let's figure out where we go now and what we can start to 00:16:21.960 |
understand. Anybody have any questions so far? All sounds really obvious, like why are you still talking about 00:16:29.560 |
this? Too slow, too fast, are we good? You're good, okay, all right, good pacing. If I go too fast because I'm from New York, I'm a fast talker, let me know. If you're getting bored, just kind of give me one of these and be like, okay, let's go. 00:16:44.440 |
So that's fine, too. But the best thing that you can do is like when you have a question, 00:16:50.520 |
just raise your hand because everybody has the same kinds of questions. Everybody's struggling with 00:16:54.920 |
things. And what I love about AI engineer is I get to be with the people who are doing the things that 00:17:01.400 |
are like you are the people that are making everything happen right now, right? I mean, our entire world is in 00:17:06.680 |
like just this incredible transformation and you all are the people that are building it. And so I want to be 00:17:12.280 |
here to support you in this moment because it's a pretty exciting time. So please don't be shy. Questions, comments? Yes. 00:17:23.480 |
Oh, I will drop it in. It's not in right now, but I will drop it in before we leave. Thanks. 00:17:29.560 |
Yeah. All right. Um, yeah, so they're all the same system. They're really just sort of like neighborhoods, 00:17:45.640 |
right? They're kind of neighborhoods in the system. Um, the reason that we draw it out is because 00:17:51.640 |
oftentimes people think of them as separate, right? So, um, this one doesn't have everything, 00:17:58.280 |
but what we see here is we've got, this has a, this one has the source. So the yellow nodes are things 00:18:03.160 |
that are coming from your application. The blues came from unstructured data. Um, we're not doing it today, 00:18:08.920 |
but if you had structured data that you were also pulling in, like, uh, bringing in a table, bringing in 00:18:14.040 |
some of, or whether it's, you know, already structured or has been brought out of your unstructured data, 00:18:20.120 |
that would be just a different color. It's all one system. But the reason that we do this is just 00:18:25.720 |
so that people conceptually understand where it's coming from and like, what does it do? Because the 00:18:31.000 |
big thing really is understanding the fact that once it's all connected, your application is talking to 00:18:38.920 |
all of the data and then you're structured and you're unstructured are all working together. And at any time 00:18:44.520 |
you have access to understanding across that system. It's like, you know, complex systems and 00:18:49.960 |
system thinking and all that good juiciness. So they're not separate. They're just neighborhoods. Yes. 00:18:58.840 |
Are you proposing that you have like one massive graph? 00:19:05.480 |
I mean, don't start with a massive graph, right? Like, let's be honest. Let's start with something small. 00:19:10.680 |
I mean, my first piece of advice is always start small. Um, the other thing to really, as always, we always want to look at optimization. We want to make sure that we're 00:19:19.800 |
we're building for the latency that we need. There's always going to be context. Yes. 00:19:23.720 |
Can you elaborate on that? They mentioned it this morning as well. 00:19:27.240 |
But when you're trying to map, for example, the ontology of, of an organization and be able to have an 00:19:35.240 |
How do you, how do you, how do you do small? Just break it apart? 00:19:39.240 |
Yeah. Um, well, I mean, that's probably an ABK question. 00:19:44.200 |
Yeah. So some of the complexity like for, for really big graphs, part of the bigness of it is 00:19:50.280 |
not just the number of nodes and relationships you have, but the number of labels and relationship 00:19:54.360 |
types that you have. And for like a large organization, don't over specify. So you start 00:19:59.960 |
up with like kind of generic terms, see if that works. And if you've got too many, too much volume in the 00:20:05.240 |
generic terms to segment, then you get more specific. And so, and the other part about that, 00:20:10.600 |
I suppose, is that you start with not all of the data, but like some of the data, work with it, 00:20:15.160 |
do some eval, figure out if it's doing what you want to, and then refine before you suck in 00:20:19.720 |
everything and create the entire database. So you kind of iterate on the schema with a subset of the 00:20:25.240 |
data. It's normal data engineering, I guess, right? 00:20:27.720 |
Do just some of the data, start generic, and then get more specific. Yep. 00:20:31.720 |
When you say big versus small, you're talking ontology, not knowledge graph, right? 00:20:42.280 |
Correct. Yeah, that's right. So on, so talking about the ontology or the schema, 00:20:46.200 |
like the, how the graph is actually laid out, rather than the number of each kinds of things, 00:20:50.920 |
right? So starting there, iterating on that first, and then bringing all the data. 00:20:56.200 |
What are your thoughts on the raptor indexes? Are you presenting indexes in a tree-based 00:21:00.360 |
structure based on some hierarchy versus knowledge graphs? And what performs better in what scenarios? 00:21:06.360 |
The raptor-based indexes, like if you represent content in more hierarchical tree-based structures, 00:21:15.560 |
and then you are having summaries in hierarchical nodes, searching through those versus knowledge 00:21:22.040 |
graphs, which scenarios plays out better? Like we should go with raptor indexes versus knowledge graphs? 00:21:27.800 |
The right answer is the annoying answer of it depends. And so you should have your eval drive the 00:21:35.400 |
structure, right? So come up with some eval. What are the, what are you actually trying to do? What 00:21:38.600 |
are the questions people are going to ask? And it's the usual thing, the questions people that you're 00:21:42.520 |
going to ask or people are going to ask should drive the data model here, the ontology. But that also 00:21:47.480 |
ends up meaning, do you put things into summarization over communities? Or do you have the knowledge graph 00:21:51.640 |
representation of it? This can be driven by the questions. Yeah, makes sense. Yeah. So if you can find 00:21:59.880 |
your way to that workshop Neo4j, I've just pasted it in OpenAI key there. You're going to end up 00:22:04.440 |
needing that at some point while we get this repo updated. And I guess at the meantime as well, 00:22:10.440 |
I'll just answer any arbitrary questions you have. Yeah. 00:22:14.200 |
Yeah, it's neither of those. Really? It's the graph analytics workshop. What? 00:22:23.640 |
No, I know. I've done so many different workshops lately. Like it's, yeah. I'll clean up the repos. Sorry. 00:22:29.640 |
Slack channel, yeah. Oh. No, so. Oh, you have a Slack channel. There's a Slack channel in the 00:22:33.960 |
AI engineer's workspace. Oh. Oh, sorry. There's two. Yeah. Okay. So it should be the workshop 00:22:38.120 |
-new4j1, not the workshop -new4j2025. I didn't realize there was yet another one. Okay. All right. 00:22:45.800 |
All right. Yeah. Thank you, Devon, for putting in the GitHub link. Where are you? Oh, I know. Hold on. Hold the phone. Because... 00:22:57.640 |
So one of the things that we're going to end up exploring here, and we touched on this a little bit this morning, 00:23:08.600 |
some of the motivation behind both GraphRag and then also doing graph analytics on top of the graph. 00:23:12.840 |
One of the aspects I always like to call out, and it isn't as obvious, I think, until you get pointed out, 00:23:19.960 |
and then it's super obvious, is that part of what we're doing is for the core data you start with, 00:23:25.240 |
particularly with unstructured data, if you're doing chunking, of course, that by using GraphRag and graph 00:23:31.320 |
analytics, you're expanding the number of answerable questions for the same base data set. 00:23:36.760 |
That if you would say roughly, for every chunk you've got, you can answer one question, right? 00:23:42.280 |
And it's being linear. Number of chunks equals number of questions you can answer. 00:23:45.240 |
If you think for every chunk you've got, you do the, and this is the upper bounds, take every pair of 00:23:51.320 |
chunks, do the Cartesian product of all the chunks, say, "What do these two chunks have together?" 00:23:55.320 |
Have in common, or "How are they different?" Create relationships or don't create relationships in them. 00:24:00.440 |
In the upper bounds, that is now, you know, n squared number of possible connections. 00:24:06.120 |
Every time you do that, that's new information that's possible. 00:24:09.480 |
So now you've got an order of magnitude more, you know, questions that can be answered. 00:24:13.720 |
Once you've got everything connected, you can then also do the community detection stuff that they touched on this morning, 00:24:18.520 |
and we'll get into maybe a little bit today as well, that now that you've got all these chunks connected, 00:24:23.720 |
it's not just the pairwise connections that are new information, it's the subsets that are through 00:24:29.400 |
those connections that also can answer new bits of information. So that if you have, if I'm in the graph, 00:24:35.160 |
and you're in the graph, and you're in the graph, you know, or you over there, whoever, maybe Allison, 00:24:40.120 |
that it's not just us individually, but like how we're related and who we know, the connections between 00:24:44.600 |
us and the connections across everybody else, all that ends up being slices of the graph that are particular 00:24:50.280 |
to the individual places that you've started. That ends up forming all this, it's not the full power 00:24:55.880 |
set of all the subsets within the set, but it's hugely amount, like if you started with like 10 00:25:01.800 |
chunks, by just going to the pairwise connections, you end up with 100 answerable questions. And once you 00:25:07.800 |
start doing the sub communities, so the subset of all the sets, at the upper end, it ends up being 00:25:14.040 |
exponential. This is the power of graphs. Yes. It's two to the n instead of n squared, right? 00:25:19.480 |
Would you love math? Thank you, ABK. Okay, so we have pushed, if you have pulled, you will want to 00:25:25.560 |
refresh this pull, this fork. So what you're going to see is a few different things. One is, I do have a 00:25:32.440 |
link actually to the slides. And then, actually we're not doing Docker, you can ignore that. But when we go 00:25:40.040 |
down to the bottom in connecting to Neo4j or a DB, you're going to see the link to, you're going to see 00:25:46.600 |
the link to console@neo4j.io. So we're going to have everybody go there. When you get in there, you're going to 00:25:53.960 |
create a new instance. Oops, sorry. Yeah, it's console.neo4j.io. This is in the readme in the 00:26:07.240 |
repository. Do you not? Let me see. Is it in the Slack channel? Which Slack are we? Are we 2025? 00:26:18.680 |
Not the 2025. We are not 2025. We're old school. We're not 2025. We're either in the future or deep 00:26:26.520 |
in the past. But we are not right now. All right. We're the other one. So when you go in there, 00:26:33.000 |
you're going to have a couple of different options for free tiers. One is the Aura Free, which is 00:26:39.400 |
smaller. It doesn't have as much of the optimization as the Pro Trial. So we're going to be using the Pro 00:26:45.960 |
Pro Trial today. I believe it's 14 days right now on the Pro Trial. But you can, if you want to work 00:26:51.960 |
on your own project later, go in and start an Aura Free instance as well. And that, as long as you're 00:26:58.280 |
querying it every so often, it'll stay open and it will always stay free. Yes. 00:27:04.360 |
No. It's your choice. Your choice. Oh, sorry. The question was, is there a preference for 00:27:16.040 |
cloud provider? And I said, no, there is not. The more you spread it out, the easier it is. We move 00:27:22.920 |
along. Yeah. So you're going to get the free trial up and running. And so what that is going to look 00:27:30.360 |
like when you create the instance, you're going to see a couple of options. You're going to go with 00:27:35.560 |
the AuraDB professional. We're not taking your credit card. We're not doing anything. Eventually, 00:27:39.160 |
it'll turn off. But it just is the most robust version that we have. So I guess it's seven days, 00:27:44.360 |
seven days. So you're just going to try that for free, and you're going to get that up and running. 00:27:48.920 |
Why do we still have an AuraDS button? No? Okay. All right. Don't worry about that. Yes. 00:27:54.440 |
Oh, yeah. Onboarding. Sorry. Yeah. It just passed. I mean, it's going to ask you some questions. 00:28:08.840 |
You can just like, it does that thing where it makes you make the graph. Oh, yeah. Sorry. 00:28:15.880 |
Yeah. It doesn't. I mean, anything is fine. Like, don't tell anybody. It's a marketing thing. Like, 00:28:22.040 |
just pass through it. I want to just get you enabled. And then when you go through, 00:28:27.880 |
you can just use the default. It's going to default to four gigs. You know, if you want to name your 00:28:32.920 |
instance, you may. One of the things that's going to -- you can definitely turn on Graph Analytics and 00:28:38.520 |
Vector Optimization if you want. And then just click Accept, and it'll start up your instance, 00:28:44.840 |
and it'll take a couple of minutes. Once that instance is up and running, we'll get you to the 00:28:49.400 |
dump and we'll get that loaded in. So I'm just going to give everybody a couple of minutes to work on this. 00:28:53.640 |
If you get stuck on this process, just raise your hand because I've got lots of my people, 00:28:59.560 |
my Neo4j people, wave. Alexi, that's you too. 00:29:03.880 |
Alexi, yes, he's being called on in class, right? He wasn't asleep, so that's good. He was just 00:29:10.760 |
working. Yeah, so if you get stuck on anything, just let us know, okay? But what I can do, 00:29:17.560 |
while we're -- while y'all are getting that up and running, is I can show you a little bit about what is 00:29:22.280 |
inside of our database. The console, it makes it really easy. Not only do you have access to your 00:29:29.800 |
instances, you have access to Graph Analytics, but, you know, the basic IDE, which is the query, 00:29:36.280 |
will be connected as well. So got one of each. Let's see what's in the database. 00:29:44.680 |
Oh, I know. That's right. I emptied it because I wanted you guys to be able to, like, do the dump 00:29:51.000 |
with me. Okay. Oh, no. There it is. Okay. All right. Okay. So we've got the query. And in the query, 00:29:56.680 |
for those of you who love your Cypher, you know, we can say return in limit 10. 00:30:05.240 |
All the Cypher that we're using today, like, I've already got in the notebooks. But what you'll see 00:30:11.880 |
is when you do a query, querying in Cypher is really all about the pattern. So in this case, 00:30:17.480 |
I'm just doing a match n. You probably can't see. Yeah, I was going to ask. Can you see that okay now? 00:30:22.520 |
Yeah, I'll fix it. Hold on. I got to make this so people can see all the things. See all that? No, 00:30:28.280 |
that's not going to work. I'm going to zoom in. Sorry. Yes? 00:30:34.920 |
Oh, okay. Okay. If you already have an account. Oh, you know what you can do? If you already have an 00:30:47.640 |
account, one of the things you can do is you can log back in. And if you change your email to your email 00:30:52.120 |
plus, like, workshop before the @ symbol, it'll give you a fresh account. 00:30:56.840 |
So, like, if, like, mine, if I'm, like, Allison at, you know, Google, if I do Allison plus workshop at 00:31:05.400 |
Google, it'll still be your email address. It'll still go to you, but it will be seen as a new 00:31:11.240 |
account. So that's another option. Yeah, just create a new account. It'll make your life easy. 00:31:16.840 |
Has anybody been able to get theirs up and running? Ish? No? Yeah, maybe. It's so interesting, 00:31:27.720 |
because lately I've been noticing, like, people are so interested in engaging in conversation. 00:31:31.960 |
There's less of that. But we definitely don't want to leave anybody behind. So we'll let you all do that. 00:31:36.280 |
So, but just to understand what is actually happening inside Neo4j when you're working with 00:31:41.480 |
it, we run these patterns, right? So it's similar to, you know, SQL. Like, the good news is it's very 00:31:49.160 |
user-friendly, Cypher. And we live in the age of, like, vibe coding and LLM. So, like, there's plenty of help 00:31:56.760 |
for all of this within our co-pilots. But depending on what you get back, in this case, you can look at it 00:32:03.080 |
as a graph. You can see the table. Or you can see, you know, just the raw values, but the graph itself. 00:32:09.640 |
So in this case, we're looking at a particular document, which is actually, you know, we call it 00:32:14.520 |
a document. It's really the chunk. I will tell you that what we have in our, what we, the data that we're 00:32:21.160 |
working with is from a project that we actually have live. And it's called Agent Neo. And it was built a while ago 00:32:29.800 |
now with, by some of the guys who were helping build the first RAG application on our graph data 00:32:35.320 |
science platform. And our knowledge share there. So this has been in play for a while. But when we look 00:32:41.480 |
at the data model itself, right, what you're going to see is what I talked about before, right? That 00:32:45.400 |
application understanding. So we've got a session. So somebody logs in to the application. 00:32:50.360 |
They start a conversation. They have their first message, right? This is the user prompt. 00:32:55.320 |
Then that prompt is going to get an answer, right? There's a message that comes back from the assistant. 00:33:01.560 |
And then that assistant is going to be going out and pulling in context documents, right? So it's going 00:33:08.360 |
to do your cosine similarity. Who does Euclidean? I don't know. It's a thing, I guess. Does your cosine similarity, 00:33:15.240 |
and it's going to bring those documents back? And we're going to start to see the chain of what that 00:33:20.280 |
conversation looks like, right? So this is that application base. And then these are the chunks. 00:33:26.120 |
So it's really just what we looked at in that previous view, but just to show you what it actually 00:33:30.840 |
looks like. So what's in this? You know, lots of documents from our documentation, developer blogs, 00:33:38.520 |
support. This was early on, our earliest chunking. We had a 512 chunk size. We used a Langchain 00:33:46.360 |
recursive splitter. So it was pretty early on. Let me see. Hold on a second. I want to show you. 00:33:51.800 |
Where am I? Oh, that's not what I want to show you. Where did it go? 00:33:58.920 |
Here we go. All right. So once the system actually starts running, what you see is this is actually one 00:34:12.600 |
user having two different conversations. This is their first conversation. This is their second 00:34:17.240 |
conversation. The first question is, can you build graphs for me? And then the next one was, what is GDS? 00:34:23.880 |
And then you see the beginning of what that response was. Yes, I can build for you. GDS stands for. 00:34:29.640 |
And so you can just to give you an idea of how this starts to build out. 00:34:34.760 |
Now, this might look a little bit frightening, but I'll walk you through what we're actually looking at. 00:34:39.400 |
So what does it mean when we start looking in practice? So this is the same user, right, with those 00:34:44.120 |
initial two conversations. What does graph data science? Yeah, what is GDS? This is also what is GDS? 00:34:52.920 |
And then can you build graphs for me? And what we're looking at here is these are the conversations 00:34:58.200 |
and these are the context documents that are being brought in. And what's interesting or plausible, 00:35:03.960 |
it makes sense, this person asked the same opening question. So it stands to reason that these same 00:35:09.080 |
documents are going to be used in both of these first responses, right? Makes sense. Then what we find is 00:35:16.040 |
this person's next conversation says, how can I have, how can I learn more about GDS is their follow-up 00:35:24.680 |
question, right? And that makes sense that it's going to now share things with, can you build, 00:35:31.640 |
because this person is then having their next prompt, which is, oh no, that one just stopped after one. 00:35:38.360 |
But what I wanted to show you is that what we see here is that in this middle conversation, 00:35:43.160 |
the conversation started out in this area of information, right? It started out in what is GDS, 00:35:49.800 |
what are some of the basics, but then the conversation went into something similar to someone else. 00:35:55.640 |
And the reason that we want you to see this is you want to be able to see how are the conversations 00:36:02.680 |
connecting to the data itself. Because what we're going to start to see is we're going to start to 00:36:07.320 |
see that it's not just about a single row, it's not just about a vector, because the way the documents 00:36:12.440 |
are used and the way people travel through your data gives you information about what they're using. 00:36:17.480 |
The reason that becomes helpful is when you're trying to manage at scale, how do I know with all these 00:36:23.320 |
documents, what are the ones that I should even be looking at? What is it that people are constantly 00:36:27.640 |
going back to? What are the areas of knowledge where things are happening? And so what we're seeing 00:36:33.480 |
here is that these elements right here were in both answers, right? So you see this first question had, 00:36:40.040 |
this first response had context, and this had context. So it was in both of those. These are 00:36:45.080 |
individuals, these over here were only in this one, some of those were only in that one. You can imagine 00:36:50.600 |
like how quickly this gets very, very hairy. Oh, where did it go? Nope. I guess that's it. I didn't 00:36:56.920 |
have the extra one. So the point is we can start to understand the movement of what's happening. Is there 00:37:04.760 |
a question? Someone? No? Okay. Anybody have any questions so far? Ponderings. Waiting till we get to juicy things. Yes? 00:37:13.960 |
Yes. Yes. The pink ones are the chunks. The orange is the user prompt. The tan is the rag application's response. 00:37:30.280 |
So you can already see how it's starting to connect to the system itself, right? So let me get back 00:37:36.680 |
on point here. We got things. We got things. Things, things, things, things, things. Okay. 00:37:41.960 |
I'm sorry. Are you storing the entire chunk into the graph? And how performant is that? And when 00:37:47.320 |
does performance become a concern? Yeah. So what we're going to see is we are actually, 00:37:52.440 |
we do actually have the chunk. Our chunks are pretty small. The size of the text doesn't really change the 00:38:01.720 |
performance, does it? Like as far as like once it's stored in the property. It's really the traversal 00:38:08.200 |
that is where things get. For most chunk sizes, it's fine. Because most chunks are like in the K 00:38:12.920 |
at most. It's not megabytes of chunk. Right. If there's megabytes, then it's like, okay, 00:38:16.360 |
that becomes problematic. Yeah. I mean, but it would be that for any database I would imagine. 00:38:20.840 |
Yeah. But what we see is, is like, you know, we said we had these properties. So for this particular 00:38:28.360 |
document, I've run some of these things already, but we've got the document, we've got the text, 00:38:33.560 |
and we have the embedding. The embedding is actually a property of the node itself. And that's one of the 00:38:39.720 |
things that makes it really easy is that you don't have to have a separate vector database, right? The 00:38:46.040 |
vector is actually stored, and then the vector index is built on top. So that once you get all those 00:38:50.520 |
vectors in, you can run the vector index. And people do really interesting things with vector indexes, 00:38:54.840 |
for sure. So this is where, well, that's not what I want. This is where we start. Yes? 00:39:01.880 |
Question in the back. I think, I think I have a question, so. 00:39:05.640 |
You definitely have a question. Whether I have an answer is the only thing that's up for debate. 00:39:10.200 |
So let's assume, right, I have very large, let's say, legal documents, right, which I'm trying to do 00:39:16.200 |
something very similar. Yeah. And I do these chunkings, you know. Yeah. 00:39:20.840 |
Are you saying I would store the chunks in Neo4j? I would not store the vectors in a vector database, 00:39:27.160 |
but I would also store that in Neo4j. Yeah. Now, how would that be from a performance standpoint over time? 00:39:32.520 |
Like, assuming your system continues to grow, wouldn't there be a performance impact? 00:39:39.560 |
Um, I don't know. Would you know what the largest RAG application is that we have in production in clients right now? 00:39:45.960 |
That's a good question. I don't know. I know that's like, you know, tens of millions, you know, seems fine. 00:39:49.800 |
Once you start to get into hundreds of millions, you might start to have to worry about, like, how to scale past that. 00:39:54.040 |
Yes. And does it still worry across -- look for similarity across all embeddings? 00:40:00.520 |
Yeah. I mean, so it depends. So what it will do is it will look across whatever the vector index is that you have. 00:40:07.240 |
And there are people who, I said, they do some really interesting things with vector indices. 00:40:11.000 |
So I've seen people who, like, have multiple vector indices depending on what the use case is. 00:40:18.200 |
So when you create the vector index, you actually just do a query and you say, "Okay, for all of these, 00:40:24.600 |
create an index on all of these nodes." And so that's one way that I definitely know people have been doing it. 00:40:30.040 |
So they'll have multiple indices that already sort of pre-sort or pre-filter what's going to be part of 00:40:36.600 |
what's coming back from the vector index. So I know one of the ways I've seen people do it is from a governance perspective. 00:40:42.360 |
So there's certain data that they want available to everybody, certain data they don't, 00:40:48.040 |
Yes, please. I'll let you go for a third, really. 00:40:53.880 |
The other question is, if we had documents that actually contained images, 00:41:00.360 |
and we wanted to somewhat have a relationship between the text and images, how would you do it here? 00:41:06.200 |
Yeah. I mean, normally, multimodal, I think most of the time they keep the images separate, 00:41:11.080 |
and then there's an embedding of the image on the node. 00:41:15.720 |
Yeah, so you put the images, like, large media files, you would put that in an S3 bucket or something, 00:41:19.800 |
you still wouldn't, you wouldn't put that into the database. So you'd have a URL to the S3 bucket, 00:41:24.200 |
but you can put the embedding into Neo4j, so on the node. Yeah. 00:41:27.160 |
Yeah. Yeah. We actually have an example of that. Yeah? 00:41:32.760 |
For, like, for, like, you have, like, multi-layered architecture to kind of query faster. Yeah. 00:41:41.800 |
Is there something similar to that happening here, where you can have, like, different subsets, you know, 00:41:47.800 |
different layers to the... Yeah, I mean, it's not... I mean, it's mostly just the way that the database is actually built. 00:41:54.680 |
Because of the way the queries run, because you're running on a pattern, you actually... 00:41:59.720 |
I kind of call it a... It's like GPS search, where it's like, start here, and then go find what you need to find. 00:42:05.640 |
Right? So, I don't know if that's the best way to explain that, but when you're running the query, 00:42:12.680 |
it's going to be very specific. So when you're thinking about, like, how you're running the vector 00:42:18.680 |
retriever itself, are you talking about, like, when you're actually doing the retrieval from the vectors? 00:42:22.200 |
Yeah. So there's a whole... I don't even know how many retrievers there are right now. 00:42:27.160 |
Like eight different retrievers or something in the GraphRag package. It's a Python-based package. 00:42:33.240 |
So there are some... If you go to... Actually, a great place to go is GraphRag.com. It's like an open, 00:42:39.240 |
like, reference point. And it talks about some of the different types of retrievers. So you can do 00:42:44.040 |
pre-filtering. You can do post-filtering. You can manage it based on the index, which is another, like, 00:42:49.240 |
easy way to do it. You know, sometimes folks will have them where they have different hierarchies. 00:42:55.080 |
So, like, when you think about these large legal documents, right, there are ways that you can 00:43:01.080 |
actually, like, create the chunks, but use the natural structure of the element as well. 00:43:07.080 |
So, like, we have an example we use with SEC data. Every time somebody files something with the SEC, 00:43:12.840 |
you know, paragraph one is always this, paragraph seven is that, paragraph nine is that. And so you'll 00:43:17.240 |
take whatever that natural architecture is within the documents themselves, and then you can leverage that 00:43:23.240 |
as well. So, one of the other great things about these entities is that you can actually have more 00:43:30.280 |
than one label. Let me see. Oh, I gotta plug in. So, one of the things that you'll see is, so for this 00:43:43.320 |
particular... This particular one, we see that it's a message, but it's also noted that it's the assistant. 00:43:50.280 |
So, you have some nodes that are, that have more than one label sometimes. So, it makes it so that if 00:43:57.720 |
you wanted to go across all messages, but I know if I'm trying to understand just the prompts from, 00:44:04.440 |
they're just the assistant answers, or just the prompts, I don't need all of those messages. So, by adding on 00:44:09.720 |
dual labels, you can actually say, "Oh, I just want to see these, or I just want to see those." So, it allows you to 00:44:16.680 |
get some of that nuance out of that complexity by leveraging multiple labels on a single node. 00:44:22.040 |
One of my favorite tricks, personally, but that's just me. 00:44:25.560 |
Yeah, I have a question. So, now, like, if you have one vector index, but don't want to actually 00:44:36.040 |
search the entire index, right? So, since that, like, let's take this text, for example, or like, content, 00:44:41.400 |
for example. You want to only search the content of few nodes while vector embedding. Is that possible? 00:44:50.040 |
I mean, there's a number of ways that you can do that. You know, you certainly could. I mean, I, 00:44:56.680 |
personally, I'm a big fan of multiple indices, just because I find them to be very efficient. 00:45:01.080 |
That's one way to do it. There was a question earlier about temporal issues. You can also use, like, 00:45:11.000 |
multiple vector indices to address that temporal portion as well, right? So, you can create an index 00:45:17.480 |
from, why is this? All right, it's starting up. You can create a vector index. So, like, 00:45:23.320 |
like, if we wanted to, like, if we wanted to have a vector index just on, you know, version 1.9 or version 00:45:31.640 |
2.6 or whatever, you could have that particular vector. So, let's just say I'm building, 00:45:36.520 |
building a chat bot, and I know that the customer is currently running whatever version of something, 00:45:42.680 |
you can use that as the actual index that you're searching for. So, that's another way that you can, 00:45:49.880 |
that you can address those kinds of things. Did that answer your question? 00:45:53.320 |
So, it's just like multiple embeddings, I mean, multiple indices. So, not one index. 00:45:58.280 |
I mean, I don't know. I don't know if engineering would agree with me, but from, like, a purely, like, 00:46:03.240 |
practical, like, I got to hack this and get it done, that's my go-to. I mean, is there? 00:46:10.920 |
Yeah, I think that's the right approach right now with Neo4j. I guess you're asking about metadata 00:46:15.560 |
filtering within the vector index, right? Yeah, and so, for us, the vector index is just an index. 00:46:20.440 |
It's not a vector search. Right. And so, we don't do metadata filtering on the vector itself. 00:46:25.960 |
We use the property graph itself for doing the filtering. Right. So, you do it before you create 00:46:31.240 |
the index, because the index is only going to go to what's connected to it, right? So, it's just... 00:46:37.160 |
You'll still have the index, but then you can do predicates on top of the nodes you found, 00:46:40.360 |
and then you filter after that. Yeah. So, even within, once you get that, 00:46:44.200 |
you can certainly filter in and around that as well. And if you look at any of our, 00:46:49.240 |
the vector retriever materials we have, like, we walk you through that. We've got a new ebook coming out. 00:46:54.840 |
It's gone to print. It's happening. That specifically talks about those retrievers. 00:47:01.000 |
All right. Multiple labels, multiple indices, all the good things. So, let's do this. So, 00:47:06.280 |
for those of you who have... Let's go back to here. The other thing you're going to see on 00:47:12.200 |
the readme is you're going to see this Neo4j dump file. This is the actual dump file from our database, 00:47:20.200 |
from last year, I think. Then, so, we're going to give that to you. So, you can download that file, 00:47:25.080 |
and I'll show you how you can drop that into your... Let me see. Sorry. 00:47:30.680 |
All right. So, when we go back to our instances, so, you should see instance 01, most likely. If you go on the 00:47:42.760 |
right-hand side of there, there are three little dots. What you're going to see is backup and restore. 00:47:47.800 |
So, you're going to click on backup and restore, and you're going to take that dump file and either 00:47:52.920 |
browse or just drag and drop it. And it's going to drop in our version of the Neo4j agent... Agent Neo 00:48:03.720 |
GDS application data, I guess. Thanks. So, yeah. So, you just drag and drop it. It runs pretty quick. 00:48:12.280 |
It'll load pretty quick. And then what you'll see is when you go to query, you'll see this will start 00:48:20.040 |
to populate. The database information will populate. And it'll show you which nodes you have... Which nodes 00:48:24.920 |
and entities you have. It'll bring in some of the indices. There is... We did run some of the algorithms 00:48:32.040 |
already. So, I'm going to show you how to run them so you can run them on your own. But in the interest 00:48:37.880 |
of time, we have run those already. So, you'll see some of that. Some of them will do live here. They 00:48:43.800 |
won't be there. Is anybody able to load? Yes. Oh, yeah. So, on the readme, on the readme in the repository, 00:48:54.520 |
if you go all the way down to the bottom where it says connect to Neo4j or a DB, there's a Neo4j dump 00:49:05.560 |
Sorry. Yeah. And then you'll just download it from there. What's that? Oh, sorry. So, 00:49:11.640 |
when you're in... When you go to console, right, you should be seeing something like this. 00:49:16.680 |
Right? And then on the right-hand side, for your instance, you'll see three little dots. 00:49:22.920 |
We're going to click on those dots, and you're going to go to backup and restore. 00:49:26.760 |
And my data is yours. I give it to you. I'm just dumping it right on you. All right. 00:49:35.640 |
Yeah. It's just easier. Nobody wants to load CSVs today. We have no time for these things. 00:49:43.400 |
Yeah. And then from there, you can go straight into the query. So, 00:49:48.360 |
when you go to query, you should see that it's populated. Is there anybody having trouble loading 00:49:53.720 |
the data that wants to load the data? It's still uploading. Okay. What's that? 00:49:58.520 |
If you... Did you... You might have had to re-pull because I... If you re-pull the repo... 00:50:07.560 |
Yeah. Yeah. I had to update it because, you know, I'm well-intentioned, but sorry. Yes. 00:50:14.280 |
When I've restored the dump file, should I expect that it will work in, like, a few minutes? 00:50:22.120 |
Yeah. Let me see what you got. It keeps doing the loading. It's still loading. 00:50:26.840 |
So, that's normal. There's, like, a latent... It shouldn't be too slow. Oh, what is... Oh. 00:50:31.960 |
You're probably having network issues. The Wi-Fi is pretty slow. Oh, sorry. Yeah. That's always the 00:50:39.320 |
the challenge with these things. Yeah. Wi-Fi. See, this is why we always have it loaded. I also have 00:50:47.960 |
it loaded on my desktop because that happened to me recently. I was somewhere and I'm like, 00:50:52.200 |
there's no Wi-Fi at all, and I have to present. So, I also have Neo4j desktop. So, if anybody ever, 00:50:57.240 |
you know, you want to go local, you want to stay on-prem, you don't want any up in your business, 00:51:00.920 |
you can also, like, stay off the cloud, go local. Also an option. Let's see. Where are we? So, 00:51:08.040 |
what are we going to do here? All right. So, I'm going to go into... So, within our notebooks, 00:51:16.360 |
we have three different notebooks. One is the get to know your graph, and we'll start there. I don't 00:51:22.360 |
need this. And really, it's, like, all this is going to do is you'll run it, you'll do your pip install of 00:51:29.960 |
all your requirements, and then what you'll... Oh, wait. I need to give you... Is the OpenAI 00:51:36.280 |
key on here, too? Hold on. Oh, I put it in the Slack channel. 00:51:38.280 |
Oh, it's in the Slack channel. Okay. So, anybody can find it anywhere in the conference. 00:51:50.440 |
they should have given you the opportunity to download your credentials? 00:51:56.200 |
It's getting... Probably not found or something. 00:51:59.080 |
Um... What is this not? Not supported... Oh, did you drop the... So, you have to drop... 00:52:05.720 |
I... Sorry, I skipped a spot. You do need to drop those credentials that you downloaded 00:52:10.040 |
into your ENV file. So, you did that. This is your ENV. Okay. 00:52:14.600 |
And then... Just pasted it. Yeah. That should be... Let me see. 00:52:21.240 |
Yeah, there. It's right here. Yeah, you got that. Um... What is it actually? What's the actual error? 00:52:36.440 |
What is it in here? Schema Bolt. Oh, yeah. Yeah, the URI should be in your 00:52:44.200 |
credentials. Exactly. I was just thinking I was just... Yeah. 00:52:47.320 |
It should be... It should be... I would take a peek at it. ABK, can you take a look and see 00:53:00.760 |
Um... Yeah. So, again, the API key is there. We're not using it, actually, that much today. 00:53:11.320 |
But one of the things that we have, obviously, is we need to have... Oh, God, you can't see anything, 00:53:15.960 |
this thing. You can't see anything. Oh, goodness. Well, that's overwhelming. 00:53:21.080 |
That's more reasonable. Okay. Is that good? You guys can see? We're good? Okay. 00:53:25.480 |
Um... Yeah. So, the basics. Obviously, we're importing OS. We're bringing in our environment variables. 00:53:32.040 |
Um... You see the Python driver. Right? This is our classic Python driver. 00:53:37.160 |
And then what I like to do is I like to just do a run query, um, uh, function just because it just makes 00:53:44.920 |
it easier. Like, I just like it. Um, you don't have to do it this way. You could just do driver.session, 00:53:50.680 |
but, um, that's just how I like to do it. And then when you run this, it'll see that, yes, 00:53:55.400 |
you are actually connected. Oh, let's see. Maybe I should actually run everything. 00:53:59.240 |
Now we're connected. Okay. Um, the first thing that we're gonna do is we're just gonna take a look at 00:54:08.600 |
what is actually in the database. So, in this, um, APOC actually stands for awesome procedures on Cypher, 00:54:16.920 |
which I just think is hilarious. I think there's a double meeting on APOC too, 00:54:20.760 |
but we'll have to ask, um, isn't there another APOC meaning? Something? I don't know. 00:54:25.560 |
Yeah. Some, some matrixy thing. Um, but what this is gonna show you is we start out with 17,000 nodes, 00:54:34.440 |
right? We have 774,000 relationships. Um, seven different labels and 27 different relationship types. 00:54:44.440 |
What does this tell us? What it tells us is obviously we got a lot of data. It's actually 00:54:48.360 |
not that much really, um, you know, for something small. Um, but you know, like almost a million 00:54:53.960 |
relationships. And so really what you, we're running this just to make sure that everything is loaded. So 00:55:00.120 |
everybody who has loaded their data, are you seeing these numbers? Anyone? Anyone? Maybe? Maybe not? No? 00:55:10.680 |
Okay. This is, I will take you back one moment, please. Hold the phone. There are three notebooks 00:55:17.720 |
inside the notebooks folder inside the repository. So, you may have to re-pull it. 00:55:29.560 |
And then this is the only one. Huh. And then this is the website. Yeah. Yeah. This is not, 00:55:37.800 |
can you reload this? Because it was updated. Yeah, um, I did a report. That's why I-- That's so weird. 00:55:44.360 |
Are you having trouble getting the connection stream for JHAB there? It could be that it's not loading 00:55:48.440 |
.env file? Yeah. It's easier to just copy the values into the cell where it's loading from the environment. 00:55:54.120 |
Yeah, you can do that too. Yeah. Yeah. That's my advice. Yep. That's actually really good advice. Um... 00:55:58.360 |
I'm uploading it to the Slack channel right now. Thank you. Thank you. What's your name? Caleb. 00:56:06.840 |
Caleb to the rescue. Thank you, Caleb. Everybody, round of applause. Round of applause. You're that 00:56:13.000 |
helpful guy at work, aren't you? You're like, oh, I can just do this thing. It'll make it easy for everybody. 00:56:17.400 |
Thank you, Caleb. Naya, also. Naya, round of applause. We love Naya. You'll probably see if you're from San 00:56:24.520 |
Francisco, Naya is all places, as is Alexi. They're around all the time. So hopefully they're familiar 00:56:30.280 |
faces for you all. Um, so yeah. So the first one is just really meant to get you to connect to the graph 00:56:37.640 |
and make sure that your connection is working. Um, and just showing you how you can get some basic summary 00:56:42.440 |
statistics. Um, you know, one of the things that we can do is we can preview some of these documents. 00:56:49.320 |
Um, so again, this is our data model just to get you familiar with what we're looking at. 00:56:53.960 |
And then let's move on to our notebook number two is where things get get more exciting. Um, 00:57:03.400 |
in notebook number two, what we what we're looking to do is we're looking to actually start 00:57:09.560 |
leveraging these algorithms for, uh, where am I? Okay. So when we talk about graphs, like ultimately 00:57:18.760 |
what we're trying to do is we're trying to help you do these things at scale. We're trying to help 00:57:21.800 |
you be successful in production and we're trying to help you get the best possible outcome from the 00:57:26.520 |
applications that you're building. And the way that I, that I suggest is my little moniker, which is 00:57:32.120 |
connect cluster curate to your point of how do we actually manage this at scale? What does that look like? 00:57:37.800 |
What do we do? Um, what we have in these notebooks walks you through this very, this process. The way 00:57:43.480 |
that it works is we start out by running KNN on similarity. So we take the embedding for each of 00:57:48.520 |
those documents and we create, uh, we leverage a similarity score. So, um, basically we take a K of 25, 00:57:57.720 |
we run KNN and we understand, we start making connections among very similar documents, right? 00:58:04.360 |
You'll see why we do that in a minute. But what this does is it sets us up for the next step, which is 00:58:11.240 |
community detection, also known as clustering. Um, and what it allows you to do is it allows you to group 00:58:18.040 |
these like items so that you can manage them together, right? It's very hard to look at any one and know 00:58:26.040 |
what's happening, but we can certainly look at all of the things. And then finally what it leads us to 00:58:31.320 |
is we curate the grounding data set via these techniques that work at scale. So this is a similarity 00:58:37.240 |
graph of context documents, um, you know, connected to their source URLs, right? So it makes sense that 00:58:42.840 |
there's a lot of overlap and similarity because they're coming from the same source. And you may be saying, 00:58:47.720 |
you may be saying, well, if they're all the same, like, why do I need to know that they're similar? 00:58:51.640 |
How does that help me? Well, it helps you in a few different ways. Uh, what the, oh, this thing's 00:58:56.680 |
going crazy. Um, what I first want to talk about is I want to talk, talk a little bit about the math 00:59:01.800 |
and the understanding of how community detection works, right? Um, I promise you there's no calculus 00:59:07.560 |
and there's no quiz. Um, but when, so like, if we're looking at this doc, you know, we're looking 00:59:14.120 |
at a few different friends, a couple of friend groups, like it's pretty clear. Like there's one 00:59:17.960 |
click, there's another click. It's pretty obvious, right? Like that seems reasonable. And then when we 00:59:23.960 |
start getting bigger, like these are, these are similar. And what we're seeing here, this is actually, 00:59:28.840 |
these are actually sized by page rank, which is an important score. Um, and they're still like, 00:59:33.320 |
it still seems pretty obvious. Like, yeah, they're clustering together. Like it's still pretty 00:59:37.240 |
logical. Like I get it. Um, but anybody who has worked with graph knows that it never looks like 00:59:44.520 |
this. Has anybody ever actually seen anything that looks as tidy? No, never. Cause it looks like this, 00:59:50.600 |
right? This is what we're actually looking at, right? In this case, what we've got is we've got 00:59:55.720 |
all these different nodes. They're sized by page rank. And then the edges are weight. They're weighted edges, 01:00:03.240 |
right? It's very like the hairball people call it. Um, the bowl of spaghetti, whatever it is. 01:00:08.280 |
And so like, it's very complex. So how do we actually find things that work? How do we actually 01:00:14.600 |
run community? Oh my God, it's so sensitive. My goodness. Um, this is the boring math part, 01:00:22.200 |
but just so that you understand like what is happening under the hood. So when you're doing clustering in graph, 01:00:28.280 |
it's based on something called modularity optimization and a community is high modularity, 01:00:34.760 |
right? It's modular makes sense, right? When the items within or the connections within it are highly 01:00:40.760 |
interconnected with each other and have a very low connection to things outside, right? That's like 01:00:46.600 |
the basics of the math. So what it's going to do is the algorithm is going to go through and it's going to 01:00:51.880 |
say, okay, for every node is the modularity score going to go up or is it going to go down if I put 01:00:57.480 |
the same label as my neighbor, right? So does it belong with this group or does it belong with that group? 01:01:02.120 |
What does it do to the actual numbers, right? Louvain is a particular type of algorithm and this is where 01:01:09.800 |
it says, yeah, that's exactly what we just talked about. So how does it actually work? So what it will do 01:01:15.240 |
is it will start by giving a single node a label and then it does the calculation and says, okay, 01:01:20.520 |
like let me figure out, does it go up or does it go down? And then it will do the first pass and it'll 01:01:25.800 |
say, okay, these are the ones that are most connected to each other. And then it'll say, okay, well, let me 01:01:31.880 |
group and aggregate those and let me look at how many, what you see in step two is you see the number of 01:01:38.840 |
connections between each of those clusters, right? So when we look at it, so, you know, it's pretty 01:01:44.680 |
straightforward. The pinks and the yellows, there's three edges between them, but there's 14 within 01:01:50.440 |
just pink and there's two within yellow and four within green and 14 within blue, right? So what it 01:01:55.960 |
shows you is it's showing you that like there's a lot connected within and not so many on the outside, 01:02:01.160 |
right? Pretty straightforward. That's pretty easy math, right? And then it just keeps going through and then 01:02:07.960 |
eventually it goes down. One of the things to know, um, there's another algorithm called Leiden. 01:02:12.840 |
It is similar, but it, um, but if you need to work with unconnected graphs, so depending on whether your 01:02:18.760 |
graph is completely connected or not, there are different ones. But ultimately what it's doing is 01:02:23.080 |
saying, you know, are, are these things, which of these things is not like the other, right? So label 01:02:29.160 |
propagation is, uh, is another way of doing clustering. And what label propagation does is it says, 01:02:35.160 |
okay, I'm going to assign a random label to something or I, I know these labels and then I'm 01:02:41.400 |
going to go to the neighbors and I'm going to say, I'm going to assign what this one is to its neighbors 01:02:46.280 |
and then eventually we have these iterations where they converge label propagation algorithm. Now the nice 01:02:51.960 |
thing about the label propagation algorithm is it's actually one of the fastest clustering algorithms. So 01:02:57.240 |
especially if you've got lots and lots of values and you want to run something very big, it's a very fast way to do it. Um, if you need 01:03:04.680 |
something that's like highly nuanced, it's worth spending the time and the computation to use one 01:03:09.880 |
of the other algorithms. But what I wanted to do is I just wanted to, to show you what this looks like 01:03:15.080 |
because label propagation, it's, and modularity does the same thing. It's really looking for density. 01:03:21.800 |
It's looking for density in those graphs. So when we think about, you know, the craziness that we see here, 01:03:27.880 |
like even in this, like you can see like little pockets of density, right? We see these pockets of density. 01:03:32.840 |
I always like to take the soft eye when we look at graphs. Um, and so that's what, that's what's happening 01:03:38.440 |
with the math under the hood, right? So what does this actually mean in our, in our code? So what we've got 01:03:47.960 |
is, let me close this out. So what we're going to do is the first thing you do is you're going to run the KNN. 01:03:55.160 |
So you run the KNN on similarity. Obviously we connect as always. Um, there is a weird little thing you have to do. 01:04:01.960 |
You have to create an empty relationship, but don't worry about that. Like I gave it to you. Um, but that's a little, 01:04:08.120 |
little peat point. Um, that's probably cause we don't need it. Um, but then we run the query and I just want 01:04:13.800 |
to walk through how the query is actually built and what it looks like. So the first thing that we have 01:04:19.560 |
is, do I have my, I want to, I want to see this. Sorry. Somebody's got a question over here. Yes. 01:04:25.160 |
Then, then, then over there, Caleb. There we go. Here we go. Did you have a question? 01:04:39.160 |
- Mmm. Graph.project. So just to go over like one, what, what the process is when you're, 01:04:48.600 |
when you're actually, um, running the algorithms. The first thing you need to do is you need to 01:04:53.480 |
project a graph because most of the time when you're running the algorithm, you're not running 01:04:57.960 |
it on the entire graph, right? We're running it on something small. So in our case, we want to run 01:05:02.440 |
k-nearest neighbors just on the documents, right? So what we need to do is we create this, uh, this 01:05:09.320 |
projection. So here we've got, we're calling the projection. We're calling it docs. So the first thing 01:05:15.240 |
you'll see is the name of the projection, which is docs. What, then what we look at is we look at which 01:05:20.520 |
nodes. The next thing you're going to see is nodes. So for us, we're using documents, right? So we want 01:05:24.760 |
documents. I want to include the property embedding. So you really want to make sure you're only bringing in 01:05:30.280 |
what you need, right? Because it's just going to make things run faster. Um, and then because it's 01:05:35.240 |
just, the documents aren't related to each other yet, right? We know that from the data model those 01:05:39.720 |
documents are related to the responses, but they're not related to each other. We're building this 01:05:44.200 |
similarity among them. And so it happens to be an empty relationship. And then what it will do is it will 01:05:50.520 |
give us the graph name, the node count, and it'll, and it'll let us know that it ran. And what that ends up looking 01:05:56.120 |
like is, let me go to my aura. Where's my aura? I'm so not aura, as the middle school would say right 01:06:04.760 |
now. Okay. Um, so let's go into explore. Anybody here have middle schoolers or like, no, the whole like 01:06:11.640 |
brain rot, like riz, aura, skibbity, like it's insanity. All right. Um, so what we're going to be looking at 01:06:19.560 |
is when we look at documents, I've run it already. Um, so we're going to see that there's similarity 01:06:27.160 |
among them, but right now we're going to see these documents and what I want to look at is, oh yeah. 01:06:33.400 |
So I'm going to look at documents that are connected to, 01:06:43.480 |
come on. Similar to documents. Oh, actually let me, yeah, that'll work. It takes a second to run. Um, 01:06:59.320 |
one of the, one of the things that, um, I did is I actually changed what the way that we look at it in 01:07:05.720 |
the graph, just so you know, I changed it so that it's actually the colors of the nodes are by community. 01:07:11.800 |
Because what I wanted to see is I wanted to see like, do we start to see, let me run this one 01:07:16.920 |
more time. Let me clear the scene. Do this. Oh, dear sweet Wi-Fi. Maybe I should open up my Neo4j, 01:07:30.280 |
like desktop. Um, yeah. So what we're going to see is we're going to see the connections in and among 01:07:37.480 |
the, hold on a second. I don't like this. I don't like this at all. What it's eventually 01:07:45.000 |
going to show you is it's going to show you the similarities in and among, and it's going to be like 01:07:48.680 |
the greatest ever hairball. So in this particular case, like this is what it might look like. 01:07:56.120 |
Um, and what we see is like we saw with the big one, like you're going to see like clusters of documents 01:08:02.120 |
that are very similar. The other thing that we're going to see is we're going to see what I call the 01:08:07.720 |
bridge documents. The ones that will often be the, the middleman. We call it, um, betweenness centrality. 01:08:15.400 |
And so what you want to understand, like I said, it's the soft eye. You want to understand what is 01:08:21.160 |
the clustering of your documents. And again, you might be asking yourself like, why is that helpful? 01:08:25.880 |
And I promise you the reason that it's helpful is because 01:08:29.720 |
what we end up being able to do is understand what's happening within the data. So this is a 2d version of, 01:08:39.160 |
of these embeddings and these clusters. And what we see is we see, you know, it looks like a bowl of 01:08:45.320 |
fruit loops. Like how, again, Alison, how is this helpful? I'm not a data scientist. Like what is 01:08:49.880 |
happening here? Um, but when we zoom in, we see a couple of things. In one case, we have a single 01:08:55.080 |
community document cluster. So what we're seeing is there's a lot of similarity in the embedding, 01:08:59.720 |
right? They're all in that same community. They're all very similar. And then in this other side, 01:09:04.760 |
we've got some, they're kind of spread out, like nothing really clear on a perimeter, right? Like 01:09:09.640 |
you're like, is that good clustering or not? Um, and so my question to you is, which is better? 01:09:14.520 |
Which of these do you think might say, oh yeah, like we've got it. We've got what we need. We're good. 01:09:21.800 |
Pros and cons. Somebody says some. Single community. What's the, what's the thought on a single 01:09:29.800 |
community? Figure out there is some relationship between those nodes. Yeah, they're very similar. 01:09:36.360 |
Mm-hmm. That's awesome. And in traditional clustering, that would be great, right? If we're 01:09:41.480 |
talking about clustering customers, we're doing customer 360, we're like, oh, this is a group, 01:09:45.160 |
this is a thing. But one of the things that we have, though, is that means we have a lot of very similar 01:09:50.680 |
documents. So the question is, do we really want it to look like that? Customers, yes. Documents, no. 01:09:57.480 |
No. Right? Because one of the things that we need to understand is we need to really be efficient. 01:10:03.160 |
So when we talk about what is a high quality grounding data set in your retrieval augmented 01:10:07.960 |
generation, these are our five basic principles that we, that we go by. Is it relevant? Is it actually 01:10:14.680 |
giving an augmenting answer? Is it reliable? Right? Like, do you have, do you have like a lot of variety 01:10:21.080 |
that's coming back on something? Like, is it really disparate from these communities? Because there's a hole in 01:10:26.120 |
your data. Because we don't know. Right? There's a gap in all this high dimensional space. There's 01:10:30.920 |
this like big blind spot that you don't know about. And again, is it efficient? How many people here are 01:10:36.760 |
really worried about their applications being efficient? Everybody. I know, right? It makes you crazy. So, 01:10:42.840 |
again, how do we do this at scale? What does that look like? So once we've done this clustering and we've 01:10:48.840 |
assigned these communities to these nodes, then we can start looking at the data itself. Right? In this particular 01:10:54.760 |
instance, one of the first things we did, and you'll, you'll see it, I left all the, the dump that you 01:10:58.920 |
have has all the errors in it. So you can play with them. We can see that the median average word length 01:11:04.360 |
is 512 characters. Well, our chunks are a length of 512. So that's probably not good data for us. 01:11:11.160 |
Right? That's not helpful. And it's clearly very different from everybody else. So what does that look 01:11:16.920 |
like? Median word count 1. So all of these are summary statistics at the, at that cluster level, 01:11:22.120 |
at that community level. And so when we looked into it, we found this is very, you know, like, again, 01:11:29.080 |
we brought our data in from, you know, web pages and all these different places. So clearly this is not 01:11:34.520 |
going to be helpful to an answer. Right? At least not to a human. Right? So this is a way that we can start 01:11:40.520 |
to clean something out. The basics. Right? Like, just the basics. The other thing that's going to come 01:11:47.000 |
up in that high density, like that high similarity we saw is this. Highly similar text chunks. In this 01:11:53.160 |
particular one here, community 4702, our average similarity is 0.98. We have 49 documents. 49 chunks 01:12:01.480 |
that are almost completely identical. If you're doing a retriever that gives, you know, the top 10, that's not going to 01:12:08.440 |
give you any context at all. It's just going to be really confident about this one thing. Right? 01:12:14.120 |
How do you know that it's even the right thing? Right? So one of the things that you want to start 01:12:18.280 |
looking at is ways that you can sometimes increase the diversity. Right? Like, I have a talk that I've 01:12:26.840 |
submitted many, many times that has never been accepted. And it's called "The Dark Side of Distance 01:12:32.440 |
Metrics: How Cosine Similarity Broke America." Right? I don't know why it doesn't get accepted. Why is that? 01:12:38.360 |
Right? Because cosine similarity gives you exactly what you want. 01:12:43.080 |
Gives you exactly what you want. Right? Like, I call it the chicken nuggets and Twinkies. If I had a robot 01:12:49.480 |
in my house and I went to San Francisco and it was responsible for cooking for my kids, when I came back, 01:12:54.440 |
all I'd be making is chicken nuggets and Twinkies. Right? That's the same thing that's happening in our 01:12:59.480 |
algorithms. And so where this becomes really, really important, where everybody here is working with agents. 01:13:09.000 |
you are refining signal at every turn. So if you get a signal, it's going to pick up on it and it's 01:13:17.000 |
going to run with it. And it's going to go down to the next agent and the next agent and the next agent. 01:13:21.240 |
And whatever that, whatever that path that it got on, it's going to go. Right? Because that's what we 01:13:27.160 |
programmed it to do. It's doing its job. It's doing exactly what we told it to do. But is that what we want 01:13:32.440 |
it to do? So one of the ways that, like, this community approach to documentation is really 01:13:38.440 |
interesting is you can look at what are the variety of documents. You can actually use re-ranking for 01:13:46.360 |
diversity in your responses. So let's say we do a vector retriever, right? We take the vector retriever 01:13:52.440 |
and it comes back with this scoring. These things are all really similar. 01:13:55.160 |
Okay? Well, most of them are from this community. But then the ones a little further down are from 01:14:01.640 |
another community. So you can actually do different ways of re-ranking. So you can wait for diversity. 01:14:09.880 |
You can wait for page rank, right? Like one of the examples we have in here is using page rank. 01:14:15.560 |
Of these vectors, which ones are the most important? Which are the ones that people go to the most? 01:14:20.520 |
We want to be feeding that back, right? It's like, it's why Google made billions of dollars, right? Page rank. 01:14:26.360 |
Change the way everybody searched. So what I want you to be thinking about is you need to be thinking about 01:14:33.480 |
what I call intelligent outcome management. It's not just about like, you know, AI, artificial intelligence. 01:14:39.960 |
But what are the outcomes that we're getting? And how do we give really good ones? 01:14:44.840 |
And I'm here to show you how you can use these algorithms to help you like diversify that a little bit. 01:14:50.760 |
So when we looked at these highly similar text chunks, 01:14:54.440 |
they're highly similar. It's because it comes from our documentation, right? We've got multiple versions, right? 01:15:01.320 |
Like we've got a version for 1.5 and a version for 1.6. Well, the algorithm doesn't change, right? 01:15:06.200 |
The algorithm doesn't change. So that's why we have all these repeats. 01:15:08.440 |
So the other great thing about graphs, because, you know, every chance I get to love on a graph, I will, 01:15:13.880 |
is you actually can very easily in this one moment, APOC nodes collapse. 01:15:20.040 |
Awesome procedures on Cypher. So what it's going to do is it's going to do two things. One is going to make your 01:15:27.240 |
retrieval much more efficient because you now have one version of it instead of many. 01:15:31.560 |
But what it's also going to do is it's going to maintain the lineage. 01:15:35.960 |
It's going to maintain the connection. So now we've got a single document, but we can see all of those 01:15:42.200 |
sources. We can see where it came from. So you don't lose the institutional understanding, right? 01:15:49.320 |
Nobody likes-- and who likes dropping rows? Everybody gets worried when you drop a row, 01:15:52.920 |
right? Like, maybe it's just me. I always get worried. I'm like, oh, I'm dropping rows. 01:15:56.360 |
Is this what I want? So in this case, you don't have to. And that's what's so great about and flexible 01:16:01.240 |
about the graph is that you can do these large-scale management moments, and you're-- you get more 01:16:07.560 |
efficient. You get really clear. You don't-- or you're not over-bundling something or overpowering 01:16:15.000 |
Oh, so basically what we're doing here is we're saying these four are all the-- basically all the 01:16:19.160 |
same. The text and the embedding are the same. So we're going to collapse them into a single node, 01:16:23.400 |
and we're going to maintain the original relationships. So in this case, when-- then 01:16:27.960 |
when you hit that embedding, you're only hitting it once instead of 19 or 49 times in our case. Yes? 01:16:32.840 |
So on the document that are pretty much very similar, if they have various properties, 01:16:38.360 |
they're slightly different. When you do the collapse, does it join all the properties together? 01:16:41.960 |
It depends on how you want to do it. Like, you certainly could. I mean, it may be that you 01:16:46.680 |
don't-- you may only want to, like, adjust what's in the index, right? Like, it could be a different 01:16:51.400 |
way that you want to manage it, like, depending on what you need to maintain, right? If it's legal, 01:16:55.640 |
then you probably don't want to lose anything. Um, you know, there's-- there's options. Yeah? 01:17:03.160 |
There's a microphone for you. Because it's a good question. Everybody needs to hear it. 01:17:12.680 |
So, um, the documents will be exactly the same for each one of the URL pages. The chunking was 01:17:20.840 |
similar and the similar kind of things got picked up. So does chunking impact when-- in these kinds of 01:17:28.360 |
mechanisms, how the similarity works? Yeah, I mean, because it could be, right? Because if it's the 01:17:32.440 |
same page, but one has, like, an extra paragraph for that version, then the chunking might split 01:17:36.440 |
differently, right? So you're not gonna-- it's not gonna-- it's not like a magic wand, right? It's no magic 01:17:41.720 |
eraser. But it's definitely an easy way to increase your efficiency and increase the quality of the output in the data 01:17:48.600 |
itself. Yes? So, you did the clustering on documents, or did you do the clustering on all different kinds of nodes that you have? 01:17:57.880 |
I did the clustering just on the document nodes, which in this case are actually chunks, right? 01:18:03.080 |
So it's just on those. So it takes the embedding, right, for that chunk, and then we run that KNN 01:18:09.080 |
so that we understand, like, do we have these really high-density places? The other thing it allows you to do is 01:18:15.480 |
understand, like, do I have the right information, right? So, um, where are we? 01:18:21.960 |
Right, so again, like, what are the most frequently used communities? We've seen this already, right? 01:18:28.760 |
This is a conversation, you know, these are the original documents. This is the the next set, some of it, right? 01:18:34.760 |
Like, I've been doing more research into understanding, like, how people travel around the communities in different conversations. 01:18:44.040 |
Do they all end in one place, or do they start in one place? Or how long does it take someone to get to the right area? 01:18:50.840 |
Right? I mean, it's, you know, we're basically tracking human cognition by watching the way people move around their thought process, right? 01:18:59.160 |
You know, this is, like, one particular visualization of a single conversation, and for each of these, 01:19:07.640 |
we've got, you know, 10 pieces, 10 context documents that come back. You know, we have a K of 10. 01:19:13.240 |
And so it looked at which communities are they coming from across this conversation? 01:19:18.280 |
So this person originally was starting out in this area, and then clearly ended up over here. 01:19:23.240 |
And then when they got to whatever community group number, or whatever it is, um, they got to their answer, and then they moved on. 01:19:29.000 |
So whether your conversations are short or long is going to depend. What route you want people to take, right? 01:19:37.240 |
I think about, you know, I always go to product analytics, right? So what is it that someone might need to know? 01:19:43.880 |
So, so much becomes available. And, you know, again, on the developer side, like, maybe these things are important to us, maybe they're not. 01:19:50.520 |
But as someone who's building agents, it has to be important to you, right? 01:19:56.520 |
We have to really take accountability for the signals that we're amplifying in our systems and be aware. 01:20:03.640 |
It's why the data scientists and the developers aren't on the other, are on different sides anymore. 01:20:07.560 |
We're all AI engineers. Question? Someone? Someone? No? Okay. All right. 01:20:12.600 |
My soapbox. Thank you. I appreciate that. Um, document usage. We talked about this. We talked about all these things. 01:20:19.480 |
I don't think there's... Yeah, I think we're good. Yeah. Oh, this is just another, uh, you know, 01:20:25.800 |
the betweenness centrality, right? Like, um, to understand why, why it might be important. 01:20:32.040 |
Um, again, if you think about what is it that we're delivering, you know, like somebody said, 01:20:39.320 |
like, who's the most important person at the company? Is it the person that's connected to any, everybody? 01:20:43.640 |
Or is it, you know, the one person that talks to both sides of the, of the business? Right? 01:20:49.000 |
Betweenness centrality says it's this person. Page rank says it's somebody else. So that's the other thing, too, 01:20:54.840 |
is when we're trying to figure out what's important, we have a variety of algorithms that do different things 01:20:59.640 |
that make that available to you. You know, conversation lengths, right? I mean, this is just some of the basics. 01:21:05.880 |
Um, but this is just another really interesting one is we are looking at, um, good ratings and bad, right? 01:21:11.080 |
So this is just an analysis based on the community, right? How many are in the community? Um, and then 01:21:22.520 |
are the ratings good or bad by community, right? If you find that there's a, that there's a cluster 01:21:27.800 |
of documents where the ratings are significantly worse than the others, let me go look at that. 01:21:34.920 |
Let me go look at those documents. Maybe it's, you know, outdated. I had somebody, this is my interesting 01:21:41.160 |
story. So, um, I do some work with some folks who do like DOD and Fed work and, um, former Navy SEAL, 01:21:49.160 |
and he's just like, yeah, I couldn't believe it. And I'm like, what? He's like, somebody was doing 01:21:53.400 |
this proposal and they have like this one thing that they have to put in, like some risk assessment 01:21:58.360 |
thing. And like, it's just like they copy and paste it over and over and over again. And so one day he got 01:22:03.720 |
really curious and he wanted to actually track where it came from. It was almost word for word something 01:22:09.480 |
from the Vietnam War that people are still putting into these documents, right? Like it's got, like if we 01:22:17.160 |
just threw it into a RAG application, it would have really high page rank. Everybody's using it. But 01:22:22.440 |
where did it come from? Is that what we want? So looking at the things that are influential, 01:22:26.680 |
not because they're an anomaly, but because we're amplifying that signal, is that the signal we want 01:22:33.400 |
to amplify? That's a whole other like AI observability thing. But the reason I bring it up is that our job is 01:22:40.200 |
to build good, smart applications. We are responsible for the outcomes of these processes. 01:22:46.680 |
And these are some ways that you can take a new approach to what you're doing, right? 01:22:51.160 |
Let me go back. Questions? Comments? Concerns? Yes? 01:22:57.960 |
Yeah. I don't know that there's-- I mean, you have to do the projection and run the algorithm. 01:23:14.280 |
But we've got it in here, community detection and assignment. I believe betweenness is in here. 01:23:23.240 |
Creating secondary relation to that. We're not even going to get to that. Yeah. So there's a 01:23:29.800 |
betweenness centrality. So when you look at the algorithms, they're all going to be pretty much the 01:23:35.320 |
same. It's going to be GDS and then the name of the algorithm. And then there's a couple of different 01:23:39.640 |
options. Stream means run the calculations and let me look at them. Just put them here. Mutate will 01:23:46.360 |
actually change the projection. You can also write. So if you do the algorithm dot write, it's going to 01:23:53.720 |
write back to the node and we'll write that assignment back. So sometimes you just want to play. You just 01:23:57.960 |
want to look at it, right? Sometimes you may want to just change it within the production and just play 01:24:02.360 |
around. So you've got a few different ways that you can apply the algorithm once those numbers come out. 01:24:07.720 |
And the basic is you're always going to do a call. You're going to call the algorithm. 01:24:11.400 |
And then the first thing you're going to do is say, which projection am I running this on? 01:24:14.440 |
Right? In this projection, I actually had, like, I created a new relationship called co-occurrence. 01:24:28.520 |
Where is it? Yeah. So what this did is it says, okay, I'm going to look at the message and I want to 01:24:36.040 |
see the different documents that are in the same message. And for each of those where the elements aren't the same, 01:24:43.160 |
right? So I don't want to create a connection to itself. I want to merge. Merge is, you can either 01:24:48.920 |
create or merge a relationship. Merge just means if it doesn't exist, make it. I'm going to merge the 01:24:57.720 |
chunk one or the context one with context two. And I'm going to create this new relationship co-occurs 01:25:03.800 |
with. And the reason I wanted to do this is I wanted to see, like, which things come up together a lot. 01:25:09.800 |
Because sometimes it's not just about the similarity. If I've done a really good job of curating my data 01:25:14.920 |
set, and I don't have a lot of that redundancy, I want to see, like, what are the concepts that are 01:25:20.680 |
coming together? Like, what are the things that come together? From there, you can then look at the way 01:25:25.480 |
they travel around each piece from one to another. So what I did is I created this co-occurrence, 01:25:30.840 |
right? So yes. So like, if we went back to our-- so basically, co-occurrence just says all of these 01:25:39.640 |
documents that were pulled in by this question will now have a relationship among them, right? And then 01:25:46.760 |
we weight that by how often do they come up together, right? You can do the same thing with traversal. So 01:25:53.080 |
these chunks come first, these chunks come second. So if you know that the conversations usually go in 01:26:00.520 |
that direction, do you want to figure-- what's the probability? Can I predict the next thing? Can I 01:26:05.720 |
figure out where the conversation is going to go? So all of this to say that, you know, I love that there's 01:26:11.960 |
a graph where I track. I love that it's just sort of like the way many people are doing things. And what I 01:26:17.480 |
want you to know is you've done all the hard work of building the graph. Now let's use it even more. 01:26:23.240 |
Let's get even more out of it. There was a question somewhere. Yeah, yeah. 01:26:27.240 |
Co-occurrence is when, like, the-- you ask a question and then the retrieved answer, like, 01:26:34.440 |
you have, like, two answers that are commonly retrieved together, like, to that question. Is that 01:26:38.680 |
co-occurrence? Am I understanding it correctly? The co-occurrence is all-- everything that came 01:26:45.320 |
into this one, like, if you're doing a K of 10, all of those 10 will co-occur, meaning, like, 01:26:51.400 |
they all were answered together. They all were fed back at the same time, right? So it's almost like 01:26:58.040 |
an affinity of, like, these things often come together. Then you can, like I said, you can do 01:27:02.520 |
the same thing. You can expand it out into, like, follow-on. Like, these things usually follow those 01:27:07.560 |
things, right? So we can start to see. And that's where, you know, things get juicy. 01:27:12.920 |
Thank you. My nerdy, like, data science brain gets super excited. Like, oh my god. Then we can 01:27:17.800 |
check this. Then we can check that. And maybe somebody cares. Yeah. 01:27:21.080 |
You said what? Oh, sorry, sorry, sorry. I'll be back. 01:27:24.760 |
I got a question more-- more not a technical question, more general. So you mentioned the 01:27:30.600 |
DOD and FED space. So we operate in the DOD space. Oh, nice. 01:27:36.360 |
So what is the, you know, what are your thoughts on the adoption of this technology, 01:27:41.800 |
so within the DOD, both on the government side and also on the U.S. side? 01:27:46.040 |
Yeah. I have a lot of opinions on that. Naya will tell you. I think when it-- the most interesting 01:27:53.480 |
conversations that I have are around accountability and traceability. There was-- I was talking to 01:28:00.040 |
someone literally this weekend and said there was an unnamed person who's currently running for 01:28:05.160 |
Congress and they said about the military, oh, they're just going to need to learn to trust the AI. 01:28:09.640 |
I've never been so alarmed in my life. I was just like, no, no, trust but verify. Trust but verify, 01:28:17.400 |
please, right? I mean, like, as someone who's been a data scientist for a long time, like, we do our 01:28:23.560 |
best, but there are unintended consequences. Like, we didn't know we were building an echo chamber. 01:28:28.280 |
We just thought we were giving you chicken nuggets and Twinkies, right? Like, we didn't know it was 01:28:31.320 |
going to break America, but we did it, right? Like, I was one of those people. Like, I built those 01:28:35.800 |
algorithms. Hello. Thank you. You know, and so, like, I think the best thing that we can all do is to just 01:28:43.800 |
keep those things in mind. And so that's where the conversations, like, my conversations with them, 01:28:49.320 |
like, and I'm not, like, saying, like, oh, you're so dumb. Like, you don't know. Like, people don't know, 01:28:54.200 |
right? Like, like, I've been doing this for years. Like, I've lived in the calculus, right? Like, 01:28:59.240 |
I've lived in, like, you know, understanding the mathematics of the neural network, right? Like, 01:29:05.160 |
I've spent a lot of time-- like, the data scientists, we had rigor and we had mathematics and we, like, had-- 01:29:09.880 |
every time we released a model, we had model reports and all this stuff. And now it's just like, 01:29:16.120 |
just trust it. It's good. We're fine, right? And a lot of it is, right? Like, we do-- I mean, 01:29:23.320 |
it's the world we live in. Like, I'm certainly not-- like, I'm not a naysayer. Like, I love AI. I love 01:29:28.360 |
what's possible. But I just think that we just need to be a little bit thoughtful about what we're doing. 01:29:33.720 |
And, like, that's why these are easy things to do, right? Like, and then once you're aware of it, 01:29:39.320 |
you're like, oh, yeah, I'm going to get a much better answer. That's awesome. Let's do that. Yes. 01:29:43.720 |
I have a question. So on the-- more on the knowledge graph side of things, is there, like, an easier way to-- 01:29:53.080 |
is there an easier way to automatically create the knowledge graph? Because, you know, things like 01:29:59.160 |
graphlets have really made things a lot easier for a lot of people versus, like, this is-- to me, 01:30:07.720 |
it's kind of like an old school way of creating graph, which is you've really got to be the domain 01:30:12.360 |
expert and know all the ontologies and all the relationships. But what if you wanted to do that 01:30:17.320 |
automatically many times over for different types of businesses? How do you do that? 01:30:21.480 |
Yeah. It's a really excellent question. And so hopefully everybody heard the question. 01:30:25.240 |
is, like, how do you, like, build graphs at scale, right? Like, how do you do it without having to be 01:30:30.040 |
an expert? The knowledge graph builder, the KG builder that Neo4j built-- because these are 01:30:36.120 |
questions that people have, right? There's some really interesting ways. Like, you can actually 01:30:41.000 |
use the simple KG pipeline and just throw your documents in and just say, "Tell me what's important." 01:30:45.800 |
Right? It's going to go through and do basic named entity recognition. Based on those entities, 01:30:51.000 |
it'll do another pass and say, "How are they related?" And what it'll do is it'll look at each 01:30:55.000 |
document individually. But then what you start to see is you start to see, like, the natural organic, 01:30:59.960 |
like, what's actually in there, what comes up. So I love that for experimentation and for 01:31:05.800 |
understanding. Because oftentimes, as builders, we don't know. We're not the subject matter experts. 01:31:10.200 |
But we're responsible for building the architecture for them to get the right answer, right? So how-- 01:31:14.840 |
what tools do we have to try and service it? Because then you can run it and it'll be noisy. Like, 01:31:20.680 |
it will look like spaghetti. It will be noisy. But then you can say, "Okay, tell me how many of 01:31:24.920 |
this label. How many of that label? Show me, like, what is the page rank? What is the betweenness 01:31:31.000 |
centrality?" Right? You can run these kinds of algorithms to understand. And then from there, 01:31:35.320 |
you say, "Okay, according to this set of documents, this is the ontology that has risen out of it." So you 01:31:40.920 |
can take a very organic approach. Sometimes people will go in with a very specific data model in mind. 01:31:48.040 |
And even in those cases, I always encourage people to run the KG Builder for experimentation purposes, 01:31:54.120 |
because it could be that there's something in your documents that you didn't even know was valuable. 01:31:59.560 |
Right? Like, let them speak. Like, let them-- like, let the ontology rise. Yeah. 01:32:05.640 |
The question is, like, I think today our industry has been writing content more for consumption by 01:32:11.640 |
humans. Now this is changing towards consumption by agents. Yeah. 01:32:16.120 |
Do you see this playing the graph analytics and what you're showcasing is playing for persona, 01:32:21.400 |
for content writers, and whether for completeness, how are they writing and producing more content? 01:32:26.360 |
It's helping them? Is that the primary persona we are targeting here? 01:32:29.480 |
I mean, I mean, listen, it can go in any direction, right? Like, one of the things that I'm really 01:32:34.280 |
excited about is the concept of, like, I call it-- it's not about human in the loop. It's about 01:32:41.800 |
accountability in the loop, right? Like we said, we're going agent to agent to agent. And so what do we then 01:32:47.560 |
pass to that agent, right? So if we have some of this information, right? Like, if we're concerned about 01:32:53.480 |
amplifying signal, right? And if we're concerned about, you know, like, a series of agents going in 01:32:59.320 |
the wrong direction, right? You think of a ship. If it's off by one degree now, it's fine. You know, 01:33:04.120 |
a hundred miles later, not so fine. So how do we-- how do we keep steering the ship when it's autonomous? 01:33:10.040 |
And so that's some of the interesting things, too. Like, when we look at this memory graph, 01:33:14.120 |
right, we can say-- like, this application graph, we can say, "Okay, I'm going to pass this to the agent, 01:33:20.040 |
and I'll pass, like, this context, or this particular information about what came before." 01:33:25.640 |
Or, you know, like, you can look at, like, sort of, like, the velocity of how quickly it's going from one 01:33:31.480 |
community to another, right? Like, there's so many interesting ways you can do the math to look at it. 01:33:35.480 |
But yeah, like, I think, like, really understanding, like, how do we drive our agents? 01:33:43.000 |
How do we drive our agents and chains is super sexy, super exciting for me. 01:33:53.320 |
So a lot of times, we realize that it's garbage in, garbage out. 01:34:00.600 |
And we ourselves sometimes aren't aware of, like, some issues with our own content, 01:34:07.720 |
whether one part is contradicting another, or something's just inaccurate. 01:34:13.320 |
Is there any fancy math with graphs that can help us identify potential issues with content hygiene? 01:34:20.280 |
I mean, I would want to give you a more thoughtful answer than something off the top of my head. 01:34:25.800 |
Like, I don't want to, I don't want to throw something out that's just kind of like, oh, what is this? 01:34:33.480 |
But again, some of it has to do with what is it connected to, right? 01:34:37.480 |
So if we think of, you know, how fraud detection works, right? 01:34:41.560 |
Like, like, or, so I worked on the, they did election graph for, with Syracuse University last 01:34:49.720 |
And they were tracking advertising, like, across Facebook and Instagram. 01:34:53.800 |
And, and I, Jenny, who ran the program, she said something really interesting to me, 01:34:59.000 |
which is, if you want to know how someone's going to vote, it's actually not their friends. 01:35:04.360 |
But if you look at the friends of friends, that second hop out is actually more predictive 01:35:10.440 |
of how someone's going to vote than their immediate group. 01:35:16.280 |
But it's the same kind of thing that might be possible, right? 01:35:20.840 |
Are we finding that there's, like, some sort of, like, density of something, like, in that 01:35:27.000 |
The other thing about me is I love to be available. 01:35:32.200 |
So feel free to connect with me on LinkedIn and DM me. 01:35:35.880 |
And if you ever want to noodle, like, I'm always happy to noodle on, like, all the things. 01:35:40.520 |
Because my hope is that you leave here today and you start thinking about, like, what might be 01:35:46.920 |
What could you look at a little bit differently? 01:35:55.960 |
I wanted to ask about the embeddings versus the structure. 01:36:12.680 |
But I wanted to ask you if your team has any specific experiences in terms of how to leverage 01:36:18.840 |
the embedding per node, let's say, versus the structure of the graph. 01:36:23.400 |
Because, for example, in the morning session, we've seen just simple aggregation of the two. 01:36:33.640 |
So my question is, how would you start approaching a given specific domain where we have embedding 01:36:39.000 |
in the same embedding space for every node, plus the structure? 01:36:43.560 |
There are actually really interesting ways that you can do combinations of not just the text 01:36:48.040 |
embedding, but the embedding of the node itself, right? 01:36:51.080 |
So for those of you who don't know, the concept that we have of this text embedding, 01:36:56.520 |
And so it's an embedding, really, of the structure of any given node and which relationships it's 01:37:01.000 |
So, yeah, there's some really interesting things that you can do there. 01:37:05.720 |
One of the things that we, like, didn't get to today but is, like, you know, leveraging page 01:37:11.160 |
rank and, you know, vector to, like, re-rank. 01:37:13.640 |
So there's definitely some interesting things that you can do. 01:37:16.440 |
I don't know if we have content on that specifically, but definitely reach out to me and I will reach 01:37:21.720 |
out to our people and we'll get you a really good answer on that for sure. 01:37:27.880 |
I still understand the community detection a little bit. 01:37:41.960 |
So, we've got a few different ways that we can look at it. 01:37:46.920 |
This is one example how Louvain actually works. 01:37:49.800 |
So, like I said, what it'll start to do is it'll assign a label. 01:37:54.760 |
And then what it says is, okay, let me look at its neighbors. 01:37:58.120 |
And if I give the neighbors the same label, does that modularity, 01:38:02.200 |
does that interconnectedness number get better or worse, right? 01:38:06.280 |
So, different clustering algorithms are going to do things different ways, right? 01:38:09.800 |
But in this case, this is what we're looking at. 01:38:12.440 |
It's like, and if you think about it, it's like, you know how you have some friends and 01:38:17.480 |
you all know each other, you're all really tight, but then you have this other friend 01:38:20.920 |
who's got, like, all these different, like, friend groups that they're a part of. 01:38:24.200 |
They don't have, like, their one little niche. 01:38:27.240 |
So, the more interconnected it is, the more relationships they have among each other. 01:38:34.360 |
How would this be of any use in terms of, like, getting the actual data from the query? 01:38:43.320 |
Yeah, so, when we talk about the community detection, 01:38:47.160 |
the community detection is really meant to be, like, a production helper. 01:38:53.800 |
it's going to allow you to understand at scale, like, where things are moving. 01:38:58.040 |
The community itself may not necessarily be part of your answer, 01:39:05.560 |
unless, as I mentioned, you want more diversity in an answer. 01:39:10.120 |
You want to cover a larger domain than just what's really close to the vector, right? 01:39:15.960 |
So, think of, if I'm, you know, I don't know, I'm in sales at Amazon. 01:39:20.440 |
Like, I don't want to, you just bought luggage. 01:39:21.960 |
I'm not going to show you more luggage, right? 01:39:23.720 |
I want to show you the things that are often bought with, right? 01:39:29.240 |
So, in that case, it could be helpful, but from a content perspective, 01:39:33.720 |
it's like, do I want to give you just what you have, 01:39:36.280 |
or is there some benefit to spreading the net a little wider 01:39:40.520 |
in what's being brought back by that vector retriever? 01:39:44.840 |
Okay. So, instead of just getting that one embedding, it gets, like, all the surrounding embeddings as well? 01:39:49.320 |
Yeah. So, depending on how you build your retrievers, right? 01:39:51.640 |
Depending on what your thresholds are, depending on, you know, 01:39:55.800 |
But, again, I'm happy to noodle with anyone, any time. 01:39:59.880 |
I literally get paid to help people build things, so I love my job. 01:40:04.280 |
Like, help me keep doing my job by asking me lots of questions and letting me help you. 01:40:09.400 |
So, like, when it detects what's a community, does it only, like, check its surrounding nodes, 01:40:15.000 |
or even, like, nodes across the entire graph? 01:40:17.080 |
You know, why don't you and I, we can talk about more community after, 01:40:21.000 |
because I don't want to keep everybody, because I know we're getting close to time, 01:40:23.160 |
but it's a good question, so thank you for that. 01:40:25.000 |
What I do want to say is, if you've been with us this morning and this afternoon, 01:40:29.400 |
your day with Neo4j is not over, because our lovely friend Alexi is here, 01:40:33.960 |
and he's hosting our AI agent meetup this evening. 01:40:41.640 |
Like, he's just one of the most, like, vivacious and passionate people about tech I've ever met in my life. 01:40:46.600 |
So, yeah, please join us for the agent protocols tonight on Graphs and MCP. 01:40:52.600 |
Feel free to reach out to me on LinkedIn, and I just want to thank you all for your attention. 01:40:57.320 |
I know that it was a, it was a lot, but please, like, just be curious.