*Music* Hey y'all, how are you doing? How's the day? I'm pretty excited to be here. I was here last year and then we were at the summit, the agent summit, not that long ago. I've definitely seen some familiar faces in all of those places. So thank you for coming to our conversation today.
My name is Alison Cosette. I am with the developer relations team. You will likely recognize this face from all of the GraphRag videos on deep learning, chillin' with Andrew Ng, like, you know, all the guys that they had. Yeah, it's just how you roll. So what I'm gonna be talking to you about today, so was anybody in the Neo4j workshop this morning with Zach?
Oh, this is great. I love it. It's like Neo4j here. So yeah, we're gonna do a few things. So that's good. You may have, well, no, I'm gonna just go like you didn't do that. Okay. So this is our repo that we're gonna be working with today. It has, yes.
So is it? Is it not there? Because I can see your profile, but if I can't, you see, I do. You know, that would not surprise me that I made that mistake. One moment, please. Let's see my settings. Yeah, it probably, I think it defaults to... I know it's in here somewhere.
Change visibility. Change to public. Sorry. All right. Now I get it. So much security. All right. Sorry about that, folks. Yes. Confirming access. Okay. Is that better? Excellent. All right. Sorry about that. So yes. Oh, so you can actually see. Yeah. I'm always in dark mode. Sorry. Can we do, can we do that?
I don't know. Just these lights along here? Yeah. Just the lights in front of the video screen. We'll work on that. I love that people are raising their hand. This makes me very excited. I love talking to people. I love being in the room with folks. So this is good.
So I'll tell you a little bit about what our goals are for our next, you know, 120 minutes together. What we're focusing on, obviously, a lot of you went to the intro to GraphRag this morning. How many people have been working on RAG applications already? Yes, yes, yes. Excellent.
How many people have been working on GraphRag already? A good number. Okay. So what I'm going to be doing today is talking specifically about how you can use graph data science to improve what you're working on in your applications. The reason that we are doing this is because one of the beauties of having a Graph is you now get access to new ways of thinking and understanding about the data that you have.
And so my goal today is to have everybody feel comfortable with some of the basics of the algorithms of Graph. And, you know, hopefully while you're working through this, you can see something that's going to be of interest. There might be something that you want to try when you get back to your application.
I do this a lot, and I will say the most successful workshops we have are when people bring their real-world scenarios to bear. So we will be going over the code, but a lot of what we're going to be doing is talking through the concepts and around Graph as a tool for managing your RAG.
And we're really just going to be talking about Graph algorithms. Anybody here come from the data science side of things? Oh, all right, we got a few, we got a few. One of the things that I think is really interesting about the term AI engineer is, you know, somebody said, "Oh, it's just engineering.
Like, it's still pipelines." I'm like, "No, it's not. It's different." Right? I came initially from the AI side, and I can tell you that none of us, none of us look at things the same way that we did, you know, a few years back. It isn't the data scientists on one side and people putting things into production on the other.
You know, we work really closely. And so what my hope now is like the, there's this concept that people say all the time that, you know, developers are just building the pipes and that the data scientists are worried about the quality of the water inside the pipe. But with AI engineers, we're all worried about the pipes and the water, right?
So what I really want is to give you as much as I can about the algorithms and about the data science and get us rolling. So let me see. Where's my repo? Where's my repo? Okay. Um, yeah. So just clone this repo. You can, you'll have it. We'll, we're going to walk through it.
It's, um, it's, it's really just like a click through of a number of Jupyter notebooks. Um, but really like, it's all mostly about the conversation that we're going to have today is where we're going. So obviously everybody here, like we always start with the basics, right? I call them the, the GPS moments that let's review all the things everybody knows first.
Um, we know that in our rag applications, what we're really trying to do is we're really, we're getting better answers, right? That's what everybody here is struggling with. Um, like some of the common things I hear people struggle with are, um, am I getting the right data? Do I have the right chunking strategy?
Um, how do I handle temporal data? How do I handle data quality? Um, what are, what are some of your biggest like rag challenges? Anybody want to share? That what just irks you about what you've been working on? Everybody's so quiet. I should have signed up a Slack channel.
Yeah. So we're definitely going to talk about, we're going to talk about that today. Um, I think one of the things that we found is that the management is, um, what? Oh, sorry. Um, the, the comment was, um, how to handle the volume of the data was a really big one.
And the other element was understanding the relationships among the data. So did you go to this morning's workshop? No. Okay. Well, you'll, we have lots of video on it, so we can, we can definitely get that to you. Yes. I would add like temporal relationships. Yes. Yeah. Temporal is a really interesting thing.
Um, I don't, I didn't, I don't have temporal in the slides, but we can definitely talk about temporal because it's really important for sure. Um, one of the things that we find is, you know, I, I will, I will give you my apologies up front. I get super excited about graphs.
So I apologize in advance if I am overly enthusiastic about what I do because I love graphs. And there's so many times where something comes up and you're like, Oh, well, if you use graph, you could address it this way. So I just want to, you know, keep bringing these things up and let's talk through all of them.
But what we're really looking at is how do we give this complete and curated response, right? What are we using for the external data? I mean, initially everybody drops their stuff into Pinecone or WeVA, you hit a vector store. Um, is anybody now working with multiple databases feeding into their RAG system?
A little bit. Yeah. Still primarily vectors, primarily vectors. Yeah. Yeah. I mean, everybody uses vectors because they work right. Um, and that's the starting point. But the thing about the vector is the vector really is, I, the analogy I always say is it's like you have a bunch of little index cards.
And so imagine if I'm the, my first day on the job and everything there is to know about my, is that what I have to do is on an index card. And I say, Oh, well, what do I know about this? And then somebody finds the index card with this little blurb.
And so like, I've just got these little things, but I don't understand how it all puts together. Right. We don't understand what is it connected to, right? What is it connected to? What context do we need? And so what I want you to start thinking about is what are the projects that you've already been working on?
When you're getting those answers back, what are some of the pieces of information that you wish your vector had? Right. Anybody have an idea of something they, you don't have to share, but like you've thought about like, Oh, if only it could do this, right? If only it knew about that, I would get a much better answer.
If you haven't, then I'm sure like the, your business folks that you're responding to have because they always have an opinion. Right. So one of the things about the knowledge graph is that it's a way for us to bring our structured data and our unstructured data together. The other thing that it will do is it will actually find knowledge inside of your unstructured documents, right?
So one of the things you brought up was how do we create the relationships among these documents or among these pieces of information? And there's a couple of ways that happens. One is it's really explicit, right? There's a table of information inside this document and you can pull that out, right?
Old school regex, everybody's friend, or it's something that is less explicit and more implicit. And that's where the LLMs come in, finding out what is in there. So in this morning's one, and if you want to hit us up, we can definitely show you how we actually extract and build the knowledge graph.
What we're working with today is I have a database already set for you folks that is a data dump that we're going to load into your pro trials. But the thing is this data comes from lots of different places. So I guess most people here are familiar with the graph, but just in case I do like my little GPS moments, let's talk about all the things we already know.
What is a graph, right? Basically we start with nodes. Nodes are entities, they're your nouns, they're your items. Some people relate them to the tables in your relational database. These are your table names, right? But the important thing is the relationships among those entities. So a customer and a product, a product and a part, a supplier and a buyer.
It's the relationships in and among these pieces of information that's where our business and where our information lives. In this case, we've got two people. One of them owns a car, one of them drives the car. They do know each other. I'm not sure why person one lives with person two, but not the other way around.
Who knows? Life is interesting, right? So we have these relationships and this tells us what is the the story of the interaction between and among these. And properties. This is where we get into the nitty gritty. We get into the details. These are those extra columns within the table, right?
Our relationships are our foreign keys. Our properties are the columns. The entities are the tables, like roughly, right? You know, don't quote me on it, but pretty close, right? As from a conceptual point. These properties can live on the entities. They can also live on the relationships, right? So we can see that this person has been driving the car since this date.
It's a very old car now. But one of the other things that we're going to see is we see underneath this car, we see a couple of things. We've got the brand, the model. We have a description of the car. And what you'll see underneath is the description embedding.
So we're all familiar with the embeddings, right? The embedding is a way of taking some piece of text, putting it into, you know, many, many dimensions, and creating a vector in space that represents the semantics of whatever that text is. One of the things that's really great about a graph is that you have immediately that connection from the embedding to all of the other information.
We see it in this microcosm here of the car, the embedding of the car. And you can pretty quickly see, oh, if I did a vector search on, you know, this car, right? And I wanted to know who are the people that own cars described like X, it's a pretty quick traversal.
You're going to do a vector search. It's going to take you to that node, that entity, right? Similar to your chunks that you've already been working with. But immediately, you already have all the relationships built in because they're already in the database, right? We don't have to, we don't have to like create new tables and views.
Everything's already there and it's connected. So when you go into graph rag and you're working with your retrievers, the beauty of the retriever is in this ability to traverse and traverse very quickly and get that immediate context, right? So these are the basic components of what we're going to be working with within the database.
So we're going to start at the very beginning, right? How does this, how does this work? We start with the entity. The first entity is the document and the document has a chunk, right? Pretty straightforward. It's a row in your vector store, right? But this is the interesting thing.
You also have like, anybody here working with agents already? MCP servers? Yeah. The other thing that we get and what we're going to be talking a lot about today is the application. I call it the application graph. Like they call it a memory graph. For me, it's what is the actual activity that's happening in the system?
What is, what is it doing? What are the messages going in and out? What are the context documents that are being used, right? Everything that we're building is moving quickly and it's moving at volume. There's massive amounts of documents and data. There's so much volume. How do we actually manage it?
What do we do? This is going to be a key part of what we talk about today, which is the actual understanding of what's happening in the application and how by monitoring and looking at what's going on in the application, you can do things like manage your documents at scale.
You can understand what are the actual important documents, what's influencing outcomes, right? I've got some good stories to share with you about what people have been working on. One of the things you mentioned is what other elements are there. So when we look at the entirety of the graph, we have like the memory graph for the application.
It's connected to the chunks and then those chunks are connected to the domain, the products, the people, and basically all the other structured pieces of your business or the structured elements that you've extracted from those unstructured documents. So what you end up with is you end up with a fully connected network of your data and your system together.
So much of the way that we have approached development prior is you had your data people over here and you're building people over there. We're all together now, right? We all work together and the way that we have to build is we build from that cohesive point. We build from the data and the application at the same time.
So I'm just going to show you what becomes possible as you start taking this kind of an approach. Everything that we're looking at today is just this little bit of the graph, right? So we're going to be looking at prompts and responses and we're going to be looking at the context documents that they're connected to and then from there we're going to say, okay, let's figure out where we go now and what we can start to understand.
Anybody have any questions so far? All sounds really obvious, like why are you still talking about this? Too slow, too fast, are we good? You're good, okay, all right, good pacing. If I go too fast because I'm from New York, I'm a fast talker, let me know. If you're getting bored, just kind of give me one of these and be like, okay, let's go.
So that's fine, too. But the best thing that you can do is like when you have a question, just raise your hand because everybody has the same kinds of questions. Everybody's struggling with things. And what I love about AI engineer is I get to be with the people who are doing the things that are like you are the people that are making everything happen right now, right?
I mean, our entire world is in like just this incredible transformation and you all are the people that are building it. And so I want to be here to support you in this moment because it's a pretty exciting time. So please don't be shy. Questions, comments? Yes. Oh, I will drop it in.
It's not in right now, but I will drop it in before we leave. Thanks. Yeah. All right. Um, yeah, so they're all the same system. They're really just sort of like neighborhoods, right? They're kind of neighborhoods in the system. Um, the reason that we draw it out is because oftentimes people think of them as separate, right?
So, um, this one doesn't have everything, but what we see here is we've got, this has a, this one has the source. So the yellow nodes are things that are coming from your application. The blues came from unstructured data. Um, we're not doing it today, but if you had structured data that you were also pulling in, like, uh, bringing in a table, bringing in some of, or whether it's, you know, already structured or has been brought out of your unstructured data, that would be just a different color.
It's all one system. But the reason that we do this is just so that people conceptually understand where it's coming from and like, what does it do? Because the big thing really is understanding the fact that once it's all connected, your application is talking to all of the data and then you're structured and you're unstructured are all working together.
And at any time you have access to understanding across that system. It's like, you know, complex systems and system thinking and all that good juiciness. So they're not separate. They're just neighborhoods. Yes. So, so, um, what you are... Are you proposing that you have like one massive graph? Yeah.
Okay. I mean, don't start with a massive graph, right? Like, let's be honest. Let's start with something small. I mean, my first piece of advice is always start small. Um, the other thing to really, as always, we always want to look at optimization. We want to make sure that we're we're building for the latency that we need.
There's always going to be context. Yes. Can you elaborate on that? They mentioned it this morning as well. Yeah. To start small. But when you're trying to map, for example, the ontology of, of an organization and be able to have an assistant have access to that. Yeah. We've always had to go big.
Yeah. How do you, how do you, how do you do small? Just break it apart? Yeah. Um, well, I mean, that's probably an ABK question. Yeah. So some of the complexity like for, for really big graphs, part of the bigness of it is not just the number of nodes and relationships you have, but the number of labels and relationship types that you have.
And for like a large organization, don't over specify. So you start up with like kind of generic terms, see if that works. And if you've got too many, too much volume in the generic terms to segment, then you get more specific. And so, and the other part about that, I suppose, is that you start with not all of the data, but like some of the data, work with it, do some eval, figure out if it's doing what you want to, and then refine before you suck in everything and create the entire database.
So you kind of iterate on the schema with a subset of the data. It's normal data engineering, I guess, right? Yeah. Do just some of the data, start generic, and then get more specific. Yep. When you say big versus small, you're talking ontology, not knowledge graph, right? Correct. Yeah, that's right.
So on, so talking about the ontology or the schema, like the, how the graph is actually laid out, rather than the number of each kinds of things, right? So starting there, iterating on that first, and then bringing all the data. Yeah. Yeah? What are your thoughts on the raptor indexes?
Are you presenting indexes in a tree-based structure based on some hierarchy versus knowledge graphs? And what performs better in what scenarios? The raptor-based indexes, like if you represent content in more hierarchical tree-based structures, and then you are having summaries in hierarchical nodes, searching through those versus knowledge graphs, which scenarios plays out better?
Like we should go with raptor indexes versus knowledge graphs? The right answer is the annoying answer of it depends. And so you should have your eval drive the structure, right? So come up with some eval. What are the, what are you actually trying to do? What are the questions people are going to ask?
And it's the usual thing, the questions people that you're going to ask or people are going to ask should drive the data model here, the ontology. But that also ends up meaning, do you put things into summarization over communities? Or do you have the knowledge graph representation of it?
This can be driven by the questions. Yeah, makes sense. Yeah. So if you can find your way to that workshop Neo4j, I've just pasted it in OpenAI key there. You're going to end up needing that at some point while we get this repo updated. And I guess at the meantime as well, I'll just answer any arbitrary questions you have.
Yeah. Yeah, it's neither of those. Really? It's the graph analytics workshop. What? No, I know. I've done so many different workshops lately. Like it's, yeah. I'll clean up the repos. Sorry. Slack channel, yeah. Oh. No, so. Oh, you have a Slack channel. There's a Slack channel in the AI engineer's workspace.
Oh. Oh, sorry. There's two. Yeah. Okay. So it should be the workshop -new4j1, not the workshop -new4j2025. I didn't realize there was yet another one. Okay. All right. All right. Yeah. Thank you, Devon, for putting in the GitHub link. Where are you? Oh, I know. Hold on. Hold the phone.
Because... So one of the things that we're going to end up exploring here, and we touched on this a little bit this morning, some of the motivation behind both GraphRag and then also doing graph analytics on top of the graph. One of the aspects I always like to call out, and it isn't as obvious, I think, until you get pointed out, and then it's super obvious, is that part of what we're doing is for the core data you start with, particularly with unstructured data, if you're doing chunking, of course, that by using GraphRag and graph analytics, you're expanding the number of answerable questions for the same base data set.
That if you would say roughly, for every chunk you've got, you can answer one question, right? And it's being linear. Number of chunks equals number of questions you can answer. If you think for every chunk you've got, you do the, and this is the upper bounds, take every pair of chunks, do the Cartesian product of all the chunks, say, "What do these two chunks have together?" Have in common, or "How are they different?" Create relationships or don't create relationships in them.
In the upper bounds, that is now, you know, n squared number of possible connections. Every time you do that, that's new information that's possible. So now you've got an order of magnitude more, you know, questions that can be answered. Once you've got everything connected, you can then also do the community detection stuff that they touched on this morning, and we'll get into maybe a little bit today as well, that now that you've got all these chunks connected, it's not just the pairwise connections that are new information, it's the subsets that are through those connections that also can answer new bits of information.
So that if you have, if I'm in the graph, and you're in the graph, and you're in the graph, you know, or you over there, whoever, maybe Allison, that it's not just us individually, but like how we're related and who we know, the connections between us and the connections across everybody else, all that ends up being slices of the graph that are particular to the individual places that you've started.
That ends up forming all this, it's not the full power set of all the subsets within the set, but it's hugely amount, like if you started with like 10 chunks, by just going to the pairwise connections, you end up with 100 answerable questions. And once you start doing the sub communities, so the subset of all the sets, at the upper end, it ends up being exponential.
This is the power of graphs. Yes. It's two to the n instead of n squared, right? Would you love math? Thank you, ABK. Okay, so we have pushed, if you have pulled, you will want to refresh this pull, this fork. So what you're going to see is a few different things.
One is, I do have a link actually to the slides. And then, actually we're not doing Docker, you can ignore that. But when we go down to the bottom in connecting to Neo4j or a DB, you're going to see the link to, you're going to see the link to console@neo4j.io.
So we're going to have everybody go there. When you get in there, you're going to create a new instance. Oops, sorry. Yeah, it's console.neo4j.io. This is in the readme in the repository. Do you not? Let me see. Is it in the Slack channel? Which Slack are we? Are we 2025?
Not the 2025. We are not 2025. We're old school. We're not 2025. We're either in the future or deep in the past. But we are not right now. All right. We're the other one. So when you go in there, you're going to have a couple of different options for free tiers.
One is the Aura Free, which is smaller. It doesn't have as much of the optimization as the Pro Trial. So we're going to be using the Pro Pro Trial today. I believe it's 14 days right now on the Pro Trial. But you can, if you want to work on your own project later, go in and start an Aura Free instance as well.
And that, as long as you're querying it every so often, it'll stay open and it will always stay free. Yes. Is there a preference on the cloud provider? No. It's your choice. Your choice. Oh, sorry. The question was, is there a preference for cloud provider? And I said, no, there is not.
The more you spread it out, the easier it is. We move along. Yeah. So you're going to get the free trial up and running. And so what that is going to look like when you create the instance, you're going to see a couple of options. You're going to go with the AuraDB professional.
We're not taking your credit card. We're not doing anything. Eventually, it'll turn off. But it just is the most robust version that we have. So I guess it's seven days, seven days. So you're just going to try that for free, and you're going to get that up and running.
Why do we still have an AuraDS button? No? Okay. All right. Don't worry about that. Yes. Oh, yeah. Onboarding. Sorry. Yeah. It just passed. I mean, it's going to ask you some questions. You can just like, it does that thing where it makes you make the graph. Oh, yeah.
Sorry. Yeah. It doesn't. I mean, anything is fine. Like, don't tell anybody. It's a marketing thing. Like, just pass through it. I want to just get you enabled. And then when you go through, you can just use the default. It's going to default to four gigs. You know, if you want to name your instance, you may.
One of the things that's going to -- you can definitely turn on Graph Analytics and Vector Optimization if you want. And then just click Accept, and it'll start up your instance, and it'll take a couple of minutes. Once that instance is up and running, we'll get you to the dump and we'll get that loaded in.
So I'm just going to give everybody a couple of minutes to work on this. If you get stuck on this process, just raise your hand because I've got lots of my people, my Neo4j people, wave. Alexi, that's you too. Also Alexi, that's you. Alexi, yes, he's being called on in class, right?
He wasn't asleep, so that's good. He was just working. Yeah, so if you get stuck on anything, just let us know, okay? But what I can do, while we're -- while y'all are getting that up and running, is I can show you a little bit about what is inside of our database.
The console, it makes it really easy. Not only do you have access to your instances, you have access to Graph Analytics, but, you know, the basic IDE, which is the query, will be connected as well. So got one of each. Let's see what's in the database. Oh, I know.
That's right. I emptied it because I wanted you guys to be able to, like, do the dump with me. Okay. Oh, no. There it is. Okay. All right. Okay. So we've got the query. And in the query, for those of you who love your Cypher, you know, we can say return in limit 10.
All the Cypher that we're using today, like, I've already got in the notebooks. But what you'll see is when you do a query, querying in Cypher is really all about the pattern. So in this case, I'm just doing a match n. You probably can't see. Yeah, I was going to ask.
Can you see that okay now? Yeah, I'll fix it. Hold on. I got to make this so people can see all the things. See all that? No, that's not going to work. I'm going to zoom in. Sorry. Yes? Oh, okay. Okay. If you already have an account. Oh, you know what you can do?
If you already have an account, one of the things you can do is you can log back in. And if you change your email to your email plus, like, workshop before the @ symbol, it'll give you a fresh account. So, like, if, like, mine, if I'm, like, Allison at, you know, Google, if I do Allison plus workshop at Google, it'll still be your email address.
It'll still go to you, but it will be seen as a new account. So that's another option. Yeah, just create a new account. It'll make your life easy. Has anybody been able to get theirs up and running? Ish? No? Yeah, maybe. It's so interesting, because lately I've been noticing, like, people are so interested in engaging in conversation.
There's less of that. But we definitely don't want to leave anybody behind. So we'll let you all do that. So, but just to understand what is actually happening inside Neo4j when you're working with it, we run these patterns, right? So it's similar to, you know, SQL. Like, the good news is it's very user-friendly, Cypher.
And we live in the age of, like, vibe coding and LLM. So, like, there's plenty of help for all of this within our co-pilots. But depending on what you get back, in this case, you can look at it as a graph. You can see the table. Or you can see, you know, just the raw values, but the graph itself.
So in this case, we're looking at a particular document, which is actually, you know, we call it a document. It's really the chunk. I will tell you that what we have in our, what we, the data that we're working with is from a project that we actually have live.
And it's called Agent Neo. And it was built a while ago now with, by some of the guys who were helping build the first RAG application on our graph data science platform. And our knowledge share there. So this has been in play for a while. But when we look at the data model itself, right, what you're going to see is what I talked about before, right?
That application understanding. So we've got a session. So somebody logs in to the application. They start a conversation. They have their first message, right? This is the user prompt. Then that prompt is going to get an answer, right? There's a message that comes back from the assistant. And then that assistant is going to be going out and pulling in context documents, right?
So it's going to do your cosine similarity. Who does Euclidean? I don't know. It's a thing, I guess. Does your cosine similarity, and it's going to bring those documents back? And we're going to start to see the chain of what that conversation looks like, right? So this is that application base.
And then these are the chunks. So it's really just what we looked at in that previous view, but just to show you what it actually looks like. So what's in this? You know, lots of documents from our documentation, developer blogs, support. This was early on, our earliest chunking. We had a 512 chunk size.
We used a Langchain recursive splitter. So it was pretty early on. Let me see. Hold on a second. I want to show you. Where am I? Oh, that's not what I want to show you. Where did it go? Here we go. All right. So once the system actually starts running, what you see is this is actually one user having two different conversations.
This is their first conversation. This is their second conversation. The first question is, can you build graphs for me? And then the next one was, what is GDS? And then you see the beginning of what that response was. Yes, I can build for you. GDS stands for. And so you can just to give you an idea of how this starts to build out.
Now, this might look a little bit frightening, but I'll walk you through what we're actually looking at. So what does it mean when we start looking in practice? So this is the same user, right, with those initial two conversations. What does graph data science? Yeah, what is GDS? This is also what is GDS?
And then can you build graphs for me? And what we're looking at here is these are the conversations and these are the context documents that are being brought in. And what's interesting or plausible, it makes sense, this person asked the same opening question. So it stands to reason that these same documents are going to be used in both of these first responses, right?
Makes sense. Then what we find is this person's next conversation says, how can I have, how can I learn more about GDS is their follow-up question, right? And that makes sense that it's going to now share things with, can you build, because this person is then having their next prompt, which is, oh no, that one just stopped after one.
But what I wanted to show you is that what we see here is that in this middle conversation, the conversation started out in this area of information, right? It started out in what is GDS, what are some of the basics, but then the conversation went into something similar to someone else.
And the reason that we want you to see this is you want to be able to see how are the conversations connecting to the data itself. Because what we're going to start to see is we're going to start to see that it's not just about a single row, it's not just about a vector, because the way the documents are used and the way people travel through your data gives you information about what they're using.
The reason that becomes helpful is when you're trying to manage at scale, how do I know with all these documents, what are the ones that I should even be looking at? What is it that people are constantly going back to? What are the areas of knowledge where things are happening?
And so what we're seeing here is that these elements right here were in both answers, right? So you see this first question had, this first response had context, and this had context. So it was in both of those. These are individuals, these over here were only in this one, some of those were only in that one.
You can imagine like how quickly this gets very, very hairy. Oh, where did it go? Nope. I guess that's it. I didn't have the extra one. So the point is we can start to understand the movement of what's happening. Is there a question? Someone? No? Okay. Anybody have any questions so far?
Ponderings. Waiting till we get to juicy things. Yes? Yes. Yes. The pink ones are the chunks. The orange is the user prompt. The tan is the rag application's response. So you can already see how it's starting to connect to the system itself, right? So let me get back on point here.
We got things. We got things. Things, things, things, things, things. Okay. I'm sorry. Are you storing the entire chunk into the graph? And how performant is that? And when does performance become a concern? Yeah. So what we're going to see is we are actually, we do actually have the chunk.
Our chunks are pretty small. The size of the text doesn't really change the performance, does it? Like as far as like once it's stored in the property. It's really the traversal that is where things get. For most chunk sizes, it's fine. Because most chunks are like in the K at most.
It's not megabytes of chunk. Right. If there's megabytes, then it's like, okay, that becomes problematic. Yeah. I mean, but it would be that for any database I would imagine. Yeah. But what we see is, is like, you know, we said we had these properties. So for this particular document, I've run some of these things already, but we've got the document, we've got the text, and we have the embedding.
The embedding is actually a property of the node itself. And that's one of the things that makes it really easy is that you don't have to have a separate vector database, right? The vector is actually stored, and then the vector index is built on top. So that once you get all those vectors in, you can run the vector index.
And people do really interesting things with vector indexes, for sure. So this is where, well, that's not what I want. This is where we start. Yes? Question in the back. I think, I think I have a question, so. You definitely have a question. Whether I have an answer is the only thing that's up for debate.
So let's assume, right, I have very large, let's say, legal documents, right, which I'm trying to do something very similar. Yeah. And I do these chunkings, you know. Yeah. Are you saying I would store the chunks in Neo4j? I would not store the vectors in a vector database, but I would also store that in Neo4j.
Yeah. Now, how would that be from a performance standpoint over time? Like, assuming your system continues to grow, wouldn't there be a performance impact? Um, I don't know. Would you know what the largest RAG application is that we have in production in clients right now? That's a good question.
I don't know. I know that's like, you know, tens of millions, you know, seems fine. Once you start to get into hundreds of millions, you might start to have to worry about, like, how to scale past that. Yes. And does it still worry across -- look for similarity across all embeddings?
Yeah. I mean, so it depends. So what it will do is it will look across whatever the vector index is that you have. And there are people who, I said, they do some really interesting things with vector indices. So I've seen people who, like, have multiple vector indices depending on what the use case is.
So when you create the vector index, you actually just do a query and you say, "Okay, for all of these, create an index on all of these nodes." And so that's one way that I definitely know people have been doing it. So they'll have multiple indices that already sort of pre-sort or pre-filter what's going to be part of what's coming back from the vector index.
So I know one of the ways I've seen people do it is from a governance perspective. So there's certain data that they want available to everybody, certain data they don't, and they manage that from the vector index. Yes, please. I'll let you go for a third, really. The other question is, if we had documents that actually contained images, and we wanted to somewhat have a relationship between the text and images, how would you do it here?
Does that make sense? Yeah. I mean, normally, multimodal, I think most of the time they keep the images separate, and then there's an embedding of the image on the node. That's right. That's right. Yeah. Yeah, so you put the images, like, large media files, you would put that in an S3 bucket or something, you still wouldn't, you wouldn't put that into the database.
So you'd have a URL to the S3 bucket, but you can put the embedding into Neo4j, so on the node. Yeah. Yeah. Yeah. We actually have an example of that. Yeah? For, like, for, like, you have, like, multi-layered architecture to kind of query faster. Yeah. Is there something similar to that happening here, where you can have, like, different subsets, you know, different layers to the...
Yeah, I mean, it's not... I mean, it's mostly just the way that the database is actually built. Because of the way the queries run, because you're running on a pattern, you actually... I kind of call it a... It's like GPS search, where it's like, start here, and then go find what you need to find.
Right? So, I don't know if that's the best way to explain that, but when you're running the query, it's going to be very specific. So when you're thinking about, like, how you're running the vector retriever itself, are you talking about, like, when you're actually doing the retrieval from the vectors?
Yeah. So there's a whole... I don't even know how many retrievers there are right now. Like eight different retrievers or something in the GraphRag package. It's a Python-based package. So there are some... If you go to... Actually, a great place to go is GraphRag.com. It's like an open, like, reference point.
And it talks about some of the different types of retrievers. So you can do pre-filtering. You can do post-filtering. You can manage it based on the index, which is another, like, easy way to do it. You know, sometimes folks will have them where they have different hierarchies. So, like, when you think about these large legal documents, right, there are ways that you can actually, like, create the chunks, but use the natural structure of the element as well.
So, like, we have an example we use with SEC data. Every time somebody files something with the SEC, you know, paragraph one is always this, paragraph seven is that, paragraph nine is that. And so you'll take whatever that natural architecture is within the documents themselves, and then you can leverage that as well.
So, one of the other great things about these entities is that you can actually have more than one label. Let me see. Oh, I gotta plug in. So, one of the things that you'll see is, so for this particular... This particular one, we see that it's a message, but it's also noted that it's the assistant.
So, you have some nodes that are, that have more than one label sometimes. So, it makes it so that if you wanted to go across all messages, but I know if I'm trying to understand just the prompts from, they're just the assistant answers, or just the prompts, I don't need all of those messages.
So, by adding on dual labels, you can actually say, "Oh, I just want to see these, or I just want to see those." So, it allows you to get some of that nuance out of that complexity by leveraging multiple labels on a single node. One of my favorite tricks, personally, but that's just me.
Yeah, I have a question. So, now, like, if you have one vector index, but don't want to actually search the entire index, right? So, since that, like, let's take this text, for example, or like, content, for example. You want to only search the content of few nodes while vector embedding.
Is that possible? I mean, there's a number of ways that you can do that. You know, you certainly could. I mean, I, personally, I'm a big fan of multiple indices, just because I find them to be very efficient. That's one way to do it. There was a question earlier about temporal issues.
You can also use, like, multiple vector indices to address that temporal portion as well, right? So, you can create an index from, why is this? All right, it's starting up. You can create a vector index. So, like, like, if we wanted to, like, if we wanted to have a vector index just on, you know, version 1.9 or version 2.6 or whatever, you could have that particular vector.
So, let's just say I'm building, building a chat bot, and I know that the customer is currently running whatever version of something, you can use that as the actual index that you're searching for. So, that's another way that you can, that you can address those kinds of things. Did that answer your question?
So, it's just like multiple embeddings, I mean, multiple indices. So, not one index. I mean, I don't know. I don't know if engineering would agree with me, but from, like, a purely, like, practical, like, I got to hack this and get it done, that's my go-to. I mean, is there?
Yeah, I think that's the right approach right now with Neo4j. I guess you're asking about metadata filtering within the vector index, right? Yeah, and so, for us, the vector index is just an index. It's not a vector search. Right. And so, we don't do metadata filtering on the vector itself.
We use the property graph itself for doing the filtering. Right. So, you do it before you create the index, because the index is only going to go to what's connected to it, right? So, it's just... You'll still have the index, but then you can do predicates on top of the nodes you found, and then you filter after that.
Yeah. So, even within, once you get that, you can certainly filter in and around that as well. And if you look at any of our, the vector retriever materials we have, like, we walk you through that. We've got a new ebook coming out. It's gone to print. It's happening.
That specifically talks about those retrievers. All right. Multiple labels, multiple indices, all the good things. So, let's do this. So, for those of you who have... Let's go back to here. The other thing you're going to see on the readme is you're going to see this Neo4j dump file.
This is the actual dump file from our database, from last year, I think. Then, so, we're going to give that to you. So, you can download that file, and I'll show you how you can drop that into your... Let me see. Sorry. All right. So, when we go back to our instances, so, you should see instance 01, most likely.
If you go on the right-hand side of there, there are three little dots. What you're going to see is backup and restore. So, you're going to click on backup and restore, and you're going to take that dump file and either browse or just drag and drop it. And it's going to drop in our version of the Neo4j agent...
Agent Neo GDS application data, I guess. Thanks. So, yeah. So, you just drag and drop it. It runs pretty quick. It'll load pretty quick. And then what you'll see is when you go to query, you'll see this will start to populate. The database information will populate. And it'll show you which nodes you have...
Which nodes and entities you have. It'll bring in some of the indices. There is... We did run some of the algorithms already. So, I'm going to show you how to run them so you can run them on your own. But in the interest of time, we have run those already.
So, you'll see some of that. Some of them will do live here. They won't be there. Is anybody able to load? Yes. Oh, yeah. So, on the readme, on the readme in the repository, if you go all the way down to the bottom where it says connect to Neo4j or a DB, there's a Neo4j dump file that it will take you to.
Sorry. Yeah. And then you'll just download it from there. What's that? Oh, sorry. So, when you're in... When you go to console, right, you should be seeing something like this. Right? And then on the right-hand side, for your instance, you'll see three little dots. We're going to click on those dots, and you're going to go to backup and restore.
And my data is yours. I give it to you. I'm just dumping it right on you. All right. Yeah. It's just easier. Nobody wants to load CSVs today. We have no time for these things. Yeah. And then from there, you can go straight into the query. So, when you go to query, you should see that it's populated.
Is there anybody having trouble loading the data that wants to load the data? It's still uploading. Okay. What's that? If you... Did you... You might have had to re-pull because I... If you re-pull the repo... Yeah. Yeah. I had to update it because, you know, I'm well-intentioned, but sorry.
Yes. When I've restored the dump file, should I expect that it will work in, like, a few minutes? Yeah. Let me see what you got. It keeps doing the loading. It's still loading. So, that's normal. There's, like, a latent... It shouldn't be too slow. Oh, what is... Oh. You're probably having network issues.
The Wi-Fi is pretty slow. Oh, sorry. Yeah. That's always the the challenge with these things. Yeah. Wi-Fi. See, this is why we always have it loaded. I also have it loaded on my desktop because that happened to me recently. I was somewhere and I'm like, there's no Wi-Fi at all, and I have to present.
So, I also have Neo4j desktop. So, if anybody ever, you know, you want to go local, you want to stay on-prem, you don't want any up in your business, you can also, like, stay off the cloud, go local. Also an option. Let's see. Where are we? So, what are we going to do here?
All right. So, I'm going to go into... So, within our notebooks, we have three different notebooks. One is the get to know your graph, and we'll start there. I don't need this. And really, it's, like, all this is going to do is you'll run it, you'll do your pip install of all your requirements, and then what you'll...
Oh, wait. I need to give you... Is the OpenAI key on here, too? Hold on. Oh, I put it in the Slack channel. Oh, it's in the Slack channel. Okay. So, anybody can find it anywhere in the conference. What's that? Enjoy. Security. So, when you set up your database, they should have given you the opportunity to download your credentials?
Yeah, I've set that up, but then... Got it. It's getting... Probably not found or something. Um... What is this not? Not supported... Oh, did you drop the... So, you have to drop... I... Sorry, I skipped a spot. You do need to drop those credentials that you downloaded into your ENV file.
So, you did that. This is your ENV. Okay. And then... Just pasted it. Yeah. That should be... Let me see. Yeah, there. It's right here. Yeah, you got that. Um... What is it actually? What's the actual error? Is it just Wi-Fi? Yeah. What is it in here? Schema Bolt.
Oh, yeah. Yeah, the URI should be in your credentials. Exactly. I was just thinking I was just... Yeah. It should be... It should be... I would take a peek at it. ABK, can you take a look and see what's happening here? All right. So... Um... Yeah. So, again, the API key is there.
We're not using it, actually, that much today. But one of the things that we have, obviously, is we need to have... Oh, God, you can't see anything, this thing. You can't see anything. Oh, goodness. Well, that's overwhelming. That's more reasonable. Okay. Is that good? You guys can see? We're good?
Okay. Um... Yeah. So, the basics. Obviously, we're importing OS. We're bringing in our environment variables. Um... You see the Python driver. Right? This is our classic Python driver. And then what I like to do is I like to just do a run query, um, uh, function just because it just makes it easier.
Like, I just like it. Um, you don't have to do it this way. You could just do driver.session, but, um, that's just how I like to do it. And then when you run this, it'll see that, yes, you are actually connected. Oh, let's see. Maybe I should actually run everything.
Now we're connected. Okay. Um, the first thing that we're gonna do is we're just gonna take a look at what is actually in the database. So, in this, um, APOC actually stands for awesome procedures on Cypher, which I just think is hilarious. I think there's a double meeting on APOC too, but we'll have to ask, um, isn't there another APOC meaning?
Something? I don't know. Yeah. Some, some matrixy thing. Um, but what this is gonna show you is we start out with 17,000 nodes, right? We have 774,000 relationships. Um, seven different labels and 27 different relationship types. What does this tell us? What it tells us is obviously we got a lot of data.
It's actually not that much really, um, you know, for something small. Um, but you know, like almost a million relationships. And so really what you, we're running this just to make sure that everything is loaded. So everybody who has loaded their data, are you seeing these numbers? Anyone? Anyone?
Maybe? Maybe not? No? Okay. This is, I will take you back one moment, please. Hold the phone. There are three notebooks inside the notebooks folder inside the repository. So, you may have to re-pull it. I pulled it up again. You did? Yeah. Huh. And then this is the only one.
Huh. And then this is the website. Yeah. Yeah. This is not, can you reload this? Because it was updated. Yeah, um, I did a report. That's why I-- That's so weird. Are you having trouble getting the connection stream for JHAB there? It could be that it's not loading .env file?
Yeah. It's easier to just copy the values into the cell where it's loading from the environment. Yeah, you can do that too. Yeah. Yeah. That's my advice. Yep. That's actually really good advice. Um... I'm uploading it to the Slack channel right now. Thank you. Thank you. What's your name?
Caleb. Caleb to the rescue. Thank you, Caleb. Everybody, round of applause. Round of applause. You're that helpful guy at work, aren't you? You're like, oh, I can just do this thing. It'll make it easy for everybody. Thank you, Caleb. Naya, also. Naya, round of applause. We love Naya. You'll probably see if you're from San Francisco, Naya is all places, as is Alexi.
They're around all the time. So hopefully they're familiar faces for you all. Um, so yeah. So the first one is just really meant to get you to connect to the graph and make sure that your connection is working. Um, and just showing you how you can get some basic summary statistics.
Um, you know, one of the things that we can do is we can preview some of these documents. Um, so again, this is our data model just to get you familiar with what we're looking at. And then let's move on to our notebook number two is where things get get more exciting.
Um, in notebook number two, what we what we're looking to do is we're looking to actually start leveraging these algorithms for, uh, where am I? Okay. So when we talk about graphs, like ultimately what we're trying to do is we're trying to help you do these things at scale.
We're trying to help you be successful in production and we're trying to help you get the best possible outcome from the applications that you're building. And the way that I, that I suggest is my little moniker, which is connect cluster curate to your point of how do we actually manage this at scale?
What does that look like? What do we do? Um, what we have in these notebooks walks you through this very, this process. The way that it works is we start out by running KNN on similarity. So we take the embedding for each of those documents and we create, uh, we leverage a similarity score.
So, um, basically we take a K of 25, we run KNN and we understand, we start making connections among very similar documents, right? You'll see why we do that in a minute. But what this does is it sets us up for the next step, which is community detection, also known as clustering.
Um, and what it allows you to do is it allows you to group these like items so that you can manage them together, right? It's very hard to look at any one and know what's happening, but we can certainly look at all of the things. And then finally what it leads us to is we curate the grounding data set via these techniques that work at scale.
So this is a similarity graph of context documents, um, you know, connected to their source URLs, right? So it makes sense that there's a lot of overlap and similarity because they're coming from the same source. And you may be saying, you may be saying, well, if they're all the same, like, why do I need to know that they're similar?
How does that help me? Well, it helps you in a few different ways. Uh, what the, oh, this thing's going crazy. Um, what I first want to talk about is I want to talk, talk a little bit about the math and the understanding of how community detection works, right?
Um, I promise you there's no calculus and there's no quiz. Um, but when, so like, if we're looking at this doc, you know, we're looking at a few different friends, a couple of friend groups, like it's pretty clear. Like there's one click, there's another click. It's pretty obvious, right?
Like that seems reasonable. And then when we start getting bigger, like these are, these are similar. And what we're seeing here, this is actually, these are actually sized by page rank, which is an important score. Um, and they're still like, it still seems pretty obvious. Like, yeah, they're clustering together.
Like it's still pretty logical. Like I get it. Um, but anybody who has worked with graph knows that it never looks like this. Has anybody ever actually seen anything that looks as tidy? No, never. Cause it looks like this, right? This is what we're actually looking at, right? In this case, what we've got is we've got all these different nodes.
They're sized by page rank. And then the edges are weight. They're weighted edges, right? It's very like the hairball people call it. Um, the bowl of spaghetti, whatever it is. And so like, it's very complex. So how do we actually find things that work? How do we actually run community?
Oh my God, it's so sensitive. My goodness. Um, this is the boring math part, but just so that you understand like what is happening under the hood. So when you're doing clustering in graph, it's based on something called modularity optimization and a community is high modularity, right? It's modular makes sense, right?
When the items within or the connections within it are highly interconnected with each other and have a very low connection to things outside, right? That's like the basics of the math. So what it's going to do is the algorithm is going to go through and it's going to say, okay, for every node is the modularity score going to go up or is it going to go down if I put the same label as my neighbor, right?
So does it belong with this group or does it belong with that group? What does it do to the actual numbers, right? Louvain is a particular type of algorithm and this is where it says, yeah, that's exactly what we just talked about. So how does it actually work? So what it will do is it will start by giving a single node a label and then it does the calculation and says, okay, like let me figure out, does it go up or does it go down?
And then it will do the first pass and it'll say, okay, these are the ones that are most connected to each other. And then it'll say, okay, well, let me group and aggregate those and let me look at how many, what you see in step two is you see the number of connections between each of those clusters, right?
So when we look at it, so, you know, it's pretty straightforward. The pinks and the yellows, there's three edges between them, but there's 14 within just pink and there's two within yellow and four within green and 14 within blue, right? So what it shows you is it's showing you that like there's a lot connected within and not so many on the outside, right?
Pretty straightforward. That's pretty easy math, right? And then it just keeps going through and then eventually it goes down. One of the things to know, um, there's another algorithm called Leiden. It is similar, but it, um, but if you need to work with unconnected graphs, so depending on whether your graph is completely connected or not, there are different ones.
But ultimately what it's doing is saying, you know, are, are these things, which of these things is not like the other, right? So label propagation is, uh, is another way of doing clustering. And what label propagation does is it says, okay, I'm going to assign a random label to something or I, I know these labels and then I'm going to go to the neighbors and I'm going to say, I'm going to assign what this one is to its neighbors and then eventually we have these iterations where they converge label propagation algorithm.
Now the nice thing about the label propagation algorithm is it's actually one of the fastest clustering algorithms. So especially if you've got lots and lots of values and you want to run something very big, it's a very fast way to do it. Um, if you need something that's like highly nuanced, it's worth spending the time and the computation to use one of the other algorithms.
But what I wanted to do is I just wanted to, to show you what this looks like because label propagation, it's, and modularity does the same thing. It's really looking for density. It's looking for density in those graphs. So when we think about, you know, the craziness that we see here, like even in this, like you can see like little pockets of density, right?
We see these pockets of density. I always like to take the soft eye when we look at graphs. Um, and so that's what, that's what's happening with the math under the hood, right? So what does this actually mean in our, in our code? So what we've got is, let me close this out.
So what we're going to do is the first thing you do is you're going to run the KNN. So you run the KNN on similarity. Obviously we connect as always. Um, there is a weird little thing you have to do. You have to create an empty relationship, but don't worry about that.
Like I gave it to you. Um, but that's a little, little peat point. Um, that's probably cause we don't need it. Um, but then we run the query and I just want to walk through how the query is actually built and what it looks like. So the first thing that we have is, do I have my, I want to, I want to see this.
Sorry. Somebody's got a question over here. Yes. Then, then, then over there, Caleb. There we go. Here we go. Did you have a question? - Yes. - Yeah. - Mmm. Graph.project. So just to go over like one, what, what the process is when you're, when you're actually, um, running the algorithms.
The first thing you need to do is you need to project a graph because most of the time when you're running the algorithm, you're not running it on the entire graph, right? We're running it on something small. So in our case, we want to run k-nearest neighbors just on the documents, right?
So what we need to do is we create this, uh, this projection. So here we've got, we're calling the projection. We're calling it docs. So the first thing you'll see is the name of the projection, which is docs. What, then what we look at is we look at which nodes.
The next thing you're going to see is nodes. So for us, we're using documents, right? So we want documents. I want to include the property embedding. So you really want to make sure you're only bringing in what you need, right? Because it's just going to make things run faster.
Um, and then because it's just, the documents aren't related to each other yet, right? We know that from the data model those documents are related to the responses, but they're not related to each other. We're building this similarity among them. And so it happens to be an empty relationship.
And then what it will do is it will give us the graph name, the node count, and it'll, and it'll let us know that it ran. And what that ends up looking like is, let me go to my aura. Where's my aura? I'm so not aura, as the middle school would say right now.
Okay. Um, so let's go into explore. Anybody here have middle schoolers or like, no, the whole like brain rot, like riz, aura, skibbity, like it's insanity. All right. Um, so what we're going to be looking at is when we look at documents, I've run it already. Um, so we're going to see that there's similarity among them, but right now we're going to see these documents and what I want to look at is, oh yeah.
So I'm going to look at documents that are connected to, come on. Similar to documents. Oh, actually let me, yeah, that'll work. It takes a second to run. Um, one of the, one of the things that, um, I did is I actually changed what the way that we look at it in the graph, just so you know, I changed it so that it's actually the colors of the nodes are by community.
Because what I wanted to see is I wanted to see like, do we start to see, let me run this one more time. Let me clear the scene. Do this. Oh, dear sweet Wi-Fi. Maybe I should open up my Neo4j, like desktop. Um, yeah. So what we're going to see is we're going to see the connections in and among the, hold on a second.
I don't like this. I don't like this at all. What it's eventually going to show you is it's going to show you the similarities in and among, and it's going to be like the greatest ever hairball. So in this particular case, like this is what it might look like.
Um, and what we see is like we saw with the big one, like you're going to see like clusters of documents that are very similar. The other thing that we're going to see is we're going to see what I call the bridge documents. The ones that will often be the, the middleman.
We call it, um, betweenness centrality. And so what you want to understand, like I said, it's the soft eye. You want to understand what is the clustering of your documents. And again, you might be asking yourself like, why is that helpful? And I promise you the reason that it's helpful is because what we end up being able to do is understand what's happening within the data.
So this is a 2d version of, of these embeddings and these clusters. And what we see is we see, you know, it looks like a bowl of fruit loops. Like how, again, Alison, how is this helpful? I'm not a data scientist. Like what is happening here? Um, but when we zoom in, we see a couple of things.
In one case, we have a single community document cluster. So what we're seeing is there's a lot of similarity in the embedding, right? They're all in that same community. They're all very similar. And then in this other side, we've got some, they're kind of spread out, like nothing really clear on a perimeter, right?
Like you're like, is that good clustering or not? Um, and so my question to you is, which is better? Which of these do you think might say, oh yeah, like we've got it. We've got what we need. We're good. Pros and cons. Somebody says some. Single community. What's the, what's the thought on a single community?
Figure out there is some relationship between those nodes. Yeah, they're very similar. Mm-hmm. That's awesome. And in traditional clustering, that would be great, right? If we're talking about clustering customers, we're doing customer 360, we're like, oh, this is a group, this is a thing. But one of the things that we have, though, is that means we have a lot of very similar documents.
So the question is, do we really want it to look like that? Customers, yes. Documents, no. No. Right? Because one of the things that we need to understand is we need to really be efficient. So when we talk about what is a high quality grounding data set in your retrieval augmented generation, these are our five basic principles that we, that we go by.
Is it relevant? Is it actually giving an augmenting answer? Is it reliable? Right? Like, do you have, do you have like a lot of variety that's coming back on something? Like, is it really disparate from these communities? Because there's a hole in your data. Because we don't know. Right?
There's a gap in all this high dimensional space. There's this like big blind spot that you don't know about. And again, is it efficient? How many people here are really worried about their applications being efficient? Everybody. I know, right? It makes you crazy. So, again, how do we do this at scale?
What does that look like? So once we've done this clustering and we've assigned these communities to these nodes, then we can start looking at the data itself. Right? In this particular instance, one of the first things we did, and you'll, you'll see it, I left all the, the dump that you have has all the errors in it.
So you can play with them. We can see that the median average word length is 512 characters. Well, our chunks are a length of 512. So that's probably not good data for us. Right? That's not helpful. And it's clearly very different from everybody else. So what does that look like?
Median word count 1. So all of these are summary statistics at the, at that cluster level, at that community level. And so when we looked into it, we found this is very, you know, like, again, we brought our data in from, you know, web pages and all these different places.
So clearly this is not going to be helpful to an answer. Right? At least not to a human. Right? So this is a way that we can start to clean something out. The basics. Right? Like, just the basics. The other thing that's going to come up in that high density, like that high similarity we saw is this.
Highly similar text chunks. In this particular one here, community 4702, our average similarity is 0.98. We have 49 documents. 49 chunks that are almost completely identical. If you're doing a retriever that gives, you know, the top 10, that's not going to give you any context at all. It's just going to be really confident about this one thing.
Right? How do you know that it's even the right thing? Right? So one of the things that you want to start looking at is ways that you can sometimes increase the diversity. Right? Like, I have a talk that I've submitted many, many times that has never been accepted. And it's called "The Dark Side of Distance Metrics: How Cosine Similarity Broke America." Right?
I don't know why it doesn't get accepted. Why is that? Right? Because cosine similarity gives you exactly what you want. Gives you exactly what you want. Right? Like, I call it the chicken nuggets and Twinkies. If I had a robot in my house and I went to San Francisco and it was responsible for cooking for my kids, when I came back, all I'd be making is chicken nuggets and Twinkies.
Right? That's the same thing that's happening in our algorithms. And so where this becomes really, really important, where everybody here is working with agents. you are refining signal at every turn. So if you get a signal, it's going to pick up on it and it's going to run with it.
And it's going to go down to the next agent and the next agent and the next agent. And whatever that, whatever that path that it got on, it's going to go. Right? Because that's what we programmed it to do. It's doing its job. It's doing exactly what we told it to do.
But is that what we want it to do? So one of the ways that, like, this community approach to documentation is really interesting is you can look at what are the variety of documents. You can actually use re-ranking for diversity in your responses. So let's say we do a vector retriever, right?
We take the vector retriever and it comes back with this scoring. These things are all really similar. Okay? Well, most of them are from this community. But then the ones a little further down are from another community. So you can actually do different ways of re-ranking. So you can wait for diversity.
You can wait for page rank, right? Like one of the examples we have in here is using page rank. Of these vectors, which ones are the most important? Which are the ones that people go to the most? We want to be feeding that back, right? It's like, it's why Google made billions of dollars, right?
Page rank. Change the way everybody searched. So what I want you to be thinking about is you need to be thinking about what I call intelligent outcome management. It's not just about like, you know, AI, artificial intelligence. But what are the outcomes that we're getting? And how do we give really good ones?
And I'm here to show you how you can use these algorithms to help you like diversify that a little bit. So when we looked at these highly similar text chunks, they're highly similar. It's because it comes from our documentation, right? We've got multiple versions, right? Like we've got a version for 1.5 and a version for 1.6.
Well, the algorithm doesn't change, right? The algorithm doesn't change. So that's why we have all these repeats. So the other great thing about graphs, because, you know, every chance I get to love on a graph, I will, is you actually can very easily in this one moment, APOC nodes collapse.
Awesome procedures on Cypher. So what it's going to do is it's going to do two things. One is going to make your retrieval much more efficient because you now have one version of it instead of many. But what it's also going to do is it's going to maintain the lineage.
It's going to maintain the connection. So now we've got a single document, but we can see all of those sources. We can see where it came from. So you don't lose the institutional understanding, right? Nobody likes-- and who likes dropping rows? Everybody gets worried when you drop a row, right?
Like, maybe it's just me. I always get worried. I'm like, oh, I'm dropping rows. Is this what I want? So in this case, you don't have to. And that's what's so great about and flexible about the graph is that you can do these large-scale management moments, and you're-- you get more efficient.
You get really clear. You don't-- or you're not over-bundling something or overpowering something. Question. What does it mean to collapse here? Oh, so basically what we're doing here is we're saying these four are all the-- basically all the same. The text and the embedding are the same. So we're going to collapse them into a single node, and we're going to maintain the original relationships.
So in this case, when-- then when you hit that embedding, you're only hitting it once instead of 19 or 49 times in our case. Yes? So on the document that are pretty much very similar, if they have various properties, they're slightly different. When you do the collapse, does it join all the properties together?
It depends on how you want to do it. Like, you certainly could. I mean, it may be that you don't-- you may only want to, like, adjust what's in the index, right? Like, it could be a different way that you want to manage it, like, depending on what you need to maintain, right?
If it's legal, then you probably don't want to lose anything. Um, you know, there's-- there's options. Yeah? There's a microphone for you. Because it's a good question. Everybody needs to hear it. So, um, the documents will be exactly the same for each one of the URL pages. The chunking was similar and the similar kind of things got picked up.
So does chunking impact when-- in these kinds of mechanisms, how the similarity works? Yeah, I mean, because it could be, right? Because if it's the same page, but one has, like, an extra paragraph for that version, then the chunking might split differently, right? So you're not gonna-- it's not gonna-- it's not like a magic wand, right?
It's no magic eraser. But it's definitely an easy way to increase your efficiency and increase the quality of the output in the data itself. Yes? So, you did the clustering on documents, or did you do the clustering on all different kinds of nodes that you have? I did the clustering just on the document nodes, which in this case are actually chunks, right?
So it's just on those. So it takes the embedding, right, for that chunk, and then we run that KNN so that we understand, like, do we have these really high-density places? The other thing it allows you to do is understand, like, do I have the right information, right? So, um, where are we?
Right, so again, like, what are the most frequently used communities? We've seen this already, right? This is a conversation, you know, these are the original documents. This is the the next set, some of it, right? Like, I've been doing more research into understanding, like, how people travel around the communities in different conversations.
Do they all end in one place, or do they start in one place? Or how long does it take someone to get to the right area? Right? I mean, it's, you know, we're basically tracking human cognition by watching the way people move around their thought process, right? You know, this is, like, one particular visualization of a single conversation, and for each of these, we've got, you know, 10 pieces, 10 context documents that come back.
You know, we have a K of 10. And so it looked at which communities are they coming from across this conversation? So this person originally was starting out in this area, and then clearly ended up over here. And then when they got to whatever community group number, or whatever it is, um, they got to their answer, and then they moved on.
So whether your conversations are short or long is going to depend. What route you want people to take, right? I think about, you know, I always go to product analytics, right? So what is it that someone might need to know? So, so much becomes available. And, you know, again, on the developer side, like, maybe these things are important to us, maybe they're not.
But as someone who's building agents, it has to be important to you, right? We have to really take accountability for the signals that we're amplifying in our systems and be aware. It's why the data scientists and the developers aren't on the other, are on different sides anymore. We're all AI engineers.
Question? Someone? Someone? No? Okay. All right. My soapbox. Thank you. I appreciate that. Um, document usage. We talked about this. We talked about all these things. I don't think there's... Yeah, I think we're good. Yeah. Oh, this is just another, uh, you know, the betweenness centrality, right? Like, um, to understand why, why it might be important.
Um, again, if you think about what is it that we're delivering, you know, like somebody said, like, who's the most important person at the company? Is it the person that's connected to any, everybody? Or is it, you know, the one person that talks to both sides of the, of the business?
Right? Betweenness centrality says it's this person. Page rank says it's somebody else. So that's the other thing, too, is when we're trying to figure out what's important, we have a variety of algorithms that do different things that make that available to you. You know, conversation lengths, right? I mean, this is just some of the basics.
Um, but this is just another really interesting one is we are looking at, um, good ratings and bad, right? So this is just an analysis based on the community, right? How many are in the community? Um, and then are the ratings good or bad by community, right? If you find that there's a, that there's a cluster of documents where the ratings are significantly worse than the others, let me go look at that.
Let me go look at those documents. Maybe it's, you know, outdated. I had somebody, this is my interesting story. So, um, I do some work with some folks who do like DOD and Fed work and, um, former Navy SEAL, and he's just like, yeah, I couldn't believe it. And I'm like, what?
He's like, somebody was doing this proposal and they have like this one thing that they have to put in, like some risk assessment thing. And like, it's just like they copy and paste it over and over and over again. And so one day he got really curious and he wanted to actually track where it came from.
It was almost word for word something from the Vietnam War that people are still putting into these documents, right? Like it's got, like if we just threw it into a RAG application, it would have really high page rank. Everybody's using it. But where did it come from? Is that what we want?
So looking at the things that are influential, not because they're an anomaly, but because we're amplifying that signal, is that the signal we want to amplify? That's a whole other like AI observability thing. But the reason I bring it up is that our job is to build good, smart applications.
We are responsible for the outcomes of these processes. And these are some ways that you can take a new approach to what you're doing, right? Let me go back. Questions? Comments? Concerns? Yes? Yeah. I don't know that there's-- I mean, you have to do the projection and run the algorithm.
But we've got it in here, community detection and assignment. I believe betweenness is in here. Oh, betweenness might be in number three. Creating secondary relation to that. We're not even going to get to that. Yeah. So there's a betweenness centrality. So when you look at the algorithms, they're all going to be pretty much the same.
It's going to be GDS and then the name of the algorithm. And then there's a couple of different options. Stream means run the calculations and let me look at them. Just put them here. Mutate will actually change the projection. You can also write. So if you do the algorithm dot write, it's going to write back to the node and we'll write that assignment back.
So sometimes you just want to play. You just want to look at it, right? Sometimes you may want to just change it within the production and just play around. So you've got a few different ways that you can apply the algorithm once those numbers come out. And the basic is you're always going to do a call.
You're going to call the algorithm. And then the first thing you're going to do is say, which projection am I running this on? Right? In this projection, I actually had, like, I created a new relationship called co-occurrence. Where is it? Yeah. So what this did is it says, okay, I'm going to look at the message and I want to see the different documents that are in the same message.
And for each of those where the elements aren't the same, right? So I don't want to create a connection to itself. I want to merge. Merge is, you can either create or merge a relationship. Merge just means if it doesn't exist, make it. I'm going to merge the chunk one or the context one with context two.
And I'm going to create this new relationship co-occurs with. And the reason I wanted to do this is I wanted to see, like, which things come up together a lot. Because sometimes it's not just about the similarity. If I've done a really good job of curating my data set, and I don't have a lot of that redundancy, I want to see, like, what are the concepts that are coming together?
Like, what are the things that come together? From there, you can then look at the way they travel around each piece from one to another. So what I did is I created this co-occurrence, right? So yes. So like, if we went back to our-- so basically, co-occurrence just says all of these documents that were pulled in by this question will now have a relationship among them, right?
And then we weight that by how often do they come up together, right? You can do the same thing with traversal. So these chunks come first, these chunks come second. So if you know that the conversations usually go in that direction, do you want to figure-- what's the probability?
Can I predict the next thing? Can I figure out where the conversation is going to go? So all of this to say that, you know, I love that there's a graph where I track. I love that it's just sort of like the way many people are doing things. And what I want you to know is you've done all the hard work of building the graph.
Now let's use it even more. Let's get even more out of it. There was a question somewhere. Yeah, yeah. Co-occurrence is when, like, the-- you ask a question and then the retrieved answer, like, you have, like, two answers that are commonly retrieved together, like, to that question. Is that co-occurrence?
Am I understanding it correctly? The co-occurrence is all-- everything that came into this one, like, if you're doing a K of 10, all of those 10 will co-occur, meaning, like, they all were answered together. They all were fed back at the same time, right? So it's almost like an affinity of, like, these things often come together.
Then you can, like I said, you can do the same thing. You can expand it out into, like, follow-on. Like, these things usually follow those things, right? So we can start to see. And that's where, you know, things get juicy. Thank you. My nerdy, like, data science brain gets super excited.
Like, oh my god. Then we can check this. Then we can check that. And maybe somebody cares. Yeah. You said what? Oh, sorry, sorry, sorry. I'll be back. I got a question more-- more not a technical question, more general. So you mentioned the DOD and FED space. So we operate in the DOD space.
Oh, nice. More on the industry side. Yeah. So what is the, you know, what are your thoughts on the adoption of this technology, so within the DOD, both on the government side and also on the U.S. side? Yeah. I have a lot of opinions on that. Naya will tell you.
I think when it-- the most interesting conversations that I have are around accountability and traceability. There was-- I was talking to someone literally this weekend and said there was an unnamed person who's currently running for Congress and they said about the military, oh, they're just going to need to learn to trust the AI.
I've never been so alarmed in my life. I was just like, no, no, trust but verify. Trust but verify, please, right? I mean, like, as someone who's been a data scientist for a long time, like, we do our best, but there are unintended consequences. Like, we didn't know we were building an echo chamber.
We just thought we were giving you chicken nuggets and Twinkies, right? Like, we didn't know it was going to break America, but we did it, right? Like, I was one of those people. Like, I built those algorithms. Hello. Thank you. You know, and so, like, I think the best thing that we can all do is to just keep those things in mind.
And so that's where the conversations, like, my conversations with them, like, and I'm not, like, saying, like, oh, you're so dumb. Like, you don't know. Like, people don't know, right? Like, like, I've been doing this for years. Like, I've lived in the calculus, right? Like, I've lived in, like, you know, understanding the mathematics of the neural network, right?
Like, I've spent a lot of time-- like, the data scientists, we had rigor and we had mathematics and we, like, had-- every time we released a model, we had model reports and all this stuff. And now it's just like, just trust it. It's good. We're fine, right? And a lot of it is, right?
Like, we do-- I mean, it's the world we live in. Like, I'm certainly not-- like, I'm not a naysayer. Like, I love AI. I love what's possible. But I just think that we just need to be a little bit thoughtful about what we're doing. And, like, that's why these are easy things to do, right?
Like, and then once you're aware of it, you're like, oh, yeah, I'm going to get a much better answer. That's awesome. Let's do that. Yes. I have a question. So on the-- more on the knowledge graph side of things, is there, like, an easier way to-- is there an easier way to automatically create the knowledge graph?
Because, you know, things like graphlets have really made things a lot easier for a lot of people versus, like, this is-- to me, it's kind of like an old school way of creating graph, which is you've really got to be the domain expert and know all the ontologies and all the relationships.
But what if you wanted to do that automatically many times over for different types of businesses? How do you do that? Yeah. It's a really excellent question. And so hopefully everybody heard the question. is, like, how do you, like, build graphs at scale, right? Like, how do you do it without having to be an expert?
The knowledge graph builder, the KG builder that Neo4j built-- because these are questions that people have, right? There's some really interesting ways. Like, you can actually use the simple KG pipeline and just throw your documents in and just say, "Tell me what's important." Right? It's going to go through and do basic named entity recognition.
Based on those entities, it'll do another pass and say, "How are they related?" And what it'll do is it'll look at each document individually. But then what you start to see is you start to see, like, the natural organic, like, what's actually in there, what comes up. So I love that for experimentation and for understanding.
Because oftentimes, as builders, we don't know. We're not the subject matter experts. But we're responsible for building the architecture for them to get the right answer, right? So how-- what tools do we have to try and service it? Because then you can run it and it'll be noisy. Like, it will look like spaghetti.
It will be noisy. But then you can say, "Okay, tell me how many of this label. How many of that label? Show me, like, what is the page rank? What is the betweenness centrality?" Right? You can run these kinds of algorithms to understand. And then from there, you say, "Okay, according to this set of documents, this is the ontology that has risen out of it." So you can take a very organic approach.
Sometimes people will go in with a very specific data model in mind. And even in those cases, I always encourage people to run the KG Builder for experimentation purposes, because it could be that there's something in your documents that you didn't even know was valuable. Right? Like, let them speak.
Like, let them-- like, let the ontology rise. Yeah. The question is, like, I think today our industry has been writing content more for consumption by humans. Now this is changing towards consumption by agents. Yeah. Do you see this playing the graph analytics and what you're showcasing is playing for persona, for content writers, and whether for completeness, how are they writing and producing more content?
It's helping them? Is that the primary persona we are targeting here? I mean, I mean, listen, it can go in any direction, right? Like, one of the things that I'm really excited about is the concept of, like, I call it-- it's not about human in the loop. It's about accountability in the loop, right?
Like we said, we're going agent to agent to agent. And so what do we then pass to that agent, right? So if we have some of this information, right? Like, if we're concerned about amplifying signal, right? And if we're concerned about, you know, like, a series of agents going in the wrong direction, right?
You think of a ship. If it's off by one degree now, it's fine. You know, a hundred miles later, not so fine. So how do we-- how do we keep steering the ship when it's autonomous? And so that's some of the interesting things, too. Like, when we look at this memory graph, right, we can say-- like, this application graph, we can say, "Okay, I'm going to pass this to the agent, and I'll pass, like, this context, or this particular information about what came before." Or, you know, like, you can look at, like, sort of, like, the velocity of how quickly it's going from one community to another, right?
Like, there's so many interesting ways you can do the math to look at it. But yeah, like, I think, like, really understanding, like, how do we drive our agents? How do we drive our agents and chains is super sexy, super exciting for me. Yeah. Question about content hygiene. Yes.
So a lot of times, we realize that it's garbage in, garbage out. Yeah. And we ourselves sometimes aren't aware of, like, some issues with our own content, whether one part is contradicting another, or something's just inaccurate. Is there any fancy math with graphs that can help us identify potential issues with content hygiene?
I mean, I would want to give you a more thoughtful answer than something off the top of my head. Like, I don't want to, I don't want to throw something out that's just kind of like, oh, what is this? Yes. Is it going to solve everything? No. But again, some of it has to do with what is it connected to, right?
So if we think of, you know, how fraud detection works, right? Like, like, or, so I worked on the, they did election graph for, with Syracuse University last year for the presidential election. And they were tracking advertising, like, across Facebook and Instagram. And, and I, Jenny, who ran the program, she said something really interesting to me, which is, if you want to know how someone's going to vote, it's actually not their friends.
But if you look at the friends of friends, that second hop out is actually more predictive of how someone's going to vote than their immediate group. Which I thought was fascinating, right? Like, why is that a thing? But it's the same kind of thing that might be possible, right?
So let's look at what is it connected to? Are we finding that there's, like, some sort of, like, density of something, like, in that area, potentially? The other thing about me is I love to be available. I love to chat, as you can see. So feel free to connect with me on LinkedIn and DM me.
And if you ever want to noodle, like, I'm always happy to noodle on, like, all the things. Because my hope is that you leave here today and you start thinking about, like, what might be interesting to you? What could you look at a little bit differently? Yeah, another question.
Yeah, go ahead. Yeah. Thank you, Alison. It's a great session, really. I wanted to ask about the embeddings versus the structure. Yeah. And, yes, it's a research question. Yeah. There's a lot of approaches out there. And, of course, it depends upon the domain. But I wanted to ask you if your team has any specific experiences in terms of how to leverage the embedding per node, let's say, versus the structure of the graph.
Because, for example, in the morning session, we've seen just simple aggregation of the two. Yeah. Close the eyes and let's see how we can. You spoke a little bit about the ranking. So my question is, how would you start approaching a given specific domain where we have embedding in the same embedding space for every node, plus the structure?
Yeah. There are actually really interesting ways that you can do combinations of not just the text embedding, but the embedding of the node itself, right? So for those of you who don't know, the concept that we have of this text embedding, you can also do on the node. And so it's an embedding, really, of the structure of any given node and which relationships it's connected to.
So, yeah, there's some really interesting things that you can do there. One of the things that we, like, didn't get to today but is, like, you know, leveraging page rank and, you know, vector to, like, re-rank. So there's definitely some interesting things that you can do. I don't know if we have content on that specifically, but definitely reach out to me and I will reach out to our people and we'll get you a really good answer on that for sure.
Thank you. We're coming up on time. I still understand the community detection a little bit. So, like, how does it work? What is happening? All right. So, we've got a few different ways that we can look at it. This is one example how Louvain actually works. So, like I said, what it'll start to do is it'll assign a label.
And then what it says is, okay, let me look at its neighbors. And if I give the neighbors the same label, does that modularity, does that interconnectedness number get better or worse, right? So, different clustering algorithms are going to do things different ways, right? But in this case, this is what we're looking at.
It's like, and if you think about it, it's like, you know how you have some friends and you all know each other, you're all really tight, but then you have this other friend who's got, like, all these different, like, friend groups that they're a part of. They don't have, like, their one little niche.
So, it's kind of like that. So, the more interconnected it is, the more relationships they have among each other. That's what it's looking for in the math. How would this be of any use in terms of, like, getting the actual data from the query? Yeah, so, when we talk about the community detection, the community detection is really meant to be, like, a production helper.
It's going to allow you to curate at scale, it's going to allow you to understand at scale, like, where things are moving. The community itself may not necessarily be part of your answer, unless, as I mentioned, you want more diversity in an answer. You want to cover a larger domain than just what's really close to the vector, right?
So, think of, if I'm, you know, I don't know, I'm in sales at Amazon. Like, I don't want to, you just bought luggage. I'm not going to show you more luggage, right? I want to show you the things that are often bought with, right? The connected pieces, right? It's sort of like that.
So, in that case, it could be helpful, but from a content perspective, it's like, do I want to give you just what you have, or is there some benefit to spreading the net a little wider in what's being brought back by that vector retriever? Okay. So, instead of just getting that one embedding, it gets, like, all the surrounding embeddings as well?
Yeah. So, depending on how you build your retrievers, right? Depending on what your thresholds are, depending on, you know, lots and lots of different ways to do it. But, again, I'm happy to noodle with anyone, any time. I literally get paid to help people build things, so I love my job.
Like, help me keep doing my job by asking me lots of questions and letting me help you. And one last question. Yeah. So, like, when it detects what's a community, does it only, like, check its surrounding nodes, or even, like, nodes across the entire graph? You know, why don't you and I, we can talk about more community after, because I don't want to keep everybody, because I know we're getting close to time, but it's a good question, so thank you for that.
What I do want to say is, if you've been with us this morning and this afternoon, your day with Neo4j is not over, because our lovely friend Alexi is here, and he's hosting our AI agent meetup this evening. There we go. Everybody say hi to Alexi. If you don't know him, you should.
Like, he's just one of the most, like, vivacious and passionate people about tech I've ever met in my life. So, yeah, please join us for the agent protocols tonight on Graphs and MCP. Feel free to reach out to me on LinkedIn, and I just want to thank you all for your attention.
I know that it was a, it was a lot, but please, like, just be curious. Be curious. That's it. Thank you, Alison. Thank you.