Multi Agent AI and Network Knowledge Graphs for Change

Good afternoon, everyone. My name is Ola Mabadeje. I'm a product guy from Cisco. So, my presentation is going to be a little more producty than techy, but I think you're going to enjoy it. So, I've been at Cisco working on AI for the last three years, and I work in this group called OutShift.

So, OutShift is Cisco's incubation group. Our charter is to help Cisco look at emerging technologies and see how these emerging technologies can help us accelerate the roadmaps of our traditional business units. And so, by training, I'm an electrical engineer, doubled into network engineering, enjoyed it, and I've been doing that for a while.

But over the last three years, focused on AI. Our group also focuses on quantum technology. So, quantum networking is something that we're focused on. And if you want to learn more about what we do with OutShift at Cisco, you can learn more about that. So, for today, we're going to dive into this real quick.

And like I said, I'm a product guy. So, I usually start with my customers' problems, trying to understand what are they trying to solve for, and then from that work backwards towards creating a solution for that. So, as part of the process for us, we usually go through this incubation phase where we ask customers a lot of questions, and then we come up with prototypes.

We do A testing, B testing, and then we kind of deliver an MVP into a production environment. And once we get product market fit, that product graduates into the Cisco's businesses. So, this customer had this issue. They said, "When we do change management, we have a lot of challenges with failures in production.

How can we reduce that? Can we use AI to reduce that problem?" So, we double-clicked on that problem statement, and we realized it was a major problem across the industry. I won't go into the details here, but it's a big problem. Now, for us to solve the problem, we wanted to understand, does AI really have a place here, or it's just going to be rule-based automation to solve this problem?

And when we looked at the workflow, we realized that there are specific spots in the workflow where AI agents can actually help address a problem. And so, we kind of highlighted three, four, and five, where we believe that AI agents can help increase the value for customers and reduce the pain points that they were describing.

And so, we sat down together with the teams. We said, "Let's figure out a solution for this." And so, this solution consists of three big buckets. The first one is the fact that it has to be natural language interface where network operations teams can actually interact with the system.

So, that's the first thing. And not just engineers, but also systems. So, for example, in our case, we built this system to talk to an ITSM tool such as ServiceNow. So, we actually have agents on the ServiceNow side talking to agents on our side. The second piece of this is the multi-agent system that sits within this application.

So, we have agents that are tasked at doing specific things. So, an agent that is tasked us doing impact assessment, doing testing, doing reasoning around potential failures that could happen in the network. And then, the third piece of this is where we're going to spend some of the time today, which is a network knowledge graph.

So, we have the concept of a digital twin in this case. So, what we're trying to do here is to build a twin of the actual production network. And that twin includes a knowledge graph plus a set of tools to execute testing. And so, we're going to dive into that in a little bit.

But before we go into that, we had this challenge of, okay, we want to build a representation of the actual network. How are we going to do this? Because if you know networking pretty well, networking is a very complex technology. You have a variety of vendors and a customized environment, variety of devices, firewall switches, routers, and so on.

And all of these different devices are spitting out data in different formats. So, the challenge for us is how can we create a representation of this real-world network using knowledge graphs in a data schema that can be understood by agents. And so, the goal was for us to create this ingestion pipeline that can represent the network in such a way that agents can take the right actions in a meaningful way and predictive way.

And so, for us to kind of proceed with that, we had these three big buckets of things to consider. So, we had to think about what are the data sources going to be? So, if you, again, in networking, there are controller systems, there are the devices themselves, there are agents in the devices, there are configuration management systems.

All of these things are all collecting data from network or they'll have data about the network. Now, when they spit out their data, they're spitting it out in different languages. Yang, JSON, and so on. Another set of considerations to have. And then, in terms of how the data is actually coming out, it could be coming out in terms of streaming telemetry, it could be configuration files in JSON, it could be some other form of data.

How can we look at all of these three different considerations and be able to come up with a set of requirements that allows us to actually build a system that addresses the customer's pain point again? And so, the team, from the product side, we had a set of requirements.

We wanted a system that, a knowledge graph that can have multimodal flexibility. That means it can talk key value pairs, you understand JSON files, you understand relationships across different entities in the network. Second thing is performance. If an engineer is querying a knowledge graph, we want to have instant access to the node, information about the node, no matter where the location of that node is, that was important for our customers.

The second thing was operational flexibility. So, the schema has to be such that we can consolidate into one schema framework. The fourth piece here is where the RAC piece comes into play. So, we've been hearing a little about GraphRAG for a little bit today. We wanted this to be a system that has ability to have vector indexing in it, so that when you want to do semantic searches, at some point, you can do that as well.

And then, in terms of just ecosystem stability, we want to make sure that when we put this in a customer's environment, there's not going to be a lot of heavy lifting that's going to be done by the customer to integrate with their systems. And again, it has to support multiple vendors.

So, these were the requirements from the product side. And then, our engineering teams kind of, we started to consider some of the options on the table. Neo4j, obviously, market leader, and the various other open source tools. At the end of the day, the engineering teams decided to kind of do some analysis around this.

So, I'm showing the table on the right-hand side. It's not an exhaustive list of things that they considered, but these were the things that they looked at that they wanted to see, okay, what is the right solution to address the requirements coming from products? And we kind of, we kind of all centered around the first two here, Neo4j and ArangoDB.

But for historical reasons, the team decided to go with ArangoDB because we had some use cases that were in the security space. That was kind of a recommendation system type of use cases that we wanted to kind of continue using. And so, but we are still exploring the use of Neo4j for some of the use cases that are coming up as part of this project.

So, we settled on ArangoDB for this, and we eventually came up with a solution that looks like this. So, we have this knowledge graph solution. This is an overview of it. On the left-hand side, we have all of the production environment. We have the controllers, the Splunk, which is a SIM system, traffic telemetry coming in.

All of them are coming into this ingestion service, which is doing an ETL, transforming all of this information into one schema. Open config. So, open config schema is a schema that is designed around networking, primarily. And it helps us to, because there's a lot of documentation about it on the internet.

So, LLMs understand this very well. So, this setup is primarily a database of networking information that has open config schema as a primary way for us to communicate with it. So, natural language communication through an individual engineer or the agents that are actually interacting with that system. And so, we built this in the form of layers.

So, if you're into networking again, there is a set of entities in the network that you want to be able to interact with. And so, we have layered this up in this way such that if there's a tool call or there's a decision to be made about a test.

For example, let's say you want to do a test about configuration drift as an example. You don't need to go to all of the layers of the graph. You just go straight down to the raw configuration file and be able to do your comparisons there. If you're trying to do like a test around reachability, for example, then you need a couple of layers.

Maybe you need raw configuration layers, data plane layers, and control plane layers. So, it's structured in a way that when the agents are making their calls to this system, they understand what the request is from the system and they're able to actually go to the right layer to pick up the information that they need to execute on it.

So, this is kind of a high level view of what the graph system looks like in layers. Now, I'm going to kind of switch gears and go back to the system. Remember I describe a system that had agents, a knowledge graph, a digital twin, as well as natural language interface.

So, let's talk about the agentic layer. And before I kind of talk about the specific agents in this system, on this application, we are looking at how we are going to build a system that is based on open standards for all of the internet. And this is one of the challenges we have within Cisco.

We are looking at a system, a set of a collective, open source collective that includes all of the partners we see down here. So, we have OutShift by Cisco. We have Langchain, Galileo. We have all of these members who are supporters of this collective. And what we are trying to do is to set up a system that allows agents from across the world.

So, it's a big vision that they can talk to each other without having to do heavy lifting of reconstructing your agents every time you want to integrate them with another agent. So, it consists of identity, schema framework for defining an agent skills and capabilities, the directory where you actually store this agent, and then how you actually compose the agent, both of the semantic layer and the synthetic layer, and then how do you observe the agents in process.

All of these are part of this collective vision as a group. And if you want to learn more about this, it's on agency.org. And I also have a slide here that kind of talks about there's real code actually that you can leverage today. Or if you want to contribute to the code, you can actually go there.

There's a GitHub repo here that you can go to and you can start to contribute or use the data. There's documentation available as well and there's Apple applications that allows you to actually see how this works in real life. And we know that there's MCP, there's A2A, all of these protocols are becoming very popular.

We also integrate all of these protocols because the goal again is not to create something that is bespoke. We want to make it open to everyone to be able to create agents and be able to make these agents work in production environments. So, back to the specific application we're talking about.

Based on this framework, we delivered this set of agents. We built this set of agents as a group. So, we have five agents right now as part of this application. There's an assistant agent that's kind of the planner that kind of orchestrates things across the globe, across all of these agents.

And then we have other agents that are all based on React reasoning loops. There's one particular agent I want to call out here, the query agent. This query agent is the one that actually interacts directly with the knowledge graph on a regular basis. We have to fine tune these agents because we initially started by doing a -- attempting to use RAG to do some querying of the knowledge graph.

But that was not working out well. So, we decided that for immediate results, we are going to fine tune it. And so, we did some fine tuning of this agent with some schema information as well as example queries. And so, that helped us to actually reduce two things: the number of tokens we were born in.

Because every time we were -- before that, the AQL queries were going through all of the layers of the knowledge graph. And in a reasoning loop, it was consuming lots of tokens and taking a lot of time for it to result -- to return results. After fine tuning, we saw a drastic reduction in number of tokens consumed, as well as the amount of time it took to actually come back with the results.

So, that kind of helped us there. So, I'm going to kind of pause here. I'm talking a lot about -- there's a lot of slide where here. I want to show a quick demo of what this actually looks like. So, tying together everything from the natural language interface interaction with an ITSM system to how the agents interact to how that collects information from knowledge graph and delivers results to the customer.

So, the scenario we have here is a network engineer wants to make a change to a firewall rule. They have to do that to accommodate a new server into the network. And so, what they need to do is to, first of all, start from ITSM. So, the submitter ticket in the service now.

Now, our system here, the UI I'm showing here right here is the UI of the actual system we've built, the application we've built. We have ingested information about the tickets here in natural language. And so, the agents here are able to actually start to work on this. So, I'm going to play video here just to make it more relatable.

So, the first thing that is happening here is that these agents -- the first agent is asking that the -- for the information to be synthesized in a summarized way so that they can understand what to quickly do. The next action that has been asked here is for you to create an impact assessment.

So, impact assessment here just means that I have to understand, will this change have any implications for me beyond the immediate target area. And that's going to be summarized. And we are now going to ask the agent that is responsible for this particular task to go and attach this information into the ITSM ticket.

So, I'm going to say, attach this information about the impact assessment into the ITSM ticket. So, that's been done. Now, the next step is to actually create a test plan. So, test plan is one of the biggest problems that our customers are facing. And they run a lot of tests, but they miss out on the right test to run.

So, this agent is actually able to reason through a lot of information about test plans across the internet. And based on the intent that was collected from the ServiceNow ticket, it's going to come up with a list of tests that you have to run to be able to make sure that this firewall rule change doesn't make a big impact or create problems in production environments.

So, as you can see here, this agent has gone ahead and actually listed all of the test cases that need to be run and the expected results for each of the tests. So, we're going to ask this agent to attach this information again back to the ITSM ticket because that's where the approval board needs to see this information before they approve the implementation of this change in production environments.

So, we can see here that that information has now been attached back by this agent to the ITSM ticket. So, two separate systems, but agents talking to each other. Now, the next step is actually run a test on all of these test cases. So, in this case, the configuration file that is going to be used to make the change in the firewall is sitting in the GitHub repo.

And so, we're going to do a pull request of that config file. I'm going to take that information. So, this is the GitHub repo where we're going to do a pull request. We're going to take the link for that pull request and paste it in the ticket, an ITSM ticket.

And so that when the execution agent starts doing his job is actually going to pull from that and use it to run his test. So, at this moment, we are going to start running the test. We're going to ask this agent to go ahead and actually run the test and execute on this test.

And so, I have attached the change. Sorry, I don't have my glasses. I've attached my change candidates to the ticket. Can you go ahead and run the test? So, what is going to happen here is if you look on the right-hand side of the screen here, a series of things are happening.

The first thing is that this agent called the executor agent goes, looks at the test cases, and then it goes into the knowledge graph. And it's going to go ahead and actually do a snapshot of the most recent visual or most recent information about the network. It's now going to take the pull request that it pulled from GitHub, the snapshot it just took from the knowledge graph.

It's going to compute it together and then run all of the individual tests one at a time. So, we can see that it's running the test, one test, test one, test two, test three, test four. So, all of this is happening in what we call a digital twin. So, a digital twin, again, is a combination of the knowledge graph instead of tools that you can use to run a test.

So, an example of a tool here could be Batfish or could be RouteNet or some other tools that you use for engineering purposes. So, once all of these tests are completed, this tool actually is going to, this agent is going to now generate a report about the test results.

So, we give you some time to run through this. It's still running the tests. But once it concludes all of the tests, it's going to report actually the test results are. So, which results, which tests actually passed, which ones failed. For the ones that have failed, it's going to make some recommendations of what you can do to go and fix the problem.

I'm going to skip to the front here to just quickly get this done quickly because of time. So, it's attached the results to the ticket and this is the report that it's spitting out in terms of this is the report for the test that were run. So, this execution agent actually created a report about all of the different test cases that were run by the system.

So, very quick short demo here. There's a lot of detail behind the scenes, but I can ask some questions offline. The couple of things I want to leave us with is that, before I go to the end of this, is that evaluation is very critical here for us to be able to understand how this delivers value to customers.

We're looking at a variety of things here. So, the agents themselves, the knowledge graph, slash digital twin, and we're looking at the, what can we actually measure quantifiably. Now, for the knowledge graph, we're looking at extrinsic intrinsic metrics, particularly not intrinsic ones, because we want to map this back to the customer's use case.

So, this is the summary of the, of what we see in terms of evaluation metrics. We are still learning. This is a, this is for now, it's an MVP, but what we are learning so far is that those two key building blocks, the knowledge graph, and the open framework for building agents is very critical for us to actually build a scalable system for our customers.

And so, I'm going to stop. It's eight seconds to go. Thank you for listening to me. And then, if you have questions, I'll be out there.

Multi Agent AI and Network Knowledge Graphs for Change — Ola Mabadeje, Cisco

Transcript