Multi Agent AI and Network Knowledge Graphs for Change

00:00:00.000 | Good afternoon, everyone. My name is Ola Mabadeje. I'm a product guy from Cisco. So,

00:00:21.720 | my presentation is going to be a little more producty than techy, but I think you're going

00:00:26.880 | to enjoy it. So, I've been at Cisco working on AI for the last three years, and I work

00:00:34.620 | in this group called OutShift. So, OutShift is Cisco's incubation group. Our charter is

00:00:40.120 | to help Cisco look at emerging technologies and see how these emerging technologies can

00:00:44.340 | help us accelerate the roadmaps of our traditional business units. And so, by training, I'm an

00:00:52.760 | electrical engineer, doubled into network engineering, enjoyed it, and I've been doing

00:00:58.340 | that for a while. But over the last three years, focused on AI. Our group also focuses on quantum

00:01:03.980 | technology. So, quantum networking is something that we're focused on. And if you want to learn

00:01:08.540 | more about what we do with OutShift at Cisco, you can learn more about that. So, for today,

00:01:15.140 | we're going to dive into this real quick. And like I said, I'm a product guy. So, I usually start with my

00:01:20.660 | customers' problems, trying to understand what are they trying to solve for, and then from that work

00:01:24.920 | backwards towards creating a solution for that. So, as part of the process for us, we usually go through

00:01:30.720 | this incubation phase where we ask customers a lot of questions, and then we come up with prototypes. We

00:01:35.700 | do A testing, B testing, and then we kind of deliver an MVP into a production environment. And once we get

00:01:41.940 | product market fit, that product graduates into the Cisco's businesses. So, this customer had this issue.

00:01:47.480 | They said, "When we do change management, we have a lot of challenges with failures in production. How can we

00:01:54.420 | reduce that? Can we use AI to reduce that problem?" So, we double-clicked on that problem statement, and we

00:01:59.580 | realized it was a major problem across the industry. I won't go into the details here, but it's a big

00:02:04.380 | problem. Now, for us to solve the problem, we wanted to understand, does AI really have a place

00:02:09.660 | here, or it's just going to be rule-based automation to solve this problem? And when we looked at the

00:02:14.460 | workflow, we realized that there are specific spots in the workflow where AI agents can actually help

00:02:19.820 | address a problem. And so, we kind of highlighted three, four, and five, where we believe that AI agents

00:02:24.780 | can help increase the value for customers and reduce the pain points that they were

00:02:29.400 | describing. And so, we sat down together with the teams. We said, "Let's figure out a solution for this."

00:02:34.760 | And so, this solution consists of three big buckets. The first one is the fact that it has to be natural

00:02:42.440 | language interface where network operations teams can actually interact with the system. So, that's the

00:02:47.320 | first thing. And not just engineers, but also systems. So, for example, in our case, we built this system to

00:02:53.000 | talk to an ITSM tool such as ServiceNow. So, we actually have agents on the ServiceNow

00:02:58.360 | side talking to agents on our side. The second piece of this is the multi-agent system that sits within

00:03:04.040 | this application. So, we have agents that are tasked at doing specific things. So, an agent that is tasked

00:03:09.400 | us doing impact assessment, doing testing, doing reasoning around potential failures that could happen in the

00:03:16.360 | network. And then, the third piece of this is where we're going to spend some of the time today,

00:03:20.440 | which is a network knowledge graph. So, we have the concept of a digital twin in this case. So, what

00:03:25.720 | we're trying to do here is to build a twin of the actual production network. And that twin includes a

00:03:30.680 | knowledge graph plus a set of tools to execute testing. And so, we're going to dive into that

00:03:37.080 | in a little bit. But before we go into that, we had this challenge of, okay, we want to build a

00:03:43.800 | representation of the actual network. How are we going to do this? Because if you know networking

00:03:50.680 | pretty well, networking is a very complex technology. You have a variety of vendors and a customized

00:03:57.800 | environment, variety of devices, firewall switches, routers, and so on. And all of these different devices are

00:04:04.120 | spitting out data in different formats. So, the challenge for us is how can we create a representation

00:04:09.800 | of this real-world network using knowledge graphs in a data schema that can be understood by agents.

00:04:15.640 | And so, the goal was for us to create this ingestion pipeline that can represent the network in such a way

00:04:21.400 | that agents can take the right actions in a meaningful way and predictive way. And so, for us to kind of

00:04:27.720 | proceed with that, we had these three big buckets of things to consider. So, we had to think about

00:04:33.400 | what are the data sources going to be? So, if you, again, in networking, there are controller systems,

00:04:38.680 | there are the devices themselves, there are agents in the devices, there are configuration management

00:04:43.640 | systems. All of these things are all collecting data from network or they'll have data about the network.

00:04:49.000 | Now, when they spit out their data, they're spitting it out in different languages. Yang,

00:04:53.320 | JSON, and so on. Another set of considerations to have. And then, in terms of how the data is actually

00:04:58.680 | coming out, it could be coming out in terms of streaming telemetry, it could be configuration

00:05:02.280 | files in JSON, it could be some other form of data. How can we look at all of these three different

00:05:07.800 | considerations and be able to come up with a set of requirements that allows us to actually build a

00:05:12.280 | system that addresses the customer's pain point again? And so, the team, from the product side,

00:05:18.440 | we had a set of requirements. We wanted a system that, a knowledge graph that can have multimodal

00:05:23.960 | flexibility. That means it can talk key value pairs, you understand JSON files, you understand

00:05:30.360 | relationships across different entities in the network. Second thing is performance. If an engineer is

00:05:37.880 | querying a knowledge graph, we want to have instant access to the node, information about the node,

00:05:43.800 | no matter where the location of that node is, that was important for our customers. The second thing was

00:05:48.520 | operational flexibility. So, the schema has to be such that we can consolidate into one schema framework.

00:05:55.000 | The fourth piece here is where the RAC piece comes into play. So, we've been hearing a little about GraphRAG

00:06:00.920 | for a little bit today. We wanted this to be a system that has ability to have vector indexing in it,

00:06:06.760 | so that when you want to do semantic searches, at some point, you can do that as well. And then, in terms of

00:06:11.640 | just ecosystem stability, we want to make sure that when we put this in a customer's environment,

00:06:16.520 | there's not going to be a lot of heavy lifting that's going to be done by the customer to integrate

00:06:21.080 | with their systems. And again, it has to support multiple vendors. So, these were the requirements

00:06:25.160 | from the product side. And then, our engineering teams kind of, we started to consider some of

00:06:28.520 | the options on the table. Neo4j, obviously, market leader, and the various other open source tools.

00:06:34.440 | At the end of the day, the engineering teams decided to kind of do some analysis around this. So,

00:06:39.800 | I'm showing the table on the right-hand side. It's not an exhaustive list of things that they

00:06:43.880 | considered, but these were the things that they looked at that they wanted to see, okay,

00:06:47.400 | what is the right solution to address the requirements coming from products? And we kind of,

00:06:54.440 | we kind of all centered around the first two here, Neo4j and ArangoDB. But for historical reasons,

00:07:00.440 | the team decided to go with ArangoDB because we had some use cases that were in the security space.

00:07:04.680 | That was kind of a recommendation system type of use cases that we wanted to kind of continue using.

00:07:10.840 | And so, but we are still exploring the use of Neo4j for some of the use cases that are coming up as

00:07:15.720 | part of this project. So, we settled on ArangoDB for this, and we eventually came up with a solution

00:07:23.000 | that looks like this. So, we have this knowledge graph solution. This is an overview of it.

00:07:26.200 | On the left-hand side, we have all of the production environment. We have the

00:07:30.920 | controllers, the Splunk, which is a SIM system, traffic telemetry coming in. All of them are coming

00:07:36.600 | into this ingestion service, which is doing an ETL, transforming all of this information into

00:07:42.600 | one schema. Open config. So, open config schema is a schema that is designed around networking,

00:07:47.640 | primarily. And it helps us to, because there's a lot of documentation about it on the internet. So,

00:07:52.760 | LLMs understand this very well. So, this setup is primarily a database of networking information that

00:08:03.800 | has open config schema as a primary way for us to communicate with it. So, natural language communication

00:08:09.080 | through an individual engineer or the agents that are actually interacting with that system.

00:08:13.800 | And so, we built this in the form of layers. So, if you're into networking again, there is a set of

00:08:21.960 | entities in the network that you want to be able to interact with. And so, we have layered this up in

00:08:26.200 | this way such that if there's a tool call or there's a decision to be made about a test. For example,

00:08:32.360 | let's say you want to do a test about configuration drift as an example. You don't need to go to all of

00:08:38.360 | the layers of the graph. You just go straight down to the raw configuration file and be able to do your

00:08:42.440 | comparisons there. If you're trying to do like a test around reachability, for example, then you need a

00:08:47.000 | couple of layers. Maybe you need raw configuration layers, data plane layers, and control plane layers. So,

00:08:52.200 | it's structured in a way that when the agents are making their calls to this system, they understand

00:08:58.680 | what the request is from the system and they're able to actually go to the right layer to pick up the

00:09:04.040 | information that they need to execute on it. So, this is kind of a high level view of what the graph

00:09:09.400 | system looks like in layers. Now, I'm going to kind of switch gears and go back to the system. Remember I

00:09:17.560 | describe a system that had agents, a knowledge graph, a digital twin, as well as natural language

00:09:22.920 | interface. So, let's talk about the agentic layer. And before I kind of talk about the specific agents

00:09:28.040 | in this system, on this application, we are looking at how we are going to build a system that is based on

00:09:35.160 | open standards for all of the internet. And this is one of the challenges we have within Cisco. We are

00:09:40.680 | looking at a system, a set of a collective, open source collective that includes all of the partners

00:09:46.760 | we see down here. So, we have OutShift by Cisco. We have Langchain, Galileo. We have all of these

00:09:52.120 | members who are supporters of this collective. And what we are trying to do is to set up a system that allows

00:09:58.840 | agents from across the world. So, it's a big vision that they can talk to each other without having to do

00:10:05.720 | heavy lifting of reconstructing your agents every time you want to integrate them with another agent.

00:10:09.960 | So, it consists of identity, schema framework for defining an agent skills and capabilities,

00:10:16.040 | the directory where you actually store this agent, and then how you actually compose the agent,

00:10:20.360 | both of the semantic layer and the synthetic layer, and then how do you observe the agents in process.

00:10:24.920 | All of these are part of this collective vision as a group. And if you want to learn more about this,

00:10:30.280 | it's on agency.org. And I also have a slide here that kind of talks about there's real code actually

00:10:36.440 | that you can leverage today. Or if you want to contribute to the code, you can actually go there.

00:10:40.840 | There's a GitHub repo here that you can go to and you can start to contribute or use the data.

00:10:46.120 | There's documentation available as well and there's Apple applications that allows you to actually see

00:10:50.840 | how this works in real life. And we know that there's MCP, there's A2A, all of these protocols are

00:10:56.920 | becoming very popular. We also integrate all of these protocols because the goal again is not to

00:11:02.200 | create something that is bespoke. We want to make it open to everyone to be able to create agents and

00:11:07.640 | be able to make these agents work in production environments. So, back to the specific application

00:11:13.000 | we're talking about. Based on this framework, we delivered this set of agents. We built this set of

00:11:19.160 | agents as a group. So, we have five agents right now as part of this application. There's an assistant

00:11:24.360 | agent that's kind of the planner that kind of orchestrates things across the globe, across all

00:11:28.360 | of these agents. And then we have other agents that are all based on React reasoning loops.

00:11:32.920 | There's one particular agent I want to call out here, the query agent. This query agent is the one

00:11:37.160 | that actually interacts directly with the knowledge graph on a regular basis. We have to fine tune these

00:11:43.080 | agents because we initially started by doing a -- attempting to use RAG to do some querying of the

00:11:50.680 | knowledge graph. But that was not working out well. So, we decided that for immediate results, we are

00:11:55.160 | going to fine tune it. And so, we did some fine tuning of this agent with some schema information

00:12:01.000 | as well as example queries. And so, that helped us to actually reduce two things: the number of tokens

00:12:05.960 | we were born in. Because every time we were -- before that, the AQL queries were going through all of the layers

00:12:10.920 | of the knowledge graph. And in a reasoning loop, it was consuming lots of tokens and taking a lot of time

00:12:16.200 | for it to result -- to return results. After fine tuning, we saw a drastic reduction in number of tokens

00:12:21.400 | consumed, as well as the amount of time it took to actually come back with the results. So, that kind of

00:12:26.200 | helped us there. So, I'm going to kind of pause here. I'm talking a lot about -- there's a lot of slide

00:12:31.560 | where here. I want to show a quick demo of what this actually looks like. So, tying together everything

00:12:36.360 | from the natural language interface interaction with an ITSM system to how the agents interact to

00:12:42.520 | how that collects information from knowledge graph and delivers results to the customer.

00:12:46.360 | So, the scenario we have here is a network engineer wants to make a change to a firewall rule. They have to

00:12:54.840 | do that to accommodate a new server into the network. And so, what they need to do is to, first of all,

00:12:59.800 | start from ITSM. So, the submitter ticket in the service now. Now, our system here, the UI I'm showing

00:13:09.880 | here right here is the UI of the actual system we've built, the application we've built. We have ingested

00:13:15.160 | information about the tickets here in natural language. And so, the agents here are able to actually

00:13:22.440 | start to work on this. So, I'm going to play video here just to make it more relatable. So, the first thing

00:13:28.360 | that is happening here is that these agents -- the first agent is asking that the -- for the information

00:13:36.520 | to be synthesized in a summarized way so that they can understand what to quickly do. The next action

00:13:42.360 | that has been asked here is for you to create an impact assessment. So, impact assessment here just means

00:13:46.600 | that I have to understand, will this change have any implications for me beyond the immediate

00:13:51.800 | target area. And that's going to be summarized. And we are now going to ask the agent that is

00:13:58.040 | responsible for this particular task to go and attach this information into the ITSM ticket. So, I'm going to say,

00:14:04.360 | attach this information about the impact assessment into the ITSM ticket. So, that's been done. Now, the next step

00:14:12.680 | is to actually create a test plan. So, test plan is one of the biggest problems that our customers are facing.

00:14:17.320 | And they run a lot of tests, but they miss out on the right test to run. So, this agent is actually able

00:14:23.560 | to reason through a lot of information about test plans across the internet. And based on the intent that

00:14:28.440 | was collected from the ServiceNow ticket, it's going to come up with a list of tests that you have to run

00:14:33.880 | to be able to make sure that this firewall rule change doesn't make a big impact or create problems in

00:14:38.840 | production environments. So, as you can see here, this agent has gone ahead and actually listed all of the test

00:14:43.560 | cases that need to be run and the expected results for each of the tests. So, we're going to ask this

00:14:48.520 | agent to attach this information again back to the ITSM ticket because that's where the approval board

00:14:54.280 | needs to see this information before they approve the implementation of this change in production

00:14:59.560 | environments. So, we can see here that that information has now been attached back by this

00:15:03.560 | agent to the ITSM ticket. So, two separate systems, but agents talking to each other.

00:15:08.840 | Now, the next step is actually run a test on all of these test cases. So,

00:15:12.520 | in this case, the configuration file that is going to be used to make the change in the firewall is

00:15:17.400 | sitting in the GitHub repo. And so, we're going to do a pull request of that config file. I'm going to

00:15:22.600 | take that information. So, this is the GitHub repo where we're going to do a pull request. We're going to

00:15:27.640 | take the link for that pull request and paste it in the ticket, an ITSM ticket. And so that when the

00:15:32.920 | execution agent starts doing his job is actually going to pull from that and use it to run his test.

00:15:38.600 | So, at this moment, we are going to start running the test. We're going to ask this agent to go ahead

00:15:44.840 | and actually run the test and execute on this test. And so, I have attached the change. Sorry, I don't

00:15:52.520 | have my glasses. I've attached my change candidates to the ticket. Can you go ahead and run the test?

00:15:59.320 | So, what is going to happen here is if you look on the right-hand side of the screen here,

00:16:02.520 | a series of things are happening. The first thing is that this agent called the executor agent goes,

00:16:07.720 | looks at the test cases, and then it goes into the knowledge graph. And it's going to go ahead and

00:16:12.920 | actually do a snapshot of the most recent visual or most recent information about the network.

00:16:18.440 | It's now going to take the pull request that it pulled from GitHub, the snapshot it just took from

00:16:24.520 | the knowledge graph. It's going to compute it together and then run all of the individual tests

00:16:29.480 | one at a time. So, we can see that it's running the test, one test, test one, test two, test three,

00:16:33.880 | test four. So, all of this is happening in what we call a digital twin. So, a digital twin, again,

00:16:38.280 | is a combination of the knowledge graph instead of tools that you can use to run a test. So,

00:16:43.400 | an example of a tool here could be Batfish or could be RouteNet or some other tools that you use for

00:16:48.120 | engineering purposes. So, once all of these tests are completed, this tool actually is going to,

00:16:54.920 | this agent is going to now generate a report about the test results. So, we give you some time to run

00:17:00.200 | through this. It's still running the tests. But once it concludes all of the tests, it's going to report

00:17:04.920 | actually the test results are. So, which results, which tests actually passed, which ones failed.

00:17:10.760 | For the ones that have failed, it's going to make some recommendations of what you can do to go and

00:17:14.120 | fix the problem. I'm going to skip to the front here to just quickly get this done quickly because

00:17:20.440 | of time. So, it's attached the results to the ticket and this is the report that it's spitting out in

00:17:28.520 | terms of this is the report for the test that were run. So, this execution agent actually created a report

00:17:33.720 | about all of the different test cases that were run by the system. So, very quick short demo here.

00:17:40.200 | There's a lot of detail behind the scenes, but I can ask some questions offline. The couple of things

00:17:45.320 | I want to leave us with is that, before I go to the end of this, is that evaluation is very critical

00:17:50.760 | here for us to be able to understand how this delivers value to customers. We're looking at a variety of

00:17:57.480 | things here. So, the agents themselves, the knowledge graph, slash digital twin, and we're looking at the,

00:18:03.240 | what can we actually measure quantifiably. Now, for the knowledge graph, we're looking at extrinsic

00:18:07.800 | intrinsic metrics, particularly not intrinsic ones, because we want to map this back to the customer's

00:18:13.240 | use case. So, this is the summary of the, of what we see in terms of evaluation metrics.

00:18:17.800 | We are still learning. This is a, this is for now, it's an MVP, but what we are learning so far is that

00:18:25.080 | those two key building blocks, the knowledge graph, and the open framework for building agents is very

00:18:30.520 | critical for us to actually build a scalable system for our customers. And so, I'm going to stop. It's

00:18:36.200 | eight seconds to go. Thank you for listening to me. And then, if you have questions, I'll be out there.

Multi Agent AI and Network Knowledge Graphs for Change — Ola Mabadeje, Cisco