back to indexMulti Agent AI and Network Knowledge Graphs for Change — Ola Mabadeje, Cisco

00:00:00.000 |
Good afternoon, everyone. My name is Ola Mabadeje. I'm a product guy from Cisco. So, 00:00:21.720 |
my presentation is going to be a little more producty than techy, but I think you're going 00:00:26.880 |
to enjoy it. So, I've been at Cisco working on AI for the last three years, and I work 00:00:34.620 |
in this group called OutShift. So, OutShift is Cisco's incubation group. Our charter is 00:00:40.120 |
to help Cisco look at emerging technologies and see how these emerging technologies can 00:00:44.340 |
help us accelerate the roadmaps of our traditional business units. And so, by training, I'm an 00:00:52.760 |
electrical engineer, doubled into network engineering, enjoyed it, and I've been doing 00:00:58.340 |
that for a while. But over the last three years, focused on AI. Our group also focuses on quantum 00:01:03.980 |
technology. So, quantum networking is something that we're focused on. And if you want to learn 00:01:08.540 |
more about what we do with OutShift at Cisco, you can learn more about that. So, for today, 00:01:15.140 |
we're going to dive into this real quick. And like I said, I'm a product guy. So, I usually start with my 00:01:20.660 |
customers' problems, trying to understand what are they trying to solve for, and then from that work 00:01:24.920 |
backwards towards creating a solution for that. So, as part of the process for us, we usually go through 00:01:30.720 |
this incubation phase where we ask customers a lot of questions, and then we come up with prototypes. We 00:01:35.700 |
do A testing, B testing, and then we kind of deliver an MVP into a production environment. And once we get 00:01:41.940 |
product market fit, that product graduates into the Cisco's businesses. So, this customer had this issue. 00:01:47.480 |
They said, "When we do change management, we have a lot of challenges with failures in production. How can we 00:01:54.420 |
reduce that? Can we use AI to reduce that problem?" So, we double-clicked on that problem statement, and we 00:01:59.580 |
realized it was a major problem across the industry. I won't go into the details here, but it's a big 00:02:04.380 |
problem. Now, for us to solve the problem, we wanted to understand, does AI really have a place 00:02:09.660 |
here, or it's just going to be rule-based automation to solve this problem? And when we looked at the 00:02:14.460 |
workflow, we realized that there are specific spots in the workflow where AI agents can actually help 00:02:19.820 |
address a problem. And so, we kind of highlighted three, four, and five, where we believe that AI agents 00:02:24.780 |
can help increase the value for customers and reduce the pain points that they were 00:02:29.400 |
describing. And so, we sat down together with the teams. We said, "Let's figure out a solution for this." 00:02:34.760 |
And so, this solution consists of three big buckets. The first one is the fact that it has to be natural 00:02:42.440 |
language interface where network operations teams can actually interact with the system. So, that's the 00:02:47.320 |
first thing. And not just engineers, but also systems. So, for example, in our case, we built this system to 00:02:53.000 |
talk to an ITSM tool such as ServiceNow. So, we actually have agents on the ServiceNow 00:02:58.360 |
side talking to agents on our side. The second piece of this is the multi-agent system that sits within 00:03:04.040 |
this application. So, we have agents that are tasked at doing specific things. So, an agent that is tasked 00:03:09.400 |
us doing impact assessment, doing testing, doing reasoning around potential failures that could happen in the 00:03:16.360 |
network. And then, the third piece of this is where we're going to spend some of the time today, 00:03:20.440 |
which is a network knowledge graph. So, we have the concept of a digital twin in this case. So, what 00:03:25.720 |
we're trying to do here is to build a twin of the actual production network. And that twin includes a 00:03:30.680 |
knowledge graph plus a set of tools to execute testing. And so, we're going to dive into that 00:03:37.080 |
in a little bit. But before we go into that, we had this challenge of, okay, we want to build a 00:03:43.800 |
representation of the actual network. How are we going to do this? Because if you know networking 00:03:50.680 |
pretty well, networking is a very complex technology. You have a variety of vendors and a customized 00:03:57.800 |
environment, variety of devices, firewall switches, routers, and so on. And all of these different devices are 00:04:04.120 |
spitting out data in different formats. So, the challenge for us is how can we create a representation 00:04:09.800 |
of this real-world network using knowledge graphs in a data schema that can be understood by agents. 00:04:15.640 |
And so, the goal was for us to create this ingestion pipeline that can represent the network in such a way 00:04:21.400 |
that agents can take the right actions in a meaningful way and predictive way. And so, for us to kind of 00:04:27.720 |
proceed with that, we had these three big buckets of things to consider. So, we had to think about 00:04:33.400 |
what are the data sources going to be? So, if you, again, in networking, there are controller systems, 00:04:38.680 |
there are the devices themselves, there are agents in the devices, there are configuration management 00:04:43.640 |
systems. All of these things are all collecting data from network or they'll have data about the network. 00:04:49.000 |
Now, when they spit out their data, they're spitting it out in different languages. Yang, 00:04:53.320 |
JSON, and so on. Another set of considerations to have. And then, in terms of how the data is actually 00:04:58.680 |
coming out, it could be coming out in terms of streaming telemetry, it could be configuration 00:05:02.280 |
files in JSON, it could be some other form of data. How can we look at all of these three different 00:05:07.800 |
considerations and be able to come up with a set of requirements that allows us to actually build a 00:05:12.280 |
system that addresses the customer's pain point again? And so, the team, from the product side, 00:05:18.440 |
we had a set of requirements. We wanted a system that, a knowledge graph that can have multimodal 00:05:23.960 |
flexibility. That means it can talk key value pairs, you understand JSON files, you understand 00:05:30.360 |
relationships across different entities in the network. Second thing is performance. If an engineer is 00:05:37.880 |
querying a knowledge graph, we want to have instant access to the node, information about the node, 00:05:43.800 |
no matter where the location of that node is, that was important for our customers. The second thing was 00:05:48.520 |
operational flexibility. So, the schema has to be such that we can consolidate into one schema framework. 00:05:55.000 |
The fourth piece here is where the RAC piece comes into play. So, we've been hearing a little about GraphRAG 00:06:00.920 |
for a little bit today. We wanted this to be a system that has ability to have vector indexing in it, 00:06:06.760 |
so that when you want to do semantic searches, at some point, you can do that as well. And then, in terms of 00:06:11.640 |
just ecosystem stability, we want to make sure that when we put this in a customer's environment, 00:06:16.520 |
there's not going to be a lot of heavy lifting that's going to be done by the customer to integrate 00:06:21.080 |
with their systems. And again, it has to support multiple vendors. So, these were the requirements 00:06:25.160 |
from the product side. And then, our engineering teams kind of, we started to consider some of 00:06:28.520 |
the options on the table. Neo4j, obviously, market leader, and the various other open source tools. 00:06:34.440 |
At the end of the day, the engineering teams decided to kind of do some analysis around this. So, 00:06:39.800 |
I'm showing the table on the right-hand side. It's not an exhaustive list of things that they 00:06:43.880 |
considered, but these were the things that they looked at that they wanted to see, okay, 00:06:47.400 |
what is the right solution to address the requirements coming from products? And we kind of, 00:06:54.440 |
we kind of all centered around the first two here, Neo4j and ArangoDB. But for historical reasons, 00:07:00.440 |
the team decided to go with ArangoDB because we had some use cases that were in the security space. 00:07:04.680 |
That was kind of a recommendation system type of use cases that we wanted to kind of continue using. 00:07:10.840 |
And so, but we are still exploring the use of Neo4j for some of the use cases that are coming up as 00:07:15.720 |
part of this project. So, we settled on ArangoDB for this, and we eventually came up with a solution 00:07:23.000 |
that looks like this. So, we have this knowledge graph solution. This is an overview of it. 00:07:26.200 |
On the left-hand side, we have all of the production environment. We have the 00:07:30.920 |
controllers, the Splunk, which is a SIM system, traffic telemetry coming in. All of them are coming 00:07:36.600 |
into this ingestion service, which is doing an ETL, transforming all of this information into 00:07:42.600 |
one schema. Open config. So, open config schema is a schema that is designed around networking, 00:07:47.640 |
primarily. And it helps us to, because there's a lot of documentation about it on the internet. So, 00:07:52.760 |
LLMs understand this very well. So, this setup is primarily a database of networking information that 00:08:03.800 |
has open config schema as a primary way for us to communicate with it. So, natural language communication 00:08:09.080 |
through an individual engineer or the agents that are actually interacting with that system. 00:08:13.800 |
And so, we built this in the form of layers. So, if you're into networking again, there is a set of 00:08:21.960 |
entities in the network that you want to be able to interact with. And so, we have layered this up in 00:08:26.200 |
this way such that if there's a tool call or there's a decision to be made about a test. For example, 00:08:32.360 |
let's say you want to do a test about configuration drift as an example. You don't need to go to all of 00:08:38.360 |
the layers of the graph. You just go straight down to the raw configuration file and be able to do your 00:08:42.440 |
comparisons there. If you're trying to do like a test around reachability, for example, then you need a 00:08:47.000 |
couple of layers. Maybe you need raw configuration layers, data plane layers, and control plane layers. So, 00:08:52.200 |
it's structured in a way that when the agents are making their calls to this system, they understand 00:08:58.680 |
what the request is from the system and they're able to actually go to the right layer to pick up the 00:09:04.040 |
information that they need to execute on it. So, this is kind of a high level view of what the graph 00:09:09.400 |
system looks like in layers. Now, I'm going to kind of switch gears and go back to the system. Remember I 00:09:17.560 |
describe a system that had agents, a knowledge graph, a digital twin, as well as natural language 00:09:22.920 |
interface. So, let's talk about the agentic layer. And before I kind of talk about the specific agents 00:09:28.040 |
in this system, on this application, we are looking at how we are going to build a system that is based on 00:09:35.160 |
open standards for all of the internet. And this is one of the challenges we have within Cisco. We are 00:09:40.680 |
looking at a system, a set of a collective, open source collective that includes all of the partners 00:09:46.760 |
we see down here. So, we have OutShift by Cisco. We have Langchain, Galileo. We have all of these 00:09:52.120 |
members who are supporters of this collective. And what we are trying to do is to set up a system that allows 00:09:58.840 |
agents from across the world. So, it's a big vision that they can talk to each other without having to do 00:10:05.720 |
heavy lifting of reconstructing your agents every time you want to integrate them with another agent. 00:10:09.960 |
So, it consists of identity, schema framework for defining an agent skills and capabilities, 00:10:16.040 |
the directory where you actually store this agent, and then how you actually compose the agent, 00:10:20.360 |
both of the semantic layer and the synthetic layer, and then how do you observe the agents in process. 00:10:24.920 |
All of these are part of this collective vision as a group. And if you want to learn more about this, 00:10:30.280 |
it's on agency.org. And I also have a slide here that kind of talks about there's real code actually 00:10:36.440 |
that you can leverage today. Or if you want to contribute to the code, you can actually go there. 00:10:40.840 |
There's a GitHub repo here that you can go to and you can start to contribute or use the data. 00:10:46.120 |
There's documentation available as well and there's Apple applications that allows you to actually see 00:10:50.840 |
how this works in real life. And we know that there's MCP, there's A2A, all of these protocols are 00:10:56.920 |
becoming very popular. We also integrate all of these protocols because the goal again is not to 00:11:02.200 |
create something that is bespoke. We want to make it open to everyone to be able to create agents and 00:11:07.640 |
be able to make these agents work in production environments. So, back to the specific application 00:11:13.000 |
we're talking about. Based on this framework, we delivered this set of agents. We built this set of 00:11:19.160 |
agents as a group. So, we have five agents right now as part of this application. There's an assistant 00:11:24.360 |
agent that's kind of the planner that kind of orchestrates things across the globe, across all 00:11:28.360 |
of these agents. And then we have other agents that are all based on React reasoning loops. 00:11:32.920 |
There's one particular agent I want to call out here, the query agent. This query agent is the one 00:11:37.160 |
that actually interacts directly with the knowledge graph on a regular basis. We have to fine tune these 00:11:43.080 |
agents because we initially started by doing a -- attempting to use RAG to do some querying of the 00:11:50.680 |
knowledge graph. But that was not working out well. So, we decided that for immediate results, we are 00:11:55.160 |
going to fine tune it. And so, we did some fine tuning of this agent with some schema information 00:12:01.000 |
as well as example queries. And so, that helped us to actually reduce two things: the number of tokens 00:12:05.960 |
we were born in. Because every time we were -- before that, the AQL queries were going through all of the layers 00:12:10.920 |
of the knowledge graph. And in a reasoning loop, it was consuming lots of tokens and taking a lot of time 00:12:16.200 |
for it to result -- to return results. After fine tuning, we saw a drastic reduction in number of tokens 00:12:21.400 |
consumed, as well as the amount of time it took to actually come back with the results. So, that kind of 00:12:26.200 |
helped us there. So, I'm going to kind of pause here. I'm talking a lot about -- there's a lot of slide 00:12:31.560 |
where here. I want to show a quick demo of what this actually looks like. So, tying together everything 00:12:36.360 |
from the natural language interface interaction with an ITSM system to how the agents interact to 00:12:42.520 |
how that collects information from knowledge graph and delivers results to the customer. 00:12:46.360 |
So, the scenario we have here is a network engineer wants to make a change to a firewall rule. They have to 00:12:54.840 |
do that to accommodate a new server into the network. And so, what they need to do is to, first of all, 00:12:59.800 |
start from ITSM. So, the submitter ticket in the service now. Now, our system here, the UI I'm showing 00:13:09.880 |
here right here is the UI of the actual system we've built, the application we've built. We have ingested 00:13:15.160 |
information about the tickets here in natural language. And so, the agents here are able to actually 00:13:22.440 |
start to work on this. So, I'm going to play video here just to make it more relatable. So, the first thing 00:13:28.360 |
that is happening here is that these agents -- the first agent is asking that the -- for the information 00:13:36.520 |
to be synthesized in a summarized way so that they can understand what to quickly do. The next action 00:13:42.360 |
that has been asked here is for you to create an impact assessment. So, impact assessment here just means 00:13:46.600 |
that I have to understand, will this change have any implications for me beyond the immediate 00:13:51.800 |
target area. And that's going to be summarized. And we are now going to ask the agent that is 00:13:58.040 |
responsible for this particular task to go and attach this information into the ITSM ticket. So, I'm going to say, 00:14:04.360 |
attach this information about the impact assessment into the ITSM ticket. So, that's been done. Now, the next step 00:14:12.680 |
is to actually create a test plan. So, test plan is one of the biggest problems that our customers are facing. 00:14:17.320 |
And they run a lot of tests, but they miss out on the right test to run. So, this agent is actually able 00:14:23.560 |
to reason through a lot of information about test plans across the internet. And based on the intent that 00:14:28.440 |
was collected from the ServiceNow ticket, it's going to come up with a list of tests that you have to run 00:14:33.880 |
to be able to make sure that this firewall rule change doesn't make a big impact or create problems in 00:14:38.840 |
production environments. So, as you can see here, this agent has gone ahead and actually listed all of the test 00:14:43.560 |
cases that need to be run and the expected results for each of the tests. So, we're going to ask this 00:14:48.520 |
agent to attach this information again back to the ITSM ticket because that's where the approval board 00:14:54.280 |
needs to see this information before they approve the implementation of this change in production 00:14:59.560 |
environments. So, we can see here that that information has now been attached back by this 00:15:03.560 |
agent to the ITSM ticket. So, two separate systems, but agents talking to each other. 00:15:08.840 |
Now, the next step is actually run a test on all of these test cases. So, 00:15:12.520 |
in this case, the configuration file that is going to be used to make the change in the firewall is 00:15:17.400 |
sitting in the GitHub repo. And so, we're going to do a pull request of that config file. I'm going to 00:15:22.600 |
take that information. So, this is the GitHub repo where we're going to do a pull request. We're going to 00:15:27.640 |
take the link for that pull request and paste it in the ticket, an ITSM ticket. And so that when the 00:15:32.920 |
execution agent starts doing his job is actually going to pull from that and use it to run his test. 00:15:38.600 |
So, at this moment, we are going to start running the test. We're going to ask this agent to go ahead 00:15:44.840 |
and actually run the test and execute on this test. And so, I have attached the change. Sorry, I don't 00:15:52.520 |
have my glasses. I've attached my change candidates to the ticket. Can you go ahead and run the test? 00:15:59.320 |
So, what is going to happen here is if you look on the right-hand side of the screen here, 00:16:02.520 |
a series of things are happening. The first thing is that this agent called the executor agent goes, 00:16:07.720 |
looks at the test cases, and then it goes into the knowledge graph. And it's going to go ahead and 00:16:12.920 |
actually do a snapshot of the most recent visual or most recent information about the network. 00:16:18.440 |
It's now going to take the pull request that it pulled from GitHub, the snapshot it just took from 00:16:24.520 |
the knowledge graph. It's going to compute it together and then run all of the individual tests 00:16:29.480 |
one at a time. So, we can see that it's running the test, one test, test one, test two, test three, 00:16:33.880 |
test four. So, all of this is happening in what we call a digital twin. So, a digital twin, again, 00:16:38.280 |
is a combination of the knowledge graph instead of tools that you can use to run a test. So, 00:16:43.400 |
an example of a tool here could be Batfish or could be RouteNet or some other tools that you use for 00:16:48.120 |
engineering purposes. So, once all of these tests are completed, this tool actually is going to, 00:16:54.920 |
this agent is going to now generate a report about the test results. So, we give you some time to run 00:17:00.200 |
through this. It's still running the tests. But once it concludes all of the tests, it's going to report 00:17:04.920 |
actually the test results are. So, which results, which tests actually passed, which ones failed. 00:17:10.760 |
For the ones that have failed, it's going to make some recommendations of what you can do to go and 00:17:14.120 |
fix the problem. I'm going to skip to the front here to just quickly get this done quickly because 00:17:20.440 |
of time. So, it's attached the results to the ticket and this is the report that it's spitting out in 00:17:28.520 |
terms of this is the report for the test that were run. So, this execution agent actually created a report 00:17:33.720 |
about all of the different test cases that were run by the system. So, very quick short demo here. 00:17:40.200 |
There's a lot of detail behind the scenes, but I can ask some questions offline. The couple of things 00:17:45.320 |
I want to leave us with is that, before I go to the end of this, is that evaluation is very critical 00:17:50.760 |
here for us to be able to understand how this delivers value to customers. We're looking at a variety of 00:17:57.480 |
things here. So, the agents themselves, the knowledge graph, slash digital twin, and we're looking at the, 00:18:03.240 |
what can we actually measure quantifiably. Now, for the knowledge graph, we're looking at extrinsic 00:18:07.800 |
intrinsic metrics, particularly not intrinsic ones, because we want to map this back to the customer's 00:18:13.240 |
use case. So, this is the summary of the, of what we see in terms of evaluation metrics. 00:18:17.800 |
We are still learning. This is a, this is for now, it's an MVP, but what we are learning so far is that 00:18:25.080 |
those two key building blocks, the knowledge graph, and the open framework for building agents is very 00:18:30.520 |
critical for us to actually build a scalable system for our customers. And so, I'm going to stop. It's 00:18:36.200 |
eight seconds to go. Thank you for listening to me. And then, if you have questions, I'll be out there.