back to indexProduction software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai

00:00:00.000 |
Hi, everyone. Thank you for coming to our talk. So, he was kind enough to already introduce us. 00:00:21.000 |
So, I'm the CEO. Matthew was the first person who joined us. If you have any difficult questions, 00:00:25.680 |
please direct it towards Matt. If you think about the three major categories of software 00:00:32.120 |
engineering, at least as we see it, there's three things that show up to me. The system 00:00:35.780 |
design, where you think about how do you actually architect a system. A lot of the talks we saw 00:00:40.560 |
in this track have been about that in some sense at a high level. Second is actually developing 00:00:45.420 |
software, putting in the business logic of your particular company and all of the DevOps 00:00:49.480 |
that comes with it. And then when your software actually hits production, invariably it's going 00:00:54.020 |
to break and then troubleshooting of these production incidents. At the heart of it, those are 00:00:57.460 |
the three things interspersed with each other as you think about developing production-grade 00:01:01.460 |
software. Now, what's been happening with the magic of AI software engineering tools like 00:01:07.700 |
Cursor or Windsor or GitHub Copilot and many more others, the part around development is getting 00:01:12.580 |
narrowed. That's the part we're relying on all these different systems to do it for us and 00:01:16.400 |
increasingly so. Over the next year, I'd say more and more of what we see is going to make 00:01:21.580 |
the development part of it really seamless. The question is what happens to the other 00:01:25.660 |
two parts of this entire software engineering workflow? The first one being system design 00:01:30.460 |
and the other one being troubleshooting. Now, what is the promise that is, as we think about 00:01:36.200 |
what software engineering could look like, is that we'll get to focus on just the most high 00:01:39.900 |
impact and creative work that happens in engineering, which is system design. So, we're hoping that 00:01:44.700 |
AI will write our code for us and also troubleshoot all the production incidents. So, we just get to 00:01:48.940 |
focus on the fun stuff, which is the creative work of how we put all these different pieces together. 00:01:52.300 |
That's the hope. I actually think that to make that happen, there's something missing, right, 00:01:56.940 |
which is the problem of troubleshooting. How do you automate that part of it? And I think if we just 00:02:01.500 |
continue in the way we're going, it's actually going to look at the opposite. It's that most of our time, 00:02:05.580 |
I think, is going to be spent doing on-call for the vast majority of us. And why is that? I think 00:02:10.300 |
troubleshooting is going to get more and more complex as we go along. First is that as we write, 00:02:15.580 |
as these software engineering, AI software engineering systems write more and more of our code, 00:02:19.820 |
humans are going to have less context about what happened. They don't understand the inner workings 00:02:23.420 |
of the code. They don't have all the context in their minds, right? Second, we're going to push these 00:02:27.660 |
systems to the limits. We're going to write more and more complex systems, just like the things we saw in 00:02:31.100 |
the previous talks, right? So, the system's going to get more complex and we're going to have less 00:02:34.540 |
understanding of it. And as a result, troubleshooting is going to get really, 00:02:37.980 |
really painful and complex. And I think that's where we're going to spend most of our time in 00:02:41.340 |
this world, which is just QA and on-call, right? And that'll be kind of a sad existence for ourselves 00:02:45.580 |
if that's what happens, right? So, this, I think, is a grim reality if we don't do something about it. 00:02:49.740 |
So, now, if you think about the workflow of troubleshooting, what does it look like? 00:02:55.660 |
In my head, you'll have all of these different wonderful companies like Grafana, Datadocs, Splunk, Elastic, 00:03:03.340 |
Sentry, whatever it might be. And what they do, essentially, is they help process data and help 00:03:07.740 |
you visualize it, right? And so, if anyone who's been on a Datadoc dashboard or whatever it might be, 00:03:12.620 |
you have these beautiful dashboards, thousands of them, to give you some cut as to what is the 00:03:16.940 |
health of your system. Now, something will break invariably in production. Then what happens next? 00:03:21.660 |
The next step, as I see it, is what I call dashboard dumpster diving, right? You'll go through all these 00:03:25.900 |
different thousands of dashboards to try to find the one that explains what happened. And you'll try to have many 00:03:29.900 |
different people doing it in parallel. At some point, you might come up with some sort of 00:03:33.260 |
promising lead, like, okay, maybe that was it. That's the dashboard that kind of explains what 00:03:36.300 |
happened. Or that's the log that kind of explains what happened. Then you want to connect it to some 00:03:40.460 |
sort of change you made in your system, right? As we think about root cause analysis, it's typically 00:03:44.220 |
trying to connect it to some particular change in the system, whether it's a pull request or a change 00:03:48.780 |
of your configuration files, whatever it might be, right? And that's where you start, say, 00:03:52.460 |
that's the second stage of your troubleshooting, which you stare aggressively at your code base until 00:03:56.300 |
something, you know, some sort of inspiration hits you. And then most of the time it doesn't hit you, 00:04:01.900 |
and then you kind of bring more teams into the mix, because you know it's maybe not your issue, 00:04:06.540 |
and suddenly you have 30, 40, 50, 100 people in a Slack incident channel trying to figure out what 00:04:11.500 |
happened. And this loop keeps going on and on and on, right? So the status quo of what incident 00:04:16.140 |
production incident debugging looks like clearly is not optimal. And I think it's only going to get worse 00:04:20.460 |
space for all the reasons I talked about earlier. And obviously this problem has been around since 00:04:27.500 |
software has been written, and it's not like we're the first people to think about thinking of this 00:04:31.100 |
issue, right? The problem is that the existing approaches we've taken to troubleshooting or using 00:04:35.340 |
machine learning AI towards it have not worked, and I don't believe will work for really fundamental 00:04:39.820 |
reasons, right? So the first one is what we call AI ops, broadly speaking, which is we're using 00:04:44.220 |
traditional machine learning and statistical anomaly detection type techniques to help figure out what 00:04:49.180 |
happened, right? The problem is if any of you have actually tried these techniques in production 00:04:52.620 |
systems, it leads to too many false positives. Your system is too complex, too dynamic, and if you just 00:04:57.500 |
try to come up with some sort of numerical representation of your data, it's going to be 00:05:01.180 |
not representative enough. And what you'll see is that you'll have thousands of alerts happening. 00:05:05.820 |
Maybe one of them is useful, but you just don't know which one. So it typically leads to more 00:05:09.260 |
signal, sorry, more noise than signal is what AI ops has led to, sadly. Okay, now we have this new world of 00:05:15.020 |
LLMs. Okay, so one option is, okay, let me -- I'm sure many people here have taken a log and put it in 00:05:20.780 |
ChatGPT and said, "Explain to me what happened. What's going on in this log?" Right? That's something I 00:05:24.460 |
think all of us have done. Now, that's okay if you know which log you're going to look at, but if I'm 00:05:28.940 |
actually dealing with production-grade systems, you're going to have petabytes of data. You might have a 00:05:32.780 |
trillion logs. Which one do I look at? You can't take all of these different logs, the trillion logs, 00:05:37.980 |
and put it in context. Right? Even if you have infinite context, it doesn't matter. The size of 00:05:42.380 |
these systems are so large that, forget about the context window, it doesn't even fit into memory. 00:05:46.300 |
It doesn't even fit into a cluster. That's why you have such difficult retention policies for data, 00:05:50.300 |
right, for logs and metrics. And second, the problem with these LLMs are is that they don't have a very 00:05:54.060 |
good understanding of the numerical representation of the data. So they have a good semantic understanding, 00:05:58.540 |
they don't have a good numerical representation, and also the context isn't big enough. 00:06:02.220 |
Okay, now the third thing we might think about is let's build an agent, right? Like, let's build a 00:06:06.780 |
React-style agent. The problem with that is that if you think about these React-style agents, what 00:06:11.500 |
they're going to do is they're going to assume you have some access to some sort of runbook, some sort 00:06:15.260 |
of meta workflow that they can rely on to help you make a decision of what next tool to call. The problem 00:06:20.460 |
is any runbook you actually put into place is deprecated the second you created. I don't know how many of 00:06:24.540 |
you have used runbooks, but typically what we've found in my experience and the team's experience is that 00:06:29.580 |
they're typically deprecated by the time they're built, right? And so that workflow that the agent is going 00:06:34.220 |
to go through is not going to be optimal. And also if you try to make it do a much more broad search of your 00:06:41.660 |
system, if you give it these simple tools, it's going to take too long, right? If you just put it into like a 00:06:47.100 |
loop, tool calling loop, it might take days to run if it doesn't time out, right? And typically if you try to solve an 00:06:53.580 |
incident, you need to solve it in two to five minutes for it to be really useful, right? So every minute counts when things are down. 00:06:59.180 |
So as a result, none of these three things, if you do them just by themselves, is enough to really 00:07:03.980 |
troubleshoot. There's fundamental issues in all of them that I think need to be solved by thinking 00:07:08.140 |
about them in a more collective way. And that's really what we're trying to do at Traversal, right? 00:07:12.540 |
Well, what you need to do with Traversal is, what we're trying to do at Traversal is really good out of 00:07:16.540 |
sample autonomous troubleshooting, which is that if you have a new incident, we've never seen it before, 00:07:20.940 |
can you troubleshoot it from first principles debugging? And to do so, I think we need to combine a few different 00:07:25.660 |
ideas. The first one being statistical, the second one being semantics, and then the third one being a 00:07:30.860 |
novel agentic control flow. And with statistics, what I mean is causal machine learning. So that's 00:07:35.740 |
where a lot of our research came. So the idea of causal machine learning is this idea of being 00:07:39.500 |
correlation is in causation. So how do you get these AI systems to pick up cause and effect relationships 00:07:43.820 |
from data programmatically? That's what we spend a lot of time thinking about. And obviously that problem 00:07:47.180 |
shows up a lot in production incidents, because typically when something breaks, a lot of things break 00:07:51.260 |
around them, right? So there's a lot of correlated failures, which are not the root cause. And so 00:07:54.780 |
thinking about it programmatically is what the study of causal machine learning is. The second being 00:07:58.700 |
semantics, right? Which is actually trying to push the limits of what these reasoning models can give 00:08:02.780 |
you to help you understand the semantic, rich semantic context that exists in log fields, in the metadata 00:08:08.700 |
source of the metric, and so on and so forth, right? Or in code itself, right? And so by combining 00:08:14.060 |
causal machine learning, which is the best of what statistics gives you, and reasoning models, which is the best of what 00:08:18.460 |
semantics gives you, you now have at least the tools, the basic tools to put into place so that you can 00:08:23.100 |
actually deal with this issue. Now we have to figure out how do you actually make this work in an agentic 00:08:27.580 |
system. And what we found is this idea of swarms of agent, where you have these thousands of parallel 00:08:31.820 |
agentic tool calls happening, giving you this kind of exhaustive search to all of your telemetry in some 00:08:36.540 |
sort of efficient way. That's what brings it together and helps you actually deal with this issue, right? 00:08:40.780 |
So just to repeat, it's statistics, which is causal machine learning, semantics, which is these reasoning models, 00:08:46.700 |
and this novel agentic control flow, which you call the swarms of agent, all of this put together in 00:08:51.180 |
the right elegant way is what actually helps you autonomously troubleshoot, right? A lot of the work 00:08:56.140 |
we rely on is based on our years of research, years of experience as researchers, we've written a number 00:09:01.340 |
of papers on these various different fields, and also relying on a lot of the research that has happened 00:09:05.500 |
in general in the field over the last couple of years that we're relying on to actually make this 00:09:09.980 |
into a reality. And so if you think about the troubleshooting workflow we talked about earlier 00:09:15.660 |
and all the pains of it, each of the different things can help remove or alleviate some of those 00:09:19.500 |
pains. So the idea of finding the promising lead from your sea of information, that's where this 00:09:24.620 |
agent swarm and causal machine learning put together really help. And then from the promising lead, 00:09:28.940 |
connecting it to a specific change in your system, whether it's a pull request or change log, 00:09:32.860 |
that's where a lot of the work that's happening in code agents and also vector search we're relying on 00:09:36.380 |
it. As this get better, we get better, right? And as a result, because you're building this context 00:09:41.100 |
real time and agents are doing it, they're pulling in the right team with the right context at the right 00:09:45.260 |
time. And so don't people want to get pulled into an incident being like, I don't really know why I'm 00:09:48.460 |
being pulled in here, which I'm sure many of you have been experiencing your time as engineers. 00:09:52.460 |
And so now I'm going to hand it over to Matt to actually talk through a real-life case study 00:09:59.420 |
that we've done with one of our customers. Hi, everybody. I'm Matt. As Anish said, thank you. 00:10:05.340 |
No, I don't own any of their outfits. So I'm going to tell you a little bit about how I got involved 00:10:11.820 |
with Traversal. I'm going to show you how we are helping out some of our customers. So in my career, 00:10:17.660 |
I spent most of my time in high-frequency trading. What I wanted to spend my time doing there was 00:10:22.300 |
focusing and writing beautiful code. What I ended up doing much more often than not was debugging 00:10:29.020 |
production incidents. So I got sick of this, as you might understand, and very happily quit. 00:10:35.740 |
Sometime later, I found myself in a class on causal AI. The guy teaching the class seemed pretty smart. 00:10:43.820 |
Looking back then, it may have just been the fact that he was much better dressed than your average 00:10:47.740 |
computer science professor, and he had a fancy accent. But he mentioned in class one day, 00:10:54.860 |
sort of an offhand comment seemed that, oh, you might be able to use causal AI to automate incident 00:11:01.420 |
remediation workflows. And I'm sitting in the back of the class, and my head explodes. Like, 00:11:05.740 |
this sounds like an amazing idea. And so I casually come up to him at the end of class and say, you know, 00:11:10.300 |
maybe we might want to do a research project on this. And little did I know that at the time, 00:11:15.420 |
he had actually started a company to solve this problem. So he invited me to join him on that 00:11:19.020 |
journey. And I'm very grateful for that. And so fast forward a year, though, and we have, 00:11:25.980 |
we built something that we're quite proud of. And I'll show you how it works. But to begin, 00:11:31.340 |
I'm going to be talking about a client of ours, DigitalOcean. DigitalOcean is a cloud provider. 00:11:36.860 |
They serve hundreds of thousands of customers every day. And I'm going to tell you about what the life 00:11:41.580 |
of an on-call engineer was like prior to traversal. 00:11:47.020 |
So imagine you're in the middle of a productive workday, you're focused in, you're writing code. 00:11:51.740 |
And then you get hit with a message describing something that's broken horribly. It's causing 00:11:58.860 |
issues for customers. It might say something like powered some potential compromise of some host 00:12:04.620 |
assigned to, you know, a bad application. And my apologies in advance here, this is not a demo, 00:12:09.740 |
per se. This is us telling you, in the real world, how we are solving issues for our customers. So 00:12:15.660 |
apologies for the redactions that you're seeing here. 00:12:18.060 |
So you get this context, and then you get thrown into an incident Slack channel. And in this Slack 00:12:25.340 |
channel, you and 40, 50, 60 other engineers begin frantically searching to find what's the cause of 00:12:31.260 |
the customer issue, right? But you're not looking through some documents, you're not looking through 00:12:35.900 |
a database table, you're looking through hundreds of millions of metrics, which are viewable on thousands 00:12:42.860 |
of dashboards. And beyond this, you also have tens of billions of logs that you might want to find. 00:12:49.340 |
And these are coming from thousands of services. And what's the thing you're looking for here? You're 00:12:53.740 |
looking for a something that's comparatively microscopic. So this might be a single log that 00:12:59.580 |
describes the thing that went wrong, the root cause of the incident. And you know, if you're lucky, after a 00:13:04.460 |
few hours of everybody frantically searching, the incident gets resolved, and you know, 00:13:09.020 |
everybody can go back to work. That is until the next production incident comes, and the cycle repeats 00:13:15.500 |
itself. So this is a very familiar situation to me. Unfortunately, it may be a very similar situation 00:13:21.820 |
to all of you. But things have changed for digital ocean. So what traversal's been able to do for 00:13:31.740 |
digital ocean is make their mission critical infrastructure far more resilient. So you'll see 00:13:37.100 |
here that mean time to resolution, MTTR, has reduced pretty dramatically for digital ocean. We've measured 00:13:43.100 |
that about 40% reduction, and the amount of time that it takes to find and resolve production incidents. 00:13:48.300 |
And all of these minutes mean a lot of pain for an engineer that's now been removed, and also 00:13:55.260 |
thousands of dollars for each minute. So now I'll be able to show you how this works. 00:14:00.140 |
So in the world post-traversal, rather than this frantic search going on all the time, when an 00:14:08.220 |
incident kicks off, the traversal AI, the ambient traversal AI, begins its investigation. And the thing 00:14:15.260 |
that it begins its investigation with is the same little bit of context that engineers get when the 00:14:21.340 |
incident starts. And given the small amount of context, traversal AI orchestrates a swarm of expert 00:14:27.500 |
AI SREs to sift through petabytes of observability data all in parallel. And so what was done previously 00:14:35.740 |
manually is done exhaustively and automatically. And then after about five minutes, traversal comes back to the 00:14:43.740 |
users right where they are in the incident Slack channel and tells them what happened. 00:14:47.580 |
So here you can see that traversal identified an issue. There was a deployment that introduced changes, 00:14:54.700 |
a cascade of issues throughout the entire system. And when engineers see this, they can roll it back, 00:14:59.260 |
they can move on and get back to their good work. And you'll note here that traversal - the engineers 00:15:05.180 |
of DigitalOcean noted that this was the thing that solved the issue. This was the correct finding. 00:15:09.180 |
And this is happening all the time. Engineers that want to dive in further can look in the traversal UI, 00:15:14.220 |
where you have a well of information that traversal unearths describing what happened with the incident. 00:15:19.660 |
It cites relevant observability data, just like any good engineer would. It gives you confidence levels 00:15:24.860 |
for the potential root cause candidates. It explains its reasoning. And if you want to dive in even further, 00:15:29.980 |
traversal allows you to interact with an AI-generated impact map. And you can even ask follow-up questions 00:15:35.820 |
to traversal to ask questions like, "How does this incident impact the part of the stack that I care the 00:15:41.500 |
most about?" And this entire experience is exactly the kind of thing that I was dying for in my previous 00:15:47.260 |
roles. And I'm so excited to share that it's now a reality. So - thank you. But this is not just 00:15:55.180 |
happening for DigitalOcean. We're working across a very heterogeneous group of enterprise environments. 00:16:02.060 |
So you'll see here - the thing that I want you to focus on here is that we're talking to basically 00:16:06.780 |
every observability tool out there, and we're talking to them on a massive scale. We're talking 00:16:12.140 |
trillions of logs. So yes, it's an AI agent's problem, and I like thinking about the AI agent's 00:16:17.340 |
problem. But in reality, this is just as much an AI infrastructure problem. 00:16:21.260 |
Furthermore, the problem that we're solving here is one where we have a massive set of data, 00:16:29.500 |
and we're looking for this small piece of information that's telling you everything that 00:16:34.140 |
you need to know. This is sort of a needle in the haystack problem. And we're focused right now on 00:16:38.700 |
observability. But there are plenty of other domains where there are similar principles that can be 00:16:43.180 |
applied. In my experience that when you're doing network observability, you're working on cyber 00:16:48.060 |
security, you find similar patterns here that I hope that you all in your respective domains can 00:16:53.980 |
take from this that these strategies of exhaustive search and swarms of agents can be applied in these 00:16:59.820 |
domains as well. Not only is this an amazing problem to work on, frankly, we've assembled an amazing 00:17:06.140 |
group of people to do it. We've got AI researchers from the top AI labs in the world that are pushing 00:17:11.100 |
the limits of what AI can do. We have people from top dev tools companies who know what it takes to 00:17:17.100 |
build a tool that developers love to use. We have fantastic AI product engineers who know how to make 00:17:22.700 |
delightful AI products. And, of course, we've got high-frequency quant finance traders who can deal 00:17:27.100 |
with pain and know what the pain of production downtime actually means. But I think what's most 00:17:33.420 |
special about the Traversal team is not necessarily the skills that we bring to the table, but it's 00:17:39.340 |
really unusual. It's really rare to be able to find a group of people that aren't really all out there 00:17:44.060 |
for themselves. And everybody's working for the betterment of each other. And we have a lot of fun 00:17:48.220 |
coming into work, and everybody loves showing up to the office. And this is what makes Traversal, 00:17:52.380 |
in my opinion, an amazing place to work. And so if the team or the problem sounds interesting to you, 00:17:57.820 |
please scan the QR code, look at one of these websites, and help us try to create this picture 00:18:02.780 |
of me that we have right here for all the engineers in the world. Thank you very much.