Production software keeps breaking and it will only get worse

Hi, everyone. Thank you for coming to our talk. So, he was kind enough to already introduce us. So, I'm the CEO. Matthew was the first person who joined us. If you have any difficult questions, please direct it towards Matt. If you think about the three major categories of software engineering, at least as we see it, there's three things that show up to me.

The system design, where you think about how do you actually architect a system. A lot of the talks we saw in this track have been about that in some sense at a high level. Second is actually developing software, putting in the business logic of your particular company and all of the DevOps that comes with it.

And then when your software actually hits production, invariably it's going to break and then troubleshooting of these production incidents. At the heart of it, those are the three things interspersed with each other as you think about developing production-grade software. Now, what's been happening with the magic of AI software engineering tools like Cursor or Windsor or GitHub Copilot and many more others, the part around development is getting narrowed.

That's the part we're relying on all these different systems to do it for us and increasingly so. Over the next year, I'd say more and more of what we see is going to make the development part of it really seamless. The question is what happens to the other two parts of this entire software engineering workflow?

The first one being system design and the other one being troubleshooting. Now, what is the promise that is, as we think about what software engineering could look like, is that we'll get to focus on just the most high impact and creative work that happens in engineering, which is system design.

So, we're hoping that AI will write our code for us and also troubleshoot all the production incidents. So, we just get to focus on the fun stuff, which is the creative work of how we put all these different pieces together. That's the hope. I actually think that to make that happen, there's something missing, right, which is the problem of troubleshooting.

How do you automate that part of it? And I think if we just continue in the way we're going, it's actually going to look at the opposite. It's that most of our time, I think, is going to be spent doing on-call for the vast majority of us. And why is that?

I think troubleshooting is going to get more and more complex as we go along. First is that as we write, as these software engineering, AI software engineering systems write more and more of our code, humans are going to have less context about what happened. They don't understand the inner workings of the code.

They don't have all the context in their minds, right? Second, we're going to push these systems to the limits. We're going to write more and more complex systems, just like the things we saw in the previous talks, right? So, the system's going to get more complex and we're going to have less understanding of it.

And as a result, troubleshooting is going to get really, really painful and complex. And I think that's where we're going to spend most of our time in this world, which is just QA and on-call, right? And that'll be kind of a sad existence for ourselves if that's what happens, right?

So, this, I think, is a grim reality if we don't do something about it. So, now, if you think about the workflow of troubleshooting, what does it look like? In my head, you'll have all of these different wonderful companies like Grafana, Datadocs, Splunk, Elastic, Sentry, whatever it might be.

And what they do, essentially, is they help process data and help you visualize it, right? And so, if anyone who's been on a Datadoc dashboard or whatever it might be, you have these beautiful dashboards, thousands of them, to give you some cut as to what is the health of your system.

Now, something will break invariably in production. Then what happens next? The next step, as I see it, is what I call dashboard dumpster diving, right? You'll go through all these different thousands of dashboards to try to find the one that explains what happened. And you'll try to have many different people doing it in parallel.

At some point, you might come up with some sort of promising lead, like, okay, maybe that was it. That's the dashboard that kind of explains what happened. Or that's the log that kind of explains what happened. Then you want to connect it to some sort of change you made in your system, right?

As we think about root cause analysis, it's typically trying to connect it to some particular change in the system, whether it's a pull request or a change of your configuration files, whatever it might be, right? And that's where you start, say, that's the second stage of your troubleshooting, which you stare aggressively at your code base until something, you know, some sort of inspiration hits you.

And then most of the time it doesn't hit you, and then you kind of bring more teams into the mix, because you know it's maybe not your issue, and suddenly you have 30, 40, 50, 100 people in a Slack incident channel trying to figure out what happened. And this loop keeps going on and on and on, right?

So the status quo of what incident production incident debugging looks like clearly is not optimal. And I think it's only going to get worse space for all the reasons I talked about earlier. And obviously this problem has been around since software has been written, and it's not like we're the first people to think about thinking of this issue, right?

The problem is that the existing approaches we've taken to troubleshooting or using machine learning AI towards it have not worked, and I don't believe will work for really fundamental reasons, right? So the first one is what we call AI ops, broadly speaking, which is we're using traditional machine learning and statistical anomaly detection type techniques to help figure out what happened, right?

The problem is if any of you have actually tried these techniques in production systems, it leads to too many false positives. Your system is too complex, too dynamic, and if you just try to come up with some sort of numerical representation of your data, it's going to be not representative enough.

And what you'll see is that you'll have thousands of alerts happening. Maybe one of them is useful, but you just don't know which one. So it typically leads to more signal, sorry, more noise than signal is what AI ops has led to, sadly. Okay, now we have this new world of LLMs.

Okay, so one option is, okay, let me -- I'm sure many people here have taken a log and put it in ChatGPT and said, "Explain to me what happened. What's going on in this log?" Right? That's something I think all of us have done. Now, that's okay if you know which log you're going to look at, but if I'm actually dealing with production-grade systems, you're going to have petabytes of data.

You might have a trillion logs. Which one do I look at? You can't take all of these different logs, the trillion logs, and put it in context. Right? Even if you have infinite context, it doesn't matter. The size of these systems are so large that, forget about the context window, it doesn't even fit into memory.

It doesn't even fit into a cluster. That's why you have such difficult retention policies for data, right, for logs and metrics. And second, the problem with these LLMs are is that they don't have a very good understanding of the numerical representation of the data. So they have a good semantic understanding, they don't have a good numerical representation, and also the context isn't big enough.

Okay, now the third thing we might think about is let's build an agent, right? Like, let's build a React-style agent. The problem with that is that if you think about these React-style agents, what they're going to do is they're going to assume you have some access to some sort of runbook, some sort of meta workflow that they can rely on to help you make a decision of what next tool to call.

The problem is any runbook you actually put into place is deprecated the second you created. I don't know how many of you have used runbooks, but typically what we've found in my experience and the team's experience is that they're typically deprecated by the time they're built, right? And so that workflow that the agent is going to go through is not going to be optimal.

And also if you try to make it do a much more broad search of your system, if you give it these simple tools, it's going to take too long, right? If you just put it into like a loop, tool calling loop, it might take days to run if it doesn't time out, right?

And typically if you try to solve an incident, you need to solve it in two to five minutes for it to be really useful, right? So every minute counts when things are down. So as a result, none of these three things, if you do them just by themselves, is enough to really troubleshoot.

There's fundamental issues in all of them that I think need to be solved by thinking about them in a more collective way. And that's really what we're trying to do at Traversal, right? Well, what you need to do with Traversal is, what we're trying to do at Traversal is really good out of sample autonomous troubleshooting, which is that if you have a new incident, we've never seen it before, can you troubleshoot it from first principles debugging?

And to do so, I think we need to combine a few different ideas. The first one being statistical, the second one being semantics, and then the third one being a novel agentic control flow. And with statistics, what I mean is causal machine learning. So that's where a lot of our research came.

So the idea of causal machine learning is this idea of being correlation is in causation. So how do you get these AI systems to pick up cause and effect relationships from data programmatically? That's what we spend a lot of time thinking about. And obviously that problem shows up a lot in production incidents, because typically when something breaks, a lot of things break around them, right?

So there's a lot of correlated failures, which are not the root cause. And so thinking about it programmatically is what the study of causal machine learning is. The second being semantics, right? Which is actually trying to push the limits of what these reasoning models can give you to help you understand the semantic, rich semantic context that exists in log fields, in the metadata source of the metric, and so on and so forth, right?

Or in code itself, right? And so by combining causal machine learning, which is the best of what statistics gives you, and reasoning models, which is the best of what semantics gives you, you now have at least the tools, the basic tools to put into place so that you can actually deal with this issue.

Now we have to figure out how do you actually make this work in an agentic system. And what we found is this idea of swarms of agent, where you have these thousands of parallel agentic tool calls happening, giving you this kind of exhaustive search to all of your telemetry in some sort of efficient way.

That's what brings it together and helps you actually deal with this issue, right? So just to repeat, it's statistics, which is causal machine learning, semantics, which is these reasoning models, and this novel agentic control flow, which you call the swarms of agent, all of this put together in the right elegant way is what actually helps you autonomously troubleshoot, right?

A lot of the work we rely on is based on our years of research, years of experience as researchers, we've written a number of papers on these various different fields, and also relying on a lot of the research that has happened in general in the field over the last couple of years that we're relying on to actually make this into a reality.

And so if you think about the troubleshooting workflow we talked about earlier and all the pains of it, each of the different things can help remove or alleviate some of those pains. So the idea of finding the promising lead from your sea of information, that's where this agent swarm and causal machine learning put together really help.

And then from the promising lead, connecting it to a specific change in your system, whether it's a pull request or change log, that's where a lot of the work that's happening in code agents and also vector search we're relying on it. As this get better, we get better, right?

And as a result, because you're building this context real time and agents are doing it, they're pulling in the right team with the right context at the right time. And so don't people want to get pulled into an incident being like, I don't really know why I'm being pulled in here, which I'm sure many of you have been experiencing your time as engineers.

And so now I'm going to hand it over to Matt to actually talk through a real-life case study that we've done with one of our customers. Hi, everybody. I'm Matt. As Anish said, thank you. No, I don't own any of their outfits. So I'm going to tell you a little bit about how I got involved with Traversal.

I'm going to show you how we are helping out some of our customers. So in my career, I spent most of my time in high-frequency trading. What I wanted to spend my time doing there was focusing and writing beautiful code. What I ended up doing much more often than not was debugging production incidents.

So I got sick of this, as you might understand, and very happily quit. Sometime later, I found myself in a class on causal AI. The guy teaching the class seemed pretty smart. Looking back then, it may have just been the fact that he was much better dressed than your average computer science professor, and he had a fancy accent.

But he mentioned in class one day, sort of an offhand comment seemed that, oh, you might be able to use causal AI to automate incident remediation workflows. And I'm sitting in the back of the class, and my head explodes. Like, this sounds like an amazing idea. And so I casually come up to him at the end of class and say, you know, maybe we might want to do a research project on this.

And little did I know that at the time, he had actually started a company to solve this problem. So he invited me to join him on that journey. And I'm very grateful for that. And so fast forward a year, though, and we have, we built something that we're quite proud of.

And I'll show you how it works. But to begin, I'm going to be talking about a client of ours, DigitalOcean. DigitalOcean is a cloud provider. They serve hundreds of thousands of customers every day. And I'm going to tell you about what the life of an on-call engineer was like prior to traversal.

So imagine you're in the middle of a productive workday, you're focused in, you're writing code. And then you get hit with a message describing something that's broken horribly. It's causing issues for customers. It might say something like powered some potential compromise of some host assigned to, you know, a bad application.

And my apologies in advance here, this is not a demo, per se. This is us telling you, in the real world, how we are solving issues for our customers. So apologies for the redactions that you're seeing here. So you get this context, and then you get thrown into an incident Slack channel.

And in this Slack channel, you and 40, 50, 60 other engineers begin frantically searching to find what's the cause of the customer issue, right? But you're not looking through some documents, you're not looking through a database table, you're looking through hundreds of millions of metrics, which are viewable on thousands of dashboards.

And beyond this, you also have tens of billions of logs that you might want to find. And these are coming from thousands of services. And what's the thing you're looking for here? You're looking for a something that's comparatively microscopic. So this might be a single log that describes the thing that went wrong, the root cause of the incident.

And you know, if you're lucky, after a few hours of everybody frantically searching, the incident gets resolved, and you know, everybody can go back to work. That is until the next production incident comes, and the cycle repeats itself. So this is a very familiar situation to me. Unfortunately, it may be a very similar situation to all of you.

But things have changed for digital ocean. So what traversal's been able to do for digital ocean is make their mission critical infrastructure far more resilient. So you'll see here that mean time to resolution, MTTR, has reduced pretty dramatically for digital ocean. We've measured that about 40% reduction, and the amount of time that it takes to find and resolve production incidents.

And all of these minutes mean a lot of pain for an engineer that's now been removed, and also thousands of dollars for each minute. So now I'll be able to show you how this works. So in the world post-traversal, rather than this frantic search going on all the time, when an incident kicks off, the traversal AI, the ambient traversal AI, begins its investigation.

And the thing that it begins its investigation with is the same little bit of context that engineers get when the incident starts. And given the small amount of context, traversal AI orchestrates a swarm of expert AI SREs to sift through petabytes of observability data all in parallel. And so what was done previously manually is done exhaustively and automatically.

And then after about five minutes, traversal comes back to the users right where they are in the incident Slack channel and tells them what happened. So here you can see that traversal identified an issue. There was a deployment that introduced changes, a cascade of issues throughout the entire system.

And when engineers see this, they can roll it back, they can move on and get back to their good work. And you'll note here that traversal - the engineers of DigitalOcean noted that this was the thing that solved the issue. This was the correct finding. And this is happening all the time.

Engineers that want to dive in further can look in the traversal UI, where you have a well of information that traversal unearths describing what happened with the incident. It cites relevant observability data, just like any good engineer would. It gives you confidence levels for the potential root cause candidates.

It explains its reasoning. And if you want to dive in even further, traversal allows you to interact with an AI-generated impact map. And you can even ask follow-up questions to traversal to ask questions like, "How does this incident impact the part of the stack that I care the most about?" And this entire experience is exactly the kind of thing that I was dying for in my previous roles.

And I'm so excited to share that it's now a reality. So - thank you. But this is not just happening for DigitalOcean. We're working across a very heterogeneous group of enterprise environments. So you'll see here - the thing that I want you to focus on here is that we're talking to basically every observability tool out there, and we're talking to them on a massive scale.

We're talking trillions of logs. So yes, it's an AI agent's problem, and I like thinking about the AI agent's problem. But in reality, this is just as much an AI infrastructure problem. Furthermore, the problem that we're solving here is one where we have a massive set of data, and we're looking for this small piece of information that's telling you everything that you need to know.

This is sort of a needle in the haystack problem. And we're focused right now on observability. But there are plenty of other domains where there are similar principles that can be applied. In my experience that when you're doing network observability, you're working on cyber security, you find similar patterns here that I hope that you all in your respective domains can take from this that these strategies of exhaustive search and swarms of agents can be applied in these domains as well.

Not only is this an amazing problem to work on, frankly, we've assembled an amazing group of people to do it. We've got AI researchers from the top AI labs in the world that are pushing the limits of what AI can do. We have people from top dev tools companies who know what it takes to build a tool that developers love to use.

We have fantastic AI product engineers who know how to make delightful AI products. And, of course, we've got high-frequency quant finance traders who can deal with pain and know what the pain of production downtime actually means. But I think what's most special about the Traversal team is not necessarily the skills that we bring to the table, but it's really unusual.

It's really rare to be able to find a group of people that aren't really all out there for themselves. And everybody's working for the betterment of each other. And we have a lot of fun coming into work, and everybody loves showing up to the office. And this is what makes Traversal, in my opinion, an amazing place to work.

And so if the team or the problem sounds interesting to you, please scan the QR code, look at one of these websites, and help us try to create this picture of me that we have right here for all the engineers in the world. Thank you very much.

Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai

Transcript