back to index

Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi, everyone. Thank you for coming to our talk. So, he was kind enough to already introduce us.
00:00:21.000 | So, I'm the CEO. Matthew was the first person who joined us. If you have any difficult questions,
00:00:25.680 | please direct it towards Matt. If you think about the three major categories of software
00:00:32.120 | engineering, at least as we see it, there's three things that show up to me. The system
00:00:35.780 | design, where you think about how do you actually architect a system. A lot of the talks we saw
00:00:40.560 | in this track have been about that in some sense at a high level. Second is actually developing
00:00:45.420 | software, putting in the business logic of your particular company and all of the DevOps
00:00:49.480 | that comes with it. And then when your software actually hits production, invariably it's going
00:00:54.020 | to break and then troubleshooting of these production incidents. At the heart of it, those are
00:00:57.460 | the three things interspersed with each other as you think about developing production-grade
00:01:01.460 | software. Now, what's been happening with the magic of AI software engineering tools like
00:01:07.700 | Cursor or Windsor or GitHub Copilot and many more others, the part around development is getting
00:01:12.580 | narrowed. That's the part we're relying on all these different systems to do it for us and
00:01:16.400 | increasingly so. Over the next year, I'd say more and more of what we see is going to make
00:01:21.580 | the development part of it really seamless. The question is what happens to the other
00:01:25.660 | two parts of this entire software engineering workflow? The first one being system design
00:01:30.460 | and the other one being troubleshooting. Now, what is the promise that is, as we think about
00:01:36.200 | what software engineering could look like, is that we'll get to focus on just the most high
00:01:39.900 | impact and creative work that happens in engineering, which is system design. So, we're hoping that
00:01:44.700 | AI will write our code for us and also troubleshoot all the production incidents. So, we just get to
00:01:48.940 | focus on the fun stuff, which is the creative work of how we put all these different pieces together.
00:01:52.300 | That's the hope. I actually think that to make that happen, there's something missing, right,
00:01:56.940 | which is the problem of troubleshooting. How do you automate that part of it? And I think if we just
00:02:01.500 | continue in the way we're going, it's actually going to look at the opposite. It's that most of our time,
00:02:05.580 | I think, is going to be spent doing on-call for the vast majority of us. And why is that? I think
00:02:10.300 | troubleshooting is going to get more and more complex as we go along. First is that as we write,
00:02:15.580 | as these software engineering, AI software engineering systems write more and more of our code,
00:02:19.820 | humans are going to have less context about what happened. They don't understand the inner workings
00:02:23.420 | of the code. They don't have all the context in their minds, right? Second, we're going to push these
00:02:27.660 | systems to the limits. We're going to write more and more complex systems, just like the things we saw in
00:02:31.100 | the previous talks, right? So, the system's going to get more complex and we're going to have less
00:02:34.540 | understanding of it. And as a result, troubleshooting is going to get really,
00:02:37.980 | really painful and complex. And I think that's where we're going to spend most of our time in
00:02:41.340 | this world, which is just QA and on-call, right? And that'll be kind of a sad existence for ourselves
00:02:45.580 | if that's what happens, right? So, this, I think, is a grim reality if we don't do something about it.
00:02:49.740 | So, now, if you think about the workflow of troubleshooting, what does it look like?
00:02:55.660 | In my head, you'll have all of these different wonderful companies like Grafana, Datadocs, Splunk, Elastic,
00:03:03.340 | Sentry, whatever it might be. And what they do, essentially, is they help process data and help
00:03:07.740 | you visualize it, right? And so, if anyone who's been on a Datadoc dashboard or whatever it might be,
00:03:12.620 | you have these beautiful dashboards, thousands of them, to give you some cut as to what is the
00:03:16.940 | health of your system. Now, something will break invariably in production. Then what happens next?
00:03:21.660 | The next step, as I see it, is what I call dashboard dumpster diving, right? You'll go through all these
00:03:25.900 | different thousands of dashboards to try to find the one that explains what happened. And you'll try to have many
00:03:29.900 | different people doing it in parallel. At some point, you might come up with some sort of
00:03:33.260 | promising lead, like, okay, maybe that was it. That's the dashboard that kind of explains what
00:03:36.300 | happened. Or that's the log that kind of explains what happened. Then you want to connect it to some
00:03:40.460 | sort of change you made in your system, right? As we think about root cause analysis, it's typically
00:03:44.220 | trying to connect it to some particular change in the system, whether it's a pull request or a change
00:03:48.780 | of your configuration files, whatever it might be, right? And that's where you start, say,
00:03:52.460 | that's the second stage of your troubleshooting, which you stare aggressively at your code base until
00:03:56.300 | something, you know, some sort of inspiration hits you. And then most of the time it doesn't hit you,
00:04:01.900 | and then you kind of bring more teams into the mix, because you know it's maybe not your issue,
00:04:06.540 | and suddenly you have 30, 40, 50, 100 people in a Slack incident channel trying to figure out what
00:04:11.500 | happened. And this loop keeps going on and on and on, right? So the status quo of what incident
00:04:16.140 | production incident debugging looks like clearly is not optimal. And I think it's only going to get worse
00:04:20.460 | space for all the reasons I talked about earlier. And obviously this problem has been around since
00:04:27.500 | software has been written, and it's not like we're the first people to think about thinking of this
00:04:31.100 | issue, right? The problem is that the existing approaches we've taken to troubleshooting or using
00:04:35.340 | machine learning AI towards it have not worked, and I don't believe will work for really fundamental
00:04:39.820 | reasons, right? So the first one is what we call AI ops, broadly speaking, which is we're using
00:04:44.220 | traditional machine learning and statistical anomaly detection type techniques to help figure out what
00:04:49.180 | happened, right? The problem is if any of you have actually tried these techniques in production
00:04:52.620 | systems, it leads to too many false positives. Your system is too complex, too dynamic, and if you just
00:04:57.500 | try to come up with some sort of numerical representation of your data, it's going to be
00:05:01.180 | not representative enough. And what you'll see is that you'll have thousands of alerts happening.
00:05:05.820 | Maybe one of them is useful, but you just don't know which one. So it typically leads to more
00:05:09.260 | signal, sorry, more noise than signal is what AI ops has led to, sadly. Okay, now we have this new world of
00:05:15.020 | LLMs. Okay, so one option is, okay, let me -- I'm sure many people here have taken a log and put it in
00:05:20.780 | ChatGPT and said, "Explain to me what happened. What's going on in this log?" Right? That's something I
00:05:24.460 | think all of us have done. Now, that's okay if you know which log you're going to look at, but if I'm
00:05:28.940 | actually dealing with production-grade systems, you're going to have petabytes of data. You might have a
00:05:32.780 | trillion logs. Which one do I look at? You can't take all of these different logs, the trillion logs,
00:05:37.980 | and put it in context. Right? Even if you have infinite context, it doesn't matter. The size of
00:05:42.380 | these systems are so large that, forget about the context window, it doesn't even fit into memory.
00:05:46.300 | It doesn't even fit into a cluster. That's why you have such difficult retention policies for data,
00:05:50.300 | right, for logs and metrics. And second, the problem with these LLMs are is that they don't have a very
00:05:54.060 | good understanding of the numerical representation of the data. So they have a good semantic understanding,
00:05:58.540 | they don't have a good numerical representation, and also the context isn't big enough.
00:06:02.220 | Okay, now the third thing we might think about is let's build an agent, right? Like, let's build a
00:06:06.780 | React-style agent. The problem with that is that if you think about these React-style agents, what
00:06:11.500 | they're going to do is they're going to assume you have some access to some sort of runbook, some sort
00:06:15.260 | of meta workflow that they can rely on to help you make a decision of what next tool to call. The problem
00:06:20.460 | is any runbook you actually put into place is deprecated the second you created. I don't know how many of
00:06:24.540 | you have used runbooks, but typically what we've found in my experience and the team's experience is that
00:06:29.580 | they're typically deprecated by the time they're built, right? And so that workflow that the agent is going
00:06:34.220 | to go through is not going to be optimal. And also if you try to make it do a much more broad search of your
00:06:41.660 | system, if you give it these simple tools, it's going to take too long, right? If you just put it into like a
00:06:47.100 | loop, tool calling loop, it might take days to run if it doesn't time out, right? And typically if you try to solve an
00:06:53.580 | incident, you need to solve it in two to five minutes for it to be really useful, right? So every minute counts when things are down.
00:06:59.180 | So as a result, none of these three things, if you do them just by themselves, is enough to really
00:07:03.980 | troubleshoot. There's fundamental issues in all of them that I think need to be solved by thinking
00:07:08.140 | about them in a more collective way. And that's really what we're trying to do at Traversal, right?
00:07:12.540 | Well, what you need to do with Traversal is, what we're trying to do at Traversal is really good out of
00:07:16.540 | sample autonomous troubleshooting, which is that if you have a new incident, we've never seen it before,
00:07:20.940 | can you troubleshoot it from first principles debugging? And to do so, I think we need to combine a few different
00:07:25.660 | ideas. The first one being statistical, the second one being semantics, and then the third one being a
00:07:30.860 | novel agentic control flow. And with statistics, what I mean is causal machine learning. So that's
00:07:35.740 | where a lot of our research came. So the idea of causal machine learning is this idea of being
00:07:39.500 | correlation is in causation. So how do you get these AI systems to pick up cause and effect relationships
00:07:43.820 | from data programmatically? That's what we spend a lot of time thinking about. And obviously that problem
00:07:47.180 | shows up a lot in production incidents, because typically when something breaks, a lot of things break
00:07:51.260 | around them, right? So there's a lot of correlated failures, which are not the root cause. And so
00:07:54.780 | thinking about it programmatically is what the study of causal machine learning is. The second being
00:07:58.700 | semantics, right? Which is actually trying to push the limits of what these reasoning models can give
00:08:02.780 | you to help you understand the semantic, rich semantic context that exists in log fields, in the metadata
00:08:08.700 | source of the metric, and so on and so forth, right? Or in code itself, right? And so by combining
00:08:14.060 | causal machine learning, which is the best of what statistics gives you, and reasoning models, which is the best of what
00:08:18.460 | semantics gives you, you now have at least the tools, the basic tools to put into place so that you can
00:08:23.100 | actually deal with this issue. Now we have to figure out how do you actually make this work in an agentic
00:08:27.580 | system. And what we found is this idea of swarms of agent, where you have these thousands of parallel
00:08:31.820 | agentic tool calls happening, giving you this kind of exhaustive search to all of your telemetry in some
00:08:36.540 | sort of efficient way. That's what brings it together and helps you actually deal with this issue, right?
00:08:40.780 | So just to repeat, it's statistics, which is causal machine learning, semantics, which is these reasoning models,
00:08:46.700 | and this novel agentic control flow, which you call the swarms of agent, all of this put together in
00:08:51.180 | the right elegant way is what actually helps you autonomously troubleshoot, right? A lot of the work
00:08:56.140 | we rely on is based on our years of research, years of experience as researchers, we've written a number
00:09:01.340 | of papers on these various different fields, and also relying on a lot of the research that has happened
00:09:05.500 | in general in the field over the last couple of years that we're relying on to actually make this
00:09:09.980 | into a reality. And so if you think about the troubleshooting workflow we talked about earlier
00:09:15.660 | and all the pains of it, each of the different things can help remove or alleviate some of those
00:09:19.500 | pains. So the idea of finding the promising lead from your sea of information, that's where this
00:09:24.620 | agent swarm and causal machine learning put together really help. And then from the promising lead,
00:09:28.940 | connecting it to a specific change in your system, whether it's a pull request or change log,
00:09:32.860 | that's where a lot of the work that's happening in code agents and also vector search we're relying on
00:09:36.380 | it. As this get better, we get better, right? And as a result, because you're building this context
00:09:41.100 | real time and agents are doing it, they're pulling in the right team with the right context at the right
00:09:45.260 | time. And so don't people want to get pulled into an incident being like, I don't really know why I'm
00:09:48.460 | being pulled in here, which I'm sure many of you have been experiencing your time as engineers.
00:09:52.460 | And so now I'm going to hand it over to Matt to actually talk through a real-life case study
00:09:59.420 | that we've done with one of our customers. Hi, everybody. I'm Matt. As Anish said, thank you.
00:10:05.340 | No, I don't own any of their outfits. So I'm going to tell you a little bit about how I got involved
00:10:11.820 | with Traversal. I'm going to show you how we are helping out some of our customers. So in my career,
00:10:17.660 | I spent most of my time in high-frequency trading. What I wanted to spend my time doing there was
00:10:22.300 | focusing and writing beautiful code. What I ended up doing much more often than not was debugging
00:10:29.020 | production incidents. So I got sick of this, as you might understand, and very happily quit.
00:10:35.740 | Sometime later, I found myself in a class on causal AI. The guy teaching the class seemed pretty smart.
00:10:43.820 | Looking back then, it may have just been the fact that he was much better dressed than your average
00:10:47.740 | computer science professor, and he had a fancy accent. But he mentioned in class one day,
00:10:54.860 | sort of an offhand comment seemed that, oh, you might be able to use causal AI to automate incident
00:11:01.420 | remediation workflows. And I'm sitting in the back of the class, and my head explodes. Like,
00:11:05.740 | this sounds like an amazing idea. And so I casually come up to him at the end of class and say, you know,
00:11:10.300 | maybe we might want to do a research project on this. And little did I know that at the time,
00:11:15.420 | he had actually started a company to solve this problem. So he invited me to join him on that
00:11:19.020 | journey. And I'm very grateful for that. And so fast forward a year, though, and we have,
00:11:25.980 | we built something that we're quite proud of. And I'll show you how it works. But to begin,
00:11:31.340 | I'm going to be talking about a client of ours, DigitalOcean. DigitalOcean is a cloud provider.
00:11:36.860 | They serve hundreds of thousands of customers every day. And I'm going to tell you about what the life
00:11:41.580 | of an on-call engineer was like prior to traversal.
00:11:47.020 | So imagine you're in the middle of a productive workday, you're focused in, you're writing code.
00:11:51.740 | And then you get hit with a message describing something that's broken horribly. It's causing
00:11:58.860 | issues for customers. It might say something like powered some potential compromise of some host
00:12:04.620 | assigned to, you know, a bad application. And my apologies in advance here, this is not a demo,
00:12:09.740 | per se. This is us telling you, in the real world, how we are solving issues for our customers. So
00:12:15.660 | apologies for the redactions that you're seeing here.
00:12:18.060 | So you get this context, and then you get thrown into an incident Slack channel. And in this Slack
00:12:25.340 | channel, you and 40, 50, 60 other engineers begin frantically searching to find what's the cause of
00:12:31.260 | the customer issue, right? But you're not looking through some documents, you're not looking through
00:12:35.900 | a database table, you're looking through hundreds of millions of metrics, which are viewable on thousands
00:12:42.860 | of dashboards. And beyond this, you also have tens of billions of logs that you might want to find.
00:12:49.340 | And these are coming from thousands of services. And what's the thing you're looking for here? You're
00:12:53.740 | looking for a something that's comparatively microscopic. So this might be a single log that
00:12:59.580 | describes the thing that went wrong, the root cause of the incident. And you know, if you're lucky, after a
00:13:04.460 | few hours of everybody frantically searching, the incident gets resolved, and you know,
00:13:09.020 | everybody can go back to work. That is until the next production incident comes, and the cycle repeats
00:13:15.500 | itself. So this is a very familiar situation to me. Unfortunately, it may be a very similar situation
00:13:21.820 | to all of you. But things have changed for digital ocean. So what traversal's been able to do for
00:13:31.740 | digital ocean is make their mission critical infrastructure far more resilient. So you'll see
00:13:37.100 | here that mean time to resolution, MTTR, has reduced pretty dramatically for digital ocean. We've measured
00:13:43.100 | that about 40% reduction, and the amount of time that it takes to find and resolve production incidents.
00:13:48.300 | And all of these minutes mean a lot of pain for an engineer that's now been removed, and also
00:13:55.260 | thousands of dollars for each minute. So now I'll be able to show you how this works.
00:14:00.140 | So in the world post-traversal, rather than this frantic search going on all the time, when an
00:14:08.220 | incident kicks off, the traversal AI, the ambient traversal AI, begins its investigation. And the thing
00:14:15.260 | that it begins its investigation with is the same little bit of context that engineers get when the
00:14:21.340 | incident starts. And given the small amount of context, traversal AI orchestrates a swarm of expert
00:14:27.500 | AI SREs to sift through petabytes of observability data all in parallel. And so what was done previously
00:14:35.740 | manually is done exhaustively and automatically. And then after about five minutes, traversal comes back to the
00:14:43.740 | users right where they are in the incident Slack channel and tells them what happened.
00:14:47.580 | So here you can see that traversal identified an issue. There was a deployment that introduced changes,
00:14:54.700 | a cascade of issues throughout the entire system. And when engineers see this, they can roll it back,
00:14:59.260 | they can move on and get back to their good work. And you'll note here that traversal - the engineers
00:15:05.180 | of DigitalOcean noted that this was the thing that solved the issue. This was the correct finding.
00:15:09.180 | And this is happening all the time. Engineers that want to dive in further can look in the traversal UI,
00:15:14.220 | where you have a well of information that traversal unearths describing what happened with the incident.
00:15:19.660 | It cites relevant observability data, just like any good engineer would. It gives you confidence levels
00:15:24.860 | for the potential root cause candidates. It explains its reasoning. And if you want to dive in even further,
00:15:29.980 | traversal allows you to interact with an AI-generated impact map. And you can even ask follow-up questions
00:15:35.820 | to traversal to ask questions like, "How does this incident impact the part of the stack that I care the
00:15:41.500 | most about?" And this entire experience is exactly the kind of thing that I was dying for in my previous
00:15:47.260 | roles. And I'm so excited to share that it's now a reality. So - thank you. But this is not just
00:15:55.180 | happening for DigitalOcean. We're working across a very heterogeneous group of enterprise environments.
00:16:02.060 | So you'll see here - the thing that I want you to focus on here is that we're talking to basically
00:16:06.780 | every observability tool out there, and we're talking to them on a massive scale. We're talking
00:16:12.140 | trillions of logs. So yes, it's an AI agent's problem, and I like thinking about the AI agent's
00:16:17.340 | problem. But in reality, this is just as much an AI infrastructure problem.
00:16:21.260 | Furthermore, the problem that we're solving here is one where we have a massive set of data,
00:16:29.500 | and we're looking for this small piece of information that's telling you everything that
00:16:34.140 | you need to know. This is sort of a needle in the haystack problem. And we're focused right now on
00:16:38.700 | observability. But there are plenty of other domains where there are similar principles that can be
00:16:43.180 | applied. In my experience that when you're doing network observability, you're working on cyber
00:16:48.060 | security, you find similar patterns here that I hope that you all in your respective domains can
00:16:53.980 | take from this that these strategies of exhaustive search and swarms of agents can be applied in these
00:16:59.820 | domains as well. Not only is this an amazing problem to work on, frankly, we've assembled an amazing
00:17:06.140 | group of people to do it. We've got AI researchers from the top AI labs in the world that are pushing
00:17:11.100 | the limits of what AI can do. We have people from top dev tools companies who know what it takes to
00:17:17.100 | build a tool that developers love to use. We have fantastic AI product engineers who know how to make
00:17:22.700 | delightful AI products. And, of course, we've got high-frequency quant finance traders who can deal
00:17:27.100 | with pain and know what the pain of production downtime actually means. But I think what's most
00:17:33.420 | special about the Traversal team is not necessarily the skills that we bring to the table, but it's
00:17:39.340 | really unusual. It's really rare to be able to find a group of people that aren't really all out there
00:17:44.060 | for themselves. And everybody's working for the betterment of each other. And we have a lot of fun
00:17:48.220 | coming into work, and everybody loves showing up to the office. And this is what makes Traversal,
00:17:52.380 | in my opinion, an amazing place to work. And so if the team or the problem sounds interesting to you,
00:17:57.820 | please scan the QR code, look at one of these websites, and help us try to create this picture
00:18:02.780 | of me that we have right here for all the engineers in the world. Thank you very much.