Events are the Wrong Abstraction for Your AI Agents

- Welcome, everyone. My name is Mason Egger, I work at Temporal, and today we're gonna talk about events are the wrong abstraction for your AI agents. So, who here, raise of hands, recognizes what this diagram is, out of curiosity? Okay, so this is an abstract, and we're gonna talk about events or the wrong abstraction for your AI agents.

So, who here, raise of hands, recognizes what this diagram is, out of curiosity? Okay, so this is a map of our solar system in a geocentric projection. This is where we have Earth as the center of our solar system, and this is how celestial objects move around the Earth.

And this was used to kind of calculate celestial trajectories prior to like the 16th, 17th century. It's really pretty, I really like it. Complex, but really nice. And then around the 16th century, Copernicus decided to put the sun at the center of the solar system, and it greatly simplified how we view our world and how we view the universe.

This allowed people to start thinking about the laws of nature. We got the laws of gravity and a lot of different things because of us being able to recenter our focus on how we look at the way that we're looking at things, the way that we build things. And basically, a whole series of new discoveries and developments were made simply by just focusing our shift on how we decide to look at things.

Now, it's interesting to note that both of these are actually accurate. They're both correct. However, it's interesting that you have to take the right frame of mind and look at what do you need to use, what tools do you need to use, what software are you using, and determine how do we want to use this as a reference.

So, for example, if we're looking at something like a moon trajectory, it might be useful to use the Earth as the frame of reference here. But if you're thinking about something like the planets and how planetary objects move, then it's probably more useful to put the sun at the center of your ecosystem.

So you're probably wondering why I'm talking about this, and that's a pretty valid question. We're all building software here. We're all building AI software. And ensuring that the software is available and scalable for our users is extremely challenging. Scaling our AI agents is actually no different than scaling a microservice architecture.

At the end of the day, this is all just distributed systems, which is great because we're not solving any new problems here. We're solving the same problem that we've been solving for the last 20 years. Nothing here is new. All we did is we added a different label to it.

And we do have patterns for solving this, which is event-driven architecture. So I asked Claude this morning to give me a definition of what is event-driven architecture, and this is what it came up with. It said, "Event-driven architecture is a software design pattern where components communicate by producing and consuming events, allowing for loose coupling and asynchronous processing where systems react to state changes or occurrences rather than direct method calls." That's a pretty fair thing, and I expect Claude to get that right.

So that's a pretty good explanation of it. But let's look at, say, what an event-driven AI architecture would potentially look like. So we have a pretty fairly decent design, if I do say so myself. I didn't make this. So if it's great, I made it. If it's not, blame my colleague.

So we have some cron jobs here to handle inactive chat sessions, basically acting as a garbage collector for a whole bunch of things. We have a message bus where events can be published and ingested by the various tools that we are using, different LLMs, all of those things. We have a dead letter queue for handling tasks that have failed, and we just cannot possibly reprocess them.

This is pretty great. This is a pretty straightforward EDA-based architecture. The question is, though, how much of this in here is actually the core business logic of your application? How much of this is actually what we're trying to solve versus trying to make sure that the darn thing doesn't break?

Good question. So some of you might be going, hey, this is great. Ship it. Love it. And you're right. That diagram does work. There are probably hundreds of thousands of agents deployed into production right now that use that exact same architecture. I'm not going to argue. It's a great, fine architecture.

It works. But I want to go back to our discussion that we were having earlier. Do we have the right thing at the center of our ecosystem? Are we looking at this through the right frame of mind? And I'm here to say that I don't think we do. We have built all of our applications in the modern world with events as the center of our universe instead of the core logic, the core foundation of what we're trying to solve.

And if you look at that diagram that I previously had, there are more parts about handling the events than there was the core logic of the actual application. I was an SRE in a past life for a company. And I've seen applications with 100 lines of code, dead simple logic that have brought entire enterprises down, large traveling industries that begin with the letter E, because of mismanaged queues.

So find me after the talk, I'll tell you about my horror stories. So issues with this approach. So what do we have? Well, we don't get APIs in an EDA architecture. We sacrifice clear, well-defined APIs when we adopt events. The events lack documentation and structure that all of our APIs that we've spent all these years building have given us.

And yes, there is an async API spec, but that really discusses the formats of the messages. It's not really an API. It's just what is produced and what is consumed. It doesn't really give us much more than that. And as we all know, developers are great at documentation. That's why there's so many of y'all working on documentation AI tooling right now.

So this is great, right? Now we have scattered logic. Our business logic now becomes fragmented and spread out across thousands of different services. Instead of having a bug and let me go, "Oh, let me open the file and see what that is," now I have to open multiple files.

Now I have to grep across the code base for the event name to figure out who called what, when, and where, and how. And if it gets even worse, I have to run the thing to figure out how it failed. That's worst case scenario, but I've seen it happen.

It's like, I have no idea what this is doing, but if I run it, I can figure it out. I'm like, that's like saying we don't know if the car is going to break when we crash it, so let's just slam it into the wall and see what happens.

Not a great way to go about things. And now every single one of our services is now an ad hoc state machine. And I know that CS majors love implementing state machines, and I'll tell you this, the only reason they love them so much is because they've never been paged for one.

Once you get paged for one, you stop loving your state machines really quick. Services now have local databases and local caches to make sure that things aren't failing. In many cases, there's no transactions between the message and when you've updated your state, so you get into this really fun state where we've updated something, but we haven't quite actually updated it yet, and now we've got into this weird case.

And that leads us to the best case of all, race conditions. So now we don't know what's actually happening. And now we get paged at 2:00 a.m. in the morning to deal with the hardest bugs of all. And we now either have to write log to do this, or we can do what some companies do, and we can just push it off to customer success and be like, oh, they'll just reset the system whenever it crashes.

That'll work, right? We'll just roll back the database. I don't think that's a good thing. All in all, I could go on hours and hours about why this approach doesn't work, but it leads to one thing, and I'm going to ruffle some feathers, so we'll do this lightly. EDA is a tightly coupled system.

Now, I know everyone's going to go, tightly coupled. Whoa! We've been told that event-driven architectures are loosely coupled. That's what I taught in school. It's what Claude told me this morning when I asked it. And it is true. They are loosely coupled, but they're loosely coupled at runtime. They are not loosely coupled at design time.

We conflate the two. They think that because they're loosely coupled, because we can have a service go down, and that the other services can keep running, that they are loosely coupled at runtime. That translates into design time. And that is not the case. That is not how this works.

So imagine that we had this discussion. Engineer A comes up and says, hey, I've refactored all the code to make everything more loosely coupled. And you go, wow, that's great. How did you do it? Well, I turned all of my local variables into global variables. Oh, great. Why on earth did you do that?

Well, now we can add new code, and everybody can just read from them, and we don't have to worry about telling anybody when we update it. That sounds insane, right? That's what we do with events. So that's what's happening. So it's just wild that we think that that's a good use case.

And it sounds like it's the same logic. And we can just read from that event until somebody decides to change the format because you were reading from an event that somebody didn't know you were. And now somebody else downstream from you three stories, you didn't know we're reading your event.

You broke their system because you updated the event. And I can hear people's pagers going off as I'm saying this. Been there, done that. And what it does is it leads to people not being willing to iterate on their architecture or their design because they're scared if they touch the magic event, everything will crumble to the ground.

So this is what basically happens. So now what? Where do we go? I propose we need to reorient ourselves. We need to take a step back and go, is this the actual proper center? Take a step back and see what the frame is. So let's put the right thing at the center.

I believe this is durable execution. So what is durable execution? It's kind of a new category of software. It's called crash-proof execution. And basically, it enables developers to write software with less effort. It allows them to focus on the application, on what the application should achieve, instead of trying to anticipate or mitigate everything that could possibly go wrong.

This in turn will accelerate your development. And basically, it turns out that failures are inevitable, but durable execution makes them inconsequential. So here's the four characteristics that we've come up with to define durable execution. Durable execution applications automatically preserve your application state. So in a typical application, a crash will cause you to lose all your variable state.

Everything is lost. And developers will typically make a cache, a redis cache, a local database, something of that nature to save all this back up to so that we can rebuild the state if it crashes. In a durable execution system, you automatically get the saving of the state out of the box.

Just automatically, all your local variables, all of your function calls, the inputs, outputs, returns, all of it is stored automatically for you. Because of this, this allows us to virtualize the execution. So execution usually takes place in a single process on a single machine, and will, and basically, will immediately end if that process crashes for any certain reason.

Durable execution can happen across a series of processes across multiple machines. If one of the current process fails, durable execution will basically take that state that it saved, restart the execution, and resume execution from the point of the last known save point, and continue on very often without this ever even being aware, the developer being aware that this even happened.

This is also not limited by time. Because durable execution can survive a crash, it enables it to run for as long as you like. So it's, you know, most people would never even think about putting sleep for 30 days inside your code. That's a totally valid and 100% achievable goal within a durable execution system, because it just comes back online whenever it's ready to be run.

It will survive that crash. I can sleep for 30 days. If it crashes, doesn't matter. Bring it back online. Resume the timer. Continue forward. And it's hardware agnostic. We, in the past, have tried to solve some of these problems with fault-tolerant hardware. We pay a lot of money to be able to hot-swap out our CPUs, to be able to hot-swap out memory, all of that stuff.

Durable execution is completely hardware agnostic. It can run anywhere. We ran it on a Raspberry Pi and shipped it into outer space. Check out our YouTube channel. You can run it wherever you want to run it. It builds reliability into the software side, not the hardware side. And it requires no actual specific hardware.

And it runs natively anywhere. And it overcomes all of these issues. So this is what a durable execution agent architecture would look like. We still have our chat UI. We have a durable workflow. We automatically get retries. Whenever a failure happens, whenever we're talking out to our LLMs, we're talking out to our tools, if the LLM goes down, we're getting rate limit, automatic retries.

You don't even have to develop that. You call a function call, that function will automatically retry until that thing becomes successful. It doesn't matter. If this thing crashes in the middle of it, automatically reconstructs the state, continues on as if this had never happened. We can store our longer states for storage, for audit purposes, at a predefined time or whenever we decide we want to close the execution.

This is basically a much simpler diagram than what we had earlier. And now the developer doesn't have to focus at all about managing queues or events or any of those things. They can focus solely on the business logic. So Temporal provides durable execution. It's an open-source MIT-licensed product. We're here at the booth today, in the hall.

And you can come and visit us. And it supports currently seven programming language SDKs. So Go, Python, TypeScript, Ruby, .NET, Java, and PHP. There will be another one coming later this year. If you looked into it, it's not that hard to figure out which one it is. And the interesting thing about durable execution is all of these are natively polyglot.

So I can call a function written in Ruby from a code written in Go with basically just providing it with the function name and the input parameters. Because, here's the dirty secret, it still events under the hood. But what did we do? Why is this a thing? We've abstracted the complexities away from the platform layer.

Software engineering, as a vocation, as a history, if you look back over the 50 years, 50, 60, 70 years, software engineering has existed. We go back to, you know, 1950s. We have made most of our advancements in programming languages as we have abstracted away the complexities away from the programmer and into the programming language.

None of us are sitting here writing assembly code anymore, thankfully. I mean, some of you might be, and good on you. I'm not. Fortran gave us mathematical operations. We don't have to write assembly language anymore and store things manually in registers. Algo, 60, and Pascal gave us if, then, else structure and structure flow concepts.

We're not writing go-tos and jumps in our code anymore. Lisp gives us memory management and garbage collection. Simula and Smalltalk gave us object oriented programming. And this just continues on and on. We've continually abstracted away complexity. Durable execution is the next foray into this. We are abstracting away events and the complexity of events into the software layer and removing that from anyone having to worry about it.

So you no longer have to worry about your cues or any of that stuff. And I leave you with a meme. So all AI is just distributed systems under the hood. If you are calling out across the network, you're a distributed system. And you basically need to handle that.

If you want to come and learn more, Temporal is in the hallway, or in the booth, sorry, in the expo hall. I will be there for literally as soon as this talk is over. So you're welcome to come by. We have a durable agent running demo. Come by. Try to break my demo.

Turn off my computer. Turn off my laptop. I guarantee you it'll still keep running when it's done. I'll show you how all of this works. You can also chat with me in Slack. We have a community Slack channel, and we have a newsletter if you're interested. Thank you very much.

Events are the Wrong Abstraction for Your AI Agents - Mason Egger, Temporal.io

Transcript