Events are the Wrong Abstraction for Your AI Agents

00:00:00.640 | - Welcome, everyone.

00:00:02.120 | My name is Mason Egger, I work at Temporal,

00:00:05.240 | and today we're gonna talk about events

00:00:07.280 | are the wrong abstraction for your AI agents.

00:00:09.920 | So, who here, raise of hands, recognizes

00:00:13.000 | what this diagram is, out of curiosity?

00:00:15.480 | Okay, so this is an abstract,

00:00:18.160 | and we're gonna talk about events

00:00:20.320 | or the wrong abstraction for your AI agents.

00:00:23.040 | So, who here, raise of hands, recognizes

00:00:26.120 | what this diagram is, out of curiosity?

00:00:28.600 | Okay, so this is a map of our solar system

00:00:32.240 | in a geocentric projection.

00:00:34.280 | This is where we have Earth

00:00:35.920 | as the center of our solar system,

00:00:37.560 | and this is how celestial objects move around the Earth.

00:00:40.280 | And this was used to kind of calculate celestial trajectories

00:00:44.320 | prior to like the 16th, 17th century.

00:00:46.520 | It's really pretty, I really like it.

00:00:49.840 | Complex, but really nice.

00:00:52.400 | And then around the 16th century,

00:00:54.120 | Copernicus decided to put the sun

00:00:56.040 | at the center of the solar system,

00:00:58.560 | and it greatly simplified how we view our world

00:01:01.600 | and how we view the universe.

00:01:03.040 | This allowed people to start thinking about the laws of nature.

00:01:05.160 | We got the laws of gravity and a lot of different things

00:01:07.800 | because of us being able to recenter our focus

00:01:10.560 | on how we look at the way that we're looking at things,

00:01:13.760 | the way that we build things.

00:01:15.480 | And basically, a whole series of new discoveries

00:01:17.720 | and developments were made simply by just focusing our shift

00:01:21.040 | on how we decide to look at things.

00:01:23.360 | Now, it's interesting to note that both of these are actually accurate.

00:01:26.600 | They're both correct.

00:01:27.920 | However, it's interesting that you have to take the right frame of mind

00:01:32.720 | and look at what do you need to use,

00:01:35.120 | what tools do you need to use, what software are you using,

00:01:37.840 | and determine how do we want to use this as a reference.

00:01:40.880 | So, for example, if we're looking at something like a moon trajectory,

00:01:43.520 | it might be useful to use the Earth as the frame of reference here.

00:01:46.400 | But if you're thinking about something like the planets

00:01:48.200 | and how planetary objects move, then it's probably more useful

00:01:50.760 | to put the sun at the center of your ecosystem.

00:01:52.760 | So you're probably wondering why I'm talking about this,

00:01:55.080 | and that's a pretty valid question.

00:01:56.600 | We're all building software here.

00:01:58.600 | We're all building AI software.

00:02:00.200 | And ensuring that the software is available and scalable

00:02:03.160 | for our users is extremely challenging.

00:02:05.240 | Scaling our AI agents is actually no different

00:02:09.160 | than scaling a microservice architecture.

00:02:11.400 | At the end of the day, this is all just distributed systems,

00:02:13.400 | which is great because we're not solving any new problems here.

00:02:16.280 | We're solving the same problem that we've been solving

00:02:17.880 | for the last 20 years.

00:02:18.760 | Nothing here is new.

00:02:20.280 | All we did is we added a different label to it.

00:02:22.520 | And we do have patterns for solving this,

00:02:25.480 | which is event-driven architecture.

00:02:27.160 | So I asked Claude this morning to give me a definition

00:02:30.840 | of what is event-driven architecture,

00:02:32.440 | and this is what it came up with.

00:02:33.400 | It said, "Event-driven architecture is a software design pattern

00:02:36.600 | where components communicate by producing and consuming events,

00:02:39.720 | allowing for loose coupling and asynchronous processing

00:02:42.120 | where systems react to state changes or occurrences

00:02:44.920 | rather than direct method calls."

00:02:46.280 | That's a pretty fair thing, and I expect Claude to get that right.

00:02:50.840 | So that's a pretty good explanation of it.

00:02:52.680 | But let's look at, say, what an event-driven AI architecture would potentially look like.

00:02:57.640 | So we have a pretty fairly decent design, if I do say so myself.

00:03:01.880 | I didn't make this.

00:03:02.440 | So if it's great, I made it.

00:03:03.880 | If it's not, blame my colleague.

00:03:06.760 | So we have some cron jobs here to handle inactive chat sessions,

00:03:09.800 | basically acting as a garbage collector for a whole bunch of things.

00:03:12.520 | We have a message bus where events can be published and ingested by the various tools

00:03:16.360 | that we are using, different LLMs, all of those things.

00:03:18.680 | We have a dead letter queue for handling tasks that have failed,

00:03:21.000 | and we just cannot possibly reprocess them.

00:03:22.840 | This is pretty great.

00:03:25.000 | This is a pretty straightforward EDA-based architecture.

00:03:29.320 | The question is, though, how much of this in here is actually the core business logic of your

00:03:33.800 | application?

00:03:34.360 | How much of this is actually what we're trying to solve

00:03:37.400 | versus trying to make sure that the darn thing doesn't break?

00:03:39.960 | Good question.

00:03:42.040 | So some of you might be going, hey, this is great.

00:03:44.520 | Ship it.

00:03:45.480 | Love it.

00:03:45.880 | And you're right.

00:03:47.560 | That diagram does work.

00:03:48.600 | There are probably hundreds of thousands of agents deployed into production right now

00:03:51.880 | that use that exact same architecture.

00:03:53.400 | I'm not going to argue.

00:03:54.120 | It's a great, fine architecture.

00:03:56.200 | It works.

00:03:57.080 | But I want to go back to our discussion that we were having earlier.

00:03:59.400 | Do we have the right thing at the center of our ecosystem?

00:04:03.640 | Are we looking at this through the right frame of mind?

00:04:06.200 | And I'm here to say that I don't think we do.

00:04:08.040 | We have built all of our applications in the modern world with events as the center of our

00:04:14.200 | universe instead of the core logic, the core foundation of what we're trying to solve.

00:04:17.880 | And if you look at that diagram that I previously had, there are more parts about handling the

00:04:22.680 | events than there was the core logic of the actual application.

00:04:27.000 | I was an SRE in a past life for a company.

00:04:29.560 | And I've seen applications with 100 lines of code, dead simple logic that have brought entire

00:04:35.000 | enterprises down, large traveling industries that begin with the letter E, because of mismanaged

00:04:41.560 | queues.

00:04:41.560 | So find me after the talk, I'll tell you about my horror stories.

00:04:45.400 | So issues with this approach.

00:04:47.800 | So what do we have?

00:04:48.440 | Well, we don't get APIs in an EDA architecture.

00:04:51.800 | We sacrifice clear, well-defined APIs when we adopt events.

00:04:56.120 | The events lack documentation and structure that all of our APIs that we've spent all these years

00:05:00.600 | building have given us.

00:05:01.560 | And yes, there is an async API spec, but that really discusses the formats of the messages.

00:05:07.160 | It's not really an API.

00:05:08.200 | It's just what is produced and what is consumed.

00:05:10.040 | It doesn't really give us much more than that.

00:05:11.720 | And as we all know, developers are great at documentation.

00:05:14.600 | That's why there's so many of y'all working on documentation AI tooling right now.

00:05:17.560 | So this is great, right?

00:05:20.600 | Now we have scattered logic.

00:05:22.520 | Our business logic now becomes fragmented and spread out across thousands of different services.

00:05:28.680 | Instead of having a bug and let me go, "Oh, let me open the file and see what that is,"

00:05:32.040 | now I have to open multiple files.

00:05:33.720 | Now I have to grep across the code base for the event name to figure out who called what,

00:05:37.000 | when, and where, and how.

00:05:37.880 | And if it gets even worse, I have to run the thing to figure out how it failed.

00:05:42.760 | That's worst case scenario, but I've seen it happen.

00:05:44.600 | It's like, I have no idea what this is doing, but if I run it, I can figure it out.

00:05:47.720 | I'm like, that's like saying we don't know if the car is going to break when we crash it,

00:05:51.640 | so let's just slam it into the wall and see what happens.

00:05:54.040 | Not a great way to go about things.

00:05:56.680 | And now every single one of our services is now an ad hoc state machine.

00:05:59.800 | And I know that CS majors love implementing state machines, and I'll tell you this,

00:06:04.760 | the only reason they love them so much is because they've never been paged for one.

00:06:07.240 | Once you get paged for one, you stop loving your state machines really quick.

00:06:11.160 | Services now have local databases and local caches to make sure that things aren't failing.

00:06:16.440 | In many cases, there's no transactions between the message and when you've updated your state,

00:06:21.480 | so you get into this really fun state where we've updated something, but we haven't quite

00:06:25.400 | actually updated it yet, and now we've got into this weird case.

00:06:27.960 | And that leads us to the best case of all, race conditions.

00:06:31.720 | So now we don't know what's actually happening.

00:06:33.880 | And now we get paged at 2:00 a.m. in the morning to deal with the hardest bugs of all.

00:06:38.360 | And we now either have to write log to do this, or we can do what some companies do,

00:06:42.920 | and we can just push it off to customer success and be like, oh, they'll just reset the system

00:06:46.120 | whenever it crashes. That'll work, right? We'll just roll back the database.

00:06:48.840 | I don't think that's a good thing. All in all, I could go on hours and hours about

00:06:54.920 | why this approach doesn't work, but it leads to one thing, and I'm going to ruffle some feathers,

00:06:58.600 | so we'll do this lightly. EDA is a tightly coupled system. Now, I know everyone's going to go,

00:07:03.800 | tightly coupled. Whoa! We've been told that event-driven architectures are loosely coupled.

00:07:09.160 | That's what I taught in school. It's what Claude told me this morning when I asked it.

00:07:12.600 | And it is true. They are loosely coupled, but they're loosely coupled at runtime.

00:07:17.400 | They are not loosely coupled at design time. We conflate the two. They think that because

00:07:22.040 | they're loosely coupled, because we can have a service go down, and that the other services

00:07:26.120 | can keep running, that they are loosely coupled at runtime. That translates into design time.

00:07:30.440 | And that is not the case. That is not how this works. So imagine that we had this discussion.

00:07:35.160 | Engineer A comes up and says, hey, I've refactored all the code to make everything

00:07:39.320 | more loosely coupled. And you go, wow, that's great. How did you do it?

00:07:42.280 | Well, I turned all of my local variables into global variables.

00:07:44.840 | Oh, great. Why on earth did you do that? Well, now we can add new code,

00:07:49.800 | and everybody can just read from them, and we don't have to worry about telling anybody

00:07:52.760 | when we update it. That sounds insane, right? That's what we do with events.

00:07:57.800 | So that's what's happening. So it's just wild that we think that that's a good use case.

00:08:05.000 | And it sounds like it's the same logic. And we can just read from that event until somebody decides to

00:08:10.440 | change the format because you were reading from an event that somebody didn't know you were.

00:08:14.280 | And now somebody else downstream from you three stories, you didn't know we're reading your

00:08:17.960 | event. You broke their system because you updated the event. And I can hear people's pagers going off

00:08:23.000 | as I'm saying this. Been there, done that. And what it does is it leads to people not being willing to

00:08:28.760 | iterate on their architecture or their design because they're scared if they touch the magic event,

00:08:33.400 | everything will crumble to the ground. So this is what basically happens. So now what? Where do we go?

00:08:37.960 | I propose we need to reorient ourselves. We need to take a step back and go, is this the actual proper

00:08:44.440 | center? Take a step back and see what the frame is. So let's put the right thing at the center.

00:08:48.120 | I believe this is durable execution. So what is durable execution? It's kind of a new category of software.

00:08:55.240 | It's called crash-proof execution. And basically, it enables developers to write software with less effort.

00:09:01.240 | It allows them to focus on the application, on what the application should achieve,

00:09:04.920 | instead of trying to anticipate or mitigate everything that could possibly go wrong.

00:09:08.680 | This in turn will accelerate your development. And basically, it turns out that failures are

00:09:12.920 | inevitable, but durable execution makes them inconsequential. So here's the four characteristics

00:09:17.400 | that we've come up with to define durable execution. Durable execution applications automatically

00:09:22.440 | preserve your application state. So in a typical application, a crash will cause you to lose all your

00:09:30.120 | variable state. Everything is lost. And developers will typically make a cache, a redis cache, a local

00:09:35.160 | database, something of that nature to save all this back up to so that we can rebuild the state if it

00:09:39.160 | crashes. In a durable execution system, you automatically get the saving of the state out of the box.

00:09:45.080 | Just automatically, all your local variables, all of your function calls, the inputs, outputs, returns,

00:09:48.920 | all of it is stored automatically for you. Because of this, this allows us to virtualize the execution.

00:09:55.320 | So execution usually takes place in a single process on a single machine, and will, and basically,

00:10:00.360 | will immediately end if that process crashes for any certain reason. Durable execution can happen

00:10:05.320 | across a series of processes across multiple machines. If one of the current process fails,

00:10:10.120 | durable execution will basically take that state that it saved, restart the execution, and resume

00:10:15.160 | execution from the point of the last known save point, and continue on very often without this ever

00:10:20.520 | even being aware, the developer being aware that this even happened. This is also not limited by time.

00:10:25.800 | Because durable execution can survive a crash, it enables it to run for as long as you like.

00:10:30.440 | So it's, you know, most people would never even think about putting sleep for 30 days inside your code.

00:10:36.760 | That's a totally valid and 100% achievable goal within a durable execution system, because it just comes

00:10:42.120 | back online whenever it's ready to be run. It will survive that crash. I can sleep for 30 days. If it crashes,

00:10:47.560 | doesn't matter. Bring it back online. Resume the timer. Continue forward. And it's hardware agnostic.

00:10:53.160 | We, in the past, have tried to solve some of these problems with fault-tolerant hardware. We pay a lot

00:10:59.000 | of money to be able to hot-swap out our CPUs, to be able to hot-swap out memory, all of that stuff.

00:11:02.920 | Durable execution is completely hardware agnostic. It can run anywhere. We ran it on a Raspberry Pi and

00:11:07.160 | shipped it into outer space. Check out our YouTube channel. You can run it wherever you want to run it.

00:11:11.960 | It builds reliability into the software side, not the hardware side. And it requires no actual specific

00:11:17.160 | hardware. And it runs natively anywhere. And it overcomes all of these issues. So this is what a

00:11:23.560 | durable execution agent architecture would look like. We still have our chat UI. We have a durable

00:11:28.520 | workflow. We automatically get retries. Whenever a failure happens, whenever we're talking out to our

00:11:33.400 | LLMs, we're talking out to our tools, if the LLM goes down, we're getting rate limit, automatic retries.

00:11:38.600 | You don't even have to develop that. You call a function call, that function will automatically

00:11:41.400 | retry until that thing becomes successful. It doesn't matter. If this thing crashes in the

00:11:45.560 | middle of it, automatically reconstructs the state, continues on as if this had never happened.

00:11:49.240 | We can store our longer states for storage, for audit purposes, at a predefined time or whenever we

00:11:55.800 | decide we want to close the execution. This is basically a much simpler diagram than what we had

00:12:02.920 | earlier. And now the developer doesn't have to focus at all about managing queues or events or any of

00:12:08.360 | those things. They can focus solely on the business logic. So Temporal provides durable execution.

00:12:14.520 | It's an open-source MIT-licensed product. We're here at the booth today, in the hall. And you can

00:12:20.840 | come and visit us. And it supports currently seven programming language SDKs. So Go, Python, TypeScript,

00:12:27.080 | Ruby, .NET, Java, and PHP. There will be another one coming later this year. If you looked into it,

00:12:32.120 | it's not that hard to figure out which one it is. And the interesting thing about durable execution is

00:12:36.200 | all of these are natively polyglot. So I can call a function written in Ruby from a code written in

00:12:41.400 | Go with basically just providing it with the function name and the input parameters. Because, here's the

00:12:45.800 | dirty secret, it still events under the hood. But what did we do? Why is this a thing? We've abstracted

00:12:52.360 | the complexities away from the platform layer. Software engineering, as a vocation, as a history,

00:12:57.480 | if you look back over the 50 years, 50, 60, 70 years, software engineering has existed. We go back to,

00:13:01.400 | you know, 1950s. We have made most of our advancements in programming languages as we

00:13:08.040 | have abstracted away the complexities away from the programmer and into the programming language.

00:13:12.840 | None of us are sitting here writing assembly code anymore, thankfully. I mean, some of you might be,

00:13:16.680 | and good on you. I'm not. Fortran gave us mathematical operations. We don't have to write

00:13:21.560 | assembly language anymore and store things manually in registers. Algo, 60, and Pascal gave us if,

00:13:26.360 | then, else structure and structure flow concepts. We're not writing go-tos and jumps in our code

00:13:30.680 | anymore. Lisp gives us memory management and garbage collection. Simula and Smalltalk gave us object

00:13:35.880 | oriented programming. And this just continues on and on. We've continually abstracted away complexity.

00:13:40.280 | Durable execution is the next foray into this. We are abstracting away events and the complexity of

00:13:46.120 | events into the software layer and removing that from anyone having to worry about it. So you no longer have

00:13:51.080 | to worry about your cues or any of that stuff. And I leave you with a meme. So all AI is just

00:13:59.400 | distributed systems under the hood. If you are calling out across the network, you're a distributed

00:14:05.000 | system. And you basically need to handle that. If you want to come and learn more, Temporal is in the

00:14:12.200 | hallway, or in the booth, sorry, in the expo hall. I will be there for literally as soon as this talk is

00:14:17.000 | over. So you're welcome to come by. We have a durable agent running demo. Come by. Try to break my demo.

00:14:22.040 | Turn off my computer. Turn off my laptop. I guarantee you it'll still keep running when it's done. I'll

00:14:27.000 | show you how all of this works. You can also chat with me in Slack. We have a community Slack channel,

00:14:31.240 | and we have a newsletter if you're interested. Thank you very much.

Events are the Wrong Abstraction for Your AI Agents - Mason Egger, Temporal.io