back to index

Events are the Wrong Abstraction for Your AI Agents - Mason Egger, Temporal.io


Whisper Transcript | Transcript Only Page

00:00:00.640 | - Welcome, everyone.
00:00:02.120 | My name is Mason Egger, I work at Temporal,
00:00:05.240 | and today we're gonna talk about events
00:00:07.280 | are the wrong abstraction for your AI agents.
00:00:09.920 | So, who here, raise of hands, recognizes
00:00:13.000 | what this diagram is, out of curiosity?
00:00:15.480 | Okay, so this is an abstract,
00:00:18.160 | and we're gonna talk about events
00:00:20.320 | or the wrong abstraction for your AI agents.
00:00:23.040 | So, who here, raise of hands, recognizes
00:00:26.120 | what this diagram is, out of curiosity?
00:00:28.600 | Okay, so this is a map of our solar system
00:00:32.240 | in a geocentric projection.
00:00:34.280 | This is where we have Earth
00:00:35.920 | as the center of our solar system,
00:00:37.560 | and this is how celestial objects move around the Earth.
00:00:40.280 | And this was used to kind of calculate celestial trajectories
00:00:44.320 | prior to like the 16th, 17th century.
00:00:46.520 | It's really pretty, I really like it.
00:00:49.840 | Complex, but really nice.
00:00:52.400 | And then around the 16th century,
00:00:54.120 | Copernicus decided to put the sun
00:00:56.040 | at the center of the solar system,
00:00:58.560 | and it greatly simplified how we view our world
00:01:01.600 | and how we view the universe.
00:01:03.040 | This allowed people to start thinking about the laws of nature.
00:01:05.160 | We got the laws of gravity and a lot of different things
00:01:07.800 | because of us being able to recenter our focus
00:01:10.560 | on how we look at the way that we're looking at things,
00:01:13.760 | the way that we build things.
00:01:15.480 | And basically, a whole series of new discoveries
00:01:17.720 | and developments were made simply by just focusing our shift
00:01:21.040 | on how we decide to look at things.
00:01:23.360 | Now, it's interesting to note that both of these are actually accurate.
00:01:26.600 | They're both correct.
00:01:27.920 | However, it's interesting that you have to take the right frame of mind
00:01:32.720 | and look at what do you need to use,
00:01:35.120 | what tools do you need to use, what software are you using,
00:01:37.840 | and determine how do we want to use this as a reference.
00:01:40.880 | So, for example, if we're looking at something like a moon trajectory,
00:01:43.520 | it might be useful to use the Earth as the frame of reference here.
00:01:46.400 | But if you're thinking about something like the planets
00:01:48.200 | and how planetary objects move, then it's probably more useful
00:01:50.760 | to put the sun at the center of your ecosystem.
00:01:52.760 | So you're probably wondering why I'm talking about this,
00:01:55.080 | and that's a pretty valid question.
00:01:56.600 | We're all building software here.
00:01:58.600 | We're all building AI software.
00:02:00.200 | And ensuring that the software is available and scalable
00:02:03.160 | for our users is extremely challenging.
00:02:05.240 | Scaling our AI agents is actually no different
00:02:09.160 | than scaling a microservice architecture.
00:02:11.400 | At the end of the day, this is all just distributed systems,
00:02:13.400 | which is great because we're not solving any new problems here.
00:02:16.280 | We're solving the same problem that we've been solving
00:02:17.880 | for the last 20 years.
00:02:18.760 | Nothing here is new.
00:02:20.280 | All we did is we added a different label to it.
00:02:22.520 | And we do have patterns for solving this,
00:02:25.480 | which is event-driven architecture.
00:02:27.160 | So I asked Claude this morning to give me a definition
00:02:30.840 | of what is event-driven architecture,
00:02:32.440 | and this is what it came up with.
00:02:33.400 | It said, "Event-driven architecture is a software design pattern
00:02:36.600 | where components communicate by producing and consuming events,
00:02:39.720 | allowing for loose coupling and asynchronous processing
00:02:42.120 | where systems react to state changes or occurrences
00:02:44.920 | rather than direct method calls."
00:02:46.280 | That's a pretty fair thing, and I expect Claude to get that right.
00:02:50.840 | So that's a pretty good explanation of it.
00:02:52.680 | But let's look at, say, what an event-driven AI architecture would potentially look like.
00:02:57.640 | So we have a pretty fairly decent design, if I do say so myself.
00:03:01.880 | I didn't make this.
00:03:02.440 | So if it's great, I made it.
00:03:03.880 | If it's not, blame my colleague.
00:03:06.760 | So we have some cron jobs here to handle inactive chat sessions,
00:03:09.800 | basically acting as a garbage collector for a whole bunch of things.
00:03:12.520 | We have a message bus where events can be published and ingested by the various tools
00:03:16.360 | that we are using, different LLMs, all of those things.
00:03:18.680 | We have a dead letter queue for handling tasks that have failed,
00:03:21.000 | and we just cannot possibly reprocess them.
00:03:22.840 | This is pretty great.
00:03:25.000 | This is a pretty straightforward EDA-based architecture.
00:03:29.320 | The question is, though, how much of this in here is actually the core business logic of your
00:03:33.800 | application?
00:03:34.360 | How much of this is actually what we're trying to solve
00:03:37.400 | versus trying to make sure that the darn thing doesn't break?
00:03:39.960 | Good question.
00:03:42.040 | So some of you might be going, hey, this is great.
00:03:44.520 | Ship it.
00:03:45.480 | Love it.
00:03:45.880 | And you're right.
00:03:47.560 | That diagram does work.
00:03:48.600 | There are probably hundreds of thousands of agents deployed into production right now
00:03:51.880 | that use that exact same architecture.
00:03:53.400 | I'm not going to argue.
00:03:54.120 | It's a great, fine architecture.
00:03:56.200 | It works.
00:03:57.080 | But I want to go back to our discussion that we were having earlier.
00:03:59.400 | Do we have the right thing at the center of our ecosystem?
00:04:03.640 | Are we looking at this through the right frame of mind?
00:04:06.200 | And I'm here to say that I don't think we do.
00:04:08.040 | We have built all of our applications in the modern world with events as the center of our
00:04:14.200 | universe instead of the core logic, the core foundation of what we're trying to solve.
00:04:17.880 | And if you look at that diagram that I previously had, there are more parts about handling the
00:04:22.680 | events than there was the core logic of the actual application.
00:04:27.000 | I was an SRE in a past life for a company.
00:04:29.560 | And I've seen applications with 100 lines of code, dead simple logic that have brought entire
00:04:35.000 | enterprises down, large traveling industries that begin with the letter E, because of mismanaged
00:04:41.560 | queues.
00:04:41.560 | So find me after the talk, I'll tell you about my horror stories.
00:04:45.400 | So issues with this approach.
00:04:47.800 | So what do we have?
00:04:48.440 | Well, we don't get APIs in an EDA architecture.
00:04:51.800 | We sacrifice clear, well-defined APIs when we adopt events.
00:04:56.120 | The events lack documentation and structure that all of our APIs that we've spent all these years
00:05:00.600 | building have given us.
00:05:01.560 | And yes, there is an async API spec, but that really discusses the formats of the messages.
00:05:07.160 | It's not really an API.
00:05:08.200 | It's just what is produced and what is consumed.
00:05:10.040 | It doesn't really give us much more than that.
00:05:11.720 | And as we all know, developers are great at documentation.
00:05:14.600 | That's why there's so many of y'all working on documentation AI tooling right now.
00:05:17.560 | So this is great, right?
00:05:20.600 | Now we have scattered logic.
00:05:22.520 | Our business logic now becomes fragmented and spread out across thousands of different services.
00:05:28.680 | Instead of having a bug and let me go, "Oh, let me open the file and see what that is,"
00:05:32.040 | now I have to open multiple files.
00:05:33.720 | Now I have to grep across the code base for the event name to figure out who called what,
00:05:37.000 | when, and where, and how.
00:05:37.880 | And if it gets even worse, I have to run the thing to figure out how it failed.
00:05:42.760 | That's worst case scenario, but I've seen it happen.
00:05:44.600 | It's like, I have no idea what this is doing, but if I run it, I can figure it out.
00:05:47.720 | I'm like, that's like saying we don't know if the car is going to break when we crash it,
00:05:51.640 | so let's just slam it into the wall and see what happens.
00:05:54.040 | Not a great way to go about things.
00:05:56.680 | And now every single one of our services is now an ad hoc state machine.
00:05:59.800 | And I know that CS majors love implementing state machines, and I'll tell you this,
00:06:04.760 | the only reason they love them so much is because they've never been paged for one.
00:06:07.240 | Once you get paged for one, you stop loving your state machines really quick.
00:06:11.160 | Services now have local databases and local caches to make sure that things aren't failing.
00:06:16.440 | In many cases, there's no transactions between the message and when you've updated your state,
00:06:21.480 | so you get into this really fun state where we've updated something, but we haven't quite
00:06:25.400 | actually updated it yet, and now we've got into this weird case.
00:06:27.960 | And that leads us to the best case of all, race conditions.
00:06:31.720 | So now we don't know what's actually happening.
00:06:33.880 | And now we get paged at 2:00 a.m. in the morning to deal with the hardest bugs of all.
00:06:38.360 | And we now either have to write log to do this, or we can do what some companies do,
00:06:42.920 | and we can just push it off to customer success and be like, oh, they'll just reset the system
00:06:46.120 | whenever it crashes. That'll work, right? We'll just roll back the database.
00:06:48.840 | I don't think that's a good thing. All in all, I could go on hours and hours about
00:06:54.920 | why this approach doesn't work, but it leads to one thing, and I'm going to ruffle some feathers,
00:06:58.600 | so we'll do this lightly. EDA is a tightly coupled system. Now, I know everyone's going to go,
00:07:03.800 | tightly coupled. Whoa! We've been told that event-driven architectures are loosely coupled.
00:07:09.160 | That's what I taught in school. It's what Claude told me this morning when I asked it.
00:07:12.600 | And it is true. They are loosely coupled, but they're loosely coupled at runtime.
00:07:17.400 | They are not loosely coupled at design time. We conflate the two. They think that because
00:07:22.040 | they're loosely coupled, because we can have a service go down, and that the other services
00:07:26.120 | can keep running, that they are loosely coupled at runtime. That translates into design time.
00:07:30.440 | And that is not the case. That is not how this works. So imagine that we had this discussion.
00:07:35.160 | Engineer A comes up and says, hey, I've refactored all the code to make everything
00:07:39.320 | more loosely coupled. And you go, wow, that's great. How did you do it?
00:07:42.280 | Well, I turned all of my local variables into global variables.
00:07:44.840 | Oh, great. Why on earth did you do that? Well, now we can add new code,
00:07:49.800 | and everybody can just read from them, and we don't have to worry about telling anybody
00:07:52.760 | when we update it. That sounds insane, right? That's what we do with events.
00:07:57.800 | So that's what's happening. So it's just wild that we think that that's a good use case.
00:08:05.000 | And it sounds like it's the same logic. And we can just read from that event until somebody decides to
00:08:10.440 | change the format because you were reading from an event that somebody didn't know you were.
00:08:14.280 | And now somebody else downstream from you three stories, you didn't know we're reading your
00:08:17.960 | event. You broke their system because you updated the event. And I can hear people's pagers going off
00:08:23.000 | as I'm saying this. Been there, done that. And what it does is it leads to people not being willing to
00:08:28.760 | iterate on their architecture or their design because they're scared if they touch the magic event,
00:08:33.400 | everything will crumble to the ground. So this is what basically happens. So now what? Where do we go?
00:08:37.960 | I propose we need to reorient ourselves. We need to take a step back and go, is this the actual proper
00:08:44.440 | center? Take a step back and see what the frame is. So let's put the right thing at the center.
00:08:48.120 | I believe this is durable execution. So what is durable execution? It's kind of a new category of software.
00:08:55.240 | It's called crash-proof execution. And basically, it enables developers to write software with less effort.
00:09:01.240 | It allows them to focus on the application, on what the application should achieve,
00:09:04.920 | instead of trying to anticipate or mitigate everything that could possibly go wrong.
00:09:08.680 | This in turn will accelerate your development. And basically, it turns out that failures are
00:09:12.920 | inevitable, but durable execution makes them inconsequential. So here's the four characteristics
00:09:17.400 | that we've come up with to define durable execution. Durable execution applications automatically
00:09:22.440 | preserve your application state. So in a typical application, a crash will cause you to lose all your
00:09:30.120 | variable state. Everything is lost. And developers will typically make a cache, a redis cache, a local
00:09:35.160 | database, something of that nature to save all this back up to so that we can rebuild the state if it
00:09:39.160 | crashes. In a durable execution system, you automatically get the saving of the state out of the box.
00:09:45.080 | Just automatically, all your local variables, all of your function calls, the inputs, outputs, returns,
00:09:48.920 | all of it is stored automatically for you. Because of this, this allows us to virtualize the execution.
00:09:55.320 | So execution usually takes place in a single process on a single machine, and will, and basically,
00:10:00.360 | will immediately end if that process crashes for any certain reason. Durable execution can happen
00:10:05.320 | across a series of processes across multiple machines. If one of the current process fails,
00:10:10.120 | durable execution will basically take that state that it saved, restart the execution, and resume
00:10:15.160 | execution from the point of the last known save point, and continue on very often without this ever
00:10:20.520 | even being aware, the developer being aware that this even happened. This is also not limited by time.
00:10:25.800 | Because durable execution can survive a crash, it enables it to run for as long as you like.
00:10:30.440 | So it's, you know, most people would never even think about putting sleep for 30 days inside your code.
00:10:36.760 | That's a totally valid and 100% achievable goal within a durable execution system, because it just comes
00:10:42.120 | back online whenever it's ready to be run. It will survive that crash. I can sleep for 30 days. If it crashes,
00:10:47.560 | doesn't matter. Bring it back online. Resume the timer. Continue forward. And it's hardware agnostic.
00:10:53.160 | We, in the past, have tried to solve some of these problems with fault-tolerant hardware. We pay a lot
00:10:59.000 | of money to be able to hot-swap out our CPUs, to be able to hot-swap out memory, all of that stuff.
00:11:02.920 | Durable execution is completely hardware agnostic. It can run anywhere. We ran it on a Raspberry Pi and
00:11:07.160 | shipped it into outer space. Check out our YouTube channel. You can run it wherever you want to run it.
00:11:11.960 | It builds reliability into the software side, not the hardware side. And it requires no actual specific
00:11:17.160 | hardware. And it runs natively anywhere. And it overcomes all of these issues. So this is what a
00:11:23.560 | durable execution agent architecture would look like. We still have our chat UI. We have a durable
00:11:28.520 | workflow. We automatically get retries. Whenever a failure happens, whenever we're talking out to our
00:11:33.400 | LLMs, we're talking out to our tools, if the LLM goes down, we're getting rate limit, automatic retries.
00:11:38.600 | You don't even have to develop that. You call a function call, that function will automatically
00:11:41.400 | retry until that thing becomes successful. It doesn't matter. If this thing crashes in the
00:11:45.560 | middle of it, automatically reconstructs the state, continues on as if this had never happened.
00:11:49.240 | We can store our longer states for storage, for audit purposes, at a predefined time or whenever we
00:11:55.800 | decide we want to close the execution. This is basically a much simpler diagram than what we had
00:12:02.920 | earlier. And now the developer doesn't have to focus at all about managing queues or events or any of
00:12:08.360 | those things. They can focus solely on the business logic. So Temporal provides durable execution.
00:12:14.520 | It's an open-source MIT-licensed product. We're here at the booth today, in the hall. And you can
00:12:20.840 | come and visit us. And it supports currently seven programming language SDKs. So Go, Python, TypeScript,
00:12:27.080 | Ruby, .NET, Java, and PHP. There will be another one coming later this year. If you looked into it,
00:12:32.120 | it's not that hard to figure out which one it is. And the interesting thing about durable execution is
00:12:36.200 | all of these are natively polyglot. So I can call a function written in Ruby from a code written in
00:12:41.400 | Go with basically just providing it with the function name and the input parameters. Because, here's the
00:12:45.800 | dirty secret, it still events under the hood. But what did we do? Why is this a thing? We've abstracted
00:12:52.360 | the complexities away from the platform layer. Software engineering, as a vocation, as a history,
00:12:57.480 | if you look back over the 50 years, 50, 60, 70 years, software engineering has existed. We go back to,
00:13:01.400 | you know, 1950s. We have made most of our advancements in programming languages as we
00:13:08.040 | have abstracted away the complexities away from the programmer and into the programming language.
00:13:12.840 | None of us are sitting here writing assembly code anymore, thankfully. I mean, some of you might be,
00:13:16.680 | and good on you. I'm not. Fortran gave us mathematical operations. We don't have to write
00:13:21.560 | assembly language anymore and store things manually in registers. Algo, 60, and Pascal gave us if,
00:13:26.360 | then, else structure and structure flow concepts. We're not writing go-tos and jumps in our code
00:13:30.680 | anymore. Lisp gives us memory management and garbage collection. Simula and Smalltalk gave us object
00:13:35.880 | oriented programming. And this just continues on and on. We've continually abstracted away complexity.
00:13:40.280 | Durable execution is the next foray into this. We are abstracting away events and the complexity of
00:13:46.120 | events into the software layer and removing that from anyone having to worry about it. So you no longer have
00:13:51.080 | to worry about your cues or any of that stuff. And I leave you with a meme. So all AI is just
00:13:59.400 | distributed systems under the hood. If you are calling out across the network, you're a distributed
00:14:05.000 | system. And you basically need to handle that. If you want to come and learn more, Temporal is in the
00:14:12.200 | hallway, or in the booth, sorry, in the expo hall. I will be there for literally as soon as this talk is
00:14:17.000 | over. So you're welcome to come by. We have a durable agent running demo. Come by. Try to break my demo.
00:14:22.040 | Turn off my computer. Turn off my laptop. I guarantee you it'll still keep running when it's done. I'll
00:14:27.000 | show you how all of this works. You can also chat with me in Slack. We have a community Slack channel,
00:14:31.240 | and we have a newsletter if you're interested. Thank you very much.