back to indexEvents are the Wrong Abstraction for Your AI Agents - Mason Egger, Temporal.io

00:00:07.280 |
are the wrong abstraction for your AI agents. 00:00:37.560 |
and this is how celestial objects move around the Earth. 00:00:40.280 |
And this was used to kind of calculate celestial trajectories 00:00:58.560 |
and it greatly simplified how we view our world 00:01:03.040 |
This allowed people to start thinking about the laws of nature. 00:01:05.160 |
We got the laws of gravity and a lot of different things 00:01:07.800 |
because of us being able to recenter our focus 00:01:10.560 |
on how we look at the way that we're looking at things, 00:01:15.480 |
And basically, a whole series of new discoveries 00:01:17.720 |
and developments were made simply by just focusing our shift 00:01:23.360 |
Now, it's interesting to note that both of these are actually accurate. 00:01:27.920 |
However, it's interesting that you have to take the right frame of mind 00:01:35.120 |
what tools do you need to use, what software are you using, 00:01:37.840 |
and determine how do we want to use this as a reference. 00:01:40.880 |
So, for example, if we're looking at something like a moon trajectory, 00:01:43.520 |
it might be useful to use the Earth as the frame of reference here. 00:01:46.400 |
But if you're thinking about something like the planets 00:01:48.200 |
and how planetary objects move, then it's probably more useful 00:01:50.760 |
to put the sun at the center of your ecosystem. 00:01:52.760 |
So you're probably wondering why I'm talking about this, 00:02:00.200 |
And ensuring that the software is available and scalable 00:02:05.240 |
Scaling our AI agents is actually no different 00:02:11.400 |
At the end of the day, this is all just distributed systems, 00:02:13.400 |
which is great because we're not solving any new problems here. 00:02:16.280 |
We're solving the same problem that we've been solving 00:02:20.280 |
All we did is we added a different label to it. 00:02:27.160 |
So I asked Claude this morning to give me a definition 00:02:33.400 |
It said, "Event-driven architecture is a software design pattern 00:02:36.600 |
where components communicate by producing and consuming events, 00:02:39.720 |
allowing for loose coupling and asynchronous processing 00:02:42.120 |
where systems react to state changes or occurrences 00:02:46.280 |
That's a pretty fair thing, and I expect Claude to get that right. 00:02:52.680 |
But let's look at, say, what an event-driven AI architecture would potentially look like. 00:02:57.640 |
So we have a pretty fairly decent design, if I do say so myself. 00:03:06.760 |
So we have some cron jobs here to handle inactive chat sessions, 00:03:09.800 |
basically acting as a garbage collector for a whole bunch of things. 00:03:12.520 |
We have a message bus where events can be published and ingested by the various tools 00:03:16.360 |
that we are using, different LLMs, all of those things. 00:03:18.680 |
We have a dead letter queue for handling tasks that have failed, 00:03:25.000 |
This is a pretty straightforward EDA-based architecture. 00:03:29.320 |
The question is, though, how much of this in here is actually the core business logic of your 00:03:34.360 |
How much of this is actually what we're trying to solve 00:03:37.400 |
versus trying to make sure that the darn thing doesn't break? 00:03:42.040 |
So some of you might be going, hey, this is great. 00:03:48.600 |
There are probably hundreds of thousands of agents deployed into production right now 00:03:57.080 |
But I want to go back to our discussion that we were having earlier. 00:03:59.400 |
Do we have the right thing at the center of our ecosystem? 00:04:03.640 |
Are we looking at this through the right frame of mind? 00:04:06.200 |
And I'm here to say that I don't think we do. 00:04:08.040 |
We have built all of our applications in the modern world with events as the center of our 00:04:14.200 |
universe instead of the core logic, the core foundation of what we're trying to solve. 00:04:17.880 |
And if you look at that diagram that I previously had, there are more parts about handling the 00:04:22.680 |
events than there was the core logic of the actual application. 00:04:29.560 |
And I've seen applications with 100 lines of code, dead simple logic that have brought entire 00:04:35.000 |
enterprises down, large traveling industries that begin with the letter E, because of mismanaged 00:04:41.560 |
So find me after the talk, I'll tell you about my horror stories. 00:04:48.440 |
Well, we don't get APIs in an EDA architecture. 00:04:51.800 |
We sacrifice clear, well-defined APIs when we adopt events. 00:04:56.120 |
The events lack documentation and structure that all of our APIs that we've spent all these years 00:05:01.560 |
And yes, there is an async API spec, but that really discusses the formats of the messages. 00:05:08.200 |
It's just what is produced and what is consumed. 00:05:10.040 |
It doesn't really give us much more than that. 00:05:11.720 |
And as we all know, developers are great at documentation. 00:05:14.600 |
That's why there's so many of y'all working on documentation AI tooling right now. 00:05:22.520 |
Our business logic now becomes fragmented and spread out across thousands of different services. 00:05:28.680 |
Instead of having a bug and let me go, "Oh, let me open the file and see what that is," 00:05:33.720 |
Now I have to grep across the code base for the event name to figure out who called what, 00:05:37.880 |
And if it gets even worse, I have to run the thing to figure out how it failed. 00:05:42.760 |
That's worst case scenario, but I've seen it happen. 00:05:44.600 |
It's like, I have no idea what this is doing, but if I run it, I can figure it out. 00:05:47.720 |
I'm like, that's like saying we don't know if the car is going to break when we crash it, 00:05:51.640 |
so let's just slam it into the wall and see what happens. 00:05:56.680 |
And now every single one of our services is now an ad hoc state machine. 00:05:59.800 |
And I know that CS majors love implementing state machines, and I'll tell you this, 00:06:04.760 |
the only reason they love them so much is because they've never been paged for one. 00:06:07.240 |
Once you get paged for one, you stop loving your state machines really quick. 00:06:11.160 |
Services now have local databases and local caches to make sure that things aren't failing. 00:06:16.440 |
In many cases, there's no transactions between the message and when you've updated your state, 00:06:21.480 |
so you get into this really fun state where we've updated something, but we haven't quite 00:06:25.400 |
actually updated it yet, and now we've got into this weird case. 00:06:27.960 |
And that leads us to the best case of all, race conditions. 00:06:31.720 |
So now we don't know what's actually happening. 00:06:33.880 |
And now we get paged at 2:00 a.m. in the morning to deal with the hardest bugs of all. 00:06:38.360 |
And we now either have to write log to do this, or we can do what some companies do, 00:06:42.920 |
and we can just push it off to customer success and be like, oh, they'll just reset the system 00:06:46.120 |
whenever it crashes. That'll work, right? We'll just roll back the database. 00:06:48.840 |
I don't think that's a good thing. All in all, I could go on hours and hours about 00:06:54.920 |
why this approach doesn't work, but it leads to one thing, and I'm going to ruffle some feathers, 00:06:58.600 |
so we'll do this lightly. EDA is a tightly coupled system. Now, I know everyone's going to go, 00:07:03.800 |
tightly coupled. Whoa! We've been told that event-driven architectures are loosely coupled. 00:07:09.160 |
That's what I taught in school. It's what Claude told me this morning when I asked it. 00:07:12.600 |
And it is true. They are loosely coupled, but they're loosely coupled at runtime. 00:07:17.400 |
They are not loosely coupled at design time. We conflate the two. They think that because 00:07:22.040 |
they're loosely coupled, because we can have a service go down, and that the other services 00:07:26.120 |
can keep running, that they are loosely coupled at runtime. That translates into design time. 00:07:30.440 |
And that is not the case. That is not how this works. So imagine that we had this discussion. 00:07:35.160 |
Engineer A comes up and says, hey, I've refactored all the code to make everything 00:07:39.320 |
more loosely coupled. And you go, wow, that's great. How did you do it? 00:07:42.280 |
Well, I turned all of my local variables into global variables. 00:07:44.840 |
Oh, great. Why on earth did you do that? Well, now we can add new code, 00:07:49.800 |
and everybody can just read from them, and we don't have to worry about telling anybody 00:07:52.760 |
when we update it. That sounds insane, right? That's what we do with events. 00:07:57.800 |
So that's what's happening. So it's just wild that we think that that's a good use case. 00:08:05.000 |
And it sounds like it's the same logic. And we can just read from that event until somebody decides to 00:08:10.440 |
change the format because you were reading from an event that somebody didn't know you were. 00:08:14.280 |
And now somebody else downstream from you three stories, you didn't know we're reading your 00:08:17.960 |
event. You broke their system because you updated the event. And I can hear people's pagers going off 00:08:23.000 |
as I'm saying this. Been there, done that. And what it does is it leads to people not being willing to 00:08:28.760 |
iterate on their architecture or their design because they're scared if they touch the magic event, 00:08:33.400 |
everything will crumble to the ground. So this is what basically happens. So now what? Where do we go? 00:08:37.960 |
I propose we need to reorient ourselves. We need to take a step back and go, is this the actual proper 00:08:44.440 |
center? Take a step back and see what the frame is. So let's put the right thing at the center. 00:08:48.120 |
I believe this is durable execution. So what is durable execution? It's kind of a new category of software. 00:08:55.240 |
It's called crash-proof execution. And basically, it enables developers to write software with less effort. 00:09:01.240 |
It allows them to focus on the application, on what the application should achieve, 00:09:04.920 |
instead of trying to anticipate or mitigate everything that could possibly go wrong. 00:09:08.680 |
This in turn will accelerate your development. And basically, it turns out that failures are 00:09:12.920 |
inevitable, but durable execution makes them inconsequential. So here's the four characteristics 00:09:17.400 |
that we've come up with to define durable execution. Durable execution applications automatically 00:09:22.440 |
preserve your application state. So in a typical application, a crash will cause you to lose all your 00:09:30.120 |
variable state. Everything is lost. And developers will typically make a cache, a redis cache, a local 00:09:35.160 |
database, something of that nature to save all this back up to so that we can rebuild the state if it 00:09:39.160 |
crashes. In a durable execution system, you automatically get the saving of the state out of the box. 00:09:45.080 |
Just automatically, all your local variables, all of your function calls, the inputs, outputs, returns, 00:09:48.920 |
all of it is stored automatically for you. Because of this, this allows us to virtualize the execution. 00:09:55.320 |
So execution usually takes place in a single process on a single machine, and will, and basically, 00:10:00.360 |
will immediately end if that process crashes for any certain reason. Durable execution can happen 00:10:05.320 |
across a series of processes across multiple machines. If one of the current process fails, 00:10:10.120 |
durable execution will basically take that state that it saved, restart the execution, and resume 00:10:15.160 |
execution from the point of the last known save point, and continue on very often without this ever 00:10:20.520 |
even being aware, the developer being aware that this even happened. This is also not limited by time. 00:10:25.800 |
Because durable execution can survive a crash, it enables it to run for as long as you like. 00:10:30.440 |
So it's, you know, most people would never even think about putting sleep for 30 days inside your code. 00:10:36.760 |
That's a totally valid and 100% achievable goal within a durable execution system, because it just comes 00:10:42.120 |
back online whenever it's ready to be run. It will survive that crash. I can sleep for 30 days. If it crashes, 00:10:47.560 |
doesn't matter. Bring it back online. Resume the timer. Continue forward. And it's hardware agnostic. 00:10:53.160 |
We, in the past, have tried to solve some of these problems with fault-tolerant hardware. We pay a lot 00:10:59.000 |
of money to be able to hot-swap out our CPUs, to be able to hot-swap out memory, all of that stuff. 00:11:02.920 |
Durable execution is completely hardware agnostic. It can run anywhere. We ran it on a Raspberry Pi and 00:11:07.160 |
shipped it into outer space. Check out our YouTube channel. You can run it wherever you want to run it. 00:11:11.960 |
It builds reliability into the software side, not the hardware side. And it requires no actual specific 00:11:17.160 |
hardware. And it runs natively anywhere. And it overcomes all of these issues. So this is what a 00:11:23.560 |
durable execution agent architecture would look like. We still have our chat UI. We have a durable 00:11:28.520 |
workflow. We automatically get retries. Whenever a failure happens, whenever we're talking out to our 00:11:33.400 |
LLMs, we're talking out to our tools, if the LLM goes down, we're getting rate limit, automatic retries. 00:11:38.600 |
You don't even have to develop that. You call a function call, that function will automatically 00:11:41.400 |
retry until that thing becomes successful. It doesn't matter. If this thing crashes in the 00:11:45.560 |
middle of it, automatically reconstructs the state, continues on as if this had never happened. 00:11:49.240 |
We can store our longer states for storage, for audit purposes, at a predefined time or whenever we 00:11:55.800 |
decide we want to close the execution. This is basically a much simpler diagram than what we had 00:12:02.920 |
earlier. And now the developer doesn't have to focus at all about managing queues or events or any of 00:12:08.360 |
those things. They can focus solely on the business logic. So Temporal provides durable execution. 00:12:14.520 |
It's an open-source MIT-licensed product. We're here at the booth today, in the hall. And you can 00:12:20.840 |
come and visit us. And it supports currently seven programming language SDKs. So Go, Python, TypeScript, 00:12:27.080 |
Ruby, .NET, Java, and PHP. There will be another one coming later this year. If you looked into it, 00:12:32.120 |
it's not that hard to figure out which one it is. And the interesting thing about durable execution is 00:12:36.200 |
all of these are natively polyglot. So I can call a function written in Ruby from a code written in 00:12:41.400 |
Go with basically just providing it with the function name and the input parameters. Because, here's the 00:12:45.800 |
dirty secret, it still events under the hood. But what did we do? Why is this a thing? We've abstracted 00:12:52.360 |
the complexities away from the platform layer. Software engineering, as a vocation, as a history, 00:12:57.480 |
if you look back over the 50 years, 50, 60, 70 years, software engineering has existed. We go back to, 00:13:01.400 |
you know, 1950s. We have made most of our advancements in programming languages as we 00:13:08.040 |
have abstracted away the complexities away from the programmer and into the programming language. 00:13:12.840 |
None of us are sitting here writing assembly code anymore, thankfully. I mean, some of you might be, 00:13:16.680 |
and good on you. I'm not. Fortran gave us mathematical operations. We don't have to write 00:13:21.560 |
assembly language anymore and store things manually in registers. Algo, 60, and Pascal gave us if, 00:13:26.360 |
then, else structure and structure flow concepts. We're not writing go-tos and jumps in our code 00:13:30.680 |
anymore. Lisp gives us memory management and garbage collection. Simula and Smalltalk gave us object 00:13:35.880 |
oriented programming. And this just continues on and on. We've continually abstracted away complexity. 00:13:40.280 |
Durable execution is the next foray into this. We are abstracting away events and the complexity of 00:13:46.120 |
events into the software layer and removing that from anyone having to worry about it. So you no longer have 00:13:51.080 |
to worry about your cues or any of that stuff. And I leave you with a meme. So all AI is just 00:13:59.400 |
distributed systems under the hood. If you are calling out across the network, you're a distributed 00:14:05.000 |
system. And you basically need to handle that. If you want to come and learn more, Temporal is in the 00:14:12.200 |
hallway, or in the booth, sorry, in the expo hall. I will be there for literally as soon as this talk is 00:14:17.000 |
over. So you're welcome to come by. We have a durable agent running demo. Come by. Try to break my demo. 00:14:22.040 |
Turn off my computer. Turn off my laptop. I guarantee you it'll still keep running when it's done. I'll 00:14:27.000 |
show you how all of this works. You can also chat with me in Slack. We have a community Slack channel, 00:14:31.240 |
and we have a newsletter if you're interested. Thank you very much.