back to index12-Factor Agents: Patterns of reliable LLM applications — Dex Horthy, HumanLayer

00:00:00.000 |
Who here is building agents? Leave your hand up if you built like 10 plus agents. Anyone here built 00:00:21.920 |
like a hundred agents? All right, we got a few. Awesome. I love it. So I think a lot of us have 00:00:27.600 |
been on this journey of building agents. And what happened with me was, you know, I decided 00:00:32.720 |
I wanted to build an agent. We figured out what we wanted it to do. We want to move fast. 00:00:36.400 |
We're developers, so we use libraries. We don't write everything from scratch. And you 00:00:39.960 |
get it to like 70, 80%. It's enough to get the CEO excited and get six more people added 00:00:43.800 |
to your team. But then you kind of realize that 70, 80% isn't quite good enough. And that 00:00:49.400 |
if you want to get past that 70, 80% quality bar, you're seven layers deep in a call stack 00:00:54.820 |
trying to reverse engineer. How does this prompt get built? Or how do these tools get 00:00:57.820 |
passed in? Where does this all come from? And if you're like me, you eventually just throw 00:01:01.440 |
it all away and start from scratch. Or you may even find out that this is not a good problem 00:01:06.200 |
for agents. I remember one of the first agents that I tried to build was a DevOps agent. I 00:01:11.140 |
was like, here's my make file. You can run make commands. Go build the project. Couldn't 00:01:15.060 |
figure it out. Did all the things in the wrong order. I'm like, cool, let's fix the prompt. 00:01:18.320 |
And over the next two hours, I had more and more detail about what everything was and every 00:01:22.780 |
single step until I got to the point where I was like, this is the exact order to run 00:01:26.320 |
the build steps. It was a cool exercise. But at the end of it, I was like, you know, I could 00:01:30.380 |
have written the bash script to do this in about 90 seconds. Not every problem needs an agent. 00:01:36.260 |
And so I've been on this journey. I think a lot of you have been on similar journeys. 00:01:41.560 |
And what happened was is I went and talked in trying to help people build better, more 00:01:45.660 |
reliable agents. I talked to 100 plus founders, builders, engineers. And I sort of noticed patterns. 00:01:53.180 |
One was that most production agents weren't that agentic at all. They were mostly just software. 00:01:58.640 |
But that there were these core things that a lot of people were doing. There were these 00:02:01.660 |
patterns that were making their LLM-based apps really, really good. And none of them were doing 00:02:06.960 |
kind of a greenfield rewrite. Rather, they were taking these small modular concepts that didn't 00:02:11.080 |
have names and didn't have definitions. And they were applying them to their existing 00:02:14.780 |
code. And what's really cool about this, I don't think you need an AI background to do 00:02:18.560 |
this. This is software engineering 101. Well, probably not 101. But just like Heroku needed 00:02:24.280 |
to define what it meant to build -- we didn't even call them cloud native back then. But this 00:02:27.620 |
was how you built applications that could run in the cloud 10 years ago. I decided to put 00:02:33.300 |
together what I thought would be the 12 factors of AI agents from everything that I've seen working 00:02:39.380 |
in the field. So we put up this GitHub repo. You can go read it. Turns out a lot of other 00:02:44.520 |
people agreed and felt the same thing. So we were on the front page of Hacker News all day. 00:02:49.860 |
200K impressions on social. I'm just gonna put this one up and no comment. And just for context, 00:02:57.440 |
we got to like 4,000 stars in like a month or two. There's 14 active contributors. It's very 00:03:03.500 |
easy to read that thing and hear the talk and say like, "Oh, we're here. This is the anti-framework 00:03:08.080 |
talk." I am not here to bash frameworks. I would think of this as much as anything as a wish list, 00:03:13.980 |
a list of feature requests. How can we make frameworks serve the needs of really good builders who need 00:03:19.640 |
a really high reliability and want to move fast still? So what am I here to do? I want you to kind of forget 00:03:27.020 |
everything you know about agents and kind of rethink from first principles how we can apply 00:03:31.020 |
everything we've learned from software engineering to the practice of building really reliable agents. 00:03:35.500 |
So we're gonna mix the order up a little bit. If you want all 12 factors in order, that's a 30-minute 00:03:40.220 |
talk. So we're gonna bundle some stuff together. There will be a QR code at the end. You can go dig 00:03:43.900 |
through it at your ledger. Factor one, the most magical things that LLMs can do has nothing to do with 00:03:49.980 |
loops or switch statements or code or tools or anything. It is turning a sentence like this 00:03:54.620 |
into JSON that looks like this. Doesn't even matter what you do with that JSON. Those are what the other 00:04:00.620 |
factors are for. But if you're doing that, that's one piece that you can bring into your app today. 00:04:04.380 |
Factor four, this leads right to... Did anyone read this paper, GoTo Considered Harmful, 00:04:10.460 |
or maybe just heard about it? I never actually read it. But it was all about... We had this abstraction 00:04:14.380 |
in the C programming language and a bunch of other programming languages at the time that said, 00:04:18.380 |
This thing, GoTo, it makes code terrible. It's the wrong abstraction. No one should use it. 00:04:23.580 |
I'm gonna go ahead and go out on a limb here and say tool use is harmful. And I put it in quotes because 00:04:29.340 |
I'm not talking about giving an agent access to the world. Obviously, that's super badass. But what 00:04:33.900 |
I think is making things hard is the idea that tool use is this magical thing where this ethereal 00:04:39.900 |
alien entity is interacting with its environment. Because what is happening is our LM is putting out JSON. 00:04:45.580 |
We're gonna give that to some deterministic code that's gonna do something. And then maybe we'll feed 00:04:49.580 |
it back. But again, those are other factors. So if you have structures like this, and you can get the 00:04:54.300 |
LLM to output something that generates them, then you can pass it into a loop like this or a switch 00:04:59.100 |
statement like this. There's nothing special about tools. It's just JSON and code. That's factor four. 00:05:05.580 |
Factor eight. And these are, we're gonna do a couple kind of bundled together here. Owning your control flow. 00:05:11.180 |
And I want to take a step back and kind of talk about how we got here. 00:05:14.380 |
We've been writing DAGs in software for a long time. If you've written an if statement, you've written a directed graph. 00:05:19.900 |
Code is a graph. You may also be familiar with DAG orchestra. Anyone ever use like Airflow or Prefect or any of these things? 00:05:28.060 |
So like this kind of concept of breaking things up into nodes gives you certain reliability guarantees. 00:05:32.780 |
But what agents were supposed to do, and I think a lot of people talk about this, and I think in some cases 00:05:36.620 |
this is realized, is you don't have to write the DAG. You just tell the LLM here's the goal, and LLM will find 00:05:43.820 |
its way there. And we model this as a really simple loop. You know, LLM is determining the next step. 00:05:48.620 |
You're building up some context window until the LLM says, "Hey, we're done." 00:05:51.660 |
So what this looks like kind of in practice is, you know, you have an event come in. 00:05:57.180 |
You pass it into your prompt. It says you want to call an API. And you get your result. Put that on the context window. 00:06:02.780 |
Pass the whole thing back into the prompt. This is like the most naive, simple way of building agents. 00:06:08.460 |
And the LLM is gonna call a couple steps. And then eventually it's gonna say, "Cool, we've done all the tasks from the initial event." 00:06:13.820 |
Which maybe was a user message asking it to do something. Maybe it's an outage. 00:06:18.220 |
But then we get our final answer. And our materialized DAG is just these three steps in order. 00:06:23.260 |
Turns out this doesn't really work. Especially when you get to longer workflows. Mostly it's long context windows. 00:06:30.460 |
There's other reasons you could poke at as well. And people say, "Oh, like, anyone put, like, two million tokens into Gemini before and, like, try to see what happens?" 00:06:37.900 |
Like, you can do it. You'll get an answer. The API will return you something. But I don't think anyone will argue with you that you will always get, like, tighter, better, higher reliability results. 00:06:47.820 |
By controlling and limiting the number of tokens you put in that context window. 00:06:52.860 |
So it doesn't quite work, but we're gonna use that as our abstraction to build on. What's an agent really? 00:06:57.900 |
You have your prompt, which gives instructions about how to select the next step. 00:07:01.900 |
You have your switch statement, which takes whatever the model output JSON and does something with it. 00:07:05.900 |
You have a way of building up your context window. And then you have a loop that determines when and where and how and why you exit. 00:07:11.900 |
And if you own your control flow, you can do fun things like break and switch and summarize and LLM is judge and all this stuff. 00:07:18.940 |
And this leads right into, kind of, how we manage execution state and business state of our agents. 00:07:24.940 |
A lot of tools will give you things like current step, next step, retry counts, all these, like, DAG orchestrators. 00:07:28.940 |
They'll have these kind of concepts in them. But you also have your business state. What are the messages that have happened? 00:07:32.940 |
What data are we displaying to the user? What things are we waiting on approval for? 00:07:37.980 |
And we want to be able to launch, pause, resume these things like we do for any standard APIs. 00:07:43.100 |
This is all just software. And so if you can put your agent behind a REST API or an MCP server 00:07:51.020 |
and manage that loop in such a way that normal request comes in and we load that context window to the LLM, 00:07:57.980 |
we're going to allow our agent to call long-running tool. So we can interrupt the workflow, serialize 00:08:02.300 |
that context window straight into a database because we own the context window. We'll get into that. 00:08:05.980 |
And then when we launch the workflow, eventually it's going to call back with that state ID and the result. 00:08:12.300 |
We use the state ID to load the state back out of the database and then we can append the result to the 00:08:16.940 |
program and then send it right back into the LLM. The agent doesn't even know that things happen in the 00:08:21.180 |
background. Agents are just software, so let's build software. And building really good ones requires a lot of 00:08:27.900 |
flexibility. And so you really want to own that inner loop of how all that stuff is fitting together. 00:08:32.380 |
That's unifying. That's pause and resume. Factor two, this one is I think most people find first, 00:08:38.940 |
is like you really want to own your prompts. There's some good extractions that if you don't 00:08:42.380 |
want to spend a lot of time handwriting a prompt, you can put stuff in and you'll get out 00:08:46.300 |
a really good set of primitives and a really good prompt. Like this will make you a banger prompt, 00:08:53.660 |
that like you would have to go to prompt school for like three months to build a prompt this good. 00:08:57.820 |
But eventually if you want to get past some quality bar, you're going to end up writing every single 00:09:01.980 |
token by hand. Because LLMs are pure functions and the only thing that determines the reliability of 00:09:08.140 |
your agent is how good of tokens can you get out. And the only way, the only thing that determines the 00:09:12.780 |
tokens you get out other than like retraining your own model and something like that is being really 00:09:17.100 |
careful about what tokens you put in. I don't know what's better. I don't know how you want to build your 00:09:22.140 |
prompt. But I know the more things you can try and the more knobs you can test and the more things you can 00:09:25.820 |
evaluate, the more likely you are to find something really, really good. 00:09:28.780 |
Owning your prompts. You also want to own how you build your context window. So you can do the 00:09:33.180 |
standard OpenAI messages format, or in this moment where you're telling the LLM, pick the next step, 00:09:38.380 |
your only job is to tell it what's happened so far. You can put all that information however you want 00:09:42.700 |
into a single user message and ask, "Hey, what's happening next?" Or put in the system message. So you can 00:09:47.820 |
model your event state, your thread model, however you want, and stringify it however you want. And some 00:09:53.980 |
of the traces that we use in some of the agents we build internally, I'll get into that in a sec, 00:09:56.940 |
might look like this. But if you're not looking at every single token, and if you're not optimizing the 00:10:02.940 |
density and the clarity of the way that you're passing information to an LLM, you might be missing out on 00:10:10.140 |
upside on quality. So LLMs are pure functions, token in, tokens out, and everything, everything in making 00:10:15.900 |
agents good is context engineering. So you have your prompt, you have your memory, you have your 00:10:20.700 |
RAG, you have your history structure, it's all just how do we get the right tokens into the model so it 00:10:24.380 |
gives us a really good answer and solves the user's problem, solves my problem mostly. 00:10:28.220 |
I don't know what's better, but I know you want to try everything, so that's on your context building. 00:10:33.900 |
This one's a little controversial, that's why it's a standalone factor, and the way you make it 00:10:39.580 |
good is by integrating it with other factors. But you could, when the model screws up, 00:10:44.940 |
then it calls an API wrong, or it calls an API that's down, you could take the tool call that it 00:10:49.580 |
made, and grab the error that was associated with it, put that on the context window, and have it try 00:10:55.340 |
again. Anyone ever had a bad time with this? Seen like this thing just like kind of spin out, 00:11:01.580 |
and like go crazy, and lose context, and just get stuck? That's why you need to own your context window. 00:11:07.180 |
Don't just blindly put things on. If you have errors, and then you get a valid tool call, clear all the 00:11:11.100 |
are pending errors out. Summarize them. Don't put the whole stack trace on your context. Figure out 00:11:15.820 |
what you want to tell the model, so you get better results. Contacting humans with tools. This one's a 00:11:20.940 |
little subtle, but I've seen, this is just like what I've seen in the wild. Almost everybody is like avoiding 00:11:26.380 |
this very important choice at the very beginning of output, where you're deciding between tool call and 00:11:31.420 |
message to the human. If you can push that emphasis to a natural language token, you can one, give the 00:11:36.940 |
model different ways. You can be, I'm done, or I need clarification, or I need to talk to a manager, 00:11:41.340 |
or whatever it is. And two, you push the intent on that first token generation, and the sampling to 00:11:47.660 |
something that is natural language that the model understands. So your traces might look like this, 00:11:52.700 |
if you're pulling in human input here. This lets you build outer loop agents. I'm not going to talk about 00:11:57.580 |
this. If you go on the site, there's a link to this post. I've written a lot about this. I don't 00:12:02.060 |
know what's better, but you should probably try everything. That's contacting humans with tools. 00:12:07.180 |
It goes right along with trigger things from anywhere and meet users exactly where they are. People don't 00:12:11.980 |
want to have seven tabs open of different ChatGPT style agents. Just let people email with the agents 00:12:17.100 |
you're building. Let them Slack with the agents you're building. Discord, SMS, whatever it is. We see this 00:12:21.420 |
taking off all over the place. And you should have small focused agents. So we talked about this 00:12:26.700 |
structure and why it doesn't really work. So what does work? 00:12:29.180 |
The things that people are doing that work really well are microagents. So you still have a mostly 00:12:34.620 |
deterministic DAG, and you have these very small agent loops with like three to 10 steps. We do this 00:12:39.180 |
at human layer. We have a bot that manages our deployments. Most of our deploy pipeline is 00:12:43.500 |
deterministic CI/CD code. But when we get to the point where the GitHub PR is merged and the tests are 00:12:50.940 |
passing on development, excuse me, we send it to a model. We say, get this thing deployed. It says, 00:12:57.340 |
cool, I'm going to deploy the front end. And then you can send that to a human. The human says, 00:13:00.860 |
actually, no, do the back end first. This is taking natural language and turning it into 00:13:04.220 |
JSON that is the next step in our workflow. Back end gets proposed. That gets approved. It gets deployed. 00:13:09.980 |
Then the agent knows, okay, I have to go back and deploy the front end. Once that's all done and it's 00:13:14.140 |
successful, we go right back out into deterministic code. So now we're going to run the end-to-end test against 00:13:19.340 |
prod if it's done. Otherwise, we hand it back to a little rollback agent that is very similar on 00:13:23.580 |
the inside. I'm not going to go into it, but here's it working in our Slack channel. 00:13:27.500 |
Yeah, 100 tools, 20 steps, easy. Manageable context, clear responsibilities. A lot of people say, 00:13:35.980 |
what if LLMs do keep getting smarter? What if I can put two million tokens in and it can do it? 00:13:39.900 |
And I think we very much will see something like this, where you start with a mostly deterministic 00:13:46.060 |
workflow and you start sprinkling LLMs into your code, into your back end, into your logic. 00:13:51.020 |
Over time, the LLMs are able to do bigger, more complex tasks until this whole API endpoint or 00:13:56.300 |
pipeline or whatever it is is just run by an agent. That's great. But you still want to know how to 00:14:01.340 |
engineer these things to get the best quality. This is someone from Notebook LM and it's basically 00:14:05.500 |
their take, and I think they did this well, is find something that is right at the boundary of what the 00:14:09.740 |
model can do reliably, like that it can't get right all the time. And if you can figure out how to get it 00:14:15.660 |
right reliably anyways, because you've engineered reliability into your system, then you will have 00:14:20.940 |
created something magical and you will have created something that's better than what everybody else is 00:14:24.380 |
building. So that's small focused agents. There's a meme here about stateless reducers. I guess someone 00:14:29.660 |
actually tweeted at me it's not a reducer, it's a transducer, because there's multiple steps. 00:14:34.220 |
But basically agents should be stateless, you should own the state, manage it however you want. 00:14:37.740 |
So we're all still finding the right abstractions. There's a couple blog posts I link in the paper, 00:14:42.860 |
frameworks versus libraries. There's a really old one from RubyConf about like, 00:14:48.140 |
do we want duplication or is like, do we want to try to figure out these abstractions? 00:14:51.980 |
If you want to make a 12 factor agent, we are working on something called create 12 factor agent, 00:14:57.100 |
because I believe that what agents need is not bootstrap. You don't need a wrapper around an internal 00:15:02.220 |
thing. You need something more like shad CN, which is like scaffold it out and then I'll own it and 00:15:06.620 |
I'll own the code and I'm okay with that. So in summary, agents are software. You all can build 00:15:12.700 |
software. Anyone ever written a switch statement before? While loop? Yeah. Okay, so we can do this 00:15:17.740 |
stuff. LMs are stateless functions, which means just make sure you put the right things in the context 00:15:22.060 |
and you'll get the best results. Own your state and your control flow and just do it and just understand 00:15:26.460 |
it because it's going to give you flexibility. And then find the bleeding edge. Find ways to do things 00:15:31.740 |
better than everybody else by really curating what you put in the model and how you control 00:15:36.220 |
what comes out. And my take, agents are better with people. Find ways to let agents collaborate 00:15:41.420 |
with humans. There are hard things in building agents, but you should probably do them anyways, 00:15:46.940 |
at least for now. And you should do most of them. I think a lot of frameworks try to take away the hard 00:15:53.500 |
AI parts of the problem so that you can just kind of drop it in and go. And I think it should be the 00:15:59.020 |
opposite. I think the tools that we get should take away the other hard parts so that we can spend all 00:16:04.300 |
our time focusing on the hard AI parts, on getting the prompts right, on getting the flow right, 00:16:08.380 |
on getting the tokens right. So the reason why I'm here is I do run a small business. We have a startup 00:16:14.220 |
where we try to help you do... A lot of what we do in the open is open source, and I think it's really 00:16:19.580 |
important and we need to work on it together. There's some other things that are hard, but not that 00:16:23.180 |
important and not that interesting. So that's what we're solving at HumanLayer. Working on something 00:16:28.380 |
called A2H protocol. Come find me if you want to talk about this, but this is a way to get 00:16:32.300 |
like consolidation around how agents can contact humans. But mostly I just love automating things. 00:16:37.660 |
I've built tons and tons of agents internally for my personal stuff, for finding apartments, for 00:16:42.300 |
all kinds of internal business stuff we do at HumanLayer. So thank you all for watching. Let's 00:16:49.340 |
go build something. I'll see you in the hallway track. I'd love to chat if you want to riff on 00:16:53.340 |
agents or building or control flow or any of this stuff. That's 12 Factor Agents.