back to index

12-Factor Agents: Patterns of reliable LLM applications — Dex Horthy, HumanLayer


Whisper Transcript | Transcript Only Page

00:00:00.000 | Who here is building agents? Leave your hand up if you built like 10 plus agents. Anyone here built
00:00:21.920 | like a hundred agents? All right, we got a few. Awesome. I love it. So I think a lot of us have
00:00:27.600 | been on this journey of building agents. And what happened with me was, you know, I decided
00:00:32.720 | I wanted to build an agent. We figured out what we wanted it to do. We want to move fast.
00:00:36.400 | We're developers, so we use libraries. We don't write everything from scratch. And you
00:00:39.960 | get it to like 70, 80%. It's enough to get the CEO excited and get six more people added
00:00:43.800 | to your team. But then you kind of realize that 70, 80% isn't quite good enough. And that
00:00:49.400 | if you want to get past that 70, 80% quality bar, you're seven layers deep in a call stack
00:00:54.820 | trying to reverse engineer. How does this prompt get built? Or how do these tools get
00:00:57.820 | passed in? Where does this all come from? And if you're like me, you eventually just throw
00:01:01.440 | it all away and start from scratch. Or you may even find out that this is not a good problem
00:01:06.200 | for agents. I remember one of the first agents that I tried to build was a DevOps agent. I
00:01:11.140 | was like, here's my make file. You can run make commands. Go build the project. Couldn't
00:01:15.060 | figure it out. Did all the things in the wrong order. I'm like, cool, let's fix the prompt.
00:01:18.320 | And over the next two hours, I had more and more detail about what everything was and every
00:01:22.780 | single step until I got to the point where I was like, this is the exact order to run
00:01:26.320 | the build steps. It was a cool exercise. But at the end of it, I was like, you know, I could
00:01:30.380 | have written the bash script to do this in about 90 seconds. Not every problem needs an agent.
00:01:36.260 | And so I've been on this journey. I think a lot of you have been on similar journeys.
00:01:41.560 | And what happened was is I went and talked in trying to help people build better, more
00:01:45.660 | reliable agents. I talked to 100 plus founders, builders, engineers. And I sort of noticed patterns.
00:01:53.180 | One was that most production agents weren't that agentic at all. They were mostly just software.
00:01:58.640 | But that there were these core things that a lot of people were doing. There were these
00:02:01.660 | patterns that were making their LLM-based apps really, really good. And none of them were doing
00:02:06.960 | kind of a greenfield rewrite. Rather, they were taking these small modular concepts that didn't
00:02:11.080 | have names and didn't have definitions. And they were applying them to their existing
00:02:14.780 | code. And what's really cool about this, I don't think you need an AI background to do
00:02:18.560 | this. This is software engineering 101. Well, probably not 101. But just like Heroku needed
00:02:24.280 | to define what it meant to build -- we didn't even call them cloud native back then. But this
00:02:27.620 | was how you built applications that could run in the cloud 10 years ago. I decided to put
00:02:33.300 | together what I thought would be the 12 factors of AI agents from everything that I've seen working
00:02:39.380 | in the field. So we put up this GitHub repo. You can go read it. Turns out a lot of other
00:02:44.520 | people agreed and felt the same thing. So we were on the front page of Hacker News all day.
00:02:49.860 | 200K impressions on social. I'm just gonna put this one up and no comment. And just for context,
00:02:57.440 | we got to like 4,000 stars in like a month or two. There's 14 active contributors. It's very
00:03:03.500 | easy to read that thing and hear the talk and say like, "Oh, we're here. This is the anti-framework
00:03:08.080 | talk." I am not here to bash frameworks. I would think of this as much as anything as a wish list,
00:03:13.980 | a list of feature requests. How can we make frameworks serve the needs of really good builders who need
00:03:19.640 | a really high reliability and want to move fast still? So what am I here to do? I want you to kind of forget
00:03:27.020 | everything you know about agents and kind of rethink from first principles how we can apply
00:03:31.020 | everything we've learned from software engineering to the practice of building really reliable agents.
00:03:35.500 | So we're gonna mix the order up a little bit. If you want all 12 factors in order, that's a 30-minute
00:03:40.220 | talk. So we're gonna bundle some stuff together. There will be a QR code at the end. You can go dig
00:03:43.900 | through it at your ledger. Factor one, the most magical things that LLMs can do has nothing to do with
00:03:49.980 | loops or switch statements or code or tools or anything. It is turning a sentence like this
00:03:54.620 | into JSON that looks like this. Doesn't even matter what you do with that JSON. Those are what the other
00:04:00.620 | factors are for. But if you're doing that, that's one piece that you can bring into your app today.
00:04:04.380 | Factor four, this leads right to... Did anyone read this paper, GoTo Considered Harmful,
00:04:10.460 | or maybe just heard about it? I never actually read it. But it was all about... We had this abstraction
00:04:14.380 | in the C programming language and a bunch of other programming languages at the time that said,
00:04:18.380 | This thing, GoTo, it makes code terrible. It's the wrong abstraction. No one should use it.
00:04:23.580 | I'm gonna go ahead and go out on a limb here and say tool use is harmful. And I put it in quotes because
00:04:29.340 | I'm not talking about giving an agent access to the world. Obviously, that's super badass. But what
00:04:33.900 | I think is making things hard is the idea that tool use is this magical thing where this ethereal
00:04:39.900 | alien entity is interacting with its environment. Because what is happening is our LM is putting out JSON.
00:04:45.580 | We're gonna give that to some deterministic code that's gonna do something. And then maybe we'll feed
00:04:49.580 | it back. But again, those are other factors. So if you have structures like this, and you can get the
00:04:54.300 | LLM to output something that generates them, then you can pass it into a loop like this or a switch
00:04:59.100 | statement like this. There's nothing special about tools. It's just JSON and code. That's factor four.
00:05:05.580 | Factor eight. And these are, we're gonna do a couple kind of bundled together here. Owning your control flow.
00:05:11.180 | And I want to take a step back and kind of talk about how we got here.
00:05:14.380 | We've been writing DAGs in software for a long time. If you've written an if statement, you've written a directed graph.
00:05:19.900 | Code is a graph. You may also be familiar with DAG orchestra. Anyone ever use like Airflow or Prefect or any of these things?
00:05:28.060 | So like this kind of concept of breaking things up into nodes gives you certain reliability guarantees.
00:05:32.780 | But what agents were supposed to do, and I think a lot of people talk about this, and I think in some cases
00:05:36.620 | this is realized, is you don't have to write the DAG. You just tell the LLM here's the goal, and LLM will find
00:05:43.820 | its way there. And we model this as a really simple loop. You know, LLM is determining the next step.
00:05:48.620 | You're building up some context window until the LLM says, "Hey, we're done."
00:05:51.660 | So what this looks like kind of in practice is, you know, you have an event come in.
00:05:57.180 | You pass it into your prompt. It says you want to call an API. And you get your result. Put that on the context window.
00:06:02.780 | Pass the whole thing back into the prompt. This is like the most naive, simple way of building agents.
00:06:08.460 | And the LLM is gonna call a couple steps. And then eventually it's gonna say, "Cool, we've done all the tasks from the initial event."
00:06:13.820 | Which maybe was a user message asking it to do something. Maybe it's an outage.
00:06:18.220 | But then we get our final answer. And our materialized DAG is just these three steps in order.
00:06:23.260 | Turns out this doesn't really work. Especially when you get to longer workflows. Mostly it's long context windows.
00:06:30.460 | There's other reasons you could poke at as well. And people say, "Oh, like, anyone put, like, two million tokens into Gemini before and, like, try to see what happens?"
00:06:37.900 | Like, you can do it. You'll get an answer. The API will return you something. But I don't think anyone will argue with you that you will always get, like, tighter, better, higher reliability results.
00:06:47.820 | By controlling and limiting the number of tokens you put in that context window.
00:06:52.860 | So it doesn't quite work, but we're gonna use that as our abstraction to build on. What's an agent really?
00:06:57.900 | You have your prompt, which gives instructions about how to select the next step.
00:07:01.900 | You have your switch statement, which takes whatever the model output JSON and does something with it.
00:07:05.900 | You have a way of building up your context window. And then you have a loop that determines when and where and how and why you exit.
00:07:11.900 | And if you own your control flow, you can do fun things like break and switch and summarize and LLM is judge and all this stuff.
00:07:18.940 | And this leads right into, kind of, how we manage execution state and business state of our agents.
00:07:24.940 | A lot of tools will give you things like current step, next step, retry counts, all these, like, DAG orchestrators.
00:07:28.940 | They'll have these kind of concepts in them. But you also have your business state. What are the messages that have happened?
00:07:32.940 | What data are we displaying to the user? What things are we waiting on approval for?
00:07:37.980 | And we want to be able to launch, pause, resume these things like we do for any standard APIs.
00:07:43.100 | This is all just software. And so if you can put your agent behind a REST API or an MCP server
00:07:51.020 | and manage that loop in such a way that normal request comes in and we load that context window to the LLM,
00:07:57.980 | we're going to allow our agent to call long-running tool. So we can interrupt the workflow, serialize
00:08:02.300 | that context window straight into a database because we own the context window. We'll get into that.
00:08:05.980 | And then when we launch the workflow, eventually it's going to call back with that state ID and the result.
00:08:12.300 | We use the state ID to load the state back out of the database and then we can append the result to the
00:08:16.940 | program and then send it right back into the LLM. The agent doesn't even know that things happen in the
00:08:21.180 | background. Agents are just software, so let's build software. And building really good ones requires a lot of
00:08:27.900 | flexibility. And so you really want to own that inner loop of how all that stuff is fitting together.
00:08:32.380 | That's unifying. That's pause and resume. Factor two, this one is I think most people find first,
00:08:38.940 | is like you really want to own your prompts. There's some good extractions that if you don't
00:08:42.380 | want to spend a lot of time handwriting a prompt, you can put stuff in and you'll get out
00:08:46.300 | a really good set of primitives and a really good prompt. Like this will make you a banger prompt,
00:08:53.660 | that like you would have to go to prompt school for like three months to build a prompt this good.
00:08:57.820 | But eventually if you want to get past some quality bar, you're going to end up writing every single
00:09:01.980 | token by hand. Because LLMs are pure functions and the only thing that determines the reliability of
00:09:08.140 | your agent is how good of tokens can you get out. And the only way, the only thing that determines the
00:09:12.780 | tokens you get out other than like retraining your own model and something like that is being really
00:09:17.100 | careful about what tokens you put in. I don't know what's better. I don't know how you want to build your
00:09:22.140 | prompt. But I know the more things you can try and the more knobs you can test and the more things you can
00:09:25.820 | evaluate, the more likely you are to find something really, really good.
00:09:28.780 | Owning your prompts. You also want to own how you build your context window. So you can do the
00:09:33.180 | standard OpenAI messages format, or in this moment where you're telling the LLM, pick the next step,
00:09:38.380 | your only job is to tell it what's happened so far. You can put all that information however you want
00:09:42.700 | into a single user message and ask, "Hey, what's happening next?" Or put in the system message. So you can
00:09:47.820 | model your event state, your thread model, however you want, and stringify it however you want. And some
00:09:53.980 | of the traces that we use in some of the agents we build internally, I'll get into that in a sec,
00:09:56.940 | might look like this. But if you're not looking at every single token, and if you're not optimizing the
00:10:02.940 | density and the clarity of the way that you're passing information to an LLM, you might be missing out on
00:10:10.140 | upside on quality. So LLMs are pure functions, token in, tokens out, and everything, everything in making
00:10:15.900 | agents good is context engineering. So you have your prompt, you have your memory, you have your
00:10:20.700 | RAG, you have your history structure, it's all just how do we get the right tokens into the model so it
00:10:24.380 | gives us a really good answer and solves the user's problem, solves my problem mostly.
00:10:28.220 | I don't know what's better, but I know you want to try everything, so that's on your context building.
00:10:33.900 | This one's a little controversial, that's why it's a standalone factor, and the way you make it
00:10:39.580 | good is by integrating it with other factors. But you could, when the model screws up,
00:10:44.940 | then it calls an API wrong, or it calls an API that's down, you could take the tool call that it
00:10:49.580 | made, and grab the error that was associated with it, put that on the context window, and have it try
00:10:55.340 | again. Anyone ever had a bad time with this? Seen like this thing just like kind of spin out,
00:11:01.580 | and like go crazy, and lose context, and just get stuck? That's why you need to own your context window.
00:11:07.180 | Don't just blindly put things on. If you have errors, and then you get a valid tool call, clear all the
00:11:11.100 | are pending errors out. Summarize them. Don't put the whole stack trace on your context. Figure out
00:11:15.820 | what you want to tell the model, so you get better results. Contacting humans with tools. This one's a
00:11:20.940 | little subtle, but I've seen, this is just like what I've seen in the wild. Almost everybody is like avoiding
00:11:26.380 | this very important choice at the very beginning of output, where you're deciding between tool call and
00:11:31.420 | message to the human. If you can push that emphasis to a natural language token, you can one, give the
00:11:36.940 | model different ways. You can be, I'm done, or I need clarification, or I need to talk to a manager,
00:11:41.340 | or whatever it is. And two, you push the intent on that first token generation, and the sampling to
00:11:47.660 | something that is natural language that the model understands. So your traces might look like this,
00:11:52.700 | if you're pulling in human input here. This lets you build outer loop agents. I'm not going to talk about
00:11:57.580 | this. If you go on the site, there's a link to this post. I've written a lot about this. I don't
00:12:02.060 | know what's better, but you should probably try everything. That's contacting humans with tools.
00:12:07.180 | It goes right along with trigger things from anywhere and meet users exactly where they are. People don't
00:12:11.980 | want to have seven tabs open of different ChatGPT style agents. Just let people email with the agents
00:12:17.100 | you're building. Let them Slack with the agents you're building. Discord, SMS, whatever it is. We see this
00:12:21.420 | taking off all over the place. And you should have small focused agents. So we talked about this
00:12:26.700 | structure and why it doesn't really work. So what does work?
00:12:29.180 | The things that people are doing that work really well are microagents. So you still have a mostly
00:12:34.620 | deterministic DAG, and you have these very small agent loops with like three to 10 steps. We do this
00:12:39.180 | at human layer. We have a bot that manages our deployments. Most of our deploy pipeline is
00:12:43.500 | deterministic CI/CD code. But when we get to the point where the GitHub PR is merged and the tests are
00:12:50.940 | passing on development, excuse me, we send it to a model. We say, get this thing deployed. It says,
00:12:57.340 | cool, I'm going to deploy the front end. And then you can send that to a human. The human says,
00:13:00.860 | actually, no, do the back end first. This is taking natural language and turning it into
00:13:04.220 | JSON that is the next step in our workflow. Back end gets proposed. That gets approved. It gets deployed.
00:13:09.980 | Then the agent knows, okay, I have to go back and deploy the front end. Once that's all done and it's
00:13:14.140 | successful, we go right back out into deterministic code. So now we're going to run the end-to-end test against
00:13:19.340 | prod if it's done. Otherwise, we hand it back to a little rollback agent that is very similar on
00:13:23.580 | the inside. I'm not going to go into it, but here's it working in our Slack channel.
00:13:27.500 | Yeah, 100 tools, 20 steps, easy. Manageable context, clear responsibilities. A lot of people say,
00:13:35.980 | what if LLMs do keep getting smarter? What if I can put two million tokens in and it can do it?
00:13:39.900 | And I think we very much will see something like this, where you start with a mostly deterministic
00:13:46.060 | workflow and you start sprinkling LLMs into your code, into your back end, into your logic.
00:13:51.020 | Over time, the LLMs are able to do bigger, more complex tasks until this whole API endpoint or
00:13:56.300 | pipeline or whatever it is is just run by an agent. That's great. But you still want to know how to
00:14:01.340 | engineer these things to get the best quality. This is someone from Notebook LM and it's basically
00:14:05.500 | their take, and I think they did this well, is find something that is right at the boundary of what the
00:14:09.740 | model can do reliably, like that it can't get right all the time. And if you can figure out how to get it
00:14:15.660 | right reliably anyways, because you've engineered reliability into your system, then you will have
00:14:20.940 | created something magical and you will have created something that's better than what everybody else is
00:14:24.380 | building. So that's small focused agents. There's a meme here about stateless reducers. I guess someone
00:14:29.660 | actually tweeted at me it's not a reducer, it's a transducer, because there's multiple steps.
00:14:34.220 | But basically agents should be stateless, you should own the state, manage it however you want.
00:14:37.740 | So we're all still finding the right abstractions. There's a couple blog posts I link in the paper,
00:14:42.860 | frameworks versus libraries. There's a really old one from RubyConf about like,
00:14:48.140 | do we want duplication or is like, do we want to try to figure out these abstractions?
00:14:51.980 | If you want to make a 12 factor agent, we are working on something called create 12 factor agent,
00:14:57.100 | because I believe that what agents need is not bootstrap. You don't need a wrapper around an internal
00:15:02.220 | thing. You need something more like shad CN, which is like scaffold it out and then I'll own it and
00:15:06.620 | I'll own the code and I'm okay with that. So in summary, agents are software. You all can build
00:15:12.700 | software. Anyone ever written a switch statement before? While loop? Yeah. Okay, so we can do this
00:15:17.740 | stuff. LMs are stateless functions, which means just make sure you put the right things in the context
00:15:22.060 | and you'll get the best results. Own your state and your control flow and just do it and just understand
00:15:26.460 | it because it's going to give you flexibility. And then find the bleeding edge. Find ways to do things
00:15:31.740 | better than everybody else by really curating what you put in the model and how you control
00:15:36.220 | what comes out. And my take, agents are better with people. Find ways to let agents collaborate
00:15:41.420 | with humans. There are hard things in building agents, but you should probably do them anyways,
00:15:46.940 | at least for now. And you should do most of them. I think a lot of frameworks try to take away the hard
00:15:53.500 | AI parts of the problem so that you can just kind of drop it in and go. And I think it should be the
00:15:59.020 | opposite. I think the tools that we get should take away the other hard parts so that we can spend all
00:16:04.300 | our time focusing on the hard AI parts, on getting the prompts right, on getting the flow right,
00:16:08.380 | on getting the tokens right. So the reason why I'm here is I do run a small business. We have a startup
00:16:14.220 | where we try to help you do... A lot of what we do in the open is open source, and I think it's really
00:16:19.580 | important and we need to work on it together. There's some other things that are hard, but not that
00:16:23.180 | important and not that interesting. So that's what we're solving at HumanLayer. Working on something
00:16:28.380 | called A2H protocol. Come find me if you want to talk about this, but this is a way to get
00:16:32.300 | like consolidation around how agents can contact humans. But mostly I just love automating things.
00:16:37.660 | I've built tons and tons of agents internally for my personal stuff, for finding apartments, for
00:16:42.300 | all kinds of internal business stuff we do at HumanLayer. So thank you all for watching. Let's
00:16:49.340 | go build something. I'll see you in the hallway track. I'd love to chat if you want to riff on
00:16:53.340 | agents or building or control flow or any of this stuff. That's 12 Factor Agents.