Breakthrough Agents: Building Reliable Agentic Systems

Hey everybody, my name is Eno, co-founder and CTO of a company called Factory. At Factory, we believe that the way we build software is radically changing. We are transitioning from the era of human-driven software development to agent-driven software development. You can see glimpses of that today, however, it seems like we are trying to get to that future incrementally.

The current zeitgeist is to take the IDE, a tool that was designed first and foremost for human beings to write lines of code by hand, and add AI on top of that. And then you keep adding AI until kind of something changes. But what we've seen is that in order for organizations, both small and large, to unlock changes in productivity beyond just 10%, 15%, there is more of a workflow change that is demanded.

And so, we've noticed that in order to truly get 5, 10, 20x performance, you need to start shifting your mindset from, "We are collaborating with AI systems," to, "We are delegating tasks entirely to AI systems." In order to make that transition, you start to realize that you need a really different type of tool, a different type of platform, one that has an interface that lets you manage and delegate these AI systems.

You need infrastructure that lets you scale up and parallelize these agents at the same time. You need context from across your entire engineering system that integrates more than just GitHub, source code, or Jira, but also your observability tools, right? Your knowledge bases, the internet. And you need to connect all of this with agents that can actually reliably execute on the task in an end-to-end way.

We think that this deserves an entirely new sort of platform. And so, at Factory, we're building that. Most of the enterprise organizations that we work with have 5, 10,000-plus engineers, right? And so, this transition doesn't just come from kind of switching your editor or from switching kind of to watching some tutorials about Vibe coding.

Instead, it kind of requires an active investment. And so, I'm going to kind of talk about some of the things we've done in our platform that enable this human-computer interaction with agents and also talk about some of the things that we've done to enable reliable agents that can actually execute on some of the tasks that we are talking about.

So I'm going to start with a quick video of what's possible. This video showcases a quick glimpse of what it's like to trigger an agentic system. We call them droids from your project management system. And then you watch, in a platform, watch it go from ticket to pull request.

These types of tasks have quickly become table stakes for agentic systems. Tasks with a clear goal, clear success criteria, they're now quite solvable with AI. And the agent accomplishes these tasks with a combination of tools, reasoning, search, right, computer use, and then it can take all of this and bring it into one of your existing platforms, your source code manager, maybe GitHub.

But what really is an agentic system? How does it use tools? These are all questions that plenty of people have spoken about. The definition remains quite fuzzy, right? And so, before we go into super deep detail about how you get something like this, going from a plan all the way to a mergeable code, I'm going to just talk a bit about what we think an agentic system is, right?

So we found that there's lots of different workflows, some people call them cognitive architectures, that we've experimented with, plenty of other teams have experimented with. And it's hard to draw an explicit boundary of what an agent is. But there are three characteristics that we think are pretty consistent amongst the systems that most people refer to as being agentic.

That first is planning, right? Your agentic system needs to make plans that decide one or more futures actions in the system. That can be very simple, like outlining a single kind of edit call and then returning back to the user, or it could be quite complex, like saying, "We're going to go search through the code base, plan out a couple of different edits, create tickets in our project management system, execute on each of those tickets, and come back with a write-up sent in Slack to our eng manager," right?

Plans can be small, they can be simple. But in addition to that, you also need decision-making, right? And so when you are executing on a plan, there are tons of different data and interactions that your agent needs to be able to make, right? Look at the existing state and make a call based on that.

A lot of the time, reasoning is referred to as an agent's ability to make these decisions in an intermediate way. And then finally, agents need environmental grounding, right? So, agentic systems read and write information to their external environment. Ideally, they also use this information to react and adapt to changes in that environment.

When I think about agentic systems, most of the challenge in making them reliable exists in the details of the agent itself. But there's kind of a meta problem that exists, which is, no matter how reliable your agent system is, you have to answer the question, where does the human fit in?

We think that we're currently at the point in time in history with the least number of developers that there will ever be. Humans are here to stay in building software. But there's an outer loop and there's an inner loop of software development. In the outer loop, you are reasoning about what needs to get done.

You're working with your colleagues. You're listening to customers and translating those into requirements. You're iterating on different architectural decisions. And in that inner loop, you are writing lines of code. You're testing that code. You're building. You're checking up on that. You're doing code review. We think that the inner loop is going away very soon.

And agents will take up the vast majority of that. And so how do you create an AI UX that blends delegation so that you can stay in flow in that outer loop with control? When an agent ultimately can't accomplish something because of the technology, because of the current state of where we're at, you need to be able to steer that with precision.

Right? So that's kind of a background thing that I want to reference as we deep dive into those three principles of agentic systems. So planning. We like to say a droid is only as good as its plan. One of the biggest challenges we face is ensuring that a droid creates a good plan and sticks to it.

We were inspired by robotics and control systems in thinking about what are the techniques we can use to improve the ability of the system to make high quality plans. If a droid goes off and executes on something and it's totally the wrong idea, a customer's time and money is heavily wasted.

And so there's a couple of things that we'd call out. That's decomposition, right? Plans don't need to be just a high level do the thing, right? We can break plans down into sub tasks. Model predictive control, right? Your plans should be continuously updated by the environment as it's executed on.

And explicit plan templating, right? Certain tasks have a certain sort of template or workflow, right, that they can be executed on. Now, if you're too rigid with your plan templates and how your system wants to do that, you know, do something, you're going to reduce creativity. But if you know that at the end of the day, when your agent codes, that it's going to create a pull request or a commit, then you can sort of start to reason through what the beginning, intermediate, and end steps of your plan might look like, right?

And so we built all these systems into droids. And I have an example here, a real world example of a droid that is making a plan. You'll see it is searching through information, it's reasoning, it is taking that information and breaking it down into a long form plan that has multiple steps, one for front end, one for the back end, tests that it's going to do, and how it will actually execute commands to roll this change out with feature flags.

This type of planning is really hard to get right. But when you do get it right, it makes your system far more reliable at actually executing on that end task. And now this droid will keep going. The second thing is decision making, right? This is probably the hardest thing to control in your agents.

When you're building software, when you are doing tasks across other domains, human beings are making hundreds, thousands of micro decisions around what to name the variable, the scale of the change to make, where should this change go in the code base, should we imitate the patterns of the code base, or is the code base filled with tech debt and we should instead innovate on the code base, right?

Agents need to be able to assess these trade-offs and take action decisively and correctly in order to be effective. There's a few factors you can introduce to improve your agents' ability to make decisions. And when you're thinking about how to actually introduce these changes, you kind of have to think, what sort of decisions am I going to face?

And do I have the ability to explicitly select, or criteria, by which my agent should make these decisions, right? If you are an agent for travel planning or an agent for code review, right, you have a very limited set of things that the agent actually will need to decide, right?

You can say, I need to decide on what the price points are. I need to decide about if this code fits a certain set of standards, right? If you're more open-ended, like you have an agent that can do a lot of different things, or write a bunch of different code paths, there's less explicit decision-making criteria that you can introduce, right?

A lot of this stuff happens in the reasoning layer of models. So you really have to think about how your agent is instructed and the context that it has about the environment around it. Factory customers often ask questions like, how do I structure an API for this project, right?

They expect droids to be able to evaluate requirements from the user and from their organization, assess the existing code base, reason through different performance implications, and then ultimately come up with a final decision of what to do. So this is really tricky, but powerful when done right. And so if you take some of these decision-making criteria, context from the environment, you bring it together, that is what gives an agent the ability to actually make a proper decision.

And the last thing is environmental grounding, right? This is the connection your agent has with the real world. Things interact through the world in AI computer interfaces, right? Dedicated tools and context injection from other systems to the agent. This is actually really different and not so simple as saying, let's take an API or let's just build a simple tool, right?

The entire internet and the last 40 years of computing on top of it has basically existed primarily for human beings. And so we have to build new AI computer interfaces that let our agents naturally interact with the world. This is where we spend most of our time at Factory.

We found that control over the tools your agent uses is the single most important differentiator in your agent's reliability. In addition to just the tools themselves, there is also a sense that in order to ground yourself in your environment, you need to properly process information that comes in. An example of this might be if I take a CLI command and I get 100,000 lines of response from that CLI tool call.

If we just pass that to the agent, it's going to go off the rails, right? We need to process it. We need to find out what's important. We need to take the important information and hand it to the agent and the unimportant information and hide it. These types of decisions are make or break for complex systems that interact with huge volumes of data.

So think about how you process information, not just the tools that you're actually calling. And so here's an example of a Droid being handed a Sentry error alert. And it's being told we need an RCA. We need to figure out what happened, right? It's going to search through repositories to find the candidate, search using a couple of different strategies, semantic, glob, grep, APIs for Sentry, view that relevant information, and access GitHub PRs that happened around the time of the merged error, and then validate that with some additional searches.

And so it's going to use all that knowledge to then write an RCA based on all of that information. This is the type of grounding that your systems need to do in order to go beyond just coding and enter into the world of full software development work. So just to recap, a couple of main takeaways.

Start with a clear plan and start with clear boundaries. Show the work of your systems as they reason and make decisions, and that helps build trust, and helps you debug. And iterate as your agent works, allow it to reason, search, think through hard problems, and ultimately ground it in the environment.

And finally, from that beginning point, design for human collaboration. AI systems we think of like climbing gear. Shipping high-quality software is climbing Mount Everest. It's pretty hard to do. A couple of notes that, you know, want to add. We're always thinking about deeper integrations, memory, and if you are thinking about working on these types of hard problems, or working on agentic systems in general, we are always hiring.

And if your team is not delegating more than 50% of its engineering tasks to AI agents, you should come talk to Factory. Thanks. Thanks. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

Breakthrough Agents: Building Reliable Agentic Systems

Transcript