back to indexBreakthrough Agents: Building Reliable Agentic Systems

00:00:00.000 |
Hey everybody, my name is Eno, co-founder and CTO of a company called Factory. 00:00:19.760 |
At Factory, we believe that the way we build software is radically changing. 00:00:26.240 |
We are transitioning from the era of human-driven software development to agent-driven software 00:00:33.380 |
You can see glimpses of that today, however, it seems like we are trying to get to that 00:00:42.700 |
The current zeitgeist is to take the IDE, a tool that was designed first and foremost 00:00:48.260 |
for human beings to write lines of code by hand, and add AI on top of that. 00:00:54.480 |
And then you keep adding AI until kind of something changes. 00:00:59.940 |
But what we've seen is that in order for organizations, both small and large, to unlock changes in productivity 00:01:08.840 |
beyond just 10%, 15%, there is more of a workflow change that is demanded. 00:01:16.500 |
And so, we've noticed that in order to truly get 5, 10, 20x performance, you need to start 00:01:24.140 |
shifting your mindset from, "We are collaborating with AI systems," to, "We are delegating tasks 00:01:33.680 |
In order to make that transition, you start to realize that you need a really different 00:01:38.140 |
type of tool, a different type of platform, one that has an interface that lets you manage 00:01:49.020 |
You need infrastructure that lets you scale up and parallelize these agents at the same time. 00:01:55.880 |
You need context from across your entire engineering system that integrates more than just GitHub, 00:02:02.540 |
source code, or Jira, but also your observability tools, right? 00:02:10.280 |
And you need to connect all of this with agents that can actually reliably execute on the task 00:02:18.040 |
We think that this deserves an entirely new sort of platform. 00:02:25.580 |
Most of the enterprise organizations that we work with have 5, 10,000-plus engineers, right? 00:02:32.260 |
And so, this transition doesn't just come from kind of switching your editor or from switching 00:02:38.920 |
kind of to watching some tutorials about Vibe coding. 00:02:42.200 |
Instead, it kind of requires an active investment. 00:02:45.340 |
And so, I'm going to kind of talk about some of the things we've done in our platform that 00:02:50.240 |
enable this human-computer interaction with agents and also talk about some of the things 00:02:55.100 |
that we've done to enable reliable agents that can actually execute on some of the tasks 00:03:03.120 |
So I'm going to start with a quick video of what's possible. 00:03:06.780 |
This video showcases a quick glimpse of what it's like to trigger an agentic system. 00:03:12.160 |
We call them droids from your project management system. 00:03:15.920 |
And then you watch, in a platform, watch it go from ticket to pull request. 00:03:22.300 |
These types of tasks have quickly become table stakes for agentic systems. 00:03:27.600 |
Tasks with a clear goal, clear success criteria, they're now quite solvable with AI. 00:03:34.540 |
And the agent accomplishes these tasks with a combination of tools, reasoning, search, right, 00:03:40.920 |
computer use, and then it can take all of this and bring it into one of your existing 00:03:45.780 |
platforms, your source code manager, maybe GitHub. 00:03:54.240 |
These are all questions that plenty of people have spoken about. 00:04:00.020 |
And so, before we go into super deep detail about how you get something like this, going 00:04:05.040 |
from a plan all the way to a mergeable code, I'm going to just talk a bit about what we think 00:04:14.020 |
So we found that there's lots of different workflows, some people call them cognitive architectures, 00:04:20.020 |
that we've experimented with, plenty of other teams have experimented with. 00:04:24.360 |
And it's hard to draw an explicit boundary of what an agent is. 00:04:28.860 |
But there are three characteristics that we think are pretty consistent amongst the systems 00:04:37.780 |
Your agentic system needs to make plans that decide one or more futures actions in the system. 00:04:44.260 |
That can be very simple, like outlining a single kind of edit call and then returning back 00:04:50.080 |
to the user, or it could be quite complex, like saying, "We're going to go search through 00:04:54.960 |
the code base, plan out a couple of different edits, create tickets in our project management 00:04:59.860 |
system, execute on each of those tickets, and come back with a write-up sent in Slack to 00:05:09.840 |
But in addition to that, you also need decision-making, right? 00:05:13.840 |
And so when you are executing on a plan, there are tons of different data and interactions 00:05:20.400 |
that your agent needs to be able to make, right? 00:05:22.980 |
Look at the existing state and make a call based on that. 00:05:26.860 |
A lot of the time, reasoning is referred to as an agent's ability to make these decisions 00:05:35.160 |
And then finally, agents need environmental grounding, right? 00:05:39.020 |
So, agentic systems read and write information to their external environment. 00:05:45.480 |
Ideally, they also use this information to react and adapt to changes in that environment. 00:05:54.580 |
When I think about agentic systems, most of the challenge in making them reliable exists 00:06:03.420 |
But there's kind of a meta problem that exists, which is, no matter how reliable your agent 00:06:09.880 |
system is, you have to answer the question, where does the human fit in? 00:06:15.960 |
We think that we're currently at the point in time in history with the least number of developers 00:06:23.840 |
Humans are here to stay in building software. 00:06:27.300 |
But there's an outer loop and there's an inner loop of software development. 00:06:31.780 |
In the outer loop, you are reasoning about what needs to get done. 00:06:37.180 |
You're listening to customers and translating those into requirements. 00:06:40.780 |
You're iterating on different architectural decisions. 00:06:44.440 |
And in that inner loop, you are writing lines of code. 00:06:52.700 |
We think that the inner loop is going away very soon. 00:06:56.320 |
And agents will take up the vast majority of that. 00:06:59.120 |
And so how do you create an AI UX that blends delegation so that you can stay in flow in 00:07:09.760 |
When an agent ultimately can't accomplish something because of the technology, because of the current 00:07:14.320 |
state of where we're at, you need to be able to steer that with precision. 00:07:18.720 |
So that's kind of a background thing that I want to reference as we deep dive into those 00:07:28.480 |
We like to say a droid is only as good as its plan. 00:07:32.760 |
One of the biggest challenges we face is ensuring that a droid creates a good plan and sticks to 00:07:39.780 |
We were inspired by robotics and control systems in thinking about what are the techniques 00:07:46.160 |
we can use to improve the ability of the system to make high quality plans. 00:07:51.800 |
If a droid goes off and executes on something and it's totally the wrong idea, a customer's 00:08:00.100 |
And so there's a couple of things that we'd call out. 00:08:05.540 |
Plans don't need to be just a high level do the thing, right? 00:08:14.040 |
Your plans should be continuously updated by the environment as it's executed on. 00:08:23.000 |
Certain tasks have a certain sort of template or workflow, right, that they can be executed 00:08:30.080 |
Now, if you're too rigid with your plan templates and how your system wants to do that, you know, 00:08:33.320 |
do something, you're going to reduce creativity. 00:08:35.820 |
But if you know that at the end of the day, when your agent codes, that it's going to create 00:08:39.900 |
a pull request or a commit, then you can sort of start to reason through what the beginning, 00:08:45.760 |
intermediate, and end steps of your plan might look like, right? 00:08:48.960 |
And so we built all these systems into droids. 00:08:51.860 |
And I have an example here, a real world example of a droid that is making a plan. 00:08:57.200 |
You'll see it is searching through information, it's reasoning, it is taking that information 00:09:04.200 |
and breaking it down into a long form plan that has multiple steps, one for front end, one 00:09:10.480 |
for the back end, tests that it's going to do, and how it will actually execute commands to 00:09:18.180 |
This type of planning is really hard to get right. 00:09:21.480 |
But when you do get it right, it makes your system far more reliable at actually executing 00:09:35.080 |
This is probably the hardest thing to control in your agents. 00:09:40.980 |
When you're building software, when you are doing tasks across other domains, human beings 00:09:46.960 |
are making hundreds, thousands of micro decisions around what to name the variable, the scale 00:09:53.500 |
of the change to make, where should this change go in the code base, should we imitate the patterns 00:09:59.600 |
of the code base, or is the code base filled with tech debt and we should instead innovate 00:10:05.460 |
Agents need to be able to assess these trade-offs and take action decisively and correctly in order 00:10:14.040 |
There's a few factors you can introduce to improve your agents' ability to make decisions. 00:10:22.200 |
And when you're thinking about how to actually introduce these changes, you kind of have to 00:10:31.280 |
think, what sort of decisions am I going to face? 00:10:34.980 |
And do I have the ability to explicitly select, or criteria, by which my agent should make 00:10:43.440 |
If you are an agent for travel planning or an agent for code review, right, you have a very 00:10:50.540 |
limited set of things that the agent actually will need to decide, right? 00:10:56.140 |
You can say, I need to decide on what the price points are. 00:10:59.800 |
I need to decide about if this code fits a certain set of standards, right? 00:11:04.300 |
If you're more open-ended, like you have an agent that can do a lot of different things, or write a bunch of different 00:11:11.640 |
code paths, there's less explicit decision-making criteria that you can introduce, right? 00:11:17.360 |
A lot of this stuff happens in the reasoning layer of models. 00:11:21.460 |
So you really have to think about how your agent is instructed and the context that it 00:11:28.920 |
Factory customers often ask questions like, how do I structure an API for this project, right? 00:11:34.020 |
They expect droids to be able to evaluate requirements from the user and from their organization, assess 00:11:40.700 |
the existing code base, reason through different performance implications, and then ultimately 00:11:48.340 |
So this is really tricky, but powerful when done right. 00:11:51.280 |
And so if you take some of these decision-making criteria, context from the environment, you bring 00:11:56.540 |
it together, that is what gives an agent the ability to actually make a proper decision. 00:12:02.500 |
And the last thing is environmental grounding, right? 00:12:05.280 |
This is the connection your agent has with the real world. 00:12:09.020 |
Things interact through the world in AI computer interfaces, right? 00:12:14.140 |
Dedicated tools and context injection from other systems to the agent. 00:12:19.840 |
This is actually really different and not so simple as saying, let's take an API or let's 00:12:27.220 |
The entire internet and the last 40 years of computing on top of it has basically existed 00:12:36.060 |
And so we have to build new AI computer interfaces that let our agents naturally interact with 00:12:43.580 |
This is where we spend most of our time at Factory. 00:12:46.500 |
We found that control over the tools your agent uses is the single most important differentiator 00:12:54.680 |
In addition to just the tools themselves, there is also a sense that in order to ground yourself 00:13:02.000 |
in your environment, you need to properly process information that comes in. 00:13:06.340 |
An example of this might be if I take a CLI command and I get 100,000 lines of response 00:13:14.560 |
If we just pass that to the agent, it's going to go off the rails, right? 00:13:21.400 |
We need to take the important information and hand it to the agent and the unimportant information 00:13:27.120 |
These types of decisions are make or break for complex systems that interact with huge volumes 00:13:33.060 |
So think about how you process information, not just the tools that you're actually calling. 00:13:39.120 |
And so here's an example of a Droid being handed a Sentry error alert. 00:13:47.760 |
It's going to search through repositories to find the candidate, search using a couple of 00:13:52.520 |
different strategies, semantic, glob, grep, APIs for Sentry, view that relevant information, 00:13:59.760 |
and access GitHub PRs that happened around the time of the merged error, and then validate 00:14:07.240 |
And so it's going to use all that knowledge to then write an RCA based on all of that information. 00:14:13.760 |
This is the type of grounding that your systems need to do in order to go beyond just coding 00:14:18.380 |
and enter into the world of full software development work. 00:14:24.920 |
So just to recap, a couple of main takeaways. 00:14:28.580 |
Start with a clear plan and start with clear boundaries. 00:14:33.360 |
Show the work of your systems as they reason and make decisions, and that helps build trust, 00:14:41.380 |
And iterate as your agent works, allow it to reason, search, think through hard problems, 00:14:50.880 |
And finally, from that beginning point, design for human collaboration. 00:14:58.400 |
Shipping high-quality software is climbing Mount Everest. 00:15:03.020 |
A couple of notes that, you know, want to add. 00:15:08.540 |
We're always thinking about deeper integrations, memory, and if you are thinking about working 00:15:14.340 |
on these types of hard problems, or working on agentic systems in general, we are always 00:15:20.480 |
And if your team is not delegating more than 50% of its engineering tasks to AI agents,