back to index

Breakthrough Agents: Building Reliable Agentic Systems


Whisper Transcript | Transcript Only Page

00:00:00.000 | Hey everybody, my name is Eno, co-founder and CTO of a company called Factory.
00:00:19.760 | At Factory, we believe that the way we build software is radically changing.
00:00:26.240 | We are transitioning from the era of human-driven software development to agent-driven software
00:00:32.140 | development.
00:00:33.380 | You can see glimpses of that today, however, it seems like we are trying to get to that
00:00:40.400 | future incrementally.
00:00:42.700 | The current zeitgeist is to take the IDE, a tool that was designed first and foremost
00:00:48.260 | for human beings to write lines of code by hand, and add AI on top of that.
00:00:54.480 | And then you keep adding AI until kind of something changes.
00:00:59.940 | But what we've seen is that in order for organizations, both small and large, to unlock changes in productivity
00:01:08.840 | beyond just 10%, 15%, there is more of a workflow change that is demanded.
00:01:16.500 | And so, we've noticed that in order to truly get 5, 10, 20x performance, you need to start
00:01:24.140 | shifting your mindset from, "We are collaborating with AI systems," to, "We are delegating tasks
00:01:30.720 | entirely to AI systems."
00:01:33.680 | In order to make that transition, you start to realize that you need a really different
00:01:38.140 | type of tool, a different type of platform, one that has an interface that lets you manage
00:01:45.600 | and delegate these AI systems.
00:01:49.020 | You need infrastructure that lets you scale up and parallelize these agents at the same time.
00:01:55.880 | You need context from across your entire engineering system that integrates more than just GitHub,
00:02:02.540 | source code, or Jira, but also your observability tools, right?
00:02:07.000 | Your knowledge bases, the internet.
00:02:10.280 | And you need to connect all of this with agents that can actually reliably execute on the task
00:02:15.900 | in an end-to-end way.
00:02:18.040 | We think that this deserves an entirely new sort of platform.
00:02:22.320 | And so, at Factory, we're building that.
00:02:25.580 | Most of the enterprise organizations that we work with have 5, 10,000-plus engineers, right?
00:02:32.260 | And so, this transition doesn't just come from kind of switching your editor or from switching
00:02:38.920 | kind of to watching some tutorials about Vibe coding.
00:02:42.200 | Instead, it kind of requires an active investment.
00:02:45.340 | And so, I'm going to kind of talk about some of the things we've done in our platform that
00:02:50.240 | enable this human-computer interaction with agents and also talk about some of the things
00:02:55.100 | that we've done to enable reliable agents that can actually execute on some of the tasks
00:03:00.240 | that we are talking about.
00:03:03.120 | So I'm going to start with a quick video of what's possible.
00:03:06.780 | This video showcases a quick glimpse of what it's like to trigger an agentic system.
00:03:12.160 | We call them droids from your project management system.
00:03:15.920 | And then you watch, in a platform, watch it go from ticket to pull request.
00:03:22.300 | These types of tasks have quickly become table stakes for agentic systems.
00:03:27.600 | Tasks with a clear goal, clear success criteria, they're now quite solvable with AI.
00:03:34.540 | And the agent accomplishes these tasks with a combination of tools, reasoning, search, right,
00:03:40.920 | computer use, and then it can take all of this and bring it into one of your existing
00:03:45.780 | platforms, your source code manager, maybe GitHub.
00:03:48.960 | But what really is an agentic system?
00:03:52.580 | How does it use tools?
00:03:54.240 | These are all questions that plenty of people have spoken about.
00:03:57.220 | The definition remains quite fuzzy, right?
00:04:00.020 | And so, before we go into super deep detail about how you get something like this, going
00:04:05.040 | from a plan all the way to a mergeable code, I'm going to just talk a bit about what we think
00:04:11.660 | an agentic system is, right?
00:04:14.020 | So we found that there's lots of different workflows, some people call them cognitive architectures,
00:04:20.020 | that we've experimented with, plenty of other teams have experimented with.
00:04:24.360 | And it's hard to draw an explicit boundary of what an agent is.
00:04:28.860 | But there are three characteristics that we think are pretty consistent amongst the systems
00:04:33.200 | that most people refer to as being agentic.
00:04:35.920 | That first is planning, right?
00:04:37.780 | Your agentic system needs to make plans that decide one or more futures actions in the system.
00:04:44.260 | That can be very simple, like outlining a single kind of edit call and then returning back
00:04:50.080 | to the user, or it could be quite complex, like saying, "We're going to go search through
00:04:54.960 | the code base, plan out a couple of different edits, create tickets in our project management
00:04:59.860 | system, execute on each of those tickets, and come back with a write-up sent in Slack to
00:05:04.840 | our eng manager," right?
00:05:07.480 | Plans can be small, they can be simple.
00:05:09.840 | But in addition to that, you also need decision-making, right?
00:05:13.840 | And so when you are executing on a plan, there are tons of different data and interactions
00:05:20.400 | that your agent needs to be able to make, right?
00:05:22.980 | Look at the existing state and make a call based on that.
00:05:26.860 | A lot of the time, reasoning is referred to as an agent's ability to make these decisions
00:05:33.160 | in an intermediate way.
00:05:35.160 | And then finally, agents need environmental grounding, right?
00:05:39.020 | So, agentic systems read and write information to their external environment.
00:05:45.480 | Ideally, they also use this information to react and adapt to changes in that environment.
00:05:54.580 | When I think about agentic systems, most of the challenge in making them reliable exists
00:06:01.020 | in the details of the agent itself.
00:06:03.420 | But there's kind of a meta problem that exists, which is, no matter how reliable your agent
00:06:09.880 | system is, you have to answer the question, where does the human fit in?
00:06:15.960 | We think that we're currently at the point in time in history with the least number of developers
00:06:21.720 | that there will ever be.
00:06:23.840 | Humans are here to stay in building software.
00:06:27.300 | But there's an outer loop and there's an inner loop of software development.
00:06:31.780 | In the outer loop, you are reasoning about what needs to get done.
00:06:35.280 | You're working with your colleagues.
00:06:37.180 | You're listening to customers and translating those into requirements.
00:06:40.780 | You're iterating on different architectural decisions.
00:06:44.440 | And in that inner loop, you are writing lines of code.
00:06:47.340 | You're testing that code.
00:06:48.860 | You're building.
00:06:49.860 | You're checking up on that.
00:06:51.080 | You're doing code review.
00:06:52.700 | We think that the inner loop is going away very soon.
00:06:56.320 | And agents will take up the vast majority of that.
00:06:59.120 | And so how do you create an AI UX that blends delegation so that you can stay in flow in
00:07:06.460 | that outer loop with control?
00:07:09.760 | When an agent ultimately can't accomplish something because of the technology, because of the current
00:07:14.320 | state of where we're at, you need to be able to steer that with precision.
00:07:17.720 | Right?
00:07:18.720 | So that's kind of a background thing that I want to reference as we deep dive into those
00:07:21.820 | three principles of agentic systems.
00:07:26.780 | So planning.
00:07:28.480 | We like to say a droid is only as good as its plan.
00:07:32.760 | One of the biggest challenges we face is ensuring that a droid creates a good plan and sticks to
00:07:39.780 | We were inspired by robotics and control systems in thinking about what are the techniques
00:07:46.160 | we can use to improve the ability of the system to make high quality plans.
00:07:51.800 | If a droid goes off and executes on something and it's totally the wrong idea, a customer's
00:07:56.780 | time and money is heavily wasted.
00:08:00.100 | And so there's a couple of things that we'd call out.
00:08:03.340 | That's decomposition, right?
00:08:05.540 | Plans don't need to be just a high level do the thing, right?
00:08:08.880 | We can break plans down into sub tasks.
00:08:12.020 | Model predictive control, right?
00:08:14.040 | Your plans should be continuously updated by the environment as it's executed on.
00:08:19.660 | And explicit plan templating, right?
00:08:23.000 | Certain tasks have a certain sort of template or workflow, right, that they can be executed
00:08:30.080 | Now, if you're too rigid with your plan templates and how your system wants to do that, you know,
00:08:33.320 | do something, you're going to reduce creativity.
00:08:35.820 | But if you know that at the end of the day, when your agent codes, that it's going to create
00:08:39.900 | a pull request or a commit, then you can sort of start to reason through what the beginning,
00:08:45.760 | intermediate, and end steps of your plan might look like, right?
00:08:48.960 | And so we built all these systems into droids.
00:08:51.860 | And I have an example here, a real world example of a droid that is making a plan.
00:08:57.200 | You'll see it is searching through information, it's reasoning, it is taking that information
00:09:04.200 | and breaking it down into a long form plan that has multiple steps, one for front end, one
00:09:10.480 | for the back end, tests that it's going to do, and how it will actually execute commands to
00:09:15.440 | roll this change out with feature flags.
00:09:18.180 | This type of planning is really hard to get right.
00:09:21.480 | But when you do get it right, it makes your system far more reliable at actually executing
00:09:25.880 | on that end task.
00:09:28.240 | And now this droid will keep going.
00:09:32.760 | The second thing is decision making, right?
00:09:35.080 | This is probably the hardest thing to control in your agents.
00:09:40.980 | When you're building software, when you are doing tasks across other domains, human beings
00:09:46.960 | are making hundreds, thousands of micro decisions around what to name the variable, the scale
00:09:53.500 | of the change to make, where should this change go in the code base, should we imitate the patterns
00:09:59.600 | of the code base, or is the code base filled with tech debt and we should instead innovate
00:10:03.620 | on the code base, right?
00:10:05.460 | Agents need to be able to assess these trade-offs and take action decisively and correctly in order
00:10:11.780 | to be effective.
00:10:14.040 | There's a few factors you can introduce to improve your agents' ability to make decisions.
00:10:22.200 | And when you're thinking about how to actually introduce these changes, you kind of have to
00:10:31.280 | think, what sort of decisions am I going to face?
00:10:34.980 | And do I have the ability to explicitly select, or criteria, by which my agent should make
00:10:41.320 | these decisions, right?
00:10:43.440 | If you are an agent for travel planning or an agent for code review, right, you have a very
00:10:50.540 | limited set of things that the agent actually will need to decide, right?
00:10:56.140 | You can say, I need to decide on what the price points are.
00:10:59.800 | I need to decide about if this code fits a certain set of standards, right?
00:11:04.300 | If you're more open-ended, like you have an agent that can do a lot of different things, or write a bunch of different
00:11:11.640 | code paths, there's less explicit decision-making criteria that you can introduce, right?
00:11:17.360 | A lot of this stuff happens in the reasoning layer of models.
00:11:21.460 | So you really have to think about how your agent is instructed and the context that it
00:11:26.020 | has about the environment around it.
00:11:28.920 | Factory customers often ask questions like, how do I structure an API for this project, right?
00:11:34.020 | They expect droids to be able to evaluate requirements from the user and from their organization, assess
00:11:40.700 | the existing code base, reason through different performance implications, and then ultimately
00:11:45.840 | come up with a final decision of what to do.
00:11:48.340 | So this is really tricky, but powerful when done right.
00:11:51.280 | And so if you take some of these decision-making criteria, context from the environment, you bring
00:11:56.540 | it together, that is what gives an agent the ability to actually make a proper decision.
00:12:02.500 | And the last thing is environmental grounding, right?
00:12:05.280 | This is the connection your agent has with the real world.
00:12:09.020 | Things interact through the world in AI computer interfaces, right?
00:12:14.140 | Dedicated tools and context injection from other systems to the agent.
00:12:19.840 | This is actually really different and not so simple as saying, let's take an API or let's
00:12:25.420 | just build a simple tool, right?
00:12:27.220 | The entire internet and the last 40 years of computing on top of it has basically existed
00:12:33.980 | primarily for human beings.
00:12:36.060 | And so we have to build new AI computer interfaces that let our agents naturally interact with
00:12:42.060 | the world.
00:12:43.580 | This is where we spend most of our time at Factory.
00:12:46.500 | We found that control over the tools your agent uses is the single most important differentiator
00:12:51.660 | in your agent's reliability.
00:12:54.680 | In addition to just the tools themselves, there is also a sense that in order to ground yourself
00:13:02.000 | in your environment, you need to properly process information that comes in.
00:13:06.340 | An example of this might be if I take a CLI command and I get 100,000 lines of response
00:13:11.980 | from that CLI tool call.
00:13:14.560 | If we just pass that to the agent, it's going to go off the rails, right?
00:13:17.920 | We need to process it.
00:13:19.460 | We need to find out what's important.
00:13:21.400 | We need to take the important information and hand it to the agent and the unimportant information
00:13:25.800 | and hide it.
00:13:27.120 | These types of decisions are make or break for complex systems that interact with huge volumes
00:13:32.060 | of data.
00:13:33.060 | So think about how you process information, not just the tools that you're actually calling.
00:13:39.120 | And so here's an example of a Droid being handed a Sentry error alert.
00:13:43.740 | And it's being told we need an RCA.
00:13:45.740 | We need to figure out what happened, right?
00:13:47.760 | It's going to search through repositories to find the candidate, search using a couple of
00:13:52.520 | different strategies, semantic, glob, grep, APIs for Sentry, view that relevant information,
00:13:59.760 | and access GitHub PRs that happened around the time of the merged error, and then validate
00:14:05.160 | that with some additional searches.
00:14:07.240 | And so it's going to use all that knowledge to then write an RCA based on all of that information.
00:14:13.760 | This is the type of grounding that your systems need to do in order to go beyond just coding
00:14:18.380 | and enter into the world of full software development work.
00:14:24.920 | So just to recap, a couple of main takeaways.
00:14:28.580 | Start with a clear plan and start with clear boundaries.
00:14:33.360 | Show the work of your systems as they reason and make decisions, and that helps build trust,
00:14:38.760 | and helps you debug.
00:14:41.380 | And iterate as your agent works, allow it to reason, search, think through hard problems,
00:14:48.420 | and ultimately ground it in the environment.
00:14:50.880 | And finally, from that beginning point, design for human collaboration.
00:14:55.400 | AI systems we think of like climbing gear.
00:14:58.400 | Shipping high-quality software is climbing Mount Everest.
00:15:02.020 | It's pretty hard to do.
00:15:03.020 | A couple of notes that, you know, want to add.
00:15:08.540 | We're always thinking about deeper integrations, memory, and if you are thinking about working
00:15:14.340 | on these types of hard problems, or working on agentic systems in general, we are always
00:15:19.480 | hiring.
00:15:20.480 | And if your team is not delegating more than 50% of its engineering tasks to AI agents,
00:15:26.940 | you should come talk to Factory.
00:15:28.940 | Thanks.
00:15:29.940 | Thanks.
00:15:30.940 | Thank you.
00:15:30.940 | Thank you.
00:15:30.940 | Thank you.
00:15:31.940 | Thank you.
00:15:31.940 | Thank you.
00:15:32.940 | Thank you.
00:15:33.940 | Thank you.
00:15:34.940 | Thank you.
00:15:36.100 | Thank you.