back to indexLangChain Interrupt 2025 Factory AI Building Reliable Agentic Systems – Eno Reyes

00:00:00.800 |
The IDE, a tool that was designed first and foremost 00:00:03.960 |
for human beings to write lines of code by hand 00:00:10.000 |
And then you keep adding AI until something changes. 00:00:15.360 |
But what we've seen is that in order for organizations, 00:00:20.360 |
both small and large, to unlock changes in productivity 00:00:24.560 |
beyond just 10 and 15%, there is more of a workflow change 00:00:32.040 |
And so we've noticed that in order to truly get 5, 10, 20x 00:00:38.160 |
performance, you need to start shifting your mindset 00:00:44.480 |
to we are delegating tasks entirely to AI systems. 00:00:51.440 |
start to realize that you need a really different type 00:00:54.280 |
of tool, a different type of platform, one that has an interface 00:00:59.440 |
that lets you manage and delegate these AI systems. 00:01:04.120 |
You need infrastructure that lets you scale up and parallelize 00:01:10.800 |
You need context from across your entire engineering system that 00:01:15.760 |
integrates more than just GitHub, source code, or Jira, but also your observability 00:01:22.160 |
Your knowledge bases, the internet, and you need to connect all of this with agents that can 00:01:28.600 |
actually reliably execute on the task in an end-to-end way. 00:01:32.240 |
We think that this deserves an entirely new sort of platform. 00:01:40.680 |
most of the enterprise organizations that we work with have 5, 10,000 plus engineers, right? 00:01:47.360 |
And so this transition doesn't just come from kind of switching your editor or from switching 00:01:54.480 |
kind of to watching some tutorials about live coding. 00:01:57.360 |
So instead, it kind of requires an active investment. 00:02:00.440 |
And so I'm going to kind of talk about some of the things we've done in our platform that 00:02:05.880 |
enable this human computer interaction with agents and also talk about some of the things 00:02:11.040 |
that we've done to enable reliable agents that can actually execute on some of the tasks 00:02:18.040 |
So I'm going to start with a quick video of what's possible. 00:02:22.360 |
This video showcases a quick glimpse of what it's like to trigger an agentic system. 00:02:27.760 |
We call them droids from your project management system. 00:02:31.520 |
And if you watch in a platform, watch it go from ticket to pull request. 00:02:37.920 |
These types of tasks have quickly become table stakes for agentic systems. 00:02:43.240 |
These tasks with a clear goal, clear success criteria, they're now quite solvable with AI. 00:02:49.920 |
And the agent accomplishes these tasks with a combination of tools, reasoning, search, right, 00:02:56.760 |
computer use, and then it can take all of this and bring it into one of your existing platforms, 00:03:01.960 |
your source code management, and your GitHub. 00:03:09.720 |
These are all questions that plenty of people have spoken about. 00:03:15.520 |
And so before we go into super deep detail about how you get something like this, going 00:03:20.760 |
from a plan all the way to a mergeable code, I'm going to just talk a bit about what we think 00:03:29.520 |
So we found that there's lots of different workflows, some people call them cognitive architectures, 00:03:35.520 |
that we've experimented with, plenty of other teams have experimented with. 00:03:39.720 |
And it's hard to draw an explicit boundary of what an agent is. 00:03:44.320 |
But there are three characteristics that we think are pretty consistent amongst the systems 00:03:53.480 |
Your agentic system needs to make plans that decide one or more future actions in the system. 00:03:59.720 |
That can be very simple, like outlining a single kind of edit call and then returning back to the user, where it can be quite complex. 00:04:07.720 |
Like saying, we're going to go search through the code base, plan out a couple of different edits, create tickets in our project management system, execute on each of those tickets, and come back with a write-up sentence slap to our manager manager. 00:04:20.720 |
Plans can be small, they can be simple, but in addition to that, you also need decision making, right? 00:04:27.720 |
And so when you are executing on a plan, there are tons of different data and interactions that your agent needs to be able to make, right? 00:04:37.320 |
Look at the existing state and make a call based on that. 00:04:41.320 |
A lot of the time, reasoning is referred to as an agent's ability to make these decisions in an intermediate way. 00:04:50.320 |
And then finally, agents need environmental grounding, right? 00:04:54.320 |
So, agentic systems read and write information to their external environment. 00:05:00.320 |
Ideally, they also use this information to react and adapt to changes in that environment. 00:05:06.320 |
When I think about agentic systems, most of the challenge in making them reliable exists in the details of the agent itself. 00:05:18.320 |
But there's kind of a meta problem that exists, which is no matter how reliable your agent system is, you have to answer the question, where does the human fit in? 00:05:31.320 |
We think that we're currently at the point in time in history with the least number of developers that there will ever be. 00:05:38.320 |
Humans are here to stay in building software. 00:05:42.320 |
But there's an outer loop and there's an inner loop of software development. 00:05:46.320 |
In the outer loop, you are reasoning about what needs to get done. 00:05:52.320 |
You're listening to customers and translating those into requirements. 00:05:55.320 |
You're iterating on different architectural decisions. 00:05:59.320 |
And in that inner loop, you are writing lines of code. 00:06:07.320 |
We think that the inner loop is going away very soon. 00:06:11.320 |
And agents will take up the vast majority of that. 00:06:14.320 |
And so how do you create an AI/UFs that blends delegation so that you can stay and flow in that outer loop with control when an agent ultimately can't accomplish something because of the technology, because of the current state of where we're at? 00:06:30.320 |
You need to be able to steer that with precision, right? 00:06:33.320 |
So this is kind of a background thing that I want to reference as we deep dive into those three principles of agenting systems. 00:06:43.320 |
We like to say a droid is only as good as its plan. 00:06:47.320 |
One of the biggest challenges we face is ensuring that a droid creates a good plan and sticks to it. 00:06:55.320 |
We were inspired by robotics and control systems in thinking about what are the techniques we can use to improve the ability of the system to make high-quality plans. 00:07:07.320 |
If a droid goes off and executes on something and it's totally the wrong idea, a customer's time and money is heavily wasted, right? 00:07:15.320 |
And so there's a couple of things that we call out. 00:07:20.320 |
Plans don't need to be just a high-level do-the-thing, right? 00:07:29.320 |
Your plans should be continuously updated by the environment as it's executed on. 00:07:37.320 |
Certain tasks have a certain sort of template or workflow, right, that they can be executed on. 00:07:44.320 |
Now, if you're too rigid with your plan templates and how your system wants to do something, you're going to reduce creativity. 00:07:50.320 |
But if you know that at the end of the day when you're engaging codes that it's going to create a pull request or a commit, then you can sort of start to reason through what the beginning, intermediate, and end steps of your plan might look like. 00:08:04.320 |
And so we built all these systems into droids. 00:08:07.320 |
And I have an example here, a real-world example of a droid that is making a plan. 00:08:12.320 |
You'll see it is searching through information, it's reasoning, it is taking that information and breaking it down into a long-form plan that has multiple steps. 00:08:24.320 |
One for front-end, one for back-end tests that it's going to do, and how it will actually execute commands to roll this change out with feature flags. 00:08:33.320 |
This type of planning is really hard to get right, but when you do get it right, it makes your system far more reliable at actually executing on that end of tasks. 00:08:50.320 |
This is probably the hardest thing to control in your agents. 00:08:55.320 |
When you're building software, when you are doing tasks across other domains, human beings are making hundreds, thousands of micro decisions around what's named the variable, the scale of the change to make, where should this change go in the codebase, should we imitate the patterns of the codebase, or is the codebase filled with tech debt and we should instead innovate on the codebase, right? 00:09:20.320 |
Agents need to be able to assess these trade-offs and take action decisively and correctly in order to be effective. 00:09:29.320 |
There's a few factors you can introduce to improve your agent's ability to make decisions, right? 00:09:35.320 |
And when you're thinking about how to actually introduce these changes, you kind of have to think what sort of decisions am I going to face and do I have the ability to explicitly select or criteria by which my agent should make these decisions, right? 00:09:58.320 |
If you are an agent for travel planning or an agent for code review, right? 00:10:05.320 |
You have a very limited set of things that the agent actually will need to decide, right? 00:10:11.320 |
You can say, I need to decide on what the price points are. 00:10:14.320 |
I need to decide about if this code fits a certain set of standards, right? 00:10:19.320 |
If you're more open-ended, like you have an agent that can do a lot of different things or write a bunch of different code paths, there's less explicit decision-making criteria that you can introduce, right? 00:10:32.320 |
A lot of this stuff happens in the reasoning layer of models. 00:10:36.320 |
So you really have to think about how your agent is instructed and the context that it has about the environment around it. 00:10:43.320 |
Factory customers often ask questions like, how do I structure an API for this project, right? 00:10:49.320 |
They expect droids to be able to evaluate requirements from the user and from their organization, assess the existing code base, reason through different performance implications, and then ultimately come up with a final decision of what to do. 00:11:03.320 |
So this is really tricky, but powerful when done right. 00:11:06.320 |
And so if you take some of these decision-making criteria, context in the environment, you bring it together, that is what gives an agent the ability to actually make a proper decision. 00:11:17.320 |
And the last thing is environmental grounding, right? 00:11:20.320 |
This is the connection your agent has with the real world. 00:11:23.320 |
Agents interact through the world in AI computer interfaces, right? 00:11:28.320 |
Dedicated tools and context injection from other systems to the agent. 00:11:34.320 |
This is actually really different and not so simple as saying, let's take an API or let's just build a simple tool, right? 00:11:41.320 |
The entire internet and the last 40 years of computing on top of it has basically existed primarily for human beings. 00:11:50.320 |
And so we have to build new AI computer interfaces that let our agents naturally interact with the world. 00:11:58.320 |
This is where we spend most of our time in factory. 00:12:01.320 |
We found that control over the tools your agent uses is the single most important differentiator in your agent's reliability. 00:12:08.320 |
In addition to just the tools themselves, there is also a sense that in order to ground yourself in your environment, you need to properly process information that comes in. 00:12:21.320 |
An example of this might be if I take a CLI command and I get 100,000 lines of response from that CLI tool call. 00:12:29.320 |
If we just pass that to the agent, it's going to go off the rails, right? 00:12:36.320 |
We need to take the important information and hand it to the agent and the unimportant information and hide it. 00:12:42.320 |
These types of decisions are make or rate for complex systems that interact with huge volumes of data. 00:12:48.320 |
So think about how you process information, not just the tools that you're actually using. 00:12:53.320 |
And so here's an example of a droid being called a Sentry error. 00:13:03.320 |
It's going to search through repositories to find the candidate, search using a couple of different strategies, semantic, law, breadth, APIs, entry, view that relevant information, and access Gitmo PRs that happened around the time of the merged error, and then validate that with some additional searches.