Building Reliable Agentic Systems: Eno Reyes

So for this talk basically what we thought was it'd be cool to give kind of practical examples and lessons of problems and solutions that we identified while building the droids and so for context about factory our mission is to bring autonomy to software engineering and what that means concretely we build these products that we call droids they are autonomous systems that are applied to different stages of the software development life cycle think code review documentation testing all the way to end-to-end coding tasks like a refactor a migration feature work and so each of the droids has like separate cognitive architectures which are mapped to the tasks at hand and in particular the you know droids like review droid or the documentation droid which handle processes which we think are kind of more on guard rails are pretty different from something like the code droid which is able to take nearly any natural language task and make an attempt or complete a coding task that is associated with that like request and so so the idea for this talk is let's just start describing what we think of as an agentic system this like agentic systems have a lot of different interpretations a lot of different definitions we think that there's three characteristics which are kind of most representative of an agentic system the first is planning i think you've probably seen a lot of this in this track the idea that the agentic system can make decisions about one or future or many future actions that it's going to take decision making some people call this reasoning i think that a lot of these systems just the ability to make a decision have some criteria and some algorithm that's associated with making that decision is kind of critical in order for agentic systems to take on more general or broad tasks where the decision space is very wide and then you have environmental grounding and so you know a lot of systems have planning they have decision making but i think the existence of an agent within an external environment is very critical to understanding some of the unique properties of agentic systems when you're actually implementing them so being able to read and write to these environments is a you know a critical part of this process and so let's talk about planning first uh so the first idea that we kind of encountered is inspired by control systems and robotics the idea of the pseudo common filter is as you are working through a plan what happens is you tend to notice that your agentic systems will lead will be led astray their reasoning will change rapidly and as you kind of iterate through especially long plans you can imagine that in order to migrate even a small section of a code base you might have uh so that you have literally hundreds of steps in a process and so inspired by a lot of the folks that on our team come from backgrounds in self-driving and robotics uh the pseudo common filter is basically passing intermediate reasoning and you can get pretty complex about how you modify or share the intermediate reasoning through the different plan steps but as you pass that intermediate reasoning through the execution of the plan steps uh it allows the like individual decisions that happen on the uh the plan to slowly converge towards at least consistent reasoning uh the kind of core issue with this is it also facilitates error propagation a simple mistake especially early on in the plan can lead to like you know strong downstream effects subtask decomposition this is pretty well known even reacts about like some of the earlier agentic uh action systems uh have some form of subtask decomposition uh but you know what we found what we found is that experimenting with different forms or structures of subtask decomposition in our planning process uh has led to a lot of like pretty interesting downstream downstream positive effects uh in particular it when you basically increase the resolution or the fidelity of your subtasks in a given plan it gives you more fine grained control it allows you to define kind of the action space a lot more clearly however the risk is you're introducing a lot of decisions for the um for the lm to make the more like small tiny tasks you introduce the harder it is for the system or the agentic system to decide what's right um um then you have model predictive control this is again you know not a new idea not invented by anyone here but the idea of evaluating outcomes of your subtasks and your current state uh and enabling kind of adaptive replanning based on real-time feedback that occurs during the execution of the plan so you know you have the plan so you know if you have rapidly changing environment or information this can occur especially in you know situations for us where there are other humans actively engaged in either a development process or some other workflow while your agent is executing perhaps a long-running workflow uh and so if you have you know you can honestly like find a lot of information about these techniques and kind of reason around you know you don't have to necessarily jump into the true like bayesian statistics predictive modeling stuff i think a lot of the like general ideas of replanning taking in the trajectory information and making sure that you keep that up to date uh is honestly enough to see some like pretty solid quality improvements on your agents and then finally uh explicit plan criteria so i think you actually kind of see this pretty often for very simple agents and then people start to remove explicit plan criteria in order to increase generalizability uh but uh what we found is that clearly defining kind of successful structures or at least successful initial states for plans uh can lead to a very strong kind of downstream effects uh and and you know this can be done in a lot of different ways instruction tuning uh you know a few uh few shot prompt examples you can validate like your plans with different kind of hard-coded logic uh you know ultimately all of this is about error reduction and keeping your like trajectories as successful as possible as long as possible uh but it's difficult to build and scale these because basically what i'm recommending is that you hard code a lot of logic into your system uh i think this is like a general idea that people maybe you know depending on your domain this may seem appealing or not appealing but ultimately we're probably not building agi like tomorrow or next week or in the next six months and so i think taking some of the lessons from the kind of like symbolic era of ai is useful if you have like a domain challenge that you want to solve in the next three months uh for us you know we think about delivering real value to customers today and that means that a lot of what we do is thinking about things like explicit plan criteria hard-coded logic uh and i think that a lot of folks may not like be open to admitting that but i think it's a very important part of the the process of building these kind of agentic systems um so now talking about decision making and i know this is also meant to be kind of like a word vomit honestly because i think hopefully you guys are like building agents and you care a lot about this stuff and it can spark or inspire some ideas in your own systems when you're working on them um so first for decision making consensus mechanisms uh there's a lot of these self-consistency is a very popular one prompt ensembles cluster sampling you know basically the more inference at runtime that you run and the more that you can like build a clever way of selecting ideal or optimal samples from those like many number of inferences the higher the accuracy it's just going to cost more it may introduce longer inference wait times if you're not parallelizing uh but we found this very important uh to getting high quality decisions that are consistent uh next you have explicit and analogical reasoning these also have a lot of different names chain of thought uh checklists chain of density analogical prompting you know basically you want the system to explicitly outline its reasoning or decision making criteria to reduce the complexity of the decision making process so when you have for example a checklist of things that let's say you want to make a decision about left or right if you just create a checklist of what constitutes a reasonable left and what constitutes a reasonable right your system will do better at choosing left or right obviously though this introduces challenge if you have like very broad domain stuff uh and or you like have a decision which has a huge action space in which case techniques like chain of thought reasoning uh you know chain of density honestly like there's like galactic tree of mega brain thought that like there's a lot of these like techniques that exist and i think uh you know exploring these is super worth it uh for improving performance uh fine tuning this is kind of a cop-out answer but i actually think it is pretty valuable once you really get into having data uh for specific decisions that you want to make it may just be true that the best thing to do is to spend a weekend pulling the latest open source model generate a bunch of training data with a high quality model validate it with a bunch of your team members and just train like or fine-tune a model uh this is expensive and it locks in the quality of your system like a lot of the like benefit of relying on basically being able to sample from different models is every time a new state-of-the-art model comes out your system gets better i think that's a huge benefit however for certain decisions especially those that are really out of distribution uh fine-tuning is a pretty effective way uh to make a good decision and then simulation so simulation of decision making is super tricky uh this is definitely going to be very domain specific if you're working with software development simulation is luckily kind of built into like the thought process here the ability to execute code the ability to reason through code trajectories uh is super doable and so for us simulation makes up a huge amount of how we think about decision making processes so sampling multiple decision paths uh simulating them both with real and and kind of like llm imagined execution paths uh techniques like language agent tree search which basically says amongst this simulation of decision nodes let's implement like a fancy monte carlo tree search algorithm and uh you know it's it's kind of like monte carlo tree search it's not exactly but you know let's do fancy algorithms to decide where we want to move based on the simulation results uh and so you can do all these things and ultimately the the kind of core goal here is you want to have some evaluation of whether or not your system makes good decisions when you want it to make good decisions you can take all these techniques apply them and then see what's working what's not working look at the pros and cons and then adapt um so that's decision making now we have environmental grounding so the first uh you know this is oftentimes called tool use i think that that's like an equivalently like valid way to describe ai computer interfaces but i think the idea of like building these is kind of the interesting challenge so there's dedicated tools that are very common like you can like pull down lang chain and start using a calculator uh a sandbox even python script execution like i'm pretty sure you can clone a open source repo today that implements claude's artifacts like in your own like environment it's awesome how great the open source community has pushed tool use but i think a lot of the kind of edge of tool use and where you start to move towards building custom ai computer interfaces is when you need workflows or trajectories which don't exist with the kind of known tool set that you have today so if you have a calculator you can definitely run a calculation but what if you know that very consistently you are going to use a calculator take the output of that calculator and maybe pass it to another system and then you're going to read them and parse the logs and then take the output of those parse logs and do some like additional transformation if you're going to consistently do that and you want your agent to just come up with that every single time then that's probably not the right mental model for what like the agent should be thinking about or making a decision about instead what you want to do is say how can we build this tool and then build the interface to the LLM for this tool so that you can kind of streamline those types of actions that you want to take and so this is honestly really effective especially in domains like code where there's tons of dev tools like i mean there's infinite numbers of tools that people have made to be really good at developing software that are honestly really effective and they just have kind of weird interfaces maybe it's a cli maybe it's like command shift clicking to go to definition on your vs code all these things that exist in like the world need a way for an llm based system to invoke and reason around the outputs so we spend a ton of our time building these ai computer interfaces then i i kind of alluded to this but designing explicit feedback processing is i think a very critical step of grounding your agent in an external environment in particular if you have for example logs i think is the great is like a great example of this uh your cicd probably outputs an enormous amount of data about you know all the tests that ran all the debug statements all this kind of garbage that you don't care about if you know that you're going to need to process that data and use it at some point it's definitely good to build into the system explicit kind of paths or decisions or tools that take all of this feedback process it and then maybe even do like a step which is solely about llm reasoning i guess the example here is you have your logs you parse through it and then you say well what is you know i know the llm is going to want the failing tests so let me get the failing tests and also provide a brief like explainer of what the rest of the logs were doing that type of like feedback processing is pretty critical into making these systems work really well it also applies to i'm talking a lot about external tools and external mechanisms like the llm or agentic system interacting with something else there's also the llm or agentic system interacting with itself and so as you as the agent kind of reflects and reasons about error trajectories processing that feedback in a meaningful way and passing it back to the llm is super important because a lot of the time the llm is not great at actually criticizing its own action it's just good at listening to you tell it to criticize itself and so that we found to be very important uh bounded exploration so you definitely want your agents to be able to gather as much context as they can about the problem space as you introduce additional ways of gathering this context probably at the beginning of the problem not always but probably there is a huge benefit for models that can handle this like very long context models if you use gemini pro 1.5 or uh sonnet 35 honestly there's a lot of models now that have very long context windows you can continue to you know include information but at a certain point you need to kind of say let's jump into the problem and what we found is that finding the right balance of bounded exploration time is very difficult to actually know in advance and honestly requires a lot of evaluation and so if you can allow your agent to gather this context have longer exploration phases collect data decide which data is or isn't relevant and kind of begin the problem with the maximum likelihood of success then that is one of the single most important things to having a successful trajectory uh in the in the kind of extent um and then human guidance uh so this is also a little bit of a cop-out answer but i think it's important to like describe you know as you have your agentic system interact with humans you want to decide when in those interaction patterns do we want to ask the human to interfere or when do we want them to provide guidance so as you know with careful ux design and like interaction design i think that this can be extremely effective at allowing your systems to go from 30 or 40 reliability to 90 or 100 reliability but you know ultimately this is balancing autonomy with human oversight and so it's a trade-off that you have to make um anyway so those are all the things that we learned at factory there's many more and if you'd like to join us we are hiring ai engineers software engineers go to market everybody so please give me an email

Building Reliable Agentic Systems: Eno Reyes

Chapters

Transcript