back to index

Building Reliable Agentic Systems: Eno Reyes


Chapters

0:0 Intro
1:22 Agentic Systems
2:45 Planning
7:36 Decision Making
11:40 Environmental Grounding

Whisper Transcript | Transcript Only Page

00:00:00.000 | So for this talk basically what we thought was it'd be cool to give kind of practical examples
00:00:20.340 | and lessons of problems and solutions that we identified while building the droids and so for
00:00:28.080 | context about factory our mission is to bring autonomy to software engineering and what
00:00:32.600 | that means concretely we build these products that we call droids they are autonomous systems
00:00:38.160 | that are applied to different stages of the software development life cycle think code
00:00:42.520 | review documentation testing all the way to end-to-end coding tasks like a refactor a migration
00:00:48.640 | feature work and so each of the droids has like separate cognitive architectures which
00:00:54.320 | are mapped to the tasks at hand and in particular the you know droids like review droid or the
00:01:03.080 | documentation droid which handle processes which we think are kind of more on guard rails are
00:01:08.320 | pretty different from something like the code droid which is able to take nearly any natural
00:01:13.340 | language task and make an attempt or complete a coding task that is associated with that like request and so
00:01:20.960 | so the idea for this talk is let's just start describing what we think of as an agentic system
00:01:28.600 | this like agentic systems have a lot of different interpretations a lot of different definitions
00:01:34.640 | we think that there's three characteristics which are kind of most representative of an agentic system
00:01:43.780 | the first is planning i think you've probably seen a lot of this in this track the idea that the agentic system
00:01:49.580 | can make decisions about one or future or many future actions that it's going to take decision making
00:01:57.220 | some people call this reasoning i think that a lot of these systems just the ability to make a decision
00:02:04.340 | have some criteria and some algorithm that's associated with making that decision is kind of critical in
00:02:10.440 | order for agentic systems to take on more general or broad tasks where the decision space is very wide and then you have environmental grounding
00:02:19.220 | and so you know a lot of systems have planning they have decision making but i think the existence of an agent within an external
00:02:28.860 | environment is very critical to understanding some of the unique properties of agentic systems when you're actually implementing them
00:02:34.860 | so being able to read and write to these environments is a you know a critical part of this process and so let's talk about planning first
00:02:48.860 | uh so the first idea that we kind of encountered is inspired by control systems and robotics the idea
00:02:56.500 | of the pseudo common filter is as you are working through a plan what happens is you tend to notice that your agentic systems will lead will be led astray
00:03:07.500 | their reasoning will change rapidly and as you kind of iterate through especially long plans you can imagine that in order to migrate even a small section of a code base you might have
00:03:18.500 | uh so that you have literally hundreds of steps in a process and so inspired by a lot of the folks that on our team come from backgrounds in self-driving and
00:03:26.140 | robotics uh the pseudo common filter is basically passing intermediate reasoning and you can get pretty complex about how you modify or share the intermediate reasoning through the different plan steps but as you pass that intermediate reasoning through the execution of the plan steps uh it allows the like individual decisions that happen on the
00:03:48.140 | uh the plan to slowly converge towards at least consistent reasoning uh the kind of core issue with this is it also
00:03:55.780 | facilitates error propagation a simple mistake especially early on in the plan can lead to like you know strong downstream effects
00:04:05.700 | subtask decomposition this is pretty well known even reacts about like some of the earlier agentic uh action systems uh have some form of subtask decomposition uh but you know what we found
00:04:17.780 | what we found is that experimenting with different forms or structures of subtask decomposition in our planning process uh has led to a lot of like pretty interesting downstream
00:04:27.420 | downstream positive effects uh in particular it when you basically increase the resolution or the fidelity of your subtasks in a given plan it gives you more fine grained control it allows you to define kind of the action space a lot more clearly however the risk is you're introducing a lot of decisions for the
00:04:47.420 | um for the lm to make the more like small tiny tasks you introduce the harder it is for the system or the agentic system to decide what's right um
00:04:57.060 | um then you have model predictive control this is again you know not a new idea not invented by anyone here but the idea of evaluating outcomes of your subtasks and your current state uh and enabling kind of adaptive replanning based on real-time feedback that occurs during the execution
00:05:17.060 | of the plan so you know you have the plan so you know if you have rapidly changing environment or information this can occur especially in you know situations for us
00:05:26.700 | where there are other humans actively engaged in either a development process or some other workflow while your agent is executing perhaps a long-running workflow uh and so if you have you know you can honestly like find a lot of information about these techniques and kind of reason around you know you don't have to necessarily jump into the true like
00:05:46.700 | bayesian statistics predictive modeling stuff i think a lot of the like general ideas of replanning taking in the trajectory information and making sure that you keep that up to date
00:05:56.000 | uh is honestly enough to see some like pretty solid quality improvements on your agents
00:06:00.860 | and then finally uh explicit plan criteria so i think you actually kind of see this pretty often for very simple agents and then people start to remove explicit plan criteria in order to increase
00:06:14.040 | generalizability uh but uh what we found is that clearly defining kind of successful structures or at least successful initial states
00:06:23.340 | for plans uh can lead to a very strong kind of downstream effects uh and and you know this can be done in a lot of different ways
00:06:30.840 | instruction tuning uh you know a few uh few shot prompt examples you can validate like your plans with different kind of hard-coded logic
00:06:40.140 | uh you know ultimately all of this is about error reduction and keeping your like trajectories as successful as possible as long as possible
00:06:47.140 | uh but it's difficult to build and scale these because basically what i'm recommending is that you hard code a lot of logic into your system
00:06:54.140 | uh i think this is like a general idea that people maybe you know depending on your domain this may seem appealing or not appealing but
00:07:03.440 | ultimately we're probably not building agi like tomorrow or next week or in the next six months and so i think taking some of the lessons from
00:07:10.440 | the kind of like symbolic era of ai is useful if you have like a domain challenge that you want to solve in the next three months
00:07:17.740 | uh for us you know we think about delivering real value to customers today and that means that a lot of what we do is thinking about things like explicit plan criteria
00:07:26.240 | hard-coded logic uh and i think that a lot of folks may not like be open to admitting that but i think it's a very important part of the
00:07:32.540 | the process of building these kind of agentic systems um so now talking about decision making and i know this is also meant to be kind of like a word vomit
00:07:42.040 | honestly because i think hopefully you guys are like building agents and you care a lot about this stuff and it can spark or inspire some ideas in your own systems
00:07:50.840 | when you're working on them um so first for decision making consensus mechanisms uh there's a lot of these self-consistency is a very popular one
00:08:01.440 | prompt ensembles cluster sampling you know basically the more inference at runtime that you
00:08:08.300 | run and the more that you can like build a clever way of selecting ideal or optimal samples from those like
00:08:16.140 | many number of inferences the higher the accuracy it's just going to cost more it may introduce longer
00:08:21.860 | inference wait times if you're not parallelizing uh but we found this very important uh to getting high quality
00:08:28.560 | decisions that are consistent uh next you have explicit and analogical reasoning these also have a lot of
00:08:35.420 | different names chain of thought uh checklists chain of density analogical prompting you know basically you
00:08:43.420 | want the system to explicitly outline its reasoning or decision making criteria to reduce the complexity
00:08:50.420 | of the decision making process so when you have for example a checklist of things that let's say you want
00:08:56.300 | to make a decision about left or right if you just create a checklist of what constitutes a reasonable
00:09:01.920 | left and what constitutes a reasonable right your system will do better at choosing left or right
00:09:06.700 | obviously though this introduces challenge if you have like very broad domain stuff uh and or you like have a
00:09:13.280 | decision which has a huge action space in which case techniques like chain of thought reasoning uh you
00:09:18.760 | know chain of density honestly like there's like galactic tree of mega brain thought that like there's a lot of
00:09:24.860 | these like techniques that exist and i think uh you know exploring these is super worth it uh for improving
00:09:30.700 | performance uh fine tuning this is kind of a cop-out answer but i actually think it is pretty valuable once you
00:09:36.440 | really get into having data uh for specific decisions that you want to make it may just be true
00:09:43.120 | that the best thing to do is to spend a weekend pulling the latest open source model generate a bunch
00:09:48.760 | of training data with a high quality model validate it with a bunch of your team members and just train
00:09:53.880 | like or fine-tune a model uh this is expensive and it locks in the quality of your system like a lot of the
00:10:00.840 | like benefit of relying on basically being able to sample from different models is every time a new
00:10:06.420 | state-of-the-art model comes out your system gets better i think that's a huge benefit however for
00:10:11.340 | certain decisions especially those that are really out of distribution uh fine-tuning is a pretty effective
00:10:17.100 | way uh to make a good decision and then simulation so simulation of decision making is super tricky uh this
00:10:27.400 | is definitely going to be very domain specific if you're working with software development simulation is
00:10:34.040 | luckily kind of built into like the thought process here the ability to execute code the ability to
00:10:39.340 | reason through code trajectories uh is super doable and so for us simulation makes up a huge amount of
00:10:46.340 | how we think about decision making processes so sampling multiple decision paths uh simulating them
00:10:52.780 | both with real and and kind of like llm imagined execution paths uh techniques like language agent tree search
00:10:59.660 | which basically says amongst this simulation of decision nodes let's implement like a fancy monte
00:11:05.960 | carlo tree search algorithm and uh you know it's it's kind of like monte carlo tree search it's not
00:11:11.060 | exactly but you know let's do fancy algorithms to decide where we want to move based on the simulation
00:11:16.680 | results uh and so you can do all these things and ultimately the the kind of core goal here is you want to have some
00:11:24.020 | evaluation of whether or not your system makes good decisions when you want it to make good decisions
00:11:28.360 | you can take all these techniques apply them and then see what's working what's not working look at
00:11:34.500 | the pros and cons and then adapt um so that's decision making
00:11:39.500 | now we have environmental grounding so the first uh you know this is oftentimes called tool use i think
00:11:49.520 | that that's like an equivalently like valid way to describe ai computer interfaces but i think the idea
00:11:55.920 | of like building these is kind of the interesting challenge so there's dedicated tools that are very
00:12:01.980 | common like you can like pull down lang chain and start using a calculator uh a sandbox even python script
00:12:08.600 | execution like i'm pretty sure you can clone a open source repo today that implements claude's artifacts
00:12:14.440 | like in your own like environment it's awesome how great the open source community has pushed tool use
00:12:20.920 | but i think a lot of the kind of edge of tool use and where you start to move towards building custom ai
00:12:27.880 | computer interfaces is when you need workflows or trajectories which don't exist with the kind of
00:12:35.400 | known tool set that you have today so if you have a calculator you can definitely run a calculation but
00:12:41.000 | what if you know that very consistently you are going to use a calculator take the output of that calculator
00:12:47.400 | and maybe pass it to another system and then you're going to read them and parse the logs and then take
00:12:53.160 | the output of those parse logs and do some like additional transformation if you're going to
00:12:58.360 | consistently do that and you want your agent to just come up with that every single time then that's
00:13:03.320 | probably not the right mental model for what like the agent should be thinking about or making a decision
00:13:08.040 | about instead what you want to do is say how can we build this tool and then build the interface to the
00:13:15.320 | LLM for this tool so that you can kind of streamline those types of actions that you want to take
00:13:21.160 | and so this is honestly really effective especially in domains like code where there's tons of dev tools
00:13:28.760 | like i mean there's infinite numbers of tools that people have made to be really good at developing software
00:13:34.920 | that are honestly really effective and they just have kind of weird interfaces maybe it's a cli
00:13:39.320 | maybe it's like command shift clicking to go to definition on your vs code all these things that exist in like the world
00:13:47.080 | need a way for an llm based system to invoke and reason around the outputs so we spend a ton of our time
00:13:55.080 | building these ai computer interfaces
00:13:57.080 | then i i kind of alluded to this but
00:14:02.680 | designing explicit feedback processing is i think a very critical step of
00:14:09.000 | grounding your agent in an external environment in particular if you have for example logs i think
00:14:17.320 | is the great is like a great example of this uh your cicd probably outputs an enormous amount of data
00:14:24.760 | about you know all the tests that ran all the debug statements all this kind of garbage that you don't care
00:14:30.680 | about if you know that you're going to need to process that data and use it at some point it's definitely
00:14:36.840 | good to build into the system explicit kind of paths or decisions or tools that take all of this feedback
00:14:44.600 | process it and then maybe even do like a step which is solely about llm reasoning i guess the example here
00:14:52.120 | is you have your logs you parse through it and then you say well what is you know i know the llm is going
00:14:59.800 | to want the failing tests so let me get the failing tests and also provide a brief like explainer of
00:15:07.080 | what the rest of the logs were doing that type of like feedback processing is pretty critical
00:15:12.600 | into making these systems work really well it also applies to i'm talking a lot about external tools and
00:15:19.880 | external mechanisms like the llm or agentic system interacting with something else there's also the llm or
00:15:26.440 | agentic system interacting with itself and so as you as the agent kind of reflects and reasons about error
00:15:33.720 | trajectories processing that feedback in a meaningful way and passing it back to the llm is super important
00:15:40.200 | because a lot of the time the llm is not great at actually criticizing its own action it's just good at
00:15:45.640 | listening to you tell it to criticize itself and so that we found to be very important
00:15:52.920 | uh bounded exploration so you definitely want your agents to be able to gather as much context as they
00:16:00.440 | can about the problem space as you introduce additional ways of gathering this context probably at the
00:16:08.120 | beginning of the problem not always but probably there is a huge benefit for models that can handle this
00:16:17.080 | like very long context models if you use gemini pro 1.5 or uh sonnet 35 honestly there's a lot of
00:16:24.360 | models now that have very long context windows you can continue to you know include information but at a
00:16:30.920 | certain point you need to kind of say let's jump into the problem and what we found is that finding the
00:16:36.440 | right balance of bounded exploration time is very difficult to actually know in advance and honestly
00:16:44.360 | requires a lot of evaluation and so if you can allow your agent to gather this context have longer
00:16:50.600 | exploration phases collect data decide which data is or isn't relevant and kind of begin the problem with
00:16:57.880 | the maximum likelihood of success then that is one of the single most important things to having a successful
00:17:04.040 | trajectory uh in the in the kind of extent um
00:17:08.200 | and then human guidance uh so this is also a little bit of a cop-out answer but i think it's
00:17:16.360 | important to like describe you know as you have your agentic system interact with humans you want to decide
00:17:23.720 | when in those interaction patterns do we want to ask the human to interfere or when do we want them to provide
00:17:29.880 | guidance so as you know with careful ux design and like interaction design i think that this can be
00:17:38.360 | extremely effective at allowing your systems to go from 30 or 40 reliability to 90 or 100 reliability
00:17:45.720 | but you know ultimately this is balancing autonomy with human oversight and so it's a trade-off that you
00:17:50.920 | have to make um anyway so those are all the things that we learned at factory there's many more and if you'd like
00:17:56.920 | to join us we are hiring ai engineers software engineers go to market everybody so please give me an email