Building Reliable Agentic Systems: Eno Reyes

00:00:00.000 | So for this talk basically what we thought was it'd be cool to give kind of practical examples

00:00:20.340 | and lessons of problems and solutions that we identified while building the droids and so for

00:00:28.080 | context about factory our mission is to bring autonomy to software engineering and what

00:00:32.600 | that means concretely we build these products that we call droids they are autonomous systems

00:00:38.160 | that are applied to different stages of the software development life cycle think code

00:00:42.520 | review documentation testing all the way to end-to-end coding tasks like a refactor a migration

00:00:48.640 | feature work and so each of the droids has like separate cognitive architectures which

00:00:54.320 | are mapped to the tasks at hand and in particular the you know droids like review droid or the

00:01:03.080 | documentation droid which handle processes which we think are kind of more on guard rails are

00:01:08.320 | pretty different from something like the code droid which is able to take nearly any natural

00:01:13.340 | language task and make an attempt or complete a coding task that is associated with that like request and so

00:01:20.960 | so the idea for this talk is let's just start describing what we think of as an agentic system

00:01:28.600 | this like agentic systems have a lot of different interpretations a lot of different definitions

00:01:34.640 | we think that there's three characteristics which are kind of most representative of an agentic system

00:01:43.780 | the first is planning i think you've probably seen a lot of this in this track the idea that the agentic system

00:01:49.580 | can make decisions about one or future or many future actions that it's going to take decision making

00:01:57.220 | some people call this reasoning i think that a lot of these systems just the ability to make a decision

00:02:04.340 | have some criteria and some algorithm that's associated with making that decision is kind of critical in

00:02:10.440 | order for agentic systems to take on more general or broad tasks where the decision space is very wide and then you have environmental grounding

00:02:19.220 | and so you know a lot of systems have planning they have decision making but i think the existence of an agent within an external

00:02:28.860 | environment is very critical to understanding some of the unique properties of agentic systems when you're actually implementing them

00:02:34.860 | so being able to read and write to these environments is a you know a critical part of this process and so let's talk about planning first

00:02:48.860 | uh so the first idea that we kind of encountered is inspired by control systems and robotics the idea

00:02:56.500 | of the pseudo common filter is as you are working through a plan what happens is you tend to notice that your agentic systems will lead will be led astray

00:03:07.500 | their reasoning will change rapidly and as you kind of iterate through especially long plans you can imagine that in order to migrate even a small section of a code base you might have

00:03:18.500 | uh so that you have literally hundreds of steps in a process and so inspired by a lot of the folks that on our team come from backgrounds in self-driving and

00:03:26.140 | robotics uh the pseudo common filter is basically passing intermediate reasoning and you can get pretty complex about how you modify or share the intermediate reasoning through the different plan steps but as you pass that intermediate reasoning through the execution of the plan steps uh it allows the like individual decisions that happen on the

00:03:48.140 | uh the plan to slowly converge towards at least consistent reasoning uh the kind of core issue with this is it also

00:03:55.780 | facilitates error propagation a simple mistake especially early on in the plan can lead to like you know strong downstream effects

00:04:05.700 | subtask decomposition this is pretty well known even reacts about like some of the earlier agentic uh action systems uh have some form of subtask decomposition uh but you know what we found

00:04:17.780 | what we found is that experimenting with different forms or structures of subtask decomposition in our planning process uh has led to a lot of like pretty interesting downstream

00:04:27.420 | downstream positive effects uh in particular it when you basically increase the resolution or the fidelity of your subtasks in a given plan it gives you more fine grained control it allows you to define kind of the action space a lot more clearly however the risk is you're introducing a lot of decisions for the

00:04:47.420 | um for the lm to make the more like small tiny tasks you introduce the harder it is for the system or the agentic system to decide what's right um

00:04:57.060 | um then you have model predictive control this is again you know not a new idea not invented by anyone here but the idea of evaluating outcomes of your subtasks and your current state uh and enabling kind of adaptive replanning based on real-time feedback that occurs during the execution

00:05:17.060 | of the plan so you know you have the plan so you know if you have rapidly changing environment or information this can occur especially in you know situations for us

00:05:26.700 | where there are other humans actively engaged in either a development process or some other workflow while your agent is executing perhaps a long-running workflow uh and so if you have you know you can honestly like find a lot of information about these techniques and kind of reason around you know you don't have to necessarily jump into the true like

00:05:46.700 | bayesian statistics predictive modeling stuff i think a lot of the like general ideas of replanning taking in the trajectory information and making sure that you keep that up to date

00:05:56.000 | uh is honestly enough to see some like pretty solid quality improvements on your agents

00:06:00.860 | and then finally uh explicit plan criteria so i think you actually kind of see this pretty often for very simple agents and then people start to remove explicit plan criteria in order to increase

00:06:14.040 | generalizability uh but uh what we found is that clearly defining kind of successful structures or at least successful initial states

00:06:23.340 | for plans uh can lead to a very strong kind of downstream effects uh and and you know this can be done in a lot of different ways

00:06:30.840 | instruction tuning uh you know a few uh few shot prompt examples you can validate like your plans with different kind of hard-coded logic

00:06:40.140 | uh you know ultimately all of this is about error reduction and keeping your like trajectories as successful as possible as long as possible

00:06:47.140 | uh but it's difficult to build and scale these because basically what i'm recommending is that you hard code a lot of logic into your system

00:06:54.140 | uh i think this is like a general idea that people maybe you know depending on your domain this may seem appealing or not appealing but

00:07:03.440 | ultimately we're probably not building agi like tomorrow or next week or in the next six months and so i think taking some of the lessons from

00:07:10.440 | the kind of like symbolic era of ai is useful if you have like a domain challenge that you want to solve in the next three months

00:07:17.740 | uh for us you know we think about delivering real value to customers today and that means that a lot of what we do is thinking about things like explicit plan criteria

00:07:26.240 | hard-coded logic uh and i think that a lot of folks may not like be open to admitting that but i think it's a very important part of the

00:07:32.540 | the process of building these kind of agentic systems um so now talking about decision making and i know this is also meant to be kind of like a word vomit

00:07:42.040 | honestly because i think hopefully you guys are like building agents and you care a lot about this stuff and it can spark or inspire some ideas in your own systems

00:07:50.840 | when you're working on them um so first for decision making consensus mechanisms uh there's a lot of these self-consistency is a very popular one

00:08:01.440 | prompt ensembles cluster sampling you know basically the more inference at runtime that you

00:08:08.300 | run and the more that you can like build a clever way of selecting ideal or optimal samples from those like

00:08:16.140 | many number of inferences the higher the accuracy it's just going to cost more it may introduce longer

00:08:21.860 | inference wait times if you're not parallelizing uh but we found this very important uh to getting high quality

00:08:28.560 | decisions that are consistent uh next you have explicit and analogical reasoning these also have a lot of

00:08:35.420 | different names chain of thought uh checklists chain of density analogical prompting you know basically you

00:08:43.420 | want the system to explicitly outline its reasoning or decision making criteria to reduce the complexity

00:08:50.420 | of the decision making process so when you have for example a checklist of things that let's say you want

00:08:56.300 | to make a decision about left or right if you just create a checklist of what constitutes a reasonable

00:09:01.920 | left and what constitutes a reasonable right your system will do better at choosing left or right

00:09:06.700 | obviously though this introduces challenge if you have like very broad domain stuff uh and or you like have a

00:09:13.280 | decision which has a huge action space in which case techniques like chain of thought reasoning uh you

00:09:18.760 | know chain of density honestly like there's like galactic tree of mega brain thought that like there's a lot of

00:09:24.860 | these like techniques that exist and i think uh you know exploring these is super worth it uh for improving

00:09:30.700 | performance uh fine tuning this is kind of a cop-out answer but i actually think it is pretty valuable once you

00:09:36.440 | really get into having data uh for specific decisions that you want to make it may just be true

00:09:43.120 | that the best thing to do is to spend a weekend pulling the latest open source model generate a bunch

00:09:48.760 | of training data with a high quality model validate it with a bunch of your team members and just train

00:09:53.880 | like or fine-tune a model uh this is expensive and it locks in the quality of your system like a lot of the

00:10:00.840 | like benefit of relying on basically being able to sample from different models is every time a new

00:10:06.420 | state-of-the-art model comes out your system gets better i think that's a huge benefit however for

00:10:11.340 | certain decisions especially those that are really out of distribution uh fine-tuning is a pretty effective

00:10:17.100 | way uh to make a good decision and then simulation so simulation of decision making is super tricky uh this

00:10:27.400 | is definitely going to be very domain specific if you're working with software development simulation is

00:10:34.040 | luckily kind of built into like the thought process here the ability to execute code the ability to

00:10:39.340 | reason through code trajectories uh is super doable and so for us simulation makes up a huge amount of

00:10:46.340 | how we think about decision making processes so sampling multiple decision paths uh simulating them

00:10:52.780 | both with real and and kind of like llm imagined execution paths uh techniques like language agent tree search

00:10:59.660 | which basically says amongst this simulation of decision nodes let's implement like a fancy monte

00:11:05.960 | carlo tree search algorithm and uh you know it's it's kind of like monte carlo tree search it's not

00:11:11.060 | exactly but you know let's do fancy algorithms to decide where we want to move based on the simulation

00:11:16.680 | results uh and so you can do all these things and ultimately the the kind of core goal here is you want to have some

00:11:24.020 | evaluation of whether or not your system makes good decisions when you want it to make good decisions

00:11:28.360 | you can take all these techniques apply them and then see what's working what's not working look at

00:11:34.500 | the pros and cons and then adapt um so that's decision making

00:11:39.500 | now we have environmental grounding so the first uh you know this is oftentimes called tool use i think

00:11:49.520 | that that's like an equivalently like valid way to describe ai computer interfaces but i think the idea

00:11:55.920 | of like building these is kind of the interesting challenge so there's dedicated tools that are very

00:12:01.980 | common like you can like pull down lang chain and start using a calculator uh a sandbox even python script

00:12:08.600 | execution like i'm pretty sure you can clone a open source repo today that implements claude's artifacts

00:12:14.440 | like in your own like environment it's awesome how great the open source community has pushed tool use

00:12:20.920 | but i think a lot of the kind of edge of tool use and where you start to move towards building custom ai

00:12:27.880 | computer interfaces is when you need workflows or trajectories which don't exist with the kind of

00:12:35.400 | known tool set that you have today so if you have a calculator you can definitely run a calculation but

00:12:41.000 | what if you know that very consistently you are going to use a calculator take the output of that calculator

00:12:47.400 | and maybe pass it to another system and then you're going to read them and parse the logs and then take

00:12:53.160 | the output of those parse logs and do some like additional transformation if you're going to

00:12:58.360 | consistently do that and you want your agent to just come up with that every single time then that's

00:13:03.320 | probably not the right mental model for what like the agent should be thinking about or making a decision

00:13:08.040 | about instead what you want to do is say how can we build this tool and then build the interface to the

00:13:15.320 | LLM for this tool so that you can kind of streamline those types of actions that you want to take

00:13:21.160 | and so this is honestly really effective especially in domains like code where there's tons of dev tools

00:13:28.760 | like i mean there's infinite numbers of tools that people have made to be really good at developing software

00:13:34.920 | that are honestly really effective and they just have kind of weird interfaces maybe it's a cli

00:13:39.320 | maybe it's like command shift clicking to go to definition on your vs code all these things that exist in like the world

00:13:47.080 | need a way for an llm based system to invoke and reason around the outputs so we spend a ton of our time

00:13:55.080 | building these ai computer interfaces

00:13:57.080 | then i i kind of alluded to this but

00:14:02.680 | designing explicit feedback processing is i think a very critical step of

00:14:09.000 | grounding your agent in an external environment in particular if you have for example logs i think

00:14:17.320 | is the great is like a great example of this uh your cicd probably outputs an enormous amount of data

00:14:24.760 | about you know all the tests that ran all the debug statements all this kind of garbage that you don't care

00:14:30.680 | about if you know that you're going to need to process that data and use it at some point it's definitely

00:14:36.840 | good to build into the system explicit kind of paths or decisions or tools that take all of this feedback

00:14:44.600 | process it and then maybe even do like a step which is solely about llm reasoning i guess the example here

00:14:52.120 | is you have your logs you parse through it and then you say well what is you know i know the llm is going

00:14:59.800 | to want the failing tests so let me get the failing tests and also provide a brief like explainer of

00:15:07.080 | what the rest of the logs were doing that type of like feedback processing is pretty critical

00:15:12.600 | into making these systems work really well it also applies to i'm talking a lot about external tools and

00:15:19.880 | external mechanisms like the llm or agentic system interacting with something else there's also the llm or

00:15:26.440 | agentic system interacting with itself and so as you as the agent kind of reflects and reasons about error

00:15:33.720 | trajectories processing that feedback in a meaningful way and passing it back to the llm is super important

00:15:40.200 | because a lot of the time the llm is not great at actually criticizing its own action it's just good at

00:15:45.640 | listening to you tell it to criticize itself and so that we found to be very important

00:15:52.920 | uh bounded exploration so you definitely want your agents to be able to gather as much context as they

00:16:00.440 | can about the problem space as you introduce additional ways of gathering this context probably at the

00:16:08.120 | beginning of the problem not always but probably there is a huge benefit for models that can handle this

00:16:17.080 | like very long context models if you use gemini pro 1.5 or uh sonnet 35 honestly there's a lot of

00:16:24.360 | models now that have very long context windows you can continue to you know include information but at a

00:16:30.920 | certain point you need to kind of say let's jump into the problem and what we found is that finding the

00:16:36.440 | right balance of bounded exploration time is very difficult to actually know in advance and honestly

00:16:44.360 | requires a lot of evaluation and so if you can allow your agent to gather this context have longer

00:16:50.600 | exploration phases collect data decide which data is or isn't relevant and kind of begin the problem with

00:16:57.880 | the maximum likelihood of success then that is one of the single most important things to having a successful

00:17:04.040 | trajectory uh in the in the kind of extent um

00:17:08.200 | and then human guidance uh so this is also a little bit of a cop-out answer but i think it's

00:17:16.360 | important to like describe you know as you have your agentic system interact with humans you want to decide

00:17:23.720 | when in those interaction patterns do we want to ask the human to interfere or when do we want them to provide

00:17:29.880 | guidance so as you know with careful ux design and like interaction design i think that this can be

00:17:38.360 | extremely effective at allowing your systems to go from 30 or 40 reliability to 90 or 100 reliability

00:17:45.720 | but you know ultimately this is balancing autonomy with human oversight and so it's a trade-off that you

00:17:50.920 | have to make um anyway so those are all the things that we learned at factory there's many more and if you'd like

00:17:56.920 | to join us we are hiring ai engineers software engineers go to market everybody so please give me an email

Building Reliable Agentic Systems: Eno Reyes

Chapters