back to indexBuilding Reliable Agentic Systems: Eno Reyes

Chapters
0:0 Intro
1:22 Agentic Systems
2:45 Planning
7:36 Decision Making
11:40 Environmental Grounding
00:00:00.000 |
So for this talk basically what we thought was it'd be cool to give kind of practical examples 00:00:20.340 |
and lessons of problems and solutions that we identified while building the droids and so for 00:00:28.080 |
context about factory our mission is to bring autonomy to software engineering and what 00:00:32.600 |
that means concretely we build these products that we call droids they are autonomous systems 00:00:38.160 |
that are applied to different stages of the software development life cycle think code 00:00:42.520 |
review documentation testing all the way to end-to-end coding tasks like a refactor a migration 00:00:48.640 |
feature work and so each of the droids has like separate cognitive architectures which 00:00:54.320 |
are mapped to the tasks at hand and in particular the you know droids like review droid or the 00:01:03.080 |
documentation droid which handle processes which we think are kind of more on guard rails are 00:01:08.320 |
pretty different from something like the code droid which is able to take nearly any natural 00:01:13.340 |
language task and make an attempt or complete a coding task that is associated with that like request and so 00:01:20.960 |
so the idea for this talk is let's just start describing what we think of as an agentic system 00:01:28.600 |
this like agentic systems have a lot of different interpretations a lot of different definitions 00:01:34.640 |
we think that there's three characteristics which are kind of most representative of an agentic system 00:01:43.780 |
the first is planning i think you've probably seen a lot of this in this track the idea that the agentic system 00:01:49.580 |
can make decisions about one or future or many future actions that it's going to take decision making 00:01:57.220 |
some people call this reasoning i think that a lot of these systems just the ability to make a decision 00:02:04.340 |
have some criteria and some algorithm that's associated with making that decision is kind of critical in 00:02:10.440 |
order for agentic systems to take on more general or broad tasks where the decision space is very wide and then you have environmental grounding 00:02:19.220 |
and so you know a lot of systems have planning they have decision making but i think the existence of an agent within an external 00:02:28.860 |
environment is very critical to understanding some of the unique properties of agentic systems when you're actually implementing them 00:02:34.860 |
so being able to read and write to these environments is a you know a critical part of this process and so let's talk about planning first 00:02:48.860 |
uh so the first idea that we kind of encountered is inspired by control systems and robotics the idea 00:02:56.500 |
of the pseudo common filter is as you are working through a plan what happens is you tend to notice that your agentic systems will lead will be led astray 00:03:07.500 |
their reasoning will change rapidly and as you kind of iterate through especially long plans you can imagine that in order to migrate even a small section of a code base you might have 00:03:18.500 |
uh so that you have literally hundreds of steps in a process and so inspired by a lot of the folks that on our team come from backgrounds in self-driving and 00:03:26.140 |
robotics uh the pseudo common filter is basically passing intermediate reasoning and you can get pretty complex about how you modify or share the intermediate reasoning through the different plan steps but as you pass that intermediate reasoning through the execution of the plan steps uh it allows the like individual decisions that happen on the 00:03:48.140 |
uh the plan to slowly converge towards at least consistent reasoning uh the kind of core issue with this is it also 00:03:55.780 |
facilitates error propagation a simple mistake especially early on in the plan can lead to like you know strong downstream effects 00:04:05.700 |
subtask decomposition this is pretty well known even reacts about like some of the earlier agentic uh action systems uh have some form of subtask decomposition uh but you know what we found 00:04:17.780 |
what we found is that experimenting with different forms or structures of subtask decomposition in our planning process uh has led to a lot of like pretty interesting downstream 00:04:27.420 |
downstream positive effects uh in particular it when you basically increase the resolution or the fidelity of your subtasks in a given plan it gives you more fine grained control it allows you to define kind of the action space a lot more clearly however the risk is you're introducing a lot of decisions for the 00:04:47.420 |
um for the lm to make the more like small tiny tasks you introduce the harder it is for the system or the agentic system to decide what's right um 00:04:57.060 |
um then you have model predictive control this is again you know not a new idea not invented by anyone here but the idea of evaluating outcomes of your subtasks and your current state uh and enabling kind of adaptive replanning based on real-time feedback that occurs during the execution 00:05:17.060 |
of the plan so you know you have the plan so you know if you have rapidly changing environment or information this can occur especially in you know situations for us 00:05:26.700 |
where there are other humans actively engaged in either a development process or some other workflow while your agent is executing perhaps a long-running workflow uh and so if you have you know you can honestly like find a lot of information about these techniques and kind of reason around you know you don't have to necessarily jump into the true like 00:05:46.700 |
bayesian statistics predictive modeling stuff i think a lot of the like general ideas of replanning taking in the trajectory information and making sure that you keep that up to date 00:05:56.000 |
uh is honestly enough to see some like pretty solid quality improvements on your agents 00:06:00.860 |
and then finally uh explicit plan criteria so i think you actually kind of see this pretty often for very simple agents and then people start to remove explicit plan criteria in order to increase 00:06:14.040 |
generalizability uh but uh what we found is that clearly defining kind of successful structures or at least successful initial states 00:06:23.340 |
for plans uh can lead to a very strong kind of downstream effects uh and and you know this can be done in a lot of different ways 00:06:30.840 |
instruction tuning uh you know a few uh few shot prompt examples you can validate like your plans with different kind of hard-coded logic 00:06:40.140 |
uh you know ultimately all of this is about error reduction and keeping your like trajectories as successful as possible as long as possible 00:06:47.140 |
uh but it's difficult to build and scale these because basically what i'm recommending is that you hard code a lot of logic into your system 00:06:54.140 |
uh i think this is like a general idea that people maybe you know depending on your domain this may seem appealing or not appealing but 00:07:03.440 |
ultimately we're probably not building agi like tomorrow or next week or in the next six months and so i think taking some of the lessons from 00:07:10.440 |
the kind of like symbolic era of ai is useful if you have like a domain challenge that you want to solve in the next three months 00:07:17.740 |
uh for us you know we think about delivering real value to customers today and that means that a lot of what we do is thinking about things like explicit plan criteria 00:07:26.240 |
hard-coded logic uh and i think that a lot of folks may not like be open to admitting that but i think it's a very important part of the 00:07:32.540 |
the process of building these kind of agentic systems um so now talking about decision making and i know this is also meant to be kind of like a word vomit 00:07:42.040 |
honestly because i think hopefully you guys are like building agents and you care a lot about this stuff and it can spark or inspire some ideas in your own systems 00:07:50.840 |
when you're working on them um so first for decision making consensus mechanisms uh there's a lot of these self-consistency is a very popular one 00:08:01.440 |
prompt ensembles cluster sampling you know basically the more inference at runtime that you 00:08:08.300 |
run and the more that you can like build a clever way of selecting ideal or optimal samples from those like 00:08:16.140 |
many number of inferences the higher the accuracy it's just going to cost more it may introduce longer 00:08:21.860 |
inference wait times if you're not parallelizing uh but we found this very important uh to getting high quality 00:08:28.560 |
decisions that are consistent uh next you have explicit and analogical reasoning these also have a lot of 00:08:35.420 |
different names chain of thought uh checklists chain of density analogical prompting you know basically you 00:08:43.420 |
want the system to explicitly outline its reasoning or decision making criteria to reduce the complexity 00:08:50.420 |
of the decision making process so when you have for example a checklist of things that let's say you want 00:08:56.300 |
to make a decision about left or right if you just create a checklist of what constitutes a reasonable 00:09:01.920 |
left and what constitutes a reasonable right your system will do better at choosing left or right 00:09:06.700 |
obviously though this introduces challenge if you have like very broad domain stuff uh and or you like have a 00:09:13.280 |
decision which has a huge action space in which case techniques like chain of thought reasoning uh you 00:09:18.760 |
know chain of density honestly like there's like galactic tree of mega brain thought that like there's a lot of 00:09:24.860 |
these like techniques that exist and i think uh you know exploring these is super worth it uh for improving 00:09:30.700 |
performance uh fine tuning this is kind of a cop-out answer but i actually think it is pretty valuable once you 00:09:36.440 |
really get into having data uh for specific decisions that you want to make it may just be true 00:09:43.120 |
that the best thing to do is to spend a weekend pulling the latest open source model generate a bunch 00:09:48.760 |
of training data with a high quality model validate it with a bunch of your team members and just train 00:09:53.880 |
like or fine-tune a model uh this is expensive and it locks in the quality of your system like a lot of the 00:10:00.840 |
like benefit of relying on basically being able to sample from different models is every time a new 00:10:06.420 |
state-of-the-art model comes out your system gets better i think that's a huge benefit however for 00:10:11.340 |
certain decisions especially those that are really out of distribution uh fine-tuning is a pretty effective 00:10:17.100 |
way uh to make a good decision and then simulation so simulation of decision making is super tricky uh this 00:10:27.400 |
is definitely going to be very domain specific if you're working with software development simulation is 00:10:34.040 |
luckily kind of built into like the thought process here the ability to execute code the ability to 00:10:39.340 |
reason through code trajectories uh is super doable and so for us simulation makes up a huge amount of 00:10:46.340 |
how we think about decision making processes so sampling multiple decision paths uh simulating them 00:10:52.780 |
both with real and and kind of like llm imagined execution paths uh techniques like language agent tree search 00:10:59.660 |
which basically says amongst this simulation of decision nodes let's implement like a fancy monte 00:11:05.960 |
carlo tree search algorithm and uh you know it's it's kind of like monte carlo tree search it's not 00:11:11.060 |
exactly but you know let's do fancy algorithms to decide where we want to move based on the simulation 00:11:16.680 |
results uh and so you can do all these things and ultimately the the kind of core goal here is you want to have some 00:11:24.020 |
evaluation of whether or not your system makes good decisions when you want it to make good decisions 00:11:28.360 |
you can take all these techniques apply them and then see what's working what's not working look at 00:11:34.500 |
the pros and cons and then adapt um so that's decision making 00:11:39.500 |
now we have environmental grounding so the first uh you know this is oftentimes called tool use i think 00:11:49.520 |
that that's like an equivalently like valid way to describe ai computer interfaces but i think the idea 00:11:55.920 |
of like building these is kind of the interesting challenge so there's dedicated tools that are very 00:12:01.980 |
common like you can like pull down lang chain and start using a calculator uh a sandbox even python script 00:12:08.600 |
execution like i'm pretty sure you can clone a open source repo today that implements claude's artifacts 00:12:14.440 |
like in your own like environment it's awesome how great the open source community has pushed tool use 00:12:20.920 |
but i think a lot of the kind of edge of tool use and where you start to move towards building custom ai 00:12:27.880 |
computer interfaces is when you need workflows or trajectories which don't exist with the kind of 00:12:35.400 |
known tool set that you have today so if you have a calculator you can definitely run a calculation but 00:12:41.000 |
what if you know that very consistently you are going to use a calculator take the output of that calculator 00:12:47.400 |
and maybe pass it to another system and then you're going to read them and parse the logs and then take 00:12:53.160 |
the output of those parse logs and do some like additional transformation if you're going to 00:12:58.360 |
consistently do that and you want your agent to just come up with that every single time then that's 00:13:03.320 |
probably not the right mental model for what like the agent should be thinking about or making a decision 00:13:08.040 |
about instead what you want to do is say how can we build this tool and then build the interface to the 00:13:15.320 |
LLM for this tool so that you can kind of streamline those types of actions that you want to take 00:13:21.160 |
and so this is honestly really effective especially in domains like code where there's tons of dev tools 00:13:28.760 |
like i mean there's infinite numbers of tools that people have made to be really good at developing software 00:13:34.920 |
that are honestly really effective and they just have kind of weird interfaces maybe it's a cli 00:13:39.320 |
maybe it's like command shift clicking to go to definition on your vs code all these things that exist in like the world 00:13:47.080 |
need a way for an llm based system to invoke and reason around the outputs so we spend a ton of our time 00:14:02.680 |
designing explicit feedback processing is i think a very critical step of 00:14:09.000 |
grounding your agent in an external environment in particular if you have for example logs i think 00:14:17.320 |
is the great is like a great example of this uh your cicd probably outputs an enormous amount of data 00:14:24.760 |
about you know all the tests that ran all the debug statements all this kind of garbage that you don't care 00:14:30.680 |
about if you know that you're going to need to process that data and use it at some point it's definitely 00:14:36.840 |
good to build into the system explicit kind of paths or decisions or tools that take all of this feedback 00:14:44.600 |
process it and then maybe even do like a step which is solely about llm reasoning i guess the example here 00:14:52.120 |
is you have your logs you parse through it and then you say well what is you know i know the llm is going 00:14:59.800 |
to want the failing tests so let me get the failing tests and also provide a brief like explainer of 00:15:07.080 |
what the rest of the logs were doing that type of like feedback processing is pretty critical 00:15:12.600 |
into making these systems work really well it also applies to i'm talking a lot about external tools and 00:15:19.880 |
external mechanisms like the llm or agentic system interacting with something else there's also the llm or 00:15:26.440 |
agentic system interacting with itself and so as you as the agent kind of reflects and reasons about error 00:15:33.720 |
trajectories processing that feedback in a meaningful way and passing it back to the llm is super important 00:15:40.200 |
because a lot of the time the llm is not great at actually criticizing its own action it's just good at 00:15:45.640 |
listening to you tell it to criticize itself and so that we found to be very important 00:15:52.920 |
uh bounded exploration so you definitely want your agents to be able to gather as much context as they 00:16:00.440 |
can about the problem space as you introduce additional ways of gathering this context probably at the 00:16:08.120 |
beginning of the problem not always but probably there is a huge benefit for models that can handle this 00:16:17.080 |
like very long context models if you use gemini pro 1.5 or uh sonnet 35 honestly there's a lot of 00:16:24.360 |
models now that have very long context windows you can continue to you know include information but at a 00:16:30.920 |
certain point you need to kind of say let's jump into the problem and what we found is that finding the 00:16:36.440 |
right balance of bounded exploration time is very difficult to actually know in advance and honestly 00:16:44.360 |
requires a lot of evaluation and so if you can allow your agent to gather this context have longer 00:16:50.600 |
exploration phases collect data decide which data is or isn't relevant and kind of begin the problem with 00:16:57.880 |
the maximum likelihood of success then that is one of the single most important things to having a successful 00:17:04.040 |
trajectory uh in the in the kind of extent um 00:17:08.200 |
and then human guidance uh so this is also a little bit of a cop-out answer but i think it's 00:17:16.360 |
important to like describe you know as you have your agentic system interact with humans you want to decide 00:17:23.720 |
when in those interaction patterns do we want to ask the human to interfere or when do we want them to provide 00:17:29.880 |
guidance so as you know with careful ux design and like interaction design i think that this can be 00:17:38.360 |
extremely effective at allowing your systems to go from 30 or 40 reliability to 90 or 100 reliability 00:17:45.720 |
but you know ultimately this is balancing autonomy with human oversight and so it's a trade-off that you 00:17:50.920 |
have to make um anyway so those are all the things that we learned at factory there's many more and if you'd like 00:17:56.920 |
to join us we are hiring ai engineers software engineers go to market everybody so please give me an email