back to index

Patrick Dougherty: How to Build AI Agents that Actually Work


Whisper Transcript | Transcript Only Page

00:00:00.960 | Hi, I'm Patrick. I was the co-founder and CTO of Roscoe, and two years ago we decided to rip
00:00:07.280 | apart our entire product and rebuild it around AI agents. These are some of the lessons we learned.
00:00:13.040 | First of all, let's start out with a definition, since a lot of people use the term agent but don't
00:00:21.040 | necessarily mean the same thing that I do. For my definition, I came up with there's three
00:00:26.240 | specific criteria that it has to have to be considered an AI agent. Number one, the agent
00:00:32.720 | needs to be able to take directions. These can be human or AI provided, but it should be one specific
00:00:39.920 | objective or overarching goal. Two, it has to have access to call at least one tool and get a response
00:00:48.480 | back. And three, it should be able to autonomously reason how and when to use its tools to accomplish
00:00:56.160 | that objective. What that means is it can't be a predefined sequence of this tool will run and then
00:01:04.480 | this next tool will run in a prompt chained type of setup. It has to use autonomous reasoning in order
00:01:10.720 | to be called an AI agent. One of the biggest lessons we learned in building agents was the necessity to
00:01:20.320 | focus on enabling the agent to think rather than be limited by what the underlying model knows.
00:01:27.600 | So a lot of our tool calls were focused on retrieval rather than trying to do RAG where we inserted
00:01:35.920 | contents into the system prompt to guide the agent's actions. Instead, we focused on discrete tool calls that
00:01:43.680 | allowed it to perform retrieval and get the relevant context into its context window while it was working.
00:01:51.040 | The product that we built was enabling an AI agent to search and query your enterprise data in your data
00:02:00.000 | warehouse for you. And so one of the great way to kind of illustrate the limitations of focusing on knowledge
00:02:09.440 | over reasoning is when it comes to writing a SQL query given some data. So what we found frequently was that
00:02:17.680 | if you gave the agent a whole bunch of tables and all of the columns in those tables, it would fail to
00:02:25.040 | reason correctly about which one to use. It would get overwhelmed by the number of tokens in the prompt and
00:02:32.560 | either choose the wrong one or just write a terrible query that didn't execute in the first place.
00:02:37.040 | That's where we went to these more discrete, more simple building blocks of tool calls such as search tables,
00:02:47.280 | get table detail, or profile a column. The agent is then tasked with using those iteratively to find the
00:02:56.640 | right columns for the right query. Similarly, we saw this play out as reasoning models have become
00:03:06.080 | recently introduced as well. So when you focus on reasoning, you give a reasoning model the ability to
00:03:13.840 | first attempt to find the data needed to answer a particular question. But then if it doesn't find it,
00:03:20.640 | it should be able to tell you that it didn't find it and you can take some action with that knowledge.
00:03:25.840 | What we've seen with GPT 4.0 prior to reasoning models coming out is that regardless of the ability
00:03:34.560 | or the underlying data being present for it to be able to answer a question, it is going to attempt to
00:03:40.080 | write that query anyway. Let's walk through an example of this. So in this prompt, I'm providing
00:03:49.120 | GPT 4.0 with a table schema that is pretty standard from Salesforce. So there's a table for accounts,
00:03:56.720 | contacts, and opportunities. And these aren't all the columns in each of those tables. I'm just
00:04:01.280 | oversimplifying for representative purposes. At the bottom, I've asked GPT 4.0 a question.
00:04:07.040 | Write a query to see how many of my customers churned in the last month.
00:04:12.000 | What you'll see from GPT 4.0 is it is very incentivized to write a query, give me back SQL.
00:04:20.960 | It's not really stopping to think about is this query even possible to write in the first place.
00:04:28.080 | So it makes some assumptions, and then it just starts writing SQL. Its definition is really bad for
00:04:35.520 | calculating churn. It's essentially just looking in the account table and assuming that
00:04:43.520 | there's a different type of account that's not customer that would somehow be updated when a customer
00:04:51.040 | churned. So I think this is very likely to lead an analyst to a totally wrong answer if they were to
00:04:58.560 | take this query and automatically run it. But what you see is that it's not pushing back in any way.
00:05:03.760 | It doesn't stop to say, "I should think about this and consider if this is even possible."
00:05:10.240 | Okay, now let's flip over and see the same prompt run on O1. So this prompt is the exact same up top.
00:05:18.640 | We're providing the same schema and the same question. O1 reasoned through the various aspects of this
00:05:26.240 | question, and it accurately concluded that there is no way, given the schema provided,
00:05:35.600 | to calculate the status of churned on an account. And so that conclusion kind of shows the difference of
00:05:44.880 | giving the model the freedom to think and encouraging it to think and reason versus just forcing it,
00:05:52.800 | essentially, to come up with a SQL query. So that's one of the key lessons that we learned in kind of
00:05:58.080 | building and deploying agents that were useful for enterprises.
00:06:04.800 | As part of this, there is a huge need to iterate on what I would call the ACI. I believe this is a
00:06:13.920 | this is a paper that was published that kind of coined this term, Agent Computer Interface. And it's
00:06:19.200 | really referring to the exact syntax and structure of tool calls, both what goes into the tool call,
00:06:26.000 | and then the content and format of the response from the, you know, API or Python code maybe that would
00:06:33.680 | handle and execute that tool call. So what we learned is really small tweaks to the Agent Computer Interface
00:06:40.800 | can have a massive impact on the accuracy and performance of your agent. And you will feel when
00:06:47.440 | you're making these tweaks like they are so trivial, it makes no sense that they would have any bearing
00:06:52.800 | on your agent's performance. However, I'm telling you that this is actually one of the best ways you can
00:06:58.880 | spend your time when you're trying to get your agent working consistently. A couple of specific examples.
00:07:05.280 | So number one, one of the things that we found was that the format of the response, depending on the model,
00:07:13.120 | was consumed better or worse, likely correlating to the underlying training data, potentially of the model.
00:07:20.800 | So specifically, when working with GPT 4.0, we transitioned from responding with these search
00:07:27.760 | result payloads. Initially, they were formatted as Markdown. And we were seeing examples where the
00:07:33.520 | agent would look at that response that it got back from the tool call and tell us that a column did not
00:07:40.080 | exist when you could see the column within the context, you know, passed back in the tool result.
00:07:47.040 | These were long context tool results. Oftentimes, some of our customers had 500 or 1000 column tables
00:07:52.960 | in their data warehouse. So it was understandable if you're getting, you know, 30,000 tokens back,
00:07:57.920 | that there might be some challenges there. But we felt like to consistently be completely blind to it,
00:08:05.120 | there had to be a way to improve this. So we tested different formats. We ultimately learned that just
00:08:10.400 | switching the formatting of the response from Markdown to JSON, having that semi-structured payload and
00:08:16.720 | response immediately solved this problem for GPT 4.0. However, we learned later on that for Claude,
00:08:23.440 | it was really important to provide XML back to the model and not Mark, not JSON. So again, depending
00:08:30.640 | on the model you're using and the specific function arguments and then responses that you're providing
00:08:37.760 | from those tools, it can really impact your agent's performance.
00:08:46.080 | Think of the model as your brain when you're building an agent. The model is performing the
00:08:53.120 | thinking capabilities. And if the model sucks, then your users aren't going to be happy because they're
00:08:58.400 | going to see some of the obvious logical fallacies that the agent will make. So I think what's critical there is
00:09:05.760 | even if some of your tasks need to run on a cheaper model, like some of your tool calls or some of the
00:09:11.920 | sub prompts that might be triggered by your agent, it's really important that the actual model making
00:09:18.800 | the determination of which tool call to make next based on what has happened up to that point
00:09:23.760 | is a generally intelligent model. I would say Claude 3.5 Sonnet is still probably my favorite for this,
00:09:32.400 | even beyond the reasoning models, because it does a really nice balance between speed,
00:09:38.240 | cost, and making a good decision based on what it's learned so far.
00:09:43.360 | Another thing, we talked about GPT 4.0 versus 0.1 and how 4.0 is incentivized to make an effort even if
00:09:52.080 | the task is impossible. One thing you can learn though by observing the failure modes of agents running with
00:09:58.800 | a certain model is oftentimes the way it hallucinates tells you what the model expects in a tool call for
00:10:06.880 | instance. So if you see it consistently ignoring your JSON schema for a tool call and providing an argument in a
00:10:15.040 | different format, that should be an indicator to you that the agent is telling you how it thinks how it
00:10:22.000 | thinks the tool call should be defined. And if you can change it to match that expected format, you're going to
00:10:29.840 | generally improve agent performance because it's going to be closer to the training data and the
00:10:37.280 | instinct that the model has natively versus you trying to force it into doing something else.
00:10:41.760 | Another lesson we learned was that fine tuning models was a waste of time. I think this is generally
00:10:51.040 | accepted now but there's still a little bit of work happening on building agents with fine-tuned models.
00:10:57.200 | If you buy the premise that we're focusing on reasoning over inherent knowledge of the model,
00:11:05.120 | then it's logical to say that fine tuning does not really improve reasoning. Actually in our
00:11:13.440 | experience it actually decreased reasoning in a lot of cases because it effectively overfit or over tuned
00:11:19.920 | the model to do a specific sequence of tasks each time rather than stopping and thinking if it you know
00:11:26.880 | was was making the right decision. So I would really spend your time focusing on that ACI iteration
00:11:35.440 | rather than trying to build a fine-tuned model to run your agent on. Another question that we
00:11:43.680 | got frequently from customers, users, and others was, "Hey, what abstraction are you using? Which framework
00:11:50.880 | are you building on?" And for two reasons we did not end up using an abstraction. Number one was simple.
00:11:57.920 | When we started building this two years ago, none of the abstraction libraries like a LandGraph
00:12:03.040 | or a Crew AI were publicly available yet. So we didn't even have a choice really. We were just kind
00:12:08.400 | of basing some of our research off of Auto-GPT at the time. But the second reason is that even as those
00:12:14.720 | frameworks started to become more popular, we continued to evaluate transferring some of our code to them.
00:12:20.160 | The problem was there's huge blockers and considerations when you want to go to production
00:12:25.520 | with an agent running on one of these frameworks. One of the key things for us as an example was
00:12:32.880 | the ability for an end user security credentials to cascade down to the agent they were talking to.
00:12:41.360 | So you think about if a human is trying to use an agent to query their Snowflake account,
00:12:47.440 | they may have very granular permissions within that Snowflake account or of what they specifically are
00:12:52.080 | allowed to see in the underlying data. We needed our agent to be able to run with that user's permissions
00:13:00.240 | using an OAuth integration. And that was something that made an approach like LandGraph extremely
00:13:06.800 | difficult to build and scale because we needed to essentially manage the authentication process
00:13:13.440 | and the underlying service keys and tokens within our code base, not within a third-party framework.
00:13:21.760 | So the lesson I think to take away from that is think about what your end goal is first before you
00:13:27.840 | get too dependent on one of these frameworks. There is not too much code that you have to write to build
00:13:34.000 | an agent or even a multi-agent system. If you're in prototype mode, then sure, use an abstraction,
00:13:41.440 | speed yourself up, validate something as quickly as possible. But if your goal at the end is production,
00:13:46.800 | you'll likely regret being too dependent on a third-party library.
00:13:50.000 | One of the other philosophical conclusions we made is ultimately your agent's not your moat, meaning the
00:14:03.200 | system prompts. I think the most valuable thing you can do is set up the ecosystem around your agent,
00:14:20.880 | including the user experience of how your user interacts with your agent and then also the connections and the security protocols that your agent has to follow in doing its work.
00:14:31.360 | That is the most time-consuming part of building a production quality agent into a product.
00:14:37.920 | And that is ultimately going to be your moat in as much as we can even have moats these days with how quickly this stuff is moving.
00:14:45.440 | Last but not least, one of the key lessons that we learned more recently was about designing and executing on multi-agent systems.
00:14:56.720 | So about a year into our process of transitioning to an agent-based product, as our customers were getting
00:15:04.160 | comfortable with single agents, we introduced a multi-agent concept. And these are some of the key
00:15:09.040 | lessons we learned when doing that that really stuck with us and I think have continued to be highly
00:15:15.200 | resonant when you're designing agents in a product. Number one is the need to implement a manager agent within a hierarchy.
00:15:23.760 | The reason for that is that we found the manager agent would own the final
00:15:30.320 | outcome but could delegate subtasks to specific worker agents that would have more context in their
00:15:37.200 | instructions and more specific tool calls to accomplish those tasks. Whereas if you gave all
00:15:43.360 | of that information to a single manager agent, it could become overwhelmed. It might make bad decisions,
00:15:49.280 | go down bad paths. We also learned that the number of agents working together, there's almost a two-pizza rule
00:15:58.960 | kind of similar to how Jeff Bezos would design teams early on at Amazon that applies here. So we found that
00:16:05.280 | if you could limit yourself to about between five and eight agents working together, then that was typically a task
00:16:13.360 | that could be accomplished well by a multi-agent team. I've seen and prototype some systems where you might
00:16:20.800 | have 25 or 50 agents working together. And really what happens is you strongly decrease the likelihood that
00:16:28.240 | the actual outcome ever gets accomplished because you're likely to trigger infinite loops or go down
00:16:34.320 | paths that you don't return from. Incentivization is the number one way to set these things up.
00:16:41.760 | So the goal should not be to force your worker agents through a discrete set of steps, but rather to
00:16:50.160 | incentivize your manager agent, meaning describe and quote unquote reward it with accomplishing the overall
00:16:58.800 | objective and relying on it to manage the underlying worker agents and make sure that they
00:17:04.960 | their output is valuable and that it can be used within the context of achieving that broader outcome.
00:17:15.680 | I wrote more about the designing effective multi-agent teams on my blog at astorapp.com. This is the blog post,
00:17:25.360 | so I go into a little more detail about these principles and some other thoughts as well.
00:17:28.960 | Thanks so much for your time and hope you enjoyed learning all of the mistakes that I've made the last
00:17:37.360 | couple years in designing agent systems and multi-agent systems. I hope you can avoid them and that it saves you some time.