Patrick Dougherty: How to Build AI Agents that Actually Work

00:00:00.960 | Hi, I'm Patrick. I was the co-founder and CTO of Roscoe, and two years ago we decided to rip

00:00:07.280 | apart our entire product and rebuild it around AI agents. These are some of the lessons we learned.

00:00:13.040 | First of all, let's start out with a definition, since a lot of people use the term agent but don't

00:00:21.040 | necessarily mean the same thing that I do. For my definition, I came up with there's three

00:00:26.240 | specific criteria that it has to have to be considered an AI agent. Number one, the agent

00:00:32.720 | needs to be able to take directions. These can be human or AI provided, but it should be one specific

00:00:39.920 | objective or overarching goal. Two, it has to have access to call at least one tool and get a response

00:00:48.480 | back. And three, it should be able to autonomously reason how and when to use its tools to accomplish

00:00:56.160 | that objective. What that means is it can't be a predefined sequence of this tool will run and then

00:01:04.480 | this next tool will run in a prompt chained type of setup. It has to use autonomous reasoning in order

00:01:10.720 | to be called an AI agent. One of the biggest lessons we learned in building agents was the necessity to

00:01:20.320 | focus on enabling the agent to think rather than be limited by what the underlying model knows.

00:01:27.600 | So a lot of our tool calls were focused on retrieval rather than trying to do RAG where we inserted

00:01:35.920 | contents into the system prompt to guide the agent's actions. Instead, we focused on discrete tool calls that

00:01:43.680 | allowed it to perform retrieval and get the relevant context into its context window while it was working.

00:01:51.040 | The product that we built was enabling an AI agent to search and query your enterprise data in your data

00:02:00.000 | warehouse for you. And so one of the great way to kind of illustrate the limitations of focusing on knowledge

00:02:09.440 | over reasoning is when it comes to writing a SQL query given some data. So what we found frequently was that

00:02:17.680 | if you gave the agent a whole bunch of tables and all of the columns in those tables, it would fail to

00:02:25.040 | reason correctly about which one to use. It would get overwhelmed by the number of tokens in the prompt and

00:02:32.560 | either choose the wrong one or just write a terrible query that didn't execute in the first place.

00:02:37.040 | That's where we went to these more discrete, more simple building blocks of tool calls such as search tables,

00:02:47.280 | get table detail, or profile a column. The agent is then tasked with using those iteratively to find the

00:02:56.640 | right columns for the right query. Similarly, we saw this play out as reasoning models have become

00:03:06.080 | recently introduced as well. So when you focus on reasoning, you give a reasoning model the ability to

00:03:13.840 | first attempt to find the data needed to answer a particular question. But then if it doesn't find it,

00:03:20.640 | it should be able to tell you that it didn't find it and you can take some action with that knowledge.

00:03:25.840 | What we've seen with GPT 4.0 prior to reasoning models coming out is that regardless of the ability

00:03:34.560 | or the underlying data being present for it to be able to answer a question, it is going to attempt to

00:03:40.080 | write that query anyway. Let's walk through an example of this. So in this prompt, I'm providing

00:03:49.120 | GPT 4.0 with a table schema that is pretty standard from Salesforce. So there's a table for accounts,

00:03:56.720 | contacts, and opportunities. And these aren't all the columns in each of those tables. I'm just

00:04:01.280 | oversimplifying for representative purposes. At the bottom, I've asked GPT 4.0 a question.

00:04:07.040 | Write a query to see how many of my customers churned in the last month.

00:04:12.000 | What you'll see from GPT 4.0 is it is very incentivized to write a query, give me back SQL.

00:04:20.960 | It's not really stopping to think about is this query even possible to write in the first place.

00:04:28.080 | So it makes some assumptions, and then it just starts writing SQL. Its definition is really bad for

00:04:35.520 | calculating churn. It's essentially just looking in the account table and assuming that

00:04:43.520 | there's a different type of account that's not customer that would somehow be updated when a customer

00:04:51.040 | churned. So I think this is very likely to lead an analyst to a totally wrong answer if they were to

00:04:58.560 | take this query and automatically run it. But what you see is that it's not pushing back in any way.

00:05:03.760 | It doesn't stop to say, "I should think about this and consider if this is even possible."

00:05:10.240 | Okay, now let's flip over and see the same prompt run on O1. So this prompt is the exact same up top.

00:05:18.640 | We're providing the same schema and the same question. O1 reasoned through the various aspects of this

00:05:26.240 | question, and it accurately concluded that there is no way, given the schema provided,

00:05:35.600 | to calculate the status of churned on an account. And so that conclusion kind of shows the difference of

00:05:44.880 | giving the model the freedom to think and encouraging it to think and reason versus just forcing it,

00:05:52.800 | essentially, to come up with a SQL query. So that's one of the key lessons that we learned in kind of

00:05:58.080 | building and deploying agents that were useful for enterprises.

00:06:04.800 | As part of this, there is a huge need to iterate on what I would call the ACI. I believe this is a

00:06:13.920 | this is a paper that was published that kind of coined this term, Agent Computer Interface. And it's

00:06:19.200 | really referring to the exact syntax and structure of tool calls, both what goes into the tool call,

00:06:26.000 | and then the content and format of the response from the, you know, API or Python code maybe that would

00:06:33.680 | handle and execute that tool call. So what we learned is really small tweaks to the Agent Computer Interface

00:06:40.800 | can have a massive impact on the accuracy and performance of your agent. And you will feel when

00:06:47.440 | you're making these tweaks like they are so trivial, it makes no sense that they would have any bearing

00:06:52.800 | on your agent's performance. However, I'm telling you that this is actually one of the best ways you can

00:06:58.880 | spend your time when you're trying to get your agent working consistently. A couple of specific examples.

00:07:05.280 | So number one, one of the things that we found was that the format of the response, depending on the model,

00:07:13.120 | was consumed better or worse, likely correlating to the underlying training data, potentially of the model.

00:07:20.800 | So specifically, when working with GPT 4.0, we transitioned from responding with these search

00:07:27.760 | result payloads. Initially, they were formatted as Markdown. And we were seeing examples where the

00:07:33.520 | agent would look at that response that it got back from the tool call and tell us that a column did not

00:07:40.080 | exist when you could see the column within the context, you know, passed back in the tool result.

00:07:47.040 | These were long context tool results. Oftentimes, some of our customers had 500 or 1000 column tables

00:07:52.960 | in their data warehouse. So it was understandable if you're getting, you know, 30,000 tokens back,

00:07:57.920 | that there might be some challenges there. But we felt like to consistently be completely blind to it,

00:08:05.120 | there had to be a way to improve this. So we tested different formats. We ultimately learned that just

00:08:10.400 | switching the formatting of the response from Markdown to JSON, having that semi-structured payload and

00:08:16.720 | response immediately solved this problem for GPT 4.0. However, we learned later on that for Claude,

00:08:23.440 | it was really important to provide XML back to the model and not Mark, not JSON. So again, depending

00:08:30.640 | on the model you're using and the specific function arguments and then responses that you're providing

00:08:37.760 | from those tools, it can really impact your agent's performance.

00:08:46.080 | Think of the model as your brain when you're building an agent. The model is performing the

00:08:53.120 | thinking capabilities. And if the model sucks, then your users aren't going to be happy because they're

00:08:58.400 | going to see some of the obvious logical fallacies that the agent will make. So I think what's critical there is

00:09:05.760 | even if some of your tasks need to run on a cheaper model, like some of your tool calls or some of the

00:09:11.920 | sub prompts that might be triggered by your agent, it's really important that the actual model making

00:09:18.800 | the determination of which tool call to make next based on what has happened up to that point

00:09:23.760 | is a generally intelligent model. I would say Claude 3.5 Sonnet is still probably my favorite for this,

00:09:32.400 | even beyond the reasoning models, because it does a really nice balance between speed,

00:09:38.240 | cost, and making a good decision based on what it's learned so far.

00:09:43.360 | Another thing, we talked about GPT 4.0 versus 0.1 and how 4.0 is incentivized to make an effort even if

00:09:52.080 | the task is impossible. One thing you can learn though by observing the failure modes of agents running with

00:09:58.800 | a certain model is oftentimes the way it hallucinates tells you what the model expects in a tool call for

00:10:06.880 | instance. So if you see it consistently ignoring your JSON schema for a tool call and providing an argument in a

00:10:15.040 | different format, that should be an indicator to you that the agent is telling you how it thinks how it

00:10:22.000 | thinks the tool call should be defined. And if you can change it to match that expected format, you're going to

00:10:29.840 | generally improve agent performance because it's going to be closer to the training data and the

00:10:37.280 | instinct that the model has natively versus you trying to force it into doing something else.

00:10:41.760 | Another lesson we learned was that fine tuning models was a waste of time. I think this is generally

00:10:51.040 | accepted now but there's still a little bit of work happening on building agents with fine-tuned models.

00:10:57.200 | If you buy the premise that we're focusing on reasoning over inherent knowledge of the model,

00:11:05.120 | then it's logical to say that fine tuning does not really improve reasoning. Actually in our

00:11:13.440 | experience it actually decreased reasoning in a lot of cases because it effectively overfit or over tuned

00:11:19.920 | the model to do a specific sequence of tasks each time rather than stopping and thinking if it you know

00:11:26.880 | was was making the right decision. So I would really spend your time focusing on that ACI iteration

00:11:35.440 | rather than trying to build a fine-tuned model to run your agent on. Another question that we

00:11:43.680 | got frequently from customers, users, and others was, "Hey, what abstraction are you using? Which framework

00:11:50.880 | are you building on?" And for two reasons we did not end up using an abstraction. Number one was simple.

00:11:57.920 | When we started building this two years ago, none of the abstraction libraries like a LandGraph

00:12:03.040 | or a Crew AI were publicly available yet. So we didn't even have a choice really. We were just kind

00:12:08.400 | of basing some of our research off of Auto-GPT at the time. But the second reason is that even as those

00:12:14.720 | frameworks started to become more popular, we continued to evaluate transferring some of our code to them.

00:12:20.160 | The problem was there's huge blockers and considerations when you want to go to production

00:12:25.520 | with an agent running on one of these frameworks. One of the key things for us as an example was

00:12:32.880 | the ability for an end user security credentials to cascade down to the agent they were talking to.

00:12:41.360 | So you think about if a human is trying to use an agent to query their Snowflake account,

00:12:47.440 | they may have very granular permissions within that Snowflake account or of what they specifically are

00:12:52.080 | allowed to see in the underlying data. We needed our agent to be able to run with that user's permissions

00:13:00.240 | using an OAuth integration. And that was something that made an approach like LandGraph extremely

00:13:06.800 | difficult to build and scale because we needed to essentially manage the authentication process

00:13:13.440 | and the underlying service keys and tokens within our code base, not within a third-party framework.

00:13:21.760 | So the lesson I think to take away from that is think about what your end goal is first before you

00:13:27.840 | get too dependent on one of these frameworks. There is not too much code that you have to write to build

00:13:34.000 | an agent or even a multi-agent system. If you're in prototype mode, then sure, use an abstraction,

00:13:41.440 | speed yourself up, validate something as quickly as possible. But if your goal at the end is production,

00:13:46.800 | you'll likely regret being too dependent on a third-party library.

00:13:50.000 | One of the other philosophical conclusions we made is ultimately your agent's not your moat, meaning the

00:14:03.200 | system prompts. I think the most valuable thing you can do is set up the ecosystem around your agent,

00:14:20.880 | including the user experience of how your user interacts with your agent and then also the connections and the security protocols that your agent has to follow in doing its work.

00:14:31.360 | That is the most time-consuming part of building a production quality agent into a product.

00:14:37.920 | And that is ultimately going to be your moat in as much as we can even have moats these days with how quickly this stuff is moving.

00:14:45.440 | Last but not least, one of the key lessons that we learned more recently was about designing and executing on multi-agent systems.

00:14:56.720 | So about a year into our process of transitioning to an agent-based product, as our customers were getting

00:15:04.160 | comfortable with single agents, we introduced a multi-agent concept. And these are some of the key

00:15:09.040 | lessons we learned when doing that that really stuck with us and I think have continued to be highly

00:15:15.200 | resonant when you're designing agents in a product. Number one is the need to implement a manager agent within a hierarchy.

00:15:23.760 | The reason for that is that we found the manager agent would own the final

00:15:30.320 | outcome but could delegate subtasks to specific worker agents that would have more context in their

00:15:37.200 | instructions and more specific tool calls to accomplish those tasks. Whereas if you gave all

00:15:43.360 | of that information to a single manager agent, it could become overwhelmed. It might make bad decisions,

00:15:49.280 | go down bad paths. We also learned that the number of agents working together, there's almost a two-pizza rule

00:15:58.960 | kind of similar to how Jeff Bezos would design teams early on at Amazon that applies here. So we found that

00:16:05.280 | if you could limit yourself to about between five and eight agents working together, then that was typically a task

00:16:13.360 | that could be accomplished well by a multi-agent team. I've seen and prototype some systems where you might

00:16:20.800 | have 25 or 50 agents working together. And really what happens is you strongly decrease the likelihood that

00:16:28.240 | the actual outcome ever gets accomplished because you're likely to trigger infinite loops or go down

00:16:34.320 | paths that you don't return from. Incentivization is the number one way to set these things up.

00:16:41.760 | So the goal should not be to force your worker agents through a discrete set of steps, but rather to

00:16:50.160 | incentivize your manager agent, meaning describe and quote unquote reward it with accomplishing the overall

00:16:58.800 | objective and relying on it to manage the underlying worker agents and make sure that they

00:17:04.960 | their output is valuable and that it can be used within the context of achieving that broader outcome.

00:17:15.680 | I wrote more about the designing effective multi-agent teams on my blog at astorapp.com. This is the blog post,

00:17:25.360 | so I go into a little more detail about these principles and some other thoughts as well.

00:17:28.960 | Thanks so much for your time and hope you enjoyed learning all of the mistakes that I've made the last

00:17:37.360 | couple years in designing agent systems and multi-agent systems. I hope you can avoid them and that it saves you some time.