Patrick Dougherty: How to Build AI Agents that Actually Work

Hi, I'm Patrick. I was the co-founder and CTO of Roscoe, and two years ago we decided to rip apart our entire product and rebuild it around AI agents. These are some of the lessons we learned. First of all, let's start out with a definition, since a lot of people use the term agent but don't necessarily mean the same thing that I do.

For my definition, I came up with there's three specific criteria that it has to have to be considered an AI agent. Number one, the agent needs to be able to take directions. These can be human or AI provided, but it should be one specific objective or overarching goal. Two, it has to have access to call at least one tool and get a response back.

And three, it should be able to autonomously reason how and when to use its tools to accomplish that objective. What that means is it can't be a predefined sequence of this tool will run and then this next tool will run in a prompt chained type of setup. It has to use autonomous reasoning in order to be called an AI agent.

One of the biggest lessons we learned in building agents was the necessity to focus on enabling the agent to think rather than be limited by what the underlying model knows. So a lot of our tool calls were focused on retrieval rather than trying to do RAG where we inserted contents into the system prompt to guide the agent's actions.

Instead, we focused on discrete tool calls that allowed it to perform retrieval and get the relevant context into its context window while it was working. The product that we built was enabling an AI agent to search and query your enterprise data in your data warehouse for you. And so one of the great way to kind of illustrate the limitations of focusing on knowledge over reasoning is when it comes to writing a SQL query given some data.

So what we found frequently was that if you gave the agent a whole bunch of tables and all of the columns in those tables, it would fail to reason correctly about which one to use. It would get overwhelmed by the number of tokens in the prompt and either choose the wrong one or just write a terrible query that didn't execute in the first place.

That's where we went to these more discrete, more simple building blocks of tool calls such as search tables, get table detail, or profile a column. The agent is then tasked with using those iteratively to find the right columns for the right query. Similarly, we saw this play out as reasoning models have become recently introduced as well.

So when you focus on reasoning, you give a reasoning model the ability to first attempt to find the data needed to answer a particular question. But then if it doesn't find it, it should be able to tell you that it didn't find it and you can take some action with that knowledge.

What we've seen with GPT 4.0 prior to reasoning models coming out is that regardless of the ability or the underlying data being present for it to be able to answer a question, it is going to attempt to write that query anyway. Let's walk through an example of this. So in this prompt, I'm providing GPT 4.0 with a table schema that is pretty standard from Salesforce.

So there's a table for accounts, contacts, and opportunities. And these aren't all the columns in each of those tables. I'm just oversimplifying for representative purposes. At the bottom, I've asked GPT 4.0 a question. Write a query to see how many of my customers churned in the last month. What you'll see from GPT 4.0 is it is very incentivized to write a query, give me back SQL.

It's not really stopping to think about is this query even possible to write in the first place. So it makes some assumptions, and then it just starts writing SQL. Its definition is really bad for calculating churn. It's essentially just looking in the account table and assuming that there's a different type of account that's not customer that would somehow be updated when a customer churned.

So I think this is very likely to lead an analyst to a totally wrong answer if they were to take this query and automatically run it. But what you see is that it's not pushing back in any way. It doesn't stop to say, "I should think about this and consider if this is even possible." Okay, now let's flip over and see the same prompt run on O1.

So this prompt is the exact same up top. We're providing the same schema and the same question. O1 reasoned through the various aspects of this question, and it accurately concluded that there is no way, given the schema provided, to calculate the status of churned on an account. And so that conclusion kind of shows the difference of giving the model the freedom to think and encouraging it to think and reason versus just forcing it, essentially, to come up with a SQL query.

So that's one of the key lessons that we learned in kind of building and deploying agents that were useful for enterprises. As part of this, there is a huge need to iterate on what I would call the ACI. I believe this is a this is a paper that was published that kind of coined this term, Agent Computer Interface.

And it's really referring to the exact syntax and structure of tool calls, both what goes into the tool call, and then the content and format of the response from the, you know, API or Python code maybe that would handle and execute that tool call. So what we learned is really small tweaks to the Agent Computer Interface can have a massive impact on the accuracy and performance of your agent.

And you will feel when you're making these tweaks like they are so trivial, it makes no sense that they would have any bearing on your agent's performance. However, I'm telling you that this is actually one of the best ways you can spend your time when you're trying to get your agent working consistently.

A couple of specific examples. So number one, one of the things that we found was that the format of the response, depending on the model, was consumed better or worse, likely correlating to the underlying training data, potentially of the model. So specifically, when working with GPT 4.0, we transitioned from responding with these search result payloads.

Initially, they were formatted as Markdown. And we were seeing examples where the agent would look at that response that it got back from the tool call and tell us that a column did not exist when you could see the column within the context, you know, passed back in the tool result.

These were long context tool results. Oftentimes, some of our customers had 500 or 1000 column tables in their data warehouse. So it was understandable if you're getting, you know, 30,000 tokens back, that there might be some challenges there. But we felt like to consistently be completely blind to it, there had to be a way to improve this.

So we tested different formats. We ultimately learned that just switching the formatting of the response from Markdown to JSON, having that semi-structured payload and response immediately solved this problem for GPT 4.0. However, we learned later on that for Claude, it was really important to provide XML back to the model and not Mark, not JSON.

So again, depending on the model you're using and the specific function arguments and then responses that you're providing from those tools, it can really impact your agent's performance. Think of the model as your brain when you're building an agent. The model is performing the thinking capabilities. And if the model sucks, then your users aren't going to be happy because they're going to see some of the obvious logical fallacies that the agent will make.

So I think what's critical there is even if some of your tasks need to run on a cheaper model, like some of your tool calls or some of the sub prompts that might be triggered by your agent, it's really important that the actual model making the determination of which tool call to make next based on what has happened up to that point is a generally intelligent model.

I would say Claude 3.5 Sonnet is still probably my favorite for this, even beyond the reasoning models, because it does a really nice balance between speed, cost, and making a good decision based on what it's learned so far. Another thing, we talked about GPT 4.0 versus 0.1 and how 4.0 is incentivized to make an effort even if the task is impossible.

One thing you can learn though by observing the failure modes of agents running with a certain model is oftentimes the way it hallucinates tells you what the model expects in a tool call for instance. So if you see it consistently ignoring your JSON schema for a tool call and providing an argument in a different format, that should be an indicator to you that the agent is telling you how it thinks how it thinks the tool call should be defined.

And if you can change it to match that expected format, you're going to generally improve agent performance because it's going to be closer to the training data and the instinct that the model has natively versus you trying to force it into doing something else. Another lesson we learned was that fine tuning models was a waste of time.

I think this is generally accepted now but there's still a little bit of work happening on building agents with fine-tuned models. If you buy the premise that we're focusing on reasoning over inherent knowledge of the model, then it's logical to say that fine tuning does not really improve reasoning.

Actually in our experience it actually decreased reasoning in a lot of cases because it effectively overfit or over tuned the model to do a specific sequence of tasks each time rather than stopping and thinking if it you know was was making the right decision. So I would really spend your time focusing on that ACI iteration rather than trying to build a fine-tuned model to run your agent on.

Another question that we got frequently from customers, users, and others was, "Hey, what abstraction are you using? Which framework are you building on?" And for two reasons we did not end up using an abstraction. Number one was simple. When we started building this two years ago, none of the abstraction libraries like a LandGraph or a Crew AI were publicly available yet.

So we didn't even have a choice really. We were just kind of basing some of our research off of Auto-GPT at the time. But the second reason is that even as those frameworks started to become more popular, we continued to evaluate transferring some of our code to them. The problem was there's huge blockers and considerations when you want to go to production with an agent running on one of these frameworks.

One of the key things for us as an example was the ability for an end user security credentials to cascade down to the agent they were talking to. So you think about if a human is trying to use an agent to query their Snowflake account, they may have very granular permissions within that Snowflake account or of what they specifically are allowed to see in the underlying data.

We needed our agent to be able to run with that user's permissions using an OAuth integration. And that was something that made an approach like LandGraph extremely difficult to build and scale because we needed to essentially manage the authentication process and the underlying service keys and tokens within our code base, not within a third-party framework.

So the lesson I think to take away from that is think about what your end goal is first before you get too dependent on one of these frameworks. There is not too much code that you have to write to build an agent or even a multi-agent system. If you're in prototype mode, then sure, use an abstraction, speed yourself up, validate something as quickly as possible.

But if your goal at the end is production, you'll likely regret being too dependent on a third-party library. One of the other philosophical conclusions we made is ultimately your agent's not your moat, meaning the system prompts. I think the most valuable thing you can do is set up the ecosystem around your agent, including the user experience of how your user interacts with your agent and then also the connections and the security protocols that your agent has to follow in doing its work.

That is the most time-consuming part of building a production quality agent into a product. And that is ultimately going to be your moat in as much as we can even have moats these days with how quickly this stuff is moving. Last but not least, one of the key lessons that we learned more recently was about designing and executing on multi-agent systems.

So about a year into our process of transitioning to an agent-based product, as our customers were getting comfortable with single agents, we introduced a multi-agent concept. And these are some of the key lessons we learned when doing that that really stuck with us and I think have continued to be highly resonant when you're designing agents in a product.

Number one is the need to implement a manager agent within a hierarchy. The reason for that is that we found the manager agent would own the final outcome but could delegate subtasks to specific worker agents that would have more context in their instructions and more specific tool calls to accomplish those tasks.

Whereas if you gave all of that information to a single manager agent, it could become overwhelmed. It might make bad decisions, go down bad paths. We also learned that the number of agents working together, there's almost a two-pizza rule kind of similar to how Jeff Bezos would design teams early on at Amazon that applies here.

So we found that if you could limit yourself to about between five and eight agents working together, then that was typically a task that could be accomplished well by a multi-agent team. I've seen and prototype some systems where you might have 25 or 50 agents working together. And really what happens is you strongly decrease the likelihood that the actual outcome ever gets accomplished because you're likely to trigger infinite loops or go down paths that you don't return from.

Incentivization is the number one way to set these things up. So the goal should not be to force your worker agents through a discrete set of steps, but rather to incentivize your manager agent, meaning describe and quote unquote reward it with accomplishing the overall objective and relying on it to manage the underlying worker agents and make sure that they their output is valuable and that it can be used within the context of achieving that broader outcome.

I wrote more about the designing effective multi-agent teams on my blog at astorapp.com. This is the blog post, so I go into a little more detail about these principles and some other thoughts as well. Thanks so much for your time and hope you enjoyed learning all of the mistakes that I've made the last couple years in designing agent systems and multi-agent systems.

I hope you can avoid them and that it saves you some time.

Patrick Dougherty: How to Build AI Agents that Actually Work

Transcript