Cohere: Building enterprise LLM agents that work (Shaan Desai)

Hello. My name is Sean, and I'm a machine learning engineer here at Cohere, and today I'll be talking to you about building enterprise LLM agents that work. So a quick overview is that we'll have an introduction, we'll discuss some of the frameworks and approaches that we're really excited about, and we'll also address some of the critical components around evaluation and failure mitigation for LLM agents, and then ideally, hopefully, bring all of these things together into a nice product overview.

So, you know, agents continue to be the most exciting application of generative AI today, with growing demand across a host of sectors, including customer support assistants, personal assistants, RAG agents, as well as financial analyst agents. However, any developer who's spent time building LLM agents knows that doing so in a scalable, safe, and seamless way is actually a very difficult and challenging task.

You might ask, why is that so? Well, it turns out there's a panacea of frameworks, tools, models, approaches, evaluation criteria to choose from, and to effectively put together into one end-to-end pipeline. So we really hope in this talk we can go through the critical decision-making process in setting up enterprise agents, really touching on the insights and key learnings we've had in building these agents, from addressing the frameworks we love, to discussing single versus multi-agent strategies, as well as addressing some of the critical components that are less discussed around evaluating LLM agents.

So let's start with frameworks. Now, over the past few years, there have been an increasing number of frameworks that have come into the market from components such as Autogen, Crew AI, as well as LangChain. Now, they all have their own benefits and disadvantages, depending on a given use case.

But our core learning in the past year has really been to focus on three critical components, and those are observability, right? Is it easy to debug and fix? The second is the setup cost, you know, how quickly can you iterate and resolve an issue, as well as build and piece together the entire agent you're interested in building?

And then, of course, lastly is support. You know, is the framework well-documented? Does it support various models and tools and functionalities? So we've often viewed these frameworks under these three criteria. And generally, we tie these three criteria to a given use case. And more concretely, what this might look like is building large-scale enterprise agents often requires high levels of observability, for which we would really recommend going native or building with LangGraph.

Now, of course, the space of frameworks is a continuously evolving landscape. And so this is a recommendation at this point in time. But we obviously expect this to change as frameworks continue their support to improve observability and ease of use. And, you know, in the same vein, what we'd recommend for quick tests and proof of concepts is frameworks like CrewAI and Autogen.

And the reason for this is that there's generally a low setup uplift with low code to get things working out of the box. And, of course, they're easy to leverage, pre-existing or pre-built agents and tools and orchestrate them all together in a multi-agent setting. So these are immediate recommendations.

Of course, here at Cahere, we're continuously improving our integration support for these various frameworks. And we hope to continue doing the support and watching this space evolve. And in part, what we're particularly excited about is seeing a sliding scale spectrum across these various frameworks for different use cases. Okay.

Now, once you decide on which framework you want to use, of course, you need to decide on the approach or the strategy that you plan to use this framework in, right? Do you plan to use single agent? Do you plan to use multi-agent? Will you have human-in-the-loop feedback? Our core recommendation, and this is insights that have come from a number of use cases, is always start simple.

A single LLM with a handful of tools can often go a long way. But more importantly, being very diligent about the tool specifications really helps uplift performance. So what we found is, you know, we worked with one client, and one of their asks was, hey, we've got a long list of APIs.

And these API specifications could take in up to 10 to 15 different parameters. And could you get a model to successfully run tool calls for these tasks? What we've really found is, to achieve the performance gains that they were trying to achieve, we needed to really simplify the entire approach.

We needed clear descriptions with very sharp examples on how to call the tool, as well as providing and simplifying the input types. So converting complex nested dictionaries into list, stir, or full types. Now, in addition to these learnings, we've also found that providing a clear instruction list, which is short, pithy, and to the point goes a much longer way than providing a long set of instructions that can actually provide confusion to the model and induce potential hallucinations.

long streams of chat history, in other words, back and forth conversations between the user and chatbot that go over 20 turns, for example, and induce certain hallucinations. And this is true across a whole host of models and frameworks. To handle that particular problem, we really recommend caching. Essentially caching that history and retrieving it whenever it is particularly relevant to a new user query can actually help your LLM agent achieve better performance through time.

And we'll get to what we mean by performance in some later slides. Now, indeed, there are frameworks such as Autogen that support multi-agent style orchestration. And so, you know, in multi-agent, obviously in the multi-agent setting, it's a collection of simple agents tied together. And they have a routing model that decides which sub-agent to go to and retrieve information from.

And there's been a growing interest in the industry to build multi-agents that are very robust and versatile. Of course, this requires a good routing model, good reasoning model, and of course, sub-agents that are well-constrained. And so, what we've learned for the router is that it should really contain a list of tools with clear descriptions.

That always holds. But it should also contain a sharp set of routing instructions that can encompass potential edge cases, right? So, if you're trying to route information, from the router to a sub-agent, and then back to another agent, providing that type of clarity and instruction to the model can really help it decide what it should do at each stage, rather than it autonomously and continuously trying to attempt things that may not be the most optimal path to getting to the final answer.

Of course, we also recommend that for sub-agents, they should be constrained to performing independent tasks with a small set of tools to return the final answer, right? Each sub-agent should be decomposed into a specific task that it should handle. So, those are key insights we've had from building both simple and multi-agents in the enterprise setting.

And now, the most important bit, right? We've glossed over the fact that agents can act quite autonomously to achieve final results, but we do think safety is paramount to any scalable real-world application, right? And here are some examples. If we decide to use a Gmail agent, for example, we may want to request permission prior to sending emails, right?

We might want the user to get a pop-up that says, "Hey, are you okay with me sending this email?" right? We don't want random emails to be sent. And this might be true in the HR support bot setting, as well as in the financial analysis agent setting. What we've learned essentially is that incorporating human-in-the-loop is thus, like, really critical for business applications.

And what's really nice about it is that you can codify a set of rules under which human-in-the-loop is triggered, right? So, under various criteria, we can force human-in-the-loop to be triggered. And typically, this can happen before or right prior, like, right before a tool is called. But it could also happen right after a tool call is made, especially if the execution output, for example, may contain various parts of information that you may not want to process completely.

Okay, great. So we've addressed frameworks. We've addressed various approaches we've explored and the insights we've gained. Now, importantly, we need to discuss evaluation. How are we going to assess the performance of the agent that we've built? So, you know, what really makes a successful agent is a lot of things, right?

It's a lot of moving pieces that need to come together for it to be successful. Essentially, the model needs to make the right tool call at the right time. The model needs to be able to essentially receive executed tool results and reason on top of it. And it needs to make tool calls very succinctly and accurately passing the right input parameters.

And it needs to have the ability to course correct even when things are going wrong, right? So what's quite interesting here is for the final product or the end user, the only thing that particularly matters to them is the final product or the final answer they get from the agent.

But what matters most to, I think, developers as they're debugging and understanding how the LLM is making decisions is not just the final output, but all the intermediate stages that go into getting to the final answer. And so we have an example here where, for example, a user may ask a model to provide information about weather in New York City on February 5th.

Ideally, the model should decide to use a specific tool, pass in the right parameters, get a returned set of results from those tools and reason over the return response to provide a final output, which is New York City will be mostly sunny, sunny, etc. Now, as you can see, there are a number of intermediate stages that take place to get to the final response.

And typically, what we do here at Cohere is we build a golden set of ground truth user queries, expected function calls, expected parameter inputs, expected outputs, as well as expected final response. The nice thing about doing this and building this evaluation set is that we can run this large corpus of evaluations through our agentic framework and assess any critical points of failure or where we think the model may be going wrong.

And this makes debugging particularly easy from an evaluation standpoint. Now, you might be asking why I've mentioned debugging and observability as very important. Well, it turns out that autonomous LLM agents do indeed have a tendency to fail, as most developers know. And so we at Cohere are continuously exploring various failure mitigation strategies, right?

And what we've really come down to is this table of insights. It's really short and simple, but it's essentially that if you're working with failures at a low severity or a low failure rate, what we've found is actually prompt engineering can go a really long way to essentially just improving the quality of the tool API specs or the tool inputs can really help uplift the final mile on performance gaps.

However, if you do see a tool type failure or model hallucinating on specific tasks in the 10 to 20% range, what we've really found is actually building a targeted annotation data set is really useful for closing the gap. And lastly, and perhaps most critically, is if you are seeing a high failure rate, particularly if an API is very difficult to call or API names are very similar and you need to disambiguate between them, actually building a larger corpus using synthetic data and fine tuning is the strategy that we employ here at Cohere.

So I've talked to you about frameworks, approaches, various evaluation criteria and failure mitigation strategies. And what's quite nice here is that at Cohere, we're constantly working on developing and improving these various criteria. And one way in which we do this is we're continuously improving the base model performance at tool calling.

And as you can see here, we're particularly performant on BFCL v3, which is a standard evaluation criteria for single and multi-hop tool calling. And it's a really highly performant 7b model, as there is a continued interest for really lightweight tool calling models. In addition to this, we're also codifying the whole host of insights.

So in essence, we're bringing together the learnings from the frameworks, approaches and deployment, deploying these models in the wild for agentic applications into a single product, a product we've termed north. And essentially, it's a single container deployment that has access to RAG, has access to various vector DBs and search capacities, but also has connectivity to various applications of interest, including Gmail, Outlook, Drive and Slack, to name a few.

So you can think of North as a one-stop shop for using and building agentic applications as a single package. So I even have a demo for you here from North, and this is it in motion. Essentially, it's connected to Gmail, Slack, Salesforce, and G Drive. The question is asked about opportunities in Salesforce.

The model invokes reasoning chains of thought, essentially. It's able to pull the relevant document of interest and essentially provide a breakdown of both the reasoning chain, the tools that were called and the tool outputs, which is pretty nice if you're hoping to debug and assess what the model is doing under the hood.

You can also then retrieve information from recent conversations. And essentially, this would pull, again, both from Salesforce calls using a SQL-like style query. And you can also update specific tool calling capacities. For example, you could ask the model to correct which tool call was used. And ideally, what the model does is it updates its reasoning, and the package decides to then eventually use Gmail and return the relevant information.

So I hope this is an insightful talk and hopefully you've taken away some learnings about deploying enterprise LLM agents that we found particularly useful and have packaged into North. Thank you.

Cohere: Building enterprise LLM agents that work (Shaan Desai)

Transcript