Cohere: Building enterprise LLM agents that work (Shaan Desai)

00:00:00.000 | Hello. My name is Sean, and I'm a machine learning engineer here at Cohere, and today I'll be talking

00:00:10.240 | to you about building enterprise LLM agents that work. So a quick overview is that we'll have an

00:00:18.720 | introduction, we'll discuss some of the frameworks and approaches that we're really excited about,

00:00:23.520 | and we'll also address some of the critical components around evaluation and failure

00:00:28.960 | mitigation for LLM agents, and then ideally, hopefully, bring all of these things together

00:00:34.720 | into a nice product overview. So, you know, agents continue to be the most exciting application of

00:00:43.760 | generative AI today, with growing demand across a host of sectors, including customer support

00:00:49.520 | assistants, personal assistants, RAG agents, as well as financial analyst agents. However, any developer

00:00:59.120 | who's spent time building LLM agents knows that doing so in a scalable, safe, and seamless way

00:01:07.520 | is actually a very difficult and challenging task. You might ask, why is that so? Well, it turns out

00:01:16.880 | there's a panacea of frameworks, tools, models, approaches, evaluation criteria to choose from,

00:01:24.640 | and to effectively put together into one end-to-end pipeline. So we really hope in this talk we can go

00:01:32.320 | through the critical decision-making process in setting up enterprise agents, really touching on

00:01:39.280 | the insights and key learnings we've had in building these agents, from addressing the frameworks we love,

00:01:46.160 | to discussing single versus multi-agent strategies, as well as addressing some of the critical components

00:01:52.960 | that are less discussed around evaluating LLM agents.

00:01:58.880 | So let's start with frameworks. Now, over the past few years, there have been an increasing number of

00:02:07.840 | frameworks that have come into the market from components such as Autogen, Crew AI,

00:02:15.600 | as well as LangChain. Now, they all have their own benefits and disadvantages, depending on a given use

00:02:25.040 | case. But our core learning in the past year has really been to focus on three critical components,

00:02:34.880 | and those are observability, right? Is it easy to debug and fix? The second is the setup cost, you know,

00:02:42.640 | how quickly can you iterate and resolve an issue, as well as build and piece together the entire

00:02:48.960 | agent you're interested in building? And then, of course, lastly is support. You know, is the framework

00:02:56.880 | well-documented? Does it support various models and tools and functionalities?

00:03:04.160 | So we've often viewed these frameworks under these three criteria. And generally, we tie these three

00:03:14.480 | criteria to a given use case. And more concretely, what this might look like is building large-scale

00:03:22.000 | enterprise agents often requires high levels of observability, for which we would really recommend

00:03:29.520 | going native or building with LangGraph. Now, of course, the space of frameworks is a continuously

00:03:36.640 | evolving landscape. And so this is a recommendation at this point in time. But we obviously expect this to

00:03:44.400 | change as frameworks continue their support to improve observability and ease of use. And, you know,

00:03:54.400 | in the same vein, what we'd recommend for quick tests and proof of concepts is frameworks like CrewAI and

00:04:00.160 | Autogen. And the reason for this is that there's generally a low setup uplift with low code to get

00:04:07.600 | things working out of the box. And, of course, they're easy to leverage, pre-existing or pre-built

00:04:14.000 | agents and tools and orchestrate them all together in a multi-agent setting. So these are immediate

00:04:20.240 | recommendations. Of course, here at Cahere, we're continuously improving our integration support for

00:04:26.240 | these various frameworks. And we hope to continue doing the support and watching this space evolve.

00:04:33.360 | And in part, what we're particularly excited about is seeing a sliding scale spectrum

00:04:43.200 | across these various frameworks for different use cases. Okay. Now, once you decide on which framework

00:04:52.240 | you want to use, of course, you need to decide on the approach or the strategy that you plan to use

00:04:58.480 | this framework in, right? Do you plan to use single agent? Do you plan to use multi-agent? Will you have

00:05:04.080 | human-in-the-loop feedback? Our core recommendation, and this is insights that have come from a number of use

00:05:11.520 | cases, is always start simple. A single LLM with a handful of tools can often go a long way. But more

00:05:21.520 | importantly, being very diligent about the tool specifications really helps uplift performance.

00:05:30.880 | So what we found is, you know, we worked with one client, and one of their asks was,

00:05:40.080 | hey, we've got a long list of APIs. And these API specifications could take in up to 10 to 15 different

00:05:49.280 | parameters. And could you get a model to successfully run tool calls for these tasks? What we've really

00:06:00.640 | found is, to achieve the performance gains that they were trying to achieve, we needed to really simplify the

00:06:09.120 | entire approach. We needed clear descriptions with very sharp examples on how to call the tool,

00:06:14.480 | as well as providing and simplifying the input types. So converting complex nested dictionaries into list,

00:06:23.440 | stir, or full types. Now, in addition to these learnings, we've also found that providing a clear

00:06:32.000 | instruction list, which is short, pithy, and to the point goes a much longer way than providing a long set

00:06:41.200 | of instructions that can actually provide confusion to the model and induce potential hallucinations.

00:06:48.480 | long streams of chat history, in other words, back and forth conversations between the user and chatbot that go

00:07:00.640 | over 20 turns, for example, and induce certain hallucinations. And this is true across a whole host of models and

00:07:09.440 | frameworks. To handle that particular problem, we really recommend caching. Essentially caching that history

00:07:17.120 | and retrieving it whenever it is particularly relevant to a new user query can actually help your LLM agent

00:07:25.600 | achieve better performance through time. And we'll get to what we mean by performance in some later slides.

00:07:33.840 | Now, indeed, there are frameworks such as Autogen that support multi-agent style orchestration.

00:07:43.440 | And so, you know, in multi-agent, obviously in the multi-agent setting, it's a collection of simple

00:07:50.400 | agents tied together. And they have a routing model that decides which sub-agent to go to and retrieve

00:07:56.800 | information from. And there's been a growing interest in the industry to build multi-agents that are very

00:08:02.640 | robust and versatile. Of course, this requires a good routing model, good reasoning model, and of course, sub-agents

00:08:11.920 | that are well-constrained. And so, what we've learned for the router is that it should really contain a list of

00:08:17.920 | tools with clear descriptions. That always holds. But it should also contain a sharp set of routing instructions

00:08:25.600 | that can encompass potential edge cases, right? So, if you're trying to route information,

00:08:32.960 | from the router to a sub-agent, and then back to another agent, providing that type of clarity and

00:08:39.840 | instruction to the model can really help it decide what it should do at each stage, rather than it

00:08:45.760 | autonomously and continuously trying to attempt things that may not be the most optimal path to getting to the

00:08:54.240 | final answer. Of course, we also recommend that for sub-agents, they should be constrained to performing

00:09:01.440 | independent tasks with a small set of tools to return the final answer, right? Each sub-agent should be

00:09:08.160 | decomposed into a specific task that it should handle. So, those are key insights we've had from building both

00:09:15.520 | simple and multi-agents in the enterprise setting. And now, the most important bit, right? We've glossed

00:09:25.120 | over the fact that agents can act quite autonomously to achieve final results, but we do think safety is

00:09:34.000 | paramount to any scalable real-world application, right? And here are some examples. If we decide to use

00:09:44.560 | a Gmail agent, for example, we may want to request permission prior to sending emails,

00:09:52.160 | right? We might want the user to get a pop-up that says, "Hey, are you okay with me sending this email?"

00:09:59.280 | right? We don't want random emails to be sent. And this might be true in the HR support bot setting,

00:10:05.760 | as well as in the financial analysis agent setting. What we've learned essentially is that incorporating

00:10:12.320 | human-in-the-loop is thus, like, really critical for business applications. And what's really nice

00:10:17.840 | about it is that you can codify a set of rules under which human-in-the-loop is triggered, right? So,

00:10:24.960 | under various criteria, we can force human-in-the-loop to be triggered. And typically,

00:10:29.920 | this can happen before or right prior, like, right before a tool is called. But it could also happen

00:10:36.880 | right after a tool call is made, especially if the execution output, for example, may contain

00:10:44.240 | various parts of information that you may not want to process completely. Okay, great. So we've addressed

00:10:52.800 | frameworks. We've addressed various approaches we've explored and the insights we've gained.

00:10:58.640 | Now, importantly, we need to discuss evaluation. How are we going to assess the performance of the

00:11:06.080 | agent that we've built? So, you know, what really makes a successful agent is a lot of things, right?

00:11:15.920 | It's a lot of moving pieces that need to come together for it to be successful. Essentially,

00:11:20.400 | the model needs to make the right tool call at the right time. The model needs to be able to

00:11:27.760 | essentially receive executed tool results and reason on top of it. And it needs to make tool calls very

00:11:34.560 | succinctly and accurately passing the right input parameters. And it needs to have the ability to

00:11:39.600 | course correct even when things are going wrong, right? So what's quite interesting here is for the

00:11:46.640 | final product or the end user, the only thing that particularly matters to them is the final product or

00:11:52.800 | the final answer they get from the agent. But what matters most to, I think, developers as they're debugging

00:12:00.560 | and understanding how the LLM is making decisions is not just the final output, but all the intermediate

00:12:07.200 | stages that go into getting to the final answer. And so we have an example here where, for example, a user may

00:12:15.520 | ask a model to provide information about weather in New York City on February 5th. Ideally, the model should

00:12:25.760 | decide to use a specific tool, pass in the right parameters, get a returned set of results from those

00:12:35.680 | tools and reason over the return response to provide a final output, which is New York City will be mostly

00:12:43.200 | sunny, sunny, etc. Now, as you can see, there are a number of intermediate stages that take place to

00:12:50.880 | get to the final response. And typically, what we do here at Cohere is we build a golden set of ground truth

00:13:00.640 | user queries, expected function calls, expected parameter inputs, expected outputs, as well as expected final

00:13:09.840 | response. The nice thing about doing this and building this evaluation set is that we can run this large corpus of

00:13:17.520 | evaluations through our agentic framework and assess any critical points of failure or where we think the model

00:13:25.360 | may be going wrong. And this makes debugging particularly easy from an evaluation standpoint.

00:13:31.520 | Now, you might be asking why I've mentioned debugging and observability as very important. Well,

00:13:39.200 | it turns out that autonomous LLM agents do indeed have a tendency to fail, as most developers know.

00:13:47.200 | And so we at Cohere are continuously exploring various failure mitigation strategies, right? And what we've

00:13:54.880 | really come down to is this table of insights. It's really short and simple, but it's essentially that if you're

00:14:02.880 | working with failures at a low severity or a low failure rate, what we've found is actually prompt

00:14:10.560 | engineering can go a really long way to essentially just improving the quality of the tool API specs or the

00:14:17.920 | tool inputs can really help uplift the final mile on performance gaps. However, if you do see a tool type failure or model

00:14:28.480 | hallucinating on specific tasks in the 10 to 20% range, what we've really found is actually building a targeted

00:14:37.440 | annotation data set is really useful for closing the gap. And lastly, and perhaps most critically, is if you

00:14:47.360 | are seeing a high failure rate, particularly if an API is very difficult to call or API names are very similar and

00:14:55.280 | you need to disambiguate between them, actually building a larger corpus using synthetic data and fine tuning is the

00:15:03.760 | strategy that we employ here at Cohere.

00:15:05.680 | So I've talked to you about frameworks, approaches,

00:15:11.520 | various evaluation criteria and failure mitigation strategies.

00:15:17.520 | And what's quite nice here is that at Cohere, we're constantly working on developing and improving

00:15:24.640 | these various criteria. And one way in which we do this is we're continuously improving the base model

00:15:31.760 | performance at tool calling. And as you can see here, we're particularly performant on BFCL v3,

00:15:39.520 | which is a standard evaluation criteria for single and multi-hop tool calling.

00:15:43.840 | And it's a really highly performant 7b model, as there is a continued interest for really lightweight tool

00:15:52.160 | calling models. In addition to this, we're also codifying the whole host of insights. So in essence,

00:15:59.280 | we're bringing together the learnings from the frameworks, approaches and deployment, deploying these models in the wild

00:16:06.880 | for agentic applications into a single product, a product we've termed north.

00:16:13.280 | And essentially, it's a single container deployment that has access to RAG, has access to various vector

00:16:20.320 | DBs and search capacities, but also has connectivity to various applications of interest, including Gmail,

00:16:29.440 | Outlook, Drive and Slack, to name a few. So you can think of North as a one-stop shop for using and building

00:16:39.040 | agentic applications as a single package.

00:16:46.160 | So I even have a demo for you here from North, and this is it in motion.

00:16:54.720 | Essentially, it's connected to Gmail, Slack, Salesforce, and G Drive.

00:17:03.280 | The question is asked about opportunities in Salesforce. The model invokes reasoning chains of thought,

00:17:12.080 | essentially. It's able to pull the relevant document of interest and essentially provide a breakdown of both the

00:17:23.360 | reasoning chain, the tools that were called and the tool outputs, which is pretty nice if you're hoping to debug

00:17:30.560 | and assess what the model is doing under the hood. You can also then retrieve information from recent

00:17:37.760 | conversations. And essentially, this would pull, again, both from Salesforce calls using a SQL-like

00:17:46.960 | style query. And you can also update specific tool calling capacities. For example, you could ask the model to

00:17:55.920 | correct which tool call was used. And ideally, what the model does is it updates its reasoning,

00:18:04.080 | and the package decides to then eventually use Gmail and return the relevant information. So I hope this

00:18:12.080 | is an insightful talk and hopefully you've taken away some learnings about deploying enterprise LLM agents

00:18:22.400 | that we found particularly useful and have packaged into North. Thank you.