back to indexCohere: Building enterprise LLM agents that work (Shaan Desai)

00:00:00.000 |
Hello. My name is Sean, and I'm a machine learning engineer here at Cohere, and today I'll be talking 00:00:10.240 |
to you about building enterprise LLM agents that work. So a quick overview is that we'll have an 00:00:18.720 |
introduction, we'll discuss some of the frameworks and approaches that we're really excited about, 00:00:23.520 |
and we'll also address some of the critical components around evaluation and failure 00:00:28.960 |
mitigation for LLM agents, and then ideally, hopefully, bring all of these things together 00:00:34.720 |
into a nice product overview. So, you know, agents continue to be the most exciting application of 00:00:43.760 |
generative AI today, with growing demand across a host of sectors, including customer support 00:00:49.520 |
assistants, personal assistants, RAG agents, as well as financial analyst agents. However, any developer 00:00:59.120 |
who's spent time building LLM agents knows that doing so in a scalable, safe, and seamless way 00:01:07.520 |
is actually a very difficult and challenging task. You might ask, why is that so? Well, it turns out 00:01:16.880 |
there's a panacea of frameworks, tools, models, approaches, evaluation criteria to choose from, 00:01:24.640 |
and to effectively put together into one end-to-end pipeline. So we really hope in this talk we can go 00:01:32.320 |
through the critical decision-making process in setting up enterprise agents, really touching on 00:01:39.280 |
the insights and key learnings we've had in building these agents, from addressing the frameworks we love, 00:01:46.160 |
to discussing single versus multi-agent strategies, as well as addressing some of the critical components 00:01:52.960 |
that are less discussed around evaluating LLM agents. 00:01:58.880 |
So let's start with frameworks. Now, over the past few years, there have been an increasing number of 00:02:07.840 |
frameworks that have come into the market from components such as Autogen, Crew AI, 00:02:15.600 |
as well as LangChain. Now, they all have their own benefits and disadvantages, depending on a given use 00:02:25.040 |
case. But our core learning in the past year has really been to focus on three critical components, 00:02:34.880 |
and those are observability, right? Is it easy to debug and fix? The second is the setup cost, you know, 00:02:42.640 |
how quickly can you iterate and resolve an issue, as well as build and piece together the entire 00:02:48.960 |
agent you're interested in building? And then, of course, lastly is support. You know, is the framework 00:02:56.880 |
well-documented? Does it support various models and tools and functionalities? 00:03:04.160 |
So we've often viewed these frameworks under these three criteria. And generally, we tie these three 00:03:14.480 |
criteria to a given use case. And more concretely, what this might look like is building large-scale 00:03:22.000 |
enterprise agents often requires high levels of observability, for which we would really recommend 00:03:29.520 |
going native or building with LangGraph. Now, of course, the space of frameworks is a continuously 00:03:36.640 |
evolving landscape. And so this is a recommendation at this point in time. But we obviously expect this to 00:03:44.400 |
change as frameworks continue their support to improve observability and ease of use. And, you know, 00:03:54.400 |
in the same vein, what we'd recommend for quick tests and proof of concepts is frameworks like CrewAI and 00:04:00.160 |
Autogen. And the reason for this is that there's generally a low setup uplift with low code to get 00:04:07.600 |
things working out of the box. And, of course, they're easy to leverage, pre-existing or pre-built 00:04:14.000 |
agents and tools and orchestrate them all together in a multi-agent setting. So these are immediate 00:04:20.240 |
recommendations. Of course, here at Cahere, we're continuously improving our integration support for 00:04:26.240 |
these various frameworks. And we hope to continue doing the support and watching this space evolve. 00:04:33.360 |
And in part, what we're particularly excited about is seeing a sliding scale spectrum 00:04:43.200 |
across these various frameworks for different use cases. Okay. Now, once you decide on which framework 00:04:52.240 |
you want to use, of course, you need to decide on the approach or the strategy that you plan to use 00:04:58.480 |
this framework in, right? Do you plan to use single agent? Do you plan to use multi-agent? Will you have 00:05:04.080 |
human-in-the-loop feedback? Our core recommendation, and this is insights that have come from a number of use 00:05:11.520 |
cases, is always start simple. A single LLM with a handful of tools can often go a long way. But more 00:05:21.520 |
importantly, being very diligent about the tool specifications really helps uplift performance. 00:05:30.880 |
So what we found is, you know, we worked with one client, and one of their asks was, 00:05:40.080 |
hey, we've got a long list of APIs. And these API specifications could take in up to 10 to 15 different 00:05:49.280 |
parameters. And could you get a model to successfully run tool calls for these tasks? What we've really 00:06:00.640 |
found is, to achieve the performance gains that they were trying to achieve, we needed to really simplify the 00:06:09.120 |
entire approach. We needed clear descriptions with very sharp examples on how to call the tool, 00:06:14.480 |
as well as providing and simplifying the input types. So converting complex nested dictionaries into list, 00:06:23.440 |
stir, or full types. Now, in addition to these learnings, we've also found that providing a clear 00:06:32.000 |
instruction list, which is short, pithy, and to the point goes a much longer way than providing a long set 00:06:41.200 |
of instructions that can actually provide confusion to the model and induce potential hallucinations. 00:06:48.480 |
long streams of chat history, in other words, back and forth conversations between the user and chatbot that go 00:07:00.640 |
over 20 turns, for example, and induce certain hallucinations. And this is true across a whole host of models and 00:07:09.440 |
frameworks. To handle that particular problem, we really recommend caching. Essentially caching that history 00:07:17.120 |
and retrieving it whenever it is particularly relevant to a new user query can actually help your LLM agent 00:07:25.600 |
achieve better performance through time. And we'll get to what we mean by performance in some later slides. 00:07:33.840 |
Now, indeed, there are frameworks such as Autogen that support multi-agent style orchestration. 00:07:43.440 |
And so, you know, in multi-agent, obviously in the multi-agent setting, it's a collection of simple 00:07:50.400 |
agents tied together. And they have a routing model that decides which sub-agent to go to and retrieve 00:07:56.800 |
information from. And there's been a growing interest in the industry to build multi-agents that are very 00:08:02.640 |
robust and versatile. Of course, this requires a good routing model, good reasoning model, and of course, sub-agents 00:08:11.920 |
that are well-constrained. And so, what we've learned for the router is that it should really contain a list of 00:08:17.920 |
tools with clear descriptions. That always holds. But it should also contain a sharp set of routing instructions 00:08:25.600 |
that can encompass potential edge cases, right? So, if you're trying to route information, 00:08:32.960 |
from the router to a sub-agent, and then back to another agent, providing that type of clarity and 00:08:39.840 |
instruction to the model can really help it decide what it should do at each stage, rather than it 00:08:45.760 |
autonomously and continuously trying to attempt things that may not be the most optimal path to getting to the 00:08:54.240 |
final answer. Of course, we also recommend that for sub-agents, they should be constrained to performing 00:09:01.440 |
independent tasks with a small set of tools to return the final answer, right? Each sub-agent should be 00:09:08.160 |
decomposed into a specific task that it should handle. So, those are key insights we've had from building both 00:09:15.520 |
simple and multi-agents in the enterprise setting. And now, the most important bit, right? We've glossed 00:09:25.120 |
over the fact that agents can act quite autonomously to achieve final results, but we do think safety is 00:09:34.000 |
paramount to any scalable real-world application, right? And here are some examples. If we decide to use 00:09:44.560 |
a Gmail agent, for example, we may want to request permission prior to sending emails, 00:09:52.160 |
right? We might want the user to get a pop-up that says, "Hey, are you okay with me sending this email?" 00:09:59.280 |
right? We don't want random emails to be sent. And this might be true in the HR support bot setting, 00:10:05.760 |
as well as in the financial analysis agent setting. What we've learned essentially is that incorporating 00:10:12.320 |
human-in-the-loop is thus, like, really critical for business applications. And what's really nice 00:10:17.840 |
about it is that you can codify a set of rules under which human-in-the-loop is triggered, right? So, 00:10:24.960 |
under various criteria, we can force human-in-the-loop to be triggered. And typically, 00:10:29.920 |
this can happen before or right prior, like, right before a tool is called. But it could also happen 00:10:36.880 |
right after a tool call is made, especially if the execution output, for example, may contain 00:10:44.240 |
various parts of information that you may not want to process completely. Okay, great. So we've addressed 00:10:52.800 |
frameworks. We've addressed various approaches we've explored and the insights we've gained. 00:10:58.640 |
Now, importantly, we need to discuss evaluation. How are we going to assess the performance of the 00:11:06.080 |
agent that we've built? So, you know, what really makes a successful agent is a lot of things, right? 00:11:15.920 |
It's a lot of moving pieces that need to come together for it to be successful. Essentially, 00:11:20.400 |
the model needs to make the right tool call at the right time. The model needs to be able to 00:11:27.760 |
essentially receive executed tool results and reason on top of it. And it needs to make tool calls very 00:11:34.560 |
succinctly and accurately passing the right input parameters. And it needs to have the ability to 00:11:39.600 |
course correct even when things are going wrong, right? So what's quite interesting here is for the 00:11:46.640 |
final product or the end user, the only thing that particularly matters to them is the final product or 00:11:52.800 |
the final answer they get from the agent. But what matters most to, I think, developers as they're debugging 00:12:00.560 |
and understanding how the LLM is making decisions is not just the final output, but all the intermediate 00:12:07.200 |
stages that go into getting to the final answer. And so we have an example here where, for example, a user may 00:12:15.520 |
ask a model to provide information about weather in New York City on February 5th. Ideally, the model should 00:12:25.760 |
decide to use a specific tool, pass in the right parameters, get a returned set of results from those 00:12:35.680 |
tools and reason over the return response to provide a final output, which is New York City will be mostly 00:12:43.200 |
sunny, sunny, etc. Now, as you can see, there are a number of intermediate stages that take place to 00:12:50.880 |
get to the final response. And typically, what we do here at Cohere is we build a golden set of ground truth 00:13:00.640 |
user queries, expected function calls, expected parameter inputs, expected outputs, as well as expected final 00:13:09.840 |
response. The nice thing about doing this and building this evaluation set is that we can run this large corpus of 00:13:17.520 |
evaluations through our agentic framework and assess any critical points of failure or where we think the model 00:13:25.360 |
may be going wrong. And this makes debugging particularly easy from an evaluation standpoint. 00:13:31.520 |
Now, you might be asking why I've mentioned debugging and observability as very important. Well, 00:13:39.200 |
it turns out that autonomous LLM agents do indeed have a tendency to fail, as most developers know. 00:13:47.200 |
And so we at Cohere are continuously exploring various failure mitigation strategies, right? And what we've 00:13:54.880 |
really come down to is this table of insights. It's really short and simple, but it's essentially that if you're 00:14:02.880 |
working with failures at a low severity or a low failure rate, what we've found is actually prompt 00:14:10.560 |
engineering can go a really long way to essentially just improving the quality of the tool API specs or the 00:14:17.920 |
tool inputs can really help uplift the final mile on performance gaps. However, if you do see a tool type failure or model 00:14:28.480 |
hallucinating on specific tasks in the 10 to 20% range, what we've really found is actually building a targeted 00:14:37.440 |
annotation data set is really useful for closing the gap. And lastly, and perhaps most critically, is if you 00:14:47.360 |
are seeing a high failure rate, particularly if an API is very difficult to call or API names are very similar and 00:14:55.280 |
you need to disambiguate between them, actually building a larger corpus using synthetic data and fine tuning is the 00:15:05.680 |
So I've talked to you about frameworks, approaches, 00:15:11.520 |
various evaluation criteria and failure mitigation strategies. 00:15:17.520 |
And what's quite nice here is that at Cohere, we're constantly working on developing and improving 00:15:24.640 |
these various criteria. And one way in which we do this is we're continuously improving the base model 00:15:31.760 |
performance at tool calling. And as you can see here, we're particularly performant on BFCL v3, 00:15:39.520 |
which is a standard evaluation criteria for single and multi-hop tool calling. 00:15:43.840 |
And it's a really highly performant 7b model, as there is a continued interest for really lightweight tool 00:15:52.160 |
calling models. In addition to this, we're also codifying the whole host of insights. So in essence, 00:15:59.280 |
we're bringing together the learnings from the frameworks, approaches and deployment, deploying these models in the wild 00:16:06.880 |
for agentic applications into a single product, a product we've termed north. 00:16:13.280 |
And essentially, it's a single container deployment that has access to RAG, has access to various vector 00:16:20.320 |
DBs and search capacities, but also has connectivity to various applications of interest, including Gmail, 00:16:29.440 |
Outlook, Drive and Slack, to name a few. So you can think of North as a one-stop shop for using and building 00:16:46.160 |
So I even have a demo for you here from North, and this is it in motion. 00:16:54.720 |
Essentially, it's connected to Gmail, Slack, Salesforce, and G Drive. 00:17:03.280 |
The question is asked about opportunities in Salesforce. The model invokes reasoning chains of thought, 00:17:12.080 |
essentially. It's able to pull the relevant document of interest and essentially provide a breakdown of both the 00:17:23.360 |
reasoning chain, the tools that were called and the tool outputs, which is pretty nice if you're hoping to debug 00:17:30.560 |
and assess what the model is doing under the hood. You can also then retrieve information from recent 00:17:37.760 |
conversations. And essentially, this would pull, again, both from Salesforce calls using a SQL-like 00:17:46.960 |
style query. And you can also update specific tool calling capacities. For example, you could ask the model to 00:17:55.920 |
correct which tool call was used. And ideally, what the model does is it updates its reasoning, 00:18:04.080 |
and the package decides to then eventually use Gmail and return the relevant information. So I hope this 00:18:12.080 |
is an insightful talk and hopefully you've taken away some learnings about deploying enterprise LLM agents 00:18:22.400 |
that we found particularly useful and have packaged into North. Thank you.