3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph

. SPEAKER 1: I want to talk today a little bit about trying to build reliable agents in the enterprise. This is something we work with a bunch of people for, both people building as developers inside of an enterprise, looking to build agents for their company, but also people who are looking to build solutions and bring them and sell them into enterprises.

And so I wanted to talk a little bit about some of what we see kind of being the success tips and tricks for making this happen. So the vision of the future that I and other people, I think, have a similar view of for agents is that there'll be a lot of them.

They'll be running around the enterprise doing different things. They'll be an agent for every different task. We'll be coordinating with them. We'll be kind of like a manager, a supervisor. And so how do we get to that vision? And what parts of this will kind of arrive before the others?

And so I was thinking about this question. What makes some agents kind of succeed in the enterprise and some fail? And I was chatting with my friend Asaf. He's the head of AI at Monday. He also wrote GPT Researcher. It's a great open source package. I was chatting with him a few weeks ago.

And a lot of the ideas here are borrowed from that conversation. He'll probably write a blog post about this with a slightly different framing, which I would encourage everyone to check out. So I just want to give him a massive shout out. And if you have the opportunity to chat with him, you should definitely take that opportunity.

Thinking about it from first principles, like what makes agents successful in the enterprise? It'll make it successful. It will make it more likely to be adopted. The greater the value of the agent if it's right. These probably aren't going to sound kind of like earth shattering, but hopefully we'll get to some interesting points.

The more value it provides when it's right, the more likely it will be to be adopted. The more likely it is to have success, the more likely it will be to be adopted. And then the cost if it's wrong. If there's big costs when it's wrong, then it will be less likely to be adopted.

So I think these are three kind of like ingredients, which are pretty simple and pretty basic, but I think provide an interesting kind of like first principles approach for how to think about building agents and what types of agents kind of like find success. And I say in the enterprise here, but I also think this applies just generally within kind of like society.

If we want to try to put this into a fun little equation, we can multiply the probability that something succeeds times the value that you get when it succeeds, and then do the opposite for the cost when it's wrong. And of course, this needs to be greater than the cost of running the agent for you to want to put it into production.

So yeah, fun little kind of like stats slash math formula. So how can we build agents that score higher on this? Because this hasn't been anything kind of like earth shattering so far. Hopefully, we'll get to some fun insights when we talk about how to make that equation kind of like go up.

So how can we increase the value of things when they go right? And what types of agents have higher value? So part of this is choosing kind of like problems where there is really high kind of like value. So a lot of the agents that have been successful so far-- Harvey in the legal space is one of them.

In the finance space, we see stuff around research and summarization. These are high value work tasks. People pay a lot of money for lawyers and for research and investment research. And so these are examples of what I would say kind of like high value tasks are. There's other ways to kind of like improve the value of what you're working on besides just switching kind of like the vertical completely.

And I think we're starting to see some of this, especially more recently. So if we think about RAG or if we think about kind of like existing question-- or older school question answering solutions, they would often respond kind of like quickly, ideally within five seconds, and give you a quick answer.

And we're starting to see a trend towards things like deep research, which go and run for an extended period of time. We're seeing the same with code. We start with cursor. It has kind of like inline autocomplete, maybe some chat question answering there. In the past like three weeks, there's been, what, seven different examples of these ambient agents that run in the background for like hours at time.

And I think this speaks to ways that people are trying to get their agents to provide more value. They're getting them to do more work, pretty basic. But I do think that like as we have-- if you think about this future of agents working and what that means, that doesn't mean a copilot.

That means something working more autonomously in the background, doing more amounts of work. So besides kind of like focusing on areas or verticals that provide value, I think you can also absolutely reshift the UI, UX, the interaction pattern of what you're building to be kind of like more long term and do more kind of like substantial patterns of work.

Let's talk about now the probability of success. How do we make this go up? So there's two different aspects I want to talk about here. One, I think, is about the reliability of agents. If you've built agents before, it's easy to get something that works in a prototype. It runs once, great.

You can make a video, put it on Twitter. But it's hard to make it work reliably. Put it in production. And I think the core thing that we've seen-- and by the way, for some parts of-- for some types of agents, that's totally fine. You can have agents that run for a while and you don't know what they do, and that's totally fine.

Especially in the enterprise, we see oftentimes that people want more predictability, more control over what steps actually happen inside the agents. Maybe they always want to do step A after step B. And so if you prompt an agent to do that, great. It might do that like 90% of the time.

You don't know what the LLM will do. If you put that in a deterministic kind of workflow or code, then it will always do that. And so especially in the enterprise, we see that there are workflow-like things where you need more controllability, more predictability, than you get by just prompting.

And so what we've seen is the solution for this is basically make more and more of your agent deterministic. There is this concept of workflows versus agents. Anthropic wrote a great blog post on this that I'd encourage you to check out. I would argue that instead of workflows versus agents, it's oftentimes workflows and agents.

We see that parts of an agentic system are sometimes looping, calling a tool. And sometimes they're just doing A after B after C. An example of this is when you think about multi-agent architectures. If you think about an architecture that has agent A, and then after agent A finishes, you always call agent B.

Is that a workflow? Is that an agent? It's this middle ground. And so as we think about building tools for this future, one of the things that we've released is Langraph. Langraph is an agent framework. It's very different from other agent frameworks, where it really leans in to this spectrum of workflows and agents and allows you to be wherever is best for your application on that curve.

And where on that curve is best totally depends on the application that you're building. There is another thing that is different from just building and changing the agent. And I think there's oftentimes really high error bars that people have when they think about how likely an agent is to work.

I think this technology is new when trying to get something built or approved or put into production inside an enterprise. I think there's a lot of uncertainty and fear around this. And I think that relates to this fundamental uncertainty around how this agent is performing. And so besides just making it better, a really important thing that we see to do inside the enterprise, whether you're bringing a third party agent and selling it as a service, or whether you're building inside the enterprise yourself, is to work to reduce the way that people see the error bars of how this agent performs.

So what I mean by that specifically is that this is where observability and evals actually plays a slightly different role than we would maybe think or we would maybe intend. So we have an observability and eval solution called LangSmith. We built it for developers so that they could see what's going on inside their agent.

It's also proved really, really valuable for communicating to external shareholders what's going on inside the agent and how the agent performs and where it messes up and where it doesn't mess up and basically communicate these kind of patterns. And so again, the observability part, you can just see every step that's happening inside the agent.

This reduces the uncertainty that people have around what the agent and what it's actually doing. They can see that it's making three, five LLM calls. It's not just one. They're actually being really thoughtful about the steps that are happening. And then you can benchmark it against different things. And so there's a great story of a user of ours who used LangSmith initially to build the agent, but then brought it and showed it to the review panel as they were trying to get their agent approved to go into production.

And they ended the meeting under time, which almost never happens if you've been to these review panels. And they showed them basically everything inside LangSmith. And it helped reduce the perception or the risk that people had of these agents. And then the last thing I want to talk about is the cost of something if it's wrong.

There's similar to the probability of things being right, this plays an outsized role, especially in larger enterprises among review boards and managers, people's perception of these agents. People hear stories of agents going wild and causing brand damage or giving away things for free. I think there's an outsized perception of what could happen if things go bad.

And so I think there's a few UI/UX tricks that people are doing and that successful agents have to just make this a non-issue. So one is just make it easy to reverse the changes that the agent makes. So if you think about code-- and this is a screenshot of ReplitAgent.

It's a diff that it generates a PR. Code's really easy to revert. You go back to the previous commit. And so I think that's part of the reason why we see code being one of the first kind of real places that you can apply agents besides the fact that the models are trained on it.

It's also that when you use these agents, you create all these commits. And well, it depends how you do it. Replit does it in a very clever way where every time they change a file, they save it as a new commit. So you can always go back. You can always revert kind of like what the agent does.

And then the second part is having a human in the loop. So rather than merging code changes into main directly, open up PR. That's putting the human in the loop. And so then the effect of the agent-- it's not kind of like making changes. There's the human who's kind of approving what the agent does.

And this seems maybe a little subtle, but I think it completely changes the cost calculations in people's minds about what the cost of the agent doing something bad is because now it's reversible, and you have a human who is going to prevent it from even going in in the first place if it's bad.

And so human in the loop is one of the big things that we see people selling these enterprises and building inside enterprises really leaning into. So to make this a little bit more concrete, what are some examples of this? I think deep research is a pretty good example of this.

If we think about this, there is a period of time upfront when you're messaging with deep research that you go back and forth. It asks you follow-up questions, and you calibrate on what you want to research. That puts the human in the loop. It also makes sure that it gets a better result.

So it increases the value that you're going to get from the report because it's more aligned with what you actually want. And then deep research-- it doesn't take this and publish it as a blog out in the internet, or it doesn't take it and email it to your clients.

It produces just a report that you can read and decide what to do with. So it's not actually doing anything. It's up to you to take that and do things. I think similarly, when you think about code, it's another great example of-- so Claude code also has this ability where it asks questions.

It clarifies things. This is to both keep the human in the loop, but also make sure that it yields better results. And then again, with code, maybe you're not making a commit every time you change things. But it's on a separate branch. You open up PR. You're not pushing directly to master.

And so I think these are examples of things in the general industry that follow some of these patterns. So OK. So we've figured out a few levers that we can pull to try to make our agents more interesting to be deployed in the enterprise. What next? What next is how do we scale that?

So if this has positive value, then what we really want to do is just multiply this a bunch and scale it up a bunch. And I think this speaks to the concept of ambient agents, which is when we think about agents working in this futuristic view, agents working in an enterprise doing things in the background, they're doing things in the background.

They're not being kicked off by humans still in the loop. They're being triggered by different events. And I think the reason that this is so powerful is that it scales up this positive expected value thing even more than we can. Like I can only really have one-- maybe I can have two chat boxes open at the same time.

But now there can be hundreds of these running in the background. And so when we think about the difference between chat agents, which I would argue we've mostly seen, and ambient agents, one big difference is ambient agents are triggered by events. That lets us scale ourselves instead of a one-to-one.

It's now a one-to-many conversation that we can be happening. And so the concurrences of these agents that can be running goes from one to unlimited. The latency requirements also change. So when chat, you have this kind of like UX expectation that it responds really, really quickly. And that's not the case with ambient agents, because they're triggered without you even knowing.

So how do you know? How do you even care how long it's running? And so what does this let you do? Why does this matter? This lets you do more complex operations. So you can do more things. So you can start to build up a bigger body of work.

You can go from changing one line of code to changing a whole file or making a new repo or any of that. And so instead of this agent just responding directly or calling a single tool call, which usually happens in these chat applications because of the latency requirements, it can now do these more complex things.

And so the value can start kind of like increasing in terms of what you're doing. And then the other thing that I want to emphasize is that there's still kind of like a UX for interacting with these agents. So ambient does not mean fully autonomous. And this is really, really important.

Because autonomous-- when people hear autonomous, they think the cost of this thing doing something bad is really high. Because I'm not going to be able to oversee it. I don't know what's going on. How do-- it could go out there and run wild. And so ambient does not mean fully autonomous.

And so there are a lot of different kind of like human in the loop interaction patterns that you can bring into these kind of like background, these ambient agents. There can be an approve-reject pattern where for certain tools, you want to explicitly say, yes, it's OK to call this tool.

You might want to edit the tool that it's calling. So if it messes up a tool call, you can actually just correct it in the UI. You might want to give it the ability to kind of like ask questions so that you can answer them. You can provide more info if it gets stuck kind of like halfway through.

And then time travel is something that we call human on the loop as well. So this is after the agents run. If it messed up on step like 10 out of 100, you can reverse back to step 10 and say, hey, no, resume from here but do this other thing slightly differently.

And so human in the loop, we think, is super, super important. The other thing that I want to call out just briefly is I think there's this intermediary state where we're starting to be right now. I wouldn't call deep research or cloud code or any of these coding agents ambient agents because they're still triggered by a human.

But I think these are good examples of sync to async agents. And so factory is a coding agent. They use a term kind of like async coding agents. And I really like that. But I think this kind of like sync to async agents is a natural progression if you think about it.

Like right now to start-- or a year ago, everything was a sync agent. We were chatting with it. It was very much in the moment. The future is probably these autonomous agents working in the background, still pinging us when they need help. But there's this intermediate state where the human kicks it off, uses that kind of human in the loop at the start to calibrate on what you want it to do.

And so I think that that table I showed of like chat and ambient is actually probably missing a column in the middle. That's like these sync to async agents. Anyways, an example of some of the UXs that we think can be interesting for these ambient agents are basically what we call agent inbox, which is where you surface all the actions that the agent wants to take that need your approval.

And then you can go in and approve, reject, leave feedback, things like that. Just kind of tie this together and make it really concrete what I mean by ambient agents. Email, I think, is a really natural place for ambient agents. These agents can listen to incoming emails. Those are events.

They can run on however many emails come in. So that's, in theory, unlimited. But you still probably want an agent-- or you still probably want the human, the user, to approve any emails that go out or any calendar events that get sent, depending on your level of comfort. And so this is a concrete thing.

I actually built one that I have myself. We've used it to kind of test out a lot of these things. If people want to try it out, there is a QR code that you can scan and get the GitHub repo. It's all open source. And I think it's not the only example of ambient agents, but it's one that I've built myself, and so we talk a lot about internally.

That's all I have. I'm not sure if there's time for questions or not. One or two questions, if people have them. So my question is, although everybody's talking about agents, but only code generating agents are the ones who are getting funding, is it because you can measure what you have done and you can reverse what you have done.

But for all other agents, you can do a lot of stuff, but you cannot measure what you have done. You cannot reverse what you have done. Yeah. I think there's a variety of reasons. I think those two measure and-- well, OK. So the measure thing, I think probably more so.

You can-- a lot of the large model labs train on a lot of coding data because you can test whether it's correct or not. You can run it, see if it compiles. Same with math data. Math is very-- it's verifiable, right? So math and code are two examples of verifiable domains.

Essay writing is less verifiable. What does it mean for an essay to be correct? That's far more ambiguous. And so because of these verifiable things, you're able to bootstrap a lot of training data. And so there's a lot of training data in the models already about code. And so the models are better at that.

That makes the agents that use those models better at that. Then the second part, I do think code lends itself naturally to this commit and this draft and this preview thing. I think that's more generalizable. So legal is a great example. Legal, you can have first drafts of things.

That's very common. Same with essay writing. I think the concept of a first draft is actually a really good UX to aim for. It lets you do far more. It also puts the human in the loop. And so you get this dual kind of-- like if you put the human in a loop at every step, that doesn't provide any value.

Each step is so small. So the key is finding these UX patterns where the agent does a ton of work, but the human's still in the loop at key points. And first drafts, I think, are a great mental model for that. So anything where there's like first drafts, legal, writing, code, I think that's a little bit more generalizable.

The verifiable stuff, that's a little bit tougher. Yeah. Yeah. Oh, no, I'm good. I'll talk to you afterwards. ANDREW BROGDON: Cool. Yeah, more than happy to chat after. Thank you all. ANDREW BROGDON: Thank you. ANDREW BROGDON: Thank you. ANDREW BROGDON: Thank you. ANDREW BROGDON: Thank you. We'll be right back.

3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph

Transcript