back to index3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph

00:00:16.600 |
about trying to build reliable agents in the enterprise. 00:00:19.920 |
This is something we work with a bunch of people for, 00:00:25.240 |
inside of an enterprise, looking to build agents 00:00:29.440 |
who are looking to build solutions and bring them 00:00:35.060 |
And so I wanted to talk a little bit about some 00:00:37.420 |
of what we see kind of being the success tips and tricks 00:00:42.100 |
So the vision of the future that I and other people, I think, 00:00:46.880 |
have a similar view of for agents is that there'll 00:00:52.840 |
They'll be an agent for every different task. 00:00:56.440 |
We'll be kind of like a manager, a supervisor. 00:01:02.600 |
And what parts of this will kind of arrive before the others? 00:01:12.720 |
What makes some agents kind of succeed in the enterprise 00:01:33.500 |
with a slightly different framing, which I would encourage 00:01:36.880 |
So I just want to give him a massive shout out. 00:01:38.520 |
And if you have the opportunity to chat with him, 00:01:46.400 |
like what makes agents successful in the enterprise? 00:01:54.480 |
The greater the value of the agent if it's right. 00:01:57.160 |
These probably aren't going to sound kind of like earth 00:01:59.680 |
shattering, but hopefully we'll get to some interesting points. 00:02:17.300 |
So I think these are three kind of like ingredients, 00:02:21.580 |
but I think provide an interesting kind of like first principles 00:02:23.960 |
approach for how to think about building agents 00:02:26.100 |
and what types of agents kind of like find success. 00:02:36.480 |
If we want to try to put this into a fun little equation, 00:02:39.020 |
we can multiply the probability that something succeeds times 00:02:44.740 |
and then do the opposite for the cost when it's wrong. 00:02:52.440 |
So yeah, fun little kind of like stats slash math formula. 00:02:58.720 |
So how can we build agents that score higher on this? 00:03:06.180 |
when we talk about how to make that equation kind of like go up. 00:03:11.020 |
So how can we increase the value of things when they go right? 00:03:20.720 |
So part of this is choosing kind of like problems 00:03:24.520 |
where there is really high kind of like value. 00:03:26.980 |
So a lot of the agents that have been successful so far-- 00:03:33.880 |
In the finance space, we see stuff around research 00:03:53.080 |
the value of what you're working on besides just switching 00:03:56.880 |
And I think we're starting to see some of this, 00:04:00.380 |
So if we think about RAG or if we think about kind 00:04:02.360 |
of like existing question-- or older school question answering 00:04:06.380 |
solutions, they would often respond kind of like quickly, 00:04:08.680 |
ideally within five seconds, and give you a quick answer. 00:04:11.220 |
And we're starting to see a trend towards things 00:04:13.320 |
like deep research, which go and run for an extended period 00:04:23.820 |
In the past like three weeks, there's been, what, 00:04:26.340 |
seven different examples of these ambient agents 00:04:28.540 |
that run in the background for like hours at time. 00:04:30.980 |
And I think this speaks to ways that people are trying 00:04:35.300 |
They're getting them to do more work, pretty basic. 00:04:41.000 |
if you think about this future of agents working 00:04:43.440 |
and what that means, that doesn't mean a copilot. 00:04:46.360 |
That means something working more autonomously 00:04:48.320 |
in the background, doing more amounts of work. 00:04:52.140 |
So besides kind of like focusing on areas or verticals 00:04:56.280 |
that provide value, I think you can also absolutely 00:05:02.800 |
of what you're building to be kind of like more long term 00:05:05.960 |
and do more kind of like substantial patterns of work. 00:05:11.140 |
Let's talk about now the probability of success. 00:05:15.200 |
So there's two different aspects I want to talk about here. 00:05:20.000 |
One, I think, is about the reliability of agents. 00:05:25.720 |
easy to get something that works in a prototype. 00:05:41.020 |
for some types of agents, that's totally fine. 00:05:47.560 |
and you don't know what they do, and that's totally fine. 00:05:51.580 |
Especially in the enterprise, we see oftentimes 00:05:53.800 |
that people want more predictability, more control 00:05:56.740 |
over what steps actually happen inside the agents. 00:05:59.620 |
Maybe they always want to do step A after step B. 00:06:02.760 |
And so if you prompt an agent to do that, great. 00:06:08.580 |
If you put that in a deterministic kind of workflow or code, 00:06:18.580 |
where you need more controllability, more predictability, 00:06:23.660 |
And so what we've seen is the solution for this 00:06:26.300 |
is basically make more and more of your agent deterministic. 00:06:30.200 |
There is this concept of workflows versus agents. 00:06:37.380 |
I would argue that instead of workflows versus agents, 00:06:48.440 |
And sometimes they're just doing A after B after C. 00:06:53.780 |
If you think about an architecture that has agent A, 00:06:56.880 |
and then after agent A finishes, you always call agent B. 00:07:03.580 |
And so as we think about building tools for this future, 00:07:07.480 |
one of the things that we've released is Langraph. 00:07:11.900 |
It's very different from other agent frameworks, 00:07:13.940 |
where it really leans in to this spectrum of workflows 00:07:17.780 |
and agents and allows you to be wherever is best 00:07:26.100 |
depends on the application that you're building. 00:07:28.200 |
There is another thing that is different from just building 00:07:35.100 |
And I think there's oftentimes really high error bars 00:07:40.640 |
that people have when they think about how likely an agent 00:07:47.660 |
to get something built or approved or put into production 00:07:50.880 |
I think there's a lot of uncertainty and fear around this. 00:07:56.540 |
And I think that relates to this fundamental uncertainty 00:08:11.860 |
bringing a third party agent and selling it as a service, 00:08:14.880 |
or whether you're building inside the enterprise yourself, 00:08:18.800 |
is to work to reduce the way that people see the error bars 00:08:27.700 |
is that this is where observability and evals 00:08:32.920 |
than we would maybe think or we would maybe intend. 00:08:35.740 |
So we have an observability and eval solution called LangSmith. 00:08:39.400 |
We built it for developers so that they could see what's 00:08:43.480 |
It's also proved really, really valuable for communicating 00:08:46.940 |
to external shareholders what's going on inside the agent 00:08:51.000 |
and how the agent performs and where it messes up 00:08:54.000 |
and where it doesn't mess up and basically communicate 00:09:05.860 |
have around what the agent and what it's actually doing. 00:09:07.920 |
They can see that it's making three, five LLM calls. 00:09:11.820 |
They're actually being really thoughtful about the steps 00:09:14.420 |
And then you can benchmark it against different things. 00:09:17.800 |
And so there's a great story of a user of ours 00:09:20.960 |
who used LangSmith initially to build the agent, 00:09:24.340 |
but then brought it and showed it to the review panel 00:09:26.800 |
as they were trying to get their agent approved 00:09:29.680 |
And they ended the meeting under time, which almost never 00:09:33.080 |
happens if you've been to these review panels. 00:09:39.060 |
And it helped reduce the perception or the risk 00:09:56.620 |
There's similar to the probability of things being right, 00:10:04.500 |
in larger enterprises among review boards and managers, 00:10:10.920 |
and causing brand damage or giving away things for free. 00:10:25.780 |
And so I think there's a few UI/UX tricks that people are doing 00:10:31.580 |
and that successful agents have to just make this a non-issue. 00:10:36.220 |
So one is just make it easy to reverse the changes 00:10:52.580 |
why we see code being one of the first kind of real places 00:11:06.680 |
Replit does it in a very clever way where every time they change 00:11:11.660 |
You can always revert kind of like what the agent does. 00:11:15.080 |
And then the second part is having a human in the loop. 00:11:18.440 |
So rather than merging code changes into main directly, 00:11:30.760 |
There's the human who's kind of approving what the agent does. 00:11:37.120 |
but I think it completely changes the cost calculations 00:11:40.320 |
in people's minds about what the cost of the agent doing 00:11:42.720 |
something bad is because now it's reversible, 00:11:46.620 |
to prevent it from even going in in the first place 00:11:50.480 |
And so human in the loop is one of the big things 00:11:55.680 |
and building inside enterprises really leaning into. 00:12:02.860 |
I think deep research is a pretty good example of this. 00:12:06.100 |
If we think about this, there is a period of time 00:12:10.280 |
upfront when you're messaging with deep research 00:12:12.940 |
It asks you follow-up questions, and you calibrate 00:12:18.960 |
It also makes sure that it gets a better result. 00:12:23.060 |
going to get from the report because it's more aligned 00:12:26.580 |
And then deep research-- it doesn't take this 00:12:29.660 |
and publish it as a blog out in the internet, 00:12:31.560 |
or it doesn't take it and email it to your clients. 00:12:41.460 |
I think similarly, when you think about code, 00:12:55.360 |
but also make sure that it yields better results. 00:12:57.360 |
And then again, with code, maybe you're not making a commit 00:13:09.000 |
in the general industry that follow some of these patterns. 00:13:16.600 |
can pull to try to make our agents more interesting 00:13:27.160 |
So if this has positive value, then what we really want to do 00:13:31.880 |
is just multiply this a bunch and scale it up a bunch. 00:13:37.520 |
of ambient agents, which is when we think about agents working 00:13:43.400 |
in this futuristic view, agents working in an enterprise doing 00:13:46.820 |
things in the background, they're doing things in the background. 00:13:49.160 |
They're not being kicked off by humans still in the loop. 00:13:56.440 |
And I think the reason that this is so powerful 00:13:58.660 |
is that it scales up this positive expected value 00:14:05.660 |
maybe I can have two chat boxes open at the same time. 00:14:13.780 |
between chat agents, which I would argue we've mostly seen, 00:14:17.080 |
and ambient agents, one big difference is ambient agents 00:14:21.140 |
That lets us scale ourselves instead of a one-to-one. 00:14:26.480 |
And so the concurrences of these agents that can be running 00:14:35.040 |
So when chat, you have this kind of like UX expectation 00:14:42.140 |
because they're triggered without you even knowing. 00:14:53.680 |
So you can start to build up a bigger body of work. 00:14:59.600 |
to changing a whole file or making a new repo or any of that. 00:15:02.960 |
And so instead of this agent just responding directly 00:15:05.420 |
or calling a single tool call, which usually happens 00:15:07.600 |
in these chat applications because of the latency requirements, 00:15:12.020 |
And so the value can start kind of like increasing in terms 00:15:16.520 |
And then the other thing that I want to emphasize 00:15:19.780 |
is that there's still kind of like a UX for interacting 00:15:28.340 |
Because autonomous-- when people hear autonomous, 00:15:30.340 |
they think the cost of this thing doing something bad 00:15:34.480 |
Because I'm not going to be able to oversee it. 00:15:40.280 |
And so ambient does not mean fully autonomous. 00:15:43.220 |
And so there are a lot of different kind of like human in the loop 00:15:46.060 |
interaction patterns that you can bring into these kind of like 00:15:51.180 |
There can be an approve-reject pattern where for certain tools, 00:15:54.220 |
you want to explicitly say, yes, it's OK to call this tool. 00:15:57.180 |
You might want to edit the tool that it's calling. 00:16:03.000 |
You might want to give it the ability to kind of like ask 00:16:06.880 |
You can provide more info if it gets stuck kind of like halfway 00:16:10.080 |
And then time travel is something that we call human on the loop 00:16:17.620 |
you can reverse back to step 10 and say, hey, no, 00:16:20.880 |
resume from here but do this other thing slightly differently. 00:16:24.200 |
And so human in the loop, we think, is super, super important. 00:16:27.820 |
The other thing that I want to call out just briefly 00:16:43.600 |
But I think these are good examples of sync to async agents. 00:16:50.580 |
They use a term kind of like async coding agents. 00:16:54.700 |
But I think this kind of like sync to async agents 00:16:57.760 |
is a natural progression if you think about it. 00:17:05.460 |
The future is probably these autonomous agents working 00:17:07.520 |
in the background, still pinging us when they need help. 00:17:11.460 |
the human kicks it off, uses that kind of human in the loop 00:17:14.340 |
at the start to calibrate on what you want it to do. 00:17:16.780 |
And so I think that that table I showed of like chat and ambient 00:17:20.900 |
is actually probably missing a column in the middle. 00:17:25.220 |
Anyways, an example of some of the UXs that we think 00:17:30.260 |
are basically what we call agent inbox, which 00:17:32.560 |
is where you surface all the actions that the agent wants 00:17:39.940 |
Just kind of tie this together and make it really concrete 00:17:44.500 |
Email, I think, is a really natural place for ambient agents. 00:17:58.200 |
or you still probably want the human, the user, 00:17:59.960 |
to approve any emails that go out or any calendar events that 00:18:03.220 |
get sent, depending on your level of comfort. 00:18:11.640 |
We've used it to kind of test out a lot of these things. 00:18:14.240 |
If people want to try it out, there is a QR code 00:18:19.620 |
And I think it's not the only example of ambient agents, 00:18:31.120 |
I'm not sure if there's time for questions or not. 00:18:48.880 |
talking about agents, but only code generating agents 00:18:54.480 |
is it because you can measure what you have done 00:18:59.380 |
But for all other agents, you can do a lot of stuff, 00:19:13.080 |
So the measure thing, I think probably more so. 00:19:15.380 |
You can-- a lot of the large model labs train on a lot of coding 00:19:19.300 |
data because you can test whether it's correct or not. 00:19:25.680 |
So math and code are two examples of verifiable domains. 00:19:29.920 |
What does it mean for an essay to be correct? 00:19:35.580 |
you're able to bootstrap a lot of training data. 00:19:37.580 |
And so there's a lot of training data in the models already 00:19:41.800 |
That makes the agents that use those models better at that. 00:19:44.660 |
Then the second part, I do think code lends itself 00:19:47.860 |
naturally to this commit and this draft and this preview thing. 00:20:09.480 |
like if you put the human in a loop at every step, 00:20:18.540 |
but the human's still in the loop at key points. 00:20:20.920 |
And first drafts, I think, are a great mental model for that. 00:20:24.460 |
So anything where there's like first drafts, legal, writing, 00:20:27.840 |
code, I think that's a little bit more generalizable. 00:20:32.480 |
The verifiable stuff, that's a little bit tougher.