3 ingredients for building reliable enterprise agents - Harrison Chase, LangChain/LangGraph

00:00:00.000 | .

00:00:01.000 | SPEAKER 1: I want to talk today a little bit

00:00:16.600 | about trying to build reliable agents in the enterprise.

00:00:19.920 | This is something we work with a bunch of people for,

00:00:22.680 | both people building as developers

00:00:25.240 | inside of an enterprise, looking to build agents

00:00:27.480 | for their company, but also people

00:00:29.440 | who are looking to build solutions and bring them

00:00:32.020 | and sell them into enterprises.

00:00:35.060 | And so I wanted to talk a little bit about some

00:00:37.420 | of what we see kind of being the success tips and tricks

00:00:40.720 | for making this happen.

00:00:42.100 | So the vision of the future that I and other people, I think,

00:00:46.880 | have a similar view of for agents is that there'll

00:00:49.600 | be a lot of them.

00:00:50.260 | They'll be running around the enterprise

00:00:51.580 | doing different things.

00:00:52.840 | They'll be an agent for every different task.

00:00:55.180 | We'll be coordinating with them.

00:00:56.440 | We'll be kind of like a manager, a supervisor.

00:00:58.880 | And so how do we get to that vision?

00:01:02.600 | And what parts of this will kind of arrive before the others?

00:01:10.960 | And so I was thinking about this question.

00:01:12.720 | What makes some agents kind of succeed in the enterprise

00:01:15.920 | and some fail?

00:01:17.040 | And I was chatting with my friend Asaf.

00:01:19.640 | He's the head of AI at Monday.

00:01:21.080 | He also wrote GPT Researcher.

00:01:22.580 | It's a great open source package.

00:01:24.880 | I was chatting with him a few weeks ago.

00:01:27.220 | And a lot of the ideas here are borrowed

00:01:29.800 | from that conversation.

00:01:30.820 | He'll probably write a blog post about this

00:01:33.500 | with a slightly different framing, which I would encourage

00:01:35.680 | everyone to check out.

00:01:36.880 | So I just want to give him a massive shout out.

00:01:38.520 | And if you have the opportunity to chat with him,

00:01:40.820 | you should definitely take that opportunity.

00:01:44.560 | Thinking about it from first principles,

00:01:46.400 | like what makes agents successful in the enterprise?

00:01:51.680 | It'll make it successful.

00:01:52.620 | It will make it more likely to be adopted.

00:01:54.480 | The greater the value of the agent if it's right.

00:01:57.160 | These probably aren't going to sound kind of like earth

00:01:59.680 | shattering, but hopefully we'll get to some interesting points.

00:02:02.060 | The more value it provides when it's right,

00:02:04.220 | the more likely it will be to be adopted.

00:02:06.800 | The more likely it is to have success,

00:02:09.140 | the more likely it will be to be adopted.

00:02:11.560 | And then the cost if it's wrong.

00:02:13.260 | If there's big costs when it's wrong,

00:02:15.540 | then it will be less likely to be adopted.

00:02:17.300 | So I think these are three kind of like ingredients,

00:02:19.640 | which are pretty simple and pretty basic,

00:02:21.580 | but I think provide an interesting kind of like first principles

00:02:23.960 | approach for how to think about building agents

00:02:26.100 | and what types of agents kind of like find success.

00:02:29.380 | And I say in the enterprise here,

00:02:31.120 | but I also think this applies just generally

00:02:33.080 | within kind of like society.

00:02:36.480 | If we want to try to put this into a fun little equation,

00:02:39.020 | we can multiply the probability that something succeeds times

00:02:43.180 | the value that you get when it succeeds,

00:02:44.740 | and then do the opposite for the cost when it's wrong.

00:02:47.560 | And of course, this needs to be greater

00:02:49.140 | than the cost of running the agent for you

00:02:51.040 | to want to put it into production.

00:02:52.440 | So yeah, fun little kind of like stats slash math formula.

00:02:58.720 | So how can we build agents that score higher on this?

00:03:01.680 | Because this hasn't been anything kind

00:03:03.480 | of like earth shattering so far.

00:03:04.860 | Hopefully, we'll get to some fun insights

00:03:06.180 | when we talk about how to make that equation kind of like go up.

00:03:11.020 | So how can we increase the value of things when they go right?

00:03:17.980 | And what types of agents have higher value?

00:03:20.720 | So part of this is choosing kind of like problems

00:03:24.520 | where there is really high kind of like value.

00:03:26.980 | So a lot of the agents that have been successful so far--

00:03:30.120 | Harvey in the legal space is one of them.

00:03:33.880 | In the finance space, we see stuff around research

00:03:36.240 | and summarization.

00:03:36.980 | These are high value work tasks.

00:03:39.300 | People pay a lot of money for lawyers

00:03:42.320 | and for research and investment research.

00:03:45.620 | And so these are examples of what I would

00:03:47.160 | say kind of like high value tasks are.

00:03:51.120 | There's other ways to kind of like improve

00:03:53.080 | the value of what you're working on besides just switching

00:03:55.520 | kind of like the vertical completely.

00:03:56.880 | And I think we're starting to see some of this,

00:03:58.760 | especially more recently.

00:04:00.380 | So if we think about RAG or if we think about kind

00:04:02.360 | of like existing question-- or older school question answering

00:04:06.380 | solutions, they would often respond kind of like quickly,

00:04:08.680 | ideally within five seconds, and give you a quick answer.

00:04:11.220 | And we're starting to see a trend towards things

00:04:13.320 | like deep research, which go and run for an extended period

00:04:17.060 | of time.

00:04:17.720 | We're seeing the same with code.

00:04:19.060 | We start with cursor.

00:04:19.900 | It has kind of like inline autocomplete,

00:04:21.600 | maybe some chat question answering there.

00:04:23.820 | In the past like three weeks, there's been, what,

00:04:26.340 | seven different examples of these ambient agents

00:04:28.540 | that run in the background for like hours at time.

00:04:30.980 | And I think this speaks to ways that people are trying

00:04:33.240 | to get their agents to provide more value.

00:04:35.300 | They're getting them to do more work, pretty basic.

00:04:39.260 | But I do think that like as we have--

00:04:41.000 | if you think about this future of agents working

00:04:43.440 | and what that means, that doesn't mean a copilot.

00:04:46.360 | That means something working more autonomously

00:04:48.320 | in the background, doing more amounts of work.

00:04:52.140 | So besides kind of like focusing on areas or verticals

00:04:56.280 | that provide value, I think you can also absolutely

00:04:59.380 | reshift the UI, UX, the interaction pattern

00:05:02.800 | of what you're building to be kind of like more long term

00:05:05.960 | and do more kind of like substantial patterns of work.

00:05:11.140 | Let's talk about now the probability of success.

00:05:13.120 | How do we make this go up?

00:05:15.200 | So there's two different aspects I want to talk about here.

00:05:20.000 | One, I think, is about the reliability of agents.

00:05:22.180 | If you've built agents before, it's

00:05:25.720 | easy to get something that works in a prototype.

00:05:27.800 | It runs once, great.

00:05:28.720 | You can make a video, put it on Twitter.

00:05:30.140 | But it's hard to make it work reliably.

00:05:31.600 | Put it in production.

00:05:34.140 | And I think the core thing that we've seen--

00:05:37.060 | and by the way, for some parts of--

00:05:41.020 | for some types of agents, that's totally fine.

00:05:43.300 | You can have agents that run for a while

00:05:47.560 | and you don't know what they do, and that's totally fine.

00:05:51.580 | Especially in the enterprise, we see oftentimes

00:05:53.800 | that people want more predictability, more control

00:05:56.740 | over what steps actually happen inside the agents.

00:05:59.620 | Maybe they always want to do step A after step B.

00:06:02.760 | And so if you prompt an agent to do that, great.

00:06:05.080 | It might do that like 90% of the time.

00:06:07.080 | You don't know what the LLM will do.

00:06:08.580 | If you put that in a deterministic kind of workflow or code,

00:06:12.120 | then it will always do that.

00:06:13.400 | And so especially in the enterprise,

00:06:14.740 | we see that there are workflow-like things

00:06:18.580 | where you need more controllability, more predictability,

00:06:21.980 | than you get by just prompting.

00:06:23.660 | And so what we've seen is the solution for this

00:06:26.300 | is basically make more and more of your agent deterministic.

00:06:30.200 | There is this concept of workflows versus agents.

00:06:32.840 | Anthropic wrote a great blog post on this

00:06:35.160 | that I'd encourage you to check out.

00:06:37.380 | I would argue that instead of workflows versus agents,

00:06:40.080 | it's oftentimes workflows and agents.

00:06:42.960 | We see that parts of an agentic system

00:06:46.120 | are sometimes looping, calling a tool.

00:06:48.440 | And sometimes they're just doing A after B after C.

00:06:50.880 | An example of this is when you think

00:06:52.140 | about multi-agent architectures.

00:06:53.780 | If you think about an architecture that has agent A,

00:06:56.880 | and then after agent A finishes, you always call agent B.

00:06:59.760 | Is that a workflow?

00:07:01.040 | Is that an agent?

00:07:01.780 | It's this middle ground.

00:07:03.580 | And so as we think about building tools for this future,

00:07:07.480 | one of the things that we've released is Langraph.

00:07:10.380 | Langraph is an agent framework.

00:07:11.900 | It's very different from other agent frameworks,

00:07:13.940 | where it really leans in to this spectrum of workflows

00:07:17.780 | and agents and allows you to be wherever is best

00:07:21.180 | for your application on that curve.

00:07:24.180 | And where on that curve is best totally

00:07:26.100 | depends on the application that you're building.

00:07:28.200 | There is another thing that is different from just building

00:07:34.280 | and changing the agent.

00:07:35.100 | And I think there's oftentimes really high error bars

00:07:40.640 | that people have when they think about how likely an agent

00:07:43.060 | is to work.

00:07:43.740 | I think this technology is new when trying

00:07:47.660 | to get something built or approved or put into production

00:07:50.080 | inside an enterprise.

00:07:50.880 | I think there's a lot of uncertainty and fear around this.

00:07:56.540 | And I think that relates to this fundamental uncertainty

00:08:01.700 | around how this agent is performing.

00:08:04.080 | And so besides just making it better,

00:08:07.100 | a really important thing that we see

00:08:09.080 | to do inside the enterprise, whether you're

00:08:11.860 | bringing a third party agent and selling it as a service,

00:08:14.880 | or whether you're building inside the enterprise yourself,

00:08:18.800 | is to work to reduce the way that people see the error bars

00:08:23.800 | of how this agent performs.

00:08:25.740 | So what I mean by that specifically

00:08:27.700 | is that this is where observability and evals

00:08:30.160 | actually plays a slightly different role

00:08:32.920 | than we would maybe think or we would maybe intend.

00:08:35.740 | So we have an observability and eval solution called LangSmith.

00:08:39.400 | We built it for developers so that they could see what's

00:08:41.740 | going on inside their agent.

00:08:43.480 | It's also proved really, really valuable for communicating

00:08:46.940 | to external shareholders what's going on inside the agent

00:08:51.000 | and how the agent performs and where it messes up

00:08:54.000 | and where it doesn't mess up and basically communicate

00:08:56.340 | these kind of patterns.

00:08:58.400 | And so again, the observability part,

00:09:00.660 | you can just see every step that's

00:09:02.040 | happening inside the agent.

00:09:03.460 | This reduces the uncertainty that people

00:09:05.860 | have around what the agent and what it's actually doing.

00:09:07.920 | They can see that it's making three, five LLM calls.

00:09:11.140 | It's not just one.

00:09:11.820 | They're actually being really thoughtful about the steps

00:09:13.500 | that are happening.

00:09:14.420 | And then you can benchmark it against different things.

00:09:17.800 | And so there's a great story of a user of ours

00:09:20.960 | who used LangSmith initially to build the agent,

00:09:24.340 | but then brought it and showed it to the review panel

00:09:26.800 | as they were trying to get their agent approved

00:09:28.380 | to go into production.

00:09:29.680 | And they ended the meeting under time, which almost never

00:09:33.080 | happens if you've been to these review panels.

00:09:35.560 | And they showed them basically everything

00:09:38.260 | inside LangSmith.

00:09:39.060 | And it helped reduce the perception or the risk

00:09:43.060 | that people had of these agents.

00:09:47.800 | And then the last thing I want to talk about

00:09:51.160 | is the cost of something if it's wrong.

00:09:56.620 | There's similar to the probability of things being right,

00:10:00.260 | this plays an outsized role, especially

00:10:04.500 | in larger enterprises among review boards and managers,

00:10:07.240 | people's perception of these agents.

00:10:08.800 | People hear stories of agents going wild

00:10:10.920 | and causing brand damage or giving away things for free.

00:10:16.420 | I think there's an outsized perception

00:10:19.060 | of what could happen if things go bad.

00:10:25.780 | And so I think there's a few UI/UX tricks that people are doing

00:10:31.580 | and that successful agents have to just make this a non-issue.

00:10:36.220 | So one is just make it easy to reverse the changes

00:10:39.380 | that the agent makes.

00:10:40.700 | So if you think about code-- and this

00:10:42.040 | is a screenshot of ReplitAgent.

00:10:44.000 | It's a diff that it generates a PR.

00:10:47.400 | Code's really easy to revert.

00:10:48.640 | You go back to the previous commit.

00:10:50.860 | And so I think that's part of the reason

00:10:52.580 | why we see code being one of the first kind of real places

00:10:57.040 | that you can apply agents besides the fact

00:10:58.940 | that the models are trained on it.

00:11:00.080 | It's also that when you use these agents,

00:11:02.840 | you create all these commits.

00:11:04.520 | And well, it depends how you do it.

00:11:06.680 | Replit does it in a very clever way where every time they change

00:11:09.320 | a file, they save it as a new commit.

00:11:10.740 | So you can always go back.

00:11:11.660 | You can always revert kind of like what the agent does.

00:11:15.080 | And then the second part is having a human in the loop.

00:11:18.440 | So rather than merging code changes into main directly,

00:11:24.140 | open up PR.

00:11:24.920 | That's putting the human in the loop.

00:11:26.400 | And so then the effect of the agent--

00:11:29.080 | it's not kind of like making changes.

00:11:30.760 | There's the human who's kind of approving what the agent does.

00:11:33.920 | And this seems maybe a little subtle,

00:11:37.120 | but I think it completely changes the cost calculations

00:11:40.320 | in people's minds about what the cost of the agent doing

00:11:42.720 | something bad is because now it's reversible,

00:11:45.260 | and you have a human who is going

00:11:46.620 | to prevent it from even going in in the first place

00:11:48.780 | if it's bad.

00:11:50.480 | And so human in the loop is one of the big things

00:11:52.620 | that we see people selling these enterprises

00:11:55.680 | and building inside enterprises really leaning into.

00:11:59.440 | So to make this a little bit more concrete,

00:12:01.340 | what are some examples of this?

00:12:02.860 | I think deep research is a pretty good example of this.

00:12:06.100 | If we think about this, there is a period of time

00:12:10.280 | upfront when you're messaging with deep research

00:12:12.000 | that you go back and forth.

00:12:12.940 | It asks you follow-up questions, and you calibrate

00:12:14.960 | on what you want to research.

00:12:16.560 | That puts the human in the loop.

00:12:18.960 | It also makes sure that it gets a better result.

00:12:21.340 | So it increases the value that you're

00:12:23.060 | going to get from the report because it's more aligned

00:12:24.960 | with what you actually want.

00:12:26.580 | And then deep research-- it doesn't take this

00:12:29.660 | and publish it as a blog out in the internet,

00:12:31.560 | or it doesn't take it and email it to your clients.

00:12:33.220 | It produces just a report that you can read

00:12:35.740 | and decide what to do with.

00:12:36.800 | So it's not actually doing anything.

00:12:38.800 | It's up to you to take that and do things.

00:12:41.460 | I think similarly, when you think about code,

00:12:44.200 | it's another great example of--

00:12:47.560 | so Claude code also has this ability

00:12:50.140 | where it asks questions.

00:12:51.140 | It clarifies things.

00:12:52.860 | This is to both keep the human in the loop,

00:12:55.360 | but also make sure that it yields better results.

00:12:57.360 | And then again, with code, maybe you're not making a commit

00:13:00.040 | every time you change things.

00:13:01.140 | But it's on a separate branch.

00:13:02.440 | You open up PR.

00:13:03.160 | You're not pushing directly to master.

00:13:05.700 | And so I think these are examples of things

00:13:09.000 | in the general industry that follow some of these patterns.

00:13:14.540 | So OK.

00:13:15.040 | So we've figured out a few levers that we

00:13:16.600 | can pull to try to make our agents more interesting

00:13:21.620 | to be deployed in the enterprise.

00:13:23.720 | What next?

00:13:24.260 | What next is how do we scale that?

00:13:27.160 | So if this has positive value, then what we really want to do

00:13:31.880 | is just multiply this a bunch and scale it up a bunch.

00:13:34.900 | And I think this speaks to the concept

00:13:37.520 | of ambient agents, which is when we think about agents working

00:13:43.400 | in this futuristic view, agents working in an enterprise doing

00:13:46.820 | things in the background, they're doing things in the background.

00:13:49.160 | They're not being kicked off by humans still in the loop.

00:13:52.900 | They're being triggered by different events.

00:13:56.440 | And I think the reason that this is so powerful

00:13:58.660 | is that it scales up this positive expected value

00:14:01.680 | thing even more than we can.

00:14:04.000 | Like I can only really have one--

00:14:05.660 | maybe I can have two chat boxes open at the same time.

00:14:08.440 | But now there can be hundreds of these

00:14:10.680 | running in the background.

00:14:12.220 | And so when we think about the difference

00:14:13.780 | between chat agents, which I would argue we've mostly seen,

00:14:17.080 | and ambient agents, one big difference is ambient agents

00:14:19.840 | are triggered by events.

00:14:21.140 | That lets us scale ourselves instead of a one-to-one.

00:14:23.300 | It's now a one-to-many conversation

00:14:24.840 | that we can be happening.

00:14:26.480 | And so the concurrences of these agents that can be running

00:14:28.880 | goes from one to unlimited.

00:14:32.600 | The latency requirements also change.

00:14:35.040 | So when chat, you have this kind of like UX expectation

00:14:37.900 | that it responds really, really quickly.

00:14:40.080 | And that's not the case with ambient agents,

00:14:42.140 | because they're triggered without you even knowing.

00:14:43.980 | So how do you know?

00:14:46.080 | How do you even care how long it's running?

00:14:47.860 | And so what does this let you do?

00:14:49.760 | Why does this matter?

00:14:50.420 | This lets you do more complex operations.

00:14:52.420 | So you can do more things.

00:14:53.680 | So you can start to build up a bigger body of work.

00:14:57.260 | You can go from changing one line of code

00:14:59.600 | to changing a whole file or making a new repo or any of that.

00:15:02.960 | And so instead of this agent just responding directly

00:15:05.420 | or calling a single tool call, which usually happens

00:15:07.600 | in these chat applications because of the latency requirements,

00:15:10.100 | it can now do these more complex things.

00:15:12.020 | And so the value can start kind of like increasing in terms

00:15:14.860 | of what you're doing.

00:15:16.520 | And then the other thing that I want to emphasize

00:15:19.780 | is that there's still kind of like a UX for interacting

00:15:23.080 | with these agents.

00:15:24.780 | So ambient does not mean fully autonomous.

00:15:26.680 | And this is really, really important.

00:15:28.340 | Because autonomous-- when people hear autonomous,

00:15:30.340 | they think the cost of this thing doing something bad

00:15:32.920 | is really high.

00:15:34.480 | Because I'm not going to be able to oversee it.

00:15:37.020 | I don't know what's going on.

00:15:38.300 | How do-- it could go out there and run wild.

00:15:40.280 | And so ambient does not mean fully autonomous.

00:15:43.220 | And so there are a lot of different kind of like human in the loop

00:15:46.060 | interaction patterns that you can bring into these kind of like

00:15:49.120 | background, these ambient agents.

00:15:51.180 | There can be an approve-reject pattern where for certain tools,

00:15:54.220 | you want to explicitly say, yes, it's OK to call this tool.

00:15:57.180 | You might want to edit the tool that it's calling.

00:15:59.220 | So if it messes up a tool call, you

00:16:00.720 | can actually just correct it in the UI.

00:16:03.000 | You might want to give it the ability to kind of like ask

00:16:05.280 | questions so that you can answer them.

00:16:06.880 | You can provide more info if it gets stuck kind of like halfway

00:16:09.420 | through.

00:16:10.080 | And then time travel is something that we call human on the loop

00:16:13.340 | as well.

00:16:13.840 | So this is after the agents run.

00:16:15.000 | If it messed up on step like 10 out of 100,

00:16:17.620 | you can reverse back to step 10 and say, hey, no,

00:16:20.880 | resume from here but do this other thing slightly differently.

00:16:24.200 | And so human in the loop, we think, is super, super important.

00:16:27.820 | The other thing that I want to call out just briefly

00:16:30.020 | is I think there's this intermediary state

00:16:33.700 | where we're starting to be right now.

00:16:35.880 | I wouldn't call deep research or cloud code

00:16:39.300 | or any of these coding agents ambient agents

00:16:41.600 | because they're still triggered by a human.

00:16:43.600 | But I think these are good examples of sync to async agents.

00:16:48.120 | And so factory is a coding agent.

00:16:50.580 | They use a term kind of like async coding agents.

00:16:53.060 | And I really like that.

00:16:54.700 | But I think this kind of like sync to async agents

00:16:57.760 | is a natural progression if you think about it.

00:16:59.720 | Like right now to start--

00:17:01.040 | or a year ago, everything was a sync agent.

00:17:03.280 | We were chatting with it.

00:17:04.100 | It was very much in the moment.

00:17:05.460 | The future is probably these autonomous agents working

00:17:07.520 | in the background, still pinging us when they need help.

00:17:09.640 | But there's this intermediate state where

00:17:11.460 | the human kicks it off, uses that kind of human in the loop

00:17:14.340 | at the start to calibrate on what you want it to do.

00:17:16.780 | And so I think that that table I showed of like chat and ambient

00:17:20.900 | is actually probably missing a column in the middle.

00:17:22.840 | That's like these sync to async agents.

00:17:25.220 | Anyways, an example of some of the UXs that we think

00:17:28.240 | can be interesting for these ambient agents

00:17:30.260 | are basically what we call agent inbox, which

00:17:32.560 | is where you surface all the actions that the agent wants

00:17:34.960 | to take that need your approval.

00:17:36.080 | And then you can go in and approve, reject,

00:17:37.780 | leave feedback, things like that.

00:17:39.940 | Just kind of tie this together and make it really concrete

00:17:42.620 | what I mean by ambient agents.

00:17:44.500 | Email, I think, is a really natural place for ambient agents.

00:17:47.120 | These agents can listen to incoming emails.

00:17:49.580 | Those are events.

00:17:50.680 | They can run on however many emails come in.

00:17:53.020 | So that's, in theory, unlimited.

00:17:55.900 | But you still probably want an agent--

00:17:58.200 | or you still probably want the human, the user,

00:17:59.960 | to approve any emails that go out or any calendar events that

00:18:03.220 | get sent, depending on your level of comfort.

00:18:06.300 | And so this is a concrete thing.

00:18:09.200 | I actually built one that I have myself.

00:18:11.640 | We've used it to kind of test out a lot of these things.

00:18:14.240 | If people want to try it out, there is a QR code

00:18:16.280 | that you can scan and get the GitHub repo.

00:18:17.980 | It's all open source.

00:18:19.620 | And I think it's not the only example of ambient agents,

00:18:23.960 | but it's one that I've built myself,

00:18:26.040 | and so we talk a lot about internally.

00:18:29.820 | That's all I have.

00:18:31.120 | I'm not sure if there's time for questions or not.

00:18:33.700 | One or two questions, if people have them.

00:18:35.740 | So my question is, although everybody's

00:18:48.880 | talking about agents, but only code generating agents

00:18:52.640 | are the ones who are getting funding,

00:18:54.480 | is it because you can measure what you have done

00:18:57.580 | and you can reverse what you have done.

00:18:59.380 | But for all other agents, you can do a lot of stuff,

00:19:02.160 | but you cannot measure what you have done.

00:19:03.820 | You cannot reverse what you have done.

00:19:05.500 | Yeah.

00:19:06.000 | I think there's a variety of reasons.

00:19:09.380 | I think those two measure and--

00:19:12.340 | well, OK.

00:19:13.080 | So the measure thing, I think probably more so.

00:19:15.380 | You can-- a lot of the large model labs train on a lot of coding

00:19:19.300 | data because you can test whether it's correct or not.

00:19:21.420 | You can run it, see if it compiles.

00:19:22.800 | Same with math data.

00:19:23.680 | Math is very-- it's verifiable, right?

00:19:25.680 | So math and code are two examples of verifiable domains.

00:19:28.420 | Essay writing is less verifiable.

00:19:29.920 | What does it mean for an essay to be correct?

00:19:31.580 | That's far more ambiguous.

00:19:33.080 | And so because of these verifiable things,

00:19:35.580 | you're able to bootstrap a lot of training data.

00:19:37.580 | And so there's a lot of training data in the models already

00:19:40.040 | about code.

00:19:40.540 | And so the models are better at that.

00:19:41.800 | That makes the agents that use those models better at that.

00:19:44.660 | Then the second part, I do think code lends itself

00:19:47.860 | naturally to this commit and this draft and this preview thing.

00:19:51.280 | I think that's more generalizable.

00:19:53.380 | So legal is a great example.

00:19:54.740 | Legal, you can have first drafts of things.

00:19:57.180 | That's very common.

00:19:57.840 | Same with essay writing.

00:19:59.040 | I think the concept of a first draft

00:20:01.260 | is actually a really good UX to aim for.

00:20:03.120 | It lets you do far more.

00:20:04.460 | It also puts the human in the loop.

00:20:05.920 | And so you get this dual kind of--

00:20:09.480 | like if you put the human in a loop at every step,

00:20:11.420 | that doesn't provide any value.

00:20:12.740 | Each step is so small.

00:20:14.080 | So the key is finding these UX patterns

00:20:16.800 | where the agent does a ton of work,

00:20:18.540 | but the human's still in the loop at key points.

00:20:20.920 | And first drafts, I think, are a great mental model for that.

00:20:24.460 | So anything where there's like first drafts, legal, writing,

00:20:27.840 | code, I think that's a little bit more generalizable.

00:20:32.480 | The verifiable stuff, that's a little bit tougher.

00:20:35.840 | Yeah.

00:20:36.340 | Yeah.

00:20:41.140 | Oh, no, I'm good.

00:20:41.820 | I'll talk to you afterwards.

00:20:43.020 | ANDREW BROGDON: Cool.

00:20:43.840 | Yeah, more than happy to chat after.

00:20:45.300 | Thank you all.

00:20:46.220 | ANDREW BROGDON: Thank you.

00:20:46.980 | ANDREW BROGDON: Thank you.

00:20:48.320 | ANDREW BROGDON: Thank you.

00:20:49.180 | ANDREW BROGDON: Thank you.

00:20:50.180 | We'll be right back.