Prompting for Agents | Code w/ Claude

00:00:00.000 | All right. Thank you. Thank you everyone for joining us. So we're picking up with prompting

00:00:10.640 | for agents. Hopefully you were here for prompting 101 or maybe you're just joining us, but I'll give

00:00:16.520 | a little intro. My name is Hannah. I'm part of the Applied AI team in Anthropic. Hi, I'm Jeremy. I'm

00:00:21.720 | on our Applied AI team as well, and I'm a product engineer. So we're going to talk about prompting

00:00:26.440 | for agents. So we're going to switch gears a little bit, move on from the basics of prompting,

00:00:30.180 | and talk about how we do this for agents like playing Pokemon. So hopefully you were here

00:00:36.660 | for prompting 101 or maybe you have some familiarity with basic prompting. So we're not going to

00:00:40.420 | go over the really kind of basic console prompting or interacting with Claude in the desktop today,

00:00:46.440 | but just a refresher. We think about prompt engineering as kind of programming in natural language. You're

00:00:51.500 | thinking about what your agent or your model is going to be doing, what kind of tasks it's accomplishing.

00:00:56.920 | You're trying to clearly communicate to the agent, give examples where necessary, and give guidelines.

00:01:02.460 | We do follow a very specific structure for console prompting. I want you to remove this from your

00:01:09.640 | minds because it could look very different for an agent. So for an agent, you may not be laying out

00:01:13.880 | this type of very structured prompt. It's actually going to look a lot different. We're going to allow a lot

00:01:19.360 | of different things to come in. So I'm going to turn it over. I'm going to talk about what agents are,

00:01:23.420 | and then I'll turn it over to Jeremy to talk about how we do this for agents. So hopefully you have a

00:01:28.540 | sense in your mind of what an agent is. At Anthropic, we like to say that agents are models using tools in

00:01:34.000 | a loop. So we give the agent a task, and we allow it to work continuously and use tools as it thinks fit,

00:01:40.960 | update its decisions based on the information that it's getting back from its tool calls, and continue working

00:01:46.780 | independently until it completes the task. So we kind of keep it as simple as that. The environment,

00:01:53.520 | which is where the agent is working, the tools that the agent has, and the system prompt, which is where

00:01:58.220 | we tell the agent what it should be doing or what it should be accomplishing. And we typically find the

00:02:03.200 | simpler you can keep this, the better. Allow the agent to do its work, allow the model to be the model,

00:02:08.300 | and kind of work through this task. So when do you use agents? You do not always need to use an agent.

00:02:15.040 | In fact, there's many scenarios in which you won't actually want to use an agent. There are other

00:02:19.740 | approaches that would be more appropriate. Agents are really best for complex and valuable tasks. It's not

00:02:26.440 | something you should deploy in every possible scenario. You will not get the results that you want,

00:02:31.040 | and you'll spend a lot more resources than you maybe need to. So we'll talk a little bit about

00:02:35.780 | checklist or kind of ways of thinking about when you should be using an agent and maybe you don't want to

00:02:41.420 | be using an agent. So is the task complex? Is this a task that you, a human, can think through a step-by-step

00:02:47.780 | process to complete? If so, you probably don't need an agent. You want to use an agent where it's not clear to you how you'll go about accomplishing the task.

00:02:55.680 | You might know where you want to go, but you don't know exactly how you're going to get there, what tools,

00:03:00.680 | and what information you might need to arrive at the end state. Is a task valuable? Are you going to get a lot of value

00:03:06.780 | out of the agent accomplishing this task? Or is this a kind of a low value task or workflow? In that case, a workflow

00:03:13.320 | might also be better. You don't really want to be using the resources of an agent unless this is something you get that's highly

00:03:18.960 | leveraged. It's maybe revenue generating. It's something that's really valuable to your user. Again, it's something that's complex.

00:03:24.420 | The next piece is, are the parts of the task doable? So when you think about the task that has to occur,

00:03:31.000 | would you be able to give the agents the tools that it needs in order to accomplish this task? If you can't define the tools,

00:03:38.580 | or if you can't give the agent access to the information or the tool that it would need, you may want to scope the task down.

00:03:44.760 | If you can define and give to the agent the tools that it would want, that's a better use case for an agent.

00:03:51.020 | The last thing you might want to think about is the cost of errors, or how easy it is to discover errors.

00:03:56.600 | So if it's really difficult to correct an error or detect an error, that is maybe not a place where you want the agent to be working independently.

00:04:04.820 | You might want to have a human in the loop in that case. If the error is something that you can recover from,

00:04:09.600 | or if it's not too costly, to have an error occurring, then you might continue to allow the agent to work independently.

00:04:16.180 | So to make this a little bit more real, we'll talk about a few examples. I'm not going to go through each single one of these,

00:04:23.440 | but let's pick out a few that will be pretty clear or intuitive for most of us.

00:04:27.180 | So coding, obviously, all of you are very familiar with using agents and coding. Coding is a great use case.

00:04:33.760 | We can think about something like a design document, and although you know where you want to get to, which is raising a PR,

00:04:40.760 | you don't know exactly how you're going to get there. It's not clear to you what you'll build first, how you'll iterate on that,

00:04:46.500 | what changes you might make along the way, depending on what you find. This is high value, you're all very skilled.

00:04:52.600 | If an agent is able... This is more like what the midway is like at night. I feel more at home now.

00:05:02.820 | Claude is great at coding, and this is a high value use case, right? If your agent is actually able to go from a design document to a PR,

00:05:12.300 | that's a lot of time that you, a highly skilled engineer, are saved, and you're able to then spend your time on something else that's higher leverage.

00:05:19.680 | So great use case for agents. A couple other examples I'll mention here. Maybe we'll talk about the cost of error.

00:05:27.780 | So search, if we make an error in the search, there's ways that we can correct that, right? So we can use citations,

00:05:34.060 | we can use other methods of double checking the results. So if the agent makes a mistake in the search process,

00:05:39.200 | this is something we can recover from, and it's probably not too costly. Computer use. This is also a place where we can recover from errors.

00:05:46.580 | We might just go back. We might try clicking again. It's not too difficult to allow Claude just to click a few times until it's able to use the tool properly.

00:05:56.580 | Data analysis, I think, is another interesting example, kind of analogous to coding. We might know the end result that we want to get to.

00:06:03.200 | We know a set of insights that we want to gather out of data or a visualization that we want to produce from data.

00:06:08.580 | We don't know exactly what the data might look like. So the data could have different formats. It could have errors in it.

00:06:14.580 | It could have other -- it could have granularity issues that we're not sure how to disaggregate.

00:06:19.580 | We don't know the exact process that we're going to take in analyzing that data, but we know where we want to get in the end.

00:06:24.580 | So this is another example of a great use case for agents. So hopefully these make sense to you, and I'm going to turn it over to Jeremy now.

00:06:32.960 | He has some really rich experience building agents in Anthropoc, and he's going to share some best practices for actually prompting them well and how to structure a great prompt for an agent.

00:06:42.960 | Thanks, Anna. Hi, all. Yeah, so prompting for agents. I think some things that we think about here, I'll go over a few of them.

00:06:51.340 | We've learned these experiences mostly from building agents ourselves. So some agents that you can try from Anthropoc are Cloud Code, which works in your terminal and sort of agentically

00:07:00.480 | and authentically browses your files and uses the bash tool to really accomplish tasks in coding.

00:07:05.980 | Similarly, we have our new advanced research feature in Cloud.ai, and this allows you to do hours of research.

00:07:11.860 | For example, you can find hundreds of startups building agents, or you can find hundreds of potential prospects for your company.

00:07:18.860 | And this allows the model to do research across your tools, your Google Drive, web search, and stuff like that.

00:07:25.600 | And so in the process of building these products, one thing that we learned is that you need to think like your agents.

00:07:31.600 | This is maybe the most important principle. The idea is that essentially you need to understand and develop a mental model of what your agent is doing and what it's like to be in that environment.

00:07:41.280 | So the environment for the agent is a set of tools and the responses it gets back from those tools.

00:07:46.020 | In the context of Cloud Code, the way you might do this is by actually simulating the process and just imagining if you were in Cloud Code's shoes,

00:07:54.220 | given the exact tool descriptions it has and the tool schemas it has, would you be confused or would you be able to do the task that it's doing?

00:08:01.220 | If a human can't understand what your agent should be doing, then an AI will not be able to either.

00:08:06.220 | And so this is really important for thinking about tool design, thinking about prompting, is to simulate and go through their environment.

00:08:12.220 | Another is that you need to give your agents reasonable heuristics.

00:08:16.220 | And so Hannah mentioned that prompt engineering is conceptual engineering.

00:08:21.220 | What does that really mean?

00:08:22.220 | It's one of the reasons why prompt engineering is not going away and why I personally expect prompting to get more important,

00:08:28.220 | not less important, as models get smarter.

00:08:30.220 | This is because prompting is not just about text.

00:08:32.220 | It's not just about the words that you give the model.

00:08:34.220 | It's about deciding what concepts the model should have and what behaviors it should follow to perform well in a specific environment.

00:08:41.220 | So for example, Cloud Code has the concept of irreversibility.

00:08:45.220 | It should not take irreversible actions that might harm the user or harm their environment.

00:08:50.220 | So it will avoid these kinds of harmful actions or anything that might cause irreversible damage to your environment or to your code or anything like that.

00:08:58.220 | So that concept of irreversibility is something that you need to instill in the model and be very clear about.

00:09:03.220 | And think about the edge cases.

00:09:04.220 | How might the model misinterpret this concept?

00:09:07.220 | How might it not know what it means?

00:09:09.220 | For example, if you want the model to be very eager and you want it to be very agentic, well, it might go over the top a little bit.

00:09:15.220 | It might misinterpret what you're saying and do more than what you expect.

00:09:19.220 | And so you have to be very crisp and clear about the concepts you're giving the models.

00:09:23.220 | Some examples of these reasonable heuristics that we've learned.

00:09:26.220 | One is that while we were building research, we noticed that the model would often do a ton of web searches when it was unnecessary.

00:09:32.220 | For example, it would find the actual answer it needed.

00:09:35.220 | Like maybe you would find a list of scale-ups in the United States.

00:09:39.220 | And then it would keep going, even though it already had the answer.

00:09:42.220 | And that's because we hadn't told the model explicitly.

00:09:44.220 | When you find the answer, you can stop.

00:09:46.220 | You no longer need to keep searching.

00:09:48.220 | Similarly, we had to give the model sort of budgets to think about.

00:09:52.220 | For example, we told it that for simple queries, it should use under five tool calls.

00:09:56.220 | But for more complex queries, it might use up to 10 or 15.

00:10:00.220 | So these kinds of heuristics that you might assume the model already understands, you really have to articulate clearly.

00:10:06.220 | A good way to think about this is that if you're managing maybe a new intern who's fresh out of college and has not had a job before, how would you articulate to them how to get around all the problems they might run into in their first job?

00:10:18.220 | And how would you be very crisp and clear with them about how to accomplish that?

00:10:22.220 | That's often how you should think about giving heuristics to your agents, which are just general principles that it should follow.

00:10:28.220 | They may not be strict rules, but they're sort of practices.

00:10:32.220 | Another point is that tool selection is key.

00:10:34.220 | So as models get more powerful, they're able to handle more and more tools.

00:10:37.220 | Sonnet 4 and Opus 4 can handle up to 100 tools, even more than that if you have great prompting.

00:10:43.220 | But in order to use these tools, you have to be clear about which tools it should use for different tasks.

00:10:48.220 | So for example, for research, we can give the model access to Google Drive.

00:10:51.220 | We can give it access to MCP tools like Sentry or Datadog or GitHub.

00:10:56.220 | It can search across all these tools, but the model doesn't know already which tools are important for which tasks, especially in your specific company context.

00:11:05.220 | For example, if your company uses Slack a lot, maybe it should default to searching Slack for company-related information.

00:11:11.220 | All these questions about how the model should use tools, you have to give it explicit principles about when to use which tools and in which contexts.

00:11:20.220 | And this is really important, and it's often something I see where people don't prompt the agent at all about which tool to use,

00:11:26.220 | and they just give the model some tools with some very short descriptions.

00:11:30.220 | And then they wonder, like, why isn't the model using the right tool?

00:11:33.220 | Well, it's likely because the model doesn't know what it should be doing in that context.

00:11:37.220 | Another point here is that you can guide the thinking process.

00:11:40.220 | So people often sort of turn extended thinking on and then let their agents run and assume it will get out-of-the-box better performance.

00:11:47.220 | Actually, that assumption is true.

00:11:48.220 | Most of the time you will get out-of-the-box better performance, but you can squeeze even more performance out of it if you just prompt the agent to use its thinking well.

00:11:56.220 | So, for example, for search, what we do is tell the model to plan out its search process.

00:12:01.220 | So in advance, it should decide how complicated is this query, how many tool calls should I use here, what sources should I look for, how will I know when I'm successful.

00:12:10.220 | We tell it to plan out all these exact things in its first thinking block.

00:12:14.220 | And then a new capability that the Cloud4 models have is the ability to use interleaved thinking between tool calls.

00:12:21.220 | So after getting results from the web, we often find that models assume that all web search results are true.

00:12:26.220 | Right?

00:12:27.220 | They don't have any-- we haven't told them explicitly that this isn't the case, and so they might take these web results and run with them immediately.

00:12:34.220 | So one thing we prompted our models to do is to use this interleaved thinking to really reflect on the quality of the search results and decide if they need to verify them,

00:12:42.220 | if they need to get more information, or if they should add a disclaimer about how the results might not be accurate.

00:12:47.220 | Another point when prompting agents is that agents are more unpredictable than workflows or just classification-type prompts.

00:12:56.220 | Most changes will have unintended side effects.

00:12:59.220 | This is because agents will operate in a loop autonomously.

00:13:03.220 | And so, for example, if you tell the agent, you know, keep searching until you find the correct answer.

00:13:08.220 | You know, find the highest quality possible source and always keep searching until you find that source.

00:13:13.220 | What you might run into is the unintended side effect of the agent just not finding any sources.

00:13:18.220 | Maybe this perfect source doesn't exist for the query.

00:13:21.220 | And so it will just keep searching until it hits its context window.

00:13:24.220 | And that's actually what we ran into as well.

00:13:26.220 | And so you have to tell the agent, if you don't find the perfect source, that's okay.

00:13:30.220 | You can stop after a few tool calls.

00:13:32.220 | So just be aware that your prompts may have unintended side effects and you may have to roll those back.

00:13:38.220 | Another point is to help the agent manage its context window.

00:13:41.220 | The Cloud4 models have a 200k token context window.

00:13:45.220 | This is long enough for a lot of long-running tasks.

00:13:47.220 | But when you're using an agent to do work autonomously, you may hit this context window.

00:13:52.220 | And there are several strategies you can use to sort of extend the effective context window.

00:13:56.220 | One of them that we use for Cloud Code is called compaction.

00:13:59.220 | And this is just a tool that the model has that will automatically be called once it hits around 190,000 tokens, so near the context window.

00:14:08.220 | And this will summarize or compress everything in the context window to a really dense but accurate summary that is then passed to a new instance of Cloud with the summary and it continues the process.

00:14:19.220 | And we find that this essentially allows you to run infinitely with Cloud Code.

00:14:22.220 | You almost never run out of context.

00:14:24.220 | Occasionally, it will miss details from the previous session.

00:14:27.220 | But the vast majority of the time, this will keep all the important details and the model will sort of remember what happened in the last session.

00:14:34.220 | Similarly, you can sort of write to an external file.

00:14:37.220 | So the model can have access to an extra file.

00:14:40.220 | And these Cloud 4 models are especially good at writing memory to a file.

00:14:44.220 | And they can use this file to essentially extend their context window.

00:14:48.220 | Another point is that you can use sub-agents.

00:14:51.220 | We won't talk about this a lot here.

00:14:52.220 | But essentially, if you have agents that are always hitting their context windows, you may delegate some of what the agent is doing to another agent.

00:15:00.220 | Which can sort of, for example, you can have one agent be the lead agent.

00:15:05.220 | And then sub-agents do the actual searching process.

00:15:08.220 | Then the sub-agents can compress the results to the lead agent in a really dense form that doesn't use as many tokens.

00:15:13.220 | And then the lead agent can give the final report to the user.

00:15:16.220 | So we actually use this process in our research system.

00:15:19.220 | And this allows you to sort of compress what's going on in the search.

00:15:23.220 | And then only use the context window for the lead agent for actually writing the report.

00:15:27.220 | So this kind of multi-agent system can be effective for limiting the context window.

00:15:32.220 | Finally, you can let Claude be Claude.

00:15:34.220 | And essentially what this means is that Claude is great at being an agent already.

00:15:38.220 | You don't have to do a ton of work at the very beginning.

00:15:40.220 | So I would recommend just trying out your system with sort of a bare-bones prompt and bare-bones tools

00:15:45.220 | and seeing where it goes wrong and then working from there.

00:15:48.220 | Don't sort of assume that Claude can't do it ahead of time because Claude often will surprise you with how good it is.

00:15:54.220 | I talked already about tool design, but essentially the key point here is you want to make sure that your tools are good.

00:16:01.220 | What is a good tool?

00:16:03.220 | It will have a simple, accurate tool name that reflects what it does.

00:16:06.220 | You'll have tested it and make sure that it works well.

00:16:09.220 | It'll have a well-formed description so that a human reading this tool--

00:16:12.220 | imagine you give a function to another engineer on your team.

00:16:16.220 | Would they understand this function and be able to use it?

00:16:19.220 | You should ask the same question about the agent computer interfaces or the tools that you are giving your agent.

00:16:26.220 | Make sure that they're usable and clear.

00:16:28.220 | We also often find that people will give an agent a bunch of tools that have very similar names or descriptions.

00:16:34.220 | So for example, you give it six search tools and each of the search tools searches a slightly different database.

00:16:41.220 | This will confuse the model.

00:16:42.220 | So try to keep your tools fairly distinct and combine similar tools into just one.

00:16:49.220 | So one quick example here is just that you can have an agent, for example, use these different tools

00:16:54.220 | to first search the inventory in a database, run a query.

00:16:58.220 | Based on the information it finds, it can reflect on the inventory, think about it for a little bit,

00:17:03.220 | then decide to generate an invoice, generate this invoice, think about what it should do next,

00:17:08.220 | and then decide to send an email.

00:17:10.220 | And so this loop involves the agent getting information from the database, which is its external environment,

00:17:14.220 | using its tools, and then updating based on that information until it accomplishes the task.

00:17:19.220 | And that's sort of how agents work in general.

00:17:22.220 | So let's walk through a demo real quick.

00:17:24.220 | I'll switch to my computer.

00:17:27.220 | So you can see here that this is our console.

00:17:29.220 | The console is a great tool for sort of simulating your prompts and seeing what they would look like in a UI.

00:17:34.220 | And I use this while we were iterating on research to sort of understand what's really going on

00:17:39.220 | and what the agent's doing.

00:17:41.220 | This is a great way to think like your agents and sort of put yourself in their shoes.

00:17:45.220 | So you can see we have a big prompt here.

00:17:47.220 | It's not sort of super long.

00:17:48.220 | It's around 1,000 tokens.

00:17:50.220 | It involves the researcher going through a research process.

00:17:53.220 | We tell it exactly what it should plan ahead of time.

00:17:56.220 | We tell it how many tool calls it should typically use.

00:17:59.220 | We give it some guidelines about what facts it should think about,

00:18:02.220 | what makes a high-quality source, stuff like that.

00:18:04.220 | And then we tell it to use parallel tool calls.

00:18:06.220 | So, you know, run multiple web searches in parallel at the same time

00:18:10.220 | rather than running them all sequentially.

00:18:12.220 | And then we give it this question.

00:18:14.220 | How many bananas can fit in a Rivian R1S?

00:18:16.220 | This is not a question that the model will be able to answer

00:18:19.220 | because the Rivian R1S came out very recently.

00:18:21.220 | It's a car.

00:18:22.220 | It doesn't know in advance all the specifications and everything.

00:18:25.220 | So it'll have to search the web.

00:18:27.220 | Let's run it and see what happens.

00:18:29.220 | You'll see that at the very beginning it will think and break down this request.

00:18:32.220 | And so it realizes, okay, web search is going to be helpful here.

00:18:35.220 | I should get cargo capacity.

00:18:37.220 | I should search.

00:18:38.220 | Woo!

00:18:39.220 | And you see here it ran two web searches in parallel at the same time.

00:18:45.220 | That allowed it to get these results back very quickly.

00:18:48.220 | And then it's reflecting on the results.

00:18:50.220 | So it's realizing, okay, I found the banana dimensions.

00:18:53.220 | I know that the USDA identifies bananas as seven to eight inches long.

00:18:57.220 | I need to run another web search.

00:18:59.220 | Let me convert these to more standard measurements.

00:19:01.220 | You can see it's using tool calls interleaved with thinking,

00:19:04.220 | which is something new that the quad four models can do.

00:19:06.220 | Finally, it's running some calculations.

00:19:08.220 | It's thinking about how many bananas could be packed into the cargo space of the truck.

00:19:12.220 | And it's running a few more web searches.

00:19:16.220 | You can see here that this is a fairly complex task,

00:19:19.220 | but it can now provide an answer.

00:19:21.220 | It's done a bunch of web searches and it will tell you how many bananas can it fit.

00:19:25.220 | Pending.

00:19:29.220 | Approximately 48,000 bananas.

00:19:33.220 | I've seen the model estimate anything between 30,000, 50,000.

00:19:36.220 | I think the right answer is around 30,000.

00:19:39.220 | So this is roughly correct.

00:19:42.220 | Going back to the slides.

00:19:45.220 | I think that this sort of approach of testing out your prompt, seeing what tools the model calls,

00:19:52.220 | reading its thinking blocks, and actually seeing how the model is thinking,

00:19:55.220 | will often make it really obvious what the issues are and what's going wrong.

00:20:00.220 | So you'll test it out and you'll just see, okay, maybe the model is using too many tools here.

00:20:04.220 | Maybe it's using the wrong sources or maybe it's just following the wrong guidelines.

00:20:09.220 | So this is a really helpful way to sort of think like your agents and make them more concrete.

00:20:17.220 | Switching back to the slides.

00:20:19.220 | Okay, so evals.

00:20:23.220 | Evaluations are really important for any system.

00:20:27.220 | They're really important for systematically measuring whether you're making progress on your prompt.

00:20:32.220 | Very quickly, you'll notice that it's difficult to really make progress on a prompt if you don't have an eval

00:20:37.220 | that tells you meaningfully whether your prompt is getting better and whether your system is getting better.

00:20:42.220 | But evals are much more difficult for agents.

00:20:45.220 | Agents are long running.

00:20:47.220 | They do a bunch of things.

00:20:49.220 | They may not always have a predictable process.

00:20:52.220 | Classification is easier to eval because you can just check.

00:20:55.220 | Did it classify this output correctly?

00:20:57.220 | But agents are harder.

00:20:59.220 | So a few tips to make this a bit easier.

00:21:01.220 | One is that the larger the effect size, the smaller the sample size you need.

00:21:06.220 | And so this is sort of just a principle from science in general where if an effect size is very large,

00:21:11.220 | for example, if a medication will cure people immediately,

00:21:14.220 | you don't really need a large sample size of a ton of people to know that this treatment is having an effect.

00:21:20.220 | Similarly, when you change a prompt, if it's really obvious that the system is getting better, you don't need a large eval.

00:21:26.220 | I often see teams think that they need to set up a huge eval of hundreds of test cases and make it completely automated when they're just starting out building an agent.

00:21:34.220 | This is a failure mode and it's an anti-pattern.

00:21:36.220 | You should start out with a very small eval and just run it and see what happens.

00:21:42.220 | You can even start out manually.

00:21:44.220 | But the important thing is to just get started.

00:21:46.220 | I often see teams delaying evals because they think that they're so intimidating or that they need such a sort of intense eval to really get some signal.

00:21:54.220 | But you can get great signal from a small number of test cases.

00:21:57.220 | You just want to keep those test cases consistent and then keep testing them so you know whether the model and the prompt is getting better.

00:22:04.220 | You also want to use realistic tasks.

00:22:06.220 | So don't just sort of come up with arbitrary prompts or descriptions or tasks that don't really have any real correlation to what your system will be doing.

00:22:14.220 | For example, if you're working on coding tasks, you won't want to give the model just competitive programming problems.

00:22:20.220 | Because this is not what real-world coding is like.

00:22:22.220 | You'll want to give it realistic tasks that really reflect what your agent will be doing.

00:22:26.220 | Similarly, in finance, you'll want to sort of take tasks that real people are trying to solve and just use them to evaluate

00:22:33.220 | whether the model can do those.

00:22:34.220 | This allows you to really measure whether the model is getting better at the tasks that you care about.

00:22:39.220 | Another point is that LLM as judge is really powerful, especially when you give it a rubric.

00:22:44.220 | So agents will have lots of different kinds of outputs.

00:22:46.220 | For example, if you're using them for search, they might have tons of different kinds of search reports with different kinds of structure.

00:22:52.220 | But LLMs are great at handling lots of different kinds of structure and text with different characteristics.

00:22:57.220 | And so one thing that we've done, for example, is given the model just a clear rubric and then ask each other

00:23:03.220 | to evaluate the output of the agent.

00:23:05.220 | For example, for search tasks, we might give it a rubric that says, check that the model looked at the right sources,

00:23:11.220 | check that it got the correct answer.

00:23:13.220 | In this case, we might say, check that the model guessed that the amount of bananas that can fit in a ribbon R1S is between like 10,000 and 50,000.

00:23:22.220 | Anything else at that range is not realistic.

00:23:25.220 | So you can use things like that to sort of benchmark whether the model is getting the right answers,

00:23:30.220 | whether it's following the right process.

00:23:32.220 | At the end of the day, though, nothing is a perfect replacement for human evals.

00:23:36.220 | You need to test the system manually.

00:23:38.220 | You need to see what it's doing.

00:23:39.220 | You need to sort of look at the transcripts, look at what the model is doing,

00:23:42.220 | and sort of understand your system if you want to make progress on it.

00:23:46.220 | Here are some examples of evals for agents.

00:23:50.220 | So one example that I sort of talked about is answer accuracy.

00:23:53.220 | And this is where you just use an LLM as judge to judge whether the answer is accurate.

00:23:57.220 | So for example, in this case, you might say the agent needs to use a tool to query the number of employees and then report the answer.

00:24:04.220 | And then you know the number of employees at your company, so you can just check that with an LLM as judge.

00:24:08.220 | The reason you use an LLM as judge here is because it's more robust to variations.

00:24:12.220 | For example, if you're just checking for the integer 47 in this case in the output, that is not very robust.

00:24:18.220 | And if the model says 47 as text, you'll grade it incorrectly.

00:24:22.220 | So you want to use an LLM as judge there to be robust to those minor variations.

00:24:26.220 | Another way you can eval agents is tool use accuracy.

00:24:29.220 | Agents involve using tools in a loop.

00:24:31.220 | And so if you know in advance what tools the model should use or how it should use them,

00:24:35.220 | you can just evaluate if it used the correct tools in the process.

00:24:39.220 | For example, in this case, I might evaluate the agent should use web search at least five times to answer this question.

00:24:46.220 | And so I could just check in the transcript programmatically, did the tool call for web search appear five times or not?

00:24:52.220 | Similarly, you might check in this case, in response to the question book a flight, the agent should use the search flights tool.

00:24:59.220 | And you can just check that programmatically.

00:25:01.220 | And this allows you to make sure that the right tools are being used at the right times.

00:25:04.220 | Finally, a really good eval for agents is TauBench.

00:25:09.220 | You can sort of look this up.

00:25:10.220 | TauBench is a sort of open source benchmark that shows that you can evaluate whether agents reach the correct final state.

00:25:17.220 | So a lot of agents are sort of modifying a database or interacting with a user in a way where you can say the model should always get to this state at the end of the process.

00:25:27.220 | For example, if your agent is a customer service agent for airlines and the user asks to change their flight, at the end of the agentic process in response to that prompt, it should have changed the flight in the database.

00:25:40.220 | And so you can just check at the end of the agentic process.

00:25:43.220 | Was the flight changed?

00:25:44.220 | Was this row in the database changed to a different date?

00:25:47.220 | And that can verify that the agent is working correctly.

00:25:50.220 | This is really robust and you can use it a lot in a lot of different use cases.

00:25:53.220 | For example, you can check that your database is updated correctly.

00:25:57.220 | You can check that certain files were modified, things like that, as a way to evaluate the final state that the agent reaches.

00:26:03.220 | And that's it from us.

00:26:06.220 | We're happy to take your questions.

00:26:08.220 | Can you talk about building prompts for agents?

00:26:19.220 | Are you giving it kind of longer prompts first and then iterating?

00:26:22.220 | Or are you starting kind of chunk by chunk?

00:26:24.220 | What's that look like?

00:26:25.220 | And can you show sort of a little bit more on that thought process?

00:26:29.220 | That's a great question.

00:26:31.220 | Can I switch back to my screen, actually?

00:26:33.220 | I just want to sort of show the demo.

00:26:35.220 | Thank you.

00:26:36.220 | Yeah, so you can see this is sort of a final prompt that we've arrived at.

00:26:40.220 | But this is not where we started.

00:26:41.220 | I think the answer to your question is that you start with a short, simple prompt.

00:26:45.220 | So in this case, I might start with something very short.

00:26:47.220 | You know, I'll just delete this for now.

00:26:49.220 | And I'll say, like, search the web to answer the question.

00:26:52.220 | And I might just say, search the web agentically.

00:26:58.220 | I'll change this to a different question.

00:27:01.220 | How good are the clod4 models?

00:27:03.220 | And then we'll just run that.

00:27:05.220 | And so you'll want to start with something very simple and just see how it works.

00:27:09.220 | You'll often find that clod can do the task well out of the box.

00:27:12.220 | But if you have more needs and you need it to operate really consistently in production,

00:27:16.220 | you'll notice edge cases or small flaws as you test with more use cases.

00:27:21.220 | And so you'll sort of add those into the prompt.

00:27:23.220 | So I would say building an agent prompt, what it looks like concretely is start simple.

00:27:28.220 | Test it out.

00:27:29.220 | See what happens.

00:27:30.220 | Iterate from there.

00:27:31.220 | Start collecting test cases where the model fails or succeeds.

00:27:35.220 | And then over time, try to increase the number of test cases that pass.

00:27:38.220 | And the way to do this is by sort of adding instructions, adding examples to the prompt.

00:27:43.220 | But you really only do that when you find out what the edge cases are.

00:27:46.220 | And you can see that it thinks that the models are indeed good.

00:27:50.220 | So that's great.

00:27:51.220 | When I do like normal prompting and it's not agentic, I'll often give like a few shot example

00:27:58.220 | of like, hey, here's like input, here's output.

00:28:00.220 | This works really well for like classification tasks and that, right?

00:28:03.220 | Is there a parallel here in this like agentic world?

00:28:06.220 | Totally.

00:28:07.220 | Are you finding that that's ever helpful or should I not think about it that way?

00:28:10.220 | That is a great question.

00:28:11.220 | Yeah.

00:28:12.220 | So should you include few shot examples in your prompt and sort of traditional prompting techniques

00:28:17.220 | involve like saying the model should use a chain of thought and then giving a few shot examples

00:28:22.220 | like a bunch of examples to imitate?

00:28:24.220 | We find that these techniques are not as effective for state-of-the-art frontier models and for

00:28:29.220 | agents.

00:28:30.220 | The main reason for this is that if you give the model a bunch of examples of exactly what

00:28:34.220 | process it should follow, that just limits the model too much.

00:28:37.220 | These models are smarter than you can predict.

00:28:39.220 | And so you don't want to tell them exactly what they need to do.

00:28:42.220 | Similarly, chain of thought has just been trained into the models at this point.

00:28:45.220 | The models know to think in advance.

00:28:47.220 | They don't need to be told like use chain of thought.

00:28:50.220 | But what we can do here is one, you can tell the model how to use its thinking.

00:28:54.220 | So you know I talked about earlier, rather than telling the model you need to use a chain

00:28:58.220 | of thought, it already knows that.

00:28:59.220 | You can just say use your thinking process to plan out your search or to plan out what you're

00:29:04.220 | going to do in terms of coding.

00:29:06.220 | Or you can tell it to remember specific things in its thinking process.

00:29:10.220 | And that sort of helps the agent stay on track.

00:29:12.220 | As far as examples go, you'll want to give the model examples but not too prescriptive.

00:29:17.220 | I think we are out of time, but you can come up to me personally and I'll talk to y'all after.

00:29:21.220 | Thank you.

00:29:22.220 | Thank you.

00:29:23.220 | Thanks for coming.

00:29:24.220 | Thank you.

00:29:25.220 | Thank you.

Prompting for Agents | Code w/ Claude

Chapters