back to index

Taking Claude to the Next Level


Chapters

0:0 Introduction
0:47 The Next Level of AI Agents
3:43 Thinking and Tool Use
6:18 Memory
10:6 Reward Hacking
11:4 Practical Tips
14:0 Tool Use
14:49 Recap

Whisper Transcript | Transcript Only Page

00:00:00.000 | Lisa Crowfoot: Good morning, everyone. Welcome to taking Claude to the next level. I'm Lisa Crowfoot. I'm a research product manager here at Anthropic. And today I have the pleasure of introducing you to our newest models, Claude 4, Sonnet, and Opus.
00:00:28.000 | Lisa Crowfoot: So we're going to start by talking about what the next level of AI agents looks like. I'll go through new capabilities in the Claude 4 family, and then we'll talk through some practical tips for how to get the most out of Sonnet and Opus.
00:00:48.000 | Lisa Crowfoot: Before we dive deep on Claude 4, I wanted to paint a picture of what we think our next generation agents really look like. We really want Claude to be great at three things. Claude should be able to work alongside you and adapt to the ways you work. Claude should be able to work entirely independently on tasks that require many steps. And in both of these cases, Claude needs to sustain performance over hours of continuous work.
00:01:16.000 | Lisa Crowfoot: Imagine this. You've been assigned a new project to refactor your auth system to support OAuth 2.0. So you decide to work with Claude on this to make faster progress.
00:01:32.000 | Lisa Crowfoot: Then you choose to write the requirements and the plan and update the documents, but decide to delegate the implementation to Claude.
00:01:39.000 | Lisa Crowfoot: What is most interesting in this collaborative mode is that we don't envision this being like clear human AI handoffs. We really want you to be able to work with Claude. So for example, when Claude is reviewing the code base and documents, it might find out that you missed a requirement in your PRD. So Claude should challenge your assumptions, just like working with Claude.
00:01:50.000 | Lisa Crowfoot: A great engineer. Together you can achieve a higher quality outcome faster than you would have on your own. This is what augmentation, not automation will look like.
00:02:12.000 | Lisa Crowfoot: We do also envision that Claude will be able to operate on tasks like this entirely independently.
00:02:21.000 | Lisa Crowfoot: So take that same refactor and imagine that you just assign the whole thing to Claude.
00:02:27.000 | Lisa Crowfoot: Even without tight human oversight, Claude will create comprehensive plans for the refactor. It will use tools like web search and document search to make sure that it's operating from the most up to date information.
00:02:42.000 | Lisa Crowfoot: And it will use your company standards and best practices to write production ready code. Claude writes tests, recognizes and fixes its mistakes.
00:02:52.000 | Lisa Crowfoot: It can take feedback and remember your feedback so it doesn't make the same mistake twice.
00:02:58.000 | Lisa Crowfoot: So when models work independently like this, we really think trust and communication are paramount.
00:03:03.000 | Lisa Crowfoot: So Claude needs to follow your instructions. It also needs to communicate its decisions with you in a way that you can review them.
00:03:10.000 | Lisa Crowfoot: It needs to be able to adapt to changing inputs and new information.
00:03:15.000 | Lisa Crowfoot: So in both of these examples, Claude would need to work over many hours to complete the task.
00:03:22.000 | Lisa Crowfoot: If you use Claude.ai or Claude code regularly, you might be familiar with Claude doing in seconds what takes you minutes or hours.
00:03:30.000 | Lisa Crowfoot: But our vision goes beyond that. We want Claude to be able to take on tasks that will take it hours to complete.
00:03:36.000 | Lisa Crowfoot: And we think that when this is possible, it will dramatically expand what AI agents can do.
00:03:42.000 | Lisa Crowfoot: So this is our vision, an AI that works alongside you, builds trust while working independently,
00:03:50.000 | Lisa Crowfoot: and can take on complex tasks that require sustained focus.
00:03:54.000 | Lisa Crowfoot: So let's dive in on how Claude 4 is making this a reality.
00:03:58.000 | Lisa Crowfoot: As you heard earlier from Dario, we launched two new models today, Claude Opus 4 and Claude Sonnet 4.
00:04:06.000 | Lisa Crowfoot: And I'm going to talk through four main improvement areas: thinking and tool use, memory, instruction following, and reduced reward hacking.
00:04:18.000 | Lisa Crowfoot: We'll discuss how these improvements contribute towards our agent vision.
00:04:22.000 | Lisa Crowfoot: Let's start with extended thinking and tool use.
00:04:26.000 | Lisa Crowfoot: Earlier this year, we launched Claude 3.7 Sonnet, which was our first hybrid reasoning model.
00:04:32.000 | Lisa Crowfoot: And what that means is the model can respond near instantly to your request or think deeply before responding.
00:04:38.000 | Lisa Crowfoot: With Claude 4, we're expanding on thinking by introducing a new beta capability for Claude to alternate between thinking and tool use.
00:04:47.000 | Lisa Crowfoot: Let me walk you through an example.
00:04:49.000 | Lisa Crowfoot: So here I've provided Claude with a CSV of bike rental data, and I gave it a very open-ended prompt.
00:04:56.000 | Lisa Crowfoot: I told it to just tell me the three most interesting things about this data.
00:05:01.000 | Lisa Crowfoot: Claude has access to a REPL tool, which lets it run code autonomously to analyze the data.
00:05:08.000 | Lisa Crowfoot: But it's never seen this data before.
00:05:10.000 | Lisa Crowfoot: So when it first thinks, it's actually thinking quite tactically about how to handle the large file.
00:05:15.000 | Lisa Crowfoot: And the first thing it does is print itself out the headers so that it can understand the data structure and what is even in this data.
00:05:23.000 | Lisa Crowfoot: It's only in the second and third thinking block that it starts to actually think about the prompt and the problem at hand.
00:05:29.000 | Lisa Crowfoot: So it starts to plan where it's going to find interesting patterns.
00:05:35.000 | Lisa Crowfoot: It decides to look for hourly patterns in bike rentals, different patterns for casual versus registered users, and seasonal and weather patterns.
00:05:45.000 | Lisa Crowfoot: It runs through its plan, completing the analysis.
00:05:56.000 | Lisa Crowfoot: And Claude was able to find interesting patterns like the fact that casual versus registered users have different time of day usage.
00:06:04.000 | Lisa Crowfoot: It found a clear evening commuting pattern.
00:06:06.000 | Lisa Crowfoot: And kind of no surprise to any of us, it found that bike rentals were 1.8 times more common on sunny days versus rainy days.
00:06:14.000 | Lisa Crowfoot: The next capability I want to talk about is memory.
00:06:21.000 | Lisa Crowfoot: We think memory is critically important for our next generation agent's vision for two reasons.
00:06:27.000 | Lisa Crowfoot: First, no one wants to work with an agent that you have to keep reminding the same things over and over again.
00:06:34.000 | Lisa Crowfoot: But secondly, and more tactically, if Claude is working over hours, it can't keep every single detail in its context window.
00:06:41.000 | Lisa Crowfoot: It needs to be smarter and only remember the most salient and important facts.
00:06:47.000 | Lisa Crowfoot: Claude Opus 4 demonstrates remarkably better memory capabilities.
00:06:53.000 | Lisa Crowfoot: So when given an external file system with which it can read and write memories, Claude Opus is able to come up with a plan, remember that plan, and track progress against that plan over hours of work.
00:07:06.000 | Lisa Crowfoot: So we're going to take a slight detour and talk about the game of Pokemon as a way to illustrate this memory capability.
00:07:15.000 | Lisa Crowfoot: So we've been using Pokemon as a practical prototype for testing Claude's agent capabilities for a while.
00:07:25.000 | Lisa Crowfoot: If you're interested in learning more about this, my colleague David will be giving a talk later today, and I recommend checking it out.
00:07:31.000 | Lisa Crowfoot: For the purpose of this talk today, I want to talk about how Claude is using memory in Pokemon.
00:07:38.000 | Lisa Crowfoot: So if you think back to your Game Boy days, the game of Pokemon is really you go around and catch Pokemon, and then you train them up so that they can win battles.
00:07:48.000 | Lisa Crowfoot: And this concept of training is really core to the game.
00:07:51.000 | Lisa Crowfoot: You need to teach your Pokemon how to win battles, and that takes time where you go around and have the Pokemon battle other Pokemon to level up.
00:07:59.000 | Lisa Crowfoot: Prior Claude models would recognize this and decide that they had to go train their Pokemon, but would quickly lose track of their plan.
00:08:07.000 | Lisa Crowfoot: And start doing something else before their Pokemon were able to level up.
00:08:14.000 | Lisa Crowfoot: Opus 4, on the other hand, is meticulously tracking its Pokemon's training progress.
00:08:20.000 | Lisa Crowfoot: So here in its memory file, it keeps track of the fact that it has played 64 battles.
00:08:28.000 | Lisa Crowfoot: And to put that into context, 64 battles would take Claude about 12 hours of continuous gameplay.
00:08:35.000 | Lisa Crowfoot: Claude Opus remains focused on its training goals, logging Pokemon level improvements in this file.
00:08:42.000 | Lisa Crowfoot: So memory is a new model capability we're excited about because of how it will unlock longer arc agentic trajectories.
00:08:50.000 | Lisa Crowfoot: A third improvement I want to highlight is improvements in complex instruction following.
00:08:57.000 | Lisa Crowfoot: This one is near and dear to me because I've spent many hours working on Claude's system prompt.
00:09:06.000 | Lisa Crowfoot: And we're finding that as agent systems become more complex, the system prompts and sets of instructions that govern Claude's behavior are getting longer.
00:09:15.000 | Lisa Crowfoot: So for example, our own Claude AI system prompt is about 16,000 tokens right now.
00:09:22.000 | Lisa Crowfoot: So that's 16,000 tokens of instructions that Claude needs to be able to follow.
00:09:25.000 | Lisa Crowfoot: For Claude to work in these systems, it's important that its behaviors are steerable by you, the developer.
00:09:30.000 | Lisa Crowfoot: So you're each building different applications that may have different requirements and principles that govern Claude's behavior.
00:09:39.000 | Lisa Crowfoot: We've trained Claude for models specifically to be able to follow instructions within long and complex system prompts longer than 10,000 tokens, for example.
00:09:49.000 | Lisa Crowfoot: It's easier to steer Claude when to use tools and when not to use tools, and in our own system prompt, this improved instruction following has actually allowed us to reduce the size of the prompt by 70%.
00:10:07.000 | Lisa Crowfoot: Finally, I want to highlight improvements on a behavior we call reward hacking.
00:10:12.000 | Lisa Crowfoot: So reward hacking is when models take shortcuts to achieve an outcome or a result without actually solving the problem at hand.
00:10:19.000 | Lisa Crowfoot: You can think of it like hard coding tests or commenting them out.
00:10:24.000 | Lisa Crowfoot: This behavior is extremely trust-busting for users.
00:10:27.000 | Lisa Crowfoot: So when you see it happen, it makes you feel like you have to meticulously review everything Claude does in every line of code.
00:10:35.000 | Lisa Crowfoot: While we don't consider this an entirely solved problem, Claude 4 models show significantly reduced tendency to reward hack.
00:10:43.000 | Lisa Crowfoot: On an evaluation set of problems that were selected due to this tendency in past models, Claude 4 shows more than 80% less tendency towards the behavior.
00:10:53.000 | Lisa Crowfoot: And this means you can better trust Claude to complete your task the right way while being honest about its limitations and uncertainties.
00:11:01.000 | Lisa Crowfoot: So these four improvements thinking and tool use, memory, improved steering, and reward hacking work together to create a Claude that is more capable, coherent, and trustworthy over longer time horizons.
00:11:19.000 | Lisa Crowfoot: Now I want to spend the last few minutes getting practical and providing you and your team's tips to get the most out of these models when you get back to the office tomorrow.
00:11:29.000 | Lisa Crowfoot: The first decision you'll have to make is which model to use.
00:11:34.000 | Lisa Crowfoot: And our recommendation is always to test the models within your evaluations and your product to ultimately make this decision.
00:11:41.000 | Lisa Crowfoot: But to give you some high level guidance here, Opus is our most capable model with frontier intelligence.
00:11:48.000 | Lisa Crowfoot: It will be best for the most complex tasks you have.
00:11:51.000 | Lisa Crowfoot: So think coding within large and complex code bases, code migrations or refactors, long horizon agentic tasks, and planning and orchestration.
00:12:03.000 | Lisa Crowfoot: A good rule of thumb here is that if Sonnet 3.7 is getting 60 or 70% on your evaluation, it will be a great use case for testing Opus.
00:12:13.000 | Lisa Crowfoot: Sonnet 4 is fast and efficient and is great for use cases that Sonnet 3.7 is excelling at today.
00:12:22.000 | Lisa Crowfoot: It spikes at agentic coding and will be awesome for app development and greenfield coding generation, kind of vibe coding, as well as any use case that has humans in the loop.
00:12:40.000 | Lisa Crowfoot: When upgrading to the new models, you may need to adjust your prompts to get the best performance.
00:12:45.000 | Lisa Crowfoot: So those of you familiar with Sonnet 3.7 might be aware of its ability to go above and beyond the given user request.
00:12:53.000 | Lisa Crowfoot: I've seen this described as something like you ask it to change the color on a button and it codes you an entire new app.
00:13:00.000 | Lisa Crowfoot: We call this behavior over-eagerness and Cloud4 models are much less over-eager by default.
00:13:06.000 | Lisa Crowfoot: So what this means is if you have language in your prompt that aims to dampen Sonnet 3.7's proclivity towards over-eagerness, you'll want to remove that language.
00:13:17.000 | Lisa Crowfoot: We don't think it's needed anymore.
00:13:19.000 | Lisa Crowfoot: And if you have an application where you think this above and beyond behavior is beneficial to users, you should just tell the model to go above and beyond in the prompt.
00:13:27.000 | Lisa Crowfoot: Cloud4 models are more than capable of delivering that as well.
00:13:31.000 | Lisa Crowfoot: We are also finding the models have better attention to detail in the prompt.
00:13:36.000 | Lisa Crowfoot: This goes along with the improved instruction following.
00:13:39.000 | Lisa Crowfoot: But you might need to audit your prompt to make sure that you're actually encouraging the behaviors you want to see.
00:13:44.000 | Lisa Crowfoot: So for example, when we were testing this model on cloud.ai, we couldn't figure out why occasionally it was using the wrong XML tag for citations.
00:13:53.000 | Lisa Crowfoot: And we root caused it to one single typo in our prompt with examples.
00:13:58.000 | Lisa Crowfoot: If you're using Cloud4 with tool use, you can prompt Cloud4 models to call tools in parallel.
00:14:07.000 | Lisa Crowfoot: So this lets Cloud parallelize tasks running more than one thing simultaneously.
00:14:13.000 | Lisa Crowfoot: When using interleaved thinking and tool use, you can actually tell Cloud specifically what to think about in between tool calls.
00:14:21.000 | Lisa Crowfoot: So you might tell Cloud to carefully reflect on search result quality and plan next steps before proceeding.
00:14:28.000 | Lisa Crowfoot: And finally, if you're using tools, it's a good idea to tell Cloud when and when it should not invoke those tools within your prompt.
00:14:37.000 | Lisa Crowfoot: We found the improved instruction quality instruction following qualities of Cloud4 have been very effective at addressing tool over triggering problems.
00:14:47.000 | Lisa Crowfoot: So to recap, we're building towards a long term vision where Cloud can work alongside you.
00:14:56.000 | Lisa Crowfoot: Complete work for you over long sustained durations.
00:15:00.000 | Lisa Crowfoot: We think you'll find Cloud4 models great for agents because of interleaved thinking and tool use, memory, improved instruction following, and reduced reward hacking.
00:15:11.000 | Lisa Crowfoot: So what can you do tomorrow when you get back to the office?
00:15:15.000 | Lisa Crowfoot: Start experimenting, try building with both models using Opus for your most complex and ambitious tasks, and Sonnet for everything else.
00:15:23.000 | Lisa Crowfoot: Invest some time in prompt engineering.
00:15:26.000 | Lisa Crowfoot: Very small changes to your prompt can make a large difference to performance.
00:15:29.000 | Lisa Crowfoot: All of these models are slightly different.
00:15:31.000 | Lisa Crowfoot: And share your feedback with us because it will help us make the next generations of Cloud even better.
00:15:37.000 | Lisa Crowfoot: Thanks for joining me today.
00:15:40.000 | Lisa Crowfoot: We're really excited to see what you build with these new models.
00:15:44.000 | Lisa Crowfoot: And I'm happy to take any questions.
00:15:46.000 | Lisa Crowfoot: You'll need to walk over to the microphones in the aisles here.
00:15:49.000 | Lisa Crowfoot: No questions?
00:16:02.000 | Lisa Crowfoot: You're all good to go.
00:16:03.000 | Lisa Crowfoot: Awesome.
00:16:04.000 | Lisa Crowfoot: So both Opus 4 and Sonnet 4 are doing really well on Sweebench and some of the other benchmarks.
00:16:19.000 | Lisa Crowfoot: However, most folks realize that benchmarks and practical use is not really comparable.
00:16:27.000 | Lisa Crowfoot: Are you also developing new benchmarks for software development as these things get better?
00:16:33.000 | Lisa Crowfoot: And are there things like evaluations for over-eagerness and things like that where we can get a sense of it beforehand before the product actually releases?
00:16:43.000 | Lisa Crowfoot: Yeah, great question.
00:16:45.000 | Lisa Crowfoot: We test these models quite extensively before we release them through what we call like a Swiss cheese of testing methods.
00:16:52.000 | Lisa Crowfoot: So benchmarks are only one thing we look at.
00:16:55.000 | Lisa Crowfoot: We also use them internally quite extensively before launch.
00:16:59.000 | Lisa Crowfoot: So Anthropic employees have been using these models on Cloud Code for weeks, for example, and that helps us better understand how they perform in practical use.
00:17:08.000 | Lisa Crowfoot: We do some testing with early access customers.
00:17:11.000 | Lisa Crowfoot: And so we are interested in developing more and more benchmarks, but we don't think that benchmarks are the only way to look at how good these models are.
00:17:21.000 | Lisa Crowfoot: Let's go on the side.
00:17:22.000 | Yeah, so I was curious, in all of the demos and use cases that you presented so far, it seems to be very centered on text, coding and text in general.
00:17:33.000 | So I wanted to ask you if you can comment on your multimodal capabilities of the model, in particular images and audio.
00:17:41.000 | Lisa Crowfoot: Yeah, we actually think images-- the models can see images and respond to images.
00:17:49.000 | We think it's pretty important for agent capabilities.
00:17:52.000 | We see image use even within coding.
00:17:54.000 | For example, when people share with the model the front end that the model designed, then the model can go back and fix things.
00:18:00.000 | Lisa Crowfoot: So we're continuing to improve on our multimodal input capabilities, because we think it's going to be really critical for Claude to be able to do these complex tasks on its own.
00:18:12.000 | Lisa Crowfoot: Hello.
00:18:13.000 | Lisa Crowfoot: Hey.
00:18:14.000 | Lisa Crowfoot: So sometimes I use Claude tool calling not as an execution mechanism, but more so as a survey mechanism.
00:18:22.000 | For instance, I'm like, analyze the situation here, and the tools are like option A, option B, option C.
00:18:28.000 | Have you guys factored that use case for tool calling into your training, for instance?
00:18:34.000 | Lisa Crowfoot: That is not something I've heard of before, but it sounds really interesting.
00:18:39.000 | If you find me after the break, I'd love to learn more about how you think about that.
00:18:43.000 | Lisa Crowfoot: Yeah, absolutely.
00:18:44.000 | It's a great use case.
00:18:45.000 | Lisa Crowfoot: Thanks.
00:18:46.000 | So I'm really enjoying all the focus on practical software engineering tasks.
00:18:56.000 | One thing that is difficulty with LLMs on a large legacy code base is no matter how good it is at reading just a blob of text,
00:19:08.000 | the actual structure of the situation that it's involved in is just so vast that you kind of need to represent it some other way in order to navigate around.
00:19:17.000 | So I wonder what kind of patterns you have found useful in navigating these larger legacy contexts.
00:19:23.000 | Lisa Crowfoot: I think our general philosophy is that we're trying to improve Claude's ability to do agentic search.
00:19:30.000 | So you can think of agentic search like you search for something and then you can think about it a little bit more and then search again and use the information over time to inform what you're doing.
00:19:40.000 | Lisa Crowfoot: And that applies to both code and like the deep research capabilities which we have on Claude.ai.
00:19:45.000 | And so that combined with this memory capability where maybe Claude can write down where certain information is in the code base, we think will help solve that problem.
00:19:55.000 | Speaker 2: Hello. So in your presentation you mentioned something about being able to specify like what the model should be thinking about in between tool calls.
00:20:05.000 | How controllable is like the length of thinking tokens in terms of like how much the model should be thinking or like being able to specify length of that or also like specific tool calls within the actual like chain of thought process.
00:20:19.000 | Is that possible with the model?
00:20:21.000 | Lisa Crowfoot: So you have control over the maximum thinking length but the model adapts its thinking length to how much it actually needs to think to solve the task.
00:20:29.000 | So you kind of have like a thinking budget which you give the model and the model won't go over that budget but might be under that budget.
00:20:35.000 | Speaker 2: Okay. So if I wanted to ask the model to think for like a specific number of tokens, that's not possible right now.
00:20:41.000 | Lisa Crowfoot: You can tell it to think for less than a certain number of tokens.
00:20:44.000 | Less than a certain number.
00:20:45.000 | Okay.
00:20:46.000 | Speaker 2: Yeah.
00:20:48.000 | Lisa Crowfoot: Two questions.
00:20:49.000 | The first one is a gimme.
00:20:50.000 | What is the preferred mechanism for feedback?
00:20:53.000 | Because there are lots of ways to get in touch with you.
00:20:55.000 | And then the second is, does the increased durability mean that we can finally ask Claude not to generate insane numbers of inane comments when it writes code?
00:21:05.000 | Lisa Crowfoot: I hope so.
00:21:09.000 | Actually, I hope that they're better at that by default because of this like less over eager tendency.
00:21:16.000 | Lisa Crowfoot: And it should also follow your instructions better.
00:21:19.000 | Lisa Crowfoot: On feedback, I think like we love just talking to people.
00:21:23.000 | So if you find an Anthropic employee today, that would be excellent and would love to hear more about your experiences.
00:21:29.000 | And then I think we have some like online feedback forms as well.
00:21:33.000 | Lisa Crowfoot: Awesome.
00:21:34.000 | I think we'll call it here, but thanks everyone for joining me.
00:21:39.000 | I hope you're excited about Claude 4 and come find us after the break to chat more about these great new models.
00:21:48.000 | Thank you.
00:22:17.980 | Thank you.