back to indexBuilding Blocks for Tomorrow’s AI Agents

00:00:18.940 |
You know, whenever I do a tech conference, I always ask to do the session right after 00:00:23.200 |
lunch because I know only the most motivated, hardworking, smartest, most beautiful people 00:00:35.720 |
I'm a product manager at Anthropic, and we're going to talk about components for building 00:00:41.220 |
You saw Michael in the keynote talk about our Anthropic developer platform here, and today 00:00:47.060 |
we're going to drill into this agentic components. 00:00:51.800 |
When we think about building agents, there's really three key parts of this. 00:00:56.080 |
First is fundamentally building the agent and starting with our foundational models with 00:01:01.700 |
the Claude Four family of models with enhanced reasoning, memory support, much improved tool 00:01:07.080 |
calling, and long range planning is a great way to start. 00:01:11.080 |
There's also a set of components that you can reuse that saves your precious engineering 00:01:16.700 |
resources to work on different things, but we know regardless of how good our models are, 00:01:23.080 |
they're only as intelligent as the data that you bring to them. 00:01:26.080 |
And that's what the connect pillar is all about. 00:01:28.080 |
How can we help you bring more context in that helps the intelligence of the model? 00:01:33.080 |
And finally, none of that matters if you can't deliver a service that's reliable, that's stable, 00:01:39.080 |
that's performant, and cost effective, and that's what Optimize is. 00:01:49.080 |
So with Build, I want to talk about the code execution tool. 00:01:53.700 |
Customers have told us that while large language models can do many amazing things, there's still 00:01:59.700 |
some tasks that require traditional software development. 00:02:04.700 |
When you're doing advanced data analytics, you have a giant spreadsheet, need to understand, 00:02:10.700 |
do deep analysis of that data, that's still the domain where a human might need to write 00:02:16.320 |
code because that code is auditable, it's performant, it's repeatable, it does the same every time. 00:02:24.700 |
So some of those use cases are still better done with code. 00:02:29.700 |
But you know, our models are actually pretty good at writing code. 00:02:33.700 |
So we thought, why not give Claude a computer and let it write and execute that code? 00:02:42.700 |
Let me explain it by drilling in one level deep on an example. 00:02:48.700 |
So we have a client here, it calls Claude, and then that goes to a container. 00:02:54.700 |
We have a whole set of containers, so every organization gets a dedicated container. 00:03:01.700 |
And here the client is actually requesting container ID one. 00:03:05.700 |
Your client can decide how to use the containers, how to allocate them. 00:03:13.700 |
I don't know if anybody's already figured out the answer to this. 00:03:17.700 |
Claude thinks about that for a minute and decides, you know, actually this is best done by writing 00:03:24.700 |
So Claude chooses the code execution tool, writes a set of Python code that will answer that question, 00:03:34.700 |
The container executes that, and then we get some results back. 00:03:39.700 |
So all of standard out comes back, standard error comes back, and any files that were created 00:03:49.700 |
And then the model then reasons over that results, and comes up with a quippy answer. 00:03:55.700 |
So the answer was 42, and Opus with its insightful humor here has come up with a good joke about that. 00:04:02.700 |
So that's generally how code execution works, and it's very simple to set up. 00:04:08.700 |
Those of you that are already customers will recognize the messages API. 00:04:15.700 |
So it's the exact same API you've been using before. 00:04:21.700 |
And keep in mind, this is really all you need to do to set this up. 00:04:25.700 |
It's one method call brings all of this power. 00:04:30.700 |
And that's just what Shopify found was interesting as they experiment with this code execution tool we're building. 00:04:39.700 |
They have a sidekick agent that helps merchants build their storefronts, 00:04:46.700 |
and they're building an A/B testing experience there. 00:04:49.700 |
And having the power of this code execution tool is helping them bring that insight. 00:04:54.700 |
So to really understand, let's switch over to a demo. 00:05:03.700 |
So doing a demo at a tech conference at any time is a harrowing experience. 00:05:08.700 |
But when you're launching a brand new model with a bunch of new features in front of a live stream, 00:05:18.700 |
So what we have here, we've, thank you, thank you. 00:05:22.700 |
We have vibe coded a little command line client just to explain how the system works. 00:05:35.700 |
So what I'm going to do here is just give a very simple query here. 00:05:40.700 |
I'll let you think about what the answer to this one is. 00:05:43.700 |
So we pass this query to Claude 4, and it has the code execution tool enabled. 00:05:49.700 |
Claude's going to reason about that for a second, decide to call the code execution tool, and then we get streaming results. 00:05:57.700 |
So this is one HTTP call, but we're getting streaming responses back. 00:06:03.700 |
The code gets written by the model, passed to the tool. 00:06:06.700 |
The tool has executed the code, and we got that standard out there. 00:06:10.700 |
That's the 100th prime number, and then the model gives its quippy answer here. 00:06:32.700 |
I've uploaded those with the files API, which we also announced today. 00:06:38.700 |
And then what I'm going to do now is do some analysis of it. 00:06:43.700 |
So you can see this prompt says, analyze the uploaded A/B test and compare control and treatment. 00:06:49.700 |
Calculate the statistical significance and key metrics and make a recommendation. 00:06:55.700 |
So using all parts of the model here, we're doing some code execution. 00:06:59.700 |
Notice in this first turn, the model has never seen this spreadsheet before. 00:07:04.700 |
So it first has to analyze the types in the spreadsheet. 00:07:07.700 |
It gets those results back, and then since it understands them, 00:07:11.700 |
it now writes deeper code to go understand what's really happening here and pull out some insights. 00:07:18.700 |
So now we're executing that code on the VM, and we get all the results back in standard out. 00:07:26.700 |
And I got to say, I just love Opus because it doesn't give up. 00:07:30.700 |
He didn't get exactly what he wanted out of that analysis. 00:07:36.700 |
I need to understand a little bit more before I can do this. 00:07:38.700 |
So it drills in more, writes some additional data analysis code. 00:07:43.700 |
And see here, it's writing the output for itself to read. 00:07:47.700 |
So all these print statements are going to come back to standard out, and then we're going to pass that to the model. 00:07:53.700 |
So those came back, and now the model's reasoning about what its response is. 00:07:58.700 |
And it makes that business recommendation that we ask for and justifies it with the analysis of all that data. 00:08:19.700 |
The code execution tool is an anthropic hosted computing environment. 00:08:25.700 |
And it's flexible, developer controlled, so you don't need to tie to threads or whatever. 00:08:31.700 |
You can control which request goes to what container, and your containers are isolated from everybody else's. 00:08:38.700 |
And you get 50 free container hours today, which is a good amount to get started. 00:08:44.700 |
And then love to scale with you, so we have some pricing to let that scale. 00:08:49.700 |
And the best part is that's available for you to use today. 00:08:59.700 |
Let's move on and talk about the Connect pillar here, how you can bring data into the model. 00:09:06.700 |
So many of our customers have told us that, again, while the model's reasoning is great, it was trained at some specific time. 00:09:14.700 |
And maybe they need more recent information, whether that's for financial use cases, say the latest stock prices, or for, in the legal case, 00:09:23.700 |
maybe there's some case law rulings that need to be kept up to date, or even in coding you may need to get the latest API documentation to make sure your code works beautifully. 00:09:35.700 |
So in all those cases, that real-time information is very important. 00:09:41.700 |
So let's drill in a similar example and check in how web search works in our system. 00:09:54.700 |
What are the most significant technological breakthroughs announced in the past three months? 00:10:00.700 |
And what publicly traded companies would benefit from them? 00:10:03.700 |
So that's actually a pretty complicated problem. 00:10:06.700 |
I mean, you might pay an analyst money to actually answer this question. 00:10:10.700 |
What Claude does is it doesn't transform that prompt into a query. 00:10:15.700 |
It actually reasons about the overall task that you've been asked Claude to do. 00:10:20.700 |
And it decides, well, the first thing I need to do is do a broad query and really understand the technological breakthroughs. 00:10:29.700 |
We pass it to a search engine and get a set of search results back. 00:10:33.700 |
So just think about the standard 10 blue links. 00:10:36.700 |
So we get title, URL, and content from each of these websites. 00:10:45.700 |
And then the model says, okay, given the prompt that I was given and this additional context, what do I need to do now? 00:10:52.700 |
So it says, okay, well, I need to drill into one of those particular trends. 00:10:56.700 |
So it picks this small language model, drills in, and finds companies that are related to that. 00:11:04.700 |
Gets the same search results and content back into the model. 00:11:08.700 |
And then the model decides, again, to do another search. 00:11:11.700 |
Now, this one I don't really know is a trend that I'm aware of, but we learn new things with Claude every day. 00:11:21.700 |
And keep in mind, all of these things are happening in that one API call that I showed you. 00:11:36.700 |
It does this one final search to go one level deeper into the small language models to get a really deep insight into this particular one. 00:11:57.700 |
So there's actually footnotes, citations for every fact so that you can go back and verify, make sure there's no hallucinations, and it is exactly what you want. 00:12:09.700 |
And this is, again, as I mentioned, one API call messages API, and there's a tool very similar to the code execution tool we showed earlier. 00:12:21.700 |
In this case, you can actually restrict the domain. 00:12:24.700 |
So say you're building a customer support agent, you might want to restrict the domain to just one domain so you get accurate answers. 00:12:35.700 |
And you can also control the max turns if you want to be a little conservative on how many tokens you spend. 00:12:40.700 |
Although it's my business that you spend a lot of tokens, so feel free. 00:12:44.700 |
Okay, and that's what Quora has really found interesting. 00:12:49.700 |
They're building a consumer agent, and they really value that live, up-to-date information because consumers oftentimes ask about what's going on contemporarily, and so they're really valuing that. 00:13:02.700 |
And again, we're seeing customers across legal coding tools find this very valuable. 00:13:17.700 |
What are the SWE bench scores for all of Anthropix models since 3.5? 00:13:27.700 |
This is one that's sort of real-time right now, so we'll see how well this works. 00:13:31.700 |
So what it's doing now is it's actually considering that query, looking at all the tools it has available. 00:13:38.700 |
You know, we saw before it, whoa, that's good times. 00:13:50.700 |
So it's looking at all the tools it has available and deciding which one to call. 00:13:56.700 |
So in this case, it calls the search tool, and it starts with three files. 00:14:00.700 |
It starts with 3.5 because that's the data that we gave it. 00:14:03.700 |
So it does that search for 3.5, and then it does a more general search. 00:14:08.700 |
So Claude's not satisfied with the answer from the first turn. 00:14:11.700 |
So we're not like structuring, oh, do three searches, these are the searches. 00:14:16.700 |
Claude's deciding what searches to do, how many searches, when to stop, what to drill in on. 00:14:21.700 |
And here, it looks like this gets to get updated. 00:14:26.700 |
We don't quite have the Sonnet 4 scores up, but they will be there very soon. 00:14:35.700 |
So this is a question I have wondered with for a long time, and that's how many elephants can travel over the Golden Gate Bridge in an hour? 00:14:45.700 |
If you think about this question, which is a very important question, you know, there's actually some pieces of data you need to know. 00:14:52.700 |
You might need to know what the weight capacity of the Golden Gate Bridge is, the walking speed of an elephant. 00:15:00.700 |
But then even once you have that data, you need to do some computational work to go understand. 00:15:05.700 |
So Claude did all the searches, now it's doing the computational work, or at least it's writing the code for that. 00:15:13.700 |
And as soon as it finishes writing the code, we pass this over to the code execution tool, and it will execute that code. 00:15:22.700 |
There we go. It executes it and gets this data, and that goes back into the model. 00:15:29.700 |
The model sees that data, and it forms its answer, which is 7,000. 00:15:39.700 |
Okay, so that's Web Search with Code Executor. 00:15:52.700 |
Okay, so that's Web Search, and it's Anthropic's agentic search capability. 00:15:58.700 |
So it's not just simply passing the search query and getting the result. 00:16:02.700 |
The model is actually deciding how to search, how many times to search, doing that loop over and over again. 00:16:08.700 |
And that's done with our citation support, so everything there is fully grounded and auditable. 00:16:14.700 |
And it's highly composable, developer-controlled, so it's very easy to add. 00:16:20.700 |
You can have a lot of controls as developers. 00:16:22.700 |
And then it's reasonably priced, so you can use that at scale, and that's available today. 00:16:32.700 |
We've just been blown away at the industry excitement around MCPs. 00:16:37.700 |
I see a new MCP launch literally every day, so that ecosystem is growing very quickly. 00:16:44.700 |
And in fact, just last week, Cloud AI launched support for remote MCPs and Cloud AI. 00:16:53.700 |
And many of our customers have been wondering how they can take advantage of that ecosystem of MCPs within their own agents. 00:17:04.700 |
So let's take a look at how that looks under the covers. 00:17:07.700 |
So this is a little bit more complicated setup. 00:17:12.700 |
We have our client, and we have three different MCPs connected. 00:17:16.700 |
And that's because we're serving, our agent's going to serve queries like this one that like a product manager might need to do for a team after a launch. 00:17:25.700 |
Create an email with a creative and motivational image about my Asana project status and send it to the team. 00:17:32.700 |
So there's several components here, and Cloud's got these three MCPs to go figure out how to do that. 00:17:41.700 |
So the first thing Cloud decides to do is to call the Asana MCP, and a specific tool there, the Asana MCP has got tens of tools that you can call. 00:17:53.700 |
But Cloud has picked out the right one, this lists workspaces. 00:17:56.700 |
And notice we're doing this call, my Asana tasks are authenticated, not everybody can see them. 00:18:03.700 |
So we actually have to do an OAuth request to the MCP server. 00:18:08.700 |
So when you make this API call, you pass an OAuth token into the messages API. 00:18:14.700 |
We exchange that OAuth token for access token, and then make the call to the Asana's MCP server in a secure way. 00:18:22.700 |
So we do that call, we get the results back, and we say, okay, this is the workspace that that user has. 00:18:34.700 |
It picks this search method to search for this code with Cloud in MCP demo. 00:18:41.700 |
That's one I set up in context is knowing what my project is. 00:18:48.700 |
Cloud gets details about that project ID, finally finds the actual project ID for this, and it can call get tasks. 00:19:00.700 |
Just pausing for a second, a complicated enterprise software project like Asana has a complicated API structure, and even you as a developer, I as a developer, might take some time to go understand that. 00:19:14.700 |
But notice how Claude is whipping through this very quickly, and it gets right at the tasks. 00:19:24.700 |
So Claude is using that long-range planning, that capability to figure out what to do next. 00:19:33.700 |
Next, what it needs to do is to create an image. 00:19:36.700 |
If you remember the prompt, I said to create an image about that. 00:19:39.700 |
So no, we're not going to announce an image creation model from Anthropic today, but there are tons of MCP servers out there that do image creation. 00:19:52.700 |
And if you're really deep in this MCP space, you know that most of those are actually local MCP servers, like intended to run on your local machine. 00:20:02.700 |
But this support and the Claude AI support is all about remote MCPs. 00:20:06.700 |
So luckily, CloudFlare offers an MCP remote service where you can take any of these local MCPs and host it on CloudFlare in a secure way. 00:20:20.700 |
We've taken one of the open source MCP providers, hosted it on CloudFlare, and then made that available to the model. 00:20:28.700 |
And so the model chooses to call that with, when I made these slides, the project was definitely not in good shape, which is why the query is what it is. 00:20:37.700 |
And we call that MCP server and get a result back, get both the image URL as well as the actual image comes back in the result. 00:20:52.700 |
The next thing it needs to do is send that email. 00:20:55.700 |
So Claude will compose email with all of that data, and then it needs to send it. 00:21:01.700 |
So in this case, we're going to use the Zapier MCP server. 00:21:05.700 |
Zapier has got hundreds of enterprise connections in a very well-designed MCP system that lets you enable or disable, expose just exactly what you need and have enterprise control over that. 00:21:23.700 |
And the model has chosen to use this subject and whatnot to make that work. 00:21:34.700 |
And this is what it looked like more recently, a little bit happier picture, but we did actually get the email. 00:21:46.700 |
Hopefully, you're getting a little bit of a pattern here. 00:21:54.700 |
This time, we're just using this new MCP servers attribute. 00:21:58.700 |
And you can list as many MCP servers as you need here. 00:22:04.700 |
And then you pass, if you need, if it's an OAuth service, pass the authorization token there. 00:22:12.700 |
So we're very fortunate to have several remote MCPs that are live today that you can use with the MCP API. 00:22:21.700 |
Whether you're doing task management or you're doing payments, you're creating a video, you're doing machine management, there's an MCP for you to get those done. 00:22:34.700 |
And then I'm sure tomorrow there's going to be a ton more. 00:22:38.700 |
And that's really what Zapier has found interesting because with our mutual customers, their customers can now build really powerful agents very easily with a combination of our MCP API support and their MCP. 00:23:07.700 |
Just to warm us up, make sure we can get this. 00:23:10.700 |
So again, what you're going to see this time is that it calls the get Asana tasks. 00:23:20.700 |
And then we get that nice list of Asana tasks and a nice response. 00:23:25.700 |
But, you know, I thought maybe we should like pull all of the pieces together here. 00:23:33.700 |
Create an email with a creative and motivational image about my Asana project status, including some analysis on the percentage complete and any news on the web about those tasks and send it to the team. 00:23:48.700 |
So that starts by going through Asana, getting the tasks out of there. 00:24:00.700 |
So now it's writing code to analyze the status of all those tasks and get our percent complete. 00:24:07.700 |
That being done, it knows what all the tasks are. 00:24:10.700 |
So now it does a search for our conference, finds the latest information. 00:24:14.700 |
Maybe a tweet from one of you will show up there. 00:24:17.700 |
And it decides to drill in a little bit more on Claude 4 Opus and Sonnet. 00:24:22.700 |
So now it decides to create that motivational image. 00:24:27.700 |
And notice the prompt it's giving to our MCP. 00:24:34.700 |
Now what it needs to do is take all the data that you just saw and pull that together into like a really nicely formatted email. 00:24:44.700 |
And that's going to take the model just a minute to build this entire email. 00:24:50.700 |
But it's going to take that email and call the Zapier MCP service and send it. 00:24:57.700 |
So hopefully any minute it will get that email. 00:25:16.700 |
So it's a little funny formatting here because it's HTML in this JSON viewer that we're using here. 00:25:26.700 |
And then of course the real test, if this was like an actual live demo, we'd expect like to get an email. 00:25:42.700 |
But yeah, this is the one that we just, oh, that was 44 minutes ago. 00:25:56.700 |
I mean, we did zero prompt engineering on this thing. 00:25:58.700 |
Look how nicely formatted Opus comes up with this email. 00:26:08.700 |
And I think I have to finish up very quickly. 00:26:10.700 |
MCP connector is a remote MCP, simple to set up, OAuth support, only standard token prices. 00:26:21.700 |
You can't really talk about optimization without talking about prompt caching. 00:26:25.700 |
Prompt caching lets you reuse part of your prompts that are used frequently that saves capacity, cost, and latency. 00:26:34.700 |
And we've had customers say, well, your five minutes of time between cache hits isn't enough. 00:26:41.700 |
Or some humans maybe walk away from the computer and come back, or some long-running agents. 00:26:46.700 |
So we've added a new option in addition to the five minutes we launched with, a new option of extending that to one hour with the same 90% discount on cache hits. 00:26:56.700 |
And batch processing is a great way to effectively process large amounts of data. 00:27:05.700 |
And now that batch supports web search, code execution, and MCP connector, it's not just for batch processing anymore. 00:27:16.700 |
So you can get a 50% discount for using that, and you can build async agents very quickly. 00:27:23.700 |
But we've also had customers tell us that they need dedicated, they need reliability, dedicated capacity to make sure that they can serve the needs of their users. 00:27:35.700 |
So we offer, as of today, we're offering customers the ability to buy a month's worth of capacity at a discount, and with this 99% reliability. 00:27:54.700 |
So we talked about build, clawed for, long-range planning, and code execution. 00:28:00.700 |
We talked about bringing your data in with web search and MCP connector. 00:28:04.700 |
And then we talked about how to optimize that with prompt caching, batch, and priority tier. 00:28:10.700 |
So unfortunately, we're out of time, but I will be out there for questions.