Back to Index

Building Blocks for Tomorrow’s AI Agents


Transcript

Good afternoon and welcome back from lunch. You know, whenever I do a tech conference, I always ask to do the session right after lunch because I know only the most motivated, hardworking, smartest, most beautiful people come. Right? Am I right? So thank you. Thank you for being here. I'm Brad Abrams.

I'm a product manager at Anthropic, and we're going to talk about components for building agents today. You saw Michael in the keynote talk about our Anthropic developer platform here, and today we're going to drill into this agentic components. When we think about building agents, there's really three key parts of this.

First is fundamentally building the agent and starting with our foundational models with the Claude Four family of models with enhanced reasoning, memory support, much improved tool calling, and long range planning is a great way to start. There's also a set of components that you can reuse that saves your precious engineering resources to work on different things, but we know regardless of how good our models are, they're only as intelligent as the data that you bring to them.

And that's what the connect pillar is all about. How can we help you bring more context in that helps the intelligence of the model? And finally, none of that matters if you can't deliver a service that's reliable, that's stable, that's performant, and cost effective, and that's what Optimize is.

So this is sort of our agenda for today. Let's drill into Build. So with Build, I want to talk about the code execution tool. Customers have told us that while large language models can do many amazing things, there's still some tasks that require traditional software development. When you're doing advanced data analytics, you have a giant spreadsheet, need to understand, do deep analysis of that data, that's still the domain where a human might need to write code because that code is auditable, it's performant, it's repeatable, it does the same every time.

So some of those use cases are still better done with code. But you know, our models are actually pretty good at writing code. So we thought, why not give Claude a computer and let it write and execute that code? And that's what code execution is all about. Let me explain it by drilling in one level deep on an example.

So we have a client here, it calls Claude, and then that goes to a container. We have a whole set of containers, so every organization gets a dedicated container. And here the client is actually requesting container ID one. Your client can decide how to use the containers, how to allocate them.

The client has this prompt. I don't know if anybody's already figured out the answer to this. I'll let you noodle on that for a second. Claude thinks about that for a minute and decides, you know, actually this is best done by writing code. So Claude chooses the code execution tool, writes a set of Python code that will answer that question, and then we hand it over to the container.

The container executes that, and then we get some results back. So all of standard out comes back, standard error comes back, and any files that were created while executing on that container come back. And then the model then reasons over that results, and comes up with a quippy answer.

So the answer was 42, and Opus with its insightful humor here has come up with a good joke about that. So that's generally how code execution works, and it's very simple to set up. Those of you that are already customers will recognize the messages API. It's the core way to use our models.

So it's the exact same API you've been using before. We've just added a new tools block. And keep in mind, this is really all you need to do to set this up. It's one method call brings all of this power. And that's just what Shopify found was interesting as they experiment with this code execution tool we're building.

They have a sidekick agent that helps merchants build their storefronts, and they're building an A/B testing experience there. And having the power of this code execution tool is helping them bring that insight. So to really understand, let's switch over to a demo. Yeah, let's switch over to the demo.

So doing a demo at a tech conference at any time is a harrowing experience. But when you're launching a brand new model with a bunch of new features in front of a live stream, it's particularly crazy. So hopefully this will work well. So what we have here, we've, thank you, thank you.

We have vibe coded a little command line client just to explain how the system works. Very, very basic system here. And we're using Opus 4. So what I'm going to do here is just give a very simple query here. I'll let you think about what the answer to this one is.

So we pass this query to Claude 4, and it has the code execution tool enabled. Claude's going to reason about that for a second, decide to call the code execution tool, and then we get streaming results. So this is one HTTP call, but we're getting streaming responses back. The code gets written by the model, passed to the tool.

The tool has executed the code, and we got that standard out there. That's the 100th prime number, and then the model gives its quippy answer here. Thank you. First demo worked. I'm feeling good. Let's push it a little bit harder. Okay, so I have some A/B test results here.

I have to make my Shopify friends happy. I have some A/B test results here. I've uploaded those with the files API, which we also announced today. And then what I'm going to do now is do some analysis of it. So you can see this prompt says, analyze the uploaded A/B test and compare control and treatment.

Calculate the statistical significance and key metrics and make a recommendation. So using all parts of the model here, we're doing some code execution. Notice in this first turn, the model has never seen this spreadsheet before. So it first has to analyze the types in the spreadsheet. What's there? It gets those results back, and then since it understands them, it now writes deeper code to go understand what's really happening here and pull out some insights.

So now we're executing that code on the VM, and we get all the results back in standard out. And I got to say, I just love Opus because it doesn't give up. He didn't get exactly what he wanted out of that analysis. So it says, look, I need to drill in more.

I need to understand a little bit more before I can do this. So it drills in more, writes some additional data analysis code. And see here, it's writing the output for itself to read. So all these print statements are going to come back to standard out, and then we're going to pass that to the model.

So those came back, and now the model's reasoning about what its response is. And it makes that business recommendation that we ask for and justifies it with the analysis of all that data. So pretty good? Great. Let's switch back to the slides. Okay. So that's the first live demo.

Great. Code execution tool. The code execution tool is an anthropic hosted computing environment. And it's flexible, developer controlled, so you don't need to tie to threads or whatever. You can control which request goes to what container, and your containers are isolated from everybody else's. And you get 50 free container hours today, which is a good amount to get started.

And then love to scale with you, so we have some pricing to let that scale. And the best part is that's available for you to use today. Wow, this is a good audience. Go ahead. Okay. Let's move on and talk about the Connect pillar here, how you can bring data into the model.

So many of our customers have told us that, again, while the model's reasoning is great, it was trained at some specific time. And maybe they need more recent information, whether that's for financial use cases, say the latest stock prices, or for, in the legal case, maybe there's some case law rulings that need to be kept up to date, or even in coding you may need to get the latest API documentation to make sure your code works beautifully.

So in all those cases, that real-time information is very important. And that's the role that web search plays. So let's drill in a similar example and check in how web search works in our system. So, again, I have the same sort of setup. A client gives this prompt. What are the most significant technological breakthroughs announced in the past three months?

And what publicly traded companies would benefit from them? So that's actually a pretty complicated problem. I mean, you might pay an analyst money to actually answer this question. What Claude does is it doesn't transform that prompt into a query. It actually reasons about the overall task that you've been asked Claude to do.

And it decides, well, the first thing I need to do is do a broad query and really understand the technological breakthroughs. It issues that query. We pass it to a search engine and get a set of search results back. So just think about the standard 10 blue links. So we get title, URL, and content from each of these websites.

And so all that context goes into the model. And then the model says, okay, given the prompt that I was given and this additional context, what do I need to do now? So it says, okay, well, I need to drill into one of those particular trends. So it picks this small language model, drills in, and finds companies that are related to that.

Gets the same search results and content back into the model. And then the model decides, again, to do another search. Now, this one I don't really know is a trend that I'm aware of, but we learn new things with Claude every day. So it's really fantastic. And keep in mind, all of these things are happening in that one API call that I showed you.

One API call and you get all of this power. So we get the search results back. And then this is an interesting case. It does this one final search to go one level deeper into the small language models to get a really deep insight into this particular one. It gets the results from that.

And then it produces its report. So it reports a complete report. And all of the data is now cited. So there's actually footnotes, citations for every fact so that you can go back and verify, make sure there's no hallucinations, and it is exactly what you want. And this is, again, as I mentioned, one API call messages API, and there's a tool very similar to the code execution tool we showed earlier.

In this case, you can actually restrict the domain. So say you're building a customer support agent, you might want to restrict the domain to just one domain so you get accurate answers. And you can also control the max turns if you want to be a little conservative on how many tokens you spend.

Although it's my business that you spend a lot of tokens, so feel free. Okay, and that's what Quora has really found interesting. They're building a consumer agent, and they really value that live, up-to-date information because consumers oftentimes ask about what's going on contemporarily, and so they're really valuing that.

And again, we're seeing customers across legal coding tools find this very valuable. So, let's do another demo. So let's switch back to the demo machine. Okay, so now let's try a search query. What are the SWE bench scores for all of Anthropix models since 3.5? So this is a contemporary one.

This is one that's sort of real-time right now, so we'll see how well this works. So what it's doing now is it's actually considering that query, looking at all the tools it has available. You know, we saw before it, whoa, that's good times. Let's try that one again. We know it's real.

So it's looking at all the tools it has available and deciding which one to call. So in this case, it calls the search tool, and it starts with three files. It starts with 3.5 because that's the data that we gave it. So it does that search for 3.5, and then it does a more general search.

So Claude's not satisfied with the answer from the first turn. So we're not like structuring, oh, do three searches, these are the searches. Claude's deciding what searches to do, how many searches, when to stop, what to drill in on. And here, it looks like this gets to get updated.

We don't quite have the Sonnet 4 scores up, but they will be there very soon. Okay, but can we put these things together? So this is a question I have wondered with for a long time, and that's how many elephants can travel over the Golden Gate Bridge in an hour?

If you think about this question, which is a very important question, you know, there's actually some pieces of data you need to know. You might need to know what the weight capacity of the Golden Gate Bridge is, the walking speed of an elephant. But then even once you have that data, you need to do some computational work to go understand.

So Claude did all the searches, now it's doing the computational work, or at least it's writing the code for that. And as soon as it finishes writing the code, we pass this over to the code execution tool, and it will execute that code. There we go. It executes it and gets this data, and that goes back into the model.

The model sees that data, and it forms its answer, which is 7,000. That seems plausible, right? Okay, so that's Web Search with Code Executor. That's good? Okay. Okay, let's go back to slides, if we can. Okay, so that's Web Search, and it's Anthropic's agentic search capability. So it's not just simply passing the search query and getting the result.

The model is actually deciding how to search, how many times to search, doing that loop over and over again. And that's done with our citation support, so everything there is fully grounded and auditable. And it's highly composable, developer-controlled, so it's very easy to add. You can have a lot of controls as developers.

And then it's reasonably priced, so you can use that at scale, and that's available today. So let's talk about MCP Connector next. We've just been blown away at the industry excitement around MCPs. I see a new MCP launch literally every day, so that ecosystem is growing very quickly. And in fact, just last week, Cloud AI launched support for remote MCPs and Cloud AI.

And many of our customers have been wondering how they can take advantage of that ecosystem of MCPs within their own agents. And MCP Connector is the answer to that. So let's take a look at how that looks under the covers. So this is a little bit more complicated setup.

We have our client, and we have three different MCPs connected. And that's because we're serving, our agent's going to serve queries like this one that like a product manager might need to do for a team after a launch. Create an email with a creative and motivational image about my Asana project status and send it to the team.

So there's several components here, and Cloud's got these three MCPs to go figure out how to do that. So the first thing Cloud decides to do is to call the Asana MCP, and a specific tool there, the Asana MCP has got tens of tools that you can call. But Cloud has picked out the right one, this lists workspaces.

And notice we're doing this call, my Asana tasks are authenticated, not everybody can see them. So we actually have to do an OAuth request to the MCP server. So when you make this API call, you pass an OAuth token into the messages API. We exchange that OAuth token for access token, and then make the call to the Asana's MCP server in a secure way.

So we do that call, we get the results back, and we say, okay, this is the workspace that that user has. Cloud then decides to drill in. It picks this search method to search for this code with Cloud in MCP demo. That's one I set up in context is knowing what my project is.

We find the project ID. Cloud gets details about that project ID, finally finds the actual project ID for this, and it can call get tasks. Just pausing for a second, a complicated enterprise software project like Asana has a complicated API structure, and even you as a developer, I as a developer, might take some time to go understand that.

But notice how Claude is whipping through this very quickly, and it gets right at the tasks. So all that happened very quickly. But we're not done with the overall prompt. So Claude is using that long-range planning, that capability to figure out what to do next. So it's got the tasks.

Next, what it needs to do is to create an image. If you remember the prompt, I said to create an image about that. So no, we're not going to announce an image creation model from Anthropic today, but there are tons of MCP servers out there that do image creation.

And if you're really deep in this MCP space, you know that most of those are actually local MCP servers, like intended to run on your local machine. But this support and the Claude AI support is all about remote MCPs. So luckily, CloudFlare offers an MCP remote service where you can take any of these local MCPs and host it on CloudFlare in a secure way.

And that's what we've done. We've taken one of the open source MCP providers, hosted it on CloudFlare, and then made that available to the model. And so the model chooses to call that with, when I made these slides, the project was definitely not in good shape, which is why the query is what it is.

And we call that MCP server and get a result back, get both the image URL as well as the actual image comes back in the result. And both of those go back to Claude. So Claude now has the tasks. It now has the image. The next thing it needs to do is send that email.

So Claude will compose email with all of that data, and then it needs to send it. So in this case, we're going to use the Zapier MCP server. Zapier has got hundreds of enterprise connections in a very well-designed MCP system that lets you enable or disable, expose just exactly what you need and have enterprise control over that.

So we've set it up to just expose Gmail. And the model has chosen to use this subject and whatnot to make that work. And so we get the response. It has sent that email. And this is what it looked like more recently, a little bit happier picture, but we did actually get the email.

And this is very easy to set up. Hopefully, you're getting a little bit of a pattern here. A very composable system. It uses the existing messages API. This time, we're just using this new MCP servers attribute. And you can list as many MCP servers as you need here. You just give the URL and a name.

And then you pass, if you need, if it's an OAuth service, pass the authorization token there. So we're very fortunate to have several remote MCPs that are live today that you can use with the MCP API. Whether you're doing task management or you're doing payments, you're creating a video, you're doing machine management, there's an MCP for you to get those done.

So these are all available today. And then I'm sure tomorrow there's going to be a ton more. And that's really what Zapier has found interesting because with our mutual customers, their customers can now build really powerful agents very easily with a combination of our MCP API support and their MCP.

So let's look at a demo. Okay. So let's use this one. What are my open tasks in Asana? Just to warm us up, make sure we can get this. So again, what you're going to see this time is that it calls the get Asana tasks. So it's passing this.

And then we get that nice list of Asana tasks and a nice response. But, you know, I thought maybe we should like pull all of the pieces together here. So hang with me with this query. Create an email with a creative and motivational image about my Asana project status, including some analysis on the percentage complete and any news on the web about those tasks and send it to the team.

So that starts by going through Asana, getting the tasks out of there. And then it's got to use code executor. So now it's writing code to analyze the status of all those tasks and get our percent complete. That being done, it knows what all the tasks are. So now it does a search for our conference, finds the latest information.

Maybe a tweet from one of you will show up there. And it decides to drill in a little bit more on Claude 4 Opus and Sonnet. So now it decides to create that motivational image. And notice the prompt it's giving to our MCP. So that's pretty nice. The rocket launching looks beautiful.

Now what it needs to do is take all the data that you just saw and pull that together into like a really nicely formatted email. And that's going to take the model just a minute to build this entire email. But it's going to take that email and call the Zapier MCP service and send it.

So hopefully any minute it will get that email. It's almost, I can feel it. Okay. There it is. So it's almost done. Thanks. We have to have that moment. We have to have it. So it's a little funny formatting here because it's HTML in this JSON viewer that we're using here.

But it's generating this whole email. And then of course the real test, if this was like an actual live demo, we'd expect like to get an email. So let's see if we actually get this email. So you can see we've been practicing. But yeah, this is the one that we just, oh, that was 44 minutes ago.

Let's see if there's a more recent one. Yeah, here we go. This one, zero minutes ago. There it is. We got it. We got it. Look. I mean, we did zero prompt engineering on this thing. Look how nicely formatted Opus comes up with this email. So really fantastic. Okay.

Let's move back to the slides. And I think I have to finish up very quickly. MCP connector is a remote MCP, simple to set up, OAuth support, only standard token prices. Okay, let's drill into optimize. You can't really talk about optimization without talking about prompt caching. Prompt caching lets you reuse part of your prompts that are used frequently that saves capacity, cost, and latency.

And we've had customers say, well, your five minutes of time between cache hits isn't enough. Or some humans maybe walk away from the computer and come back, or some long-running agents. So we've added a new option in addition to the five minutes we launched with, a new option of extending that to one hour with the same 90% discount on cache hits.

And batch processing is a great way to effectively process large amounts of data. And now that batch supports web search, code execution, and MCP connector, it's not just for batch processing anymore. It's your async agentic API. So you can get a 50% discount for using that, and you can build async agents very quickly.

But we've also had customers tell us that they need dedicated, they need reliability, dedicated capacity to make sure that they can serve the needs of their users. So we offer, as of today, we're offering customers the ability to buy a month's worth of capacity at a discount, and with this 99% reliability.

So our discount for longer commits. Okay, so that's a wrap. So we talked about build, clawed for, long-range planning, and code execution. We talked about bringing your data in with web search and MCP connector. And then we talked about how to optimize that with prompt caching, batch, and priority tier.

So unfortunately, we're out of time, but I will be out there for questions. So thank you very much for coming. Thank you very much for coming. Thank you. Thank you.