Claude plays Pokemon | Code w/ Claude

All right. Welcome, everybody. I was asked to talk about tool use and some of them changes to our new models that have made them better at using tools. And so I decided to use this opportunity to instead talk about Quadplay's Pokemon, which I made. So I'm David. I'm on our PodAI team here.

I'm the creator of Quadplay's Pokemon. And what we are going to do today is relaunch the stream live together. We're going to talk about it and we're going to have a good time. Could you flip over to my demo screen, please? All right. I have a button queued up to my machine in my house in my basement in Seattle, which is running this.

And I'm going to hit enter, but I need help from the crowd to count down from 10 for me to windmill slam this enter button. So I'm going to start it, but I need all of you to participate. It's very important for the vibes that are run. 10, 9, 8, 7, 6, 5, 4, 3, 2, 1.

Let's go. All right. So I'm here to talk about Quadplay's Pokemon because there's some new stuff that our models can do that is really exciting and makes the models better at Pokemon, makes the models better at a lot of things, makes the models better at being agents. What this is actually going to practically look like, and I'll show some examples later in some slides I have after we're done, is that our models will learn and adapt and think differently.

So in a minute, we're going to get to the name entry screen. One of my favorite examples I show later is that. The name entry screen, a notoriously challenging place for Quad. Quad does not quite understand how cursors move around, and it gets a little bit lost. It's a grid.

It gets confused. One of the things that it's really good at with this extended thinking mode is building a full plan of where it needs to move, what it needs to do, that thing between tool calls. So in the past, it might not use that extended thinking, reconsider its assumptions, question itself, figure out the right answer.

In Quad 4, you will see that happen. Another feature we added in this new version of Quad that we have been asked for a lot is the ability to call multiple tool calls at once. So parallel tool calling is the other name. In Quad 3.7, it was like frustratingly bad at this.

It would only call one tool at any given time. And what that means for you is practically when you're building an agent, it'll call one tool and then it will wait. You'll have to make a whole new generation. You'll get a whole time to first token hit between generations.

With the new models, they are much more keen to call multiple tools. We actually saw it right at the beginning. I haven't seen it since. But it will do things in Pokemon, like take an action and update its memory at the same time, which essentially just saves us tokens when we're building agents.

The model is going to take more actions more quickly and not need to go through the plan act loop as often. Tool use has evolved rapidly in the last year. When I first started helping customers with tool use last year, a lot of it was like, let's use a calculator, give the model a calculator tool so it can do math because the model is bad at math, so it has this ability to spill over.

These days, that is not what tool use gets used for. Tool use is the driver of agents. When people build with tools, they give models full suites of tools that enable models to take long, agentic actions and move forward. And so in that the agentic tool or loop, it really revolves around tools.

In an agentic loop, the model will plan an action, act on whatever that plan was, learn something from what it saw, and then repeat that until it's accomplished its goals. In Pokemon, it might say, I'm going to try to talk to my mom right now. The way it would do that is I'm going to press A, and then it will reflect, see the dialog box come up, and see it worked, and keep going with its job to play Pokemon.

So let's talk about those two big improvements we talked about with tool calling in Quad 4. The first is improved planning. By being able to use extended thinking mode between tool calls, you are now able to see the model actually break down, build plans, step back, reflect, question its assumptions between tool calls.

And by calling multiple tools at once, the models will be more efficient when acting as agents. This has a practical impact in Pokemon that we also didn't get to see. So I want to talk a little bit about this interleaved thinking, thinking between tool calls, because there's some clear examples in Pokemon of how this works out.

In the past, when you launched it, we actually saw -- this is the one thing we actually did saw -- is when you hit run on a model, it would build the whole plan for how it was going to be Pokemon in its first message. This was a terrible plan it would write.

It would say, I'm going to write my name down as Claude, and I'm going to give my Pokemon nicknames, and I'm going to go beat Pokemon, and that's the extent of its planning. And then it would hit a really big, horrible challenge, which is the name entry screen, and everything would fall apart.

It would occasionally hit left to move the cursor to a new letter and accidentally wrap around to the other side of the screen and think, how the heck did my cursor end up on the right side of the screen? The game must be bugged. Everything is terrible. And now with the ability to do extended thinking between tool calls, you'll see the model actually sort of catch these errors more often, adjust, adapt its thinking, and come up with a better plan.

So in this example, this is an actual trace from Quad 4 Opus where it says, I'm really stumped. The cursor went right instead of left when I hit left. What happened? And then it will actually say, wait, let's step through. What actually happened? Where did the cursor go? It was at this letter, this letter, this letter.

Actually, what I think happened is the cursor spilled over and wrapped around to the other edge. Everything's okay. I understand how this works now. Let's keep going with name entry, and it can sort of pick that up and learn it. And that ability to adapt on the fly is really meaningful as you build agents that are kind of expected to take in tons of new information as they're building.

Similarly, we have parallel tool calling. Parallel tool calling, more of an efficiency game. In the past, when you're sitting there waiting to talk to mom, the model would press A, talk. And then if it wanted to update its knowledge base to keep track of where it found mom in the past, it would have to take a whole nother action, call out to quad again, wait for the time to first token hit, make that change.

With parallel tool calling, it can do both things at once, basically. It can say, I'm going to talk to mom. I'm going to update my knowledge base. I want to advance the dialogue six times by pressing A six times because I'm bored talking to mom. This saves you time.

It saves your customers time. It speeds up how agents work. And it will make agents work more effectively for your customers that won't have to wait around for sort of redundant tool calling and calls to the cloud. And so what this means, and what's next, is that models are getting better at being agents.

This is obvious. We knew this. But this is one of the core things we work on at Anthropic. We find ways to make models smarter when they're acting over long time horizons and solving complex problems. Extended thinking between tool calls is an example of this. It's something we've seen make a real impact on how effective agents are in the real world.

But I also want to talk about like Quad is being trained to be a useful agent and an easier one to build with. When we build our models, we try to listen to developers. We understand what it means to give Quad the capabilities that make it work more easily, more seamlessly, better for users.

And we train those into our models too. Things like parallel tool calling that we want to hear feedback and improve our models on model over model. I will let some people ask questions. We can chat about this a little bit. Yeah. Over here. Hey. Hi. Thanks. Cloud-based Pokemon is awesome.

So one question I had was, so you have many low level actions, right, which is like click button A, click button B. And then you also have some high level actions like go to this point that you've previously visited. That's one of the high level actions. Are all these in the same hierarchy of tools?

Or how do you think of, because when you're building any agent, like, you know, you have some sort of zoomed out view action that you want to take and then some zoomed in click a button action. Yep. How do you think of this? And should it be flat? Should it be like a hierarchy?

How do you think of this? Thanks. I think like designing any agent, designing tools tends to be the thing that actually matters the most. And this is like the most simple set of tools. In fact, like I've A in for simplicity with Pokemon. It's somewhat a bad example in this sense.

But what really matters is being clear, like separating the concerns of what tool should be used when, giving good examples of what tools should be used when, and how to do that. And so in the case of Pokemon, I have this tool that allows the model to navigate to a specific place.

And then you have to just be very clear to it that it should use that when it's trying to move around in the overworld. It's like I have, I watched Claude play a bunch and I found out that the model was quite bad at moving around in the overworld zone.

And so just basically telling it, hey, you're not good at this. When you're trying to do this set of tasks, this is the right tool to use. You'll have a better outcome. Versus if you're in a battle, just pressing buttons directly. You're perfectly capable of it's easier for you.

That's a good way to do it. And so I think about sort of like the loop of building these, learn, like watch the model, see where it struggles, try to design and build tools that will help some of the places it struggles, and then write clear descriptions that help the model understand what you have seen, like what its shortcomings are, what it, why it might need this tool, what scenarios, and equip it with that knowledge.

I think there's a microphone here. Yes. There's a pattern in which if you have like a bunch of tool functions and you don't want to necessarily like clutter your current context with like a whole huge list, you adopt a helper, which kind of acts like a proxy where the model can say, hey, I want to accomplish this.

And then, okay, so you know what I'm talking about. Have the dynamics of that particular pattern use changed at all with the new model? I don't think we, like I have not studied and I don't think we really know how that will break down with the new model. My expectation is that like telling, the smarter a model gets, the more I trust it with the full context and to make complex decisions.

So my gut with building with Opus would be, or Sonnet really, like the Cloud 4 models, is giving it the full list. And again, maybe you're guessing about like context clutter and just like avoiding if the tools are too long or maybe don't click. Yeah, and a quick follow-up, like how many, how many has that at least gotten in your experience, like how many tools have done?

I think we've pretty confidently seen the model be able to navigate order of like 50 to 100 tools. It's just a question of definition though. Like the more as a human who writes prompts and writes tools out, the more that you, more tools you write, the less likely it is that you're going to be precise enough and where and how you can actually define those tools to the model and sort of like divide the lines between them.

And so from my perspective, it's a little bit like well-designed that that's possible. If it gets complicated or nuanced or the too much overlap, I think that's where you need to start figuring out like patterns to delegate larger chunks of work or things like that. So when you say that we should give clear descriptions of what tools should be used when, does that belong in the prompt or does that belong in the tool description?

I ask because I've been working on agentic features myself and I find that if I pass in a JSON schema where I tell it about every field and description in a way that's opinionated about what it's going to do, like that's generally worked better for me. But on the other hand, I see these architecture advancements with remote MCP servers where tools can be defined once and used in many other use cases.

So I'm not really sure what to do. Yeah, it's a great question. My lean is often to put things in a tool description, but honestly, I think you can do both. I mean, when you, the way that our prompt gets rendered when you provide tools in a tool description is it just renders the tools in the system prompt.

And so mechanically, the gap in text between if you would write it in the tool description and just below it in the system prompt is not that much. And I think it matters more to just have clear descriptions and to be clear about what it is. I think the thing that's nice about putting in a tool description is you're sort of like separating what tool you're talking about when more like the way that we've trained it, we're sort of like guaranteeing that the syntax that is used for the model to like read and understand a tool description is something it's seen before.

Whereas if you venture off that path, there's a risk that you're going to do something that's not as easy for them all to understand. But I think like if you write a really strong prompt, it should work similarly well in both situations, I'd expect. So 3.5 Sonic got stuck in Mount Moon for a while.

It did. Can it make it out? It will make it out. This is okay. Let's talk a little bit about Claude performance. This is a good chance to ramble here. Claude, Opus is significantly better at Pokemon. But the ways that it's better are not the most satisfying ways if you want to see Pokemon get beat.

It's like roughly as enabled to see the Pokemon, the Game Boy screen as it was before. So we didn't like, I don't know, I didn't go to research and ask them to make the model better at Game Boy screens. That's not what our customers are asking for. It might be my favorite thing, but it wouldn't be a good reflection.

So it still struggles with some like navigation challenges and stuff like that, where it's just like not sure what it's seeing. It's ability to plan and execute on a plan is like miles ahead of where it was in the past. My favorite example of this that I've seen, after you get the third badge to go to Rock Tunnel, you need to get Flash, the HM.

To do that, you need to go catch at least 10 species of Pokemon and then like find some dude in a random building. It found the dude in a random building, it found out it needed to catch 10 Pokemon, and it weren't like on a 24-hour grind session finding 10 Pokemon.

Like uninterrupted, didn't get distracted, didn't do anything else, catch 10 Pokemon, like wander back, get Flash, go straight to Rock Tunnel. And it's like this ability to sort of plan and execute, like build a plan, and then like actually track and execute against that over, in this case, like 100 million tokens worth of information was like by far the best I've ever seen from a model.

So in this playthrough, as you watch at home, as you watch on the demo thing, I think you'll see it gets stuck in Mt. Moon for probably a similar amount of time, if I had to guess. But you'll see it do some like miles more intelligent things in the process of getting there.

Funny how it works. Yeah. Hey, I just have a question about parallel tool calling. Yeah. First time I've ever, is this state of the art? I haven't. Uh, no. A model should be able to do this. I think like, frankly, like I wish 3.7 could have done this. I don't think this is like an insane capability, but it matters.

Like it's just a useful thing for people to be able to do. So just under the hood in your messages array that you're interacting with the model, are you just doing some magic on your end to kind of free? It's kind of like on the model to say, hey, I'm done.

I've described a set of tool calls I want to make and I'm done or not. So the model in the past would just like make one tool call and say, I want to wait for the result of this. The model now is more likely to understand that in some cases, I actually know two or five or eight tool calls that I want to make right now.

And it will describe all of those. And then the object you get back in the API is, has eight tool use blocks that say, here are the eight tools I want to use. And then you're asked to go sort of like render those. Awesome. Thank you. Yeah. So I'm particularly interested with the idea, right?

So with parallel tool calls, there are some cases where it's obvious that all the tool calls can actually happen in parallel. But then there's like more of a planning sense where you showed like press A, press A, press A. And so of course, I'm like thrown back to being six in my mom's minivan and remembering when I restarted a really long conversation because I was spamming A.

Yep. And so I'm just like, I'm nerdily curious if that, if it's ever done that where it's like impatiently restarted a conversation. All the time. All the time. But I think that that also like scratches out a deeper thing of like, is there ever such a thing as too, too much planning and do you see it like being too opinionated about following the plan and not updating with new information like the conversation has ended?

I think this is like the range for good prompting, honestly. The so the reason that it actually hits many buttons is you'll see its thought process say, I'm going to hit a whole bunch of buttons and I'll stop whenever it's done. But it doesn't like quite have the sense of time like we do.

So if it says I want to hit a 500 times, it's like, oh, don't worry, I'll be I'll know when I have finished the dialogue and then I'll stop. But it doesn't quite understand that it doesn't get to see in between each one by default. Because like the nature, I don't know, that's a very LLM problem that you have to register 500 buttons and then close your eyes and then come back and find out what happened.

But you can actually get around that just like with prompting and helping the model understand what is happening, what are its limitations and what and how should it act. So like in the system prompt for quad place Pokemon, I just have to tell it when you register a sequence of buttons, you don't get to see like you're not going to see.

And so there could be side effects. Restarting the dialogue is a simple version, but you can actually do like much worse things in Pokemon. Like I've seen it overwrite one of its moves accidentally when it was learning a new move in a way that was like quite bad for for making progress in the game.

And so this is like I think the space where someone building agents, you have a lot of room to sort of see how models make mistakes like that, help them understand why and what's going on and sort of like build that into how you prompt them prompt agents. And that's a lot of how I think about sort of like iterating on agent design.

So in our production agent, we saw that in 3.7, there was some not very good consistency with calling about 18 tools versus like if you were to just pass the model a single tool and then just have the exact same prompt and have it call that. And you mentioned before that the four models are able to handle like over 100 tools.

Are there any changes or differences you're seeing in how you get consistent performance amongst so many tools? I think these models we've pretty clearly seen are much better at precise instruction following. This can be a double-edged sword. Like if you're imprecise with the instructions you write, they'll readily follow or get confused by contrasting instructions sometimes.

But I think the key is with like very good tool design and very crisp prompting, we've seen that these models are like much more capable at following a pretty long set of different and complex instructions and being able to use that execute. So I think the key is there's more room to hill climb on a prompt maybe is what I would say with these models, which is to say as you are making more and more precise descriptions of your tools, there's more room to get better and better across a wider range of tools and sort of like reach that same level of performance you'd expect on a single tool.

I think I am at time. I have successfully gave a very different talk than I expected but I appreciate you all for being here and it was fun to talk with you all. Thank you. Thank you.

Claude plays Pokemon | Code w/ Claude

Transcript