back to indexClaude plays Pokemon | Code w/ Claude

00:00:00.000 |
All right. Welcome, everybody. I was asked to talk about tool use and some of them changes 00:00:14.520 |
to our new models that have made them better at using tools. And so I decided to use this 00:00:21.400 |
opportunity to instead talk about Quadplay's Pokemon, which I made. So I'm David. I'm on 00:00:28.120 |
our PodAI team here. I'm the creator of Quadplay's Pokemon. And what we are going to do today 00:00:33.940 |
is relaunch the stream live together. We're going to talk about it and we're going to have a good 00:00:39.220 |
time. Could you flip over to my demo screen, please? All right. I have a button queued up to 00:00:45.640 |
my machine in my house in my basement in Seattle, which is running this. And I'm going to hit enter, 00:00:51.300 |
but I need help from the crowd to count down from 10 for me to windmill slam this enter button. So 00:00:57.080 |
I'm going to start it, but I need all of you to participate. It's very important for the vibes 00:01:00.700 |
that are run. 10, 9, 8, 7, 6, 5, 4, 3, 2, 1. Let's go. All right. So I'm here to talk about Quadplay's 00:01:19.680 |
Pokemon because there's some new stuff that our models can do that is really exciting and makes 00:01:26.600 |
the models better at Pokemon, makes the models better at a lot of things, makes the models better 00:01:30.180 |
at being agents. What this is actually going to practically look like, and I'll show some examples 00:01:33.380 |
later in some slides I have after we're done, is that our models will learn and adapt and think 00:01:40.680 |
differently. So in a minute, we're going to get to the name entry screen. One of my favorite examples I show 00:01:44.680 |
later is that. The name entry screen, a notoriously challenging place for Quad. Quad does not quite 00:01:51.140 |
understand how cursors move around, and it gets a little bit lost. It's a grid. It gets confused. 00:01:55.440 |
One of the things that it's really good at with this extended thinking mode is building a full plan of 00:02:01.620 |
where it needs to move, what it needs to do, that thing between tool calls. So in the past, it might 00:02:05.820 |
not use that extended thinking, reconsider its assumptions, question itself, figure out the right 00:02:10.940 |
answer. In Quad 4, you will see that happen. Another feature we added in this new version of Quad that we 00:02:23.000 |
have been asked for a lot is the ability to call multiple tool calls at once. So parallel tool calling 00:02:29.320 |
is the other name. In Quad 3.7, it was like frustratingly bad at this. It would only call one tool at any given time. 00:02:36.200 |
And what that means for you is practically when you're building an agent, it'll call one tool and then it will wait. 00:02:42.920 |
You'll have to make a whole new generation. You'll get a whole time to first token hit between generations. 00:02:48.580 |
With the new models, they are much more keen to call multiple tools. We actually saw it right at the beginning. 00:02:55.200 |
I haven't seen it since. But it will do things in Pokemon, like take an action and update its memory at the same time, 00:03:00.920 |
which essentially just saves us tokens when we're building agents. The model is going to take more 00:03:05.920 |
actions more quickly and not need to go through the plan act loop as often. Tool use has evolved 00:03:12.960 |
rapidly in the last year. When I first started helping customers with tool use last year, a lot of it was like, 00:03:19.920 |
let's use a calculator, give the model a calculator tool so it can do math because the model is bad at math, 00:03:25.760 |
so it has this ability to spill over. These days, that is not what tool use gets used for. Tool use is the driver of agents. 00:03:31.920 |
When people build with tools, they give models full suites of tools that enable models to take long, 00:03:37.840 |
agentic actions and move forward. And so in that the agentic tool or loop, it really revolves around tools. 00:03:47.680 |
In an agentic loop, the model will plan an action, act on whatever that plan was, learn something from 00:03:54.960 |
what it saw, and then repeat that until it's accomplished its goals. In Pokemon, it might say, 00:04:00.720 |
I'm going to try to talk to my mom right now. The way it would do that is I'm going to press A, 00:04:04.400 |
and then it will reflect, see the dialog box come up, and see it worked, and keep going with its job 00:04:09.840 |
to play Pokemon. So let's talk about those two big improvements we talked about with tool calling in 00:04:15.600 |
Quad 4. The first is improved planning. By being able to use extended thinking mode between tool calls, 00:04:22.480 |
you are now able to see the model actually break down, build plans, step back, reflect, 00:04:29.200 |
question its assumptions between tool calls. And by calling multiple tools at once, the models will 00:04:34.400 |
be more efficient when acting as agents. This has a practical impact in Pokemon that we also didn't get 00:04:40.320 |
to see. So I want to talk a little bit about this interleaved thinking, thinking between tool calls, 00:04:46.880 |
because there's some clear examples in Pokemon of how this works out. In the past, when you launched 00:04:53.680 |
it, we actually saw -- this is the one thing we actually did saw -- is when you hit run on a model, 00:04:58.160 |
it would build the whole plan for how it was going to be Pokemon in its first message. 00:05:02.800 |
This was a terrible plan it would write. It would say, I'm going to write my name down as Claude, 00:05:07.520 |
and I'm going to give my Pokemon nicknames, and I'm going to go beat Pokemon, and that's the extent of its 00:05:11.360 |
planning. And then it would hit a really big, horrible challenge, which is the name entry screen, 00:05:17.520 |
and everything would fall apart. It would occasionally hit left to move the cursor to a new 00:05:24.240 |
letter and accidentally wrap around to the other side of the screen and think, how the heck did my 00:05:28.800 |
cursor end up on the right side of the screen? The game must be bugged. Everything is terrible. 00:05:33.120 |
And now with the ability to do extended thinking between tool calls, you'll see the model actually 00:05:40.800 |
sort of catch these errors more often, adjust, adapt its thinking, and come up with a better plan. 00:05:47.120 |
So in this example, this is an actual trace from Quad 4 Opus where it says, I'm really stumped. The 00:05:52.880 |
cursor went right instead of left when I hit left. What happened? And then it will actually say, wait, 00:05:57.680 |
let's step through. What actually happened? Where did the cursor go? It was at this letter, this letter, 00:06:01.760 |
this letter. Actually, what I think happened is the cursor spilled over and wrapped around to the 00:06:07.760 |
other edge. Everything's okay. I understand how this works now. Let's keep going with name entry, 00:06:12.240 |
and it can sort of pick that up and learn it. And that ability to adapt on the fly is really meaningful 00:06:17.440 |
as you build agents that are kind of expected to take in tons of new information as they're building. 00:06:26.960 |
Similarly, we have parallel tool calling. Parallel tool calling, more of an efficiency game. In the past, 00:06:32.960 |
when you're sitting there waiting to talk to mom, the model would press A, talk. And then if it wanted 00:06:37.680 |
to update its knowledge base to keep track of where it found mom in the past, it would have to take a 00:06:42.160 |
whole nother action, call out to quad again, wait for the time to first token hit, make that change. 00:06:47.280 |
With parallel tool calling, it can do both things at once, basically. It can say, I'm going to talk to mom. I'm going to update my knowledge base. 00:06:54.960 |
I want to advance the dialogue six times by pressing A six times because I'm bored talking to mom. 00:06:59.680 |
This saves you time. It saves your customers time. It speeds up how agents work. 00:07:04.320 |
And it will make agents work more effectively for your customers that won't have to wait around 00:07:11.360 |
for sort of redundant tool calling and calls to the cloud. 00:07:17.360 |
And so what this means, and what's next, is that models are getting better at being agents. This is 00:07:23.360 |
obvious. We knew this. But this is one of the core things we work on at Anthropic. We find ways to 00:07:27.920 |
make models smarter when they're acting over long time horizons and solving complex problems. Extended 00:07:35.680 |
thinking between tool calls is an example of this. It's something we've seen make a real impact on how 00:07:39.920 |
effective agents are in the real world. But I also want to talk about like Quad 00:07:44.640 |
is being trained to be a useful agent and an easier one to build with. When we build our models, 00:07:49.600 |
we try to listen to developers. We understand what it means to give Quad the capabilities that make it work 00:07:54.880 |
more easily, more seamlessly, better for users. And we train those into our models too. Things like 00:07:59.680 |
parallel tool calling that we want to hear feedback and improve our models on model over model. 00:08:04.480 |
I will let some people ask questions. We can chat about this a little bit. Yeah. Over here. 00:08:09.120 |
Hey. Hi. Thanks. Cloud-based Pokemon is awesome. 00:08:13.520 |
So one question I had was, so you have many low level actions, right, which is like click button A, 00:08:19.440 |
click button B. And then you also have some high level actions like go to this point that you've 00:08:24.000 |
previously visited. That's one of the high level actions. Are all these in the same hierarchy of tools? 00:08:29.200 |
Or how do you think of, because when you're building any agent, like, you know, you have some 00:08:32.800 |
sort of zoomed out view action that you want to take and then some zoomed in click a button action. 00:08:37.120 |
Yep. How do you think of this? And should it be flat? Should it be like a hierarchy? How do you 00:08:41.120 |
think of this? Thanks. I think like designing any agent, designing tools tends to be the thing that 00:08:47.120 |
actually matters the most. And this is like the most simple set of tools. In fact, like I've A in for 00:08:53.760 |
simplicity with Pokemon. It's somewhat a bad example in this sense. But what really matters is being 00:08:58.720 |
clear, like separating the concerns of what tool should be used when, giving good examples of 00:09:03.680 |
what tools should be used when, and how to do that. And so in the case of Pokemon, I have this tool that 00:09:08.160 |
allows the model to navigate to a specific place. And then you have to just be very clear to it that it 00:09:13.680 |
should use that when it's trying to move around in the overworld. It's like I have, I watched Claude play a 00:09:20.560 |
bunch and I found out that the model was quite bad at moving around in the overworld zone. And so just 00:09:27.040 |
basically telling it, hey, you're not good at this. When you're trying to do this set of tasks, this is 00:09:32.000 |
the right tool to use. You'll have a better outcome. Versus if you're in a battle, just pressing buttons 00:09:37.520 |
directly. You're perfectly capable of it's easier for you. That's a good way to do it. And so I think 00:09:42.000 |
about sort of like the loop of building these, learn, like watch the model, see where it struggles, 00:09:48.480 |
try to design and build tools that will help some of the places it struggles, and then write clear 00:09:54.800 |
descriptions that help the model understand what you have seen, like what its shortcomings are, what it, 00:10:00.000 |
why it might need this tool, what scenarios, and equip it with that knowledge. I think there's a 00:10:05.680 |
microphone here. Yes. There's a pattern in which if you have like a bunch of tool functions and you 00:10:12.640 |
don't want to necessarily like clutter your current context with like a whole huge list, you adopt a 00:10:19.040 |
helper, which kind of acts like a proxy where the model can say, hey, I want to accomplish this. And then, 00:10:24.960 |
okay, so you know what I'm talking about. Have the dynamics of that particular pattern use changed at 00:10:30.160 |
all with the new model? I don't think we, like I have not studied and I don't think we really know 00:10:36.080 |
how that will break down with the new model. My expectation is that like telling, the smarter a model 00:10:43.440 |
gets, the more I trust it with the full context and to make complex decisions. So my gut with building 00:10:49.840 |
with Opus would be, or Sonnet really, like the Cloud 4 models, is giving it the full list. And again, 00:10:56.080 |
maybe you're guessing about like context clutter and just like avoiding if the tools are too long or 00:11:01.280 |
maybe don't click. Yeah, and a quick follow-up, like how many, how many has that at least gotten in your 00:11:06.560 |
experience, like how many tools have done? I think we've pretty confidently seen the model be able to 00:11:12.160 |
navigate order of like 50 to 100 tools. It's just a question of definition though. Like the more as a human who writes 00:11:19.680 |
prompts and writes tools out, the more that you, more tools you write, the less likely it is that 00:11:26.000 |
you're going to be precise enough and where and how you can actually define those tools to the model 00:11:30.880 |
and sort of like divide the lines between them. And so from my perspective, it's a little bit like 00:11:35.120 |
well-designed that that's possible. If it gets complicated or nuanced or the too much overlap, 00:11:40.400 |
I think that's where you need to start figuring out like patterns to delegate larger chunks of work 00:11:44.000 |
or things like that. So when you say that we should give 00:11:50.320 |
clear descriptions of what tools should be used when, does that belong in the prompt or does that 00:11:57.520 |
belong in the tool description? I ask because I've been working on agentic features myself and I find 00:12:05.600 |
that if I pass in a JSON schema where I tell it about every field and description in a way that's opinionated 00:12:12.160 |
about what it's going to do, like that's generally worked better for me. But on the other hand, 00:12:16.720 |
I see these architecture advancements with remote MCP servers where tools can be defined once and used 00:12:22.720 |
in many other use cases. So I'm not really sure what to do. Yeah, it's a great question. My lean is 00:12:30.800 |
often to put things in a tool description, but honestly, I think you can do both. I mean, when you, 00:12:36.000 |
the way that our prompt gets rendered when you provide tools in a tool description is it just 00:12:41.520 |
renders the tools in the system prompt. And so mechanically, the gap in text between if you would 00:12:47.280 |
write it in the tool description and just below it in the system prompt is not that much. And I think it 00:12:52.560 |
matters more to just have clear descriptions and to be clear about what it is. I think the thing that's nice 00:12:56.960 |
about putting in a tool description is you're sort of like separating what tool you're talking about when 00:13:02.560 |
more like the way that we've trained it, we're sort of like guaranteeing that the syntax that is used 00:13:09.600 |
for the model to like read and understand a tool description is something it's seen before. Whereas 00:13:13.440 |
if you venture off that path, there's a risk that you're going to do something that's not as easy for 00:13:18.240 |
them all to understand. But I think like if you write a really strong prompt, it should work 00:13:23.840 |
similarly well in both situations, I'd expect. 00:13:25.600 |
So 3.5 Sonic got stuck in Mount Moon for a while. 00:13:35.440 |
It will make it out. This is okay. Let's talk a little bit about Claude performance. This is a 00:13:40.640 |
good chance to ramble here. Claude, Opus is significantly better at Pokemon. But the ways 00:13:48.960 |
that it's better are not the most satisfying ways if you want to see Pokemon get beat. It's like roughly 00:13:55.840 |
as enabled to see the Pokemon, the Game Boy screen as it was before. So we didn't like, I don't know, 00:14:01.600 |
I didn't go to research and ask them to make the model better at Game Boy screens. That's not what 00:14:06.400 |
our customers are asking for. It might be my favorite thing, but it wouldn't be a good 00:14:10.400 |
reflection. So it still struggles with some like navigation challenges and stuff like that, 00:14:15.120 |
where it's just like not sure what it's seeing. It's ability to plan and execute on a plan is like 00:14:20.160 |
miles ahead of where it was in the past. My favorite example of this that I've seen, 00:14:24.160 |
after you get the third badge to go to Rock Tunnel, you need to get Flash, the HM. To do that, 00:14:31.120 |
you need to go catch at least 10 species of Pokemon and then like find some dude in a random building. 00:14:36.000 |
It found the dude in a random building, it found out it needed to catch 10 Pokemon, 00:14:40.160 |
and it weren't like on a 24-hour grind session finding 10 Pokemon. Like uninterrupted, 00:14:45.840 |
didn't get distracted, didn't do anything else, catch 10 Pokemon, like wander back, get Flash, 00:14:50.000 |
go straight to Rock Tunnel. And it's like this ability to sort of plan and execute, like build a plan, 00:14:55.280 |
and then like actually track and execute against that over, in this case, like 100 million tokens worth of 00:15:01.600 |
information was like by far the best I've ever seen from a model. So in this playthrough, as you watch at home, 00:15:08.080 |
as you watch on the demo thing, I think you'll see it gets stuck in Mt. Moon for probably a similar amount of time, 00:15:13.760 |
if I had to guess. But you'll see it do some like miles more intelligent things in the process of getting there. 00:15:22.880 |
Yeah. Hey, I just have a question about parallel tool calling. Yeah. 00:15:29.040 |
First time I've ever, is this state of the art? I haven't. 00:15:31.520 |
Uh, no. A model should be able to do this. I think like, frankly, like I wish 3.7 could have done this. 00:15:37.680 |
I don't think this is like an insane capability, but it matters. Like it's just a useful thing for 00:15:41.280 |
people to be able to do. So just under the hood in your messages array that you're interacting with the model, 00:15:46.960 |
are you just doing some magic on your end to kind of free? 00:15:50.960 |
It's kind of like on the model to say, hey, I'm done. I've described a set of tool calls I want to make 00:15:56.960 |
and I'm done or not. So the model in the past would just like make one tool call and say, 00:16:01.520 |
I want to wait for the result of this. The model now is more likely to understand that in some cases, 00:16:06.720 |
I actually know two or five or eight tool calls that I want to make right now. And it will describe 00:16:12.080 |
all of those. And then the object you get back in the API is, has eight tool use blocks that say, 00:16:16.240 |
here are the eight tools I want to use. And then you're asked to go sort of like render those. 00:16:20.640 |
So I'm particularly interested with the idea, right? So with parallel tool calls, 00:16:28.320 |
there are some cases where it's obvious that all the tool calls can actually happen in parallel. 00:16:33.360 |
But then there's like more of a planning sense where you showed like press A, press A, press A. 00:16:38.960 |
And so of course, I'm like thrown back to being six in my mom's minivan and remembering when I 00:16:44.080 |
restarted a really long conversation because I was spamming A. Yep. 00:16:47.600 |
And so I'm just like, I'm nerdily curious if that, if it's ever done that where it's like 00:16:53.680 |
impatiently restarted a conversation. All the time. All the time. 00:16:56.240 |
But I think that that also like scratches out a deeper thing of like, is there ever such a thing as too, 00:17:00.880 |
too much planning and do you see it like being too opinionated about following the plan and not 00:17:07.040 |
updating with new information like the conversation has ended? 00:17:10.640 |
I think this is like the range for good prompting, honestly. 00:17:13.200 |
The so the reason that it actually hits many buttons is you'll see its thought process say, 00:17:19.680 |
I'm going to hit a whole bunch of buttons and I'll stop whenever it's done. 00:17:23.600 |
But it doesn't like quite have the sense of time like we do. 00:17:26.240 |
So if it says I want to hit a 500 times, it's like, oh, don't worry, I'll be I'll know 00:17:30.320 |
when I have finished the dialogue and then I'll stop. 00:17:33.360 |
But it doesn't quite understand that it doesn't get to see in between each one by default. 00:17:38.480 |
Because like the nature, I don't know, that's a very LLM problem that you have to register 500 00:17:42.960 |
buttons and then close your eyes and then come back and find out what happened. 00:17:45.600 |
But you can actually get around that just like with prompting and helping the model understand what is 00:17:51.280 |
happening, what are its limitations and what and how should it act. 00:17:56.080 |
So like in the system prompt for quad place Pokemon, I just have to tell it 00:17:59.760 |
when you register a sequence of buttons, you don't get to see like you're not going to see. 00:18:06.880 |
Restarting the dialogue is a simple version, but you can actually do like much worse things in Pokemon. 00:18:12.800 |
Like I've seen it overwrite one of its moves accidentally when it was learning a new move 00:18:16.560 |
in a way that was like quite bad for for making progress in the game. 00:18:19.680 |
And so this is like I think the space where someone building agents, you have a lot of room to sort of 00:18:28.400 |
see how models make mistakes like that, help them understand why and what's going on and sort of like 00:18:35.200 |
build that into how you prompt them prompt agents. And that's a lot of how I think about sort of like 00:18:40.480 |
So in our production agent, we saw that in 3.7, there was some not very good consistency with calling 00:18:50.080 |
about 18 tools versus like if you were to just pass the model a single tool and then just have the exact 00:18:56.400 |
same prompt and have it call that. And you mentioned before that the four models are able to handle like 00:19:03.280 |
over 100 tools. Are there any changes or differences you're seeing in how you get consistent performance 00:19:10.240 |
I think these models we've pretty clearly seen are much better at precise instruction following. 00:19:17.040 |
This can be a double-edged sword. Like if you're imprecise with the instructions you write, 00:19:20.560 |
they'll readily follow or get confused by contrasting instructions sometimes. 00:19:25.600 |
But I think the key is with like very good tool design and very crisp prompting, we've seen that these 00:19:30.880 |
models are like much more capable at following a pretty long set of different and complex 00:19:36.480 |
instructions and being able to use that execute. So I think the key is there's more room to hill climb 00:19:41.120 |
on a prompt maybe is what I would say with these models, which is to say as you are making more and 00:19:45.840 |
more precise descriptions of your tools, there's more room to get better and better across a wider range 00:19:51.680 |
of tools and sort of like reach that same level of performance you'd expect on a single tool. 00:19:55.600 |
I think I am at time. I have successfully gave a very different talk than I expected but I appreciate 00:20:00.720 |
you all for being here and it was fun to talk with you all.