back to index

Claude plays Pokemon | Code w/ Claude


Whisper Transcript | Transcript Only Page

00:00:00.000 | All right. Welcome, everybody. I was asked to talk about tool use and some of them changes
00:00:14.520 | to our new models that have made them better at using tools. And so I decided to use this
00:00:21.400 | opportunity to instead talk about Quadplay's Pokemon, which I made. So I'm David. I'm on
00:00:28.120 | our PodAI team here. I'm the creator of Quadplay's Pokemon. And what we are going to do today
00:00:33.940 | is relaunch the stream live together. We're going to talk about it and we're going to have a good
00:00:39.220 | time. Could you flip over to my demo screen, please? All right. I have a button queued up to
00:00:45.640 | my machine in my house in my basement in Seattle, which is running this. And I'm going to hit enter,
00:00:51.300 | but I need help from the crowd to count down from 10 for me to windmill slam this enter button. So
00:00:57.080 | I'm going to start it, but I need all of you to participate. It's very important for the vibes
00:01:00.700 | that are run. 10, 9, 8, 7, 6, 5, 4, 3, 2, 1. Let's go. All right. So I'm here to talk about Quadplay's
00:01:19.680 | Pokemon because there's some new stuff that our models can do that is really exciting and makes
00:01:26.600 | the models better at Pokemon, makes the models better at a lot of things, makes the models better
00:01:30.180 | at being agents. What this is actually going to practically look like, and I'll show some examples
00:01:33.380 | later in some slides I have after we're done, is that our models will learn and adapt and think
00:01:40.680 | differently. So in a minute, we're going to get to the name entry screen. One of my favorite examples I show
00:01:44.680 | later is that. The name entry screen, a notoriously challenging place for Quad. Quad does not quite
00:01:51.140 | understand how cursors move around, and it gets a little bit lost. It's a grid. It gets confused.
00:01:55.440 | One of the things that it's really good at with this extended thinking mode is building a full plan of
00:02:01.620 | where it needs to move, what it needs to do, that thing between tool calls. So in the past, it might
00:02:05.820 | not use that extended thinking, reconsider its assumptions, question itself, figure out the right
00:02:10.940 | answer. In Quad 4, you will see that happen. Another feature we added in this new version of Quad that we
00:02:23.000 | have been asked for a lot is the ability to call multiple tool calls at once. So parallel tool calling
00:02:29.320 | is the other name. In Quad 3.7, it was like frustratingly bad at this. It would only call one tool at any given time.
00:02:36.200 | And what that means for you is practically when you're building an agent, it'll call one tool and then it will wait.
00:02:42.920 | You'll have to make a whole new generation. You'll get a whole time to first token hit between generations.
00:02:48.580 | With the new models, they are much more keen to call multiple tools. We actually saw it right at the beginning.
00:02:55.200 | I haven't seen it since. But it will do things in Pokemon, like take an action and update its memory at the same time,
00:03:00.920 | which essentially just saves us tokens when we're building agents. The model is going to take more
00:03:05.920 | actions more quickly and not need to go through the plan act loop as often. Tool use has evolved
00:03:12.960 | rapidly in the last year. When I first started helping customers with tool use last year, a lot of it was like,
00:03:19.920 | let's use a calculator, give the model a calculator tool so it can do math because the model is bad at math,
00:03:25.760 | so it has this ability to spill over. These days, that is not what tool use gets used for. Tool use is the driver of agents.
00:03:31.920 | When people build with tools, they give models full suites of tools that enable models to take long,
00:03:37.840 | agentic actions and move forward. And so in that the agentic tool or loop, it really revolves around tools.
00:03:47.680 | In an agentic loop, the model will plan an action, act on whatever that plan was, learn something from
00:03:54.960 | what it saw, and then repeat that until it's accomplished its goals. In Pokemon, it might say,
00:04:00.720 | I'm going to try to talk to my mom right now. The way it would do that is I'm going to press A,
00:04:04.400 | and then it will reflect, see the dialog box come up, and see it worked, and keep going with its job
00:04:09.840 | to play Pokemon. So let's talk about those two big improvements we talked about with tool calling in
00:04:15.600 | Quad 4. The first is improved planning. By being able to use extended thinking mode between tool calls,
00:04:22.480 | you are now able to see the model actually break down, build plans, step back, reflect,
00:04:29.200 | question its assumptions between tool calls. And by calling multiple tools at once, the models will
00:04:34.400 | be more efficient when acting as agents. This has a practical impact in Pokemon that we also didn't get
00:04:40.320 | to see. So I want to talk a little bit about this interleaved thinking, thinking between tool calls,
00:04:46.880 | because there's some clear examples in Pokemon of how this works out. In the past, when you launched
00:04:53.680 | it, we actually saw -- this is the one thing we actually did saw -- is when you hit run on a model,
00:04:58.160 | it would build the whole plan for how it was going to be Pokemon in its first message.
00:05:02.800 | This was a terrible plan it would write. It would say, I'm going to write my name down as Claude,
00:05:07.520 | and I'm going to give my Pokemon nicknames, and I'm going to go beat Pokemon, and that's the extent of its
00:05:11.360 | planning. And then it would hit a really big, horrible challenge, which is the name entry screen,
00:05:17.520 | and everything would fall apart. It would occasionally hit left to move the cursor to a new
00:05:24.240 | letter and accidentally wrap around to the other side of the screen and think, how the heck did my
00:05:28.800 | cursor end up on the right side of the screen? The game must be bugged. Everything is terrible.
00:05:33.120 | And now with the ability to do extended thinking between tool calls, you'll see the model actually
00:05:40.800 | sort of catch these errors more often, adjust, adapt its thinking, and come up with a better plan.
00:05:47.120 | So in this example, this is an actual trace from Quad 4 Opus where it says, I'm really stumped. The
00:05:52.880 | cursor went right instead of left when I hit left. What happened? And then it will actually say, wait,
00:05:57.680 | let's step through. What actually happened? Where did the cursor go? It was at this letter, this letter,
00:06:01.760 | this letter. Actually, what I think happened is the cursor spilled over and wrapped around to the
00:06:07.760 | other edge. Everything's okay. I understand how this works now. Let's keep going with name entry,
00:06:12.240 | and it can sort of pick that up and learn it. And that ability to adapt on the fly is really meaningful
00:06:17.440 | as you build agents that are kind of expected to take in tons of new information as they're building.
00:06:26.960 | Similarly, we have parallel tool calling. Parallel tool calling, more of an efficiency game. In the past,
00:06:32.960 | when you're sitting there waiting to talk to mom, the model would press A, talk. And then if it wanted
00:06:37.680 | to update its knowledge base to keep track of where it found mom in the past, it would have to take a
00:06:42.160 | whole nother action, call out to quad again, wait for the time to first token hit, make that change.
00:06:47.280 | With parallel tool calling, it can do both things at once, basically. It can say, I'm going to talk to mom. I'm going to update my knowledge base.
00:06:54.960 | I want to advance the dialogue six times by pressing A six times because I'm bored talking to mom.
00:06:59.680 | This saves you time. It saves your customers time. It speeds up how agents work.
00:07:04.320 | And it will make agents work more effectively for your customers that won't have to wait around
00:07:11.360 | for sort of redundant tool calling and calls to the cloud.
00:07:17.360 | And so what this means, and what's next, is that models are getting better at being agents. This is
00:07:23.360 | obvious. We knew this. But this is one of the core things we work on at Anthropic. We find ways to
00:07:27.920 | make models smarter when they're acting over long time horizons and solving complex problems. Extended
00:07:35.680 | thinking between tool calls is an example of this. It's something we've seen make a real impact on how
00:07:39.920 | effective agents are in the real world. But I also want to talk about like Quad
00:07:44.640 | is being trained to be a useful agent and an easier one to build with. When we build our models,
00:07:49.600 | we try to listen to developers. We understand what it means to give Quad the capabilities that make it work
00:07:54.880 | more easily, more seamlessly, better for users. And we train those into our models too. Things like
00:07:59.680 | parallel tool calling that we want to hear feedback and improve our models on model over model.
00:08:04.480 | I will let some people ask questions. We can chat about this a little bit. Yeah. Over here.
00:08:09.120 | Hey. Hi. Thanks. Cloud-based Pokemon is awesome.
00:08:13.520 | So one question I had was, so you have many low level actions, right, which is like click button A,
00:08:19.440 | click button B. And then you also have some high level actions like go to this point that you've
00:08:24.000 | previously visited. That's one of the high level actions. Are all these in the same hierarchy of tools?
00:08:29.200 | Or how do you think of, because when you're building any agent, like, you know, you have some
00:08:32.800 | sort of zoomed out view action that you want to take and then some zoomed in click a button action.
00:08:37.120 | Yep. How do you think of this? And should it be flat? Should it be like a hierarchy? How do you
00:08:41.120 | think of this? Thanks. I think like designing any agent, designing tools tends to be the thing that
00:08:47.120 | actually matters the most. And this is like the most simple set of tools. In fact, like I've A in for
00:08:53.760 | simplicity with Pokemon. It's somewhat a bad example in this sense. But what really matters is being
00:08:58.720 | clear, like separating the concerns of what tool should be used when, giving good examples of
00:09:03.680 | what tools should be used when, and how to do that. And so in the case of Pokemon, I have this tool that
00:09:08.160 | allows the model to navigate to a specific place. And then you have to just be very clear to it that it
00:09:13.680 | should use that when it's trying to move around in the overworld. It's like I have, I watched Claude play a
00:09:20.560 | bunch and I found out that the model was quite bad at moving around in the overworld zone. And so just
00:09:27.040 | basically telling it, hey, you're not good at this. When you're trying to do this set of tasks, this is
00:09:32.000 | the right tool to use. You'll have a better outcome. Versus if you're in a battle, just pressing buttons
00:09:37.520 | directly. You're perfectly capable of it's easier for you. That's a good way to do it. And so I think
00:09:42.000 | about sort of like the loop of building these, learn, like watch the model, see where it struggles,
00:09:48.480 | try to design and build tools that will help some of the places it struggles, and then write clear
00:09:54.800 | descriptions that help the model understand what you have seen, like what its shortcomings are, what it,
00:10:00.000 | why it might need this tool, what scenarios, and equip it with that knowledge. I think there's a
00:10:05.680 | microphone here. Yes. There's a pattern in which if you have like a bunch of tool functions and you
00:10:12.640 | don't want to necessarily like clutter your current context with like a whole huge list, you adopt a
00:10:19.040 | helper, which kind of acts like a proxy where the model can say, hey, I want to accomplish this. And then,
00:10:24.960 | okay, so you know what I'm talking about. Have the dynamics of that particular pattern use changed at
00:10:30.160 | all with the new model? I don't think we, like I have not studied and I don't think we really know
00:10:36.080 | how that will break down with the new model. My expectation is that like telling, the smarter a model
00:10:43.440 | gets, the more I trust it with the full context and to make complex decisions. So my gut with building
00:10:49.840 | with Opus would be, or Sonnet really, like the Cloud 4 models, is giving it the full list. And again,
00:10:56.080 | maybe you're guessing about like context clutter and just like avoiding if the tools are too long or
00:11:01.280 | maybe don't click. Yeah, and a quick follow-up, like how many, how many has that at least gotten in your
00:11:06.560 | experience, like how many tools have done? I think we've pretty confidently seen the model be able to
00:11:12.160 | navigate order of like 50 to 100 tools. It's just a question of definition though. Like the more as a human who writes
00:11:19.680 | prompts and writes tools out, the more that you, more tools you write, the less likely it is that
00:11:26.000 | you're going to be precise enough and where and how you can actually define those tools to the model
00:11:30.880 | and sort of like divide the lines between them. And so from my perspective, it's a little bit like
00:11:35.120 | well-designed that that's possible. If it gets complicated or nuanced or the too much overlap,
00:11:40.400 | I think that's where you need to start figuring out like patterns to delegate larger chunks of work
00:11:44.000 | or things like that. So when you say that we should give
00:11:50.320 | clear descriptions of what tools should be used when, does that belong in the prompt or does that
00:11:57.520 | belong in the tool description? I ask because I've been working on agentic features myself and I find
00:12:05.600 | that if I pass in a JSON schema where I tell it about every field and description in a way that's opinionated
00:12:12.160 | about what it's going to do, like that's generally worked better for me. But on the other hand,
00:12:16.720 | I see these architecture advancements with remote MCP servers where tools can be defined once and used
00:12:22.720 | in many other use cases. So I'm not really sure what to do. Yeah, it's a great question. My lean is
00:12:30.800 | often to put things in a tool description, but honestly, I think you can do both. I mean, when you,
00:12:36.000 | the way that our prompt gets rendered when you provide tools in a tool description is it just
00:12:41.520 | renders the tools in the system prompt. And so mechanically, the gap in text between if you would
00:12:47.280 | write it in the tool description and just below it in the system prompt is not that much. And I think it
00:12:52.560 | matters more to just have clear descriptions and to be clear about what it is. I think the thing that's nice
00:12:56.960 | about putting in a tool description is you're sort of like separating what tool you're talking about when
00:13:02.560 | more like the way that we've trained it, we're sort of like guaranteeing that the syntax that is used
00:13:09.600 | for the model to like read and understand a tool description is something it's seen before. Whereas
00:13:13.440 | if you venture off that path, there's a risk that you're going to do something that's not as easy for
00:13:18.240 | them all to understand. But I think like if you write a really strong prompt, it should work
00:13:23.840 | similarly well in both situations, I'd expect.
00:13:25.600 | So 3.5 Sonic got stuck in Mount Moon for a while.
00:13:33.760 | It did.
00:13:34.560 | Can it make it out?
00:13:35.440 | It will make it out. This is okay. Let's talk a little bit about Claude performance. This is a
00:13:40.640 | good chance to ramble here. Claude, Opus is significantly better at Pokemon. But the ways
00:13:48.960 | that it's better are not the most satisfying ways if you want to see Pokemon get beat. It's like roughly
00:13:55.840 | as enabled to see the Pokemon, the Game Boy screen as it was before. So we didn't like, I don't know,
00:14:01.600 | I didn't go to research and ask them to make the model better at Game Boy screens. That's not what
00:14:06.400 | our customers are asking for. It might be my favorite thing, but it wouldn't be a good
00:14:10.400 | reflection. So it still struggles with some like navigation challenges and stuff like that,
00:14:15.120 | where it's just like not sure what it's seeing. It's ability to plan and execute on a plan is like
00:14:20.160 | miles ahead of where it was in the past. My favorite example of this that I've seen,
00:14:24.160 | after you get the third badge to go to Rock Tunnel, you need to get Flash, the HM. To do that,
00:14:31.120 | you need to go catch at least 10 species of Pokemon and then like find some dude in a random building.
00:14:36.000 | It found the dude in a random building, it found out it needed to catch 10 Pokemon,
00:14:40.160 | and it weren't like on a 24-hour grind session finding 10 Pokemon. Like uninterrupted,
00:14:45.840 | didn't get distracted, didn't do anything else, catch 10 Pokemon, like wander back, get Flash,
00:14:50.000 | go straight to Rock Tunnel. And it's like this ability to sort of plan and execute, like build a plan,
00:14:55.280 | and then like actually track and execute against that over, in this case, like 100 million tokens worth of
00:15:01.600 | information was like by far the best I've ever seen from a model. So in this playthrough, as you watch at home,
00:15:08.080 | as you watch on the demo thing, I think you'll see it gets stuck in Mt. Moon for probably a similar amount of time,
00:15:13.760 | if I had to guess. But you'll see it do some like miles more intelligent things in the process of getting there.
00:15:21.200 | Funny how it works.
00:15:22.880 | Yeah. Hey, I just have a question about parallel tool calling. Yeah.
00:15:29.040 | First time I've ever, is this state of the art? I haven't.
00:15:31.520 | Uh, no. A model should be able to do this. I think like, frankly, like I wish 3.7 could have done this.
00:15:37.680 | I don't think this is like an insane capability, but it matters. Like it's just a useful thing for
00:15:41.280 | people to be able to do. So just under the hood in your messages array that you're interacting with the model,
00:15:46.960 | are you just doing some magic on your end to kind of free?
00:15:50.960 | It's kind of like on the model to say, hey, I'm done. I've described a set of tool calls I want to make
00:15:56.960 | and I'm done or not. So the model in the past would just like make one tool call and say,
00:16:01.520 | I want to wait for the result of this. The model now is more likely to understand that in some cases,
00:16:06.720 | I actually know two or five or eight tool calls that I want to make right now. And it will describe
00:16:12.080 | all of those. And then the object you get back in the API is, has eight tool use blocks that say,
00:16:16.240 | here are the eight tools I want to use. And then you're asked to go sort of like render those.
00:16:19.600 | Awesome. Thank you. Yeah.
00:16:20.640 | So I'm particularly interested with the idea, right? So with parallel tool calls,
00:16:28.320 | there are some cases where it's obvious that all the tool calls can actually happen in parallel.
00:16:33.360 | But then there's like more of a planning sense where you showed like press A, press A, press A.
00:16:38.960 | And so of course, I'm like thrown back to being six in my mom's minivan and remembering when I
00:16:44.080 | restarted a really long conversation because I was spamming A. Yep.
00:16:47.600 | And so I'm just like, I'm nerdily curious if that, if it's ever done that where it's like
00:16:53.680 | impatiently restarted a conversation. All the time. All the time.
00:16:56.240 | But I think that that also like scratches out a deeper thing of like, is there ever such a thing as too,
00:17:00.880 | too much planning and do you see it like being too opinionated about following the plan and not
00:17:07.040 | updating with new information like the conversation has ended?
00:17:10.640 | I think this is like the range for good prompting, honestly.
00:17:13.200 | The so the reason that it actually hits many buttons is you'll see its thought process say,
00:17:19.680 | I'm going to hit a whole bunch of buttons and I'll stop whenever it's done.
00:17:23.600 | But it doesn't like quite have the sense of time like we do.
00:17:26.240 | So if it says I want to hit a 500 times, it's like, oh, don't worry, I'll be I'll know
00:17:30.320 | when I have finished the dialogue and then I'll stop.
00:17:33.360 | But it doesn't quite understand that it doesn't get to see in between each one by default.
00:17:38.480 | Because like the nature, I don't know, that's a very LLM problem that you have to register 500
00:17:42.960 | buttons and then close your eyes and then come back and find out what happened.
00:17:45.600 | But you can actually get around that just like with prompting and helping the model understand what is
00:17:51.280 | happening, what are its limitations and what and how should it act.
00:17:56.080 | So like in the system prompt for quad place Pokemon, I just have to tell it
00:17:59.760 | when you register a sequence of buttons, you don't get to see like you're not going to see.
00:18:04.800 | And so there could be side effects.
00:18:06.880 | Restarting the dialogue is a simple version, but you can actually do like much worse things in Pokemon.
00:18:12.800 | Like I've seen it overwrite one of its moves accidentally when it was learning a new move
00:18:16.560 | in a way that was like quite bad for for making progress in the game.
00:18:19.680 | And so this is like I think the space where someone building agents, you have a lot of room to sort of
00:18:28.400 | see how models make mistakes like that, help them understand why and what's going on and sort of like
00:18:35.200 | build that into how you prompt them prompt agents. And that's a lot of how I think about sort of like
00:18:39.440 | iterating on agent design.
00:18:40.480 | So in our production agent, we saw that in 3.7, there was some not very good consistency with calling
00:18:50.080 | about 18 tools versus like if you were to just pass the model a single tool and then just have the exact
00:18:56.400 | same prompt and have it call that. And you mentioned before that the four models are able to handle like
00:19:03.280 | over 100 tools. Are there any changes or differences you're seeing in how you get consistent performance
00:19:09.360 | amongst so many tools?
00:19:10.240 | I think these models we've pretty clearly seen are much better at precise instruction following.
00:19:17.040 | This can be a double-edged sword. Like if you're imprecise with the instructions you write,
00:19:20.560 | they'll readily follow or get confused by contrasting instructions sometimes.
00:19:25.600 | But I think the key is with like very good tool design and very crisp prompting, we've seen that these
00:19:30.880 | models are like much more capable at following a pretty long set of different and complex
00:19:36.480 | instructions and being able to use that execute. So I think the key is there's more room to hill climb
00:19:41.120 | on a prompt maybe is what I would say with these models, which is to say as you are making more and
00:19:45.840 | more precise descriptions of your tools, there's more room to get better and better across a wider range
00:19:51.680 | of tools and sort of like reach that same level of performance you'd expect on a single tool.
00:19:55.600 | I think I am at time. I have successfully gave a very different talk than I expected but I appreciate
00:20:00.720 | you all for being here and it was fun to talk with you all.
00:20:06.720 | Thank you.
00:20:07.720 | Thank you.