Stanford CS25: V2 I Strategic Games

00:00:00.000 | So it spent sometimes like two months training the bots on thousands of CPUs, terabytes of

00:00:14.320 | memory sometimes.

00:00:15.320 | But when it came time to actually play against the humans, they would act almost instantly.

00:00:20.160 | It was just a lookup table.

00:00:23.080 | And the humans when they were in a tough spot, they would not act instantly, they would think

00:00:28.400 | they would sit there and they would think for five seconds, maybe five minutes if it

00:00:32.000 | was a really difficult decision.

00:00:33.800 | And it was clear that that was allowing them to come up with better strategies.

00:00:38.320 | And so I wanted to investigate this behavior in our bots, like if we could add this to

00:00:43.240 | our bots, how much of a difference would it make the ability to instead of acting instantly

00:00:48.400 | to take some time and compute a better strategy for the spot that the agent was in.

00:00:56.400 | And this is what I found.

00:00:58.760 | So on the x-axis here, we have like the number of buckets, the number, you can think of this

00:01:03.120 | as like the number of parameters in your model.

00:01:04.840 | And on the y-axis, we have distance from Nash equilibrium.

00:01:07.120 | So this is basically like how much you would lose to worst case adversaries.

00:01:10.080 | So the lower this number is, the better your poker bot is.

00:01:13.840 | And you can see as you scale up the number of parameters, your performance improves.

00:01:16.840 | And as you increase the number of parameters by about 100x, your exploitability goes down

00:01:23.080 | by about half, and indeed, you're getting a much better poker bot.

00:01:26.880 | But you can see the blue line here is if you don't have search, and the orange line is

00:01:30.080 | if you do add search.

00:01:32.200 | And you can see just adding search, adding the ability to sit there and think for a bit,

00:01:36.520 | improve the performance of these models, it reduced the exploitability, the distance from

00:01:41.080 | Nash equilibrium by about 7x.

00:01:44.160 | And if you were to extend that blue line and see how many parameters would you need in

00:01:48.560 | order to be comparable to adding search?

00:01:52.800 | The answer is you would need to scale up your model by about 100,000x.

00:01:58.640 | So this was pretty mind-blowing to me when I saw this.

00:02:02.160 | I mean, over the course of my PhD, the first three or four years of my PhD, I managed to

00:02:07.680 | scale up these models by about 100x.

00:02:13.480 | And I was proud of that.

00:02:14.480 | I mean, that's like a pretty impressive result, I think.

00:02:17.840 | But what this plot was showing me was that just adding search was the equivalent of scaling

00:02:23.400 | things up by about 100,000x.

00:02:25.160 | And so all of my previous research up until this point would just be a footnote compared

00:02:29.280 | to adding search.

00:02:32.560 | So when I saw this, it became clear this was the answer to beating top humans in poker.

00:02:36.600 | And so for the next year, basically nonstop, I worked on scaling search.

00:02:41.500 | Now there's a question that naturally comes up, which is why wasn't this considered before?

00:02:45.500 | There's a few factors.

00:02:46.500 | First of all, I should say search had been considered in poker before, and it's actually

00:02:50.600 | quite natural to say, well, if you had search in chess and search in Go, why would you not

00:02:56.400 | consider search in poker?

00:02:59.400 | There's a few reasons.

00:03:00.400 | One is that culturally, the poker research grew out of game theory and reinforcement

00:03:04.840 | learning.

00:03:05.840 | And so it wasn't really from the same background as the people that were working on chess and

00:03:08.800 | working on Go.

00:03:11.680 | When you scale search, scaling test time compute, it makes all your experiments much more expensive

00:03:15.840 | and just more unpleasant to work with.

00:03:19.800 | And there were just incentive structures.

00:03:21.560 | People are always thinking about winning the next annual computer poker competition, and

00:03:24.520 | the ACPC limited the resources that you could use at test time.

00:03:26.960 | So search wasn't really possible effectively in the ACPC.

00:03:32.240 | And I think the biggest factor is that people just didn't think it would make such a huge

00:03:34.760 | difference.

00:03:35.760 | I mean, I think it's reasonable to look at something like search and think like, oh yeah,

00:03:38.400 | that might make a 10x difference.

00:03:40.640 | You probably wouldn't think it makes 100,000x difference.

00:03:43.760 | And so there were some people working on it, but it wasn't really the focus of a lot of

00:03:46.920 | people's research.

00:03:51.180 | So anyway, focused on scaling search, and that led to the 2017 Brains vs. AI competition,

00:03:57.040 | where we again played our bot against four top poker pros, 120,000 hands of poker, $200,000

00:04:03.760 | in prize money.

00:04:04.760 | And this time, the bot won by 15 big blinds per 100, instead of nine big blinds per 100.

00:04:10.320 | This was a crushing victory.

00:04:13.120 | Each human lost individually to the bot, and four standard deviations of statistical significance.

00:04:21.200 | We followed this up in 2019 with a six-player poker AI competition.

00:04:26.240 | The big difference here is that we figured out how to do depth-limited search.

00:04:28.720 | So before in the 2017 bot, it would always have the search to the end of the game.

00:04:33.720 | Here, it only had to do search a few moves ahead, and it could stop there.

00:04:37.640 | And so this time, again, it won with statistical significance.

00:04:41.120 | And what's really surprising about this bot is that, despite it being a much larger game,

00:04:45.680 | the six-player poker bot, Pluribus, cost under $150 to train on cloud computing resources.

00:04:52.720 | And it runs on 28 CPU cores at inference time, there's no GPUs.

00:04:57.000 | So I think what this shows is that this really was an algorithmic improvement.

00:05:02.600 | I mean, this would have been doable 20 years ago if people knew how to do it.

00:05:09.620 | And I think it also shows the power of search.

00:05:12.880 | If you can figure out how to scale that compute at test time, it really can make a huge difference

00:05:17.960 | and bring down your training costs by a huge amount.

00:05:20.800 | Anyway, yeah.

00:05:21.800 | So I wanted to say also, this is not limited to poker.

00:05:26.400 | If you look at Go, you see a similar pattern.

00:05:28.800 | So this is a plot from the AlphaGo Zero paper.

00:05:31.680 | On the x-axis, we have different versions of AlphaGo, and on the y-axis, we have EloRating,

00:05:34.920 | which is a way of comparing different bots, but also a way of comparing bots to humans.

00:05:39.700 | And you can see if-- OK, so SuperHuman performance is around 3,600 Elo, and you can see AlphaGo

00:05:48.060 | Lee, the version that played against Lee Sedol in 2016, that's right over the line of SuperHuman

00:05:51.940 | performance.

00:05:52.940 | AlphaGo Zero, the strongest version of AlphaGo, is around 5,200 Elo.

00:05:58.540 | But if you take out the test time search, if you just play according to the policy net

00:06:03.500 | and not do any Monte Carlo tree search in AlphaGo Zero at test time, then the EloRating

00:06:08.260 | drops to around 3,000, which is substantially below SuperHuman performance.

00:06:15.020 | So what this shows is that if you take out Monte Carlo tree search at test time, AlphaGo

00:06:21.420 | Zero is not SuperHuman.

00:06:22.420 | And in fact, nobody has made a SuperHuman Go bot that does not use search in some form.

00:06:29.420 | Nobody has made a raw neural network that can beat top humans in Go.

00:06:35.100 | And I should say also, this is just if you're taking out the search at test time.

00:06:38.180 | I'm not even talking about taking it out of training time.

00:06:40.280 | If you took it out of training time, it wouldn't even get off the ground.

00:06:44.900 | Now there's a question of, OK, well, surely you could just scale up the models, scale

00:06:48.420 | up the amount of training, and you would eventually surpass SuperHuman performance and match the

00:06:53.700 | performance if you added search.

00:06:55.640 | And that's true, yes.

00:06:56.640 | If you scale up the models and if you scale up the training, then you would eventually

00:07:01.180 | match the performance with search.

00:07:02.860 | But there's a question of how much would you have to scale it up by?

00:07:06.020 | Now a rough rule of thumb is that in order to increase your EloRating by about 120 points,

00:07:10.380 | you either have to double the amount of model size and training, or you have to double the

00:07:13.740 | amount of test time search.

00:07:16.500 | And so if you look at that gap of around 2,000 Elo points, and you calculate the number of

00:07:20.620 | deadlines that you would need, the answer is that in order to get the raw policy net

00:07:24.100 | from 3,000 Elo to 5,200 Elo, you would need to scale your model and your training by about

00:07:28.900 | 100,000x.

00:07:29.900 | OK, so why is this important?

00:07:37.220 | I think you look at what's happening today with large language models and transformers,

00:07:41.980 | and you see something similar.

00:07:42.980 | I mean, you're getting huge-- there's a question of, what do I mean by search?

00:07:49.340 | There's specific kinds of search, like multicolored tree search, the ability to just plan ahead

00:07:53.100 | what you're going to do instead of just acting instantly based on your pre-computed policy.

00:07:58.240 | But really what I mean by search more broadly is the ability to scale the amount of computation

00:08:03.540 | to get better performance.

00:08:06.140 | I think that's the real value that search is adding.

00:08:09.240 | Instead of just acting according to your pre-computed-- front loading all of your computations, so

00:08:15.500 | you're doing everything, all your computation ahead of time, and then at inference time,

00:08:19.860 | acting basically instantly, could you get a better solution if you had five minutes

00:08:25.020 | to output an action instead of 100 milliseconds?

00:08:31.580 | So yeah, I think you look at-- no, sorry, there's a question.

00:08:37.900 | Does a transformer with a search circuit count as search, or do you mean hand engineering

00:08:41.340 | search algos?

00:08:43.540 | I don't want to get bogged down into the details of how to do this, because the answer is nobody

00:08:48.580 | really knows yet.

00:08:50.360 | Nobody really has a general way of doing search.

00:08:52.240 | And all the domains that we've done search successfully, like Poker and Go, it's done

00:08:56.420 | in a fairly domain-specific way.

00:08:59.500 | Go used this algorithm called multicolored tree search.

00:09:04.100 | And yeah, you could think of beam search as one simple form of search, but it does seem

00:09:08.080 | like there should be better ways in the future.

00:09:14.500 | So anyway, where I'm going with this is you look at how large language models are being

00:09:18.560 | trained today, and you're seeing millions of dollars being thrown at pre-training.

00:09:25.620 | I wouldn't be surprised if we see a large language model that would cost $100 million

00:09:30.980 | to train.

00:09:31.980 | We might even get to $1 billion.

00:09:35.220 | But the inference cost is still going to be very small.

00:09:40.140 | And so there's a question of, could you do substantially better if you could scale the

00:09:46.020 | amount of inference cost as well?

00:09:48.340 | Maybe that could amortize some of your training cost.

00:09:54.280 | So there's this lecture called "The Bitter Lesson" by Richard Sutton that says the biggest

00:09:59.760 | lesson that can be learned-- and so it's a really great essay.

00:10:02.060 | I recommend reading it.

00:10:03.060 | But one of the big takeaways is, he says, the biggest lesson that can be learned from

00:10:05.640 | over 70 years of AI research is that general methods that leverage computation are ultimately

00:10:09.480 | the most effective.

00:10:11.000 | The two methods that seem to scale arbitrarily in this way are search and learning.

00:10:15.820 | Now, I think we've done a great job with generalizing search-- sorry, generalizing learning.

00:10:21.380 | And I think there's still room for improvement when it comes to search.

00:10:24.620 | And yeah, the next goal really is about generality.

00:10:29.960 | Can we develop a truly general way of scaling inference compute instead of just doing things

00:10:33.840 | like Monte Carlo tree search that are specific to a better domain, to a specific domain,

00:10:38.680 | and also better than things like chain of thought?

00:10:43.900 | What this would look like is that you have much higher test time compute, but you have

00:10:48.780 | much more capable models.

00:10:50.720 | And I think for certain domains, that trade-off is worth it.

00:10:52.740 | Like, if you think about what inference costs we're willing to pay for a proof of the Riemann

00:10:56.940 | hypothesis, I think we'd be willing to pay a lot.

00:11:00.480 | Or the cost of-- what cost are we willing to pay for new life-saving drugs?

00:11:05.140 | I think we'd be willing to pay a lot.

00:11:07.580 | So I think that there is an opportunity here.

00:11:10.140 | OK, anyway.

00:11:12.380 | So that's my prelude.

00:11:13.380 | I guess any questions about that before I move on to Cicero?

00:11:16.140 | By the way, the reason why I'm talking about this is because it's going to inform the approach

00:11:30.060 | that we took to Cicero, which I think is quite different from the approach that a lot of

00:11:36.060 | other researchers might have taken to this problem.

00:11:38.140 | Someone asked, can you give an example of search?

00:11:42.100 | Well, multi-column tree search is one form of search.

00:11:44.460 | You could also think of breadth-first search, depth-first search, these kinds of things.

00:11:49.420 | They're all search.

00:11:50.420 | I would also argue that chain of thought is doing something similar to search, where it's

00:11:55.900 | allowing the model to leverage extra compute at test time to get better performance.

00:12:02.060 | But I think that that's the main thing that you want, the ability to leverage extra compute

00:12:05.300 | at test time.

00:12:08.180 | What's the search-- what's the space that you are searching over?

00:12:10.940 | Again, in a game like Go, it's different board positions.

00:12:14.820 | But you could also imagine searching over different sentences that you could say, things

00:12:20.580 | like that.

00:12:24.620 | There's a lot of flexibility there as well.

00:12:31.220 | So now I want to get into Cicero.

00:12:33.220 | So first thing I should say when it comes to Cicero, this is a big team effort.

00:12:42.860 | This was like-- this is actually one of the great things about working on this project,

00:12:45.660 | that there was just such a diverse talent pool, experts in reinforcement learning, planning,

00:12:49.860 | game theory, natural language processing, all working together on this.

00:12:54.220 | And it would not have been possible without everybody.

00:12:58.460 | So the motivation for diplomacy actually came from 2019.

00:13:01.260 | We were looking at all the breakthroughs that were happening at the time.

00:13:04.140 | And I think a good example of this is this XKCD comic that came out in 2012 that shows

00:13:10.140 | like different categories of games, games that are solved, games where computers can

00:13:12.900 | beat top humans, games where computers still lose to top humans, and games where computers

00:13:16.140 | may never outplay top humans.

00:13:18.100 | And in this category, computers still lose to top humans, you had four games, Go, Arima,

00:13:22.300 | Poker, and StarCraft.

00:13:24.180 | In 2015, actually one of my colleagues, David Wu, made the first AI to beat top humans in

00:13:31.700 | Arima.

00:13:32.700 | In 2016, we have AlphaGo beating Lee Sedol in Go.

00:13:36.660 | In 2017, you have the work that I just described where we beat top humans in Poker.

00:13:41.340 | And in 2019, we had AlphaStar beating expert humans in StarCraft.

00:13:48.580 | So that shows the incredible amount of progress that had happened in strategic reasoning over

00:13:53.540 | the past several years, leading up to 2019.

00:13:57.100 | And at the same time, we also had GPT-2 come out in 2019.

00:14:01.500 | And it showed that language model and natural language processing was progressing much faster

00:14:06.020 | than I think a lot of people, including us, expected.

00:14:10.580 | And so we were thinking about what after the six-player poker work, I was discussing with

00:14:14.940 | my colleagues, what should we work on next?

00:14:17.580 | And we were throwing around different domains to work on.

00:14:22.860 | And given the incredible amount of progress in AI, we wanted to pick something really

00:14:27.500 | ambitious, something that we thought you couldn't just tackle by scaling up existing approaches,

00:14:32.700 | that you really needed something new in order to address.

00:14:36.260 | And we landed on diplomacy because we thought that it would be the hardest game to make

00:14:39.860 | an AI for.

00:14:42.060 | So what is diplomacy?

00:14:44.860 | Diplomacy is a natural language strategy game.

00:14:47.940 | It takes place right before World War I.

00:14:50.700 | You play as one of the seven great powers of Europe-- England, France, Germany, Austria,

00:14:54.700 | Russia, and Turkey.

00:14:56.500 | And your goal is to control a majority of the map.

00:14:59.860 | In practice, that rarely happens.

00:15:01.880 | If you control a majority of the map, then you've won.

00:15:04.900 | In practice, nobody ends up winning outright.

00:15:08.680 | And so your score is proportional to the percentage of the map that you control.

00:15:14.700 | Now what's really interesting about diplomacy is that it is a natural language negotiation

00:15:19.460 | game.

00:15:21.200 | So you have these conversations, like what you're seeing here between Germany and England,

00:15:24.580 | where they will privately communicate with each other before making their moves.

00:15:27.540 | And so you can have Germany ask, like, want to support Sweden?

00:15:30.860 | England says, let me think on that, and so on.

00:15:35.900 | So this is a popular strategy game developed in the 1950s.

00:15:40.740 | It was JFK and Kissinger's favorite game, actually.

00:15:44.500 | But like I said, each turn involves sophisticated private natural language negotiations.

00:15:49.220 | And I want to make clear, this is not negotiations like you would see in a game like Settlers

00:15:54.740 | of Catan, for example.

00:15:58.980 | It's much more like Survivor, if you've ever seen the TV show Survivor.

00:16:04.020 | You have discussions around alliances that you'd like to build, discussions around specific

00:16:09.180 | tactics that you'd like to execute on the current turn, and also more long-term strategy

00:16:15.580 | around where do we go from here, and how do we divide resources.

00:16:19.780 | Now the way the game works, you have these negotiations that last between 5 and 15 minutes,

00:16:24.700 | depending on the version of the game, on each turn.

00:16:28.900 | And all these negotiations are done privately, otherwise in negotiation.

00:16:32.020 | Also, I think that you are not muted.

00:16:34.860 | OK, thank you.

00:16:36.980 | And then after the negotiation period completes, everybody will simultaneously write down their

00:16:41.020 | moves.

00:16:42.100 | And so a player could promise you something like, I'm going to support you into this territory

00:16:45.540 | this turn.

00:16:47.220 | But then when people actually write down their moves, they might not write that down.

00:16:50.480 | And so you only find out if they were true to their word when all the moves are revealed

00:16:54.300 | simultaneously.

00:16:58.780 | And so for this reason, alliances and trust building is key.

00:17:02.400 | The ability to trust that somebody is going to follow through on their promises, that's

00:17:05.920 | really what this game is all about.

00:17:07.540 | And the ability to convince people that you are going to follow through on your promises

00:17:11.340 | is really what this game is all about.

00:17:14.940 | And so for this reason, diplomacy has long been considered a challenge problem for AI.

00:17:19.060 | There's research in the game going back to the '80s.

00:17:21.140 | The research really only picked up-- it picked up quite intensely starting in 2019 when researchers

00:17:27.980 | from DeepMind, ourselves, Mila, other places started working on this.

00:17:33.660 | Now a lot of that research, the vast majority of that research actually was focused on the

00:17:37.940 | non-language version of the game, which was seen as a stepping stone to the full natural

00:17:40.900 | language version.

00:17:41.900 | Though we decided to focus from the start on the full natural language version of the

00:17:46.500 | game.

00:17:47.500 | So to give you a sense of what these negotiations and dialogue look like, here is one example.

00:17:55.460 | So here, England, you can see they move their fleet in Norway to St. Petersburg.

00:18:02.720 | And that occupies the Russian territory.

00:18:06.140 | And so this is what the board state looks like after that move.

00:18:08.980 | And now there's this conversation between Austria and Russia.

00:18:11.580 | Austria says, well, what happened up north?

00:18:13.900 | Russia says, England stabbed.

00:18:14.900 | I'm afraid the end may be close for me, my friend.

00:18:17.260 | Austria says, yeah, that's rough.

00:18:18.260 | Are you going to be OK up there?

00:18:19.600 | Russia says, I hope so.

00:18:20.600 | England seems to still want to work together.

00:18:22.580 | Austria says, can you make a deal with Germany?

00:18:24.280 | So the players are now discussing what should be discussed with other players.

00:18:29.140 | Russia says, good idea.

00:18:30.140 | Then Austria says, you'll be fine as long as you can defend Sevastopol.

00:18:33.420 | So Sevastopol is this territory down to the south.

00:18:35.220 | You can see that Turkey has a fleet and an army in the Black Sea in Armenia next to Sevastopol.

00:18:41.140 | And so they could potentially attack that territory next turn.

00:18:44.660 | Austria says, can you support/hold Sevastopol with Ukraine and Romania?

00:18:49.580 | I'll support/hold Romania.

00:18:50.580 | Russia says, yep, I'm already doing so.

00:18:52.980 | Austria says, awesome.

00:18:53.980 | Hopefully, we can start getting you back on your feet.

00:18:57.300 | So this is an example of the kinds of conversations that you'll see in a game of diplomacy.

00:19:01.180 | In this conversation, Austria is actually our bot, Cicero.

00:19:05.040 | So that kind of gives you a sense of the sophistication of the agent's dialogue.

00:19:10.620 | OK, I'll skip this for-- OK, so I guess I'll go into this.

00:19:17.900 | I don't want to take up too much time.

00:19:20.580 | Really what makes diplomacy interesting is that support is key.

00:19:23.200 | So here, for example, Budapest and Warsaw, the red and the purple units both try to move

00:19:28.660 | into Galicia.

00:19:29.660 | And so since it's a one versus one, they both bounce back.

00:19:32.260 | And now they're boosting to the territory.

00:19:34.340 | In the middle panel, you can see Vienna supports Budapest into Galicia.

00:19:37.700 | And so now it's a two versus one.

00:19:40.140 | And that red unit will indeed enter Galicia.

00:19:43.300 | And what's really interesting about diplomacy is that it doesn't just have to be your own

00:19:46.300 | units that are supporting you.

00:19:47.620 | It could be another player's units as well.

00:19:50.300 | So for example, the green player could support the red player into Galicia.

00:19:53.740 | And then that red unit would still go in there.

00:19:57.300 | So support is really what the game is all about, negotiating over support.

00:20:01.380 | And so for that reason, diplomacy has this reputation as the game that ruins friendships.

00:20:05.580 | It's really difficult to have an alliance with somebody for three or four hours and

00:20:08.960 | then have them backstab you and basically just ruin your game.

00:20:14.940 | But if you talk to expert diplomacy players, they view it differently.

00:20:18.420 | They say diplomacy is ultimately about building trust in an environment that encourages you

00:20:21.980 | to not trust anyone.

00:20:24.860 | And that's why we decided to work on the game.

00:20:26.580 | Could we make an AI that is able to build trust with the players in an environment that

00:20:30.700 | encourages them to not trust anybody?

00:20:33.140 | Can the bot honestly communicate that it's going to do something and evaluate whether

00:20:39.700 | another person is being honest when they are saying that they're going to do something?

00:20:44.940 | OK, so why diplomacy?

00:20:48.220 | It sits at this nice intersection of reinforcement learning and planning and also natural language.

00:20:53.340 | There's two perspectives that we can take on why diplomacy is a really interesting domain.

00:20:57.680 | One is the multi-agent perspective.

00:20:59.380 | So here, all the previous game AI results, like chess, Go, poker, these have all been

00:21:06.100 | in purely zero-sum, two-player zero-sum domains.

00:21:10.020 | And in these domains, self-play is guaranteed to converge to an optimal solution.

00:21:15.380 | Basically what this means is you can start having the bot play completely from scratch

00:21:18.340 | with no human data, and by playing against itself repeatedly, it will eventually converge

00:21:23.180 | to this unbeatable optimal solution called the minimax equilibrium.

00:21:27.500 | But that result only holds in two-player zero-sum games.

00:21:30.300 | That whole paradigm only holds in two-player zero-sum games.

00:21:34.420 | When you go to domains that involve cooperation, in addition to competition, then success requires

00:21:39.480 | understanding human behavior and conventions.

00:21:41.660 | You can't just treat the other players like machines anymore.

00:21:44.440 | You have to treat them like humans.

00:21:46.420 | You have to model human irrationality, human suboptimality.

00:21:51.780 | One example of this is actually language.

00:21:55.540 | You can imagine if you were to train a bot completely from scratch in the game of diplomacy,

00:21:59.660 | like the full natural language version of the game, there's no reason why the bot would

00:22:03.140 | learn to communicate in English.

00:22:04.700 | It would learn to communicate in some weird, gibberish robot language.

00:22:07.820 | And then when you stick it in a game with six humans, it's not going to be able to cooperate

00:22:11.360 | with them.

00:22:15.600 | So we have to find a way to incorporate human data and be able to learn how humans behave

00:22:20.740 | in order to succeed in this game.

00:22:27.660 | There's also the NLP perspective, which is that current language models are essentially

00:22:32.720 | just imitating human-like text.

00:22:34.560 | Now, there's been some progress with things like RLHF, but that's still not really the

00:22:41.040 | way that humans communicate.

00:22:42.640 | They communicate with an intention in mind.

00:22:44.320 | They come up with this intention, and then they communicate with the goal of communicating

00:22:47.240 | that intention.

00:22:48.240 | And they understand that others are trying to do the same.

00:22:51.540 | And so there's a question of, can we move beyond chitchat to grounded, intentional dialogue?

00:23:00.920 | So Cicero is an AI agent for diplomacy that integrates high-level strategic play and open

00:23:05.920 | domain dialogue.

00:23:07.520 | And we used 50,000 human games of diplomacy acquired through a partnership with the website

00:23:10.880 | webdiplomacy.net.

00:23:14.480 | So we entered Cicero in an online diplomacy league.

00:23:17.080 | Just to give you the results up front, Cicero was not detected as an AI agent for 40 games

00:23:21.600 | with 82 unique players.

00:23:24.040 | There was one player that mentioned after the fact that they kind of made a joke about

00:23:28.880 | us being a bot, but they didn't really follow up on it, and nobody else followed up on it.

00:23:33.400 | And they later accused somebody else of also being a bot.

00:23:35.840 | So we weren't sure how seriously to take that accusation.

00:23:38.960 | But I think it's safe to say it made it through all 40 games without being detected as a bot.

00:23:42.760 | And then we-- in fact, we told the players afterwards that it was a bot the whole time.

00:23:47.040 | These are the kinds of responses that we got.

00:23:49.780 | People were quite surprised-- pleasantly surprised, fortunately.

00:23:53.760 | Nobody was upset with us, but they were quite surprised that there was a bot that had been

00:23:59.320 | playing this game with them the whole time.

00:24:02.960 | So in terms of results, Cicero placed in the top 10% of players.

00:24:07.600 | It's a high-variance game.

00:24:08.600 | And so if you look at players that played five or more games, it placed second out of

00:24:13.120 | 19.

00:24:14.120 | And it achieved more than double the average human score.

00:24:16.640 | So I would describe this as a strong level of human performance.

00:24:19.400 | I wouldn't go as far as to say that this is superhuman by any means.

00:24:24.000 | But it is currently quite a strong result.

00:24:27.920 | Now, to give you a picture of how Cicero works.

00:24:35.480 | So the input that we feed into the model is the board state and the recent action history

00:24:41.960 | that's shown on the top left here, and also the dialogue that it's had with all the players

00:24:46.600 | up until now.

00:24:49.440 | So that's going to get fed into a dialogue-conditional-action model that's going to predict what Cicero

00:24:56.160 | thinks all the players are going to do this turn and what they think we will do this turn.

00:25:06.440 | These lead to what we call anchor policies that are then used for planning.

00:25:11.880 | Now, planning here, again, this is like the part where we leverage extra compute at test

00:25:20.720 | time in order to get better performance.

00:25:23.360 | So essentially, we take these initial predictions of what everybody's going to do, what are

00:25:26.160 | called anchor policies, and we improve upon these predictions using this planning process

00:25:32.600 | called pickle, where basically, we account for the fact that players will pick actions

00:25:38.400 | that have higher expected value with higher probability.

00:25:40.920 | We're essentially adding this rationality prior to all the players to assume that they're

00:25:45.240 | not going to blunder as often as the model might suggest, and they're going to pick smarter

00:25:49.280 | actions with higher probability than the initial model might suggest.

00:25:52.800 | And what we find is that this actually gives us a better prediction of what all the players

00:25:56.200 | will do than just relying on the raw neural net itself.

00:26:03.560 | This gives us the action that we actually play in the game, and it also gives us what

00:26:07.800 | we call intents.

00:26:09.560 | So intents are an action for ourselves and an action for the dialogue partner that we're

00:26:13.200 | speaking to.

00:26:17.840 | So we have this dialogue model that conditions on these intents.

00:26:21.120 | So the intents are fed into the dialogue model, along with the board state and action history,

00:26:26.360 | and also the dialogue that we've had so far.

00:26:29.240 | And that dialogue model will then generate candidate messages that are conditioned on

00:26:35.240 | those intents.

00:26:38.240 | These candidate messages go through a series of filters that filter out nonsense, grounding

00:26:42.040 | issues, and also low expected value messages.

00:26:46.640 | And ultimately, we get out a message to send to our dialogue partner.

00:26:52.240 | Now every time we send or receive a message, we will repeat this whole process.

00:26:59.880 | So there's actually a lot that is quite novel in Cicero.

00:27:04.360 | And I'm going to try to talk about the contributions as much as possible.

00:27:08.400 | I might go through this a little quickly, just so we have time for questions.

00:27:12.640 | The first one is a controllable dialogue model that conditions on the game state and a set

00:27:16.600 | of intended actions for the speaker and the recipient.

00:27:19.360 | So we have a question, what is the action space here for the model?

00:27:28.960 | The action space for the action prediction model is like all the actions that you could

00:27:33.800 | take in the game, that a player could take in the game.

00:27:37.160 | For the dialogue model, it's like, you know, messages that you can send.

00:27:44.280 | Okay, so we train what we call an intent model that predicts what actions people will take

00:27:53.680 | at the end of truthful turns.

00:27:55.080 | Basically, what we're trying to predict, what are people intending to do when they communicate

00:28:00.840 | a certain message.

00:28:03.800 | And then we use this to automatically annotate the data set with basically what we expect

00:28:10.120 | people's intentions were when they sent that message.

00:28:12.880 | And we filter out as much as possible lies from the data set, so that the text in the

00:28:21.720 | data set is annotated with the truthful intention.

00:28:29.160 | And then during play, Cicero conditions the dialogue model on the truthful intention that

00:28:33.560 | it intends to take.

00:28:34.560 | And the goal then is that, the hope then is that it will generate a message consistent

00:28:39.520 | with that intention.

00:28:41.480 | And that is then fed into everything else that's, you know, sorry, that the intentions

00:28:47.920 | that we generate through planning are fed into the dialogue model.

00:28:54.880 | So to give you an example of what this looks like, this gives us a way to control the dialogue

00:28:59.040 | model through a set of intentions.

00:29:02.000 | Like here, we are Cicero's England in pink, and their action is to move to Belgium, among

00:29:10.040 | other things.

00:29:13.080 | And so if we feed this intention into the dialogue model, then the message that might

00:29:16.600 | get generated is something like England saying to France, do you mind supporting me, do you

00:29:21.240 | mind supporting Eddie to Belgium?

00:29:25.760 | On the other hand, let's say Cicero's action is to support France to the Belgium.

00:29:33.960 | Then if you feed that into the dialogue model, then the message that's generated might say

00:29:38.040 | something like, let me know if you want me to support you to Belgium, otherwise I'll

00:29:41.040 | probably poke Holland.

00:29:46.600 | Now what we find is that conditioning the dialogue model on these intentions in this

00:29:49.640 | way, it makes the model more controllable, but it also leads to higher quality dialogue

00:29:54.120 | with less nonsense.

00:29:55.880 | So we found that it led to dialogue that was more consistent with the state, more consistent

00:30:00.640 | with the plan, higher quality, lower perplexity.

00:30:03.760 | And I think the reasoning for why this is the case is that we're relieving the dialogue

00:30:08.560 | model of the burden of having to come up with a good strategy.

00:30:14.040 | We're allowing the dialogue model to do what it does best, to focus on what it does best,

00:30:18.040 | which is dialogue.

00:30:20.480 | And we're relieving it of the strategic components of the game, because we're feeding that strategy

00:30:25.520 | into the dialogue model.

00:30:28.840 | Okay, so that's one main contribution, this controllable dialogue model that conditions

00:30:34.520 | on a plan.

00:30:36.680 | The second is a planning engine that accounts for dialogue and human behavior.

00:30:43.080 | So I mentioned that a lot of previous work on games was done using self-play in two-player

00:30:52.600 | zero-sum settings.

00:30:54.700 | Now the problem with pure self-play is that it can learn strong policies, but it doesn't

00:31:02.040 | stick with human conventions, and it can't account for dialogue.

00:31:05.280 | It's just going to ignore the human data and the human way of playing if you just do self-play.

00:31:12.120 | So that's one extreme.

00:31:14.240 | The other extreme that you can go is to just do supervised learning on human data, create

00:31:20.280 | this model of how humans play, and then train with those imitation humans.

00:31:27.120 | And if you do this, you'll end up with a bot that's consistent with dialogue and human

00:31:32.040 | conventions, but it's only as strong as the training data.

00:31:36.020 | And we found that it was actually very easily manipulable through adversarial dialogue.

00:31:41.100 | So for example, you can send messages to it saying, "Thanks for agreeing to support me

00:31:44.520 | at the Paris," and it will think, "Well, I've only ever seen that message in my training

00:31:49.840 | data when I've agreed to support the person at the Paris, and so I guess I'm supporting

00:31:53.640 | them at the Paris this turn," even though that might be a terrible move for the bot.

00:32:00.240 | So we came up with this algorithm called Pickle that is a happy medium between these two extremes.

00:32:08.380 | The way Pickle works is it's doing self-play, but regularized toward sticking to the human

00:32:18.520 | imitation policy.

00:32:20.360 | So it has a KL penalty for deviating from the human imitation policy.

00:32:27.560 | So we have this parameter lambda that controls how easy it is to deviate from the human imitation

00:32:35.920 | policy.

00:32:36.920 | At lambda equals zero, it just ignores the human imitation policy completely and just

00:32:41.780 | does pure self-play.

00:32:44.100 | And so we'll just do self-play as if from scratch at lambda equals zero.

00:32:49.220 | At lambda equals infinity, it's just playing the human imitation policy and not doing self-play

00:32:55.380 | at all.

00:32:57.620 | But for intermediate values of lambda, what we find is that it actually gives you a good

00:33:00.420 | medium between sticking to human conventions and performing strongly.

00:33:08.420 | So you can kind of see this behavior emerge here.

00:33:11.500 | Sorry, there's a question.

00:33:15.360 | Is this similar to offline RL or also incorporates exploration?

00:33:18.260 | So I would say there's actually a lot of similar work on having a KL penalty.

00:33:25.500 | And so yes, I would say that it's very similar to a lot of that work.

00:33:30.060 | And this has also been done actually in AlphaStar, where they had a KL penalty.

00:33:33.540 | Though that was more about aiding exploration, like using human data to aid exploration rather

00:33:38.780 | than trying to better imitate humans.

00:33:41.960 | So I think what's interesting about the pickle work is that one, we find it imitates humans

00:33:45.740 | better than just doing supervised learning alone.

00:33:49.300 | And two, we are doing a bit of theory of mind where we assume that the other players are

00:33:55.660 | also-- we're using this as a model for our behavior, what we expect other people to think

00:34:01.420 | our behavior is, in addition to modeling the other players.

00:34:05.020 | So it's like a common knowledge algorithm that we're using here.

00:34:17.940 | So the kind of behavior that you see from this, you can see here, let's say England

00:34:22.260 | agrees-- sorry, so let's say we're in this situation.

00:34:25.180 | This actually came up in a real game, and it inspired a figure from our paper.

00:34:31.180 | So England and France are fighting.

00:34:35.100 | France is the bot.

00:34:36.700 | And France asks if England is willing to disengage.

00:34:42.620 | And let's say England says, yes, I will move out of English Channel if you head back to

00:34:46.820 | NAO.

00:34:48.540 | Well, we can see that Cicero does, in fact, back off, goes to NAO, and the disengagement

00:34:56.060 | is successful.

00:34:57.220 | And so this shows that the bot strategy really is reflecting the dialogue that it's had with

00:35:02.020 | this other player.

00:35:05.660 | Another message that England might send is something like, I'm sorry, you've been fighting

00:35:08.900 | me this whole game.

00:35:09.900 | I can't trust you that you won't stab me.

00:35:12.380 | And so in this case, Cicero will continue its attack on England.

00:35:16.140 | And you can see, again, this is reflective.

00:35:17.860 | It's changing its behavior depending on the dialogue.

00:35:20.060 | But you can also have this kind of message where England says, yes, I'll leave English

00:35:26.920 | Channel if you move Kiel to Munich, get Holland to Belgium.

00:35:29.260 | So these are really bad moves for Cicero to follow.

00:35:33.060 | And so if you just look at the raw policy net, it might actually do this.

00:35:40.620 | It might actually do these moves because England suggested it.

00:35:44.220 | But because we're using pickle, that incorporates-- it counts for the expected value of different

00:35:48.620 | actions.

00:35:49.620 | It will actually partially back off, but ignore the suggested moves because it recognizes

00:35:53.380 | that those will leave it very vulnerable to an attack.

00:35:57.460 | OK, I'll skip this slide for time.

00:36:08.780 | Another thing I should say is that we're not just doing planning.

00:36:11.780 | We're actually doing this in a full self-play reinforcement learning loop.

00:36:16.660 | And again, the goal here is it's really about modeling humans better than supervised learning

00:36:21.940 | alone.

00:36:22.940 | And we found that doing this self-play reinforcement learning with pickle allowed us to better

00:36:26.340 | model human behavior than just doing imitation learning.

00:36:30.900 | Finally, we have an ensemble of message filtering techniques that filters both nonsensical and

00:36:35.820 | strategically unsound messages.

00:36:40.060 | So to give you an example of what these filters look like, one that we developed is value-based

00:36:44.460 | filtering.

00:36:45.620 | So the motivation for this is that what we feed into our dialogue model is a plan for

00:36:52.340 | ourselves and for our speaking partner.

00:36:55.380 | But it's the entire plan that we have for ourselves.

00:36:58.460 | And so we might end up feeding into the dialogue model the fact that we're going to attack

00:37:02.540 | the player that we're speaking to.

00:37:04.660 | Now the dialogue model is, to be honest, kind of dumb.

00:37:08.220 | And it doesn't really know that it shouldn't be telling this player that they're going

00:37:14.180 | to be attacked this turn.

00:37:16.140 | And so you have these messages that might be sent, something like the second one shown

00:37:19.740 | here, where England says to France, we have hostile intentions towards you.

00:37:23.180 | You must be wiped from the board.

00:37:24.620 | Please provide a croissant.

00:37:26.540 | So this is actually a message that the bot sent to a player.

00:37:29.420 | Not to a player.

00:37:30.420 | This was preliminary testing and kind of motivated this whole approach.

00:37:36.780 | So we don't want the bot to send these kinds of messages if it's going to attack a player.

00:37:39.460 | We want it to send something that's like, you know, not an outright lie necessarily,

00:37:43.660 | but just something either not send a message or something that's much more bland.

00:37:49.620 | And so we filter out these kinds of messages by looking at the value.

00:37:54.140 | Like what we do is we generate a bunch of candidate messages.

00:37:58.040 | And then we see if we were to send this message, what is the behavior that we would expect

00:38:04.020 | the other players to take?

00:38:06.420 | Like what actions will we expect them to do after we send this message?

00:38:10.220 | And what do they expect we will do after we send this message?

00:38:14.000 | And then we see what is the expected value of the action that we intend to take given

00:38:20.560 | the prediction of what everybody else is going to do.

00:38:23.460 | So if our intention is to attack France, then we can see, well, if I were to send this message

00:38:28.940 | to France, then they're going to get really defensive and defend against an attack from

00:38:32.740 | us and our attack is going to be unsuccessful.

00:38:35.340 | And so therefore, I probably shouldn't send this message to them.

00:38:41.620 | And so in this way, we can actually filter out messages that have low expected value.

00:38:44.780 | And we found that this worked surprisingly well.

00:38:51.380 | Dialogue examples.

00:38:52.380 | I'll go through one just for the sake of time.

00:38:57.680 | So here we have Cicero's France, and France is conversing with Turkey, who's a human player,

00:39:07.520 | and they're debating over who's going to get Tunis, this territory circled in red.

00:39:11.440 | You can see they both have fleets next to the territory.

00:39:14.840 | If they both go for it, neither of them are going to get it, and so they need to work

00:39:17.240 | out some sort of deal.

00:39:19.280 | So France says, I'll work with you, but I need Tunis for now.

00:39:21.720 | Turkey says, nope, you've got to let me have it.

00:39:24.840 | France says, no, I need it.

00:39:26.640 | And then France suggests, you can take these other territories instead.

00:39:30.600 | You have Serbia and Rome to take.

00:39:33.720 | Turkey says, they're impossible targets.

00:39:35.760 | And then Cicero suggests specific moves that would allow Turkey to capture these territories.

00:39:41.940 | So Cicero says, Greece to Ionia, Ionia to Tyrrhenian.

00:39:46.800 | Turkey says, hmm, you're right, good ideas.

00:39:48.840 | And then France says, then in the fall, you take Rome and Austria collapses.

00:39:51.700 | And so that allows Turkey to make progress against Austria.

00:39:55.860 | But conveniently, it also allows France to capture Tunis, because Turkey will be using

00:40:01.700 | those units for something else.

00:40:06.500 | So limitations and future directions.

00:40:08.580 | Intent representation is just an action per player.

00:40:11.300 | So there's a question of like, the intentions that we're feeding into the dialogue model

00:40:15.380 | is an action that we're going to take for this turn and for the next turn, for ourselves

00:40:18.900 | and for the other player.

00:40:20.660 | But ideally, we would have a richer set of intentions, we will be able to condition on

00:40:25.220 | things like long term strategy, or style of communication, or asking questions.

00:40:32.980 | That's one of the limitations of this approach.

00:40:34.740 | Now, of course, the richer you make the space of intentions, the more room there is for

00:40:39.740 | things to go wrong.

00:40:40.900 | And you also have to then train the model to be able to handle these wider space of

00:40:44.700 | intentions.

00:40:45.700 | There was a question, do you think the dialogue model is learning an internal world model

00:40:52.940 | to be so good at predicting moves?

00:40:55.500 | No, this is arguably why we're conditioning on intentions.

00:41:02.180 | We're relieving the dialogue model of having to come up with a good world model, because

00:41:06.980 | we're telling it like, these are the moves that we are planning to take this turn.

00:41:10.260 | And these are the moves that we would like this other player to take this turn.

00:41:13.140 | So we're able to have the world model separate from the dialogue model, but condition on

00:41:21.620 | the output from the world model.

00:41:26.940 | Another limitation is that Cicero's value model doesn't condition on dialogue.

00:41:30.940 | And so it has a limited understanding of the long term effects of dialogue.

00:41:39.020 | This greatly limits our ability to plan what kind of messages we should be sending.

00:41:47.400 | And this is actually why we always condition Cicero's dialogue generation on its truthful

00:41:53.360 | intentions.

00:41:54.860 | You could argue that there's situations in diplomacy where you would want to lie to the

00:41:59.020 | other player.

00:42:00.780 | The best players rarely lie, but they do lie sometimes.

00:42:05.760 | And you have to understand the trade off between like, if you lie, you're going to not, it's

00:42:14.020 | going to be much harder to work with this person in the future.

00:42:17.500 | And so you have to make sure that the value that you're getting positionally is worth

00:42:22.020 | that loss of trust and a broken relationship.

00:42:25.780 | Now, because Cicero's value model doesn't condition on dialogue, it can't really understand

00:42:32.380 | this trade off.

00:42:34.020 | And so for this reason, we actually always condition it on its truthful intentions.

00:42:40.700 | Now, it is possible to have Cicero's value model condition on dialogue, but you would

00:42:47.620 | need way more data, and it would make things much more expensive.

00:42:50.820 | And so we weren't able to do it further for this bot.

00:42:58.020 | And finally, there's a big question that I mentioned earlier, which is, is there a more

00:43:01.780 | general way of scaling inference time compute to achieve better performance?

00:43:06.020 | The way that we've done planning in Cicero is, I would argue, a bit domain specific.

00:43:10.500 | I think it's like the idea of pickle is quite general, but I think that there are potentially

00:43:14.560 | more general ways of doing planning.

00:43:22.340 | Somebody's asking, looking forward to the next two to three years, what criteria will

00:43:26.140 | you use to select the next game to try to conquer?

00:43:28.860 | Honestly, like I said, we chose diplomacy because we thought it'd be the hardest game

00:43:32.740 | to make an AI for, and I think that that's true.

00:43:36.060 | I don't think that we're gonna be working on games anymore, because I can't think of

00:43:38.460 | any other game that if we were to succeed at that, it would be truly impressive.

00:43:46.060 | And so I think where the research is going in the future is generality.

00:43:53.260 | Like instead of getting an AI to play this specific game, can we get an AI that is able

00:43:59.180 | to play diplomacy, but could also play Go or poker, or could also write essays and stories

00:44:05.460 | and solve math problems and write theorems?

00:44:09.720 | I think what we will see is games serving as benchmarks for progress, but not as the

00:44:18.740 | goal.

00:44:19.740 | It'll be part of the test set, but not part of the training set.

00:44:21.880 | And I think that's the way it should be going forward.

00:44:25.300 | Finally, I want to add that diplomacy is an amazing testbed for multi-agent AI and grounded

00:44:31.880 | dialogue.

00:44:33.940 | So if you are interested in these kinds of domains, I highly recommend taking advantage

00:44:38.960 | of the fact that we've open sourced all of our code and models, and the dialogue and

00:44:43.940 | action data is available through what's called an RFP, where you can apply to get access

00:44:48.900 | to the dialogue and data.

00:44:54.660 | So thanks for listening.

00:44:55.660 | To wrap up, Cicero combines strategic reasoning and natural language in diplomacy.

00:44:59.660 | It placed in the top 10% of human players.

00:45:01.900 | And the paper is in science and code and models are publicly available at this URL.

00:45:06.660 | So thanks.

00:45:07.660 | And for the remaining time, I'll take questions.

00:45:09.380 | Great.

00:45:10.380 | Thanks a lot for your talk.

00:45:11.900 | So we've also opened some questions from the class, but you can finish their Zoom questions.

00:45:17.020 | So if anyone has some questions, I think, Noam, you can answer those.

00:45:20.660 | Yeah, there's one question, are you concerned about AIs outcompeting humans at real world

00:45:24.780 | diplomatic strategic negotiation and deception tasks?

00:45:28.820 | So like I said, we're not very focused on deception, even though arguably deception

00:45:33.180 | is a part of the game of diplomacy.

00:45:36.420 | I think for diplomatic and strategic negotiation, I don't feel like, look, the way that we've

00:45:43.700 | developed Cicero, it's designed to play diplomacy, the game of diplomacy specifically, and you

00:45:48.940 | can't use it out of the box for other tasks.

00:45:52.100 | That said, I do think that the techniques are quite general.

00:45:56.060 | And so hopefully others can build on that and to be able to do different things.

00:45:59.880 | And I think it is entirely possible that over the next several years, you will see this

00:46:04.700 | entering into real world negotiations much more often.

00:46:08.620 | I actually think that diplomacy is a big step towards real world applicability compared

00:46:13.320 | to breakthroughs in games like Go and Poker.

00:46:19.380 | Because now your action space is really like the space of natural language, and you have

00:46:23.020 | to model human behavior.

00:46:24.020 | Do you think in the future, we could appoint an AI to the UN council?

00:46:30.300 | Oh, hopefully, only if it does better than humans, but that would be very interesting

00:46:34.580 | to see.

00:46:35.580 | Great.

00:46:36.580 | I'm also curious, like, what's like the future things that you're working on in this direction?

00:46:41.260 | Like, do you think you can do something like AlphaGo Zero, where you just like, take this

00:46:44.880 | like pre-built model, and then maybe just make it like self-play?

00:46:48.240 | Or like, what sort of future directions are you thinking for improving this sort of box?

00:46:52.760 | I think the future directions are really focused around generality.

00:46:55.680 | Like, I think one of the big insights of Cicero is like, this ability to leverage planning

00:47:00.840 | to get better performance with language models and in this strategic domain.

00:47:06.520 | I think there's a lot of opportunity to do that sort of thing in a broader space of domains.

00:47:10.680 | I mean, you look at language models today, and they do token by token prediction.

00:47:17.020 | And I think there's a big opportunity to go beyond that.

00:47:19.320 | So that's what I'm excited to look into.

00:47:20.800 | I'm also curious, like, I didn't understand the exact details, how you're using planning

00:47:23.960 | or Monte Carlo research with your, like the models that you have.

00:47:29.200 | So is it like...

00:47:30.200 | We didn't use Monte Carlo research in Cicero.

00:47:33.800 | Monte Carlo research is a very good heuristic, but it's a heuristic that is particularly

00:47:40.240 | useful for deterministic perfect information games.

00:47:45.200 | And I think in order to like have a truly general form of planning, we need to go more

00:47:48.960 | abstract than Monte Carlo research.

00:47:51.380 | We use this algorithm called PICL, it's based on a regret minimization algorithm.

00:47:56.520 | I don't really want to go into the details of it because it's not that important for

00:47:59.240 | the class.

00:48:00.480 | But the idea is like, it is this iterative algorithm that will gradually refine the prediction

00:48:05.360 | of what everybody's going to do and get better and better predictions the more iterations

00:48:07.760 | that you run.

00:48:08.760 | And that's similar to search.

00:48:09.760 | Yeah.

00:48:10.760 | Sure.

00:48:11.760 | Go for it.

00:48:12.760 | You're unmuted.

00:48:13.760 | Okay.

00:48:14.760 | So yeah, my question is like, when we were talking about generalizability, how does the

00:48:30.620 | communication between different modules of the model look like, particularly when we're

00:48:36.320 | talking about the dialogue model?

00:48:38.800 | Like how do you send information from the policy network to the dialogue model?

00:48:41.600 | And in the future, if you have a model that's good at different tasks, are we going to have

00:48:46.160 | like a really big policy net that learns all of them or like separate language modules

00:48:51.060 | for all of them?

00:48:52.060 | Like how do you break it down?

00:48:54.120 | So we actually convert the policy, the action for ourselves and for our dialogue partner

00:48:58.240 | into a string, natural language string, and just feed that into the dialogue model along

00:49:03.560 | with all the dialogue that it's had so far.

00:49:06.120 | So it's just all text in, text out.

00:49:11.000 | And that works great.

00:49:13.480 | And then what was the second part of your question?

00:49:18.560 | Something like, are we just going to have like one giant policy net trained on everything?

00:49:22.240 | Yeah, it was like, so if you're only using text first, doesn't it limit the model?

00:49:28.120 | And if you're using it for different games, like, are you thinking like, when you say

00:49:32.720 | in the future, you will work on generalizability, are you thinking about a big policy network

00:49:37.560 | that is trained on separate games or is able to like understand different games at the

00:49:41.960 | same time?

00:49:42.960 | Or do we have like separate policy networks for different games?

00:49:46.680 | And yeah, like, doesn't this like text interface limit the model in terms of communication?

00:49:52.040 | Like if you're using vectors, it might like, yeah, it might be a bit of a bottleneck.

00:49:58.520 | I mean, I think ideally, you go in this direction where you have like, you know, a foundational

00:50:02.680 | model that works for pretty much everything.

00:50:07.200 | Does text I mean, certainly, just like a text in text out that like limits what you can

00:50:10.480 | do in terms of communication, but hopefully, we get beyond that.

00:50:16.360 | I think it's a reasonable choice for now.

00:50:20.400 | Thank you.

00:50:21.400 | I think we have more zoom questions.

00:50:26.280 | Okay, so there's a question in the chat, I'd love to hear your speculation on the future.

00:50:30.480 | For instance, we've seen some startups that are fine tuning LLMs to be biased, or experts

00:50:34.580 | in say, subject X versus subject Y.

00:50:39.920 | This seems like a pretty general question.

00:50:46.520 | I don't have strong opinions on this.

00:50:53.720 | I yeah, I mean, like, I'm not I'm not too.

00:51:03.160 | I'm not too focused myself on, you know, fine tuning language models to specific tasks.

00:51:10.000 | I think the direction that I'm much more interested in going forward is, you know, like the more

00:51:14.960 | general forms of planning.

00:51:16.620 | So I don't think I can really comment on, you know, how do you tune these language models

00:51:25.540 | in these ways.

00:51:27.900 | So what sort of planning methods are you like interested in looking at, like, like MCTS

00:51:32.900 | is one.

00:51:33.900 | So let me, so I got to step out for just one second.

00:51:40.340 | Got this switch rooms, excuse me.

00:51:43.460 | Okay, nevermind.

00:51:44.460 | We're all good.

00:51:45.460 | Sorry.

00:51:46.460 | What was the question?

00:51:47.460 | Oh, yes.

00:51:48.460 | I was thinking, I was just asking, like, what sort of planning algorithms do you think are

00:51:51.100 | very interesting to combine?

00:51:52.100 | So you think like, we have like, so many options, like we have like planning kind of stuff,

00:51:56.260 | or RL, there's like MCTS, there's like the work you did with Cicero.

00:51:59.740 | So what do you think are the most interesting algorithms that you think will scale well,

00:52:03.780 | can generalize?

00:52:04.780 | Well, I think that's the big question that a lot of people are trying to figure out today.

00:52:08.900 | And it's not really clear what the answer is.

00:52:11.660 | I mean, I think, you know, you look at some of the chain of thought, and I think there's

00:52:16.100 | a lot of limitations to chain of thought.

00:52:17.860 | And I think that it should be possible to do a lot better.

00:52:20.540 | But it is really impressive to see just how general of an approach it is.

00:52:25.540 | And so I think it would be nice to see to see things that are general in that, in that

00:52:33.780 | way, but hopefully able to achieve better performance.

00:52:37.700 | Awesome.

00:52:38.700 | Also, when you say like, Cicero is like an encoder decoder model, in the sense that encodes

00:52:46.900 | the world, and then you have the dialog model, which is trying to decode it.

00:52:50.940 | It was an encoder decoder model.

00:52:52.140 | Yes.

00:52:53.140 | I don't think that that's necessarily the right choice.

00:52:57.180 | But that's what we used.

00:53:00.180 | Any questions?

00:53:01.180 | Okay, I think, yeah, we're mostly good.

00:53:05.940 | But yeah, thanks a lot.

00:53:08.220 | Okay.

00:53:09.220 | Well, yeah, and if there are any questions, feel free to email me reach out, I'm happy

00:53:13.660 | to chat.

00:53:14.460 | Thanks.

00:53:15.460 | Bye.

00:53:16.460 | Bye.

00:53:17.460 | Bye.

00:53:17.460 | [BLANK_AUDIO]

Stanford CS25: V2 I Strategic Games

Chapters