back to indexStanford CS25: V2 I Strategic Games
Chapters
0:0
20:53 The multi-agent perspective
22:26 The NLP perspective
36:39 Value-based filtering
00:00:00.000 |
So it spent sometimes like two months training the bots on thousands of CPUs, terabytes of 00:00:15.320 |
But when it came time to actually play against the humans, they would act almost instantly. 00:00:23.080 |
And the humans when they were in a tough spot, they would not act instantly, they would think 00:00:28.400 |
they would sit there and they would think for five seconds, maybe five minutes if it 00:00:33.800 |
And it was clear that that was allowing them to come up with better strategies. 00:00:38.320 |
And so I wanted to investigate this behavior in our bots, like if we could add this to 00:00:43.240 |
our bots, how much of a difference would it make the ability to instead of acting instantly 00:00:48.400 |
to take some time and compute a better strategy for the spot that the agent was in. 00:00:58.760 |
So on the x-axis here, we have like the number of buckets, the number, you can think of this 00:01:03.120 |
as like the number of parameters in your model. 00:01:04.840 |
And on the y-axis, we have distance from Nash equilibrium. 00:01:07.120 |
So this is basically like how much you would lose to worst case adversaries. 00:01:10.080 |
So the lower this number is, the better your poker bot is. 00:01:13.840 |
And you can see as you scale up the number of parameters, your performance improves. 00:01:16.840 |
And as you increase the number of parameters by about 100x, your exploitability goes down 00:01:23.080 |
by about half, and indeed, you're getting a much better poker bot. 00:01:26.880 |
But you can see the blue line here is if you don't have search, and the orange line is 00:01:32.200 |
And you can see just adding search, adding the ability to sit there and think for a bit, 00:01:36.520 |
improve the performance of these models, it reduced the exploitability, the distance from 00:01:44.160 |
And if you were to extend that blue line and see how many parameters would you need in 00:01:52.800 |
The answer is you would need to scale up your model by about 100,000x. 00:01:58.640 |
So this was pretty mind-blowing to me when I saw this. 00:02:02.160 |
I mean, over the course of my PhD, the first three or four years of my PhD, I managed to 00:02:14.480 |
I mean, that's like a pretty impressive result, I think. 00:02:17.840 |
But what this plot was showing me was that just adding search was the equivalent of scaling 00:02:25.160 |
And so all of my previous research up until this point would just be a footnote compared 00:02:32.560 |
So when I saw this, it became clear this was the answer to beating top humans in poker. 00:02:36.600 |
And so for the next year, basically nonstop, I worked on scaling search. 00:02:41.500 |
Now there's a question that naturally comes up, which is why wasn't this considered before? 00:02:46.500 |
First of all, I should say search had been considered in poker before, and it's actually 00:02:50.600 |
quite natural to say, well, if you had search in chess and search in Go, why would you not 00:03:00.400 |
One is that culturally, the poker research grew out of game theory and reinforcement 00:03:05.840 |
And so it wasn't really from the same background as the people that were working on chess and 00:03:11.680 |
When you scale search, scaling test time compute, it makes all your experiments much more expensive 00:03:21.560 |
People are always thinking about winning the next annual computer poker competition, and 00:03:24.520 |
the ACPC limited the resources that you could use at test time. 00:03:26.960 |
So search wasn't really possible effectively in the ACPC. 00:03:32.240 |
And I think the biggest factor is that people just didn't think it would make such a huge 00:03:35.760 |
I mean, I think it's reasonable to look at something like search and think like, oh yeah, 00:03:40.640 |
You probably wouldn't think it makes 100,000x difference. 00:03:43.760 |
And so there were some people working on it, but it wasn't really the focus of a lot of 00:03:51.180 |
So anyway, focused on scaling search, and that led to the 2017 Brains vs. AI competition, 00:03:57.040 |
where we again played our bot against four top poker pros, 120,000 hands of poker, $200,000 00:04:04.760 |
And this time, the bot won by 15 big blinds per 100, instead of nine big blinds per 100. 00:04:13.120 |
Each human lost individually to the bot, and four standard deviations of statistical significance. 00:04:21.200 |
We followed this up in 2019 with a six-player poker AI competition. 00:04:26.240 |
The big difference here is that we figured out how to do depth-limited search. 00:04:28.720 |
So before in the 2017 bot, it would always have the search to the end of the game. 00:04:33.720 |
Here, it only had to do search a few moves ahead, and it could stop there. 00:04:37.640 |
And so this time, again, it won with statistical significance. 00:04:41.120 |
And what's really surprising about this bot is that, despite it being a much larger game, 00:04:45.680 |
the six-player poker bot, Pluribus, cost under $150 to train on cloud computing resources. 00:04:52.720 |
And it runs on 28 CPU cores at inference time, there's no GPUs. 00:04:57.000 |
So I think what this shows is that this really was an algorithmic improvement. 00:05:02.600 |
I mean, this would have been doable 20 years ago if people knew how to do it. 00:05:09.620 |
And I think it also shows the power of search. 00:05:12.880 |
If you can figure out how to scale that compute at test time, it really can make a huge difference 00:05:17.960 |
and bring down your training costs by a huge amount. 00:05:21.800 |
So I wanted to say also, this is not limited to poker. 00:05:26.400 |
If you look at Go, you see a similar pattern. 00:05:28.800 |
So this is a plot from the AlphaGo Zero paper. 00:05:31.680 |
On the x-axis, we have different versions of AlphaGo, and on the y-axis, we have EloRating, 00:05:34.920 |
which is a way of comparing different bots, but also a way of comparing bots to humans. 00:05:39.700 |
And you can see if-- OK, so SuperHuman performance is around 3,600 Elo, and you can see AlphaGo 00:05:48.060 |
Lee, the version that played against Lee Sedol in 2016, that's right over the line of SuperHuman 00:05:52.940 |
AlphaGo Zero, the strongest version of AlphaGo, is around 5,200 Elo. 00:05:58.540 |
But if you take out the test time search, if you just play according to the policy net 00:06:03.500 |
and not do any Monte Carlo tree search in AlphaGo Zero at test time, then the EloRating 00:06:08.260 |
drops to around 3,000, which is substantially below SuperHuman performance. 00:06:15.020 |
So what this shows is that if you take out Monte Carlo tree search at test time, AlphaGo 00:06:22.420 |
And in fact, nobody has made a SuperHuman Go bot that does not use search in some form. 00:06:29.420 |
Nobody has made a raw neural network that can beat top humans in Go. 00:06:35.100 |
And I should say also, this is just if you're taking out the search at test time. 00:06:38.180 |
I'm not even talking about taking it out of training time. 00:06:40.280 |
If you took it out of training time, it wouldn't even get off the ground. 00:06:44.900 |
Now there's a question of, OK, well, surely you could just scale up the models, scale 00:06:48.420 |
up the amount of training, and you would eventually surpass SuperHuman performance and match the 00:06:56.640 |
If you scale up the models and if you scale up the training, then you would eventually 00:07:02.860 |
But there's a question of how much would you have to scale it up by? 00:07:06.020 |
Now a rough rule of thumb is that in order to increase your EloRating by about 120 points, 00:07:10.380 |
you either have to double the amount of model size and training, or you have to double the 00:07:16.500 |
And so if you look at that gap of around 2,000 Elo points, and you calculate the number of 00:07:20.620 |
deadlines that you would need, the answer is that in order to get the raw policy net 00:07:24.100 |
from 3,000 Elo to 5,200 Elo, you would need to scale your model and your training by about 00:07:37.220 |
I think you look at what's happening today with large language models and transformers, 00:07:42.980 |
I mean, you're getting huge-- there's a question of, what do I mean by search? 00:07:49.340 |
There's specific kinds of search, like multicolored tree search, the ability to just plan ahead 00:07:53.100 |
what you're going to do instead of just acting instantly based on your pre-computed policy. 00:07:58.240 |
But really what I mean by search more broadly is the ability to scale the amount of computation 00:08:06.140 |
I think that's the real value that search is adding. 00:08:09.240 |
Instead of just acting according to your pre-computed-- front loading all of your computations, so 00:08:15.500 |
you're doing everything, all your computation ahead of time, and then at inference time, 00:08:19.860 |
acting basically instantly, could you get a better solution if you had five minutes 00:08:25.020 |
to output an action instead of 100 milliseconds? 00:08:31.580 |
So yeah, I think you look at-- no, sorry, there's a question. 00:08:37.900 |
Does a transformer with a search circuit count as search, or do you mean hand engineering 00:08:43.540 |
I don't want to get bogged down into the details of how to do this, because the answer is nobody 00:08:50.360 |
Nobody really has a general way of doing search. 00:08:52.240 |
And all the domains that we've done search successfully, like Poker and Go, it's done 00:08:59.500 |
Go used this algorithm called multicolored tree search. 00:09:04.100 |
And yeah, you could think of beam search as one simple form of search, but it does seem 00:09:08.080 |
like there should be better ways in the future. 00:09:14.500 |
So anyway, where I'm going with this is you look at how large language models are being 00:09:18.560 |
trained today, and you're seeing millions of dollars being thrown at pre-training. 00:09:25.620 |
I wouldn't be surprised if we see a large language model that would cost $100 million 00:09:35.220 |
But the inference cost is still going to be very small. 00:09:40.140 |
And so there's a question of, could you do substantially better if you could scale the 00:09:48.340 |
Maybe that could amortize some of your training cost. 00:09:54.280 |
So there's this lecture called "The Bitter Lesson" by Richard Sutton that says the biggest 00:09:59.760 |
lesson that can be learned-- and so it's a really great essay. 00:10:03.060 |
But one of the big takeaways is, he says, the biggest lesson that can be learned from 00:10:05.640 |
over 70 years of AI research is that general methods that leverage computation are ultimately 00:10:11.000 |
The two methods that seem to scale arbitrarily in this way are search and learning. 00:10:15.820 |
Now, I think we've done a great job with generalizing search-- sorry, generalizing learning. 00:10:21.380 |
And I think there's still room for improvement when it comes to search. 00:10:24.620 |
And yeah, the next goal really is about generality. 00:10:29.960 |
Can we develop a truly general way of scaling inference compute instead of just doing things 00:10:33.840 |
like Monte Carlo tree search that are specific to a better domain, to a specific domain, 00:10:38.680 |
and also better than things like chain of thought? 00:10:43.900 |
What this would look like is that you have much higher test time compute, but you have 00:10:50.720 |
And I think for certain domains, that trade-off is worth it. 00:10:52.740 |
Like, if you think about what inference costs we're willing to pay for a proof of the Riemann 00:10:56.940 |
hypothesis, I think we'd be willing to pay a lot. 00:11:00.480 |
Or the cost of-- what cost are we willing to pay for new life-saving drugs? 00:11:07.580 |
So I think that there is an opportunity here. 00:11:13.380 |
I guess any questions about that before I move on to Cicero? 00:11:16.140 |
By the way, the reason why I'm talking about this is because it's going to inform the approach 00:11:30.060 |
that we took to Cicero, which I think is quite different from the approach that a lot of 00:11:36.060 |
other researchers might have taken to this problem. 00:11:38.140 |
Someone asked, can you give an example of search? 00:11:42.100 |
Well, multi-column tree search is one form of search. 00:11:44.460 |
You could also think of breadth-first search, depth-first search, these kinds of things. 00:11:50.420 |
I would also argue that chain of thought is doing something similar to search, where it's 00:11:55.900 |
allowing the model to leverage extra compute at test time to get better performance. 00:12:02.060 |
But I think that that's the main thing that you want, the ability to leverage extra compute 00:12:08.180 |
What's the search-- what's the space that you are searching over? 00:12:10.940 |
Again, in a game like Go, it's different board positions. 00:12:14.820 |
But you could also imagine searching over different sentences that you could say, things 00:12:33.220 |
So first thing I should say when it comes to Cicero, this is a big team effort. 00:12:42.860 |
This was like-- this is actually one of the great things about working on this project, 00:12:45.660 |
that there was just such a diverse talent pool, experts in reinforcement learning, planning, 00:12:49.860 |
game theory, natural language processing, all working together on this. 00:12:54.220 |
And it would not have been possible without everybody. 00:12:58.460 |
So the motivation for diplomacy actually came from 2019. 00:13:01.260 |
We were looking at all the breakthroughs that were happening at the time. 00:13:04.140 |
And I think a good example of this is this XKCD comic that came out in 2012 that shows 00:13:10.140 |
like different categories of games, games that are solved, games where computers can 00:13:12.900 |
beat top humans, games where computers still lose to top humans, and games where computers 00:13:18.100 |
And in this category, computers still lose to top humans, you had four games, Go, Arima, 00:13:24.180 |
In 2015, actually one of my colleagues, David Wu, made the first AI to beat top humans in 00:13:32.700 |
In 2016, we have AlphaGo beating Lee Sedol in Go. 00:13:36.660 |
In 2017, you have the work that I just described where we beat top humans in Poker. 00:13:41.340 |
And in 2019, we had AlphaStar beating expert humans in StarCraft. 00:13:48.580 |
So that shows the incredible amount of progress that had happened in strategic reasoning over 00:13:57.100 |
And at the same time, we also had GPT-2 come out in 2019. 00:14:01.500 |
And it showed that language model and natural language processing was progressing much faster 00:14:06.020 |
than I think a lot of people, including us, expected. 00:14:10.580 |
And so we were thinking about what after the six-player poker work, I was discussing with 00:14:17.580 |
And we were throwing around different domains to work on. 00:14:22.860 |
And given the incredible amount of progress in AI, we wanted to pick something really 00:14:27.500 |
ambitious, something that we thought you couldn't just tackle by scaling up existing approaches, 00:14:32.700 |
that you really needed something new in order to address. 00:14:36.260 |
And we landed on diplomacy because we thought that it would be the hardest game to make 00:14:44.860 |
Diplomacy is a natural language strategy game. 00:14:50.700 |
You play as one of the seven great powers of Europe-- England, France, Germany, Austria, 00:14:56.500 |
And your goal is to control a majority of the map. 00:15:01.880 |
If you control a majority of the map, then you've won. 00:15:04.900 |
In practice, nobody ends up winning outright. 00:15:08.680 |
And so your score is proportional to the percentage of the map that you control. 00:15:14.700 |
Now what's really interesting about diplomacy is that it is a natural language negotiation 00:15:21.200 |
So you have these conversations, like what you're seeing here between Germany and England, 00:15:24.580 |
where they will privately communicate with each other before making their moves. 00:15:27.540 |
And so you can have Germany ask, like, want to support Sweden? 00:15:30.860 |
England says, let me think on that, and so on. 00:15:35.900 |
So this is a popular strategy game developed in the 1950s. 00:15:40.740 |
It was JFK and Kissinger's favorite game, actually. 00:15:44.500 |
But like I said, each turn involves sophisticated private natural language negotiations. 00:15:49.220 |
And I want to make clear, this is not negotiations like you would see in a game like Settlers 00:15:58.980 |
It's much more like Survivor, if you've ever seen the TV show Survivor. 00:16:04.020 |
You have discussions around alliances that you'd like to build, discussions around specific 00:16:09.180 |
tactics that you'd like to execute on the current turn, and also more long-term strategy 00:16:15.580 |
around where do we go from here, and how do we divide resources. 00:16:19.780 |
Now the way the game works, you have these negotiations that last between 5 and 15 minutes, 00:16:24.700 |
depending on the version of the game, on each turn. 00:16:28.900 |
And all these negotiations are done privately, otherwise in negotiation. 00:16:36.980 |
And then after the negotiation period completes, everybody will simultaneously write down their 00:16:42.100 |
And so a player could promise you something like, I'm going to support you into this territory 00:16:47.220 |
But then when people actually write down their moves, they might not write that down. 00:16:50.480 |
And so you only find out if they were true to their word when all the moves are revealed 00:16:58.780 |
And so for this reason, alliances and trust building is key. 00:17:02.400 |
The ability to trust that somebody is going to follow through on their promises, that's 00:17:07.540 |
And the ability to convince people that you are going to follow through on your promises 00:17:14.940 |
And so for this reason, diplomacy has long been considered a challenge problem for AI. 00:17:19.060 |
There's research in the game going back to the '80s. 00:17:21.140 |
The research really only picked up-- it picked up quite intensely starting in 2019 when researchers 00:17:27.980 |
from DeepMind, ourselves, Mila, other places started working on this. 00:17:33.660 |
Now a lot of that research, the vast majority of that research actually was focused on the 00:17:37.940 |
non-language version of the game, which was seen as a stepping stone to the full natural 00:17:41.900 |
Though we decided to focus from the start on the full natural language version of the 00:17:47.500 |
So to give you a sense of what these negotiations and dialogue look like, here is one example. 00:17:55.460 |
So here, England, you can see they move their fleet in Norway to St. Petersburg. 00:18:06.140 |
And so this is what the board state looks like after that move. 00:18:08.980 |
And now there's this conversation between Austria and Russia. 00:18:14.900 |
I'm afraid the end may be close for me, my friend. 00:18:20.600 |
England seems to still want to work together. 00:18:22.580 |
Austria says, can you make a deal with Germany? 00:18:24.280 |
So the players are now discussing what should be discussed with other players. 00:18:30.140 |
Then Austria says, you'll be fine as long as you can defend Sevastopol. 00:18:33.420 |
So Sevastopol is this territory down to the south. 00:18:35.220 |
You can see that Turkey has a fleet and an army in the Black Sea in Armenia next to Sevastopol. 00:18:41.140 |
And so they could potentially attack that territory next turn. 00:18:44.660 |
Austria says, can you support/hold Sevastopol with Ukraine and Romania? 00:18:53.980 |
Hopefully, we can start getting you back on your feet. 00:18:57.300 |
So this is an example of the kinds of conversations that you'll see in a game of diplomacy. 00:19:01.180 |
In this conversation, Austria is actually our bot, Cicero. 00:19:05.040 |
So that kind of gives you a sense of the sophistication of the agent's dialogue. 00:19:10.620 |
OK, I'll skip this for-- OK, so I guess I'll go into this. 00:19:20.580 |
Really what makes diplomacy interesting is that support is key. 00:19:23.200 |
So here, for example, Budapest and Warsaw, the red and the purple units both try to move 00:19:29.660 |
And so since it's a one versus one, they both bounce back. 00:19:34.340 |
In the middle panel, you can see Vienna supports Budapest into Galicia. 00:19:43.300 |
And what's really interesting about diplomacy is that it doesn't just have to be your own 00:19:50.300 |
So for example, the green player could support the red player into Galicia. 00:19:53.740 |
And then that red unit would still go in there. 00:19:57.300 |
So support is really what the game is all about, negotiating over support. 00:20:01.380 |
And so for that reason, diplomacy has this reputation as the game that ruins friendships. 00:20:05.580 |
It's really difficult to have an alliance with somebody for three or four hours and 00:20:08.960 |
then have them backstab you and basically just ruin your game. 00:20:14.940 |
But if you talk to expert diplomacy players, they view it differently. 00:20:18.420 |
They say diplomacy is ultimately about building trust in an environment that encourages you 00:20:24.860 |
And that's why we decided to work on the game. 00:20:26.580 |
Could we make an AI that is able to build trust with the players in an environment that 00:20:33.140 |
Can the bot honestly communicate that it's going to do something and evaluate whether 00:20:39.700 |
another person is being honest when they are saying that they're going to do something? 00:20:48.220 |
It sits at this nice intersection of reinforcement learning and planning and also natural language. 00:20:53.340 |
There's two perspectives that we can take on why diplomacy is a really interesting domain. 00:20:59.380 |
So here, all the previous game AI results, like chess, Go, poker, these have all been 00:21:06.100 |
in purely zero-sum, two-player zero-sum domains. 00:21:10.020 |
And in these domains, self-play is guaranteed to converge to an optimal solution. 00:21:15.380 |
Basically what this means is you can start having the bot play completely from scratch 00:21:18.340 |
with no human data, and by playing against itself repeatedly, it will eventually converge 00:21:23.180 |
to this unbeatable optimal solution called the minimax equilibrium. 00:21:27.500 |
But that result only holds in two-player zero-sum games. 00:21:30.300 |
That whole paradigm only holds in two-player zero-sum games. 00:21:34.420 |
When you go to domains that involve cooperation, in addition to competition, then success requires 00:21:39.480 |
understanding human behavior and conventions. 00:21:41.660 |
You can't just treat the other players like machines anymore. 00:21:46.420 |
You have to model human irrationality, human suboptimality. 00:21:55.540 |
You can imagine if you were to train a bot completely from scratch in the game of diplomacy, 00:21:59.660 |
like the full natural language version of the game, there's no reason why the bot would 00:22:04.700 |
It would learn to communicate in some weird, gibberish robot language. 00:22:07.820 |
And then when you stick it in a game with six humans, it's not going to be able to cooperate 00:22:15.600 |
So we have to find a way to incorporate human data and be able to learn how humans behave 00:22:27.660 |
There's also the NLP perspective, which is that current language models are essentially 00:22:34.560 |
Now, there's been some progress with things like RLHF, but that's still not really the 00:22:44.320 |
They come up with this intention, and then they communicate with the goal of communicating 00:22:48.240 |
And they understand that others are trying to do the same. 00:22:51.540 |
And so there's a question of, can we move beyond chitchat to grounded, intentional dialogue? 00:23:00.920 |
So Cicero is an AI agent for diplomacy that integrates high-level strategic play and open 00:23:07.520 |
And we used 50,000 human games of diplomacy acquired through a partnership with the website 00:23:14.480 |
So we entered Cicero in an online diplomacy league. 00:23:17.080 |
Just to give you the results up front, Cicero was not detected as an AI agent for 40 games 00:23:24.040 |
There was one player that mentioned after the fact that they kind of made a joke about 00:23:28.880 |
us being a bot, but they didn't really follow up on it, and nobody else followed up on it. 00:23:33.400 |
And they later accused somebody else of also being a bot. 00:23:35.840 |
So we weren't sure how seriously to take that accusation. 00:23:38.960 |
But I think it's safe to say it made it through all 40 games without being detected as a bot. 00:23:42.760 |
And then we-- in fact, we told the players afterwards that it was a bot the whole time. 00:23:47.040 |
These are the kinds of responses that we got. 00:23:49.780 |
People were quite surprised-- pleasantly surprised, fortunately. 00:23:53.760 |
Nobody was upset with us, but they were quite surprised that there was a bot that had been 00:24:02.960 |
So in terms of results, Cicero placed in the top 10% of players. 00:24:08.600 |
And so if you look at players that played five or more games, it placed second out of 00:24:14.120 |
And it achieved more than double the average human score. 00:24:16.640 |
So I would describe this as a strong level of human performance. 00:24:19.400 |
I wouldn't go as far as to say that this is superhuman by any means. 00:24:27.920 |
Now, to give you a picture of how Cicero works. 00:24:35.480 |
So the input that we feed into the model is the board state and the recent action history 00:24:41.960 |
that's shown on the top left here, and also the dialogue that it's had with all the players 00:24:49.440 |
So that's going to get fed into a dialogue-conditional-action model that's going to predict what Cicero 00:24:56.160 |
thinks all the players are going to do this turn and what they think we will do this turn. 00:25:06.440 |
These lead to what we call anchor policies that are then used for planning. 00:25:11.880 |
Now, planning here, again, this is like the part where we leverage extra compute at test 00:25:23.360 |
So essentially, we take these initial predictions of what everybody's going to do, what are 00:25:26.160 |
called anchor policies, and we improve upon these predictions using this planning process 00:25:32.600 |
called pickle, where basically, we account for the fact that players will pick actions 00:25:38.400 |
that have higher expected value with higher probability. 00:25:40.920 |
We're essentially adding this rationality prior to all the players to assume that they're 00:25:45.240 |
not going to blunder as often as the model might suggest, and they're going to pick smarter 00:25:49.280 |
actions with higher probability than the initial model might suggest. 00:25:52.800 |
And what we find is that this actually gives us a better prediction of what all the players 00:25:56.200 |
will do than just relying on the raw neural net itself. 00:26:03.560 |
This gives us the action that we actually play in the game, and it also gives us what 00:26:09.560 |
So intents are an action for ourselves and an action for the dialogue partner that we're 00:26:17.840 |
So we have this dialogue model that conditions on these intents. 00:26:21.120 |
So the intents are fed into the dialogue model, along with the board state and action history, 00:26:29.240 |
And that dialogue model will then generate candidate messages that are conditioned on 00:26:38.240 |
These candidate messages go through a series of filters that filter out nonsense, grounding 00:26:42.040 |
issues, and also low expected value messages. 00:26:46.640 |
And ultimately, we get out a message to send to our dialogue partner. 00:26:52.240 |
Now every time we send or receive a message, we will repeat this whole process. 00:26:59.880 |
So there's actually a lot that is quite novel in Cicero. 00:27:04.360 |
And I'm going to try to talk about the contributions as much as possible. 00:27:08.400 |
I might go through this a little quickly, just so we have time for questions. 00:27:12.640 |
The first one is a controllable dialogue model that conditions on the game state and a set 00:27:16.600 |
of intended actions for the speaker and the recipient. 00:27:19.360 |
So we have a question, what is the action space here for the model? 00:27:28.960 |
The action space for the action prediction model is like all the actions that you could 00:27:33.800 |
take in the game, that a player could take in the game. 00:27:37.160 |
For the dialogue model, it's like, you know, messages that you can send. 00:27:44.280 |
Okay, so we train what we call an intent model that predicts what actions people will take 00:27:55.080 |
Basically, what we're trying to predict, what are people intending to do when they communicate 00:28:03.800 |
And then we use this to automatically annotate the data set with basically what we expect 00:28:10.120 |
people's intentions were when they sent that message. 00:28:12.880 |
And we filter out as much as possible lies from the data set, so that the text in the 00:28:21.720 |
data set is annotated with the truthful intention. 00:28:29.160 |
And then during play, Cicero conditions the dialogue model on the truthful intention that 00:28:34.560 |
And the goal then is that, the hope then is that it will generate a message consistent 00:28:41.480 |
And that is then fed into everything else that's, you know, sorry, that the intentions 00:28:47.920 |
that we generate through planning are fed into the dialogue model. 00:28:54.880 |
So to give you an example of what this looks like, this gives us a way to control the dialogue 00:29:02.000 |
Like here, we are Cicero's England in pink, and their action is to move to Belgium, among 00:29:13.080 |
And so if we feed this intention into the dialogue model, then the message that might 00:29:16.600 |
get generated is something like England saying to France, do you mind supporting me, do you 00:29:25.760 |
On the other hand, let's say Cicero's action is to support France to the Belgium. 00:29:33.960 |
Then if you feed that into the dialogue model, then the message that's generated might say 00:29:38.040 |
something like, let me know if you want me to support you to Belgium, otherwise I'll 00:29:46.600 |
Now what we find is that conditioning the dialogue model on these intentions in this 00:29:49.640 |
way, it makes the model more controllable, but it also leads to higher quality dialogue 00:29:55.880 |
So we found that it led to dialogue that was more consistent with the state, more consistent 00:30:00.640 |
with the plan, higher quality, lower perplexity. 00:30:03.760 |
And I think the reasoning for why this is the case is that we're relieving the dialogue 00:30:08.560 |
model of the burden of having to come up with a good strategy. 00:30:14.040 |
We're allowing the dialogue model to do what it does best, to focus on what it does best, 00:30:20.480 |
And we're relieving it of the strategic components of the game, because we're feeding that strategy 00:30:28.840 |
Okay, so that's one main contribution, this controllable dialogue model that conditions 00:30:36.680 |
The second is a planning engine that accounts for dialogue and human behavior. 00:30:43.080 |
So I mentioned that a lot of previous work on games was done using self-play in two-player 00:30:54.700 |
Now the problem with pure self-play is that it can learn strong policies, but it doesn't 00:31:02.040 |
stick with human conventions, and it can't account for dialogue. 00:31:05.280 |
It's just going to ignore the human data and the human way of playing if you just do self-play. 00:31:14.240 |
The other extreme that you can go is to just do supervised learning on human data, create 00:31:20.280 |
this model of how humans play, and then train with those imitation humans. 00:31:27.120 |
And if you do this, you'll end up with a bot that's consistent with dialogue and human 00:31:32.040 |
conventions, but it's only as strong as the training data. 00:31:36.020 |
And we found that it was actually very easily manipulable through adversarial dialogue. 00:31:41.100 |
So for example, you can send messages to it saying, "Thanks for agreeing to support me 00:31:44.520 |
at the Paris," and it will think, "Well, I've only ever seen that message in my training 00:31:49.840 |
data when I've agreed to support the person at the Paris, and so I guess I'm supporting 00:31:53.640 |
them at the Paris this turn," even though that might be a terrible move for the bot. 00:32:00.240 |
So we came up with this algorithm called Pickle that is a happy medium between these two extremes. 00:32:08.380 |
The way Pickle works is it's doing self-play, but regularized toward sticking to the human 00:32:20.360 |
So it has a KL penalty for deviating from the human imitation policy. 00:32:27.560 |
So we have this parameter lambda that controls how easy it is to deviate from the human imitation 00:32:36.920 |
At lambda equals zero, it just ignores the human imitation policy completely and just 00:32:44.100 |
And so we'll just do self-play as if from scratch at lambda equals zero. 00:32:49.220 |
At lambda equals infinity, it's just playing the human imitation policy and not doing self-play 00:32:57.620 |
But for intermediate values of lambda, what we find is that it actually gives you a good 00:33:00.420 |
medium between sticking to human conventions and performing strongly. 00:33:08.420 |
So you can kind of see this behavior emerge here. 00:33:15.360 |
Is this similar to offline RL or also incorporates exploration? 00:33:18.260 |
So I would say there's actually a lot of similar work on having a KL penalty. 00:33:25.500 |
And so yes, I would say that it's very similar to a lot of that work. 00:33:30.060 |
And this has also been done actually in AlphaStar, where they had a KL penalty. 00:33:33.540 |
Though that was more about aiding exploration, like using human data to aid exploration rather 00:33:41.960 |
So I think what's interesting about the pickle work is that one, we find it imitates humans 00:33:45.740 |
better than just doing supervised learning alone. 00:33:49.300 |
And two, we are doing a bit of theory of mind where we assume that the other players are 00:33:55.660 |
also-- we're using this as a model for our behavior, what we expect other people to think 00:34:01.420 |
our behavior is, in addition to modeling the other players. 00:34:05.020 |
So it's like a common knowledge algorithm that we're using here. 00:34:17.940 |
So the kind of behavior that you see from this, you can see here, let's say England 00:34:22.260 |
agrees-- sorry, so let's say we're in this situation. 00:34:25.180 |
This actually came up in a real game, and it inspired a figure from our paper. 00:34:36.700 |
And France asks if England is willing to disengage. 00:34:42.620 |
And let's say England says, yes, I will move out of English Channel if you head back to 00:34:48.540 |
Well, we can see that Cicero does, in fact, back off, goes to NAO, and the disengagement 00:34:57.220 |
And so this shows that the bot strategy really is reflecting the dialogue that it's had with 00:35:05.660 |
Another message that England might send is something like, I'm sorry, you've been fighting 00:35:12.380 |
And so in this case, Cicero will continue its attack on England. 00:35:17.860 |
It's changing its behavior depending on the dialogue. 00:35:20.060 |
But you can also have this kind of message where England says, yes, I'll leave English 00:35:26.920 |
Channel if you move Kiel to Munich, get Holland to Belgium. 00:35:29.260 |
So these are really bad moves for Cicero to follow. 00:35:33.060 |
And so if you just look at the raw policy net, it might actually do this. 00:35:40.620 |
It might actually do these moves because England suggested it. 00:35:44.220 |
But because we're using pickle, that incorporates-- it counts for the expected value of different 00:35:49.620 |
It will actually partially back off, but ignore the suggested moves because it recognizes 00:35:53.380 |
that those will leave it very vulnerable to an attack. 00:36:08.780 |
Another thing I should say is that we're not just doing planning. 00:36:11.780 |
We're actually doing this in a full self-play reinforcement learning loop. 00:36:16.660 |
And again, the goal here is it's really about modeling humans better than supervised learning 00:36:22.940 |
And we found that doing this self-play reinforcement learning with pickle allowed us to better 00:36:26.340 |
model human behavior than just doing imitation learning. 00:36:30.900 |
Finally, we have an ensemble of message filtering techniques that filters both nonsensical and 00:36:40.060 |
So to give you an example of what these filters look like, one that we developed is value-based 00:36:45.620 |
So the motivation for this is that what we feed into our dialogue model is a plan for 00:36:55.380 |
But it's the entire plan that we have for ourselves. 00:36:58.460 |
And so we might end up feeding into the dialogue model the fact that we're going to attack 00:37:04.660 |
Now the dialogue model is, to be honest, kind of dumb. 00:37:08.220 |
And it doesn't really know that it shouldn't be telling this player that they're going 00:37:16.140 |
And so you have these messages that might be sent, something like the second one shown 00:37:19.740 |
here, where England says to France, we have hostile intentions towards you. 00:37:26.540 |
So this is actually a message that the bot sent to a player. 00:37:30.420 |
This was preliminary testing and kind of motivated this whole approach. 00:37:36.780 |
So we don't want the bot to send these kinds of messages if it's going to attack a player. 00:37:39.460 |
We want it to send something that's like, you know, not an outright lie necessarily, 00:37:43.660 |
but just something either not send a message or something that's much more bland. 00:37:49.620 |
And so we filter out these kinds of messages by looking at the value. 00:37:54.140 |
Like what we do is we generate a bunch of candidate messages. 00:37:58.040 |
And then we see if we were to send this message, what is the behavior that we would expect 00:38:06.420 |
Like what actions will we expect them to do after we send this message? 00:38:10.220 |
And what do they expect we will do after we send this message? 00:38:14.000 |
And then we see what is the expected value of the action that we intend to take given 00:38:20.560 |
the prediction of what everybody else is going to do. 00:38:23.460 |
So if our intention is to attack France, then we can see, well, if I were to send this message 00:38:28.940 |
to France, then they're going to get really defensive and defend against an attack from 00:38:32.740 |
us and our attack is going to be unsuccessful. 00:38:35.340 |
And so therefore, I probably shouldn't send this message to them. 00:38:41.620 |
And so in this way, we can actually filter out messages that have low expected value. 00:38:44.780 |
And we found that this worked surprisingly well. 00:38:52.380 |
I'll go through one just for the sake of time. 00:38:57.680 |
So here we have Cicero's France, and France is conversing with Turkey, who's a human player, 00:39:07.520 |
and they're debating over who's going to get Tunis, this territory circled in red. 00:39:11.440 |
You can see they both have fleets next to the territory. 00:39:14.840 |
If they both go for it, neither of them are going to get it, and so they need to work 00:39:19.280 |
So France says, I'll work with you, but I need Tunis for now. 00:39:21.720 |
Turkey says, nope, you've got to let me have it. 00:39:26.640 |
And then France suggests, you can take these other territories instead. 00:39:35.760 |
And then Cicero suggests specific moves that would allow Turkey to capture these territories. 00:39:41.940 |
So Cicero says, Greece to Ionia, Ionia to Tyrrhenian. 00:39:48.840 |
And then France says, then in the fall, you take Rome and Austria collapses. 00:39:51.700 |
And so that allows Turkey to make progress against Austria. 00:39:55.860 |
But conveniently, it also allows France to capture Tunis, because Turkey will be using 00:40:08.580 |
Intent representation is just an action per player. 00:40:11.300 |
So there's a question of like, the intentions that we're feeding into the dialogue model 00:40:15.380 |
is an action that we're going to take for this turn and for the next turn, for ourselves 00:40:20.660 |
But ideally, we would have a richer set of intentions, we will be able to condition on 00:40:25.220 |
things like long term strategy, or style of communication, or asking questions. 00:40:32.980 |
That's one of the limitations of this approach. 00:40:34.740 |
Now, of course, the richer you make the space of intentions, the more room there is for 00:40:40.900 |
And you also have to then train the model to be able to handle these wider space of 00:40:45.700 |
There was a question, do you think the dialogue model is learning an internal world model 00:40:55.500 |
No, this is arguably why we're conditioning on intentions. 00:41:02.180 |
We're relieving the dialogue model of having to come up with a good world model, because 00:41:06.980 |
we're telling it like, these are the moves that we are planning to take this turn. 00:41:10.260 |
And these are the moves that we would like this other player to take this turn. 00:41:13.140 |
So we're able to have the world model separate from the dialogue model, but condition on 00:41:26.940 |
Another limitation is that Cicero's value model doesn't condition on dialogue. 00:41:30.940 |
And so it has a limited understanding of the long term effects of dialogue. 00:41:39.020 |
This greatly limits our ability to plan what kind of messages we should be sending. 00:41:47.400 |
And this is actually why we always condition Cicero's dialogue generation on its truthful 00:41:54.860 |
You could argue that there's situations in diplomacy where you would want to lie to the 00:42:00.780 |
The best players rarely lie, but they do lie sometimes. 00:42:05.760 |
And you have to understand the trade off between like, if you lie, you're going to not, it's 00:42:14.020 |
going to be much harder to work with this person in the future. 00:42:17.500 |
And so you have to make sure that the value that you're getting positionally is worth 00:42:22.020 |
that loss of trust and a broken relationship. 00:42:25.780 |
Now, because Cicero's value model doesn't condition on dialogue, it can't really understand 00:42:34.020 |
And so for this reason, we actually always condition it on its truthful intentions. 00:42:40.700 |
Now, it is possible to have Cicero's value model condition on dialogue, but you would 00:42:47.620 |
need way more data, and it would make things much more expensive. 00:42:50.820 |
And so we weren't able to do it further for this bot. 00:42:58.020 |
And finally, there's a big question that I mentioned earlier, which is, is there a more 00:43:01.780 |
general way of scaling inference time compute to achieve better performance? 00:43:06.020 |
The way that we've done planning in Cicero is, I would argue, a bit domain specific. 00:43:10.500 |
I think it's like the idea of pickle is quite general, but I think that there are potentially 00:43:22.340 |
Somebody's asking, looking forward to the next two to three years, what criteria will 00:43:26.140 |
you use to select the next game to try to conquer? 00:43:28.860 |
Honestly, like I said, we chose diplomacy because we thought it'd be the hardest game 00:43:32.740 |
to make an AI for, and I think that that's true. 00:43:36.060 |
I don't think that we're gonna be working on games anymore, because I can't think of 00:43:38.460 |
any other game that if we were to succeed at that, it would be truly impressive. 00:43:46.060 |
And so I think where the research is going in the future is generality. 00:43:53.260 |
Like instead of getting an AI to play this specific game, can we get an AI that is able 00:43:59.180 |
to play diplomacy, but could also play Go or poker, or could also write essays and stories 00:44:09.720 |
I think what we will see is games serving as benchmarks for progress, but not as the 00:44:19.740 |
It'll be part of the test set, but not part of the training set. 00:44:21.880 |
And I think that's the way it should be going forward. 00:44:25.300 |
Finally, I want to add that diplomacy is an amazing testbed for multi-agent AI and grounded 00:44:33.940 |
So if you are interested in these kinds of domains, I highly recommend taking advantage 00:44:38.960 |
of the fact that we've open sourced all of our code and models, and the dialogue and 00:44:43.940 |
action data is available through what's called an RFP, where you can apply to get access 00:44:55.660 |
To wrap up, Cicero combines strategic reasoning and natural language in diplomacy. 00:45:01.900 |
And the paper is in science and code and models are publicly available at this URL. 00:45:07.660 |
And for the remaining time, I'll take questions. 00:45:11.900 |
So we've also opened some questions from the class, but you can finish their Zoom questions. 00:45:17.020 |
So if anyone has some questions, I think, Noam, you can answer those. 00:45:20.660 |
Yeah, there's one question, are you concerned about AIs outcompeting humans at real world 00:45:24.780 |
diplomatic strategic negotiation and deception tasks? 00:45:28.820 |
So like I said, we're not very focused on deception, even though arguably deception 00:45:36.420 |
I think for diplomatic and strategic negotiation, I don't feel like, look, the way that we've 00:45:43.700 |
developed Cicero, it's designed to play diplomacy, the game of diplomacy specifically, and you 00:45:52.100 |
That said, I do think that the techniques are quite general. 00:45:56.060 |
And so hopefully others can build on that and to be able to do different things. 00:45:59.880 |
And I think it is entirely possible that over the next several years, you will see this 00:46:04.700 |
entering into real world negotiations much more often. 00:46:08.620 |
I actually think that diplomacy is a big step towards real world applicability compared 00:46:19.380 |
Because now your action space is really like the space of natural language, and you have 00:46:24.020 |
Do you think in the future, we could appoint an AI to the UN council? 00:46:30.300 |
Oh, hopefully, only if it does better than humans, but that would be very interesting 00:46:36.580 |
I'm also curious, like, what's like the future things that you're working on in this direction? 00:46:41.260 |
Like, do you think you can do something like AlphaGo Zero, where you just like, take this 00:46:44.880 |
like pre-built model, and then maybe just make it like self-play? 00:46:48.240 |
Or like, what sort of future directions are you thinking for improving this sort of box? 00:46:52.760 |
I think the future directions are really focused around generality. 00:46:55.680 |
Like, I think one of the big insights of Cicero is like, this ability to leverage planning 00:47:00.840 |
to get better performance with language models and in this strategic domain. 00:47:06.520 |
I think there's a lot of opportunity to do that sort of thing in a broader space of domains. 00:47:10.680 |
I mean, you look at language models today, and they do token by token prediction. 00:47:17.020 |
And I think there's a big opportunity to go beyond that. 00:47:20.800 |
I'm also curious, like, I didn't understand the exact details, how you're using planning 00:47:23.960 |
or Monte Carlo research with your, like the models that you have. 00:47:30.200 |
We didn't use Monte Carlo research in Cicero. 00:47:33.800 |
Monte Carlo research is a very good heuristic, but it's a heuristic that is particularly 00:47:40.240 |
useful for deterministic perfect information games. 00:47:45.200 |
And I think in order to like have a truly general form of planning, we need to go more 00:47:51.380 |
We use this algorithm called PICL, it's based on a regret minimization algorithm. 00:47:56.520 |
I don't really want to go into the details of it because it's not that important for 00:48:00.480 |
But the idea is like, it is this iterative algorithm that will gradually refine the prediction 00:48:05.360 |
of what everybody's going to do and get better and better predictions the more iterations 00:48:14.760 |
So yeah, my question is like, when we were talking about generalizability, how does the 00:48:30.620 |
communication between different modules of the model look like, particularly when we're 00:48:38.800 |
Like how do you send information from the policy network to the dialogue model? 00:48:41.600 |
And in the future, if you have a model that's good at different tasks, are we going to have 00:48:46.160 |
like a really big policy net that learns all of them or like separate language modules 00:48:54.120 |
So we actually convert the policy, the action for ourselves and for our dialogue partner 00:48:58.240 |
into a string, natural language string, and just feed that into the dialogue model along 00:49:13.480 |
And then what was the second part of your question? 00:49:18.560 |
Something like, are we just going to have like one giant policy net trained on everything? 00:49:22.240 |
Yeah, it was like, so if you're only using text first, doesn't it limit the model? 00:49:28.120 |
And if you're using it for different games, like, are you thinking like, when you say 00:49:32.720 |
in the future, you will work on generalizability, are you thinking about a big policy network 00:49:37.560 |
that is trained on separate games or is able to like understand different games at the 00:49:42.960 |
Or do we have like separate policy networks for different games? 00:49:46.680 |
And yeah, like, doesn't this like text interface limit the model in terms of communication? 00:49:52.040 |
Like if you're using vectors, it might like, yeah, it might be a bit of a bottleneck. 00:49:58.520 |
I mean, I think ideally, you go in this direction where you have like, you know, a foundational 00:50:07.200 |
Does text I mean, certainly, just like a text in text out that like limits what you can 00:50:10.480 |
do in terms of communication, but hopefully, we get beyond that. 00:50:26.280 |
Okay, so there's a question in the chat, I'd love to hear your speculation on the future. 00:50:30.480 |
For instance, we've seen some startups that are fine tuning LLMs to be biased, or experts 00:51:03.160 |
I'm not too focused myself on, you know, fine tuning language models to specific tasks. 00:51:10.000 |
I think the direction that I'm much more interested in going forward is, you know, like the more 00:51:16.620 |
So I don't think I can really comment on, you know, how do you tune these language models 00:51:27.900 |
So what sort of planning methods are you like interested in looking at, like, like MCTS 00:51:33.900 |
So let me, so I got to step out for just one second. 00:51:48.460 |
I was thinking, I was just asking, like, what sort of planning algorithms do you think are 00:51:52.100 |
So you think like, we have like, so many options, like we have like planning kind of stuff, 00:51:56.260 |
or RL, there's like MCTS, there's like the work you did with Cicero. 00:51:59.740 |
So what do you think are the most interesting algorithms that you think will scale well, 00:52:04.780 |
Well, I think that's the big question that a lot of people are trying to figure out today. 00:52:08.900 |
And it's not really clear what the answer is. 00:52:11.660 |
I mean, I think, you know, you look at some of the chain of thought, and I think there's 00:52:17.860 |
And I think that it should be possible to do a lot better. 00:52:20.540 |
But it is really impressive to see just how general of an approach it is. 00:52:25.540 |
And so I think it would be nice to see to see things that are general in that, in that 00:52:33.780 |
way, but hopefully able to achieve better performance. 00:52:38.700 |
Also, when you say like, Cicero is like an encoder decoder model, in the sense that encodes 00:52:46.900 |
the world, and then you have the dialog model, which is trying to decode it. 00:52:53.140 |
I don't think that that's necessarily the right choice. 00:53:09.220 |
Well, yeah, and if there are any questions, feel free to email me reach out, I'm happy