Back to Index

Stanford CS25: V2 I Strategic Games


Chapters

0:0
20:53 The multi-agent perspective
22:26 The NLP perspective
36:39 Value-based filtering

Transcript

So it spent sometimes like two months training the bots on thousands of CPUs, terabytes of memory sometimes. But when it came time to actually play against the humans, they would act almost instantly. It was just a lookup table. And the humans when they were in a tough spot, they would not act instantly, they would think they would sit there and they would think for five seconds, maybe five minutes if it was a really difficult decision.

And it was clear that that was allowing them to come up with better strategies. And so I wanted to investigate this behavior in our bots, like if we could add this to our bots, how much of a difference would it make the ability to instead of acting instantly to take some time and compute a better strategy for the spot that the agent was in.

And this is what I found. So on the x-axis here, we have like the number of buckets, the number, you can think of this as like the number of parameters in your model. And on the y-axis, we have distance from Nash equilibrium. So this is basically like how much you would lose to worst case adversaries.

So the lower this number is, the better your poker bot is. And you can see as you scale up the number of parameters, your performance improves. And as you increase the number of parameters by about 100x, your exploitability goes down by about half, and indeed, you're getting a much better poker bot.

But you can see the blue line here is if you don't have search, and the orange line is if you do add search. And you can see just adding search, adding the ability to sit there and think for a bit, improve the performance of these models, it reduced the exploitability, the distance from Nash equilibrium by about 7x.

And if you were to extend that blue line and see how many parameters would you need in order to be comparable to adding search? The answer is you would need to scale up your model by about 100,000x. So this was pretty mind-blowing to me when I saw this. I mean, over the course of my PhD, the first three or four years of my PhD, I managed to scale up these models by about 100x.

And I was proud of that. I mean, that's like a pretty impressive result, I think. But what this plot was showing me was that just adding search was the equivalent of scaling things up by about 100,000x. And so all of my previous research up until this point would just be a footnote compared to adding search.

So when I saw this, it became clear this was the answer to beating top humans in poker. And so for the next year, basically nonstop, I worked on scaling search. Now there's a question that naturally comes up, which is why wasn't this considered before? There's a few factors. First of all, I should say search had been considered in poker before, and it's actually quite natural to say, well, if you had search in chess and search in Go, why would you not consider search in poker?

There's a few reasons. One is that culturally, the poker research grew out of game theory and reinforcement learning. And so it wasn't really from the same background as the people that were working on chess and working on Go. When you scale search, scaling test time compute, it makes all your experiments much more expensive and just more unpleasant to work with.

And there were just incentive structures. People are always thinking about winning the next annual computer poker competition, and the ACPC limited the resources that you could use at test time. So search wasn't really possible effectively in the ACPC. And I think the biggest factor is that people just didn't think it would make such a huge difference.

I mean, I think it's reasonable to look at something like search and think like, oh yeah, that might make a 10x difference. You probably wouldn't think it makes 100,000x difference. And so there were some people working on it, but it wasn't really the focus of a lot of people's research.

So anyway, focused on scaling search, and that led to the 2017 Brains vs. AI competition, where we again played our bot against four top poker pros, 120,000 hands of poker, $200,000 in prize money. And this time, the bot won by 15 big blinds per 100, instead of nine big blinds per 100.

This was a crushing victory. Each human lost individually to the bot, and four standard deviations of statistical significance. We followed this up in 2019 with a six-player poker AI competition. The big difference here is that we figured out how to do depth-limited search. So before in the 2017 bot, it would always have the search to the end of the game.

Here, it only had to do search a few moves ahead, and it could stop there. And so this time, again, it won with statistical significance. And what's really surprising about this bot is that, despite it being a much larger game, the six-player poker bot, Pluribus, cost under $150 to train on cloud computing resources.

And it runs on 28 CPU cores at inference time, there's no GPUs. So I think what this shows is that this really was an algorithmic improvement. I mean, this would have been doable 20 years ago if people knew how to do it. And I think it also shows the power of search.

If you can figure out how to scale that compute at test time, it really can make a huge difference and bring down your training costs by a huge amount. Anyway, yeah. So I wanted to say also, this is not limited to poker. If you look at Go, you see a similar pattern.

So this is a plot from the AlphaGo Zero paper. On the x-axis, we have different versions of AlphaGo, and on the y-axis, we have EloRating, which is a way of comparing different bots, but also a way of comparing bots to humans. And you can see if-- OK, so SuperHuman performance is around 3,600 Elo, and you can see AlphaGo Lee, the version that played against Lee Sedol in 2016, that's right over the line of SuperHuman performance.

AlphaGo Zero, the strongest version of AlphaGo, is around 5,200 Elo. But if you take out the test time search, if you just play according to the policy net and not do any Monte Carlo tree search in AlphaGo Zero at test time, then the EloRating drops to around 3,000, which is substantially below SuperHuman performance.

So what this shows is that if you take out Monte Carlo tree search at test time, AlphaGo Zero is not SuperHuman. And in fact, nobody has made a SuperHuman Go bot that does not use search in some form. Nobody has made a raw neural network that can beat top humans in Go.

And I should say also, this is just if you're taking out the search at test time. I'm not even talking about taking it out of training time. If you took it out of training time, it wouldn't even get off the ground. Now there's a question of, OK, well, surely you could just scale up the models, scale up the amount of training, and you would eventually surpass SuperHuman performance and match the performance if you added search.

And that's true, yes. If you scale up the models and if you scale up the training, then you would eventually match the performance with search. But there's a question of how much would you have to scale it up by? Now a rough rule of thumb is that in order to increase your EloRating by about 120 points, you either have to double the amount of model size and training, or you have to double the amount of test time search.

And so if you look at that gap of around 2,000 Elo points, and you calculate the number of deadlines that you would need, the answer is that in order to get the raw policy net from 3,000 Elo to 5,200 Elo, you would need to scale your model and your training by about 100,000x.

OK, so why is this important? I think you look at what's happening today with large language models and transformers, and you see something similar. I mean, you're getting huge-- there's a question of, what do I mean by search? There's specific kinds of search, like multicolored tree search, the ability to just plan ahead what you're going to do instead of just acting instantly based on your pre-computed policy.

But really what I mean by search more broadly is the ability to scale the amount of computation to get better performance. I think that's the real value that search is adding. Instead of just acting according to your pre-computed-- front loading all of your computations, so you're doing everything, all your computation ahead of time, and then at inference time, acting basically instantly, could you get a better solution if you had five minutes to output an action instead of 100 milliseconds?

So yeah, I think you look at-- no, sorry, there's a question. Does a transformer with a search circuit count as search, or do you mean hand engineering search algos? I don't want to get bogged down into the details of how to do this, because the answer is nobody really knows yet.

Nobody really has a general way of doing search. And all the domains that we've done search successfully, like Poker and Go, it's done in a fairly domain-specific way. Go used this algorithm called multicolored tree search. And yeah, you could think of beam search as one simple form of search, but it does seem like there should be better ways in the future.

So anyway, where I'm going with this is you look at how large language models are being trained today, and you're seeing millions of dollars being thrown at pre-training. I wouldn't be surprised if we see a large language model that would cost $100 million to train. We might even get to $1 billion.

But the inference cost is still going to be very small. And so there's a question of, could you do substantially better if you could scale the amount of inference cost as well? Maybe that could amortize some of your training cost. So there's this lecture called "The Bitter Lesson" by Richard Sutton that says the biggest lesson that can be learned-- and so it's a really great essay.

I recommend reading it. But one of the big takeaways is, he says, the biggest lesson that can be learned from over 70 years of AI research is that general methods that leverage computation are ultimately the most effective. The two methods that seem to scale arbitrarily in this way are search and learning.

Now, I think we've done a great job with generalizing search-- sorry, generalizing learning. And I think there's still room for improvement when it comes to search. And yeah, the next goal really is about generality. Can we develop a truly general way of scaling inference compute instead of just doing things like Monte Carlo tree search that are specific to a better domain, to a specific domain, and also better than things like chain of thought?

What this would look like is that you have much higher test time compute, but you have much more capable models. And I think for certain domains, that trade-off is worth it. Like, if you think about what inference costs we're willing to pay for a proof of the Riemann hypothesis, I think we'd be willing to pay a lot.

Or the cost of-- what cost are we willing to pay for new life-saving drugs? I think we'd be willing to pay a lot. So I think that there is an opportunity here. OK, anyway. So that's my prelude. I guess any questions about that before I move on to Cicero?

By the way, the reason why I'm talking about this is because it's going to inform the approach that we took to Cicero, which I think is quite different from the approach that a lot of other researchers might have taken to this problem. Someone asked, can you give an example of search?

Well, multi-column tree search is one form of search. You could also think of breadth-first search, depth-first search, these kinds of things. They're all search. I would also argue that chain of thought is doing something similar to search, where it's allowing the model to leverage extra compute at test time to get better performance.

But I think that that's the main thing that you want, the ability to leverage extra compute at test time. What's the search-- what's the space that you are searching over? Again, in a game like Go, it's different board positions. But you could also imagine searching over different sentences that you could say, things like that.

There's a lot of flexibility there as well. So now I want to get into Cicero. So first thing I should say when it comes to Cicero, this is a big team effort. This was like-- this is actually one of the great things about working on this project, that there was just such a diverse talent pool, experts in reinforcement learning, planning, game theory, natural language processing, all working together on this.

And it would not have been possible without everybody. So the motivation for diplomacy actually came from 2019. We were looking at all the breakthroughs that were happening at the time. And I think a good example of this is this XKCD comic that came out in 2012 that shows like different categories of games, games that are solved, games where computers can beat top humans, games where computers still lose to top humans, and games where computers may never outplay top humans.

And in this category, computers still lose to top humans, you had four games, Go, Arima, Poker, and StarCraft. In 2015, actually one of my colleagues, David Wu, made the first AI to beat top humans in Arima. In 2016, we have AlphaGo beating Lee Sedol in Go. In 2017, you have the work that I just described where we beat top humans in Poker.

And in 2019, we had AlphaStar beating expert humans in StarCraft. So that shows the incredible amount of progress that had happened in strategic reasoning over the past several years, leading up to 2019. And at the same time, we also had GPT-2 come out in 2019. And it showed that language model and natural language processing was progressing much faster than I think a lot of people, including us, expected.

And so we were thinking about what after the six-player poker work, I was discussing with my colleagues, what should we work on next? And we were throwing around different domains to work on. And given the incredible amount of progress in AI, we wanted to pick something really ambitious, something that we thought you couldn't just tackle by scaling up existing approaches, that you really needed something new in order to address.

And we landed on diplomacy because we thought that it would be the hardest game to make an AI for. So what is diplomacy? Diplomacy is a natural language strategy game. It takes place right before World War I. You play as one of the seven great powers of Europe-- England, France, Germany, Austria, Russia, and Turkey.

And your goal is to control a majority of the map. In practice, that rarely happens. If you control a majority of the map, then you've won. In practice, nobody ends up winning outright. And so your score is proportional to the percentage of the map that you control. Now what's really interesting about diplomacy is that it is a natural language negotiation game.

So you have these conversations, like what you're seeing here between Germany and England, where they will privately communicate with each other before making their moves. And so you can have Germany ask, like, want to support Sweden? England says, let me think on that, and so on. So this is a popular strategy game developed in the 1950s.

It was JFK and Kissinger's favorite game, actually. But like I said, each turn involves sophisticated private natural language negotiations. And I want to make clear, this is not negotiations like you would see in a game like Settlers of Catan, for example. It's much more like Survivor, if you've ever seen the TV show Survivor.

You have discussions around alliances that you'd like to build, discussions around specific tactics that you'd like to execute on the current turn, and also more long-term strategy around where do we go from here, and how do we divide resources. Now the way the game works, you have these negotiations that last between 5 and 15 minutes, depending on the version of the game, on each turn.

And all these negotiations are done privately, otherwise in negotiation. Also, I think that you are not muted. OK, thank you. And then after the negotiation period completes, everybody will simultaneously write down their moves. And so a player could promise you something like, I'm going to support you into this territory this turn.

But then when people actually write down their moves, they might not write that down. And so you only find out if they were true to their word when all the moves are revealed simultaneously. And so for this reason, alliances and trust building is key. The ability to trust that somebody is going to follow through on their promises, that's really what this game is all about.

And the ability to convince people that you are going to follow through on your promises is really what this game is all about. And so for this reason, diplomacy has long been considered a challenge problem for AI. There's research in the game going back to the '80s. The research really only picked up-- it picked up quite intensely starting in 2019 when researchers from DeepMind, ourselves, Mila, other places started working on this.

Now a lot of that research, the vast majority of that research actually was focused on the non-language version of the game, which was seen as a stepping stone to the full natural language version. Though we decided to focus from the start on the full natural language version of the game.

So to give you a sense of what these negotiations and dialogue look like, here is one example. So here, England, you can see they move their fleet in Norway to St. Petersburg. And that occupies the Russian territory. And so this is what the board state looks like after that move.

And now there's this conversation between Austria and Russia. Austria says, well, what happened up north? Russia says, England stabbed. I'm afraid the end may be close for me, my friend. Austria says, yeah, that's rough. Are you going to be OK up there? Russia says, I hope so. England seems to still want to work together.

Austria says, can you make a deal with Germany? So the players are now discussing what should be discussed with other players. Russia says, good idea. Then Austria says, you'll be fine as long as you can defend Sevastopol. So Sevastopol is this territory down to the south. You can see that Turkey has a fleet and an army in the Black Sea in Armenia next to Sevastopol.

And so they could potentially attack that territory next turn. Austria says, can you support/hold Sevastopol with Ukraine and Romania? I'll support/hold Romania. Russia says, yep, I'm already doing so. Austria says, awesome. Hopefully, we can start getting you back on your feet. So this is an example of the kinds of conversations that you'll see in a game of diplomacy.

In this conversation, Austria is actually our bot, Cicero. So that kind of gives you a sense of the sophistication of the agent's dialogue. OK, I'll skip this for-- OK, so I guess I'll go into this. I don't want to take up too much time. Really what makes diplomacy interesting is that support is key.

So here, for example, Budapest and Warsaw, the red and the purple units both try to move into Galicia. And so since it's a one versus one, they both bounce back. And now they're boosting to the territory. In the middle panel, you can see Vienna supports Budapest into Galicia. And so now it's a two versus one.

And that red unit will indeed enter Galicia. And what's really interesting about diplomacy is that it doesn't just have to be your own units that are supporting you. It could be another player's units as well. So for example, the green player could support the red player into Galicia. And then that red unit would still go in there.

So support is really what the game is all about, negotiating over support. And so for that reason, diplomacy has this reputation as the game that ruins friendships. It's really difficult to have an alliance with somebody for three or four hours and then have them backstab you and basically just ruin your game.

But if you talk to expert diplomacy players, they view it differently. They say diplomacy is ultimately about building trust in an environment that encourages you to not trust anyone. And that's why we decided to work on the game. Could we make an AI that is able to build trust with the players in an environment that encourages them to not trust anybody?

Can the bot honestly communicate that it's going to do something and evaluate whether another person is being honest when they are saying that they're going to do something? OK, so why diplomacy? It sits at this nice intersection of reinforcement learning and planning and also natural language. There's two perspectives that we can take on why diplomacy is a really interesting domain.

One is the multi-agent perspective. So here, all the previous game AI results, like chess, Go, poker, these have all been in purely zero-sum, two-player zero-sum domains. And in these domains, self-play is guaranteed to converge to an optimal solution. Basically what this means is you can start having the bot play completely from scratch with no human data, and by playing against itself repeatedly, it will eventually converge to this unbeatable optimal solution called the minimax equilibrium.

But that result only holds in two-player zero-sum games. That whole paradigm only holds in two-player zero-sum games. When you go to domains that involve cooperation, in addition to competition, then success requires understanding human behavior and conventions. You can't just treat the other players like machines anymore. You have to treat them like humans.

You have to model human irrationality, human suboptimality. One example of this is actually language. You can imagine if you were to train a bot completely from scratch in the game of diplomacy, like the full natural language version of the game, there's no reason why the bot would learn to communicate in English.

It would learn to communicate in some weird, gibberish robot language. And then when you stick it in a game with six humans, it's not going to be able to cooperate with them. So we have to find a way to incorporate human data and be able to learn how humans behave in order to succeed in this game.

There's also the NLP perspective, which is that current language models are essentially just imitating human-like text. Now, there's been some progress with things like RLHF, but that's still not really the way that humans communicate. They communicate with an intention in mind. They come up with this intention, and then they communicate with the goal of communicating that intention.

And they understand that others are trying to do the same. And so there's a question of, can we move beyond chitchat to grounded, intentional dialogue? So Cicero is an AI agent for diplomacy that integrates high-level strategic play and open domain dialogue. And we used 50,000 human games of diplomacy acquired through a partnership with the website webdiplomacy.net.

So we entered Cicero in an online diplomacy league. Just to give you the results up front, Cicero was not detected as an AI agent for 40 games with 82 unique players. There was one player that mentioned after the fact that they kind of made a joke about us being a bot, but they didn't really follow up on it, and nobody else followed up on it.

And they later accused somebody else of also being a bot. So we weren't sure how seriously to take that accusation. But I think it's safe to say it made it through all 40 games without being detected as a bot. And then we-- in fact, we told the players afterwards that it was a bot the whole time.

These are the kinds of responses that we got. People were quite surprised-- pleasantly surprised, fortunately. Nobody was upset with us, but they were quite surprised that there was a bot that had been playing this game with them the whole time. So in terms of results, Cicero placed in the top 10% of players.

It's a high-variance game. And so if you look at players that played five or more games, it placed second out of 19. And it achieved more than double the average human score. So I would describe this as a strong level of human performance. I wouldn't go as far as to say that this is superhuman by any means.

But it is currently quite a strong result. Now, to give you a picture of how Cicero works. So the input that we feed into the model is the board state and the recent action history that's shown on the top left here, and also the dialogue that it's had with all the players up until now.

So that's going to get fed into a dialogue-conditional-action model that's going to predict what Cicero thinks all the players are going to do this turn and what they think we will do this turn. These lead to what we call anchor policies that are then used for planning. Now, planning here, again, this is like the part where we leverage extra compute at test time in order to get better performance.

So essentially, we take these initial predictions of what everybody's going to do, what are called anchor policies, and we improve upon these predictions using this planning process called pickle, where basically, we account for the fact that players will pick actions that have higher expected value with higher probability. We're essentially adding this rationality prior to all the players to assume that they're not going to blunder as often as the model might suggest, and they're going to pick smarter actions with higher probability than the initial model might suggest.

And what we find is that this actually gives us a better prediction of what all the players will do than just relying on the raw neural net itself. This gives us the action that we actually play in the game, and it also gives us what we call intents. So intents are an action for ourselves and an action for the dialogue partner that we're speaking to.

So we have this dialogue model that conditions on these intents. So the intents are fed into the dialogue model, along with the board state and action history, and also the dialogue that we've had so far. And that dialogue model will then generate candidate messages that are conditioned on those intents.

These candidate messages go through a series of filters that filter out nonsense, grounding issues, and also low expected value messages. And ultimately, we get out a message to send to our dialogue partner. Now every time we send or receive a message, we will repeat this whole process. So there's actually a lot that is quite novel in Cicero.

And I'm going to try to talk about the contributions as much as possible. I might go through this a little quickly, just so we have time for questions. The first one is a controllable dialogue model that conditions on the game state and a set of intended actions for the speaker and the recipient.

So we have a question, what is the action space here for the model? The action space for the action prediction model is like all the actions that you could take in the game, that a player could take in the game. For the dialogue model, it's like, you know, messages that you can send.

Okay, so we train what we call an intent model that predicts what actions people will take at the end of truthful turns. Basically, what we're trying to predict, what are people intending to do when they communicate a certain message. And then we use this to automatically annotate the data set with basically what we expect people's intentions were when they sent that message.

And we filter out as much as possible lies from the data set, so that the text in the data set is annotated with the truthful intention. And then during play, Cicero conditions the dialogue model on the truthful intention that it intends to take. And the goal then is that, the hope then is that it will generate a message consistent with that intention.

And that is then fed into everything else that's, you know, sorry, that the intentions that we generate through planning are fed into the dialogue model. So to give you an example of what this looks like, this gives us a way to control the dialogue model through a set of intentions.

Like here, we are Cicero's England in pink, and their action is to move to Belgium, among other things. And so if we feed this intention into the dialogue model, then the message that might get generated is something like England saying to France, do you mind supporting me, do you mind supporting Eddie to Belgium?

On the other hand, let's say Cicero's action is to support France to the Belgium. Then if you feed that into the dialogue model, then the message that's generated might say something like, let me know if you want me to support you to Belgium, otherwise I'll probably poke Holland. Now what we find is that conditioning the dialogue model on these intentions in this way, it makes the model more controllable, but it also leads to higher quality dialogue with less nonsense.

So we found that it led to dialogue that was more consistent with the state, more consistent with the plan, higher quality, lower perplexity. And I think the reasoning for why this is the case is that we're relieving the dialogue model of the burden of having to come up with a good strategy.

We're allowing the dialogue model to do what it does best, to focus on what it does best, which is dialogue. And we're relieving it of the strategic components of the game, because we're feeding that strategy into the dialogue model. Okay, so that's one main contribution, this controllable dialogue model that conditions on a plan.

The second is a planning engine that accounts for dialogue and human behavior. So I mentioned that a lot of previous work on games was done using self-play in two-player zero-sum settings. Now the problem with pure self-play is that it can learn strong policies, but it doesn't stick with human conventions, and it can't account for dialogue.

It's just going to ignore the human data and the human way of playing if you just do self-play. So that's one extreme. The other extreme that you can go is to just do supervised learning on human data, create this model of how humans play, and then train with those imitation humans.

And if you do this, you'll end up with a bot that's consistent with dialogue and human conventions, but it's only as strong as the training data. And we found that it was actually very easily manipulable through adversarial dialogue. So for example, you can send messages to it saying, "Thanks for agreeing to support me at the Paris," and it will think, "Well, I've only ever seen that message in my training data when I've agreed to support the person at the Paris, and so I guess I'm supporting them at the Paris this turn," even though that might be a terrible move for the bot.

So we came up with this algorithm called Pickle that is a happy medium between these two extremes. The way Pickle works is it's doing self-play, but regularized toward sticking to the human imitation policy. So it has a KL penalty for deviating from the human imitation policy. So we have this parameter lambda that controls how easy it is to deviate from the human imitation policy.

At lambda equals zero, it just ignores the human imitation policy completely and just does pure self-play. And so we'll just do self-play as if from scratch at lambda equals zero. At lambda equals infinity, it's just playing the human imitation policy and not doing self-play at all. But for intermediate values of lambda, what we find is that it actually gives you a good medium between sticking to human conventions and performing strongly.

So you can kind of see this behavior emerge here. Sorry, there's a question. Is this similar to offline RL or also incorporates exploration? So I would say there's actually a lot of similar work on having a KL penalty. And so yes, I would say that it's very similar to a lot of that work.

And this has also been done actually in AlphaStar, where they had a KL penalty. Though that was more about aiding exploration, like using human data to aid exploration rather than trying to better imitate humans. So I think what's interesting about the pickle work is that one, we find it imitates humans better than just doing supervised learning alone.

And two, we are doing a bit of theory of mind where we assume that the other players are also-- we're using this as a model for our behavior, what we expect other people to think our behavior is, in addition to modeling the other players. So it's like a common knowledge algorithm that we're using here.

So the kind of behavior that you see from this, you can see here, let's say England agrees-- sorry, so let's say we're in this situation. This actually came up in a real game, and it inspired a figure from our paper. So England and France are fighting. France is the bot.

And France asks if England is willing to disengage. And let's say England says, yes, I will move out of English Channel if you head back to NAO. Well, we can see that Cicero does, in fact, back off, goes to NAO, and the disengagement is successful. And so this shows that the bot strategy really is reflecting the dialogue that it's had with this other player.

Another message that England might send is something like, I'm sorry, you've been fighting me this whole game. I can't trust you that you won't stab me. And so in this case, Cicero will continue its attack on England. And you can see, again, this is reflective. It's changing its behavior depending on the dialogue.

But you can also have this kind of message where England says, yes, I'll leave English Channel if you move Kiel to Munich, get Holland to Belgium. So these are really bad moves for Cicero to follow. And so if you just look at the raw policy net, it might actually do this.

It might actually do these moves because England suggested it. But because we're using pickle, that incorporates-- it counts for the expected value of different actions. It will actually partially back off, but ignore the suggested moves because it recognizes that those will leave it very vulnerable to an attack. OK, I'll skip this slide for time.

Another thing I should say is that we're not just doing planning. We're actually doing this in a full self-play reinforcement learning loop. And again, the goal here is it's really about modeling humans better than supervised learning alone. And we found that doing this self-play reinforcement learning with pickle allowed us to better model human behavior than just doing imitation learning.

Finally, we have an ensemble of message filtering techniques that filters both nonsensical and strategically unsound messages. So to give you an example of what these filters look like, one that we developed is value-based filtering. So the motivation for this is that what we feed into our dialogue model is a plan for ourselves and for our speaking partner.

But it's the entire plan that we have for ourselves. And so we might end up feeding into the dialogue model the fact that we're going to attack the player that we're speaking to. Now the dialogue model is, to be honest, kind of dumb. And it doesn't really know that it shouldn't be telling this player that they're going to be attacked this turn.

And so you have these messages that might be sent, something like the second one shown here, where England says to France, we have hostile intentions towards you. You must be wiped from the board. Please provide a croissant. So this is actually a message that the bot sent to a player.

Not to a player. This was preliminary testing and kind of motivated this whole approach. So we don't want the bot to send these kinds of messages if it's going to attack a player. We want it to send something that's like, you know, not an outright lie necessarily, but just something either not send a message or something that's much more bland.

And so we filter out these kinds of messages by looking at the value. Like what we do is we generate a bunch of candidate messages. And then we see if we were to send this message, what is the behavior that we would expect the other players to take? Like what actions will we expect them to do after we send this message?

And what do they expect we will do after we send this message? And then we see what is the expected value of the action that we intend to take given the prediction of what everybody else is going to do. So if our intention is to attack France, then we can see, well, if I were to send this message to France, then they're going to get really defensive and defend against an attack from us and our attack is going to be unsuccessful.

And so therefore, I probably shouldn't send this message to them. And so in this way, we can actually filter out messages that have low expected value. And we found that this worked surprisingly well. Dialogue examples. I'll go through one just for the sake of time. So here we have Cicero's France, and France is conversing with Turkey, who's a human player, and they're debating over who's going to get Tunis, this territory circled in red.

You can see they both have fleets next to the territory. If they both go for it, neither of them are going to get it, and so they need to work out some sort of deal. So France says, I'll work with you, but I need Tunis for now. Turkey says, nope, you've got to let me have it.

France says, no, I need it. And then France suggests, you can take these other territories instead. You have Serbia and Rome to take. Turkey says, they're impossible targets. And then Cicero suggests specific moves that would allow Turkey to capture these territories. So Cicero says, Greece to Ionia, Ionia to Tyrrhenian.

Turkey says, hmm, you're right, good ideas. And then France says, then in the fall, you take Rome and Austria collapses. And so that allows Turkey to make progress against Austria. But conveniently, it also allows France to capture Tunis, because Turkey will be using those units for something else. So limitations and future directions.

Intent representation is just an action per player. So there's a question of like, the intentions that we're feeding into the dialogue model is an action that we're going to take for this turn and for the next turn, for ourselves and for the other player. But ideally, we would have a richer set of intentions, we will be able to condition on things like long term strategy, or style of communication, or asking questions.

That's one of the limitations of this approach. Now, of course, the richer you make the space of intentions, the more room there is for things to go wrong. And you also have to then train the model to be able to handle these wider space of intentions. There was a question, do you think the dialogue model is learning an internal world model to be so good at predicting moves?

No, this is arguably why we're conditioning on intentions. We're relieving the dialogue model of having to come up with a good world model, because we're telling it like, these are the moves that we are planning to take this turn. And these are the moves that we would like this other player to take this turn.

So we're able to have the world model separate from the dialogue model, but condition on the output from the world model. Another limitation is that Cicero's value model doesn't condition on dialogue. And so it has a limited understanding of the long term effects of dialogue. This greatly limits our ability to plan what kind of messages we should be sending.

And this is actually why we always condition Cicero's dialogue generation on its truthful intentions. You could argue that there's situations in diplomacy where you would want to lie to the other player. The best players rarely lie, but they do lie sometimes. And you have to understand the trade off between like, if you lie, you're going to not, it's going to be much harder to work with this person in the future.

And so you have to make sure that the value that you're getting positionally is worth that loss of trust and a broken relationship. Now, because Cicero's value model doesn't condition on dialogue, it can't really understand this trade off. And so for this reason, we actually always condition it on its truthful intentions.

Now, it is possible to have Cicero's value model condition on dialogue, but you would need way more data, and it would make things much more expensive. And so we weren't able to do it further for this bot. And finally, there's a big question that I mentioned earlier, which is, is there a more general way of scaling inference time compute to achieve better performance?

The way that we've done planning in Cicero is, I would argue, a bit domain specific. I think it's like the idea of pickle is quite general, but I think that there are potentially more general ways of doing planning. Somebody's asking, looking forward to the next two to three years, what criteria will you use to select the next game to try to conquer?

Honestly, like I said, we chose diplomacy because we thought it'd be the hardest game to make an AI for, and I think that that's true. I don't think that we're gonna be working on games anymore, because I can't think of any other game that if we were to succeed at that, it would be truly impressive.

And so I think where the research is going in the future is generality. Like instead of getting an AI to play this specific game, can we get an AI that is able to play diplomacy, but could also play Go or poker, or could also write essays and stories and solve math problems and write theorems?

I think what we will see is games serving as benchmarks for progress, but not as the goal. It'll be part of the test set, but not part of the training set. And I think that's the way it should be going forward. Finally, I want to add that diplomacy is an amazing testbed for multi-agent AI and grounded dialogue.

So if you are interested in these kinds of domains, I highly recommend taking advantage of the fact that we've open sourced all of our code and models, and the dialogue and action data is available through what's called an RFP, where you can apply to get access to the dialogue and data.

So thanks for listening. To wrap up, Cicero combines strategic reasoning and natural language in diplomacy. It placed in the top 10% of human players. And the paper is in science and code and models are publicly available at this URL. So thanks. And for the remaining time, I'll take questions.

Great. Thanks a lot for your talk. So we've also opened some questions from the class, but you can finish their Zoom questions. So if anyone has some questions, I think, Noam, you can answer those. Yeah, there's one question, are you concerned about AIs outcompeting humans at real world diplomatic strategic negotiation and deception tasks?

So like I said, we're not very focused on deception, even though arguably deception is a part of the game of diplomacy. I think for diplomatic and strategic negotiation, I don't feel like, look, the way that we've developed Cicero, it's designed to play diplomacy, the game of diplomacy specifically, and you can't use it out of the box for other tasks.

That said, I do think that the techniques are quite general. And so hopefully others can build on that and to be able to do different things. And I think it is entirely possible that over the next several years, you will see this entering into real world negotiations much more often.

I actually think that diplomacy is a big step towards real world applicability compared to breakthroughs in games like Go and Poker. Because now your action space is really like the space of natural language, and you have to model human behavior. Do you think in the future, we could appoint an AI to the UN council?

Oh, hopefully, only if it does better than humans, but that would be very interesting to see. Great. I'm also curious, like, what's like the future things that you're working on in this direction? Like, do you think you can do something like AlphaGo Zero, where you just like, take this like pre-built model, and then maybe just make it like self-play?

Or like, what sort of future directions are you thinking for improving this sort of box? I think the future directions are really focused around generality. Like, I think one of the big insights of Cicero is like, this ability to leverage planning to get better performance with language models and in this strategic domain.

I think there's a lot of opportunity to do that sort of thing in a broader space of domains. I mean, you look at language models today, and they do token by token prediction. And I think there's a big opportunity to go beyond that. So that's what I'm excited to look into.

I'm also curious, like, I didn't understand the exact details, how you're using planning or Monte Carlo research with your, like the models that you have. So is it like... We didn't use Monte Carlo research in Cicero. Monte Carlo research is a very good heuristic, but it's a heuristic that is particularly useful for deterministic perfect information games.

And I think in order to like have a truly general form of planning, we need to go more abstract than Monte Carlo research. We use this algorithm called PICL, it's based on a regret minimization algorithm. I don't really want to go into the details of it because it's not that important for the class.

But the idea is like, it is this iterative algorithm that will gradually refine the prediction of what everybody's going to do and get better and better predictions the more iterations that you run. And that's similar to search. Yeah. Sure. Go for it. You're unmuted. Okay. So yeah, my question is like, when we were talking about generalizability, how does the communication between different modules of the model look like, particularly when we're talking about the dialogue model?

Like how do you send information from the policy network to the dialogue model? And in the future, if you have a model that's good at different tasks, are we going to have like a really big policy net that learns all of them or like separate language modules for all of them?

Like how do you break it down? So we actually convert the policy, the action for ourselves and for our dialogue partner into a string, natural language string, and just feed that into the dialogue model along with all the dialogue that it's had so far. So it's just all text in, text out.

And that works great. And then what was the second part of your question? Something like, are we just going to have like one giant policy net trained on everything? Yeah, it was like, so if you're only using text first, doesn't it limit the model? And if you're using it for different games, like, are you thinking like, when you say in the future, you will work on generalizability, are you thinking about a big policy network that is trained on separate games or is able to like understand different games at the same time?

Or do we have like separate policy networks for different games? And yeah, like, doesn't this like text interface limit the model in terms of communication? Like if you're using vectors, it might like, yeah, it might be a bit of a bottleneck. I mean, I think ideally, you go in this direction where you have like, you know, a foundational model that works for pretty much everything.

Does text I mean, certainly, just like a text in text out that like limits what you can do in terms of communication, but hopefully, we get beyond that. I think it's a reasonable choice for now. Thank you. I think we have more zoom questions. Okay, so there's a question in the chat, I'd love to hear your speculation on the future.

For instance, we've seen some startups that are fine tuning LLMs to be biased, or experts in say, subject X versus subject Y. This seems like a pretty general question. I don't have strong opinions on this. I yeah, I mean, like, I'm not I'm not too. I'm not too focused myself on, you know, fine tuning language models to specific tasks.

I think the direction that I'm much more interested in going forward is, you know, like the more general forms of planning. So I don't think I can really comment on, you know, how do you tune these language models in these ways. So what sort of planning methods are you like interested in looking at, like, like MCTS is one.

So let me, so I got to step out for just one second. Got this switch rooms, excuse me. Okay, nevermind. We're all good. Sorry. What was the question? Oh, yes. I was thinking, I was just asking, like, what sort of planning algorithms do you think are very interesting to combine?

So you think like, we have like, so many options, like we have like planning kind of stuff, or RL, there's like MCTS, there's like the work you did with Cicero. So what do you think are the most interesting algorithms that you think will scale well, can generalize? Well, I think that's the big question that a lot of people are trying to figure out today.

And it's not really clear what the answer is. I mean, I think, you know, you look at some of the chain of thought, and I think there's a lot of limitations to chain of thought. And I think that it should be possible to do a lot better. But it is really impressive to see just how general of an approach it is.

And so I think it would be nice to see to see things that are general in that, in that way, but hopefully able to achieve better performance. Awesome. Also, when you say like, Cicero is like an encoder decoder model, in the sense that encodes the world, and then you have the dialog model, which is trying to decode it.

It was an encoder decoder model. Yes. I don't think that that's necessarily the right choice. But that's what we used. Any questions? Okay, I think, yeah, we're mostly good. But yeah, thanks a lot. Okay. Well, yeah, and if there are any questions, feel free to email me reach out, I'm happy to chat.

Thanks. Bye. Bye. Bye.