back to index

Noam Brown: AI vs Humans in Poker and Games of Strategic Negotiation | Lex Fridman Podcast #344


Chapters

0:0 Introduction
1:9 No Limit Texas Hold 'em
5:2 Solving poker
18:12 Poker vs Chess
24:50 AI playing poker
58:18 Heads-up vs Multi-way poker
69:8 Greatest poker player of all time
72:42 Diplomacy game
82:33 AI negotiating with humans
124:58 AI in geopolitics
129:43 Human-like AI for games
135:44 Ethics of AI
139:57 AGI
143:57 Advice to beginners

Whisper Transcript | Transcript Only Page

00:00:00.000 | A lot of people were saying like,
00:00:01.540 | oh, this whole idea of game theory, it's just nonsense.
00:00:03.980 | And if you really want to make money,
00:00:05.480 | you got to like look into the other person's eyes
00:00:07.220 | and read their soul and figure out what cards they have.
00:00:10.640 | But what happened was where we played our bot
00:00:13.320 | against four top heads up, no limit,
00:00:15.140 | hold them poker players.
00:00:17.020 | And the bot wasn't trying to adapt to them.
00:00:19.460 | It wasn't trying to exploit them.
00:00:20.780 | It wasn't trying to do these mind games.
00:00:22.780 | It was just trying to approximate the Nash equilibrium
00:00:25.220 | and it crushed them.
00:00:28.660 | The following is a conversation with Noah Brown,
00:00:31.100 | research scientist at FAIR,
00:00:32.740 | Facebook AI research group at Meta AI.
00:00:35.500 | He co-created the first AI system
00:00:38.340 | that achieved superhuman level performance
00:00:40.600 | in no limit Texas hold them, both heads up and multiplayer.
00:00:44.580 | And now recently he co-created an AI system
00:00:48.540 | that can strategically out negotiate humans
00:00:51.360 | using natural language
00:00:52.900 | in a popular board game called diplomacy,
00:00:55.340 | which is a war game that emphasizes negotiation.
00:00:58.760 | This is Alex Friedman podcast to support it.
00:01:02.460 | Please check out our sponsors in the description.
00:01:04.900 | And now dear friends, here's Noam Brown.
00:01:08.440 | You've been a lead on three amazing AI projects.
00:01:12.820 | So we've got Libratus that solved
00:01:15.140 | or at least achieved human level performance
00:01:17.540 | on no limit Texas hold them poker
00:01:19.740 | with two players, heads up.
00:01:22.140 | You got Pleribus that solved no limit Texas hold them poker
00:01:26.140 | with six players.
00:01:28.100 | And just now you have Cicero.
00:01:30.220 | These are all names of systems that solved
00:01:32.980 | or achieved human level performance
00:01:35.780 | on the game of diplomacy,
00:01:37.660 | which for people who don't know
00:01:39.980 | is a popular strategy board game.
00:01:42.140 | It was loved by JFK, John F. Kennedy and Henry Kissinger
00:01:46.980 | and many other big famous people in the decades since.
00:01:52.100 | So let's talk about poker and diplomacy today.
00:01:54.780 | First poker, what is the game of no limit Texas hold them?
00:01:59.180 | And how is it different from chess?
00:02:00.460 | - Well, no limit Texas hold them poker
00:02:02.300 | is the most popular variant of poker in the world.
00:02:05.100 | So, you know, you go to a casino,
00:02:06.540 | you play sit down at the poker table.
00:02:08.420 | The game that you're playing is no limit Texas hold them.
00:02:11.080 | If you watch movies about poker,
00:02:12.500 | like casino Royale or rounders,
00:02:14.520 | the game that they're playing
00:02:15.360 | is no limit Texas hold them poker.
00:02:17.420 | Now it's very different from limit hold them
00:02:20.940 | in that you can bet any amount of chips that you want.
00:02:23.380 | And so the stakes escalate really quickly.
00:02:26.060 | You start out with like one or $2 in the pot.
00:02:28.660 | And then by the end of the hand,
00:02:29.740 | you've got like $1,000 in there maybe.
00:02:32.460 | - So the option to increase the number very aggressively
00:02:34.980 | and very quickly is always there.
00:02:36.340 | - Right, the no limit aspect is there's no limits
00:02:38.780 | to how much you can bet.
00:02:39.980 | You know, in limit hold them,
00:02:42.220 | there's like $2 in the pot,
00:02:43.420 | you can only bet like $2.
00:02:45.400 | But if you got $10,000 in front of you,
00:02:47.900 | you're always welcome to put $10,000 into the pot.
00:02:50.500 | - So I've got a chance to hang out with Phil Helmuth
00:02:52.780 | who plays all these different variants of poker.
00:02:55.780 | And correct me if I'm wrong,
00:02:57.220 | but it seems like no limit rewards crazy
00:03:01.280 | versus the other ones rewards
00:03:02.900 | more kind of calculated strategy.
00:03:05.220 | Or no, because you're sort of looking
00:03:07.620 | from an analytic perspective,
00:03:10.420 | is strategy also rewarded in no limit Texas hold them?
00:03:14.980 | - I think both variants reward strategy.
00:03:17.220 | But I think what's different about no limit hold them
00:03:20.220 | is it's much easier to get jumpy.
00:03:23.220 | You know, you go in there thinking you're gonna play
00:03:26.580 | for like $100 or something.
00:03:28.900 | And suddenly there's like, you know, $1,000 in the pot.
00:03:31.100 | A lot of people can't handle that.
00:03:32.540 | - Can you define jumpy?
00:03:33.900 | - When you're playing poker,
00:03:35.220 | you always want to choose the action
00:03:37.300 | that's going to maximize your expected value.
00:03:39.420 | It's kind of like with investing, right?
00:03:41.260 | Like if you're ever in a situation
00:03:42.620 | where the amount of money that's at stake
00:03:44.820 | is going to have a material impact on your life,
00:03:49.240 | then you're gonna play in a more risk averse style.
00:03:51.860 | You know, if somebody makes a huge bet,
00:03:53.780 | you're gonna, if you're playing no limit hold them
00:03:55.620 | and somebody makes a huge bet,
00:03:57.540 | there might come a point where you're like,
00:03:58.900 | this is too much money for me to handle.
00:04:00.780 | Like I can't risk this amount.
00:04:03.100 | And that's what throws a lot of people off.
00:04:05.540 | So that's the big difference I think
00:04:07.860 | between no limit and limit.
00:04:09.900 | - What about on the action side
00:04:11.420 | when you're actually making that big bet?
00:04:14.140 | That's what I mean by crazy.
00:04:15.460 | I was trying to refer to the technical,
00:04:18.580 | the technical term of crazy,
00:04:20.700 | meaning use the big jump in the bet
00:04:24.460 | to completely throw off the other person
00:04:26.380 | in terms of their ability to reason optimally.
00:04:30.340 | - I think that's right.
00:04:31.160 | I think one of the key strategies in poker
00:04:34.500 | is to put the other person into an uncomfortable position.
00:04:38.020 | And if you're doing that, then you're playing poker well.
00:04:40.860 | And there's a lot of opportunities to do that
00:04:42.380 | in no limit hold them.
00:04:43.580 | You know, you can have like $50 in there,
00:04:45.980 | you throw in a thousand dollar bet,
00:04:47.860 | and you know, that's sometimes if you do it right,
00:04:51.020 | it puts the other person in a really tough spot.
00:04:53.200 | Now it's also possible that you make huge mistakes that way.
00:04:56.380 | And so it's really easy to lose a lot of money
00:04:57.980 | in no limit hold them if you don't know what you're doing.
00:05:00.740 | But there's a lot of upside potential too.
00:05:02.540 | - So when you build systems, AI systems
00:05:04.700 | that play these games, we'll talk about poker,
00:05:06.540 | we'll talk about diplomacy.
00:05:08.620 | Are you drawn in in part by the beauty of the game itself,
00:05:12.700 | AI aside, or is it to you primarily a fascinating
00:05:17.880 | problem set for the AI to solve?
00:05:20.160 | - I'm drawn in by the beauty of the game.
00:05:21.840 | When I started playing poker when I was in high school,
00:05:25.520 | and the idea to me that there is a correct,
00:05:29.960 | an objectively correct way of playing poker.
00:05:32.440 | And if you could figure out what that is,
00:05:34.280 | then you're making unlimited money basically.
00:05:38.220 | That's like a really fascinating concept to me.
00:05:41.160 | And so I was fascinated by the strategy of poker,
00:05:44.200 | even when I was like 16 years old.
00:05:46.280 | It wasn't until like much later
00:05:47.440 | that I actually worked on poker AIs.
00:05:49.000 | - So there was a sense that you can solve poker,
00:05:51.580 | like in the way you can solve chess, for example,
00:05:54.440 | or checkers, I believe checkers got solved, right?
00:05:57.440 | - Yeah, checkers is completely solved.
00:05:59.440 | - Optimal strategy. - Optimal strategy.
00:06:00.680 | It's impossible to beat the AI.
00:06:01.840 | - Yeah, and so in that same way,
00:06:03.040 | you could technically solve chess.
00:06:05.800 | - You could solve chess, you could solve poker.
00:06:07.560 | - You could solve poker.
00:06:08.800 | - So this gets into the concept of a Nash equilibrium.
00:06:12.400 | - So it is a Nash equilibrium.
00:06:14.200 | - Okay, so in any finite two-player zero-sum game,
00:06:19.080 | there is an optimal strategy that if you play it,
00:06:21.960 | you are guaranteed to not lose an expectation
00:06:24.360 | no matter what your opponent does.
00:06:26.600 | And this is kind of a radical concept to a lot of people,
00:06:29.960 | but it's true in chess, it's true in poker,
00:06:31.880 | it's true in any finite two-player zero-sum game.
00:06:34.960 | And to give some intuition for this,
00:06:36.880 | you can think of rock, paper, scissors.
00:06:39.020 | In rock, paper, scissors, if you randomly choose
00:06:41.800 | between throwing rock, paper, and scissors
00:06:43.320 | with equal probability,
00:06:44.560 | then no matter what your opponent does,
00:06:46.740 | you are not going to lose an expectation.
00:06:48.440 | You're not going to lose an expectation in the long run.
00:06:51.160 | Now, the same is true for poker.
00:06:53.080 | There exists some strategy,
00:06:54.440 | some really complicated strategy,
00:06:56.360 | that if you play that,
00:06:57.600 | you are guaranteed to not lose money in the long run.
00:07:00.440 | And I should say, this is for two-player poker.
00:07:01.880 | Six-player poker is a different story.
00:07:03.280 | - Yeah, it's a beautiful giant mess.
00:07:05.400 | When you say in expectation,
00:07:08.240 | you're guaranteed not to lose in expectation.
00:07:10.760 | What does in expectation mean?
00:07:12.680 | - Poker's a very high variance game.
00:07:14.240 | So you're going to have hands where you win,
00:07:15.520 | you're going to have hands with your lose.
00:07:16.480 | Even if you're playing the perfect strategy,
00:07:18.220 | you can't guarantee that you're going to win
00:07:19.400 | every single hand.
00:07:20.680 | But if you play for long enough,
00:07:22.480 | then you are guaranteed to at least break even,
00:07:25.040 | and in practice, probably win.
00:07:27.600 | - So that's in expectation,
00:07:29.080 | the size of your stack, generally speaking.
00:07:32.240 | Now, that doesn't include anything
00:07:33.760 | about the fact that you can go broke.
00:07:36.080 | It doesn't include any of those kinds
00:07:37.520 | of normal real-world limitations.
00:07:39.520 | You're talking in a theoretical world.
00:07:42.440 | What about the zero-sum aspect?
00:07:44.720 | How big of a constraint is that?
00:07:46.040 | How big of a constraint is finite?
00:07:48.480 | - So finite's not a huge constraint.
00:07:51.800 | So I mean, most games that you play are finite in size.
00:07:54.500 | It's also true, actually,
00:07:55.620 | that there exists this perfect strategy
00:07:57.760 | in many infinite games as well.
00:07:59.360 | Technically, the game has to be compact.
00:08:01.360 | There are some edge cases
00:08:03.880 | where you don't have a Nash equilibrium
00:08:05.200 | in a two-player zero-sum game.
00:08:06.720 | So you can think of a game where,
00:08:08.760 | if we're playing a game where whoever names
00:08:10.000 | the bigger number is the winner,
00:08:11.920 | there's no Nash equilibrium to that game.
00:08:13.400 | - 17, okay. - Yeah, exactly.
00:08:14.680 | 18, but you beat.
00:08:16.040 | - You win again.
00:08:17.320 | You're good at this.
00:08:18.840 | - I've played a lot of games.
00:08:20.960 | - Okay, so that's, and then the zero-sum aspect.
00:08:24.040 | The zero-sum. - Zero-sum aspect.
00:08:26.120 | So there exists a Nash equilibrium
00:08:28.360 | in non-two-player zero-sum games as well.
00:08:30.320 | And by the way, just to clarify what I mean
00:08:31.760 | by two-player zero-sum, I mean there's two players,
00:08:34.520 | and whatever one player wins, the other player loses.
00:08:36.920 | So if we're playing poker and I win $50,
00:08:38.840 | that means that you're losing $50.
00:08:41.160 | Now, outside of two-player zero-sum games,
00:08:44.480 | there still exists Nash equilibria,
00:08:46.840 | but they're not as meaningful.
00:08:48.480 | Because you can think of a game like Risk.
00:08:51.360 | If everybody else on the board decides
00:08:54.120 | to team up against you and take you out,
00:08:55.920 | there's no perfect strategy you can play
00:08:57.480 | that's gonna guarantee that you win there.
00:08:59.280 | There's just nothing you can do.
00:09:00.600 | So outside of two-player zero-sum games,
00:09:02.720 | there's no guarantee that you're going to win
00:09:05.040 | by playing a Nash equilibrium.
00:09:07.160 | - Have you ever tried to model in
00:09:08.920 | the other aspects of the game,
00:09:11.320 | which is like the pleasure you draw from playing the game?
00:09:15.360 | And then if you're a professional poker player,
00:09:18.440 | if you're exciting, even if you lose,
00:09:20.840 | the money you would get from the attention you get
00:09:25.520 | to the sponsors and all that kind of stuff,
00:09:27.320 | is that, that would be a fun thing to model in.
00:09:31.000 | Or is that make it sort of super complex
00:09:33.440 | to include the human factor in its full complexity?
00:09:36.920 | - I think you bring up a couple of good points there.
00:09:38.480 | So I think a lot of professional poker players,
00:09:41.200 | I mean, they get a huge amount of money,
00:09:42.920 | not from actually playing poker,
00:09:44.640 | but from the sponsorships and having a personality
00:09:47.600 | that people want to tune in and watch.
00:09:49.520 | That's a big way to make a name for yourself in poker.
00:09:53.360 | - I just wonder from an AI perspective,
00:09:55.040 | if you create, and we'll talk about this more,
00:09:57.440 | maybe AI system that also talks trash
00:10:01.720 | and all that kind of stuff,
00:10:02.960 | that that becomes part of the function to maximize.
00:10:05.240 | So it's not just optimal poker play.
00:10:08.560 | Maybe sometimes you want to be chaotic.
00:10:10.200 | Maybe sometimes you want to be suboptimal
00:10:11.800 | and you lose the chaos.
00:10:15.480 | And maybe sometimes you want to be overly aggressive
00:10:18.240 | because the audience loves that.
00:10:21.840 | That'd be fascinating.
00:10:22.960 | - I think what you're getting at here
00:10:24.200 | is that there's a difference between making an AI
00:10:25.880 | that wins a game and an AI that's fun to play with.
00:10:28.040 | - Yeah.
00:10:28.880 | - Yeah.
00:10:29.720 | - Or fun to watch.
00:10:30.540 | So those are all different things,
00:10:31.440 | fun to play with and fun to watch.
00:10:33.200 | - Yeah, and I think, I've heard talks from game designers
00:10:37.640 | and they say people that work on AI
00:10:39.880 | for actual recreational games that people play.
00:10:42.800 | And they say, yeah, there's a big difference
00:10:44.440 | between trying to make an AI that actually wins.
00:10:46.280 | And you look at a game like "Civilization",
00:10:49.280 | the way that the AIs play is not optimal for trying to win.
00:10:53.400 | They're playing a different game.
00:10:54.800 | They're trying to have personalities.
00:10:55.960 | They're trying to be fun and engaging.
00:10:59.000 | And that makes for a better game.
00:11:00.960 | - Yeah.
00:11:01.800 | - And we also talk about NPCs.
00:11:02.640 | We just talked to Todd Howard,
00:11:03.960 | who is the creator of "Fallout" and the "Elder Scrolls" series
00:11:06.720 | and "Starfield", the new game coming out.
00:11:10.560 | And the creator of what I think is the greatest game
00:11:13.160 | of all time, which is "Skyrim" and the NPCs there.
00:11:15.640 | The AI that governs that whole game is very interesting,
00:11:18.300 | but the NPCs also are super interesting.
00:11:20.780 | And considering what language models might do to NPCs
00:11:25.640 | in an open world RPG role-playing game,
00:11:29.800 | it's super exciting.
00:11:31.020 | - Yeah, honestly, I think this is one
00:11:33.440 | of the first applications where we're going to see
00:11:35.420 | real consumer interaction with large language models.
00:11:38.640 | I guess "Elder Scrolls VI" is in development now.
00:11:42.720 | They're probably pretty close to finishing it,
00:11:44.680 | but I would not be surprised at all if "Elder Scrolls VII"
00:11:48.160 | was using large language models for their NPCs.
00:11:49.880 | - No, they're not.
00:11:51.080 | I mean, I'm not saying anything.
00:11:52.680 | I'm not saying anything.
00:11:53.640 | - Okay, this is me speculating, not you.
00:11:55.880 | - No, but they're just releasing the "Starfield" game.
00:11:59.280 | They do one game at a time.
00:12:00.520 | - Yeah.
00:12:01.360 | - And so whatever it is, whenever the date is,
00:12:04.260 | I don't know what the date is, calm down,
00:12:07.000 | but it would be, I don't know, like 2024, '25, '26.
00:12:11.000 | So it's actually very possible
00:12:12.720 | that would include language models.
00:12:14.240 | - I was listening to this talk by a gaming executive
00:12:19.200 | when I was in grad school.
00:12:20.840 | And one of the questions that a person in the audience asked
00:12:24.160 | is why are all these games so focused
00:12:26.140 | on fighting and killing?
00:12:27.780 | And the person responded that it's just so much harder
00:12:30.880 | to make an AI that can talk with you and cooperate with you
00:12:34.080 | than it is to make an AI that can fight you.
00:12:36.600 | And I think once this technology develops further
00:12:39.680 | and you can reach a point where like
00:12:41.380 | not every single line of dialogue has to be scripted,
00:12:44.160 | it unlocks a lot of potential for new kinds of games,
00:12:46.320 | like much more like positive interactions
00:12:49.280 | that are not so focused on fighting.
00:12:50.360 | And I'm really looking forward to that.
00:12:52.080 | - It might not be positive.
00:12:53.120 | It might be just drama.
00:12:54.440 | So you'll be in like a "Call of Duty" game
00:12:56.760 | and instead of doing the shooting,
00:12:57.720 | you'll just be hanging out and like arguing with an AI
00:13:00.920 | about like passive aggressive.
00:13:03.720 | And then you won't be able to sleep that night.
00:13:05.200 | You have to return and continue the argument
00:13:07.780 | that you were emotionally hurt.
00:13:10.520 | I mean, yeah, I think that's actually an exciting world.
00:13:15.040 | Whatever is the drama, the chaos that we love,
00:13:17.640 | the push and pull of human connection,
00:13:19.480 | I think it's possible to do that in the video game world.
00:13:22.240 | And I think you could be messier
00:13:24.280 | and make more mistakes in a video game world,
00:13:26.320 | which is why it would be a nice place.
00:13:28.840 | And also it doesn't have as deep
00:13:32.680 | of a real psychological impact
00:13:34.360 | because inside video games,
00:13:35.640 | it's kind of understood that you're in a not a real world.
00:13:39.380 | So whatever crazy stuff AI does,
00:13:42.000 | we have some flexibility to play.
00:13:43.880 | Just like with a game of diplomacy, it's a game.
00:13:46.320 | This is not real geopolitics, not real war.
00:13:48.800 | It's a game.
00:13:49.880 | So you can have a little bit of fun, a little bit of chaos.
00:13:53.960 | Okay, back to "Nashville Coyote Pair."
00:13:56.160 | How do we find the Nash equilibrium?
00:13:58.300 | - All right, so there's different ways
00:14:00.580 | to find a Nash equilibrium.
00:14:01.720 | So the way that we do it
00:14:04.680 | is with this process called self-play.
00:14:07.300 | Basically, we have this algorithm
00:14:09.180 | that starts by playing totally randomly,
00:14:11.800 | and it learns how to play the game
00:14:13.960 | by playing against itself.
00:14:15.640 | So it will start playing the game totally randomly,
00:14:19.560 | and then if it's playing poker,
00:14:21.240 | it'll eventually get to the end of the game and make $50.
00:14:26.200 | And then it will review all of the decisions
00:14:28.260 | that it made along the way and say,
00:14:30.420 | what would have happened
00:14:31.420 | if I had chosen this other action instead?
00:14:34.080 | If I had raised here instead of called,
00:14:36.760 | what would the other player have done?
00:14:38.460 | And because it's playing against a copy of itself,
00:14:40.220 | it's able to do that counterfactual reasoning.
00:14:42.560 | So it can say, okay, well, if I took this action
00:14:45.340 | and the other person takes this action,
00:14:46.600 | and then I take this action,
00:14:47.940 | and eventually I make $150 instead of 50.
00:14:51.440 | And so it updates the regret value for that action.
00:14:56.160 | Regret is basically like how much does it regret
00:14:58.040 | having not played that action in the past?
00:15:00.540 | And when it encounters that same situation again,
00:15:03.740 | it's going to pick actions that have higher regret
00:15:05.720 | with higher probability.
00:15:07.680 | Now, it'll just keep simulating the games this way.
00:15:10.920 | It'll keep accumulating regrets for different situations.
00:15:14.880 | And in the long run,
00:15:16.760 | if you pick actions that have higher regret
00:15:18.600 | with higher probability in the correct way,
00:15:20.760 | it's proven to converge to a Nash equilibrium.
00:15:23.580 | - Even for super complex games?
00:15:26.440 | Even for imperfect information games?
00:15:28.560 | - It's true for all games.
00:15:29.440 | It's true for chess, it's true for poker.
00:15:31.520 | It's particularly useful for poker.
00:15:33.480 | - So this is the method
00:15:34.640 | of counterfactual regret minimization?
00:15:36.400 | - This is counterfactual regret minimization.
00:15:37.880 | - That doesn't have to do with self-play,
00:15:39.400 | it has to do with just any,
00:15:41.680 | if you follow this kind of process, self-play or not,
00:15:44.520 | you will be able to arrive at an optimal set of actions.
00:15:48.280 | - So this counterfactual regret minimization
00:15:50.120 | is a kind of self-play.
00:15:51.440 | It's a principled kind of self-play
00:15:53.120 | that's proven to converge to Nash equilibria,
00:15:55.360 | even in imperfect information games.
00:15:57.680 | Now you can have other forms of self-play
00:15:59.200 | and people use other forms of self-play
00:16:00.680 | for perfect information games,
00:16:02.120 | where you have more flexibility,
00:16:04.800 | the algorithm doesn't have to be as theoretically sound
00:16:07.720 | in order to converge to that class of games
00:16:09.760 | because it's a simpler setting.
00:16:11.800 | - Sure, so I kind of, in my brain,
00:16:14.920 | the word self-play has mapped to neural networks,
00:16:17.280 | but we're speaking something bigger
00:16:19.000 | than just neural networks.
00:16:20.000 | It could be anything.
00:16:21.320 | The self-play mechanism is just the mechanism
00:16:25.000 | of a system playing itself.
00:16:26.120 | - Exactly, yeah.
00:16:26.960 | Self-play is not tied specifically to neural nets,
00:16:28.840 | it's a kind of reinforcement learning, basically.
00:16:32.040 | And I would also say this process of trying to reason,
00:16:35.720 | oh, what would the value have been
00:16:36.840 | if I had taken this other action instead?
00:16:39.280 | This is very similar to how humans learn
00:16:40.920 | to play a game like poker.
00:16:42.960 | You probably played poker before
00:16:44.160 | and with your friends, you probably ask,
00:16:46.120 | oh, what do you have called me if I raise there?
00:16:49.400 | That's a person trying to do the same kind of learning
00:16:52.440 | from a counterfactual that the AI is doing.
00:16:54.920 | - Okay, and if you do that at scale,
00:16:56.360 | you're gonna be able to learn an optimal policy.
00:16:59.680 | - Yeah, now where the neural nets come in,
00:17:01.800 | I said, okay, if it's in that situation again,
00:17:04.720 | then it will choose the action that has high regret.
00:17:07.520 | Now, the problem is that poker is such a huge game.
00:17:10.640 | I think No Limit Texas Hold'em,
00:17:12.360 | the version that we were playing,
00:17:13.240 | has 10 to the 161 different decision points,
00:17:15.920 | which is more than the number of atoms
00:17:17.080 | in the universe squared.
00:17:18.160 | - That's heads up?
00:17:19.040 | - That's heads up, yeah.
00:17:20.240 | - 10 to the 161, you said?
00:17:21.600 | - Yeah, I mean, it depends on the number of chips
00:17:22.960 | that you have, the stacks and everything,
00:17:24.240 | but the version that we were playing was 10 to the 161.
00:17:27.080 | - Which I assume would be a somewhat simplified version
00:17:29.360 | anyway, 'cause I bet there's some step function
00:17:34.240 | you had for bets.
00:17:36.160 | - Oh, no, no, no, I'm saying we played the full game.
00:17:38.760 | You can bet whatever amount you want.
00:17:39.920 | Now, the bot maybe was constrained
00:17:41.280 | in what it considered for bet sizes,
00:17:43.200 | but the person on the other side
00:17:44.720 | could bet whatever they wanted.
00:17:45.800 | - Yeah, I mean, 161 plus or minus 10 doesn't matter.
00:17:49.520 | - Yeah, yeah.
00:17:50.360 | And so the way neural nets help out here is,
00:17:54.960 | you know, you don't have to run into the same exact situation
00:17:57.160 | 'cause that's never gonna happen again.
00:17:58.280 | The odds of you running into the same exact situation
00:18:00.160 | are pretty slim, but if you run into a similar situation,
00:18:03.160 | then you can generalize from other states
00:18:05.120 | that you've been in that kind of look like that one,
00:18:07.000 | and you can say like, well, these other situations,
00:18:09.160 | I had high regret for this action,
00:18:10.400 | and so maybe I should play that action here as well.
00:18:12.680 | - Which is the more complex game?
00:18:15.000 | Chess or poker or go or poker?
00:18:18.560 | Do you know?
00:18:19.400 | - That is a controversial question.
00:18:21.120 | - Okay.
00:18:21.960 | - I'm gonna-
00:18:22.780 | - It's like somebody's screaming on Reddit right now.
00:18:24.040 | It depends on which subreddit you're on.
00:18:25.720 | Is it chess or is it poker?
00:18:27.200 | - I'm sure like David Silver's
00:18:28.360 | gonna get really angry at me.
00:18:29.600 | - Yeah.
00:18:30.440 | - I'll say, I'm gonna say poker actually,
00:18:31.760 | and I think for a couple of reasons.
00:18:33.960 | - They're not here to defend themselves.
00:18:36.200 | - So first of all,
00:18:37.760 | you have the imperfect information aspect.
00:18:39.460 | And so it's, we can go into that,
00:18:43.760 | but like once you introduce imperfect information,
00:18:47.040 | things get much more complicated.
00:18:49.080 | - So we should say,
00:18:50.920 | maybe you can describe what is seen to the players,
00:18:54.360 | what is not seen in the game of Texas Hold 'Em.
00:18:57.960 | - Yeah, so Texas Hold 'Em,
00:18:59.040 | you get two cards face down that only you see.
00:19:02.640 | And so that's the hidden information of the game.
00:19:04.560 | The other players also all get two cards face down
00:19:06.600 | that only they see.
00:19:08.000 | And so you have to kind of, as you're playing,
00:19:10.000 | reason about like, okay, what do they think I have?
00:19:12.640 | What do they have?
00:19:13.680 | What do they think I think they have?
00:19:15.480 | That kind of stuff.
00:19:16.320 | And that's kind of where bluffing comes into play, right?
00:19:20.080 | Because the fact that you can bluff,
00:19:22.640 | the fact that you can bet with a bad hand and still win,
00:19:25.740 | is because they don't know what your cards are.
00:19:27.600 | - Right.
00:19:28.440 | - And that's the key difference
00:19:29.680 | between a perfect information game like poker,
00:19:31.880 | sorry, like chess and go,
00:19:34.120 | and imperfect information games like poker.
00:19:36.040 | - This is what trash talk looks like.
00:19:38.440 | The implied statement is,
00:19:41.960 | the game I solved is much tougher.
00:19:44.840 | But yeah, so when you're playing,
00:19:46.880 | I'm just gonna do random questions here.
00:19:48.440 | So when you're playing your opponent
00:19:51.240 | under imperfect information,
00:19:53.080 | is there some degree to which you're trying
00:19:56.240 | to estimate the range of hands that they have?
00:19:58.820 | Or is that not part of the algorithm?
00:20:01.320 | So what are the different approaches
00:20:04.280 | to the imperfect information game?
00:20:06.680 | - So the key thing to understand
00:20:08.320 | about why imperfect information makes things difficult,
00:20:11.140 | is that you have to worry not just about
00:20:13.200 | which actions to play,
00:20:14.720 | but the probability that you're gonna play those actions.
00:20:17.400 | So you think about rock, paper, scissors, for example,
00:20:21.280 | rock, paper, scissors is an imperfect information game.
00:20:24.360 | - Right.
00:20:25.200 | - Because you don't know what I'm about to throw.
00:20:26.600 | - I do, but yeah, usually not, yeah.
00:20:28.520 | - Yeah, and so you can't just say like,
00:20:30.360 | I'm just gonna throw a rock every single time,
00:20:32.240 | because the other person's gonna figure that out
00:20:34.160 | and notice a pattern,
00:20:35.260 | and then suddenly you're gonna start losing.
00:20:37.280 | And so you don't just have to figure out
00:20:38.600 | like which action to play,
00:20:39.640 | you have to figure out the probability that you play it.
00:20:42.020 | And really importantly, the value of an action
00:20:45.320 | depends on the probability that you're gonna play it.
00:20:47.580 | So if you're playing rock every single time,
00:20:50.380 | that value is really low.
00:20:51.900 | But if you're never playing rock,
00:20:54.640 | you play rock like 1% of the time,
00:20:55.960 | then suddenly the other person
00:20:57.620 | is probably gonna be throwing scissors.
00:20:59.780 | And when you throw a rock,
00:21:00.820 | the value of that action is gonna be really high.
00:21:03.340 | Now you take that to poker,
00:21:04.580 | what that means is the value of bluffing, for example,
00:21:09.100 | if you're the kind of person that never bluffs
00:21:10.600 | and you have this reputation as somebody that never bluffs,
00:21:13.240 | and suddenly you bluff,
00:21:14.480 | there's a really good chance that that bluff is gonna work
00:21:16.560 | and you're gonna make a lot of money.
00:21:18.160 | On the other hand, if you got a reputation,
00:21:19.880 | like if they seen you play for a long time and they see,
00:21:21.720 | oh, you're the kind of person that's bluffing all the time,
00:21:24.700 | when you bluff, they're not gonna buy it
00:21:26.200 | and they're gonna call you down
00:21:27.040 | and you're gonna lose a lot of money.
00:21:28.920 | And that, finding that balance
00:21:31.240 | of how often you should be bluffing
00:21:33.240 | is the key challenge of a game of poker.
00:21:37.240 | And you contrast that with a game like chess,
00:21:40.220 | it doesn't matter if you're opening with the queen's gambit
00:21:43.880 | 10% of the time or 100% of the time,
00:21:46.140 | the value, the expected value is the same.
00:21:48.240 | So that's why we need these algorithms that understand,
00:21:54.560 | not just we have to figure out what actions are good,
00:21:56.840 | but the probabilities,
00:21:57.720 | we need to get the exact probabilities correct.
00:21:59.600 | And that's actually when we created the bot Libratus,
00:22:02.300 | Libratus means balanced
00:22:03.540 | because the algorithm that we designed
00:22:06.160 | was designed to find that right balance
00:22:08.200 | of how often it should play each action.
00:22:10.920 | - The balance of how often in the key sort of branching
00:22:13.720 | is the bluff or not to bluff.
00:22:15.320 | Is that a good crude simplification
00:22:18.800 | of the major decision in poker?
00:22:21.080 | - It's a good simplification,
00:22:22.200 | I think that's like the main tension,
00:22:23.880 | but it's not just how often to bluff or not to bluff,
00:22:27.720 | it's like, how often should you bet in general?
00:22:29.740 | How often should you, what kind of bet should you make?
00:22:33.280 | Should you bet big or should you bet small?
00:22:35.240 | And with which hands?
00:22:37.720 | And so this is where the idea of a range comes from,
00:22:40.680 | because when you're bluffing with a particular hand
00:22:43.200 | in a particular spot,
00:22:44.880 | you don't want there to be a pattern
00:22:46.220 | for the other person to pick up on.
00:22:47.440 | You don't want them to figure out,
00:22:48.520 | oh, whenever this person is in this spot,
00:22:50.700 | they're always bluffing.
00:22:51.840 | And so you have to reason about,
00:22:53.560 | okay, would I also bet with a good hand in this spot?
00:22:58.240 | You wanna be unpredictable.
00:22:59.700 | So you have to think about what would I do
00:23:01.980 | if I had this different set of cards?
00:23:04.280 | - Is there explicit estimation of like a theory of mind
00:23:08.400 | that the other person has about you,
00:23:09.960 | or is that just a emergent thing that happens?
00:23:13.140 | - The way that the bots handle it,
00:23:17.280 | that are really successful,
00:23:18.160 | they have an explicit theory of mind.
00:23:19.660 | So they're explicitly reasoning about
00:23:21.920 | what's the common knowledge belief?
00:23:25.400 | What do you think I have?
00:23:27.320 | What do I think you have?
00:23:28.800 | What do you think I think you have?
00:23:31.060 | It's explicitly reasoning about that.
00:23:32.680 | - Is there multiple yous there?
00:23:34.360 | So maybe that's jumping ahead to six players,
00:23:37.360 | but is there a stickiness to the person?
00:23:40.720 | So it's an iterative game.
00:23:42.060 | You're playing the same person.
00:23:43.600 | There's a stickiness to that, right?
00:23:47.480 | You're gathering information as you play.
00:23:49.720 | It's not every hand is a new hand.
00:23:53.320 | Is there a continuation in terms of estimating
00:23:57.240 | what kind of player I'm facing here?
00:23:59.340 | - That's a good question.
00:24:00.180 | So you could approach the game that way.
00:24:02.780 | The way that the bots do it,
00:24:04.500 | they don't, and the way that humans approach it also,
00:24:06.660 | expert human players,
00:24:07.940 | the way they approach it is to basically assume
00:24:10.580 | that you know my strategy.
00:24:13.300 | So I'm going to try to pick a strategy
00:24:16.460 | where even if I were to play it for 10,000 hands
00:24:18.380 | and you could figure out exactly what it was,
00:24:20.200 | you still wouldn't be able to beat it.
00:24:21.440 | Basically what that means is I'm trying
00:24:22.740 | to approximate the Nash equilibrium.
00:24:24.380 | I'm trying to be perfectly balanced
00:24:25.620 | because if I'm playing the Nash equilibrium,
00:24:27.940 | even if you know what my strategy is,
00:24:30.700 | like I said, I'm still unbeatable in expectation.
00:24:33.140 | So that's what the bot aims for.
00:24:35.740 | And that's actually what a lot
00:24:36.940 | of expert poker players aim for as well,
00:24:38.700 | to start by playing the Nash equilibrium.
00:24:41.140 | And then maybe if they spot weaknesses
00:24:42.500 | in the way you're playing,
00:24:43.660 | then they can deviate a little bit
00:24:44.940 | to take advantage of that.
00:24:46.200 | - They aim to be unbeatable in expectation.
00:24:49.500 | Okay.
00:24:50.660 | So who's the greatest poker player of all time
00:24:52.780 | and why is it Phil Hellmuth?
00:24:54.180 | So this is for Phil.
00:24:56.220 | So he's known at least in part
00:25:00.260 | for maybe playing suboptimally
00:25:02.700 | and he still wins a lot.
00:25:04.820 | It's a bit chaotic.
00:25:06.100 | So maybe can you speak from an AI perspective
00:25:09.940 | about the genius of his madness
00:25:12.340 | or the madness of his genius?
00:25:14.420 | So playing suboptimally, playing chaotically
00:25:16.660 | as a way to make it hard to pin down
00:25:21.820 | about what your strategy is.
00:25:23.740 | - So, okay.
00:25:25.060 | The thing that I should explain first of all
00:25:26.420 | with like Nash equilibrium,
00:25:28.580 | it doesn't mean that it's predictable.
00:25:30.080 | The whole point of it is that
00:25:31.060 | you're trying to be unpredictable.
00:25:32.880 | Now, I think when somebody like Phil Hellmuth
00:25:35.480 | might be really successful,
00:25:36.940 | is not in being unpredictable,
00:25:38.540 | but in being able to take advantage
00:25:42.240 | of the other player and figure out
00:25:43.660 | where they're being predictable
00:25:45.700 | or guiding the other player into thinking
00:25:49.280 | that you have certain weaknesses
00:25:51.340 | and then understanding how they're going
00:25:53.080 | to change their behavior.
00:25:54.140 | They're going to deviate from a Nash equilibrium style
00:25:57.000 | of play to try to take advantage
00:25:58.540 | of those perceived weaknesses
00:25:59.500 | and then counter exploit them.
00:26:00.860 | So you kind of get into the mind games there.
00:26:02.700 | - So you think about these heads up poker
00:26:04.980 | as a dance between two agents,
00:26:07.700 | I guess, are you playing the cards
00:26:08.880 | or are you playing the player?
00:26:10.460 | - So this gets down to a big argument
00:26:12.900 | in the poker community and the academic community.
00:26:15.500 | For a long time, there was this debate
00:26:16.900 | of like what's called GTO,
00:26:19.140 | game theory optimal poker or exploitative play.
00:26:22.820 | And up until about like 2017,
00:26:25.820 | when we did the Labradors match,
00:26:27.140 | I think actually exploitative play had the advantage.
00:26:29.660 | A lot of people were saying like,
00:26:31.140 | oh, this whole idea of game theory, it's just nonsense.
00:26:33.580 | And if you really want to make money,
00:26:35.080 | you got to like look into the other person's eyes
00:26:36.820 | and read their soul and figure out what cards they have.
00:26:40.260 | But what happened was people started adopting
00:26:42.960 | the game theory optimal strategy
00:26:44.560 | and they were making good money.
00:26:46.860 | And they weren't trying to adapt so much to the other player.
00:26:50.140 | They were just trying to play the Nash equilibrium.
00:26:52.740 | And then what really solidified it, I think,
00:26:54.340 | was the Labradors match,
00:26:56.940 | where we played our bot against four top heads up,
00:26:59.360 | no limit hold 'em poker players.
00:27:01.620 | And the bot wasn't trying to adapt to them.
00:27:04.060 | It wasn't trying to exploit them.
00:27:05.380 | It wasn't trying to do these mind games.
00:27:07.380 | It was just trying to approximate the Nash equilibrium
00:27:09.820 | and it crushed them.
00:27:11.540 | I think, you know, we were playing for $50, $100 blinds.
00:27:16.540 | And over the course of about 120,000 hands,
00:27:19.660 | it made close to $2 million.
00:27:21.500 | - 120,000 hands.
00:27:22.900 | - 120,000 hands.
00:27:24.060 | - Against humans.
00:27:24.900 | - Yeah, and this was fake money to be clear.
00:27:26.580 | So there was real money at stake.
00:27:27.740 | There was $200,000.
00:27:28.620 | - First of all, all money is fake.
00:27:30.340 | But that's a different conversation.
00:27:33.780 | We give it meaning.
00:27:36.140 | It's a phenomena that gets meaning
00:27:39.180 | from our complex psychology as a human civilization.
00:27:43.340 | It's emerging from the collective intelligence
00:27:45.180 | of the human species.
00:27:46.380 | But that's not what you mean.
00:27:47.280 | You mean like there's literally,
00:27:49.140 | you can't buy stuff with it.
00:27:50.380 | Okay, can you actually step back
00:27:52.300 | and take me through that competition?
00:27:55.900 | - Yeah, okay.
00:27:56.740 | So when I was in grad school,
00:27:59.620 | there was this thing called
00:28:00.460 | the annual computer poker competition,
00:28:02.180 | where every year all the different research labs
00:28:04.620 | that were working on AI for poker would get together.
00:28:06.940 | They would make a bot.
00:28:07.780 | They would play them against each other.
00:28:09.700 | And we made a bot that actually won the 2014 competition,
00:28:14.300 | the 2016 competition.
00:28:16.220 | And so we decided we're gonna take this bot, build on it,
00:28:19.300 | and play against real top professional
00:28:22.460 | heads up no limit Texas Hold'em poker players.
00:28:25.020 | So we invited four of the world's best players
00:28:29.220 | in this specialty.
00:28:30.620 | And we challenged them to 120,000 hands of poker
00:28:33.260 | over the course of 20 days.
00:28:34.740 | And we had $200,000 in prize money at stake,
00:28:40.220 | where it would basically be divided among them,
00:28:42.060 | depending on how well they did relative to each other.
00:28:44.700 | So we wanted to have some incentive
00:28:46.060 | for them to play their best.
00:28:47.780 | - Did you have a confidence 2014, 16,
00:28:50.940 | that this is even possible?
00:28:52.580 | How much doubt was there?
00:28:53.860 | - So we did a competition actually in 2015,
00:28:56.660 | where we also played against professional poker players
00:28:58.940 | and the bot lost by a pretty sizable margin actually.
00:29:02.080 | Now there were some big improvements from 2015 to 2017.
00:29:05.820 | And so-
00:29:06.660 | - Can you speak to the improvements?
00:29:07.660 | Is it computational in nature?
00:29:08.940 | Is it the algorithm, the methods?
00:29:11.420 | - It was really an algorithmic approach.
00:29:12.940 | That was the difference.
00:29:13.900 | So 2015, it was much more focused on trying to come up
00:29:18.900 | with a strategy upfront,
00:29:20.780 | like trying to solve the entire game of poker,
00:29:23.180 | like, and then just have a lookup table
00:29:25.420 | where you're saying like, "Oh, I'm in this situation.
00:29:27.460 | "What's the strategy?"
00:29:29.420 | The approach that we took in 2017
00:29:30.980 | was much more search-based.
00:29:32.820 | It was trying to say, "Okay, well, let me in real time
00:29:36.260 | "try to compute a much better strategy
00:29:38.760 | "than what I had pre-computed
00:29:40.940 | "by playing against myself during self-play."
00:29:42.620 | - What is the search space for poker?
00:29:47.020 | What are you searching over?
00:29:48.420 | What's that look like?
00:29:50.700 | There's different actions like raising, calling.
00:29:53.940 | Yeah, what are the actions?
00:29:55.520 | Is it just a search over actions?
00:29:59.620 | - So in a game like chess, the search is like,
00:30:02.740 | "Okay, I'm in this chess position
00:30:04.920 | "and I can move these different pieces
00:30:06.920 | "and see where things end up."
00:30:08.340 | In poker, what you're searching over
00:30:09.860 | is the actions that you can take for your hand,
00:30:12.980 | the probabilities that you take those actions,
00:30:14.960 | and then also the probabilities that you take other actions
00:30:17.300 | with other hands that you might have.
00:30:19.140 | And that's kind of like hard to wrap your head around.
00:30:22.980 | Like, why are you searching over these other hands
00:30:26.220 | that you might have and trying to figure out
00:30:28.500 | what you would do with those hands?
00:30:30.740 | And the idea is, again, you wanna always be balanced
00:30:35.460 | and unpredictable.
00:30:36.840 | And so if you're a search algorithm that's saying like,
00:30:39.160 | "Oh, I want to raise with this hand."
00:30:41.300 | Well, in order to know whether that's a good action,
00:30:43.700 | like let's say it's a bluff.
00:30:44.780 | Let's say you have a bad hand and you're saying like,
00:30:46.140 | "Oh, I think I should be betting here
00:30:48.220 | "with this really bad hand and bluffing."
00:30:50.320 | Well, that's only a good action
00:30:52.060 | if you're also betting with a strong hand.
00:30:55.380 | Otherwise, it's an obvious bluff.
00:30:57.020 | - So if your action in some sense
00:30:59.300 | maximizes your unpredictability,
00:31:02.060 | so that action could be mapped by your opponent
00:31:04.220 | to a lot of different hands, then that's a good action.
00:31:07.220 | - Basically what you wanna do is put your opponent
00:31:09.680 | into a tough spot.
00:31:11.040 | So you want them to always have some doubt,
00:31:13.500 | like, "Should I call here?
00:31:14.600 | "Should I fold here?"
00:31:15.900 | And if you are raising in the appropriate balance
00:31:18.920 | between bluffs and good hands,
00:31:20.440 | then you're putting them into that tough spot.
00:31:21.800 | And so that's what we're trying to do.
00:31:22.800 | We're always trying to search for a strategy
00:31:24.760 | that would put the opponent into a difficult position.
00:31:26.840 | - Can you give a metric that you're trying to maximize
00:31:29.880 | or minimize?
00:31:30.720 | Does this have to do with the regret thing
00:31:32.120 | that we're talking about in terms of putting your opponent
00:31:35.360 | in a maximally tough spot?
00:31:37.380 | - Yeah, ultimately what you're trying to maximize
00:31:39.200 | is your expected winnings, your expected value,
00:31:41.740 | the amount of money that you're gonna walk away from,
00:31:43.660 | assuming that your opponent was playing optimally
00:31:46.300 | in response.
00:31:47.420 | So you're gonna assume that your opponent
00:31:49.420 | is also playing as well as possible
00:31:53.540 | a Nash equilibrium approach,
00:31:55.100 | because if they're not,
00:31:56.700 | then you're just gonna make more money.
00:31:59.060 | Anything that deviates, by definition,
00:32:01.340 | the Nash equilibrium is the strategy
00:32:03.780 | that does the best in expectation.
00:32:06.300 | And so if you're deviating from that,
00:32:07.860 | then you're just, they're gonna lose money.
00:32:09.580 | And since it's a two player zero sum game,
00:32:11.140 | that means you're gonna make money.
00:32:12.420 | - So there's not an explicit, like objective function
00:32:15.300 | that maximizes the toughness of the spot they're put in.
00:32:18.860 | You're always, this is from like a self play
00:32:22.060 | reinforcement learning perspective.
00:32:24.100 | You're just trying to maximize winnings
00:32:26.100 | and the rest is implicit.
00:32:27.780 | - That's right, yeah.
00:32:28.620 | So what we're actually trying to maximize
00:32:30.380 | is the expected value,
00:32:31.780 | given that the opponent is playing optimally
00:32:33.420 | in response to us.
00:32:34.520 | Now in practice, what that ends up looking like
00:32:36.700 | is it's putting the opponent into difficult situations
00:32:39.640 | where there's no obvious decision to be made.
00:32:41.860 | - So the system doesn't know anything
00:32:44.060 | about the difficulty of the situation.
00:32:46.340 | - Not at all, doesn't care.
00:32:47.180 | - Okay, in my head it was getting excited
00:32:49.260 | whenever I was making the other, the opponent sweat.
00:32:51.700 | Okay, so you're, in 2015, you didn't do as well.
00:32:55.260 | So what's the journey from that to a system
00:32:57.880 | that in your mind could have a chance?
00:33:00.380 | - So 2015, we got, we beat pretty badly
00:33:04.460 | and we actually learned a lot from that competition.
00:33:06.740 | And in particular, what became clear to me
00:33:09.580 | is that the way the humans were approaching the game
00:33:11.620 | was very different from how the bot was approaching the game.
00:33:15.000 | The bot would not be doing search.
00:33:17.660 | It would just be trying to compute,
00:33:19.380 | it would do like months of self play.
00:33:21.140 | It would just be playing against itself for months,
00:33:23.180 | but then when it's actually playing the game,
00:33:24.460 | it would just act instantly.
00:33:26.020 | And the humans, when they're in a tough spot,
00:33:28.700 | they would sit there and think for sometimes
00:33:31.540 | even like five minutes about whether they're gonna call
00:33:34.220 | or fold a hand.
00:33:35.060 | And it became clear to me that that's,
00:33:39.720 | there's a good chance that that's what's missing
00:33:41.540 | from our bot.
00:33:42.380 | So I actually did some initial experiments
00:33:45.020 | to try to figure out how much of a difference
00:33:46.300 | does this actually make?
00:33:47.480 | And the difference was huge.
00:33:49.020 | - As a signal to the human player,
00:33:50.900 | how long you took to think?
00:33:52.260 | - No, no, no, I'm not saying
00:33:53.100 | that there were any timing tells.
00:33:54.200 | I was saying when the human,
00:33:55.780 | like the bot would always act instantly.
00:33:57.380 | It wouldn't try to come up with a better strategy
00:33:59.780 | in real time over what it had pre-computed during training.
00:34:04.420 | Whereas the human, like they have all this intuition
00:34:06.580 | about how to play, but they're also in real time
00:34:09.640 | leveraging their ability to think,
00:34:12.420 | just to search, to plan,
00:34:14.000 | and coming up with an even better strategy
00:34:16.140 | than what their intuition would say.
00:34:17.420 | - So you're saying that there's,
00:34:18.980 | you're doing, that's what you mean by
00:34:20.500 | you're doing search also.
00:34:22.300 | You have an intuition and search on top of that,
00:34:26.700 | looking for a better solution.
00:34:28.580 | - Yeah, that's what I mean by search.
00:34:30.220 | That instead of acting instantly,
00:34:33.420 | a neural net usually gives you a response
00:34:36.060 | in like a hundred milliseconds or something.
00:34:37.420 | It depends on the size of the net.
00:34:39.100 | But if you can leverage extra computational resources,
00:34:42.920 | you can possibly get a much better outcome.
00:34:46.420 | And we did some experiments
00:34:48.580 | in small scale versions of poker.
00:34:50.920 | And what we found was that
00:34:53.660 | if you do a little bit of search, even just a little bit,
00:34:58.380 | it was the equivalent of making your,
00:35:01.580 | you know, your pre-computed strategy.
00:35:03.600 | Like you can kind of think of it as your neural net,
00:35:05.320 | a thousand times bigger.
00:35:06.980 | It was just a little bit of search.
00:35:08.520 | And it just like blew away all of the research
00:35:10.960 | that we had been working on
00:35:11.800 | and trying to like scale up this like pre-computed solution.
00:35:15.820 | It was dwarfed by the benefit that we got from search.
00:35:19.740 | - Can you just linger on what you mean by search here?
00:35:22.440 | You're searching over a space of actions
00:35:26.060 | for your hand and for other hands.
00:35:29.020 | How are you selecting the other hands to search over?
00:35:32.140 | - So yeah. - Randomly?
00:35:34.040 | - No, it's all the other hands that you could have.
00:35:35.780 | So when you're playing no limit Texas hold 'em,
00:35:37.640 | you've got two face down cards.
00:35:39.100 | And so that's 52 choose two, 1,326 different combinations.
00:35:43.860 | Now that's actually a little bit lower
00:35:45.020 | because there's face up cards in the middle.
00:35:46.620 | And so you can eliminate those as well.
00:35:48.340 | But you're looking at like around a thousand
00:35:50.140 | different possible hands that you can have.
00:35:51.980 | And so when we're doing, when the bot's doing search,
00:35:54.480 | it's thinking explicitly,
00:35:56.120 | there are these thousand different hands that I could have.
00:35:58.300 | There are these thousand different hands that you could have.
00:36:00.660 | Let me try to figure out what would be a better strategy
00:36:03.440 | than what I've pre-computed for these hands and your hands.
00:36:07.780 | - Okay, so that search, how do you fuse that
00:36:12.780 | with what the neural net is telling you
00:36:15.700 | or what the train system is telling you?
00:36:19.040 | - Yeah, so you kind of like,
00:36:22.040 | where the train system comes in is the value at the end.
00:36:25.920 | So there's, you only look so far ahead.
00:36:29.820 | You look like maybe one round ahead.
00:36:31.800 | So if you're on the flop,
00:36:32.620 | you're looking to the start of the turn.
00:36:34.360 | And at that point you can use the pre-computed solution
00:36:38.800 | to figure out what's the value here of this strategy.
00:36:42.900 | - Is it of a single action, essentially in that spot?
00:36:46.880 | You're getting a value or is it the value
00:36:49.760 | of the entire series of actions?
00:36:52.200 | - Well, it's kind of both
00:36:53.500 | because you're trying to maximize the value
00:36:55.560 | for the hand that you have,
00:36:57.860 | but in the process, in order to maximize the value
00:36:59.900 | of the hand that you have,
00:37:01.080 | you have to figure out what would I be doing
00:37:03.080 | with all these other hands as well.
00:37:04.680 | - Okay, but are you in the search
00:37:06.280 | or was going to the end of the game?
00:37:09.240 | - In Libratus, we did.
00:37:11.520 | So we only use search starting on the turn.
00:37:14.200 | And then we searched all the way to the end of the game.
00:37:17.000 | - The turn, the river.
00:37:18.080 | Can we take it through the terminology?
00:37:22.120 | - Yeah, there's four rounds of poker.
00:37:23.600 | So there's the pre-flop, the flop, the turn and the river.
00:37:26.800 | And so we would start doing search halfway through the game.
00:37:30.920 | Now the first half of the game, that was all pre-computed.
00:37:33.000 | It would just act instantly.
00:37:34.480 | And then when it got to the halfway point,
00:37:37.080 | then it would always search to the end of the game.
00:37:39.040 | Now we later improved this
00:37:40.120 | so it wouldn't have to search all the way
00:37:41.280 | to the end of the game.
00:37:42.120 | It would actually search just a few moves ahead.
00:37:45.000 | But that came later and that drastically reduced
00:37:48.400 | the amount of computational resources that we needed.
00:37:51.240 | - But the moves, 'cause you can keep betting
00:37:53.200 | on top of each other.
00:37:54.160 | That's what you mean by moves.
00:37:55.080 | So like that's where you don't just get one bet
00:37:58.800 | per turn of poker.
00:37:59.920 | You can have multiple arbitrary number of bets, right?
00:38:02.640 | - Right, I'm trying to think like,
00:38:04.600 | I'm gonna bet and then what are you gonna do in response?
00:38:06.640 | Are you gonna raise me or are you gonna call?
00:38:08.320 | And then if you raise, what should I do?
00:38:10.120 | So it's reasoning about that whole process
00:38:12.720 | up until the end of the game in the case of Libratus.
00:38:15.560 | - So for Libratus, what's the most number of re-raises
00:38:18.640 | have you ever seen?
00:38:19.560 | - You probably cap out at like five or something
00:38:23.760 | because at that point you're basically all in.
00:38:26.680 | - I mean, is there like interesting patterns like that
00:38:30.040 | that you've seen that the game does?
00:38:31.840 | Like you'll have like alpha zero doing way more sacrifices
00:38:34.600 | than humans usually do.
00:38:36.920 | Is there something like Libratus was constantly re-raising
00:38:40.840 | or something like that that you've noticed?
00:38:43.040 | - There was something really interesting
00:38:44.480 | that we observed with Libratus.
00:38:46.440 | So humans, when they're playing poker,
00:38:49.360 | they usually size their bets relative
00:38:51.480 | to the size of the pot.
00:38:52.460 | So if the pot has $100 in there,
00:38:55.200 | maybe you bet like $75 or somewhere around there,
00:38:57.600 | somewhere between like 50 and $100.
00:38:59.560 | And with Libratus, we gave it the option
00:39:03.360 | to basically bet whatever it wanted.
00:39:05.120 | It was actually really easy for us to say like,
00:39:07.560 | oh, if you want, you can bet like 10 times the pot.
00:39:09.600 | And we didn't think it would actually do that.
00:39:11.080 | It was just like, why not give it the option?
00:39:13.840 | And then during the competition,
00:39:15.120 | it actually started doing this.
00:39:16.680 | And by the way, this was like a very last minute decision
00:39:18.600 | on our part to add this option.
00:39:19.760 | And so we did not think the bot would do this.
00:39:23.640 | And I was actually kind of worried
00:39:25.640 | when it did start to do this, like, oh, is this a problem?
00:39:27.780 | Like humans don't do this.
00:39:28.720 | Like is it screwing up?
00:39:29.880 | But it would put the humans into really difficult spots
00:39:33.360 | when it would do that.
00:39:34.720 | Because you could imagine like you have the second best hand
00:39:38.200 | that's possible given the board.
00:39:40.160 | And you're thinking like,
00:39:41.000 | oh, you're in a really great spot here.
00:39:42.240 | And suddenly the bot bets $20,000 into a $1,000 pot.
00:39:46.680 | And it's basically saying like,
00:39:48.600 | I have the best hand or I'm bluffing.
00:39:51.860 | And you having the second best hand,
00:39:53.920 | like now you get a really tough choice to make.
00:39:56.440 | And so the humans would sometimes think like five
00:39:59.080 | or 10 minutes about like, what do you do?
00:40:01.080 | Should I call? Should I fold?
00:40:03.120 | And when I saw the humans like really struggling
00:40:06.000 | with that decision, like that's when I realized like,
00:40:07.400 | oh, actually this is maybe a good thing to do after all.
00:40:09.760 | - And of course the system is a no that it's making,
00:40:13.440 | again, like we said, that it's putting them in a tough spot.
00:40:16.880 | It's just, that's part of the optimal,
00:40:19.880 | the game theory optimal.
00:40:21.280 | - Right, from the bot's perspective,
00:40:22.480 | it's just doing the thing
00:40:23.900 | that's going to make it the most money.
00:40:26.500 | And the fact that it's putting the humans
00:40:28.200 | in a difficult spot, like that's just, you know,
00:40:30.640 | a side effect of that.
00:40:32.200 | And this was, I think the one thing,
00:40:35.040 | I mean, there were a few things
00:40:35.880 | that the humans walked away from,
00:40:36.840 | but this was the number one thing
00:40:39.320 | that the humans walked away from the competition saying like,
00:40:41.680 | we need to start doing this.
00:40:43.800 | And now these overbets, what are called overbets,
00:40:46.200 | have become really common in high-level poker play.
00:40:48.760 | - Have you ever talked to like somebody like Daniel
00:40:50.760 | DeGranio about this?
00:40:52.120 | He seems to be a student of the game.
00:40:54.120 | - I did actually have a conversation
00:40:55.480 | with Daniel DeGranio once, yeah.
00:40:56.840 | I was visiting the Isle of Man
00:40:59.600 | to talk to poker stars about AI.
00:41:02.140 | And Daniel DeGranio was there
00:41:04.680 | when we had dinner together with some other people.
00:41:07.560 | And yeah, he was really interested in it.
00:41:10.080 | He mentioned that he was like, you know,
00:41:11.720 | excited about like learning from these AIs.
00:41:14.440 | - So he wasn't scared, he was excited.
00:41:16.360 | - He was excited.
00:41:17.200 | And he honestly, he wanted to play against the bot.
00:41:20.000 | He thought he had a decent chance of beating it.
00:41:23.400 | I think, you know, this was like several years ago
00:41:26.920 | when I think it was like not as clear to everybody
00:41:30.560 | that, you know, the AIs were taking over.
00:41:32.680 | I think now people recognize that like
00:41:34.400 | if you're playing against a bot,
00:41:36.440 | there's like no chance that you have in a game like poker.
00:41:38.640 | - So consistently the bots will win.
00:41:41.120 | The bots have heads up and in other variants too.
00:41:45.600 | So multi, six player Texas hold 'em,
00:41:49.240 | no limit Texas hold 'em, the bots win?
00:41:51.800 | - Yeah, that's the case.
00:41:52.640 | So I think there is some debate about like,
00:41:54.320 | is it true for every single variant of poker?
00:41:56.160 | I think for every single variant of poker,
00:41:59.080 | if somebody really put in the effort,
00:42:00.920 | they can make an AI that would beat all humans at it.
00:42:04.320 | We've focused on the most popular variants.
00:42:06.760 | So heads up, no limit Texas hold 'em.
00:42:08.720 | And then we followed that up with six player poker as well,
00:42:13.000 | where we managed to make a bot
00:42:14.560 | that beat expert human players.
00:42:16.360 | And I think even there now,
00:42:18.720 | it's pretty clear that humans don't stand a chance.
00:42:20.840 | - See, I would love to hook up an AI system
00:42:22.720 | that looks at EEG, like how,
00:42:26.560 | like actually tries to optimize the toughness of the spot
00:42:29.760 | it puts a human in.
00:42:31.320 | And I would love to see how different is that
00:42:34.000 | from the game theory optimal.
00:42:35.520 | So you try to maximize the heart rate of the human player,
00:42:39.120 | like the freaking out over a long period of time.
00:42:42.720 | I wonder if there's going to be different strategies
00:42:46.480 | that emerge that are close in terms of effectiveness.
00:42:49.760 | 'Cause something tells me you could still be
00:42:53.360 | achieve superhuman level performance
00:42:56.440 | by just making people sweat.
00:42:58.560 | - I feel like that there's a good chance
00:43:00.560 | that that is the case, yeah.
00:43:01.600 | If you're able to see like,
00:43:03.840 | that it's like a decent proxy for score, right?
00:43:06.760 | And this is actually like the common poker wisdom
00:43:09.760 | where they're teaching players, before there were bots,
00:43:12.880 | and they were trying to teach people how to play poker.
00:43:14.760 | They would say like, the key to the game
00:43:16.560 | is to put your opponent into difficult spots.
00:43:18.600 | It's a good estimate for
00:43:20.520 | if you're making the right decision.
00:43:21.800 | - So what else can you say about
00:43:23.320 | the fundamental role of search in poker?
00:43:27.240 | And maybe if you can also relate it to chess and go
00:43:30.240 | in these games,
00:43:31.420 | what's the role of search to solving these games?
00:43:35.780 | - Yeah, I think a lot of people under,
00:43:39.240 | this is true for the general public
00:43:40.760 | and I think it's true for the AI community.
00:43:42.500 | A lot of people underestimate the importance of search
00:43:45.040 | for these kinds of game AI results.
00:43:48.300 | An example of this is TD Gammon that came out in 1992.
00:43:52.720 | This was the first real instance of a neural net
00:43:55.640 | being used in a game AI.
00:43:57.120 | It's a landmark achievement.
00:43:58.240 | It was actually the inspiration for AlphaZero
00:44:00.560 | and it used search.
00:44:02.000 | It used two-ply search to figure out its next move.
00:44:05.000 | You got Deep Blue.
00:44:06.320 | There, it was very heavily focused on search,
00:44:09.960 | looking many, many moves ahead farther than any human could.
00:44:13.320 | And that was key for why it won.
00:44:15.640 | And then even with something like AlphaGo,
00:44:18.080 | I mean, AlphaGo is commonly hailed
00:44:21.060 | as a landmark achievement for neural nets, and it is,
00:44:24.780 | but there's also this huge component of search,
00:44:26.780 | Monte Carlo Tree Search to AlphaGo,
00:44:29.060 | that was key, absolutely essential
00:44:31.420 | for the AI to be able to beat top humans.
00:44:33.420 | I think a good example of this is you look at
00:44:37.740 | the latest versions of AlphaGo,
00:44:39.900 | like it was called AlphaZero,
00:44:41.560 | and there's this metric called Elo rating
00:44:45.020 | where you can compare different humans
00:44:47.220 | and you can compare bots to humans.
00:44:49.340 | Now, a top human player is around 3,600 Elo,
00:44:53.220 | maybe a little bit higher now.
00:44:55.460 | AlphaZero, the strongest version, is around 5,200 Elo.
00:44:59.940 | But if you take out the search that's being done
00:45:02.980 | at test time, and by the way, what I mean by search
00:45:05.280 | is the planning ahead, the thinking of like,
00:45:07.880 | oh, if I move my, if I place this stone here
00:45:10.320 | and then he does this,
00:45:11.540 | and then you look like five moves ahead
00:45:12.980 | and you see like what the board state looks like,
00:45:15.860 | that's what I mean by search.
00:45:17.220 | If you take out the search that's done during the game,
00:45:19.780 | the Elo rating drops to around 3,000.
00:45:21.860 | So even today, what, seven years after AlphaGo,
00:45:26.980 | if you take out the Monte Carlo Tree Search
00:45:29.080 | that's being done at when playing against the human,
00:45:32.620 | the bots are not superhuman.
00:45:34.040 | Nobody has made a raw neural net that is superhuman in Go.
00:45:38.300 | - That's worth lingering on, that's quite profound.
00:45:43.120 | So without search, that just means
00:45:45.460 | looking at the next move and saying,
00:45:48.740 | this is the best move.
00:45:49.840 | So having a function that estimates accurately
00:45:52.920 | what the best move is without search.
00:45:55.940 | - Yeah, and all these bots, they have the,
00:45:58.380 | what's called a policy network, where it will tell you,
00:46:00.700 | this is what the neural net thinks is the next best move.
00:46:03.540 | And it's kind of like the intuition that a human has.
00:46:08.500 | You know, the human looks at the board
00:46:09.980 | and any Go or chess master will be able to tell you like,
00:46:14.580 | oh, instantly, here's what I think the right move is.
00:46:17.780 | And the bot is able to do the same thing.
00:46:19.740 | But just like how a human, grandmaster,
00:46:22.620 | can make a better decision if they have more time to think,
00:46:25.180 | when you add on this Monte Carlo Tree Search,
00:46:27.660 | the bot is able to make a better decision.
00:46:30.460 | - Yeah, I mean, of course a human is doing something
00:46:32.900 | like search in their brain, but it's not,
00:46:35.240 | I hesitate to draw a hard line,
00:46:38.780 | but it's not like a Monte Carlo Tree Search.
00:46:41.700 | It's more like sequential language model generation.
00:46:46.700 | So it's like a different, it's a,
00:46:48.580 | the neural network is doing the searching.
00:46:50.820 | I wonder what the human brain is doing in terms of searching
00:46:53.740 | 'cause you're doing that like computation.
00:46:55.900 | A human is computing.
00:46:57.220 | They have intuition, they have gut,
00:46:58.980 | they have a really strong ability to estimate,
00:47:02.260 | you know, amongst the top players,
00:47:03.980 | of what is good and not position
00:47:05.900 | without calculating all the details.
00:47:08.300 | But they're still doing search in their head,
00:47:10.220 | but it's a different kind of search.
00:47:11.980 | Have you ever thought about like,
00:47:12.940 | what is the difference between the human,
00:47:15.420 | the search that the human is performing
00:47:17.900 | versus what computers are doing?
00:47:21.180 | - I have thought a lot about that,
00:47:22.020 | and I think it's a really important question.
00:47:24.180 | So the AI in Alpha and Alphas in AlphaGo
00:47:27.820 | or any of these Go AIs,
00:47:29.300 | they're all doing Monte Carlo Tree Search,
00:47:30.860 | which is a particular kind of search.
00:47:32.660 | And it's actually a symbolic tabular search.
00:47:36.380 | It uses the neural net to guide its search,
00:47:38.860 | but it isn't actually like full on neural net.
00:47:42.740 | Now, that kind of search is very successful
00:47:46.180 | in these kinds of like perfect information board games
00:47:48.580 | like chess and Go.
00:47:50.060 | But if you take it to a game like poker, for example,
00:47:52.060 | it doesn't work.
00:47:52.900 | It can't understand the concept of hidden information.
00:47:56.660 | It doesn't understand the balance that you have to strike
00:47:58.740 | between like the amount that you're raising
00:48:00.540 | versus the amount that you're calling.
00:48:02.180 | And in every one of these games,
00:48:04.580 | you see a different kind of search.
00:48:06.660 | And the human brain is able to plan
00:48:08.700 | for all these different games in a very general way.
00:48:11.980 | Now, I think that's one thing
00:48:12.900 | that we're missing from AI today.
00:48:14.380 | And I think it's a really important missing piece.
00:48:16.260 | The ability to plan and reason more generally
00:48:20.500 | across a wide variety of different settings.
00:48:23.820 | - In a way where the general reasoning
00:48:26.380 | makes you better at each one of the games, not worse.
00:48:29.740 | - Yeah, so you can kind of think of it
00:48:30.900 | as like neural nets today,
00:48:32.820 | they'll give you like Transformers, for example,
00:48:34.780 | are super general, but they'll give you,
00:48:37.620 | it'll output an answer in like 100 milliseconds.
00:48:40.540 | And if you tell it like,
00:48:41.380 | "Oh, you've got five minutes to give me a decision,
00:48:43.420 | feel free to take more time to make a better decision."
00:48:45.620 | It's not gonna know what to do with that.
00:48:47.300 | But a human, if you're playing a game like chess,
00:48:50.700 | they're gonna give you a very different answer
00:48:51.860 | depending on if you say,
00:48:53.140 | "Oh, you've got 100 milliseconds
00:48:54.420 | or you've got five minutes."
00:48:55.780 | - Yeah, I mean, people have started using
00:49:00.020 | Transformers language models in an iterative way
00:49:02.980 | that does improve the answer
00:49:04.580 | or like showing the work kind of idea.
00:49:08.100 | - Yeah, they got this thing called
00:49:09.020 | chain of thought reasoning.
00:49:10.060 | And that's, I think-
00:49:11.340 | - Super promising, right?
00:49:12.460 | - Yeah, and I think it's a good step in the right direction.
00:49:15.860 | I would kind of like say it's similar
00:49:17.740 | to Monte Carlo rollouts in a game like chess.
00:49:20.340 | There's a kind of search that you can do
00:49:22.220 | where you're saying like,
00:49:23.060 | "I'm gonna roll out my intuition and see like,
00:49:25.420 | without really thinking,
00:49:27.460 | what are the better decisions I can make
00:49:28.740 | farther down the path?
00:49:30.740 | What would I do if I just acted according to intuition
00:49:32.820 | for the next 10 moves?"
00:49:34.980 | And that gets you an improvement,
00:49:36.980 | but I think that there's much richer kinds of planning
00:49:40.780 | that we could do.
00:49:42.140 | - So when Labradors actually beat the poker players,
00:49:44.300 | what did that feel like?
00:49:45.900 | What was that?
00:49:46.740 | I mean, actually on that day,
00:49:48.980 | what were you feeling like?
00:49:49.980 | Were you nervous?
00:49:51.700 | I mean, poker was one of the games that you thought
00:49:55.100 | like is not gonna be solvable
00:49:56.540 | 'cause it's the human factor.
00:49:57.700 | So at least in the narratives,
00:50:00.220 | we'll tell ourselves the human factor
00:50:02.820 | is so fundamental to the game of poker.
00:50:05.300 | - Yeah, the Labradors competition
00:50:06.460 | was super stressful for me.
00:50:07.820 | Also, I mean, I was working on this
00:50:11.340 | like basically continuously for a year
00:50:13.100 | leading up to the competition.
00:50:14.460 | I mean, for me, it became like very clear,
00:50:16.580 | like, okay, this is the search technique,
00:50:18.340 | this is the approach that we need.
00:50:19.620 | And then I spent a year working on this
00:50:21.260 | pretty much like nonstop.
00:50:23.020 | - Can we actually get into details?
00:50:24.420 | Like what programming languages is it written in?
00:50:26.860 | What's some interesting implementation details
00:50:30.220 | that are like fun/painful?
00:50:33.340 | - Yeah, so one of the interesting things about Labradors
00:50:35.860 | is that we had no idea what the bar was
00:50:37.620 | to actually beat top humans.
00:50:39.940 | We could play against like our prior bots
00:50:41.500 | and that kind of gives us some sense of like,
00:50:42.860 | are we making progress?
00:50:44.060 | Are we going in the right direction?
00:50:45.900 | But we had no idea like what the bar actually was.
00:50:48.060 | And so we threw a huge amount of resources
00:50:50.900 | at trying to make the strongest bot possible.
00:50:52.980 | So we use C++, it was parallelized.
00:50:55.540 | We were using, I think like a thousand CPUs,
00:50:58.340 | maybe more actually.
00:51:00.140 | And today that sounds like nothing,
00:51:02.260 | but for a grad student back in 2016,
00:51:04.740 | that was a huge amount of resources.
00:51:06.500 | - Well, it's still a lot for even any grad student today.
00:51:09.220 | It's still tough to get,
00:51:11.060 | or even to allow yourself to think in that,
00:51:15.180 | in terms of scale at CMU, at MIT, anything like that.
00:51:18.900 | - Yeah, and talking about terabytes of memory.
00:51:21.980 | So it was a very parallelized,
00:51:24.300 | and it had to be very fast too,
00:51:25.620 | because the more games that you could simulate,
00:51:28.740 | the stronger the bot would be.
00:51:30.140 | - So is there some like John Carmack style,
00:51:33.500 | like efficiencies you had to come up with,
00:51:35.980 | like an efficient way to represent the hand,
00:51:39.100 | all that kind of stuff?
00:51:40.180 | - There were all sorts of optimizations that I had to make
00:51:42.460 | to try to get this thing to run as fast as possible.
00:51:44.500 | They were like, how do you minimize the latency?
00:51:46.620 | How do you like, you know, package things together
00:51:48.860 | so that like you minimize the amount of communication
00:51:50.940 | between the different nodes?
00:51:52.340 | How do you like optimize the algorithms
00:51:55.180 | so that you can, you know,
00:51:56.860 | try to squeeze out more and more
00:51:58.180 | from the game that you're actually playing?
00:51:59.780 | All these kinds of different decisions
00:52:01.220 | that I, you know, had to make.
00:52:03.980 | - Just a fun question.
00:52:05.020 | What IDE did you use?
00:52:07.580 | What for C++ at the time?
00:52:10.620 | - I think I used Visual Studio actually.
00:52:12.780 | - Okay. - Yeah.
00:52:13.620 | - Is that still carried through to today?
00:52:15.580 | - VS Code is what I use today.
00:52:17.060 | It seems like it's pretty popular.
00:52:17.900 | - It's the community, basically converged on.
00:52:19.980 | Okay, cool.
00:52:20.820 | So you got this super optimized C++ system,
00:52:25.340 | and then you show up to the day of competition.
00:52:28.220 | - Yeah.
00:52:29.220 | - Humans versus machine.
00:52:30.940 | How did it feel throughout the day?
00:52:34.420 | - Super stressful.
00:52:35.380 | I mean, I thought going into it
00:52:38.340 | that we had like a 50/50 chance.
00:52:40.020 | Because basically I thought if they play
00:52:42.820 | in a totally normal style, I think we'll squeak out a win.
00:52:45.780 | But there's always a chance
00:52:47.620 | that they can find some weakness in the bot.
00:52:49.820 | And if they do, and we're playing like for 20 days,
00:52:52.460 | 120,000 hands of poker.
00:52:53.660 | They have a lot of time to find weaknesses in the system.
00:52:56.340 | And if they do, we're gonna get crushed.
00:52:58.820 | And that's actually what happened
00:52:59.860 | in the previous competition.
00:53:01.900 | The humans, they started out,
00:53:03.300 | it wasn't like they were winning from the start.
00:53:05.500 | But then they found these weaknesses
00:53:06.780 | that they could take advantage of.
00:53:08.140 | And for the next 10 days,
00:53:09.900 | they were just crushing the bot, stealing money from it.
00:53:12.980 | - What were the weaknesses they found?
00:53:14.300 | Like maybe over betting was effective,
00:53:16.820 | that kind of stuff.
00:53:17.660 | So certain betting strategies worked?
00:53:19.620 | - What they found is, yeah, over betting,
00:53:21.980 | like betting certain amounts,
00:53:23.020 | the bot would have a lot of trouble
00:53:24.060 | dealing with those sizes.
00:53:25.460 | And then also, when the bot got
00:53:29.500 | into really difficult all-in situations,
00:53:31.820 | it wasn't able to, because it wasn't doing search,
00:53:35.220 | it had to clump different hands together
00:53:37.980 | and it would treat them identically.
00:53:40.540 | And so it wouldn't be able to distinguish,
00:53:42.700 | you know, like having a king high flush
00:53:44.860 | versus an ace high flush.
00:53:46.180 | And in some situations that really matters a lot.
00:53:48.180 | And so they could put the bot into those situations
00:53:50.660 | and then the bot would just bleed money.
00:53:52.820 | - Clever humans.
00:53:54.100 | Okay, so I didn't realize it was over 20 days.
00:53:57.500 | So what were the humans like over those 20 days?
00:54:02.500 | And what was the bot like?
00:54:04.460 | - So we had set up the competition, you know,
00:54:06.700 | like I said, there was $200,000 in prize money
00:54:09.140 | and they would get paid a fraction of that
00:54:12.100 | depending on how well they did relative to each other.
00:54:14.380 | So I was kind of hoping that they wouldn't work together
00:54:16.420 | to try to find weaknesses in the bot,
00:54:18.420 | but they enter the competition
00:54:20.220 | with their like number one objective being to beat the bot.
00:54:22.700 | And they didn't care about like individual glory.
00:54:24.900 | They were like, we're all gonna work as a team
00:54:26.540 | to try to take down this bot.
00:54:28.100 | And so they immediately started comparing notes.
00:54:31.020 | What they would do is they would coordinate
00:54:34.380 | looking at different parts of the strategy
00:54:36.380 | to try to, you know, find out weaknesses.
00:54:39.060 | And then at the end of the day,
00:54:41.940 | we actually sent them a log of all the hands
00:54:43.780 | that were played and what cards the bot had
00:54:45.900 | on each of those hands.
00:54:46.900 | - Oh wow. - Yeah.
00:54:48.460 | - That's gutsy.
00:54:49.380 | - Yeah, it was honestly,
00:54:50.380 | and I'm not sure why we did that in retrospect,
00:54:52.020 | but I mean, I'm glad we did it
00:54:53.700 | 'cause we ended up winning anyway,
00:54:54.780 | but that if you've ever played poker before,
00:54:57.260 | like that is golden information.
00:54:58.900 | I mean, to know, usually when you play poker,
00:55:01.220 | you see about a third of the hands to show down
00:55:03.380 | and to just hand them all the cards
00:55:06.420 | that the bot had on every single hand,
00:55:08.380 | that was just a gold mine for them.
00:55:11.740 | And so then they would review the hands
00:55:13.700 | and try to see like, okay,
00:55:14.700 | could they find patterns in the bot, weaknesses?
00:55:16.860 | And could they, then they would coordinate
00:55:19.500 | and study together and try to figure out,
00:55:20.940 | okay, now this person's gonna explore
00:55:23.180 | this part of the strategy for weaknesses.
00:55:24.620 | This person's gonna explore this part
00:55:25.700 | of the strategy for weaknesses.
00:55:28.060 | - It's a kind of psychological warfare,
00:55:30.180 | showing them the hands.
00:55:31.500 | - Yeah.
00:55:32.540 | - I mean, I'm sure you didn't think of it that way,
00:55:33.900 | but like doing that means you're confident
00:55:36.900 | in the bot's ability to win.
00:55:38.740 | - Well, that's one way of putting it.
00:55:39.980 | I wasn't super confident.
00:55:41.540 | - Yeah.
00:55:42.380 | - So, going in, like I said,
00:55:44.140 | I think I had like 50/50 odds on us winning.
00:55:46.580 | When we actually, when we announced the competition,
00:55:49.620 | the poker community decided to gamble on who would win.
00:55:52.820 | And their initial odds against us were like four to one.
00:55:55.260 | They were really convinced that the humans
00:55:56.700 | were gonna pull out a win.
00:55:58.460 | The bot ended up winning for three days straight.
00:56:02.980 | And even then after three days,
00:56:04.660 | the betting odds were still just 50/50.
00:56:06.660 | And then at that point, it started to look like
00:56:11.180 | the humans were coming back.
00:56:13.740 | They started to like, you know,
00:56:14.900 | but poker is a very high variance game.
00:56:17.540 | And I think what happened is like,
00:56:19.500 | they thought that they spotted some weaknesses
00:56:21.140 | that weren't actually there.
00:56:22.660 | And then around day eight,
00:56:24.100 | it was just very clear
00:56:25.220 | that they were getting absolutely crushed.
00:56:27.300 | And from that point, I mean, for a while there,
00:56:30.380 | I was super stressed out thinking like,
00:56:32.460 | oh my God, the humans are coming back
00:56:33.700 | and they've found weaknesses
00:56:35.460 | and now we're just gonna lose the whole thing.
00:56:37.340 | But no, it ended up going in the other direction
00:56:39.620 | and the bot ended up like crushing them in the long run.
00:56:42.780 | - How did it feel at the end?
00:56:45.100 | Like as a human being,
00:56:47.700 | as a person who loves,
00:56:49.340 | appreciates the beauty of the game of poker
00:56:51.580 | and as a person who appreciates the beauty of AI,
00:56:55.500 | is there, did you feel a certain kind of way about it?
00:56:58.740 | - I felt a lot of things, man.
00:57:01.560 | I mean, at that point in my life,
00:57:03.660 | I had spent five years working on this project
00:57:05.980 | and it was a huge sense of accomplishment.
00:57:09.580 | I mean, to spend five years working on something
00:57:11.380 | and finally see it succeed.
00:57:12.780 | Yeah, I wouldn't trade that for anything in the world.
00:57:16.060 | - Yeah, because that's a real benchmark.
00:57:18.020 | It's not like getting some percent accuracy on a data set.
00:57:23.020 | This is like real, this is real world.
00:57:26.380 | It's just a game, but it's also a game
00:57:28.660 | that means a lot to a lot of people.
00:57:30.700 | And this is humans doing their best to beat the machine.
00:57:33.420 | So this is a real benchmark, unlike anything else.
00:57:36.460 | - Yeah, and I mean, this is what I had been dreaming about
00:57:39.460 | since I was like 16 playing poker,
00:57:41.380 | you know, with my friends in high school.
00:57:43.200 | The idea that you could find a strategy
00:57:46.220 | to approximate the Nash equilibrium,
00:57:48.060 | be able to beat all the poker players in the world with it.
00:57:51.420 | So to actually see that come to fruition and be realized,
00:57:55.540 | that was, it's kind of magical.
00:57:58.460 | - Yeah, especially money is on the line too.
00:58:00.540 | It's different than chess.
00:58:02.940 | And that aspect, like people get,
00:58:05.780 | that's why you want to look at betting markets
00:58:08.100 | if you want to actually understand what people really think.
00:58:11.500 | And in the same sense, poker, it's really high stakes
00:58:14.220 | 'cause it's money.
00:58:15.580 | And to solve that game, that's an amazing accomplishment.
00:58:18.820 | So the leap from that to multi-way six player poker,
00:58:23.820 | what's, how difficult is that jump?
00:58:27.460 | And what are some interesting differences
00:58:28.940 | between heads up poker and multi-way poker?
00:58:32.500 | - Yeah, so I mentioned, you know,
00:58:34.100 | Nash equilibrium in two player zero-sum games.
00:58:37.100 | If you play that strategy,
00:58:38.180 | you are guaranteed to not lose an expectation
00:58:40.180 | no matter what your opponent does.
00:58:41.880 | Now, once you go to six player poker,
00:58:43.340 | you're no longer playing a two player zero-sum game.
00:58:45.420 | And so there was a lot of debate
00:58:46.580 | among the academic community and among the poker community
00:58:49.220 | about how well these techniques would extend
00:58:51.440 | beyond just two player heads up poker.
00:58:54.600 | Now, what I had come to realize is that
00:58:57.780 | the techniques actually I thought
00:59:00.900 | really would extend to six player poker
00:59:03.300 | because even though in theory,
00:59:05.100 | they don't give you these guarantees
00:59:06.980 | outside of two player zero-sum games,
00:59:08.820 | in practice, it still gives you a really strong strategy.
00:59:11.780 | Now, there were a lot of complications
00:59:13.860 | that would come up with six player poker
00:59:16.100 | besides like the game theoretic aspect.
00:59:17.860 | I mean, for one, the game is just exponentially larger.
00:59:21.100 | So the main thing that allowed us
00:59:24.460 | to go from two player to six player
00:59:26.560 | was the idea of depth limited search.
00:59:29.260 | So I said before, like, you know, we would do search,
00:59:31.980 | we would plan out, the bot would plan out
00:59:34.020 | like what it's going to do next
00:59:35.540 | and for the next several moves.
00:59:37.100 | And in Libratus, that search was done
00:59:39.380 | extending all the way to the end of the game.
00:59:41.340 | So it would have to start from the turn onwards,
00:59:46.340 | like looking maybe 10 moves ahead,
00:59:49.540 | it would have to figure out
00:59:51.340 | what it was doing for all those moves.
00:59:53.580 | Now, when you get to six player poker,
00:59:55.060 | it can't do that exhaustive search anymore
00:59:57.260 | 'cause the game is just way too large.
00:59:59.160 | But by only having to look a few moves ahead
01:00:03.060 | and then stopping there and substituting a value estimate
01:00:06.140 | of like how good is that strategy at that point,
01:00:08.700 | then we're able to do a much more scalable form of search.
01:00:12.200 | - Is there something cool,
01:00:15.300 | looking at the paper right now,
01:00:17.060 | is there something cool in the paper in terms of graphics?
01:00:20.220 | A game tree traversal via Monte Carlo.
01:00:22.460 | - I think if you go down a bit.
01:00:24.020 | - Figure one, an example of equilibrium selection problem.
01:00:29.220 | Ooh, so yeah.
01:00:30.980 | What do we know about equilibria
01:00:32.820 | when there's multiple players?
01:00:34.780 | - So when you go outside of two players, you're a sum.
01:00:38.120 | So a Nash equilibrium is a set of strategies,
01:00:39.980 | like one strategy for each player,
01:00:41.680 | where no player has an incentive
01:00:43.380 | to switch to a different strategy.
01:00:45.340 | And so you can kind of think of it as like,
01:00:48.300 | imagine you have a game where there's a ring.
01:00:51.620 | That's actually the visual here.
01:00:52.740 | You got a ring and the object of the game
01:00:55.320 | is to be as far away from the other players as possible.
01:00:58.780 | There's a Nash equilibrium is for all the players
01:01:01.540 | to be spaced equally apart around this ring.
01:01:04.580 | But there's infinitely many different Nash equilibria.
01:01:06.660 | There's infinitely many ways
01:01:08.100 | to space four dots along a ring.
01:01:11.100 | And if every single player independently
01:01:14.220 | computes a Nash equilibrium,
01:01:16.220 | then there's no guarantee that the joint strategy
01:01:18.900 | that they're all playing is going to be a Nash equilibrium.
01:01:22.680 | They're just gonna be like random dots
01:01:24.580 | scattered along this ring,
01:01:25.540 | rather than four coordinated dots
01:01:27.480 | being equally spaced apart.
01:01:28.820 | - Is it possible to sort of optimally
01:01:30.380 | do this kind of selection,
01:01:32.940 | to do the selection of the equilibria you're chasing?
01:01:37.620 | So is there like a meta problem to be solved here?
01:01:40.220 | - So the meta problem is in some sense,
01:01:42.540 | how do you understand the Nash equilibria
01:01:45.340 | that the other players are going to play?
01:01:47.340 | And even if you do that, again,
01:01:51.420 | there's no guarantee that you're going to win.
01:01:53.140 | So, if you're playing risk, like I said,
01:01:58.140 | and all the other players decide to team up against you,
01:02:00.980 | you're gonna lose.
01:02:01.800 | Nash equilibrium doesn't help you there.
01:02:03.940 | And so there is this big debate about
01:02:06.420 | whether Nash equilibrium and all these techniques
01:02:08.540 | that compute it are even useful
01:02:10.460 | once you go outside of two player zero-sum games.
01:02:13.080 | Now, I think for many games,
01:02:15.160 | there is a valid criticism here.
01:02:17.020 | And I think when we talk about,
01:02:18.080 | when we go to something like diplomacy,
01:02:19.800 | we run into this issue that the approach
01:02:23.500 | of trying to approximate a Nash equilibrium
01:02:25.980 | doesn't really work anymore.
01:02:27.820 | But it turns out that in six player poker,
01:02:30.520 | because six player poker is such an adversarial game,
01:02:33.220 | where none of the players
01:02:35.820 | really try to work with each other,
01:02:38.160 | the techniques that were used in two player poker
01:02:40.180 | to try to approximate an equilibrium,
01:02:41.960 | those still end up working in practice
01:02:43.820 | in six player poker as well.
01:02:45.340 | - There's some deep way in which six player poker
01:02:49.200 | is just a bunch of heads up poker, like games in one.
01:02:53.700 | It's like embedded in it.
01:02:55.540 | So the competitiveness is more fundamental to poker
01:03:00.100 | than the cooperation.
01:03:01.620 | - Right, yeah.
01:03:02.440 | Poker is just such an adversarial game.
01:03:03.780 | There's no real cooperation.
01:03:05.280 | In fact, you're not even allowed to cooperate in poker.
01:03:07.380 | It's considered collusion.
01:03:08.340 | It's against the rules.
01:03:09.500 | And so for that reason,
01:03:12.260 | the techniques end up working really well.
01:03:13.680 | And I think that's true more broadly
01:03:16.120 | in extremely adversarial games in general.
01:03:18.300 | - But that's sort of in practice
01:03:20.100 | versus being able to prove something.
01:03:22.380 | - That's right.
01:03:23.200 | Nobody has a proof that that's the case.
01:03:24.360 | And it could be that six player poker
01:03:26.780 | belongs to some class of games
01:03:28.760 | where approximating an equilibrium through self-play
01:03:33.340 | provably works well.
01:03:34.580 | And there are other classes of games
01:03:37.520 | beyond just two players, zero sum,
01:03:39.080 | where this is proven to work well.
01:03:40.900 | So there are these kinds of games called potential games,
01:03:43.220 | which I won't go into.
01:03:44.360 | It's kind of like a complicated concept,
01:03:45.860 | but there are classes of games
01:03:49.480 | where this approach to approximating an equilibrium
01:03:53.200 | is proven to work well.
01:03:54.800 | Now, six player poker is not known to belong
01:03:57.060 | to one of those classes,
01:03:57.920 | but it is possible that there is some classic games
01:03:59.860 | where it either provably performs well
01:04:01.820 | or provably performs not that badly.
01:04:04.200 | - So what are some interesting things about Pleribus
01:04:08.180 | that was able to achieve human level performance
01:04:10.900 | on this or superhuman level performance
01:04:13.660 | on the six player version of poker?
01:04:16.180 | - Personally, I think the most interesting thing
01:04:18.300 | about Pleribus is that it was so much cheaper than Libratus.
01:04:22.780 | I mean, Libratus, if you had to put a price tag
01:04:25.540 | on the computational resources that went into it,
01:04:27.740 | I would say the final training run took about $100,000.
01:04:31.160 | You go to Pleribus, the final training run
01:04:34.360 | would cost like less than $150 on AWS.
01:04:37.940 | - Is this normalized to computational inflation?
01:04:41.540 | So meaning, does this just have to do with the fact
01:04:46.200 | that Pleribus was trained like a year later?
01:04:49.120 | - No, no, no, it's not.
01:04:50.240 | I mean, first of all, like, yeah,
01:04:51.920 | computing resources are getting cheaper every day,
01:04:55.000 | but you're not gonna see a thousand fold decrease
01:04:57.680 | in the computational resources over two years
01:05:00.600 | or even anywhere close to that.
01:05:02.060 | The real improvement was algorithmic improvements
01:05:04.720 | and in particular, the ability to do depth limited search.
01:05:08.440 | - So does depth limited search also work for Libratus?
01:05:12.420 | - Yeah, yes.
01:05:13.260 | So where this depth limited search came from is,
01:05:15.760 | you know, I developed this technique
01:05:17.440 | and ran it on two player poker first
01:05:21.080 | and that reduced the computational resources needed
01:05:24.140 | to make an AI that was superhuman
01:05:26.240 | from, you know, $100,000 for Libratus
01:05:28.800 | to something you could train on your laptop.
01:05:31.600 | - What do you learn from that, from that discovery?
01:05:35.940 | - What I would take away from that
01:05:38.080 | is that algorithmic improvements really do matter.
01:05:40.200 | - How would you describe the more general case
01:05:43.360 | of limited depth search?
01:05:45.200 | So it's basically constraining the scale, temporal,
01:05:48.120 | or in some other way of the computation you're doing,
01:05:51.400 | in some clever way.
01:05:53.280 | So like with, like how else can you significantly
01:05:56.700 | constrain computation, right?
01:05:59.640 | - Well, I think the idea is that we want to be able
01:06:02.200 | to leverage search as much as possible.
01:06:04.160 | And the way that we were doing it in Libratus
01:06:05.960 | required us to search all the way to the end of the game.
01:06:08.600 | Now, if you're playing a game like chess,
01:06:09.960 | the idea that you're gonna search always
01:06:11.440 | to the end of the game is kind of unimaginable, right?
01:06:14.120 | Like there's just so many situations
01:06:15.360 | where you just won't be able to use search in that case
01:06:17.480 | or the cost would be, you know, prohibitive.
01:06:20.960 | And this technique allowed us to leverage search
01:06:25.860 | and without having to pay such a huge
01:06:27.920 | computational cost for it,
01:06:29.480 | and be able to apply it more broadly.
01:06:31.760 | - So to what degree did you use neural nets
01:06:33.920 | for Libratus and Pleribus?
01:06:36.600 | And more generally, what role do neural nets have to play
01:06:40.320 | in superhuman level performance in poker?
01:06:44.640 | - So we actually did not use neural nets at all
01:06:46.760 | for Libratus or Pleribus.
01:06:49.220 | And a lot of people found this surprising back in 2017.
01:06:52.920 | I think they find it surprising today
01:06:55.440 | that we were able to do this without using any neural nets.
01:06:58.540 | And I think the reason for that,
01:07:01.360 | I mean, I think neural nets are incredibly powerful
01:07:04.920 | and the techniques that are used today,
01:07:06.840 | even for poker AIs, do rely quite heavily on neural nets.
01:07:11.760 | But it wasn't the main challenge for poker.
01:07:14.160 | Like I think what neural nets are really good for,
01:07:17.320 | if you're in a situation where finding features
01:07:20.440 | for a value function is really difficult,
01:07:23.000 | then neural nets are really powerful.
01:07:24.720 | And this was the problem in Go, right?
01:07:26.360 | Like the problem in Go was that,
01:07:28.760 | or the final problem in Go at least,
01:07:30.600 | was that nobody had a good way of looking at a board
01:07:33.880 | and figuring out who was winning or losing,
01:07:35.920 | describing through a simple algorithm
01:07:38.500 | who was winning or losing.
01:07:40.300 | And so there neural nets were super helpful
01:07:42.800 | because you could just feed in a ton
01:07:44.680 | of different board positions into this neural net,
01:07:46.920 | and it would be able to predict then
01:07:48.360 | who was winning or losing.
01:07:49.800 | But in poker, the features weren't the challenge.
01:07:53.400 | The challenge was how do you design a scalable algorithm
01:07:57.640 | that would allow you to find this balanced strategy
01:08:00.920 | that would understand that you have to bluff
01:08:03.660 | with the right probability?
01:08:05.560 | - So can that be somehow incorporated
01:08:08.000 | into the value function?
01:08:10.040 | The complexity of poker that you've described?
01:08:14.860 | - Yeah, so the way the value functions work in poker,
01:08:17.360 | like the latest and greatest poker AIs,
01:08:19.260 | they do use neural nets for the value function.
01:08:22.020 | The way it's done is very different
01:08:24.380 | from how it's done in a game like chess or Go,
01:08:26.360 | because in poker, you have to reason about beliefs.
01:08:31.100 | And so the value of a state depends on the beliefs
01:08:35.600 | that players have about what the different cards are.
01:08:39.300 | Like if you have pocket aces,
01:08:41.700 | then whether that's a really, really good hand
01:08:44.460 | or just an okay hand depends on whether you know
01:08:46.980 | I have pocket aces.
01:08:48.440 | Rather, if you know that I have pocket aces,
01:08:50.580 | then if I bet, you're gonna fold immediately.
01:08:53.420 | But if you think that I have a really bad hand,
01:08:55.720 | then I could bet with pocket aces and make a ton of money.
01:08:58.320 | So the value function in poker these days
01:09:02.740 | takes the beliefs as an input,
01:09:05.100 | which is very different from how chess and Go AIs work.
01:09:08.200 | - So as a person who appreciates the game,
01:09:13.700 | who do you think is the greatest poker player of all time?
01:09:16.940 | - That's a tough question.
01:09:19.140 | - Can AI help answer that question?
01:09:20.900 | Can you actually analyze the quality of play?
01:09:24.860 | So the chess engines can give estimates
01:09:28.940 | of the quality of play.
01:09:30.100 | I wonder if there's a,
01:09:34.060 | is there an Elo rating type of system for poker?
01:09:37.700 | I suppose you could, but there's just not enough.
01:09:41.220 | You would have to play a lot of games, right?
01:09:43.700 | A very large number of games,
01:09:45.100 | like more than you would in chess.
01:09:46.380 | The deterministic game makes it easier to estimate Elo.
01:09:49.740 | I think.
01:09:50.580 | - I think it is much harder to estimate
01:09:52.660 | something like Elo rating in poker.
01:09:54.180 | I think it's doable.
01:09:55.320 | The problem is that the game is very high variance.
01:09:57.700 | So you could play,
01:09:59.140 | you could be profitable in poker for a year
01:10:02.100 | and you could actually be a bad player
01:10:03.540 | just because the variance is so high.
01:10:05.260 | I mean, you've got top professional poker players
01:10:07.340 | that would lose for a year
01:10:08.980 | just because they're on a really bad streak.
01:10:12.300 | - So for Elo, you have to have a nice clean way of saying
01:10:16.180 | if player A played player B and A beats B,
01:10:20.700 | that says something, that's a signal.
01:10:22.700 | In poker, that's a very noisy signal.
01:10:24.580 | - It's a very noisy signal.
01:10:25.540 | Now there is a signal there.
01:10:26.500 | And so you could do this calculation.
01:10:29.300 | It would just be much harder.
01:10:31.540 | But the same way that AIs have now taken over chess
01:10:35.180 | and all the top professional chess players train with AIs,
01:10:40.180 | the same is true for poker.
01:10:42.220 | The game has become a very computational,
01:10:46.180 | people train with AIs to try to find out
01:10:48.220 | where they're making mistakes,
01:10:49.900 | try to learn from the AIs to improve their strategy.
01:10:52.980 | So now, yeah, so the game has been revolutionized
01:10:57.740 | in the past five years by the development of AI
01:11:00.180 | in this sport.
01:11:01.020 | - The skill with which you avoided the question
01:11:03.060 | of the greatest of all time was impressive.
01:11:05.180 | - So my feeling is that it's a difficult question
01:11:08.020 | because just like in chess,
01:11:10.620 | where you can't really compare Magnus Carlsen today
01:11:13.220 | to Garry Kasparov, because the game has evolved so much.
01:11:17.260 | The poker players today are so far beyond the skills
01:11:23.180 | of people that were playing even 10 or 20 years ago.
01:11:27.420 | So you look at the kinds of all-stars that were on ESPN
01:11:30.980 | at the height of the poker boom,
01:11:33.260 | pretty much all those players are actually not that good
01:11:35.540 | at the game today, at least the strategy aspect.
01:11:39.380 | I mean, they might still be good at reading the player
01:11:42.180 | at the other side of the table and trying to figure out
01:11:44.340 | are they bluffing or not?
01:11:45.620 | But in terms of the actual computational strategy
01:11:48.340 | of the game, a lot of them have really struggled
01:11:50.860 | to keep up with that development.
01:11:52.780 | Now, so for that reason, I'll give an answer
01:11:55.900 | and I'm gonna say Daniel Legranio,
01:11:58.140 | who you actually had on the podcast recently,
01:11:59.820 | I saw it was a great episode.
01:12:00.980 | - I love this so much.
01:12:02.300 | (laughing)
01:12:03.660 | And Phil's gonna hate this so much.
01:12:05.780 | - And I'm gonna give him credit
01:12:08.540 | because he is one of the few old school,
01:12:11.580 | really strong players that have kept up
01:12:14.180 | with the development of AI.
01:12:15.180 | - So he is trying to, he's constantly studying
01:12:17.860 | the game theory optimal way of playing.
01:12:19.780 | - Exactly, yeah.
01:12:20.900 | And I think a lot of the old school poker players
01:12:23.260 | have just kind of given up on that aspect
01:12:24.780 | and I gotta give Daniel Legranio credit
01:12:26.860 | for keeping up with all the developments
01:12:29.620 | that are happening in the sport.
01:12:31.380 | - Yeah, it's fascinating to watch.
01:12:32.500 | It's fascinating to watch where it's headed.
01:12:34.740 | Yeah, so there you go, some love for Daniel.
01:12:38.260 | Quick pause, bath and break?
01:12:40.780 | - Yeah, let's do it.
01:12:42.180 | - Let's go from poker to diplomacy.
01:12:45.260 | What is at a high level the game of diplomacy?
01:12:48.320 | - Yeah, so I talked a lot about two players,
01:12:51.180 | zero sum games.
01:12:52.020 | And what's interesting about diplomacy
01:12:54.340 | is that it's very different from these adversarial games
01:12:59.500 | like chess, go, poker, even Starcraft and Dota.
01:13:02.820 | Diplomacy has a much bigger cooperative element to it.
01:13:05.900 | It's a seven player game.
01:13:07.580 | It was actually created in the fifties
01:13:10.220 | and it takes place before World War I.
01:13:13.700 | It's like a map of Europe with seven great powers
01:13:16.780 | and they're all trying to form alliances with each other.
01:13:20.540 | There's a lot of negotiation going on.
01:13:22.440 | And so the whole focus of the game
01:13:25.540 | is on forming alliances with the other players
01:13:28.980 | to take on the other players.
01:13:30.340 | - England, Germany, Russia, Turkey,
01:13:32.740 | Austria, Hungary, Italy, and France.
01:13:35.580 | - That's right, yeah.
01:13:36.900 | So the way the game works is on each turn,
01:13:41.000 | you spend about five to 15 minutes
01:13:43.820 | talking to the other players in privates
01:13:46.220 | and you make all sorts of deals with them.
01:13:48.820 | You say like, "Hey, let's work together.
01:13:50.820 | Let's team up against this other player."
01:13:53.040 | Because the only way that you can make progress
01:13:54.660 | is by working with somebody else against the others.
01:13:58.740 | And then after that negotiation period is done,
01:14:01.300 | all the players simultaneously submit their moves
01:14:05.440 | and they're all executed at the same time.
01:14:07.780 | And so you can tell people like,
01:14:09.280 | "Hey, I'm gonna support you this turn,"
01:14:11.780 | but then you don't follow through with it.
01:14:13.260 | And they're only gonna figure that out
01:14:14.980 | once they see the moves being read off.
01:14:16.980 | - How much of it is natural language,
01:14:18.540 | like written actual text?
01:14:21.060 | How much is like,
01:14:22.780 | you're actually saying phrases that are structured?
01:14:25.920 | - So there's different ways to play the game.
01:14:27.940 | You can play it in person,
01:14:29.140 | and in that case, it's all natural language.
01:14:31.780 | Free form communication.
01:14:32.940 | There's no constraints on the kinds of deals
01:14:34.500 | that you can make, the kinds of things that you can discuss.
01:14:37.700 | You can also play it online.
01:14:38.960 | So you can send long emails back and forth.
01:14:41.660 | You can play it live online or over voice chat.
01:14:46.580 | But the focus, the important thing to understand
01:14:49.020 | is that this is unstructured communication.
01:14:51.060 | You can say whatever you want.
01:14:52.980 | You can make any sorts of deals that you want
01:14:54.660 | and everything is done privately.
01:14:56.940 | So it's not like you're all around the board together
01:15:00.020 | having a conversation.
01:15:01.620 | You're grabbing somebody going off into a corner
01:15:03.460 | and conspiring behind everybody else's back
01:15:05.720 | about what you're planning.
01:15:07.020 | - And there's no limit in theory to the conversation
01:15:10.780 | you can have directly with one person.
01:15:12.580 | - That's right.
01:15:13.420 | You can make all sorts of,
01:15:14.580 | you can talk about anything.
01:15:15.460 | You can say like,
01:15:16.300 | "Hey, let's have a long-term alliance against this guy."
01:15:17.780 | You can say like,
01:15:18.620 | "Hey, can you support me this turn?
01:15:20.020 | And in return, I'll do this other thing for you next turn."
01:15:23.020 | Or, you know, yeah,
01:15:24.720 | just you can talk about like what you talked about
01:15:26.960 | with somebody else
01:15:27.800 | and gossip about like what they're planning.
01:15:30.760 | The way that I would describe the game
01:15:32.060 | is that it's kind of like a mix between Risk,
01:15:34.840 | poker, and the TV show "Survivor."
01:15:37.620 | There's like this big element of like trying to,
01:15:40.420 | yeah, there's a big social element.
01:15:43.680 | And the best way that I would describe the game
01:15:45.440 | is that it's really a game about people
01:15:47.600 | rather than the pieces.
01:15:48.800 | - So Risk, because it is a map,
01:15:52.420 | it's kind of war game-like.
01:15:54.900 | Poker, because there's a game theory component
01:15:58.720 | that's very kind of strategic.
01:16:00.760 | So you could convert it
01:16:01.840 | into an artificial intelligence problem.
01:16:04.200 | And then Survivor, because of the social component,
01:16:06.400 | strong social component.
01:16:07.880 | I saw that somebody said online
01:16:09.240 | that the internet version of the game
01:16:12.240 | has this quality of that it's easier
01:16:14.920 | to almost to do like role-playing.
01:16:17.560 | As opposed to being yourself,
01:16:19.520 | you can actually like be the,
01:16:21.360 | like really imagine yourself as the leader of France
01:16:24.300 | or Russia and so on.
01:16:25.500 | Like really pretend to be that person.
01:16:28.320 | It's actually fun to really lean into being that leader.
01:16:32.820 | - Yeah, so some players do go this route
01:16:34.700 | where they just like kind of view it as a strategy game,
01:16:37.060 | but also a role-playing game where they can like act out,
01:16:39.100 | like, what would I be like if I was, you know,
01:16:41.500 | a leader of France in 1900?
01:16:43.580 | - I'll forfeit right away.
01:16:44.500 | No, I'm just kidding.
01:16:45.500 | And they sometimes use like the old-timey language
01:16:50.260 | to like, or how they imagined the elites
01:16:53.700 | would talk at that time.
01:16:54.960 | Anyway, so what are the different turns of the game?
01:16:57.840 | Like what are the rounds?
01:16:59.660 | - Yeah, so on every turn,
01:17:01.160 | you got like a bunch of different units
01:17:03.360 | that you start out with.
01:17:04.200 | So you start out controlling like just a few units
01:17:07.480 | and the object of the game is to gain control
01:17:09.720 | of a majority of the map.
01:17:10.560 | If you're able to do that, then you've won the game.
01:17:13.120 | But like I said, the only way that you're able to do that
01:17:15.280 | is by working with other players.
01:17:16.840 | So on every turn, you can issue a move order.
01:17:19.440 | So for each of your units,
01:17:20.780 | you can move them to an adjacent territory,
01:17:23.960 | or you can keep them where they are,
01:17:26.340 | or you can support a move or a hold of a different unit.
01:17:31.720 | - What are the territories?
01:17:32.560 | Well, how is the map divided up?
01:17:34.300 | - It's kind of like risk where the map is divided up
01:17:36.980 | into like 50 different territories.
01:17:38.960 | Now you can enter a territory
01:17:42.460 | if you're moving into that territory with more supports
01:17:45.300 | than the person that's in there
01:17:47.000 | or the person that's trying to move in there.
01:17:48.900 | So if you're moving in and there's somebody already there,
01:17:51.660 | then if neither of you have support, it's a one versus one
01:17:54.380 | and you'll bounce back and neither of you will make progress.
01:17:56.880 | If you have a unit that's supporting
01:17:58.620 | that move into the territory, then it's a two versus one
01:18:01.300 | and you'll kick them out
01:18:02.300 | and they'll have to retreat somewhere.
01:18:03.980 | - What does support mean?
01:18:04.900 | - Support is like, it's an action
01:18:06.620 | that you can issue in the game.
01:18:07.700 | So you can say this unit, you write down,
01:18:10.060 | this unit is supporting this other unit into this territory.
01:18:13.140 | - Are these units from opposing forces?
01:18:16.100 | - They could be, they could be.
01:18:17.020 | And this is where the interesting aspect
01:18:18.620 | of the game comes in
01:18:19.460 | because you can support your own units into territory,
01:18:22.340 | but you can also support other people's units
01:18:24.460 | into territories.
01:18:25.300 | And so that's what the negotiations really revolve around.
01:18:28.740 | - But you don't have to do the thing you say
01:18:30.860 | you're going to do, right?
01:18:32.860 | - Yeah, and so this is--
01:18:33.700 | - So you can say, I'm gonna support you,
01:18:34.940 | but then backstab the person.
01:18:36.940 | - Yeah, that's absolutely right.
01:18:38.420 | - And that tension is core to the game?
01:18:41.020 | - That tension is absolutely core to the game.
01:18:43.420 | The fact that you can make all sorts of promises,
01:18:46.640 | but you have to reason about the fact that like,
01:18:49.060 | hey, they might not trust you
01:18:50.300 | if you say you're gonna do something,
01:18:51.900 | or they might be lying to you
01:18:54.140 | when they say that they're gonna support you.
01:18:56.300 | - So maybe just to jump back,
01:18:59.340 | what's the history of the game in general?
01:19:01.360 | Is it true that Henry Kissinger loved the game
01:19:03.820 | and JFK and all those?
01:19:05.300 | I've heard like a bunch of different people that,
01:19:07.340 | or is that just one of those things
01:19:08.900 | that the cool kids say they do,
01:19:10.420 | but they don't actually play?
01:19:11.500 | - So the game was created in the '50s.
01:19:13.380 | - Yeah.
01:19:14.620 | - And from what I understand, it was JFK's,
01:19:18.120 | it was played in like the JFK White House,
01:19:19.600 | Henry Kissinger's favorite game.
01:19:21.080 | I don't know if it's true,
01:19:22.320 | but that's definitely what I've heard.
01:19:23.880 | - It's interesting that they went with World War I
01:19:27.160 | when it was created after World War II.
01:19:29.600 | - So the story that I've heard for the creation of the game
01:19:32.620 | is it was created by somebody that had looked at
01:19:36.640 | the history of the 20th century,
01:19:39.200 | and they saw World War I as a failure of diplomacy.
01:19:44.120 | - Yeah.
01:19:44.960 | - They saw the fact that this war broke out
01:19:47.200 | as like the diplomats of all these countries
01:19:49.880 | really failed to prevent a war.
01:19:51.440 | And he wanted to create a game
01:19:52.640 | that would basically teach people about diplomacy.
01:19:55.140 | And it's really fascinating that in his ideal version
01:20:00.900 | of the game of diplomacy, nobody actually wins the game.
01:20:03.640 | Because the whole point is that if somebody is about to win,
01:20:05.880 | then the other players should be able to work together
01:20:08.200 | to stop that person from winning.
01:20:10.200 | And so the ideal version of the game
01:20:11.920 | is just one where nobody actually wins.
01:20:13.800 | And it's kind of has a nice like wholesome take home message
01:20:16.760 | then that war is ultimately futile.
01:20:20.660 | - And that optimal, that futile optimal
01:20:25.800 | could be achieved through great diplomacy.
01:20:28.080 | - Yeah.
01:20:28.920 | - So is there some asymmetry in terms of
01:20:32.580 | which is more powerful, Russia versus Germany
01:20:35.160 | versus France and so on?
01:20:38.100 | - So I think the general consensus is that France
01:20:40.760 | is the strongest power in the game.
01:20:42.120 | But the beautiful thing about diplomacy
01:20:44.200 | is that it's self-balancing, right?
01:20:45.860 | So the fact that France has an inherited advantage
01:20:48.380 | from the beginning means that the other players
01:20:51.200 | are less likely to work with it.
01:20:53.000 | - I saw that Russia has four units
01:20:54.800 | versus four of something
01:20:56.260 | that the others have three of something.
01:20:58.000 | - That's true, yeah.
01:20:58.840 | So Russia starts off with four units
01:20:59.840 | while all the other players start with three.
01:21:02.040 | But Russia is also in a much more vulnerable position
01:21:04.560 | because they have to like,
01:21:06.400 | they have a lot more neighbors as well.
01:21:07.920 | - Got it.
01:21:08.760 | Larger territory, more, yeah, right.
01:21:11.000 | More border to defend.
01:21:13.040 | Okay, what else is important to know about the rules?
01:21:17.680 | So how many rounds are there?
01:21:20.040 | Like, is this iterative game?
01:21:22.120 | Is it finite?
01:21:23.840 | Do you just keep going indefinitely?
01:21:25.440 | - Usually the game lasts, I would say about 15 or 20 turns.
01:21:30.040 | There's in theory, no limit.
01:21:32.240 | It could last longer, but at some point,
01:21:34.280 | I mean, if you're playing a house game with friends,
01:21:35.880 | at some point you just get tired and you all agree like,
01:21:37.880 | okay, we're gonna end the game here and call it a draw.
01:21:41.320 | If you're playing online, there's usually like set limits
01:21:43.320 | on when the game will actually end.
01:21:45.000 | - And what's the end, what's the termination condition?
01:21:47.600 | Like, does one country have to conquer everything else?
01:21:52.600 | - So if somebody is able to actually gain control
01:21:55.200 | of a majority of the map, then they've won the game.
01:21:57.480 | And that is a solo victory as it's called.
01:21:59.880 | Now that pretty rarely happens,
01:22:01.480 | especially with strong players, because like I said,
01:22:03.320 | the game is designed to incentivize the other players
01:22:06.240 | to put a stop to that and all work together
01:22:07.760 | to stop the superpower.
01:22:09.640 | Usually what ends up happening is that, you know,
01:22:12.800 | all the players agree to a draw and then the score,
01:22:16.640 | the win is divided among the remaining players.
01:22:20.360 | There's a lot of different scoring systems.
01:22:21.760 | The one that we used in our research basically gives
01:22:26.400 | a score relative to how much control you have of the map.
01:22:29.800 | So the more that you control, the higher your score.
01:22:32.640 | - What's the history of using this game
01:22:35.600 | as a benchmark for AI research?
01:22:37.800 | Do people use it?
01:22:39.720 | - Yeah, so people have been working on AI for diplomacy
01:22:42.880 | since about the '80s.
01:22:44.360 | There was some really exciting research back then,
01:22:47.640 | but the approach that was taken was very different
01:22:50.760 | from what we see today.
01:22:51.600 | I mean, the research in the '80s was a very rule-based
01:22:54.120 | approach, kind of a heuristic approach.
01:22:56.160 | It was very in line with the kind of research
01:22:57.800 | that was being done in the '80s.
01:22:59.520 | You know, basically trying to encode human knowledge
01:23:01.640 | into the strategy of the AI.
01:23:03.560 | - Sure.
01:23:04.840 | - And, you know, it's understandable.
01:23:06.280 | I mean, the game is so incredibly different
01:23:08.600 | and so much more complicated than the kinds of games
01:23:12.080 | that people were working on like chess and go and poker
01:23:15.720 | that it was honestly even hard to like start
01:23:20.080 | making any progress in diplomacy.
01:23:22.280 | - Can you just formulate what is the problem
01:23:24.960 | from an AI perspective and why is it hard?
01:23:27.600 | Why is it a challenging game to solve?
01:23:29.640 | - So there's a lot of aspects in diplomacy
01:23:31.320 | that make it a huge challenge.
01:23:33.640 | First of all, you have the natural language components.
01:23:36.720 | And I think this really is what makes it
01:23:39.280 | arguably the most difficult game
01:23:42.400 | among like the major benchmarks.
01:23:43.880 | The fact that you have to,
01:23:46.320 | it's not about moving pieces on the board.
01:23:49.760 | Your action space is basically all the different sentences
01:23:53.520 | that you could communicate to somebody else in this game.
01:23:57.480 | - Is there, can we just like linger on that?
01:23:59.400 | So is part of it like the ambiguity in the language?
01:24:04.400 | If it was like very strict,
01:24:07.920 | if you narrowed the set of possible sentences
01:24:10.040 | you could do it,
01:24:10.880 | would that simplify the game significantly?
01:24:12.720 | - The real difficulty is the breadth of things
01:24:16.640 | that you can talk about.
01:24:17.840 | You can have natural language in other games
01:24:20.960 | like Settlers of Catan, for example,
01:24:22.280 | like you could have a natural language,
01:24:23.760 | Settlers of Catan AI.
01:24:25.400 | But the things that you're gonna talk about
01:24:26.920 | are basically like, am I trading you two sheep for a wood
01:24:29.320 | or three sheep for a wood?
01:24:30.680 | Whereas in a game like Diplomacy,
01:24:33.560 | the breadth of conversations that you're going to have
01:24:35.760 | are like, am I going to support you?
01:24:38.120 | Are you gonna support me in return?
01:24:39.520 | Which units are gonna do what?
01:24:41.800 | What did this other person promise you?
01:24:45.240 | They're lying because they told this other person
01:24:47.040 | that they're gonna do this instead.
01:24:49.080 | If you help me out this turn,
01:24:50.280 | then in the future I'll do these things
01:24:52.480 | that will help you out.
01:24:53.640 | The depth and breadth of these conversations
01:24:58.200 | is really complicated.
01:25:01.000 | And it's all being done in natural language.
01:25:03.200 | Now you could approach it,
01:25:05.400 | and we actually considered doing this,
01:25:06.640 | like having a simplified language
01:25:09.200 | to make this complexity smaller.
01:25:12.640 | But ultimately we thought the most impactful way
01:25:16.360 | of doing this research would be to address
01:25:19.400 | the natural language component head on
01:25:21.360 | and just try to go for the full game upfront.
01:25:24.320 | - Just looking at sample games
01:25:25.920 | and what the conversations look like.
01:25:27.880 | Greetings England, this should prove to be a fun game
01:25:30.560 | since all the private press
01:25:32.760 | is going to be made public at the end.
01:25:35.200 | At the least it will be interesting to see
01:25:37.440 | if the press changes because of that.
01:25:39.320 | Anyway, good, okay.
01:25:40.440 | So there's like a--
01:25:41.840 | - Yeah, that's just kind of like the generic greetings
01:25:43.560 | at the beginning of the game.
01:25:44.400 | I think that the meat comes a little bit later
01:25:46.160 | when you're starting to talk about
01:25:48.040 | specific strategy and stuff.
01:25:50.120 | - I agree there are a lot of advantages
01:25:52.280 | to the two of us keeping in touch
01:25:54.640 | and our nations make strong natural allies
01:25:58.000 | in the middle game.
01:25:58.880 | So that kind of stuff, making friends, making enemies.
01:26:02.480 | - Yeah, or like if you look at the next line,
01:26:03.800 | so the person saying like,
01:26:05.040 | I've heard bits about a Lepanto and an octopus opening
01:26:08.640 | and basically telling Austria like,
01:26:10.280 | hey, just a heads up, you know,
01:26:11.760 | I've heard these whispers about like
01:26:13.040 | what might be going on behind your back.
01:26:14.960 | - Yeah, so there's all kinds of complexities
01:26:18.680 | in the language of that, right?
01:26:22.760 | Like to interpret what the heck that means.
01:26:25.200 | It's hard for us humans, but for AI it's even harder
01:26:27.920 | 'cause you have to understand like at every level
01:26:30.120 | the semantics of that.
01:26:31.800 | - Right, I mean, there's the complexity in understanding
01:26:34.240 | when somebody is saying this to me, what does that mean?
01:26:36.640 | And then there's also the complexity of like,
01:26:38.640 | should I be telling this person this?
01:26:40.120 | Like I've overheard these whispers,
01:26:42.080 | should I be telling this person that like,
01:26:43.520 | hey, you might be getting attacked by this other power?
01:26:46.720 | - Okay, so how are we supposed to think about?
01:26:51.800 | Okay, so that's the natural language.
01:26:54.160 | How do you even begin trying to solve this game?
01:26:56.160 | It seems like the Turing test on steroids.
01:27:00.120 | - Yeah, and I mean, there's the natural language aspect.
01:27:02.280 | And then even besides the natural language aspect,
01:27:04.320 | you also have the cooperative elements of the game.
01:27:07.640 | And I think this is actually something
01:27:09.840 | that I find really interesting.
01:27:11.160 | If you look at all the previous game AI breakthroughs,
01:27:15.240 | they've all happened in these purely adversarial games
01:27:17.240 | where you don't actually need to understand
01:27:19.080 | how humans play the game.
01:27:20.480 | It's all just AI versus AI, right?
01:27:23.040 | Like you look at checkers, chess, go, poker,
01:27:27.680 | Starcraft, Dota 2, like in some of those cases,
01:27:31.280 | they leveraged human data, but they never needed to.
01:27:33.840 | They were always just trying to have a scalable algorithm
01:27:38.520 | that then they could throw a lot of computational resources
01:27:41.720 | at a lot of memory at, and then eventually it would converge
01:27:45.280 | to an approximation of a Nash equilibrium.
01:27:47.760 | This perfect strategy that in a two player zero-sum game
01:27:51.920 | guarantees that they're going to be able
01:27:53.520 | to not lose to any opponent.
01:27:55.840 | - So you can't leverage self-play to solve this game.
01:27:58.320 | - You can leverage self-play,
01:27:59.840 | but it's no longer sufficient to beat humans.
01:28:02.760 | - So how do you integrate the human into the loop of this?
01:28:05.240 | - So what you have to do is incorporate human data.
01:28:08.560 | And to kind of give you some intuition
01:28:11.000 | for why this is the case,
01:28:12.000 | like imagine you're playing a negotiation game,
01:28:13.800 | like diplomacy, but you're training completely from scratch
01:28:18.600 | without any human data.
01:28:19.920 | The AI is not going to suddenly like figure out
01:28:23.200 | how to communicate in English.
01:28:24.360 | It's going to figure out some weird robot language
01:28:27.280 | that only it will understand.
01:28:29.040 | And then when you stick that in a game
01:28:30.440 | with six other humans,
01:28:31.960 | they're going to think this person's talking gibberish
01:28:34.320 | and they're just going to ally with each other
01:28:35.480 | and team up against the bot,
01:28:37.280 | or not even team up against the bot,
01:28:38.400 | but just not work with the bot.
01:28:39.880 | And so in order to be able to play this game with humans,
01:28:43.640 | it has to understand the human way of playing the game,
01:28:46.280 | not this machine way of playing the game.
01:28:48.680 | - Yeah, yeah, that's fascinating.
01:28:50.600 | So, right.
01:28:51.720 | That's a nuanced thing to understand
01:28:54.960 | 'cause a chess playing program
01:28:58.440 | doesn't need to play like a human to beat a human.
01:29:01.000 | - Exactly.
01:29:01.840 | - But here you have to play like a human
01:29:03.560 | in order to beat them.
01:29:04.680 | - Or at least you have to understand
01:29:06.040 | how humans play the game
01:29:07.000 | so that you can understand how to work with them.
01:29:08.920 | If they have certain expectations
01:29:10.640 | about what does it mean to be a good ally?
01:29:13.160 | What does it mean to have like a reciprocal relationship
01:29:16.240 | where we're working together?
01:29:17.520 | You have to abide by those conventions.
01:29:20.040 | And if you don't,
01:29:20.880 | they're just going to work with somebody else instead.
01:29:23.120 | - Do you think of this as a clean,
01:29:26.360 | in some deep sense of the spirit of the Turing test
01:29:28.800 | as formulated by Alan Turing?
01:29:30.480 | Is it, in some sense,
01:29:32.460 | this is what the Turing test actually looks like?
01:29:35.060 | - So, because of open-ended natural language conversation
01:29:40.520 | it seems like very difficult to evaluate.
01:29:44.080 | Like here at a high stakes
01:29:46.040 | where humans are trying to win a game,
01:29:47.760 | that seems like how you actually perform the Turing test.
01:29:51.960 | - I think it's different from the Turing test.
01:29:53.720 | Like the way that the Turing test is formulated,
01:29:55.880 | it's about trying to distinguish a human from a machine
01:29:59.120 | and seeing, oh, could the machine successfully pass
01:30:02.760 | as a human in this adversarial setting
01:30:04.640 | where the player is trying to figure out
01:30:07.320 | whether it's a machine or a human.
01:30:08.960 | Whereas in diplomacy, it's not about trying to figure out
01:30:11.960 | whether this player is a human or a machine.
01:30:14.400 | It's ultimately about whether I can work with this player
01:30:17.560 | regardless of whether they are a human or a machine.
01:30:19.880 | And can the machine do that better than a human can?
01:30:22.980 | - Yeah, I'm going to think about that,
01:30:26.100 | but that just feels like the implied requirement for that
01:30:31.100 | is for the machine to be human-like.
01:30:33.500 | - I think that's true,
01:30:35.880 | that if you're going to play in this human game,
01:30:39.080 | you have to somehow adapt to the human surroundings
01:30:43.360 | and the human play style.
01:30:44.720 | - And to win, you have to adapt.
01:30:47.220 | So you can't, if you're the outsider,
01:30:50.080 | if you're not human-like,
01:30:51.600 | I feel like that's a losing strategy.
01:30:53.840 | - I think that's correct, yeah.
01:30:55.800 | - Yeah, so, okay.
01:30:57.100 | What are the complexities here?
01:31:00.720 | What was your approach to it?
01:31:02.640 | - Before I get to that,
01:31:03.480 | one thing I should explain
01:31:04.720 | why we decided to work on Diplomacy.
01:31:07.320 | So basically what happened is in 2019,
01:31:10.560 | I was wrapping up the work on six-player poker on Pluribus
01:31:15.320 | and was trying to think about what to work on next.
01:31:17.880 | And I had been seeing all these other breakthroughs
01:31:20.840 | happening in AI.
01:31:21.680 | I mean, 2019, you have StarCraft,
01:31:24.600 | you have AlphaStar beating humans in StarCraft,
01:31:26.920 | you've got the Dota 2 stuff happening at OpenAI,
01:31:30.000 | you have GPT-2 or GPT-3 coming,
01:31:32.520 | I think it was GPT-2 at the time.
01:31:34.520 | And it became clear that AI was progressing
01:31:37.560 | really, really rapidly.
01:31:39.400 | And people were throwing out these other games
01:31:42.440 | about what should be the next challenge for multi-agent AI.
01:31:46.800 | And I just felt like we had to aim bigger.
01:31:50.060 | If you look at a game like chess or a game like Go,
01:31:54.480 | they took decades for researchers
01:31:56.320 | to ultimately reach superhuman performance at.
01:31:59.440 | I mean, chess took 40 years of AI research
01:32:02.480 | Go took another 20 years.
01:32:03.960 | And we thought that diplomacy
01:32:08.720 | would be this incredibly difficult challenge
01:32:10.760 | that could easily take a decade
01:32:12.840 | to make an AI that could play competently.
01:32:14.920 | But we felt like that was a goal worth aiming for.
01:32:18.200 | And so honestly, I was kind of reluctant
01:32:21.520 | to work on it at first
01:32:22.400 | because I thought it was like too far
01:32:24.960 | out of the realm of possibility.
01:32:26.120 | But I was talking to a coworker of mine, Adam Lear,
01:32:28.680 | and he was basically saying like,
01:32:30.040 | "Why not aim for it?
01:32:31.480 | "We'll learn some interesting things along the way
01:32:33.000 | "and maybe it'll be possible."
01:32:34.480 | And so we decided to go for it.
01:32:36.320 | And I think it was the right choice
01:32:38.880 | considering just how much progress there was in AI
01:32:43.000 | and that progress has continued in the years since.
01:32:45.480 | - So winning in diplomacy, what does that really look like?
01:32:50.240 | It means talking to six other players,
01:32:54.520 | six other entities, agents,
01:32:57.480 | and convincing them of stuff
01:33:01.020 | that you want them to be convinced of.
01:33:03.600 | Like what exactly, I'm trying to get like,
01:33:06.120 | to deeply understand what the problem is.
01:33:08.400 | - Ultimately, the problem is simple to quantify, right?
01:33:14.120 | Like you're going to play this game with humans
01:33:16.360 | and you want your score on average
01:33:18.840 | to be as high as possible.
01:33:21.080 | You know, if you can say like,
01:33:22.600 | "I am winning more than any human alive,"
01:33:26.560 | then you're a champion diplomacy player.
01:33:30.000 | Now, ultimately we didn't reach that.
01:33:31.620 | We got to human level performance.
01:33:32.860 | We actually, so we played about 40 games
01:33:35.480 | with real humans online.
01:33:38.280 | The bot came in second out of all players
01:33:40.420 | that played five or more games.
01:33:41.980 | And so not like number one, but way, way higher than-
01:33:46.580 | - What was the expertise level?
01:33:48.460 | Are they beginners?
01:33:49.580 | Are they intermediate players, advanced players?
01:33:51.940 | Do you have a sense?
01:33:52.880 | - That's a great question.
01:33:53.780 | And so I think this kind of goes into
01:33:55.940 | how do you measure the performance in diplomacy?
01:33:58.060 | And I would argue that when you're measuring performance
01:34:00.160 | in a game like this,
01:34:01.360 | you don't actually want to measure it
01:34:03.160 | in games with all expert players.
01:34:05.760 | It's kind of like if you're developing a self-driving car,
01:34:08.580 | you don't want to measure that car on the road
01:34:11.220 | with a bunch of expert stunt drivers.
01:34:13.400 | You want to put it on a road of like an actual American city
01:34:16.400 | and see is this car crashing less often
01:34:19.880 | than an expert driver would.
01:34:21.920 | So that's the metric that we've used.
01:34:24.080 | We're saying like, we're going to stick this game,
01:34:26.480 | we're going to stick this bot in games
01:34:27.860 | with a wide variety of skill levels.
01:34:30.460 | And then are we doing better than a strong
01:34:33.500 | or expert human player would in the same situation?
01:34:36.820 | - That's quite brilliant.
01:34:37.820 | 'Cause I played a lot of sports in my life,
01:34:39.740 | like I did tennis, judo, whatever.
01:34:42.220 | And it's somehow almost easier
01:34:44.900 | to go against experts almost always.
01:34:46.940 | I don't know.
01:34:47.780 | I think they're more predictable in the quality of play.
01:34:51.020 | The space of strategies you're operating under
01:34:54.260 | is narrower against experts.
01:34:56.760 | It's more fun.
01:34:57.600 | It's really frustrating to go against beginners.
01:34:59.480 | Also 'cause beginners talk trash to you
01:35:01.960 | when they somehow do beat you.
01:35:03.760 | So that's a human thing that they add
01:35:05.800 | is not to be worried about that.
01:35:07.100 | But yeah, the variance in strategies is greater,
01:35:10.720 | especially with natural language.
01:35:12.000 | It's just all over the place then.
01:35:13.960 | - Yeah.
01:35:14.800 | And honestly, when you look at
01:35:17.320 | what makes a good human diplomacy player,
01:35:20.520 | obviously they're able to handle themselves
01:35:22.060 | in games with other expert humans,
01:35:23.280 | but where they really shine
01:35:24.780 | is when they're playing with these weak players.
01:35:26.640 | And they know how to take advantage
01:35:29.000 | of the fact that they're a weak player,
01:35:30.520 | that they won't be able to pull off a stab as well,
01:35:33.320 | or that they have certain tendencies
01:35:35.400 | and they can take them under their wing
01:35:36.720 | and persuade them to do things
01:35:38.240 | that might not even be in their interest.
01:35:41.080 | The really good diplomacy players
01:35:42.480 | are able to take advantage of the fact
01:35:45.080 | that there are some weak players in the game.
01:35:47.520 | - Okay, so if you have to incorporate human play data,
01:35:50.360 | how do you do that?
01:35:51.680 | How do you do that in order to train
01:35:53.320 | an AI system to play diplomacy?
01:35:55.960 | - Yeah, so that's really the crux of the problem.
01:35:58.660 | How do we leverage the benefits of self-play
01:36:02.440 | that have been so successful
01:36:04.040 | in all these other previous games,
01:36:06.180 | while keeping the strategy as human compatible as possible?
01:36:10.800 | And so what we did is we first trained a language model,
01:36:14.760 | and then we made that language model controllable
01:36:17.960 | on a set of intents, what we call intents,
01:36:21.560 | which are basically like an action that we want to play
01:36:24.160 | and an action that we would like the other player to play.
01:36:27.160 | And so this gives us a way to generate dialogue
01:36:29.400 | that's not just trying to imitate the human style,
01:36:31.880 | whatever a human would say in this situation,
01:36:34.320 | but to actually give it an intent,
01:36:36.720 | a purpose in its communication.
01:36:38.800 | We can talk about a specific move
01:36:40.600 | or we can make a specific request.
01:36:42.600 | And the determination of what that move is
01:36:45.560 | that we're discussing comes from a strategic reasoning model
01:36:50.640 | that uses reinforcement learning and planning.
01:36:53.000 | - So the computing the intents for all the players,
01:36:57.000 | how is that done?
01:36:58.920 | Just as a starting point,
01:37:01.840 | is that with reinforcement learning
01:37:03.280 | or is that just optimal,
01:37:04.640 | determining what the optimal is for intents?
01:37:06.840 | - It's a combination of reinforcement learning and planning.
01:37:11.240 | Actually very similar to how we approached poker
01:37:14.280 | and how people have approached chess and Go as well.
01:37:18.200 | We're using self-play and search to try to figure out
01:37:22.920 | what is an optimal move for us
01:37:25.440 | and what is a desirable move
01:37:26.920 | that we would like this other player to play.
01:37:28.920 | Now, the difference between the way that we approached
01:37:32.960 | reinforcement learning and search in this game
01:37:35.400 | versus those previous games
01:37:37.000 | is that we have to keep it human compatible.
01:37:38.840 | We have to understand how the other person
01:37:41.800 | is likely to play rather than just assuming
01:37:43.680 | that they're gonna play like a machine.
01:37:45.480 | - And how language gets them to play
01:37:48.160 | in a way that maximize the chance of following the intent
01:37:52.760 | you want them to follow.
01:37:54.000 | Okay, how do you do that?
01:37:55.720 | How do you connect language to intent?
01:37:58.160 | - So the way that RL and planning is done
01:38:01.480 | is actually not using language.
01:38:03.160 | So we're coming up with this plan for the action
01:38:07.640 | that we're gonna play and the other person's gonna play
01:38:09.120 | and then we feed that action into the dialogue model
01:38:11.800 | that will then send a message according to those plans.
01:38:14.120 | - So the language model there is mapping action to...
01:38:19.120 | - To message. - To message.
01:38:21.720 | One word at a time.
01:38:22.720 | - Basically one message at a time.
01:38:25.400 | So we'll feed into the dialogue model,
01:38:27.360 | like here are the actions that you should be discussing.
01:38:29.320 | Here's the message,
01:38:30.160 | here's like the content of the message
01:38:33.560 | that we would like you to send
01:38:35.040 | and then it will actually generate a message
01:38:37.040 | that corresponds to that.
01:38:37.920 | - Okay, does this actually work?
01:38:39.800 | - It works surprisingly well.
01:38:41.240 | - Okay, how...
01:38:42.080 | (laughs)
01:38:43.840 | Oh man, the number of ways it probably goes horribly,
01:38:47.480 | I would have imagined it goes horribly wrong.
01:38:50.080 | So how the heck is it effective at all?
01:38:54.000 | - I mean, there are a lot of ways that this could fail.
01:38:55.880 | So for example, I mean, you could have a situation
01:38:59.720 | where you're basically like,
01:39:01.420 | we don't tell the language model,
01:39:04.400 | like here are the pieces of our action
01:39:07.000 | or the other person's action
01:39:07.840 | that you should be communicating.
01:39:09.160 | And so like, let's say you're about to attack somebody,
01:39:11.480 | you probably don't wanna tell them
01:39:12.880 | that you're going to attack them,
01:39:14.240 | but there's nothing in the language,
01:39:15.840 | like the language model is not very smart
01:39:17.080 | at the end of the day.
01:39:17.920 | So it doesn't really have a way of knowing like,
01:39:20.200 | well, what should I be talking about?
01:39:21.560 | Should I tell this person
01:39:22.440 | that I'm about to attack them or not?
01:39:24.240 | So we have to like develop a lot of other techniques
01:39:27.400 | that deal with that.
01:39:28.620 | Like one of the things we do for example,
01:39:31.120 | is we try to calculate if I'm going to send this message,
01:39:34.660 | what would I expect the other person to do in response?
01:39:37.360 | So if it's a message like,
01:39:38.640 | hey, I'm gonna attack you this turn,
01:39:40.240 | they're probably gonna, you know, attack us
01:39:42.640 | or defend against that attack.
01:39:44.640 | And so we have a way of recognizing like,
01:39:47.120 | hey, sending this message
01:39:48.920 | is a negative expected value action
01:39:52.240 | and we should not send this message.
01:39:54.760 | - So you have for particular kinds of messages,
01:39:57.160 | you have like an extra function
01:39:59.520 | that does the, estimates the value of that message.
01:40:03.200 | - Yeah, so we have these kinds of filters that like-
01:40:05.760 | - So it's a filter.
01:40:06.600 | So there's a good, is that filter in your network
01:40:10.040 | or is it rule-based?
01:40:11.800 | - That's a neural network.
01:40:13.200 | So we're, well, it's a combination.
01:40:15.120 | It's a neural network, but it's also using planning.
01:40:18.120 | It's trying to compute like,
01:40:20.040 | what is the policy that the other players are going to play
01:40:23.720 | given that this message has been sent?
01:40:26.120 | And then is that better than not sending the message?
01:40:28.760 | - I feel like that's how my brain works too.
01:40:30.560 | Like there's a language model that generates random crap
01:40:33.800 | and then there's these other neural nets
01:40:37.160 | that are essentially filters.
01:40:39.000 | At least that's what I tweet.
01:40:41.200 | I'll usually, my process of tweeting,
01:40:43.400 | I'll think of something and it's hilarious to me.
01:40:46.300 | And then about five seconds later,
01:40:48.080 | the filter network comes in and says,
01:40:49.880 | no, no, that's not funny at all.
01:40:52.320 | I mean, there's something interesting
01:40:53.980 | to that kind of process.
01:40:55.500 | So you have a set of actions that you want,
01:40:58.640 | you have an intent that you want to achieve,
01:41:02.080 | an intent that you want your opponent to achieve,
01:41:04.000 | then you generate messages.
01:41:05.760 | And then you evaluate if those messages
01:41:07.880 | will achieve the goal you want.
01:41:12.880 | - Yeah, and we're filtering for several things.
01:41:15.280 | We're filtering like, is this a sensible message?
01:41:17.960 | So sometimes language models will send,
01:41:19.720 | will generate messages that are just like totally nonsense.
01:41:23.800 | And we try to filter those out.
01:41:25.140 | We also try to filter out messages that are basically lies.
01:41:29.240 | So diplomacy has this reputation as a game
01:41:32.260 | that's really about deception and lying,
01:41:35.200 | but we try to actually minimize the amount
01:41:38.280 | that the bot would lie.
01:41:40.200 | This was actually mostly a-
01:41:42.440 | - Or are you?
01:41:43.760 | No, I'm just kidding.
01:41:44.600 | All right, go ahead.
01:41:45.440 | (laughing)
01:41:46.360 | - I mean, like part of the reason for this
01:41:47.680 | is that we actually found that lying
01:41:51.120 | would make the bot perform worse in the long run.
01:41:53.240 | It would end up with a lower score.
01:41:55.240 | Because once the bot lies,
01:41:56.840 | people would never trust it again.
01:41:59.660 | And trust is a huge aspect of the game of diplomacy.
01:42:02.000 | - I'm taking notes here,
01:42:02.920 | 'cause I think this applies to life lessons too.
01:42:06.960 | - Oh, I think it's a really, yeah, really strong-
01:42:08.520 | - So like lying is a dangerous thing to do.
01:42:11.040 | Like you want to avoid obvious lying.
01:42:15.000 | - Yeah, I mean, I think when people play diplomacy
01:42:17.000 | for the first time,
01:42:18.280 | they approach it as a game of deception and lying.
01:42:21.120 | And they, ultimately, if you talk to top diplomacy players,
01:42:24.920 | what they'll tell you is that diplomacy
01:42:26.760 | is a game about trust
01:42:28.200 | and being able to build trust in an environment
01:42:30.560 | that encourages people to not trust anyone.
01:42:33.360 | So that's the ultimate tension in diplomacy.
01:42:36.040 | How can this AI reason
01:42:38.320 | about whether you are being honest in your communication?
01:42:41.040 | And how can the AI persuade you that it is being honest
01:42:44.560 | when it is telling you that,
01:42:45.400 | "Hey, I'm actually going to support you this turn."
01:42:48.160 | - Is there some sense,
01:42:49.520 | I don't know if you step back and think,
01:42:50.960 | that this process will indirectly
01:42:55.960 | help us study human psychology?
01:42:59.200 | So like if trust is the ultimate goal,
01:43:01.600 | wouldn't that help us understand
01:43:03.920 | what are the fundamental aspects of forming trust
01:43:07.400 | between humans and between humans and AI?
01:43:10.000 | I mean, that's a really, really important question
01:43:11.880 | that's much bigger than strategy games.
01:43:14.920 | It's how can,
01:43:15.960 | that's fundamental to the human-robot interaction problem.
01:43:19.520 | How do we form trust between intelligent entities?
01:43:23.840 | - So one of the things I'm really excited about
01:43:25.880 | with diplomacy,
01:43:27.780 | there's never really been a good domain
01:43:30.580 | to investigate these kinds of questions.
01:43:32.620 | And diplomacy gives us a domain
01:43:35.780 | where trust is really at the center of it.
01:43:38.860 | And it's not just like you've hired
01:43:40.580 | a bunch of mechanical Turkers that are being paid
01:43:43.660 | and trying to get through the task as quickly as possible.
01:43:47.060 | You have these people that are really invested
01:43:48.940 | in the outcome of the game,
01:43:49.860 | and they're really trying to do the best that they can.
01:43:52.980 | And so I'm really excited that we're able to,
01:43:56.140 | we actually have put together this,
01:43:58.820 | we're open sourcing all of our models,
01:44:00.240 | we're open sourcing all of the code,
01:44:03.540 | and we're making the data that we've used
01:44:05.860 | available to researchers
01:44:07.580 | so that they can investigate these kinds of questions.
01:44:10.980 | - So the data of the different,
01:44:12.140 | the human and the AI play of diplomacy,
01:44:15.260 | and the models that you use
01:44:16.660 | for the generation of the messages and the filtering.
01:44:20.240 | - Yeah, not just even the data of the AI
01:44:22.700 | playing with the humans,
01:44:23.700 | but all the training data that we had,
01:44:26.880 | that we use to train the AI
01:44:28.420 | to understand how humans play the game.
01:44:30.300 | We're setting up a system
01:44:31.580 | where researchers will be able to apply
01:44:34.580 | to be able to gain access to that data
01:44:36.420 | and be able to use it in their own research.
01:44:38.220 | - We should say, what is the name of the system?
01:44:41.060 | - We're calling the bot Cicero.
01:44:42.340 | - Cicero.
01:44:43.180 | And what's the name, like you're open sourcing,
01:44:45.580 | what's the name of the repository and the project?
01:44:49.140 | Is it also just called Cicero the big project?
01:44:51.900 | Or is it still coming up with a name?
01:44:53.300 | - The data set comes from this website, webdiplomacy.net,
01:44:56.820 | is this site that's been online for like 20 years now.
01:44:59.260 | And it's one of the main sites
01:45:00.720 | that people use to play diplomacy on it.
01:45:02.800 | We've got like 50,000 games of diplomacy
01:45:06.180 | with natural language communication,
01:45:09.380 | over 10 million messages.
01:45:11.180 | So it's a pretty massive data set that people can use to,
01:45:14.820 | we're hoping that the academic community
01:45:16.540 | and the research community is able to use it
01:45:18.540 | for all sorts of interesting research questions.
01:45:21.140 | - So do you, from having studied this game,
01:45:23.980 | is this a sufficiently rich problem space
01:45:28.060 | to explore this kind of human AI interaction?
01:45:31.700 | - Yeah, absolutely.
01:45:32.540 | And I think it's maybe the best data set
01:45:36.100 | that I can think of out there
01:45:37.420 | to investigate these kinds of questions of negotiation,
01:45:41.420 | trust, persuasion.
01:45:44.220 | I wouldn't say it's the best data set in the world
01:45:45.900 | for human AI interaction, that's a very broad field.
01:45:49.660 | But I think that it's definitely up there as like,
01:45:52.100 | if you're really interested in language models
01:45:54.660 | interacting with humans in a setting
01:45:57.260 | where their incentives are not fully aligned,
01:45:59.660 | this seems like an ideal data set for investigating that.
01:46:02.900 | - So you have a paper with some impressive results
01:46:07.900 | and just an impressive paper that taken this problem on.
01:46:11.460 | What's the most exciting thing to you
01:46:13.560 | in terms of the results from the paper?
01:46:16.780 | - Well, I think there's a few--
01:46:18.860 | - Ideas or results?
01:46:20.660 | - Yeah, I think there's a few aspects of the results
01:46:23.180 | and that I think are really exciting.
01:46:25.460 | So first of all, the fact that we were able to achieve
01:46:27.980 | such strong performance,
01:46:29.900 | I was surprised by and pleasantly surprised by.
01:46:33.780 | So we played 40 games of diplomacy with real humans
01:46:37.060 | and the bot placed second out of all players
01:46:40.860 | that have played five or more games.
01:46:42.380 | So it's about 80 players total,
01:46:44.020 | 19 of whom played five or more games
01:46:46.700 | and the bot was ranked second out of those players.
01:46:49.200 | And the bot was really good in two dimensions.
01:46:53.860 | One, being able to establish strong connections
01:46:56.660 | with the other players on the board,
01:46:58.060 | being able to like persuade them to work with it,
01:47:01.600 | being able to coordinate with them
01:47:02.800 | about like how it's going to work with them.
01:47:04.980 | And then also the raw tactical
01:47:07.820 | and strategic aspects of the game,
01:47:09.940 | being able to understand
01:47:12.260 | what the other players are likely to do,
01:47:13.700 | being able to model their behavior
01:47:15.540 | and respond appropriately to that,
01:47:17.700 | the bot also really excelled at.
01:47:19.820 | - What are some interesting things that the bot said?
01:47:22.460 | By the way, are you allowed to swear in the,
01:47:26.260 | like are there rules to what you're allowed to say
01:47:28.100 | and not in diplomacy?
01:47:29.560 | - You can say whatever you want.
01:47:30.860 | I think the site will get very angry at you
01:47:32.700 | if you start like threatening somebody.
01:47:34.700 | And we actually--
01:47:36.420 | - Like if you threaten somebody,
01:47:37.780 | you're supposed to do it politely.
01:47:39.420 | - Yeah, politely, you know, like keep it in character.
01:47:43.700 | We actually had a researcher watching the bot 24/7,
01:47:46.780 | well, whenever we play a game,
01:47:47.660 | we had a bot watching it to make sure
01:47:49.100 | that it wouldn't go off the rails
01:47:50.580 | and start like threatening somebody or something like that.
01:47:52.420 | - I would just love it if the bot started like mocking,
01:47:55.520 | mocking everybody,
01:47:56.460 | like some weird quirky strategies would emerge.
01:47:59.620 | That have you seen anything interesting that you,
01:48:01.540 | huh, that's a weird,
01:48:02.880 | that's a weird behavior,
01:48:05.940 | either the filter or the language model
01:48:09.060 | that was weird to you.
01:48:10.900 | - That was, yeah, there were definitely like
01:48:12.900 | things that the bot would do
01:48:15.660 | that were not in line with like
01:48:17.460 | how humans would approach the game.
01:48:19.500 | And that, in a good way, the humans actually,
01:48:22.340 | you know, we've talked to some expert diplomacy players
01:48:24.580 | about these results and their takeaway is that,
01:48:27.180 | well, maybe humans are approaching this the wrong way.
01:48:29.020 | And this is actually like the right way to play the game.
01:48:31.500 | - So what's required to win?
01:48:35.300 | Like what does it mean to mess up
01:48:37.860 | or to exploit the suboptimal behavior of a player?
01:48:41.380 | Like is there optimally rational behavior
01:48:45.500 | and irrational behavior that you need to estimate,
01:48:48.780 | that kind of stuff?
01:48:49.620 | Like what stands out to you?
01:48:51.060 | Like, is there a crack that you can exploit?
01:48:53.860 | Is there like a weakness that you can exploit in the game
01:48:58.300 | that everybody's looking for?
01:49:00.200 | - Well, I think you're asking kind of two questions there.
01:49:05.060 | So one, like modeling the irrationality
01:49:08.300 | and the suboptimality of humans.
01:49:10.700 | You can't, in diplomacy,
01:49:13.420 | you can't treat all the other players like they're machines.
01:49:15.700 | And if you do that,
01:49:17.060 | you're going to end up playing really poorly.
01:49:19.420 | And so we actually ran this experiment.
01:49:20.780 | So we trained a bot in a two player,
01:49:24.060 | zero-sum version of diplomacy,
01:49:26.220 | the same way that you might approach a game
01:49:27.800 | like chess or poker.
01:49:29.700 | And the bot was superhuman.
01:49:30.940 | It would crush any competitor.
01:49:32.700 | And then we took that same training approach
01:49:35.060 | and we trained a bot for the full seven player version
01:49:37.300 | of the game through self-play without any human data.
01:49:40.300 | And we stuck it in a game with six humans
01:49:42.300 | and it got destroyed.
01:49:43.800 | Even in the version of the game
01:49:44.820 | where there's no explicit natural language communication,
01:49:47.740 | it still got destroyed
01:49:49.100 | because it just wouldn't be able to understand
01:49:50.960 | how the other players were approaching the game
01:49:52.580 | and be able to work with that.
01:49:54.140 | - Can you just linger on that,
01:49:56.780 | meaning like there's an individual,
01:49:58.520 | there's an individual personality to each player
01:50:00.620 | and then you're supposed to remember that.
01:50:01.940 | But what do you mean it's not able to understand the players?
01:50:06.360 | - Well, it would, for example,
01:50:07.580 | expect the human to support it in a certain way
01:50:11.080 | when the human would simply like,
01:50:13.740 | think like, no, I'm not supposed to support you here.
01:50:16.420 | It's kind of like, you know,
01:50:17.300 | if you develop a self-driving car
01:50:19.460 | and it's trained completely from scratch
01:50:21.100 | with other self-driving cars,
01:50:22.800 | it might learn to drive on the left side of the road.
01:50:25.020 | And that's a totally reasonable thing to do
01:50:26.380 | if you're with these other self-driving cars
01:50:28.260 | that are also driving on the left side of the road.
01:50:30.120 | But if you put it in an American city, it's gonna crash.
01:50:33.020 | - But I guess the intuition I'm trying to build up
01:50:34.700 | is why does it then crush a human player heads up
01:50:37.660 | versus multiple?
01:50:40.600 | - This is an aspect of two players zero sum
01:50:42.920 | versus games that involve cooperation.
01:50:45.160 | So in a two player zero sum game,
01:50:47.340 | you can do self-play from scratch
01:50:50.320 | and you will arrive at the Nash equilibrium
01:50:52.600 | where you don't have to worry about the other player
01:50:56.600 | playing in a very human suboptimal style.
01:50:58.520 | That's just gonna be,
01:50:59.400 | the only way that deviating from a Nash equilibrium
01:51:04.120 | would change things is if it helped you.
01:51:06.640 | - So what's the dynamic of cooperation
01:51:09.200 | that's effective in diplomacy?
01:51:11.280 | Do you always have to have one friend in the game?
01:51:14.920 | - You always want to maximize your friends
01:51:18.160 | and minimize your enemies.
01:51:19.460 | - Got it.
01:51:23.040 | And boy, and the lying comes into play there.
01:51:28.040 | So the more friends you have, the better.
01:51:32.280 | - Yeah, I mean, I guess you have to attack somebody
01:51:34.120 | or else you're not gonna make progress.
01:51:35.320 | - Right, so that's the tension, but this is too real.
01:51:39.160 | This is too real.
01:51:40.120 | This is too close to geopolitics
01:51:42.800 | of actual military conflict in the world.
01:51:45.320 | Okay, that's fascinating.
01:51:47.520 | So that cooperation element
01:51:49.280 | is what makes the game really, really hard.
01:51:51.440 | - Yeah, and to give you an example
01:51:53.800 | of how this suboptimality and irrationality comes into play,
01:51:57.320 | there's a really common situation in the game of diplomacy
01:52:01.640 | that where one player starts to win
01:52:04.200 | and they're at the point where they're controlling
01:52:05.960 | about half the map.
01:52:07.680 | And the remaining players
01:52:09.000 | who have all been fighting each other the whole game
01:52:11.000 | all have to work together now
01:52:12.800 | to stop this other player from winning
01:52:14.180 | or else everybody's gonna lose.
01:52:15.740 | And it's kind of like "Game of Thrones."
01:52:18.800 | I don't know if you've seen the show
01:52:19.960 | where you got the others coming from the north
01:52:22.040 | and all the people have to work out their differences
01:52:24.400 | and stop them from taking over.
01:52:26.580 | And the bot will do this.
01:52:30.320 | The bot will work with the other players
01:52:32.040 | to stop the superpower from winning.
01:52:33.920 | But if it's trained from scratch
01:52:36.840 | or it doesn't really have a good grounding
01:52:38.400 | in how humans approach it,
01:52:39.720 | it will also at the same time attack the other players
01:52:43.000 | with its extra units.
01:52:44.380 | So all the units that are not necessary
01:52:46.060 | to stop the superpower from winning,
01:52:47.540 | it will use those to grab as many centers as possible
01:52:50.300 | from the other players.
01:52:51.760 | And in totally rational play,
01:52:55.020 | the other players should just live with that.
01:52:56.880 | They have to understand like,
01:52:57.760 | "Hey, a score of one is better than a score of zero.
01:53:00.840 | "So, okay, he's grabbed my centers,
01:53:04.120 | "but I'll just deal with it."
01:53:06.080 | But humans don't act that way, right?
01:53:08.160 | The human gets really angry at the bot
01:53:10.560 | and ends up throwing the game
01:53:12.280 | because I'm gonna screw you over
01:53:15.160 | because you did something that's not fair to me.
01:53:17.560 | - Got it.
01:53:19.660 | And are you supposed to model that?
01:53:20.720 | Is the bot supposed to model that kind of human frustration?
01:53:24.840 | - Yeah, exactly.
01:53:25.660 | So that is something that seems almost impossible to model
01:53:29.540 | purely from scratch without any human data.
01:53:31.100 | It's a very cultural thing.
01:53:32.760 | And so you need human data to be able to understand that,
01:53:36.980 | "Hey, that's how humans behave."
01:53:38.780 | And you have to work around that.
01:53:40.100 | It might be suboptimal, it might be irrational,
01:53:42.300 | but that's an aspect of humanity that you have to deal with.
01:53:47.220 | - So how difficult is it to train on human data
01:53:49.540 | given that human data is very limited
01:53:51.340 | versus what a purely self-play mechanism can generate?
01:53:55.380 | - That's actually one of the major challenges
01:53:57.100 | that we faced in the research,
01:53:58.220 | that we had a good amount of human data.
01:53:59.980 | We had about 50,000 games.
01:54:01.380 | What we try to do is leverage as much self-play as possible
01:54:05.460 | while still leveraging the human data.
01:54:08.780 | So what we do is we do self-play,
01:54:11.660 | very similar to how it's been done in poker and Go,
01:54:14.240 | but we try to regularize the self-play
01:54:17.700 | towards the human data.
01:54:18.880 | Basically, the way to think about it is
01:54:23.100 | we penalize the bot for choosing actions
01:54:28.100 | that are very unlikely under the human data set.
01:54:32.240 | - How do you know?
01:54:34.260 | Is there some kind of function that says,
01:54:36.380 | "This is human-like and not?"
01:54:38.260 | - Yeah, so we train a bot through supervised learning
01:54:41.980 | to model the human play as much as possible.
01:54:44.140 | So we basically train a neural net on those 50,000 games,
01:54:47.900 | and that gives us an approximate,
01:54:50.080 | that gives us a policy that resembles to some extent
01:54:52.620 | how humans actually play the game.
01:54:54.300 | Now, this isn't a perfect model of human play
01:54:57.260 | because we don't have unlimited data.
01:54:58.580 | We don't have unlimited neural net capacity,
01:55:01.500 | but it gives us some approximation.
01:55:03.580 | - Is there some data on the internet
01:55:05.220 | that's useful besides just diplomacy?
01:55:07.620 | So on the language side of things, is there some,
01:55:09.900 | can you go to like Reddit?
01:55:11.300 | And so sort of background model formulation
01:55:16.620 | that's useful for the game of diplomacy.
01:55:18.620 | - Yeah, absolutely.
01:55:19.460 | And so for the language model,
01:55:20.300 | which is kind of like a separate question,
01:55:22.780 | we didn't use the language model during self-play training,
01:55:25.420 | but we pre-trained the language model
01:55:29.020 | on tons of internet data as much as possible.
01:55:32.660 | And then we fine-tuned it specifically
01:55:34.320 | on the diplomacy games.
01:55:35.700 | So we are able to leverage the wider data set
01:55:38.540 | in order to fill in some of the gaps
01:55:41.520 | in how communication happens more broadly
01:55:44.340 | besides just specifically in these diplomacy games.
01:55:47.100 | - Okay, cool.
01:55:47.920 | What are some interesting things that came to life
01:55:50.760 | from this work to you?
01:55:53.280 | Like what are some insights about games
01:55:58.280 | where natural language is involved
01:56:01.200 | and cooperation, deep cooperation is involved?
01:56:04.400 | - Well, I think there's a few insights.
01:56:06.040 | So first of all, the fact that you can't rely purely
01:56:11.040 | or even largely on self-play,
01:56:12.920 | that you really have to have an understanding
01:56:14.640 | of how humans approach the game.
01:56:17.480 | I think that that's one of the major conclusions
01:56:19.120 | that I'm drawing from this work.
01:56:20.640 | And that is, I think, applicable more broadly
01:56:23.700 | to a lot of different games.
01:56:25.000 | So we've actually already taken the approaches
01:56:26.800 | that we've used in diplomacy and tried them
01:56:28.760 | on a cooperative card game called Hanabi.
01:56:31.840 | And we've had a lot of success in that game as well.
01:56:34.440 | On the language side, I think the fact
01:56:39.200 | that we were able to control the language model
01:56:43.080 | through this intense approach was very effective.
01:56:47.240 | And it allowed us, instead of just imitating
01:56:49.880 | how humans would communicate, we're able to go beyond that
01:56:52.720 | and able to feed into its superhuman strategies
01:56:57.720 | that it can then generate messages corresponding to.
01:57:02.600 | - Is there something you could say about detecting
01:57:04.800 | whether a person or AI is lying or not?
01:57:07.900 | - The bot doesn't explicitly try to calculate
01:57:13.260 | whether somebody is lying or not.
01:57:15.140 | But what it will do is try to predict
01:57:18.040 | what actions they're going to take,
01:57:19.920 | given the communications, given the messages
01:57:22.200 | that they've sent to us.
01:57:23.480 | So given our conversation,
01:57:24.840 | what do I think you're going to do?
01:57:26.040 | And implicitly, there is a calculation
01:57:28.800 | about whether you're lying to me in that.
01:57:30.800 | Based on your messages, if I think you're going
01:57:34.620 | to attack me this turn, even though your messages say
01:57:37.480 | that you're not, then essentially the bot
01:57:40.320 | is predicting that you're lying.
01:57:42.200 | But it doesn't view it as lying the same way
01:57:45.420 | that we would view it as lying.
01:57:47.260 | - But you could probably reformulate with all the same data
01:57:51.100 | and make a classifier lying or not.
01:57:54.700 | - Yeah, I think you could do that.
01:57:56.620 | That was not something that we were focused on,
01:57:58.280 | but I think that it is possible that,
01:58:00.240 | if you came up with some measurements of like,
01:58:03.340 | what does it mean to tell a lie?
01:58:04.660 | Because there's a spectrum, right?
01:58:06.000 | Like if you're withholding some information, is that a lie?
01:58:10.540 | If you're mostly telling the truth,
01:58:11.860 | but you forgot to mention this one action out of 10,
01:58:15.100 | is that a lie?
01:58:16.500 | It's hard to draw the line, but if you're willing to do that
01:58:19.340 | and then you could possibly use it to--
01:58:22.580 | - This feels like an argument inside a relationship now.
01:58:25.600 | What constitutes a lie?
01:58:27.860 | Depends what you mean by the definition of the word is.
01:58:32.260 | Okay, still it's fascinating because trust and lying
01:58:37.260 | is all intermixed into this and it's language models
01:58:41.700 | that are becoming more and more sophisticated.
01:58:43.540 | It's just a fascinating space to explore.
01:58:45.860 | What do you see as the future of this work
01:58:52.340 | that is inspired by the breakthrough performance
01:58:56.580 | that you're getting here with diplomacy?
01:58:58.580 | - I think there's a few different directions
01:59:03.220 | to take this work.
01:59:04.180 | I think really what it's showing us is the potential
01:59:09.740 | that language models have.
01:59:10.740 | I mean, I think a lot of people didn't think
01:59:12.220 | that this kind of result was possible even today,
01:59:14.980 | despite all the progress that's been made in language models.
01:59:17.540 | And so it shows us how we can leverage the power
01:59:21.420 | of things like self-play on top of language models
01:59:24.260 | to get increasingly better performance.
01:59:27.520 | And the ceiling is really much higher
01:59:30.700 | than what we have right now.
01:59:32.340 | - Is this transferable somehow to chatbots
01:59:37.140 | for the more general task of dialogue?
01:59:40.620 | So there is a kind of negotiation here,
01:59:43.100 | a dance between entities that are trying to cooperate
01:59:46.780 | and at the same time, a little bit adversarial,
01:59:49.860 | which I think maps somewhat to the general,
01:59:53.880 | the entire process of Reddit or like internet communication.
01:59:59.900 | You're cooperating, you're adversarial,
02:00:02.300 | you're having debates, you're having a camaraderie,
02:00:05.180 | all that kind of stuff.
02:00:06.780 | - I think one of the things that's really useful
02:00:08.700 | about diplomacy is that we have a well-defined
02:00:11.540 | value function.
02:00:12.880 | There is a well-defined score that the bot
02:00:15.260 | is trying to optimize.
02:00:16.660 | And in a setting like a general chatbot setting,
02:00:20.580 | it would need that kind of objective
02:00:24.380 | in order to fully leverage the techniques
02:00:26.420 | that we've developed.
02:00:27.460 | - What about like what we talked about earlier
02:00:30.940 | with NPCs inside video games?
02:00:33.380 | Like how can it be used to create
02:00:36.420 | for Elder Scrolls VI more compelling NPCs
02:00:41.420 | that you could talk to instead of committing
02:00:44.740 | all kinds of violence with a sword and fighting dragons,
02:00:48.020 | just sitting in a tavern and drink all day
02:00:49.940 | and talk to the chatbot?
02:00:51.580 | - The way that we've approached AI in diplomacy
02:00:53.300 | is you condition the language on an intent.
02:00:56.380 | Now that intent in diplomacy is an action,
02:00:59.380 | but it doesn't have to be.
02:01:00.420 | And you can imagine, you could have NPCs
02:01:04.340 | in video games or the metaverse or whatever,
02:01:06.700 | where there's some intent or there's some objective
02:01:09.500 | that they're trying to maximize,
02:01:10.540 | and you can specify what that is.
02:01:12.200 | And then the language can correspond to that intent.
02:01:17.460 | Now, I'm not saying that this is happening imminently,
02:01:19.820 | but I'm saying that this is like a future application
02:01:22.500 | potentially of this direction of research.
02:01:25.020 | - So what's the more general formulation of this?
02:01:27.820 | Making self-play be able to scale the way self-play does
02:01:30.940 | and still maintain human-like behavior.
02:01:33.740 | - The way that we've approached self-play in diplomacy
02:01:37.460 | is we're trying to come up with good intents
02:01:42.460 | to condition the language model on.
02:01:43.980 | And the space of intents is actions
02:01:46.660 | that can be played in the game.
02:01:47.980 | Now, there is the potential to have a broader set of intents,
02:01:51.420 | things like long-term cooperation or long-term objectives
02:01:56.420 | or gossip about what another player was saying.
02:02:01.080 | These are things that we're currently not conditioning
02:02:03.140 | the language model on, and so we're not able to control it
02:02:07.140 | to say like, "Oh, you should be talking
02:02:08.500 | "about this thing right now."
02:02:09.800 | But it's quite possible that you could expand
02:02:12.540 | the scope of intents to be able to allow it
02:02:14.900 | to talk about those things.
02:02:16.080 | Now, in the process of doing that,
02:02:17.820 | the self-play would become much more complicated.
02:02:20.220 | And so that is a potential for future work.
02:02:23.820 | - Okay, the increase in the number of intents.
02:02:25.580 | I still am not quite clear how you keep the self-play
02:02:32.400 | integrated into the human world.
02:02:34.960 | - Yeah.
02:02:35.800 | - I'm a little bit loose on understanding how you do that.
02:02:39.380 | - So we train in neural nets to imitate the human data
02:02:43.240 | as closely as possible,
02:02:44.640 | and that's what we call the anchor policy.
02:02:47.000 | And now when we're doing self-play,
02:02:49.320 | the problem with the anchor policy
02:02:50.920 | is that it's not a perfect approximation
02:02:53.680 | of how humans actually play.
02:02:54.960 | Because we don't have infinite data,
02:02:56.520 | because we don't have unlimited neural network capacity,
02:03:00.220 | it's actually a relatively suboptimal approximation
02:03:03.040 | of how humans actually play.
02:03:04.680 | And we can improve that approximation
02:03:06.820 | by adding planning and RL.
02:03:10.220 | And so what we do is we get a better approximation,
02:03:13.880 | a better model of human play by,
02:03:17.040 | during the self-play process,
02:03:19.080 | we say you can deviate from this human anchor policy
02:03:24.080 | if there is an action that has, you know,
02:03:26.440 | particularly high expected value.
02:03:29.240 | But it would have to be a really high expected value
02:03:32.260 | in order to deviate from this human-like policy.
02:03:36.180 | So you basically say,
02:03:37.500 | try to maximize your expected value
02:03:39.500 | while at the same time,
02:03:40.920 | stay as close as possible to the human policy.
02:03:44.060 | And there is a parameter that controls
02:03:46.580 | the relative weighting of those competing objectives.
02:03:50.660 | - So the question I have
02:03:52.620 | is how sophisticated can the anchor policy get?
02:03:56.500 | So I have a policy that approximates human behavior, right?
02:03:59.840 | - Yeah.
02:04:00.680 | - So as you increase the number of intents,
02:04:03.440 | as you generalize the space in which this is applicable,
02:04:08.340 | and given that the human data is limited,
02:04:11.260 | try to anticipate a policy that works
02:04:14.260 | for a much larger number of cases.
02:04:17.800 | Like how difficult is the process
02:04:19.500 | of forming a damn good anchor policy?
02:04:22.640 | - Well, it really comes down
02:04:23.480 | to how much human data you have.
02:04:25.280 | So it's all about scaling the human data.
02:04:27.600 | - I think the more human data you have, the better.
02:04:30.040 | And I think that that's going to be the major bottleneck
02:04:32.680 | in scaling to more complicated domains.
02:04:37.000 | But that said, there might be the potential,
02:04:39.600 | just like in the language model,
02:04:40.800 | where we leveraged tons of data on the internet
02:04:43.640 | and then specialized it for diplomacy.
02:04:46.740 | There is the future potential
02:04:47.900 | that you can leverage huge amounts of data across the board
02:04:50.820 | and then specialize it in the data set
02:04:53.600 | that you have for diplomacy.
02:04:54.560 | And that way you're essentially augmenting
02:04:56.360 | the amount of data that you have.
02:04:58.540 | - To what degree does this apply
02:05:00.440 | to the general, the real world diplomacy, the geopolitics?
02:05:06.400 | You know, there's a game theory has a history
02:05:11.040 | of being applied to understand
02:05:13.120 | and to give us hope about nuclear weapons, for example.
02:05:16.000 | The mutually assured destruction
02:05:17.820 | is a game theoretic concept that you can formulate.
02:05:21.020 | Some people say it's oversimplified,
02:05:23.080 | but nevertheless, here we are
02:05:24.800 | and we somehow haven't blown ourselves up.
02:05:27.320 | Do you see a future where this kind of system
02:05:32.320 | can be used to help us make decisions,
02:05:35.960 | geopolitical decisions in the world?
02:05:37.760 | - Well, like I said, the original motivation
02:05:40.800 | for the game of diplomacy was the failures of World War I,
02:05:44.060 | the diplomatic failures that led to war.
02:05:46.680 | And the real take-home message of diplomacy is that,
02:05:50.520 | you know, if people approach diplomacy,
02:05:53.000 | the right way, then war is ultimately unsuccessful.
02:05:57.380 | The way that I see it, war is
02:06:00.440 | an inherently negative sum game, right?
02:06:02.020 | There's always a better outcome than war
02:06:04.400 | for all the parties involved.
02:06:06.020 | And my hope is that, you know, as AI progresses,
02:06:10.400 | then maybe this technology could be used
02:06:12.480 | to help people make better decisions across the board
02:06:17.000 | and, you know, hopefully avoid
02:06:19.240 | negative sum outcomes like war.
02:06:21.360 | - Yeah, I mean, I just came back from Ukraine.
02:06:24.120 | I'm going back there.
02:06:25.440 | On deep personal levels, think a lot about
02:06:30.200 | how peace can be achieved.
02:06:34.360 | And I'm a big believer in conversation,
02:06:36.520 | leaders getting together and having conversations
02:06:39.920 | and trying to understand each other.
02:06:42.560 | Yeah, it's fascinating to think
02:06:44.820 | whether each one of those leaders
02:06:46.120 | can run a simulation ahead of time.
02:06:48.560 | Like if I'm an asshole,
02:06:50.840 | (chuckles)
02:06:52.080 | what are the possible consequences?
02:06:53.520 | If I'm nice, what are the possible consequences?
02:06:56.720 | My guess is that if the president of the United States
02:07:01.200 | got together with Vladimir Zelensky and Vladimir Putin,
02:07:06.200 | that there would be significant benefits
02:07:10.240 | to the president of the United States not having the ego
02:07:14.880 | of kind of playing down, of giving away a lot of chips
02:07:19.280 | for the future success of a world.
02:07:22.200 | So giving a lot of power to the two presidents
02:07:24.500 | of the competing nations to achieve peace.
02:07:27.300 | That's my guess,
02:07:29.120 | but it'd be nice to run a bunch of simulations.
02:07:31.640 | But then you have to have human data, right?
02:07:33.280 | You really, 'cause it's like the game of diplomacy
02:07:35.840 | is fundamentally different than geopolitics.
02:07:37.760 | You need data.
02:07:39.040 | You need like, I guess that's the question I have.
02:07:42.120 | Like how transferable is this to,
02:07:44.160 | like I don't know, any kind of negotiation, right?
02:07:47.600 | Like to any kind of, some local, I don't know,
02:07:50.160 | a bunch of lawyers like arguing,
02:07:52.480 | like a divorce, like divorce lawyers.
02:07:55.360 | Like how transferable is this
02:07:56.800 | to all kinds of human negotiation?
02:07:58.840 | - Well, I feel like this isn't a question
02:08:00.440 | that's unique to diplomacy.
02:08:01.440 | I mean, I think you look at RL breakthroughs,
02:08:03.880 | reinforcement learning breakthroughs
02:08:05.160 | in previous games as well,
02:08:06.280 | like AI for StarCraft, AI for Atari.
02:08:09.200 | You haven't really seen it deployed in the real world
02:08:11.780 | because you have these problems of,
02:08:13.840 | it's really hard to collect a lot of data
02:08:16.600 | and you don't have a well-defined action space.
02:08:21.280 | You don't have a well-defined reward function.
02:08:23.360 | These are all things that you really need
02:08:25.520 | for reinforcement learning
02:08:27.200 | and planning to be really successful today.
02:08:29.440 | Now, there are some domains where you do have that.
02:08:32.800 | Code generation is one example.
02:08:35.360 | Theorem proving mathematics, that's another example
02:08:37.520 | where you have a well-defined action space.
02:08:39.000 | You have a well-defined reward function.
02:08:40.920 | And those are the kinds of domains
02:08:42.840 | where I can see RL in the short term
02:08:45.560 | being incredibly powerful.
02:08:47.120 | But yeah, I think that those are the barriers
02:08:51.120 | to deploying this at scale in the real world.
02:08:53.300 | But the hope is that in the long run,
02:08:55.280 | we'll be able to get there.
02:08:57.080 | - Yeah, but see, diplomacy feels like closer
02:08:59.900 | to the real world than does StarCraft.
02:09:02.560 | Like 'cause it's natural language, right?
02:09:04.620 | You're operating in the space of intents
02:09:06.200 | and in the space of natural language,
02:09:07.640 | that feels very close to the real world.
02:09:09.520 | And it also feels like you could get data on that
02:09:12.880 | from the internet.
02:09:14.760 | - Yeah, and that's why I do think that diplomacy
02:09:17.400 | is taking a big step closer to the real world
02:09:20.360 | than anything that's came before
02:09:21.440 | in terms of game AI breakthroughs.
02:09:23.200 | The fact that we're communicating in natural language,
02:09:27.920 | we're leveraging the fact that we have this
02:09:30.640 | like general data set of dialogue and communication
02:09:35.320 | from a breadth of the internet.
02:09:37.100 | That is a big step in that direction.
02:09:39.940 | We're not 100% there, but we're getting closer at least.
02:09:44.020 | - So if we actually return back to poker and chess,
02:09:47.320 | are some of the ideas that you're learning here
02:09:48.920 | with diplomacy, could you construct AI systems
02:09:52.320 | that play like humans?
02:09:55.080 | Like make for a fun opponent in a game of chess?
02:10:00.080 | - Yeah, absolutely.
02:10:01.240 | We've already started looking into this direction a bit.
02:10:03.080 | So we tried to use the techniques that we've developed
02:10:05.440 | for diplomacy to make chess and go AIs.
02:10:08.840 | And what we found is that it led to much more human-like
02:10:13.080 | strong chess and go players.
02:10:15.600 | The way that AIs like Stockfish today play
02:10:19.620 | is in a very inhuman style.
02:10:21.580 | It's very strong, but it's very different
02:10:23.460 | from how humans play.
02:10:25.120 | And so we can take the techniques
02:10:27.020 | that we've developed for diplomacy.
02:10:28.280 | We do something similar in chess and go,
02:10:32.400 | and we end up with a bot that's both strong and human-like.
02:10:36.440 | To elaborate on this a bit,
02:10:39.440 | like one way to approach making a human-like
02:10:43.000 | AI for chess is to collect a bunch of human games,
02:10:47.560 | like a bunch of human grandmaster games,
02:10:49.560 | and just to supervise learning on those games.
02:10:52.200 | But the problem is that if you do that,
02:10:53.880 | what you end up with is an AI that's substantially weaker
02:10:57.200 | than the human grandmasters that you trained on.
02:10:59.760 | Because the neural net is not able to approximate
02:11:03.520 | the nuance of the strategy.
02:11:06.240 | This goes back to the planning thing that I mentioned,
02:11:08.600 | the search thing that I talked about before,
02:11:10.520 | that these human grandmasters, when they're playing,
02:11:12.640 | they're using search and they're using planning.
02:11:15.560 | And the neural net alone,
02:11:17.640 | unless you have a massive neural net
02:11:19.280 | that's like a thousand times bigger
02:11:20.400 | than what we have right now,
02:11:22.080 | it's not able to approximate those details very effectively.
02:11:25.720 | And on the other hand,
02:11:28.560 | you can leverage search and planning very heavily,
02:11:32.480 | but then what you end up with is an AI
02:11:34.120 | that plays in a very different style
02:11:35.800 | from how humans play the game.
02:11:37.720 | Now, if you strike this intermediate balance
02:11:40.020 | by setting the regularization parameters correctly
02:11:43.280 | and say, you can do planning,
02:11:44.440 | but try to keep it close to the human policy,
02:11:46.880 | then you end up with an AI that plays
02:11:49.800 | in both a very human-like style and a very strong style.
02:11:54.080 | And you can actually even tune it
02:11:55.820 | to have a certain ELO rating.
02:11:58.320 | So you can say, play in the style of like a 2800 ELO human.
02:12:01.480 | - I wonder if you could do specific type of humans
02:12:04.920 | or categories of humans, not just skill, but style.
02:12:09.520 | - Yeah, I think so.
02:12:10.400 | And so this is where the research gets interesting.
02:12:13.720 | Like, one of the things that I was thinking about is,
02:12:16.760 | and this is actually already being done,
02:12:18.280 | there's a researcher at the University of Toronto
02:12:19.880 | that's working on this,
02:12:21.760 | is to make an AI that plays
02:12:23.200 | in the style of a particular player.
02:12:25.480 | Like Magnus Carlsen, for example,
02:12:26.960 | you can make an AI that plays like Magnus Carlsen.
02:12:29.680 | And then where I think this gets interesting is like,
02:12:31.440 | maybe you're up against Magnus Carlsen
02:12:33.320 | in the world championship or something,
02:12:35.180 | you can play against this Magnus Carlsen bot
02:12:37.800 | to prepare against the real Magnus Carlsen.
02:12:40.480 | And you can try to explore strategies
02:12:42.600 | that he might struggle with and try to figure out like,
02:12:46.000 | how do you beat this player in particular?
02:12:48.880 | On the other hand, you can also have Magnus Carlsen
02:12:51.040 | working with this bot to try to figure out where he's weak
02:12:53.700 | and where he needs to improve his strategy.
02:12:56.260 | And so I can envision this future
02:12:59.540 | where data on specific chess and Go players
02:13:03.600 | becomes extremely valuable because you can use that data
02:13:06.300 | to create specific models
02:13:08.180 | of how these particular players play.
02:13:10.040 | - So increasingly human-like behavior in bots, however,
02:13:13.400 | as you've mentioned, makes cheating,
02:13:17.120 | cheat detection much harder.
02:13:19.060 | - It does, yeah.
02:13:20.040 | The way that cheat detection works in a game like poker
02:13:23.540 | and a game like chess and Go, from what I understand,
02:13:26.100 | is trying to see like, is this person making moves
02:13:30.180 | that are very common among chess AIs or AIs in general?
02:13:36.160 | But very uncommon among top human players.
02:13:40.680 | And if you have the development of these AIs
02:13:43.920 | that play in a very strong style,
02:13:46.280 | but also a very human-like style,
02:13:48.280 | then that poses serious challenges for cheat detection.
02:13:51.280 | - And it makes you now ask yourself a hard question
02:13:54.280 | about what is the role of AI systems
02:13:56.720 | as they become more and more integrated in our society?
02:13:59.800 | And this kind of human AI integration
02:14:03.620 | has some deep ethical issues that we should be aware of.
02:14:08.620 | And also it's a kind of cybersecurity challenge, right?
02:14:13.140 | For it to make, you know,
02:14:14.260 | one of the assumptions we have when we play games
02:14:17.100 | is that there's a trust that it's only humans involved.
02:14:20.140 | And the better AI systems we create,
02:14:24.740 | which makes it super exciting,
02:14:26.140 | human-like AI systems with different styles of humans
02:14:28.700 | is really exciting,
02:14:29.740 | but then we have to have the defenses
02:14:32.060 | better and better and better
02:14:33.600 | if we're to trust that we can enjoy
02:14:36.880 | human versus human game in a deeply fair way.
02:14:40.760 | It's fascinating.
02:14:41.600 | It's just, it's humbling.
02:14:44.320 | - Yeah, I think there's a lot of like negative potential
02:14:47.160 | for this kind of technology,
02:14:48.240 | but you know, at the same time,
02:14:49.760 | there's a lot of upside for it as well.
02:14:51.600 | So, you know, for example, right now,
02:14:53.520 | it's really hard to learn how to get better
02:14:55.720 | in games like chess and poker and Go
02:14:57.800 | because the way that the AI plays
02:15:00.000 | is so foreign and incomprehensible.
02:15:02.280 | But if you have these AIs that are playing,
02:15:04.520 | you know, you can say like,
02:15:05.360 | oh, I'm a 2000 Elo human, how do I get to 2200?
02:15:08.240 | Now you can have an AI that plays in the style
02:15:10.760 | of a 2200 Elo human, and that will help you get better.
02:15:14.160 | Or, you know, you mentioned this problem of like,
02:15:16.840 | how do you know that you're actually playing with humans
02:15:19.440 | when you're playing like online and in video games?
02:15:22.000 | Well, now we have the potential of populating
02:15:24.760 | these like virtual worlds with agents,
02:15:28.560 | like AI agents that are actually fun to play with,
02:15:30.720 | and you don't have to always be playing with other humans
02:15:33.680 | to, you know, have a fun time.
02:15:36.320 | So yeah, a lot of upside potential too.
02:15:38.560 | And I think, you know, with any sort of tool,
02:15:40.320 | there's the potential for a lot of greatness
02:15:42.760 | and a lot of downsides as well.
02:15:44.520 | - So in the paper that I got a chance to look at,
02:15:47.080 | there's a section on ethical considerations.
02:15:50.600 | What's in that section?
02:15:51.640 | What are some ethical considerations here?
02:15:53.560 | Is it some of the stuff we already talked about?
02:15:55.760 | - There's some things that we've already talked about.
02:15:57.520 | I think specific to diplomacy, you know,
02:16:01.240 | there's also the challenge that the game is,
02:16:04.960 | you know, there is a deception aspect to the game.
02:16:07.360 | And so, you know, developing language models
02:16:11.360 | that are capable of deception is I think a dicey issue
02:16:14.280 | and something that, you know,
02:16:16.160 | makes research on diplomacy particularly challenging.
02:16:18.800 | And, you know, so those kinds of issues of like,
02:16:24.480 | should we even be developing AIs
02:16:25.920 | that are capable of lying to people?
02:16:27.120 | That's something that we have to, you know,
02:16:28.840 | think carefully about.
02:16:30.480 | - That's so cool.
02:16:31.320 | I mean, you have to do that kind of stuff
02:16:32.680 | in order to figure out where the ethical lines are.
02:16:35.080 | But I can see in the future it being illegal
02:16:37.120 | to have a consumer product that lies.
02:16:41.840 | - Yeah, yeah.
02:16:44.080 | - Like your personal assistant AI system
02:16:46.240 | is always have to tell the truth.
02:16:48.360 | But if I ask it, do I look,
02:16:50.960 | did I get fatter over the past month?
02:16:53.360 | I sure as hell want that AI system to lie to me.
02:16:56.880 | So there's a trade off between lying and being nice.
02:17:01.320 | We have to somehow find,
02:17:02.560 | what is the ethics in that?
02:17:04.920 | And we're back to discussions inside relationships.
02:17:07.400 | Anyway, what were you saying?
02:17:08.600 | - Oh, yeah, I was saying like, yeah,
02:17:09.840 | that's kind of going to the question of like,
02:17:11.520 | what is a lie?
02:17:12.360 | You know, is a white lie a bad lie?
02:17:13.840 | Is it an ethical lie?
02:17:15.080 | You know, those kinds of questions.
02:17:16.840 | - Boy, we return time and time again
02:17:20.520 | to deep human questions as we design AI systems.
02:17:23.280 | That's exactly what they do.
02:17:24.360 | They put a mirror to humanity
02:17:26.680 | to help us understand ourselves.
02:17:29.040 | - There's also the issue of like, you know,
02:17:31.000 | in these diplomacy experiments,
02:17:32.360 | in order to do a fair comparison,
02:17:35.320 | you know, what we found is that
02:17:36.920 | there's an inherent anti-AI bias in these kinds of games.
02:17:40.760 | So we actually played a tournament
02:17:43.160 | in a non-language version of the game,
02:17:45.000 | where, you know, we told the participants like,
02:17:47.000 | hey, in every single game, there's going to be an AI.
02:17:49.440 | And what we found is that the humans
02:17:51.960 | would spend basically the entire game,
02:17:53.580 | like trying to figure out who the bot was.
02:17:55.040 | And then as soon as they thought they figured it out,
02:17:56.480 | they would all team up and try to kill it.
02:17:58.480 | And, you know, overcoming that inherent anti-AI bias
02:18:03.880 | is a challenge.
02:18:05.520 | - On the flip side, I think when robots become the enemy,
02:18:10.520 | that's when we get to heal our human divisions,
02:18:14.560 | and then we can become one.
02:18:16.800 | As long as we have one enemy,
02:18:18.000 | it's that Reagan thing when the aliens show up,
02:18:20.680 | that's when we put our side,
02:18:22.360 | our divisions will become one human species.
02:18:25.420 | - Right, we might have our differences,
02:18:26.560 | but we're at least all human.
02:18:27.760 | - At least we all hate the robots.
02:18:30.280 | No, no, no, no.
02:18:31.640 | I think there will be actually in the future
02:18:33.120 | something like a civil rights movement for robots.
02:18:35.440 | I think that's the fascinating thing about AI systems,
02:18:38.200 | and that is they ask, they force us to ask about
02:18:41.360 | ethical questions about what is sentience,
02:18:44.660 | what is, how do we feel about systems
02:18:47.040 | that are capable of suffering,
02:18:48.200 | or capable of displaying suffering?
02:18:50.400 | And how do we design products that show emotion and not?
02:18:53.800 | How do we feel about that?
02:18:54.920 | Lying is another topic.
02:18:56.920 | Are we going to allow bots to lie and not?
02:19:00.060 | And where's the balance between being nice
02:19:02.440 | and telling the truth?
02:19:04.640 | I mean, these are all fascinating human questions,
02:19:07.040 | and it's so exciting to be in a century
02:19:09.160 | where we can create systems that
02:19:11.800 | take these philosophical questions
02:19:15.040 | that have been asked for centuries,
02:19:17.280 | and now we can engineer them inside systems,
02:19:20.100 | where you really have to answer them,
02:19:22.280 | because you'll have transformational impact on human society
02:19:26.960 | depending on what you design inside those systems.
02:19:30.040 | It's fascinating.
02:19:31.400 | And like you said, I feel like diplomacy
02:19:33.480 | is a step towards the direction of the real world,
02:19:35.960 | applying these RL methods towards the real world.
02:19:38.880 | From all the breakthrough performances
02:19:41.960 | in Go and Chess and StarCraft and Dota,
02:19:44.380 | this feels like the real world.
02:19:46.760 | Especially now my mind's been on war,
02:19:49.300 | and military conflict.
02:19:50.720 | This feels like it can give us some deep insights
02:19:53.320 | about human behavior at the large geopolitical scale.
02:19:56.640 | What do you think is the breakthrough
02:20:03.520 | or the directions of work that will take us
02:20:08.960 | towards solving intelligence, towards creating AGI systems?
02:20:13.400 | You've been a part of creating,
02:20:15.600 | by the way, we should say a part of great teams
02:20:19.220 | that do this, of creating systems
02:20:23.040 | that achieve breakthrough performances
02:20:24.860 | on before thought unsolvable problems,
02:20:28.400 | like poker, multiplayer poker, diplomacy.
02:20:32.040 | We're taking steps towards that direction.
02:20:35.680 | What do you think it takes to go all the way
02:20:37.520 | to create superhuman level intelligence?
02:20:40.640 | - You know, there's a lot of people
02:20:41.480 | trying to figure that out right now.
02:20:43.360 | And I should say, the amount of progress that's been made,
02:20:46.400 | especially in the past few years is truly phenomenal.
02:20:49.360 | I mean, you look at where AI was 10 years ago,
02:20:52.360 | and the idea that you can have AIs
02:20:53.840 | that can generate language and generate images
02:20:56.200 | the way they're doing today,
02:20:57.480 | and able to play a game like diplomacy
02:20:59.440 | was just unthinkable, even five years ago,
02:21:03.400 | let alone 10 years ago.
02:21:04.560 | Now, there are aspects of AI that I think are still lacking.
02:21:11.240 | I think there's general agreement
02:21:15.240 | that one of the major issues with AI today
02:21:17.560 | is that it's very data inefficient.
02:21:19.920 | It requires a huge number of samples of training examples
02:21:24.000 | to be able to train.
02:21:25.320 | You know, you look at an AI that plays Go,
02:21:27.320 | and it needs millions of games of Go
02:21:29.920 | to learn how to play the game well.
02:21:32.000 | Whereas a human can pick it up in like, you know,
02:21:34.160 | I don't know, how many games does a human Go player,
02:21:36.560 | Go grandmaster play in their lifetime?
02:21:38.720 | Probably, you know, in the thousands
02:21:41.280 | or tens of thousands, I guess.
02:21:45.040 | So that's one issue.
02:21:46.200 | - Data efficiency.
02:21:47.080 | - Overcoming this challenge of data efficiency.
02:21:48.680 | And this is particularly important
02:21:50.120 | if we want to deploy AI systems in real world settings
02:21:55.120 | where they're interacting with humans,
02:21:56.480 | because, you know, for example, with robotics,
02:21:58.440 | it's really hard to generate a huge number of samples.
02:22:01.360 | It's a different story when you're working in these,
02:22:04.160 | you know, totally virtual games
02:22:06.100 | where you can play a million games and it's no big deal.
02:22:08.520 | - I was planning on just launching like a thousand
02:22:10.480 | of these robots in Austin.
02:22:12.320 | I don't think it's illegal for legged robots
02:22:14.520 | to roam the streets and just collect data.
02:22:16.920 | - That's not a crazy idea.
02:22:17.760 | - Of course, the worst that could happen.
02:22:18.580 | - Yeah, I mean, that's one way
02:22:20.080 | to overcome the data efficiency problem.
02:22:21.640 | It's like scale it, yeah.
02:22:23.800 | - Like I actually tried to see if there's a law
02:22:26.280 | against robots, like legged robots just operating
02:22:30.080 | in the streets of a major city and there isn't,
02:22:34.040 | I couldn't find any.
02:22:35.160 | So I'll take it all the way to the Supreme Court.
02:22:38.840 | Robot rights.
02:22:40.840 | Okay, anyway, sorry, you were saying.
02:22:42.320 | So what are the ideas for getting,
02:22:45.880 | becoming more data efficient?
02:22:48.040 | - I mean, that's the trillion dollar question in AI today.
02:22:50.840 | I mean, if you can figure out how to make AI systems
02:22:52.840 | more data efficient, then that's a huge breakthrough.
02:22:57.520 | So nobody really knows right now.
02:22:58.960 | - It could be just a gigantic background model,
02:23:01.480 | language model, and then you do,
02:23:03.780 | the training becomes like prompting that model
02:23:10.000 | to essentially do a kind of querying,
02:23:11.960 | a search into the space of the things it's learned
02:23:14.580 | to customize that to whatever problem you're trying to solve.
02:23:17.480 | So maybe if you form a large enough language model,
02:23:20.080 | you can go quite a long way.
02:23:21.920 | - I think there's some truth to that.
02:23:24.160 | I mean, you look at the way humans approach
02:23:26.840 | a game like poker, they're not coming at it from scratch.
02:23:29.500 | They're coming at it with a huge amount
02:23:31.400 | of background knowledge about how humans work,
02:23:34.640 | how the world works, the idea of money.
02:23:38.040 | So they're able to leverage that kind of information
02:23:41.400 | to pick up the game faster.
02:23:44.380 | So it's not really a fair comparison
02:23:45.920 | to then compare it to an AI
02:23:47.160 | that's like learning from scratch.
02:23:48.260 | And maybe one of the ways that we address
02:23:49.640 | this sample complexity problem is by allowing AIs
02:23:53.560 | to leverage that general knowledge
02:23:55.040 | across a ton of different domains.
02:23:56.740 | - So, like I said, you did a lot of incredible work
02:24:01.600 | in the space of research and actually building systems.
02:24:04.720 | What advice would you give to, let's start with beginners.
02:24:07.720 | What advice would you give to beginners
02:24:09.860 | interested in machine learning?
02:24:11.840 | Just they're at the very start of their journey,
02:24:13.600 | they're in high school and college,
02:24:15.280 | thinking like this seems like a fascinating world.
02:24:18.240 | What advice would you give them?
02:24:19.840 | - I would say that there are a lot of people
02:24:24.800 | working on similar aspects of machine learning
02:24:27.640 | and to not be afraid to try something a bit different.
02:24:31.040 | My own path in AI is pretty atypical
02:24:35.640 | for a machine learning researcher today.
02:24:37.520 | I mean, I started out working on game theory
02:24:39.920 | and then shifting more towards reinforcement learning
02:24:43.740 | as time went on.
02:24:44.880 | And that actually had a lot of benefits, I think,
02:24:46.600 | because it allowed me to look at these problems
02:24:48.960 | in a very different way
02:24:49.920 | from the way a lot of machine learning researchers view it.
02:24:54.120 | And that comes with drawbacks in some respects.
02:24:58.360 | Like I think there's definitely aspects of machine learning
02:25:00.560 | where I'm weaker than most of the researchers out there,
02:25:04.600 | but I think that diversity of perspective,
02:25:07.640 | when I'm working with my teammates,
02:25:10.160 | there's something that I'm bringing to the table
02:25:11.320 | and there's something that they're bringing to the table.
02:25:12.880 | And that kind of collaboration
02:25:13.920 | becomes very fruitful for that reason.
02:25:15.760 | - So there could be problems like poker,
02:25:18.640 | like you've chosen diplomacy,
02:25:20.960 | there could be problems like that still out there
02:25:23.120 | that you can just tackle,
02:25:24.600 | even if it seems extremely difficult.
02:25:26.440 | - I think that there's a lot of challenges left.
02:25:30.520 | And I think having a diversity of viewpoints
02:25:34.560 | and backgrounds is really helpful
02:25:36.480 | for working together to figure out
02:25:38.800 | how to tackle those kinds of challenges.
02:25:40.360 | - So as a beginner, so that,
02:25:42.360 | I would say that's more for like a grad student
02:25:45.720 | where they already built up a base,
02:25:46.880 | like a complete beginner, what's a good journey.
02:25:49.200 | So for you that was doing some more
02:25:51.360 | in the math side of things, doing game theory,
02:25:53.880 | all that, so it's basically build up a foundation
02:25:56.760 | in something, so programming, mathematics,
02:25:59.440 | it could even be physics, but build that foundation.
02:26:03.400 | - Yeah, I would say build a strong foundation
02:26:05.400 | in math and computer science and statistics
02:26:07.400 | and these kinds of areas,
02:26:08.280 | but don't be afraid to try something that's different
02:26:11.920 | and learn something that's different
02:26:13.080 | from the thing that everybody else is doing
02:26:15.680 | to get into machine learning.
02:26:17.680 | There's value in having a different background
02:26:20.920 | than everybody else.
02:26:22.040 | Yeah, so, but certainly having a strong math background,
02:26:26.320 | especially in things like linear algebra
02:26:28.080 | and statistics and probability
02:26:30.560 | are incredibly helpful today for learning about
02:26:33.320 | and understanding machine learning.
02:26:35.440 | - Do you think one day we'll be able to,
02:26:37.000 | since you're taking steps from poker to diplomacy,
02:26:39.680 | one day we'll be able to figure out
02:26:43.840 | how to live life optimally?
02:26:45.280 | - Well, what is it, like in poker and diplomacy,
02:26:49.360 | you need a value function, you need to have a reward system.
02:26:52.120 | And so what does it mean to live a life that's optimal?
02:26:55.080 | - So, okay, so then you can exactly like lay down
02:26:59.200 | a reward function being like, I wanna be rich
02:27:02.240 | or I want to be in a happy relationship.
02:27:07.240 | And then you'll say, well, do X.
02:27:10.920 | - There's a lot of talk today about in AI safety circles
02:27:15.760 | about like misspecification of reward function.
02:27:19.160 | So you say like, okay, my objective is to be rich
02:27:22.920 | and maybe the AI tells you like, okay,
02:27:24.600 | well, if you wanna maximize the probability
02:27:26.080 | that you're rich, go rob a bank.
02:27:27.440 | - Sure.
02:27:28.280 | - And so you wanna, is that really what you want?
02:27:30.560 | Is your objective really to be rich at all costs
02:27:33.360 | or is it more nuanced than that?
02:27:35.040 | - So the unintended consequences, yeah.
02:27:38.060 | Yeah, so maybe life is more about defining
02:27:44.480 | the reward function that minimizes
02:27:47.400 | the unintended consequences than it is about
02:27:50.760 | the actual policy that gets you to the reward function.
02:27:53.920 | Maybe life is just about constantly updating
02:27:57.360 | the reward function.
02:27:59.640 | I think one of the challenges in life
02:28:01.080 | is figuring out exactly what that reward function is.
02:28:04.280 | Sometimes it's pretty hard to specify.
02:28:06.360 | The same way that trying to handcraft the optimal policy
02:28:09.680 | in a game like chess is really difficult.
02:28:12.280 | It's not so clear cut what the reward function is for life.
02:28:15.920 | - I think one day AI will figure it out.
02:28:18.080 | And I wonder what that would be.
02:28:21.960 | Until then, I just really appreciate
02:28:24.880 | the kind of work you're doing.
02:28:25.840 | And it's really fascinating taking a leap
02:28:28.960 | into a more and more real-world-like problem space
02:28:33.960 | and just achieving incredible results
02:28:38.680 | by applying reinforcement learning.
02:28:40.240 | Now, since I saw your work on poker,
02:28:42.600 | you've been a constant inspiration.
02:28:44.360 | It's an honor to get to finally talk to you
02:28:46.240 | and this is really fun.
02:28:47.800 | - Thanks for having me.
02:28:49.400 | - Thanks for listening to this conversation with Noah Brown.
02:28:52.400 | To support this podcast, please check out our sponsors
02:28:54.660 | in the description.
02:28:56.120 | And now, let me leave you with some words from Sun Tzu
02:28:59.560 | in "The Art of War."
02:29:01.720 | "The whole secret lies in confusing the enemy
02:29:04.960 | so that he cannot fathom our real intent."
02:29:08.120 | Thank you for listening and hope to see you next time.
02:29:12.440 | (upbeat music)
02:29:15.020 | (upbeat music)
02:29:17.600 | [BLANK_AUDIO]