back to index

AlphaZero and Self Play (David Silver, DeepMind) | AI Podcast Clips


Chapters

0:0
0:37 What Is Self Play
3:45 Idea for Alpha Zero
13:2 Japanese Chess

Whisper Transcript | Transcript Only Page

00:00:00.000 | So the next incredible step,
00:00:05.000 | right, really the profound step is probably AlphaGo Zero.
00:00:09.280 | I mean, it's arguable, I kind of see them all
00:00:12.440 | as the same place, but really,
00:00:14.120 | and perhaps you were already thinking
00:00:15.840 | that AlphaGo Zero is the natural,
00:00:17.760 | it was always going to be the next step,
00:00:20.240 | but it's removing the reliance on human expert games
00:00:24.440 | for pre-training, as you mentioned.
00:00:26.400 | So how big of an intellectual leap was this
00:00:30.680 | that self-play could achieve
00:00:33.640 | superhuman level performance on its own?
00:00:35.760 | And maybe could you also say what is self-play?
00:00:39.640 | We kind of mentioned it a few times, but.
00:00:41.680 | - So let me start with self-play.
00:00:46.240 | So the idea of self-play is something
00:00:49.360 | which is really about systems learning for themselves,
00:00:53.040 | but in the situation where there's more than one agent.
00:00:56.680 | And so if you're in a game,
00:00:58.820 | and the game is played between two players,
00:01:01.420 | then self-play is really about understanding that game
00:01:04.760 | just by playing games against yourself
00:01:08.480 | rather than against any actual real opponent.
00:01:10.920 | And so it's a way to kind of discover strategies
00:01:14.840 | without having to actually need to go out
00:01:18.400 | and play against any particular human player
00:01:22.960 | for example.
00:01:23.800 | The main idea of AlphaZero was really to,
00:01:31.040 | you know, try and step back from any of the knowledge
00:01:36.040 | that we'd put into the system and ask the question,
00:01:38.720 | is it possible to come up with a single elegant principle
00:01:43.720 | by which a system can learn for itself
00:01:46.520 | all of the knowledge which it requires to play a game
00:01:50.280 | such as Go?
00:01:51.780 | Importantly, by taking knowledge out,
00:01:54.160 | you not only make the system less brittle in the sense
00:01:59.160 | that perhaps the knowledge you were putting in
00:02:01.520 | was just getting in the way
00:02:02.840 | and maybe stopping the system learning for itself,
00:02:06.320 | but also you make it more general.
00:02:08.760 | The more knowledge you put in,
00:02:10.680 | the harder it is for a system to actually be placed,
00:02:14.380 | taken out of the system in which it's kind of been designed
00:02:17.680 | and placed in some other system
00:02:19.640 | that maybe would need a completely different knowledge base
00:02:21.480 | to understand and perform well.
00:02:23.720 | And so the real goal here is to strip out
00:02:27.160 | all of the knowledge that we put in
00:02:28.760 | to the point that we can just plug it
00:02:30.480 | into something totally different.
00:02:32.700 | And that to me is really, you know,
00:02:34.720 | the promise of AI is that we can have systems such as that,
00:02:38.200 | which, you know, no matter what the goal is,
00:02:40.400 | no matter what goal we set to the system,
00:02:43.760 | we can come up with, we have an algorithm
00:02:46.880 | which can be placed into that world,
00:02:48.560 | into that environment and can succeed
00:02:50.860 | in achieving that goal.
00:02:52.800 | And then that's to me is almost the essence of intelligence
00:02:57.520 | if we can achieve that.
00:02:58.920 | And so AlphaZero is a step towards that.
00:03:00.920 | And it's a step that was taken in the context
00:03:04.600 | of two-player perfect information games like Go and chess.
00:03:09.600 | We also applied it to Japanese chess.
00:03:12.400 | - So just to clarify, the first step was AlphaGo Zero.
00:03:16.440 | - The first step was to try and take all of the knowledge
00:03:19.940 | out of AlphaGo in such a way that it could play
00:03:23.760 | in a fully self-discovered way, purely from self-play.
00:03:28.760 | And to me, the motivation for that was always
00:03:32.720 | that we could then plug it into other domains,
00:03:35.020 | but we saved that until later.
00:03:37.740 | (both laughing)
00:03:38.960 | - Well, and- - In fact, I mean,
00:03:41.260 | just for fun, I could tell you exactly the moment
00:03:45.240 | where the idea for AlphaZero occurred to me,
00:03:48.760 | 'cause I think there's maybe a lesson there for researchers
00:03:51.280 | who are kind of too deeply embedded in their research
00:03:54.120 | and working 24/7 to try and come up with the next idea,
00:03:57.880 | which is, it actually occurred to me on honeymoon.
00:04:02.980 | (both laughing)
00:04:04.760 | And I was like at my most fully relaxed state,
00:04:08.040 | really enjoying myself, and just bing,
00:04:12.480 | this like the algorithm for AlphaZero just appeared.
00:04:17.080 | And in its full form, and this was actually
00:04:21.480 | before we played against Lisa Doll,
00:04:24.120 | but we just didn't, I think we were so busy
00:04:28.640 | trying to make sure we could beat the world champion
00:04:33.480 | that it was only later that we had the opportunity
00:04:36.840 | to step back and start examining
00:04:38.840 | that sort of deeper scientific question
00:04:41.320 | of whether this could really work.
00:04:43.240 | - So nevertheless, so self-play is probably
00:04:47.160 | one of the most sort of profound ideas
00:04:50.840 | that it represents, to me at least,
00:04:54.200 | artificial intelligence.
00:04:56.360 | But the fact that you could use that kind of mechanism
00:05:00.680 | to again, beat world-class players, that's very surprising.
00:05:05.680 | So we kind of, to me, it feels like you have to train
00:05:10.080 | in a large number of expert games.
00:05:12.180 | So was it surprising to you, what was the intuition?
00:05:14.540 | Can you sort of think, not necessarily at that time,
00:05:17.400 | even now, what's your intuition,
00:05:18.880 | why this thing works so well?
00:05:20.920 | Why it's able to learn from scratch?
00:05:22.800 | - Well, let me first say why we tried it.
00:05:25.440 | So we tried it both because I feel that
00:05:27.800 | it was the deeper scientific question to be asking
00:05:31.020 | to make progress towards AI,
00:05:32.960 | and also because in general in my research,
00:05:35.840 | I don't like to do research on questions
00:05:38.400 | for which we already know the likely outcome.
00:05:41.880 | I don't see much value in running an experiment
00:05:44.100 | where you're 95% confident that you will succeed.
00:05:48.540 | And so we could have tried, maybe to take AlphaGo
00:05:52.920 | and do something which we knew for sure it would succeed on.
00:05:56.080 | But much more interesting to me was to try it
00:05:58.080 | on the things which we weren't sure about.
00:06:00.300 | And one of the big questions on our minds back then was,
00:06:05.120 | could you really do this with self-play alone?
00:06:07.080 | How far could that go?
00:06:08.540 | Would it be as strong?
00:06:10.440 | And honestly, we weren't sure.
00:06:13.080 | Yeah, it was 50/50, I think.
00:06:14.580 | If you'd asked me, I wasn't confident
00:06:18.280 | that it could reach the same level as these systems,
00:06:21.600 | but it felt like the right question to ask.
00:06:24.760 | And even if it had not achieved the same level,
00:06:27.680 | I felt that that was an important direction to be studying.
00:06:32.680 | And so then lo and behold,
00:06:38.600 | it actually ended up outperforming
00:06:41.240 | the previous version of AlphaGo
00:06:43.320 | and indeed was able to beat it by 100 games to zero.
00:06:46.860 | So what's the intuition as to why?
00:06:50.700 | I think the intuition to me is clear,
00:06:53.320 | that whenever you have errors in a system,
00:06:58.320 | as we did in AlphaGo,
00:07:00.320 | AlphaGo suffered from these delusions.
00:07:02.960 | Occasionally it would misunderstand
00:07:04.200 | what was going on in a position and mis-evaluate it.
00:07:06.840 | How can you remove all of these errors?
00:07:10.680 | Errors arise from many sources.
00:07:12.760 | For us, they were arising both from,
00:07:15.040 | starting from the human data,
00:07:16.200 | but also from the nature of the search
00:07:18.640 | and the nature of the algorithm itself.
00:07:20.720 | But the only way to address them in any complex system
00:07:24.160 | is to give the system the ability to correct its own errors.
00:07:28.920 | It must be able to correct them.
00:07:30.400 | It must be able to learn for itself
00:07:32.360 | when it's doing something wrong and correct for it.
00:07:35.540 | And so it seemed to me that the way to correct delusions
00:07:38.760 | was indeed to have more iterations
00:07:41.600 | of reinforcement learning.
00:07:42.600 | That no matter where you start,
00:07:44.460 | you should be able to correct for those errors
00:07:46.700 | until it gets to play that out and understand,
00:07:49.280 | oh, well, I thought that I was gonna win in this situation,
00:07:52.340 | but then I ended up losing.
00:07:54.200 | That suggests that I was mis-evaluating something
00:07:56.320 | and there's a hole in my knowledge
00:07:57.360 | and now the system can correct for itself
00:07:59.520 | and understand how to do better.
00:08:01.460 | Now, if you take that same idea and trace it back,
00:08:05.240 | all the way to the beginning,
00:08:07.120 | it should be able to take you from no knowledge,
00:08:10.120 | from completely random starting point,
00:08:12.720 | all the way to the highest levels of knowledge
00:08:15.680 | that you can achieve in a domain.
00:08:18.080 | And the principle is the same,
00:08:19.380 | that if you bestow a system with the ability
00:08:22.720 | to correct its own errors,
00:08:24.480 | then it can take you from random
00:08:26.440 | to something slightly better than random
00:08:28.680 | because it sees the stupid things that the random is doing
00:08:31.340 | and it can correct them.
00:08:32.520 | And then it can take you from that slightly better system
00:08:34.920 | and understand, well, what's that doing wrong?
00:08:36.880 | And it takes you on to the next level and the next level.
00:08:40.320 | And this progress can go on indefinitely.
00:08:43.960 | And indeed, what would have happened
00:08:46.200 | if we'd carried on training AlphaGo Zero for longer?
00:08:49.300 | We saw no sign of it slowing down its improvements,
00:08:54.240 | or at least it was certainly carrying on to improve.
00:08:57.560 | And presumably, if you had the computational resources,
00:09:01.960 | this could lead to better and better systems
00:09:05.400 | that discover more and more.
00:09:06.640 | - So your intuition is fundamentally
00:09:09.840 | there's not a ceiling to this process.
00:09:12.520 | One of the surprising things, just like you said,
00:09:15.560 | is the process of patching errors.
00:09:18.280 | It's intuitively makes sense.
00:09:20.240 | And reinforcement learning should be part of that process.
00:09:24.520 | But what is surprising is in the process of patching
00:09:27.740 | your own lack of knowledge,
00:09:30.200 | you don't open up other patches.
00:09:32.920 | You keep sort of,
00:09:34.680 | like there's a monotonic decrease of your weaknesses.
00:09:39.440 | - Well, let me back this up.
00:09:41.080 | I think science always should make falsifiable hypotheses.
00:09:44.720 | So let me back up this claim
00:09:46.240 | with a falsifiable hypothesis,
00:09:47.960 | which is that if someone was to, in the future,
00:09:50.680 | take AlphaZero as an algorithm
00:09:53.280 | and run it with greater computational resources
00:09:58.360 | that we had available today,
00:10:00.440 | then I would predict that they would be able
00:10:03.760 | to beat the previous system 100 games to zero.
00:10:06.280 | And that if they were then to do the same thing
00:10:08.160 | a couple of years later,
00:10:09.320 | that that would beat that previous system
00:10:11.600 | 100 games to zero.
00:10:13.000 | And that that process would continue indefinitely
00:10:16.080 | throughout at least my human lifetime.
00:10:18.480 | - Presumably the game of Go would set the ceiling.
00:10:21.920 | I mean-
00:10:22.760 | - The game of Go would set the ceiling,
00:10:24.140 | but the game of Go has 10 to the 170 states in it.
00:10:26.920 | So the ceiling is unreachable by any computational device
00:10:31.320 | that can be built out of the, you know,
00:10:33.680 | 10 to the 80 atoms in the universe.
00:10:36.280 | You asked a really good question, which is,
00:10:39.920 | do you not open up other errors
00:10:42.040 | when you correct your previous ones?
00:10:44.560 | And the answer is yes, you do.
00:10:47.080 | And so it's a remarkable fact
00:10:49.560 | about this class of two-player game,
00:10:53.160 | and also true of single agent games,
00:10:56.120 | that essentially progress will always lead you to,
00:11:01.120 | if you have sufficient representational resource,
00:11:05.980 | like imagine you had,
00:11:07.520 | could represent every state in a big table of the game,
00:11:11.080 | then we know for sure that a progress of self-improvement
00:11:14.960 | will lead all the way in the single agent case
00:11:18.040 | to the optimal possible behavior.
00:11:19.960 | And in the two-player case to the minimax optimal behavior.
00:11:22.720 | That is the best way that I can play,
00:11:26.240 | knowing that you're playing perfectly against me.
00:11:28.960 | And so for those cases, we know that
00:11:31.800 | even if you do open up some new error,
00:11:35.640 | that in some sense you've made progress.
00:11:38.160 | You're progressing towards the best that can be done.
00:11:41.400 | - So AlphaGo was initially trained on expert games
00:11:46.160 | with some self-play.
00:11:47.400 | AlphaGo Zero removed the need to be trained on expert games.
00:11:51.320 | And then another incredible step for me,
00:11:54.880 | 'cause I just love chess,
00:11:56.680 | is to generalize that further to be in AlphaZero,
00:12:00.440 | to be able to play the game of Go,
00:12:03.160 | beating AlphaGo Zero and AlphaGo,
00:12:05.680 | and then also being able to play the game of chess
00:12:09.120 | and others.
00:12:10.080 | So what was that step like?
00:12:11.920 | What's the interesting aspects there
00:12:14.520 | that required to make that happen?
00:12:17.600 | - I think the remarkable observation,
00:12:20.880 | which we saw with AlphaZero,
00:12:22.880 | was that actually without modifying the algorithm at all,
00:12:26.680 | it was able to play and crack
00:12:28.440 | some of AI's greatest previous challenges.
00:12:32.240 | In particular, we dropped it into the game of chess.
00:12:35.720 | And unlike the previous systems like Deep Blue,
00:12:38.120 | which had been worked on for years and years,
00:12:41.360 | and we were able to beat
00:12:43.560 | the world's strongest computer chess program convincingly
00:12:48.240 | using a system that was fully discovered
00:12:51.880 | by its own, from scratch with its own principles.
00:12:55.880 | And in fact, one of the nice things that we found
00:12:59.120 | was that, in fact, we also achieved the same result
00:13:02.480 | in Japanese chess, a variant of chess
00:13:04.440 | where you get to capture pieces
00:13:06.120 | and then place them back down on your own side
00:13:08.600 | as an extra piece.
00:13:09.920 | So a much more complicated variant of chess.
00:13:12.800 | And we also beat the world's strongest programs
00:13:15.720 | and reached superhuman performance in that game too.
00:13:19.000 | And it was the very first time
00:13:21.040 | that we'd ever run the system on that particular game,
00:13:25.440 | was the version that we published in the paper on AlphaZero.
00:13:28.640 | It just worked out of the box, literally, no touching it.
00:13:32.680 | We didn't have to do anything.
00:13:33.800 | And there it was, superhuman performance,
00:13:36.240 | no tweaking, no twiddling.
00:13:37.920 | And so I think there's something beautiful
00:13:40.520 | about that principle that you can take an algorithm
00:13:43.920 | and without twiddling anything, it just works.
00:13:48.640 | Now, to go beyond AlphaZero, what's required?
00:13:53.640 | AlphaZero is just a step.
00:13:56.400 | And there's a long way to go beyond that
00:13:57.880 | to really crack the deep problems of AI.
00:14:00.880 | But one of the important steps is to acknowledge
00:14:04.440 | that the world is a really messy place.
00:14:07.000 | You know, it's this rich, complex, beautiful,
00:14:09.440 | but messy environment that we live in.
00:14:12.920 | And no one gives us the rules.
00:14:14.360 | Like no one knows the rules of the world.
00:14:17.040 | At least maybe we understand that it operates
00:14:19.440 | according to Newtonian or quantum mechanics
00:14:22.120 | at the micro level or according to relativity
00:14:24.960 | at the macro level.
00:14:26.080 | But that's not a model that's useful for us as people
00:14:29.360 | to operate in it.
00:14:31.160 | Somehow the agent needs to understand the world for itself
00:14:34.720 | in a way where no one tells it the rules of the game,
00:14:37.240 | and yet it can still figure out what to do in that world,
00:14:41.800 | deal with this stream of observations coming in,
00:14:44.520 | rich sensory input coming in,
00:14:46.240 | actions going out in a way that allows it to reason
00:14:49.240 | in the way that AlphaGo or AlphaZero can reason
00:14:52.400 | in the way that these go and chess playing programs
00:14:54.600 | can reason, but in a way that allows it to take actions
00:14:58.720 | in that messy world to achieve its goals.
00:15:01.320 | And so this led us to the most recent step
00:15:06.200 | in the story of AlphaGo, which was a system called MuZero.
00:15:10.440 | And MuZero is a system which learns for itself
00:15:14.360 | even when the rules are not given to it.
00:15:16.400 | It actually can be dropped into a system
00:15:19.160 | with messy perceptual inputs.
00:15:20.680 | We actually tried it in some Atari games,
00:15:24.800 | the canonical domains of Atari that have been used
00:15:28.080 | for reinforcement learning.
00:15:29.480 | And this system learned to build a model
00:15:33.840 | of these Atari games that was sufficiently rich
00:15:37.880 | and useful enough for it to be able to plan successfully.
00:15:42.320 | And in fact, that system not only went on
00:15:44.440 | to beat the state of the art in Atari,
00:15:47.600 | but the same system without modification
00:15:50.240 | was able to reach the same level of superhuman performance
00:15:53.920 | in Go, chess, and shogi that we'd seen in AlphaZero,
00:15:57.840 | showing that even without the rules,
00:15:59.640 | the system can learn for itself just by trial and error,
00:16:02.040 | just by playing this game of Go
00:16:04.040 | and no one tells you what the rules are,
00:16:05.960 | but you just get to the end
00:16:07.120 | and someone says, you know, win or loss.
00:16:09.640 | You play this game of chess and someone says win or loss,
00:16:12.960 | or you play a game of breakout in Atari
00:16:16.480 | and someone just tells you, you know, your score at the end.
00:16:18.960 | And the system for itself figures out
00:16:21.520 | essentially the rules of the system,
00:16:22.840 | the dynamics of the world, how the world works.
00:16:26.120 | And not in any explicit way,
00:16:29.040 | but just implicitly enough understanding
00:16:31.680 | for it to be able to plan in that system
00:16:34.000 | in order to achieve its goals.
00:16:36.400 | - And that's the fundamental process
00:16:39.000 | that you have to go through when you're facing
00:16:40.600 | any uncertain kind of environment
00:16:42.440 | that you would in the real world,
00:16:44.120 | is figuring out the sort of the rules,
00:16:46.000 | the basic rules of the game.
00:16:47.480 | - That's right.
00:16:48.320 | - So, I mean, yeah, that allows it to be applicable
00:16:51.560 | to basically any domain that could be digitized
00:16:55.560 | in the way that it needs to in order to be consumable,
00:17:00.960 | sort of in order for the reinforcement learning framework
00:17:03.080 | to be able to sense the environment,
00:17:04.640 | to be able to act in the environment and so on.
00:17:06.520 | - The full reinforcement learning problem
00:17:07.960 | needs to deal with worlds that are unknown and complex
00:17:12.280 | and the agent needs to learn for itself
00:17:14.640 | how to deal with that.
00:17:15.800 | And so Museo is a step, a further step in that direction.
00:17:19.720 | (upbeat music)
00:17:22.320 | (upbeat music)
00:17:24.920 | (upbeat music)
00:17:27.520 | (upbeat music)
00:17:30.120 | (upbeat music)
00:17:32.720 | (upbeat music)
00:17:35.320 | [BLANK_AUDIO]