AlphaZero and Self Play (David Silver, DeepMind)

00:00:00.000 | So the next incredible step,

00:00:05.000 | right, really the profound step is probably AlphaGo Zero.

00:00:09.280 | I mean, it's arguable, I kind of see them all

00:00:12.440 | as the same place, but really,

00:00:14.120 | and perhaps you were already thinking

00:00:15.840 | that AlphaGo Zero is the natural,

00:00:17.760 | it was always going to be the next step,

00:00:20.240 | but it's removing the reliance on human expert games

00:00:24.440 | for pre-training, as you mentioned.

00:00:26.400 | So how big of an intellectual leap was this

00:00:30.680 | that self-play could achieve

00:00:33.640 | superhuman level performance on its own?

00:00:35.760 | And maybe could you also say what is self-play?

00:00:39.640 | We kind of mentioned it a few times, but.

00:00:41.680 | - So let me start with self-play.

00:00:46.240 | So the idea of self-play is something

00:00:49.360 | which is really about systems learning for themselves,

00:00:53.040 | but in the situation where there's more than one agent.

00:00:56.680 | And so if you're in a game,

00:00:58.820 | and the game is played between two players,

00:01:01.420 | then self-play is really about understanding that game

00:01:04.760 | just by playing games against yourself

00:01:08.480 | rather than against any actual real opponent.

00:01:10.920 | And so it's a way to kind of discover strategies

00:01:14.840 | without having to actually need to go out

00:01:18.400 | and play against any particular human player

00:01:22.960 | for example.

00:01:23.800 | The main idea of AlphaZero was really to,

00:01:31.040 | you know, try and step back from any of the knowledge

00:01:36.040 | that we'd put into the system and ask the question,

00:01:38.720 | is it possible to come up with a single elegant principle

00:01:43.720 | by which a system can learn for itself

00:01:46.520 | all of the knowledge which it requires to play a game

00:01:50.280 | such as Go?

00:01:51.780 | Importantly, by taking knowledge out,

00:01:54.160 | you not only make the system less brittle in the sense

00:01:59.160 | that perhaps the knowledge you were putting in

00:02:01.520 | was just getting in the way

00:02:02.840 | and maybe stopping the system learning for itself,

00:02:06.320 | but also you make it more general.

00:02:08.760 | The more knowledge you put in,

00:02:10.680 | the harder it is for a system to actually be placed,

00:02:14.380 | taken out of the system in which it's kind of been designed

00:02:17.680 | and placed in some other system

00:02:19.640 | that maybe would need a completely different knowledge base

00:02:21.480 | to understand and perform well.

00:02:23.720 | And so the real goal here is to strip out

00:02:27.160 | all of the knowledge that we put in

00:02:28.760 | to the point that we can just plug it

00:02:30.480 | into something totally different.

00:02:32.700 | And that to me is really, you know,

00:02:34.720 | the promise of AI is that we can have systems such as that,

00:02:38.200 | which, you know, no matter what the goal is,

00:02:40.400 | no matter what goal we set to the system,

00:02:43.760 | we can come up with, we have an algorithm

00:02:46.880 | which can be placed into that world,

00:02:48.560 | into that environment and can succeed

00:02:50.860 | in achieving that goal.

00:02:52.800 | And then that's to me is almost the essence of intelligence

00:02:57.520 | if we can achieve that.

00:02:58.920 | And so AlphaZero is a step towards that.

00:03:00.920 | And it's a step that was taken in the context

00:03:04.600 | of two-player perfect information games like Go and chess.

00:03:09.600 | We also applied it to Japanese chess.

00:03:12.400 | - So just to clarify, the first step was AlphaGo Zero.

00:03:16.440 | - The first step was to try and take all of the knowledge

00:03:19.940 | out of AlphaGo in such a way that it could play

00:03:23.760 | in a fully self-discovered way, purely from self-play.

00:03:28.760 | And to me, the motivation for that was always

00:03:32.720 | that we could then plug it into other domains,

00:03:35.020 | but we saved that until later.

00:03:37.740 | (both laughing)

00:03:38.960 | - Well, and- - In fact, I mean,

00:03:41.260 | just for fun, I could tell you exactly the moment

00:03:45.240 | where the idea for AlphaZero occurred to me,

00:03:48.760 | 'cause I think there's maybe a lesson there for researchers

00:03:51.280 | who are kind of too deeply embedded in their research

00:03:54.120 | and working 24/7 to try and come up with the next idea,

00:03:57.880 | which is, it actually occurred to me on honeymoon.

00:04:02.980 | (both laughing)

00:04:04.760 | And I was like at my most fully relaxed state,

00:04:08.040 | really enjoying myself, and just bing,

00:04:12.480 | this like the algorithm for AlphaZero just appeared.

00:04:17.080 | And in its full form, and this was actually

00:04:21.480 | before we played against Lisa Doll,

00:04:24.120 | but we just didn't, I think we were so busy

00:04:28.640 | trying to make sure we could beat the world champion

00:04:33.480 | that it was only later that we had the opportunity

00:04:36.840 | to step back and start examining

00:04:38.840 | that sort of deeper scientific question

00:04:41.320 | of whether this could really work.

00:04:43.240 | - So nevertheless, so self-play is probably

00:04:47.160 | one of the most sort of profound ideas

00:04:50.840 | that it represents, to me at least,

00:04:54.200 | artificial intelligence.

00:04:56.360 | But the fact that you could use that kind of mechanism

00:05:00.680 | to again, beat world-class players, that's very surprising.

00:05:05.680 | So we kind of, to me, it feels like you have to train

00:05:10.080 | in a large number of expert games.

00:05:12.180 | So was it surprising to you, what was the intuition?

00:05:14.540 | Can you sort of think, not necessarily at that time,

00:05:17.400 | even now, what's your intuition,

00:05:18.880 | why this thing works so well?

00:05:20.920 | Why it's able to learn from scratch?

00:05:22.800 | - Well, let me first say why we tried it.

00:05:25.440 | So we tried it both because I feel that

00:05:27.800 | it was the deeper scientific question to be asking

00:05:31.020 | to make progress towards AI,

00:05:32.960 | and also because in general in my research,

00:05:35.840 | I don't like to do research on questions

00:05:38.400 | for which we already know the likely outcome.

00:05:41.880 | I don't see much value in running an experiment

00:05:44.100 | where you're 95% confident that you will succeed.

00:05:48.540 | And so we could have tried, maybe to take AlphaGo

00:05:52.920 | and do something which we knew for sure it would succeed on.

00:05:56.080 | But much more interesting to me was to try it

00:05:58.080 | on the things which we weren't sure about.

00:06:00.300 | And one of the big questions on our minds back then was,

00:06:05.120 | could you really do this with self-play alone?

00:06:07.080 | How far could that go?

00:06:08.540 | Would it be as strong?

00:06:10.440 | And honestly, we weren't sure.

00:06:13.080 | Yeah, it was 50/50, I think.

00:06:14.580 | If you'd asked me, I wasn't confident

00:06:18.280 | that it could reach the same level as these systems,

00:06:21.600 | but it felt like the right question to ask.

00:06:24.760 | And even if it had not achieved the same level,

00:06:27.680 | I felt that that was an important direction to be studying.

00:06:32.680 | And so then lo and behold,

00:06:38.600 | it actually ended up outperforming

00:06:41.240 | the previous version of AlphaGo

00:06:43.320 | and indeed was able to beat it by 100 games to zero.

00:06:46.860 | So what's the intuition as to why?

00:06:50.700 | I think the intuition to me is clear,

00:06:53.320 | that whenever you have errors in a system,

00:06:58.320 | as we did in AlphaGo,

00:07:00.320 | AlphaGo suffered from these delusions.

00:07:02.960 | Occasionally it would misunderstand

00:07:04.200 | what was going on in a position and mis-evaluate it.

00:07:06.840 | How can you remove all of these errors?

00:07:10.680 | Errors arise from many sources.

00:07:12.760 | For us, they were arising both from,

00:07:15.040 | starting from the human data,

00:07:16.200 | but also from the nature of the search

00:07:18.640 | and the nature of the algorithm itself.

00:07:20.720 | But the only way to address them in any complex system

00:07:24.160 | is to give the system the ability to correct its own errors.

00:07:28.920 | It must be able to correct them.

00:07:30.400 | It must be able to learn for itself

00:07:32.360 | when it's doing something wrong and correct for it.

00:07:35.540 | And so it seemed to me that the way to correct delusions

00:07:38.760 | was indeed to have more iterations

00:07:41.600 | of reinforcement learning.

00:07:42.600 | That no matter where you start,

00:07:44.460 | you should be able to correct for those errors

00:07:46.700 | until it gets to play that out and understand,

00:07:49.280 | oh, well, I thought that I was gonna win in this situation,

00:07:52.340 | but then I ended up losing.

00:07:54.200 | That suggests that I was mis-evaluating something

00:07:56.320 | and there's a hole in my knowledge

00:07:57.360 | and now the system can correct for itself

00:07:59.520 | and understand how to do better.

00:08:01.460 | Now, if you take that same idea and trace it back,

00:08:05.240 | all the way to the beginning,

00:08:07.120 | it should be able to take you from no knowledge,

00:08:10.120 | from completely random starting point,

00:08:12.720 | all the way to the highest levels of knowledge

00:08:15.680 | that you can achieve in a domain.

00:08:18.080 | And the principle is the same,

00:08:19.380 | that if you bestow a system with the ability

00:08:22.720 | to correct its own errors,

00:08:24.480 | then it can take you from random

00:08:26.440 | to something slightly better than random

00:08:28.680 | because it sees the stupid things that the random is doing

00:08:31.340 | and it can correct them.

00:08:32.520 | And then it can take you from that slightly better system

00:08:34.920 | and understand, well, what's that doing wrong?

00:08:36.880 | And it takes you on to the next level and the next level.

00:08:40.320 | And this progress can go on indefinitely.

00:08:43.960 | And indeed, what would have happened

00:08:46.200 | if we'd carried on training AlphaGo Zero for longer?

00:08:49.300 | We saw no sign of it slowing down its improvements,

00:08:54.240 | or at least it was certainly carrying on to improve.

00:08:57.560 | And presumably, if you had the computational resources,

00:09:01.960 | this could lead to better and better systems

00:09:05.400 | that discover more and more.

00:09:06.640 | - So your intuition is fundamentally

00:09:09.840 | there's not a ceiling to this process.

00:09:12.520 | One of the surprising things, just like you said,

00:09:15.560 | is the process of patching errors.

00:09:18.280 | It's intuitively makes sense.

00:09:20.240 | And reinforcement learning should be part of that process.

00:09:24.520 | But what is surprising is in the process of patching

00:09:27.740 | your own lack of knowledge,

00:09:30.200 | you don't open up other patches.

00:09:32.920 | You keep sort of,

00:09:34.680 | like there's a monotonic decrease of your weaknesses.

00:09:39.440 | - Well, let me back this up.

00:09:41.080 | I think science always should make falsifiable hypotheses.

00:09:44.720 | So let me back up this claim

00:09:46.240 | with a falsifiable hypothesis,

00:09:47.960 | which is that if someone was to, in the future,

00:09:50.680 | take AlphaZero as an algorithm

00:09:53.280 | and run it with greater computational resources

00:09:58.360 | that we had available today,

00:10:00.440 | then I would predict that they would be able

00:10:03.760 | to beat the previous system 100 games to zero.

00:10:06.280 | And that if they were then to do the same thing

00:10:08.160 | a couple of years later,

00:10:09.320 | that that would beat that previous system

00:10:11.600 | 100 games to zero.

00:10:13.000 | And that that process would continue indefinitely

00:10:16.080 | throughout at least my human lifetime.

00:10:18.480 | - Presumably the game of Go would set the ceiling.

00:10:21.920 | I mean-

00:10:22.760 | - The game of Go would set the ceiling,

00:10:24.140 | but the game of Go has 10 to the 170 states in it.

00:10:26.920 | So the ceiling is unreachable by any computational device

00:10:31.320 | that can be built out of the, you know,

00:10:33.680 | 10 to the 80 atoms in the universe.

00:10:36.280 | You asked a really good question, which is,

00:10:39.920 | do you not open up other errors

00:10:42.040 | when you correct your previous ones?

00:10:44.560 | And the answer is yes, you do.

00:10:47.080 | And so it's a remarkable fact

00:10:49.560 | about this class of two-player game,

00:10:53.160 | and also true of single agent games,

00:10:56.120 | that essentially progress will always lead you to,

00:11:01.120 | if you have sufficient representational resource,

00:11:05.980 | like imagine you had,

00:11:07.520 | could represent every state in a big table of the game,

00:11:11.080 | then we know for sure that a progress of self-improvement

00:11:14.960 | will lead all the way in the single agent case

00:11:18.040 | to the optimal possible behavior.

00:11:19.960 | And in the two-player case to the minimax optimal behavior.

00:11:22.720 | That is the best way that I can play,

00:11:26.240 | knowing that you're playing perfectly against me.

00:11:28.960 | And so for those cases, we know that

00:11:31.800 | even if you do open up some new error,

00:11:35.640 | that in some sense you've made progress.

00:11:38.160 | You're progressing towards the best that can be done.

00:11:41.400 | - So AlphaGo was initially trained on expert games

00:11:46.160 | with some self-play.

00:11:47.400 | AlphaGo Zero removed the need to be trained on expert games.

00:11:51.320 | And then another incredible step for me,

00:11:54.880 | 'cause I just love chess,

00:11:56.680 | is to generalize that further to be in AlphaZero,

00:12:00.440 | to be able to play the game of Go,

00:12:03.160 | beating AlphaGo Zero and AlphaGo,

00:12:05.680 | and then also being able to play the game of chess

00:12:09.120 | and others.

00:12:10.080 | So what was that step like?

00:12:11.920 | What's the interesting aspects there

00:12:14.520 | that required to make that happen?

00:12:17.600 | - I think the remarkable observation,

00:12:20.880 | which we saw with AlphaZero,

00:12:22.880 | was that actually without modifying the algorithm at all,

00:12:26.680 | it was able to play and crack

00:12:28.440 | some of AI's greatest previous challenges.

00:12:32.240 | In particular, we dropped it into the game of chess.

00:12:35.720 | And unlike the previous systems like Deep Blue,

00:12:38.120 | which had been worked on for years and years,

00:12:41.360 | and we were able to beat

00:12:43.560 | the world's strongest computer chess program convincingly

00:12:48.240 | using a system that was fully discovered

00:12:51.880 | by its own, from scratch with its own principles.

00:12:55.880 | And in fact, one of the nice things that we found

00:12:59.120 | was that, in fact, we also achieved the same result

00:13:02.480 | in Japanese chess, a variant of chess

00:13:04.440 | where you get to capture pieces

00:13:06.120 | and then place them back down on your own side

00:13:08.600 | as an extra piece.

00:13:09.920 | So a much more complicated variant of chess.

00:13:12.800 | And we also beat the world's strongest programs

00:13:15.720 | and reached superhuman performance in that game too.

00:13:19.000 | And it was the very first time

00:13:21.040 | that we'd ever run the system on that particular game,

00:13:25.440 | was the version that we published in the paper on AlphaZero.

00:13:28.640 | It just worked out of the box, literally, no touching it.

00:13:32.680 | We didn't have to do anything.

00:13:33.800 | And there it was, superhuman performance,

00:13:36.240 | no tweaking, no twiddling.

00:13:37.920 | And so I think there's something beautiful

00:13:40.520 | about that principle that you can take an algorithm

00:13:43.920 | and without twiddling anything, it just works.

00:13:48.640 | Now, to go beyond AlphaZero, what's required?

00:13:53.640 | AlphaZero is just a step.

00:13:56.400 | And there's a long way to go beyond that

00:13:57.880 | to really crack the deep problems of AI.

00:14:00.880 | But one of the important steps is to acknowledge

00:14:04.440 | that the world is a really messy place.

00:14:07.000 | You know, it's this rich, complex, beautiful,

00:14:09.440 | but messy environment that we live in.

00:14:12.920 | And no one gives us the rules.

00:14:14.360 | Like no one knows the rules of the world.

00:14:17.040 | At least maybe we understand that it operates

00:14:19.440 | according to Newtonian or quantum mechanics

00:14:22.120 | at the micro level or according to relativity

00:14:24.960 | at the macro level.

00:14:26.080 | But that's not a model that's useful for us as people

00:14:29.360 | to operate in it.

00:14:31.160 | Somehow the agent needs to understand the world for itself

00:14:34.720 | in a way where no one tells it the rules of the game,

00:14:37.240 | and yet it can still figure out what to do in that world,

00:14:41.800 | deal with this stream of observations coming in,

00:14:44.520 | rich sensory input coming in,

00:14:46.240 | actions going out in a way that allows it to reason

00:14:49.240 | in the way that AlphaGo or AlphaZero can reason

00:14:52.400 | in the way that these go and chess playing programs

00:14:54.600 | can reason, but in a way that allows it to take actions

00:14:58.720 | in that messy world to achieve its goals.

00:15:01.320 | And so this led us to the most recent step

00:15:06.200 | in the story of AlphaGo, which was a system called MuZero.

00:15:10.440 | And MuZero is a system which learns for itself

00:15:14.360 | even when the rules are not given to it.

00:15:16.400 | It actually can be dropped into a system

00:15:19.160 | with messy perceptual inputs.

00:15:20.680 | We actually tried it in some Atari games,

00:15:24.800 | the canonical domains of Atari that have been used

00:15:28.080 | for reinforcement learning.

00:15:29.480 | And this system learned to build a model

00:15:33.840 | of these Atari games that was sufficiently rich

00:15:37.880 | and useful enough for it to be able to plan successfully.

00:15:42.320 | And in fact, that system not only went on

00:15:44.440 | to beat the state of the art in Atari,

00:15:47.600 | but the same system without modification

00:15:50.240 | was able to reach the same level of superhuman performance

00:15:53.920 | in Go, chess, and shogi that we'd seen in AlphaZero,

00:15:57.840 | showing that even without the rules,

00:15:59.640 | the system can learn for itself just by trial and error,

00:16:02.040 | just by playing this game of Go

00:16:04.040 | and no one tells you what the rules are,

00:16:05.960 | but you just get to the end

00:16:07.120 | and someone says, you know, win or loss.

00:16:09.640 | You play this game of chess and someone says win or loss,

00:16:12.960 | or you play a game of breakout in Atari

00:16:16.480 | and someone just tells you, you know, your score at the end.

00:16:18.960 | And the system for itself figures out

00:16:21.520 | essentially the rules of the system,

00:16:22.840 | the dynamics of the world, how the world works.

00:16:26.120 | And not in any explicit way,

00:16:29.040 | but just implicitly enough understanding

00:16:31.680 | for it to be able to plan in that system

00:16:34.000 | in order to achieve its goals.

00:16:36.400 | - And that's the fundamental process

00:16:39.000 | that you have to go through when you're facing

00:16:40.600 | any uncertain kind of environment

00:16:42.440 | that you would in the real world,

00:16:44.120 | is figuring out the sort of the rules,

00:16:46.000 | the basic rules of the game.

00:16:47.480 | - That's right.

00:16:48.320 | - So, I mean, yeah, that allows it to be applicable

00:16:51.560 | to basically any domain that could be digitized

00:16:55.560 | in the way that it needs to in order to be consumable,

00:17:00.960 | sort of in order for the reinforcement learning framework

00:17:03.080 | to be able to sense the environment,

00:17:04.640 | to be able to act in the environment and so on.

00:17:06.520 | - The full reinforcement learning problem

00:17:07.960 | needs to deal with worlds that are unknown and complex

00:17:12.280 | and the agent needs to learn for itself

00:17:14.640 | how to deal with that.

00:17:15.800 | And so Museo is a step, a further step in that direction.

00:17:19.720 | (upbeat music)

00:17:22.320 | (upbeat music)

00:17:24.920 | (upbeat music)

00:17:27.520 | (upbeat music)

00:17:30.120 | (upbeat music)

00:17:32.720 | (upbeat music)

00:17:35.320 | [BLANK_AUDIO]

AlphaZero and Self Play (David Silver, DeepMind) | AI Podcast Clips

Chapters