back to indexAlphaZero and Self Play (David Silver, DeepMind) | AI Podcast Clips
Chapters
0:0
0:37 What Is Self Play
3:45 Idea for Alpha Zero
13:2 Japanese Chess
00:00:05.000 |
right, really the profound step is probably AlphaGo Zero. 00:00:09.280 |
I mean, it's arguable, I kind of see them all 00:00:20.240 |
but it's removing the reliance on human expert games 00:00:35.760 |
And maybe could you also say what is self-play? 00:00:49.360 |
which is really about systems learning for themselves, 00:00:53.040 |
but in the situation where there's more than one agent. 00:01:01.420 |
then self-play is really about understanding that game 00:01:08.480 |
rather than against any actual real opponent. 00:01:10.920 |
And so it's a way to kind of discover strategies 00:01:31.040 |
you know, try and step back from any of the knowledge 00:01:36.040 |
that we'd put into the system and ask the question, 00:01:38.720 |
is it possible to come up with a single elegant principle 00:01:46.520 |
all of the knowledge which it requires to play a game 00:01:54.160 |
you not only make the system less brittle in the sense 00:01:59.160 |
that perhaps the knowledge you were putting in 00:02:02.840 |
and maybe stopping the system learning for itself, 00:02:10.680 |
the harder it is for a system to actually be placed, 00:02:14.380 |
taken out of the system in which it's kind of been designed 00:02:19.640 |
that maybe would need a completely different knowledge base 00:02:34.720 |
the promise of AI is that we can have systems such as that, 00:02:52.800 |
And then that's to me is almost the essence of intelligence 00:03:00.920 |
And it's a step that was taken in the context 00:03:04.600 |
of two-player perfect information games like Go and chess. 00:03:12.400 |
- So just to clarify, the first step was AlphaGo Zero. 00:03:16.440 |
- The first step was to try and take all of the knowledge 00:03:19.940 |
out of AlphaGo in such a way that it could play 00:03:23.760 |
in a fully self-discovered way, purely from self-play. 00:03:28.760 |
And to me, the motivation for that was always 00:03:32.720 |
that we could then plug it into other domains, 00:03:41.260 |
just for fun, I could tell you exactly the moment 00:03:48.760 |
'cause I think there's maybe a lesson there for researchers 00:03:51.280 |
who are kind of too deeply embedded in their research 00:03:54.120 |
and working 24/7 to try and come up with the next idea, 00:03:57.880 |
which is, it actually occurred to me on honeymoon. 00:04:04.760 |
And I was like at my most fully relaxed state, 00:04:12.480 |
this like the algorithm for AlphaZero just appeared. 00:04:28.640 |
trying to make sure we could beat the world champion 00:04:33.480 |
that it was only later that we had the opportunity 00:04:56.360 |
But the fact that you could use that kind of mechanism 00:05:00.680 |
to again, beat world-class players, that's very surprising. 00:05:05.680 |
So we kind of, to me, it feels like you have to train 00:05:12.180 |
So was it surprising to you, what was the intuition? 00:05:14.540 |
Can you sort of think, not necessarily at that time, 00:05:27.800 |
it was the deeper scientific question to be asking 00:05:38.400 |
for which we already know the likely outcome. 00:05:41.880 |
I don't see much value in running an experiment 00:05:44.100 |
where you're 95% confident that you will succeed. 00:05:48.540 |
And so we could have tried, maybe to take AlphaGo 00:05:52.920 |
and do something which we knew for sure it would succeed on. 00:05:56.080 |
But much more interesting to me was to try it 00:06:00.300 |
And one of the big questions on our minds back then was, 00:06:05.120 |
could you really do this with self-play alone? 00:06:18.280 |
that it could reach the same level as these systems, 00:06:24.760 |
And even if it had not achieved the same level, 00:06:27.680 |
I felt that that was an important direction to be studying. 00:06:43.320 |
and indeed was able to beat it by 100 games to zero. 00:07:04.200 |
what was going on in a position and mis-evaluate it. 00:07:20.720 |
But the only way to address them in any complex system 00:07:24.160 |
is to give the system the ability to correct its own errors. 00:07:32.360 |
when it's doing something wrong and correct for it. 00:07:35.540 |
And so it seemed to me that the way to correct delusions 00:07:44.460 |
you should be able to correct for those errors 00:07:46.700 |
until it gets to play that out and understand, 00:07:49.280 |
oh, well, I thought that I was gonna win in this situation, 00:07:54.200 |
That suggests that I was mis-evaluating something 00:08:01.460 |
Now, if you take that same idea and trace it back, 00:08:07.120 |
it should be able to take you from no knowledge, 00:08:12.720 |
all the way to the highest levels of knowledge 00:08:28.680 |
because it sees the stupid things that the random is doing 00:08:32.520 |
And then it can take you from that slightly better system 00:08:34.920 |
and understand, well, what's that doing wrong? 00:08:36.880 |
And it takes you on to the next level and the next level. 00:08:46.200 |
if we'd carried on training AlphaGo Zero for longer? 00:08:49.300 |
We saw no sign of it slowing down its improvements, 00:08:54.240 |
or at least it was certainly carrying on to improve. 00:08:57.560 |
And presumably, if you had the computational resources, 00:09:12.520 |
One of the surprising things, just like you said, 00:09:20.240 |
And reinforcement learning should be part of that process. 00:09:24.520 |
But what is surprising is in the process of patching 00:09:34.680 |
like there's a monotonic decrease of your weaknesses. 00:09:41.080 |
I think science always should make falsifiable hypotheses. 00:09:47.960 |
which is that if someone was to, in the future, 00:09:53.280 |
and run it with greater computational resources 00:10:03.760 |
to beat the previous system 100 games to zero. 00:10:06.280 |
And that if they were then to do the same thing 00:10:13.000 |
And that that process would continue indefinitely 00:10:18.480 |
- Presumably the game of Go would set the ceiling. 00:10:24.140 |
but the game of Go has 10 to the 170 states in it. 00:10:26.920 |
So the ceiling is unreachable by any computational device 00:10:56.120 |
that essentially progress will always lead you to, 00:11:01.120 |
if you have sufficient representational resource, 00:11:07.520 |
could represent every state in a big table of the game, 00:11:11.080 |
then we know for sure that a progress of self-improvement 00:11:14.960 |
will lead all the way in the single agent case 00:11:19.960 |
And in the two-player case to the minimax optimal behavior. 00:11:26.240 |
knowing that you're playing perfectly against me. 00:11:38.160 |
You're progressing towards the best that can be done. 00:11:41.400 |
- So AlphaGo was initially trained on expert games 00:11:47.400 |
AlphaGo Zero removed the need to be trained on expert games. 00:11:56.680 |
is to generalize that further to be in AlphaZero, 00:12:05.680 |
and then also being able to play the game of chess 00:12:22.880 |
was that actually without modifying the algorithm at all, 00:12:32.240 |
In particular, we dropped it into the game of chess. 00:12:35.720 |
And unlike the previous systems like Deep Blue, 00:12:38.120 |
which had been worked on for years and years, 00:12:43.560 |
the world's strongest computer chess program convincingly 00:12:51.880 |
by its own, from scratch with its own principles. 00:12:55.880 |
And in fact, one of the nice things that we found 00:12:59.120 |
was that, in fact, we also achieved the same result 00:13:06.120 |
and then place them back down on your own side 00:13:12.800 |
And we also beat the world's strongest programs 00:13:15.720 |
and reached superhuman performance in that game too. 00:13:21.040 |
that we'd ever run the system on that particular game, 00:13:25.440 |
was the version that we published in the paper on AlphaZero. 00:13:28.640 |
It just worked out of the box, literally, no touching it. 00:13:40.520 |
about that principle that you can take an algorithm 00:13:43.920 |
and without twiddling anything, it just works. 00:13:48.640 |
Now, to go beyond AlphaZero, what's required? 00:14:00.880 |
But one of the important steps is to acknowledge 00:14:07.000 |
You know, it's this rich, complex, beautiful, 00:14:17.040 |
At least maybe we understand that it operates 00:14:22.120 |
at the micro level or according to relativity 00:14:26.080 |
But that's not a model that's useful for us as people 00:14:31.160 |
Somehow the agent needs to understand the world for itself 00:14:34.720 |
in a way where no one tells it the rules of the game, 00:14:37.240 |
and yet it can still figure out what to do in that world, 00:14:41.800 |
deal with this stream of observations coming in, 00:14:46.240 |
actions going out in a way that allows it to reason 00:14:49.240 |
in the way that AlphaGo or AlphaZero can reason 00:14:52.400 |
in the way that these go and chess playing programs 00:14:54.600 |
can reason, but in a way that allows it to take actions 00:15:06.200 |
in the story of AlphaGo, which was a system called MuZero. 00:15:10.440 |
And MuZero is a system which learns for itself 00:15:24.800 |
the canonical domains of Atari that have been used 00:15:33.840 |
of these Atari games that was sufficiently rich 00:15:37.880 |
and useful enough for it to be able to plan successfully. 00:15:50.240 |
was able to reach the same level of superhuman performance 00:15:53.920 |
in Go, chess, and shogi that we'd seen in AlphaZero, 00:15:59.640 |
the system can learn for itself just by trial and error, 00:16:09.640 |
You play this game of chess and someone says win or loss, 00:16:16.480 |
and someone just tells you, you know, your score at the end. 00:16:22.840 |
the dynamics of the world, how the world works. 00:16:39.000 |
that you have to go through when you're facing 00:16:48.320 |
- So, I mean, yeah, that allows it to be applicable 00:16:51.560 |
to basically any domain that could be digitized 00:16:55.560 |
in the way that it needs to in order to be consumable, 00:17:00.960 |
sort of in order for the reinforcement learning framework 00:17:04.640 |
to be able to act in the environment and so on. 00:17:07.960 |
needs to deal with worlds that are unknown and complex 00:17:15.800 |
And so Museo is a step, a further step in that direction.