Back to Index

AlphaZero and Self Play (David Silver, DeepMind) | AI Podcast Clips


Chapters

0:0
0:37 What Is Self Play
3:45 Idea for Alpha Zero
13:2 Japanese Chess

Transcript

So the next incredible step, right, really the profound step is probably AlphaGo Zero. I mean, it's arguable, I kind of see them all as the same place, but really, and perhaps you were already thinking that AlphaGo Zero is the natural, it was always going to be the next step, but it's removing the reliance on human expert games for pre-training, as you mentioned.

So how big of an intellectual leap was this that self-play could achieve superhuman level performance on its own? And maybe could you also say what is self-play? We kind of mentioned it a few times, but. - So let me start with self-play. So the idea of self-play is something which is really about systems learning for themselves, but in the situation where there's more than one agent.

And so if you're in a game, and the game is played between two players, then self-play is really about understanding that game just by playing games against yourself rather than against any actual real opponent. And so it's a way to kind of discover strategies without having to actually need to go out and play against any particular human player for example.

The main idea of AlphaZero was really to, you know, try and step back from any of the knowledge that we'd put into the system and ask the question, is it possible to come up with a single elegant principle by which a system can learn for itself all of the knowledge which it requires to play a game such as Go?

Importantly, by taking knowledge out, you not only make the system less brittle in the sense that perhaps the knowledge you were putting in was just getting in the way and maybe stopping the system learning for itself, but also you make it more general. The more knowledge you put in, the harder it is for a system to actually be placed, taken out of the system in which it's kind of been designed and placed in some other system that maybe would need a completely different knowledge base to understand and perform well.

And so the real goal here is to strip out all of the knowledge that we put in to the point that we can just plug it into something totally different. And that to me is really, you know, the promise of AI is that we can have systems such as that, which, you know, no matter what the goal is, no matter what goal we set to the system, we can come up with, we have an algorithm which can be placed into that world, into that environment and can succeed in achieving that goal.

And then that's to me is almost the essence of intelligence if we can achieve that. And so AlphaZero is a step towards that. And it's a step that was taken in the context of two-player perfect information games like Go and chess. We also applied it to Japanese chess. - So just to clarify, the first step was AlphaGo Zero.

- The first step was to try and take all of the knowledge out of AlphaGo in such a way that it could play in a fully self-discovered way, purely from self-play. And to me, the motivation for that was always that we could then plug it into other domains, but we saved that until later.

(both laughing) - Well, and- - In fact, I mean, just for fun, I could tell you exactly the moment where the idea for AlphaZero occurred to me, 'cause I think there's maybe a lesson there for researchers who are kind of too deeply embedded in their research and working 24/7 to try and come up with the next idea, which is, it actually occurred to me on honeymoon.

(both laughing) And I was like at my most fully relaxed state, really enjoying myself, and just bing, this like the algorithm for AlphaZero just appeared. And in its full form, and this was actually before we played against Lisa Doll, but we just didn't, I think we were so busy trying to make sure we could beat the world champion that it was only later that we had the opportunity to step back and start examining that sort of deeper scientific question of whether this could really work.

- So nevertheless, so self-play is probably one of the most sort of profound ideas that it represents, to me at least, artificial intelligence. But the fact that you could use that kind of mechanism to again, beat world-class players, that's very surprising. So we kind of, to me, it feels like you have to train in a large number of expert games.

So was it surprising to you, what was the intuition? Can you sort of think, not necessarily at that time, even now, what's your intuition, why this thing works so well? Why it's able to learn from scratch? - Well, let me first say why we tried it. So we tried it both because I feel that it was the deeper scientific question to be asking to make progress towards AI, and also because in general in my research, I don't like to do research on questions for which we already know the likely outcome.

I don't see much value in running an experiment where you're 95% confident that you will succeed. And so we could have tried, maybe to take AlphaGo and do something which we knew for sure it would succeed on. But much more interesting to me was to try it on the things which we weren't sure about.

And one of the big questions on our minds back then was, could you really do this with self-play alone? How far could that go? Would it be as strong? And honestly, we weren't sure. Yeah, it was 50/50, I think. If you'd asked me, I wasn't confident that it could reach the same level as these systems, but it felt like the right question to ask.

And even if it had not achieved the same level, I felt that that was an important direction to be studying. And so then lo and behold, it actually ended up outperforming the previous version of AlphaGo and indeed was able to beat it by 100 games to zero. So what's the intuition as to why?

I think the intuition to me is clear, that whenever you have errors in a system, as we did in AlphaGo, AlphaGo suffered from these delusions. Occasionally it would misunderstand what was going on in a position and mis-evaluate it. How can you remove all of these errors? Errors arise from many sources.

For us, they were arising both from, starting from the human data, but also from the nature of the search and the nature of the algorithm itself. But the only way to address them in any complex system is to give the system the ability to correct its own errors. It must be able to correct them.

It must be able to learn for itself when it's doing something wrong and correct for it. And so it seemed to me that the way to correct delusions was indeed to have more iterations of reinforcement learning. That no matter where you start, you should be able to correct for those errors until it gets to play that out and understand, oh, well, I thought that I was gonna win in this situation, but then I ended up losing.

That suggests that I was mis-evaluating something and there's a hole in my knowledge and now the system can correct for itself and understand how to do better. Now, if you take that same idea and trace it back, all the way to the beginning, it should be able to take you from no knowledge, from completely random starting point, all the way to the highest levels of knowledge that you can achieve in a domain.

And the principle is the same, that if you bestow a system with the ability to correct its own errors, then it can take you from random to something slightly better than random because it sees the stupid things that the random is doing and it can correct them. And then it can take you from that slightly better system and understand, well, what's that doing wrong?

And it takes you on to the next level and the next level. And this progress can go on indefinitely. And indeed, what would have happened if we'd carried on training AlphaGo Zero for longer? We saw no sign of it slowing down its improvements, or at least it was certainly carrying on to improve.

And presumably, if you had the computational resources, this could lead to better and better systems that discover more and more. - So your intuition is fundamentally there's not a ceiling to this process. One of the surprising things, just like you said, is the process of patching errors. It's intuitively makes sense.

And reinforcement learning should be part of that process. But what is surprising is in the process of patching your own lack of knowledge, you don't open up other patches. You keep sort of, like there's a monotonic decrease of your weaknesses. - Well, let me back this up. I think science always should make falsifiable hypotheses.

So let me back up this claim with a falsifiable hypothesis, which is that if someone was to, in the future, take AlphaZero as an algorithm and run it with greater computational resources that we had available today, then I would predict that they would be able to beat the previous system 100 games to zero.

And that if they were then to do the same thing a couple of years later, that that would beat that previous system 100 games to zero. And that that process would continue indefinitely throughout at least my human lifetime. - Presumably the game of Go would set the ceiling. I mean- - The game of Go would set the ceiling, but the game of Go has 10 to the 170 states in it.

So the ceiling is unreachable by any computational device that can be built out of the, you know, 10 to the 80 atoms in the universe. You asked a really good question, which is, do you not open up other errors when you correct your previous ones? And the answer is yes, you do.

And so it's a remarkable fact about this class of two-player game, and also true of single agent games, that essentially progress will always lead you to, if you have sufficient representational resource, like imagine you had, could represent every state in a big table of the game, then we know for sure that a progress of self-improvement will lead all the way in the single agent case to the optimal possible behavior.

And in the two-player case to the minimax optimal behavior. That is the best way that I can play, knowing that you're playing perfectly against me. And so for those cases, we know that even if you do open up some new error, that in some sense you've made progress. You're progressing towards the best that can be done.

- So AlphaGo was initially trained on expert games with some self-play. AlphaGo Zero removed the need to be trained on expert games. And then another incredible step for me, 'cause I just love chess, is to generalize that further to be in AlphaZero, to be able to play the game of Go, beating AlphaGo Zero and AlphaGo, and then also being able to play the game of chess and others.

So what was that step like? What's the interesting aspects there that required to make that happen? - I think the remarkable observation, which we saw with AlphaZero, was that actually without modifying the algorithm at all, it was able to play and crack some of AI's greatest previous challenges. In particular, we dropped it into the game of chess.

And unlike the previous systems like Deep Blue, which had been worked on for years and years, and we were able to beat the world's strongest computer chess program convincingly using a system that was fully discovered by its own, from scratch with its own principles. And in fact, one of the nice things that we found was that, in fact, we also achieved the same result in Japanese chess, a variant of chess where you get to capture pieces and then place them back down on your own side as an extra piece.

So a much more complicated variant of chess. And we also beat the world's strongest programs and reached superhuman performance in that game too. And it was the very first time that we'd ever run the system on that particular game, was the version that we published in the paper on AlphaZero.

It just worked out of the box, literally, no touching it. We didn't have to do anything. And there it was, superhuman performance, no tweaking, no twiddling. And so I think there's something beautiful about that principle that you can take an algorithm and without twiddling anything, it just works. Now, to go beyond AlphaZero, what's required?

AlphaZero is just a step. And there's a long way to go beyond that to really crack the deep problems of AI. But one of the important steps is to acknowledge that the world is a really messy place. You know, it's this rich, complex, beautiful, but messy environment that we live in.

And no one gives us the rules. Like no one knows the rules of the world. At least maybe we understand that it operates according to Newtonian or quantum mechanics at the micro level or according to relativity at the macro level. But that's not a model that's useful for us as people to operate in it.

Somehow the agent needs to understand the world for itself in a way where no one tells it the rules of the game, and yet it can still figure out what to do in that world, deal with this stream of observations coming in, rich sensory input coming in, actions going out in a way that allows it to reason in the way that AlphaGo or AlphaZero can reason in the way that these go and chess playing programs can reason, but in a way that allows it to take actions in that messy world to achieve its goals.

And so this led us to the most recent step in the story of AlphaGo, which was a system called MuZero. And MuZero is a system which learns for itself even when the rules are not given to it. It actually can be dropped into a system with messy perceptual inputs.

We actually tried it in some Atari games, the canonical domains of Atari that have been used for reinforcement learning. And this system learned to build a model of these Atari games that was sufficiently rich and useful enough for it to be able to plan successfully. And in fact, that system not only went on to beat the state of the art in Atari, but the same system without modification was able to reach the same level of superhuman performance in Go, chess, and shogi that we'd seen in AlphaZero, showing that even without the rules, the system can learn for itself just by trial and error, just by playing this game of Go and no one tells you what the rules are, but you just get to the end and someone says, you know, win or loss.

You play this game of chess and someone says win or loss, or you play a game of breakout in Atari and someone just tells you, you know, your score at the end. And the system for itself figures out essentially the rules of the system, the dynamics of the world, how the world works.

And not in any explicit way, but just implicitly enough understanding for it to be able to plan in that system in order to achieve its goals. - And that's the fundamental process that you have to go through when you're facing any uncertain kind of environment that you would in the real world, is figuring out the sort of the rules, the basic rules of the game.

- That's right. - So, I mean, yeah, that allows it to be applicable to basically any domain that could be digitized in the way that it needs to in order to be consumable, sort of in order for the reinforcement learning framework to be able to sense the environment, to be able to act in the environment and so on.

- The full reinforcement learning problem needs to deal with worlds that are unknown and complex and the agent needs to learn for itself how to deal with that. And so Museo is a step, a further step in that direction. (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)