back to indexDavid Silver: AlphaGo, AlphaZero, and Deep Reinforcement Learning | Lex Fridman Podcast #86
Chapters
0:0 Introduction
4:9 First program
11:11 AlphaGo
21:42 Rule of the game of Go
25:37 Reinforcement learning: personal journey
30:15 What is reinforcement learning?
43:51 AlphaGo (continued)
53:40 Supervised learning and self play in AlphaGo
66:12 Lee Sedol retirement from Go play
68:57 Garry Kasparov
74:10 Alpha Zero and self play
91:29 Creativity in AlphaZero
95:21 AlphaZero applications
97:59 Reward functions
100:51 Meaning of life
00:00:00.000 |
The following is a conversation with David Silver, 00:00:02.600 |
who leads the reinforcement learning research group 00:00:07.880 |
on AlphaGo, AlphaZero, and co-led the AlphaStar 00:00:12.120 |
and MuZero efforts and a lot of important work 00:00:17.200 |
I believe AlphaZero is one of the most important 00:00:20.880 |
accomplishments in the history of artificial intelligence. 00:00:28.600 |
together with a lot of other great researchers at DeepMind. 00:00:35.160 |
We were both jet lagged, but didn't care and made it happen. 00:00:39.080 |
It was a pleasure and truly an honor to talk with David. 00:00:47.000 |
For everyone feeling the medical, psychological, 00:01:04.800 |
support on Patreon, or simply connect with me on Twitter 00:01:29.080 |
by signing up to MasterClass at masterclass.com/lex 00:01:34.040 |
and downloading Cash App and using code LEXPODCAST. 00:01:57.800 |
in the context of the history of money is fascinating. 00:02:05.320 |
Debits and credits on ledgers started around 30,000 years ago. 00:02:12.840 |
and Bitcoin, the first decentralized cryptocurrency, 00:02:38.640 |
an organization that is helping to advance robotics 00:02:41.080 |
and STEM education for young people around the world. 00:02:49.480 |
to get a discount and to support this podcast. 00:02:53.580 |
if you sign up for an all-access pass for a year, 00:03:09.680 |
to watch courses from, to list some of my favorites. 00:03:15.120 |
Neil deGrasse Tyson on scientific thinking and communication, 00:03:18.080 |
Will Wright, the creator of SimCity and Sims, 00:03:21.480 |
on game design, Jane Goodall on conservation, 00:03:28.040 |
could be the most beautiful guitar song ever written, 00:03:30.940 |
Garry Kasparov on chess, Daniel Negrano on poker, 00:03:37.840 |
and the experience of being launched into space alone 00:03:51.880 |
that will stick with you for a long time, I promise. 00:04:02.240 |
to get a discount and to support this podcast. 00:04:04.700 |
And now, here's my conversation with David Silver. 00:04:08.740 |
What was the first program you've ever written 00:04:16.680 |
My parents brought home this BBC Model B microcomputer. 00:04:26.160 |
and couldn't resist just playing around with it. 00:04:44.520 |
- How did you think about computers back then? 00:04:49.680 |
and there's this thing that you just gave birth to 00:04:52.900 |
that's able to create sort of visual elements 00:04:57.680 |
Or did you not think of it in those romantic notions? 00:05:16.440 |
I used to play with Lego with the same feeling. 00:05:21.880 |
You're not constrained by the amount of kit you've got. 00:05:28.480 |
and the advanced user guide and then learning. 00:05:34.640 |
My father also became interested in this machine 00:05:40.280 |
and study for a master's degree in artificial intelligence, 00:05:44.480 |
funnily enough, at Essex University when I was seven. 00:05:48.600 |
So I was exposed to those things at an early age. 00:05:54.840 |
and do things like querying your family tree. 00:05:59.800 |
of trying to figure things out on a computer. 00:06:04.100 |
- Those are the early steps in computer science programming. 00:06:09.320 |
with artificial intelligence or with the ideas, 00:06:13.300 |
- I think it was really when I went to study at university. 00:06:30.120 |
Where do we want to go with computer science? 00:06:40.920 |
and recreate something akin to human intelligence. 00:06:44.240 |
If we could do that, that would be a major leap forward. 00:06:47.540 |
And that idea, I certainly wasn't the first to have it, 00:06:51.000 |
but it nestled within me somewhere and became like a bug. 00:06:55.480 |
You know, I really wanted to crack that problem. 00:06:58.920 |
- So you thought it was, like you had a notion 00:07:00.800 |
that this is something that human beings can do, 00:07:03.000 |
that it is possible to create an intelligent machine? 00:07:09.160 |
in something metaphysical, then what are our brains doing? 00:07:13.440 |
Well, at some level, they're information processing systems, 00:07:17.260 |
which are able to take whatever information is in there, 00:07:40.080 |
Do you remember the first time you were in a program 00:07:47.360 |
Sort of achieved super David Silver level performance? 00:07:56.440 |
So for five years, I programmed games for my first job. 00:08:01.300 |
So it was an amazing opportunity to get involved 00:08:05.820 |
And so I was involved in building AI at that time. 00:08:10.820 |
And so for sure, there was a sense of building handcrafted, 00:08:17.100 |
what people used to call AI in the games industry, 00:08:20.280 |
which I think is not really what we might think of 00:08:29.020 |
in a way which makes things interesting and challenging 00:08:39.440 |
which in certain limited cases could do things 00:08:45.380 |
but mostly in these kind of Twitch like scenarios 00:09:09.120 |
which I still had inside me to really understand intelligence 00:09:17.100 |
short term fixes rather than long term vision. 00:09:23.600 |
which was funny enough trying to apply reinforcement learning 00:09:31.880 |
a system which would by trial and error play against itself 00:09:35.600 |
and was able to learn which patterns were actually helpful 00:09:40.560 |
to predict whether it was gonna win or lose the game 00:09:46.200 |
that would mean that you're more likely to win. 00:10:04.480 |
You know, it's like in space, 2001 Space Odyssey 00:10:08.280 |
kind of realizing that you've created something that, 00:10:12.720 |
that is, you know, that's achieved human level intelligence 00:10:24.320 |
- There were no neural networks in those days. 00:10:34.200 |
which people are still using in deep reinforcement learning. 00:11:04.560 |
that something I felt should work had worked. 00:11:08.640 |
- So to me, AlphaGo, and I don't know how else to put it, 00:11:23.440 |
So you're one of the key people behind this achievement, 00:11:34.800 |
So as far as I know, the AI community at that point 00:11:40.680 |
largely saw the game of Go as unbeatable in AI, 00:11:48.760 |
Even if you consider, at least the way I saw it, 00:12:01.360 |
So given that the game of Go was impossible to master, 00:12:15.120 |
that you could actually build a computer program 00:12:21.880 |
but achieves that kind of level of playing Go? 00:12:24.880 |
- First of all, thank you, that was very kind words. 00:12:38.080 |
and it was their first meeting together since the match. 00:12:47.320 |
So these are amazing moments when they happen, 00:12:59.160 |
I've always had a fascination in board games. 00:13:01.840 |
I played chess as a kid, I played Scrabble as a kid. 00:13:04.840 |
When I was at university, I discovered the game of Go, 00:13:08.960 |
and to me, it just blew all of those other games 00:13:12.040 |
It was just so deep and profound in its complexity 00:13:17.720 |
What I discovered was that I could devote endless hours 00:13:22.720 |
to this game, and I knew in my heart of hearts 00:13:28.200 |
that no matter how many hours I would devote to it, 00:13:34.320 |
Or there was another path, and the other path 00:13:43.480 |
And so even in those days, I had this idea that, 00:13:46.760 |
what if, what if it was possible to build a program 00:14:06.280 |
It was the challenge where all other approaches had failed. 00:14:15.920 |
for the classical methods of AI, like heuristic search. 00:14:19.880 |
In the '90s, they all fell one after another, 00:14:28.840 |
There were numerous cases where systems built 00:14:37.920 |
had been able to defeat the human world champion 00:15:02.640 |
when that nine-year-old child was giving nine free moves 00:15:09.800 |
And computer Go expert beat that same strongest program 00:15:22.480 |
in around 2003, when I started working on computer Go. 00:15:30.320 |
There was very, very little in the way of progress 00:15:39.120 |
And so people, it wasn't through lack of effort. 00:15:46.600 |
that something different would be required for Go 00:15:49.800 |
than had been needed for all of these other domains 00:16:02.360 |
that a Go player would look at a position and say, 00:16:05.400 |
"Hey, here's this mess of black and white stones. 00:16:12.600 |
"that this part of the board has become my territory. 00:16:15.720 |
"This part of the board has become your territory. 00:16:17.760 |
"And I've got this overall sense that I'm gonna win 00:16:20.120 |
"and that this is about the right move to play." 00:16:24.680 |
of being able to evaluate what's going on in a position, 00:16:28.160 |
it was pivotal to humans being able to play this game 00:16:34.960 |
So this question of how to evaluate a position, 00:16:37.680 |
how to come up with these intuitive judgments 00:16:47.400 |
and the reason why methods which had succeeded so well 00:16:57.880 |
we would need to get something akin to human intuition. 00:17:00.360 |
And if we got something akin to human intuition, 00:17:06.760 |
So to me, that was the moment where it's like, 00:17:09.120 |
"Okay, this is not just about playing the game of Go. 00:17:17.480 |
Now this is the opportunity to do something meaningful 00:17:19.520 |
and transformative, and I guess a dream was born. 00:17:25.200 |
So almost this realization that you need to find, 00:17:29.000 |
formulate Go as a kind of a prediction problem 00:17:37.320 |
but to give it the ability to kind of intuit things 00:17:47.000 |
Now, okay, but what about the learning part of it? 00:18:08.520 |
- So I strongly felt that learning would be necessary, 00:18:18.160 |
to the game of Go, and not just learning of any type, 00:18:21.760 |
but I felt that the only way to really have a system 00:18:26.120 |
to progress beyond human levels of performance 00:18:33.080 |
And how else can a machine hope to understand 00:18:38.960 |
If you're not learning, what else are you doing? 00:18:40.360 |
Well, you're putting all the knowledge into the system, 00:18:42.520 |
and that just feels like something which decades of AI 00:18:50.520 |
but certainly has a ceiling to the capabilities. 00:18:53.320 |
It's known as the knowledge acquisition bottleneck, 00:19:03.560 |
That's the only way you're going to be able to get 00:19:06.360 |
a system which has sufficient knowledge in it, 00:19:10.320 |
millions and millions of pieces of knowledge, 00:19:15.520 |
and understand how those billions and trillions 00:19:17.960 |
of pieces of knowledge can be leveraged in a way 00:19:27.440 |
- Yeah, I mean, if I put myself back in that time, 00:19:42.720 |
of knowledge base, like a growing knowledge base, 00:19:53.560 |
assemble together into a large knowledge base. 00:19:56.600 |
- Well, in a sense, that was the state of the art back then. 00:20:01.080 |
which had been competing for this prize I mentioned, 00:20:04.360 |
they were an assembly of different specialized systems, 00:20:09.800 |
some of which used huge amounts of human knowledge 00:20:16.680 |
that were required to play well in the game of Go, 00:20:24.560 |
and combined with more principled search-based methods, 00:20:28.600 |
which were trying to solve for particular sub parts 00:20:48.120 |
And although not all of the pieces were handcrafted, 00:20:54.600 |
the overall effect was nevertheless still brittle, 00:20:56.760 |
and it was hard to make all these pieces work well together. 00:21:02.680 |
and the main innovation of the approach I took, 00:21:11.360 |
a principled approach where the system can learn for itself, 00:21:14.880 |
just from the outcome, like, learn for itself. 00:21:19.280 |
If you try something, did that help or did it not help? 00:21:22.680 |
And only through that procedure can you arrive 00:21:33.560 |
And so that principle was already very important 00:21:39.840 |
we were missing some important pieces back then. 00:21:49.120 |
let's take a step back, we kind of skipped it a bit, 00:21:55.720 |
The elements of it perhaps contrasting to chess 00:22:02.120 |
that sort of you really enjoy as a human being, 00:22:13.080 |
- So the game of Go has remarkably simple rules. 00:22:16.720 |
In fact, so simple that people have speculated 00:22:19.160 |
that if we were to meet alien life at some point, 00:22:22.200 |
that we wouldn't be able to communicate with them, 00:22:23.800 |
but we would be able to play a game of Go with them. 00:22:25.960 |
So they probably have discovered the same rule set. 00:22:32.240 |
and you play on the intersections of the grid 00:22:37.560 |
it's to surround as much territory as you can, 00:22:40.800 |
as many of these intersections with your stones, 00:22:43.600 |
and to surround more than your opponent does. 00:22:50.480 |
then you get to capture it and remove it from the board 00:23:09.240 |
that will help you acquire territory later in the game, 00:23:23.880 |
and human Go players have played this game for, 00:23:36.280 |
something like 50 million players across the world, 00:23:45.880 |
four ancient arts that was required by Chinese scholars. 00:24:10.840 |
but the evaluation of a particular static board 00:24:18.920 |
You can't, in chess you can kind of assign points 00:24:32.800 |
where both players have played the same number of stones. 00:24:41.400 |
the same number of white stones and black stones, 00:24:45.160 |
how well you're doing is this intuitive sense 00:24:52.520 |
And if you look at the complexity of a real Go position, 00:24:55.680 |
you know, it's mind boggling that kind of question 00:25:12.840 |
position evaluation is so hard in Go compared to other games. 00:25:17.440 |
In addition to that, it has an enormous search space. 00:25:19.280 |
So there's around 10 to the 170 positions in the game of Go. 00:25:30.480 |
that were so successful in things like Deep Blue 00:25:32.480 |
and chess programs just kind of fall over in Go. 00:25:36.080 |
- So at which point did reinforcement learning 00:25:39.440 |
enter your life, your research life, your way of thinking? 00:25:45.440 |
but reinforcement learning is a very particular 00:25:49.680 |
One that's both philosophically sort of profound, 00:25:53.080 |
but also one that's pretty difficult to get to work 00:26:02.320 |
- So I had just finished working in the games industry 00:26:17.120 |
but I wasn't sure what that meant at that stage. 00:26:19.240 |
I really didn't feel I had the tools to decide 00:26:27.200 |
And one of the things I read was Saturn and Barto, 00:26:33.360 |
on an introduction to reinforcement learning. 00:26:43.480 |
that this is what I understood intelligence to be. 00:26:46.880 |
And this was the path that I felt would be necessary 00:27:02.720 |
in supervising me on a PhD thesis in computer go. 00:27:24.000 |
that he'd even be around to see the end of it. 00:27:34.840 |
with a history of fantastic work in board games as well, 00:27:48.400 |
the more strongly I felt that this wasn't just the path 00:27:56.240 |
but really, this was the thing I'd been looking for. 00:28:17.520 |
in some sense, we've cracked the problem of AI. 00:28:31.320 |
And if we ever create a human level intelligence system, 00:28:34.920 |
it would be at the core of that kind of system. 00:28:57.520 |
That they can all be brought within this framework 00:29:01.080 |
and gives us a way to access them in a meaningful way 00:29:03.520 |
that allows us as scientists to understand intelligence 00:29:11.760 |
And so in that sense, I feel that it gives us a path, 00:29:16.280 |
maybe not the only path, but a path towards AI. 00:29:20.360 |
And so do I think that any system in the future 00:29:24.960 |
that's solved AI would have to have RL within it? 00:29:37.920 |
Now, what particular methods have been used to get there? 00:29:41.240 |
Well, we should keep an open mind about the best approaches 00:30:02.440 |
and there are many amazing discoveries ahead of us. 00:30:06.320 |
especially of the different kinds of RL approaches currently 00:30:25.560 |
and the science and the problem of intelligence 00:30:30.560 |
in the form of an agent that interacts with an environment. 00:30:38.160 |
like the world in which that agent is situated. 00:30:44.760 |
Those actions have some effect on the environment 00:30:47.600 |
and the environment gives back an observation 00:30:49.240 |
to the agent saying, this is what you see or sense. 00:31:13.920 |
So I don't know if there's a nice brief inwards way 00:31:21.560 |
model-based, policy-based reinforcement learning. 00:31:27.920 |
okay, so there's this ambitious problem definition of RL. 00:31:33.440 |
It's trying to capture and encircle all of the things 00:31:35.640 |
in which an agent interacts with an environment and say, 00:31:43.880 |
Well, how do you solve a really hard problem like that? 00:31:46.560 |
Well, one approach you can take is to decompose 00:31:49.560 |
that very hard problem into pieces that work together 00:31:55.440 |
And so you can kind of look at the decomposition 00:32:00.740 |
and ask, well, what form does that decomposition take? 00:32:03.860 |
And some of the most common pieces that people use 00:32:06.260 |
when they're kind of putting the solution method together, 00:32:09.660 |
some of the most common pieces that people use 00:32:11.760 |
are whether or not that solution has a value function. 00:32:22.860 |
That means something which is deciding how to pick actions. 00:32:25.820 |
Is that decision-making process explicitly represented? 00:32:32.060 |
Is there something which is explicitly trying to predict 00:32:47.100 |
as choices of whether or not to use those building blocks 00:32:49.940 |
when you're trying to decompose the solution. 00:33:03.220 |
But those three fundamental choices give rise 00:33:20.500 |
are somehow implicitly learned within the system. 00:33:23.460 |
So it's almost a choice of how you approach a problem 00:33:35.340 |
like the details of how you solve the problem 00:33:37.540 |
but they're not fundamentally different from each other? 00:33:40.940 |
- I think the fundamental idea is maybe at the higher level. 00:33:45.940 |
The fundamental idea is the first step of the decomposition 00:33:49.580 |
is really to say, well, how are we really gonna solve 00:33:53.980 |
any kind of problem where you're trying to figure out 00:33:56.540 |
how to take actions and just from this stream 00:33:58.780 |
of observations, you've got some agent situated 00:34:04.340 |
getting to take these actions and what should it do? 00:34:07.820 |
Maybe the complexity of the world is so great 00:34:10.820 |
that you can't even imagine how to build a system 00:34:15.740 |
And so the first step of this decomposition is to say, 00:34:18.580 |
well, you have to learn, the system has to learn for itself. 00:34:22.060 |
And so note that the reinforcement learning problem 00:34:24.460 |
doesn't actually stipulate that you have to learn. 00:34:27.100 |
Like you could maximize your rewards without learning. 00:34:29.380 |
It would just, wouldn't do a very good job of it. 00:34:32.420 |
So learning is required because it's the only way 00:34:35.380 |
to achieve good performance in any sufficiently large 00:34:45.340 |
'Cause now you might ask, well, what should you be learning? 00:34:52.980 |
you're trying to update the parameters of some system, 00:34:57.340 |
which is then the thing that actually picks the actions. 00:35:00.860 |
And those parameters could be representing anything. 00:35:03.460 |
They could be parameterizing a value function 00:35:08.540 |
And so in that sense, there's a lot of commonality 00:35:13.580 |
and it's being learned with the ultimate goal 00:35:17.500 |
But the way in which you decompose the problem 00:35:20.300 |
is really what gives the semantics to the whole system. 00:35:23.140 |
Like, are you trying to learn something to predict well, 00:35:28.580 |
Are you learning something to perform well, like a policy? 00:35:34.020 |
is kind of giving the semantics to the system. 00:35:40.340 |
And we have to make those fundamental choices 00:35:46.180 |
to be able to learn how to make those choices 00:35:51.460 |
the very first thing you have to deal with is, 00:35:56.020 |
can you even take in this huge stream of observations 00:36:08.140 |
And what is this idea of using neural networks 00:36:14.580 |
- So amongst all the approaches for reinforcement learning, 00:36:25.420 |
powerful representations that are offered by neural networks 00:36:31.620 |
to represent any of these different components 00:36:43.460 |
well, here's a powerful toolkit that's so powerful 00:36:55.020 |
that means that whatever we need to represent 00:36:57.940 |
for our policy or for our value function, for our model, 00:37:09.460 |
That as we start to put more resources into the system, 00:37:20.140 |
that these are systems that can just get better 00:37:22.220 |
and better and better at doing whatever the job is 00:37:25.540 |
Whatever we've asked that function to represent, 00:37:27.940 |
it can learn a function that does a better and better job 00:37:36.620 |
how well you're gonna do in the world, the value function, 00:37:38.820 |
whether it's gonna be choosing what to do in the world, 00:37:41.780 |
the policy, or whether it's understanding the world itself, 00:37:46.820 |
- Nevertheless, the fact that neural networks 00:37:50.220 |
are able to learn incredibly complex representations 00:38:08.780 |
Can you still believe it works as well as it does? 00:38:11.500 |
Do you have good intuition about why it works at all 00:38:16.700 |
- I think, let me take two parts to that question. 00:38:27.420 |
that the idea of reinforcement learning works, 00:38:33.980 |
I feel it's the only thing which can, ultimately. 00:39:23.940 |
have these incredibly nonlinear kind of bumpy surfaces, 00:39:28.060 |
which to our kind of low dimensional intuitions 00:39:31.220 |
make it feel like, surely you're just gonna get stuck, 00:39:35.220 |
because you won't be able to make any further progress. 00:39:38.620 |
And yet, the big surprise is that learning continues, 00:39:46.500 |
turn out not to be, because in high dimensions, 00:39:57.800 |
that will take you out and take you lower still. 00:40:01.260 |
learning can proceed and do better and better and better, 00:40:17.500 |
and somewhat shocking that it turns out to be the case. 00:40:23.120 |
to our low dimensional intuitions, that's surprising. 00:40:36.860 |
what a billion dimensional neural network surface 00:40:43.260 |
what that even looks like, is very hard for us. 00:40:56.780 |
I think it's really down to that lack of ability 00:41:00.300 |
to generalize from low dimensions to high dimensions, 00:41:03.260 |
because back then we were in the low dimensional case, 00:41:22.540 |
we're starting to build the theory to support that. 00:41:28.260 |
but all of the theory seems to be pointing in the direction 00:41:34.820 |
both in its representational capacity, which was known, 00:41:37.220 |
but also in its learning ability, which is surprising. 00:41:40.860 |
- And it makes one wonder what else we're missing, 00:42:15.760 |
will we look back and feel that these algorithms 00:42:21.500 |
which are used even in 100,000, 10,000 years? 00:42:24.940 |
- Yeah, they'll watch back to this conversation 00:42:30.300 |
and with a smile, maybe a little bit of a laugh. 00:42:54.460 |
There's something, just like you said in the game of Go, 00:42:58.180 |
I mean, I love the systems of like cellular automata, 00:43:36.820 |
and which will carry us furthest into the future. 00:43:49.100 |
as to what those minimal ingredients might be. 00:44:08.940 |
which actually happened in the context of Computer Go. 00:44:27.380 |
not by humans saying whether that position is good or not 00:44:44.480 |
multiple times and taking the average of those outcomes 00:44:55.340 |
and you get the system to kind of play random moves 00:44:58.080 |
against itself all the way to the end of the game 00:45:05.120 |
well, you say, hey, this is a position that favors white. 00:45:31.140 |
was something called Monte Carlo tree search, 00:45:42.940 |
of the random playouts from that node onwards. 00:45:51.620 |
in the strength of computer Go playing programs. 00:46:07.620 |
And so this was a program by someone called Silvangeli, 00:46:13.100 |
but I worked with him a little bit in those days, 00:46:21.720 |
towards the latest successes we saw in computer Go, 00:46:27.980 |
MoGo was evaluating purely by random rollouts 00:46:36.360 |
that random play should give you anything at all. 00:46:39.880 |
- Like why in this perfectly deterministic game 00:46:42.560 |
that's very precise and involves these very exact sequences, 00:46:54.080 |
captures something about the nature of the search tree, 00:47:01.800 |
the nature of the search tree from that node onwards 00:47:23.520 |
It makes you wonder about the fundamental nature 00:47:39.480 |
were a first revolution in the sense that they led to, 00:47:58.060 |
So strong players, but not anywhere near the level 00:48:01.360 |
of professionals, nevermind the world champion. 00:48:04.480 |
And so that brings us to the birth of AlphaGo, 00:48:08.400 |
which happened in the context of a startup company 00:48:19.000 |
and the project was really a scientific investigation 00:48:23.680 |
where myself and Ajah Huang and an intern, Chris Madison, 00:48:35.540 |
is there another fundamentally different approach 00:48:39.600 |
to this key question of Go, the key challenge 00:48:50.460 |
what move to play or how well you're doing in that position, 00:48:54.840 |
And so the deep learning revolution had just begun. 00:48:59.120 |
That systems like ImageNet had suddenly been won 00:49:08.640 |
well, if deep learning is able to scale up so effectively 00:49:12.480 |
with images to understand them enough to classify them, 00:49:17.520 |
Why not take the black and white stones of the Go board 00:49:22.520 |
and build a system which can understand for itself 00:49:25.360 |
what that means in terms of what move to pick 00:49:32.540 |
which we were probing and trying to understand. 00:49:49.440 |
And we showed that actually a pure deep learning system 00:49:52.400 |
with no search at all was actually able to reach 00:49:55.640 |
human Dan level, master level at the full game of Go, 00:50:06.040 |
at the level of the best Monte Carlo tree search systems, 00:50:28.440 |
Was it surprising from a scientific perspective in general, 00:50:37.300 |
In fact, it was so surprising that we had a bet back then 00:50:41.760 |
and like many good projects, bets are quite motivating 00:50:52.100 |
no search at all to beat a Dan level human player. 00:51:01.000 |
He came in and we had this first match against him. 00:51:05.560 |
- Which side of the bet were you on by the way? 00:51:24.200 |
And for me, that was the moment where it was like, 00:51:29.400 |
We have a system which without search is able 00:51:32.960 |
to already just look at this position and understand things 00:51:41.440 |
I really felt that reaching the top levels of human play, 00:51:59.600 |
I was rather keen that it would be us that achieved it. 00:52:16.040 |
And we made the decision to scale up the project, 00:52:23.460 |
And so AlphaGo became something where we had a clear goal, 00:52:28.460 |
which was to try and crack this outstanding challenge of AI 00:52:33.760 |
to see if we could beat the world's best players. 00:52:37.380 |
And this led within the space of not so many months 00:52:42.380 |
to playing against the European champion, Fan Hui, 00:53:04.220 |
Again, we were basing our predictions on our own progress 00:53:11.380 |
of our own progress when we thought we would exceed 00:53:17.700 |
And we tried to make an estimate and set up a match 00:53:20.500 |
and that became the AlphaGo versus Lee Sedol match in 2016. 00:53:35.040 |
- So maybe we could take even a broader view. 00:53:40.040 |
AlphaGo involves both learning from expert games 00:53:45.900 |
and, as far as I remember, a self-play component 00:53:50.900 |
to where it learns by playing against itself. 00:53:54.240 |
But in your sense, what was the role of learning 00:54:04.100 |
what was the thing that you're trying to do more of, 00:54:15.620 |
but did you have a hope, a dream that self-play 00:54:19.540 |
would be the key component at that moment yet? 00:54:31.380 |
And so when we had our first paper that showed 00:54:34.620 |
that it was possible to predict the winner of the game, 00:54:42.420 |
- Yeah, and so the reason that we did it that way 00:54:45.120 |
was at that time we were exploring separately 00:54:53.460 |
to me at that time, was how far could that be stretched? 00:55:14.220 |
And to us, the human data right from the beginning 00:55:16.900 |
was an expedient step to help us for pragmatic reasons 00:55:20.880 |
to go faster towards the goals of the project 00:55:24.560 |
than we might be able to starting solely from self-play. 00:55:32.800 |
and that might not be the long-term holy grail of AI, 00:55:37.360 |
but that it was something which was extremely useful to us. 00:55:42.240 |
It helped us to build deep learning representations 00:55:48.420 |
And so really I would say it's served a purpose, 00:55:53.280 |
but something which I continue to use in our research today, 00:55:56.180 |
which is trying to break down a very hard challenge 00:56:00.080 |
into pieces which are easier to understand for us 00:56:04.180 |
So if you use a component based on human data, 00:56:11.360 |
the more principled version later that does it for itself. 00:56:19.700 |
and I don't think I'm being sort of romanticizing 00:56:23.180 |
this notion, I think it's one of the greatest moments 00:56:35.900 |
'Cause to me, I feel like it's something that would, 00:56:38.560 |
we mentioned what the AGI systems of the future 00:56:42.500 |
I think they'll look back at the AlphaGo victory 00:56:52.760 |
I mean, it's funny 'cause I guess I've been working on, 00:56:56.200 |
I've been working on ComputerGo for a long time. 00:56:58.080 |
So I'd been working at the time of the AlphaGo match 00:57:03.040 |
And throughout that decade, I'd had this dream 00:57:07.680 |
what would it be like really to actually be able 00:57:14.280 |
And I imagined that that would be an interesting moment 00:57:31.520 |
that were following us around and the 100 million people 00:57:37.620 |
I realized that I'd been off in my estimation 00:57:43.960 |
And so there was definitely an adjustment process 00:57:57.960 |
And I think there was that moment of realization. 00:58:03.600 |
if you go into something thinking it's gonna be 00:58:10.860 |
it suddenly makes you worry about whether some 00:58:12.660 |
of the decisions you'd made were really the best ones 00:58:14.960 |
or the wisest or were going to lead to the best outcome. 00:58:18.260 |
And we knew for sure that there were still imperfections 00:58:24.400 |
And so, yeah, it was, I think, a great experience 00:58:28.160 |
and I feel privileged to have been part of it, 00:58:35.940 |
I feel privileged to have been in a moment of history, 00:58:43.260 |
in a sense, I was insulated from the knowledge of, 00:58:46.380 |
I think it would have been harder to focus on the research 00:58:48.820 |
if the full kind of reality of what was gonna come to pass 00:58:58.740 |
and we were trying to answer the scientific questions 00:59:04.540 |
And I think it was better that way in retrospect. 00:59:10.220 |
what were the chances that you could get the win? 00:59:12.720 |
So, just like you said, I'm a little bit more familiar 00:59:20.300 |
that we may not even get a chance to talk to. 00:59:26.260 |
But here, with AlphaStar and beating StarCraft, 00:59:31.140 |
there was already a track record with AlphaGo. 00:59:59.620 |
And the number of fingers that they had out on that hand 01:00:06.300 |
And there was an amazing spread in the team's predictions. 01:00:20.620 |
And one of the things which we had established 01:00:23.140 |
was that AlphaGo, in around one in five games, 01:00:27.260 |
would develop something which we called a delusion, 01:00:49.420 |
But we also knew that if there were delusions, 01:01:08.180 |
And after that, it led it into this situation 01:01:20.780 |
- So yeah, and can you maybe speak to it a little bit more? 01:01:26.380 |
Is there interesting things that come to memory 01:01:29.900 |
in terms of the play of the human or the machine? 01:01:33.620 |
- So I remember all of these games vividly, of course. 01:01:36.980 |
You know, moments like these don't come too often 01:01:47.340 |
because it was the first time that a computer program 01:01:59.540 |
invaded Lisa Doll's territory towards the end of the game. 01:02:10.760 |
hey, you thought this was gonna be your territory 01:02:26.140 |
The second game became famous for a move known as Move 37. 01:02:40.260 |
They thought that maybe the operator had made a mistake. 01:02:45.260 |
They thought that there was something crazy going on. 01:02:48.260 |
And it just broke every rule that Go players are taught 01:02:55.300 |
You can only play it on the third line or the fourth line 01:03:03.580 |
and made this beautiful pattern in the middle of the board 01:03:12.300 |
where we could say computers exhibited creativity, 01:03:17.020 |
that was something humans hadn't known about, 01:03:20.980 |
hadn't anticipated, and computers discovered this idea. 01:03:29.140 |
not in the domains of human knowledge of the game. 01:03:32.460 |
And now the humans think this is a reasonable thing to do 01:03:44.600 |
when you play against a human world champion, 01:03:46.620 |
which again, I hadn't anticipated before going there, 01:03:48.820 |
which is, you know, these players are amazing. 01:03:53.300 |
Lee Sedol was a true champion, 18 time world champion, 01:03:56.400 |
and had this amazing ability to probe AlphaGo 01:04:06.140 |
and we felt we were sailing comfortably to victory, 01:04:09.740 |
but he managed to, from nothing, stir up this fight 01:04:29.740 |
And so for us, you know, this was a real challenge. 01:04:41.420 |
The fourth game was amazing in that Lee Sedol 01:04:51.980 |
which I think only a true world champion can do, 01:05:05.300 |
It kind of, he found just a piece of genius really. 01:05:10.300 |
And after that, AlphaGo, its evaluation just tumbled. 01:05:21.420 |
And it starts to behave rather oddly at that point. 01:05:26.860 |
we as a team were convinced having seen AlphaGo 01:05:35.900 |
We were convinced that it was miss-evaluating the position 01:05:41.180 |
And it was only in the last few moves of the game 01:06:12.100 |
- So just like in the case of Deep Blue beating Gary Kasparov 01:06:18.860 |
so Gary was, I think the first time he's ever lost 01:06:24.380 |
And I mean, there's a similar situation with Lucidol. 01:06:47.260 |
But Lucidol recently announced his retirement. 01:06:52.140 |
I don't know if we can look too deeply into it, 01:06:56.040 |
but he did say that even if I become number one, 01:07:05.500 |
What do you think about his retirement from the game and go? 01:07:09.460 |
to the first part of your comment about Gary Kasparov, 01:07:15.660 |
he specifically said that when he first lost to Deep Blue, 01:07:22.340 |
He viewed that this had been a failure of his, 01:07:26.900 |
he said he'd come to realize that actually it was a success. 01:07:31.940 |
because this marked a transformational moment for AI. 01:07:38.900 |
he came to realize that that moment was pivotal 01:07:48.740 |
Lissadol, I think, was much more cognizant of that 01:08:02.900 |
was not only meaningful for AI, but for humans as well. 01:08:06.460 |
And he felt as a Go player that it had opened his horizons 01:08:09.900 |
and meant that he could start exploring new things. 01:08:14.420 |
because it had broken all of the conventions and barriers 01:08:18.620 |
and meant that suddenly anything was possible again. 01:08:22.460 |
And so, yeah, I was sad to hear that he'd retired, 01:08:26.100 |
but he's been a great world champion over many, many years. 01:08:31.100 |
And I think he'll be remembered for that evermore. 01:08:36.020 |
He'll be remembered as the last person to beat AlphaGo. 01:08:39.340 |
I mean, after that, we increased the power of the system 01:08:45.820 |
beats the other strong human players 60 games to nil. 01:08:56.980 |
- It's interesting that you spent time at AAAI 01:09:04.020 |
What, I mean, it's almost, I'm just curious to learn 01:09:15.500 |
he's written a book about artificial intelligence. 01:09:28.660 |
but I think Garry is the greatest chess player of all time, 01:09:32.300 |
the probably one of the greatest game players of all time. 01:09:36.540 |
And you sort of at the center of creating a system 01:09:41.540 |
that beats one of the greatest players of all time. 01:09:46.780 |
Is there anything, any interesting digs, any bets, 01:09:53.700 |
- So Garry Kasparov has an incredible respect 01:10:07.580 |
that he really appreciates and respects what we've done. 01:10:16.180 |
which later after AlphaGo, we built the AlphaZero system, 01:10:21.180 |
which defeated the world's strongest chess programs. 01:10:25.860 |
And to Garry Kasparov, that moment in computer chess 01:10:37.620 |
and a system which was able to discover for itself 01:10:44.820 |
which he hadn't always known about or anyone. 01:10:49.820 |
And in fact, one of the things I discovered at this panel 01:10:53.260 |
was that the current world champion, Magnus Carlsen, 01:10:56.620 |
apparently recently commented on his improvement 01:11:00.540 |
in performance and he attributes it to AlphaZero. 01:11:05.940 |
and he's changed his style to play more like AlphaZero. 01:11:08.820 |
And it's led to him actually increasing his rating 01:11:20.380 |
and just like you said with reinforcement learning, 01:11:26.300 |
machine learning feels like what intelligence is. 01:11:29.660 |
And you could attribute it to sort of a bitter viewpoint 01:11:34.660 |
from Garry's perspective, from us humans' perspective, 01:11:39.500 |
saying that pure search that IBM Deep Blue was doing 01:11:53.700 |
- So I think we should not demean the achievements 01:12:00.060 |
I think that Deep Blue was an amazing achievement in itself. 01:12:16.580 |
the way that the chess position was understood 01:12:22.500 |
is a limitation, which means that there's a ceiling 01:12:27.380 |
on how well it can do, but maybe more importantly, 01:12:30.500 |
it means that the same idea cannot be applied 01:12:38.500 |
and that ability to kind of encode exactly their knowledge 01:12:45.060 |
is that most domains turn out to be of the second type, 01:12:50.540 |
it's hard to extract from experts or it isn't even available. 01:12:54.020 |
And so we need to solve problems in a different way. 01:12:59.020 |
And I think AlphaGo is a step towards solving things 01:13:02.700 |
in a way which puts learning as a first-class citizen 01:13:07.700 |
and says systems need to understand for themselves 01:13:22.860 |
And in order to do that, we make progress towards AI. 01:13:27.860 |
- Yeah, so one of the nice things about this, 01:13:31.860 |
about taking a learning approach to the game of Go 01:13:35.420 |
or game playing is that the things you learn, 01:13:38.500 |
the things you figure out are actually going to be applicable 01:13:40.700 |
to other problems that are real-world problems. 01:13:46.700 |
there's two really interesting things about AlphaGo. 01:13:49.100 |
One is the science of it, just the science of learning, 01:13:54.580 |
And then the other is, well, you're actually learning 01:13:59.740 |
that would be potentially applicable in other applications, 01:14:14.940 |
really the profound step is probably AlphaGo Zero. 01:14:18.360 |
I mean, it's arguable, I kind of see them all 01:14:29.340 |
but it's removing the reliance on human expert games 01:14:39.760 |
that self-play could achieve superhuman level performance 01:14:45.720 |
And maybe could you also say what is self-play? 01:14:58.440 |
which is really about systems learning for themselves, 01:15:02.140 |
but in the situation where there's more than one agent. 01:15:10.520 |
then self-play is really about understanding that game 01:15:17.600 |
rather than against any actual real opponent. 01:15:20.040 |
And so it's a way to kind of discover strategies 01:15:27.480 |
and play against any particular human player, for example. 01:15:40.160 |
you know, try and step back from any of the knowledge 01:15:45.160 |
that we'd put into the system and ask the question, 01:15:47.840 |
is it possible to come up with a single elegant principle 01:16:08.360 |
in the sense that perhaps the knowledge you were putting in 01:16:11.960 |
and maybe stopping the system learning for itself, 01:16:19.780 |
the harder it is for a system to actually be placed, 01:16:23.480 |
taken out of the system in which it's kind of been designed 01:16:28.760 |
that maybe would need a completely different knowledge base 01:16:43.820 |
the promise of AI is that we can have systems such as that, 01:16:56.000 |
which can be placed into that world, into that environment, 01:17:05.280 |
the essence of intelligence, if we can achieve that. 01:17:10.040 |
And it's a step that was taken in the context 01:17:21.520 |
- So just to clarify, the first step was AlphaGo Zero. 01:17:25.560 |
- The first step was to try and take all of the knowledge 01:17:29.040 |
out of AlphaGo in such a way that it could play 01:17:32.840 |
in a fully self-discovered way, purely from self-play. 01:17:37.840 |
And to me, the motivation for that was always 01:17:41.840 |
that we could then plug it into other domains, 01:17:59.120 |
for researchers who are kind of too deeply embedded 01:18:08.160 |
which is, it actually occurred to me on honeymoon. 01:18:13.840 |
And I was like at my most fully relaxed state, 01:18:21.600 |
this, like, the algorithm for AlphaZero just appeared. 01:18:42.560 |
that it was only later that we had the opportunity 01:18:59.920 |
that represents, to me at least, artificial intelligence. 01:19:04.920 |
But the fact that you could use that kind of mechanism 01:19:14.820 |
So we kind of, to me, it feels like you have to train 01:19:23.640 |
Can you sort of think, not necessarily at that time, 01:19:47.500 |
for which we already know the likely outcome. 01:19:51.000 |
I don't see much value in running an experiment 01:19:53.200 |
where you're 95% confident that you will succeed. 01:19:57.640 |
And so we could have tried maybe to take AlphaGo 01:20:02.040 |
and do something which we knew for sure it would succeed on. 01:20:05.200 |
But much more interesting to me was to try it 01:20:09.400 |
And one of the big questions on our minds back then was, 01:20:14.240 |
could you really do this with self-play alone? 01:20:27.400 |
that it could reach the same level as these systems, 01:20:33.880 |
And even if it had not achieved the same level, 01:20:36.760 |
I felt that that was an important direction to be studying. 01:20:52.440 |
and indeed was able to beat it by 100 games to zero. 01:21:07.420 |
as we did in AlphaGo, AlphaGo suffered from these delusions. 01:21:12.060 |
Occasionally it would misunderstand what was going on 01:21:29.840 |
But the only way to address them in any complex system 01:21:33.280 |
is to give the system the ability to correct its own errors. 01:21:41.480 |
when it's doing something wrong and correct for it. 01:21:44.640 |
And so it seemed to me that the way to correct delusions 01:21:53.560 |
you should be able to correct for those errors 01:21:55.800 |
until it gets to play that out and understand, 01:21:58.400 |
oh, well, I thought that I was gonna win in this situation, 01:22:03.280 |
That suggests that I was mis-evaluating something, 01:22:13.440 |
and trace it back all the way to the beginning, 01:22:16.240 |
it should be able to take you from no knowledge, 01:22:21.840 |
all the way to the highest levels of knowledge 01:22:37.800 |
because it sees the stupid things that the random is doing 01:22:41.640 |
And then it can take you from that slightly better system 01:22:44.000 |
and understand, well, what's that doing wrong? 01:22:45.960 |
And it takes you on to the next level and the next level. 01:22:55.320 |
if we'd carried on training AlphaGo Zero for longer? 01:22:58.420 |
We saw no sign of it slowing down its improvements, 01:23:03.360 |
or at least it was certainly carrying on to improve. 01:23:06.680 |
And presumably, if you had the computational resources, 01:23:21.640 |
One of the surprising things, just like you said, 01:23:29.320 |
And reinforcement learning should be part of that process. 01:23:33.600 |
But what is surprising is in the process of patching 01:23:36.840 |
your own lack of knowledge, you don't open up other patches. 01:23:43.800 |
like there's a monotonic decrease of your weaknesses. 01:23:50.160 |
I think science always should make falsifiable hypotheses. 01:23:57.040 |
which is that if someone was to, in the future, 01:24:12.880 |
to beat the previous system 100 games to zero. 01:24:15.400 |
And that if they were then to do the same thing 01:24:19.240 |
that that would beat that previous system 100 games to zero. 01:24:22.080 |
And that that process would continue indefinitely 01:24:27.560 |
- Presumably the game of Go would set the ceiling. 01:24:33.200 |
but the game of Go has 10 to the 170 states in it. 01:24:36.320 |
So the ceiling is unreachable by any computational device 01:24:46.600 |
- You asked a really good question, which is, 01:25:05.200 |
that essentially progress will always lead you to, 01:25:10.200 |
if you have sufficient representational resource, 01:25:16.600 |
could represent every state in a big table of the game, 01:25:20.160 |
then we know for sure that a progress of self-improvement 01:25:24.040 |
will lead all the way in the single-agent case 01:25:29.080 |
and in the two-player case to the minimax optimal behavior. 01:25:35.280 |
knowing that you're playing perfectly against me. 01:25:39.760 |
we know that even if you do open up some new error, 01:25:46.360 |
You're progressing towards the best that can be done. 01:25:50.440 |
- So AlphaGo was initially trained on expert games 01:25:56.440 |
AlphaGo Zero removed the need to be trained on expert games. 01:26:05.720 |
is to generalize that further to be in AlphaZero 01:26:14.760 |
and then also being able to play the game of chess 01:26:31.960 |
was that actually without modifying the algorithm at all, 01:26:41.280 |
In particular, we dropped it into the game of chess. 01:26:44.760 |
And unlike the previous systems like Deep Blue, 01:26:47.160 |
which had been worked on for years and years, 01:26:52.640 |
the world's strongest computer chess program convincingly 01:27:00.920 |
by its own from scratch with its own principles. 01:27:04.920 |
And in fact, one of the nice things that we found 01:27:08.160 |
was that, in fact, we also achieved the same result 01:27:15.160 |
and then place them back down on your own side 01:27:21.840 |
And we also beat the world's strongest programs 01:27:24.760 |
and reached superhuman performance in that game too. 01:27:30.080 |
that we'd ever run the system on that particular game, 01:27:34.480 |
was the version that we published in the paper on AlphaZero. 01:27:37.680 |
It just worked out of the box, literally, no touching it. 01:27:49.560 |
about that principle that you can take an algorithm 01:27:52.960 |
and without twiddling anything, it just works. 01:27:57.720 |
Now, to go beyond AlphaZero, what's required? 01:28:09.920 |
But one of the important steps is to acknowledge 01:28:26.120 |
At least maybe we understand that it operates 01:28:31.160 |
at the micro level or according to relativity 01:28:36.800 |
that's useful for us as people to operate in it. 01:28:40.240 |
Somehow the agent needs to understand the world for itself 01:28:43.800 |
in a way where no one tells it the rules of the game, 01:28:46.320 |
and yet it can still figure out what to do in that world, 01:28:49.960 |
deal with this stream of observations coming in, 01:28:53.600 |
rich sensory input coming in, actions going out in a way 01:28:57.000 |
that allows it to reason in the way that AlphaGo 01:28:59.560 |
or AlphaZero can reason, in the way that these 01:29:15.320 |
in the story of AlphaGo, which was a system called MuZero. 01:29:19.560 |
And MuZero is a system which learns for itself 01:29:33.920 |
the canonical domains of Atari that have been used 01:29:42.920 |
of these Atari games that was sufficiently rich 01:29:46.960 |
and useful enough for it to be able to plan successfully. 01:29:59.340 |
was able to reach the same level of superhuman performance 01:30:03.020 |
in Go, chess, and shogi that we'd seen in AlphaZero, 01:30:08.740 |
the system can learn for itself just by trial and error, 01:30:15.040 |
but you just get to the end and someone says, 01:30:18.760 |
Or you play this game of chess and someone says win or loss, 01:30:31.940 |
the dynamics of the world, how the world works. 01:30:48.140 |
that you have to go through when you're facing 01:31:05.900 |
in the way that it needs to in order to be consumable, 01:31:10.060 |
sort of in order for the reinforcement learning framework 01:31:13.740 |
to be able to act in the environment and so on. 01:31:17.020 |
needs to deal with worlds that are unknown and complex 01:31:24.560 |
And so Museo is a step, a further step in that direction. 01:31:29.460 |
- One of the things that inspired the general public 01:31:32.180 |
and just in conversations I have like with my parents 01:31:34.540 |
or something with my mom that just loves what was done 01:31:42.120 |
some new strategies, new behaviors that were created. 01:31:50.780 |
Do you see it the same way that there's creativity 01:31:52.920 |
and there's some behaviors, patterns that you saw 01:31:57.220 |
that AlphaZero was able to display that are truly creative? 01:32:05.860 |
that I think we should ask what creativity really means. 01:32:08.260 |
So to me, creativity means discovering something 01:32:13.260 |
which wasn't known before, something unexpected, 01:32:19.700 |
And so in that sense, the process of reinforcement learning 01:32:24.700 |
or the self-play approach that was used by AlphaZero 01:32:34.560 |
you're playing according to your current norms 01:32:44.940 |
And then that process, it's like a micro discovery 01:32:54.500 |
"Oh, this pattern, this pattern's working really well for me. 01:32:58.620 |
And now, "Oh, here's this other thing I can do. 01:33:00.700 |
"I can start to connect these stones together in this way 01:33:04.040 |
"or I can start to sacrifice stones or give up on pieces 01:33:12.440 |
The system's discovering things like this for itself 01:33:17.100 |
And so it should come as no surprise to us then 01:33:22.140 |
that they discover things that are not known to humans, 01:33:25.740 |
that to the human norms are considered creative. 01:33:39.220 |
where what we saw was that there are these opening patterns 01:33:45.500 |
These are like the patterns that humans learn 01:33:50.260 |
over literally thousands of years in the game of Go. 01:33:53.220 |
And what we saw was in the course of the training, 01:34:06.980 |
And over time, we found that all of the joseki 01:34:10.180 |
that humans played were discovered by the system 01:34:15.660 |
and this sort of essential notion of creativity. 01:34:19.700 |
But what was really interesting was that over time, 01:34:24.940 |
in favor of its own joseki that humans didn't know about. 01:34:29.580 |
you thought that the Knights move pincer joseki 01:34:34.100 |
but here's something different you can do there, 01:34:46.620 |
that are used in today's top level Go competitions. 01:34:54.780 |
maybe just makes me feel good as a human being 01:34:58.340 |
that a self-play mechanism that knows nothing about us humans 01:35:04.580 |
It's like an affirmation that we're doing okay as humans. 01:35:15.980 |
It sucks, but it's the best song we've tried. 01:35:29.580 |
with AlphaStar and so on and the current work. 01:35:43.820 |
do you see it being applied in other domains? 01:35:50.620 |
that it's applied in both the simulated environments 01:35:56.380 |
Constrained, I mean, AlphaStar really demonstrates 01:35:58.980 |
that you can remove a lot of the constraints, 01:36:00.500 |
but nevertheless, it's in a digital simulated environment. 01:36:04.100 |
Do you have a hope, a dream that it starts being applied 01:36:09.140 |
And maybe even in domains that are safety critical 01:36:12.980 |
and so on, and have a real impact in the real world, 01:36:18.300 |
which seems like a very far out dream at this point. 01:36:25.580 |
that we will get to the point where ideas just like these 01:36:34.900 |
other people use your algorithms in unexpected ways. 01:36:43.260 |
where different teams unbeknownst to us took AlphaZero 01:36:48.260 |
and applied exactly those same algorithms and ideas 01:36:52.780 |
to real world problems of huge meaning to society. 01:36:57.620 |
So one of them was the problem of chemical synthesis, 01:37:01.020 |
and they were able to beat the state of the art 01:37:02.980 |
in finding pathways of how to actually synthesize chemicals, 01:37:19.500 |
you know, one of the big questions is how to understand 01:37:22.780 |
the nature of the function in quantum computation, 01:37:26.780 |
and a system based on AlphaZero beat the state of the art 01:37:32.380 |
So these are just examples, and I think, you know, 01:37:44.180 |
You know, you provide a really powerful tool to society, 01:37:53.620 |
and for sure I hope that we see all kinds of outcomes. 01:38:01.820 |
of reinforcement learning framework is, you know, 01:38:05.580 |
you usually want to specify a reward function, 01:38:09.220 |
What do you think about sort of ideas of intrinsic rewards 01:38:13.860 |
if, and when we're not really sure about, you know, 01:38:17.460 |
if we take, you know, human beings as existence proof 01:38:32.100 |
for when you don't know how to truly specify the reward, 01:38:42.700 |
- So I think, you know, when we think about intelligence, 01:38:48.380 |
And I think it's clearest to understand that problem 01:38:52.700 |
that we want the system to try and solve for. 01:38:58.920 |
do we really even have a clearly defined problem 01:39:04.320 |
Now, within that, as with your example for humans, 01:39:08.500 |
the system may choose to create its own motivations 01:39:17.820 |
And that may indeed be a hugely important mechanism 01:39:25.500 |
that I think the system needs to be measurable 01:39:39.040 |
But if we think of those goals, really, you know, 01:39:41.860 |
like the goal of being able to pick up an object 01:39:47.180 |
or influence people to do things in a particular way 01:39:54.220 |
really, they're sub goals, really, that we set ourselves. 01:40:05.340 |
And we choose those because we think it will lead us 01:40:10.500 |
We think that's helpful to us to achieve some ultimate goal. 01:40:15.080 |
Now, I don't want to speculate whether or not humans 01:40:18.260 |
as a system necessarily have a singular overall goal 01:40:28.100 |
that if we're trying to understand intelligence 01:40:44.040 |
We have to know what we're asking the system to do. 01:40:46.380 |
Otherwise, if you don't have a clearly defined purpose, 01:40:48.860 |
you're not gonna get a clearly defined answer. 01:40:51.620 |
- The ridiculous big question that has to naturally follow, 01:41:03.340 |
or big real questions before humans is the meaning of life, 01:41:08.060 |
is us trying to figure out our own reward function. 01:41:12.520 |
that if you want to build intelligent systems 01:41:16.260 |
you should be at least cognizant to some degree 01:41:22.000 |
what do you think is the reward function of human life, 01:41:30.740 |
- I think I'd be speculating beyond my own expertise, 01:41:39.420 |
- And say, I think that there are many levels 01:41:42.980 |
and you can understand something as optimizing 01:41:58.100 |
just following certain mechanical laws of physics 01:42:02.340 |
and that that's led to the development of the universe. 01:42:04.620 |
But at another level, you can view it as actually, 01:42:17.300 |
that you can think of this as almost like a goal 01:42:20.080 |
of the universe, that the purpose of the universe 01:42:34.080 |
well, how can that be done by a particular system? 01:42:41.740 |
that the universe discovered in order to kind of dissipate 01:42:48.080 |
And by the way, I'm borrowing from Max Tegmark 01:42:53.920 |
But if you can think of evolution as a mechanism 01:43:13.700 |
which is to actually reproduce as effectively as possible. 01:43:27.580 |
that can survive and reproduce as effectively as possible. 01:43:31.620 |
that high level goal, those individual organisms 01:43:33.860 |
discover brains, intelligences, which enable them 01:43:47.800 |
maybe they were controlling things at some direct level. 01:43:53.100 |
of pre-programmed systems, which were directly controlling 01:44:07.260 |
parts of the brain which were able to learn for themselves 01:44:10.140 |
and learn how to program themselves to achieve any goal. 01:44:20.340 |
and provides this very flexible notion of intelligence 01:44:26.540 |
why the reason we feel that we can achieve any goal. 01:44:30.020 |
So it's a very long-winded answer to say that, 01:44:32.820 |
you know, I think there are many perspectives 01:44:34.700 |
and many levels at which intelligence can be understood. 01:44:49.500 |
and understand it as AI researchers or computer scientists. 01:45:13.100 |
of the meaning of life structured so beautifully in layers. 01:45:18.340 |
which is the next step which you're responsible for, 01:45:21.740 |
which is creating the artificial intelligence layer 01:45:28.220 |
And I can't wait to see, well, I may not be around, 01:45:31.740 |
but I can't wait to see what the next layer beyond that. 01:45:36.260 |
- Well, let's just take that argument, you know, 01:45:41.260 |
So the next level indeed is for how can our learning brain 01:45:49.180 |
Well, maybe it does so by us as learning beings, 01:45:53.780 |
building a system which is able to solve for those goals 01:46:01.820 |
And so when we build a system to play the game of Go, 01:46:04.940 |
you know, when I said that I wanted to build a system 01:46:08.740 |
I've enabled myself to achieve that goal of playing Go 01:46:18.740 |
which is systems which are able to achieve goals 01:46:22.620 |
And ultimately there may be layers beyond that 01:46:25.060 |
where they set sub-goals to parts of their own system 01:46:36.100 |
I think is a multi-layered one and a multi-perspective one. 01:46:52.260 |
and for inspiring millions of people in the process. 01:47:01.300 |
with David Silver and thank you to our sponsors, 01:47:07.740 |
by signing up to Masterclass at masterclass.com/lex 01:47:12.100 |
and downloading Cash App and using code LEXPODCAST. 01:47:15.740 |
If you enjoy this podcast, subscribe on YouTube, 01:47:20.260 |
support on Patreon, or simply connect with me on Twitter 01:47:25.260 |
And now let me leave you with some words from David Silver. 01:47:28.620 |
"My personal belief is that we've seen something 01:47:31.220 |
"of a turning point where we're starting to understand 01:47:34.380 |
"that many abilities, like intuition and creativity, 01:47:38.100 |
"that we've previously thought were in the domain only 01:47:45.420 |
"And I think that's a really exciting moment in history." 01:47:48.300 |
Thank you for listening and hope to see you next time.