back to indexWhat is Deep Reinforcement Learning? (David Silver, DeepMind) | AI Podcast Clips
Chapters
0:0 What is reinforcement learning
1:12 Ambitious problem definition
3:27 Fundamental idea
6:0 Deep reinforcement learning
00:00:00.000 |
If it's okay, can we take a step back and ask the basic question of what is, to you, 00:00:09.040 |
So reinforcement learning is the study and the science and the problem of intelligence 00:00:18.040 |
in the form of an agent that interacts with an environment. 00:00:22.160 |
So the problem you're trying to solve is represented by some environment, like the world in which 00:00:27.400 |
And the goal of RL is clear, that the agent gets to take actions. 00:00:32.320 |
Those actions have some effect on the environment and the environment gives back an observation 00:00:35.720 |
to the agent saying, you know, this is what you see or sense. 00:00:39.520 |
And one special thing which it gives back is called the reward signal, how well it's 00:00:44.800 |
And the reinforcement learning problem is to simply take actions over time so as to 00:01:00.560 |
So I don't know if there's a nice, brief, inwards way to paint a picture of sort of 00:01:06.880 |
value-based, model-based, policy-based reinforcement learning. 00:01:13.120 |
So now if we think about, okay, so there's this ambitious problem definition of RL. 00:01:20.040 |
It's trying to capture and encircle all of the things in which an agent interacts with 00:01:23.640 |
an environment and say, well, how can we formalize and understand what it means to crack that? 00:01:30.200 |
Well, how do you solve a really hard problem like that? 00:01:32.920 |
Well, one approach you can take is to decompose that very hard problem into pieces that work 00:01:42.760 |
And so you can kind of look at the decomposition that's inside the agent's head, if you like, 00:01:47.400 |
and ask, well, what form does that decomposition take? 00:01:50.480 |
And some of the most common pieces that people use when they're kind of putting this system, 00:01:54.480 |
the solution method together, some of the most common pieces that people use are whether 00:02:01.560 |
That means, is it trying to predict, explicitly trying to predict how much reward it will 00:02:09.480 |
That means something which is deciding how to pick actions. 00:02:12.420 |
Is that decision-making process explicitly represented? 00:02:18.680 |
Is there something which is explicitly trying to predict what will happen in the environment? 00:02:23.260 |
And so those three pieces are, to me, some of the most common building blocks. 00:02:29.200 |
And I understand the different choices in RL as choices of whether or not to use those 00:02:35.720 |
building blocks when you're trying to decompose the solution. 00:02:46.920 |
And of course, other things that you could add into the picture as well. 00:02:49.880 |
But those three fundamental choices give rise to some of the branches of RL with which we're 00:02:55.320 |
And so those, as you mentioned, there is a choice of what's specified or modeled explicitly. 00:03:04.160 |
And the idea is that all of these are somehow implicitly learned within the system. 00:03:10.140 |
So it's almost a choice of how you approach a problem. 00:03:15.120 |
Do you see those as fundamental differences or are these almost like small specifics, 00:03:22.160 |
like the details of how you solve the problem, but they're not fundamentally different from 00:03:27.520 |
I think the fundamental idea is maybe at the higher level. 00:03:32.520 |
The fundamental idea is the first step of the decomposition is really to say, well, 00:03:39.020 |
how are we really going to solve any kind of problem where you're trying to figure out 00:03:44.080 |
And just from this stream of observations, you've got some agent situated in its sensory 00:03:48.040 |
motor stream and getting all these observations in, getting to take these actions. 00:03:54.160 |
Maybe the complexity of the world is so great that you can't even imagine how to build a 00:03:59.400 |
system that would understand how to deal with that. 00:04:02.440 |
And so the first step of this decomposition is to say, well, you have to learn. 00:04:08.720 |
And so note that the reinforcement learning problem doesn't actually stipulate that you 00:04:13.960 |
If you could maximize your rewards without learning, it would just wouldn't do a very 00:04:19.040 |
So learning is required because it's the only way to achieve good performance in any sufficiently 00:04:28.640 |
And so that step gives commonality to all of the other pieces, because now you might 00:04:36.440 |
You know, in this sense, learning might mean, well, you're trying to update the parameters 00:04:42.240 |
of some system, which is then the thing that actually picks the actions. 00:04:48.040 |
And those parameters could be representing anything. 00:04:50.040 |
They could be parameterizing a value function or a model or a policy. 00:04:55.180 |
And so in that sense, there's a lot of commonality in that whatever is being represented there 00:04:58.840 |
is the thing which is being learned, and it's being learned with the ultimate goal of maximizing 00:05:05.240 |
But the way in which you decompose the problem is really what gives the semantics to the 00:05:10.080 |
Are you trying to learn something to predict well, like a value function or a model? 00:05:15.080 |
Are you learning something to perform well, like a policy? 00:05:18.760 |
And the form of that objective is kind of giving the semantics to the system. 00:05:23.000 |
And so it really is, at the next level down, a fundamental choice. 00:05:26.760 |
And we have to make those fundamental choices as system designers or enabler, our algorithms 00:05:32.680 |
to be able to learn how to make those choices for themselves. 00:05:36.020 |
- So then the next step you mentioned, the very first thing you have to deal with is, 00:05:42.720 |
can you even take in this huge stream of observations and do anything with it? 00:05:48.240 |
So the natural next basic question is, what is deep reinforcement learning? 00:05:55.000 |
And what is this idea of using neural networks to deal with this huge incoming stream? 00:06:01.260 |
- So amongst all the approaches for reinforcement learning, deep reinforcement learning is one 00:06:07.160 |
family of solution methods that tries to utilize powerful representations that are offered 00:06:16.720 |
by neural networks to represent any of these different components of the solution, of the 00:06:23.800 |
agent, like whether it's the value function or the model or the policy. 00:06:28.440 |
The idea of deep learning is to say, well, here's a powerful toolkit that's so powerful 00:06:33.160 |
that it's universal in the sense that it can represent any function and it can learn any 00:06:38.880 |
And so if we can leverage that universality, that means that whatever we need to represent 00:06:44.440 |
for our policy or for our value function or for our model, deep learning can do it. 00:06:48.560 |
So that deep learning is one approach that offers us a toolkit that has no ceiling to 00:06:55.240 |
its performance, that as we start to put more resources into the system, more memory and 00:07:00.160 |
more computation and more data, more experience, more interactions with the environment, that 00:07:07.280 |
these are systems that can just get better and better and better at doing whatever the 00:07:12.300 |
Whatever we've asked that function to represent, it can learn a function that does a better 00:07:18.120 |
and better job of representing that knowledge, whether that knowledge be estimating how well 00:07:23.440 |
you're going to do in the world, the value function, whether it's going to be choosing 00:07:26.600 |
what to do in the world, the policy, or whether it's understanding the world itself, what's 00:07:33.800 |
- Nevertheless, the fact that neural networks are able to learn incredibly complex representations 00:07:41.400 |
that allow you to do the policy, the model, or the value function is, at least to my mind, 00:07:55.420 |
Can you still believe it works as well as it does? 00:07:57.880 |
Do you have good intuition about why it works at all and works as well as it does? 00:08:05.800 |
- I think, let me take two parts to that question. 00:08:09.400 |
I think it's not surprising to me that the idea of reinforcement learning works, because 00:08:17.480 |
in some sense, I feel it's the only thing which can, ultimately, and so I feel we have 00:08:25.640 |
to address it, and there must be success is possible, because we have examples of intelligence, 00:08:31.560 |
and it must at some level be able to, possible to acquire experience and use that experience 00:08:37.520 |
to do better in a way which is meaningful to environments of the complexity that humans 00:08:46.120 |
Am I surprised that our current systems can do as well as they can do? 00:08:50.600 |
I think one of the big surprises for me and a lot of the community is really the fact 00:08:58.360 |
that deep learning can continue to perform so well, despite the fact that these neural 00:09:09.120 |
networks that they're representing have these incredibly non-linear, kind of bumpy surfaces, 00:09:14.720 |
which to our kind of low-dimensional intuitions make it feel like, surely, you're just going 00:09:19.840 |
to get stuck, and learning will get stuck, because you won't be able to make any further 00:09:25.280 |
And yet, the big surprise is that learning continues, and these, what appear to be local 00:09:32.600 |
optima turn out not to be, because in high dimensions, when we make really big neural 00:09:36.400 |
nets, there's always a way out, and there's a way to go even lower, and then you're still 00:09:42.040 |
not in a local optima because there's some other pathway that will take you out and take 00:09:46.680 |
And so no matter where you are, learning can proceed and do better and better and better 00:09:53.680 |
And so that is a surprising and beautiful property of neural nets, which I find elegant 00:10:03.060 |
and beautiful and somewhat shocking that it turns out to be the case. 00:10:07.640 |
As you said, which I really like, to our low-dimensional intuitions, that's surprising. 00:10:16.240 |
We're very tuned to working within a three-dimensional environment, and so to start to visualize 00:10:23.240 |
what a billion-dimensional neural network surface that you're trying to optimize over, 00:10:29.920 |
what that even looks like, is very hard for us. 00:10:32.720 |
And so I think that really, if you try to account for essentially the AI winter where 00:10:41.400 |
people gave up on neural networks, I think it's really down to that lack of ability to 00:10:46.920 |
generalize from low dimensions to high dimensions. 00:10:49.980 |
Because back then we were in the low-dimensional case. 00:10:52.560 |
People could only build neural nets with 50 nodes in them or something. 00:10:58.160 |
And to imagine that it might be possible to build a billion-dimensional neural net and 00:11:02.640 |
it might have a completely different, qualitatively different property, was very hard to anticipate. 00:11:07.960 |
And I think even now we're starting to build the theory to support that. 00:11:13.240 |
And it's incomplete at the moment, but all of the theory seems to be pointing in the 00:11:16.880 |
direction that indeed this is an approach which truly is universal, both in its representational 00:11:22.400 |
capacity, which was known, but also in its learning ability, which is surprising. 00:11:28.080 |
- It makes one wonder what else we're missing due to our low-dimensional intuitions that 00:11:38.240 |
- I often wonder, when we one day do have AIs which are superhuman in their abilities 00:11:47.400 |
to understand the world, what will they think of the algorithms that we developed back now? 00:11:55.600 |
Will it be looking back at these days and thinking that, will we look back and feel 00:12:03.560 |
that these algorithms were naive first steps or will they still be the fundamental ideas 00:12:07.960 |
which are used even in 100,000, 10,000 years? 00:12:14.720 |
- They'll watch back to this conversation with a smile, maybe a little bit of a laugh. 00:12:22.040 |
- My sense is, I think, just like when we used to think that the sun revolved around 00:12:31.320 |
the earth, they'll see our systems of today, reinforcement learning, as too complicated. 00:12:41.160 |
There's something, just like you said in the game of Go, I love the systems of cellular 00:12:47.040 |
automata, that there's simple rules from which incredible complexity emerges. 00:12:52.760 |
So it feels like there might be some very simple approaches, just like where Sutton 00:13:00.760 |
These simple methods with compute over time seem to prove to be the most effective. 00:13:08.480 |
I think that if we try to anticipate what will generalize well into the future, I think 00:13:17.640 |
it's likely to be the case that it's the simple, clear ideas which will have the longest legs 00:13:23.240 |
and which will carry us furthest into the future. 00:13:25.640 |
Nevertheless, we're in a situation where we need to make things work today. 00:13:29.880 |
And sometimes that requires putting together more complex systems where we don't have the 00:13:34.840 |
full answers yet as to what those minimal ingredients might be.