What is Deep Reinforcement Learning? (David Silver, DeepMind)

00:00:00.000 | If it's okay, can we take a step back and ask the basic question of what is, to you,

00:00:08.040 | reinforcement learning?

00:00:09.040 | So reinforcement learning is the study and the science and the problem of intelligence

00:00:18.040 | in the form of an agent that interacts with an environment.

00:00:22.160 | So the problem you're trying to solve is represented by some environment, like the world in which

00:00:25.400 | that agent is situated.

00:00:27.400 | And the goal of RL is clear, that the agent gets to take actions.

00:00:32.320 | Those actions have some effect on the environment and the environment gives back an observation

00:00:35.720 | to the agent saying, you know, this is what you see or sense.

00:00:39.520 | And one special thing which it gives back is called the reward signal, how well it's

00:00:43.320 | doing in the environment.

00:00:44.800 | And the reinforcement learning problem is to simply take actions over time so as to

00:00:51.240 | maximize that reward signal.

00:00:54.060 | So a couple of basic questions.

00:00:57.920 | What types of RL approaches are there?

00:01:00.560 | So I don't know if there's a nice, brief, inwards way to paint a picture of sort of

00:01:06.880 | value-based, model-based, policy-based reinforcement learning.

00:01:12.120 | Yeah.

00:01:13.120 | So now if we think about, okay, so there's this ambitious problem definition of RL.

00:01:18.600 | It's really, you know, it's truly ambitious.

00:01:20.040 | It's trying to capture and encircle all of the things in which an agent interacts with

00:01:23.640 | an environment and say, well, how can we formalize and understand what it means to crack that?

00:01:28.740 | Now let's think about the solution method.

00:01:30.200 | Well, how do you solve a really hard problem like that?

00:01:32.920 | Well, one approach you can take is to decompose that very hard problem into pieces that work

00:01:39.960 | together to solve that hard problem.

00:01:42.760 | And so you can kind of look at the decomposition that's inside the agent's head, if you like,

00:01:47.400 | and ask, well, what form does that decomposition take?

00:01:50.480 | And some of the most common pieces that people use when they're kind of putting this system,

00:01:54.480 | the solution method together, some of the most common pieces that people use are whether

00:01:59.000 | or not that solution has a value function.

00:02:01.560 | That means, is it trying to predict, explicitly trying to predict how much reward it will

00:02:05.320 | get in the future?

00:02:06.320 | Does it have a representation of a policy?

00:02:09.480 | That means something which is deciding how to pick actions.

00:02:12.420 | Is that decision-making process explicitly represented?

00:02:16.000 | And is there a model in the system?

00:02:18.680 | Is there something which is explicitly trying to predict what will happen in the environment?

00:02:23.260 | And so those three pieces are, to me, some of the most common building blocks.

00:02:29.200 | And I understand the different choices in RL as choices of whether or not to use those

00:02:35.720 | building blocks when you're trying to decompose the solution.

00:02:39.340 | Should I have a value function represented?

00:02:40.960 | Should I have a policy represented?

00:02:43.400 | Should I have a model represented?

00:02:45.160 | And there are combinations of those pieces.

00:02:46.920 | And of course, other things that you could add into the picture as well.

00:02:49.880 | But those three fundamental choices give rise to some of the branches of RL with which we're

00:02:54.240 | very familiar.

00:02:55.320 | And so those, as you mentioned, there is a choice of what's specified or modeled explicitly.

00:03:04.160 | And the idea is that all of these are somehow implicitly learned within the system.

00:03:10.140 | So it's almost a choice of how you approach a problem.

00:03:15.120 | Do you see those as fundamental differences or are these almost like small specifics,

00:03:22.160 | like the details of how you solve the problem, but they're not fundamentally different from

00:03:25.600 | each other?

00:03:27.520 | I think the fundamental idea is maybe at the higher level.

00:03:32.520 | The fundamental idea is the first step of the decomposition is really to say, well,

00:03:39.020 | how are we really going to solve any kind of problem where you're trying to figure out

00:03:43.080 | how to take actions?

00:03:44.080 | And just from this stream of observations, you've got some agent situated in its sensory

00:03:48.040 | motor stream and getting all these observations in, getting to take these actions.

00:03:52.160 | And what should it do?

00:03:53.160 | How can you even broach that problem?

00:03:54.160 | Maybe the complexity of the world is so great that you can't even imagine how to build a

00:03:59.400 | system that would understand how to deal with that.

00:04:02.440 | And so the first step of this decomposition is to say, well, you have to learn.

00:04:06.160 | The system has to learn for itself.

00:04:08.720 | And so note that the reinforcement learning problem doesn't actually stipulate that you

00:04:12.960 | have to learn.

00:04:13.960 | If you could maximize your rewards without learning, it would just wouldn't do a very

00:04:17.400 | good job of it.

00:04:19.040 | So learning is required because it's the only way to achieve good performance in any sufficiently

00:04:24.400 | large and complex environment.

00:04:27.120 | So that's the first step.

00:04:28.640 | And so that step gives commonality to all of the other pieces, because now you might

00:04:32.440 | ask, well, what should you be learning?

00:04:35.440 | What does learning even mean?

00:04:36.440 | You know, in this sense, learning might mean, well, you're trying to update the parameters

00:04:42.240 | of some system, which is then the thing that actually picks the actions.

00:04:48.040 | And those parameters could be representing anything.

00:04:50.040 | They could be parameterizing a value function or a model or a policy.

00:04:55.180 | And so in that sense, there's a lot of commonality in that whatever is being represented there

00:04:58.840 | is the thing which is being learned, and it's being learned with the ultimate goal of maximizing

00:05:02.880 | rewards.

00:05:05.240 | But the way in which you decompose the problem is really what gives the semantics to the

00:05:09.080 | whole system.

00:05:10.080 | Are you trying to learn something to predict well, like a value function or a model?

00:05:15.080 | Are you learning something to perform well, like a policy?

00:05:18.760 | And the form of that objective is kind of giving the semantics to the system.

00:05:23.000 | And so it really is, at the next level down, a fundamental choice.

00:05:26.760 | And we have to make those fundamental choices as system designers or enabler, our algorithms

00:05:32.680 | to be able to learn how to make those choices for themselves.

00:05:36.020 | - So then the next step you mentioned, the very first thing you have to deal with is,

00:05:42.720 | can you even take in this huge stream of observations and do anything with it?

00:05:48.240 | So the natural next basic question is, what is deep reinforcement learning?

00:05:55.000 | And what is this idea of using neural networks to deal with this huge incoming stream?

00:06:01.260 | - So amongst all the approaches for reinforcement learning, deep reinforcement learning is one

00:06:07.160 | family of solution methods that tries to utilize powerful representations that are offered

00:06:16.720 | by neural networks to represent any of these different components of the solution, of the

00:06:23.800 | agent, like whether it's the value function or the model or the policy.

00:06:28.440 | The idea of deep learning is to say, well, here's a powerful toolkit that's so powerful

00:06:33.160 | that it's universal in the sense that it can represent any function and it can learn any

00:06:37.880 | function.

00:06:38.880 | And so if we can leverage that universality, that means that whatever we need to represent

00:06:44.440 | for our policy or for our value function or for our model, deep learning can do it.

00:06:48.560 | So that deep learning is one approach that offers us a toolkit that has no ceiling to

00:06:55.240 | its performance, that as we start to put more resources into the system, more memory and

00:07:00.160 | more computation and more data, more experience, more interactions with the environment, that

00:07:07.280 | these are systems that can just get better and better and better at doing whatever the

00:07:10.520 | job is they've asked them to do.

00:07:12.300 | Whatever we've asked that function to represent, it can learn a function that does a better

00:07:18.120 | and better job of representing that knowledge, whether that knowledge be estimating how well

00:07:23.440 | you're going to do in the world, the value function, whether it's going to be choosing

00:07:26.600 | what to do in the world, the policy, or whether it's understanding the world itself, what's

00:07:31.600 | going to happen next, the model.

00:07:33.800 | - Nevertheless, the fact that neural networks are able to learn incredibly complex representations

00:07:41.400 | that allow you to do the policy, the model, or the value function is, at least to my mind,

00:07:48.480 | exceptionally beautiful and surprising.

00:07:52.480 | Is it surprising, was it surprising to you?

00:07:55.420 | Can you still believe it works as well as it does?

00:07:57.880 | Do you have good intuition about why it works at all and works as well as it does?

00:08:05.800 | - I think, let me take two parts to that question.

00:08:09.400 | I think it's not surprising to me that the idea of reinforcement learning works, because

00:08:17.480 | in some sense, I feel it's the only thing which can, ultimately, and so I feel we have

00:08:25.640 | to address it, and there must be success is possible, because we have examples of intelligence,

00:08:31.560 | and it must at some level be able to, possible to acquire experience and use that experience

00:08:37.520 | to do better in a way which is meaningful to environments of the complexity that humans

00:08:43.560 | can deal with.

00:08:44.560 | It must be.

00:08:46.120 | Am I surprised that our current systems can do as well as they can do?

00:08:50.600 | I think one of the big surprises for me and a lot of the community is really the fact

00:08:58.360 | that deep learning can continue to perform so well, despite the fact that these neural

00:09:09.120 | networks that they're representing have these incredibly non-linear, kind of bumpy surfaces,

00:09:14.720 | which to our kind of low-dimensional intuitions make it feel like, surely, you're just going

00:09:19.840 | to get stuck, and learning will get stuck, because you won't be able to make any further

00:09:24.200 | progress.

00:09:25.280 | And yet, the big surprise is that learning continues, and these, what appear to be local

00:09:32.600 | optima turn out not to be, because in high dimensions, when we make really big neural

00:09:36.400 | nets, there's always a way out, and there's a way to go even lower, and then you're still

00:09:42.040 | not in a local optima because there's some other pathway that will take you out and take

00:09:45.240 | you lower still.

00:09:46.680 | And so no matter where you are, learning can proceed and do better and better and better

00:09:51.920 | without bound.

00:09:53.680 | And so that is a surprising and beautiful property of neural nets, which I find elegant

00:10:03.060 | and beautiful and somewhat shocking that it turns out to be the case.

00:10:07.640 | As you said, which I really like, to our low-dimensional intuitions, that's surprising.

00:10:15.040 | Yeah.

00:10:16.240 | We're very tuned to working within a three-dimensional environment, and so to start to visualize

00:10:23.240 | what a billion-dimensional neural network surface that you're trying to optimize over,

00:10:29.920 | what that even looks like, is very hard for us.

00:10:32.720 | And so I think that really, if you try to account for essentially the AI winter where

00:10:41.400 | people gave up on neural networks, I think it's really down to that lack of ability to

00:10:46.920 | generalize from low dimensions to high dimensions.

00:10:49.980 | Because back then we were in the low-dimensional case.

00:10:52.560 | People could only build neural nets with 50 nodes in them or something.

00:10:58.160 | And to imagine that it might be possible to build a billion-dimensional neural net and

00:11:02.640 | it might have a completely different, qualitatively different property, was very hard to anticipate.

00:11:07.960 | And I think even now we're starting to build the theory to support that.

00:11:13.240 | And it's incomplete at the moment, but all of the theory seems to be pointing in the

00:11:16.880 | direction that indeed this is an approach which truly is universal, both in its representational

00:11:22.400 | capacity, which was known, but also in its learning ability, which is surprising.

00:11:28.080 | - It makes one wonder what else we're missing due to our low-dimensional intuitions that

00:11:35.520 | will seem obvious once it's discovered.

00:11:38.240 | - I often wonder, when we one day do have AIs which are superhuman in their abilities

00:11:47.400 | to understand the world, what will they think of the algorithms that we developed back now?

00:11:55.600 | Will it be looking back at these days and thinking that, will we look back and feel

00:12:03.560 | that these algorithms were naive first steps or will they still be the fundamental ideas

00:12:07.960 | which are used even in 100,000, 10,000 years?

00:12:11.800 | It's hard to know.

00:12:14.720 | - They'll watch back to this conversation with a smile, maybe a little bit of a laugh.

00:12:22.040 | - My sense is, I think, just like when we used to think that the sun revolved around

00:12:31.320 | the earth, they'll see our systems of today, reinforcement learning, as too complicated.

00:12:38.280 | That the answer was simple all along.

00:12:41.160 | There's something, just like you said in the game of Go, I love the systems of cellular

00:12:47.040 | automata, that there's simple rules from which incredible complexity emerges.

00:12:52.760 | So it feels like there might be some very simple approaches, just like where Sutton

00:12:58.480 | says, right?

00:13:00.760 | These simple methods with compute over time seem to prove to be the most effective.

00:13:07.280 | - I 100% agree.

00:13:08.480 | I think that if we try to anticipate what will generalize well into the future, I think

00:13:17.640 | it's likely to be the case that it's the simple, clear ideas which will have the longest legs

00:13:23.240 | and which will carry us furthest into the future.

00:13:25.640 | Nevertheless, we're in a situation where we need to make things work today.

00:13:29.880 | And sometimes that requires putting together more complex systems where we don't have the

00:13:34.840 | full answers yet as to what those minimal ingredients might be.

00:13:37.720 | [END]

00:13:38.220 | 1

00:13:39.220 | 1

00:13:40.220 | 1

00:13:41.220 | 1

00:13:42.220 | 1

00:13:43.220 | 2

00:13:44.220 | 1

00:13:45.220 | 4

00:13:46.220 | 2

00:13:47.220 | 1

00:13:48.220 | 4

00:13:49.220 | 3

00:13:50.220 | 4

00:13:51.220 | 5

00:13:52.220 | 6

00:13:53.220 | 7

00:13:54.220 | 8

00:13:55.220 | 9

00:13:56.220 | 10

What is Deep Reinforcement Learning? (David Silver, DeepMind) | AI Podcast Clips

Chapters