back to index

What is Deep Reinforcement Learning? (David Silver, DeepMind) | AI Podcast Clips


Chapters

0:0 What is reinforcement learning
1:12 Ambitious problem definition
3:27 Fundamental idea
6:0 Deep reinforcement learning

Whisper Transcript | Transcript Only Page

00:00:00.000 | If it's okay, can we take a step back and ask the basic question of what is, to you,
00:00:08.040 | reinforcement learning?
00:00:09.040 | So reinforcement learning is the study and the science and the problem of intelligence
00:00:18.040 | in the form of an agent that interacts with an environment.
00:00:22.160 | So the problem you're trying to solve is represented by some environment, like the world in which
00:00:25.400 | that agent is situated.
00:00:27.400 | And the goal of RL is clear, that the agent gets to take actions.
00:00:32.320 | Those actions have some effect on the environment and the environment gives back an observation
00:00:35.720 | to the agent saying, you know, this is what you see or sense.
00:00:39.520 | And one special thing which it gives back is called the reward signal, how well it's
00:00:43.320 | doing in the environment.
00:00:44.800 | And the reinforcement learning problem is to simply take actions over time so as to
00:00:51.240 | maximize that reward signal.
00:00:54.060 | So a couple of basic questions.
00:00:57.920 | What types of RL approaches are there?
00:01:00.560 | So I don't know if there's a nice, brief, inwards way to paint a picture of sort of
00:01:06.880 | value-based, model-based, policy-based reinforcement learning.
00:01:12.120 | Yeah.
00:01:13.120 | So now if we think about, okay, so there's this ambitious problem definition of RL.
00:01:18.600 | It's really, you know, it's truly ambitious.
00:01:20.040 | It's trying to capture and encircle all of the things in which an agent interacts with
00:01:23.640 | an environment and say, well, how can we formalize and understand what it means to crack that?
00:01:28.740 | Now let's think about the solution method.
00:01:30.200 | Well, how do you solve a really hard problem like that?
00:01:32.920 | Well, one approach you can take is to decompose that very hard problem into pieces that work
00:01:39.960 | together to solve that hard problem.
00:01:42.760 | And so you can kind of look at the decomposition that's inside the agent's head, if you like,
00:01:47.400 | and ask, well, what form does that decomposition take?
00:01:50.480 | And some of the most common pieces that people use when they're kind of putting this system,
00:01:54.480 | the solution method together, some of the most common pieces that people use are whether
00:01:59.000 | or not that solution has a value function.
00:02:01.560 | That means, is it trying to predict, explicitly trying to predict how much reward it will
00:02:05.320 | get in the future?
00:02:06.320 | Does it have a representation of a policy?
00:02:09.480 | That means something which is deciding how to pick actions.
00:02:12.420 | Is that decision-making process explicitly represented?
00:02:16.000 | And is there a model in the system?
00:02:18.680 | Is there something which is explicitly trying to predict what will happen in the environment?
00:02:23.260 | And so those three pieces are, to me, some of the most common building blocks.
00:02:29.200 | And I understand the different choices in RL as choices of whether or not to use those
00:02:35.720 | building blocks when you're trying to decompose the solution.
00:02:39.340 | Should I have a value function represented?
00:02:40.960 | Should I have a policy represented?
00:02:43.400 | Should I have a model represented?
00:02:45.160 | And there are combinations of those pieces.
00:02:46.920 | And of course, other things that you could add into the picture as well.
00:02:49.880 | But those three fundamental choices give rise to some of the branches of RL with which we're
00:02:54.240 | very familiar.
00:02:55.320 | And so those, as you mentioned, there is a choice of what's specified or modeled explicitly.
00:03:04.160 | And the idea is that all of these are somehow implicitly learned within the system.
00:03:10.140 | So it's almost a choice of how you approach a problem.
00:03:15.120 | Do you see those as fundamental differences or are these almost like small specifics,
00:03:22.160 | like the details of how you solve the problem, but they're not fundamentally different from
00:03:25.600 | each other?
00:03:27.520 | I think the fundamental idea is maybe at the higher level.
00:03:32.520 | The fundamental idea is the first step of the decomposition is really to say, well,
00:03:39.020 | how are we really going to solve any kind of problem where you're trying to figure out
00:03:43.080 | how to take actions?
00:03:44.080 | And just from this stream of observations, you've got some agent situated in its sensory
00:03:48.040 | motor stream and getting all these observations in, getting to take these actions.
00:03:52.160 | And what should it do?
00:03:53.160 | How can you even broach that problem?
00:03:54.160 | Maybe the complexity of the world is so great that you can't even imagine how to build a
00:03:59.400 | system that would understand how to deal with that.
00:04:02.440 | And so the first step of this decomposition is to say, well, you have to learn.
00:04:06.160 | The system has to learn for itself.
00:04:08.720 | And so note that the reinforcement learning problem doesn't actually stipulate that you
00:04:12.960 | have to learn.
00:04:13.960 | If you could maximize your rewards without learning, it would just wouldn't do a very
00:04:17.400 | good job of it.
00:04:19.040 | So learning is required because it's the only way to achieve good performance in any sufficiently
00:04:24.400 | large and complex environment.
00:04:27.120 | So that's the first step.
00:04:28.640 | And so that step gives commonality to all of the other pieces, because now you might
00:04:32.440 | ask, well, what should you be learning?
00:04:35.440 | What does learning even mean?
00:04:36.440 | You know, in this sense, learning might mean, well, you're trying to update the parameters
00:04:42.240 | of some system, which is then the thing that actually picks the actions.
00:04:48.040 | And those parameters could be representing anything.
00:04:50.040 | They could be parameterizing a value function or a model or a policy.
00:04:55.180 | And so in that sense, there's a lot of commonality in that whatever is being represented there
00:04:58.840 | is the thing which is being learned, and it's being learned with the ultimate goal of maximizing
00:05:02.880 | rewards.
00:05:05.240 | But the way in which you decompose the problem is really what gives the semantics to the
00:05:09.080 | whole system.
00:05:10.080 | Are you trying to learn something to predict well, like a value function or a model?
00:05:15.080 | Are you learning something to perform well, like a policy?
00:05:18.760 | And the form of that objective is kind of giving the semantics to the system.
00:05:23.000 | And so it really is, at the next level down, a fundamental choice.
00:05:26.760 | And we have to make those fundamental choices as system designers or enabler, our algorithms
00:05:32.680 | to be able to learn how to make those choices for themselves.
00:05:36.020 | - So then the next step you mentioned, the very first thing you have to deal with is,
00:05:42.720 | can you even take in this huge stream of observations and do anything with it?
00:05:48.240 | So the natural next basic question is, what is deep reinforcement learning?
00:05:55.000 | And what is this idea of using neural networks to deal with this huge incoming stream?
00:06:01.260 | - So amongst all the approaches for reinforcement learning, deep reinforcement learning is one
00:06:07.160 | family of solution methods that tries to utilize powerful representations that are offered
00:06:16.720 | by neural networks to represent any of these different components of the solution, of the
00:06:23.800 | agent, like whether it's the value function or the model or the policy.
00:06:28.440 | The idea of deep learning is to say, well, here's a powerful toolkit that's so powerful
00:06:33.160 | that it's universal in the sense that it can represent any function and it can learn any
00:06:37.880 | function.
00:06:38.880 | And so if we can leverage that universality, that means that whatever we need to represent
00:06:44.440 | for our policy or for our value function or for our model, deep learning can do it.
00:06:48.560 | So that deep learning is one approach that offers us a toolkit that has no ceiling to
00:06:55.240 | its performance, that as we start to put more resources into the system, more memory and
00:07:00.160 | more computation and more data, more experience, more interactions with the environment, that
00:07:07.280 | these are systems that can just get better and better and better at doing whatever the
00:07:10.520 | job is they've asked them to do.
00:07:12.300 | Whatever we've asked that function to represent, it can learn a function that does a better
00:07:18.120 | and better job of representing that knowledge, whether that knowledge be estimating how well
00:07:23.440 | you're going to do in the world, the value function, whether it's going to be choosing
00:07:26.600 | what to do in the world, the policy, or whether it's understanding the world itself, what's
00:07:31.600 | going to happen next, the model.
00:07:33.800 | - Nevertheless, the fact that neural networks are able to learn incredibly complex representations
00:07:41.400 | that allow you to do the policy, the model, or the value function is, at least to my mind,
00:07:48.480 | exceptionally beautiful and surprising.
00:07:52.480 | Is it surprising, was it surprising to you?
00:07:55.420 | Can you still believe it works as well as it does?
00:07:57.880 | Do you have good intuition about why it works at all and works as well as it does?
00:08:05.800 | - I think, let me take two parts to that question.
00:08:09.400 | I think it's not surprising to me that the idea of reinforcement learning works, because
00:08:17.480 | in some sense, I feel it's the only thing which can, ultimately, and so I feel we have
00:08:25.640 | to address it, and there must be success is possible, because we have examples of intelligence,
00:08:31.560 | and it must at some level be able to, possible to acquire experience and use that experience
00:08:37.520 | to do better in a way which is meaningful to environments of the complexity that humans
00:08:43.560 | can deal with.
00:08:44.560 | It must be.
00:08:46.120 | Am I surprised that our current systems can do as well as they can do?
00:08:50.600 | I think one of the big surprises for me and a lot of the community is really the fact
00:08:58.360 | that deep learning can continue to perform so well, despite the fact that these neural
00:09:09.120 | networks that they're representing have these incredibly non-linear, kind of bumpy surfaces,
00:09:14.720 | which to our kind of low-dimensional intuitions make it feel like, surely, you're just going
00:09:19.840 | to get stuck, and learning will get stuck, because you won't be able to make any further
00:09:24.200 | progress.
00:09:25.280 | And yet, the big surprise is that learning continues, and these, what appear to be local
00:09:32.600 | optima turn out not to be, because in high dimensions, when we make really big neural
00:09:36.400 | nets, there's always a way out, and there's a way to go even lower, and then you're still
00:09:42.040 | not in a local optima because there's some other pathway that will take you out and take
00:09:45.240 | you lower still.
00:09:46.680 | And so no matter where you are, learning can proceed and do better and better and better
00:09:51.920 | without bound.
00:09:53.680 | And so that is a surprising and beautiful property of neural nets, which I find elegant
00:10:03.060 | and beautiful and somewhat shocking that it turns out to be the case.
00:10:07.640 | As you said, which I really like, to our low-dimensional intuitions, that's surprising.
00:10:15.040 | Yeah.
00:10:16.240 | We're very tuned to working within a three-dimensional environment, and so to start to visualize
00:10:23.240 | what a billion-dimensional neural network surface that you're trying to optimize over,
00:10:29.920 | what that even looks like, is very hard for us.
00:10:32.720 | And so I think that really, if you try to account for essentially the AI winter where
00:10:41.400 | people gave up on neural networks, I think it's really down to that lack of ability to
00:10:46.920 | generalize from low dimensions to high dimensions.
00:10:49.980 | Because back then we were in the low-dimensional case.
00:10:52.560 | People could only build neural nets with 50 nodes in them or something.
00:10:58.160 | And to imagine that it might be possible to build a billion-dimensional neural net and
00:11:02.640 | it might have a completely different, qualitatively different property, was very hard to anticipate.
00:11:07.960 | And I think even now we're starting to build the theory to support that.
00:11:13.240 | And it's incomplete at the moment, but all of the theory seems to be pointing in the
00:11:16.880 | direction that indeed this is an approach which truly is universal, both in its representational
00:11:22.400 | capacity, which was known, but also in its learning ability, which is surprising.
00:11:28.080 | - It makes one wonder what else we're missing due to our low-dimensional intuitions that
00:11:35.520 | will seem obvious once it's discovered.
00:11:38.240 | - I often wonder, when we one day do have AIs which are superhuman in their abilities
00:11:47.400 | to understand the world, what will they think of the algorithms that we developed back now?
00:11:55.600 | Will it be looking back at these days and thinking that, will we look back and feel
00:12:03.560 | that these algorithms were naive first steps or will they still be the fundamental ideas
00:12:07.960 | which are used even in 100,000, 10,000 years?
00:12:11.800 | It's hard to know.
00:12:14.720 | - They'll watch back to this conversation with a smile, maybe a little bit of a laugh.
00:12:22.040 | - My sense is, I think, just like when we used to think that the sun revolved around
00:12:31.320 | the earth, they'll see our systems of today, reinforcement learning, as too complicated.
00:12:38.280 | That the answer was simple all along.
00:12:41.160 | There's something, just like you said in the game of Go, I love the systems of cellular
00:12:47.040 | automata, that there's simple rules from which incredible complexity emerges.
00:12:52.760 | So it feels like there might be some very simple approaches, just like where Sutton
00:12:58.480 | says, right?
00:13:00.760 | These simple methods with compute over time seem to prove to be the most effective.
00:13:07.280 | - I 100% agree.
00:13:08.480 | I think that if we try to anticipate what will generalize well into the future, I think
00:13:17.640 | it's likely to be the case that it's the simple, clear ideas which will have the longest legs
00:13:23.240 | and which will carry us furthest into the future.
00:13:25.640 | Nevertheless, we're in a situation where we need to make things work today.
00:13:29.880 | And sometimes that requires putting together more complex systems where we don't have the
00:13:34.840 | full answers yet as to what those minimal ingredients might be.
00:13:37.720 | [END]