Back to Index

What is Deep Reinforcement Learning? (David Silver, DeepMind) | AI Podcast Clips


Chapters

0:0 What is reinforcement learning
1:12 Ambitious problem definition
3:27 Fundamental idea
6:0 Deep reinforcement learning

Transcript

If it's okay, can we take a step back and ask the basic question of what is, to you, reinforcement learning? So reinforcement learning is the study and the science and the problem of intelligence in the form of an agent that interacts with an environment. So the problem you're trying to solve is represented by some environment, like the world in which that agent is situated.

And the goal of RL is clear, that the agent gets to take actions. Those actions have some effect on the environment and the environment gives back an observation to the agent saying, you know, this is what you see or sense. And one special thing which it gives back is called the reward signal, how well it's doing in the environment.

And the reinforcement learning problem is to simply take actions over time so as to maximize that reward signal. So a couple of basic questions. What types of RL approaches are there? So I don't know if there's a nice, brief, inwards way to paint a picture of sort of value-based, model-based, policy-based reinforcement learning.

Yeah. So now if we think about, okay, so there's this ambitious problem definition of RL. It's really, you know, it's truly ambitious. It's trying to capture and encircle all of the things in which an agent interacts with an environment and say, well, how can we formalize and understand what it means to crack that?

Now let's think about the solution method. Well, how do you solve a really hard problem like that? Well, one approach you can take is to decompose that very hard problem into pieces that work together to solve that hard problem. And so you can kind of look at the decomposition that's inside the agent's head, if you like, and ask, well, what form does that decomposition take?

And some of the most common pieces that people use when they're kind of putting this system, the solution method together, some of the most common pieces that people use are whether or not that solution has a value function. That means, is it trying to predict, explicitly trying to predict how much reward it will get in the future?

Does it have a representation of a policy? That means something which is deciding how to pick actions. Is that decision-making process explicitly represented? And is there a model in the system? Is there something which is explicitly trying to predict what will happen in the environment? And so those three pieces are, to me, some of the most common building blocks.

And I understand the different choices in RL as choices of whether or not to use those building blocks when you're trying to decompose the solution. Should I have a value function represented? Should I have a policy represented? Should I have a model represented? And there are combinations of those pieces.

And of course, other things that you could add into the picture as well. But those three fundamental choices give rise to some of the branches of RL with which we're very familiar. And so those, as you mentioned, there is a choice of what's specified or modeled explicitly. And the idea is that all of these are somehow implicitly learned within the system.

So it's almost a choice of how you approach a problem. Do you see those as fundamental differences or are these almost like small specifics, like the details of how you solve the problem, but they're not fundamentally different from each other? I think the fundamental idea is maybe at the higher level.

The fundamental idea is the first step of the decomposition is really to say, well, how are we really going to solve any kind of problem where you're trying to figure out how to take actions? And just from this stream of observations, you've got some agent situated in its sensory motor stream and getting all these observations in, getting to take these actions.

And what should it do? How can you even broach that problem? Maybe the complexity of the world is so great that you can't even imagine how to build a system that would understand how to deal with that. And so the first step of this decomposition is to say, well, you have to learn.

The system has to learn for itself. And so note that the reinforcement learning problem doesn't actually stipulate that you have to learn. If you could maximize your rewards without learning, it would just wouldn't do a very good job of it. So learning is required because it's the only way to achieve good performance in any sufficiently large and complex environment.

So that's the first step. And so that step gives commonality to all of the other pieces, because now you might ask, well, what should you be learning? What does learning even mean? You know, in this sense, learning might mean, well, you're trying to update the parameters of some system, which is then the thing that actually picks the actions.

And those parameters could be representing anything. They could be parameterizing a value function or a model or a policy. And so in that sense, there's a lot of commonality in that whatever is being represented there is the thing which is being learned, and it's being learned with the ultimate goal of maximizing rewards.

But the way in which you decompose the problem is really what gives the semantics to the whole system. Are you trying to learn something to predict well, like a value function or a model? Are you learning something to perform well, like a policy? And the form of that objective is kind of giving the semantics to the system.

And so it really is, at the next level down, a fundamental choice. And we have to make those fundamental choices as system designers or enabler, our algorithms to be able to learn how to make those choices for themselves. - So then the next step you mentioned, the very first thing you have to deal with is, can you even take in this huge stream of observations and do anything with it?

So the natural next basic question is, what is deep reinforcement learning? And what is this idea of using neural networks to deal with this huge incoming stream? - So amongst all the approaches for reinforcement learning, deep reinforcement learning is one family of solution methods that tries to utilize powerful representations that are offered by neural networks to represent any of these different components of the solution, of the agent, like whether it's the value function or the model or the policy.

The idea of deep learning is to say, well, here's a powerful toolkit that's so powerful that it's universal in the sense that it can represent any function and it can learn any function. And so if we can leverage that universality, that means that whatever we need to represent for our policy or for our value function or for our model, deep learning can do it.

So that deep learning is one approach that offers us a toolkit that has no ceiling to its performance, that as we start to put more resources into the system, more memory and more computation and more data, more experience, more interactions with the environment, that these are systems that can just get better and better and better at doing whatever the job is they've asked them to do.

Whatever we've asked that function to represent, it can learn a function that does a better and better job of representing that knowledge, whether that knowledge be estimating how well you're going to do in the world, the value function, whether it's going to be choosing what to do in the world, the policy, or whether it's understanding the world itself, what's going to happen next, the model.

- Nevertheless, the fact that neural networks are able to learn incredibly complex representations that allow you to do the policy, the model, or the value function is, at least to my mind, exceptionally beautiful and surprising. Is it surprising, was it surprising to you? Can you still believe it works as well as it does?

Do you have good intuition about why it works at all and works as well as it does? - I think, let me take two parts to that question. I think it's not surprising to me that the idea of reinforcement learning works, because in some sense, I feel it's the only thing which can, ultimately, and so I feel we have to address it, and there must be success is possible, because we have examples of intelligence, and it must at some level be able to, possible to acquire experience and use that experience to do better in a way which is meaningful to environments of the complexity that humans can deal with.

It must be. Am I surprised that our current systems can do as well as they can do? I think one of the big surprises for me and a lot of the community is really the fact that deep learning can continue to perform so well, despite the fact that these neural networks that they're representing have these incredibly non-linear, kind of bumpy surfaces, which to our kind of low-dimensional intuitions make it feel like, surely, you're just going to get stuck, and learning will get stuck, because you won't be able to make any further progress.

And yet, the big surprise is that learning continues, and these, what appear to be local optima turn out not to be, because in high dimensions, when we make really big neural nets, there's always a way out, and there's a way to go even lower, and then you're still not in a local optima because there's some other pathway that will take you out and take you lower still.

And so no matter where you are, learning can proceed and do better and better and better without bound. And so that is a surprising and beautiful property of neural nets, which I find elegant and beautiful and somewhat shocking that it turns out to be the case. As you said, which I really like, to our low-dimensional intuitions, that's surprising.

Yeah. We're very tuned to working within a three-dimensional environment, and so to start to visualize what a billion-dimensional neural network surface that you're trying to optimize over, what that even looks like, is very hard for us. And so I think that really, if you try to account for essentially the AI winter where people gave up on neural networks, I think it's really down to that lack of ability to generalize from low dimensions to high dimensions.

Because back then we were in the low-dimensional case. People could only build neural nets with 50 nodes in them or something. And to imagine that it might be possible to build a billion-dimensional neural net and it might have a completely different, qualitatively different property, was very hard to anticipate.

And I think even now we're starting to build the theory to support that. And it's incomplete at the moment, but all of the theory seems to be pointing in the direction that indeed this is an approach which truly is universal, both in its representational capacity, which was known, but also in its learning ability, which is surprising.

- It makes one wonder what else we're missing due to our low-dimensional intuitions that will seem obvious once it's discovered. - I often wonder, when we one day do have AIs which are superhuman in their abilities to understand the world, what will they think of the algorithms that we developed back now?

Will it be looking back at these days and thinking that, will we look back and feel that these algorithms were naive first steps or will they still be the fundamental ideas which are used even in 100,000, 10,000 years? It's hard to know. - They'll watch back to this conversation with a smile, maybe a little bit of a laugh.

- My sense is, I think, just like when we used to think that the sun revolved around the earth, they'll see our systems of today, reinforcement learning, as too complicated. That the answer was simple all along. There's something, just like you said in the game of Go, I love the systems of cellular automata, that there's simple rules from which incredible complexity emerges.

So it feels like there might be some very simple approaches, just like where Sutton says, right? These simple methods with compute over time seem to prove to be the most effective. - I 100% agree. I think that if we try to anticipate what will generalize well into the future, I think it's likely to be the case that it's the simple, clear ideas which will have the longest legs and which will carry us furthest into the future.

Nevertheless, we're in a situation where we need to make things work today. And sometimes that requires putting together more complex systems where we don't have the full answers yet as to what those minimal ingredients might be. 1 1 1 1 1 2 1 4 2 1 4 3 4 5 6 7 8 9 10