back to indexDeep Reinforcement Learning (John Schulman, OpenAI)
Chapters
0:0
0:48 What is Reinforcement Learning?
3:4 Business Operations
6:34 How Does RL Relate to Other ML Problems?
10:7 How Does RL Relate to Other Machine Learning Problems?
14:18 Recent Success Stories in Deep RL
20:57 Episodic Setting
21:56 Parameterized Policies
23:45 Policy Gradient Methods: Overview
28:5 Score Function Gradient Estimator: Intuition
31:33 Score Function Gradient Estimator for Policies Now random wariable is a whole trajectory
35:1 Policy Gradient: Use Temporal Structure
37:1 Policy Gradient: Introduce Baseline
39:4 Discounts for Variance Reduction
44:34 Extension: Step Sizes and Trust Regions
57:52 Value Functions
00:00:06.080 |
So I'm going to talk about some of the core methods in deep reinforcement learning. 00:00:16.120 |
First I'll do a brief introduction to what deep RL is and whether it might make sense 00:00:28.940 |
So on the one hand we have the policy gradient methods. 00:00:33.240 |
Then on the other hand we have methods that learn a Q function, including Q-learning and 00:00:40.960 |
And I'll talk a little at the end about what are the pros and cons of these different methods. 00:00:51.300 |
It's a branch of machine learning concerned with taking sequences of actions. 00:00:55.740 |
So often it's described in terms of an agent interacting with a previously unknown environment. 00:01:04.340 |
And it's trying to maximize some kind of cumulative reward, some kind of reward function that 00:01:12.340 |
And pretty much any kind of task where you have some kind of goal that you want to achieve 00:01:27.700 |
It's just reinforcement learning where you're using neural networks as function approximators. 00:01:34.820 |
So the interesting thing about reinforcement learning in contrast to supervised learning 00:01:39.780 |
is it's actually not totally obvious what you should use your neural network to approximate 00:01:46.140 |
And there are different kinds of algorithms that approximate different things. 00:01:49.300 |
So one choice is to use the neural network to approximate your policy, which is how the 00:01:57.020 |
Another choice is to approximate the value functions, which measure how good or bad different 00:02:05.300 |
And last, you can try to learn a model of the system, a dynamics model, which will make 00:02:16.060 |
Okay, so I'll now give a few examples of different places where you might apply reinforcement 00:02:23.900 |
learning and what the observations and actions would be. 00:02:32.020 |
So here you could imagine a robot where the observations are the camera images and the 00:02:39.180 |
The actions are the joint torques you're applying. 00:02:42.580 |
And the reward is going to depend on what you want the robot to do. 00:02:48.500 |
So this is something we, as the algorithm designer, get to define. 00:02:53.220 |
So the rewards could be to stay balanced, to navigate to some target location, or something 00:03:04.060 |
So reinforcement learning has also been used in a lot of more practical applications. 00:03:10.980 |
Well, applications that have been practical in the past. 00:03:14.460 |
I think robotics will be very practical in the future. 00:03:18.860 |
And for example, one area is inventory management. 00:03:25.460 |
So this is just one example of how you could use reinforcement learning for a decision-making 00:03:30.740 |
So you have to decide how much to stock up on of every item. 00:03:37.020 |
And your observations would be your current inventory levels. 00:03:40.780 |
Actions would be how much of each item you're going to purchase. 00:03:47.780 |
So people in operations research, this is a subfield, study this kind of problem a lot. 00:03:58.580 |
There are also a lot of machine learning problems where people have started to apply reinforcement 00:04:10.020 |
So the idea in attention is you don't want to look at the whole input at once. 00:04:15.620 |
So one example of this is with a large image, you might want to just crop out part of it 00:04:21.500 |
and use that and just do detection on that part of the image. 00:04:25.800 |
So here, your observation would be your current image window. 00:04:30.500 |
Action is where to look or where to crop your image. 00:04:33.780 |
And reward is whether you make a classification error or not. 00:04:39.900 |
So here, you have to try to choose the right area of the image to look at so you'll do 00:04:51.700 |
Reinforcement learning has also been used in structured prediction problems, which in 00:04:58.700 |
the past often weren't considered to be reinforcement learning problems. 00:05:03.340 |
But it turns out that to actually properly solve them, it actually is a reinforcement 00:05:10.980 |
So machine translation, for example, so you get a sentence in the source language and 00:05:18.660 |
you have to omit a sentence in the target language. 00:05:22.180 |
And here, your observations are the sentence in the source language. 00:05:27.540 |
You're omitting one word at a time in the target language. 00:05:30.620 |
And you have some reward function that looks at the whole sentence and tells you how good 00:05:37.020 |
So since this is non-differentiable, you can't just differentiate through the whole thing 00:05:44.500 |
So it turns out you can use a policy gradient method to optimize your translation system. 00:05:56.060 |
So those are just a few examples, not exhaustive at all. 00:06:02.180 |
But I just want to say a little bit about how reinforcement learning fits into the picture 00:06:11.980 |
of all the other types of machine learning problems. 00:06:15.860 |
So previous courses in this series have talked about supervised learning and unsupervised 00:06:24.660 |
So how does reinforcement learning relate to them? 00:06:29.260 |
So let's just first compare it to-- let's look at supervised learning. 00:06:33.900 |
So in supervised learning, first, the environment samples an input-output pair from some distribution 00:06:42.540 |
The agent makes a prediction, y hat, using its function f. 00:06:48.980 |
And it receives some loss, which tells it if it made the right prediction or the wrong 00:06:55.300 |
So the interpretation is environment asks the agent a question and then tells her the 00:07:03.180 |
So contextual bandits make this problem a little harder in that they give the learning 00:07:11.700 |
So now the environment samples an input, but notice that there's not a correct output associated 00:07:19.180 |
Then the agent takes an action, and the agent receives some cost, which is from some probability 00:07:30.940 |
We're sampling it from some probability distribution. 00:07:34.180 |
And the agent doesn't know what this probability distribution is, so that's what makes the 00:07:40.700 |
So environment asks the agent a question, the agent answers, and the environment gives her 00:07:54.320 |
So personalized recommendations is one big one, along with advertising. 00:07:58.220 |
So you have to decide, like, customers who like this -- I mean, you have a customer, 00:08:05.460 |
and you know what they liked in the past, so you have to make a prediction about what 00:08:10.620 |
So you show them appropriate ads or links, like what book you want to try to advertise 00:08:17.580 |
to them, or what video you want to show them, and so on. 00:08:22.920 |
So here, the big difference between this and the supervised learning setting is you don't 00:08:26.500 |
have access to the function, the loss function you're trying to optimize. 00:08:30.320 |
So in particular, you can't differentiate through it. 00:08:33.100 |
We don't know the process that generates c, so we can't compute the grading of the loss 00:08:37.740 |
function and use that to tune the agent's parameters. 00:08:41.580 |
So that makes it -- so that makes the problem a bit harder, where you have to use a different 00:08:48.860 |
Lastly, reinforcement learning is almost the same as the contextual bandit setting, except 00:08:57.940 |
So now, instead of sampling the initial state from scratch every time step, from the same 00:09:09.020 |
So you have some transition probability distribution called p here, where the state x sub t is 00:09:16.100 |
conditioned on the previous state and the previous action. 00:09:22.500 |
And that makes the problem quite a bit harder, because now -- well, for a number of reasons. 00:09:27.920 |
For one thing, the inputs you're getting depend on what actions you're taking. 00:09:32.120 |
So now that makes it harder to develop a stable, reliable algorithm, because now as the agent 00:09:40.280 |
So that can lead to all sorts of out-of-control behavior. 00:09:46.380 |
And it also means you have delayed effects, because since the system is stateful, you 00:09:52.280 |
might need to take a lot of actions to get into the right state. 00:09:55.600 |
So you might need to -- you can't just act greedily every time step. 00:10:06.180 |
Okay, so just to summarize these differences, there are two differences. 00:10:11.160 |
The first one is, you don't have full analytic access to the function you're trying to optimize. 00:10:18.700 |
Second, you're interacting with a stateful world, which means that the input you get 00:10:26.620 |
And if you just take the first of those differences between supervised learning and reinforcement 00:10:31.620 |
learning, you get the contextual bandit setting. 00:10:36.740 |
Okay, so I realize that there are multiple -- this audience probably has people with 00:10:44.820 |
Some people are doing research and want to know about what's the latest in the research 00:10:50.540 |
And some people are -- want to apply these machine learning techniques to practical applications. 00:10:56.420 |
So this slide is for the latter group of people. 00:11:00.580 |
So if you're wondering -- if you have some problem where you think reinforcement learning 00:11:05.260 |
might be relevant and you're wondering if you should apply reinforcement learning. 00:11:10.020 |
So first, I should say that the answer might be no. 00:11:14.880 |
It might be overkill, especially deep reinforcement learning. 00:11:18.360 |
So this is a set of fairly new techniques where it's not going to work out of the box 00:11:24.360 |
And it's -- these techniques aren't that well established. 00:11:27.920 |
So they require a lot of -- they have a lot of knobs to be tuned. 00:11:30.620 |
So it might be overkill and, yeah, these techniques aren't that well established at the moment. 00:11:37.180 |
So it might be worth investigating some other methods first. 00:11:42.180 |
So one -- so if your problem has a small number of parameters you're trying to optimize over 00:11:48.800 |
and you have a simulator that you can, like, just do lots of experiments on, then derivative-free 00:11:56.940 |
optimization methods are likely to be better than reinforcement learning, or they're likely 00:12:04.240 |
So these methods just look at -- they just -- you give them a black box function where 00:12:09.480 |
you put in a parameter vector and it'll give you a noisy estimate of the score. 00:12:14.540 |
And these algorithms will just optimize over the parameters of that black box -- I mean, 00:12:22.180 |
So yeah, there's a variety of different methods for derivative-free optimization, but these 00:12:27.980 |
are easier to understand than reinforcement learning, and they do kind of work out of 00:12:35.660 |
A lot of problems are actually -- can be seen as -- are contextual bandit problems, and 00:12:41.580 |
the statefulness of the world isn't that relevant. 00:12:44.420 |
So for example, in advertising, this is where people -- people look at advertising as a 00:12:50.260 |
contextual bandit problem most of the time, because you decide what ad to present the 00:12:54.420 |
user with, and then they either click on it or they don't. 00:13:00.220 |
But it's really -- the user is kind of stateful, because if you show them a terrible ad, they 00:13:07.460 |
So there is -- like, your actions do have some repercussions. 00:13:12.760 |
But often you can just approximate it as being a contextual bandit problem where there is 00:13:18.640 |
So there's a better theoretical understanding of contextual bandit problems and methods 00:13:26.420 |
So in that case -- so if it is a contextual bandit problem, you might want to use those 00:13:34.100 |
And lastly, the operations research field has been using these methods for a while on 00:13:43.000 |
real problems, and they have a set of methods which are just pretty much the basic algorithms, 00:13:51.480 |
policy iteration and value iteration, but they're sort of well -- they're well-developed 00:13:57.320 |
ways of doing feature engineering for these problems that end up working pretty decently. 00:14:01.560 |
So these techniques are also worth considering instead of trying to throw a big neural network 00:14:11.300 |
So now -- well, now that I've talked about what -- why not to use deep reinforcement 00:14:15.200 |
learning or what it's not good for, I'll just talk about some recent success stories in 00:14:21.200 |
deep reinforcement learning, which are achievements that probably wouldn't have been possible 00:14:27.920 |
So a few years ago, there was a pretty influential result by Mni et al. from DeepMind where they 00:14:37.640 |
used a deep Q-learning algorithm to play Atari games using the screen images as input. 00:14:47.800 |
And that's hard because you have these -- these games are -- you're trying to do different 00:14:51.920 |
things in all these games, and they're -- some of them are kind of complicated. 00:14:54.620 |
So it's pretty remarkable that you can just use a simple -- that a simple algorithm can 00:15:04.260 |
So since then, people have also solved -- or solved this domain using policy gradients 00:15:15.640 |
So another big groundbreaking result was beating a champion-level player at Go, also by DeepMind, 00:15:26.500 |
using a combination of supervised learning from, like, from expert games, plus policy 00:15:33.540 |
gradients to fine-tune the supervised learning policy, plus Monte Carlo tree search, plus 00:15:41.220 |
value functions to make the search work better. 00:15:44.340 |
So a combination of techniques and reinforcement learning. 00:15:48.040 |
Robotic -- so some of my colleagues at Berkeley had some very nice results learning in real 00:15:56.080 |
time how to do manipulation tasks using an algorithm called guided policy search using 00:16:06.920 |
And some of my colleagues and I have been working on robotic locomotion using policy 00:16:17.080 |
And people have been working on locomotion for a while and have been able to achieve 00:16:21.720 |
pretty good results using very, like, highly engineered domain-specific methods. 00:16:27.320 |
But previously, there hadn't been much success using general methods to solve it. 00:16:34.640 |
And last, there have been some recent results playing 3D games using policy gradients. 00:16:40.800 |
In fact, there was even a contest I heard about a couple days ago with this new VisDoom 00:16:54.160 |
So that's it for the high-level overview part of this. 00:16:58.680 |
Now I'm going to start getting into the actual formalism and the technical details. 00:17:04.640 |
Okay, so the basic object in the field of reinforcement learning is the Markov decision 00:17:15.120 |
So the Markov decision process is defined by the following components. 00:17:21.740 |
This is all the different states of the system. 00:17:25.040 |
The action space, these are all the actions the agent can take. 00:17:28.600 |
And you have this probability distribution, which determines the probability of next state 00:17:37.120 |
So R is the reward, S prime is the next state, S and A are the actions. 00:17:42.200 |
So it's a conditional probability distribution. 00:17:44.200 |
Sometimes people split this out into a separate reward function, but that's basically an equivalent 00:17:52.300 |
And sometimes there's some extra objects to find. 00:17:59.440 |
We'll consider an initial state distribution. 00:18:02.020 |
So this is the world starts out in a certain state. 00:18:07.180 |
And the typical optimization problem you want to solve given this MDP is to maximize expected 00:18:14.380 |
So there are various ways of defining that more precisely, which I'll go into later. 00:18:24.260 |
So there are various different settings of reinforcement learning where you define a 00:18:31.660 |
The one we'll be most concerned with is called the episodic setting. 00:18:35.600 |
So here the agent's experience is split up into a series of episodes which have finite 00:18:44.200 |
So in each episode, we first sample the initial state of the world from some probability distribution 00:18:52.060 |
And then the agent keeps on acting until the world ends up in some terminal state. 00:19:00.880 |
So just to give an example of what terminal states might be like and how an episodic reinforcement 00:19:09.780 |
So one example is when termination is good and you want to terminate the episode as fast 00:19:16.820 |
So if we imagine setting up a task with some kind of taxi robot that should get to the 00:19:21.420 |
destination as fast as possible, then the episode would be like one trip and it's trying 00:19:28.360 |
to terminate the episode as fast as possible. 00:19:33.300 |
Another example is a waiter robot where you have a fixed length shift, but the waiter 00:19:39.380 |
has to accumulate, it has to do as well as possible during that shift. 00:19:46.000 |
The waiter has to say maximize tips or customer happiness. 00:19:52.840 |
And then you could imagine another kind of task where termination is bad and you want 00:20:03.740 |
But also you could imagine having a walking robot where you want it to walk as far as 00:20:15.940 |
And in this setting, it's pretty easy to define what the goal is. 00:20:21.620 |
We just want to maximize the expectation of the total reward per episode. 00:20:29.820 |
And the last object we're going to introduce here is a policy. 00:20:34.300 |
So the policy is just the function that the agent uses to choose its actions. 00:20:40.220 |
So we have deterministic policies, which are just the policy is denoted by pi. 00:20:45.380 |
So we have the action is just some function of the state. 00:20:48.780 |
And we also have stochastic policies where the policy is a conditional probability distribution. 00:20:57.180 |
So we're just going to make a little bit more precise the setting of the episodic MDP. 00:21:04.920 |
So first we sample the initial state from this distribution, mu. 00:21:11.820 |
Then we sample the first action from the policy, a0 from the policy. 00:21:16.780 |
Then we sample next state and reward from the transition probability distribution, and 00:21:21.460 |
so on, until we reach a terminal state, s sub t. 00:21:25.080 |
And then the quantity we care about is the sum of all these rewards, r0 plus r1 dot dot 00:21:36.560 |
So eta of pi is just defined as the expected total reward of the policy pi. 00:21:46.100 |
Here's the picture that illustrates exactly the same thing. 00:21:56.340 |
And lastly, in the policy gradient section in particular, we're going to be interested 00:22:03.420 |
So here we have a parameter vector, theta, which specifies exactly what the policy is. 00:22:11.020 |
So for example, the family of policies could be just a neural net. 00:22:15.760 |
You have a certain neural network architecture. 00:22:18.020 |
And theta specifies all the weights of this neural network. 00:22:24.100 |
So we could have a deterministic policy, of course, or a stochastic policy. 00:22:30.560 |
And if you're wondering concretely what would a policy look like, I mean, how do you use 00:22:38.100 |
It's actually exactly, you do exactly the same thing you would do if this were a classification 00:22:44.500 |
So S here, the state here is your input, and the action is your output. 00:22:51.580 |
So if you have a discrete action space, a discrete set of actions, then you would use 00:22:58.220 |
a network that outputs a vector of probabilities, the probabilities of the different actions. 00:23:06.900 |
And if you have a continuous action space, you would have your neural network output 00:23:11.500 |
the mean and the diagonal of a covariance matrix of a Gaussian distribution. 00:23:17.220 |
So this is just like you're doing regression. 00:23:19.780 |
So you can use the same kind of architectures you'd use in supervised learning. 00:23:25.100 |
Okay, so that's it for the formalism of MDPs. 00:23:32.760 |
So now I'm going to go into policy gradient methods, which are one broad and general class 00:23:39.200 |
of reinforcement learning methods, which are quite effective. 00:23:45.260 |
So to give a brief overview of this, here's the intuition of what policy gradient methods 00:23:54.800 |
So here, capital R means the sum of rewards of the whole episode. 00:23:59.780 |
So our optimization problem is we want to maximize the expectation of the total reward 00:24:05.960 |
given our parameterized policy, pi sub theta. 00:24:10.680 |
And the intuition of how our algorithm is going to work is we're going to collect a 00:24:16.880 |
I mean, this is just run a bunch of episodes using our policy. 00:24:21.160 |
And then we want to make the good trajectories more probable. 00:24:24.320 |
So I mean, some of the trajectories were lucky and they were really good. 00:24:28.120 |
Some of them, the agent was unlucky and they were bad. 00:24:30.980 |
And the ones that were good, meaning there was high reward, that means the agent 00:24:38.040 |
So we want to increase the probability of the actions from those trajectories. 00:24:44.260 |
So the most basic version of policy gradient methods just try to make the good trajectories 00:24:48.920 |
more probable without trying to figure out which were the good actions and which were 00:24:54.360 |
Slightly better methods or more elaborate methods try to figure out which actions were 00:25:01.320 |
And then they try to make the good actions more probable. 00:25:05.800 |
And lastly, there's another class of methods which actually try to push the actions towards 00:25:13.160 |
So they differentiate the loss function with respect to the actions and they try to push 00:25:19.880 |
So we're mostly going to talk about one and two here. 00:25:34.300 |
So while we're maximizing over the policy, we're trying to find the best policy. 00:25:41.440 |
But here, the policy is assumed to be parameterized. 00:25:44.880 |
So there's some parameter vector theta that specifies the policy. 00:25:48.220 |
And now we just want to maximize with respect to theta. 00:25:59.000 |
So there's a very fundamental concept, which is called the score function grading estimator, 00:26:10.860 |
So actually, to introduce this, we're not going to talk about policies in RL at all. 00:26:15.120 |
We're just going to assume we have some expectation. 00:26:18.320 |
We have expectation of f of x, where x is sampled from some parameterized probability 00:26:27.320 |
So we want to compute the grading of this expectation with respect to theta. 00:26:32.340 |
So there's a general formula that'll do this. 00:26:35.440 |
And the way you derive it is you just write the expectation as an integral. 00:26:44.240 |
You swap the integral with the derivative and you turn it back into an expectation. 00:26:50.040 |
And what you get at the end is this bottom line, which says that you take the expectation 00:26:55.040 |
of function value times grad log probability. 00:27:00.360 |
So this is an unbiased estimator of the gradient, meaning if we get enough samples, it'll converge 00:27:08.660 |
So the way you can compute this estimator, meaning the way you can get a noisy estimate 00:27:13.780 |
of the gradient of the expectation, is you just get one sample from this distribution. 00:27:22.900 |
And then you multiply f of x times grad log probability. 00:27:29.820 |
So the only requirement for being able to use this estimator is we need to be able to 00:27:38.360 |
We need to be able to analytically compute it. 00:27:40.900 |
And we need to be able to differentiate it with respect to theta. 00:27:51.100 |
There's another way of deriving it using importance sampling. 00:27:54.940 |
So you write down the importance sampling estimator for the expectation. 00:27:58.380 |
And then you just swap the derivative with the expectation and you get the same thing. 00:28:05.740 |
So now let me just give a little bit of intuition about this estimator. 00:28:13.060 |
So f of x is measuring how good the sample x is. 00:28:17.500 |
So that means that-- so g hat here is our gradient estimator, meaning this is what we 00:28:22.700 |
get if we take one sample x sub i and we compute our estimator. 00:28:30.220 |
So if we move in direction g hat, that pushes up the log probability of our sample x sub 00:28:39.920 |
So if we have really good-- if we got a really good function value, then we're going to try 00:28:46.940 |
And if it was a bad function value, then we're not going to try to push it up very much. 00:28:56.300 |
The really nice thing is this is valid even if f of x is discontinuous or if f of x is 00:29:02.780 |
unknown, meaning you only-- you don't get to differentiate it, you just get to see the 00:29:09.580 |
Or the sample space is a discrete set, so x doesn't even have to be continuous. 00:29:15.740 |
And this is quite remarkable, actually, that you don't even need to have access to the 00:29:21.780 |
full-- you don't need to know exactly what the function is that you're optimizing. 00:29:26.820 |
You just have to be able to query it for the function value. 00:29:32.500 |
And this means-- this is a way of being able to differentiate functions through a system 00:29:42.960 |
So for example, in robotic locomotion, one issue is that you have contacts between the 00:29:51.540 |
And you make and break contact, and that causes a discontinuous change in the dynamics. 00:29:58.720 |
So that makes it really hard to do smooth optimization techniques to come up with the 00:30:03.820 |
So when you use this kind of gradient estimator, along with policy gradients, which I'm going 00:30:08.460 |
to talk about very soon, you can actually just differentiate-- you can optimize the 00:30:15.220 |
system even though it has differentiable pieces in it. 00:30:22.340 |
So here's another little picture of what's going on. 00:30:27.900 |
So we have our function f of x, which we're trying to maximize the expectation of. 00:30:33.980 |
And then we have our probability density, p of x. 00:30:37.220 |
So we just sample a bunch of values from our probability density. 00:30:50.660 |
And we're trying to push the probability distribution so that the probability goes up at these samples 00:31:01.020 |
So over on the right side of the curve, that means we're trying to push that probability 00:31:08.540 |
And on the left side, we're pushing it up softly. 00:31:11.160 |
So what's going to happen is the probability density is going to slide to the right. 00:31:15.340 |
If you can imagine a sort of physical analogy there. 00:31:21.900 |
So that's the score function gradient estimator. 00:31:26.300 |
It can be used in various machine learning problems. 00:31:32.500 |
Now we're going to apply it to the reinforcement learning setting. 00:31:36.580 |
And we're going to take our random variable x to be a whole trajectory. 00:31:42.960 |
So the trajectory consists of state action reward, state action reward, and so on until 00:31:50.180 |
And now to get our gradient estimator, to get the gradient of the expected reward, all 00:31:59.180 |
we've got to do is compute the grad log probability times the total reward. 00:32:07.540 |
So this probability of the trajectory, that sounds like a really unfriendly quantity because 00:32:12.020 |
there's a long, complicated process that generates this trajectory with lots of time steps. 00:32:20.020 |
But log-- OK, so we can write out what this process is, what this probability density 00:32:27.000 |
So we have-- it's just a product of probabilities. 00:32:29.980 |
We've got our initial-- we've got our mu of s0, which is just our initial state distribution. 00:32:36.280 |
And then every time step, we have-- we sample the action according to pi. 00:32:40.700 |
And we sample the next state and reward according to our dynamics model. 00:32:53.100 |
Anything that doesn't contain theta drops out. 00:32:57.300 |
So the thing is, we didn't know-- there are parts of this probability distribution, p 00:33:03.360 |
of tau given theta, that we don't have access to. 00:33:06.260 |
So if this is reinforcement learning, we don't assume that we know the dynamics model of 00:33:11.500 |
We just find out about it by sampling-- by doing episodes. 00:33:19.980 |
So since this product turns into a sum, all the pieces, like the log p there and the log 00:33:34.740 |
And what we get in the end is we get a sum of log probabilities of actions. 00:33:47.340 |
So our formula looks like-- our formula for the grading of the expectation is just the 00:33:53.340 |
expectation over trajectories of total reward of the trajectory times grad of the sum of 00:34:06.320 |
So the interpretation of this is we're taking our good trajectories and we're trying to 00:34:12.300 |
increase their probability in proportion to how good they are. 00:34:16.220 |
And you can think of this as being similar to supervised learning, where we treat the 00:34:20.620 |
good trajectories with high rewards as positive examples in our supervised learning problem. 00:34:26.660 |
So we're using those to train the policy on which actions are good. 00:34:30.940 |
We're basically treating those actions as positive examples. 00:34:35.820 |
OK, now we can improve this formula a little bit. 00:34:41.900 |
So that was just the most basic-- I mean, this is an unbiased estimate. 00:34:46.180 |
This is an estimator for the policy gradient. 00:34:47.940 |
So if we just take that expression inside the expectation on the right-hand side and 00:34:52.940 |
we take one sample of that, it has the right mean. 00:34:56.020 |
So if we just get enough of them, we're going to get the policy gradient. 00:35:02.580 |
But we can also write down some other formulas that have the same mean but have lower variance. 00:35:07.820 |
So we can come up with better estimators for the policy gradient. 00:35:11.620 |
And that's actually quite important because the one from the previous slide is really 00:35:15.060 |
bad when you have a large number of time steps, meaning it has really high variance. 00:35:22.500 |
So the first thing we can do is we can use the temporal structure of the problem. 00:35:28.900 |
By the way, to derive these next bunch of formulas, it just takes a bunch of really 00:35:33.300 |
straightforward manipulation where you move around expectations. 00:35:37.220 |
And I'm not going to go through all the math. 00:35:45.420 |
So we can repeat the same argument from the previous slide to just derive the gradient 00:35:54.520 |
So we end up with that reward term times the grad sum of log probs. 00:36:00.260 |
And just summing over that, we get a new formula where we're not multiplying the sum of the 00:36:07.460 |
grad log prob of the whole thing times the sum of all rewards. 00:36:15.220 |
Now we have a sum over time of grad log probability of the action at that time times the sum of 00:36:24.400 |
So now, I mean, in the formula from the previous slide, we would have had all the rewards in 00:36:32.980 |
And that kind of makes sense because an action can affect the probability of the previous 00:36:39.980 |
So to figure out if the action is good, we should only be looking at the future rewards. 00:36:47.340 |
So this is a slightly better formula than the one on the previous slide, meaning it 00:36:51.180 |
has the exact same mean except the expression inside the expectation there has lower variance. 00:37:01.340 |
And we can further reduce the variance by introducing a baseline. 00:37:06.500 |
So now we can take any old function, B, which takes in a state and it outputs a real number. 00:37:13.620 |
And we can subtract it from our sum of future rewards. 00:37:19.500 |
And we didn't affect the mean of the estimator at all. 00:37:24.180 |
So we didn't change the expectation at all by introducing this baseline. 00:37:33.260 |
So yeah, for any choice of B, this gives us an unbiased estimator. 00:37:36.980 |
By the way, if you're not that familiar with the terminology of estimators, what I'm saying 00:37:41.340 |
is we have an expectation on the right-hand side of that formula. 00:37:48.980 |
And the quantity inside that expectation is what's called the estimator. 00:37:52.900 |
And if we get a bunch of samples, then we can get an estimate of the thing on the left-hand 00:38:04.020 |
So when I say it's an unbiased estimator, that just means that this equation is correct, 00:38:09.740 |
meaning that the thing on the right-hand side equals the thing on the left-hand side. 00:38:14.740 |
So yeah, this works for any choice of baseline. 00:38:17.660 |
And a near optimal choice is to use the expected return. 00:38:27.180 |
And the interpretation of that is if we took an action, we only want to increase the probability 00:38:38.660 |
Well, the sum of rewards after that action should have been better than expected. 00:38:43.840 |
So the B of S is the expected sum of rewards. 00:38:47.780 |
And we're just taking the difference between the measured thing and the expected thing. 00:38:56.560 |
So that was a pretty key thing for variance reduction. 00:39:03.120 |
And I'm going to introduce one last variance reduction technique. 00:39:07.120 |
And actually, all three of these are really important. 00:39:09.200 |
So basically, nothing's going to work except for maybe really small-scale problems unless 00:39:16.100 |
So the last variance reduction technique is to use discounts. 00:39:20.760 |
So the discount factor ignores delayed effects between actions and rewards. 00:39:27.800 |
So what we're going to do here looks kind of like a hack, but there's an explanation 00:39:32.400 |
for it, which is instead of taking the sum of rewards, we're going to take a discounted 00:39:38.720 |
Meaning that we add this exponential factor, gamma. 00:39:44.320 |
So when we're multiplying the grad log probability by some future reward, we multiply it by some 00:39:55.560 |
So people typically use gamma equals 0.99 or gamma equals 0.95. 00:40:01.280 |
So that means if you use 0.99, that means after 100 time steps, you're going to be reducing 00:40:14.240 |
So you're exponentially decaying the effect of the future rewards. 00:40:20.540 |
And the intuition is that the action shouldn't affect rewards really far in the future. 00:40:27.240 |
Like the assumption is that the system doesn't have really long-term memory and it sort of 00:40:40.500 |
So you can just ignore the interaction between action and a reward way in the future. 00:40:51.000 |
So now instead of taking the baseline to be the expected sum of future rewards, we want 00:40:56.560 |
So now we're measuring if the action was better than expected according to the discounted 00:41:06.600 |
And now there's a more general class of formulas that looks like the one that I just wrote. 00:41:10.580 |
So this one that's on the top of the slide is pretty good. 00:41:13.840 |
And this is almost as good as anything you're going to do to within a small constant factor. 00:41:19.880 |
But there's a more general class of formulas that look like grad log probability times 00:41:28.800 |
some quantity a hat, which we call the advantage estimate. 00:41:32.520 |
And this is in general just going to be an estimate of, it has a more precise definition, 00:41:39.480 |
which is how much was this action better than the average action taken by the policy. 00:41:48.720 |
But informally, this just means how much better was the action than expected. 00:41:55.240 |
And this formula makes a lot of sense because we want to increase the probability of the 00:41:58.680 |
good actions and decrease the probability of the bad ones. 00:42:02.000 |
So we should increase it in proportion to the goodness of the action. 00:42:10.360 |
So just to summarize, so I just told you there's this gradient estimator, meaning there's this 00:42:14.760 |
expression you can compute, which gives you a noisy estimate of the policy gradient. 00:42:18.960 |
So how do you actually turn this into an algorithm? 00:42:32.680 |
You take your policy, you initialize your policy parameter and your baseline function. 00:42:40.160 |
For each iteration, you execute the current policy to get a bunch of whole episodes, meaning 00:42:50.920 |
And each time step in each trajectory, you should compute the return, meaning the sum 00:42:56.140 |
of rewards following that time step, the sum of discounted rewards, and the advantage estimate, 00:43:01.120 |
which is the sum of discounted rewards minus the baseline. 00:43:05.800 |
Then you refit the baseline by trying to make the baseline function equal the returns. 00:43:12.360 |
And then you update the policy using a policy gradient estimator. 00:43:16.680 |
So you're just doing SGD while updating the baseline as you go along. 00:43:25.520 |
So that's the vanilla policy gradient algorithm. 00:43:30.200 |
And this is, I'll briefly talk, this has been used to obtain some pretty good results. 00:43:37.040 |
So it's not that bad of an algorithm, but there are several different directions that 00:43:47.440 |
So one issue that you run into is with step sizes. 00:43:52.760 |
So in supervised learning, step sizes aren't that big of a deal. 00:43:57.960 |
Because maybe you take too big of a step, but that's OK. 00:44:04.360 |
And your current function, your current classifier, for example, doesn't affect what inputs you're 00:44:10.880 |
So even if you just are doing really, even if your network is just thrashing around for 00:44:16.800 |
a while because you're taking too big steps, that's not going to cause any problems. 00:44:22.800 |
But yeah, so step sizes aren't that big of a deal. 00:44:30.600 |
You can start off with a large step size and anneal them down to zero. 00:44:36.720 |
In reinforcement learning, if you take too big of a step, you might wreck your policy. 00:44:41.480 |
And even if you don't actually change the network that much, so you don't lose all your 00:44:45.640 |
nice features, you might just change its behavior a little too much. 00:44:50.680 |
And now it's going to do something totally different and visit a totally different part 00:44:55.940 |
So since in reinforcement learning, the system is stateful and your state distribution depends 00:45:00.440 |
on your policy, that brings a really different problem. 00:45:06.760 |
And now, after you took that step, the next batch of data you're going to get was collected 00:45:13.280 |
And now you're never going to recover because you just forgot everything. 00:45:19.440 |
So, one way that my colleagues and I, well, one way to fix this is to try to basically 00:45:29.920 |
try to stop the policy from taking too big of a step. 00:45:33.120 |
So you can look at the KL divergence between the old policy and the new policy, like before 00:45:39.680 |
the update and after the update, and make sure that the distributions aren't that different. 00:45:48.200 |
So my colleagues and I developed an algorithm called trust region policy optimization, which 00:45:53.320 |
looks at the action distributions and tries to make sure the KL divergence isn't too large. 00:46:00.040 |
And this is very closely related to previous natural policy gradient methods, which are 00:46:08.320 |
based on-- which are doing something similar, but usually it's not set up as a hard constraint 00:46:20.920 |
So another type of extension of policy gradient methods is to do more-- to use value functions 00:46:32.800 |
Instead of just using them as a baseline, you can also-- you can use them more aggressively 00:46:41.200 |
So I won't go into the details in this talk, but sometimes these are called actor-critic 00:46:52.800 |
There's also another type of approach, which I briefly touched on in the earlier slide, 00:46:58.880 |
where instead of just trying to increase the probability of the good actions, you actually 00:47:03.040 |
differentiate your loss with respect to the actions. 00:47:05.880 |
This is like the reparameterization trick, which is used for density modeling and unsupervised 00:47:15.060 |
So here you're trying to-- instead of just increasing the probability of the good actions, 00:47:19.560 |
you're trying to push the actions towards better actions. 00:47:23.880 |
And I'd say both of these bullet points, you're potentially decreasing your variance a lot, 00:47:33.080 |
So it actually makes the algorithms a little harder to understand and to get them working, 00:47:38.840 |
because with high variance, if you just crank up the amount of data, you can always drive 00:47:46.800 |
But with bias, even if-- no matter how much data you get, you're not going to get rid 00:47:51.800 |
So if your gradient is pointing in the wrong direction, then you're not going to learn 00:47:57.560 |
OK, so now that's it for the policy gradient section of this talk. 00:48:05.360 |
So I wanted to show a quick video of some work that my colleagues and I did on learning 00:48:10.860 |
locomotion controllers with policy gradient methods, which I think-- well, I found pretty 00:48:26.660 |
So here, what we've got is a humanoid simulated-- let's see. 00:48:31.240 |
OK, yeah, it's a simulated humanoid robot in a physics simulator, a realistic physics 00:48:40.320 |
And it has a neural network policy, which takes in the joint angles of the robot and 00:48:51.080 |
It's got joint velocities and also positions of the different links of the robot. 00:48:58.500 |
It's pretty much the raw state of the robot, like no clever feature engineering there. 00:49:04.160 |
And the output is going to be the joint torques, which are set 100 times a second. 00:49:09.620 |
So we're just mapping from joint angles to joint torques. 00:49:15.300 |
And we define a reward function, which is to move forward as fast as possible. 00:49:25.680 |
So the episode ends when its head goes below a certain height, meaning it fell over. 00:49:33.060 |
There is a little bit of tweaking for the reward function, but not too extensive. 00:49:39.480 |
So you can see, first it just falls forward a lot of times. 00:49:55.180 |
And then slowly it starts to develop a half-decent looking walk. 00:50:06.420 |
And at the very end of this, it could just keep running indefinitely. 00:50:11.020 |
So I think it was actually stable in a strong sense, meaning I could just leave it for 15 00:50:18.980 |
So here's another robot model that we just created without too much thought. 00:50:26.220 |
We just decided to put a bunch of legs on this thing. 00:50:30.700 |
So we don't even know how this thing is supposed to walk. 00:50:35.040 |
And just give it to the same algorithm and it just figures out some kind of crazy way 00:50:42.920 |
So that's the nice thing about reinforcement learning. 00:50:45.260 |
You don't even need to know what you want it to do. 00:50:48.620 |
I think this is also the physics are a little unrealistic here. 00:50:56.140 |
Here we set up, we used a similar model to the one in the first demo. 00:51:00.940 |
But here we just give it a reward for having its head at a certain height. 00:51:04.880 |
So there's a reward telling it to get its head up as high as possible. 00:51:09.060 |
And then it figures out how to get up off the ground. 00:51:31.060 |
OK, any questions about policy gradients before I move on to the next part? 00:51:57.300 |
So the question was, is the system time invariant? 00:52:05.140 |
And also that it doesn't change from one episode to the next. 00:52:10.180 |
Of course, in some real world problems that might not be the case. 00:52:13.020 |
So I think that's also an interesting problem setting where you have a non-stationary environment. 00:52:26.180 |
So the question was, for the baseline, to learn a good baseline, do you need to know 00:52:32.340 |
So no, you can just learn it by doing regression. 00:52:36.100 |
You just estimate the empirical returns and then you do regression to try to fit a function 00:53:09.620 |
Yeah, so the question is, there's a discount factor which should cause the policy to disregard 00:53:18.100 |
any effects that are delayed by more than 100 time steps. 00:53:21.520 |
So how does it still work that this guy learns how to stand up, even though that might take 00:53:32.300 |
And in fact, I would say that these methods aren't guaranteed to work well if you have 00:53:42.140 |
Often they work anyway, but there's no guarantee. 00:53:44.740 |
So I think there's actually something pretty fundamental missing in how to deal with really 00:53:50.660 |
And people have recently been thinking about hierarchical reinforcement learning, where 00:53:54.700 |
you have different levels of detail of the system, where you might have one level of 00:54:01.640 |
description where you have a short time, a small time step, and then you have successively 00:54:08.940 |
And that allows you to plan over much longer horizons. 00:54:12.940 |
So that's something that's currently an active area of research. 00:54:15.860 |
But yeah, I would say that none of these methods are guaranteed to do anything reasonable if 00:54:21.460 |
you have more than 1 over 1 minus gamma time steps between action and reward. 00:54:28.100 |
So if you introduce some anomaly in the environment, can it still handle it? 00:54:35.500 |
Oh yeah, so in this kind of task, if you introduce terrain or something, could it-- I think if 00:54:42.260 |
you didn't train it to deal with terrain, then it might fail. 00:54:47.700 |
Actually, I don't think it would fail, because the funny thing about-- these policies are 00:54:52.020 |
actually really robust, because you train them with a stochastic policy. 00:54:57.580 |
So there's a lot of noise being generated by the policy itself. 00:55:01.540 |
So in practice, it's able to deal with huge noise introduced by the policy. 00:55:08.860 |
And as a result, I found that if you change the dynamics parameters a little, it can usually 00:55:15.420 |
But yeah, there's no guarantee that it'll do anything if you give it something you didn't 00:55:20.780 |
I think that you probably could do the same kind of training with terrain. 00:55:25.540 |
I didn't have any terrain, so I didn't try it. 00:55:30.940 |
OK, I'm going to move on to the next part of the talk. 00:55:35.540 |
Feel free, if you have more questions, to find me afterwards. 00:55:42.500 |
OK, so now I'm going to talk about a different type of reinforcement learning algorithm. 00:55:54.100 |
So the previous methods are distinguished by the fact that they explicitly represent 00:56:03.140 |
a policy, which is the function that chooses your actions. 00:56:06.500 |
And they try to optimize it with respect to the parameters of the policy. 00:56:10.700 |
So the nice thing about the policy gradient methods we just talked about is that you're 00:56:18.100 |
And you're optimizing it with gradient descent. 00:56:20.100 |
So that makes it kind of easy to understand what's going on. 00:56:23.880 |
Because if you're getting the proper gradient estimate, and you take small enough steps, 00:56:29.540 |
I mean, of course, you still could get stuck in a local minimum. 00:56:32.740 |
But at least-- or you get stuck in a bad local minimum. 00:56:37.940 |
And you can use our understanding of optimization to figure out what's going on. 00:56:44.080 |
So these next class of methods are a little different, because they're not optimizing 00:56:50.260 |
They're learning something else called a Q function, which measures how good state-action 00:56:56.860 |
So it measures-- I'll say that more formally later. 00:57:00.620 |
But it's just measuring how good the actions are. 00:57:04.020 |
And these methods are actually-- these are able to exactly solve MDPs efficiently in 00:57:15.700 |
the setting where you have a finite number of states and actions. 00:57:19.620 |
So these are the preferred methods for exactly solving them in those settings. 00:57:25.100 |
But you can also apply them with continuous states and actions, and using expressive function 00:57:37.500 |
But it's a little harder to understand what's going on in these methods, like when they're 00:57:41.860 |
going to work and when they're not going to work. 00:57:50.380 |
So the Q function is defined as the expected sum of rewards when we condition on the first 00:57:59.180 |
So we're conditioning on S0 equals S, A0 equals A. And the Q function is the expected 00:58:08.380 |
discounted sum of rewards when we're acting under the policy pi. 00:58:14.620 |
So by convention, I'm starting out with time step 0. 00:58:20.900 |
I could have also said that we're taking RT plus RT plus 1 plus RT plus 2 and so on. 00:58:27.740 |
But since we're assuming the system is stationary, it should be exactly the same. 00:58:32.520 |
So just by convention, I'm going to say that the first-- I'm going to always use time 0, 00:58:36.900 |
1, 2, 3, and so on, just for ease of notation. 00:58:41.980 |
So the Q function is just telling you how good this state action pair is under your 00:58:48.020 |
The value function-- well, the state value function, usually called V, is just conditioning 00:58:57.820 |
It's telling you how good that state is, what's the expected reward of that state. 00:59:02.520 |
And lastly, the advantage function is the difference between the Q function and the 00:59:07.080 |
state value function, meaning how much better is that action than what the policy would 00:59:14.180 |
We're not going to talk about advantage functions in this section, but it was actually-- this 00:59:18.020 |
corresponds to the notion of an advantage estimator we briefly mentioned in the previous 00:59:26.140 |
So here, we're going to consider methods that explicitly store and update the Q function 00:59:30.860 |
instead of the policy, and updates them using what are called Bellman equations. 00:59:41.620 |
So the Bellman equation-- so a Bellman equation, in general, is a consistency equation that 00:59:51.100 |
So here, I'm writing down the Bellman equation for Q pi. 00:59:55.780 |
And what it's saying is that the expected sum of rewards should be the first reward 01:00:04.460 |
plus this expected sum of rewards after the first time step. 01:00:12.940 |
V pi of S1 is just adding up all the rewards after time step 0. 01:00:22.380 |
So in the second equation, we write out this relationship just involving the Q function. 01:00:29.380 |
So we have a consistency equation that the Q function should satisfy. 01:00:37.640 |
We can slightly generalize this to use k time steps instead of just one time step. 01:00:42.860 |
So we can expand out the expectation-- the expected sum of rewards to write out k rewards 01:00:49.920 |
explicitly and then cap it off with the value function at the very end, which accounts for 01:00:58.640 |
OK, so here's the Bellman equation from the previous slide. 01:01:03.620 |
So now I'm going to introduce a very important concept called the Bellman backup. 01:01:11.340 |
So we have this equation that the value function Q pi should satisfy. 01:01:24.880 |
So we define this Bellman backup operator that operates on an arbitrary Q function. 01:01:34.400 |
And it's defined by just taking the right-hand side of the Bellman equation and plugging 01:01:50.280 |
So Q pi is a fixed point of this operator, meaning if we apply the backup operator, we 01:02:00.980 |
And very nicely, if we keep applying this backup operator repeatedly to any old arbitrary 01:02:08.600 |
initial Q function Q, the series will converge to Q pi, which is the fixed point of the operator. 01:02:15.640 |
So that's-- yeah, so that's-- so that way you can-- one way-- you can use an iterative 01:02:27.200 |
algorithm to estimate Q pi by taking any old initial Q function and repeatedly applying 01:02:36.380 |
So now there's another kind of Q function that we're going to introduce called Q star. 01:02:42.200 |
So the previous Q function Q pi was-- this is the-- telling you the value function under 01:02:50.220 |
So it only makes sense with regard to some particular fixed policy pi. 01:02:54.800 |
Q star is going to be-- is going to involve the optimal policy instead. 01:03:02.920 |
So Q star is just defined as the Q function of the optimal policy. 01:03:06.720 |
So here we have pi star, the optimal policy, and Q star is just the Q function of the optimal 01:03:12.800 |
And it also happens to be the pointwise maximum over all policies of the Q function at each 01:03:25.920 |
So the optimal policy is deterministic, and it should satisfy this equation that it takes 01:03:36.800 |
So recall that the Q function tells you your expected return if you take the given action. 01:03:43.340 |
So obviously the optimal policy should take the action that has the best expected return. 01:03:58.000 |
So now that we know this property of the optimal policy, we can rewrite the Bellman equation. 01:04:06.200 |
So on the-- that first equation is-- that's just the Bellman equation from the previous 01:04:12.880 |
Now we can take that expectation over actions and replace it by what the optimal policy 01:04:18.880 |
is going to do, which is just going to take-- it's going to take the argmax of the optimal 01:04:26.180 |
That should say Q star inside-- on the right-hand side. 01:04:32.820 |
So now we have a Bellman equation that the optimal policy should satisfy. 01:04:39.340 |
Now we can do the same thing with the backup operator. 01:04:42.480 |
So we take that Bellman equation and we plug in an arbitrary Q function on the right-hand 01:04:49.300 |
side instead of the optimal Q function Q star. 01:04:54.900 |
So Q star is a fixed point of this Bellman operator. 01:04:59.880 |
That's just a restatement of the Bellman equation. 01:05:04.300 |
And again, if we reply this Bellman operator repeatedly to an arbitrary initial Q function, 01:05:11.340 |
it converges to Q star, which is the optimal Q function. 01:05:16.900 |
This is the Banach fixed point theorem in both cases. 01:05:26.460 |
So based on these ideas, there are two classic algorithms for exactly solving MDPs. 01:05:33.340 |
These are sometimes called dynamic programming algorithms because they're actually quite 01:05:36.740 |
related to the kind of dynamic programming algorithms that are used to solve search problems. 01:05:46.500 |
And you just initialize your Q function arbitrarily. 01:05:49.900 |
And you repeatedly do Bellman backups until it converges. 01:06:02.580 |
Then each step, you first compute either exactly or approximately the Q function of that policy. 01:06:12.880 |
And then you update your policy to be the greedy policy for the Q function you just 01:06:19.020 |
So that means that your new policy just takes the argmax of the Q function. 01:06:24.500 |
So it takes the action that's best according to that Q function. 01:06:31.700 |
So I didn't say anything about how you compute Q pi. 01:06:37.340 |
You can compute it exactly because it happens that the Bellman equation for Q pi is a linear 01:06:46.820 |
More commonly, well, if you have a large scale problem, you might not be able to solve this 01:06:51.620 |
So what people often do is they do a finite number of Bellman backups, which doesn't 01:06:58.380 |
exactly converge on Q pi, but it gives you something that's approximately Q pi. 01:07:04.700 |
Okay, so that's, I just told you algorithms that you can implement if you have full access 01:07:13.300 |
Like you know the whole table of probabilities. 01:07:16.180 |
But in reinforcement learning, usually the assumption is that you don't know any of these 01:07:24.980 |
So all of these things have to be estimated from data or you're only able to access the 01:07:32.500 |
So now it turns out that these algorithms can be also implemented if you only access 01:07:39.340 |
the system through interactions, which is kind of remarkable, I think. 01:07:45.260 |
So the way it works is, so let's recall our backup formulas for Q pi and Q star. 01:07:53.620 |
So, we can, in both cases, we have this certain quantity inside an expectation. 01:08:02.620 |
In both cases, we can compute an unbiased estimator of that quantity inside the expectation, 01:08:11.780 |
Meaning if we sampled some data from our system using any old policy, then we can get an unbiased 01:08:21.580 |
estimator of the quantity on the right-hand side of those expectations. 01:08:26.300 |
I mean the quantity inside of the right-hand expectations. 01:08:31.980 |
So basically we can do an approximate version of this Bellman backup, which is unbiased. 01:08:41.980 |
And even with this noise, so we're doing a noisy version of the Bellman backup. 01:08:46.040 |
Even with this noise, it can be proven that if you choose your step sizes appropriately 01:08:50.500 |
with the right schedule, you're still gonna converge to Q pi or Q star, depending on which 01:09:01.460 |
Okay, so now, well I'll say at this point that this is pretty much the fundamental idea. 01:09:08.700 |
And now you can come up with algorithms that can be applied in the reinforcement learning 01:09:15.840 |
setting where you're just accessing the system through sampling. 01:09:18.820 |
And you can also start introducing function approximation here. 01:09:23.100 |
So I haven't said anything about what the Q function is. 01:09:26.500 |
I've just told you it's a function of state and action. 01:09:29.460 |
But now we can start having neural network Q functions, for example. 01:09:36.120 |
So we can parameterize the Q function with the neural network. 01:09:41.960 |
And now, instead of doing the Bellman backup, I mean it doesn't make sense to do the Bellman 01:09:46.980 |
backup exactly because we're not just setting the values of the neural network output. 01:09:52.420 |
The best we can do is try to encourage the neural network to have some output values. 01:09:57.980 |
So what we do is, instead of doing the- the way we do this backup is we set up a least 01:10:04.700 |
So we write down this quadratic objective that says that the Q function should be approximately 01:10:16.380 |
So one version of this algorithm, which was introduced about 10 years ago, called neural 01:10:25.320 |
You sample trajectories using your current policy, which might be determined by the Q 01:10:32.940 |
function or it could be any old policy as it turns out. 01:10:37.500 |
And then you solve the least squares problem where you're trying to minimize this quadratic 01:10:46.340 |
you try to minimize this quadratic error which is based on the Bellman backup, the backup 01:10:55.500 |
So one thing I haven't mentioned so far is what do you actually use as your policy. 01:11:02.020 |
So I said sample trajectory using your policy. 01:11:05.040 |
So if you have a Q function, you can turn it into a policy by just taking the action 01:11:14.120 |
So the Q function measures the goodness of all your actions. 01:11:16.920 |
So you can easily turn that into a policy by taking your best action or by taking actions 01:11:22.280 |
where the log probability is proportional to the goodness or something like that. 01:11:28.780 |
So you might take typically probability is exponential of Q value over some kind of temperature 01:11:40.800 |
Whereas if you just take the argmax, that's called the greedy policy. 01:11:47.340 |
So it turns out that with these kind of Q learning algorithms, you don't have to execute 01:11:57.440 |
You actually have some freedom in what policy you can execute, which is actually one very 01:12:02.700 |
nice property of these algorithms that you can use an exploration technique where your 01:12:09.520 |
policy is actively trying to reach new states or do something new and still learn the correct, 01:12:16.240 |
still converge, still move towards Q star or Q pi as the case may be. 01:12:24.320 |
So that's a very basic neural fitted Q iteration is sort of a basic way of doing this. 01:12:31.440 |
A more recent algorithm that's gotten a lot of attention is the one that was from Manny 01:12:36.760 |
et al from DeepMind, which is basically an online version of this algorithm with a couple 01:12:47.840 |
But actually when you look at the two tricks, they're actually kind of very, they make a 01:12:52.960 |
lot of sense if you just think about what value iteration is doing. 01:12:56.420 |
So one technique is you use this replay pool where it's a rolling history of your past 01:13:03.280 |
data and that's just the data you're going to use to fit your Q function. 01:13:09.080 |
So that makes sure you have like a representative sample of data to fit your Q function to. 01:13:16.040 |
And the second idea is to use a target network. 01:13:21.440 |
So instead of using your current Q function and just doing Bellman backups on that, you 01:13:29.840 |
So you have this target network which is a copy of your Q function at some earlier time 01:13:35.920 |
So that also, if you think about value iteration, you have your old Q function and you're trying 01:13:41.440 |
to make the new one equal to the backed up version of the old one. 01:13:44.680 |
So using the target network just is sort of the natural thing to do if you're trying to 01:13:53.600 |
And there have been many extensions proposed since then. 01:13:55.880 |
I've got a bunch of citations at the bottom of the slide. 01:14:00.360 |
So this algorithm, the DQN algorithm, is using the backup B, which is the backup for Q*. 01:14:11.920 |
Remember that I also introduced this other backup, B pi, which is the backup for Q pi. 01:14:18.560 |
So there's another algorithm, like a very classic algorithm called SARSA, which is an 01:14:23.580 |
online way of doing the B pi backup, essentially. 01:14:28.640 |
Well, it's sort of an online version of policy iteration. 01:14:35.280 |
So it's actually found to work as well, or better than DQN, well, better than using the 01:14:44.980 |
So I think the jury's still out on exactly how these things compare. 01:14:51.480 |
But I think it's worth considering both policy iteration and value iteration and all the 01:14:58.040 |
different online versions of these algorithms and taking them seriously. 01:15:01.520 |
Because it's not clear right now exactly how they all compare to each other in the function 01:15:12.800 |
So that's the overview of all the technical parts. 01:15:17.380 |
And now I just have a couple conclusion slides. 01:15:22.360 |
So let me just summarize the current state of affairs. 01:15:25.000 |
I introduced two kinds of algorithms, policy gradient algorithms, which explicitly represent 01:15:30.840 |
a policy and optimize it, and Q function learning algorithms, which explicitly represent a Q 01:15:37.200 |
function, which is the goodness of different actions, and use that to implicitly represent 01:15:44.560 |
So policy gradient methods, there's a lot of-- so there have been some successes with 01:15:55.240 |
There is a recent paper on this A3C method, which is an async implementation of it, which 01:16:06.820 |
There's also another kind of methods are the natural policy gradient methods, trust region 01:16:13.380 |
The one I showed you was obtained using trust region policy optimization, which is one of 01:16:20.100 |
So that makes it-- I think these trust region methods and natural policy gradient methods 01:16:24.820 |
are more sample efficient than the vanilla methods, because you end up-- you're doing 01:16:31.220 |
more than one gradient update with each little bit of data you collect. 01:16:34.920 |
So with the vanilla policy gradient, you just compute one little gradient estimate, and 01:16:41.220 |
With natural policy gradient, you're solving a little optimization problem with it, so 01:16:48.540 |
So that's what we have in the policy gradient world. 01:16:54.060 |
And in the Q-function world, we have the DQN algorithm and some of its relatives. 01:17:01.900 |
And these are sort of descendants of value iteration, where you're approximating the 01:17:13.360 |
And then SARSA is also-- it's related to policy iteration. 01:17:21.260 |
These are both different-- I mean, these are estimating different-- they're dealing with 01:17:25.580 |
different Bellman equations, so it's kind of interesting that both kinds of methods 01:17:29.100 |
work and they all-- they're both-- they have fairly similar behaviors, as it turns out. 01:17:35.980 |
So here's what I would say that-- here's how I would compare them. 01:17:40.340 |
And this is anecdotal evidence, but I think this is the consensus right now. 01:17:47.060 |
The Q-function methods are more sample efficient when they work, but they don't work as generally 01:17:52.580 |
as policy gradient methods, and it's a little harder to figure out what's going on when 01:17:58.940 |
And that kind of makes sense, because in the policy gradient methods, you're optimizing 01:18:01.980 |
exactly the thing you care about with gradient descent. 01:18:05.380 |
Whereas with Q-function methods, you're doing something indirect, where you're trying to 01:18:09.660 |
learn a Q-function, and then you're hoping that it gives you a good policy. 01:18:14.820 |
And yeah, so I would also point out that there are also some confounds, so it's hard to make 01:18:20.660 |
a good conclusion at this point, because people use different time horizons in the policy 01:18:28.460 |
gradient methods versus the Q-function methods. 01:18:30.980 |
So they do one-step lookaheads on the Q-functions and multi-step lookaheads on the policy gradients. 01:18:36.300 |
So it's not clear if the differences come from using different time horizons or some 01:18:43.660 |
differences in how the algorithms are working, because you're either doing regression for 01:18:47.860 |
a Q-function versus learning a policy using policy gradients. 01:18:53.980 |
So just to summarize it, I would say here are some of our core model-free reinforcement 01:18:59.220 |
learning algorithms, and they - oh, whoops, I'm missing a word in the first column, which 01:19:07.300 |
I think should say reliability and robustness. 01:19:11.860 |
So this just means, is it going to work on new problems without parameter tuning, or 01:19:19.420 |
is it going to mysteriously either work or not work? 01:19:24.260 |
So this would be my slightly sloppy summary of all these different algorithms. 01:19:32.460 |
I would say there's still some room for improvement. 01:19:35.540 |
There might be some improvements in the basic methods, because there are some nice properties 01:19:39.540 |
of the Q-function methods that we don't have in the policy gradient methods. 01:19:44.260 |
Like you can easily do off - you can easily explore with a different policy than the one 01:19:49.780 |
that you're learning the Q-function for, and that's really important. 01:19:54.500 |
You can't do that very easily with policy gradient methods. 01:19:58.380 |
Whereas the policy gradient methods just seem like they're more - you can just apply them 01:20:03.380 |
and they're more likely to work, and it's well understood what's going on. 01:20:10.100 |
So I think, yeah, there's still - I don't know if it's possible to get the best of both 01:20:21.540 |
In model-based reinforcement learning, what lines of research do you find most interesting? 01:20:39.540 |
Oh yeah, so in model-based reinforcement learning, what lines of research do I find most interesting? 01:20:47.340 |
I think the work from my colleagues on guided policy search is very nice. 01:20:51.260 |
So I would say that's a kind of model-based reinforcement learning. 01:20:56.740 |
I also like - there's some methods that are using the model for faster learning, like 01:21:03.140 |
So there's a paper called stochastic value gradients that I like a lot. 01:21:12.640 |
So I don't think there have been a lot of really compelling results where you're able 01:21:18.060 |
to learn extremely fast, like you're able to learn with much better sample efficiency 01:21:24.060 |
So it seems like that should be possible, but I don't think it's been demonstrated yet. 01:21:29.980 |
So maybe in the next couple of years we'll see that happen. 01:21:41.340 |
Is it true or not true that most of this problem requires some kind of simulated world to run 01:21:51.700 |
Oh yeah, so are you asking does this work in the real world? 01:21:59.500 |
Yeah, I would say it does work if you have a lot of patience and you're willing to execute 01:22:06.820 |
So the locomotion results I showed add up to about two weeks of real time. 01:22:12.840 |
So it's actually not that bad, especially when you consider that toddlers take a while 01:22:18.020 |
to learn how to walk properly, even though evolution already puts in a lot of built-in 01:22:24.820 |
So I'd say maybe, yeah, I'd say it can be run in the real world. 01:22:31.620 |
Some of my colleagues in Berkeley are doing some experiments where they are running just 01:22:36.380 |
regular reinforcement learning algorithms in the real world. 01:22:41.300 |
But I hope to see some nice results from that soon. 01:22:52.860 |
I was wondering what was your intuition on the lost surface of those deep reinforcement 01:22:59.620 |
learning optimization problems, and maybe especially how it evolves as the policy learns. 01:23:08.900 |
And I should specify in the policy gradient case. 01:23:13.100 |
So I think the situation is a little bit different in reinforcement learning from in supervised 01:23:17.900 |
So in reinforcement learning, you have one kind of local minima in policy space. 01:23:29.300 |
So for example, let's say you want your-- so I keep going back to the locomotion example, 01:23:39.260 |
There's one local minimum where it just stands and it doesn't bother to walk, because there's 01:23:44.980 |
And there's another local minimum where it just dives forward, because it gets a little 01:23:49.140 |
bit of reward for that before it falls to its doom. 01:23:55.040 |
So I think that that's actually the hard part about the optimization problem, is because 01:24:03.920 |
of the different space of behaviors, and actually has nothing to do with the neural network. 01:24:09.520 |
So I've also found that it matters surprisingly little what kind of neural network architecture 01:24:18.300 |
you use, because I think that most of the hardness and the weirdness of the problem 01:24:21.960 |
comes from what the behavior space looks like, rather than what the actual numerical optimization 01:24:34.320 |
So there are many problems where the reward is only observed at the end of the task, so 01:24:41.960 |
in the final-- in the terminal state in each episode. 01:24:45.280 |
And you don't see rewards in intermediate states. 01:24:48.120 |
So how much harder do these problems become for deep reinforcement learning in your experience? 01:24:54.560 |
Yeah, so you have-- if you don't get the reward until the end, then you can't-- well, then 01:25:05.480 |
I don't have anything precise to say about that. 01:25:08.880 |
I think it's going to be harder if you have less-- if your rewards are further away. 01:25:14.520 |
So for example, in your video, for the last example of getting up and getting the head 01:25:19.600 |
above a certain height, for example, that could be one where you only get a plus 1 if 01:25:25.120 |
you're above and you don't get anything below. 01:25:28.160 |
But when you're doing something that was kind of-- if you get your head higher, then you 01:25:32.520 |
Yeah, so I think we came up with a reward like distance from height squared, which made 01:25:39.880 |
Yeah, the problem would have been a lot harder if you get zero reward until you get your 01:25:45.560 |
And it's actually-- that would be a problem of exploration, which is that you have to 01:25:51.920 |
explore all the different states to figure out where you're going to get good reward. 01:26:02.720 |
So I have a question about how do you choose to quantize your space time, because in your 01:26:06.640 |
locomotion example, it clearly has a continuous system. 01:26:11.920 |
Yeah, so it's actually really important how you discretize time, like what time step you 01:26:16.840 |
use, because the algorithm has-- I mean, the algorithm does care about what the time step 01:26:25.600 |
So it's not like-- yeah, because you have discount factors, and you're also sampling 01:26:34.080 |
So yeah, so if you choose too small of a time step, then the rewards will be delayed by 01:26:47.560 |
And also, your exploration will be more like a random walk, because you're changing your 01:26:56.840 |
And I'd say that's a flaw in current methods. 01:27:02.480 |
So at the same time, John will give a thank you.