Deep Reinforcement Learning (John Schulman, OpenAI)

So good morning, everyone. So I'm going to talk about some of the core methods in deep reinforcement learning. So the aim of this talk is as follows. First I'll do a brief introduction to what deep RL is and whether it might make sense to apply it in your problem.

I'll talk about some of the core techniques. So on the one hand we have the policy gradient methods. Then on the other hand we have methods that learn a Q function, including Q-learning and SARSA. And I'll talk a little at the end about what are the pros and cons of these different methods.

So first, what is reinforcement learning? It's a branch of machine learning concerned with taking sequences of actions. So often it's described in terms of an agent interacting with a previously unknown environment. And it's trying to maximize some kind of cumulative reward, some kind of reward function that we've defined, accumulated over time.

And pretty much any kind of task where you have some kind of goal that you want to achieve can be stated in these terms. So this is an extremely general formulation. What's deep reinforcement learning? It's pretty simple. It's just reinforcement learning where you're using neural networks as function approximators.

So the interesting thing about reinforcement learning in contrast to supervised learning is it's actually not totally obvious what you should use your neural network to approximate in reinforcement learning. And there are different kinds of algorithms that approximate different things. So one choice is to use the neural network to approximate your policy, which is how the agent chooses its actions.

Another choice is to approximate the value functions, which measure how good or bad different states are or actions. And last, you can try to learn a model of the system, a dynamics model, which will make predictions about next states and rewards. Okay, so I'll now give a few examples of different places where you might apply reinforcement learning and what the observations and actions would be.

So one example is robotics. So here you could imagine a robot where the observations are the camera images and the joint angles of the robot. The actions are the joint torques you're applying. And the reward is going to depend on what you want the robot to do. So this is something we, as the algorithm designer, get to define.

So the rewards could be to stay balanced, to navigate to some target location, or something more abstract like serve and protect humans. So reinforcement learning has also been used in a lot of more practical applications. Well, applications that have been practical in the past. I think robotics will be very practical in the future.

And for example, one area is inventory management. So this is just one example of how you could use reinforcement learning for a decision-making problem. So you have to decide how much to stock up on of every item. And your observations would be your current inventory levels. Actions would be how much of each item you're going to purchase.

And reward is your profit. So people in operations research, this is a subfield, study this kind of problem a lot. There are also a lot of machine learning problems where people have started to apply reinforcement learning techniques. So one example is attention. So the idea in attention is you don't want to look at the whole input at once.

You want to just focus on part of it. So one example of this is with a large image, you might want to just crop out part of it and use that and just do detection on that part of the image. So here, your observation would be your current image window.

Action is where to look or where to crop your image. And reward is whether you make a classification error or not. So here, you have to try to choose the right area of the image to look at so you'll do the correct classification. Reinforcement learning has also been used in structured prediction problems, which in the past often weren't considered to be reinforcement learning problems.

But it turns out that to actually properly solve them, it actually is a reinforcement learning problem. So machine translation, for example, so you get a sentence in the source language and you have to omit a sentence in the target language. And here, your observations are the sentence in the source language.

You're omitting one word at a time in the target language. And you have some reward function that looks at the whole sentence and tells you how good your translation was. So since this is non-differentiable, you can't just differentiate through the whole thing and do gradient descent. So it turns out you can use a policy gradient method to optimize your translation system.

So people have started to do that. So those are just a few examples, not exhaustive at all. But I just want to say a little bit about how reinforcement learning fits into the picture of all the other types of machine learning problems. So previous courses in this series have talked about supervised learning and unsupervised learning.

So how does reinforcement learning relate to them? How is it different? So let's just first compare it to-- let's look at supervised learning. So in supervised learning, first, the environment samples an input-output pair from some distribution row. The agent makes a prediction, y hat, using its function f. And it receives some loss, which tells it if it made the right prediction or the wrong prediction.

So the interpretation is environment asks the agent a question and then tells her the right answer. So contextual bandits make this problem a little harder in that they give the learning agent a little bit less information. So now the environment samples an input, but notice that there's not a correct output associated with it.

Then the agent takes an action, and the agent receives some cost, which is from some probability distribution. So here, c is the cost. We're sampling it from some probability distribution. And the agent doesn't know what this probability distribution is, so that's what makes the problem hard. So environment asks the agent a question, the agent answers, and the environment gives her a noisy score on the answer.

So this is applied. This actually has a lot of applications. So personalized recommendations is one big one, along with advertising. So you have to decide, like, customers who like this -- I mean, you have a customer, and you know what they liked in the past, so you have to make a prediction about what they're going to like in the future.

So you show them appropriate ads or links, like what book you want to try to advertise to them, or what video you want to show them, and so on. So here, the big difference between this and the supervised learning setting is you don't have access to the function, the loss function you're trying to optimize.

So in particular, you can't differentiate through it. We don't know the process that generates c, so we can't compute the grading of the loss function and use that to tune the agent's parameters. So that makes it -- so that makes the problem a bit harder, where you have to use a different kind of algorithm.

Lastly, reinforcement learning is almost the same as the contextual bandit setting, except now the environment is stateful. So now, instead of sampling the initial state from scratch every time step, from the same distribution, the state evolves over time. So you have some transition probability distribution called p here, where the state x sub t is conditioned on the previous state and the previous action.

And that makes the problem quite a bit harder, because now -- well, for a number of reasons. For one thing, the inputs you're getting depend on what actions you're taking. So now that makes it harder to develop a stable, reliable algorithm, because now as the agent starts to learn, it gets different inputs.

So that can lead to all sorts of out-of-control behavior. And it also means you have delayed effects, because since the system is stateful, you might need to take a lot of actions to get into the right state. So you might need to -- you can't just act greedily every time step.

You have to think ahead, effectively. Okay, so just to summarize these differences, there are two differences. The first one is, you don't have full analytic access to the function you're trying to optimize. You have to query it through interaction. Second, you're interacting with a stateful world, which means that the input you get is going to depend on your previous actions.

And if you just take the first of those differences between supervised learning and reinforcement learning, you get the contextual bandit setting. So that's sort of halfway in between. Okay, so I realize that there are multiple -- this audience probably has people with different interests. Some people are doing research and want to know about what's the latest in the research world.

And some people are -- want to apply these machine learning techniques to practical applications. So this slide is for the latter group of people. So if you're wondering -- if you have some problem where you think reinforcement learning might be relevant and you're wondering if you should apply reinforcement learning.

So first, I should say that the answer might be no. It might be overkill, especially deep reinforcement learning. So this is a set of fairly new techniques where it's not going to work out of the box very well. And it's -- these techniques aren't that well established. So they require a lot of -- they have a lot of knobs to be tuned.

So it might be overkill and, yeah, these techniques aren't that well established at the moment. So it might be worth investigating some other methods first. So one -- so if your problem has a small number of parameters you're trying to optimize over and you have a simulator that you can, like, just do lots of experiments on, then derivative-free optimization methods are likely to be better than reinforcement learning, or they're likely to be easier to get working.

So these methods just look at -- they just -- you give them a black box function where you put in a parameter vector and it'll give you a noisy estimate of the score. And these algorithms will just optimize over the parameters of that black box -- I mean, that are being put into that black box.

So yeah, there's a variety of different methods for derivative-free optimization, but these are easier to understand than reinforcement learning, and they do kind of work out of the box. Okay. A lot of problems are actually -- can be seen as -- are contextual bandit problems, and the statefulness of the world isn't that relevant.

So for example, in advertising, this is where people -- people look at advertising as a contextual bandit problem most of the time, because you decide what ad to present the user with, and then they either click on it or they don't. But it's really -- the user is kind of stateful, because if you show them a terrible ad, they might just go and download ad block.

So there is -- like, your actions do have some repercussions. But often you can just approximate it as being a contextual bandit problem where there is no state. So there's a better theoretical understanding of contextual bandit problems and methods that are -- that have some guarantees. So in that case -- so if it is a contextual bandit problem, you might want to use those kind of algorithms instead.

And lastly, the operations research field has been using these methods for a while on real problems, and they have a set of methods which are just pretty much the basic algorithms, policy iteration and value iteration, but they're sort of well -- they're well-developed ways of doing feature engineering for these problems that end up working pretty decently.

So these techniques are also worth considering instead of trying to throw a big neural network at it. Okay. So now -- well, now that I've talked about what -- why not to use deep reinforcement learning or what it's not good for, I'll just talk about some recent success stories in deep reinforcement learning, which are achievements that probably wouldn't have been possible using these other techniques.

So a few years ago, there was a pretty influential result by Mni et al. from DeepMind where they used a deep Q-learning algorithm to play Atari games using the screen images as input. And that's hard because you have these -- these games are -- you're trying to do different things in all these games, and they're -- some of them are kind of complicated.

So it's pretty remarkable that you can just use a simple -- that a simple algorithm can solve them all. The same algorithm can solve them all. So since then, people have also solved -- or solved this domain using policy gradients and another algorithm called Dagger. So another big groundbreaking result was beating a champion-level player at Go, also by DeepMind, using a combination of supervised learning from, like, from expert games, plus policy gradients to fine-tune the supervised learning policy, plus Monte Carlo tree search, plus value functions to make the search work better.

So a combination of techniques and reinforcement learning. Robotic -- so some of my colleagues at Berkeley had some very nice results learning in real time how to do manipulation tasks using an algorithm called guided policy search using the PR2 robot. And some of my colleagues and I have been working on robotic locomotion using policy gradient methods.

And people have been working on locomotion for a while and have been able to achieve pretty good results using very, like, highly engineered domain-specific methods. But previously, there hadn't been much success using general methods to solve it. And last, there have been some recent results playing 3D games using policy gradients.

In fact, there was even a contest I heard about a couple days ago with this new VisDoom task, which is pretty nice. So you might want to check out VisDoom. Okay. So that's it for the high-level overview part of this. Now I'm going to start getting into the actual formalism and the technical details.

Okay, so the basic object in the field of reinforcement learning is the Markov decision process. So the Markov decision process is defined by the following components. You have a state space. This is all the different states of the system. The action space, these are all the actions the agent can take.

And you have this probability distribution, which determines the probability of next state and reward. So R is the reward, S prime is the next state, S and A are the actions. So it's a conditional probability distribution. Sometimes people split this out into a separate reward function, but that's basically an equivalent formulation.

Okay. And sometimes there's some extra objects to find. We'll be interested in the... We'll consider an initial state distribution. So this is the world starts out in a certain state. And the typical optimization problem you want to solve given this MDP is to maximize expected cumulative reward. So there are various ways of defining that more precisely, which I'll go into later.

Okay. So there are various different settings of reinforcement learning where you define a slightly different optimization problem. The one we'll be most concerned with is called the episodic setting. So here the agent's experience is split up into a series of episodes which have finite length. So in each episode, we first sample the initial state of the world from some probability distribution mu.

And then the agent keeps on acting until the world ends up in some terminal state. So just to give an example of what terminal states might be like and how an episodic reinforcement learning problem might look. So one example is when termination is good and you want to terminate the episode as fast as possible.

So if we imagine setting up a task with some kind of taxi robot that should get to the destination as fast as possible, then the episode would be like one trip and it's trying to terminate the episode as fast as possible. Another example is a waiter robot where you have a fixed length shift, but the waiter has to accumulate, it has to do as well as possible during that shift.

So there the episode has a fixed length. The waiter has to say maximize tips or customer happiness. And then you could imagine another kind of task where termination is bad and you want the episode to last as long as possible. So you can view life as an example of that.

But also you could imagine having a walking robot where you want it to walk as far as possible before it falls over. And in this setting, it's pretty easy to define what the goal is. We just want to maximize the expectation of the total reward per episode. And the last object we're going to introduce here is a policy.

So the policy is just the function that the agent uses to choose its actions. So we have deterministic policies, which are just the policy is denoted by pi. So we have the action is just some function of the state. And we also have stochastic policies where the policy is a conditional probability distribution.

So we're just going to make a little bit more precise the setting of the episodic MDP. So first we sample the initial state from this distribution, mu. Then we sample the first action from the policy, a0 from the policy. Then we sample next state and reward from the transition probability distribution, and so on, until we reach a terminal state, s sub t.

And then the quantity we care about is the sum of all these rewards, r0 plus r1 dot dot dot plus r sub t minus 1. And we want to maximize. So eta of pi is just defined as the expected total reward of the policy pi. Here's the picture that illustrates exactly the same thing.

So you can look at it as a graphical model. And lastly, in the policy gradient section in particular, we're going to be interested in parameterized policies. So here we have a parameter vector, theta, which specifies exactly what the policy is. So for example, the family of policies could be just a neural net.

You have a certain neural network architecture. And theta specifies all the weights of this neural network. So we could have a deterministic policy, of course, or a stochastic policy. And if you're wondering concretely what would a policy look like, I mean, how do you use a neural network to represent your policy?

It's actually exactly, you do exactly the same thing you would do if this were a classification or a regression problem. So S here, the state here is your input, and the action is your output. So if you have a discrete action space, a discrete set of actions, then you would use a network that outputs a vector of probabilities, the probabilities of the different actions.

This is exactly like a classifier. And if you have a continuous action space, you would have your neural network output the mean and the diagonal of a covariance matrix of a Gaussian distribution. So this is just like you're doing regression. So you can use the same kind of architectures you'd use in supervised learning.

Okay, so that's it for the formalism of MDPs. So now I'm going to go into policy gradient methods, which are one broad and general class of reinforcement learning methods, which are quite effective. So to give a brief overview of this, here's the intuition of what policy gradient methods are going to do.

So here, capital R means the sum of rewards of the whole episode. So our optimization problem is we want to maximize the expectation of the total reward given our parameterized policy, pi sub theta. And the intuition of how our algorithm is going to work is we're going to collect a bunch of trajectories.

I mean, this is just run a bunch of episodes using our policy. And then we want to make the good trajectories more probable. So I mean, some of the trajectories were lucky and they were really good. Some of them, the agent was unlucky and they were bad. And the ones that were good, meaning there was high reward, that means the agent probably took good actions there.

So we want to increase the probability of the actions from those trajectories. So the most basic version of policy gradient methods just try to make the good trajectories more probable without trying to figure out which were the good actions and which were the bad actions. Slightly better methods or more elaborate methods try to figure out which actions were good and which ones were bad.

And then they try to make the good actions more probable. And lastly, there's another class of methods which actually try to push the actions towards better actions. So they differentiate the loss function with respect to the actions and they try to push the actions to better actions. So we're mostly going to talk about one and two here.

Oh, there's a question? Oh, yeah, good question. So while we're maximizing over the policy, we're trying to find the best policy. But here, the policy is assumed to be parameterized. So there's some parameter vector theta that specifies the policy. And now we just want to maximize with respect to theta.

Any other questions? So there's a very fundamental concept, which is called the score function grading estimator, which underlies policy gradient methods. So actually, to introduce this, we're not going to talk about policies in RL at all. We're just going to assume we have some expectation. We have expectation of f of x, where x is sampled from some parameterized probability distribution.

So we want to compute the grading of this expectation with respect to theta. So there's a general formula that'll do this. And the way you derive it is you just write the expectation as an integral. And then you just move some things around. You swap the integral with the derivative and you turn it back into an expectation.

And what you get at the end is this bottom line, which says that you take the expectation of function value times grad log probability. So this is an unbiased estimator of the gradient, meaning if we get enough samples, it'll converge on the right thing. So the way you can compute this estimator, meaning the way you can get a noisy estimate of the gradient of the expectation, is you just get one sample from this distribution.

And then you multiply f of x times grad log probability. So the only requirement for being able to use this estimator is we need to be able to compute the probability density. We need to be able to analytically compute it. And we need to be able to differentiate it with respect to theta.

And often it needs to be differentiable. There's another way of deriving it using importance sampling. So you write down the importance sampling estimator for the expectation. And then you just swap the derivative with the expectation and you get the same thing. So now let me just give a little bit of intuition about this estimator.

So f of x is measuring how good the sample x is. So that means that-- so g hat here is our gradient estimator, meaning this is what we get if we take one sample x sub i and we compute our estimator. This is our estimate of the gradient. So if we move in direction g hat, that pushes up the log probability of our sample x sub i in proportion to how good it is.

So if we have really good-- if we got a really good function value, then we're going to try to push up its log probability a lot. And if it was a bad function value, then we're not going to try to push it up very much. So pretty simple intuition.

The really nice thing is this is valid even if f of x is discontinuous or if f of x is unknown, meaning you only-- you don't get to differentiate it, you just get to see the function values. Or the sample space is a discrete set, so x doesn't even have to be continuous.

And this is quite remarkable, actually, that you don't even need to have access to the full-- you don't need to know exactly what the function is that you're optimizing. You just have to be able to query it for the function value. And this means-- this is a way of being able to differentiate functions through a system that has non-differentiable pieces.

So for example, in robotic locomotion, one issue is that you have contacts between the robot's foot and the ground. And you make and break contact, and that causes a discontinuous change in the dynamics. So that makes it really hard to do smooth optimization techniques to come up with the right behavior.

So when you use this kind of gradient estimator, along with policy gradients, which I'm going to talk about very soon, you can actually just differentiate-- you can optimize the system even though it has differentiable pieces in it. So here's another little picture of what's going on. So we have our function f of x, which we're trying to maximize the expectation of.

And then we have our probability density, p of x. So we just sample a bunch of values from our probability density. Those are the blue dots on the x-axis. And then we look at the function values. And we're trying to push the probability distribution so that the probability goes up at these samples in proportion to the function value.

So over on the right side of the curve, that means we're trying to push that probability value up really hard. And on the left side, we're pushing it up softly. So what's going to happen is the probability density is going to slide to the right. If you can imagine a sort of physical analogy there.

OK. So that's the score function gradient estimator. This is a general technique. It can be used in various machine learning problems. Now we're going to apply it to the reinforcement learning setting. And we're going to take our random variable x to be a whole trajectory. So the trajectory consists of state action reward, state action reward, and so on until the end of the episode.

And now to get our gradient estimator, to get the gradient of the expected reward, all we've got to do is compute the grad log probability times the total reward. So this probability of the trajectory, that sounds like a really unfriendly quantity because there's a long, complicated process that generates this trajectory with lots of time steps.

But log-- OK, so we can write out what this process is, what this probability density is. So we have-- it's just a product of probabilities. We've got our initial-- we've got our mu of s0, which is just our initial state distribution. And then every time step, we have-- we sample the action according to pi.

And we sample the next state and reward according to our dynamics model. So log turns that product into a sum. And here's the cool part. Anything that doesn't contain theta drops out. So the thing is, we didn't know-- there are parts of this probability distribution, p of tau given theta, that we don't have access to.

So if this is reinforcement learning, we don't assume that we know the dynamics model of the system. We just find out about it by sampling-- by doing episodes. So since this product turns into a sum, all the pieces, like the log p there and the log mu, which we don't know, just drop out.

So it doesn't matter. And what we get in the end is we get a sum of log probabilities of actions. So grad log pi of action given state. So our formula looks like-- our formula for the grading of the expectation is just the expectation over trajectories of total reward of the trajectory times grad of the sum of all the log probs.

So the interpretation of this is we're taking our good trajectories and we're trying to increase their probability in proportion to how good they are. And you can think of this as being similar to supervised learning, where we treat the good trajectories with high rewards as positive examples in our supervised learning problem.

So we're using those to train the policy on which actions are good. We're basically treating those actions as positive examples. OK, now we can improve this formula a little bit. So that was just the most basic-- I mean, this is an unbiased estimate. This is an estimator for the policy gradient.

So if we just take that expression inside the expectation on the right-hand side and we take one sample of that, it has the right mean. So if we just get enough of them, we're going to get the policy gradient. But we can also write down some other formulas that have the same mean but have lower variance.

So we can come up with better estimators for the policy gradient. And that's actually quite important because the one from the previous slide is really bad when you have a large number of time steps, meaning it has really high variance. So the first thing we can do is we can use the temporal structure of the problem.

By the way, to derive these next bunch of formulas, it just takes a bunch of really straightforward manipulation where you move around expectations. And I'm not going to go through all the math. But I'll just say what the formulas are. So we can repeat the same argument from the previous slide to just derive the gradient estimator for a single reward term.

So we end up with that reward term times the grad sum of log probs. And just summing over that, we get a new formula where we're not multiplying the sum of the grad log prob of the whole thing times the sum of all rewards. So let's look at that bottom formula.

Now we have a sum over time of grad log probability of the action at that time times the sum of future rewards. So now, I mean, in the formula from the previous slide, we would have had all the rewards in that sum. But now we just have the future rewards.

And that kind of makes sense because an action can affect the probability of the previous rewards. So to figure out if the action is good, we should only be looking at the future rewards. So this is a slightly better formula than the one on the previous slide, meaning it has the exact same mean except the expression inside the expectation there has lower variance.

And we can further reduce the variance by introducing a baseline. So now we can take any old function, B, which takes in a state and it outputs a real number. And we can subtract it from our sum of future rewards. And we didn't affect the mean of the estimator at all.

So we didn't change the expectation at all by introducing this baseline. So yeah, for any choice of B, this gives us an unbiased estimator. By the way, if you're not that familiar with the terminology of estimators, what I'm saying is we have an expectation on the right-hand side of that formula.

And the quantity inside that expectation is what's called the estimator. And if we get a bunch of samples, then we can get an estimate of the thing on the left-hand side, which is what we care about. So when I say it's an unbiased estimator, that just means that this equation is correct, meaning that the thing on the right-hand side equals the thing on the left-hand side.

So yeah, this works for any choice of baseline. And a near optimal choice is to use the expected return. So the expected sum of future rewards. And the interpretation of that is if we took an action, we only want to increase the probability of the action if it was a good action.

So how do we tell if it was a good action? Well, the sum of rewards after that action should have been better than expected. So the B of S is the expected sum of rewards. And we're just taking the difference between the measured thing and the expected thing. So that was a pretty key thing for variance reduction.

And I'm going to introduce one last variance reduction technique. And actually, all three of these are really important. So basically, nothing's going to work except for maybe really small-scale problems unless you do these things. So the last variance reduction technique is to use discounts. So the discount factor ignores delayed effects between actions and rewards.

So what we're going to do here looks kind of like a hack, but there's an explanation for it, which is instead of taking the sum of rewards, we're going to take a discounted sum of rewards. Meaning that we add this exponential factor, gamma. So when we're multiplying the grad log probability by some future reward, we multiply it by some quantity that decays with time.

So people typically use gamma equals 0.99 or gamma equals 0.95. So that means if you use 0.99, that means after 100 time steps, you're going to be reducing the reward by a factor of 1 over e. So you're exponentially decaying the effect of the future rewards. And the intuition is that the action shouldn't affect rewards really far in the future.

Like the assumption is that the system doesn't have really long-term memory and it sort of resets. The effects aren't that far delayed. So you can just ignore the interaction between action and a reward way in the future. That's the intuition. So now instead of taking the baseline to be the expected sum of future rewards, we want to do a discounted sum.

So now we're measuring if the action was better than expected according to the discounted sum. And now there's a more general class of formulas that looks like the one that I just wrote. So this one that's on the top of the slide is pretty good. And this is almost as good as anything you're going to do to within a small constant factor.

But there's a more general class of formulas that look like grad log probability times some quantity a hat, which we call the advantage estimate. And this is in general just going to be an estimate of, it has a more precise definition, which is how much was this action better than the average action taken by the policy.

But informally, this just means how much better was the action than expected. And this formula makes a lot of sense because we want to increase the probability of the good actions and decrease the probability of the bad ones. So we should increase it in proportion to the goodness of the action.

So just to summarize, so I just told you there's this gradient estimator, meaning there's this expression you can compute, which gives you a noisy estimate of the policy gradient. So how do you actually turn this into an algorithm? So this is silly, algorithm seven. So here's what the algorithm looks like.

It's pretty much what you'd expect. You take your policy, you initialize your policy parameter and your baseline function. For each iteration, you execute the current policy to get a bunch of whole episodes, meaning whole trajectories. And each time step in each trajectory, you should compute the return, meaning the sum of rewards following that time step, the sum of discounted rewards, and the advantage estimate, which is the sum of discounted rewards minus the baseline.

Then you refit the baseline by trying to make the baseline function equal the returns. And then you update the policy using a policy gradient estimator. So you're just doing SGD while updating the baseline as you go along. So that's the vanilla policy gradient algorithm. And this is, I'll briefly talk, this has been used to obtain some pretty good results.

So it's not that bad of an algorithm, but there are several different directions that it can be improved. So one issue that you run into is with step sizes. So in supervised learning, step sizes aren't that big of a deal. Because maybe you take too big of a step, but that's OK.

You'll fix it the next update. And your current function, your current classifier, for example, doesn't affect what inputs you're getting. So even if you just are doing really, even if your network is just thrashing around for a while because you're taking too big steps, that's not going to cause any problems.

But yeah, so step sizes aren't that big of a deal. You can just anneal them. You can start off with a large step size and anneal them down to zero. And that works pretty well. In reinforcement learning, if you take too big of a step, you might wreck your policy.

And even if you don't actually change the network that much, so you don't lose all your nice features, you might just change its behavior a little too much. And now it's going to do something totally different and visit a totally different part of state space. So since in reinforcement learning, the system is stateful and your state distribution depends on your policy, that brings a really different problem.

And now, after you took that step, the next batch of data you're going to get was collected by the bad policy. And now you're never going to recover because you just forgot everything. Yeah. So, one way that my colleagues and I, well, one way to fix this is to try to basically try to stop the policy from taking too big of a step.

So you can look at the KL divergence between the old policy and the new policy, like before the update and after the update, and make sure that the distributions aren't that different. So you're not taking too big of a step. It's kind of an obvious thing to do. So my colleagues and I developed an algorithm called trust region policy optimization, which looks at the action distributions and tries to make sure the KL divergence isn't too large.

And this is very closely related to previous natural policy gradient methods, which are based on-- which are doing something similar, but usually it's not set up as a hard constraint on the KL divergence. So another type of extension of policy gradient methods is to do more-- to use value functions to do more variance reduction.

Instead of just using them as a baseline, you can also-- you can use them more aggressively and introduce some bias. So I won't go into the details in this talk, but sometimes these are called actor-critic methods. There's also another type of approach, which I briefly touched on in the earlier slide, where instead of just trying to increase the probability of the good actions, you actually differentiate your loss with respect to the actions.

This is like the reparameterization trick, which is used for density modeling and unsupervised learning. So here you're trying to-- instead of just increasing the probability of the good actions, you're trying to push the actions towards better actions. And I'd say both of these bullet points, you're potentially decreasing your variance a lot, but at the cost of increasing bias.

So it actually makes the algorithms a little harder to understand and to get them working, because with high variance, if you just crank up the amount of data, you can always drive your variance down as much as you want. But with bias, even if-- no matter how much data you get, you're not going to get rid of the bias.

So if your gradient is pointing in the wrong direction, then you're not going to learn anything. OK, so now that's it for the policy gradient section of this talk. So I wanted to show a quick video of some work that my colleagues and I did on learning locomotion controllers with policy gradient methods, which I think-- well, I found pretty exciting when I saw it.

So hopefully you find it interesting. So here, what we've got is a humanoid simulated-- let's see. OK, yeah, it's a simulated humanoid robot in a physics simulator, a realistic physics simulator called MuJoCo. And it has a neural network policy, which takes in the joint angles of the robot and a little bit of other kinematic information.

It's got joint velocities and also positions of the different links of the robot. So that's what the input is. It's pretty much the raw state of the robot, like no clever feature engineering there. And the output is going to be the joint torques, which are set 100 times a second.

So we're just mapping from joint angles to joint torques. And we define a reward function, which is to move forward as fast as possible. So it gets a reward for moving forward. So the episode ends when its head goes below a certain height, meaning it fell over. So that's basically the setup.

There is a little bit of tweaking for the reward function, but not too extensive. So you can see, first it just falls forward a lot of times. And then slowly it starts to develop a half-decent looking walk. And eventually it gets it down pretty well. And at the very end of this, it could just keep running indefinitely.

So I think it was actually stable in a strong sense, meaning I could just leave it for 15 minutes and it wouldn't fall over. It would just keep going. So here's another robot model that we just created without too much thought. We just decided to put a bunch of legs on this thing.

So we don't even know how this thing is supposed to walk. And just give it to the same algorithm and it just figures out some kind of crazy way to walk. So that's the nice thing about reinforcement learning. You don't even need to know what you want it to do.

I think this is also the physics are a little unrealistic here. Here we set up, we used a similar model to the one in the first demo. But here we just give it a reward for having its head at a certain height. So there's a reward telling it to get its head up as high as possible.

And then it figures out how to get up off the ground. Let's see, I have low battery. Does anyone have a charger that I could? Thanks a lot. You're a lifesaver. OK, any questions about policy gradients before I move on to the next part? So the question was, is the system time invariant?

Yes, that's assumed. Is that it's stationary? Oh, right. And also that it doesn't change from one episode to the next. Of course, in some real world problems that might not be the case. So I think that's also an interesting problem setting where you have a non-stationary environment. So the question was, for the baseline, to learn a good baseline, do you need to know the dynamics of the system?

So no, you can just learn it by doing regression. You just estimate the empirical returns and then you do regression to try to fit a function to that. Yeah, so the question is, there's a discount factor which should cause the policy to disregard any effects that are delayed by more than 100 time steps.

So how does it still work that this guy learns how to stand up, even though that might take more than 100 time steps? Is that correct? Yeah. So yeah, you're right. And in fact, I would say that these methods aren't guaranteed to work well if you have more than 100 time steps.

So sometimes they work anyway. Often they work anyway, but there's no guarantee. So I think there's actually something pretty fundamental missing in how to deal with really long time scales. And people have recently been thinking about hierarchical reinforcement learning, where you have different levels of detail of the system, where you might have one level of description where you have a short time, a small time step, and then you have successively larger time steps.

And that allows you to plan over much longer horizons. So that's something that's currently an active area of research. But yeah, I would say that none of these methods are guaranteed to do anything reasonable if you have more than 1 over 1 minus gamma time steps between action and reward.

So if you introduce some anomaly in the environment, can it still handle it? Oh yeah, so in this kind of task, if you introduce terrain or something, could it-- I think if you didn't train it to deal with terrain, then it might fail. It probably would fail. Actually, I don't think it would fail, because the funny thing about-- these policies are actually really robust, because you train them with a stochastic policy.

So there's a lot of noise being generated by the policy itself. So in practice, it's able to deal with huge noise introduced by the policy. And as a result, I found that if you change the dynamics parameters a little, it can usually still work. But yeah, there's no guarantee that it'll do anything if you give it something you didn't train it for.

I think that you probably could do the same kind of training with terrain. I didn't have any terrain, so I didn't try it. But that would be nice to try. OK, I'm going to move on to the next part of the talk. Feel free, if you have more questions, to find me afterwards.

OK, so now I'm going to talk about a different type of reinforcement learning algorithm. So the previous methods are distinguished by the fact that they explicitly represent a policy, which is the function that chooses your actions. And they try to optimize it with respect to the parameters of the policy.

So the nice thing about the policy gradient methods we just talked about is that you're optimizing the thing you care about. And you're optimizing it with gradient descent. So that makes it kind of easy to understand what's going on. Because if you're getting the proper gradient estimate, and you take small enough steps, then you should be improving.

I mean, of course, you still could get stuck in a local minimum. But at least-- or you get stuck in a bad local minimum. But at least it's a local minimum. And you can use our understanding of optimization to figure out what's going on. So these next class of methods are a little different, because they're not optimizing the policy directly.

They're learning something else called a Q function, which measures how good state-action pairs are. So it measures-- I'll say that more formally later. But it's just measuring how good the actions are. And these methods are actually-- these are able to exactly solve MDPs efficiently in the setting where you have a finite number of states and actions.

So these are the preferred methods for exactly solving them in those settings. But you can also apply them with continuous states and actions, and using expressive function approximators like neural networks. But it's a little harder to understand what's going on in these methods, like when they're going to work and when they're not going to work.

So I'll define the relevant quantities here. So the Q function is defined as the expected sum of rewards when we condition on the first state and the first action. So we're conditioning on S0 equals S, A0 equals A. And the Q function is the expected discounted sum of rewards when we're acting under the policy pi.

So by convention, I'm starting out with time step 0. I could have also said that we're taking RT plus RT plus 1 plus RT plus 2 and so on. But since we're assuming the system is stationary, it should be exactly the same. So just by convention, I'm going to say that the first-- I'm going to always use time 0, 1, 2, 3, and so on, just for ease of notation.

So the Q function is just telling you how good this state action pair is under your current policy. The value function-- well, the state value function, usually called V, is just conditioning on the state. It's telling you how good that state is, what's the expected reward of that state.

And lastly, the advantage function is the difference between the Q function and the state value function, meaning how much better is that action than what the policy would have done. We're not going to talk about advantage functions in this section, but it was actually-- this corresponds to the notion of an advantage estimator we briefly mentioned in the previous section.

So here, we're going to consider methods that explicitly store and update the Q function instead of the policy, and updates them using what are called Bellman equations. So the Bellman equation-- so a Bellman equation, in general, is a consistency equation that should be satisfied by a value function. So here, I'm writing down the Bellman equation for Q pi.

And what it's saying is that the expected sum of rewards should be the first reward plus this expected sum of rewards after the first time step. So it's saying something pretty simple. So R0 is the first reward. V pi of S1 is just adding up all the rewards after time step 0.

So in the second equation, we write out this relationship just involving the Q function. So we have a consistency equation that the Q function should satisfy. We can slightly generalize this to use k time steps instead of just one time step. So we can expand out the expectation-- the expected sum of rewards to write out k rewards explicitly and then cap it off with the value function at the very end, which accounts for all the rewards after that.

OK, so here's the Bellman equation from the previous slide. So now I'm going to introduce a very important concept called the Bellman backup. So we have this equation that the value function Q pi should satisfy. But let's assume we don't know Q pi. So let's say we have some other Q function.

So we define this Bellman backup operator that operates on an arbitrary Q function. So it maps a Q function to a new Q function. And it's defined by just taking the right-hand side of the Bellman equation and plugging in our new Q function Q instead of the Q pi.

So Q pi is a fixed point of this operator, meaning if we apply the backup operator, we get the same thing back. And very nicely, if we keep applying this backup operator repeatedly to any old arbitrary initial Q function Q, the series will converge to Q pi, which is the fixed point of the operator.

So that's-- yeah, so that's-- so that way you can-- one way-- you can use an iterative algorithm to estimate Q pi by taking any old initial Q function and repeatedly applying this backup operator. So now there's another kind of Q function that we're going to introduce called Q star.

So the previous Q function Q pi was-- this is the-- telling you the value function under the current-- under some policy pi. So it only makes sense with regard to some particular fixed policy pi. Q star is going to be-- is going to involve the optimal policy instead. So Q star is just defined as the Q function of the optimal policy.

So here we have pi star, the optimal policy, and Q star is just the Q function of the optimal policy. And it also happens to be the pointwise maximum over all policies of the Q function at each state action pair. So the optimal policy is deterministic, and it should satisfy this equation that it takes the argmax of the optimal Q function.

So recall that the Q function tells you your expected return if you take the given action. So obviously the optimal policy should take the action that has the best expected return. So that's why this last equation is evident. So now that we know this property of the optimal policy, we can rewrite the Bellman equation.

So on the-- that first equation is-- that's just the Bellman equation from the previous slides for a given policy pi. Now we can take that expectation over actions and replace it by what the optimal policy is going to do, which is just going to take-- it's going to take the argmax of the optimal Q function.

There's a typo on my slide. That should say Q star inside-- on the right-hand side. So now we have a Bellman equation that the optimal policy should satisfy. Now we can do the same thing with the backup operator. So we take that Bellman equation and we plug in an arbitrary Q function on the right-hand side instead of the optimal Q function Q star.

So Q star is a fixed point of this Bellman operator. That's just a restatement of the Bellman equation. And again, if we reply this Bellman operator repeatedly to an arbitrary initial Q function, it converges to Q star, which is the optimal Q function. This is the Banach fixed point theorem in both cases.

It can be used to prove it. OK. So based on these ideas, there are two classic algorithms for exactly solving MDPs. These are sometimes called dynamic programming algorithms because they're actually quite related to the kind of dynamic programming algorithms that are used to solve search problems. So one is called value iteration.

And you just initialize your Q function arbitrarily. And you repeatedly do Bellman backups until it converges. The second one is called policy iteration. You initialize your policy arbitrarily. Then each step, you first compute either exactly or approximately the Q function of that policy. And then you update your policy to be the greedy policy for the Q function you just computed.

So that means that your new policy just takes the argmax of the Q function. So it takes the action that's best according to that Q function. So I didn't say anything about how you compute Q pi. So one way to do it is to compute it. You can compute it exactly because it happens that the Bellman equation for Q pi is a linear system of equations.

So often you can just solve them exactly. More commonly, well, if you have a large scale problem, you might not be able to solve this system. So what people often do is they do a finite number of Bellman backups, which doesn't exactly converge on Q pi, but it gives you something that's approximately Q pi.

Okay, so that's, I just told you algorithms that you can implement if you have full access to the MDP. Like you know the whole table of probabilities. But in reinforcement learning, usually the assumption is that you don't know any of these probability distributions. You don't know their reward function.

You don't know the transition probabilities. So all of these things have to be estimated from data or you're only able to access the system through interaction. So now it turns out that these algorithms can be also implemented if you only access the system through interactions, which is kind of remarkable, I think.

So the way it works is, so let's recall our backup formulas for Q pi and Q star. So, we can, in both cases, we have this certain quantity inside an expectation. In both cases, we can compute an unbiased estimator of that quantity inside the expectation, just using a single sample.

Meaning if we sampled some data from our system using any old policy, then we can get an unbiased estimator of the quantity on the right-hand side of those expectations. I mean the quantity inside of the right-hand expectations. So basically we can do an approximate version of this Bellman backup, which is unbiased.

And even with this noise, so we're doing a noisy version of the Bellman backup. Even with this noise, it can be proven that if you choose your step sizes appropriately with the right schedule, you're still gonna converge to Q pi or Q star, depending on which algorithm you're implementing.

Okay, so now, well I'll say at this point that this is pretty much the fundamental idea. And now you can come up with algorithms that can be applied in the reinforcement learning setting where you're just accessing the system through sampling. And you can also start introducing function approximation here.

So I haven't said anything about what the Q function is. I've just told you it's a function of state and action. But now we can start having neural network Q functions, for example. So we can parameterize the Q function with the neural network. Let's call it Q theta. And now, instead of doing the Bellman backup, I mean it doesn't make sense to do the Bellman backup exactly because we're not just setting the values of the neural network output.

The best we can do is try to encourage the neural network to have some output values. So what we do is, instead of doing the- the way we do this backup is we set up a least squares problem. So we write down this quadratic objective that says that the Q function should be approximately equal to the backed up value.

And then we just minimize it with SGD. So one version of this algorithm, which was introduced about 10 years ago, called neural fitted Q iteration. Well it works exactly the way you'd expect. You sample trajectories using your current policy, which might be determined by the Q function or it could be any old policy as it turns out.

And then you solve the least squares problem where you're trying to minimize this quadratic you try to minimize this quadratic error which is based on the Bellman backup, the backup for Q star. So one thing I haven't mentioned so far is what do you actually use as your policy.

So I said sample trajectory using your policy. So if you have a Q function, you can turn it into a policy by just taking the action that has the highest Q value. That's what you typically do. So the Q function measures the goodness of all your actions. So you can easily turn that into a policy by taking your best action or by taking actions where the log probability is proportional to the goodness or something like that.

So you might take typically probability is exponential of Q value over some kind of temperature parameter. That's called Boltzmann exploration. Whereas if you just take the argmax, that's called the greedy policy. So it turns out that with these kind of Q learning algorithms, you don't have to execute the greedy policy for learning to work.

You actually have some freedom in what policy you can execute, which is actually one very nice property of these algorithms that you can use an exploration technique where your policy is actively trying to reach new states or do something new and still learn the correct, still converge, still move towards Q star or Q pi as the case may be.

So that's a very basic neural fitted Q iteration is sort of a basic way of doing this. A more recent algorithm that's gotten a lot of attention is the one that was from Manny et al from DeepMind, which is basically an online version of this algorithm with a couple of useful tweaks in it.

But actually when you look at the two tricks, they're actually kind of very, they make a lot of sense if you just think about what value iteration is doing. So one technique is you use this replay pool where it's a rolling history of your past data and that's just the data you're going to use to fit your Q function.

So that makes sure you have like a representative sample of data to fit your Q function to. And the second idea is to use a target network. So instead of using your current Q function and just doing Bellman backups on that, you have some lagged version of your Q function.

So you have this target network which is a copy of your Q function at some earlier time and you use that in the backups. So that also, if you think about value iteration, you have your old Q function and you're trying to make the new one equal to the backed up version of the old one.

So using the target network just is sort of the natural thing to do if you're trying to implement value iteration in an online way. And there have been many extensions proposed since then. I've got a bunch of citations at the bottom of the slide. So this algorithm, the DQN algorithm, is using the backup B, which is the backup for Q*.

Remember that I also introduced this other backup, B pi, which is the backup for Q pi. So there's another algorithm, like a very classic algorithm called SARSA, which is an online way of doing the B pi backup, essentially. Well, it's sort of an online version of policy iteration. So it's actually found to work as well, or better than DQN, well, better than using the B backup in some settings, not all settings.

So I think the jury's still out on exactly how these things compare. But I think it's worth considering both policy iteration and value iteration and all the different online versions of these algorithms and taking them seriously. Because it's not clear right now exactly how they all compare to each other in the function approximation setting.

So that's the overview of all the technical parts. And now I just have a couple conclusion slides. So let me just summarize the current state of affairs. I introduced two kinds of algorithms, policy gradient algorithms, which explicitly represent a policy and optimize it, and Q function learning algorithms, which explicitly represent a Q function, which is the goodness of different actions, and use that to implicitly represent a policy.

So policy gradient methods, there's a lot of-- so there have been some successes with different variants of it. So you have vanilla policy gradient methods. There is a recent paper on this A3C method, which is an async implementation of it, which gets very good results. There's also another kind of methods are the natural policy gradient methods, trust region methods.

The one I showed you was obtained using trust region policy optimization, which is one of these in the second category. So that makes it-- I think these trust region methods and natural policy gradient methods are more sample efficient than the vanilla methods, because you end up-- you're doing more than one gradient update with each little bit of data you collect.

So with the vanilla policy gradient, you just compute one little gradient estimate, and then you throw it away. With natural policy gradient, you're solving a little optimization problem with it, so you get more juice out of it. So that's what we have in the policy gradient world. And in the Q-function world, we have the DQN algorithm and some of its relatives.

And these are sort of descendants of value iteration, where you're approximating the Bellman backup using value iteration. And then SARSA is also-- it's related to policy iteration. These are both different-- I mean, these are estimating different-- they're dealing with different Bellman equations, so it's kind of interesting that both kinds of methods work and they all-- they're both-- they have fairly similar behaviors, as it turns out.

So here's what I would say that-- here's how I would compare them. And this is anecdotal evidence, but I think this is the consensus right now. The Q-function methods are more sample efficient when they work, but they don't work as generally as policy gradient methods, and it's a little harder to figure out what's going on when they don't work.

And that kind of makes sense, because in the policy gradient methods, you're optimizing exactly the thing you care about with gradient descent. Whereas with Q-function methods, you're doing something indirect, where you're trying to learn a Q-function, and then you're hoping that it gives you a good policy. And yeah, so I would also point out that there are also some confounds, so it's hard to make a good conclusion at this point, because people use different time horizons in the policy gradient methods versus the Q-function methods.

So they do one-step lookaheads on the Q-functions and multi-step lookaheads on the policy gradients. So it's not clear if the differences come from using different time horizons or some differences in how the algorithms are working, because you're either doing regression for a Q-function versus learning a policy using policy gradients.

So just to summarize it, I would say here are some of our core model-free reinforcement learning algorithms, and they - oh, whoops, I'm missing a word in the first column, which I think should say reliability and robustness. So this just means, is it going to work on new problems without parameter tuning, or is it going to mysteriously either work or not work?

So this would be my slightly sloppy summary of all these different algorithms. I would say there's still some room for improvement. There might be some improvements in the basic methods, because there are some nice properties of the Q-function methods that we don't have in the policy gradient methods. Like you can easily do off - you can easily explore with a different policy than the one that you're learning the Q-function for, and that's really important.

You can't do that very easily with policy gradient methods. Whereas the policy gradient methods just seem like they're more - you can just apply them and they're more likely to work, and it's well understood what's going on. So I think, yeah, there's still - I don't know if it's possible to get the best of both worlds, but that's the hope.

And that's it for my talk. Thank you. Any questions? In model-based reinforcement learning, what lines of research do you find most interesting? Oh yeah, so in model-based reinforcement learning, what lines of research do I find most interesting? I think the work from my colleagues on guided policy search is very nice.

So I would say that's a kind of model-based reinforcement learning. I also like - there's some methods that are using the model for faster learning, like for variance reduction. So there's a paper called stochastic value gradients that I like a lot. I think it's a pretty wide open area.

So I don't think there have been a lot of really compelling results where you're able to learn extremely fast, like you're able to learn with much better sample efficiency using a model. So it seems like that should be possible, but I don't think it's been demonstrated yet. So maybe in the next couple of years we'll see that happen.

Hello. Hi. Thanks for the talk. So I have a question. Is it true or not true that most of this problem requires some kind of simulated world to run experiments in the episodes, right? Oh yeah, so are you asking does this work in the real world? Is that the question?

Yeah. Yeah, I would say it does work if you have a lot of patience and you're willing to execute this thing for a while. So the locomotion results I showed add up to about two weeks of real time. So it's actually not that bad, especially when you consider that toddlers take a while to learn how to walk properly, even though evolution already puts in a lot of built-in information.

So I'd say maybe, yeah, I'd say it can be run in the real world. Some of my colleagues in Berkeley are doing some experiments where they are running just regular reinforcement learning algorithms in the real world. Very brave. But I hope to see some nice results from that soon.

Thank you. Hi. Thanks for your talk. Here, on the other side. Here. Yeah. I was wondering what was your intuition on the lost surface of those deep reinforcement learning optimization problems, and maybe especially how it evolves as the policy learns. And I should specify in the policy gradient case.

So I think the situation is a little bit different in reinforcement learning from in supervised learning. So in reinforcement learning, you have one kind of local minima in policy space. So for example, let's say you want your-- so I keep going back to the locomotion example, because I spent a lot of time on it.

But let's say you want your robot to walk. There's one local minimum where it just stands and it doesn't bother to walk, because there's too much penalty for falling over. And there's another local minimum where it just dives forward, because it gets a little bit of reward for that before it falls to its doom.

So I think that that's actually the hard part about the optimization problem, is because of the different space of behaviors, and actually has nothing to do with the neural network. So I've also found that it matters surprisingly little what kind of neural network architecture you use, because I think that most of the hardness and the weirdness of the problem comes from what the behavior space looks like, rather than what the actual numerical optimization landscape looks like.

Cool. Thank you. So there are many problems where the reward is only observed at the end of the task, so in the final-- in the terminal state in each episode. And you don't see rewards in intermediate states. So how much harder do these problems become for deep reinforcement learning in your experience?

Thanks. Yeah, so you have-- if you don't get the reward until the end, then you can't-- well, then it's probably-- it might be harder to learn. Yeah. I don't have anything precise to say about that. I think it's going to be harder if you have less-- if your rewards are further away.

So for example, in your video, for the last example of getting up and getting the head above a certain height, for example, that could be one where you only get a plus 1 if you're above and you don't get anything below. Oh, right. But when you're doing something that was kind of-- if you get your head higher, then you still get something partial.

Yeah, so I think we came up with a reward like distance from height squared, which made the problem easier. Yeah, the problem would have been a lot harder if you get zero reward until you get your head above the height. And it's actually-- that would be a problem of exploration, which is that you have to explore all the different states to figure out where you're going to get good reward.

Thanks. One last question. OK, one last question. So I have a question about how do you choose to quantize your space time, because in your locomotion example, it clearly has a continuous system. Right. Yeah, so it's actually really important how you discretize time, like what time step you use, because the algorithm has-- I mean, the algorithm does care about what the time step is.

So it's not like-- yeah, because you have discount factors, and you're also sampling a different action at every time step. So yeah, so if you choose too small of a time step, then the rewards will be delayed by more time steps. So that makes the credit assignment harder. And also, your exploration will be more like a random walk, because you're changing your minds really frequently.

So yeah, the time step is pretty important. And I'd say that's a flaw in current methods. OK, thank you. So at the same time, John will give a thank you. Thank you. So take a short break. We convene in 15 minutes.

Deep Reinforcement Learning (John Schulman, OpenAI)

Chapters

Transcript