Deep Reinforcement Learning (John Schulman, OpenAI)

00:00:00.000 | So good morning, everyone.

00:00:06.080 | So I'm going to talk about some of the core methods in deep reinforcement learning.

00:00:12.740 | So the aim of this talk is as follows.

00:00:16.120 | First I'll do a brief introduction to what deep RL is and whether it might make sense

00:00:21.100 | to apply it in your problem.

00:00:24.280 | I'll talk about some of the core techniques.

00:00:28.940 | So on the one hand we have the policy gradient methods.

00:00:33.240 | Then on the other hand we have methods that learn a Q function, including Q-learning and

00:00:38.500 | SARSA.

00:00:40.960 | And I'll talk a little at the end about what are the pros and cons of these different methods.

00:00:48.240 | So first, what is reinforcement learning?

00:00:51.300 | It's a branch of machine learning concerned with taking sequences of actions.

00:00:55.740 | So often it's described in terms of an agent interacting with a previously unknown environment.

00:01:04.340 | And it's trying to maximize some kind of cumulative reward, some kind of reward function that

00:01:09.240 | we've defined, accumulated over time.

00:01:12.340 | And pretty much any kind of task where you have some kind of goal that you want to achieve

00:01:16.440 | can be stated in these terms.

00:01:18.740 | So this is an extremely general formulation.

00:01:25.060 | What's deep reinforcement learning?

00:01:26.540 | It's pretty simple.

00:01:27.700 | It's just reinforcement learning where you're using neural networks as function approximators.

00:01:34.820 | So the interesting thing about reinforcement learning in contrast to supervised learning

00:01:39.780 | is it's actually not totally obvious what you should use your neural network to approximate

00:01:44.660 | in reinforcement learning.

00:01:46.140 | And there are different kinds of algorithms that approximate different things.

00:01:49.300 | So one choice is to use the neural network to approximate your policy, which is how the

00:01:54.460 | agent chooses its actions.

00:01:57.020 | Another choice is to approximate the value functions, which measure how good or bad different

00:02:02.060 | states are or actions.

00:02:05.300 | And last, you can try to learn a model of the system, a dynamics model, which will make

00:02:12.500 | predictions about next states and rewards.

00:02:16.060 | Okay, so I'll now give a few examples of different places where you might apply reinforcement

00:02:23.900 | learning and what the observations and actions would be.

00:02:28.660 | So one example is robotics.

00:02:32.020 | So here you could imagine a robot where the observations are the camera images and the

00:02:37.180 | joint angles of the robot.

00:02:39.180 | The actions are the joint torques you're applying.

00:02:42.580 | And the reward is going to depend on what you want the robot to do.

00:02:48.500 | So this is something we, as the algorithm designer, get to define.

00:02:53.220 | So the rewards could be to stay balanced, to navigate to some target location, or something

00:02:59.580 | more abstract like serve and protect humans.

00:03:04.060 | So reinforcement learning has also been used in a lot of more practical applications.

00:03:10.980 | Well, applications that have been practical in the past.

00:03:14.460 | I think robotics will be very practical in the future.

00:03:18.860 | And for example, one area is inventory management.

00:03:25.460 | So this is just one example of how you could use reinforcement learning for a decision-making

00:03:29.660 | problem.

00:03:30.740 | So you have to decide how much to stock up on of every item.

00:03:37.020 | And your observations would be your current inventory levels.

00:03:40.780 | Actions would be how much of each item you're going to purchase.

00:03:43.860 | And reward is your profit.

00:03:47.780 | So people in operations research, this is a subfield, study this kind of problem a lot.

00:03:58.580 | There are also a lot of machine learning problems where people have started to apply reinforcement

00:04:03.100 | learning techniques.

00:04:06.220 | So one example is attention.

00:04:10.020 | So the idea in attention is you don't want to look at the whole input at once.

00:04:13.380 | You want to just focus on part of it.

00:04:15.620 | So one example of this is with a large image, you might want to just crop out part of it

00:04:21.500 | and use that and just do detection on that part of the image.

00:04:25.800 | So here, your observation would be your current image window.

00:04:30.500 | Action is where to look or where to crop your image.

00:04:33.780 | And reward is whether you make a classification error or not.

00:04:39.900 | So here, you have to try to choose the right area of the image to look at so you'll do

00:04:47.740 | the correct classification.

00:04:51.700 | Reinforcement learning has also been used in structured prediction problems, which in

00:04:58.700 | the past often weren't considered to be reinforcement learning problems.

00:05:03.340 | But it turns out that to actually properly solve them, it actually is a reinforcement

00:05:09.620 | learning problem.

00:05:10.980 | So machine translation, for example, so you get a sentence in the source language and

00:05:18.660 | you have to omit a sentence in the target language.

00:05:22.180 | And here, your observations are the sentence in the source language.

00:05:27.540 | You're omitting one word at a time in the target language.

00:05:30.620 | And you have some reward function that looks at the whole sentence and tells you how good

00:05:35.180 | your translation was.

00:05:37.020 | So since this is non-differentiable, you can't just differentiate through the whole thing

00:05:43.380 | and do gradient descent.

00:05:44.500 | So it turns out you can use a policy gradient method to optimize your translation system.

00:05:51.680 | So people have started to do that.

00:05:56.060 | So those are just a few examples, not exhaustive at all.

00:06:02.180 | But I just want to say a little bit about how reinforcement learning fits into the picture

00:06:11.980 | of all the other types of machine learning problems.

00:06:15.860 | So previous courses in this series have talked about supervised learning and unsupervised

00:06:23.660 | learning.

00:06:24.660 | So how does reinforcement learning relate to them?

00:06:27.740 | How is it different?

00:06:29.260 | So let's just first compare it to-- let's look at supervised learning.

00:06:33.900 | So in supervised learning, first, the environment samples an input-output pair from some distribution

00:06:39.900 | row.

00:06:42.540 | The agent makes a prediction, y hat, using its function f.

00:06:48.980 | And it receives some loss, which tells it if it made the right prediction or the wrong

00:06:53.660 | prediction.

00:06:55.300 | So the interpretation is environment asks the agent a question and then tells her the

00:06:59.700 | right answer.

00:07:03.180 | So contextual bandits make this problem a little harder in that they give the learning

00:07:09.100 | agent a little bit less information.

00:07:11.700 | So now the environment samples an input, but notice that there's not a correct output associated

00:07:17.340 | with it.

00:07:19.180 | Then the agent takes an action, and the agent receives some cost, which is from some probability

00:07:27.860 | distribution.

00:07:28.860 | So here, c is the cost.

00:07:30.940 | We're sampling it from some probability distribution.

00:07:34.180 | And the agent doesn't know what this probability distribution is, so that's what makes the

00:07:37.220 | problem hard.

00:07:40.700 | So environment asks the agent a question, the agent answers, and the environment gives her

00:07:45.740 | a noisy score on the answer.

00:07:50.780 | So this is applied.

00:07:52.740 | This actually has a lot of applications.

00:07:54.320 | So personalized recommendations is one big one, along with advertising.

00:07:58.220 | So you have to decide, like, customers who like this -- I mean, you have a customer,

00:08:05.460 | and you know what they liked in the past, so you have to make a prediction about what

00:08:08.940 | they're going to like in the future.

00:08:10.620 | So you show them appropriate ads or links, like what book you want to try to advertise

00:08:17.580 | to them, or what video you want to show them, and so on.

00:08:22.920 | So here, the big difference between this and the supervised learning setting is you don't

00:08:26.500 | have access to the function, the loss function you're trying to optimize.

00:08:30.320 | So in particular, you can't differentiate through it.

00:08:33.100 | We don't know the process that generates c, so we can't compute the grading of the loss

00:08:37.740 | function and use that to tune the agent's parameters.

00:08:41.580 | So that makes it -- so that makes the problem a bit harder, where you have to use a different

00:08:46.860 | kind of algorithm.

00:08:48.860 | Lastly, reinforcement learning is almost the same as the contextual bandit setting, except

00:08:54.860 | now the environment is stateful.

00:08:57.940 | So now, instead of sampling the initial state from scratch every time step, from the same

00:09:03.980 | distribution, the state evolves over time.

00:09:09.020 | So you have some transition probability distribution called p here, where the state x sub t is

00:09:16.100 | conditioned on the previous state and the previous action.

00:09:22.500 | And that makes the problem quite a bit harder, because now -- well, for a number of reasons.

00:09:27.920 | For one thing, the inputs you're getting depend on what actions you're taking.

00:09:32.120 | So now that makes it harder to develop a stable, reliable algorithm, because now as the agent

00:09:37.320 | starts to learn, it gets different inputs.

00:09:40.280 | So that can lead to all sorts of out-of-control behavior.

00:09:46.380 | And it also means you have delayed effects, because since the system is stateful, you

00:09:52.280 | might need to take a lot of actions to get into the right state.

00:09:55.600 | So you might need to -- you can't just act greedily every time step.

00:10:00.900 | You have to think ahead, effectively.

00:10:06.180 | Okay, so just to summarize these differences, there are two differences.

00:10:11.160 | The first one is, you don't have full analytic access to the function you're trying to optimize.

00:10:15.700 | You have to query it through interaction.

00:10:18.700 | Second, you're interacting with a stateful world, which means that the input you get

00:10:23.900 | is going to depend on your previous actions.

00:10:26.620 | And if you just take the first of those differences between supervised learning and reinforcement

00:10:31.620 | learning, you get the contextual bandit setting.

00:10:33.940 | So that's sort of halfway in between.

00:10:36.740 | Okay, so I realize that there are multiple -- this audience probably has people with

00:10:43.260 | different interests.

00:10:44.820 | Some people are doing research and want to know about what's the latest in the research

00:10:49.500 | world.

00:10:50.540 | And some people are -- want to apply these machine learning techniques to practical applications.

00:10:56.420 | So this slide is for the latter group of people.

00:11:00.580 | So if you're wondering -- if you have some problem where you think reinforcement learning

00:11:05.260 | might be relevant and you're wondering if you should apply reinforcement learning.

00:11:10.020 | So first, I should say that the answer might be no.

00:11:14.880 | It might be overkill, especially deep reinforcement learning.

00:11:18.360 | So this is a set of fairly new techniques where it's not going to work out of the box

00:11:22.140 | very well.

00:11:24.360 | And it's -- these techniques aren't that well established.

00:11:27.920 | So they require a lot of -- they have a lot of knobs to be tuned.

00:11:30.620 | So it might be overkill and, yeah, these techniques aren't that well established at the moment.

00:11:37.180 | So it might be worth investigating some other methods first.

00:11:42.180 | So one -- so if your problem has a small number of parameters you're trying to optimize over

00:11:48.800 | and you have a simulator that you can, like, just do lots of experiments on, then derivative-free

00:11:56.940 | optimization methods are likely to be better than reinforcement learning, or they're likely

00:12:01.860 | to be easier to get working.

00:12:04.240 | So these methods just look at -- they just -- you give them a black box function where

00:12:09.480 | you put in a parameter vector and it'll give you a noisy estimate of the score.

00:12:14.540 | And these algorithms will just optimize over the parameters of that black box -- I mean,

00:12:19.260 | that are being put into that black box.

00:12:22.180 | So yeah, there's a variety of different methods for derivative-free optimization, but these

00:12:27.980 | are easier to understand than reinforcement learning, and they do kind of work out of

00:12:31.980 | the box.

00:12:33.980 | Okay.

00:12:35.660 | A lot of problems are actually -- can be seen as -- are contextual bandit problems, and

00:12:41.580 | the statefulness of the world isn't that relevant.

00:12:44.420 | So for example, in advertising, this is where people -- people look at advertising as a

00:12:50.260 | contextual bandit problem most of the time, because you decide what ad to present the

00:12:54.420 | user with, and then they either click on it or they don't.

00:13:00.220 | But it's really -- the user is kind of stateful, because if you show them a terrible ad, they

00:13:04.980 | might just go and download ad block.

00:13:07.460 | So there is -- like, your actions do have some repercussions.

00:13:12.760 | But often you can just approximate it as being a contextual bandit problem where there is

00:13:17.300 | no state.

00:13:18.640 | So there's a better theoretical understanding of contextual bandit problems and methods

00:13:24.300 | that are -- that have some guarantees.

00:13:26.420 | So in that case -- so if it is a contextual bandit problem, you might want to use those

00:13:31.740 | kind of algorithms instead.

00:13:34.100 | And lastly, the operations research field has been using these methods for a while on

00:13:43.000 | real problems, and they have a set of methods which are just pretty much the basic algorithms,

00:13:51.480 | policy iteration and value iteration, but they're sort of well -- they're well-developed

00:13:57.320 | ways of doing feature engineering for these problems that end up working pretty decently.

00:14:01.560 | So these techniques are also worth considering instead of trying to throw a big neural network

00:14:06.240 | at it.

00:14:08.560 | Okay.

00:14:11.300 | So now -- well, now that I've talked about what -- why not to use deep reinforcement

00:14:15.200 | learning or what it's not good for, I'll just talk about some recent success stories in

00:14:21.200 | deep reinforcement learning, which are achievements that probably wouldn't have been possible

00:14:25.520 | using these other techniques.

00:14:27.920 | So a few years ago, there was a pretty influential result by Mni et al. from DeepMind where they

00:14:37.640 | used a deep Q-learning algorithm to play Atari games using the screen images as input.

00:14:47.800 | And that's hard because you have these -- these games are -- you're trying to do different

00:14:51.920 | things in all these games, and they're -- some of them are kind of complicated.

00:14:54.620 | So it's pretty remarkable that you can just use a simple -- that a simple algorithm can

00:14:59.760 | solve them all.

00:15:01.880 | The same algorithm can solve them all.

00:15:04.260 | So since then, people have also solved -- or solved this domain using policy gradients

00:15:11.760 | and another algorithm called Dagger.

00:15:15.640 | So another big groundbreaking result was beating a champion-level player at Go, also by DeepMind,

00:15:26.500 | using a combination of supervised learning from, like, from expert games, plus policy

00:15:33.540 | gradients to fine-tune the supervised learning policy, plus Monte Carlo tree search, plus

00:15:41.220 | value functions to make the search work better.

00:15:44.340 | So a combination of techniques and reinforcement learning.

00:15:48.040 | Robotic -- so some of my colleagues at Berkeley had some very nice results learning in real

00:15:56.080 | time how to do manipulation tasks using an algorithm called guided policy search using

00:16:03.160 | the PR2 robot.

00:16:06.920 | And some of my colleagues and I have been working on robotic locomotion using policy

00:16:13.160 | gradient methods.

00:16:17.080 | And people have been working on locomotion for a while and have been able to achieve

00:16:21.720 | pretty good results using very, like, highly engineered domain-specific methods.

00:16:27.320 | But previously, there hadn't been much success using general methods to solve it.

00:16:34.640 | And last, there have been some recent results playing 3D games using policy gradients.

00:16:40.800 | In fact, there was even a contest I heard about a couple days ago with this new VisDoom

00:16:45.320 | task, which is pretty nice.

00:16:49.600 | So you might want to check out VisDoom.

00:16:52.480 | Okay.

00:16:54.160 | So that's it for the high-level overview part of this.

00:16:58.680 | Now I'm going to start getting into the actual formalism and the technical details.

00:17:04.640 | Okay, so the basic object in the field of reinforcement learning is the Markov decision

00:17:13.600 | process.

00:17:15.120 | So the Markov decision process is defined by the following components.

00:17:20.200 | You have a state space.

00:17:21.740 | This is all the different states of the system.

00:17:25.040 | The action space, these are all the actions the agent can take.

00:17:28.600 | And you have this probability distribution, which determines the probability of next state

00:17:36.120 | and reward.

00:17:37.120 | So R is the reward, S prime is the next state, S and A are the actions.

00:17:42.200 | So it's a conditional probability distribution.

00:17:44.200 | Sometimes people split this out into a separate reward function, but that's basically an equivalent

00:17:49.440 | formulation.

00:17:50.960 | Okay.

00:17:52.300 | And sometimes there's some extra objects to find.

00:17:57.900 | We'll be interested in the...

00:17:59.440 | We'll consider an initial state distribution.

00:18:02.020 | So this is the world starts out in a certain state.

00:18:07.180 | And the typical optimization problem you want to solve given this MDP is to maximize expected

00:18:12.780 | cumulative reward.

00:18:14.380 | So there are various ways of defining that more precisely, which I'll go into later.

00:18:21.540 | Okay.

00:18:24.260 | So there are various different settings of reinforcement learning where you define a

00:18:29.420 | slightly different optimization problem.

00:18:31.660 | The one we'll be most concerned with is called the episodic setting.

00:18:35.600 | So here the agent's experience is split up into a series of episodes which have finite

00:18:42.740 | length.

00:18:44.200 | So in each episode, we first sample the initial state of the world from some probability distribution

00:18:50.500 | mu.

00:18:52.060 | And then the agent keeps on acting until the world ends up in some terminal state.

00:19:00.880 | So just to give an example of what terminal states might be like and how an episodic reinforcement

00:19:07.740 | learning problem might look.

00:19:09.780 | So one example is when termination is good and you want to terminate the episode as fast

00:19:15.420 | as possible.

00:19:16.820 | So if we imagine setting up a task with some kind of taxi robot that should get to the

00:19:21.420 | destination as fast as possible, then the episode would be like one trip and it's trying

00:19:28.360 | to terminate the episode as fast as possible.

00:19:33.300 | Another example is a waiter robot where you have a fixed length shift, but the waiter

00:19:39.380 | has to accumulate, it has to do as well as possible during that shift.

00:19:43.200 | So there the episode has a fixed length.

00:19:46.000 | The waiter has to say maximize tips or customer happiness.

00:19:52.840 | And then you could imagine another kind of task where termination is bad and you want

00:19:57.140 | the episode to last as long as possible.

00:20:00.060 | So you can view life as an example of that.

00:20:03.740 | But also you could imagine having a walking robot where you want it to walk as far as

00:20:10.860 | possible before it falls over.

00:20:15.940 | And in this setting, it's pretty easy to define what the goal is.

00:20:21.620 | We just want to maximize the expectation of the total reward per episode.

00:20:29.820 | And the last object we're going to introduce here is a policy.

00:20:34.300 | So the policy is just the function that the agent uses to choose its actions.

00:20:40.220 | So we have deterministic policies, which are just the policy is denoted by pi.

00:20:45.380 | So we have the action is just some function of the state.

00:20:48.780 | And we also have stochastic policies where the policy is a conditional probability distribution.

00:20:57.180 | So we're just going to make a little bit more precise the setting of the episodic MDP.

00:21:04.920 | So first we sample the initial state from this distribution, mu.

00:21:11.820 | Then we sample the first action from the policy, a0 from the policy.

00:21:16.780 | Then we sample next state and reward from the transition probability distribution, and

00:21:21.460 | so on, until we reach a terminal state, s sub t.

00:21:25.080 | And then the quantity we care about is the sum of all these rewards, r0 plus r1 dot dot

00:21:31.380 | dot plus r sub t minus 1.

00:21:33.940 | And we want to maximize.

00:21:36.560 | So eta of pi is just defined as the expected total reward of the policy pi.

00:21:46.100 | Here's the picture that illustrates exactly the same thing.

00:21:49.340 | So you can look at it as a graphical model.

00:21:56.340 | And lastly, in the policy gradient section in particular, we're going to be interested

00:22:01.280 | in parameterized policies.

00:22:03.420 | So here we have a parameter vector, theta, which specifies exactly what the policy is.

00:22:11.020 | So for example, the family of policies could be just a neural net.

00:22:15.760 | You have a certain neural network architecture.

00:22:18.020 | And theta specifies all the weights of this neural network.

00:22:24.100 | So we could have a deterministic policy, of course, or a stochastic policy.

00:22:30.560 | And if you're wondering concretely what would a policy look like, I mean, how do you use

00:22:35.620 | a neural network to represent your policy?

00:22:38.100 | It's actually exactly, you do exactly the same thing you would do if this were a classification

00:22:42.700 | or a regression problem.

00:22:44.500 | So S here, the state here is your input, and the action is your output.

00:22:51.580 | So if you have a discrete action space, a discrete set of actions, then you would use

00:22:58.220 | a network that outputs a vector of probabilities, the probabilities of the different actions.

00:23:03.260 | This is exactly like a classifier.

00:23:06.900 | And if you have a continuous action space, you would have your neural network output

00:23:11.500 | the mean and the diagonal of a covariance matrix of a Gaussian distribution.

00:23:17.220 | So this is just like you're doing regression.

00:23:19.780 | So you can use the same kind of architectures you'd use in supervised learning.

00:23:25.100 | Okay, so that's it for the formalism of MDPs.

00:23:32.760 | So now I'm going to go into policy gradient methods, which are one broad and general class

00:23:39.200 | of reinforcement learning methods, which are quite effective.

00:23:45.260 | So to give a brief overview of this, here's the intuition of what policy gradient methods

00:23:52.740 | are going to do.

00:23:54.800 | So here, capital R means the sum of rewards of the whole episode.

00:23:59.780 | So our optimization problem is we want to maximize the expectation of the total reward

00:24:05.960 | given our parameterized policy, pi sub theta.

00:24:10.680 | And the intuition of how our algorithm is going to work is we're going to collect a

00:24:15.880 | bunch of trajectories.

00:24:16.880 | I mean, this is just run a bunch of episodes using our policy.

00:24:21.160 | And then we want to make the good trajectories more probable.

00:24:24.320 | So I mean, some of the trajectories were lucky and they were really good.

00:24:28.120 | Some of them, the agent was unlucky and they were bad.

00:24:30.980 | And the ones that were good, meaning there was high reward, that means the agent

00:24:35.920 | probably took good actions there.

00:24:38.040 | So we want to increase the probability of the actions from those trajectories.

00:24:44.260 | So the most basic version of policy gradient methods just try to make the good trajectories

00:24:48.920 | more probable without trying to figure out which were the good actions and which were

00:24:52.280 | the bad actions.

00:24:54.360 | Slightly better methods or more elaborate methods try to figure out which actions were

00:24:59.640 | good and which ones were bad.

00:25:01.320 | And then they try to make the good actions more probable.

00:25:05.800 | And lastly, there's another class of methods which actually try to push the actions towards

00:25:11.640 | better actions.

00:25:13.160 | So they differentiate the loss function with respect to the actions and they try to push

00:25:16.960 | the actions to better actions.

00:25:19.880 | So we're mostly going to talk about one and two here.

00:25:23.640 | Oh, there's a question?

00:25:29.920 | Oh, yeah, good question.

00:25:34.300 | So while we're maximizing over the policy, we're trying to find the best policy.

00:25:41.440 | But here, the policy is assumed to be parameterized.

00:25:44.880 | So there's some parameter vector theta that specifies the policy.

00:25:48.220 | And now we just want to maximize with respect to theta.

00:25:53.020 | Any other questions?

00:25:59.000 | So there's a very fundamental concept, which is called the score function grading estimator,

00:26:06.620 | which underlies policy gradient methods.

00:26:10.860 | So actually, to introduce this, we're not going to talk about policies in RL at all.

00:26:15.120 | We're just going to assume we have some expectation.

00:26:18.320 | We have expectation of f of x, where x is sampled from some parameterized probability

00:26:24.920 | distribution.

00:26:27.320 | So we want to compute the grading of this expectation with respect to theta.

00:26:32.340 | So there's a general formula that'll do this.

00:26:35.440 | And the way you derive it is you just write the expectation as an integral.

00:26:40.460 | And then you just move some things around.

00:26:44.240 | You swap the integral with the derivative and you turn it back into an expectation.

00:26:50.040 | And what you get at the end is this bottom line, which says that you take the expectation

00:26:55.040 | of function value times grad log probability.

00:27:00.360 | So this is an unbiased estimator of the gradient, meaning if we get enough samples, it'll converge

00:27:06.800 | on the right thing.

00:27:08.660 | So the way you can compute this estimator, meaning the way you can get a noisy estimate

00:27:13.780 | of the gradient of the expectation, is you just get one sample from this distribution.

00:27:22.900 | And then you multiply f of x times grad log probability.

00:27:29.820 | So the only requirement for being able to use this estimator is we need to be able to

00:27:35.860 | compute the probability density.

00:27:38.360 | We need to be able to analytically compute it.

00:27:40.900 | And we need to be able to differentiate it with respect to theta.

00:27:44.400 | And often it needs to be differentiable.

00:27:51.100 | There's another way of deriving it using importance sampling.

00:27:54.940 | So you write down the importance sampling estimator for the expectation.

00:27:58.380 | And then you just swap the derivative with the expectation and you get the same thing.

00:28:05.740 | So now let me just give a little bit of intuition about this estimator.

00:28:13.060 | So f of x is measuring how good the sample x is.

00:28:17.500 | So that means that-- so g hat here is our gradient estimator, meaning this is what we

00:28:22.700 | get if we take one sample x sub i and we compute our estimator.

00:28:27.780 | This is our estimate of the gradient.

00:28:30.220 | So if we move in direction g hat, that pushes up the log probability of our sample x sub

00:28:37.100 | i in proportion to how good it is.

00:28:39.920 | So if we have really good-- if we got a really good function value, then we're going to try

00:28:44.540 | to push up its log probability a lot.

00:28:46.940 | And if it was a bad function value, then we're not going to try to push it up very much.

00:28:51.140 | So pretty simple intuition.

00:28:56.300 | The really nice thing is this is valid even if f of x is discontinuous or if f of x is

00:29:02.780 | unknown, meaning you only-- you don't get to differentiate it, you just get to see the

00:29:07.300 | function values.

00:29:09.580 | Or the sample space is a discrete set, so x doesn't even have to be continuous.

00:29:15.740 | And this is quite remarkable, actually, that you don't even need to have access to the

00:29:21.780 | full-- you don't need to know exactly what the function is that you're optimizing.

00:29:26.820 | You just have to be able to query it for the function value.

00:29:32.500 | And this means-- this is a way of being able to differentiate functions through a system

00:29:40.500 | that has non-differentiable pieces.

00:29:42.960 | So for example, in robotic locomotion, one issue is that you have contacts between the

00:29:49.620 | robot's foot and the ground.

00:29:51.540 | And you make and break contact, and that causes a discontinuous change in the dynamics.

00:29:58.720 | So that makes it really hard to do smooth optimization techniques to come up with the

00:30:02.500 | right behavior.

00:30:03.820 | So when you use this kind of gradient estimator, along with policy gradients, which I'm going

00:30:08.460 | to talk about very soon, you can actually just differentiate-- you can optimize the

00:30:15.220 | system even though it has differentiable pieces in it.

00:30:22.340 | So here's another little picture of what's going on.

00:30:27.900 | So we have our function f of x, which we're trying to maximize the expectation of.

00:30:33.980 | And then we have our probability density, p of x.

00:30:37.220 | So we just sample a bunch of values from our probability density.

00:30:40.840 | Those are the blue dots on the x-axis.

00:30:43.320 | And then we look at the function values.

00:30:50.660 | And we're trying to push the probability distribution so that the probability goes up at these samples

00:30:58.300 | in proportion to the function value.

00:31:01.020 | So over on the right side of the curve, that means we're trying to push that probability

00:31:06.900 | value up really hard.

00:31:08.540 | And on the left side, we're pushing it up softly.

00:31:11.160 | So what's going to happen is the probability density is going to slide to the right.

00:31:15.340 | If you can imagine a sort of physical analogy there.

00:31:20.260 | OK.

00:31:21.900 | So that's the score function gradient estimator.

00:31:24.140 | This is a general technique.

00:31:26.300 | It can be used in various machine learning problems.

00:31:32.500 | Now we're going to apply it to the reinforcement learning setting.

00:31:36.580 | And we're going to take our random variable x to be a whole trajectory.

00:31:42.960 | So the trajectory consists of state action reward, state action reward, and so on until

00:31:47.220 | the end of the episode.

00:31:50.180 | And now to get our gradient estimator, to get the gradient of the expected reward, all

00:31:59.180 | we've got to do is compute the grad log probability times the total reward.

00:32:07.540 | So this probability of the trajectory, that sounds like a really unfriendly quantity because

00:32:12.020 | there's a long, complicated process that generates this trajectory with lots of time steps.

00:32:20.020 | But log-- OK, so we can write out what this process is, what this probability density

00:32:25.700 | is.

00:32:27.000 | So we have-- it's just a product of probabilities.

00:32:29.980 | We've got our initial-- we've got our mu of s0, which is just our initial state distribution.

00:32:36.280 | And then every time step, we have-- we sample the action according to pi.

00:32:40.700 | And we sample the next state and reward according to our dynamics model.

00:32:45.300 | So log turns that product into a sum.

00:32:50.120 | And here's the cool part.

00:32:53.100 | Anything that doesn't contain theta drops out.

00:32:57.300 | So the thing is, we didn't know-- there are parts of this probability distribution, p

00:33:03.360 | of tau given theta, that we don't have access to.

00:33:06.260 | So if this is reinforcement learning, we don't assume that we know the dynamics model of

00:33:10.500 | the system.

00:33:11.500 | We just find out about it by sampling-- by doing episodes.

00:33:19.980 | So since this product turns into a sum, all the pieces, like the log p there and the log

00:33:28.100 | mu, which we don't know, just drop out.

00:33:31.300 | So it doesn't matter.

00:33:34.740 | And what we get in the end is we get a sum of log probabilities of actions.

00:33:42.340 | So grad log pi of action given state.

00:33:47.340 | So our formula looks like-- our formula for the grading of the expectation is just the

00:33:53.340 | expectation over trajectories of total reward of the trajectory times grad of the sum of

00:34:01.540 | all the log probs.

00:34:06.320 | So the interpretation of this is we're taking our good trajectories and we're trying to

00:34:12.300 | increase their probability in proportion to how good they are.

00:34:16.220 | And you can think of this as being similar to supervised learning, where we treat the

00:34:20.620 | good trajectories with high rewards as positive examples in our supervised learning problem.

00:34:26.660 | So we're using those to train the policy on which actions are good.

00:34:30.940 | We're basically treating those actions as positive examples.

00:34:35.820 | OK, now we can improve this formula a little bit.

00:34:41.900 | So that was just the most basic-- I mean, this is an unbiased estimate.

00:34:46.180 | This is an estimator for the policy gradient.

00:34:47.940 | So if we just take that expression inside the expectation on the right-hand side and

00:34:52.940 | we take one sample of that, it has the right mean.

00:34:56.020 | So if we just get enough of them, we're going to get the policy gradient.

00:35:02.580 | But we can also write down some other formulas that have the same mean but have lower variance.

00:35:07.820 | So we can come up with better estimators for the policy gradient.

00:35:11.620 | And that's actually quite important because the one from the previous slide is really

00:35:15.060 | bad when you have a large number of time steps, meaning it has really high variance.

00:35:22.500 | So the first thing we can do is we can use the temporal structure of the problem.

00:35:28.900 | By the way, to derive these next bunch of formulas, it just takes a bunch of really

00:35:33.300 | straightforward manipulation where you move around expectations.

00:35:37.220 | And I'm not going to go through all the math.

00:35:40.060 | But I'll just say what the formulas are.

00:35:45.420 | So we can repeat the same argument from the previous slide to just derive the gradient

00:35:51.740 | estimator for a single reward term.

00:35:54.520 | So we end up with that reward term times the grad sum of log probs.

00:36:00.260 | And just summing over that, we get a new formula where we're not multiplying the sum of the

00:36:07.460 | grad log prob of the whole thing times the sum of all rewards.

00:36:12.580 | So let's look at that bottom formula.

00:36:15.220 | Now we have a sum over time of grad log probability of the action at that time times the sum of

00:36:21.580 | future rewards.

00:36:24.400 | So now, I mean, in the formula from the previous slide, we would have had all the rewards in

00:36:28.420 | that sum.

00:36:30.660 | But now we just have the future rewards.

00:36:32.980 | And that kind of makes sense because an action can affect the probability of the previous

00:36:38.980 | rewards.

00:36:39.980 | So to figure out if the action is good, we should only be looking at the future rewards.

00:36:47.340 | So this is a slightly better formula than the one on the previous slide, meaning it

00:36:51.180 | has the exact same mean except the expression inside the expectation there has lower variance.

00:37:01.340 | And we can further reduce the variance by introducing a baseline.

00:37:06.500 | So now we can take any old function, B, which takes in a state and it outputs a real number.

00:37:13.620 | And we can subtract it from our sum of future rewards.

00:37:19.500 | And we didn't affect the mean of the estimator at all.

00:37:24.180 | So we didn't change the expectation at all by introducing this baseline.

00:37:33.260 | So yeah, for any choice of B, this gives us an unbiased estimator.

00:37:36.980 | By the way, if you're not that familiar with the terminology of estimators, what I'm saying

00:37:41.340 | is we have an expectation on the right-hand side of that formula.

00:37:48.980 | And the quantity inside that expectation is what's called the estimator.

00:37:52.900 | And if we get a bunch of samples, then we can get an estimate of the thing on the left-hand

00:37:59.180 | side, which is what we care about.

00:38:04.020 | So when I say it's an unbiased estimator, that just means that this equation is correct,

00:38:09.740 | meaning that the thing on the right-hand side equals the thing on the left-hand side.

00:38:14.740 | So yeah, this works for any choice of baseline.

00:38:17.660 | And a near optimal choice is to use the expected return.

00:38:22.440 | So the expected sum of future rewards.

00:38:27.180 | And the interpretation of that is if we took an action, we only want to increase the probability

00:38:34.180 | of the action if it was a good action.

00:38:36.660 | So how do we tell if it was a good action?

00:38:38.660 | Well, the sum of rewards after that action should have been better than expected.

00:38:43.840 | So the B of S is the expected sum of rewards.

00:38:47.780 | And we're just taking the difference between the measured thing and the expected thing.

00:38:56.560 | So that was a pretty key thing for variance reduction.

00:39:03.120 | And I'm going to introduce one last variance reduction technique.

00:39:07.120 | And actually, all three of these are really important.

00:39:09.200 | So basically, nothing's going to work except for maybe really small-scale problems unless

00:39:14.360 | you do these things.

00:39:16.100 | So the last variance reduction technique is to use discounts.

00:39:20.760 | So the discount factor ignores delayed effects between actions and rewards.

00:39:27.800 | So what we're going to do here looks kind of like a hack, but there's an explanation

00:39:32.400 | for it, which is instead of taking the sum of rewards, we're going to take a discounted

00:39:37.720 | sum of rewards.

00:39:38.720 | Meaning that we add this exponential factor, gamma.

00:39:44.320 | So when we're multiplying the grad log probability by some future reward, we multiply it by some

00:39:53.320 | quantity that decays with time.

00:39:55.560 | So people typically use gamma equals 0.99 or gamma equals 0.95.

00:40:01.280 | So that means if you use 0.99, that means after 100 time steps, you're going to be reducing

00:40:09.820 | the reward by a factor of 1 over e.

00:40:14.240 | So you're exponentially decaying the effect of the future rewards.

00:40:20.540 | And the intuition is that the action shouldn't affect rewards really far in the future.

00:40:27.240 | Like the assumption is that the system doesn't have really long-term memory and it sort of

00:40:35.000 | resets.

00:40:38.360 | The effects aren't that far delayed.

00:40:40.500 | So you can just ignore the interaction between action and a reward way in the future.

00:40:49.260 | That's the intuition.

00:40:51.000 | So now instead of taking the baseline to be the expected sum of future rewards, we want

00:40:54.800 | to do a discounted sum.

00:40:56.560 | So now we're measuring if the action was better than expected according to the discounted

00:41:03.280 | sum.

00:41:06.600 | And now there's a more general class of formulas that looks like the one that I just wrote.

00:41:10.580 | So this one that's on the top of the slide is pretty good.

00:41:13.840 | And this is almost as good as anything you're going to do to within a small constant factor.

00:41:19.880 | But there's a more general class of formulas that look like grad log probability times

00:41:28.800 | some quantity a hat, which we call the advantage estimate.

00:41:32.520 | And this is in general just going to be an estimate of, it has a more precise definition,

00:41:39.480 | which is how much was this action better than the average action taken by the policy.

00:41:48.720 | But informally, this just means how much better was the action than expected.

00:41:55.240 | And this formula makes a lot of sense because we want to increase the probability of the

00:41:58.680 | good actions and decrease the probability of the bad ones.

00:42:02.000 | So we should increase it in proportion to the goodness of the action.

00:42:10.360 | So just to summarize, so I just told you there's this gradient estimator, meaning there's this

00:42:14.760 | expression you can compute, which gives you a noisy estimate of the policy gradient.

00:42:18.960 | So how do you actually turn this into an algorithm?

00:42:21.560 | So this is silly, algorithm seven.

00:42:27.560 | So here's what the algorithm looks like.

00:42:29.080 | It's pretty much what you'd expect.

00:42:32.680 | You take your policy, you initialize your policy parameter and your baseline function.

00:42:40.160 | For each iteration, you execute the current policy to get a bunch of whole episodes, meaning

00:42:49.280 | whole trajectories.

00:42:50.920 | And each time step in each trajectory, you should compute the return, meaning the sum

00:42:56.140 | of rewards following that time step, the sum of discounted rewards, and the advantage estimate,

00:43:01.120 | which is the sum of discounted rewards minus the baseline.

00:43:05.800 | Then you refit the baseline by trying to make the baseline function equal the returns.

00:43:12.360 | And then you update the policy using a policy gradient estimator.

00:43:16.680 | So you're just doing SGD while updating the baseline as you go along.

00:43:25.520 | So that's the vanilla policy gradient algorithm.

00:43:30.200 | And this is, I'll briefly talk, this has been used to obtain some pretty good results.

00:43:37.040 | So it's not that bad of an algorithm, but there are several different directions that

00:43:43.480 | it can be improved.

00:43:47.440 | So one issue that you run into is with step sizes.

00:43:52.760 | So in supervised learning, step sizes aren't that big of a deal.

00:43:57.960 | Because maybe you take too big of a step, but that's OK.

00:44:02.760 | You'll fix it the next update.

00:44:04.360 | And your current function, your current classifier, for example, doesn't affect what inputs you're

00:44:09.880 | getting.

00:44:10.880 | So even if you just are doing really, even if your network is just thrashing around for

00:44:16.800 | a while because you're taking too big steps, that's not going to cause any problems.

00:44:22.800 | But yeah, so step sizes aren't that big of a deal.

00:44:28.720 | You can just anneal them.

00:44:30.600 | You can start off with a large step size and anneal them down to zero.

00:44:33.700 | And that works pretty well.

00:44:36.720 | In reinforcement learning, if you take too big of a step, you might wreck your policy.

00:44:41.480 | And even if you don't actually change the network that much, so you don't lose all your

00:44:45.640 | nice features, you might just change its behavior a little too much.

00:44:50.680 | And now it's going to do something totally different and visit a totally different part

00:44:53.840 | of state space.

00:44:55.940 | So since in reinforcement learning, the system is stateful and your state distribution depends

00:45:00.440 | on your policy, that brings a really different problem.

00:45:06.760 | And now, after you took that step, the next batch of data you're going to get was collected

00:45:11.720 | by the bad policy.

00:45:13.280 | And now you're never going to recover because you just forgot everything.

00:45:17.760 | Yeah.

00:45:19.440 | So, one way that my colleagues and I, well, one way to fix this is to try to basically

00:45:29.920 | try to stop the policy from taking too big of a step.

00:45:33.120 | So you can look at the KL divergence between the old policy and the new policy, like before

00:45:39.680 | the update and after the update, and make sure that the distributions aren't that different.

00:45:44.640 | So you're not taking too big of a step.

00:45:46.400 | It's kind of an obvious thing to do.

00:45:48.200 | So my colleagues and I developed an algorithm called trust region policy optimization, which

00:45:53.320 | looks at the action distributions and tries to make sure the KL divergence isn't too large.

00:46:00.040 | And this is very closely related to previous natural policy gradient methods, which are

00:46:08.320 | based on-- which are doing something similar, but usually it's not set up as a hard constraint

00:46:15.040 | on the KL divergence.

00:46:20.920 | So another type of extension of policy gradient methods is to do more-- to use value functions

00:46:28.880 | to do more variance reduction.

00:46:32.800 | Instead of just using them as a baseline, you can also-- you can use them more aggressively

00:46:38.720 | and introduce some bias.

00:46:41.200 | So I won't go into the details in this talk, but sometimes these are called actor-critic

00:46:47.120 | methods.

00:46:52.800 | There's also another type of approach, which I briefly touched on in the earlier slide,

00:46:58.880 | where instead of just trying to increase the probability of the good actions, you actually

00:47:03.040 | differentiate your loss with respect to the actions.

00:47:05.880 | This is like the reparameterization trick, which is used for density modeling and unsupervised

00:47:12.320 | learning.

00:47:15.060 | So here you're trying to-- instead of just increasing the probability of the good actions,

00:47:19.560 | you're trying to push the actions towards better actions.

00:47:23.880 | And I'd say both of these bullet points, you're potentially decreasing your variance a lot,

00:47:30.720 | but at the cost of increasing bias.

00:47:33.080 | So it actually makes the algorithms a little harder to understand and to get them working,

00:47:38.840 | because with high variance, if you just crank up the amount of data, you can always drive

00:47:44.180 | your variance down as much as you want.

00:47:46.800 | But with bias, even if-- no matter how much data you get, you're not going to get rid

00:47:50.800 | of the bias.

00:47:51.800 | So if your gradient is pointing in the wrong direction, then you're not going to learn

00:47:55.360 | anything.

00:47:57.560 | OK, so now that's it for the policy gradient section of this talk.

00:48:05.360 | So I wanted to show a quick video of some work that my colleagues and I did on learning

00:48:10.860 | locomotion controllers with policy gradient methods, which I think-- well, I found pretty

00:48:17.020 | exciting when I saw it.

00:48:19.160 | So hopefully you find it interesting.

00:48:26.660 | So here, what we've got is a humanoid simulated-- let's see.

00:48:31.240 | OK, yeah, it's a simulated humanoid robot in a physics simulator, a realistic physics

00:48:37.200 | simulator called MuJoCo.

00:48:40.320 | And it has a neural network policy, which takes in the joint angles of the robot and

00:48:48.600 | a little bit of other kinematic information.

00:48:51.080 | It's got joint velocities and also positions of the different links of the robot.

00:48:56.700 | So that's what the input is.

00:48:58.500 | It's pretty much the raw state of the robot, like no clever feature engineering there.

00:49:04.160 | And the output is going to be the joint torques, which are set 100 times a second.

00:49:09.620 | So we're just mapping from joint angles to joint torques.

00:49:15.300 | And we define a reward function, which is to move forward as fast as possible.

00:49:20.400 | So it gets a reward for moving forward.

00:49:25.680 | So the episode ends when its head goes below a certain height, meaning it fell over.

00:49:31.560 | So that's basically the setup.

00:49:33.060 | There is a little bit of tweaking for the reward function, but not too extensive.

00:49:39.480 | So you can see, first it just falls forward a lot of times.

00:49:55.180 | And then slowly it starts to develop a half-decent looking walk.

00:50:00.520 | And eventually it gets it down pretty well.

00:50:06.420 | And at the very end of this, it could just keep running indefinitely.

00:50:11.020 | So I think it was actually stable in a strong sense, meaning I could just leave it for 15

00:50:15.180 | minutes and it wouldn't fall over.

00:50:16.580 | It would just keep going.

00:50:18.980 | So here's another robot model that we just created without too much thought.

00:50:26.220 | We just decided to put a bunch of legs on this thing.

00:50:30.700 | So we don't even know how this thing is supposed to walk.

00:50:35.040 | And just give it to the same algorithm and it just figures out some kind of crazy way

00:50:39.340 | to walk.

00:50:42.920 | So that's the nice thing about reinforcement learning.

00:50:45.260 | You don't even need to know what you want it to do.

00:50:48.620 | I think this is also the physics are a little unrealistic here.

00:50:56.140 | Here we set up, we used a similar model to the one in the first demo.

00:51:00.940 | But here we just give it a reward for having its head at a certain height.

00:51:04.880 | So there's a reward telling it to get its head up as high as possible.

00:51:09.060 | And then it figures out how to get up off the ground.

00:51:12.220 | Let's see, I have low battery.

00:51:19.380 | Does anyone have a charger that I could?

00:51:29.060 | Thanks a lot.

00:51:30.060 | You're a lifesaver.

00:51:31.060 | OK, any questions about policy gradients before I move on to the next part?

00:51:57.300 | So the question was, is the system time invariant?

00:51:59.780 | Yes, that's assumed.

00:52:01.300 | Is that it's stationary?

00:52:03.900 | Oh, right.

00:52:05.140 | And also that it doesn't change from one episode to the next.

00:52:10.180 | Of course, in some real world problems that might not be the case.

00:52:13.020 | So I think that's also an interesting problem setting where you have a non-stationary environment.

00:52:26.180 | So the question was, for the baseline, to learn a good baseline, do you need to know

00:52:30.340 | the dynamics of the system?

00:52:32.340 | So no, you can just learn it by doing regression.

00:52:36.100 | You just estimate the empirical returns and then you do regression to try to fit a function

00:52:40.860 | to that.

00:52:41.860 | [INAUDIBLE]

00:53:09.620 | Yeah, so the question is, there's a discount factor which should cause the policy to disregard

00:53:18.100 | any effects that are delayed by more than 100 time steps.

00:53:21.520 | So how does it still work that this guy learns how to stand up, even though that might take

00:53:26.860 | more than 100 time steps?

00:53:28.580 | Is that correct?

00:53:29.580 | Yeah.

00:53:30.580 | So yeah, you're right.

00:53:32.300 | And in fact, I would say that these methods aren't guaranteed to work well if you have

00:53:37.180 | more than 100 time steps.

00:53:39.980 | So sometimes they work anyway.

00:53:42.140 | Often they work anyway, but there's no guarantee.

00:53:44.740 | So I think there's actually something pretty fundamental missing in how to deal with really

00:53:49.340 | long time scales.

00:53:50.660 | And people have recently been thinking about hierarchical reinforcement learning, where

00:53:54.700 | you have different levels of detail of the system, where you might have one level of

00:54:01.640 | description where you have a short time, a small time step, and then you have successively

00:54:07.180 | larger time steps.

00:54:08.940 | And that allows you to plan over much longer horizons.

00:54:12.940 | So that's something that's currently an active area of research.

00:54:15.860 | But yeah, I would say that none of these methods are guaranteed to do anything reasonable if

00:54:21.460 | you have more than 1 over 1 minus gamma time steps between action and reward.

00:54:28.100 | So if you introduce some anomaly in the environment, can it still handle it?

00:54:35.500 | Oh yeah, so in this kind of task, if you introduce terrain or something, could it-- I think if

00:54:42.260 | you didn't train it to deal with terrain, then it might fail.

00:54:46.700 | It probably would fail.

00:54:47.700 | Actually, I don't think it would fail, because the funny thing about-- these policies are

00:54:52.020 | actually really robust, because you train them with a stochastic policy.

00:54:57.580 | So there's a lot of noise being generated by the policy itself.

00:55:01.540 | So in practice, it's able to deal with huge noise introduced by the policy.

00:55:08.860 | And as a result, I found that if you change the dynamics parameters a little, it can usually

00:55:14.380 | still work.

00:55:15.420 | But yeah, there's no guarantee that it'll do anything if you give it something you didn't

00:55:18.820 | train it for.

00:55:20.780 | I think that you probably could do the same kind of training with terrain.

00:55:25.540 | I didn't have any terrain, so I didn't try it.

00:55:28.180 | But that would be nice to try.

00:55:30.940 | OK, I'm going to move on to the next part of the talk.

00:55:35.540 | Feel free, if you have more questions, to find me afterwards.

00:55:42.500 | OK, so now I'm going to talk about a different type of reinforcement learning algorithm.

00:55:54.100 | So the previous methods are distinguished by the fact that they explicitly represent

00:56:03.140 | a policy, which is the function that chooses your actions.

00:56:06.500 | And they try to optimize it with respect to the parameters of the policy.

00:56:10.700 | So the nice thing about the policy gradient methods we just talked about is that you're

00:56:14.860 | optimizing the thing you care about.

00:56:18.100 | And you're optimizing it with gradient descent.

00:56:20.100 | So that makes it kind of easy to understand what's going on.

00:56:23.880 | Because if you're getting the proper gradient estimate, and you take small enough steps,

00:56:28.540 | then you should be improving.

00:56:29.540 | I mean, of course, you still could get stuck in a local minimum.

00:56:32.740 | But at least-- or you get stuck in a bad local minimum.

00:56:36.460 | But at least it's a local minimum.

00:56:37.940 | And you can use our understanding of optimization to figure out what's going on.

00:56:44.080 | So these next class of methods are a little different, because they're not optimizing

00:56:48.580 | the policy directly.

00:56:50.260 | They're learning something else called a Q function, which measures how good state-action

00:56:55.860 | pairs are.

00:56:56.860 | So it measures-- I'll say that more formally later.

00:57:00.620 | But it's just measuring how good the actions are.

00:57:04.020 | And these methods are actually-- these are able to exactly solve MDPs efficiently in

00:57:15.700 | the setting where you have a finite number of states and actions.

00:57:19.620 | So these are the preferred methods for exactly solving them in those settings.

00:57:25.100 | But you can also apply them with continuous states and actions, and using expressive function

00:57:34.460 | approximators like neural networks.

00:57:37.500 | But it's a little harder to understand what's going on in these methods, like when they're

00:57:41.860 | going to work and when they're not going to work.

00:57:45.000 | So I'll define the relevant quantities here.

00:57:50.380 | So the Q function is defined as the expected sum of rewards when we condition on the first

00:57:57.020 | state and the first action.

00:57:59.180 | So we're conditioning on S0 equals S, A0 equals A. And the Q function is the expected

00:58:08.380 | discounted sum of rewards when we're acting under the policy pi.

00:58:14.620 | So by convention, I'm starting out with time step 0.

00:58:20.900 | I could have also said that we're taking RT plus RT plus 1 plus RT plus 2 and so on.

00:58:27.740 | But since we're assuming the system is stationary, it should be exactly the same.

00:58:32.520 | So just by convention, I'm going to say that the first-- I'm going to always use time 0,

00:58:36.900 | 1, 2, 3, and so on, just for ease of notation.

00:58:41.980 | So the Q function is just telling you how good this state action pair is under your

00:58:46.240 | current policy.

00:58:48.020 | The value function-- well, the state value function, usually called V, is just conditioning

00:58:54.340 | on the state.

00:58:57.820 | It's telling you how good that state is, what's the expected reward of that state.

00:59:02.520 | And lastly, the advantage function is the difference between the Q function and the

00:59:07.080 | state value function, meaning how much better is that action than what the policy would

00:59:11.700 | have done.

00:59:14.180 | We're not going to talk about advantage functions in this section, but it was actually-- this

00:59:18.020 | corresponds to the notion of an advantage estimator we briefly mentioned in the previous

00:59:23.300 | section.

00:59:26.140 | So here, we're going to consider methods that explicitly store and update the Q function

00:59:30.860 | instead of the policy, and updates them using what are called Bellman equations.

00:59:41.620 | So the Bellman equation-- so a Bellman equation, in general, is a consistency equation that

00:59:46.780 | should be satisfied by a value function.

00:59:51.100 | So here, I'm writing down the Bellman equation for Q pi.

00:59:55.780 | And what it's saying is that the expected sum of rewards should be the first reward

01:00:04.460 | plus this expected sum of rewards after the first time step.

01:00:08.500 | So it's saying something pretty simple.

01:00:10.820 | So R0 is the first reward.

01:00:12.940 | V pi of S1 is just adding up all the rewards after time step 0.

01:00:22.380 | So in the second equation, we write out this relationship just involving the Q function.

01:00:29.380 | So we have a consistency equation that the Q function should satisfy.

01:00:37.640 | We can slightly generalize this to use k time steps instead of just one time step.

01:00:42.860 | So we can expand out the expectation-- the expected sum of rewards to write out k rewards

01:00:49.920 | explicitly and then cap it off with the value function at the very end, which accounts for

01:00:55.240 | all the rewards after that.

01:00:58.640 | OK, so here's the Bellman equation from the previous slide.

01:01:03.620 | So now I'm going to introduce a very important concept called the Bellman backup.

01:01:11.340 | So we have this equation that the value function Q pi should satisfy.

01:01:17.280 | But let's assume we don't know Q pi.

01:01:20.080 | So let's say we have some other Q function.

01:01:24.880 | So we define this Bellman backup operator that operates on an arbitrary Q function.

01:01:31.280 | So it maps a Q function to a new Q function.

01:01:34.400 | And it's defined by just taking the right-hand side of the Bellman equation and plugging

01:01:39.640 | in our new Q function Q instead of the Q pi.

01:01:50.280 | So Q pi is a fixed point of this operator, meaning if we apply the backup operator, we

01:01:57.200 | get the same thing back.

01:02:00.980 | And very nicely, if we keep applying this backup operator repeatedly to any old arbitrary

01:02:08.600 | initial Q function Q, the series will converge to Q pi, which is the fixed point of the operator.

01:02:15.640 | So that's-- yeah, so that's-- so that way you can-- one way-- you can use an iterative

01:02:27.200 | algorithm to estimate Q pi by taking any old initial Q function and repeatedly applying

01:02:32.680 | this backup operator.

01:02:36.380 | So now there's another kind of Q function that we're going to introduce called Q star.

01:02:42.200 | So the previous Q function Q pi was-- this is the-- telling you the value function under

01:02:47.800 | the current-- under some policy pi.

01:02:50.220 | So it only makes sense with regard to some particular fixed policy pi.

01:02:54.800 | Q star is going to be-- is going to involve the optimal policy instead.

01:03:02.920 | So Q star is just defined as the Q function of the optimal policy.

01:03:06.720 | So here we have pi star, the optimal policy, and Q star is just the Q function of the optimal

01:03:11.680 | policy.

01:03:12.800 | And it also happens to be the pointwise maximum over all policies of the Q function at each

01:03:21.000 | state action pair.

01:03:25.920 | So the optimal policy is deterministic, and it should satisfy this equation that it takes

01:03:34.220 | the argmax of the optimal Q function.

01:03:36.800 | So recall that the Q function tells you your expected return if you take the given action.

01:03:43.340 | So obviously the optimal policy should take the action that has the best expected return.

01:03:48.220 | So that's why this last equation is evident.

01:03:58.000 | So now that we know this property of the optimal policy, we can rewrite the Bellman equation.

01:04:06.200 | So on the-- that first equation is-- that's just the Bellman equation from the previous

01:04:10.200 | slides for a given policy pi.

01:04:12.880 | Now we can take that expectation over actions and replace it by what the optimal policy

01:04:18.880 | is going to do, which is just going to take-- it's going to take the argmax of the optimal

01:04:23.660 | Q function.

01:04:24.660 | There's a typo on my slide.

01:04:26.180 | That should say Q star inside-- on the right-hand side.

01:04:32.820 | So now we have a Bellman equation that the optimal policy should satisfy.

01:04:39.340 | Now we can do the same thing with the backup operator.

01:04:42.480 | So we take that Bellman equation and we plug in an arbitrary Q function on the right-hand

01:04:49.300 | side instead of the optimal Q function Q star.

01:04:54.900 | So Q star is a fixed point of this Bellman operator.

01:04:59.880 | That's just a restatement of the Bellman equation.

01:05:04.300 | And again, if we reply this Bellman operator repeatedly to an arbitrary initial Q function,

01:05:11.340 | it converges to Q star, which is the optimal Q function.

01:05:16.900 | This is the Banach fixed point theorem in both cases.

01:05:19.940 | It can be used to prove it.

01:05:24.140 | OK.

01:05:26.460 | So based on these ideas, there are two classic algorithms for exactly solving MDPs.

01:05:33.340 | These are sometimes called dynamic programming algorithms because they're actually quite

01:05:36.740 | related to the kind of dynamic programming algorithms that are used to solve search problems.

01:05:44.420 | So one is called value iteration.

01:05:46.500 | And you just initialize your Q function arbitrarily.

01:05:49.900 | And you repeatedly do Bellman backups until it converges.

01:05:56.100 | The second one is called policy iteration.

01:05:59.420 | You initialize your policy arbitrarily.

01:06:02.580 | Then each step, you first compute either exactly or approximately the Q function of that policy.

01:06:12.880 | And then you update your policy to be the greedy policy for the Q function you just

01:06:17.580 | computed.

01:06:19.020 | So that means that your new policy just takes the argmax of the Q function.

01:06:24.500 | So it takes the action that's best according to that Q function.

01:06:31.700 | So I didn't say anything about how you compute Q pi.

01:06:35.100 | So one way to do it is to compute it.

01:06:37.340 | You can compute it exactly because it happens that the Bellman equation for Q pi is a linear

01:06:41.740 | system of equations.

01:06:43.500 | So often you can just solve them exactly.

01:06:46.820 | More commonly, well, if you have a large scale problem, you might not be able to solve this

01:06:50.540 | system.

01:06:51.620 | So what people often do is they do a finite number of Bellman backups, which doesn't

01:06:58.380 | exactly converge on Q pi, but it gives you something that's approximately Q pi.

01:07:04.700 | Okay, so that's, I just told you algorithms that you can implement if you have full access

01:07:12.060 | to the MDP.

01:07:13.300 | Like you know the whole table of probabilities.

01:07:16.180 | But in reinforcement learning, usually the assumption is that you don't know any of these

01:07:20.380 | probability distributions.

01:07:21.740 | You don't know their reward function.

01:07:22.900 | You don't know the transition probabilities.

01:07:24.980 | So all of these things have to be estimated from data or you're only able to access the

01:07:30.540 | system through interaction.

01:07:32.500 | So now it turns out that these algorithms can be also implemented if you only access

01:07:39.340 | the system through interactions, which is kind of remarkable, I think.

01:07:45.260 | So the way it works is, so let's recall our backup formulas for Q pi and Q star.

01:07:53.620 | So, we can, in both cases, we have this certain quantity inside an expectation.

01:08:02.620 | In both cases, we can compute an unbiased estimator of that quantity inside the expectation,

01:08:10.260 | just using a single sample.

01:08:11.780 | Meaning if we sampled some data from our system using any old policy, then we can get an unbiased

01:08:21.580 | estimator of the quantity on the right-hand side of those expectations.

01:08:26.300 | I mean the quantity inside of the right-hand expectations.

01:08:31.980 | So basically we can do an approximate version of this Bellman backup, which is unbiased.

01:08:41.980 | And even with this noise, so we're doing a noisy version of the Bellman backup.

01:08:46.040 | Even with this noise, it can be proven that if you choose your step sizes appropriately

01:08:50.500 | with the right schedule, you're still gonna converge to Q pi or Q star, depending on which

01:08:58.860 | algorithm you're implementing.

01:09:01.460 | Okay, so now, well I'll say at this point that this is pretty much the fundamental idea.

01:09:08.700 | And now you can come up with algorithms that can be applied in the reinforcement learning

01:09:15.840 | setting where you're just accessing the system through sampling.

01:09:18.820 | And you can also start introducing function approximation here.

01:09:23.100 | So I haven't said anything about what the Q function is.

01:09:26.500 | I've just told you it's a function of state and action.

01:09:29.460 | But now we can start having neural network Q functions, for example.

01:09:36.120 | So we can parameterize the Q function with the neural network.

01:09:39.620 | Let's call it Q theta.

01:09:41.960 | And now, instead of doing the Bellman backup, I mean it doesn't make sense to do the Bellman

01:09:46.980 | backup exactly because we're not just setting the values of the neural network output.

01:09:52.420 | The best we can do is try to encourage the neural network to have some output values.

01:09:57.980 | So what we do is, instead of doing the- the way we do this backup is we set up a least

01:10:03.140 | squares problem.

01:10:04.700 | So we write down this quadratic objective that says that the Q function should be approximately

01:10:10.200 | equal to the backed up value.

01:10:11.760 | And then we just minimize it with SGD.

01:10:16.380 | So one version of this algorithm, which was introduced about 10 years ago, called neural

01:10:20.900 | fitted Q iteration.

01:10:23.260 | Well it works exactly the way you'd expect.

01:10:25.320 | You sample trajectories using your current policy, which might be determined by the Q

01:10:32.940 | function or it could be any old policy as it turns out.

01:10:37.500 | And then you solve the least squares problem where you're trying to minimize this quadratic

01:10:46.340 | you try to minimize this quadratic error which is based on the Bellman backup, the backup

01:10:52.820 | for Q star.

01:10:55.500 | So one thing I haven't mentioned so far is what do you actually use as your policy.

01:11:02.020 | So I said sample trajectory using your policy.

01:11:05.040 | So if you have a Q function, you can turn it into a policy by just taking the action

01:11:11.100 | that has the highest Q value.

01:11:12.780 | That's what you typically do.

01:11:14.120 | So the Q function measures the goodness of all your actions.

01:11:16.920 | So you can easily turn that into a policy by taking your best action or by taking actions

01:11:22.280 | where the log probability is proportional to the goodness or something like that.

01:11:28.780 | So you might take typically probability is exponential of Q value over some kind of temperature

01:11:36.860 | parameter.

01:11:38.600 | That's called Boltzmann exploration.

01:11:40.800 | Whereas if you just take the argmax, that's called the greedy policy.

01:11:47.340 | So it turns out that with these kind of Q learning algorithms, you don't have to execute

01:11:52.540 | the greedy policy for learning to work.

01:11:57.440 | You actually have some freedom in what policy you can execute, which is actually one very

01:12:02.700 | nice property of these algorithms that you can use an exploration technique where your

01:12:09.520 | policy is actively trying to reach new states or do something new and still learn the correct,

01:12:16.240 | still converge, still move towards Q star or Q pi as the case may be.

01:12:24.320 | So that's a very basic neural fitted Q iteration is sort of a basic way of doing this.

01:12:31.440 | A more recent algorithm that's gotten a lot of attention is the one that was from Manny

01:12:36.760 | et al from DeepMind, which is basically an online version of this algorithm with a couple

01:12:43.040 | of useful tweaks in it.

01:12:47.840 | But actually when you look at the two tricks, they're actually kind of very, they make a

01:12:52.960 | lot of sense if you just think about what value iteration is doing.

01:12:56.420 | So one technique is you use this replay pool where it's a rolling history of your past

01:13:03.280 | data and that's just the data you're going to use to fit your Q function.

01:13:09.080 | So that makes sure you have like a representative sample of data to fit your Q function to.

01:13:16.040 | And the second idea is to use a target network.

01:13:21.440 | So instead of using your current Q function and just doing Bellman backups on that, you

01:13:26.960 | have some lagged version of your Q function.

01:13:29.840 | So you have this target network which is a copy of your Q function at some earlier time

01:13:34.240 | and you use that in the backups.

01:13:35.920 | So that also, if you think about value iteration, you have your old Q function and you're trying

01:13:41.440 | to make the new one equal to the backed up version of the old one.

01:13:44.680 | So using the target network just is sort of the natural thing to do if you're trying to

01:13:48.560 | implement value iteration in an online way.

01:13:53.600 | And there have been many extensions proposed since then.

01:13:55.880 | I've got a bunch of citations at the bottom of the slide.

01:14:00.360 | So this algorithm, the DQN algorithm, is using the backup B, which is the backup for Q*.

01:14:11.920 | Remember that I also introduced this other backup, B pi, which is the backup for Q pi.

01:14:18.560 | So there's another algorithm, like a very classic algorithm called SARSA, which is an

01:14:23.580 | online way of doing the B pi backup, essentially.

01:14:28.640 | Well, it's sort of an online version of policy iteration.

01:14:35.280 | So it's actually found to work as well, or better than DQN, well, better than using the

01:14:41.360 | B backup in some settings, not all settings.

01:14:44.980 | So I think the jury's still out on exactly how these things compare.

01:14:51.480 | But I think it's worth considering both policy iteration and value iteration and all the

01:14:58.040 | different online versions of these algorithms and taking them seriously.

01:15:01.520 | Because it's not clear right now exactly how they all compare to each other in the function

01:15:08.360 | approximation setting.

01:15:12.800 | So that's the overview of all the technical parts.

01:15:17.380 | And now I just have a couple conclusion slides.

01:15:22.360 | So let me just summarize the current state of affairs.

01:15:25.000 | I introduced two kinds of algorithms, policy gradient algorithms, which explicitly represent

01:15:30.840 | a policy and optimize it, and Q function learning algorithms, which explicitly represent a Q

01:15:37.200 | function, which is the goodness of different actions, and use that to implicitly represent

01:15:42.080 | a policy.

01:15:44.560 | So policy gradient methods, there's a lot of-- so there have been some successes with

01:15:49.580 | different variants of it.

01:15:51.740 | So you have vanilla policy gradient methods.

01:15:55.240 | There is a recent paper on this A3C method, which is an async implementation of it, which

01:16:03.700 | gets very good results.

01:16:06.820 | There's also another kind of methods are the natural policy gradient methods, trust region

01:16:12.380 | methods.

01:16:13.380 | The one I showed you was obtained using trust region policy optimization, which is one of

01:16:18.020 | these in the second category.

01:16:20.100 | So that makes it-- I think these trust region methods and natural policy gradient methods

01:16:24.820 | are more sample efficient than the vanilla methods, because you end up-- you're doing

01:16:31.220 | more than one gradient update with each little bit of data you collect.

01:16:34.920 | So with the vanilla policy gradient, you just compute one little gradient estimate, and

01:16:39.900 | then you throw it away.

01:16:41.220 | With natural policy gradient, you're solving a little optimization problem with it, so

01:16:44.740 | you get more juice out of it.

01:16:48.540 | So that's what we have in the policy gradient world.

01:16:54.060 | And in the Q-function world, we have the DQN algorithm and some of its relatives.

01:17:01.900 | And these are sort of descendants of value iteration, where you're approximating the

01:17:09.500 | Bellman backup using value iteration.

01:17:13.360 | And then SARSA is also-- it's related to policy iteration.

01:17:21.260 | These are both different-- I mean, these are estimating different-- they're dealing with

01:17:25.580 | different Bellman equations, so it's kind of interesting that both kinds of methods

01:17:29.100 | work and they all-- they're both-- they have fairly similar behaviors, as it turns out.

01:17:35.980 | So here's what I would say that-- here's how I would compare them.

01:17:40.340 | And this is anecdotal evidence, but I think this is the consensus right now.

01:17:47.060 | The Q-function methods are more sample efficient when they work, but they don't work as generally

01:17:52.580 | as policy gradient methods, and it's a little harder to figure out what's going on when

01:17:57.260 | they don't work.

01:17:58.940 | And that kind of makes sense, because in the policy gradient methods, you're optimizing

01:18:01.980 | exactly the thing you care about with gradient descent.

01:18:05.380 | Whereas with Q-function methods, you're doing something indirect, where you're trying to

01:18:09.660 | learn a Q-function, and then you're hoping that it gives you a good policy.

01:18:14.820 | And yeah, so I would also point out that there are also some confounds, so it's hard to make

01:18:20.660 | a good conclusion at this point, because people use different time horizons in the policy

01:18:28.460 | gradient methods versus the Q-function methods.

01:18:30.980 | So they do one-step lookaheads on the Q-functions and multi-step lookaheads on the policy gradients.

01:18:36.300 | So it's not clear if the differences come from using different time horizons or some

01:18:43.660 | differences in how the algorithms are working, because you're either doing regression for

01:18:47.860 | a Q-function versus learning a policy using policy gradients.

01:18:53.980 | So just to summarize it, I would say here are some of our core model-free reinforcement

01:18:59.220 | learning algorithms, and they - oh, whoops, I'm missing a word in the first column, which

01:19:07.300 | I think should say reliability and robustness.

01:19:11.860 | So this just means, is it going to work on new problems without parameter tuning, or

01:19:19.420 | is it going to mysteriously either work or not work?

01:19:24.260 | So this would be my slightly sloppy summary of all these different algorithms.

01:19:32.460 | I would say there's still some room for improvement.

01:19:35.540 | There might be some improvements in the basic methods, because there are some nice properties

01:19:39.540 | of the Q-function methods that we don't have in the policy gradient methods.

01:19:44.260 | Like you can easily do off - you can easily explore with a different policy than the one

01:19:49.780 | that you're learning the Q-function for, and that's really important.

01:19:54.500 | You can't do that very easily with policy gradient methods.

01:19:58.380 | Whereas the policy gradient methods just seem like they're more - you can just apply them

01:20:03.380 | and they're more likely to work, and it's well understood what's going on.

01:20:10.100 | So I think, yeah, there's still - I don't know if it's possible to get the best of both

01:20:13.980 | worlds, but that's the hope.

01:20:18.060 | And that's it for my talk.

01:20:19.540 | Thank you.

01:20:20.540 | Any questions?

01:20:21.540 | In model-based reinforcement learning, what lines of research do you find most interesting?

01:20:39.540 | Oh yeah, so in model-based reinforcement learning, what lines of research do I find most interesting?

01:20:47.340 | I think the work from my colleagues on guided policy search is very nice.

01:20:51.260 | So I would say that's a kind of model-based reinforcement learning.

01:20:56.740 | I also like - there's some methods that are using the model for faster learning, like

01:21:02.060 | for variance reduction.

01:21:03.140 | So there's a paper called stochastic value gradients that I like a lot.

01:21:10.460 | I think it's a pretty wide open area.

01:21:12.640 | So I don't think there have been a lot of really compelling results where you're able

01:21:18.060 | to learn extremely fast, like you're able to learn with much better sample efficiency

01:21:23.060 | using a model.

01:21:24.060 | So it seems like that should be possible, but I don't think it's been demonstrated yet.

01:21:29.980 | So maybe in the next couple of years we'll see that happen.

01:21:34.060 | Hello.

01:21:35.060 | Hi.

01:21:36.060 | Thanks for the talk.

01:21:40.340 | So I have a question.

01:21:41.340 | Is it true or not true that most of this problem requires some kind of simulated world to run

01:21:49.540 | experiments in the episodes, right?

01:21:51.700 | Oh yeah, so are you asking does this work in the real world?

01:21:57.500 | Is that the question?

01:21:58.500 | Yeah.

01:21:59.500 | Yeah, I would say it does work if you have a lot of patience and you're willing to execute

01:22:05.620 | this thing for a while.

01:22:06.820 | So the locomotion results I showed add up to about two weeks of real time.

01:22:12.840 | So it's actually not that bad, especially when you consider that toddlers take a while

01:22:18.020 | to learn how to walk properly, even though evolution already puts in a lot of built-in

01:22:22.660 | information.

01:22:24.820 | So I'd say maybe, yeah, I'd say it can be run in the real world.

01:22:31.620 | Some of my colleagues in Berkeley are doing some experiments where they are running just

01:22:36.380 | regular reinforcement learning algorithms in the real world.

01:22:39.980 | Very brave.

01:22:41.300 | But I hope to see some nice results from that soon.

01:22:45.860 | Thank you.

01:22:47.860 | Hi.

01:22:48.860 | Thanks for your talk.

01:22:49.860 | Here, on the other side.

01:22:50.860 | Here.

01:22:51.860 | Yeah.

01:22:52.860 | I was wondering what was your intuition on the lost surface of those deep reinforcement

01:22:59.620 | learning optimization problems, and maybe especially how it evolves as the policy learns.

01:23:08.900 | And I should specify in the policy gradient case.

01:23:13.100 | So I think the situation is a little bit different in reinforcement learning from in supervised

01:23:16.900 | learning.

01:23:17.900 | So in reinforcement learning, you have one kind of local minima in policy space.

01:23:29.300 | So for example, let's say you want your-- so I keep going back to the locomotion example,

01:23:34.340 | because I spent a lot of time on it.

01:23:36.020 | But let's say you want your robot to walk.

01:23:39.260 | There's one local minimum where it just stands and it doesn't bother to walk, because there's

01:23:42.980 | too much penalty for falling over.

01:23:44.980 | And there's another local minimum where it just dives forward, because it gets a little

01:23:49.140 | bit of reward for that before it falls to its doom.

01:23:55.040 | So I think that that's actually the hard part about the optimization problem, is because

01:24:03.920 | of the different space of behaviors, and actually has nothing to do with the neural network.

01:24:09.520 | So I've also found that it matters surprisingly little what kind of neural network architecture

01:24:18.300 | you use, because I think that most of the hardness and the weirdness of the problem

01:24:21.960 | comes from what the behavior space looks like, rather than what the actual numerical optimization

01:24:28.360 | landscape looks like.

01:24:30.200 | Cool.

01:24:31.200 | Thank you.

01:24:34.320 | So there are many problems where the reward is only observed at the end of the task, so

01:24:41.960 | in the final-- in the terminal state in each episode.

01:24:45.280 | And you don't see rewards in intermediate states.

01:24:48.120 | So how much harder do these problems become for deep reinforcement learning in your experience?

01:24:53.560 | Thanks.

01:24:54.560 | Yeah, so you have-- if you don't get the reward until the end, then you can't-- well, then

01:25:02.520 | it's probably-- it might be harder to learn.

01:25:04.480 | Yeah.

01:25:05.480 | I don't have anything precise to say about that.

01:25:08.880 | I think it's going to be harder if you have less-- if your rewards are further away.

01:25:14.520 | So for example, in your video, for the last example of getting up and getting the head

01:25:19.600 | above a certain height, for example, that could be one where you only get a plus 1 if

01:25:25.120 | you're above and you don't get anything below.

01:25:27.160 | Oh, right.

01:25:28.160 | But when you're doing something that was kind of-- if you get your head higher, then you

01:25:30.960 | still get something partial.

01:25:32.520 | Yeah, so I think we came up with a reward like distance from height squared, which made

01:25:38.880 | the problem easier.

01:25:39.880 | Yeah, the problem would have been a lot harder if you get zero reward until you get your

01:25:43.960 | head above the height.

01:25:45.560 | And it's actually-- that would be a problem of exploration, which is that you have to

01:25:51.920 | explore all the different states to figure out where you're going to get good reward.

01:25:56.520 | Thanks.

01:25:57.520 | One last question.

01:25:59.560 | OK, one last question.

01:26:02.720 | So I have a question about how do you choose to quantize your space time, because in your

01:26:06.640 | locomotion example, it clearly has a continuous system.

01:26:10.920 | Right.

01:26:11.920 | Yeah, so it's actually really important how you discretize time, like what time step you

01:26:16.840 | use, because the algorithm has-- I mean, the algorithm does care about what the time step

01:26:24.600 | is.

01:26:25.600 | So it's not like-- yeah, because you have discount factors, and you're also sampling

01:26:32.240 | a different action at every time step.

01:26:34.080 | So yeah, so if you choose too small of a time step, then the rewards will be delayed by

01:26:43.520 | more time steps.

01:26:44.760 | So that makes the credit assignment harder.

01:26:47.560 | And also, your exploration will be more like a random walk, because you're changing your

01:26:52.480 | minds really frequently.

01:26:54.960 | So yeah, the time step is pretty important.

01:26:56.840 | And I'd say that's a flaw in current methods.

01:27:00.800 | OK, thank you.

01:27:02.480 | So at the same time, John will give a thank you.

01:27:09.760 | Thank you.

01:27:10.760 | So take a short break.

01:27:12.280 | We convene in 15 minutes.

01:27:14.040 | [BLANK_AUDIO]

Deep Reinforcement Learning (John Schulman, OpenAI)

Chapters