back to index

Deep Reinforcement Learning (John Schulman, OpenAI)


Chapters

0:0
0:48 What is Reinforcement Learning?
3:4 Business Operations
6:34 How Does RL Relate to Other ML Problems?
10:7 How Does RL Relate to Other Machine Learning Problems?
14:18 Recent Success Stories in Deep RL
20:57 Episodic Setting
21:56 Parameterized Policies
23:45 Policy Gradient Methods: Overview
28:5 Score Function Gradient Estimator: Intuition
31:33 Score Function Gradient Estimator for Policies Now random wariable is a whole trajectory
35:1 Policy Gradient: Use Temporal Structure
37:1 Policy Gradient: Introduce Baseline
39:4 Discounts for Variance Reduction
44:34 Extension: Step Sizes and Trust Regions
57:52 Value Functions

Whisper Transcript | Transcript Only Page

00:00:00.000 | So good morning, everyone.
00:00:06.080 | So I'm going to talk about some of the core methods in deep reinforcement learning.
00:00:12.740 | So the aim of this talk is as follows.
00:00:16.120 | First I'll do a brief introduction to what deep RL is and whether it might make sense
00:00:21.100 | to apply it in your problem.
00:00:24.280 | I'll talk about some of the core techniques.
00:00:28.940 | So on the one hand we have the policy gradient methods.
00:00:33.240 | Then on the other hand we have methods that learn a Q function, including Q-learning and
00:00:38.500 | SARSA.
00:00:40.960 | And I'll talk a little at the end about what are the pros and cons of these different methods.
00:00:48.240 | So first, what is reinforcement learning?
00:00:51.300 | It's a branch of machine learning concerned with taking sequences of actions.
00:00:55.740 | So often it's described in terms of an agent interacting with a previously unknown environment.
00:01:04.340 | And it's trying to maximize some kind of cumulative reward, some kind of reward function that
00:01:09.240 | we've defined, accumulated over time.
00:01:12.340 | And pretty much any kind of task where you have some kind of goal that you want to achieve
00:01:16.440 | can be stated in these terms.
00:01:18.740 | So this is an extremely general formulation.
00:01:25.060 | What's deep reinforcement learning?
00:01:26.540 | It's pretty simple.
00:01:27.700 | It's just reinforcement learning where you're using neural networks as function approximators.
00:01:34.820 | So the interesting thing about reinforcement learning in contrast to supervised learning
00:01:39.780 | is it's actually not totally obvious what you should use your neural network to approximate
00:01:44.660 | in reinforcement learning.
00:01:46.140 | And there are different kinds of algorithms that approximate different things.
00:01:49.300 | So one choice is to use the neural network to approximate your policy, which is how the
00:01:54.460 | agent chooses its actions.
00:01:57.020 | Another choice is to approximate the value functions, which measure how good or bad different
00:02:02.060 | states are or actions.
00:02:05.300 | And last, you can try to learn a model of the system, a dynamics model, which will make
00:02:12.500 | predictions about next states and rewards.
00:02:16.060 | Okay, so I'll now give a few examples of different places where you might apply reinforcement
00:02:23.900 | learning and what the observations and actions would be.
00:02:28.660 | So one example is robotics.
00:02:32.020 | So here you could imagine a robot where the observations are the camera images and the
00:02:37.180 | joint angles of the robot.
00:02:39.180 | The actions are the joint torques you're applying.
00:02:42.580 | And the reward is going to depend on what you want the robot to do.
00:02:48.500 | So this is something we, as the algorithm designer, get to define.
00:02:53.220 | So the rewards could be to stay balanced, to navigate to some target location, or something
00:02:59.580 | more abstract like serve and protect humans.
00:03:04.060 | So reinforcement learning has also been used in a lot of more practical applications.
00:03:10.980 | Well, applications that have been practical in the past.
00:03:14.460 | I think robotics will be very practical in the future.
00:03:18.860 | And for example, one area is inventory management.
00:03:25.460 | So this is just one example of how you could use reinforcement learning for a decision-making
00:03:29.660 | problem.
00:03:30.740 | So you have to decide how much to stock up on of every item.
00:03:37.020 | And your observations would be your current inventory levels.
00:03:40.780 | Actions would be how much of each item you're going to purchase.
00:03:43.860 | And reward is your profit.
00:03:47.780 | So people in operations research, this is a subfield, study this kind of problem a lot.
00:03:58.580 | There are also a lot of machine learning problems where people have started to apply reinforcement
00:04:03.100 | learning techniques.
00:04:06.220 | So one example is attention.
00:04:10.020 | So the idea in attention is you don't want to look at the whole input at once.
00:04:13.380 | You want to just focus on part of it.
00:04:15.620 | So one example of this is with a large image, you might want to just crop out part of it
00:04:21.500 | and use that and just do detection on that part of the image.
00:04:25.800 | So here, your observation would be your current image window.
00:04:30.500 | Action is where to look or where to crop your image.
00:04:33.780 | And reward is whether you make a classification error or not.
00:04:39.900 | So here, you have to try to choose the right area of the image to look at so you'll do
00:04:47.740 | the correct classification.
00:04:51.700 | Reinforcement learning has also been used in structured prediction problems, which in
00:04:58.700 | the past often weren't considered to be reinforcement learning problems.
00:05:03.340 | But it turns out that to actually properly solve them, it actually is a reinforcement
00:05:09.620 | learning problem.
00:05:10.980 | So machine translation, for example, so you get a sentence in the source language and
00:05:18.660 | you have to omit a sentence in the target language.
00:05:22.180 | And here, your observations are the sentence in the source language.
00:05:27.540 | You're omitting one word at a time in the target language.
00:05:30.620 | And you have some reward function that looks at the whole sentence and tells you how good
00:05:35.180 | your translation was.
00:05:37.020 | So since this is non-differentiable, you can't just differentiate through the whole thing
00:05:43.380 | and do gradient descent.
00:05:44.500 | So it turns out you can use a policy gradient method to optimize your translation system.
00:05:51.680 | So people have started to do that.
00:05:56.060 | So those are just a few examples, not exhaustive at all.
00:06:02.180 | But I just want to say a little bit about how reinforcement learning fits into the picture
00:06:11.980 | of all the other types of machine learning problems.
00:06:15.860 | So previous courses in this series have talked about supervised learning and unsupervised
00:06:23.660 | learning.
00:06:24.660 | So how does reinforcement learning relate to them?
00:06:27.740 | How is it different?
00:06:29.260 | So let's just first compare it to-- let's look at supervised learning.
00:06:33.900 | So in supervised learning, first, the environment samples an input-output pair from some distribution
00:06:42.540 | The agent makes a prediction, y hat, using its function f.
00:06:48.980 | And it receives some loss, which tells it if it made the right prediction or the wrong
00:06:53.660 | prediction.
00:06:55.300 | So the interpretation is environment asks the agent a question and then tells her the
00:06:59.700 | right answer.
00:07:03.180 | So contextual bandits make this problem a little harder in that they give the learning
00:07:09.100 | agent a little bit less information.
00:07:11.700 | So now the environment samples an input, but notice that there's not a correct output associated
00:07:17.340 | with it.
00:07:19.180 | Then the agent takes an action, and the agent receives some cost, which is from some probability
00:07:27.860 | distribution.
00:07:28.860 | So here, c is the cost.
00:07:30.940 | We're sampling it from some probability distribution.
00:07:34.180 | And the agent doesn't know what this probability distribution is, so that's what makes the
00:07:37.220 | problem hard.
00:07:40.700 | So environment asks the agent a question, the agent answers, and the environment gives her
00:07:45.740 | a noisy score on the answer.
00:07:50.780 | So this is applied.
00:07:52.740 | This actually has a lot of applications.
00:07:54.320 | So personalized recommendations is one big one, along with advertising.
00:07:58.220 | So you have to decide, like, customers who like this -- I mean, you have a customer,
00:08:05.460 | and you know what they liked in the past, so you have to make a prediction about what
00:08:08.940 | they're going to like in the future.
00:08:10.620 | So you show them appropriate ads or links, like what book you want to try to advertise
00:08:17.580 | to them, or what video you want to show them, and so on.
00:08:22.920 | So here, the big difference between this and the supervised learning setting is you don't
00:08:26.500 | have access to the function, the loss function you're trying to optimize.
00:08:30.320 | So in particular, you can't differentiate through it.
00:08:33.100 | We don't know the process that generates c, so we can't compute the grading of the loss
00:08:37.740 | function and use that to tune the agent's parameters.
00:08:41.580 | So that makes it -- so that makes the problem a bit harder, where you have to use a different
00:08:46.860 | kind of algorithm.
00:08:48.860 | Lastly, reinforcement learning is almost the same as the contextual bandit setting, except
00:08:54.860 | now the environment is stateful.
00:08:57.940 | So now, instead of sampling the initial state from scratch every time step, from the same
00:09:03.980 | distribution, the state evolves over time.
00:09:09.020 | So you have some transition probability distribution called p here, where the state x sub t is
00:09:16.100 | conditioned on the previous state and the previous action.
00:09:22.500 | And that makes the problem quite a bit harder, because now -- well, for a number of reasons.
00:09:27.920 | For one thing, the inputs you're getting depend on what actions you're taking.
00:09:32.120 | So now that makes it harder to develop a stable, reliable algorithm, because now as the agent
00:09:37.320 | starts to learn, it gets different inputs.
00:09:40.280 | So that can lead to all sorts of out-of-control behavior.
00:09:46.380 | And it also means you have delayed effects, because since the system is stateful, you
00:09:52.280 | might need to take a lot of actions to get into the right state.
00:09:55.600 | So you might need to -- you can't just act greedily every time step.
00:10:00.900 | You have to think ahead, effectively.
00:10:06.180 | Okay, so just to summarize these differences, there are two differences.
00:10:11.160 | The first one is, you don't have full analytic access to the function you're trying to optimize.
00:10:15.700 | You have to query it through interaction.
00:10:18.700 | Second, you're interacting with a stateful world, which means that the input you get
00:10:23.900 | is going to depend on your previous actions.
00:10:26.620 | And if you just take the first of those differences between supervised learning and reinforcement
00:10:31.620 | learning, you get the contextual bandit setting.
00:10:33.940 | So that's sort of halfway in between.
00:10:36.740 | Okay, so I realize that there are multiple -- this audience probably has people with
00:10:43.260 | different interests.
00:10:44.820 | Some people are doing research and want to know about what's the latest in the research
00:10:49.500 | world.
00:10:50.540 | And some people are -- want to apply these machine learning techniques to practical applications.
00:10:56.420 | So this slide is for the latter group of people.
00:11:00.580 | So if you're wondering -- if you have some problem where you think reinforcement learning
00:11:05.260 | might be relevant and you're wondering if you should apply reinforcement learning.
00:11:10.020 | So first, I should say that the answer might be no.
00:11:14.880 | It might be overkill, especially deep reinforcement learning.
00:11:18.360 | So this is a set of fairly new techniques where it's not going to work out of the box
00:11:22.140 | very well.
00:11:24.360 | And it's -- these techniques aren't that well established.
00:11:27.920 | So they require a lot of -- they have a lot of knobs to be tuned.
00:11:30.620 | So it might be overkill and, yeah, these techniques aren't that well established at the moment.
00:11:37.180 | So it might be worth investigating some other methods first.
00:11:42.180 | So one -- so if your problem has a small number of parameters you're trying to optimize over
00:11:48.800 | and you have a simulator that you can, like, just do lots of experiments on, then derivative-free
00:11:56.940 | optimization methods are likely to be better than reinforcement learning, or they're likely
00:12:01.860 | to be easier to get working.
00:12:04.240 | So these methods just look at -- they just -- you give them a black box function where
00:12:09.480 | you put in a parameter vector and it'll give you a noisy estimate of the score.
00:12:14.540 | And these algorithms will just optimize over the parameters of that black box -- I mean,
00:12:19.260 | that are being put into that black box.
00:12:22.180 | So yeah, there's a variety of different methods for derivative-free optimization, but these
00:12:27.980 | are easier to understand than reinforcement learning, and they do kind of work out of
00:12:31.980 | the box.
00:12:33.980 | Okay.
00:12:35.660 | A lot of problems are actually -- can be seen as -- are contextual bandit problems, and
00:12:41.580 | the statefulness of the world isn't that relevant.
00:12:44.420 | So for example, in advertising, this is where people -- people look at advertising as a
00:12:50.260 | contextual bandit problem most of the time, because you decide what ad to present the
00:12:54.420 | user with, and then they either click on it or they don't.
00:13:00.220 | But it's really -- the user is kind of stateful, because if you show them a terrible ad, they
00:13:04.980 | might just go and download ad block.
00:13:07.460 | So there is -- like, your actions do have some repercussions.
00:13:12.760 | But often you can just approximate it as being a contextual bandit problem where there is
00:13:17.300 | no state.
00:13:18.640 | So there's a better theoretical understanding of contextual bandit problems and methods
00:13:24.300 | that are -- that have some guarantees.
00:13:26.420 | So in that case -- so if it is a contextual bandit problem, you might want to use those
00:13:31.740 | kind of algorithms instead.
00:13:34.100 | And lastly, the operations research field has been using these methods for a while on
00:13:43.000 | real problems, and they have a set of methods which are just pretty much the basic algorithms,
00:13:51.480 | policy iteration and value iteration, but they're sort of well -- they're well-developed
00:13:57.320 | ways of doing feature engineering for these problems that end up working pretty decently.
00:14:01.560 | So these techniques are also worth considering instead of trying to throw a big neural network
00:14:06.240 | at it.
00:14:08.560 | Okay.
00:14:11.300 | So now -- well, now that I've talked about what -- why not to use deep reinforcement
00:14:15.200 | learning or what it's not good for, I'll just talk about some recent success stories in
00:14:21.200 | deep reinforcement learning, which are achievements that probably wouldn't have been possible
00:14:25.520 | using these other techniques.
00:14:27.920 | So a few years ago, there was a pretty influential result by Mni et al. from DeepMind where they
00:14:37.640 | used a deep Q-learning algorithm to play Atari games using the screen images as input.
00:14:47.800 | And that's hard because you have these -- these games are -- you're trying to do different
00:14:51.920 | things in all these games, and they're -- some of them are kind of complicated.
00:14:54.620 | So it's pretty remarkable that you can just use a simple -- that a simple algorithm can
00:14:59.760 | solve them all.
00:15:01.880 | The same algorithm can solve them all.
00:15:04.260 | So since then, people have also solved -- or solved this domain using policy gradients
00:15:11.760 | and another algorithm called Dagger.
00:15:15.640 | So another big groundbreaking result was beating a champion-level player at Go, also by DeepMind,
00:15:26.500 | using a combination of supervised learning from, like, from expert games, plus policy
00:15:33.540 | gradients to fine-tune the supervised learning policy, plus Monte Carlo tree search, plus
00:15:41.220 | value functions to make the search work better.
00:15:44.340 | So a combination of techniques and reinforcement learning.
00:15:48.040 | Robotic -- so some of my colleagues at Berkeley had some very nice results learning in real
00:15:56.080 | time how to do manipulation tasks using an algorithm called guided policy search using
00:16:03.160 | the PR2 robot.
00:16:06.920 | And some of my colleagues and I have been working on robotic locomotion using policy
00:16:13.160 | gradient methods.
00:16:17.080 | And people have been working on locomotion for a while and have been able to achieve
00:16:21.720 | pretty good results using very, like, highly engineered domain-specific methods.
00:16:27.320 | But previously, there hadn't been much success using general methods to solve it.
00:16:34.640 | And last, there have been some recent results playing 3D games using policy gradients.
00:16:40.800 | In fact, there was even a contest I heard about a couple days ago with this new VisDoom
00:16:45.320 | task, which is pretty nice.
00:16:49.600 | So you might want to check out VisDoom.
00:16:52.480 | Okay.
00:16:54.160 | So that's it for the high-level overview part of this.
00:16:58.680 | Now I'm going to start getting into the actual formalism and the technical details.
00:17:04.640 | Okay, so the basic object in the field of reinforcement learning is the Markov decision
00:17:13.600 | process.
00:17:15.120 | So the Markov decision process is defined by the following components.
00:17:20.200 | You have a state space.
00:17:21.740 | This is all the different states of the system.
00:17:25.040 | The action space, these are all the actions the agent can take.
00:17:28.600 | And you have this probability distribution, which determines the probability of next state
00:17:36.120 | and reward.
00:17:37.120 | So R is the reward, S prime is the next state, S and A are the actions.
00:17:42.200 | So it's a conditional probability distribution.
00:17:44.200 | Sometimes people split this out into a separate reward function, but that's basically an equivalent
00:17:49.440 | formulation.
00:17:50.960 | Okay.
00:17:52.300 | And sometimes there's some extra objects to find.
00:17:57.900 | We'll be interested in the...
00:17:59.440 | We'll consider an initial state distribution.
00:18:02.020 | So this is the world starts out in a certain state.
00:18:07.180 | And the typical optimization problem you want to solve given this MDP is to maximize expected
00:18:12.780 | cumulative reward.
00:18:14.380 | So there are various ways of defining that more precisely, which I'll go into later.
00:18:21.540 | Okay.
00:18:24.260 | So there are various different settings of reinforcement learning where you define a
00:18:29.420 | slightly different optimization problem.
00:18:31.660 | The one we'll be most concerned with is called the episodic setting.
00:18:35.600 | So here the agent's experience is split up into a series of episodes which have finite
00:18:42.740 | length.
00:18:44.200 | So in each episode, we first sample the initial state of the world from some probability distribution
00:18:52.060 | And then the agent keeps on acting until the world ends up in some terminal state.
00:19:00.880 | So just to give an example of what terminal states might be like and how an episodic reinforcement
00:19:07.740 | learning problem might look.
00:19:09.780 | So one example is when termination is good and you want to terminate the episode as fast
00:19:15.420 | as possible.
00:19:16.820 | So if we imagine setting up a task with some kind of taxi robot that should get to the
00:19:21.420 | destination as fast as possible, then the episode would be like one trip and it's trying
00:19:28.360 | to terminate the episode as fast as possible.
00:19:33.300 | Another example is a waiter robot where you have a fixed length shift, but the waiter
00:19:39.380 | has to accumulate, it has to do as well as possible during that shift.
00:19:43.200 | So there the episode has a fixed length.
00:19:46.000 | The waiter has to say maximize tips or customer happiness.
00:19:52.840 | And then you could imagine another kind of task where termination is bad and you want
00:19:57.140 | the episode to last as long as possible.
00:20:00.060 | So you can view life as an example of that.
00:20:03.740 | But also you could imagine having a walking robot where you want it to walk as far as
00:20:10.860 | possible before it falls over.
00:20:15.940 | And in this setting, it's pretty easy to define what the goal is.
00:20:21.620 | We just want to maximize the expectation of the total reward per episode.
00:20:29.820 | And the last object we're going to introduce here is a policy.
00:20:34.300 | So the policy is just the function that the agent uses to choose its actions.
00:20:40.220 | So we have deterministic policies, which are just the policy is denoted by pi.
00:20:45.380 | So we have the action is just some function of the state.
00:20:48.780 | And we also have stochastic policies where the policy is a conditional probability distribution.
00:20:57.180 | So we're just going to make a little bit more precise the setting of the episodic MDP.
00:21:04.920 | So first we sample the initial state from this distribution, mu.
00:21:11.820 | Then we sample the first action from the policy, a0 from the policy.
00:21:16.780 | Then we sample next state and reward from the transition probability distribution, and
00:21:21.460 | so on, until we reach a terminal state, s sub t.
00:21:25.080 | And then the quantity we care about is the sum of all these rewards, r0 plus r1 dot dot
00:21:31.380 | dot plus r sub t minus 1.
00:21:33.940 | And we want to maximize.
00:21:36.560 | So eta of pi is just defined as the expected total reward of the policy pi.
00:21:46.100 | Here's the picture that illustrates exactly the same thing.
00:21:49.340 | So you can look at it as a graphical model.
00:21:56.340 | And lastly, in the policy gradient section in particular, we're going to be interested
00:22:01.280 | in parameterized policies.
00:22:03.420 | So here we have a parameter vector, theta, which specifies exactly what the policy is.
00:22:11.020 | So for example, the family of policies could be just a neural net.
00:22:15.760 | You have a certain neural network architecture.
00:22:18.020 | And theta specifies all the weights of this neural network.
00:22:24.100 | So we could have a deterministic policy, of course, or a stochastic policy.
00:22:30.560 | And if you're wondering concretely what would a policy look like, I mean, how do you use
00:22:35.620 | a neural network to represent your policy?
00:22:38.100 | It's actually exactly, you do exactly the same thing you would do if this were a classification
00:22:42.700 | or a regression problem.
00:22:44.500 | So S here, the state here is your input, and the action is your output.
00:22:51.580 | So if you have a discrete action space, a discrete set of actions, then you would use
00:22:58.220 | a network that outputs a vector of probabilities, the probabilities of the different actions.
00:23:03.260 | This is exactly like a classifier.
00:23:06.900 | And if you have a continuous action space, you would have your neural network output
00:23:11.500 | the mean and the diagonal of a covariance matrix of a Gaussian distribution.
00:23:17.220 | So this is just like you're doing regression.
00:23:19.780 | So you can use the same kind of architectures you'd use in supervised learning.
00:23:25.100 | Okay, so that's it for the formalism of MDPs.
00:23:32.760 | So now I'm going to go into policy gradient methods, which are one broad and general class
00:23:39.200 | of reinforcement learning methods, which are quite effective.
00:23:45.260 | So to give a brief overview of this, here's the intuition of what policy gradient methods
00:23:52.740 | are going to do.
00:23:54.800 | So here, capital R means the sum of rewards of the whole episode.
00:23:59.780 | So our optimization problem is we want to maximize the expectation of the total reward
00:24:05.960 | given our parameterized policy, pi sub theta.
00:24:10.680 | And the intuition of how our algorithm is going to work is we're going to collect a
00:24:15.880 | bunch of trajectories.
00:24:16.880 | I mean, this is just run a bunch of episodes using our policy.
00:24:21.160 | And then we want to make the good trajectories more probable.
00:24:24.320 | So I mean, some of the trajectories were lucky and they were really good.
00:24:28.120 | Some of them, the agent was unlucky and they were bad.
00:24:30.980 | And the ones that were good, meaning there was high reward, that means the agent
00:24:35.920 | probably took good actions there.
00:24:38.040 | So we want to increase the probability of the actions from those trajectories.
00:24:44.260 | So the most basic version of policy gradient methods just try to make the good trajectories
00:24:48.920 | more probable without trying to figure out which were the good actions and which were
00:24:52.280 | the bad actions.
00:24:54.360 | Slightly better methods or more elaborate methods try to figure out which actions were
00:24:59.640 | good and which ones were bad.
00:25:01.320 | And then they try to make the good actions more probable.
00:25:05.800 | And lastly, there's another class of methods which actually try to push the actions towards
00:25:11.640 | better actions.
00:25:13.160 | So they differentiate the loss function with respect to the actions and they try to push
00:25:16.960 | the actions to better actions.
00:25:19.880 | So we're mostly going to talk about one and two here.
00:25:23.640 | Oh, there's a question?
00:25:29.920 | Oh, yeah, good question.
00:25:34.300 | So while we're maximizing over the policy, we're trying to find the best policy.
00:25:41.440 | But here, the policy is assumed to be parameterized.
00:25:44.880 | So there's some parameter vector theta that specifies the policy.
00:25:48.220 | And now we just want to maximize with respect to theta.
00:25:53.020 | Any other questions?
00:25:59.000 | So there's a very fundamental concept, which is called the score function grading estimator,
00:26:06.620 | which underlies policy gradient methods.
00:26:10.860 | So actually, to introduce this, we're not going to talk about policies in RL at all.
00:26:15.120 | We're just going to assume we have some expectation.
00:26:18.320 | We have expectation of f of x, where x is sampled from some parameterized probability
00:26:24.920 | distribution.
00:26:27.320 | So we want to compute the grading of this expectation with respect to theta.
00:26:32.340 | So there's a general formula that'll do this.
00:26:35.440 | And the way you derive it is you just write the expectation as an integral.
00:26:40.460 | And then you just move some things around.
00:26:44.240 | You swap the integral with the derivative and you turn it back into an expectation.
00:26:50.040 | And what you get at the end is this bottom line, which says that you take the expectation
00:26:55.040 | of function value times grad log probability.
00:27:00.360 | So this is an unbiased estimator of the gradient, meaning if we get enough samples, it'll converge
00:27:06.800 | on the right thing.
00:27:08.660 | So the way you can compute this estimator, meaning the way you can get a noisy estimate
00:27:13.780 | of the gradient of the expectation, is you just get one sample from this distribution.
00:27:22.900 | And then you multiply f of x times grad log probability.
00:27:29.820 | So the only requirement for being able to use this estimator is we need to be able to
00:27:35.860 | compute the probability density.
00:27:38.360 | We need to be able to analytically compute it.
00:27:40.900 | And we need to be able to differentiate it with respect to theta.
00:27:44.400 | And often it needs to be differentiable.
00:27:51.100 | There's another way of deriving it using importance sampling.
00:27:54.940 | So you write down the importance sampling estimator for the expectation.
00:27:58.380 | And then you just swap the derivative with the expectation and you get the same thing.
00:28:05.740 | So now let me just give a little bit of intuition about this estimator.
00:28:13.060 | So f of x is measuring how good the sample x is.
00:28:17.500 | So that means that-- so g hat here is our gradient estimator, meaning this is what we
00:28:22.700 | get if we take one sample x sub i and we compute our estimator.
00:28:27.780 | This is our estimate of the gradient.
00:28:30.220 | So if we move in direction g hat, that pushes up the log probability of our sample x sub
00:28:37.100 | i in proportion to how good it is.
00:28:39.920 | So if we have really good-- if we got a really good function value, then we're going to try
00:28:44.540 | to push up its log probability a lot.
00:28:46.940 | And if it was a bad function value, then we're not going to try to push it up very much.
00:28:51.140 | So pretty simple intuition.
00:28:56.300 | The really nice thing is this is valid even if f of x is discontinuous or if f of x is
00:29:02.780 | unknown, meaning you only-- you don't get to differentiate it, you just get to see the
00:29:07.300 | function values.
00:29:09.580 | Or the sample space is a discrete set, so x doesn't even have to be continuous.
00:29:15.740 | And this is quite remarkable, actually, that you don't even need to have access to the
00:29:21.780 | full-- you don't need to know exactly what the function is that you're optimizing.
00:29:26.820 | You just have to be able to query it for the function value.
00:29:32.500 | And this means-- this is a way of being able to differentiate functions through a system
00:29:40.500 | that has non-differentiable pieces.
00:29:42.960 | So for example, in robotic locomotion, one issue is that you have contacts between the
00:29:49.620 | robot's foot and the ground.
00:29:51.540 | And you make and break contact, and that causes a discontinuous change in the dynamics.
00:29:58.720 | So that makes it really hard to do smooth optimization techniques to come up with the
00:30:02.500 | right behavior.
00:30:03.820 | So when you use this kind of gradient estimator, along with policy gradients, which I'm going
00:30:08.460 | to talk about very soon, you can actually just differentiate-- you can optimize the
00:30:15.220 | system even though it has differentiable pieces in it.
00:30:22.340 | So here's another little picture of what's going on.
00:30:27.900 | So we have our function f of x, which we're trying to maximize the expectation of.
00:30:33.980 | And then we have our probability density, p of x.
00:30:37.220 | So we just sample a bunch of values from our probability density.
00:30:40.840 | Those are the blue dots on the x-axis.
00:30:43.320 | And then we look at the function values.
00:30:50.660 | And we're trying to push the probability distribution so that the probability goes up at these samples
00:30:58.300 | in proportion to the function value.
00:31:01.020 | So over on the right side of the curve, that means we're trying to push that probability
00:31:06.900 | value up really hard.
00:31:08.540 | And on the left side, we're pushing it up softly.
00:31:11.160 | So what's going to happen is the probability density is going to slide to the right.
00:31:15.340 | If you can imagine a sort of physical analogy there.
00:31:21.900 | So that's the score function gradient estimator.
00:31:24.140 | This is a general technique.
00:31:26.300 | It can be used in various machine learning problems.
00:31:32.500 | Now we're going to apply it to the reinforcement learning setting.
00:31:36.580 | And we're going to take our random variable x to be a whole trajectory.
00:31:42.960 | So the trajectory consists of state action reward, state action reward, and so on until
00:31:47.220 | the end of the episode.
00:31:50.180 | And now to get our gradient estimator, to get the gradient of the expected reward, all
00:31:59.180 | we've got to do is compute the grad log probability times the total reward.
00:32:07.540 | So this probability of the trajectory, that sounds like a really unfriendly quantity because
00:32:12.020 | there's a long, complicated process that generates this trajectory with lots of time steps.
00:32:20.020 | But log-- OK, so we can write out what this process is, what this probability density
00:32:27.000 | So we have-- it's just a product of probabilities.
00:32:29.980 | We've got our initial-- we've got our mu of s0, which is just our initial state distribution.
00:32:36.280 | And then every time step, we have-- we sample the action according to pi.
00:32:40.700 | And we sample the next state and reward according to our dynamics model.
00:32:45.300 | So log turns that product into a sum.
00:32:50.120 | And here's the cool part.
00:32:53.100 | Anything that doesn't contain theta drops out.
00:32:57.300 | So the thing is, we didn't know-- there are parts of this probability distribution, p
00:33:03.360 | of tau given theta, that we don't have access to.
00:33:06.260 | So if this is reinforcement learning, we don't assume that we know the dynamics model of
00:33:10.500 | the system.
00:33:11.500 | We just find out about it by sampling-- by doing episodes.
00:33:19.980 | So since this product turns into a sum, all the pieces, like the log p there and the log
00:33:28.100 | mu, which we don't know, just drop out.
00:33:31.300 | So it doesn't matter.
00:33:34.740 | And what we get in the end is we get a sum of log probabilities of actions.
00:33:42.340 | So grad log pi of action given state.
00:33:47.340 | So our formula looks like-- our formula for the grading of the expectation is just the
00:33:53.340 | expectation over trajectories of total reward of the trajectory times grad of the sum of
00:34:01.540 | all the log probs.
00:34:06.320 | So the interpretation of this is we're taking our good trajectories and we're trying to
00:34:12.300 | increase their probability in proportion to how good they are.
00:34:16.220 | And you can think of this as being similar to supervised learning, where we treat the
00:34:20.620 | good trajectories with high rewards as positive examples in our supervised learning problem.
00:34:26.660 | So we're using those to train the policy on which actions are good.
00:34:30.940 | We're basically treating those actions as positive examples.
00:34:35.820 | OK, now we can improve this formula a little bit.
00:34:41.900 | So that was just the most basic-- I mean, this is an unbiased estimate.
00:34:46.180 | This is an estimator for the policy gradient.
00:34:47.940 | So if we just take that expression inside the expectation on the right-hand side and
00:34:52.940 | we take one sample of that, it has the right mean.
00:34:56.020 | So if we just get enough of them, we're going to get the policy gradient.
00:35:02.580 | But we can also write down some other formulas that have the same mean but have lower variance.
00:35:07.820 | So we can come up with better estimators for the policy gradient.
00:35:11.620 | And that's actually quite important because the one from the previous slide is really
00:35:15.060 | bad when you have a large number of time steps, meaning it has really high variance.
00:35:22.500 | So the first thing we can do is we can use the temporal structure of the problem.
00:35:28.900 | By the way, to derive these next bunch of formulas, it just takes a bunch of really
00:35:33.300 | straightforward manipulation where you move around expectations.
00:35:37.220 | And I'm not going to go through all the math.
00:35:40.060 | But I'll just say what the formulas are.
00:35:45.420 | So we can repeat the same argument from the previous slide to just derive the gradient
00:35:51.740 | estimator for a single reward term.
00:35:54.520 | So we end up with that reward term times the grad sum of log probs.
00:36:00.260 | And just summing over that, we get a new formula where we're not multiplying the sum of the
00:36:07.460 | grad log prob of the whole thing times the sum of all rewards.
00:36:12.580 | So let's look at that bottom formula.
00:36:15.220 | Now we have a sum over time of grad log probability of the action at that time times the sum of
00:36:21.580 | future rewards.
00:36:24.400 | So now, I mean, in the formula from the previous slide, we would have had all the rewards in
00:36:28.420 | that sum.
00:36:30.660 | But now we just have the future rewards.
00:36:32.980 | And that kind of makes sense because an action can affect the probability of the previous
00:36:38.980 | rewards.
00:36:39.980 | So to figure out if the action is good, we should only be looking at the future rewards.
00:36:47.340 | So this is a slightly better formula than the one on the previous slide, meaning it
00:36:51.180 | has the exact same mean except the expression inside the expectation there has lower variance.
00:37:01.340 | And we can further reduce the variance by introducing a baseline.
00:37:06.500 | So now we can take any old function, B, which takes in a state and it outputs a real number.
00:37:13.620 | And we can subtract it from our sum of future rewards.
00:37:19.500 | And we didn't affect the mean of the estimator at all.
00:37:24.180 | So we didn't change the expectation at all by introducing this baseline.
00:37:33.260 | So yeah, for any choice of B, this gives us an unbiased estimator.
00:37:36.980 | By the way, if you're not that familiar with the terminology of estimators, what I'm saying
00:37:41.340 | is we have an expectation on the right-hand side of that formula.
00:37:48.980 | And the quantity inside that expectation is what's called the estimator.
00:37:52.900 | And if we get a bunch of samples, then we can get an estimate of the thing on the left-hand
00:37:59.180 | side, which is what we care about.
00:38:04.020 | So when I say it's an unbiased estimator, that just means that this equation is correct,
00:38:09.740 | meaning that the thing on the right-hand side equals the thing on the left-hand side.
00:38:14.740 | So yeah, this works for any choice of baseline.
00:38:17.660 | And a near optimal choice is to use the expected return.
00:38:22.440 | So the expected sum of future rewards.
00:38:27.180 | And the interpretation of that is if we took an action, we only want to increase the probability
00:38:34.180 | of the action if it was a good action.
00:38:36.660 | So how do we tell if it was a good action?
00:38:38.660 | Well, the sum of rewards after that action should have been better than expected.
00:38:43.840 | So the B of S is the expected sum of rewards.
00:38:47.780 | And we're just taking the difference between the measured thing and the expected thing.
00:38:56.560 | So that was a pretty key thing for variance reduction.
00:39:03.120 | And I'm going to introduce one last variance reduction technique.
00:39:07.120 | And actually, all three of these are really important.
00:39:09.200 | So basically, nothing's going to work except for maybe really small-scale problems unless
00:39:14.360 | you do these things.
00:39:16.100 | So the last variance reduction technique is to use discounts.
00:39:20.760 | So the discount factor ignores delayed effects between actions and rewards.
00:39:27.800 | So what we're going to do here looks kind of like a hack, but there's an explanation
00:39:32.400 | for it, which is instead of taking the sum of rewards, we're going to take a discounted
00:39:37.720 | sum of rewards.
00:39:38.720 | Meaning that we add this exponential factor, gamma.
00:39:44.320 | So when we're multiplying the grad log probability by some future reward, we multiply it by some
00:39:53.320 | quantity that decays with time.
00:39:55.560 | So people typically use gamma equals 0.99 or gamma equals 0.95.
00:40:01.280 | So that means if you use 0.99, that means after 100 time steps, you're going to be reducing
00:40:09.820 | the reward by a factor of 1 over e.
00:40:14.240 | So you're exponentially decaying the effect of the future rewards.
00:40:20.540 | And the intuition is that the action shouldn't affect rewards really far in the future.
00:40:27.240 | Like the assumption is that the system doesn't have really long-term memory and it sort of
00:40:35.000 | resets.
00:40:38.360 | The effects aren't that far delayed.
00:40:40.500 | So you can just ignore the interaction between action and a reward way in the future.
00:40:49.260 | That's the intuition.
00:40:51.000 | So now instead of taking the baseline to be the expected sum of future rewards, we want
00:40:54.800 | to do a discounted sum.
00:40:56.560 | So now we're measuring if the action was better than expected according to the discounted
00:41:06.600 | And now there's a more general class of formulas that looks like the one that I just wrote.
00:41:10.580 | So this one that's on the top of the slide is pretty good.
00:41:13.840 | And this is almost as good as anything you're going to do to within a small constant factor.
00:41:19.880 | But there's a more general class of formulas that look like grad log probability times
00:41:28.800 | some quantity a hat, which we call the advantage estimate.
00:41:32.520 | And this is in general just going to be an estimate of, it has a more precise definition,
00:41:39.480 | which is how much was this action better than the average action taken by the policy.
00:41:48.720 | But informally, this just means how much better was the action than expected.
00:41:55.240 | And this formula makes a lot of sense because we want to increase the probability of the
00:41:58.680 | good actions and decrease the probability of the bad ones.
00:42:02.000 | So we should increase it in proportion to the goodness of the action.
00:42:10.360 | So just to summarize, so I just told you there's this gradient estimator, meaning there's this
00:42:14.760 | expression you can compute, which gives you a noisy estimate of the policy gradient.
00:42:18.960 | So how do you actually turn this into an algorithm?
00:42:21.560 | So this is silly, algorithm seven.
00:42:27.560 | So here's what the algorithm looks like.
00:42:29.080 | It's pretty much what you'd expect.
00:42:32.680 | You take your policy, you initialize your policy parameter and your baseline function.
00:42:40.160 | For each iteration, you execute the current policy to get a bunch of whole episodes, meaning
00:42:49.280 | whole trajectories.
00:42:50.920 | And each time step in each trajectory, you should compute the return, meaning the sum
00:42:56.140 | of rewards following that time step, the sum of discounted rewards, and the advantage estimate,
00:43:01.120 | which is the sum of discounted rewards minus the baseline.
00:43:05.800 | Then you refit the baseline by trying to make the baseline function equal the returns.
00:43:12.360 | And then you update the policy using a policy gradient estimator.
00:43:16.680 | So you're just doing SGD while updating the baseline as you go along.
00:43:25.520 | So that's the vanilla policy gradient algorithm.
00:43:30.200 | And this is, I'll briefly talk, this has been used to obtain some pretty good results.
00:43:37.040 | So it's not that bad of an algorithm, but there are several different directions that
00:43:43.480 | it can be improved.
00:43:47.440 | So one issue that you run into is with step sizes.
00:43:52.760 | So in supervised learning, step sizes aren't that big of a deal.
00:43:57.960 | Because maybe you take too big of a step, but that's OK.
00:44:02.760 | You'll fix it the next update.
00:44:04.360 | And your current function, your current classifier, for example, doesn't affect what inputs you're
00:44:09.880 | getting.
00:44:10.880 | So even if you just are doing really, even if your network is just thrashing around for
00:44:16.800 | a while because you're taking too big steps, that's not going to cause any problems.
00:44:22.800 | But yeah, so step sizes aren't that big of a deal.
00:44:28.720 | You can just anneal them.
00:44:30.600 | You can start off with a large step size and anneal them down to zero.
00:44:33.700 | And that works pretty well.
00:44:36.720 | In reinforcement learning, if you take too big of a step, you might wreck your policy.
00:44:41.480 | And even if you don't actually change the network that much, so you don't lose all your
00:44:45.640 | nice features, you might just change its behavior a little too much.
00:44:50.680 | And now it's going to do something totally different and visit a totally different part
00:44:53.840 | of state space.
00:44:55.940 | So since in reinforcement learning, the system is stateful and your state distribution depends
00:45:00.440 | on your policy, that brings a really different problem.
00:45:06.760 | And now, after you took that step, the next batch of data you're going to get was collected
00:45:11.720 | by the bad policy.
00:45:13.280 | And now you're never going to recover because you just forgot everything.
00:45:17.760 | Yeah.
00:45:19.440 | So, one way that my colleagues and I, well, one way to fix this is to try to basically
00:45:29.920 | try to stop the policy from taking too big of a step.
00:45:33.120 | So you can look at the KL divergence between the old policy and the new policy, like before
00:45:39.680 | the update and after the update, and make sure that the distributions aren't that different.
00:45:44.640 | So you're not taking too big of a step.
00:45:46.400 | It's kind of an obvious thing to do.
00:45:48.200 | So my colleagues and I developed an algorithm called trust region policy optimization, which
00:45:53.320 | looks at the action distributions and tries to make sure the KL divergence isn't too large.
00:46:00.040 | And this is very closely related to previous natural policy gradient methods, which are
00:46:08.320 | based on-- which are doing something similar, but usually it's not set up as a hard constraint
00:46:15.040 | on the KL divergence.
00:46:20.920 | So another type of extension of policy gradient methods is to do more-- to use value functions
00:46:28.880 | to do more variance reduction.
00:46:32.800 | Instead of just using them as a baseline, you can also-- you can use them more aggressively
00:46:38.720 | and introduce some bias.
00:46:41.200 | So I won't go into the details in this talk, but sometimes these are called actor-critic
00:46:47.120 | methods.
00:46:52.800 | There's also another type of approach, which I briefly touched on in the earlier slide,
00:46:58.880 | where instead of just trying to increase the probability of the good actions, you actually
00:47:03.040 | differentiate your loss with respect to the actions.
00:47:05.880 | This is like the reparameterization trick, which is used for density modeling and unsupervised
00:47:12.320 | learning.
00:47:15.060 | So here you're trying to-- instead of just increasing the probability of the good actions,
00:47:19.560 | you're trying to push the actions towards better actions.
00:47:23.880 | And I'd say both of these bullet points, you're potentially decreasing your variance a lot,
00:47:30.720 | but at the cost of increasing bias.
00:47:33.080 | So it actually makes the algorithms a little harder to understand and to get them working,
00:47:38.840 | because with high variance, if you just crank up the amount of data, you can always drive
00:47:44.180 | your variance down as much as you want.
00:47:46.800 | But with bias, even if-- no matter how much data you get, you're not going to get rid
00:47:50.800 | of the bias.
00:47:51.800 | So if your gradient is pointing in the wrong direction, then you're not going to learn
00:47:55.360 | anything.
00:47:57.560 | OK, so now that's it for the policy gradient section of this talk.
00:48:05.360 | So I wanted to show a quick video of some work that my colleagues and I did on learning
00:48:10.860 | locomotion controllers with policy gradient methods, which I think-- well, I found pretty
00:48:17.020 | exciting when I saw it.
00:48:19.160 | So hopefully you find it interesting.
00:48:26.660 | So here, what we've got is a humanoid simulated-- let's see.
00:48:31.240 | OK, yeah, it's a simulated humanoid robot in a physics simulator, a realistic physics
00:48:37.200 | simulator called MuJoCo.
00:48:40.320 | And it has a neural network policy, which takes in the joint angles of the robot and
00:48:48.600 | a little bit of other kinematic information.
00:48:51.080 | It's got joint velocities and also positions of the different links of the robot.
00:48:56.700 | So that's what the input is.
00:48:58.500 | It's pretty much the raw state of the robot, like no clever feature engineering there.
00:49:04.160 | And the output is going to be the joint torques, which are set 100 times a second.
00:49:09.620 | So we're just mapping from joint angles to joint torques.
00:49:15.300 | And we define a reward function, which is to move forward as fast as possible.
00:49:20.400 | So it gets a reward for moving forward.
00:49:25.680 | So the episode ends when its head goes below a certain height, meaning it fell over.
00:49:31.560 | So that's basically the setup.
00:49:33.060 | There is a little bit of tweaking for the reward function, but not too extensive.
00:49:39.480 | So you can see, first it just falls forward a lot of times.
00:49:55.180 | And then slowly it starts to develop a half-decent looking walk.
00:50:00.520 | And eventually it gets it down pretty well.
00:50:06.420 | And at the very end of this, it could just keep running indefinitely.
00:50:11.020 | So I think it was actually stable in a strong sense, meaning I could just leave it for 15
00:50:15.180 | minutes and it wouldn't fall over.
00:50:16.580 | It would just keep going.
00:50:18.980 | So here's another robot model that we just created without too much thought.
00:50:26.220 | We just decided to put a bunch of legs on this thing.
00:50:30.700 | So we don't even know how this thing is supposed to walk.
00:50:35.040 | And just give it to the same algorithm and it just figures out some kind of crazy way
00:50:39.340 | to walk.
00:50:42.920 | So that's the nice thing about reinforcement learning.
00:50:45.260 | You don't even need to know what you want it to do.
00:50:48.620 | I think this is also the physics are a little unrealistic here.
00:50:56.140 | Here we set up, we used a similar model to the one in the first demo.
00:51:00.940 | But here we just give it a reward for having its head at a certain height.
00:51:04.880 | So there's a reward telling it to get its head up as high as possible.
00:51:09.060 | And then it figures out how to get up off the ground.
00:51:12.220 | Let's see, I have low battery.
00:51:19.380 | Does anyone have a charger that I could?
00:51:29.060 | Thanks a lot.
00:51:30.060 | You're a lifesaver.
00:51:31.060 | OK, any questions about policy gradients before I move on to the next part?
00:51:57.300 | So the question was, is the system time invariant?
00:51:59.780 | Yes, that's assumed.
00:52:01.300 | Is that it's stationary?
00:52:03.900 | Oh, right.
00:52:05.140 | And also that it doesn't change from one episode to the next.
00:52:10.180 | Of course, in some real world problems that might not be the case.
00:52:13.020 | So I think that's also an interesting problem setting where you have a non-stationary environment.
00:52:26.180 | So the question was, for the baseline, to learn a good baseline, do you need to know
00:52:30.340 | the dynamics of the system?
00:52:32.340 | So no, you can just learn it by doing regression.
00:52:36.100 | You just estimate the empirical returns and then you do regression to try to fit a function
00:52:40.860 | to that.
00:52:41.860 | [INAUDIBLE]
00:53:09.620 | Yeah, so the question is, there's a discount factor which should cause the policy to disregard
00:53:18.100 | any effects that are delayed by more than 100 time steps.
00:53:21.520 | So how does it still work that this guy learns how to stand up, even though that might take
00:53:26.860 | more than 100 time steps?
00:53:28.580 | Is that correct?
00:53:29.580 | Yeah.
00:53:30.580 | So yeah, you're right.
00:53:32.300 | And in fact, I would say that these methods aren't guaranteed to work well if you have
00:53:37.180 | more than 100 time steps.
00:53:39.980 | So sometimes they work anyway.
00:53:42.140 | Often they work anyway, but there's no guarantee.
00:53:44.740 | So I think there's actually something pretty fundamental missing in how to deal with really
00:53:49.340 | long time scales.
00:53:50.660 | And people have recently been thinking about hierarchical reinforcement learning, where
00:53:54.700 | you have different levels of detail of the system, where you might have one level of
00:54:01.640 | description where you have a short time, a small time step, and then you have successively
00:54:07.180 | larger time steps.
00:54:08.940 | And that allows you to plan over much longer horizons.
00:54:12.940 | So that's something that's currently an active area of research.
00:54:15.860 | But yeah, I would say that none of these methods are guaranteed to do anything reasonable if
00:54:21.460 | you have more than 1 over 1 minus gamma time steps between action and reward.
00:54:28.100 | So if you introduce some anomaly in the environment, can it still handle it?
00:54:35.500 | Oh yeah, so in this kind of task, if you introduce terrain or something, could it-- I think if
00:54:42.260 | you didn't train it to deal with terrain, then it might fail.
00:54:46.700 | It probably would fail.
00:54:47.700 | Actually, I don't think it would fail, because the funny thing about-- these policies are
00:54:52.020 | actually really robust, because you train them with a stochastic policy.
00:54:57.580 | So there's a lot of noise being generated by the policy itself.
00:55:01.540 | So in practice, it's able to deal with huge noise introduced by the policy.
00:55:08.860 | And as a result, I found that if you change the dynamics parameters a little, it can usually
00:55:14.380 | still work.
00:55:15.420 | But yeah, there's no guarantee that it'll do anything if you give it something you didn't
00:55:18.820 | train it for.
00:55:20.780 | I think that you probably could do the same kind of training with terrain.
00:55:25.540 | I didn't have any terrain, so I didn't try it.
00:55:28.180 | But that would be nice to try.
00:55:30.940 | OK, I'm going to move on to the next part of the talk.
00:55:35.540 | Feel free, if you have more questions, to find me afterwards.
00:55:42.500 | OK, so now I'm going to talk about a different type of reinforcement learning algorithm.
00:55:54.100 | So the previous methods are distinguished by the fact that they explicitly represent
00:56:03.140 | a policy, which is the function that chooses your actions.
00:56:06.500 | And they try to optimize it with respect to the parameters of the policy.
00:56:10.700 | So the nice thing about the policy gradient methods we just talked about is that you're
00:56:14.860 | optimizing the thing you care about.
00:56:18.100 | And you're optimizing it with gradient descent.
00:56:20.100 | So that makes it kind of easy to understand what's going on.
00:56:23.880 | Because if you're getting the proper gradient estimate, and you take small enough steps,
00:56:28.540 | then you should be improving.
00:56:29.540 | I mean, of course, you still could get stuck in a local minimum.
00:56:32.740 | But at least-- or you get stuck in a bad local minimum.
00:56:36.460 | But at least it's a local minimum.
00:56:37.940 | And you can use our understanding of optimization to figure out what's going on.
00:56:44.080 | So these next class of methods are a little different, because they're not optimizing
00:56:48.580 | the policy directly.
00:56:50.260 | They're learning something else called a Q function, which measures how good state-action
00:56:55.860 | pairs are.
00:56:56.860 | So it measures-- I'll say that more formally later.
00:57:00.620 | But it's just measuring how good the actions are.
00:57:04.020 | And these methods are actually-- these are able to exactly solve MDPs efficiently in
00:57:15.700 | the setting where you have a finite number of states and actions.
00:57:19.620 | So these are the preferred methods for exactly solving them in those settings.
00:57:25.100 | But you can also apply them with continuous states and actions, and using expressive function
00:57:34.460 | approximators like neural networks.
00:57:37.500 | But it's a little harder to understand what's going on in these methods, like when they're
00:57:41.860 | going to work and when they're not going to work.
00:57:45.000 | So I'll define the relevant quantities here.
00:57:50.380 | So the Q function is defined as the expected sum of rewards when we condition on the first
00:57:57.020 | state and the first action.
00:57:59.180 | So we're conditioning on S0 equals S, A0 equals A. And the Q function is the expected
00:58:08.380 | discounted sum of rewards when we're acting under the policy pi.
00:58:14.620 | So by convention, I'm starting out with time step 0.
00:58:20.900 | I could have also said that we're taking RT plus RT plus 1 plus RT plus 2 and so on.
00:58:27.740 | But since we're assuming the system is stationary, it should be exactly the same.
00:58:32.520 | So just by convention, I'm going to say that the first-- I'm going to always use time 0,
00:58:36.900 | 1, 2, 3, and so on, just for ease of notation.
00:58:41.980 | So the Q function is just telling you how good this state action pair is under your
00:58:46.240 | current policy.
00:58:48.020 | The value function-- well, the state value function, usually called V, is just conditioning
00:58:54.340 | on the state.
00:58:57.820 | It's telling you how good that state is, what's the expected reward of that state.
00:59:02.520 | And lastly, the advantage function is the difference between the Q function and the
00:59:07.080 | state value function, meaning how much better is that action than what the policy would
00:59:11.700 | have done.
00:59:14.180 | We're not going to talk about advantage functions in this section, but it was actually-- this
00:59:18.020 | corresponds to the notion of an advantage estimator we briefly mentioned in the previous
00:59:23.300 | section.
00:59:26.140 | So here, we're going to consider methods that explicitly store and update the Q function
00:59:30.860 | instead of the policy, and updates them using what are called Bellman equations.
00:59:41.620 | So the Bellman equation-- so a Bellman equation, in general, is a consistency equation that
00:59:46.780 | should be satisfied by a value function.
00:59:51.100 | So here, I'm writing down the Bellman equation for Q pi.
00:59:55.780 | And what it's saying is that the expected sum of rewards should be the first reward
01:00:04.460 | plus this expected sum of rewards after the first time step.
01:00:08.500 | So it's saying something pretty simple.
01:00:10.820 | So R0 is the first reward.
01:00:12.940 | V pi of S1 is just adding up all the rewards after time step 0.
01:00:22.380 | So in the second equation, we write out this relationship just involving the Q function.
01:00:29.380 | So we have a consistency equation that the Q function should satisfy.
01:00:37.640 | We can slightly generalize this to use k time steps instead of just one time step.
01:00:42.860 | So we can expand out the expectation-- the expected sum of rewards to write out k rewards
01:00:49.920 | explicitly and then cap it off with the value function at the very end, which accounts for
01:00:55.240 | all the rewards after that.
01:00:58.640 | OK, so here's the Bellman equation from the previous slide.
01:01:03.620 | So now I'm going to introduce a very important concept called the Bellman backup.
01:01:11.340 | So we have this equation that the value function Q pi should satisfy.
01:01:17.280 | But let's assume we don't know Q pi.
01:01:20.080 | So let's say we have some other Q function.
01:01:24.880 | So we define this Bellman backup operator that operates on an arbitrary Q function.
01:01:31.280 | So it maps a Q function to a new Q function.
01:01:34.400 | And it's defined by just taking the right-hand side of the Bellman equation and plugging
01:01:39.640 | in our new Q function Q instead of the Q pi.
01:01:50.280 | So Q pi is a fixed point of this operator, meaning if we apply the backup operator, we
01:01:57.200 | get the same thing back.
01:02:00.980 | And very nicely, if we keep applying this backup operator repeatedly to any old arbitrary
01:02:08.600 | initial Q function Q, the series will converge to Q pi, which is the fixed point of the operator.
01:02:15.640 | So that's-- yeah, so that's-- so that way you can-- one way-- you can use an iterative
01:02:27.200 | algorithm to estimate Q pi by taking any old initial Q function and repeatedly applying
01:02:32.680 | this backup operator.
01:02:36.380 | So now there's another kind of Q function that we're going to introduce called Q star.
01:02:42.200 | So the previous Q function Q pi was-- this is the-- telling you the value function under
01:02:47.800 | the current-- under some policy pi.
01:02:50.220 | So it only makes sense with regard to some particular fixed policy pi.
01:02:54.800 | Q star is going to be-- is going to involve the optimal policy instead.
01:03:02.920 | So Q star is just defined as the Q function of the optimal policy.
01:03:06.720 | So here we have pi star, the optimal policy, and Q star is just the Q function of the optimal
01:03:11.680 | policy.
01:03:12.800 | And it also happens to be the pointwise maximum over all policies of the Q function at each
01:03:21.000 | state action pair.
01:03:25.920 | So the optimal policy is deterministic, and it should satisfy this equation that it takes
01:03:34.220 | the argmax of the optimal Q function.
01:03:36.800 | So recall that the Q function tells you your expected return if you take the given action.
01:03:43.340 | So obviously the optimal policy should take the action that has the best expected return.
01:03:48.220 | So that's why this last equation is evident.
01:03:58.000 | So now that we know this property of the optimal policy, we can rewrite the Bellman equation.
01:04:06.200 | So on the-- that first equation is-- that's just the Bellman equation from the previous
01:04:10.200 | slides for a given policy pi.
01:04:12.880 | Now we can take that expectation over actions and replace it by what the optimal policy
01:04:18.880 | is going to do, which is just going to take-- it's going to take the argmax of the optimal
01:04:23.660 | Q function.
01:04:24.660 | There's a typo on my slide.
01:04:26.180 | That should say Q star inside-- on the right-hand side.
01:04:32.820 | So now we have a Bellman equation that the optimal policy should satisfy.
01:04:39.340 | Now we can do the same thing with the backup operator.
01:04:42.480 | So we take that Bellman equation and we plug in an arbitrary Q function on the right-hand
01:04:49.300 | side instead of the optimal Q function Q star.
01:04:54.900 | So Q star is a fixed point of this Bellman operator.
01:04:59.880 | That's just a restatement of the Bellman equation.
01:05:04.300 | And again, if we reply this Bellman operator repeatedly to an arbitrary initial Q function,
01:05:11.340 | it converges to Q star, which is the optimal Q function.
01:05:16.900 | This is the Banach fixed point theorem in both cases.
01:05:19.940 | It can be used to prove it.
01:05:26.460 | So based on these ideas, there are two classic algorithms for exactly solving MDPs.
01:05:33.340 | These are sometimes called dynamic programming algorithms because they're actually quite
01:05:36.740 | related to the kind of dynamic programming algorithms that are used to solve search problems.
01:05:44.420 | So one is called value iteration.
01:05:46.500 | And you just initialize your Q function arbitrarily.
01:05:49.900 | And you repeatedly do Bellman backups until it converges.
01:05:56.100 | The second one is called policy iteration.
01:05:59.420 | You initialize your policy arbitrarily.
01:06:02.580 | Then each step, you first compute either exactly or approximately the Q function of that policy.
01:06:12.880 | And then you update your policy to be the greedy policy for the Q function you just
01:06:17.580 | computed.
01:06:19.020 | So that means that your new policy just takes the argmax of the Q function.
01:06:24.500 | So it takes the action that's best according to that Q function.
01:06:31.700 | So I didn't say anything about how you compute Q pi.
01:06:35.100 | So one way to do it is to compute it.
01:06:37.340 | You can compute it exactly because it happens that the Bellman equation for Q pi is a linear
01:06:41.740 | system of equations.
01:06:43.500 | So often you can just solve them exactly.
01:06:46.820 | More commonly, well, if you have a large scale problem, you might not be able to solve this
01:06:50.540 | system.
01:06:51.620 | So what people often do is they do a finite number of Bellman backups, which doesn't
01:06:58.380 | exactly converge on Q pi, but it gives you something that's approximately Q pi.
01:07:04.700 | Okay, so that's, I just told you algorithms that you can implement if you have full access
01:07:12.060 | to the MDP.
01:07:13.300 | Like you know the whole table of probabilities.
01:07:16.180 | But in reinforcement learning, usually the assumption is that you don't know any of these
01:07:20.380 | probability distributions.
01:07:21.740 | You don't know their reward function.
01:07:22.900 | You don't know the transition probabilities.
01:07:24.980 | So all of these things have to be estimated from data or you're only able to access the
01:07:30.540 | system through interaction.
01:07:32.500 | So now it turns out that these algorithms can be also implemented if you only access
01:07:39.340 | the system through interactions, which is kind of remarkable, I think.
01:07:45.260 | So the way it works is, so let's recall our backup formulas for Q pi and Q star.
01:07:53.620 | So, we can, in both cases, we have this certain quantity inside an expectation.
01:08:02.620 | In both cases, we can compute an unbiased estimator of that quantity inside the expectation,
01:08:10.260 | just using a single sample.
01:08:11.780 | Meaning if we sampled some data from our system using any old policy, then we can get an unbiased
01:08:21.580 | estimator of the quantity on the right-hand side of those expectations.
01:08:26.300 | I mean the quantity inside of the right-hand expectations.
01:08:31.980 | So basically we can do an approximate version of this Bellman backup, which is unbiased.
01:08:41.980 | And even with this noise, so we're doing a noisy version of the Bellman backup.
01:08:46.040 | Even with this noise, it can be proven that if you choose your step sizes appropriately
01:08:50.500 | with the right schedule, you're still gonna converge to Q pi or Q star, depending on which
01:08:58.860 | algorithm you're implementing.
01:09:01.460 | Okay, so now, well I'll say at this point that this is pretty much the fundamental idea.
01:09:08.700 | And now you can come up with algorithms that can be applied in the reinforcement learning
01:09:15.840 | setting where you're just accessing the system through sampling.
01:09:18.820 | And you can also start introducing function approximation here.
01:09:23.100 | So I haven't said anything about what the Q function is.
01:09:26.500 | I've just told you it's a function of state and action.
01:09:29.460 | But now we can start having neural network Q functions, for example.
01:09:36.120 | So we can parameterize the Q function with the neural network.
01:09:39.620 | Let's call it Q theta.
01:09:41.960 | And now, instead of doing the Bellman backup, I mean it doesn't make sense to do the Bellman
01:09:46.980 | backup exactly because we're not just setting the values of the neural network output.
01:09:52.420 | The best we can do is try to encourage the neural network to have some output values.
01:09:57.980 | So what we do is, instead of doing the- the way we do this backup is we set up a least
01:10:03.140 | squares problem.
01:10:04.700 | So we write down this quadratic objective that says that the Q function should be approximately
01:10:10.200 | equal to the backed up value.
01:10:11.760 | And then we just minimize it with SGD.
01:10:16.380 | So one version of this algorithm, which was introduced about 10 years ago, called neural
01:10:20.900 | fitted Q iteration.
01:10:23.260 | Well it works exactly the way you'd expect.
01:10:25.320 | You sample trajectories using your current policy, which might be determined by the Q
01:10:32.940 | function or it could be any old policy as it turns out.
01:10:37.500 | And then you solve the least squares problem where you're trying to minimize this quadratic
01:10:46.340 | you try to minimize this quadratic error which is based on the Bellman backup, the backup
01:10:52.820 | for Q star.
01:10:55.500 | So one thing I haven't mentioned so far is what do you actually use as your policy.
01:11:02.020 | So I said sample trajectory using your policy.
01:11:05.040 | So if you have a Q function, you can turn it into a policy by just taking the action
01:11:11.100 | that has the highest Q value.
01:11:12.780 | That's what you typically do.
01:11:14.120 | So the Q function measures the goodness of all your actions.
01:11:16.920 | So you can easily turn that into a policy by taking your best action or by taking actions
01:11:22.280 | where the log probability is proportional to the goodness or something like that.
01:11:28.780 | So you might take typically probability is exponential of Q value over some kind of temperature
01:11:36.860 | parameter.
01:11:38.600 | That's called Boltzmann exploration.
01:11:40.800 | Whereas if you just take the argmax, that's called the greedy policy.
01:11:47.340 | So it turns out that with these kind of Q learning algorithms, you don't have to execute
01:11:52.540 | the greedy policy for learning to work.
01:11:57.440 | You actually have some freedom in what policy you can execute, which is actually one very
01:12:02.700 | nice property of these algorithms that you can use an exploration technique where your
01:12:09.520 | policy is actively trying to reach new states or do something new and still learn the correct,
01:12:16.240 | still converge, still move towards Q star or Q pi as the case may be.
01:12:24.320 | So that's a very basic neural fitted Q iteration is sort of a basic way of doing this.
01:12:31.440 | A more recent algorithm that's gotten a lot of attention is the one that was from Manny
01:12:36.760 | et al from DeepMind, which is basically an online version of this algorithm with a couple
01:12:43.040 | of useful tweaks in it.
01:12:47.840 | But actually when you look at the two tricks, they're actually kind of very, they make a
01:12:52.960 | lot of sense if you just think about what value iteration is doing.
01:12:56.420 | So one technique is you use this replay pool where it's a rolling history of your past
01:13:03.280 | data and that's just the data you're going to use to fit your Q function.
01:13:09.080 | So that makes sure you have like a representative sample of data to fit your Q function to.
01:13:16.040 | And the second idea is to use a target network.
01:13:21.440 | So instead of using your current Q function and just doing Bellman backups on that, you
01:13:26.960 | have some lagged version of your Q function.
01:13:29.840 | So you have this target network which is a copy of your Q function at some earlier time
01:13:34.240 | and you use that in the backups.
01:13:35.920 | So that also, if you think about value iteration, you have your old Q function and you're trying
01:13:41.440 | to make the new one equal to the backed up version of the old one.
01:13:44.680 | So using the target network just is sort of the natural thing to do if you're trying to
01:13:48.560 | implement value iteration in an online way.
01:13:53.600 | And there have been many extensions proposed since then.
01:13:55.880 | I've got a bunch of citations at the bottom of the slide.
01:14:00.360 | So this algorithm, the DQN algorithm, is using the backup B, which is the backup for Q*.
01:14:11.920 | Remember that I also introduced this other backup, B pi, which is the backup for Q pi.
01:14:18.560 | So there's another algorithm, like a very classic algorithm called SARSA, which is an
01:14:23.580 | online way of doing the B pi backup, essentially.
01:14:28.640 | Well, it's sort of an online version of policy iteration.
01:14:35.280 | So it's actually found to work as well, or better than DQN, well, better than using the
01:14:41.360 | B backup in some settings, not all settings.
01:14:44.980 | So I think the jury's still out on exactly how these things compare.
01:14:51.480 | But I think it's worth considering both policy iteration and value iteration and all the
01:14:58.040 | different online versions of these algorithms and taking them seriously.
01:15:01.520 | Because it's not clear right now exactly how they all compare to each other in the function
01:15:08.360 | approximation setting.
01:15:12.800 | So that's the overview of all the technical parts.
01:15:17.380 | And now I just have a couple conclusion slides.
01:15:22.360 | So let me just summarize the current state of affairs.
01:15:25.000 | I introduced two kinds of algorithms, policy gradient algorithms, which explicitly represent
01:15:30.840 | a policy and optimize it, and Q function learning algorithms, which explicitly represent a Q
01:15:37.200 | function, which is the goodness of different actions, and use that to implicitly represent
01:15:42.080 | a policy.
01:15:44.560 | So policy gradient methods, there's a lot of-- so there have been some successes with
01:15:49.580 | different variants of it.
01:15:51.740 | So you have vanilla policy gradient methods.
01:15:55.240 | There is a recent paper on this A3C method, which is an async implementation of it, which
01:16:03.700 | gets very good results.
01:16:06.820 | There's also another kind of methods are the natural policy gradient methods, trust region
01:16:12.380 | methods.
01:16:13.380 | The one I showed you was obtained using trust region policy optimization, which is one of
01:16:18.020 | these in the second category.
01:16:20.100 | So that makes it-- I think these trust region methods and natural policy gradient methods
01:16:24.820 | are more sample efficient than the vanilla methods, because you end up-- you're doing
01:16:31.220 | more than one gradient update with each little bit of data you collect.
01:16:34.920 | So with the vanilla policy gradient, you just compute one little gradient estimate, and
01:16:39.900 | then you throw it away.
01:16:41.220 | With natural policy gradient, you're solving a little optimization problem with it, so
01:16:44.740 | you get more juice out of it.
01:16:48.540 | So that's what we have in the policy gradient world.
01:16:54.060 | And in the Q-function world, we have the DQN algorithm and some of its relatives.
01:17:01.900 | And these are sort of descendants of value iteration, where you're approximating the
01:17:09.500 | Bellman backup using value iteration.
01:17:13.360 | And then SARSA is also-- it's related to policy iteration.
01:17:21.260 | These are both different-- I mean, these are estimating different-- they're dealing with
01:17:25.580 | different Bellman equations, so it's kind of interesting that both kinds of methods
01:17:29.100 | work and they all-- they're both-- they have fairly similar behaviors, as it turns out.
01:17:35.980 | So here's what I would say that-- here's how I would compare them.
01:17:40.340 | And this is anecdotal evidence, but I think this is the consensus right now.
01:17:47.060 | The Q-function methods are more sample efficient when they work, but they don't work as generally
01:17:52.580 | as policy gradient methods, and it's a little harder to figure out what's going on when
01:17:57.260 | they don't work.
01:17:58.940 | And that kind of makes sense, because in the policy gradient methods, you're optimizing
01:18:01.980 | exactly the thing you care about with gradient descent.
01:18:05.380 | Whereas with Q-function methods, you're doing something indirect, where you're trying to
01:18:09.660 | learn a Q-function, and then you're hoping that it gives you a good policy.
01:18:14.820 | And yeah, so I would also point out that there are also some confounds, so it's hard to make
01:18:20.660 | a good conclusion at this point, because people use different time horizons in the policy
01:18:28.460 | gradient methods versus the Q-function methods.
01:18:30.980 | So they do one-step lookaheads on the Q-functions and multi-step lookaheads on the policy gradients.
01:18:36.300 | So it's not clear if the differences come from using different time horizons or some
01:18:43.660 | differences in how the algorithms are working, because you're either doing regression for
01:18:47.860 | a Q-function versus learning a policy using policy gradients.
01:18:53.980 | So just to summarize it, I would say here are some of our core model-free reinforcement
01:18:59.220 | learning algorithms, and they - oh, whoops, I'm missing a word in the first column, which
01:19:07.300 | I think should say reliability and robustness.
01:19:11.860 | So this just means, is it going to work on new problems without parameter tuning, or
01:19:19.420 | is it going to mysteriously either work or not work?
01:19:24.260 | So this would be my slightly sloppy summary of all these different algorithms.
01:19:32.460 | I would say there's still some room for improvement.
01:19:35.540 | There might be some improvements in the basic methods, because there are some nice properties
01:19:39.540 | of the Q-function methods that we don't have in the policy gradient methods.
01:19:44.260 | Like you can easily do off - you can easily explore with a different policy than the one
01:19:49.780 | that you're learning the Q-function for, and that's really important.
01:19:54.500 | You can't do that very easily with policy gradient methods.
01:19:58.380 | Whereas the policy gradient methods just seem like they're more - you can just apply them
01:20:03.380 | and they're more likely to work, and it's well understood what's going on.
01:20:10.100 | So I think, yeah, there's still - I don't know if it's possible to get the best of both
01:20:13.980 | worlds, but that's the hope.
01:20:18.060 | And that's it for my talk.
01:20:19.540 | Thank you.
01:20:20.540 | Any questions?
01:20:21.540 | In model-based reinforcement learning, what lines of research do you find most interesting?
01:20:39.540 | Oh yeah, so in model-based reinforcement learning, what lines of research do I find most interesting?
01:20:47.340 | I think the work from my colleagues on guided policy search is very nice.
01:20:51.260 | So I would say that's a kind of model-based reinforcement learning.
01:20:56.740 | I also like - there's some methods that are using the model for faster learning, like
01:21:02.060 | for variance reduction.
01:21:03.140 | So there's a paper called stochastic value gradients that I like a lot.
01:21:10.460 | I think it's a pretty wide open area.
01:21:12.640 | So I don't think there have been a lot of really compelling results where you're able
01:21:18.060 | to learn extremely fast, like you're able to learn with much better sample efficiency
01:21:23.060 | using a model.
01:21:24.060 | So it seems like that should be possible, but I don't think it's been demonstrated yet.
01:21:29.980 | So maybe in the next couple of years we'll see that happen.
01:21:34.060 | Hello.
01:21:36.060 | Thanks for the talk.
01:21:40.340 | So I have a question.
01:21:41.340 | Is it true or not true that most of this problem requires some kind of simulated world to run
01:21:49.540 | experiments in the episodes, right?
01:21:51.700 | Oh yeah, so are you asking does this work in the real world?
01:21:57.500 | Is that the question?
01:21:58.500 | Yeah.
01:21:59.500 | Yeah, I would say it does work if you have a lot of patience and you're willing to execute
01:22:05.620 | this thing for a while.
01:22:06.820 | So the locomotion results I showed add up to about two weeks of real time.
01:22:12.840 | So it's actually not that bad, especially when you consider that toddlers take a while
01:22:18.020 | to learn how to walk properly, even though evolution already puts in a lot of built-in
01:22:22.660 | information.
01:22:24.820 | So I'd say maybe, yeah, I'd say it can be run in the real world.
01:22:31.620 | Some of my colleagues in Berkeley are doing some experiments where they are running just
01:22:36.380 | regular reinforcement learning algorithms in the real world.
01:22:39.980 | Very brave.
01:22:41.300 | But I hope to see some nice results from that soon.
01:22:45.860 | Thank you.
01:22:48.860 | Thanks for your talk.
01:22:49.860 | Here, on the other side.
01:22:50.860 | Here.
01:22:51.860 | Yeah.
01:22:52.860 | I was wondering what was your intuition on the lost surface of those deep reinforcement
01:22:59.620 | learning optimization problems, and maybe especially how it evolves as the policy learns.
01:23:08.900 | And I should specify in the policy gradient case.
01:23:13.100 | So I think the situation is a little bit different in reinforcement learning from in supervised
01:23:16.900 | learning.
01:23:17.900 | So in reinforcement learning, you have one kind of local minima in policy space.
01:23:29.300 | So for example, let's say you want your-- so I keep going back to the locomotion example,
01:23:34.340 | because I spent a lot of time on it.
01:23:36.020 | But let's say you want your robot to walk.
01:23:39.260 | There's one local minimum where it just stands and it doesn't bother to walk, because there's
01:23:42.980 | too much penalty for falling over.
01:23:44.980 | And there's another local minimum where it just dives forward, because it gets a little
01:23:49.140 | bit of reward for that before it falls to its doom.
01:23:55.040 | So I think that that's actually the hard part about the optimization problem, is because
01:24:03.920 | of the different space of behaviors, and actually has nothing to do with the neural network.
01:24:09.520 | So I've also found that it matters surprisingly little what kind of neural network architecture
01:24:18.300 | you use, because I think that most of the hardness and the weirdness of the problem
01:24:21.960 | comes from what the behavior space looks like, rather than what the actual numerical optimization
01:24:28.360 | landscape looks like.
01:24:30.200 | Cool.
01:24:31.200 | Thank you.
01:24:34.320 | So there are many problems where the reward is only observed at the end of the task, so
01:24:41.960 | in the final-- in the terminal state in each episode.
01:24:45.280 | And you don't see rewards in intermediate states.
01:24:48.120 | So how much harder do these problems become for deep reinforcement learning in your experience?
01:24:53.560 | Thanks.
01:24:54.560 | Yeah, so you have-- if you don't get the reward until the end, then you can't-- well, then
01:25:02.520 | it's probably-- it might be harder to learn.
01:25:04.480 | Yeah.
01:25:05.480 | I don't have anything precise to say about that.
01:25:08.880 | I think it's going to be harder if you have less-- if your rewards are further away.
01:25:14.520 | So for example, in your video, for the last example of getting up and getting the head
01:25:19.600 | above a certain height, for example, that could be one where you only get a plus 1 if
01:25:25.120 | you're above and you don't get anything below.
01:25:27.160 | Oh, right.
01:25:28.160 | But when you're doing something that was kind of-- if you get your head higher, then you
01:25:30.960 | still get something partial.
01:25:32.520 | Yeah, so I think we came up with a reward like distance from height squared, which made
01:25:38.880 | the problem easier.
01:25:39.880 | Yeah, the problem would have been a lot harder if you get zero reward until you get your
01:25:43.960 | head above the height.
01:25:45.560 | And it's actually-- that would be a problem of exploration, which is that you have to
01:25:51.920 | explore all the different states to figure out where you're going to get good reward.
01:25:56.520 | Thanks.
01:25:57.520 | One last question.
01:25:59.560 | OK, one last question.
01:26:02.720 | So I have a question about how do you choose to quantize your space time, because in your
01:26:06.640 | locomotion example, it clearly has a continuous system.
01:26:10.920 | Right.
01:26:11.920 | Yeah, so it's actually really important how you discretize time, like what time step you
01:26:16.840 | use, because the algorithm has-- I mean, the algorithm does care about what the time step
01:26:25.600 | So it's not like-- yeah, because you have discount factors, and you're also sampling
01:26:32.240 | a different action at every time step.
01:26:34.080 | So yeah, so if you choose too small of a time step, then the rewards will be delayed by
01:26:43.520 | more time steps.
01:26:44.760 | So that makes the credit assignment harder.
01:26:47.560 | And also, your exploration will be more like a random walk, because you're changing your
01:26:52.480 | minds really frequently.
01:26:54.960 | So yeah, the time step is pretty important.
01:26:56.840 | And I'd say that's a flaw in current methods.
01:27:00.800 | OK, thank you.
01:27:02.480 | So at the same time, John will give a thank you.
01:27:09.760 | Thank you.
01:27:10.760 | So take a short break.
01:27:12.280 | We convene in 15 minutes.
01:27:14.040 | [BLANK_AUDIO]