back to indexMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)
Chapters
0:0 Introduction
2:14 Types of learning
6:35 Reinforcement learning in humans
8:22 What can be learned from data?
12:15 Reinforcement learning framework
14:6 Challenge for RL in real-world applications
15:40 Component of an RL agent
17:42 Example: robot in a room
23:5 AI safety and unintended consequences
26:21 Examples of RL systems
29:52 Takeaways for real-world impact
31:25 3 types of RL: model-based, value-based, policy-based
35:28 Q-learning
38:40 Deep Q-Networks (DQN)
48:0 Policy Gradient (PG)
50:36 Advantage Actor-Critic (A2C & A3C)
52:52 Deep Deterministic Policy Gradient (DDPG)
54:12 Policy Optimization (TRPO and PPO)
56:3 AlphaZero
60:50 Deep RL in real-world applications
63:9 Closing the RL simulation gap
64:44 Next step in Deep RL
00:00:00.000 |
Today I'd like to overview the exciting field of deep reinforcement learning 00:00:04.640 |
Introduce, overview and provide you some of the basics 00:00:08.240 |
I think it's one of the most exciting fields in artificial intelligence 00:00:14.960 |
It's marrying the power and the ability of deep neural networks 00:00:23.120 |
with the ability to act on that understanding 00:00:31.040 |
Taking as a whole, that's really what the creation of intelligent beings is 00:00:39.360 |
And the exciting breakthroughs that recently have happened 00:00:42.000 |
Captivate our imagination about what's possible 00:00:45.440 |
And that's why this is my favorite area of deep learning and artificial intelligence in general 00:00:55.120 |
We've talked about deep learning which is taking samples of data 00:01:02.080 |
compress, encode the representation of that data in a way that you can reason about it 00:01:07.680 |
And we take that power and apply it to the world where sequential decisions are to be made 00:01:16.000 |
So it's looking at problems and formulations of tasks 00:01:22.640 |
Where an agent, an intelligent system has to make a sequence of decisions 00:01:38.880 |
Any intelligent being that is tasked with operating in the world. How do they learn anything? 00:01:43.680 |
Especially when you know very little in the beginning 00:01:47.280 |
It's trial and error is the fundamental process by which reinforcement learning agents learn 00:01:53.280 |
And the deep part of deep reinforcement learning is neural networks 00:01:59.380 |
It's using the frameworks and reinforcement learning 00:02:02.720 |
Where the neural network is doing the representation 00:02:07.460 |
Of the world based on which the actions are made 00:02:17.760 |
Sometimes the terminology itself can confuse us to the fundamentals 00:02:22.340 |
There is supervised learning, there's semi-supervised learning, there's unsupervised learning, there's reinforcement learning 00:02:29.360 |
And there's this feeling that supervised learning is really the only one 00:02:33.360 |
Where you have to perform the manual annotation, where you have to do the large-scale supervision 00:02:42.640 |
Every type of machine learning is supervised learning 00:02:45.680 |
It's supervised by a loss function or a function that tells you what's good 00:02:55.680 |
You know even looking at our own existence is how we humans figure out what's good and bad 00:03:02.080 |
All kinds of sources direct and indirect by which our morals and ethics we figure out what's good and bad 00:03:08.320 |
The difference between supervised and unsupervised and reinforcement learning is the source of that supervision 00:03:16.560 |
Is that the cost of human labor required to attain the supervision is low 00:03:25.440 |
Turtles all the way down it's turtles and then there's a human at the bottom 00:03:30.960 |
There at some point there needs to be human intervention 00:03:38.480 |
Input to provide what's good and what's bad and this will arise in reinforcement learning as well 00:03:43.680 |
we have to remember that because the challenges and the exciting opportunities of reinforcement learning lie in the fact of 00:03:53.220 |
In the most efficient way possible, but supervision nevertheless is required for any system that has an input and an output 00:04:01.840 |
That's trying to learn like a neural network does to provide an output. That's good. It needs somebody to say what's good and what's bad 00:04:09.360 |
For you curious about that. There's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche 00:04:19.840 |
So let's look at supervised learning and reinforcement learning 00:04:23.620 |
I'd like to propose a way to think about the difference 00:04:28.160 |
That is illustrative and useful when we start talking about the techniques 00:04:41.200 |
Learning from those examples where ground truth provides you 00:04:47.440 |
Semantic meaning of what's in that data and from those examples one by one whether it's sequences or single samples 00:04:56.720 |
We learn what how to then take future such samples and interpret them 00:05:02.240 |
Reinforcement learning is teaching what we teach an agent through experience 00:05:09.700 |
Not by showing a singular sample of a data set but by putting them out into the world 00:05:15.840 |
The distinction there the essential element of reinforcement learning then for us 00:05:24.640 |
But the essential design step is to provide the world in which to experience 00:05:34.320 |
The from the world it gets the dynamics of that world the physics of the world from that world 00:05:41.040 |
It gets the rewards what's good and bad and us as designers 00:05:44.660 |
Of that agent do not just have to do the algorithm. We have to do design the the world 00:05:53.040 |
In which that agent is trying to solve a task 00:05:57.600 |
The design of the world is the process of reinforcement learning the design of examples 00:06:03.520 |
The annotation of examples is the world of supervised learning 00:06:06.500 |
And the essential perhaps the most difficult element of reinforcement learning is the reward the good versus bad 00:06:26.400 |
And reaching the destination that's success and failure is the inability to reach that destination 00:06:35.760 |
The way we learn from these very few examples 00:06:40.980 |
Appear to learn from very few examples through trial and error 00:06:46.240 |
Is a mystery a beautiful mystery full of open questions 00:06:49.280 |
It could be from the huge amount of data 230 million years worth of bipedal data that we've been walking 00:06:55.760 |
Mammals walking ability to walk or 500 million years the ability to see having eyes 00:07:02.240 |
So that's the the hardware side somehow genetically encoded in us is the ability to comprehend this world extremely efficiently 00:07:12.640 |
not the hardware not the 500 million years, but the 00:07:20.880 |
Maybe even years in the very beginning when we're born 00:07:24.720 |
The ability to learn really quickly through observation to aggregate that information 00:07:29.940 |
Filter all the junk that you don't need and be able to learn really quickly 00:07:34.480 |
Through imitation learning through observation the way for walking that might mean observing others to walk 00:07:45.760 |
Around we would never be able to learn this the fundamentals of this walking or as efficiently 00:07:53.200 |
And then it could be the algorithm totally not understood is the algorithm that our brain uses to learn 00:08:01.520 |
The back propagation that's an artificial neural networks the same kind of processes not understood in the brain 00:08:11.520 |
So I want you to think about that as we talk about 00:08:16.160 |
By comparison accomplishments and reinforcement learning and how do we take the next steps? 00:08:20.880 |
But it nevertheless is exciting to have machines that learn how to act in the world 00:08:38.000 |
The process of learning is thought of as intelligence. It's the ability to know very little and through experience examples 00:08:45.620 |
Interaction with the world in whatever medium whether it's data or simulation so on be able to form much richer and interesting 00:08:53.780 |
Representations of that world be able to act in that world. That's that's the dream 00:08:58.080 |
So let's look at this stack of what an age what it means to be an agent in this world 00:09:05.040 |
The input to the bottom the output is there's an environment. We have to sense that environment 00:09:13.360 |
Several sensory systems on cars you can have lidar camera 00:09:20.080 |
Stereo vision audio microphone networking gps imu sensor so on whatever robot you can think about 00:09:29.760 |
and you have this raw sensory data and then once you have the raw sensory data you're tasked with 00:09:34.720 |
representing that data in such a way that you can make sense of it as opposed to all the 00:09:40.240 |
The the raw sensors in the eye the cones and so on that taken as just giant stream of high bandwidth information 00:09:52.400 |
Abstractions of features based on which we can reason from edges to corners to faces 00:09:57.760 |
And so on that's exactly what deep learning neural networks have stepped in to be able to 00:10:02.480 |
In an automated fashion with as little human input as possible be able to form higher order representations of that information 00:10:10.100 |
Then there's the the learning aspect building on top of the greater abstractions formed through representations 00:10:17.620 |
Be able to accomplish something useful whether it's discriminative task a generative task and so on based on the representation 00:10:25.040 |
Be able to make sense of the data be able to generate new data and so on 00:10:29.440 |
From sequence to sequence to sequence to sample from sample to sequence and so on and so forth to actions as we'll talk about 00:10:42.780 |
Aggregate all the information that's been received in the past to the useful 00:10:51.440 |
Pertinent to the task at hand. It's the thing the old 00:10:55.040 |
It looks like a duck quacks like a duck swims like a duck 00:10:58.240 |
Three different data sets i'm sure there's state-of-the-art algorithms for the three image classification 00:11:08.560 |
Activity recognition so on aggregating those three together 00:11:12.000 |
Is still an open problem and that could be the last piece again 00:11:16.480 |
I want you to think about as we think about reinforcement learning agents. How do we play? 00:11:20.880 |
How do we transfer from the game of atari to the game of go to the game of dota to the game of a robot? 00:11:29.040 |
Navigating an uncertain environment in the real world 00:11:32.720 |
And once you have that once you sense the raw world once you have a representation of that world then 00:11:43.440 |
Which is provide actions within the constraints of the world in such a way that we believe can get us towards success 00:11:51.680 |
The promise excitement of deep learning is is the part of the stack that converts raw data into meaningful representations 00:11:58.900 |
The promise the dream of deeper enforcement learning 00:12:05.040 |
And building an agent that uses that representation 00:12:13.920 |
The framework and the formulation of reinforcement learning 00:12:22.240 |
Is that there's an environment and there's an agent that acts in that environment? 00:12:38.880 |
It gives the environment an action it acts in that environment and through the action 00:12:43.920 |
The environment changes in some way and then a new observation occurs 00:12:49.360 |
And then also as you provide the action make the observations you receive a reward 00:12:53.940 |
In most formulations of this of this framework 00:13:04.080 |
The only thing you need to be concerned about is the state you came from the state you arrived in and the reward received 00:13:09.780 |
The open question here is what can't be modeled in this kind of way. Can we model all of it? 00:13:22.480 |
And what are is this a good way to formulate the learning problem of robotic systems? 00:13:29.520 |
In the real world in the simulated world. Those are the the open questions 00:13:40.960 |
It could be single agent or multi-agent atari versus driving like deep traffic 00:13:52.060 |
Static as in chess dynamic again and driving in most real world applications discrete versus continuous like games 00:13:59.100 |
Chess or continuous and carpool balancing a pull on a cart 00:14:03.500 |
The challenge for RL in real world applications 00:14:16.280 |
Supervised learning is teaching by example learning by example 00:14:23.860 |
Reinforcement learning is teaching by experience 00:14:26.300 |
And the way we provide experience to reinforcement learning agents currently for the most part is through simulation 00:14:33.100 |
Or through highly constrained real world scenarios 00:14:54.280 |
two directions of research and work one is to 00:15:00.740 |
algorithms improve the ability of the algorithms to then 00:15:03.640 |
To form policies that are transferable across all kinds of domains including the real world including especially the real world 00:15:10.920 |
So train and simulation transfer to the real world 00:15:16.920 |
As we improve the simulation in such a way that the fidelity of the simulation increase increases to the point where the gap 00:15:28.760 |
Minimal to a degree that things learned in simulation are directly trivially transferable to the real world 00:15:50.440 |
It sees the world it makes a decision. That's a policy makes a decision how to act sees the reward 00:15:56.520 |
Sees a new state acts sees a reward sees new states and acts and this repeats 00:16:16.260 |
A state action pair is meaning taking an action 00:16:19.540 |
In a particular state. How good is that ability to evaluate that? 00:16:27.540 |
Different from the environment from the perspective of the agent 00:16:30.340 |
So the environment has a model based on which it operates 00:16:33.640 |
And then the agent has a representation best understanding of that model 00:16:44.340 |
Simply formulated framework is to maximize reward 00:16:49.780 |
The reward mathematically and practically is talked about 00:16:53.140 |
Is with a discounted framework, so we discount further and further future reward 00:17:00.740 |
So the reward that's farther into the future is means less to us in terms of maximization than reward 00:17:07.460 |
That's in the near term. And so why do we discount it? 00:17:11.220 |
So first a lot of it is a math trick to be able to prove certain aspects analyze certain aspects of conversions 00:17:20.820 |
Because environments either are or can be thought of a stochastic random. It's very difficult 00:17:27.720 |
To there's a degree of uncertainty, which makes it difficult to really estimate 00:17:35.940 |
The reward they'll be in the future because of the ripple effect of the uncertainty 00:17:46.640 |
policies rewards actions, there's a robot in the room there's 00:17:50.980 |
12 cells in which you can step it starts in the bottom left. It tries to get rewards on the top, right? 00:17:58.740 |
There's a plus one. It's a really good thing at the top, right? It wants to get there by walking around 00:18:05.300 |
There's a negative one, which is really bad. It wants to avoid that square and the choice of actions is up down left right four actions 00:18:16.260 |
Being a negative reward of 0.04 for each step 00:18:20.260 |
So there's a cost to each step and there's a stochastic nature to this world potentially we'll talk about both deterministic stochastic 00:18:26.920 |
So in the stochastic case when you choose the action up 00:18:36.660 |
With 10% chance you move left and another 10 move right 00:18:41.300 |
So that's stochastic nature, even though you try to go up you might end up in a blocks the left into the right 00:18:53.280 |
Given that we always start in the bottom left is really shortest path 00:18:56.580 |
Is you know, you can't ever because there's no stochasticity 00:19:00.980 |
you're never going to screw up and just fall into the hole negative one hole that you just compute the shortest path and 00:19:07.140 |
Walk along that shortest path why shortest path because every single step hurts. There's a negative reward to it 00:19:13.860 |
0.04 so shortest path is the thing that minimizes the reward shortest path to the 00:19:22.820 |
Okay, let's look at a stochastic world. Like I mentioned the 80% up and then split 10% to the left and right 00:19:29.860 |
How does the policy change? Well, first of all we need to have 00:19:34.980 |
We need to have a plan for every single block in the area because you might end up there due to the stochasticity of the world 00:19:40.980 |
Okay, the the basic addition there is that we're trying to go 00:19:49.300 |
The closer you get to the negative one hole. So just try to avoid up because up 00:19:56.100 |
The stochastic nature of up means you might fall into the hole with a 10% chance 00:20:00.340 |
And given the 0.04 step reward you're willing to take the long way home 00:20:05.620 |
In some cases in order to avoid that possibility the negative one possibility 00:20:10.760 |
Now, let's look at a reward for each step if it decreases to negative two. It really hurts to take every step 00:20:17.380 |
Then again, we go to the shortest path despite the fact that uh, there's a stochastic nature 00:20:24.180 |
In fact, you don't really care that you step into the negative one hole because every step really hurts. You just want to get home 00:20:30.260 |
And then you can play with this reward structure right yes 00:20:35.380 |
instead of uh negative two or negative 0.04 you can look at 00:20:52.420 |
Value the higher negative reward for each step 00:20:59.160 |
Versus the less urgency the lower the negative reward 00:21:10.980 |
The every step is a positive so the entire system which is actually 00:21:17.140 |
Quite common in reinforcement learning the entire system is full of positive rewards 00:21:21.940 |
And so then the optimum policy becomes the longest path 00:21:25.220 |
Is a grad school taking as long as possible never reaching the destination 00:21:36.900 |
What lessons do we draw from robot in the room two things? 00:21:40.660 |
The environment model the dynamics is just there in the trivial example the stochastic nature the difference between 80 percent 100 percent and 00:21:51.060 |
The model of the world the environment has a big impact on what the optimal policy is 00:21:55.780 |
And the reward structure most importantly the thing we can often control 00:22:06.900 |
in our constructs of the task we try to solve in reinforcement learning is the 00:22:11.140 |
What is good and what is bad and how bad is it and how good is it the reward structure is a big? 00:22:21.140 |
like like uh, robert frost said a complete change on the 00:22:25.540 |
Policy the choices the agent makes so when you formulate a reinforcement learning framework 00:22:33.380 |
As researchers as students what you often do is you design the environment you design the world in which the system learns 00:22:41.140 |
Even when your ultimate goal is the physical robot you just still there's a lot of work still done in simulation 00:22:48.100 |
So you design the world the parameters of that world and you also design the reward structure and it can have 00:22:53.780 |
A transformative results slight variations in those parameters can be huge results 00:23:01.060 |
On huge differences on the policy that's arrived and of course 00:23:05.060 |
The example i've shown before I really love is 00:23:12.580 |
impact of the the changing reward structure might have unintended consequences 00:23:21.140 |
Consequences for real world system can have obviously 00:23:27.680 |
Costs that are more than just a failed game of atari 00:23:31.220 |
So here's a human performing the task get playing the game of coast runners racing around the track 00:23:41.860 |
And you finish fast you get a lot of points and so it's natural to then okay 00:23:47.780 |
Let's do an rl agent and then optimize this for those points 00:23:51.700 |
And what you find out in the game is that you also get points by picking up the little green turbo things 00:23:59.380 |
And what the agent figures out is that you can actually get a lot more points 00:24:12.020 |
Just rotating over and over slamming into the wall fire and everything just picking it up, especially because 00:24:21.540 |
Can avoid the terminal state at the end of finishing the race in fact finishing the race means you stop collecting positive reward 00:24:28.580 |
So you never want to finish collect the turbos 00:24:38.360 |
But they're out there of unintended consequences that can have highly negative detrimental effects when put in the real world 00:24:48.740 |
When you put robots four-wheeled ones like autonomous vehicles into the real world 00:24:53.780 |
And you have objective functions that have to navigate difficult intersections full of pedestrians 00:24:59.080 |
So you have to form intent models of those pedestrians here. You see cars asserting themselves through dense intersections 00:25:08.820 |
Within those risks that are taken by us humans when we drive vehicles 00:25:14.260 |
we have to then encode that ability to take subtle risk into 00:25:23.960 |
Then you have to think about at the end of the day. There's an objective function 00:25:29.540 |
and if that objective function does not anticipate the green turbos that are to be collected and 00:25:36.020 |
then result in some unintended consequences could have 00:25:43.300 |
Negative effects especially in situations that involve human life 00:25:47.300 |
That's the field of AI safety and some of the folks who talk about deep mind and open AI 00:25:52.980 |
That are doing incredible work in RL also have groups that are working in AI safety for a very good reason 00:26:02.900 |
I believe that artificial intelligence will define some of the most impactful positive things 00:26:09.060 |
In the 21st century, but I also believe we are nowhere close 00:26:13.700 |
To solving some of the fundamental problems of AI safety that we also need to address as we develop those algorithms 00:26:20.440 |
So okay examples of reinforcement learning systems 00:26:23.860 |
All of it has to do with formulation of rewards formulation of state and actions. You have the traditional 00:26:37.600 |
Balancing a pole continuous. So the action is the horizontal force of the cart 00:26:44.160 |
So it stays top in the moving cart and the reward is one at each time step if the pole is upright 00:26:49.920 |
And the state measured by the cart by the agent is the pole angle angular speed 00:26:55.920 |
And of course self sensing of the cart position and the horizontal velocity 00:27:01.460 |
Another example here didn't want to include the video because it's really disturbing 00:27:07.360 |
but I do want to include this slide because it's really important to think about is 00:27:11.440 |
by sensing the the raw pixels learning and teaching an agent to 00:27:19.520 |
So the goal there is to eliminate all opponents 00:27:22.260 |
The state is the raw game pixels the actions up down shoot reload and so on 00:27:34.000 |
When an opponent is eliminated and negative when the agent is eliminated simple 00:27:38.800 |
I added it here because again on the topic of AI safety 00:27:44.420 |
We have to think about objective functions and how that translate into the world of not just autonomous vehicles 00:27:56.720 |
Things that even more directly have harm like autonomous weapon systems. We have a lecture on this in the AGI series 00:28:05.280 |
The on the robotics platform the manipulate object manipulation grasping objects. There's a few benchmarks. There's a few interesting applications 00:28:18.320 |
Manipulating objects rotating and so on especially when those objects don't have have complicated shapes 00:28:25.680 |
And so the goal is to pick up an object in the purely in the grasping object challenge 00:28:31.200 |
The state is the visual information. So it's visual visual based the raw pixels of the objects 00:28:36.800 |
The action is to move the arm grasp the object pick it up 00:28:40.160 |
And obviously it's positive when the pickup is successful 00:28:48.960 |
is because it will finally allow us to solve the problem of the the claw which has been 00:28:58.880 |
I don't know. That's not at all why i'm excited but okay 00:29:02.080 |
And then we have to think about as we get greater and greater degree of application in the real world with the robotics 00:29:11.520 |
The the main focus of my passion in terms of robotics is how do we encode some of the things that us humans encode? 00:29:19.840 |
We have to think about our own objective function our own reward structure our own model of the environment about which we perceive and reason 00:29:27.200 |
About in order to then encode machines that are doing the same and I believe autonomous driving is in that category 00:29:33.040 |
We have to ask questions of ethics. We have to ask questions 00:29:38.080 |
of risk value of human life value of efficiency money and so on all these are fundamental questions that an autonomous vehicle 00:29:45.600 |
Unfortunately has to solve before it becomes fully autonomous 00:29:54.960 |
the real world impact of reinforcement learning agents 00:30:01.920 |
Okay, these neural networks that form higher representation 00:30:04.400 |
The fun part is the algorithms all the different architectures the different encoder decoder structures 00:30:13.140 |
recurrence LSTMs GRUs all the fun architectures and the data and 00:30:19.680 |
the ability to leverage different data sets in order to 00:30:27.180 |
Perform discriminatory tasks better than you know 00:30:30.640 |
MIT does better than Stanford that kind of thing. That's the fun part 00:30:34.800 |
The hard part is asking good questions and collecting huge amounts of data that's representative of the task 00:30:41.920 |
That's for real world impact not CVPR publication real world impact 00:30:46.320 |
A huge amount of data on the deeper enforcement learning side the key challenge 00:30:53.040 |
The fun part again is the algorithms. How do we learn from data some of the stuff i'll talk about today? 00:30:57.520 |
The hard part is defining the environment defining the access space and the reward structure 00:31:03.600 |
As I mentioned this is the big challenge and the hardest part is how to crack the gap between simulation in the real world 00:31:14.960 |
How to solve that transfer learning problem yet for the real world impact 00:31:22.240 |
There's countless algorithms and there's a lot of ways to taxonomize them, but at the highest level 00:31:41.280 |
You construct your estimate of how you believe the dynamics of that world operates 00:31:51.680 |
Doing that is once you have a model or an estimate of a model you're able to 00:31:56.480 |
Anticipate you're able to plan into the future. You're able to 00:32:04.320 |
In a branching way predict how your actions will change the world so you can plan far into the future 00:32:10.240 |
This is the mechanism by which you you can you can do 00:32:15.520 |
Uh in the simplest form because in chess, you don't even need to learn the model 00:32:18.880 |
The model is learned is given to you chess go and so on 00:32:21.520 |
The most important way in which they're different I think is the sample efficiency 00:32:26.500 |
Is how many examples of data are needed to be able to successfully operate in the world? 00:32:32.000 |
And so model-based methods because they're constructing a model if they can 00:32:39.360 |
Because once you have a model you can do all kinds of reasoning that doesn't require 00:32:44.100 |
experiencing every possibility of that model you can 00:32:48.720 |
Unroll the model to to see how the world changes based on your actions 00:32:53.600 |
Value-based methods are ones that look to estimate 00:32:58.880 |
The quality of states the quality of state taking a certain action in a certain state 00:33:08.320 |
Versus the last category that's on policy. What does it mean to be off policy? It means that 00:33:16.080 |
They constantly a value-based agents constantly update how good is taking action in a state 00:33:28.240 |
Taking action in a state and they use that to pick the optimal action 00:33:32.000 |
They don't directly learn a policy a strategy of how to act they learn how good it is 00:33:39.600 |
to be in a state and use that goodness information to then 00:33:46.960 |
And then every once in a while flip a coin in order to explore 00:33:49.920 |
And then policy-based methods are ones that directly learn a policy function 00:34:02.140 |
representation of that world neural networks and is output 00:34:09.360 |
So, okay, that's the range of model-based value-based and policy-based 00:34:14.820 |
Here's an image from openai that I really like I encourage you to 00:34:19.700 |
As we further explore here to look up spinning up in deeper enforcement learning from openai 00:34:26.120 |
Here's an image that taxonomizes in the way that I described some of the recent developments 00:34:34.420 |
top the distinction between model free rl and 00:34:40.180 |
In model free rl, which is what we'll focus on today. There is a distinction between policy optimization 00:34:46.680 |
So on policy methods and q learning which is all policy methods policy optimization methods that directly optimize the policy 00:35:00.500 |
Q learning all policy methods learn like I mentioned the value of taking a certain action in a state and from that 00:35:27.940 |
Really was one of the first great breakthroughs 00:35:30.600 |
From google deep mind on the deep rl side and solving atari games dqn 00:35:40.500 |
And let's take a step back and think about what q learning is 00:35:43.380 |
Q learning looks at the state action value function 00:35:50.260 |
That estimates based on a particular policy or based on an optimal policy. How good is it to take an action? 00:36:02.720 |
Reward if I take an action in this state and continue operating under an optimal policy 00:36:11.620 |
Amongst all the actions I have which action should I take to maximize the reward? 00:36:16.660 |
Now in the beginning, you know, nothing, you know, you don't have this value estimation 00:36:24.660 |
So you have to learn it and you learn it with a bellman equation of updating it 00:36:28.980 |
You take your current estimate and update it with the reward you see 00:36:37.700 |
It's off policy and model free. You don't have to have any estimate or knowledge of the world 00:36:43.380 |
You don't have to have any policy whatsoever. All you're doing is 00:36:47.460 |
Roaming about the world collecting data when you took a certain action. Here's the word you received and you're updating 00:37:14.820 |
To take an action of the value of taking that action so you can always take the optimal one 00:37:20.740 |
but because you know very little in the beginning that optimal is going to 00:37:25.140 |
You have no way of knowing that's good or not 00:37:28.020 |
So there's some degree of exploration the fundamental aspect of value-based methods or any RL methods 00:37:34.180 |
Like I said, it's trial and error is exploration 00:37:36.920 |
So for value-based methods like Q learning the way that's done is with a flip of a coin epsilon greedy 00:37:44.340 |
With a flip of a coin you can choose to just take a random action 00:37:52.020 |
Slowly decrease epsilon to zero as your agent learns more and more and more 00:37:57.940 |
so in the beginning you explore a lot an epsilon of one an epsilon of zero in the end when you're just acting greedy based on the 00:38:05.540 |
Your understanding of the world as represented by the Q value function 00:38:09.300 |
For non-neural network approaches. This is simply a table the Q 00:38:14.420 |
This Q function is a table. Like I said on the y state 00:38:19.300 |
x actions and in each cell you have a reward that's 00:38:26.500 |
A discount or reward you estimate to be received there 00:38:29.060 |
And as you walk around with this Bellamy equation, you can update that table 00:38:34.740 |
It's a table nevertheless number of states times number of actions 00:38:38.580 |
Now if you look at any practical real-world problem 00:38:41.460 |
And an arcade game with raw sensory input is a very crude first step towards the real world so raw sensory information 00:38:52.660 |
This kind of value iteration and updating a table is impractical 00:38:57.160 |
Because here's for a game of breakout if you look at four consecutive frames of a game of breakout 00:39:03.080 |
Size of the of the raw sensory input is 84 by 84 pixels 00:39:27.140 |
It's significantly larger than the number of atoms in the universe 00:39:29.720 |
So the size of this Q table if we use the traditional approach is intractable 00:39:41.700 |
Deep RL is RL plus neural networks where the neural networks is tasked with taking this in value-based 00:39:49.860 |
Methods taking this Q table and learning a compressed representation of it 00:40:00.660 |
That's what previously talked about the ability the powerful ability of neural networks to form 00:40:10.080 |
extremely high dimensional complex raw sensory information 00:40:14.740 |
So it's simple the framework remains for the most part the same in reinforcement learning. It's just that this Q function 00:40:21.300 |
For value-based methods becomes a neural network and becomes an approximator 00:40:27.240 |
where the hope is as you navigate the world and you pick up new knowledge through 00:40:32.900 |
The back propagating the gradient and the loss function that you're able to form a good representation of the optimal Q function 00:40:42.340 |
So use neural networks with neural networks are good at which is function approximators 00:40:45.960 |
And that's DQN deep Q network was used to have the initial 00:40:51.780 |
Incredible nice results on the arcade games where the input is the raw sensory pixels with a few convolutional layers fully connected layers 00:41:06.640 |
Probability of taking that action and then you sample that and you choose the best action 00:41:10.500 |
And so this simple agent with a neural network that estimates that Q function 00:41:18.100 |
A superhuman performance on many of these arcade games that excited the world because it's taking raw sensory information 00:41:28.020 |
That doesn't in the beginning understand any of the physics of the world any of the dynamics of the environment and through that intractable 00:41:39.300 |
State space is able to learn how to actually do pretty well 00:41:57.940 |
Q value of taking an action in a particular state 00:42:04.420 |
Target against which the loss function is calculated. Which is what is the 00:42:15.780 |
and once you've taken that action the way you calculate the value is by looking at the next step and choosing the 00:42:20.900 |
Maximum choosing if you take the best action in the next state 00:42:25.620 |
What is going to be the Q function? So there's two estimators going on in terms of neural networks 00:42:31.620 |
There's two forward passes here. There's two Q's in this equation 00:42:41.780 |
With a few tricks and double DQN that's done by two neural networks 00:42:46.200 |
And I mentioned tricks because with this and with most of RL tricks tell a lot of the story 00:42:55.940 |
A lot of what makes systems work is the details 00:43:04.180 |
The two biggest tricks for DQN that will reappear in a lot of value-based methods is experience replay 00:43:12.180 |
So think of an agent that plays through these games 00:43:25.380 |
The power of that one of the central elements of what makes value-based methods attractive 00:43:33.300 |
Because you're not directly estimating the policy but are learning the quality of taking an action in a particular state 00:43:41.380 |
You're able to then jump around through your memory and and play 00:43:48.100 |
So learn, train the network through the historical data and then the other trick 00:44:00.740 |
So you're it's it's a dragon chasing its own tail 00:44:05.140 |
It's easy for the loss function to become unstable. So the training does not converge 00:44:10.580 |
So the trick of fixing a target network is taking one of the queues 00:44:15.060 |
And only updating it every x steps every thousand steps and so on and taking the same kind of network is just fixing it 00:44:22.180 |
So for the target network that defines the loss function just keeping it fixed and only updating irregularly 00:44:27.800 |
So you're chasing a fixed target with a loss function as opposed to a dynamic one 00:44:37.300 |
With minimal effort come up with some creative solutions here 00:44:41.220 |
Break out here after 10 minutes of training on the left after two of two hours of training on the right 00:44:46.900 |
It's coming up with some creative solutions again. It's pretty cool because this is raw pixels, right? We're now like 00:44:58.740 |
So kind of take it for granted, but I still for the most part 00:45:03.140 |
Captivated by just how beautiful it is that from raw sensory information 00:45:12.340 |
to act in a way that actually supersedes humans in terms of creativity in terms of 00:45:16.420 |
In terms of actual raw performance. It's really exciting and games of simple form is the cleanest way to demonstrate that 00:45:26.660 |
The same kind of DQN network is able to achieve superhuman performance on a bunch of different games 00:45:31.700 |
There's improvements to this like dual DQN again 00:45:36.100 |
The Q function can be decomposed which is useful into the value estimate 00:45:44.020 |
And in future slides will be called advantage 00:45:47.000 |
So the advantage of taking action in that state the nice thing of the advantage 00:45:54.900 |
It's a measure of the action quality relative to the average 00:46:00.020 |
Action that could be taken there. So if it's that's very useful advantage versus sort of raw reward 00:46:06.900 |
Is that if all the actions you have to take are pretty good? 00:46:21.540 |
So when you have these two estimates you have these two streams for neural network and a dueling DQN 00:46:28.000 |
DDQN where one estimates the value the other the advantage 00:46:32.900 |
And that's again that dueling nature is useful for 00:46:39.120 |
Also when the there are many states in which the action 00:46:43.360 |
Is decoupled the quality of the actions is decoupled from the state. So many states it doesn't matter 00:46:54.000 |
So you don't need to learn all the different complexities 00:46:57.300 |
All the topology of different actions when you in a particular state 00:47:05.280 |
Is prioritize experience replay like I said experience replay is really key to these algorithms 00:47:10.660 |
And the thing that syncs some of the policy optimization methods 00:47:15.040 |
And experience replay is collecting different memories 00:47:18.500 |
But if you just sample randomly in those memories 00:47:27.680 |
Really affected by the frequency of those experience occurred not their importance 00:47:32.720 |
So prioritize experience replay assigns a priority 00:47:44.640 |
So the the stuff you have learned the most from is given a higher priority and therefore you get to see 00:47:58.320 |
Okay, moving on to policy gradients this is on policy versus Q learning off policy 00:48:13.040 |
Directly optimizing the policy where the input is the raw pixels 00:48:23.520 |
Forms of representations of that environment space and its output produces a stochastic estimate 00:48:29.280 |
A probability of the different actions here in the pong the pixels 00:48:33.460 |
A single output that produces a probability of moving the paddle up 00:48:38.800 |
So how do policy gradients vanilla policy gradient very basic works 00:48:43.120 |
Is you unroll the environment you play through the environment 00:48:49.600 |
Here pong moving the paddle up and down and so on collecting no rewards 00:49:01.760 |
Every single action you're taking along the way gets either punished or rewarded based on whether it led to victory or defeat 00:49:07.680 |
This also is remarkable that this works at all 00:49:17.680 |
I mean every single thing you did along the way 00:49:23.520 |
It's like muddied. It's the reason that policy gradient methods are more inefficient, but it's still very surprising that it works at all 00:49:30.800 |
So the pros versus DQN the value-based methods 00:49:35.920 |
Is that if the world is so messy that you can't learn a Q function the nice thing about policy gradient because it's learning 00:49:42.080 |
The policy directly that it will at least learn a pretty good policy 00:49:46.480 |
Usually in many cases faster convergence. It's able to deal with stochastic policies 00:49:51.120 |
So value-based methods cannot learn stochastic policies and it's much more naturally able to deal with continuous actions 00:50:05.360 |
It can become highly unstable as we'll talk about some solutions to this during the training process and the credit assignment 00:50:13.040 |
So if we look at the chain of actions that lead to a positive reward 00:50:17.600 |
Some might be awesome actions. Some might be good actions 00:50:21.520 |
Some might be terrible actions, but that doesn't matter as long as the destination was good 00:50:26.400 |
And that's then every single action along the way gets a positive reinforcement 00:50:31.060 |
That's the downside and there's now improvements to that advantage actor critic methods A2C combining the best of 00:50:53.440 |
Policy-based and that's the one that takes the actions 00:50:56.880 |
Samples the actions from the policy network and the critic 00:51:06.640 |
All right. So as opposed to in the policy update the first equation there the reward coming from the destination 00:51:13.300 |
The the reward being from whether you won the game or not 00:51:28.480 |
So you're able to now learn about the environment about evaluating your own actions at every step 00:51:39.280 |
From deep mind and synchronous from open ai variants of this but of the actor advantage actor critic framework 00:51:47.040 |
But both are highly parallelizable the difference with 00:51:57.200 |
Is that every single agent so you just throw these agents operating in the environment and they're learning they're rolling out the games and getting the reward 00:52:05.840 |
They're updating the original network asynchronously the global network parameters asynchronously 00:52:12.420 |
And as a result, they're also operating constantly on outdated versions of that network 00:52:18.960 |
The open ai approach that fixes this is that there's a coordinator that there's these rounds where everybody 00:52:26.320 |
All the agents in parallel are rolling out the episode 00:52:30.240 |
but then the coordinator waits for everybody to finish in order to make the update to the global network and then 00:52:38.620 |
To all the agents and so that means that every iteration starts with the same global parameters and that has really nice properties 00:52:53.200 |
from google deep mind the deep deterministic policy gradient 00:52:57.520 |
Is combining the ideas of dqn but dealing with continuous action spaces 00:53:15.040 |
Having the actor operating in a stochastic nature is picking the best picking a deterministic policy 00:53:25.760 |
But okay with that the problem quite naturally 00:53:29.060 |
Is that when the policy is now deterministic it's able to do a continuous action space, but because it's deterministic it's never exploring 00:53:36.580 |
So the way we inject exploration into the system is by adding noise 00:53:40.960 |
either adding noise into the action space on the output or adding noise into the parameters of the network that 00:53:50.720 |
that create perturbations in the actions such that the final result is that you try different kinds of things and the 00:53:57.120 |
The scale of the noise just like with the epsilon greedy in the exploration for dqn 00:54:01.680 |
The scale of the noise decreases as you learn more and more 00:54:10.880 |
We'll do a lecture just on this there's been a lot of exciting work here 00:54:19.980 |
optimization on policy optimization with ppo and trpo 00:54:27.760 |
A reinforcement learning as purely an optimization problem 00:54:42.320 |
Influences the rest of your the optimization process. You have to be very careful about the actions you take in particular 00:54:54.020 |
When your convergence the the training performance in general 00:55:02.100 |
There's the line search methods, which is where gradient descent or gradient ascent falls under which 00:55:07.700 |
Which is the how we train deep neural networks 00:55:14.020 |
Pick a direction of the gradient and then pick the step size 00:55:19.620 |
The problem with that is that can get you into trouble here. There's a nice visualization walking along a ridge 00:55:26.420 |
Is it can it can result in you stepping off that ridge again the collapsing of the training process 00:55:33.620 |
the performance the trust region is is the underlying idea here for the 00:55:39.380 |
For the policy optimization methods that first pick the step size so the constraint in various kinds of ways the the magnitude 00:55:48.100 |
Of the difference to to the weights that's applied and then the direction 00:55:55.140 |
Placing a much higher priority not choosing bad actions that can throw you off the optimization path trajectory 00:56:04.740 |
On the model-based methods and we'll also talk about them in the robotics side. There's a lot of 00:56:11.120 |
approaches now where deep learning is starting to be used for 00:56:15.140 |
Model-based methods when the model has to be learned 00:56:17.620 |
But of course when the model doesn't have to be learned is given inherent to the game 00:56:22.100 |
You know the model like in go and chess and so on alpha zero has really done incredible stuff 00:56:30.660 |
What's why is what is the model here? So the way that a lot of these games are approached 00:56:36.340 |
You know game of go it's turn-based one person goes and another person goes and there's this game tree 00:56:42.260 |
At every point there's a set of actions that can be taken and quickly if you look at that game tree 00:56:46.900 |
It's it becomes you know, it grows exponentially 00:56:50.020 |
So it becomes huge the game of go is the hugest of all in terms of because the number of choices you have 00:56:58.340 |
And then, you know, it gets to checkers and then tic-tac-toe and it's just the the degree at every step 00:57:04.580 |
increases decreases based on the game structure 00:57:08.180 |
And so the task for neural network there is to learn the quality of the board. It's to learn 00:57:24.100 |
Most useful to explore and result in a highly successful state 00:57:28.500 |
So that choice of what's good to explore what what branch is good to go down 00:57:36.980 |
And with alpha go it was pre-trained the first success that beat the world champion was pre-trained on expert games 00:57:50.580 |
No pre-training on expert systems. So no imitation learning is just purely through self-play through suggesting through playing itself 00:57:58.900 |
New board positions many of these systems use Monte Carlo tree search and during this search 00:58:05.120 |
balancing exploitation exploration so going deep on promising positions based on the estimation in neural network or 00:58:11.120 |
With a flip of a coin playing underplayed positions 00:58:15.380 |
And so this kind of here you could think of as an intuition of looking at a board and estimating how good that board is 00:58:24.320 |
And also estimating how good that board is likely to lead to victory down the end 00:58:31.520 |
So estimating just general quality and probability of leading to victory 00:58:37.120 |
The next step forward is alpha zero using the same similar architecture with mcts 00:58:43.380 |
Monte Carlo tree search but applying it to different games 00:58:50.560 |
other engines state-of-the-art engines in go and shogi in chess 00:59:02.480 |
This model-based approaches which are really extremely simple efficient if you can construct such a model 00:59:08.800 |
And in in robotics if you can learn such a model 00:59:19.660 |
Engines which are far superior to humans already stockfish can destroy most humans 00:59:27.360 |
the ability through learning through through estimating the quality of a board to be able to defeat these engines is incredible and 00:59:34.400 |
the exciting aspect here versus engines that don't use neural networks is that the 00:59:44.320 |
It really has to do with based on the neural network you explore certain positions 00:59:59.440 |
In chess, they seem to explore very few moves 01:00:03.120 |
They have a really good neural network at estimating which are the likely branches which would provide value to explore 01:00:14.080 |
Stockfish and so on are much more brute force in their estimation for the mcts 01:00:19.620 |
And then alpha zero is a step towards the grandmaster 01:00:23.600 |
Because the number of branches need to be explored is much much fewer 01:00:26.480 |
A lot of the work is done in the representation formed by the neural network, which is super exciting 01:00:34.960 |
Stockfish and chess. It's able to outperform elmo and shogi and 01:00:43.600 |
Or the previous iterations of alpha goes zero and so on 01:00:53.440 |
The sobering truth is that majority of real world application 01:00:57.220 |
Of agents that have to act in this world perceive the world and act in this world 01:01:09.680 |
You use neural networks to perceive certain aspects of the world, but ultimately 01:01:19.520 |
That's true for all most of the autonomous vehicle companies or all of the autonomous vehicle companies operating today 01:01:30.220 |
Industrial robotics and any of the humanoid robots have to navigate in this world under uncertain conditions 01:01:35.780 |
All the work from boston dynamics doesn't involve any machine learning as far as we know 01:01:40.080 |
Now that's beginning to change here with animal the the recent development 01:01:55.280 |
You're trying to learn more efficient movement. You're trying to learn more robust movement on top of the other controllers 01:02:01.940 |
So it's quite exciting through rl to be able to learn some of the control dynamics here. That's able to teach 01:02:09.040 |
Uh this particular robot to be able to get up from arbitrary positions 01:02:13.780 |
So it's less hard coding in order to be able to deal with 01:02:19.740 |
Unexpected initial conditions and unexpected perturbations 01:02:22.820 |
So it's exciting there in terms of learning the control dynamics and some of the driving policy 01:02:29.280 |
So making behavioral driving behavior decisions changing lanes turning and so on that if you uh, 01:02:38.960 |
They they're starting to use some rl in terms of the driving policy in order to especially predict the future 01:02:44.560 |
They're trying to anticipate intent modeling predict where the pedestrians where the cars are going to be based in the environment 01:02:50.000 |
They're trying to unroll what's happened recently into the future and beginning to 01:02:54.720 |
Move beyond sort of pure end-to-end on nvidia end-to-end learning approach of the control decisions 01:03:02.480 |
Are actually moving to rl and making long-term planning decisions 01:03:12.800 |
The gap the leap needed to go from simulation to real world 01:03:18.080 |
Most of the work is done from the design of the environment and the design of the reward structure 01:03:22.480 |
And because most of that work now is in simulation. We need to either develop better algorithms for transfer learning 01:03:29.040 |
or close the distance between simulation and the real world and 01:03:33.520 |
Also, we could think outside the box a little bit 01:03:38.240 |
I had the conversation with peter abeel recently one of the leading researchers in deep rl 01:03:46.240 |
The the idea is that we don't need to make simulation more realistic 01:03:52.420 |
What we could do is just create an infinite number of simulations 01:04:03.360 |
And the naturally the regularization aspect of having all those simulations 01:04:07.860 |
Will make it so that our our reality is just another sample from those simulations 01:04:12.900 |
And so maybe the solution isn't to create higher fidelity simulation or to create transfer learning algorithms 01:04:26.700 |
So then that step towards creating a agent that work that works in the real world is a trivial one 01:04:33.680 |
And maybe that's exactly whoever created the simulation we're living in and the multiverse that we're living in did 01:04:45.200 |
The lecture videos we have several in rl will be made all available on deeplearning.mit.edu 01:04:54.400 |
The link is there and I really like the essay 01:04:58.800 |
From openai on spinning up as a deep rl researcher, you know, if you're interested in getting into research in rl 01:05:05.360 |
What are the steps you need to take from the background of developing the mathematical background prob stat and multivariate calculus? 01:05:13.300 |
To some of the basics like it's covered last week on deep learning some of the basics ideas in rl 01:05:19.040 |
Just terminology and so on some basic concepts then picking a framework tensorflow or pytorch 01:05:29.040 |
All right implement the algorithms I mentioned today. Those are the core rl algorithms 01:05:36.080 |
It should only take about 200 300 lines of code. They're actually when you put it down on paper are quite simple 01:05:46.720 |
Read papers about those algorithms that follow after looking not for the big waving performance 01:05:54.000 |
The hand waving performance but for the tricks that were used to train these algorithms 01:05:58.000 |
The tricks tell a lot of the story and that's the useful parts that you need to learn 01:06:03.280 |
And iterate fast on simple benchmark environments 01:06:07.200 |
So openai gem has provided a lot of easy to use environments that you can play with that 01:06:12.480 |
You can train an agent in minutes hours as opposed to days and weeks 01:06:17.200 |
And so iterating fast is the best way to learn these algorithms and then on the research side 01:06:22.480 |
There's three ways to get a best paper award, right? 01:06:25.440 |
To publish and to contribute and have an impact in the research community 01:06:31.120 |
In rl one is improve an existing approach given a particular benchmark. There's a few 01:06:37.360 |
Benchmark data sets environments that are emerging. So you want to improve an existing approach some aspect of the convergence of the performance 01:06:46.400 |
Focus on an unsolved task. There's certain games that just haven't been solved through the rl 01:06:52.560 |
Formulation or you can come up with a totally new 01:06:55.600 |
Problem that hasn't been addressed by rl before 01:06:59.200 |
So with that I'd like to thank you very much tomorrow. I hope to see you here for deep traffic. Thanks