back to index

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)


Chapters

0:0 Introduction
2:14 Types of learning
6:35 Reinforcement learning in humans
8:22 What can be learned from data?
12:15 Reinforcement learning framework
14:6 Challenge for RL in real-world applications
15:40 Component of an RL agent
17:42 Example: robot in a room
23:5 AI safety and unintended consequences
26:21 Examples of RL systems
29:52 Takeaways for real-world impact
31:25 3 types of RL: model-based, value-based, policy-based
35:28 Q-learning
38:40 Deep Q-Networks (DQN)
48:0 Policy Gradient (PG)
50:36 Advantage Actor-Critic (A2C & A3C)
52:52 Deep Deterministic Policy Gradient (DDPG)
54:12 Policy Optimization (TRPO and PPO)
56:3 AlphaZero
60:50 Deep RL in real-world applications
63:9 Closing the RL simulation gap
64:44 Next step in Deep RL

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today I'd like to overview the exciting field of deep reinforcement learning
00:00:04.640 | Introduce, overview and provide you some of the basics
00:00:08.240 | I think it's one of the most exciting fields in artificial intelligence
00:00:14.960 | It's marrying the power and the ability of deep neural networks
00:00:20.320 | to represent and comprehend the world
00:00:23.120 | with the ability to act on that understanding
00:00:29.040 | on that representation
00:00:31.040 | Taking as a whole, that's really what the creation of intelligent beings is
00:00:36.480 | Understand the world and act
00:00:39.360 | And the exciting breakthroughs that recently have happened
00:00:42.000 | Captivate our imagination about what's possible
00:00:45.440 | And that's why this is my favorite area of deep learning and artificial intelligence in general
00:00:50.400 | And I hope you feel the same
00:00:52.560 | So what is deep reinforcement learning?
00:00:55.120 | We've talked about deep learning which is taking samples of data
00:01:00.080 | Being able to in a supervised way
00:01:02.080 | compress, encode the representation of that data in a way that you can reason about it
00:01:07.680 | And we take that power and apply it to the world where sequential decisions are to be made
00:01:16.000 | So it's looking at problems and formulations of tasks
00:01:22.640 | Where an agent, an intelligent system has to make a sequence of decisions
00:01:28.240 | And the decisions that are made
00:01:31.200 | Have an effect on the world around the agent
00:01:36.880 | How do all of us?
00:01:38.880 | Any intelligent being that is tasked with operating in the world. How do they learn anything?
00:01:43.680 | Especially when you know very little in the beginning
00:01:47.280 | It's trial and error is the fundamental process by which reinforcement learning agents learn
00:01:53.280 | And the deep part of deep reinforcement learning is neural networks
00:01:59.380 | It's using the frameworks and reinforcement learning
00:02:02.720 | Where the neural network is doing the representation
00:02:07.460 | Of the world based on which the actions are made
00:02:11.520 | And we have to take a step back
00:02:15.760 | When we look at the types of learning
00:02:17.760 | Sometimes the terminology itself can confuse us to the fundamentals
00:02:22.340 | There is supervised learning, there's semi-supervised learning, there's unsupervised learning, there's reinforcement learning
00:02:29.360 | And there's this feeling that supervised learning is really the only one
00:02:33.360 | Where you have to perform the manual annotation, where you have to do the large-scale supervision
00:02:38.500 | That's not the case
00:02:42.640 | Every type of machine learning is supervised learning
00:02:45.680 | It's supervised by a loss function or a function that tells you what's good
00:02:53.360 | And what's bad
00:02:55.680 | You know even looking at our own existence is how we humans figure out what's good and bad
00:03:00.240 | there's
00:03:02.080 | All kinds of sources direct and indirect by which our morals and ethics we figure out what's good and bad
00:03:08.320 | The difference between supervised and unsupervised and reinforcement learning is the source of that supervision
00:03:13.780 | What's implied when you say unsupervised?
00:03:16.560 | Is that the cost of human labor required to attain the supervision is low
00:03:22.720 | But it's never
00:03:25.440 | Turtles all the way down it's turtles and then there's a human at the bottom
00:03:30.960 | There at some point there needs to be human intervention
00:03:37.360 | Human
00:03:38.480 | Input to provide what's good and what's bad and this will arise in reinforcement learning as well
00:03:43.680 | we have to remember that because the challenges and the exciting opportunities of reinforcement learning lie in the fact of
00:03:50.000 | How do we get that supervision?
00:03:53.220 | In the most efficient way possible, but supervision nevertheless is required for any system that has an input and an output
00:04:01.840 | That's trying to learn like a neural network does to provide an output. That's good. It needs somebody to say what's good and what's bad
00:04:09.360 | For you curious about that. There's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche
00:04:16.720 | I recommend
00:04:17.840 | the latter especially
00:04:19.840 | So let's look at supervised learning and reinforcement learning
00:04:23.620 | I'd like to propose a way to think about the difference
00:04:28.160 | That is illustrative and useful when we start talking about the techniques
00:04:33.300 | So supervised learning is taking
00:04:35.760 | a bunch of examples of data
00:04:41.200 | Learning from those examples where ground truth provides you
00:04:44.640 | the compressed
00:04:47.440 | Semantic meaning of what's in that data and from those examples one by one whether it's sequences or single samples
00:04:56.720 | We learn what how to then take future such samples and interpret them
00:05:02.240 | Reinforcement learning is teaching what we teach an agent through experience
00:05:09.700 | Not by showing a singular sample of a data set but by putting them out into the world
00:05:15.840 | The distinction there the essential element of reinforcement learning then for us
00:05:20.880 | Now we'll talk about a bunch of algorithms
00:05:24.640 | But the essential design step is to provide the world in which to experience
00:05:31.140 | The agent learns from the world
00:05:34.320 | The from the world it gets the dynamics of that world the physics of the world from that world
00:05:41.040 | It gets the rewards what's good and bad and us as designers
00:05:44.660 | Of that agent do not just have to do the algorithm. We have to do design the the world
00:05:53.040 | In which that agent is trying to solve a task
00:05:57.600 | The design of the world is the process of reinforcement learning the design of examples
00:06:03.520 | The annotation of examples is the world of supervised learning
00:06:06.500 | And the essential perhaps the most difficult element of reinforcement learning is the reward the good versus bad
00:06:15.840 | Here a baby starts walking across the room
00:06:21.280 | We want to define success as a baby
00:06:23.840 | walking across the room
00:06:26.400 | And reaching the destination that's success and failure is the inability to reach that destination
00:06:32.100 | simple
00:06:33.760 | and reinforcement learning in humans
00:06:35.760 | The way we learn from these very few examples
00:06:40.980 | Appear to learn from very few examples through trial and error
00:06:46.240 | Is a mystery a beautiful mystery full of open questions
00:06:49.280 | It could be from the huge amount of data 230 million years worth of bipedal data that we've been walking
00:06:55.760 | Mammals walking ability to walk or 500 million years the ability to see having eyes
00:07:02.240 | So that's the the hardware side somehow genetically encoded in us is the ability to comprehend this world extremely efficiently
00:07:10.260 | It could be through
00:07:12.640 | not the hardware not the 500 million years, but the
00:07:16.000 | the few
00:07:18.720 | minutes hours days months
00:07:20.880 | Maybe even years in the very beginning when we're born
00:07:24.720 | The ability to learn really quickly through observation to aggregate that information
00:07:29.940 | Filter all the junk that you don't need and be able to learn really quickly
00:07:34.480 | Through imitation learning through observation the way for walking that might mean observing others to walk
00:07:42.080 | The idea there is
00:07:43.760 | If there was no others
00:07:45.760 | Around we would never be able to learn this the fundamentals of this walking or as efficiently
00:07:50.420 | It's through observation
00:07:53.200 | And then it could be the algorithm totally not understood is the algorithm that our brain uses to learn
00:08:01.520 | The back propagation that's an artificial neural networks the same kind of processes not understood in the brain
00:08:08.960 | That could be the key
00:08:11.520 | So I want you to think about that as we talk about
00:08:13.760 | the very trivial
00:08:16.160 | By comparison accomplishments and reinforcement learning and how do we take the next steps?
00:08:20.880 | But it nevertheless is exciting to have machines that learn how to act in the world
00:08:30.080 | the process of learning for those who have
00:08:34.560 | fallen in love with artificial intelligence
00:08:38.000 | The process of learning is thought of as intelligence. It's the ability to know very little and through experience examples
00:08:45.620 | Interaction with the world in whatever medium whether it's data or simulation so on be able to form much richer and interesting
00:08:53.780 | Representations of that world be able to act in that world. That's that's the dream
00:08:58.080 | So let's look at this stack of what an age what it means to be an agent in this world
00:09:03.280 | from top
00:09:05.040 | The input to the bottom the output is there's an environment. We have to sense that environment
00:09:11.280 | We have just a few tools us humans have
00:09:13.360 | Several sensory systems on cars you can have lidar camera
00:09:20.080 | Stereo vision audio microphone networking gps imu sensor so on whatever robot you can think about
00:09:27.440 | There's a way to sense that world
00:09:29.760 | and you have this raw sensory data and then once you have the raw sensory data you're tasked with
00:09:34.720 | representing that data in such a way that you can make sense of it as opposed to all the
00:09:40.240 | The the raw sensors in the eye the cones and so on that taken as just giant stream of high bandwidth information
00:09:48.000 | We have to be able to form
00:09:50.240 | higher
00:09:52.400 | Abstractions of features based on which we can reason from edges to corners to faces
00:09:57.760 | And so on that's exactly what deep learning neural networks have stepped in to be able to
00:10:02.480 | In an automated fashion with as little human input as possible be able to form higher order representations of that information
00:10:10.100 | Then there's the the learning aspect building on top of the greater abstractions formed through representations
00:10:17.620 | Be able to accomplish something useful whether it's discriminative task a generative task and so on based on the representation
00:10:25.040 | Be able to make sense of the data be able to generate new data and so on
00:10:29.440 | From sequence to sequence to sequence to sample from sample to sequence and so on and so forth to actions as we'll talk about
00:10:37.040 | and then there is the
00:10:39.920 | ability to
00:10:42.780 | Aggregate all the information that's been received in the past to the useful
00:10:48.480 | information that's
00:10:51.440 | Pertinent to the task at hand. It's the thing the old
00:10:55.040 | It looks like a duck quacks like a duck swims like a duck
00:10:58.240 | Three different data sets i'm sure there's state-of-the-art algorithms for the three image classification
00:11:04.420 | audio recognition
00:11:06.720 | video classification
00:11:08.560 | Activity recognition so on aggregating those three together
00:11:12.000 | Is still an open problem and that could be the last piece again
00:11:16.480 | I want you to think about as we think about reinforcement learning agents. How do we play?
00:11:20.880 | How do we transfer from the game of atari to the game of go to the game of dota to the game of a robot?
00:11:29.040 | Navigating an uncertain environment in the real world
00:11:32.720 | And once you have that once you sense the raw world once you have a representation of that world then
00:11:40.240 | We need to act
00:11:43.440 | Which is provide actions within the constraints of the world in such a way that we believe can get us towards success
00:11:51.680 | The promise excitement of deep learning is is the part of the stack that converts raw data into meaningful representations
00:11:58.900 | The promise the dream of deeper enforcement learning
00:12:02.560 | Is going beyond
00:12:05.040 | And building an agent that uses that representation
00:12:08.020 | And acts achieve success in the world
00:12:11.120 | That's super exciting
00:12:13.920 | The framework and the formulation of reinforcement learning
00:12:18.800 | At its
00:12:20.800 | Simplest
00:12:22.240 | Is that there's an environment and there's an agent that acts in that environment?
00:12:26.500 | the agent senses the environment by
00:12:29.600 | by some
00:12:31.580 | observation whether it's partial or
00:12:33.840 | complete observation of the environment
00:12:38.880 | It gives the environment an action it acts in that environment and through the action
00:12:43.920 | The environment changes in some way and then a new observation occurs
00:12:49.360 | And then also as you provide the action make the observations you receive a reward
00:12:53.940 | In most formulations of this of this framework
00:12:58.000 | This entire system has no memory
00:13:00.720 | That the
00:13:04.080 | The only thing you need to be concerned about is the state you came from the state you arrived in and the reward received
00:13:09.780 | The open question here is what can't be modeled in this kind of way. Can we model all of it?
00:13:18.000 | From human life to the game of go
00:13:20.000 | Can all of this be modeled in this way?
00:13:22.480 | And what are is this a good way to formulate the learning problem of robotic systems?
00:13:29.520 | In the real world in the simulated world. Those are the the open questions
00:13:33.220 | The environment could be fully observable
00:13:37.040 | Or partially observable like in poker
00:13:40.960 | It could be single agent or multi-agent atari versus driving like deep traffic
00:13:47.340 | deterministic or stochastic
00:13:49.340 | static versus dynamic
00:13:52.060 | Static as in chess dynamic again and driving in most real world applications discrete versus continuous like games
00:13:59.100 | Chess or continuous and carpool balancing a pull on a cart
00:14:03.500 | The challenge for RL in real world applications
00:14:07.840 | Is that as a reminder
00:14:16.280 | Supervised learning is teaching by example learning by example
00:14:21.100 | teaching from our perspective
00:14:23.860 | Reinforcement learning is teaching by experience
00:14:26.300 | And the way we provide experience to reinforcement learning agents currently for the most part is through simulation
00:14:33.100 | Or through highly constrained real world scenarios
00:14:37.260 | So the challenge is in the fact
00:14:42.600 | most of the successes
00:14:44.920 | is with
00:14:46.920 | Systems environments that are simulatable
00:14:49.180 | So there's two ways to then close this gap
00:14:54.280 | two directions of research and work one is to
00:14:58.440 | improve the
00:15:00.740 | algorithms improve the ability of the algorithms to then
00:15:03.640 | To form policies that are transferable across all kinds of domains including the real world including especially the real world
00:15:10.920 | So train and simulation transfer to the real world
00:15:16.920 | As we improve the simulation in such a way that the fidelity of the simulation increase increases to the point where the gap
00:15:24.040 | between reality and simulation
00:15:28.760 | Minimal to a degree that things learned in simulation are directly trivially transferable to the real world
00:15:37.000 | Okay, the major components of an RL agent
00:15:44.040 | an agent
00:15:45.960 | Operates based on a strategy
00:15:47.960 | called a policy
00:15:50.440 | It sees the world it makes a decision. That's a policy makes a decision how to act sees the reward
00:15:56.520 | Sees a new state acts sees a reward sees new states and acts and this repeats
00:16:03.080 | forever until a terminal state
00:16:06.200 | the value function
00:16:11.300 | estimate of how good a state is
00:16:13.620 | or how good
00:16:16.260 | A state action pair is meaning taking an action
00:16:19.540 | In a particular state. How good is that ability to evaluate that?
00:16:24.660 | and then the model
00:16:27.540 | Different from the environment from the perspective of the agent
00:16:30.340 | So the environment has a model based on which it operates
00:16:33.640 | And then the agent has a representation best understanding of that model
00:16:39.300 | So the purpose for an RL agent
00:16:41.620 | In this
00:16:44.340 | Simply formulated framework is to maximize reward
00:16:47.380 | the way that
00:16:49.780 | The reward mathematically and practically is talked about
00:16:53.140 | Is with a discounted framework, so we discount further and further future reward
00:17:00.740 | So the reward that's farther into the future is means less to us in terms of maximization than reward
00:17:07.460 | That's in the near term. And so why do we discount it?
00:17:11.220 | So first a lot of it is a math trick to be able to prove certain aspects analyze certain aspects of conversions
00:17:17.160 | And in general on a more philosophical sense
00:17:20.820 | Because environments either are or can be thought of a stochastic random. It's very difficult
00:17:27.720 | To there's a degree of uncertainty, which makes it difficult to really estimate
00:17:35.940 | The reward they'll be in the future because of the ripple effect of the uncertainty
00:17:40.280 | Let's look at an example a simple one
00:17:43.860 | Helps us understand
00:17:46.640 | policies rewards actions, there's a robot in the room there's
00:17:50.980 | 12 cells in which you can step it starts in the bottom left. It tries to get rewards on the top, right?
00:17:58.740 | There's a plus one. It's a really good thing at the top, right? It wants to get there by walking around
00:18:05.300 | There's a negative one, which is really bad. It wants to avoid that square and the choice of actions is up down left right four actions
00:18:13.140 | so you could think of uh, they're
00:18:16.260 | Being a negative reward of 0.04 for each step
00:18:20.260 | So there's a cost to each step and there's a stochastic nature to this world potentially we'll talk about both deterministic stochastic
00:18:26.920 | So in the stochastic case when you choose the action up
00:18:30.660 | with an 80% probability
00:18:34.580 | With an 80% chance you move up but
00:18:36.660 | With 10% chance you move left and another 10 move right
00:18:41.300 | So that's stochastic nature, even though you try to go up you might end up in a blocks the left into the right
00:18:46.340 | so for a deterministic world
00:18:51.280 | optimal policy here
00:18:53.280 | Given that we always start in the bottom left is really shortest path
00:18:56.580 | Is you know, you can't ever because there's no stochasticity
00:19:00.980 | you're never going to screw up and just fall into the hole negative one hole that you just compute the shortest path and
00:19:07.140 | Walk along that shortest path why shortest path because every single step hurts. There's a negative reward to it
00:19:13.860 | 0.04 so shortest path is the thing that minimizes the reward shortest path to the
00:19:20.340 | to the plus one block
00:19:22.820 | Okay, let's look at a stochastic world. Like I mentioned the 80% up and then split 10% to the left and right
00:19:29.860 | How does the policy change? Well, first of all we need to have
00:19:34.980 | We need to have a plan for every single block in the area because you might end up there due to the stochasticity of the world
00:19:40.980 | Okay, the the basic addition there is that we're trying to go
00:19:45.940 | Avoid up
00:19:49.300 | The closer you get to the negative one hole. So just try to avoid up because up
00:19:56.100 | The stochastic nature of up means you might fall into the hole with a 10% chance
00:20:00.340 | And given the 0.04 step reward you're willing to take the long way home
00:20:05.620 | In some cases in order to avoid that possibility the negative one possibility
00:20:10.760 | Now, let's look at a reward for each step if it decreases to negative two. It really hurts to take every step
00:20:17.380 | Then again, we go to the shortest path despite the fact that uh, there's a stochastic nature
00:20:24.180 | In fact, you don't really care that you step into the negative one hole because every step really hurts. You just want to get home
00:20:30.260 | And then you can play with this reward structure right yes
00:20:35.380 | instead of uh negative two or negative 0.04 you can look at
00:20:41.140 | Negative 0.1 and you can see immediately
00:20:44.600 | that the structure of the policy
00:20:47.620 | It changes
00:20:50.180 | So with a higher
00:20:52.420 | Value the higher negative reward for each step
00:20:54.980 | immediately
00:20:56.980 | the urgency of the agent increases
00:20:59.160 | Versus the less urgency the lower the negative reward
00:21:03.780 | And when the reward flips
00:21:07.380 | So it's positive
00:21:10.980 | The every step is a positive so the entire system which is actually
00:21:17.140 | Quite common in reinforcement learning the entire system is full of positive rewards
00:21:21.940 | And so then the optimum policy becomes the longest path
00:21:25.220 | Is a grad school taking as long as possible never reaching the destination
00:21:36.900 | What lessons do we draw from robot in the room two things?
00:21:40.660 | The environment model the dynamics is just there in the trivial example the stochastic nature the difference between 80 percent 100 percent and
00:21:49.140 | 50 percent
00:21:51.060 | The model of the world the environment has a big impact on what the optimal policy is
00:21:55.780 | And the reward structure most importantly the thing we can often control
00:22:06.900 | in our constructs of the task we try to solve in reinforcement learning is the
00:22:11.140 | What is good and what is bad and how bad is it and how good is it the reward structure is a big?
00:22:17.300 | Impact and that has a complete change
00:22:21.140 | like like uh, robert frost said a complete change on the
00:22:25.540 | Policy the choices the agent makes so when you formulate a reinforcement learning framework
00:22:33.380 | As researchers as students what you often do is you design the environment you design the world in which the system learns
00:22:41.140 | Even when your ultimate goal is the physical robot you just still there's a lot of work still done in simulation
00:22:48.100 | So you design the world the parameters of that world and you also design the reward structure and it can have
00:22:53.780 | A transformative results slight variations in those parameters can be huge results
00:23:01.060 | On huge differences on the policy that's arrived and of course
00:23:05.060 | The example i've shown before I really love is
00:23:12.580 | impact of the the changing reward structure might have unintended consequences
00:23:19.940 | those
00:23:21.140 | Consequences for real world system can have obviously
00:23:24.200 | highly detrimental
00:23:27.680 | Costs that are more than just a failed game of atari
00:23:31.220 | So here's a human performing the task get playing the game of coast runners racing around the track
00:23:37.380 | and so it's
00:23:39.620 | uh, when you finish first
00:23:41.860 | And you finish fast you get a lot of points and so it's natural to then okay
00:23:47.780 | Let's do an rl agent and then optimize this for those points
00:23:51.700 | And what you find out in the game is that you also get points by picking up the little green turbo things
00:23:59.380 | And what the agent figures out is that you can actually get a lot more points
00:24:06.500 | By simply focusing on the green turbos
00:24:09.380 | focusing on the green turbos
00:24:12.020 | Just rotating over and over slamming into the wall fire and everything just picking it up, especially because
00:24:17.540 | ability to pick up those turbos
00:24:21.540 | Can avoid the terminal state at the end of finishing the race in fact finishing the race means you stop collecting positive reward
00:24:28.580 | So you never want to finish collect the turbos
00:24:30.900 | And though that's a trivial example
00:24:34.680 | It's not actually easy to find such examples
00:24:38.360 | But they're out there of unintended consequences that can have highly negative detrimental effects when put in the real world
00:24:45.940 | We'll talk about a little bit of robotics
00:24:48.740 | When you put robots four-wheeled ones like autonomous vehicles into the real world
00:24:53.780 | And you have objective functions that have to navigate difficult intersections full of pedestrians
00:24:59.080 | So you have to form intent models of those pedestrians here. You see cars asserting themselves through dense intersections
00:25:06.280 | taking risks and
00:25:08.820 | Within those risks that are taken by us humans when we drive vehicles
00:25:14.260 | we have to then encode that ability to take subtle risk into
00:25:21.780 | AI-based control algorithms perception
00:25:23.960 | Then you have to think about at the end of the day. There's an objective function
00:25:29.540 | and if that objective function does not anticipate the green turbos that are to be collected and
00:25:36.020 | then result in some unintended consequences could have
00:25:43.300 | Negative effects especially in situations that involve human life
00:25:47.300 | That's the field of AI safety and some of the folks who talk about deep mind and open AI
00:25:52.980 | That are doing incredible work in RL also have groups that are working in AI safety for a very good reason
00:26:00.180 | this is a problem that
00:26:02.900 | I believe that artificial intelligence will define some of the most impactful positive things
00:26:09.060 | In the 21st century, but I also believe we are nowhere close
00:26:13.700 | To solving some of the fundamental problems of AI safety that we also need to address as we develop those algorithms
00:26:20.440 | So okay examples of reinforcement learning systems
00:26:23.860 | All of it has to do with formulation of rewards formulation of state and actions. You have the traditional
00:26:31.320 | The often used benchmark of a cart
00:26:37.600 | Balancing a pole continuous. So the action is the horizontal force of the cart
00:26:42.000 | The goal is to balance the pole
00:26:44.160 | So it stays top in the moving cart and the reward is one at each time step if the pole is upright
00:26:49.920 | And the state measured by the cart by the agent is the pole angle angular speed
00:26:55.920 | And of course self sensing of the cart position and the horizontal velocity
00:27:01.460 | Another example here didn't want to include the video because it's really disturbing
00:27:07.360 | but I do want to include this slide because it's really important to think about is
00:27:11.440 | by sensing the the raw pixels learning and teaching an agent to
00:27:16.560 | Play a game of doom
00:27:19.520 | So the goal there is to eliminate all opponents
00:27:22.260 | The state is the raw game pixels the actions up down shoot reload and so on
00:27:31.680 | The positive reward is
00:27:34.000 | When an opponent is eliminated and negative when the agent is eliminated simple
00:27:38.800 | I added it here because again on the topic of AI safety
00:27:44.420 | We have to think about objective functions and how that translate into the world of not just autonomous vehicles
00:27:56.720 | Things that even more directly have harm like autonomous weapon systems. We have a lecture on this in the AGI series
00:28:05.280 | The on the robotics platform the manipulate object manipulation grasping objects. There's a few benchmarks. There's a few interesting applications
00:28:13.140 | learning the problem of
00:28:16.080 | grabbing objects moving objects
00:28:18.320 | Manipulating objects rotating and so on especially when those objects don't have have complicated shapes
00:28:25.680 | And so the goal is to pick up an object in the purely in the grasping object challenge
00:28:31.200 | The state is the visual information. So it's visual visual based the raw pixels of the objects
00:28:36.800 | The action is to move the arm grasp the object pick it up
00:28:40.160 | And obviously it's positive when the pickup is successful
00:28:44.340 | The reason i'm personally excited
00:28:47.120 | by this
00:28:48.960 | is because it will finally allow us to solve the problem of the the claw which has been
00:28:55.040 | Torturing me for many years
00:28:58.880 | I don't know. That's not at all why i'm excited but okay
00:29:02.080 | And then we have to think about as we get greater and greater degree of application in the real world with the robotics
00:29:08.180 | Like cars
00:29:11.520 | The the main focus of my passion in terms of robotics is how do we encode some of the things that us humans encode?
00:29:17.920 | How do we you know?
00:29:19.840 | We have to think about our own objective function our own reward structure our own model of the environment about which we perceive and reason
00:29:27.200 | About in order to then encode machines that are doing the same and I believe autonomous driving is in that category
00:29:33.040 | We have to ask questions of ethics. We have to ask questions
00:29:38.080 | of risk value of human life value of efficiency money and so on all these are fundamental questions that an autonomous vehicle
00:29:45.600 | Unfortunately has to solve before it becomes fully autonomous
00:29:49.220 | So here are the key takeaways of
00:29:54.960 | the real world impact of reinforcement learning agents
00:29:57.680 | On the deep learning side
00:30:01.920 | Okay, these neural networks that form higher representation
00:30:04.400 | The fun part is the algorithms all the different architectures the different encoder decoder structures
00:30:10.020 | all the attention self-attention
00:30:13.140 | recurrence LSTMs GRUs all the fun architectures and the data and
00:30:19.680 | the ability to leverage different data sets in order to
00:30:25.180 | discriminate better than uh
00:30:27.180 | Perform discriminatory tasks better than you know
00:30:30.640 | MIT does better than Stanford that kind of thing. That's the fun part
00:30:34.800 | The hard part is asking good questions and collecting huge amounts of data that's representative of the task
00:30:41.920 | That's for real world impact not CVPR publication real world impact
00:30:46.320 | A huge amount of data on the deeper enforcement learning side the key challenge
00:30:53.040 | The fun part again is the algorithms. How do we learn from data some of the stuff i'll talk about today?
00:30:57.520 | The hard part is defining the environment defining the access space and the reward structure
00:31:03.600 | As I mentioned this is the big challenge and the hardest part is how to crack the gap between simulation in the real world
00:31:10.640 | the leaping lizard
00:31:12.960 | That's the hardest part. We don't even know
00:31:14.960 | How to solve that transfer learning problem yet for the real world impact
00:31:18.880 | The three types of reinforcement learning
00:31:22.240 | There's countless algorithms and there's a lot of ways to taxonomize them, but at the highest level
00:31:30.960 | There's model-based and there's model-free
00:31:33.760 | model-based algorithms
00:31:37.020 | Learn the model of the world
00:31:39.280 | So as you interact with the world
00:31:41.280 | You construct your estimate of how you believe the dynamics of that world operates
00:31:47.940 | The nice thing about
00:31:51.680 | Doing that is once you have a model or an estimate of a model you're able to
00:31:56.480 | Anticipate you're able to plan into the future. You're able to
00:32:01.200 | use the model to
00:32:04.320 | In a branching way predict how your actions will change the world so you can plan far into the future
00:32:10.240 | This is the mechanism by which you you can you can do
00:32:13.360 | chess
00:32:15.520 | Uh in the simplest form because in chess, you don't even need to learn the model
00:32:18.880 | The model is learned is given to you chess go and so on
00:32:21.520 | The most important way in which they're different I think is the sample efficiency
00:32:26.500 | Is how many examples of data are needed to be able to successfully operate in the world?
00:32:32.000 | And so model-based methods because they're constructing a model if they can
00:32:36.480 | Are extremely sample efficient
00:32:39.360 | Because once you have a model you can do all kinds of reasoning that doesn't require
00:32:44.100 | experiencing every possibility of that model you can
00:32:48.720 | Unroll the model to to see how the world changes based on your actions
00:32:53.600 | Value-based methods are ones that look to estimate
00:32:58.880 | The quality of states the quality of state taking a certain action in a certain state
00:33:06.320 | They're called off policy
00:33:08.320 | Versus the last category that's on policy. What does it mean to be off policy? It means that
00:33:16.080 | They constantly a value-based agents constantly update how good is taking action in a state
00:33:22.800 | and they have this
00:33:26.000 | model of that goodness of
00:33:28.240 | Taking action in a state and they use that to pick the optimal action
00:33:32.000 | They don't directly learn a policy a strategy of how to act they learn how good it is
00:33:39.600 | to be in a state and use that goodness information to then
00:33:44.960 | pick the best one
00:33:46.960 | And then every once in a while flip a coin in order to explore
00:33:49.920 | And then policy-based methods are ones that directly learn a policy function
00:33:56.320 | so they take
00:33:58.640 | as input the
00:34:00.640 | the world
00:34:02.140 | representation of that world neural networks and is output
00:34:04.720 | a action
00:34:07.360 | Where the action is stochastic
00:34:09.360 | So, okay, that's the range of model-based value-based and policy-based
00:34:14.820 | Here's an image from openai that I really like I encourage you to
00:34:19.700 | As we further explore here to look up spinning up in deeper enforcement learning from openai
00:34:26.120 | Here's an image that taxonomizes in the way that I described some of the recent developments
00:34:30.920 | in rl
00:34:32.660 | so at the very
00:34:34.420 | top the distinction between model free rl and
00:34:37.060 | model-based rl
00:34:40.180 | In model free rl, which is what we'll focus on today. There is a distinction between policy optimization
00:34:46.680 | So on policy methods and q learning which is all policy methods policy optimization methods that directly optimize the policy
00:34:56.340 | Directly learn the policy in some way
00:34:58.900 | and then
00:35:00.500 | Q learning all policy methods learn like I mentioned the value of taking a certain action in a state and from that
00:35:07.540 | learned
00:35:10.260 | that learned q value be able to
00:35:12.980 | Choose how to act in the world
00:35:16.020 | So let's look at a few sample representative
00:35:19.880 | approaches in this space
00:35:23.120 | Let's start with the one
00:35:27.940 | Really was one of the first great breakthroughs
00:35:30.600 | From google deep mind on the deep rl side and solving atari games dqn
00:35:35.320 | deep q learning networks deep q networks
00:35:40.500 | And let's take a step back and think about what q learning is
00:35:43.380 | Q learning looks at the state action value function
00:35:50.260 | That estimates based on a particular policy or based on an optimal policy. How good is it to take an action?
00:35:56.900 | in this state
00:36:01.360 | estimated
00:36:02.720 | Reward if I take an action in this state and continue operating under an optimal policy
00:36:09.300 | It gives you directly a way to say
00:36:11.620 | Amongst all the actions I have which action should I take to maximize the reward?
00:36:16.660 | Now in the beginning, you know, nothing, you know, you don't have this value estimation
00:36:21.800 | You don't have this q function
00:36:24.660 | So you have to learn it and you learn it with a bellman equation of updating it
00:36:28.980 | You take your current estimate and update it with the reward you see
00:36:32.180 | Received after you take an action
00:36:37.700 | It's off policy and model free. You don't have to have any estimate or knowledge of the world
00:36:43.380 | You don't have to have any policy whatsoever. All you're doing is
00:36:47.460 | Roaming about the world collecting data when you took a certain action. Here's the word you received and you're updating
00:36:53.800 | gradually this table
00:36:56.800 | Where
00:36:59.620 | the table has state states on the y-axis and
00:37:03.860 | actions on
00:37:07.140 | the x-axis
00:37:09.140 | and the key part there is
00:37:11.540 | Because you always have an estimate of what
00:37:14.820 | To take an action of the value of taking that action so you can always take the optimal one
00:37:20.740 | but because you know very little in the beginning that optimal is going to
00:37:25.140 | You have no way of knowing that's good or not
00:37:28.020 | So there's some degree of exploration the fundamental aspect of value-based methods or any RL methods
00:37:34.180 | Like I said, it's trial and error is exploration
00:37:36.920 | So for value-based methods like Q learning the way that's done is with a flip of a coin epsilon greedy
00:37:44.340 | With a flip of a coin you can choose to just take a random action
00:37:49.140 | and you
00:37:52.020 | Slowly decrease epsilon to zero as your agent learns more and more and more
00:37:57.940 | so in the beginning you explore a lot an epsilon of one an epsilon of zero in the end when you're just acting greedy based on the
00:38:05.540 | Your understanding of the world as represented by the Q value function
00:38:09.300 | For non-neural network approaches. This is simply a table the Q
00:38:14.420 | This Q function is a table. Like I said on the y state
00:38:19.300 | x actions and in each cell you have a reward that's
00:38:26.500 | A discount or reward you estimate to be received there
00:38:29.060 | And as you walk around with this Bellamy equation, you can update that table
00:38:34.740 | It's a table nevertheless number of states times number of actions
00:38:38.580 | Now if you look at any practical real-world problem
00:38:41.460 | And an arcade game with raw sensory input is a very crude first step towards the real world so raw sensory information
00:38:52.660 | This kind of value iteration and updating a table is impractical
00:38:57.160 | Because here's for a game of breakout if you look at four consecutive frames of a game of breakout
00:39:03.080 | Size of the of the raw sensory input is 84 by 84 pixels
00:39:10.500 | grayscale
00:39:12.660 | Every pixel has 256 values
00:39:14.980 | that's 256 to the power of
00:39:21.700 | Whatever 84 times 84 times 4 is
00:39:24.180 | Whatever it is
00:39:27.140 | It's significantly larger than the number of atoms in the universe
00:39:29.720 | So the size of this Q table if we use the traditional approach is intractable
00:39:35.640 | Neural networks to the rescue
00:39:41.700 | Deep RL is RL plus neural networks where the neural networks is tasked with taking this in value-based
00:39:49.860 | Methods taking this Q table and learning a compressed representation of it
00:39:53.860 | Learning an approximator for the function
00:39:57.540 | from state and action to the value
00:40:00.660 | That's what previously talked about the ability the powerful ability of neural networks to form
00:40:07.700 | representations from
00:40:10.080 | extremely high dimensional complex raw sensory information
00:40:14.740 | So it's simple the framework remains for the most part the same in reinforcement learning. It's just that this Q function
00:40:21.300 | For value-based methods becomes a neural network and becomes an approximator
00:40:27.240 | where the hope is as you navigate the world and you pick up new knowledge through
00:40:32.900 | The back propagating the gradient and the loss function that you're able to form a good representation of the optimal Q function
00:40:42.340 | So use neural networks with neural networks are good at which is function approximators
00:40:45.960 | And that's DQN deep Q network was used to have the initial
00:40:51.780 | Incredible nice results on the arcade games where the input is the raw sensory pixels with a few convolutional layers fully connected layers
00:41:00.740 | And the output is a set of actions
00:41:03.540 | You know
00:41:06.640 | Probability of taking that action and then you sample that and you choose the best action
00:41:10.500 | And so this simple agent with a neural network that estimates that Q function
00:41:14.260 | Very simple network is able to achieve
00:41:18.100 | A superhuman performance on many of these arcade games that excited the world because it's taking raw sensory information
00:41:25.400 | with a pretty simple network
00:41:28.020 | That doesn't in the beginning understand any of the physics of the world any of the dynamics of the environment and through that intractable
00:41:34.920 | space the intractable
00:41:39.300 | State space is able to learn how to actually do pretty well
00:41:42.900 | The loss function for DQN has two
00:41:48.980 | Q functions
00:41:51.860 | One is the expected
00:41:54.200 | The predicted
00:41:57.940 | Q value of taking an action in a particular state
00:42:00.580 | and the other is
00:42:04.420 | Target against which the loss function is calculated. Which is what is the
00:42:09.380 | value that you got once you actually
00:42:12.660 | Taken that action
00:42:15.780 | and once you've taken that action the way you calculate the value is by looking at the next step and choosing the
00:42:20.900 | Maximum choosing if you take the best action in the next state
00:42:25.620 | What is going to be the Q function? So there's two estimators going on in terms of neural networks
00:42:31.620 | There's two forward passes here. There's two Q's in this equation
00:42:34.900 | so in traditional DQN, that's just
00:42:38.660 | That's done by a single neural network
00:42:41.780 | With a few tricks and double DQN that's done by two neural networks
00:42:46.200 | And I mentioned tricks because with this and with most of RL tricks tell a lot of the story
00:42:55.940 | A lot of what makes systems work is the details
00:43:00.100 | in games and robotic systems in these cases
00:43:04.180 | The two biggest tricks for DQN that will reappear in a lot of value-based methods is experience replay
00:43:12.180 | So think of an agent that plays through these games
00:43:16.580 | as also collecting memories
00:43:19.220 | You collect this
00:43:21.700 | Bank of memories that can then be replayed
00:43:25.380 | The power of that one of the central elements of what makes value-based methods attractive
00:43:30.980 | is that
00:43:33.300 | Because you're not directly estimating the policy but are learning the quality of taking an action in a particular state
00:43:41.380 | You're able to then jump around through your memory and and play
00:43:45.460 | different aspects of that memory
00:43:48.100 | So learn, train the network through the historical data and then the other trick
00:43:54.980 | Simple is like I said that there is
00:43:58.100 | So the loss function has two queues
00:44:00.740 | So you're it's it's a dragon chasing its own tail
00:44:05.140 | It's easy for the loss function to become unstable. So the training does not converge
00:44:10.580 | So the trick of fixing a target network is taking one of the queues
00:44:15.060 | And only updating it every x steps every thousand steps and so on and taking the same kind of network is just fixing it
00:44:22.180 | So for the target network that defines the loss function just keeping it fixed and only updating irregularly
00:44:27.800 | So you're chasing a fixed target with a loss function as opposed to a dynamic one
00:44:33.780 | So you can solve a lot of the Atari games
00:44:37.300 | With minimal effort come up with some creative solutions here
00:44:41.220 | Break out here after 10 minutes of training on the left after two of two hours of training on the right
00:44:46.900 | It's coming up with some creative solutions again. It's pretty cool because this is raw pixels, right? We're now like
00:44:53.780 | There's been a few years since this
00:44:56.740 | Breakthrough
00:44:58.740 | So kind of take it for granted, but I still for the most part
00:45:03.140 | Captivated by just how beautiful it is that from raw sensory information
00:45:08.360 | neural networks are able to
00:45:11.460 | learn
00:45:12.340 | to act in a way that actually supersedes humans in terms of creativity in terms of
00:45:16.420 | In terms of actual raw performance. It's really exciting and games of simple form is the cleanest way to demonstrate that
00:45:26.660 | The same kind of DQN network is able to achieve superhuman performance on a bunch of different games
00:45:31.700 | There's improvements to this like dual DQN again
00:45:36.100 | The Q function can be decomposed which is useful into the value estimate
00:45:41.140 | Of being in that state and what's called
00:45:44.020 | And in future slides will be called advantage
00:45:47.000 | So the advantage of taking action in that state the nice thing of the advantage
00:45:51.960 | as a measure is that
00:45:54.900 | It's a measure of the action quality relative to the average
00:46:00.020 | Action that could be taken there. So if it's that's very useful advantage versus sort of raw reward
00:46:06.900 | Is that if all the actions you have to take are pretty good?
00:46:10.100 | You want to know well how much better it is?
00:46:12.980 | in terms of optimization
00:46:15.620 | That's a better measure for choosing actions
00:46:18.740 | in a value-based sense
00:46:21.540 | So when you have these two estimates you have these two streams for neural network and a dueling DQN
00:46:28.000 | DDQN where one estimates the value the other the advantage
00:46:32.900 | And that's again that dueling nature is useful for
00:46:39.120 | Also when the there are many states in which the action
00:46:43.360 | Is decoupled the quality of the actions is decoupled from the state. So many states it doesn't matter
00:46:50.720 | Which action you take
00:46:54.000 | So you don't need to learn all the different complexities
00:46:57.300 | All the topology of different actions when you in a particular state
00:47:01.920 | And another one
00:47:05.280 | Is prioritize experience replay like I said experience replay is really key to these algorithms
00:47:10.660 | And the thing that syncs some of the policy optimization methods
00:47:15.040 | And experience replay is collecting different memories
00:47:18.500 | But if you just sample randomly in those memories
00:47:23.520 | You're now affected
00:47:25.520 | the sampled experiences are
00:47:27.680 | Really affected by the frequency of those experience occurred not their importance
00:47:32.720 | So prioritize experience replay assigns a priority
00:47:36.500 | a value
00:47:38.320 | based on the
00:47:40.320 | magnitude of the
00:47:42.540 | temporal difference learned error
00:47:44.640 | So the the stuff you have learned the most from is given a higher priority and therefore you get to see
00:47:52.480 | through the experience replay process that
00:47:54.800 | That particular experience more often
00:47:58.320 | Okay, moving on to policy gradients this is on policy versus Q learning off policy
00:48:07.760 | Policy gradient
00:48:13.040 | Directly optimizing the policy where the input is the raw pixels
00:48:16.640 | And the policy network
00:48:19.360 | represents
00:48:23.520 | Forms of representations of that environment space and its output produces a stochastic estimate
00:48:29.280 | A probability of the different actions here in the pong the pixels
00:48:33.460 | A single output that produces a probability of moving the paddle up
00:48:38.800 | So how do policy gradients vanilla policy gradient very basic works
00:48:43.120 | Is you unroll the environment you play through the environment
00:48:49.600 | Here pong moving the paddle up and down and so on collecting no rewards
00:48:54.100 | And only collecting reward at the very end
00:48:57.840 | Based on whether you win or lose
00:49:01.760 | Every single action you're taking along the way gets either punished or rewarded based on whether it led to victory or defeat
00:49:07.680 | This also is remarkable that this works at all
00:49:13.120 | because the credit assignment there is a is
00:49:17.680 | I mean every single thing you did along the way
00:49:20.080 | Is averaged out
00:49:23.520 | It's like muddied. It's the reason that policy gradient methods are more inefficient, but it's still very surprising that it works at all
00:49:30.800 | So the pros versus DQN the value-based methods
00:49:35.920 | Is that if the world is so messy that you can't learn a Q function the nice thing about policy gradient because it's learning
00:49:42.080 | The policy directly that it will at least learn a pretty good policy
00:49:46.480 | Usually in many cases faster convergence. It's able to deal with stochastic policies
00:49:51.120 | So value-based methods cannot learn stochastic policies and it's much more naturally able to deal with continuous actions
00:49:57.920 | The cons is it's inefficient
00:50:01.380 | versus DQN
00:50:05.360 | It can become highly unstable as we'll talk about some solutions to this during the training process and the credit assignment
00:50:13.040 | So if we look at the chain of actions that lead to a positive reward
00:50:17.600 | Some might be awesome actions. Some might be good actions
00:50:21.520 | Some might be terrible actions, but that doesn't matter as long as the destination was good
00:50:26.400 | And that's then every single action along the way gets a positive reinforcement
00:50:31.060 | That's the downside and there's now improvements to that advantage actor critic methods A2C combining the best of
00:50:42.880 | Value-based methods and policy
00:50:45.120 | based methods
00:50:48.320 | So having an actor
00:50:50.560 | two networks an actor which is
00:50:53.440 | Policy-based and that's the one that takes the actions
00:50:56.880 | Samples the actions from the policy network and the critic
00:51:00.880 | That measures how good those actions are
00:51:03.520 | And the critic is value-based
00:51:06.640 | All right. So as opposed to in the policy update the first equation there the reward coming from the destination
00:51:13.300 | The the reward being from whether you won the game or not
00:51:17.200 | Every single step along the way
00:51:19.920 | You now learn a Q value function
00:51:22.800 | QSA state and action
00:51:25.740 | using the critic network
00:51:28.480 | So you're able to now learn about the environment about evaluating your own actions at every step
00:51:35.280 | So you're much more sample efficient
00:51:37.280 | There's asynchronous
00:51:39.280 | From deep mind and synchronous from open ai variants of this but of the actor advantage actor critic framework
00:51:47.040 | But both are highly parallelizable the difference with
00:51:50.800 | A3C the
00:51:55.280 | asynchronous one
00:51:57.200 | Is that every single agent so you just throw these agents operating in the environment and they're learning they're rolling out the games and getting the reward
00:52:05.840 | They're updating the original network asynchronously the global network parameters asynchronously
00:52:12.420 | And as a result, they're also operating constantly on outdated versions of that network
00:52:18.960 | The open ai approach that fixes this is that there's a coordinator that there's these rounds where everybody
00:52:26.320 | All the agents in parallel are rolling out the episode
00:52:30.240 | but then the coordinator waits for everybody to finish in order to make the update to the global network and then
00:52:36.080 | distributes all the same parameters
00:52:38.620 | To all the agents and so that means that every iteration starts with the same global parameters and that has really nice properties
00:52:46.880 | in terms of conversions
00:52:49.520 | and stability of the training process
00:52:53.200 | from google deep mind the deep deterministic policy gradient
00:52:57.520 | Is combining the ideas of dqn but dealing with continuous action spaces
00:53:05.360 | taking a policy network, but instead of
00:53:07.360 | the actor actor critic framework
00:53:10.640 | but instead of picking a stochastic policy
00:53:15.040 | Having the actor operating in a stochastic nature is picking the best picking a deterministic policy
00:53:21.440 | So it's always choosing the best action
00:53:25.760 | But okay with that the problem quite naturally
00:53:29.060 | Is that when the policy is now deterministic it's able to do a continuous action space, but because it's deterministic it's never exploring
00:53:36.580 | So the way we inject exploration into the system is by adding noise
00:53:40.960 | either adding noise into the action space on the output or adding noise into the parameters of the network that
00:53:47.200 | Have then
00:53:50.720 | that create perturbations in the actions such that the final result is that you try different kinds of things and the
00:53:57.120 | The scale of the noise just like with the epsilon greedy in the exploration for dqn
00:54:01.680 | The scale of the noise decreases as you learn more and more
00:54:04.240 | so on the policy optimization side from
00:54:07.760 | Openai and others
00:54:10.880 | We'll do a lecture just on this there's been a lot of exciting work here
00:54:16.800 | the basic idea of
00:54:19.980 | optimization on policy optimization with ppo and trpo
00:54:25.760 | First of all, we want to formulate
00:54:27.760 | A reinforcement learning as purely an optimization problem
00:54:36.640 | second of all
00:54:38.080 | if policy optimization
00:54:40.080 | the actions you take
00:54:42.320 | Influences the rest of your the optimization process. You have to be very careful about the actions you take in particular
00:54:50.660 | You have to
00:54:52.020 | Avoid taking really bad actions
00:54:54.020 | When your convergence the the training performance in general
00:54:58.260 | collapses
00:55:00.580 | So, how do we do that?
00:55:02.100 | There's the line search methods, which is where gradient descent or gradient ascent falls under which
00:55:07.700 | Which is the how we train deep neural networks
00:55:11.460 | is you first
00:55:14.020 | Pick a direction of the gradient and then pick the step size
00:55:19.620 | The problem with that is that can get you into trouble here. There's a nice visualization walking along a ridge
00:55:26.420 | Is it can it can result in you stepping off that ridge again the collapsing of the training process
00:55:33.620 | the performance the trust region is is the underlying idea here for the
00:55:39.380 | For the policy optimization methods that first pick the step size so the constraint in various kinds of ways the the magnitude
00:55:48.100 | Of the difference to to the weights that's applied and then the direction
00:55:55.140 | Placing a much higher priority not choosing bad actions that can throw you off the optimization path trajectory
00:56:01.220 | We should take to that path
00:56:03.060 | and finally the
00:56:04.740 | On the model-based methods and we'll also talk about them in the robotics side. There's a lot of
00:56:09.380 | interesting
00:56:11.120 | approaches now where deep learning is starting to be used for
00:56:15.140 | Model-based methods when the model has to be learned
00:56:17.620 | But of course when the model doesn't have to be learned is given inherent to the game
00:56:22.100 | You know the model like in go and chess and so on alpha zero has really done incredible stuff
00:56:30.660 | What's why is what is the model here? So the way that a lot of these games are approached
00:56:36.340 | You know game of go it's turn-based one person goes and another person goes and there's this game tree
00:56:42.260 | At every point there's a set of actions that can be taken and quickly if you look at that game tree
00:56:46.900 | It's it becomes you know, it grows exponentially
00:56:50.020 | So it becomes huge the game of go is the hugest of all in terms of because the number of choices you have
00:56:55.620 | Is the largest and there's chess
00:56:58.340 | And then, you know, it gets to checkers and then tic-tac-toe and it's just the the degree at every step
00:57:04.580 | increases decreases based on the game structure
00:57:08.180 | And so the task for neural network there is to learn the quality of the board. It's to learn
00:57:14.100 | which boards
00:57:16.820 | which game positions are most
00:57:19.540 | likely to result
00:57:24.100 | Most useful to explore and result in a highly successful state
00:57:28.500 | So that choice of what's good to explore what what branch is good to go down
00:57:33.940 | Is where we can have neural networks step in
00:57:36.980 | And with alpha go it was pre-trained the first success that beat the world champion was pre-trained on expert games
00:57:44.580 | then with alpha go zero
00:57:47.140 | It was
00:57:50.580 | No pre-training on expert systems. So no imitation learning is just purely through self-play through suggesting through playing itself
00:57:58.900 | New board positions many of these systems use Monte Carlo tree search and during this search
00:58:05.120 | balancing exploitation exploration so going deep on promising positions based on the estimation in neural network or
00:58:11.120 | With a flip of a coin playing underplayed positions
00:58:15.380 | And so this kind of here you could think of as an intuition of looking at a board and estimating how good that board is
00:58:24.320 | And also estimating how good that board is likely to lead to victory down the end
00:58:31.520 | So estimating just general quality and probability of leading to victory
00:58:37.120 | The next step forward is alpha zero using the same similar architecture with mcts
00:58:43.380 | Monte Carlo tree search but applying it to different games
00:58:47.360 | And applying it and competing against
00:58:50.560 | other engines state-of-the-art engines in go and shogi in chess
00:58:55.760 | And outperforming them with very few
00:58:58.640 | very few steps
00:59:00.480 | so here's
00:59:02.480 | This model-based approaches which are really extremely simple efficient if you can construct such a model
00:59:08.800 | And in in robotics if you can learn such a model
00:59:12.160 | It can be exceptionally powerful here
00:59:15.760 | Beating the the
00:59:19.660 | Engines which are far superior to humans already stockfish can destroy most humans
00:59:24.320 | on earth at the game of chess
00:59:27.360 | the ability through learning through through estimating the quality of a board to be able to defeat these engines is incredible and
00:59:34.400 | the exciting aspect here versus engines that don't use neural networks is that the
00:59:42.160 | number
00:59:44.320 | It really has to do with based on the neural network you explore certain positions
00:59:50.100 | You explore certain parts of the tree
00:59:53.840 | And if you look at grandmasters
00:59:57.440 | human players
00:59:59.440 | In chess, they seem to explore very few moves
01:00:03.120 | They have a really good neural network at estimating which are the likely branches which would provide value to explore
01:00:11.120 | and on the other side
01:00:14.080 | Stockfish and so on are much more brute force in their estimation for the mcts
01:00:19.620 | And then alpha zero is a step towards the grandmaster
01:00:23.600 | Because the number of branches need to be explored is much much fewer
01:00:26.480 | A lot of the work is done in the representation formed by the neural network, which is super exciting
01:00:32.480 | And then it's able to outperform
01:00:34.960 | Stockfish and chess. It's able to outperform elmo and shogi and
01:00:40.080 | It's itself in go
01:00:43.600 | Or the previous iterations of alpha goes zero and so on
01:00:47.280 | Now the challenge here
01:00:53.440 | The sobering truth is that majority of real world application
01:00:57.220 | Of agents that have to act in this world perceive the world and act in this world
01:01:01.680 | Are for the most part
01:01:04.000 | Not based have no rl involved
01:01:06.480 | So the action is not learned
01:01:09.680 | You use neural networks to perceive certain aspects of the world, but ultimately
01:01:13.940 | the action
01:01:16.720 | Is not is not learned from data
01:01:19.520 | That's true for all most of the autonomous vehicle companies or all of the autonomous vehicle companies operating today
01:01:25.680 | and it's true for
01:01:28.320 | robotic manipulation
01:01:30.220 | Industrial robotics and any of the humanoid robots have to navigate in this world under uncertain conditions
01:01:35.780 | All the work from boston dynamics doesn't involve any machine learning as far as we know
01:01:40.080 | Now that's beginning to change here with animal the the recent development
01:01:49.360 | Where the certain aspects of the control
01:01:52.640 | Of robotics is being learned
01:01:55.280 | You're trying to learn more efficient movement. You're trying to learn more robust movement on top of the other controllers
01:02:01.940 | So it's quite exciting through rl to be able to learn some of the control dynamics here. That's able to teach
01:02:09.040 | Uh this particular robot to be able to get up from arbitrary positions
01:02:13.780 | So it's less hard coding in order to be able to deal with
01:02:19.740 | Unexpected initial conditions and unexpected perturbations
01:02:22.820 | So it's exciting there in terms of learning the control dynamics and some of the driving policy
01:02:29.280 | So making behavioral driving behavior decisions changing lanes turning and so on that if you uh,
01:02:36.400 | If you were here last week heard from waymo
01:02:38.960 | They they're starting to use some rl in terms of the driving policy in order to especially predict the future
01:02:44.560 | They're trying to anticipate intent modeling predict where the pedestrians where the cars are going to be based in the environment
01:02:50.000 | They're trying to unroll what's happened recently into the future and beginning to
01:02:54.720 | Move beyond sort of pure end-to-end on nvidia end-to-end learning approach of the control decisions
01:03:02.480 | Are actually moving to rl and making long-term planning decisions
01:03:06.660 | but again, the challenge is the
01:03:12.800 | The gap the leap needed to go from simulation to real world
01:03:18.080 | Most of the work is done from the design of the environment and the design of the reward structure
01:03:22.480 | And because most of that work now is in simulation. We need to either develop better algorithms for transfer learning
01:03:29.040 | or close the distance between simulation and the real world and
01:03:33.520 | Also, we could think outside the box a little bit
01:03:38.240 | I had the conversation with peter abeel recently one of the leading researchers in deep rl
01:03:42.640 | It kind of on the side quickly mentioned
01:03:46.240 | The the idea is that we don't need to make simulation more realistic
01:03:52.420 | What we could do is just create an infinite number of simulations
01:03:58.180 | Or a very large number of simulations
01:04:03.360 | And the naturally the regularization aspect of having all those simulations
01:04:07.860 | Will make it so that our our reality is just another sample from those simulations
01:04:12.900 | And so maybe the solution isn't to create higher fidelity simulation or to create transfer learning algorithms
01:04:19.140 | Maybe it's to build
01:04:24.240 | Arbitrary number of simulations
01:04:26.700 | So then that step towards creating a agent that work that works in the real world is a trivial one
01:04:33.680 | And maybe that's exactly whoever created the simulation we're living in and the multiverse that we're living in did
01:04:40.640 | Next steps
01:04:45.200 | The lecture videos we have several in rl will be made all available on deeplearning.mit.edu
01:04:50.580 | We'll have several tutorials in rl on github
01:04:54.400 | The link is there and I really like the essay
01:04:58.800 | From openai on spinning up as a deep rl researcher, you know, if you're interested in getting into research in rl
01:05:05.360 | What are the steps you need to take from the background of developing the mathematical background prob stat and multivariate calculus?
01:05:13.300 | To some of the basics like it's covered last week on deep learning some of the basics ideas in rl
01:05:19.040 | Just terminology and so on some basic concepts then picking a framework tensorflow or pytorch
01:05:27.280 | Learn by doing
01:05:29.040 | All right implement the algorithms I mentioned today. Those are the core rl algorithms
01:05:33.540 | So implement all of them from scratch
01:05:36.080 | It should only take about 200 300 lines of code. They're actually when you put it down on paper are quite simple
01:05:42.720 | intuitive algorithms
01:05:44.960 | and then
01:05:46.720 | Read papers about those algorithms that follow after looking not for the big waving performance
01:05:54.000 | The hand waving performance but for the tricks that were used to train these algorithms
01:05:58.000 | The tricks tell a lot of the story and that's the useful parts that you need to learn
01:06:03.280 | And iterate fast on simple benchmark environments
01:06:07.200 | So openai gem has provided a lot of easy to use environments that you can play with that
01:06:12.480 | You can train an agent in minutes hours as opposed to days and weeks
01:06:17.200 | And so iterating fast is the best way to learn these algorithms and then on the research side
01:06:22.480 | There's three ways to get a best paper award, right?
01:06:25.440 | To publish and to contribute and have an impact in the research community
01:06:31.120 | In rl one is improve an existing approach given a particular benchmark. There's a few
01:06:37.360 | Benchmark data sets environments that are emerging. So you want to improve an existing approach some aspect of the convergence of the performance
01:06:44.900 | You can
01:06:46.400 | Focus on an unsolved task. There's certain games that just haven't been solved through the rl
01:06:52.560 | Formulation or you can come up with a totally new
01:06:55.600 | Problem that hasn't been addressed by rl before
01:06:59.200 | So with that I'd like to thank you very much tomorrow. I hope to see you here for deep traffic. Thanks
01:07:24.660 | [BLANK_AUDIO]