MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

00:00:00.000 | Today I'd like to overview the exciting field of deep reinforcement learning

00:00:04.640 | Introduce, overview and provide you some of the basics

00:00:08.240 | I think it's one of the most exciting fields in artificial intelligence

00:00:14.960 | It's marrying the power and the ability of deep neural networks

00:00:20.320 | to represent and comprehend the world

00:00:23.120 | with the ability to act on that understanding

00:00:29.040 | on that representation

00:00:31.040 | Taking as a whole, that's really what the creation of intelligent beings is

00:00:36.480 | Understand the world and act

00:00:39.360 | And the exciting breakthroughs that recently have happened

00:00:42.000 | Captivate our imagination about what's possible

00:00:45.440 | And that's why this is my favorite area of deep learning and artificial intelligence in general

00:00:50.400 | And I hope you feel the same

00:00:52.560 | So what is deep reinforcement learning?

00:00:55.120 | We've talked about deep learning which is taking samples of data

00:01:00.080 | Being able to in a supervised way

00:01:02.080 | compress, encode the representation of that data in a way that you can reason about it

00:01:07.680 | And we take that power and apply it to the world where sequential decisions are to be made

00:01:16.000 | So it's looking at problems and formulations of tasks

00:01:22.640 | Where an agent, an intelligent system has to make a sequence of decisions

00:01:28.240 | And the decisions that are made

00:01:31.200 | Have an effect on the world around the agent

00:01:34.400 | How?

00:01:36.880 | How do all of us?

00:01:38.880 | Any intelligent being that is tasked with operating in the world. How do they learn anything?

00:01:43.680 | Especially when you know very little in the beginning

00:01:47.280 | It's trial and error is the fundamental process by which reinforcement learning agents learn

00:01:53.280 | And the deep part of deep reinforcement learning is neural networks

00:01:59.380 | It's using the frameworks and reinforcement learning

00:02:02.720 | Where the neural network is doing the representation

00:02:07.460 | Of the world based on which the actions are made

00:02:11.520 | And we have to take a step back

00:02:15.760 | When we look at the types of learning

00:02:17.760 | Sometimes the terminology itself can confuse us to the fundamentals

00:02:22.340 | There is supervised learning, there's semi-supervised learning, there's unsupervised learning, there's reinforcement learning

00:02:29.360 | And there's this feeling that supervised learning is really the only one

00:02:33.360 | Where you have to perform the manual annotation, where you have to do the large-scale supervision

00:02:38.500 | That's not the case

00:02:42.640 | Every type of machine learning is supervised learning

00:02:45.680 | It's supervised by a loss function or a function that tells you what's good

00:02:53.360 | And what's bad

00:02:55.680 | You know even looking at our own existence is how we humans figure out what's good and bad

00:03:00.240 | there's

00:03:02.080 | All kinds of sources direct and indirect by which our morals and ethics we figure out what's good and bad

00:03:08.320 | The difference between supervised and unsupervised and reinforcement learning is the source of that supervision

00:03:13.780 | What's implied when you say unsupervised?

00:03:16.560 | Is that the cost of human labor required to attain the supervision is low

00:03:22.720 | But it's never

00:03:25.440 | Turtles all the way down it's turtles and then there's a human at the bottom

00:03:30.960 | There at some point there needs to be human intervention

00:03:37.360 | Human

00:03:38.480 | Input to provide what's good and what's bad and this will arise in reinforcement learning as well

00:03:43.680 | we have to remember that because the challenges and the exciting opportunities of reinforcement learning lie in the fact of

00:03:50.000 | How do we get that supervision?

00:03:53.220 | In the most efficient way possible, but supervision nevertheless is required for any system that has an input and an output

00:04:01.840 | That's trying to learn like a neural network does to provide an output. That's good. It needs somebody to say what's good and what's bad

00:04:09.360 | For you curious about that. There's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche

00:04:16.720 | I recommend

00:04:17.840 | the latter especially

00:04:19.840 | So let's look at supervised learning and reinforcement learning

00:04:23.620 | I'd like to propose a way to think about the difference

00:04:28.160 | That is illustrative and useful when we start talking about the techniques

00:04:33.300 | So supervised learning is taking

00:04:35.760 | a bunch of examples of data

00:04:38.160 | And

00:04:41.200 | Learning from those examples where ground truth provides you

00:04:44.640 | the compressed

00:04:47.440 | Semantic meaning of what's in that data and from those examples one by one whether it's sequences or single samples

00:04:56.720 | We learn what how to then take future such samples and interpret them

00:05:02.240 | Reinforcement learning is teaching what we teach an agent through experience

00:05:09.700 | Not by showing a singular sample of a data set but by putting them out into the world

00:05:15.840 | The distinction there the essential element of reinforcement learning then for us

00:05:20.880 | Now we'll talk about a bunch of algorithms

00:05:24.640 | But the essential design step is to provide the world in which to experience

00:05:31.140 | The agent learns from the world

00:05:34.320 | The from the world it gets the dynamics of that world the physics of the world from that world

00:05:41.040 | It gets the rewards what's good and bad and us as designers

00:05:44.660 | Of that agent do not just have to do the algorithm. We have to do design the the world

00:05:53.040 | In which that agent is trying to solve a task

00:05:57.600 | The design of the world is the process of reinforcement learning the design of examples

00:06:03.520 | The annotation of examples is the world of supervised learning

00:06:06.500 | And the essential perhaps the most difficult element of reinforcement learning is the reward the good versus bad

00:06:15.840 | Here a baby starts walking across the room

00:06:21.280 | We want to define success as a baby

00:06:23.840 | walking across the room

00:06:26.400 | And reaching the destination that's success and failure is the inability to reach that destination

00:06:32.100 | simple

00:06:33.760 | and reinforcement learning in humans

00:06:35.760 | The way we learn from these very few examples

00:06:40.980 | Appear to learn from very few examples through trial and error

00:06:46.240 | Is a mystery a beautiful mystery full of open questions

00:06:49.280 | It could be from the huge amount of data 230 million years worth of bipedal data that we've been walking

00:06:55.760 | Mammals walking ability to walk or 500 million years the ability to see having eyes

00:07:02.240 | So that's the the hardware side somehow genetically encoded in us is the ability to comprehend this world extremely efficiently

00:07:10.260 | It could be through

00:07:12.640 | not the hardware not the 500 million years, but the

00:07:16.000 | the few

00:07:18.720 | minutes hours days months

00:07:20.880 | Maybe even years in the very beginning when we're born

00:07:24.720 | The ability to learn really quickly through observation to aggregate that information

00:07:29.940 | Filter all the junk that you don't need and be able to learn really quickly

00:07:34.480 | Through imitation learning through observation the way for walking that might mean observing others to walk

00:07:42.080 | The idea there is

00:07:43.760 | If there was no others

00:07:45.760 | Around we would never be able to learn this the fundamentals of this walking or as efficiently

00:07:50.420 | It's through observation

00:07:53.200 | And then it could be the algorithm totally not understood is the algorithm that our brain uses to learn

00:08:01.520 | The back propagation that's an artificial neural networks the same kind of processes not understood in the brain

00:08:08.960 | That could be the key

00:08:11.520 | So I want you to think about that as we talk about

00:08:13.760 | the very trivial

00:08:16.160 | By comparison accomplishments and reinforcement learning and how do we take the next steps?

00:08:20.880 | But it nevertheless is exciting to have machines that learn how to act in the world

00:08:30.080 | the process of learning for those who have

00:08:34.560 | fallen in love with artificial intelligence

00:08:38.000 | The process of learning is thought of as intelligence. It's the ability to know very little and through experience examples

00:08:45.620 | Interaction with the world in whatever medium whether it's data or simulation so on be able to form much richer and interesting

00:08:53.780 | Representations of that world be able to act in that world. That's that's the dream

00:08:58.080 | So let's look at this stack of what an age what it means to be an agent in this world

00:09:03.280 | from top

00:09:05.040 | The input to the bottom the output is there's an environment. We have to sense that environment

00:09:11.280 | We have just a few tools us humans have

00:09:13.360 | Several sensory systems on cars you can have lidar camera

00:09:20.080 | Stereo vision audio microphone networking gps imu sensor so on whatever robot you can think about

00:09:27.440 | There's a way to sense that world

00:09:29.760 | and you have this raw sensory data and then once you have the raw sensory data you're tasked with

00:09:34.720 | representing that data in such a way that you can make sense of it as opposed to all the

00:09:40.240 | The the raw sensors in the eye the cones and so on that taken as just giant stream of high bandwidth information

00:09:48.000 | We have to be able to form

00:09:50.240 | higher

00:09:52.400 | Abstractions of features based on which we can reason from edges to corners to faces

00:09:57.760 | And so on that's exactly what deep learning neural networks have stepped in to be able to

00:10:02.480 | In an automated fashion with as little human input as possible be able to form higher order representations of that information

00:10:10.100 | Then there's the the learning aspect building on top of the greater abstractions formed through representations

00:10:17.620 | Be able to accomplish something useful whether it's discriminative task a generative task and so on based on the representation

00:10:25.040 | Be able to make sense of the data be able to generate new data and so on

00:10:29.440 | From sequence to sequence to sequence to sample from sample to sequence and so on and so forth to actions as we'll talk about

00:10:37.040 | and then there is the

00:10:39.920 | ability to

00:10:42.780 | Aggregate all the information that's been received in the past to the useful

00:10:48.480 | information that's

00:10:51.440 | Pertinent to the task at hand. It's the thing the old

00:10:55.040 | It looks like a duck quacks like a duck swims like a duck

00:10:58.240 | Three different data sets i'm sure there's state-of-the-art algorithms for the three image classification

00:11:04.420 | audio recognition

00:11:06.720 | video classification

00:11:08.560 | Activity recognition so on aggregating those three together

00:11:12.000 | Is still an open problem and that could be the last piece again

00:11:16.480 | I want you to think about as we think about reinforcement learning agents. How do we play?

00:11:20.880 | How do we transfer from the game of atari to the game of go to the game of dota to the game of a robot?

00:11:29.040 | Navigating an uncertain environment in the real world

00:11:32.720 | And once you have that once you sense the raw world once you have a representation of that world then

00:11:40.240 | We need to act

00:11:43.440 | Which is provide actions within the constraints of the world in such a way that we believe can get us towards success

00:11:51.680 | The promise excitement of deep learning is is the part of the stack that converts raw data into meaningful representations

00:11:58.900 | The promise the dream of deeper enforcement learning

00:12:02.560 | Is going beyond

00:12:05.040 | And building an agent that uses that representation

00:12:08.020 | And acts achieve success in the world

00:12:11.120 | That's super exciting

00:12:13.920 | The framework and the formulation of reinforcement learning

00:12:18.800 | At its

00:12:20.800 | Simplest

00:12:22.240 | Is that there's an environment and there's an agent that acts in that environment?

00:12:26.500 | the agent senses the environment by

00:12:29.600 | by some

00:12:31.580 | observation whether it's partial or

00:12:33.840 | complete observation of the environment

00:12:36.660 | and

00:12:38.880 | It gives the environment an action it acts in that environment and through the action

00:12:43.920 | The environment changes in some way and then a new observation occurs

00:12:49.360 | And then also as you provide the action make the observations you receive a reward

00:12:53.940 | In most formulations of this of this framework

00:12:58.000 | This entire system has no memory

00:13:00.720 | That the

00:13:04.080 | The only thing you need to be concerned about is the state you came from the state you arrived in and the reward received

00:13:09.780 | The open question here is what can't be modeled in this kind of way. Can we model all of it?

00:13:16.160 | from

00:13:18.000 | From human life to the game of go

00:13:20.000 | Can all of this be modeled in this way?

00:13:22.480 | And what are is this a good way to formulate the learning problem of robotic systems?

00:13:29.520 | In the real world in the simulated world. Those are the the open questions

00:13:33.220 | The environment could be fully observable

00:13:37.040 | Or partially observable like in poker

00:13:40.960 | It could be single agent or multi-agent atari versus driving like deep traffic

00:13:47.340 | deterministic or stochastic

00:13:49.340 | static versus dynamic

00:13:52.060 | Static as in chess dynamic again and driving in most real world applications discrete versus continuous like games

00:13:59.100 | Chess or continuous and carpool balancing a pull on a cart

00:14:03.500 | The challenge for RL in real world applications

00:14:07.840 | Is that as a reminder

00:14:16.280 | Supervised learning is teaching by example learning by example

00:14:21.100 | teaching from our perspective

00:14:23.860 | Reinforcement learning is teaching by experience

00:14:26.300 | And the way we provide experience to reinforcement learning agents currently for the most part is through simulation

00:14:33.100 | Or through highly constrained real world scenarios

00:14:37.260 | So the challenge is in the fact

00:14:40.840 | that

00:14:42.600 | most of the successes

00:14:44.920 | is with

00:14:46.920 | Systems environments that are simulatable

00:14:49.180 | So there's two ways to then close this gap

00:14:54.280 | two directions of research and work one is to

00:14:58.440 | improve the

00:15:00.740 | algorithms improve the ability of the algorithms to then

00:15:03.640 | To form policies that are transferable across all kinds of domains including the real world including especially the real world

00:15:10.920 | So train and simulation transfer to the real world

00:15:14.920 | or

00:15:16.920 | As we improve the simulation in such a way that the fidelity of the simulation increase increases to the point where the gap

00:15:24.040 | between reality and simulation

00:15:26.280 | is

00:15:28.760 | Minimal to a degree that things learned in simulation are directly trivially transferable to the real world

00:15:37.000 | Okay, the major components of an RL agent

00:15:44.040 | an agent

00:15:45.960 | Operates based on a strategy

00:15:47.960 | called a policy

00:15:50.440 | It sees the world it makes a decision. That's a policy makes a decision how to act sees the reward

00:15:56.520 | Sees a new state acts sees a reward sees new states and acts and this repeats

00:16:03.080 | forever until a terminal state

00:16:06.200 | the value function

00:16:08.520 | is

00:16:09.880 | the

00:16:11.300 | estimate of how good a state is

00:16:13.620 | or how good

00:16:16.260 | A state action pair is meaning taking an action

00:16:19.540 | In a particular state. How good is that ability to evaluate that?

00:16:24.660 | and then the model

00:16:27.540 | Different from the environment from the perspective of the agent

00:16:30.340 | So the environment has a model based on which it operates

00:16:33.640 | And then the agent has a representation best understanding of that model

00:16:39.300 | So the purpose for an RL agent

00:16:41.620 | In this

00:16:44.340 | Simply formulated framework is to maximize reward

00:16:47.380 | the way that

00:16:49.780 | The reward mathematically and practically is talked about

00:16:53.140 | Is with a discounted framework, so we discount further and further future reward

00:17:00.740 | So the reward that's farther into the future is means less to us in terms of maximization than reward

00:17:07.460 | That's in the near term. And so why do we discount it?

00:17:11.220 | So first a lot of it is a math trick to be able to prove certain aspects analyze certain aspects of conversions

00:17:17.160 | And in general on a more philosophical sense

00:17:20.820 | Because environments either are or can be thought of a stochastic random. It's very difficult

00:17:27.720 | To there's a degree of uncertainty, which makes it difficult to really estimate

00:17:33.080 | the

00:17:35.940 | The reward they'll be in the future because of the ripple effect of the uncertainty

00:17:40.280 | Let's look at an example a simple one

00:17:43.860 | Helps us understand

00:17:46.640 | policies rewards actions, there's a robot in the room there's

00:17:50.980 | 12 cells in which you can step it starts in the bottom left. It tries to get rewards on the top, right?

00:17:58.740 | There's a plus one. It's a really good thing at the top, right? It wants to get there by walking around

00:18:05.300 | There's a negative one, which is really bad. It wants to avoid that square and the choice of actions is up down left right four actions

00:18:13.140 | so you could think of uh, they're

00:18:16.260 | Being a negative reward of 0.04 for each step

00:18:20.260 | So there's a cost to each step and there's a stochastic nature to this world potentially we'll talk about both deterministic stochastic

00:18:26.920 | So in the stochastic case when you choose the action up

00:18:30.660 | with an 80% probability

00:18:34.580 | With an 80% chance you move up but

00:18:36.660 | With 10% chance you move left and another 10 move right

00:18:41.300 | So that's stochastic nature, even though you try to go up you might end up in a blocks the left into the right

00:18:46.340 | so for a deterministic world

00:18:48.980 | the

00:18:51.280 | optimal policy here

00:18:53.280 | Given that we always start in the bottom left is really shortest path

00:18:56.580 | Is you know, you can't ever because there's no stochasticity

00:19:00.980 | you're never going to screw up and just fall into the hole negative one hole that you just compute the shortest path and

00:19:07.140 | Walk along that shortest path why shortest path because every single step hurts. There's a negative reward to it

00:19:13.860 | 0.04 so shortest path is the thing that minimizes the reward shortest path to the

00:19:20.340 | to the plus one block

00:19:22.820 | Okay, let's look at a stochastic world. Like I mentioned the 80% up and then split 10% to the left and right

00:19:29.860 | How does the policy change? Well, first of all we need to have

00:19:32.660 | uh

00:19:34.980 | We need to have a plan for every single block in the area because you might end up there due to the stochasticity of the world

00:19:40.980 | Okay, the the basic addition there is that we're trying to go

00:19:45.940 | Avoid up

00:19:49.300 | The closer you get to the negative one hole. So just try to avoid up because up

00:19:56.100 | The stochastic nature of up means you might fall into the hole with a 10% chance

00:20:00.340 | And given the 0.04 step reward you're willing to take the long way home

00:20:05.620 | In some cases in order to avoid that possibility the negative one possibility

00:20:10.760 | Now, let's look at a reward for each step if it decreases to negative two. It really hurts to take every step

00:20:17.380 | Then again, we go to the shortest path despite the fact that uh, there's a stochastic nature

00:20:24.180 | In fact, you don't really care that you step into the negative one hole because every step really hurts. You just want to get home

00:20:30.260 | And then you can play with this reward structure right yes

00:20:35.380 | instead of uh negative two or negative 0.04 you can look at

00:20:41.140 | Negative 0.1 and you can see immediately

00:20:44.600 | that the structure of the policy

00:20:47.620 | It changes

00:20:50.180 | So with a higher

00:20:52.420 | Value the higher negative reward for each step

00:20:54.980 | immediately

00:20:56.980 | the urgency of the agent increases

00:20:59.160 | Versus the less urgency the lower the negative reward

00:21:03.780 | And when the reward flips

00:21:07.380 | So it's positive

00:21:10.980 | The every step is a positive so the entire system which is actually

00:21:17.140 | Quite common in reinforcement learning the entire system is full of positive rewards

00:21:21.940 | And so then the optimum policy becomes the longest path

00:21:25.220 | Is a grad school taking as long as possible never reaching the destination

00:21:31.880 | So

00:21:36.900 | What lessons do we draw from robot in the room two things?

00:21:40.660 | The environment model the dynamics is just there in the trivial example the stochastic nature the difference between 80 percent 100 percent and

00:21:49.140 | 50 percent

00:21:51.060 | The model of the world the environment has a big impact on what the optimal policy is

00:21:55.780 | And the reward structure most importantly the thing we can often control

00:22:02.440 | More

00:22:06.900 | in our constructs of the task we try to solve in reinforcement learning is the

00:22:11.140 | What is good and what is bad and how bad is it and how good is it the reward structure is a big?

00:22:17.300 | Impact and that has a complete change

00:22:21.140 | like like uh, robert frost said a complete change on the

00:22:25.540 | Policy the choices the agent makes so when you formulate a reinforcement learning framework

00:22:33.380 | As researchers as students what you often do is you design the environment you design the world in which the system learns

00:22:41.140 | Even when your ultimate goal is the physical robot you just still there's a lot of work still done in simulation

00:22:48.100 | So you design the world the parameters of that world and you also design the reward structure and it can have

00:22:53.780 | A transformative results slight variations in those parameters can be huge results

00:23:01.060 | On huge differences on the policy that's arrived and of course

00:23:05.060 | The example i've shown before I really love is

00:23:11.060 | the

00:23:12.580 | impact of the the changing reward structure might have unintended consequences

00:23:17.320 | and

00:23:19.940 | those

00:23:21.140 | Consequences for real world system can have obviously

00:23:24.200 | highly detrimental

00:23:27.680 | Costs that are more than just a failed game of atari

00:23:31.220 | So here's a human performing the task get playing the game of coast runners racing around the track

00:23:37.380 | and so it's

00:23:39.620 | uh, when you finish first

00:23:41.860 | And you finish fast you get a lot of points and so it's natural to then okay

00:23:47.780 | Let's do an rl agent and then optimize this for those points

00:23:51.700 | And what you find out in the game is that you also get points by picking up the little green turbo things

00:23:59.380 | And what the agent figures out is that you can actually get a lot more points

00:24:04.180 | even

00:24:06.500 | By simply focusing on the green turbos

00:24:09.380 | focusing on the green turbos

00:24:12.020 | Just rotating over and over slamming into the wall fire and everything just picking it up, especially because

00:24:17.540 | ability to pick up those turbos

00:24:21.540 | Can avoid the terminal state at the end of finishing the race in fact finishing the race means you stop collecting positive reward

00:24:28.580 | So you never want to finish collect the turbos

00:24:30.900 | And though that's a trivial example

00:24:34.680 | It's not actually easy to find such examples

00:24:38.360 | But they're out there of unintended consequences that can have highly negative detrimental effects when put in the real world

00:24:45.940 | We'll talk about a little bit of robotics

00:24:48.740 | When you put robots four-wheeled ones like autonomous vehicles into the real world

00:24:53.780 | And you have objective functions that have to navigate difficult intersections full of pedestrians

00:24:59.080 | So you have to form intent models of those pedestrians here. You see cars asserting themselves through dense intersections

00:25:06.280 | taking risks and

00:25:08.820 | Within those risks that are taken by us humans when we drive vehicles

00:25:14.260 | we have to then encode that ability to take subtle risk into

00:25:18.580 | into

00:25:21.780 | AI-based control algorithms perception

00:25:23.960 | Then you have to think about at the end of the day. There's an objective function

00:25:29.540 | and if that objective function does not anticipate the green turbos that are to be collected and

00:25:36.020 | then result in some unintended consequences could have

00:25:38.980 | very

00:25:43.300 | Negative effects especially in situations that involve human life

00:25:47.300 | That's the field of AI safety and some of the folks who talk about deep mind and open AI

00:25:52.980 | That are doing incredible work in RL also have groups that are working in AI safety for a very good reason

00:26:00.180 | this is a problem that

00:26:02.900 | I believe that artificial intelligence will define some of the most impactful positive things

00:26:09.060 | In the 21st century, but I also believe we are nowhere close

00:26:13.700 | To solving some of the fundamental problems of AI safety that we also need to address as we develop those algorithms

00:26:20.440 | So okay examples of reinforcement learning systems

00:26:23.860 | All of it has to do with formulation of rewards formulation of state and actions. You have the traditional

00:26:31.320 | The often used benchmark of a cart

00:26:37.600 | Balancing a pole continuous. So the action is the horizontal force of the cart

00:26:42.000 | The goal is to balance the pole

00:26:44.160 | So it stays top in the moving cart and the reward is one at each time step if the pole is upright

00:26:49.920 | And the state measured by the cart by the agent is the pole angle angular speed

00:26:55.920 | And of course self sensing of the cart position and the horizontal velocity

00:27:01.460 | Another example here didn't want to include the video because it's really disturbing

00:27:07.360 | but I do want to include this slide because it's really important to think about is

00:27:11.440 | by sensing the the raw pixels learning and teaching an agent to

00:27:16.560 | Play a game of doom

00:27:19.520 | So the goal there is to eliminate all opponents

00:27:22.260 | The state is the raw game pixels the actions up down shoot reload and so on

00:27:28.000 | And

00:27:31.680 | The positive reward is

00:27:34.000 | When an opponent is eliminated and negative when the agent is eliminated simple

00:27:38.800 | I added it here because again on the topic of AI safety

00:27:44.420 | We have to think about objective functions and how that translate into the world of not just autonomous vehicles

00:27:53.940 | but

00:27:56.720 | Things that even more directly have harm like autonomous weapon systems. We have a lecture on this in the AGI series

00:28:04.240 | and

00:28:05.280 | The on the robotics platform the manipulate object manipulation grasping objects. There's a few benchmarks. There's a few interesting applications

00:28:13.140 | learning the problem of

00:28:16.080 | grabbing objects moving objects

00:28:18.320 | Manipulating objects rotating and so on especially when those objects don't have have complicated shapes

00:28:25.680 | And so the goal is to pick up an object in the purely in the grasping object challenge

00:28:31.200 | The state is the visual information. So it's visual visual based the raw pixels of the objects

00:28:36.800 | The action is to move the arm grasp the object pick it up

00:28:40.160 | And obviously it's positive when the pickup is successful

00:28:44.340 | The reason i'm personally excited

00:28:47.120 | by this

00:28:48.960 | is because it will finally allow us to solve the problem of the the claw which has been

00:28:55.040 | Torturing me for many years

00:28:58.880 | I don't know. That's not at all why i'm excited but okay

00:29:02.080 | And then we have to think about as we get greater and greater degree of application in the real world with the robotics

00:29:08.180 | Like cars

00:29:11.520 | The the main focus of my passion in terms of robotics is how do we encode some of the things that us humans encode?

00:29:17.920 | How do we you know?

00:29:19.840 | We have to think about our own objective function our own reward structure our own model of the environment about which we perceive and reason

00:29:27.200 | About in order to then encode machines that are doing the same and I believe autonomous driving is in that category

00:29:33.040 | We have to ask questions of ethics. We have to ask questions

00:29:36.080 | of

00:29:38.080 | of risk value of human life value of efficiency money and so on all these are fundamental questions that an autonomous vehicle

00:29:45.600 | Unfortunately has to solve before it becomes fully autonomous

00:29:49.220 | So here are the key takeaways of

00:29:54.960 | the real world impact of reinforcement learning agents

00:29:57.680 | On the deep learning side

00:30:01.920 | Okay, these neural networks that form higher representation

00:30:04.400 | The fun part is the algorithms all the different architectures the different encoder decoder structures

00:30:10.020 | all the attention self-attention

00:30:13.140 | recurrence LSTMs GRUs all the fun architectures and the data and

00:30:19.680 | the ability to leverage different data sets in order to

00:30:23.180 | to

00:30:25.180 | discriminate better than uh

00:30:27.180 | Perform discriminatory tasks better than you know

00:30:30.640 | MIT does better than Stanford that kind of thing. That's the fun part

00:30:34.800 | The hard part is asking good questions and collecting huge amounts of data that's representative of the task

00:30:41.920 | That's for real world impact not CVPR publication real world impact

00:30:46.320 | A huge amount of data on the deeper enforcement learning side the key challenge

00:30:53.040 | The fun part again is the algorithms. How do we learn from data some of the stuff i'll talk about today?

00:30:57.520 | The hard part is defining the environment defining the access space and the reward structure

00:31:03.600 | As I mentioned this is the big challenge and the hardest part is how to crack the gap between simulation in the real world

00:31:10.640 | the leaping lizard

00:31:12.960 | That's the hardest part. We don't even know

00:31:14.960 | How to solve that transfer learning problem yet for the real world impact

00:31:18.880 | The three types of reinforcement learning

00:31:22.240 | There's countless algorithms and there's a lot of ways to taxonomize them, but at the highest level

00:31:30.960 | There's model-based and there's model-free

00:31:33.760 | model-based algorithms

00:31:37.020 | Learn the model of the world

00:31:39.280 | So as you interact with the world

00:31:41.280 | You construct your estimate of how you believe the dynamics of that world operates

00:31:47.940 | The nice thing about

00:31:51.680 | Doing that is once you have a model or an estimate of a model you're able to

00:31:56.480 | Anticipate you're able to plan into the future. You're able to

00:32:01.200 | use the model to

00:32:04.320 | In a branching way predict how your actions will change the world so you can plan far into the future

00:32:10.240 | This is the mechanism by which you you can you can do

00:32:13.360 | chess

00:32:15.520 | Uh in the simplest form because in chess, you don't even need to learn the model

00:32:18.880 | The model is learned is given to you chess go and so on

00:32:21.520 | The most important way in which they're different I think is the sample efficiency

00:32:26.500 | Is how many examples of data are needed to be able to successfully operate in the world?

00:32:32.000 | And so model-based methods because they're constructing a model if they can

00:32:36.480 | Are extremely sample efficient

00:32:39.360 | Because once you have a model you can do all kinds of reasoning that doesn't require

00:32:44.100 | experiencing every possibility of that model you can

00:32:48.720 | Unroll the model to to see how the world changes based on your actions

00:32:53.600 | Value-based methods are ones that look to estimate

00:32:58.880 | The quality of states the quality of state taking a certain action in a certain state

00:33:04.320 | so

00:33:06.320 | They're called off policy

00:33:08.320 | Versus the last category that's on policy. What does it mean to be off policy? It means that

00:33:16.080 | They constantly a value-based agents constantly update how good is taking action in a state

00:33:22.800 | and they have this

00:33:26.000 | model of that goodness of

00:33:28.240 | Taking action in a state and they use that to pick the optimal action

00:33:32.000 | They don't directly learn a policy a strategy of how to act they learn how good it is

00:33:39.600 | to be in a state and use that goodness information to then

00:33:44.960 | pick the best one

00:33:46.960 | And then every once in a while flip a coin in order to explore

00:33:49.920 | And then policy-based methods are ones that directly learn a policy function

00:33:56.320 | so they take

00:33:58.640 | as input the

00:34:00.640 | the world

00:34:02.140 | representation of that world neural networks and is output

00:34:04.720 | a action

00:34:07.360 | Where the action is stochastic

00:34:09.360 | So, okay, that's the range of model-based value-based and policy-based

00:34:14.820 | Here's an image from openai that I really like I encourage you to

00:34:19.700 | As we further explore here to look up spinning up in deeper enforcement learning from openai

00:34:26.120 | Here's an image that taxonomizes in the way that I described some of the recent developments

00:34:30.920 | in rl

00:34:32.660 | so at the very

00:34:34.420 | top the distinction between model free rl and

00:34:37.060 | model-based rl

00:34:40.180 | In model free rl, which is what we'll focus on today. There is a distinction between policy optimization

00:34:46.680 | So on policy methods and q learning which is all policy methods policy optimization methods that directly optimize the policy

00:34:56.340 | Directly learn the policy in some way

00:34:58.900 | and then

00:35:00.500 | Q learning all policy methods learn like I mentioned the value of taking a certain action in a state and from that

00:35:07.540 | learned

00:35:10.260 | that learned q value be able to

00:35:12.980 | Choose how to act in the world

00:35:16.020 | So let's look at a few sample representative

00:35:19.880 | approaches in this space

00:35:23.120 | Let's start with the one

00:35:25.620 | that

00:35:27.940 | Really was one of the first great breakthroughs

00:35:30.600 | From google deep mind on the deep rl side and solving atari games dqn

00:35:35.320 | deep q learning networks deep q networks

00:35:40.500 | And let's take a step back and think about what q learning is

00:35:43.380 | Q learning looks at the state action value function

00:35:47.300 | Q

00:35:50.260 | That estimates based on a particular policy or based on an optimal policy. How good is it to take an action?

00:35:56.900 | in this state

00:36:00.020 | the

00:36:01.360 | estimated

00:36:02.720 | Reward if I take an action in this state and continue operating under an optimal policy

00:36:09.300 | It gives you directly a way to say

00:36:11.620 | Amongst all the actions I have which action should I take to maximize the reward?

00:36:16.660 | Now in the beginning, you know, nothing, you know, you don't have this value estimation

00:36:21.800 | You don't have this q function

00:36:24.660 | So you have to learn it and you learn it with a bellman equation of updating it

00:36:28.980 | You take your current estimate and update it with the reward you see

00:36:32.180 | Received after you take an action

00:36:34.820 | Here

00:36:37.700 | It's off policy and model free. You don't have to have any estimate or knowledge of the world

00:36:43.380 | You don't have to have any policy whatsoever. All you're doing is

00:36:47.460 | Roaming about the world collecting data when you took a certain action. Here's the word you received and you're updating

00:36:53.800 | gradually this table

00:36:56.800 | Where

00:36:59.620 | the table has state states on the y-axis and

00:37:03.860 | actions on

00:37:07.140 | the x-axis

00:37:09.140 | and the key part there is

00:37:11.540 | Because you always have an estimate of what

00:37:14.820 | To take an action of the value of taking that action so you can always take the optimal one

00:37:20.740 | but because you know very little in the beginning that optimal is going to

00:37:25.140 | You have no way of knowing that's good or not

00:37:28.020 | So there's some degree of exploration the fundamental aspect of value-based methods or any RL methods

00:37:34.180 | Like I said, it's trial and error is exploration

00:37:36.920 | So for value-based methods like Q learning the way that's done is with a flip of a coin epsilon greedy

00:37:44.340 | With a flip of a coin you can choose to just take a random action

00:37:49.140 | and you

00:37:52.020 | Slowly decrease epsilon to zero as your agent learns more and more and more

00:37:57.940 | so in the beginning you explore a lot an epsilon of one an epsilon of zero in the end when you're just acting greedy based on the

00:38:05.540 | Your understanding of the world as represented by the Q value function

00:38:09.300 | For non-neural network approaches. This is simply a table the Q

00:38:14.420 | This Q function is a table. Like I said on the y state

00:38:19.300 | x actions and in each cell you have a reward that's

00:38:26.500 | A discount or reward you estimate to be received there

00:38:29.060 | And as you walk around with this Bellamy equation, you can update that table

00:38:33.140 | but

00:38:34.740 | It's a table nevertheless number of states times number of actions

00:38:38.580 | Now if you look at any practical real-world problem

00:38:41.460 | And an arcade game with raw sensory input is a very crude first step towards the real world so raw sensory information

00:38:52.660 | This kind of value iteration and updating a table is impractical

00:38:57.160 | Because here's for a game of breakout if you look at four consecutive frames of a game of breakout

00:39:03.080 | Size of the of the raw sensory input is 84 by 84 pixels

00:39:10.500 | grayscale

00:39:12.660 | Every pixel has 256 values

00:39:14.980 | that's 256 to the power of

00:39:21.700 | Whatever 84 times 84 times 4 is

00:39:24.180 | Whatever it is

00:39:27.140 | It's significantly larger than the number of atoms in the universe

00:39:29.720 | So the size of this Q table if we use the traditional approach is intractable

00:39:35.640 | Neural networks to the rescue

00:39:41.700 | Deep RL is RL plus neural networks where the neural networks is tasked with taking this in value-based

00:39:49.860 | Methods taking this Q table and learning a compressed representation of it

00:39:53.860 | Learning an approximator for the function

00:39:57.540 | from state and action to the value

00:40:00.660 | That's what previously talked about the ability the powerful ability of neural networks to form

00:40:07.700 | representations from

00:40:10.080 | extremely high dimensional complex raw sensory information

00:40:14.740 | So it's simple the framework remains for the most part the same in reinforcement learning. It's just that this Q function

00:40:21.300 | For value-based methods becomes a neural network and becomes an approximator

00:40:27.240 | where the hope is as you navigate the world and you pick up new knowledge through

00:40:32.900 | The back propagating the gradient and the loss function that you're able to form a good representation of the optimal Q function

00:40:42.340 | So use neural networks with neural networks are good at which is function approximators

00:40:45.960 | And that's DQN deep Q network was used to have the initial

00:40:51.780 | Incredible nice results on the arcade games where the input is the raw sensory pixels with a few convolutional layers fully connected layers

00:41:00.740 | And the output is a set of actions

00:41:03.540 | You know

00:41:06.640 | Probability of taking that action and then you sample that and you choose the best action

00:41:10.500 | And so this simple agent with a neural network that estimates that Q function

00:41:14.260 | Very simple network is able to achieve

00:41:18.100 | A superhuman performance on many of these arcade games that excited the world because it's taking raw sensory information

00:41:25.400 | with a pretty simple network

00:41:28.020 | That doesn't in the beginning understand any of the physics of the world any of the dynamics of the environment and through that intractable

00:41:34.920 | space the intractable

00:41:39.300 | State space is able to learn how to actually do pretty well

00:41:42.900 | The loss function for DQN has two

00:41:48.980 | Q functions

00:41:51.860 | One is the expected

00:41:54.200 | The predicted

00:41:57.940 | Q value of taking an action in a particular state

00:42:00.580 | and the other is

00:42:03.300 | the

00:42:04.420 | Target against which the loss function is calculated. Which is what is the

00:42:09.380 | value that you got once you actually

00:42:12.660 | Taken that action

00:42:15.780 | and once you've taken that action the way you calculate the value is by looking at the next step and choosing the

00:42:20.900 | Maximum choosing if you take the best action in the next state

00:42:25.620 | What is going to be the Q function? So there's two estimators going on in terms of neural networks

00:42:31.620 | There's two forward passes here. There's two Q's in this equation

00:42:34.900 | so in traditional DQN, that's just

00:42:38.660 | That's done by a single neural network

00:42:41.780 | With a few tricks and double DQN that's done by two neural networks

00:42:46.200 | And I mentioned tricks because with this and with most of RL tricks tell a lot of the story

00:42:55.940 | A lot of what makes systems work is the details

00:43:00.100 | in games and robotic systems in these cases

00:43:04.180 | The two biggest tricks for DQN that will reappear in a lot of value-based methods is experience replay

00:43:12.180 | So think of an agent that plays through these games

00:43:16.580 | as also collecting memories

00:43:19.220 | You collect this

00:43:21.700 | Bank of memories that can then be replayed

00:43:25.380 | The power of that one of the central elements of what makes value-based methods attractive

00:43:30.980 | is that

00:43:33.300 | Because you're not directly estimating the policy but are learning the quality of taking an action in a particular state

00:43:39.140 | the

00:43:41.380 | You're able to then jump around through your memory and and play

00:43:45.460 | different aspects of that memory

00:43:48.100 | So learn, train the network through the historical data and then the other trick

00:43:54.980 | Simple is like I said that there is

00:43:58.100 | So the loss function has two queues

00:44:00.740 | So you're it's it's a dragon chasing its own tail

00:44:05.140 | It's easy for the loss function to become unstable. So the training does not converge

00:44:10.580 | So the trick of fixing a target network is taking one of the queues

00:44:15.060 | And only updating it every x steps every thousand steps and so on and taking the same kind of network is just fixing it

00:44:22.180 | So for the target network that defines the loss function just keeping it fixed and only updating irregularly

00:44:27.800 | So you're chasing a fixed target with a loss function as opposed to a dynamic one

00:44:33.780 | So you can solve a lot of the Atari games

00:44:37.300 | With minimal effort come up with some creative solutions here

00:44:41.220 | Break out here after 10 minutes of training on the left after two of two hours of training on the right

00:44:46.900 | It's coming up with some creative solutions again. It's pretty cool because this is raw pixels, right? We're now like

00:44:53.780 | There's been a few years since this

00:44:56.740 | Breakthrough

00:44:58.740 | So kind of take it for granted, but I still for the most part

00:45:03.140 | Captivated by just how beautiful it is that from raw sensory information

00:45:08.360 | neural networks are able to

00:45:11.460 | learn

00:45:12.340 | to act in a way that actually supersedes humans in terms of creativity in terms of

00:45:16.420 | In terms of actual raw performance. It's really exciting and games of simple form is the cleanest way to demonstrate that

00:45:24.660 | and

00:45:26.660 | The same kind of DQN network is able to achieve superhuman performance on a bunch of different games

00:45:31.700 | There's improvements to this like dual DQN again

00:45:36.100 | The Q function can be decomposed which is useful into the value estimate

00:45:41.140 | Of being in that state and what's called

00:45:44.020 | And in future slides will be called advantage

00:45:47.000 | So the advantage of taking action in that state the nice thing of the advantage

00:45:51.960 | as a measure is that

00:45:54.900 | It's a measure of the action quality relative to the average

00:46:00.020 | Action that could be taken there. So if it's that's very useful advantage versus sort of raw reward

00:46:06.900 | Is that if all the actions you have to take are pretty good?

00:46:10.100 | You want to know well how much better it is?

00:46:12.980 | in terms of optimization

00:46:15.620 | That's a better measure for choosing actions

00:46:18.740 | in a value-based sense

00:46:21.540 | So when you have these two estimates you have these two streams for neural network and a dueling DQN

00:46:28.000 | DDQN where one estimates the value the other the advantage

00:46:32.900 | And that's again that dueling nature is useful for

00:46:39.120 | Also when the there are many states in which the action

00:46:43.360 | Is decoupled the quality of the actions is decoupled from the state. So many states it doesn't matter

00:46:50.720 | Which action you take

00:46:54.000 | So you don't need to learn all the different complexities

00:46:57.300 | All the topology of different actions when you in a particular state

00:47:01.920 | And another one

00:47:05.280 | Is prioritize experience replay like I said experience replay is really key to these algorithms

00:47:10.660 | And the thing that syncs some of the policy optimization methods

00:47:15.040 | And experience replay is collecting different memories

00:47:18.500 | But if you just sample randomly in those memories

00:47:23.520 | You're now affected

00:47:25.520 | the sampled experiences are

00:47:27.680 | Really affected by the frequency of those experience occurred not their importance

00:47:32.720 | So prioritize experience replay assigns a priority

00:47:36.500 | a value

00:47:38.320 | based on the

00:47:40.320 | magnitude of the

00:47:42.540 | temporal difference learned error

00:47:44.640 | So the the stuff you have learned the most from is given a higher priority and therefore you get to see

00:47:52.480 | through the experience replay process that

00:47:54.800 | That particular experience more often

00:47:58.320 | Okay, moving on to policy gradients this is on policy versus Q learning off policy

00:48:07.760 | Policy gradient

00:48:11.360 | Is

00:48:13.040 | Directly optimizing the policy where the input is the raw pixels

00:48:16.640 | And the policy network

00:48:19.360 | represents

00:48:21.920 | the

00:48:23.520 | Forms of representations of that environment space and its output produces a stochastic estimate

00:48:29.280 | A probability of the different actions here in the pong the pixels

00:48:33.460 | A single output that produces a probability of moving the paddle up

00:48:38.800 | So how do policy gradients vanilla policy gradient very basic works

00:48:43.120 | Is you unroll the environment you play through the environment

00:48:49.600 | Here pong moving the paddle up and down and so on collecting no rewards

00:48:54.100 | And only collecting reward at the very end

00:48:57.840 | Based on whether you win or lose

00:49:01.760 | Every single action you're taking along the way gets either punished or rewarded based on whether it led to victory or defeat

00:49:07.680 | This also is remarkable that this works at all

00:49:13.120 | because the credit assignment there is a is

00:49:17.680 | I mean every single thing you did along the way

00:49:20.080 | Is averaged out

00:49:23.520 | It's like muddied. It's the reason that policy gradient methods are more inefficient, but it's still very surprising that it works at all

00:49:30.800 | So the pros versus DQN the value-based methods

00:49:35.920 | Is that if the world is so messy that you can't learn a Q function the nice thing about policy gradient because it's learning

00:49:42.080 | The policy directly that it will at least learn a pretty good policy

00:49:46.480 | Usually in many cases faster convergence. It's able to deal with stochastic policies

00:49:51.120 | So value-based methods cannot learn stochastic policies and it's much more naturally able to deal with continuous actions

00:49:57.920 | The cons is it's inefficient

00:50:01.380 | versus DQN

00:50:04.000 | it's

00:50:05.360 | It can become highly unstable as we'll talk about some solutions to this during the training process and the credit assignment

00:50:13.040 | So if we look at the chain of actions that lead to a positive reward

00:50:17.600 | Some might be awesome actions. Some might be good actions

00:50:21.520 | Some might be terrible actions, but that doesn't matter as long as the destination was good

00:50:26.400 | And that's then every single action along the way gets a positive reinforcement

00:50:31.060 | That's the downside and there's now improvements to that advantage actor critic methods A2C combining the best of

00:50:42.880 | Value-based methods and policy

00:50:45.120 | based methods

00:50:48.320 | So having an actor

00:50:50.560 | two networks an actor which is

00:50:53.440 | Policy-based and that's the one that takes the actions

00:50:56.880 | Samples the actions from the policy network and the critic

00:51:00.880 | That measures how good those actions are

00:51:03.520 | And the critic is value-based

00:51:06.640 | All right. So as opposed to in the policy update the first equation there the reward coming from the destination

00:51:13.300 | The the reward being from whether you won the game or not

00:51:17.200 | Every single step along the way

00:51:19.920 | You now learn a Q value function

00:51:22.800 | QSA state and action

00:51:25.740 | using the critic network

00:51:28.480 | So you're able to now learn about the environment about evaluating your own actions at every step

00:51:35.280 | So you're much more sample efficient

00:51:37.280 | There's asynchronous

00:51:39.280 | From deep mind and synchronous from open ai variants of this but of the actor advantage actor critic framework

00:51:47.040 | But both are highly parallelizable the difference with

00:51:50.800 | A3C the

00:51:55.280 | asynchronous one

00:51:57.200 | Is that every single agent so you just throw these agents operating in the environment and they're learning they're rolling out the games and getting the reward

00:52:05.840 | They're updating the original network asynchronously the global network parameters asynchronously

00:52:12.420 | And as a result, they're also operating constantly on outdated versions of that network

00:52:18.960 | The open ai approach that fixes this is that there's a coordinator that there's these rounds where everybody

00:52:26.320 | All the agents in parallel are rolling out the episode

00:52:30.240 | but then the coordinator waits for everybody to finish in order to make the update to the global network and then

00:52:36.080 | distributes all the same parameters

00:52:38.620 | To all the agents and so that means that every iteration starts with the same global parameters and that has really nice properties

00:52:46.880 | in terms of conversions

00:52:49.520 | and stability of the training process

00:52:51.520 | Okay

00:52:53.200 | from google deep mind the deep deterministic policy gradient

00:52:57.520 | Is combining the ideas of dqn but dealing with continuous action spaces

00:53:03.120 | so

00:53:05.360 | taking a policy network, but instead of

00:53:07.360 | the actor actor critic framework

00:53:10.640 | but instead of picking a stochastic policy

00:53:15.040 | Having the actor operating in a stochastic nature is picking the best picking a deterministic policy

00:53:21.440 | So it's always choosing the best action

00:53:25.760 | But okay with that the problem quite naturally

00:53:29.060 | Is that when the policy is now deterministic it's able to do a continuous action space, but because it's deterministic it's never exploring

00:53:36.580 | So the way we inject exploration into the system is by adding noise

00:53:40.960 | either adding noise into the action space on the output or adding noise into the parameters of the network that

00:53:47.200 | Have then

00:53:50.720 | that create perturbations in the actions such that the final result is that you try different kinds of things and the

00:53:57.120 | The scale of the noise just like with the epsilon greedy in the exploration for dqn

00:54:01.680 | The scale of the noise decreases as you learn more and more

00:54:04.240 | so on the policy optimization side from

00:54:07.760 | Openai and others

00:54:10.880 | We'll do a lecture just on this there's been a lot of exciting work here

00:54:16.800 | the basic idea of

00:54:19.980 | optimization on policy optimization with ppo and trpo

00:54:23.120 | is

00:54:25.760 | First of all, we want to formulate

00:54:27.760 | A reinforcement learning as purely an optimization problem

00:54:33.520 | And

00:54:36.640 | second of all

00:54:38.080 | if policy optimization

00:54:40.080 | the actions you take

00:54:42.320 | Influences the rest of your the optimization process. You have to be very careful about the actions you take in particular

00:54:50.660 | You have to

00:54:52.020 | Avoid taking really bad actions

00:54:54.020 | When your convergence the the training performance in general

00:54:58.260 | collapses

00:55:00.580 | So, how do we do that?

00:55:02.100 | There's the line search methods, which is where gradient descent or gradient ascent falls under which

00:55:07.700 | Which is the how we train deep neural networks

00:55:11.460 | is you first

00:55:14.020 | Pick a direction of the gradient and then pick the step size

00:55:19.620 | The problem with that is that can get you into trouble here. There's a nice visualization walking along a ridge

00:55:26.420 | Is it can it can result in you stepping off that ridge again the collapsing of the training process

00:55:33.620 | the performance the trust region is is the underlying idea here for the

00:55:39.380 | For the policy optimization methods that first pick the step size so the constraint in various kinds of ways the the magnitude

00:55:48.100 | Of the difference to to the weights that's applied and then the direction

00:55:52.980 | so

00:55:55.140 | Placing a much higher priority not choosing bad actions that can throw you off the optimization path trajectory

00:56:01.220 | We should take to that path

00:56:03.060 | and finally the

00:56:04.740 | On the model-based methods and we'll also talk about them in the robotics side. There's a lot of

00:56:09.380 | interesting

00:56:11.120 | approaches now where deep learning is starting to be used for

00:56:15.140 | Model-based methods when the model has to be learned

00:56:17.620 | But of course when the model doesn't have to be learned is given inherent to the game

00:56:22.100 | You know the model like in go and chess and so on alpha zero has really done incredible stuff

00:56:28.500 | so

00:56:30.660 | What's why is what is the model here? So the way that a lot of these games are approached

00:56:36.340 | You know game of go it's turn-based one person goes and another person goes and there's this game tree

00:56:42.260 | At every point there's a set of actions that can be taken and quickly if you look at that game tree

00:56:46.900 | It's it becomes you know, it grows exponentially

00:56:50.020 | So it becomes huge the game of go is the hugest of all in terms of because the number of choices you have

00:56:55.620 | Is the largest and there's chess

00:56:58.340 | And then, you know, it gets to checkers and then tic-tac-toe and it's just the the degree at every step

00:57:04.580 | increases decreases based on the game structure

00:57:08.180 | And so the task for neural network there is to learn the quality of the board. It's to learn

00:57:14.100 | which boards

00:57:16.820 | which game positions are most

00:57:19.540 | likely to result

00:57:22.420 | in a

00:57:24.100 | Most useful to explore and result in a highly successful state

00:57:28.500 | So that choice of what's good to explore what what branch is good to go down

00:57:33.940 | Is where we can have neural networks step in

00:57:36.980 | And with alpha go it was pre-trained the first success that beat the world champion was pre-trained on expert games

00:57:44.580 | then with alpha go zero

00:57:47.140 | It was

00:57:50.580 | No pre-training on expert systems. So no imitation learning is just purely through self-play through suggesting through playing itself

00:57:58.900 | New board positions many of these systems use Monte Carlo tree search and during this search

00:58:05.120 | balancing exploitation exploration so going deep on promising positions based on the estimation in neural network or

00:58:11.120 | With a flip of a coin playing underplayed positions

00:58:15.380 | And so this kind of here you could think of as an intuition of looking at a board and estimating how good that board is

00:58:24.320 | And also estimating how good that board is likely to lead to victory down the end

00:58:31.520 | So estimating just general quality and probability of leading to victory

00:58:35.680 | then

00:58:37.120 | The next step forward is alpha zero using the same similar architecture with mcts

00:58:43.380 | Monte Carlo tree search but applying it to different games

00:58:47.360 | And applying it and competing against

00:58:50.560 | other engines state-of-the-art engines in go and shogi in chess

00:58:55.760 | And outperforming them with very few

00:58:58.640 | very few steps

00:59:00.480 | so here's

00:59:02.480 | This model-based approaches which are really extremely simple efficient if you can construct such a model

00:59:08.800 | And in in robotics if you can learn such a model

00:59:12.160 | It can be exceptionally powerful here

00:59:15.760 | Beating the the

00:59:19.660 | Engines which are far superior to humans already stockfish can destroy most humans

00:59:24.320 | on earth at the game of chess

00:59:27.360 | the ability through learning through through estimating the quality of a board to be able to defeat these engines is incredible and

00:59:34.400 | the exciting aspect here versus engines that don't use neural networks is that the

00:59:42.160 | number

00:59:44.320 | It really has to do with based on the neural network you explore certain positions

00:59:50.100 | You explore certain parts of the tree

00:59:53.840 | And if you look at grandmasters

00:59:57.440 | human players

00:59:59.440 | In chess, they seem to explore very few moves

01:00:03.120 | They have a really good neural network at estimating which are the likely branches which would provide value to explore

01:00:11.120 | and on the other side

01:00:14.080 | Stockfish and so on are much more brute force in their estimation for the mcts

01:00:19.620 | And then alpha zero is a step towards the grandmaster

01:00:23.600 | Because the number of branches need to be explored is much much fewer

01:00:26.480 | A lot of the work is done in the representation formed by the neural network, which is super exciting

01:00:32.480 | And then it's able to outperform

01:00:34.960 | Stockfish and chess. It's able to outperform elmo and shogi and

01:00:40.080 | It's itself in go

01:00:43.600 | Or the previous iterations of alpha goes zero and so on

01:00:47.280 | Now the challenge here

01:00:53.440 | The sobering truth is that majority of real world application

01:00:57.220 | Of agents that have to act in this world perceive the world and act in this world

01:01:01.680 | Are for the most part

01:01:04.000 | Not based have no rl involved

01:01:06.480 | So the action is not learned

01:01:09.680 | You use neural networks to perceive certain aspects of the world, but ultimately

01:01:13.940 | the action

01:01:16.720 | Is not is not learned from data

01:01:19.520 | That's true for all most of the autonomous vehicle companies or all of the autonomous vehicle companies operating today

01:01:25.680 | and it's true for

01:01:28.320 | robotic manipulation

01:01:30.220 | Industrial robotics and any of the humanoid robots have to navigate in this world under uncertain conditions

01:01:35.780 | All the work from boston dynamics doesn't involve any machine learning as far as we know

01:01:40.080 | Now that's beginning to change here with animal the the recent development

01:01:49.360 | Where the certain aspects of the control

01:01:52.640 | Of robotics is being learned

01:01:55.280 | You're trying to learn more efficient movement. You're trying to learn more robust movement on top of the other controllers

01:02:01.940 | So it's quite exciting through rl to be able to learn some of the control dynamics here. That's able to teach

01:02:09.040 | Uh this particular robot to be able to get up from arbitrary positions

01:02:13.780 | So it's less hard coding in order to be able to deal with

01:02:17.740 | uh

01:02:19.740 | Unexpected initial conditions and unexpected perturbations

01:02:22.820 | So it's exciting there in terms of learning the control dynamics and some of the driving policy

01:02:29.280 | So making behavioral driving behavior decisions changing lanes turning and so on that if you uh,

01:02:36.400 | If you were here last week heard from waymo

01:02:38.960 | They they're starting to use some rl in terms of the driving policy in order to especially predict the future

01:02:44.560 | They're trying to anticipate intent modeling predict where the pedestrians where the cars are going to be based in the environment

01:02:50.000 | They're trying to unroll what's happened recently into the future and beginning to

01:02:54.720 | Move beyond sort of pure end-to-end on nvidia end-to-end learning approach of the control decisions

01:03:02.480 | Are actually moving to rl and making long-term planning decisions

01:03:06.660 | but again, the challenge is the

01:03:12.800 | The gap the leap needed to go from simulation to real world

01:03:16.560 | all

01:03:18.080 | Most of the work is done from the design of the environment and the design of the reward structure

01:03:22.480 | And because most of that work now is in simulation. We need to either develop better algorithms for transfer learning

01:03:29.040 | or close the distance between simulation and the real world and

01:03:33.520 | Also, we could think outside the box a little bit

01:03:38.240 | I had the conversation with peter abeel recently one of the leading researchers in deep rl

01:03:42.640 | It kind of on the side quickly mentioned

01:03:46.240 | The the idea is that we don't need to make simulation more realistic

01:03:52.420 | What we could do is just create an infinite number of simulations

01:03:58.180 | Or a very large number of simulations

01:04:03.360 | And the naturally the regularization aspect of having all those simulations

01:04:07.860 | Will make it so that our our reality is just another sample from those simulations

01:04:12.900 | And so maybe the solution isn't to create higher fidelity simulation or to create transfer learning algorithms

01:04:19.140 | Maybe it's to build

01:04:21.920 | a

01:04:24.240 | Arbitrary number of simulations

01:04:26.700 | So then that step towards creating a agent that work that works in the real world is a trivial one

01:04:33.680 | And maybe that's exactly whoever created the simulation we're living in and the multiverse that we're living in did

01:04:40.640 | Next steps

01:04:45.200 | The lecture videos we have several in rl will be made all available on deeplearning.mit.edu

01:04:50.580 | We'll have several tutorials in rl on github

01:04:54.400 | The link is there and I really like the essay

01:04:58.800 | From openai on spinning up as a deep rl researcher, you know, if you're interested in getting into research in rl

01:05:05.360 | What are the steps you need to take from the background of developing the mathematical background prob stat and multivariate calculus?

01:05:13.300 | To some of the basics like it's covered last week on deep learning some of the basics ideas in rl

01:05:19.040 | Just terminology and so on some basic concepts then picking a framework tensorflow or pytorch

01:05:25.540 | and

01:05:27.280 | Learn by doing

01:05:29.040 | All right implement the algorithms I mentioned today. Those are the core rl algorithms

01:05:33.540 | So implement all of them from scratch

01:05:36.080 | It should only take about 200 300 lines of code. They're actually when you put it down on paper are quite simple

01:05:42.720 | intuitive algorithms

01:05:44.960 | and then

01:05:46.720 | Read papers about those algorithms that follow after looking not for the big waving performance

01:05:54.000 | The hand waving performance but for the tricks that were used to train these algorithms

01:05:58.000 | The tricks tell a lot of the story and that's the useful parts that you need to learn

01:06:03.280 | And iterate fast on simple benchmark environments

01:06:07.200 | So openai gem has provided a lot of easy to use environments that you can play with that

01:06:12.480 | You can train an agent in minutes hours as opposed to days and weeks

01:06:17.200 | And so iterating fast is the best way to learn these algorithms and then on the research side

01:06:22.480 | There's three ways to get a best paper award, right?

01:06:25.440 | To publish and to contribute and have an impact in the research community

01:06:31.120 | In rl one is improve an existing approach given a particular benchmark. There's a few

01:06:37.360 | Benchmark data sets environments that are emerging. So you want to improve an existing approach some aspect of the convergence of the performance

01:06:44.900 | You can

01:06:46.400 | Focus on an unsolved task. There's certain games that just haven't been solved through the rl

01:06:52.560 | Formulation or you can come up with a totally new

01:06:55.600 | Problem that hasn't been addressed by rl before

01:06:59.200 | So with that I'd like to thank you very much tomorrow. I hope to see you here for deep traffic. Thanks

01:07:06.980 | You

01:07:09.700 | You

01:07:11.700 | You

01:07:14.660 | You

01:07:16.660 | You

01:07:18.660 | You

01:07:20.660 | You

01:07:22.660 | You

01:07:24.660 | [BLANK_AUDIO]

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

Chapters