MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

Today I'd like to overview the exciting field of deep reinforcement learning Introduce, overview and provide you some of the basics I think it's one of the most exciting fields in artificial intelligence It's marrying the power and the ability of deep neural networks to represent and comprehend the world with the ability to act on that understanding on that representation Taking as a whole, that's really what the creation of intelligent beings is Understand the world and act And the exciting breakthroughs that recently have happened Captivate our imagination about what's possible And that's why this is my favorite area of deep learning and artificial intelligence in general And I hope you feel the same So what is deep reinforcement learning?

We've talked about deep learning which is taking samples of data Being able to in a supervised way compress, encode the representation of that data in a way that you can reason about it And we take that power and apply it to the world where sequential decisions are to be made So it's looking at problems and formulations of tasks Where an agent, an intelligent system has to make a sequence of decisions And the decisions that are made Have an effect on the world around the agent How?

How do all of us? Any intelligent being that is tasked with operating in the world. How do they learn anything? Especially when you know very little in the beginning It's trial and error is the fundamental process by which reinforcement learning agents learn And the deep part of deep reinforcement learning is neural networks It's using the frameworks and reinforcement learning Where the neural network is doing the representation Of the world based on which the actions are made And we have to take a step back When we look at the types of learning Sometimes the terminology itself can confuse us to the fundamentals There is supervised learning, there's semi-supervised learning, there's unsupervised learning, there's reinforcement learning And there's this feeling that supervised learning is really the only one Where you have to perform the manual annotation, where you have to do the large-scale supervision That's not the case Every type of machine learning is supervised learning It's supervised by a loss function or a function that tells you what's good And what's bad You know even looking at our own existence is how we humans figure out what's good and bad there's All kinds of sources direct and indirect by which our morals and ethics we figure out what's good and bad The difference between supervised and unsupervised and reinforcement learning is the source of that supervision What's implied when you say unsupervised?

Is that the cost of human labor required to attain the supervision is low But it's never Turtles all the way down it's turtles and then there's a human at the bottom There at some point there needs to be human intervention Human Input to provide what's good and what's bad and this will arise in reinforcement learning as well we have to remember that because the challenges and the exciting opportunities of reinforcement learning lie in the fact of How do we get that supervision?

In the most efficient way possible, but supervision nevertheless is required for any system that has an input and an output That's trying to learn like a neural network does to provide an output. That's good. It needs somebody to say what's good and what's bad For you curious about that.

There's been a few books a couple written throughout the last few centuries from Socrates to Nietzsche I recommend the latter especially So let's look at supervised learning and reinforcement learning I'd like to propose a way to think about the difference That is illustrative and useful when we start talking about the techniques So supervised learning is taking a bunch of examples of data And Learning from those examples where ground truth provides you the compressed Semantic meaning of what's in that data and from those examples one by one whether it's sequences or single samples We learn what how to then take future such samples and interpret them Reinforcement learning is teaching what we teach an agent through experience Not by showing a singular sample of a data set but by putting them out into the world The distinction there the essential element of reinforcement learning then for us Now we'll talk about a bunch of algorithms But the essential design step is to provide the world in which to experience The agent learns from the world The from the world it gets the dynamics of that world the physics of the world from that world It gets the rewards what's good and bad and us as designers Of that agent do not just have to do the algorithm.

We have to do design the the world In which that agent is trying to solve a task The design of the world is the process of reinforcement learning the design of examples The annotation of examples is the world of supervised learning And the essential perhaps the most difficult element of reinforcement learning is the reward the good versus bad Here a baby starts walking across the room We want to define success as a baby walking across the room And reaching the destination that's success and failure is the inability to reach that destination simple and reinforcement learning in humans The way we learn from these very few examples Appear to learn from very few examples through trial and error Is a mystery a beautiful mystery full of open questions It could be from the huge amount of data 230 million years worth of bipedal data that we've been walking Mammals walking ability to walk or 500 million years the ability to see having eyes So that's the the hardware side somehow genetically encoded in us is the ability to comprehend this world extremely efficiently It could be through not the hardware not the 500 million years, but the the few minutes hours days months Maybe even years in the very beginning when we're born The ability to learn really quickly through observation to aggregate that information Filter all the junk that you don't need and be able to learn really quickly Through imitation learning through observation the way for walking that might mean observing others to walk The idea there is If there was no others Around we would never be able to learn this the fundamentals of this walking or as efficiently It's through observation And then it could be the algorithm totally not understood is the algorithm that our brain uses to learn The back propagation that's an artificial neural networks the same kind of processes not understood in the brain That could be the key So I want you to think about that as we talk about the very trivial By comparison accomplishments and reinforcement learning and how do we take the next steps?

But it nevertheless is exciting to have machines that learn how to act in the world the process of learning for those who have fallen in love with artificial intelligence The process of learning is thought of as intelligence. It's the ability to know very little and through experience examples Interaction with the world in whatever medium whether it's data or simulation so on be able to form much richer and interesting Representations of that world be able to act in that world.

That's that's the dream So let's look at this stack of what an age what it means to be an agent in this world from top The input to the bottom the output is there's an environment. We have to sense that environment We have just a few tools us humans have Several sensory systems on cars you can have lidar camera Stereo vision audio microphone networking gps imu sensor so on whatever robot you can think about There's a way to sense that world and you have this raw sensory data and then once you have the raw sensory data you're tasked with representing that data in such a way that you can make sense of it as opposed to all the The the raw sensors in the eye the cones and so on that taken as just giant stream of high bandwidth information We have to be able to form higher Abstractions of features based on which we can reason from edges to corners to faces And so on that's exactly what deep learning neural networks have stepped in to be able to In an automated fashion with as little human input as possible be able to form higher order representations of that information Then there's the the learning aspect building on top of the greater abstractions formed through representations Be able to accomplish something useful whether it's discriminative task a generative task and so on based on the representation Be able to make sense of the data be able to generate new data and so on From sequence to sequence to sequence to sample from sample to sequence and so on and so forth to actions as we'll talk about and then there is the ability to Aggregate all the information that's been received in the past to the useful information that's Pertinent to the task at hand.

It's the thing the old It looks like a duck quacks like a duck swims like a duck Three different data sets i'm sure there's state-of-the-art algorithms for the three image classification audio recognition video classification Activity recognition so on aggregating those three together Is still an open problem and that could be the last piece again I want you to think about as we think about reinforcement learning agents.

How do we play? How do we transfer from the game of atari to the game of go to the game of dota to the game of a robot? Navigating an uncertain environment in the real world And once you have that once you sense the raw world once you have a representation of that world then We need to act Which is provide actions within the constraints of the world in such a way that we believe can get us towards success The promise excitement of deep learning is is the part of the stack that converts raw data into meaningful representations The promise the dream of deeper enforcement learning Is going beyond And building an agent that uses that representation And acts achieve success in the world That's super exciting The framework and the formulation of reinforcement learning At its Simplest Is that there's an environment and there's an agent that acts in that environment?

the agent senses the environment by by some observation whether it's partial or complete observation of the environment and It gives the environment an action it acts in that environment and through the action The environment changes in some way and then a new observation occurs And then also as you provide the action make the observations you receive a reward In most formulations of this of this framework This entire system has no memory That the The only thing you need to be concerned about is the state you came from the state you arrived in and the reward received The open question here is what can't be modeled in this kind of way.

Can we model all of it? from From human life to the game of go Can all of this be modeled in this way? And what are is this a good way to formulate the learning problem of robotic systems? In the real world in the simulated world. Those are the the open questions The environment could be fully observable Or partially observable like in poker It could be single agent or multi-agent atari versus driving like deep traffic deterministic or stochastic static versus dynamic Static as in chess dynamic again and driving in most real world applications discrete versus continuous like games Chess or continuous and carpool balancing a pull on a cart The challenge for RL in real world applications Is that as a reminder Supervised learning is teaching by example learning by example teaching from our perspective Reinforcement learning is teaching by experience And the way we provide experience to reinforcement learning agents currently for the most part is through simulation Or through highly constrained real world scenarios So the challenge is in the fact that most of the successes is with Systems environments that are simulatable So there's two ways to then close this gap two directions of research and work one is to improve the algorithms improve the ability of the algorithms to then To form policies that are transferable across all kinds of domains including the real world including especially the real world So train and simulation transfer to the real world or As we improve the simulation in such a way that the fidelity of the simulation increase increases to the point where the gap between reality and simulation is Minimal to a degree that things learned in simulation are directly trivially transferable to the real world Okay, the major components of an RL agent an agent Operates based on a strategy called a policy It sees the world it makes a decision.

That's a policy makes a decision how to act sees the reward Sees a new state acts sees a reward sees new states and acts and this repeats forever until a terminal state the value function is the estimate of how good a state is or how good A state action pair is meaning taking an action In a particular state.

How good is that ability to evaluate that? and then the model Different from the environment from the perspective of the agent So the environment has a model based on which it operates And then the agent has a representation best understanding of that model So the purpose for an RL agent In this Simply formulated framework is to maximize reward the way that The reward mathematically and practically is talked about Is with a discounted framework, so we discount further and further future reward So the reward that's farther into the future is means less to us in terms of maximization than reward That's in the near term.

And so why do we discount it? So first a lot of it is a math trick to be able to prove certain aspects analyze certain aspects of conversions And in general on a more philosophical sense Because environments either are or can be thought of a stochastic random. It's very difficult To there's a degree of uncertainty, which makes it difficult to really estimate the The reward they'll be in the future because of the ripple effect of the uncertainty Let's look at an example a simple one Helps us understand policies rewards actions, there's a robot in the room there's 12 cells in which you can step it starts in the bottom left.

It tries to get rewards on the top, right? There's a plus one. It's a really good thing at the top, right? It wants to get there by walking around There's a negative one, which is really bad. It wants to avoid that square and the choice of actions is up down left right four actions so you could think of uh, they're Being a negative reward of 0.04 for each step So there's a cost to each step and there's a stochastic nature to this world potentially we'll talk about both deterministic stochastic So in the stochastic case when you choose the action up with an 80% probability With an 80% chance you move up but With 10% chance you move left and another 10 move right So that's stochastic nature, even though you try to go up you might end up in a blocks the left into the right so for a deterministic world the optimal policy here Given that we always start in the bottom left is really shortest path Is you know, you can't ever because there's no stochasticity you're never going to screw up and just fall into the hole negative one hole that you just compute the shortest path and Walk along that shortest path why shortest path because every single step hurts.

There's a negative reward to it 0.04 so shortest path is the thing that minimizes the reward shortest path to the to the plus one block Okay, let's look at a stochastic world. Like I mentioned the 80% up and then split 10% to the left and right How does the policy change?

Well, first of all we need to have uh We need to have a plan for every single block in the area because you might end up there due to the stochasticity of the world Okay, the the basic addition there is that we're trying to go Avoid up The closer you get to the negative one hole.

So just try to avoid up because up The stochastic nature of up means you might fall into the hole with a 10% chance And given the 0.04 step reward you're willing to take the long way home In some cases in order to avoid that possibility the negative one possibility Now, let's look at a reward for each step if it decreases to negative two.

It really hurts to take every step Then again, we go to the shortest path despite the fact that uh, there's a stochastic nature In fact, you don't really care that you step into the negative one hole because every step really hurts. You just want to get home And then you can play with this reward structure right yes instead of uh negative two or negative 0.04 you can look at Negative 0.1 and you can see immediately that the structure of the policy It changes So with a higher Value the higher negative reward for each step immediately the urgency of the agent increases Versus the less urgency the lower the negative reward And when the reward flips So it's positive The every step is a positive so the entire system which is actually Quite common in reinforcement learning the entire system is full of positive rewards And so then the optimum policy becomes the longest path Is a grad school taking as long as possible never reaching the destination So What lessons do we draw from robot in the room two things?

The environment model the dynamics is just there in the trivial example the stochastic nature the difference between 80 percent 100 percent and 50 percent The model of the world the environment has a big impact on what the optimal policy is And the reward structure most importantly the thing we can often control More in our constructs of the task we try to solve in reinforcement learning is the What is good and what is bad and how bad is it and how good is it the reward structure is a big?

Impact and that has a complete change like like uh, robert frost said a complete change on the Policy the choices the agent makes so when you formulate a reinforcement learning framework As researchers as students what you often do is you design the environment you design the world in which the system learns Even when your ultimate goal is the physical robot you just still there's a lot of work still done in simulation So you design the world the parameters of that world and you also design the reward structure and it can have A transformative results slight variations in those parameters can be huge results On huge differences on the policy that's arrived and of course The example i've shown before I really love is the impact of the the changing reward structure might have unintended consequences and those Consequences for real world system can have obviously highly detrimental Costs that are more than just a failed game of atari So here's a human performing the task get playing the game of coast runners racing around the track and so it's uh, when you finish first And you finish fast you get a lot of points and so it's natural to then okay Let's do an rl agent and then optimize this for those points And what you find out in the game is that you also get points by picking up the little green turbo things And what the agent figures out is that you can actually get a lot more points even By simply focusing on the green turbos focusing on the green turbos Just rotating over and over slamming into the wall fire and everything just picking it up, especially because ability to pick up those turbos Can avoid the terminal state at the end of finishing the race in fact finishing the race means you stop collecting positive reward So you never want to finish collect the turbos And though that's a trivial example It's not actually easy to find such examples But they're out there of unintended consequences that can have highly negative detrimental effects when put in the real world We'll talk about a little bit of robotics When you put robots four-wheeled ones like autonomous vehicles into the real world And you have objective functions that have to navigate difficult intersections full of pedestrians So you have to form intent models of those pedestrians here.

You see cars asserting themselves through dense intersections taking risks and Within those risks that are taken by us humans when we drive vehicles we have to then encode that ability to take subtle risk into into AI-based control algorithms perception Then you have to think about at the end of the day.

There's an objective function and if that objective function does not anticipate the green turbos that are to be collected and then result in some unintended consequences could have very Negative effects especially in situations that involve human life That's the field of AI safety and some of the folks who talk about deep mind and open AI That are doing incredible work in RL also have groups that are working in AI safety for a very good reason this is a problem that I believe that artificial intelligence will define some of the most impactful positive things In the 21st century, but I also believe we are nowhere close To solving some of the fundamental problems of AI safety that we also need to address as we develop those algorithms So okay examples of reinforcement learning systems All of it has to do with formulation of rewards formulation of state and actions.

You have the traditional The often used benchmark of a cart Balancing a pole continuous. So the action is the horizontal force of the cart The goal is to balance the pole So it stays top in the moving cart and the reward is one at each time step if the pole is upright And the state measured by the cart by the agent is the pole angle angular speed And of course self sensing of the cart position and the horizontal velocity Another example here didn't want to include the video because it's really disturbing but I do want to include this slide because it's really important to think about is by sensing the the raw pixels learning and teaching an agent to Play a game of doom So the goal there is to eliminate all opponents The state is the raw game pixels the actions up down shoot reload and so on And The positive reward is When an opponent is eliminated and negative when the agent is eliminated simple I added it here because again on the topic of AI safety We have to think about objective functions and how that translate into the world of not just autonomous vehicles but Things that even more directly have harm like autonomous weapon systems.

We have a lecture on this in the AGI series and The on the robotics platform the manipulate object manipulation grasping objects. There's a few benchmarks. There's a few interesting applications learning the problem of grabbing objects moving objects Manipulating objects rotating and so on especially when those objects don't have have complicated shapes And so the goal is to pick up an object in the purely in the grasping object challenge The state is the visual information.

So it's visual visual based the raw pixels of the objects The action is to move the arm grasp the object pick it up And obviously it's positive when the pickup is successful The reason i'm personally excited by this is because it will finally allow us to solve the problem of the the claw which has been Torturing me for many years I don't know.

That's not at all why i'm excited but okay And then we have to think about as we get greater and greater degree of application in the real world with the robotics Like cars The the main focus of my passion in terms of robotics is how do we encode some of the things that us humans encode?

How do we you know? We have to think about our own objective function our own reward structure our own model of the environment about which we perceive and reason About in order to then encode machines that are doing the same and I believe autonomous driving is in that category We have to ask questions of ethics.

We have to ask questions of of risk value of human life value of efficiency money and so on all these are fundamental questions that an autonomous vehicle Unfortunately has to solve before it becomes fully autonomous So here are the key takeaways of the real world impact of reinforcement learning agents On the deep learning side Okay, these neural networks that form higher representation The fun part is the algorithms all the different architectures the different encoder decoder structures all the attention self-attention recurrence LSTMs GRUs all the fun architectures and the data and the ability to leverage different data sets in order to to discriminate better than uh Perform discriminatory tasks better than you know MIT does better than Stanford that kind of thing.

That's the fun part The hard part is asking good questions and collecting huge amounts of data that's representative of the task That's for real world impact not CVPR publication real world impact A huge amount of data on the deeper enforcement learning side the key challenge The fun part again is the algorithms.

How do we learn from data some of the stuff i'll talk about today? The hard part is defining the environment defining the access space and the reward structure As I mentioned this is the big challenge and the hardest part is how to crack the gap between simulation in the real world the leaping lizard That's the hardest part.

We don't even know How to solve that transfer learning problem yet for the real world impact The three types of reinforcement learning There's countless algorithms and there's a lot of ways to taxonomize them, but at the highest level There's model-based and there's model-free model-based algorithms Learn the model of the world So as you interact with the world You construct your estimate of how you believe the dynamics of that world operates The nice thing about Doing that is once you have a model or an estimate of a model you're able to Anticipate you're able to plan into the future.

You're able to use the model to In a branching way predict how your actions will change the world so you can plan far into the future This is the mechanism by which you you can you can do chess Uh in the simplest form because in chess, you don't even need to learn the model The model is learned is given to you chess go and so on The most important way in which they're different I think is the sample efficiency Is how many examples of data are needed to be able to successfully operate in the world?

And so model-based methods because they're constructing a model if they can Are extremely sample efficient Because once you have a model you can do all kinds of reasoning that doesn't require experiencing every possibility of that model you can Unroll the model to to see how the world changes based on your actions Value-based methods are ones that look to estimate The quality of states the quality of state taking a certain action in a certain state so They're called off policy Versus the last category that's on policy.

What does it mean to be off policy? It means that They constantly a value-based agents constantly update how good is taking action in a state and they have this model of that goodness of Taking action in a state and they use that to pick the optimal action They don't directly learn a policy a strategy of how to act they learn how good it is to be in a state and use that goodness information to then pick the best one And then every once in a while flip a coin in order to explore And then policy-based methods are ones that directly learn a policy function so they take as input the the world representation of that world neural networks and is output a action Where the action is stochastic So, okay, that's the range of model-based value-based and policy-based Here's an image from openai that I really like I encourage you to As we further explore here to look up spinning up in deeper enforcement learning from openai Here's an image that taxonomizes in the way that I described some of the recent developments in rl so at the very top the distinction between model free rl and model-based rl In model free rl, which is what we'll focus on today.

There is a distinction between policy optimization So on policy methods and q learning which is all policy methods policy optimization methods that directly optimize the policy Directly learn the policy in some way and then Q learning all policy methods learn like I mentioned the value of taking a certain action in a state and from that learned that learned q value be able to Choose how to act in the world So let's look at a few sample representative approaches in this space Let's start with the one that Really was one of the first great breakthroughs From google deep mind on the deep rl side and solving atari games dqn deep q learning networks deep q networks And let's take a step back and think about what q learning is Q learning looks at the state action value function Q That estimates based on a particular policy or based on an optimal policy.

How good is it to take an action? in this state the estimated Reward if I take an action in this state and continue operating under an optimal policy It gives you directly a way to say Amongst all the actions I have which action should I take to maximize the reward?

Now in the beginning, you know, nothing, you know, you don't have this value estimation You don't have this q function So you have to learn it and you learn it with a bellman equation of updating it You take your current estimate and update it with the reward you see Received after you take an action Here It's off policy and model free.

You don't have to have any estimate or knowledge of the world You don't have to have any policy whatsoever. All you're doing is Roaming about the world collecting data when you took a certain action. Here's the word you received and you're updating gradually this table Where the table has state states on the y-axis and actions on the x-axis and the key part there is Because you always have an estimate of what To take an action of the value of taking that action so you can always take the optimal one but because you know very little in the beginning that optimal is going to You have no way of knowing that's good or not So there's some degree of exploration the fundamental aspect of value-based methods or any RL methods Like I said, it's trial and error is exploration So for value-based methods like Q learning the way that's done is with a flip of a coin epsilon greedy With a flip of a coin you can choose to just take a random action and you Slowly decrease epsilon to zero as your agent learns more and more and more so in the beginning you explore a lot an epsilon of one an epsilon of zero in the end when you're just acting greedy based on the Your understanding of the world as represented by the Q value function For non-neural network approaches.

This is simply a table the Q This Q function is a table. Like I said on the y state x actions and in each cell you have a reward that's A discount or reward you estimate to be received there And as you walk around with this Bellamy equation, you can update that table but It's a table nevertheless number of states times number of actions Now if you look at any practical real-world problem And an arcade game with raw sensory input is a very crude first step towards the real world so raw sensory information This kind of value iteration and updating a table is impractical Because here's for a game of breakout if you look at four consecutive frames of a game of breakout Size of the of the raw sensory input is 84 by 84 pixels grayscale Every pixel has 256 values that's 256 to the power of Whatever 84 times 84 times 4 is Whatever it is It's significantly larger than the number of atoms in the universe So the size of this Q table if we use the traditional approach is intractable Neural networks to the rescue Deep RL is RL plus neural networks where the neural networks is tasked with taking this in value-based Methods taking this Q table and learning a compressed representation of it Learning an approximator for the function from state and action to the value That's what previously talked about the ability the powerful ability of neural networks to form representations from extremely high dimensional complex raw sensory information So it's simple the framework remains for the most part the same in reinforcement learning.

It's just that this Q function For value-based methods becomes a neural network and becomes an approximator where the hope is as you navigate the world and you pick up new knowledge through The back propagating the gradient and the loss function that you're able to form a good representation of the optimal Q function So use neural networks with neural networks are good at which is function approximators And that's DQN deep Q network was used to have the initial Incredible nice results on the arcade games where the input is the raw sensory pixels with a few convolutional layers fully connected layers And the output is a set of actions You know Probability of taking that action and then you sample that and you choose the best action And so this simple agent with a neural network that estimates that Q function Very simple network is able to achieve A superhuman performance on many of these arcade games that excited the world because it's taking raw sensory information with a pretty simple network That doesn't in the beginning understand any of the physics of the world any of the dynamics of the environment and through that intractable space the intractable State space is able to learn how to actually do pretty well The loss function for DQN has two Q functions One is the expected The predicted Q value of taking an action in a particular state and the other is the Target against which the loss function is calculated.

Which is what is the value that you got once you actually Taken that action and once you've taken that action the way you calculate the value is by looking at the next step and choosing the Maximum choosing if you take the best action in the next state What is going to be the Q function?

So there's two estimators going on in terms of neural networks There's two forward passes here. There's two Q's in this equation so in traditional DQN, that's just That's done by a single neural network With a few tricks and double DQN that's done by two neural networks And I mentioned tricks because with this and with most of RL tricks tell a lot of the story A lot of what makes systems work is the details in games and robotic systems in these cases The two biggest tricks for DQN that will reappear in a lot of value-based methods is experience replay So think of an agent that plays through these games as also collecting memories You collect this Bank of memories that can then be replayed The power of that one of the central elements of what makes value-based methods attractive is that Because you're not directly estimating the policy but are learning the quality of taking an action in a particular state the You're able to then jump around through your memory and and play different aspects of that memory So learn, train the network through the historical data and then the other trick Simple is like I said that there is So the loss function has two queues So you're it's it's a dragon chasing its own tail It's easy for the loss function to become unstable.

So the training does not converge So the trick of fixing a target network is taking one of the queues And only updating it every x steps every thousand steps and so on and taking the same kind of network is just fixing it So for the target network that defines the loss function just keeping it fixed and only updating irregularly So you're chasing a fixed target with a loss function as opposed to a dynamic one So you can solve a lot of the Atari games With minimal effort come up with some creative solutions here Break out here after 10 minutes of training on the left after two of two hours of training on the right It's coming up with some creative solutions again.

It's pretty cool because this is raw pixels, right? We're now like There's been a few years since this Breakthrough So kind of take it for granted, but I still for the most part Captivated by just how beautiful it is that from raw sensory information neural networks are able to learn to act in a way that actually supersedes humans in terms of creativity in terms of In terms of actual raw performance.

It's really exciting and games of simple form is the cleanest way to demonstrate that and The same kind of DQN network is able to achieve superhuman performance on a bunch of different games There's improvements to this like dual DQN again The Q function can be decomposed which is useful into the value estimate Of being in that state and what's called And in future slides will be called advantage So the advantage of taking action in that state the nice thing of the advantage as a measure is that It's a measure of the action quality relative to the average Action that could be taken there.

So if it's that's very useful advantage versus sort of raw reward Is that if all the actions you have to take are pretty good? You want to know well how much better it is? in terms of optimization That's a better measure for choosing actions in a value-based sense So when you have these two estimates you have these two streams for neural network and a dueling DQN DDQN where one estimates the value the other the advantage And that's again that dueling nature is useful for Also when the there are many states in which the action Is decoupled the quality of the actions is decoupled from the state.

So many states it doesn't matter Which action you take So you don't need to learn all the different complexities All the topology of different actions when you in a particular state And another one Is prioritize experience replay like I said experience replay is really key to these algorithms And the thing that syncs some of the policy optimization methods And experience replay is collecting different memories But if you just sample randomly in those memories You're now affected the sampled experiences are Really affected by the frequency of those experience occurred not their importance So prioritize experience replay assigns a priority a value based on the magnitude of the temporal difference learned error So the the stuff you have learned the most from is given a higher priority and therefore you get to see through the experience replay process that That particular experience more often Okay, moving on to policy gradients this is on policy versus Q learning off policy Policy gradient Is Directly optimizing the policy where the input is the raw pixels And the policy network represents the Forms of representations of that environment space and its output produces a stochastic estimate A probability of the different actions here in the pong the pixels A single output that produces a probability of moving the paddle up So how do policy gradients vanilla policy gradient very basic works Is you unroll the environment you play through the environment Here pong moving the paddle up and down and so on collecting no rewards And only collecting reward at the very end Based on whether you win or lose Every single action you're taking along the way gets either punished or rewarded based on whether it led to victory or defeat This also is remarkable that this works at all because the credit assignment there is a is I mean every single thing you did along the way Is averaged out It's like muddied.

It's the reason that policy gradient methods are more inefficient, but it's still very surprising that it works at all So the pros versus DQN the value-based methods Is that if the world is so messy that you can't learn a Q function the nice thing about policy gradient because it's learning The policy directly that it will at least learn a pretty good policy Usually in many cases faster convergence.

It's able to deal with stochastic policies So value-based methods cannot learn stochastic policies and it's much more naturally able to deal with continuous actions The cons is it's inefficient versus DQN it's It can become highly unstable as we'll talk about some solutions to this during the training process and the credit assignment So if we look at the chain of actions that lead to a positive reward Some might be awesome actions.

Some might be good actions Some might be terrible actions, but that doesn't matter as long as the destination was good And that's then every single action along the way gets a positive reinforcement That's the downside and there's now improvements to that advantage actor critic methods A2C combining the best of Value-based methods and policy based methods So having an actor two networks an actor which is Policy-based and that's the one that takes the actions Samples the actions from the policy network and the critic That measures how good those actions are And the critic is value-based All right.

So as opposed to in the policy update the first equation there the reward coming from the destination The the reward being from whether you won the game or not Every single step along the way You now learn a Q value function QSA state and action using the critic network So you're able to now learn about the environment about evaluating your own actions at every step So you're much more sample efficient There's asynchronous From deep mind and synchronous from open ai variants of this but of the actor advantage actor critic framework But both are highly parallelizable the difference with A3C the asynchronous one Is that every single agent so you just throw these agents operating in the environment and they're learning they're rolling out the games and getting the reward They're updating the original network asynchronously the global network parameters asynchronously And as a result, they're also operating constantly on outdated versions of that network The open ai approach that fixes this is that there's a coordinator that there's these rounds where everybody All the agents in parallel are rolling out the episode but then the coordinator waits for everybody to finish in order to make the update to the global network and then distributes all the same parameters To all the agents and so that means that every iteration starts with the same global parameters and that has really nice properties in terms of conversions and stability of the training process Okay from google deep mind the deep deterministic policy gradient Is combining the ideas of dqn but dealing with continuous action spaces so taking a policy network, but instead of the actor actor critic framework but instead of picking a stochastic policy Having the actor operating in a stochastic nature is picking the best picking a deterministic policy So it's always choosing the best action But okay with that the problem quite naturally Is that when the policy is now deterministic it's able to do a continuous action space, but because it's deterministic it's never exploring So the way we inject exploration into the system is by adding noise either adding noise into the action space on the output or adding noise into the parameters of the network that Have then that create perturbations in the actions such that the final result is that you try different kinds of things and the The scale of the noise just like with the epsilon greedy in the exploration for dqn The scale of the noise decreases as you learn more and more so on the policy optimization side from Openai and others We'll do a lecture just on this there's been a lot of exciting work here the basic idea of optimization on policy optimization with ppo and trpo is First of all, we want to formulate A reinforcement learning as purely an optimization problem And second of all if policy optimization the actions you take Influences the rest of your the optimization process.

You have to be very careful about the actions you take in particular You have to Avoid taking really bad actions When your convergence the the training performance in general collapses So, how do we do that? There's the line search methods, which is where gradient descent or gradient ascent falls under which Which is the how we train deep neural networks is you first Pick a direction of the gradient and then pick the step size The problem with that is that can get you into trouble here.

There's a nice visualization walking along a ridge Is it can it can result in you stepping off that ridge again the collapsing of the training process the performance the trust region is is the underlying idea here for the For the policy optimization methods that first pick the step size so the constraint in various kinds of ways the the magnitude Of the difference to to the weights that's applied and then the direction so Placing a much higher priority not choosing bad actions that can throw you off the optimization path trajectory We should take to that path and finally the On the model-based methods and we'll also talk about them in the robotics side.

There's a lot of interesting approaches now where deep learning is starting to be used for Model-based methods when the model has to be learned But of course when the model doesn't have to be learned is given inherent to the game You know the model like in go and chess and so on alpha zero has really done incredible stuff so What's why is what is the model here?

So the way that a lot of these games are approached You know game of go it's turn-based one person goes and another person goes and there's this game tree At every point there's a set of actions that can be taken and quickly if you look at that game tree It's it becomes you know, it grows exponentially So it becomes huge the game of go is the hugest of all in terms of because the number of choices you have Is the largest and there's chess And then, you know, it gets to checkers and then tic-tac-toe and it's just the the degree at every step increases decreases based on the game structure And so the task for neural network there is to learn the quality of the board.

It's to learn which boards which game positions are most likely to result in a Most useful to explore and result in a highly successful state So that choice of what's good to explore what what branch is good to go down Is where we can have neural networks step in And with alpha go it was pre-trained the first success that beat the world champion was pre-trained on expert games then with alpha go zero It was No pre-training on expert systems.

So no imitation learning is just purely through self-play through suggesting through playing itself New board positions many of these systems use Monte Carlo tree search and during this search balancing exploitation exploration so going deep on promising positions based on the estimation in neural network or With a flip of a coin playing underplayed positions And so this kind of here you could think of as an intuition of looking at a board and estimating how good that board is And also estimating how good that board is likely to lead to victory down the end So estimating just general quality and probability of leading to victory then The next step forward is alpha zero using the same similar architecture with mcts Monte Carlo tree search but applying it to different games And applying it and competing against other engines state-of-the-art engines in go and shogi in chess And outperforming them with very few very few steps so here's This model-based approaches which are really extremely simple efficient if you can construct such a model And in in robotics if you can learn such a model It can be exceptionally powerful here Beating the the Engines which are far superior to humans already stockfish can destroy most humans on earth at the game of chess the ability through learning through through estimating the quality of a board to be able to defeat these engines is incredible and the exciting aspect here versus engines that don't use neural networks is that the number It really has to do with based on the neural network you explore certain positions You explore certain parts of the tree And if you look at grandmasters human players In chess, they seem to explore very few moves They have a really good neural network at estimating which are the likely branches which would provide value to explore and on the other side Stockfish and so on are much more brute force in their estimation for the mcts And then alpha zero is a step towards the grandmaster Because the number of branches need to be explored is much much fewer A lot of the work is done in the representation formed by the neural network, which is super exciting And then it's able to outperform Stockfish and chess.

It's able to outperform elmo and shogi and It's itself in go Or the previous iterations of alpha goes zero and so on Now the challenge here The sobering truth is that majority of real world application Of agents that have to act in this world perceive the world and act in this world Are for the most part Not based have no rl involved So the action is not learned You use neural networks to perceive certain aspects of the world, but ultimately the action Is not is not learned from data That's true for all most of the autonomous vehicle companies or all of the autonomous vehicle companies operating today and it's true for robotic manipulation Industrial robotics and any of the humanoid robots have to navigate in this world under uncertain conditions All the work from boston dynamics doesn't involve any machine learning as far as we know Now that's beginning to change here with animal the the recent development Where the certain aspects of the control Of robotics is being learned You're trying to learn more efficient movement.

You're trying to learn more robust movement on top of the other controllers So it's quite exciting through rl to be able to learn some of the control dynamics here. That's able to teach Uh this particular robot to be able to get up from arbitrary positions So it's less hard coding in order to be able to deal with uh Unexpected initial conditions and unexpected perturbations So it's exciting there in terms of learning the control dynamics and some of the driving policy So making behavioral driving behavior decisions changing lanes turning and so on that if you uh, If you were here last week heard from waymo They they're starting to use some rl in terms of the driving policy in order to especially predict the future They're trying to anticipate intent modeling predict where the pedestrians where the cars are going to be based in the environment They're trying to unroll what's happened recently into the future and beginning to Move beyond sort of pure end-to-end on nvidia end-to-end learning approach of the control decisions Are actually moving to rl and making long-term planning decisions but again, the challenge is the The gap the leap needed to go from simulation to real world all Most of the work is done from the design of the environment and the design of the reward structure And because most of that work now is in simulation.

We need to either develop better algorithms for transfer learning or close the distance between simulation and the real world and Also, we could think outside the box a little bit I had the conversation with peter abeel recently one of the leading researchers in deep rl It kind of on the side quickly mentioned The the idea is that we don't need to make simulation more realistic What we could do is just create an infinite number of simulations Or a very large number of simulations And the naturally the regularization aspect of having all those simulations Will make it so that our our reality is just another sample from those simulations And so maybe the solution isn't to create higher fidelity simulation or to create transfer learning algorithms Maybe it's to build a Arbitrary number of simulations So then that step towards creating a agent that work that works in the real world is a trivial one And maybe that's exactly whoever created the simulation we're living in and the multiverse that we're living in did Next steps The lecture videos we have several in rl will be made all available on deeplearning.mit.edu We'll have several tutorials in rl on github The link is there and I really like the essay From openai on spinning up as a deep rl researcher, you know, if you're interested in getting into research in rl What are the steps you need to take from the background of developing the mathematical background prob stat and multivariate calculus?

To some of the basics like it's covered last week on deep learning some of the basics ideas in rl Just terminology and so on some basic concepts then picking a framework tensorflow or pytorch and Learn by doing All right implement the algorithms I mentioned today. Those are the core rl algorithms So implement all of them from scratch It should only take about 200 300 lines of code.

They're actually when you put it down on paper are quite simple intuitive algorithms and then Read papers about those algorithms that follow after looking not for the big waving performance The hand waving performance but for the tricks that were used to train these algorithms The tricks tell a lot of the story and that's the useful parts that you need to learn And iterate fast on simple benchmark environments So openai gem has provided a lot of easy to use environments that you can play with that You can train an agent in minutes hours as opposed to days and weeks And so iterating fast is the best way to learn these algorithms and then on the research side There's three ways to get a best paper award, right?

To publish and to contribute and have an impact in the research community In rl one is improve an existing approach given a particular benchmark. There's a few Benchmark data sets environments that are emerging. So you want to improve an existing approach some aspect of the convergence of the performance You can Focus on an unsolved task.

There's certain games that just haven't been solved through the rl Formulation or you can come up with a totally new Problem that hasn't been addressed by rl before So with that I'd like to thank you very much tomorrow. I hope to see you here for deep traffic. Thanks You You You You You You You You

MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)

Chapters

Transcript