Back to Index

MIT 6.S094: Deep Reinforcement Learning for Motion Planning


Chapters

0:0 Intro
0:43 Types of machine learning
5:48 Perceptron: Weighing the Evidence
7:34 Perceptron: Implement a NAND Gate
10:6 Perceptron NAND Gate
11:1 The Process of Learning Small Change in Weights → Small Change in Output
13:2 Combining Neurons into Layers
13:49 Task: Classify and Image of a Number
19:45 Philosophical Motivation for Reinforcement Learning
21:20 Agent and Environment
24:55 Markov Decision Process
25:30 Major Components of an Rl Agent
26:53 Robot in a Room
27:57 Is this a solution?
28:22 Optimal policy
28:42 Reward for each step-2
29:11 Reward for each step: +0.01
30:15 Value Function
31:19 Q Learning
35:3 Exploration vs Exploitation
36:7 Q-Learning: Value Iteration
38:3 Q-Learning: Representation Matters
45:48 Philosophical Motivation for Deep Reinforcement Learning
48:0 Deep Q-Network: Atari
49:24 Deep Q-Network Training
52:18 Atari Breakout
53:53 Deep Q-Learning Algorithm

Transcript

All right. Hello everybody. Welcome back. Glad you came back. Today, we will unveil the first tutorial, the first project. This is Deep Traffic, code named Deep Traffic, where your task is to solve the traffic problem using deep reinforcement learning. And I'll talk about what's involved in designing a network there, how you submit your own network, and how you participate in the competition.

As I said, the winner gets a very special prize to be announced later. What is machine learning? There are several types. There's supervised learning. As I mentioned yesterday, that's what's meant usually when you discuss about, you talk about machine learning and talk about its successes. Supervised learning requires a data set where you know the ground truth.

You know the inputs and the outputs. And you provide that to a machine learning algorithm in order to learn the mapping between the inputs and the outputs in such a way that you can generalize to further examples in the future. Unsupervised learning is the other side, when you know absolutely nothing about the outputs, about the truth of the data that you're working with.

All you get is data. And you have to find underlying structure, underlying representation of the data that's meaningful for you to accomplish a certain task, whatever that is. There's semi-supervised data, where only parts, usually a very small amount, is labeled. There's ground truth available for just a small fraction of it.

If you think of images that are out there on the Internet, and then you think about ImageNet, a data set where every image is labeled, the size of that ImageNet data set is a tiny subset of all the images available online. But that's the task we're dealing with as human beings, as people interested in doing machine learning, is how to expand the size of that, of the part of our data that we know something confidently about.

And reinforcement learning sits somewhere in between. It's semi-supervised learning. Where there's an agent that has to exist in the world. And that agent knows the inputs that the world provides, but knows very little about that world except through occasional time-delayed rewards. This is what it's like to be human.

This is what life is about. You don't know what's good and bad. You kind of have to just live it. And every once in a while, you find out that all that stuff you did last week was a pretty bad idea. That's reinforcement learning. That's semi-supervised in the sense that only a small subset of the data comes with some ground truth, some certainty that you have to then extract knowledge from.

So first, at the core of anything that works currently, in terms of a practical sense, there has to be some ground truth. There has to be some truth that we can hold on to as we try to generalize. And that's supervised learning. Even as in reinforcement learning, the only thing we can count on is that truth that comes in a form of a reward.

So the standard supervised learning pipeline is you have some raw data, the inputs. You have ground truth, the labels, the outputs that matches to the inputs. Then you run a certain, any kind of algorithm, whether that's a neural network or another pre-processing, processing algorithm that extracts the features from that data set.

You can think of a picture of a face. That algorithm could extract the nose, the eyes, the corners of the eyes, the pupil, or even lower level features in that image. After that, we insert those features into a model, a machine learning model. We train that model. Then we, as we, whatever that algorithm is, as we pass it through that training process, we then evaluate.

After we've seen this one particular example, how much better are we at other tasks? And as we repeat this loop, the model learns to perform better and better at generalizing from the raw data to the labels that we have. And finally, you get to release that model into the wild to actually do prediction on data it has never seen before, that you don't know about.

And the task there is to predict the labels. Okay, so neural networks is what this class is about. It's one of the machine learning algorithms that has proven to be very successful. And the building block, the computational building block of a neural network is a neuron. A perceptron is a type of neuron.

It's the original old-school neuron where the output is binary, a zero or one. It's not real valued. And the process that a perceptron goes through is it has multiple inputs and a single output. The inputs, each of the inputs have weights on them, shown here on the left as 0.7, 0.6, 1.4.

Those weights are applied to the inputs and a perceptron, the inputs are ones or zeros, binary. And those weights are applied and then sum together. A bias on each neuron is then added on top and a threshold. There's a test whether that summed value plus the bias is below or above a threshold.

If it's above a threshold, it produces a one. If it's below a threshold, it produces a zero. Simple. It's one of the only things we understand about neural networks confidently. We can prove a lot of things about this neuron. For example, what we know is that a neuron can approximate a NAND gate.

A NAND gate is a logical operation, a logical function that takes its input. It has two inputs, A and B, here on the diagram on the left. And the table shows what that function is. When the inputs are zeros, 0, 1 in any order, the output is a one.

Otherwise, it's a zero. The cool thing about a NAND gate is that it's a universal gate. That you can build up any computer you have. Your phone in your pocket today can be built out of just NAND gates. So it's functionally complete. You can build any logical function out of them if you stack them together in arbitrary ways.

The problem with NAND gates and computers is they're built from the bottom up. You have to design these circuits of NAND gates. So the cool thing here is with the Perceptron, we can learn this magical NAND gate. We can learn this function. So let's go through how we can do that, how a Perceptron can perform the NAND operation.

Here's the four examples. If we put the weights of -2 on each of the inputs and a bias of 3 on the neuron, and if we perform that same operation of summing the weights times the inputs plus the bias, in the top left, we get, when the inputs are zeros and there's sum to the bias, we get a 3.

That's a positive number, which means the output of a Perceptron will be a 1. In the top right, when the input is a 0 and a 1, that sum is still a positive number. Again, produces a 1 and so on. When the inputs are both 1s, then the output is a -1, less than 0.

So while this is simple, it's really important to think about. It's a sort of the one basic computational truth you can hold on to as we talk about some of the magical things neural network can do. Because if you compare a circuit of NAND gates and a circuit of neurons, the difference while a circuit of neurons, which is what we think of as a neural network, can perform the same thing as the circuit of NAND gates.

What it can also do is it can learn. It can learn the arbitrary logical functions that an arbitrary circuit of NAND gates can represent. But it doesn't require the human designer. We can evolve, if you will. So one of the key aspects here, one of the key drawbacks of Perceptron is it's not very smooth in its output.

As we change the weights on the inputs and we change the bias and we tweak it a little bit, it's very likely that when you get, it's very easy to make the neuron output a 0 instead of a 1, or 1 instead of a 0. So when we start stacking many of these together, it's hard to control the output of the thing as a whole.

Now the essential step that makes a neural network work that a circuit of Perceptron doesn't is if the output is made smooth, is made continuous with an activation function. And so instead of using a step function like a Perceptron does, shown there on the left, we use any kind of smooth function, sigmoid, where the output can change gradually as you change the weights and the bias.

And this is a basic but critical step. And so learning is generally the process of adjusting those weights gradually and seeing how it has an effect on the rest of the network. You just keep tweaking weights here and there and seeing how much closer you get to the ground truth.

And if you get farther away, you just adjust the weights in the opposite direction. That's neural networks in a nutshell. There is what we'll mostly talk about today is feed-forward neural networks. On the left, going from inputs to outputs with no loops. There is also these amazing things called recurrent neural networks.

They're amazing because they have memory. They have a memory of state. They remember the temporal dynamics of the data that went through. But the painful thing is that they're really hard to train. Today we'll talk about feed-forward neural networks. So let's look at this example. An example of stacking a few of these neurons together.

Let's think of the task, the basic task now famous using the classification of numbers. You have an image of a number, handwritten number, and your task is given that image to say what number is in that image. Now what is an image? An image is a collection of pixels.

In this case, 28 by 28 pixels. That's a total of 784 numbers. Those numbers are from 0 to 255. And on the left of the network, the size of that input, despite the diagram, is 784 neurons. That's the input. Then comes the hidden layer. It's called the hidden layer because it has no interaction with the input or the output.

It is simply a block used at the core of the computational power of neural networks, is the hidden layer. It's tasked with forming a representation of the data in such a way that it maps from the inputs to the outputs. In this case, there is 15 neurons in the hidden layer.

There is 10 values on the output corresponding to each of the numbers. There's several ways you can build this kind of network and this is what the magic of neural networks is. You can do it a lot of ways. You only really need four outputs to represent values 0 through 9.

But in practice, it seems that having 10 outputs works better. And how do these work? Whenever the input is a 5, the output neuron in charge of the 5 gets really excited and outputs a value that's close to 1, from 0 to 1. Close to 1. And then the other ones get an output of value, hopefully, that's close to 0.

And when they don't, we adjust the weights in such a way that they get closer to 0 and closer to 1, depending on whether it's the correct neuron associated with the picture. We'll talk about the details of this training process more tomorrow when it's more relevant. But what we've discussed just now is the forward pass through the network.

It's the pass when you take the inputs, apply the weights, sum them together, add the bias, produce the output and check which of the outputs produces the highest confidence of the number. Then once those probabilities for each of the numbers is provided, we determine the gradient that's used to punish or reward the weights that resulted in either the correct or the incorrect decisions.

And that's called back propagation. We step backwards through the network applying those punishments or rewards. Because of the smoothness of the activation functions, that is a mathematically efficient operation. That's where the GPUs step in. So far, example of numbers. The ground truth for number 6 looks like the following in the slides.

Y of X equals to a 10-dimensional vector, where only one of them, the 6th value is a 1, the rest are 0. That's the ground truth that comes with the image. The loss function here, the basic loss function is the squared error. Y of X is the ground truth and A is the output of the neural network resulting from the forward pass.

So when you input that number of a 6 and it outputs whatever it outputs, that's A, a 10-dimensional vector. And it's summed over the inputs to produce the squared error. That's our loss function. The loss function, the objective function, that's what's used to determine how much to reward or punish the back propagated weights throughout the network.

And the basic operation of optimizing that loss function, of minimizing that loss function is done with various variants of gradient descent. It's hopefully a somewhat smooth function, but it's a highly nonlinear function. This is why we can't prove much about neural networks. It's a highly high-dimensional, highly nonlinear function that's hopefully smooth enough where the gradient descent can find its way to at least a good solution.

And there has to be some stochastic element there that jumps around to ensure that it doesn't get stuck in a local minima of this very complex function. Okay, that's supervised learning. There's inputs, there's outputs, ground truth. That's our comfort zone. Because we're pretty confident, we know what's going on.

All you have to do is just, you have this data set, you train and you train a network on that data set and you can evaluate it, you can write a paper and try to beat a previous paper, it's great. The problem is when you then use that neural network to create an intelligent system that you put out there in the world.

And now that system no longer is working with your data set. It has to exist in this world that's maybe very different from the ground truth. So the takeaway from supervised learning is that neural networks are great memorization. But in a sort of philosophical way, they might not be great at generalizing, at reasoning beyond the specific flavor of data set that they were trained on.

The hope for reinforcement learning is that we can extend the knowledge we gain in a supervised way to the huge world outside where we don't have the ground truth of how to act, of what does, how good a certain state is or how bad a certain state is. This is a kind of brute force reasoning.

And I'll talk about kind of what I mean there. But it feels like it's closer to reasoning as opposed to memorization. That's a good way to think of supervised learning is memorization. You're just studying for an exam. And as many of you know, that doesn't mean you're going to be successful in life just because you get an A.

And so a reinforcement learning agent or just any agent, a human being or any machine existing in this world can operate in the following way from the perspective of the agent. It can execute an action, it can receive an observation resulting from that action in a form of a new state and it can receive a reward or a punishment.

You can break down our existence in this way, simplistic view. But it's a convenient one on the computational side. And from the environment side, the environment receives the action, emits the observation, so your action changes the world, therefore that world has to change. And then tell you about it and give you a reward or punishment for it.

So let's look at, again, one of the most fascinating things. I'll try to convey why this is fascinating a little bit later on. Is the work of DeepMind on Atari. This is Atari Breakout, a game where a paddle has to move around. That's the world that's existing in. It's a paddle, the agent is a paddle and there's a bouncing ball and you're trying to move your actions to the right, move right, move left.

You're trying to move in such a way that the ball doesn't get past you. And so here is a human level performance of that agent. And so what does this paddle have to do? It has to operate in this environment. It has to act, move left, move right. Each action changes the state of the world.

This may seem obvious but moving right changes visually the state of the world. In fact, what we're watching now on the slides is the world changing before your eyes for this little guy. And it gets rewards or punishments. Rewards it gets in the form of points. They're racking up points in the top left of the video.

And then when the ball gets past the paddle, it gets punished by dying, quote-unquote. And that's the number of lives it has left, going from five to four to three, down to zero. And so the goal is to select at any one moment the action that maximizes future reward.

Without any knowledge of what a reward is, in a greater sense of the word, all you have is an instantaneous reward or punishment, instantaneous response of the world to your actions. And this can be modeled as a Markov decision process. Markov decision process is a mathematically convenient construct. It has no memory.

All you get is you have a state that you're currently in, you perform an action, you get a reward and you find yourself in a new state. And that repeats over and over. You start from state zero, you go to state one, you once again repeat an action, get a reward, go to the next state.

Okay, that's the formulation that we're operating in. When you're in a certain state, you have no memory of what happened two states ago. Everything is operating on the instantaneous, instantaneously. And so what are the major components of a reinforcement learning agent? There's a policy. That's the agent, the function broadly defined of an agent's behavior.

That includes the knowledge of how for any given state, what is an action that I will take with some probability. Value function is how good each state and action are in any particular state. And there's a model. Now this is a little, a subtle thing that is actually the biggest problem with everything you'll see today.

Is the model is how we represent the environment. And what you'll see today is some amazing things that neural networks can achieve on a relatively simplistic model of the world. And the question whether that model can extend to the real world where human lives are at stake in the case of driving.

Yeah. So let's look at the simplistic world. A robot in a room. You start at the bottom left. Your goal is to get to the top right. Your possible actions are going up, down, left and right. Now this world can be deterministic, which means when you go up, you actually go up.

Or it could be non-deterministic as human life is. It's when you go up, sometimes you go right. So in this case, if you choose to go up, you move up 80% of the time. You move left 10% of the time and you move right 10% of the time. And when you get to the top right, you get a reward of +1.

When you get to the second block from that, 4, 2, you get -1. You get punished. And every time you take a step, you get a slight punishment of 0.04. Okay, so the question is, if you start at the bottom left, is this a good solution? Is this a good policy by which you exist in the world?

And it is if the world is deterministic. If whenever you choose to go up, you go up. Whenever you choose to go right, you go right. But if the actions are stochastic, that's not the case. In what I described previously with 0.8 up and probability of 0.1 going left and right, this is the optimal policy.

Now, if we punish every single step with a -2 as opposed to -0.04. So, every time you take a step, it hurts. You're going to try to get to a positive block as quickly as possible. And that's what this policy says. I'll walk through a -1 if I have to, as long as I stop getting a -2.

Now, if the reward for each step is -0.1, you might choose to go around that -1 block. Slight detour to avoid the pain. And then you might take an even longer detour as the reward for each step goes up or the punishment goes down, I guess. And then if there's an actual positive reward for every step you take, then you'll avoid going to the finish line.

You'll just wander the world. We saw that with the coast racer yesterday. The boat that chose not to finish the race because it was having too much fun getting points in the middle. So, let's look at the world that this agent is operating in. It's a value function. That value function depends on a reward.

The word that comes in the future. And that reward is discounted because the world is stochastic. We can't expect the reward to come along to us in the way that we hope it does based on the policy, based on the way we choose to act. And so there's a gamma there that over time, as the reward is farther and farther into the future, discounts that reward, diminishes the impact of that future award in your evaluation of the current state.

And so your goal is to develop a strategy that maximizes the discounted future reward, the sum, this discounted sum. And reinforcement learning, there is a lot of approaches for coming up with a good policy, a near optimal, an optimal policy. There's a lot of fun math there. You can try to construct a model that optimizes some estimate of this world.

You can try in a Monte Carlo way to just simulate that world and see how it unrolls. And as it unrolls, you try to compute the optimal policy. Or what we'll talk about today is Q-learning. It's an off-policy approach where the policy is estimated as we go along. The policy is represented as a Q-function.

The Q-function shown there on the left is, I apologize for the equations, I lied. There'll be some equations. The input to the Q-function is a state at time t, s, t, and an action that you choose to take in that state, a, t. And your goal is in that state to choose an action which maximizes the reward in the next step.

And what Q-learning does, and I'll describe the process, is it's able to approximate through experience the optimal Q-function. The optimal function that tells you how to act in any state of the world. You just have to live it. You have to simulate this world. You have to move about it.

You have to explore in order to see every possible state, try every different action, get rewarded, get punished, and figure out what is the optimal thing to do. That's done using this Bellman equation. On the left, the output is the new state, the estimate, the Q-function estimate of the new state for new action.

And this is the update rule at the core of Q-learning. You take the old estimate and add based on the learning rate, alpha from 0 to 1, update the evaluation of that state based on your new reward that you received at that time. So you've arrived in a certain state, s, t, you try to do an action and then you got a certain reward and you update your estimate of that state and action pair based on this rule.

When the learning rate is 0, you don't learn. When alpha is 0, you never change your worldview based on the new incoming evidence. When alpha is 1, you every time change your evaluation, your world evaluation based on the new evidence. And that's the key ingredient to reinforcement learning. First you explore, then you exploit.

First you explore in a non-greedy way and then you get greedy. You figure out what's good for you and you keep doing it. So if you want to learn an Atari game, first you try every single action, every state, you screw up, get punished, get rewarded and eventually you figure out what's actually the right thing to do and you just keep doing it.

And that's how you win against the greatest human players in the world in a game of Go, for example, as we'll talk about. And the way you do that is you have an epsilon greedy policy that over time with a probability of 1-epsilon, you perform an optimal greedy action.

With a probability of epsilon, you perform a random action. Random action being explore. And so as epsilon goes down from 1 to 0, you explore less and less. So the algorithm here is really simple on the bottom of the slide there. It's the algorithm version, the pseudocode version of the equation, the Bellman equation update.

You initialize your estimate of state action pairs arbitrarily, a random number. Now this is an important point. When you start playing or living or doing whatever you're doing and whatever you're doing with reinforcement learning or driving, you have no preconceived notion of what's good and bad. It's random or however you choose to initialize it.

And the fact that it learns anything is amazing. I want you to remember that. That's one of the amazing things about the Q-learning at all and then the deep neural network version of Q-learning. The algorithm repeats the following step. You step into the world, observe an initial state. You select an action A.

So that action, if you're exploring, will be a random action. If you're greedily pursuing the best action you can, it will be the action that maximizes the Q function. You observe or reward after you take the action and a new state that you find yourself in. And then you update your estimate of the previous state you were in having taken that action using that Bellman equation update.

And repeat this. Over and over. And so there on the bottom of the slide is a summary of life. Yes. The Q function? Yes. Yes. Yeah. It's a single. The question was, is the Q function a single value? And yes, it's just a single continuous value. So the question was, how do you model the world?

So the way you model, so let's start this very simplistic world of Atari paddle. You model it as a paddle that can move left and right and there's some blocks and you model the physics of the ball. That requires a lot of expert knowledge in that particular game. So you sit there hand crafting this model.

That's hard to do even for a simplistic game. The other model you could take is looking at this world in the way that humans do visually. So take the model in as a set of pixels. Just the model is all the pixels of the world. You know nothing about paddles or balls or physics or colors and points.

They're just pixels coming in. That seems like a ridiculous model of the world but it seems to work for Atari. It seems to work for human beings. When you're born, you see there's light coming into your eyes and you don't have any, as far as we know, you don't come with an instruction when you're born.

You know there's people in the world and there is good guys and bad guys and there's this is how you walk. No, all you get is light, sound and the other sensors. And you get to learn about every single thing you think of as the way you model the world is a learned representation.

And we'll talk about how a neural network does that. It learns to represent the world. But if we have to hand model the world, it's an impossible task. That's the question. If we have to hand model the world, then that world better be a simplistic one. That's a great question.

So the question was, what is the robustness of this model? If the way you represent the world is at all even slightly different from the way you thought that world is, that's not that well studied as far as I'm aware. I mean it's already amazing that if you construct, if you have a certain input of the world, if you have a certain model of the world, you can learn anything, it's already amazing.

The question is, and it's an important one, is we'll talk a little bit about it, not about the world model but the reward function. If the reward function is slightly different, the real reward function of life or driving or of coast runner is different than what you expected it to be.

What's the negative there? Yeah, it could be huge. So, there's another question or no? Never mind. Yep. Sorry, can you ask that again? Yes, you can change it over. So the question was, do you change the alpha value over time? And you certainly should change the alpha value over time.

So the question was, what is the complex interplay of the epsilon function with the Q-learning update? That's 100% fine-tuned, hand-tuned to the particular learning problem. So you certainly want to, the more complex, the larger the number of states in the world and the larger the number of actions, the longer you have to wait before you decrease the epsilon to zero.

But you have to play with it. It's one of the parameters you have to play with, unfortunately, and there's quite a few of them, which is why you can't just drop a reinforcement learning agent into the world. Oh, the effect in that sense. No, no, it's just a coin flip.

And if that epsilon is 0.5, half the time you're going to take a random action. So it's no, there's no specific, it's not like you'll take the best action and then with some probability take the second best and so on. I mean, you could certainly do that, but in the simple formulation that works, is you just take a random action because you don't want to have a preconceived notion of what's a good action to try when you're exploring.

The whole point is you try crazy stuff, if it's a simulation. So, good question. So, representation matters. This is the question about how we represent the world. So we can think of this world of breakout, for example, of this Atari game as a paddle that moves left and right and the exact position of the different things it can hit, construct this complex model, this expert driven model that has to fine tune it to this particular problem.

But in practice, the more complex this model gets, the worse that Bellman equation update, that trying to construct the Q function for every single combination of state and actions becomes too difficult because that function is too sparse and huge. So if you think of looking at this world in a general way, in the way human beings would, is a collection of pixels visually.

If you just take in a pixel, this game is a collection of 84 by 84 pixels, an image, RGB image. And then you look at not just the current image, but look at the temporal trajectory of those images. So like if there's a ball moving, you want to know about that movement.

So you look at four images. So current image and three images back. And say they're grayscale with 256 gray levels. That size of the Q table that the Q value function has to learn is whatever that number is, but it's certainly larger than the number of atoms in the universe.

That's a large number. So you have to run the simulation long enough to touch at least a few times most of the states in that Q table. So as Elon Musk says, "You may need to run," you know, "we live in a simulation." You may have to run a universe just to compute the Q function in this case.

So that's where deep learning steps in. Instead of modeling the world as a Q table, you estimate, you try to learn that function. And so the takeaway from supervised learning, if you remember, that it's good at memorizing or good at memorizing data. The hope for reinforcement learning with a Q learning is that we can extend the occasional rewards we get to generalize over the operation, the actions you take in that world leading up to the rewards.

And the hope for deep learning is that we can move this reinforcement learning system into a world that doesn't need to be, that can be defined arbitrarily, can include all the pixels of an Atari game, can include all the pixels sensed by a drone or a robot or a car.

But still it needs a formalized definition of that world, which is much easier to do when you're able to take in sensors like an image. So deep Q learning, deep version. So instead of learning a Q table, a Q function, we try in estimating that Q prime, we try to learn it using machine learning.

So it tries to learn some parameters. This huge complex function, we try to learn it. And the way we do that is we have a neural network, the same kind that I showed that learned the numbers to map from an image to a classification of that image into a number.

The same kind of network is used to take in a state and action and produce a Q value. Now here's the amazing thing, that without knowing anything in the beginning, as I said with a Q table, it's initialized randomly. The Q function, this deep network knows nothing in the beginning.

All it knows is in the simulated world, the rewards you get for a particular game. So you have to play time and time again and see the rewards you get for every single iteration of the game. But in the beginning it knows nothing. And it's able to learn to play better than human beings.

This is a DeepMind paper playing Atari with deep reinforcement learning from 2013. There's one of the key things that got everybody excited about the role of deep learning and artificial intelligence. Is that using a convolutional neural network, which I'll talk about tomorrow, but it's a vanilla network like any other, like I talked about earlier today.

Just a regular network that takes the raw pixels, as I said, and estimates that Q function from the raw pixels and is able to play on many of those games better than a human being. And the loss function that I mentioned previously. So again, very vanilla loss function, very simple objective function.

The first one you'll probably implement. We have a tutorial in TensorFlow. Squared error. So we take this Bellman equation where the estimate is Q, the Q function estimate of state and action is the maximum reward you get for taking any of the actions that takes you to any of the future states.

And you try to take that action, observe the result of that action, and if the target is different that your learned target, what the function has learned is the expected reward in that case, is different than what you actually got, you adjust it. You adjust the weights in the network.

And this is exactly the process by which we learn how to exist in this pixel world. So you're mapping states and actions to a Q value. The algorithm is as follows. This is how we train it. We're given a transition, S, current state, action taken in that state, R, the reward you get, and S' is the state you find yourself in.

And so we replace the basic update rule in the previous pseudocode by taking a forward pass through the network, given that S state. We look at what the predicted Q value is of that action. We then do another forward pass through that network. And see what we actually get.

And then if we're totally off, we punish, we back propagate the weights in a way that next time we'll make less of that mistake. And you repeat this process. And this is a simulation. You're learning against yourself. And again, the same rule applies here. Exploration versus exploitation. You start out with an epsilon of 0 or 1.

You're mostly exploring and then you move towards an epsilon of 0. And with Atari breakout, this is the DeepMind paper result. It's training epochs on the X axis. On the Y axis is the average action value and the average reward per episode. I'll show why it's kind of an amazing result, but it's messy.

Because there's a lot of tricks involved. So it's not just putting in a bunch of pixels of a game and getting an agent that knows how to win at that game. There's a lot of pre-processing and playing with the data required. So which is unfortunate because the truth is messier than the hope.

But one of the critical tricks needed is called experience replay. So as opposed to letting an agent, so you're learning this big network that tries to build a model of what's good to do in the world and what's not. And you're learning as you go. So with experience replay, you're keeping a track of all the things you did.

And every once in a while, you look back into your memory and pull out some of those old experiences, the good old times, and train on those again. As opposed to letting the agent run itself into some local optima where it tries to learn a very subtle aspect of the game that actually in the global sense doesn't get you farther to winning the game.

Very much like life. So here's the algorithm, deep Q learning algorithm, pseudocode. We initialize the replay memory. Again, there's this little trick that's required. It's keeping a track of stuff that's happened in the past. We initialize the action value function Q with random weights and observe initial state. Again, same thing, select an action with a probability epsilon, explore.

Otherwise, choose the best one based on the estimate provided by the neural network. And then carry out the action, observe the reward and store that experience in the replay memory. And then sample random transition from replay memory. So with a certain probability, you bring those old times back to get yourself out of the local minima.

And then you train the Q network using the difference between what you actually got and your estimate. You repeat this process over and over. So here's what you can do. After 10 minutes of training on the left, so that's very little training, what you get is a paddle that learns hardly anything and it just keeps dying.

If you look at, it goes from 5 to 4 to 3 to 2 to 1, those are the number of lives left. Then after two hours of training on a single GPU, it learns to win, you know, not die, rack up points and learns to avoid the ball from passing the paddle, which is great.

That's human level performance really, better than some humans, you know, but it still dies sometimes so it's very human level. And then after four hours, it does something really amazing. It figures out how to win at the game in a very lazy way, which is drill a hole through the blocks up to the top and get the ball stuck up there and then it does all the hard work for you.

That minimizes the probability of the ball getting past your paddle because it's just stuck in the blocks up top. So that might be something that you wouldn't even figure out to do yourself. And that's an... I need to sort of pause here to clearly explain what's happening. The input to this algorithm is just the pixels of the game.

It's the same thing that human beings take in when they take the visual perception and it's able to learn under this constrained definition of what is a reward and a punishment. It's able to learn to get a high reward. That's general artificial intelligence. A very small example of it but it's general.

It's general purpose. It knows nothing about games. It knows nothing about paddles or physics. It's just taking sensory input of the game. And they've did the same thing for a bunch of different games in Atari. And what's shown here in this plot on the X-axis is a bunch of different games from Atari and on the Y-axis is a percentile where 100% is about the best that human beings can do.

Meaning it's the score that human beings would get. So everything about there in the middle, everything to the left of that is far exceeding human level performance and below that is on par or worse than human level performance. So it can learn all, so many boxing, pinball, all of these games and it doesn't know anything about any of the individual games.

It's just taking in pixels. It's just as if you put a human being behind any of these games and ask them to learn to beat the game. And there's been a lot of improvements in this algorithm recently. Yes, question. (inaudible) No, no. There's no... So the question was, do they customize the model for a particular game?

And no. The point, you could of course, but the point is it doesn't need to be customized for the game. But the important thing is that it's still only on Atari games. Right, so the question whether this is transferable to driving. Perhaps not. Right, you play the game. Well you do.

No, you don't have the... Well yeah, you play one step of the game. So you take action in a state and then you observe that. So you have the simulation. I mean that's really, that's one of the biggest problems here is you require the simulation in order to get the ground truth.

(inaudible) So that's a great question or comment. The comment was that for a lot of these situations, the reward function might not change at all depending on your actions. The rewards are really, most of the time, delayed 10, 20, 30 steps down the line. Which is why it is amazing that this works at all.

That it's learning locally and through that process of simulation of hundreds of thousands of times it runs through the game, it's able to learn what to do now such that I get a reward later. If you just pause, look at the math of it, it's very simple math, and look at the result, it's incredible.

So there's a lot of improvements. This one called the General Reinforcement Learning Architecture or GORILLA. The cool thing about this in the simulated world at least is that you can run deep reinforcement learning in distributed way. You could do both the simulation in distributed way, you could do the learning in the distributed way.

You can generate experiences which is what this kind of diagram shows. You can either from human beings or from simulation. So for example, the way that AlphaGo, the DeepMind team has beat the game of Go is they learn from both expert games and by playing itself. So you can do this in a distributed way and you could do the learning in distributed way so you can scale.

And in this particular case, the GORILLA has achieved a better result than the DQN network. That's part of their nature paper. Okay, so let me now get to driving for a second here. Where does reinforcement learning, where reinforcement learning can step in and help? So this is back to the open question that I asked yesterday.

Is driving closer to chess or to everyday conversation? Chess meaning it can be formalized in a simplistic way and we could think about it as an obstacle avoidance problem and once the obstacle avoidance is solved, you just navigate that constrained space. You choose to move left, you choose to move right in a lane, you choose to speed up or slow down.

Well, if it's a game like chess, which we'll assume for today as opposed to for tomorrow, for today we're going to go with the one on the left and we're going to look at deep traffic. Here's this game, a simulation, where the goal is to achieve the highest average speed you can on this seven lane highway full of cars.

And so, as a side note for students, the requirement is they have to follow the tutorial that I'll present a link for at the end of this presentation. And what they have to do is achieve a speed, build a network that achieves a speed of 65 miles an hour or higher.

There is a leaderboard and you get to submit the model you come up with with a simple click of a button. So all of this runs in the browser, which is also another amazing thing. And then you immediately or relatively so, make your way up the leaderboard. So let's look, let's zoom in.

What is this world, two-dimensional world of traffic is? What does it look like for the intelligence system? We discretize that world into a grid shown here on the left. That's the representation of the state. There are seven lanes and every single lane is broken up into blocks spatially. And if there's a car in that block, the length of a car is about three blocks, three of those grid blocks, then that grid is seen as occupied.

And then the red car is you. That's the thing that's running in the intelligent agent. There is on the left is the current speed of the red car. It actually says MIT on top. And then you also have a count of how many cars you passed. And if your network sucks, then that number is going to get to be negative.

You can also change with a drop down the simulation speed from normal on the left to fast on the right. So normal is, so you know, the fast speeds up the replay of the simulation. The one on the left, normal, it feels a little more like real driving. There's a drop down for different display options.

The default is none in terms of stuff you show on the road. Then there is the learning input, which is the, while the whole space is discretized, you can choose what your car sees. And that's, you can choose how far ahead it sees, behind, how far to the left and right it sees.

And so by choosing the learning input, to visualize learning input, you get to see what you set that input to be. Then there is the safety system. This is a system that protects you from yourself. The way we've made this game is that it operates under something similar. If you have some intelligence, if you drive and you have adaptive cruise control in your car, it operates in the same way.

When it gets close to the car in front, it slows down for you and it doesn't let you run the car to the left of you, to the right of you, off the road. So it constrains the movement capabilities of your car in such a way that you don't hit anybody because then it would have to simulate collisions and it would just be a mess.

So it protects you from that and so you can choose to visualize that "safety system" with the visualization box. And then you can also choose to visualize the full map. This is the full occupancy map that you get, if you would like to, provide as input to the network.

Now that input for every single grid, it's a number. It's not just a zero one whether there's a car in there. It's the maximum speed limit, which is 80 miles per hour. Don't get crazy. 80 miles an hour is the speed limit. That block, when it's empty, is set to the 80 miles an hour.

And when it's occupied, it's set to the number that's the speed of the car. And then the blocks that the red car is occupying, is set to a very large number, much higher than the speed limit. So safety system, here shown in red, are the parts of the grid that your car can't move into.

Question? What's that? Yes. Yes. The question was, what was the third option I just mentioned? And it's the red car itself. You yourself, the blocks underneath that car, are set to a really high number. It's a way for the algorithm to know, for the learning algorithm to know, that these blocks are special.

So safety system shows red here if the car can't move into those blocks. So any, in terms of, when it lights up red, it means the car can't speed up anymore in front of it. And when the blocks to the left or to the right light up as red, that means you can't change lanes to the left or right.

On the right of the slide, you're free to go, free to do whatever you want. That's what that indicates. There's all the blocks are yellow. Safety system says you're free to choose any of the five actions. And the five actions are move left, move right, stay in place, accelerate or slow down.

And those actions are given as input. That action is, that's what's produced by the, what's called here the brain. The brain takes in the current state as input, the last reward and produces and learns, uses that reward to train the network through backward function there, back propagation. And then ask the brain, given the current state, to give it the next action with a forward pass, the forward function.

You don't need to know the operation of this function in particular. This is not something you need to worry about, but you can if you want. You can customize this learning step. There is, by the way, what I'm describing now, there's just a few lines of code right there in the browser.

That you can change and immediately, well, with the press of a button, changes the simulation or the design of the network. You don't need to have any special hardware, you don't need to do anything special. And the tutorial cleanly outlines exactly all of these steps. But it's kind of amazing that you can design a deep neural network that's part of the reinforcement learning agent.

So it's a deep Q learning agent right there in the browser. So you can choose the lane side variable which controls how many lanes to the side you see. So when that value is zero, you only look forward. When that value is one, you have one lane to the left, one value to the right.

It's really the lane, the radius of your perception system. Patches ahead is how far ahead you look, patches behind is how far behind you look. And so for example here, the lane side equals two, that means it looks two to the left, two to the right. Obviously, if two to the right is off-road, it provides a value of zero in those blocks.

If we set the patches behind to be ten, it looks ten patches back. Behind starting at the one patch back is starting from the front of the car. The scoring for the evaluation, for the competition is your average speed over a predefined period of time. And so the method we use to collect that speed is we run the agent ten runs, about 30 simulated minutes of game each and take the median speed of the ten runs.

That's the score. This is done server-side and so given that we've gotten some, for this code, recently gotten some publicity online unfortunately. This might be a dangerous thing to say, it's no cheating possible. But because it's done server-side and we did, this is JavaScript and it runs in the browser, it's hopefully sandboxed so we can't do anything tricky.

But we dare you to try. You can try it locally to get an estimate. And there's a button that says evaluate and it gives you a score right back of how well you're doing with the current network. That button is start evaluation run. You press the button, it does a progress bar and it gives you the average speed.

You can, there's a code box where you modify all the variables I mentioned and the tutorial describes this in detail. And then once you're ready, you modify a few things, you can press apply code, it restarts, it kills all the training that you've done up to this point or resets it and starts the training again.

So save often and there's a save button. So the training is done on a separate thread in web workers which are exciting things that allow you to allow JavaScript to run amazingly in a on multiple CPU cores in a parallel way. So the simulation that scores this or the training is done a lot faster than real time.

A thousand frames a second, a thousand movement steps a second. This is all in JavaScript. And the next day gets shipped to the main simulation from time to time as the training goes on. So all you have to do is press run training and it trains and the car behaves better over time.

Maybe I'll actually show it in the browser to, let's see if we work. Well, is it going to mess up? We're good. Why is it being weird? (silence) What can possibly go wrong? (silence) So here's the game. When it starts, this is running live in the browser. (silence) Artificial intelligence, ladies and gentlemen, in the browser, a neural network.

So currently it's not very good. It's driving at two miles an hour and watching everybody pass. So what's being shown live is the loss function which is pretty poor. So in order to train, like I said, a thousand frames a second, you just press the run training button. And pretty quickly it learns based on the network you specify in the code box, how to, and based on the input and all the things that I mentioned, training finished.

It learns how to do a little better. We on purpose put in a network that's not very good in there. So right now it won't, on average, be doing that well, but it does better than standing there in place. And then you can do the start evaluation run to simulate the network much faster than real time, to see how well it does.

This is a similar evaluation step that we take when determining where you stand on the leaderboard. The current average speed in that 10 run simulation is 56.56 miles per hour. Now I may be logged in, maybe not. If you're logged in, you click submit your code. If you're not logged in, it says you're not logged in, please log in to submit your code.

And then all you have to do is log in. This is the most flawless demo of my life. And then you press submit model again and success. Oh man. Thank you for your submission. And so now my submission is entered as Lex in the leaderboard, and my 56.56 or whatever it was.

So I dare all of you to try to beat that. So to, as you play around with stuff, if you want to save the code, you could do so by pressing the save code button. That saves the various JavaScript configurations, and that saves the network layout to file. And then you can load from file as well.

Again, danger, it overrides the code for you. And you press the submit button to submit the model to competition. Make sure that you train the network. We don't train it for you. You bring us, you submit a model and you have to press train. And it gets evaluated with, in time.

It enters a queue to get evaluated. This is public facing, so the queue can grow pretty big. And it goes to that queue, evaluates it, and then depending on where you stand, you get added to the leaderboard. You get added to the leaderboard, showing the top 10 entries. You can resubmit often, and only the highest score counts.

Okay, we're using code, the implementation of neural networks, done in just JavaScript, by Andrej Karpathy from Stanford, now OpenAI. ComNetJS is a library. And what's being visualized there, it's also being visualized in the game, is the inputs to the network. In this case, it's 135 inputs. You can also specify not just the, not just how far ahead behind you see into the left and to the right, you can specify how far back in time you look as well.

And so, what's visualized there is the input to the network, 135 neurons, and then the output, a regression, similar to the kind of output we saw with numbers, where there's 10 outputs saying if it's a 0, 1, or 9. Here, the output is one of the five actions, left, right, stay in place, speed up or slow down.

The ComNetJS settings, is you can select the number of inputs, if you want to mess with this stuff, this is all the stuff you don't need to mess with, because we already give you the variables of lane side and patches ahead and so on. You can select the number of actions, the temporal window, and the network size.

So, the network definition here is the, this is the input, the size of the input. Again, all this is in the tutorial, just to give you a little outline. There is a, the first fully connected layer has 10 neurons, with ReLU activation functions, same kind of smooth function that we talked about before, and a regression layer for the output.

And there's a bunch of other messy options that you can play with, if you dare. But those aren't the ones I mentioned before, it's really the important ones. It's selecting the number of layers, the size of those layers, you get to build your own very neural network that drives.

And the actual learning is done with a backward propagation, and then that returns the action by doing a forward pass to the network. In case you're interested in this kind of stuff, there is an amazingly cool code editor, that's a Monaco editor, that's, it just works, it does some auto-completion, so you get to play with it, makes everything very convenient in terms of code editing.

A lot of this visualization of the game, and the simulation, and the simulation we'll talk about tomorrow, is done in the browser using HTML5 Canvas. So here is a simple specification of a blue box with Canvas, and this is very efficient and easy to work with. And the thing that a lot of us are excited about, a very subtle one, but that you can not just run, so with the V8 engine, JavaScript has become super fast.

You can train neural networks in the browser, that's already amazing. And then with web workers, as long as you have Chrome, a modern browser. So, is you can run multiple processes in separate threads. So you could do a lot of stuff, you can do visualization separately, and you can train in separate threads.

It's very cool. Okay, so the tutorial is cars.mit.edu and deep traffic. We won't put these links on the website for a little bit, because we got put on the front page of Hacker News. Which we don't want those to leak out, especially with the claims that you can't cheat.

And while it's pretty efficient in terms of running time, so everything is running on your machine, client side. It's still, you have to pull some images here and pull some of the code. So the tutorial is on cars.mit.edu/deep-traffic and the simulation is deep-traffic.js. So cars.mit.edu/deep-traffic.js. I encourage you to go there, play with the network, submit your code, and win the very special prize.

It is a pretty cool one, but we're still working on it. It's not, there is a prize, I swear. All right, so let's take a pause and think about what we talked about today. So the very best of deep reinforcement learning is the most exciting accomplishment, I think, is when the game, when I first started as a freshman, took intro to artificial intelligence.

It was said that it's a game that's impossible for machines to beat because of the combinatorial complexity. It's just the sheer number of options. It's so much more complex than chess. And so the most amazing accomplishment of deep reinforcement learning to me is the design of AlphaGo. When for the first time, the world champion in Go was beaten by DeepMind's AlphaGo.

And the way they did it, and this is, I think, very relevant to driving, is you start by creating first in a supervised way, training a policy network. So you take expert games to construct a network first. So you don't play against yourself, the agent doesn't play against itself, but they learn from expert games.

So there's some human ground truth. This human ground truth represents reality. So for driving, this is important. We have a, well, we're starting to get a lot of data where video of drivers is being recorded. So we can learn on that data before we then run the agents through simulation where it learns much larger magnitudes of data sets through simulation.

And they did just that. Now as a reminder that when you let an agent drive itself, this is probably one of my favorite videos of all time, but I just recently saw it so I could just watch this for hours. But it's a reminder that you can't trust your first estimates of a reward function.

To be those that are safe and productive for our society when you're talking about an intelligent system that gets to operate in the real world. This is just as clear of a reminder of that as there is. So again, all the references are available online for these slides. We'll put up the slides.

I imagine you might have, if you want to come down and talk to us for questions for the either Docker or JavaScript question. So the question was, what is the visualization you're seeing in deep traffic? You're seeing a car move about. How is it moving? It's moving based on the latest snapshot of the network you trained.

So it's just visualizing for you just for fun the network you trained most recently. Okay, so if people have questions, stick around afterwards. Just details on Docker and yeah. Do you want to do it offline? Want me to tuck up?