back to indexMIT 6.S094: Deep Reinforcement Learning for Motion Planning
Chapters
0:0 Intro
0:43 Types of machine learning
5:48 Perceptron: Weighing the Evidence
7:34 Perceptron: Implement a NAND Gate
10:6 Perceptron NAND Gate
11:1 The Process of Learning Small Change in Weights → Small Change in Output
13:2 Combining Neurons into Layers
13:49 Task: Classify and Image of a Number
19:45 Philosophical Motivation for Reinforcement Learning
21:20 Agent and Environment
24:55 Markov Decision Process
25:30 Major Components of an Rl Agent
26:53 Robot in a Room
27:57 Is this a solution?
28:22 Optimal policy
28:42 Reward for each step-2
29:11 Reward for each step: +0.01
30:15 Value Function
31:19 Q Learning
35:3 Exploration vs Exploitation
36:7 Q-Learning: Value Iteration
38:3 Q-Learning: Representation Matters
45:48 Philosophical Motivation for Deep Reinforcement Learning
48:0 Deep Q-Network: Atari
49:24 Deep Q-Network Training
52:18 Atari Breakout
53:53 Deep Q-Learning Algorithm
00:00:00.000 |
All right. Hello everybody. Welcome back. Glad you came back. 00:00:04.120 |
Today, we will unveil the first tutorial, the first project. 00:00:13.580 |
This is Deep Traffic, code named Deep Traffic, 00:00:17.220 |
where your task is to solve the traffic problem using deep reinforcement learning. 00:00:23.820 |
And I'll talk about what's involved in designing a network there, 00:00:33.900 |
As I said, the winner gets a very special prize to be announced later. 00:00:47.100 |
As I mentioned yesterday, that's what's meant usually when you discuss about, 00:00:52.480 |
you talk about machine learning and talk about its successes. 00:00:56.320 |
Supervised learning requires a data set where you know the ground truth. 00:01:03.860 |
And you provide that to a machine learning algorithm in order to learn the mapping 00:01:11.840 |
between the inputs and the outputs in such a way that you can generalize 00:01:22.340 |
when you know absolutely nothing about the outputs, 00:01:28.120 |
about the truth of the data that you're working with. 00:01:37.760 |
underlying representation of the data that's meaningful 00:01:41.600 |
for you to accomplish a certain task, whatever that is. 00:01:46.500 |
There's semi-supervised data, where only parts, usually a very small amount, is labeled. 00:01:54.940 |
There's ground truth available for just a small fraction of it. 00:01:59.280 |
If you think of images that are out there on the Internet, 00:02:03.440 |
and then you think about ImageNet, a data set where every image is labeled, 00:02:08.080 |
the size of that ImageNet data set is a tiny subset of all the images 00:02:18.500 |
But that's the task we're dealing with as human beings, 00:02:23.300 |
as people interested in doing machine learning, 00:02:31.340 |
of the part of our data that we know something confidently about. 00:02:37.480 |
And reinforcement learning sits somewhere in between. 00:02:46.460 |
Where there's an agent that has to exist in the world. 00:02:52.660 |
And that agent knows the inputs that the world provides, 00:02:59.460 |
but knows very little about that world except through occasional time-delayed rewards. 00:03:17.820 |
And every once in a while, you find out that all that stuff you did last week was a pretty bad idea. 00:03:27.220 |
That's semi-supervised in the sense that only a small subset of the data comes with some ground truth, 00:03:35.500 |
some certainty that you have to then extract knowledge from. 00:03:41.200 |
So first, at the core of anything that works currently, 00:03:45.900 |
in terms of a practical sense, there has to be some ground truth. 00:03:50.900 |
There has to be some truth that we can hold on to as we try to generalize. 00:03:59.440 |
Even as in reinforcement learning, the only thing we can count on is that truth 00:04:07.380 |
So the standard supervised learning pipeline is you have some raw data, the inputs. 00:04:13.340 |
You have ground truth, the labels, the outputs that matches to the inputs. 00:04:20.380 |
Then you run a certain, any kind of algorithm, whether that's a neural network 00:04:26.360 |
or another pre-processing, processing algorithm that extracts the features from that data set. 00:04:34.460 |
That algorithm could extract the nose, the eyes, the corners of the eyes, the pupil, 00:04:46.560 |
After that, we insert those features into a model, a machine learning model. 00:05:01.040 |
Then we, as we, whatever that algorithm is, as we pass it through that training process, 00:05:08.040 |
After we've seen this one particular example, how much better are we at other tasks? 00:05:16.840 |
And as we repeat this loop, the model learns to perform better and better 00:05:23.720 |
at generalizing from the raw data to the labels that we have. 00:05:29.180 |
And finally, you get to release that model into the wild to actually do prediction 00:05:34.580 |
on data it has never seen before, that you don't know about. 00:05:48.020 |
Okay, so neural networks is what this class is about. 00:05:56.920 |
It's one of the machine learning algorithms that has proven to be very successful. 00:06:01.120 |
And the building block, the computational building block of a neural network is a neuron. 00:06:13.760 |
It's the original old-school neuron where the output is binary, a zero or one. 00:06:25.360 |
And the process that a perceptron goes through is it has multiple inputs and a single output. 00:06:35.120 |
The inputs, each of the inputs have weights on them, shown here on the left as 0.7, 0.6, 1.4. 00:06:45.160 |
Those weights are applied to the inputs and a perceptron, the inputs are ones or zeros, binary. 00:06:54.160 |
And those weights are applied and then sum together. 00:06:59.320 |
A bias on each neuron is then added on top and a threshold. 00:07:10.560 |
There's a test whether that summed value plus the bias is below or above a threshold. 00:07:17.060 |
If it's above a threshold, it produces a one. If it's below a threshold, it produces a zero. Simple. 00:07:23.760 |
It's one of the only things we understand about neural networks confidently. 00:07:28.260 |
We can prove a lot of things about this neuron. 00:07:30.860 |
For example, what we know is that a neuron can approximate a NAND gate. 00:07:44.960 |
A NAND gate is a logical operation, a logical function that takes its input. 00:07:54.060 |
It has two inputs, A and B, here on the diagram on the left. 00:08:03.460 |
When the inputs are zeros, 0, 1 in any order, the output is a one. Otherwise, it's a zero. 00:08:14.260 |
The cool thing about a NAND gate is that it's a universal gate. 00:08:23.460 |
Your phone in your pocket today can be built out of just NAND gates. 00:08:31.360 |
You can build any logical function out of them if you stack them together in arbitrary ways. 00:08:36.560 |
The problem with NAND gates and computers is they're built from the bottom up. 00:08:43.360 |
You have to design these circuits of NAND gates. 00:08:45.360 |
So the cool thing here is with the Perceptron, we can learn this magical NAND gate. 00:08:54.660 |
So let's go through how we can do that, how a Perceptron can perform the NAND operation. 00:09:08.060 |
If we put the weights of -2 on each of the inputs and a bias of 3 on the neuron, 00:09:15.460 |
and if we perform that same operation of summing the weights times the inputs plus the bias, 00:09:24.860 |
in the top left, we get, when the inputs are zeros and there's sum to the bias, we get a 3. 00:09:35.460 |
That's a positive number, which means the output of a Perceptron will be a 1. 00:09:39.960 |
In the top right, when the input is a 0 and a 1, that sum is still a positive number. 00:09:51.160 |
When the inputs are both 1s, then the output is a -1, less than 0. 00:10:01.160 |
So while this is simple, it's really important to think about. 00:10:07.760 |
It's a sort of the one basic computational truth you can hold on to 00:10:16.960 |
as we talk about some of the magical things neural network can do. 00:10:19.860 |
Because if you compare a circuit of NAND gates and a circuit of neurons, 00:10:29.760 |
the difference while a circuit of neurons, which is what we think of as a neural network, 00:10:38.360 |
can perform the same thing as the circuit of NAND gates. 00:10:45.560 |
It can learn the arbitrary logical functions that an arbitrary circuit of NAND gates can represent. 00:11:02.460 |
So one of the key aspects here, one of the key drawbacks of Perceptron 00:11:14.960 |
As we change the weights on the inputs and we change the bias and we tweak it a little bit, 00:11:23.860 |
it's very likely that when you get, it's very easy to make the neuron output a 0 instead of a 1, 00:11:34.360 |
So when we start stacking many of these together, it's hard to control the output of the thing as a whole. 00:11:44.860 |
Now the essential step that makes a neural network work that a circuit of Perceptron doesn't 00:11:54.160 |
is if the output is made smooth, is made continuous with an activation function. 00:12:00.360 |
And so instead of using a step function like a Perceptron does, shown there on the left, 00:12:10.760 |
we use any kind of smooth function, sigmoid, where the output can change gradually 00:12:33.260 |
And so learning is generally the process of adjusting those weights gradually 00:12:41.060 |
and seeing how it has an effect on the rest of the network. 00:12:45.360 |
You just keep tweaking weights here and there and seeing how much closer you get to the ground truth. 00:12:54.260 |
And if you get farther away, you just adjust the weights in the opposite direction. 00:13:03.260 |
There is what we'll mostly talk about today is feed-forward neural networks. 00:13:11.060 |
On the left, going from inputs to outputs with no loops. 00:13:17.860 |
There is also these amazing things called recurrent neural networks. 00:13:27.960 |
They're amazing because they have memory. They have a memory of state. 00:13:32.460 |
They remember the temporal dynamics of the data that went through. 00:13:38.360 |
But the painful thing is that they're really hard to train. 00:13:45.760 |
Today we'll talk about feed-forward neural networks. 00:13:51.860 |
So let's look at this example. An example of stacking a few of these neurons together. 00:13:59.860 |
Let's think of the task, the basic task now famous using the classification of numbers. 00:14:08.160 |
You have an image of a number, handwritten number, and your task is given that image 00:14:21.660 |
Now what is an image? An image is a collection of pixels. In this case, 28 by 28 pixels. 00:14:26.960 |
That's a total of 784 numbers. Those numbers are from 0 to 255. 00:14:33.260 |
And on the left of the network, the size of that input, despite the diagram, is 784 neurons. 00:14:45.360 |
That's the input. Then comes the hidden layer. 00:14:51.360 |
It's called the hidden layer because it has no interaction with the input or the output. 00:15:02.060 |
It is simply a block used at the core of the computational power of neural networks, is the hidden layer. 00:15:14.860 |
It's tasked with forming a representation of the data in such a way that it maps from the inputs to the outputs. 00:15:23.160 |
In this case, there is 15 neurons in the hidden layer. There is 10 values on the output 00:15:36.060 |
There's several ways you can build this kind of network and this is what the magic of neural networks is. 00:15:42.860 |
You can do it a lot of ways. You only really need four outputs to represent values 0 through 9. 00:15:48.860 |
But in practice, it seems that having 10 outputs works better. And how do these work? 00:15:55.560 |
Whenever the input is a 5, the output neuron in charge of the 5 gets really excited 00:16:02.760 |
and outputs a value that's close to 1, from 0 to 1. 00:16:08.060 |
Close to 1. And then the other ones get an output of value, hopefully, that's close to 0. 00:16:15.460 |
And when they don't, we adjust the weights in such a way that they get closer to 0 and closer to 1, 00:16:22.960 |
depending on whether it's the correct neuron associated with the picture. 00:16:26.960 |
We'll talk about the details of this training process more tomorrow when it's more relevant. 00:16:35.660 |
But what we've discussed just now is the forward pass through the network. 00:16:43.360 |
It's the pass when you take the inputs, apply the weights, sum them together, add the bias, 00:16:50.160 |
produce the output and check which of the outputs produces the highest confidence of the number. 00:16:55.860 |
Then once those probabilities for each of the numbers is provided, 00:17:02.860 |
we determine the gradient that's used to punish or reward the weights 00:17:12.260 |
that resulted in either the correct or the incorrect decisions. 00:17:18.660 |
We step backwards through the network applying those punishments or rewards. 00:17:23.160 |
Because of the smoothness of the activation functions, that is a mathematically efficient operation. 00:17:38.360 |
The ground truth for number 6 looks like the following in the slides. 00:17:47.560 |
Y of X equals to a 10-dimensional vector, where only one of them, the 6th value is a 1, the rest are 0. 00:18:02.560 |
That's the ground truth that comes with the image. 00:18:06.060 |
The loss function here, the basic loss function is the squared error. 00:18:11.760 |
Y of X is the ground truth and A is the output of the neural network resulting from the forward pass. 00:18:21.360 |
So when you input that number of a 6 and it outputs whatever it outputs, that's A, a 10-dimensional vector. 00:18:31.260 |
And it's summed over the inputs to produce the squared error. 00:18:38.160 |
The loss function, the objective function, that's what's used to determine 00:18:43.960 |
how much to reward or punish the back propagated weights throughout the network. 00:18:50.260 |
And the basic operation of optimizing that loss function, of minimizing that loss function 00:19:00.660 |
is done with various variants of gradient descent. 00:19:04.260 |
It's hopefully a somewhat smooth function, but it's a highly nonlinear function. 00:19:11.960 |
This is why we can't prove much about neural networks. 00:19:15.360 |
It's a highly high-dimensional, highly nonlinear function that's hopefully smooth enough 00:19:23.660 |
where the gradient descent can find its way to at least a good solution. 00:19:30.860 |
And there has to be some stochastic element there that jumps around 00:19:39.060 |
to ensure that it doesn't get stuck in a local minima of this very complex function. 00:19:47.860 |
There's inputs, there's outputs, ground truth. 00:19:53.360 |
Because we're pretty confident, we know what's going on. 00:19:55.960 |
All you have to do is just, you have this data set, you train and 00:19:59.660 |
you train a network on that data set and you can evaluate it, you can write a paper 00:20:05.260 |
and try to beat a previous paper, it's great. 00:20:07.760 |
The problem is when you then use that neural network to create an intelligent system 00:20:16.860 |
And now that system no longer is working with your data set. 00:20:22.560 |
It has to exist in this world that's maybe very different from the ground truth. 00:20:29.560 |
So the takeaway from supervised learning is that neural networks are great memorization. 00:20:36.060 |
But in a sort of philosophical way, they might not be great at generalizing, 00:20:41.460 |
at reasoning beyond the specific flavor of data set that they were trained on. 00:20:49.260 |
The hope for reinforcement learning is that we can extend the knowledge we gain in a supervised way 00:20:56.960 |
to the huge world outside where we don't have the ground truth of how to act, 00:21:08.160 |
of what does, how good a certain state is or how bad a certain state is. 00:21:16.760 |
And I'll talk about kind of what I mean there. 00:21:20.360 |
But it feels like it's closer to reasoning as opposed to memorization. 00:21:24.560 |
That's a good way to think of supervised learning is memorization. 00:21:30.660 |
And as many of you know, that doesn't mean you're going to be successful in life 00:21:36.760 |
And so a reinforcement learning agent or just any agent, 00:21:46.560 |
a human being or any machine existing in this world 00:21:51.960 |
can operate in the following way from the perspective of the agent. 00:21:57.960 |
It can execute an action, it can receive an observation 00:22:02.060 |
resulting from that action in a form of a new state 00:22:11.660 |
You can break down our existence in this way, simplistic view. 00:22:18.360 |
But it's a convenient one on the computational side. 00:22:23.460 |
And from the environment side, the environment receives the action, 00:22:28.860 |
emits the observation, so your action changes the world, 00:22:37.960 |
And then tell you about it and give you a reward or punishment for it. 00:22:42.760 |
So let's look at, again, one of the most fascinating things. 00:22:54.560 |
I'll try to convey why this is fascinating a little bit later on. 00:23:06.160 |
This is Atari Breakout, a game where a paddle has to move around. 00:23:15.960 |
It's a paddle, the agent is a paddle and there's a bouncing ball 00:23:21.460 |
and you're trying to move your actions to the right, move right, move left. 00:23:25.960 |
You're trying to move in such a way that the ball doesn't get past you. 00:23:30.860 |
And so here is a human level performance of that agent. 00:23:47.060 |
This may seem obvious but moving right changes visually the state of the world. 00:23:54.860 |
In fact, what we're watching now on the slides is the world changing before your eyes for this little guy. 00:24:10.160 |
They're racking up points in the top left of the video. 00:24:14.860 |
And then when the ball gets past the paddle, it gets punished by dying, quote-unquote. 00:24:22.660 |
And that's the number of lives it has left, going from five to four to three, down to zero. 00:24:30.860 |
And so the goal is to select at any one moment the action that maximizes future reward. 00:24:38.660 |
Without any knowledge of what a reward is, in a greater sense of the word, 00:24:45.260 |
all you have is an instantaneous reward or punishment, instantaneous response of the world to your actions. 00:24:51.660 |
And this can be modeled as a Markov decision process. 00:25:01.160 |
Markov decision process is a mathematically convenient construct. 00:25:06.460 |
It has no memory. All you get is you have a state that you're currently in, 00:25:12.460 |
you perform an action, you get a reward and you find yourself in a new state. 00:25:18.860 |
You start from state zero, you go to state one, you once again repeat an action, get a reward, go to the next state. 00:25:27.860 |
Okay, that's the formulation that we're operating in. 00:25:30.060 |
When you're in a certain state, you have no memory of what happened two states ago. 00:25:35.960 |
Everything is operating on the instantaneous, instantaneously. 00:25:43.360 |
And so what are the major components of a reinforcement learning agent? 00:25:47.260 |
There's a policy. That's the agent, the function broadly defined of an agent's behavior. 00:25:56.960 |
That includes the knowledge of how for any given state, what is an action that I will take with some probability. 00:26:09.560 |
Value function is how good each state and action are in any particular state. 00:26:20.960 |
And there's a model. Now this is a little, a subtle thing that is actually the biggest problem with everything you'll see today. 00:26:31.560 |
Is the model is how we represent the environment. 00:26:35.260 |
And what you'll see today is some amazing things that neural networks can achieve 00:26:39.760 |
on a relatively simplistic model of the world. 00:26:43.360 |
And the question whether that model can extend to the real world 00:26:47.260 |
where human lives are at stake in the case of driving. 00:26:50.760 |
So let's look at the simplistic world. A robot in a room. 00:26:58.660 |
You start at the bottom left. Your goal is to get to the top right. 00:27:04.260 |
Your possible actions are going up, down, left and right. 00:27:10.560 |
Now this world can be deterministic, which means when you go up, you actually go up. 00:27:19.560 |
Or it could be non-deterministic as human life is. 00:27:28.160 |
So in this case, if you choose to go up, you move up 80% of the time. 00:27:34.660 |
You move left 10% of the time and you move right 10% of the time. 00:27:38.560 |
And when you get to the top right, you get a reward of +1. 00:27:44.060 |
When you get to the second block from that, 4, 2, you get -1. You get punished. 00:27:49.660 |
And every time you take a step, you get a slight punishment of 0.04. 00:27:54.760 |
Okay, so the question is, if you start at the bottom left, is this a good solution? 00:28:03.060 |
Is this a good policy by which you exist in the world? 00:28:15.260 |
Whenever you choose to go right, you go right. 00:28:17.960 |
But if the actions are stochastic, that's not the case. 00:28:25.460 |
In what I described previously with 0.8 up and probability of 0.1 going left and right, 00:28:42.560 |
Now, if we punish every single step with a -2 as opposed to -0.04. 00:28:52.760 |
You're going to try to get to a positive block as quickly as possible. 00:29:03.160 |
I'll walk through a -1 if I have to, as long as I stop getting a -2. 00:29:23.360 |
And then you might take an even longer detour as the reward for each step goes up 00:29:39.360 |
And then if there's an actual positive reward for every step you take, 00:30:10.360 |
because it was having too much fun getting points in the middle. 00:30:13.560 |
So, let's look at the world that this agent is operating in. 00:30:31.160 |
And that reward is discounted because the world is stochastic. 00:30:36.160 |
We can't expect the reward to come along to us in the way that 00:30:43.160 |
we hope it does based on the policy, based on the way we choose to act. 00:30:52.560 |
as the reward is farther and farther into the future, 00:30:57.060 |
discounts that reward, diminishes the impact of that future award 00:31:07.660 |
And so your goal is to develop a strategy that maximizes 00:31:13.460 |
the discounted future reward, the sum, this discounted sum. 00:31:25.060 |
there is a lot of approaches for coming up with a good policy, 00:31:37.560 |
You can try to construct a model that optimizes some estimate of this world. 00:31:45.860 |
You can try in a Monte Carlo way to just simulate that world 00:31:54.460 |
And as it unrolls, you try to compute the optimal policy. 00:31:59.460 |
Or what we'll talk about today is Q-learning. 00:32:03.960 |
It's an off-policy approach where the policy is estimated as we go along. 00:32:33.560 |
The input to the Q-function is a state at time t, s, t, 00:32:42.660 |
and an action that you choose to take in that state, a, t. 00:32:48.960 |
And your goal is in that state to choose an action which maximizes the reward in the next step. 00:32:55.560 |
And what Q-learning does, and I'll describe the process, 00:33:01.260 |
is it's able to approximate through experience the optimal Q-function. 00:33:08.460 |
The optimal function that tells you how to act in any state of the world. 00:33:23.260 |
You have to explore in order to see every possible state, 00:33:28.060 |
try every different action, get rewarded, get punished, 00:33:31.860 |
and figure out what is the optimal thing to do. 00:33:43.460 |
On the left, the output is the new state, the estimate, 00:33:48.460 |
the Q-function estimate of the new state for new action. 00:33:52.860 |
And this is the update rule at the core of Q-learning. 00:33:58.060 |
You take the old estimate and add based on the learning rate, 00:34:08.360 |
alpha from 0 to 1, update the evaluation of that state 00:34:15.860 |
based on your new reward that you received at that time. 00:34:21.360 |
So you've arrived in a certain state, s, t, you try to do an action 00:34:28.560 |
and then you got a certain reward and you update your estimate of that state 00:34:37.860 |
When the learning rate is 0, you don't learn. 00:34:41.960 |
When alpha is 0, you never change your worldview based on the new incoming evidence. 00:34:50.660 |
When alpha is 1, you every time change your evaluation, 00:35:00.160 |
your world evaluation based on the new evidence. 00:35:04.760 |
And that's the key ingredient to reinforcement learning. 00:35:11.460 |
First you explore in a non-greedy way and then you get greedy. 00:35:15.660 |
You figure out what's good for you and you keep doing it. 00:35:18.160 |
So if you want to learn an Atari game, first you try every single action, every state, 00:35:24.360 |
you screw up, get punished, get rewarded and eventually you figure out 00:35:28.160 |
what's actually the right thing to do and you just keep doing it. 00:35:30.760 |
And that's how you win against the greatest human players in the world 00:35:36.960 |
in a game of Go, for example, as we'll talk about. 00:35:39.760 |
And the way you do that is you have an epsilon greedy policy 00:35:44.960 |
that over time with a probability of 1-epsilon, you perform an optimal greedy action. 00:35:52.960 |
With a probability of epsilon, you perform a random action. 00:36:00.260 |
And so as epsilon goes down from 1 to 0, you explore less and less. 00:36:05.560 |
So the algorithm here is really simple on the bottom of the slide there. 00:36:13.460 |
It's the algorithm version, the pseudocode version of the equation, 00:36:23.160 |
You initialize your estimate of state action pairs arbitrarily, a random number. 00:36:32.360 |
When you start playing or living or doing whatever you're doing 00:36:38.360 |
and whatever you're doing with reinforcement learning or driving, 00:36:41.460 |
you have no preconceived notion of what's good and bad. 00:36:47.560 |
It's random or however you choose to initialize it. 00:36:51.760 |
And the fact that it learns anything is amazing. 00:36:58.260 |
That's one of the amazing things about the Q-learning at all 00:37:05.760 |
and then the deep neural network version of Q-learning. 00:37:14.360 |
You step into the world, observe an initial state. 00:37:21.960 |
So that action, if you're exploring, will be a random action. 00:37:25.660 |
If you're greedily pursuing the best action you can, 00:37:28.760 |
it will be the action that maximizes the Q function. 00:37:31.460 |
You observe or reward after you take the action 00:37:37.760 |
And then you update your estimate of the previous state you were in 00:37:41.860 |
having taken that action using that Bellman equation update. 00:37:49.460 |
And so there on the bottom of the slide is a summary of life. 00:38:16.160 |
The question was, is the Q function a single value? 00:38:19.160 |
And yes, it's just a single continuous value. 00:38:23.460 |
So the question was, how do you model the world? 00:38:37.060 |
So the way you model, so let's start this very simplistic world 00:38:47.960 |
You model it as a paddle that can move left and right 00:38:51.460 |
and there's some blocks and you model the physics of the ball. 00:38:56.160 |
That requires a lot of expert knowledge in that particular game. 00:39:04.760 |
That's hard to do even for a simplistic game. 00:39:07.660 |
The other model you could take is looking at this world 00:39:19.460 |
Just the model is all the pixels of the world. 00:39:25.360 |
You know nothing about paddles or balls or physics or colors and points. 00:39:32.860 |
That seems like a ridiculous model of the world 00:39:40.260 |
When you're born, you see there's light coming into your eyes 00:39:52.260 |
you don't come with an instruction when you're born. 00:39:55.160 |
You know there's people in the world and there is good guys and bad guys 00:40:02.260 |
No, all you get is light, sound and the other sensors. 00:40:08.160 |
And you get to learn about every single thing you think of 00:40:17.060 |
as the way you model the world is a learned representation. 00:40:21.160 |
And we'll talk about how a neural network does that. 00:40:26.160 |
But if we have to hand model the world, it's an impossible task. 00:40:36.060 |
If we have to hand model the world, then that world better be a simplistic one. 00:40:45.260 |
So the question was, what is the robustness of this model? 00:40:48.360 |
If the way you represent the world is at all even slightly different 00:40:55.960 |
that's not that well studied as far as I'm aware. 00:41:00.060 |
I mean it's already amazing that if you construct, 00:41:07.160 |
you can learn anything, it's already amazing. 00:41:14.760 |
not about the world model but the reward function. 00:41:17.560 |
If the reward function is slightly different, 00:41:19.760 |
the real reward function of life or driving or of coast runner 00:41:25.760 |
is different than what you expected it to be. 00:41:47.060 |
So the question was, do you change the alpha value over time? 00:41:50.660 |
And you certainly should change the alpha value over time. 00:42:00.360 |
So the question was, what is the complex interplay of the epsilon function 00:42:06.160 |
That's 100% fine-tuned, hand-tuned to the particular learning problem. 00:42:16.960 |
the more complex, the larger the number of states in the world 00:42:27.260 |
the longer you have to wait before you decrease the epsilon to zero. 00:42:34.260 |
It's one of the parameters you have to play with, unfortunately, 00:42:38.860 |
which is why you can't just drop a reinforcement learning agent into the world. 00:42:53.960 |
half the time you're going to take a random action. 00:43:01.960 |
and then with some probability take the second best and so on. 00:43:11.060 |
because you don't want to have a preconceived notion of 00:43:13.260 |
what's a good action to try when you're exploring. 00:43:16.260 |
The whole point is you try crazy stuff, if it's a simulation. 00:43:28.160 |
This is the question about how we represent the world. 00:43:32.160 |
So we can think of this world of breakout, for example, 00:43:37.460 |
of this Atari game as a paddle that moves left and right 00:43:43.560 |
and the exact position of the different things it can hit, 00:43:48.960 |
this expert driven model that has to fine tune it to this particular problem. 00:43:54.460 |
But in practice, the more complex this model gets, 00:44:08.760 |
for every single combination of state and actions 00:44:11.660 |
becomes too difficult because that function is too sparse and huge. 00:44:16.760 |
So if you think of looking at this world in a general way, 00:44:29.560 |
this game is a collection of 84 by 84 pixels, an image, RGB image. 00:44:35.160 |
And then you look at not just the current image, 00:44:40.860 |
but look at the temporal trajectory of those images. 00:44:45.960 |
So like if there's a ball moving, you want to know about that movement. 00:44:54.060 |
And say they're grayscale with 256 gray levels. 00:45:00.560 |
That size of the Q table that the Q value function has to learn 00:45:12.560 |
but it's certainly larger than the number of atoms in the universe. 00:45:19.660 |
So you have to run the simulation long enough 00:45:23.260 |
to touch at least a few times most of the states in that Q table. 00:45:34.360 |
"You may need to run," you know, "we live in a simulation." 00:45:38.260 |
You may have to run a universe just to compute the Q function in this case. 00:45:58.360 |
you estimate, you try to learn that function. 00:46:04.160 |
And so the takeaway from supervised learning, if you remember, 00:46:09.760 |
that it's good at memorizing or good at memorizing data. 00:46:12.560 |
The hope for reinforcement learning with a Q learning 00:46:19.560 |
is that we can extend the occasional rewards we get 00:46:27.160 |
the actions you take in that world leading up to the rewards. 00:46:30.760 |
And the hope for deep learning is that we can move this reinforcement learning system 00:46:48.560 |
can include all the pixels sensed by a drone or a robot or a car. 00:46:52.460 |
But still it needs a formalized definition of that world, 00:46:59.160 |
which is much easier to do when you're able to take in sensors like an image. 00:47:09.760 |
So instead of learning a Q table, a Q function, 00:47:27.560 |
This huge complex function, we try to learn it. 00:47:33.460 |
And the way we do that is we have a neural network, 00:47:39.260 |
the same kind that I showed that learned the numbers to map from an image 00:47:42.960 |
to a classification of that image into a number. 00:47:47.060 |
The same kind of network is used to take in a state and action and produce a Q value. 00:47:58.660 |
that without knowing anything in the beginning, 00:48:06.060 |
as I said with a Q table, it's initialized randomly. 00:48:12.460 |
The Q function, this deep network knows nothing in the beginning. 00:48:17.560 |
All it knows is in the simulated world, the rewards you get for a particular game. 00:48:26.060 |
So you have to play time and time again and see 00:48:29.560 |
the rewards you get for every single iteration of the game. 00:48:40.060 |
And it's able to learn to play better than human beings. 00:48:44.460 |
This is a DeepMind paper playing Atari with deep reinforcement learning from 2013. 00:48:52.160 |
There's one of the key things that got everybody excited about the role of deep learning 00:48:59.360 |
Is that using a convolutional neural network, which I'll talk about tomorrow, 00:49:07.160 |
but it's a vanilla network like any other, like I talked about earlier today. 00:49:11.160 |
Just a regular network that takes the raw pixels, as I said, 00:49:15.760 |
and estimates that Q function from the raw pixels 00:49:19.260 |
and is able to play on many of those games better than a human being. 00:49:22.860 |
And the loss function that I mentioned previously. 00:49:27.860 |
So again, very vanilla loss function, very simple objective function. 00:49:43.460 |
So we take this Bellman equation where the estimate is Q, 00:49:48.560 |
the Q function estimate of state and action is the maximum reward you get 00:49:55.860 |
for taking any of the actions that takes you to any of the future states. 00:50:03.260 |
And you try to take that action, observe the result of that action, 00:50:09.460 |
and if the target is different that your learned target, 00:50:14.960 |
what the function has learned is the expected reward in that case, 00:50:19.360 |
is different than what you actually got, you adjust it. 00:50:28.560 |
And this is exactly the process by which we learn how to exist in this pixel world. 00:50:35.360 |
So you're mapping states and actions to a Q value. 00:50:48.160 |
We're given a transition, S, current state, action taken in that state, 00:50:55.060 |
R, the reward you get, and S' is the state you find yourself in. 00:50:59.360 |
And so we replace the basic update rule in the previous pseudocode 00:51:06.460 |
by taking a forward pass through the network, given that S state. 00:51:14.860 |
We look at what the predicted Q value is of that action. 00:51:20.660 |
We then do another forward pass through that network. 00:51:26.960 |
And then if we're totally off, we punish, we back propagate the weights 00:51:37.160 |
in a way that next time we'll make less of that mistake. 00:52:08.960 |
You're mostly exploring and then you move towards an epsilon of 0. 00:52:17.460 |
And with Atari breakout, this is the DeepMind paper result. 00:52:26.360 |
On the Y axis is the average action value and the average reward per episode. 00:52:31.560 |
I'll show why it's kind of an amazing result, but it's messy. 00:52:40.060 |
So it's not just putting in a bunch of pixels of a game 00:52:44.560 |
and getting an agent that knows how to win at that game. 00:52:48.160 |
There's a lot of pre-processing and playing with the data required. 00:52:53.360 |
So which is unfortunate because the truth is messier than the hope. 00:52:59.760 |
But one of the critical tricks needed is called experience replay. 00:53:05.960 |
So as opposed to letting an agent, so you're learning this big network 00:53:12.560 |
that tries to build a model of what's good to do in the world and what's not. 00:53:21.560 |
So with experience replay, you're keeping a track of all the things you did. 00:53:26.260 |
And every once in a while, you look back into your memory 00:53:29.660 |
and pull out some of those old experiences, the good old times, and train on those again. 00:53:34.960 |
As opposed to letting the agent run itself into some local optima 00:53:41.960 |
where it tries to learn a very subtle aspect of the game 00:53:44.760 |
that actually in the global sense doesn't get you farther to winning the game. 00:53:52.260 |
So here's the algorithm, deep Q learning algorithm, pseudocode. 00:54:01.860 |
Again, there's this little trick that's required. 00:54:05.260 |
It's keeping a track of stuff that's happened in the past. 00:54:08.860 |
We initialize the action value function Q with random weights 00:54:15.660 |
Again, same thing, select an action with a probability epsilon, explore. 00:54:21.760 |
Otherwise, choose the best one based on the estimate provided by the neural network. 00:54:27.560 |
And then carry out the action, observe the reward and store that experience in the replay memory. 00:54:34.160 |
And then sample random transition from replay memory. 00:54:41.860 |
So with a certain probability, you bring those old times back to get yourself out of the local minima. 00:54:48.960 |
And then you train the Q network using the difference between what you actually got and your estimate. 00:55:08.160 |
After 10 minutes of training on the left, so that's very little training, 00:55:14.060 |
what you get is a paddle that learns hardly anything and it just keeps dying. 00:55:22.360 |
If you look at, it goes from 5 to 4 to 3 to 2 to 1, those are the number of lives left. 00:55:28.360 |
Then after two hours of training on a single GPU, it learns to win, you know, not die, 00:55:38.160 |
rack up points and learns to avoid the ball from passing the paddle, which is great. 00:55:48.260 |
That's human level performance really, better than some humans, you know, 00:55:53.560 |
but it still dies sometimes so it's very human level. 00:55:58.060 |
And then after four hours, it does something really amazing. 00:56:03.560 |
It figures out how to win at the game in a very lazy way, which is drill a hole through the blocks 00:56:15.060 |
up to the top and get the ball stuck up there and then it does all the hard work for you. 00:56:19.960 |
That minimizes the probability of the ball getting past your paddle because it's just stuck in the blocks up top. 00:56:28.560 |
So that might be something that you wouldn't even figure out to do yourself. 00:56:32.260 |
And that's an... I need to sort of pause here to clearly explain what's happening. 00:56:40.560 |
The input to this algorithm is just the pixels of the game. 00:56:46.660 |
It's the same thing that human beings take in when they take the visual perception 00:56:53.060 |
and it's able to learn under this constrained definition of what is a reward and a punishment. 00:57:13.760 |
It's general purpose. It knows nothing about games. 00:57:17.460 |
It knows nothing about paddles or physics. It's just taking sensory input of the game. 00:57:23.160 |
And they've did the same thing for a bunch of different games in Atari. 00:57:27.760 |
And what's shown here in this plot on the X-axis is a bunch of different games from Atari 00:57:38.460 |
and on the Y-axis is a percentile where 100% is about the best that human beings can do. 00:57:46.660 |
Meaning it's the score that human beings would get. 00:57:48.560 |
So everything about there in the middle, everything to the left of that 00:57:54.860 |
and below that is on par or worse than human level performance. 00:57:59.460 |
So it can learn all, so many boxing, pinball, all of these games 00:58:06.660 |
and it doesn't know anything about any of the individual games. 00:58:11.160 |
It's just as if you put a human being behind any of these games 00:58:23.260 |
And there's been a lot of improvements in this algorithm recently. 00:58:34.260 |
So the question was, do they customize the model for a particular game? 00:58:41.560 |
but the point is it doesn't need to be customized for the game. 00:58:45.060 |
But the important thing is that it's still only on Atari games. 00:58:54.060 |
Right, so the question whether this is transferable to driving. 00:59:12.660 |
No, you don't have the... Well yeah, you play one step of the game. 00:59:15.960 |
So you take action in a state and then you observe that. 00:59:26.660 |
I mean that's really, that's one of the biggest problems here 00:59:31.860 |
is you require the simulation in order to get the ground truth. 00:59:45.960 |
The comment was that for a lot of these situations, 00:59:49.760 |
the reward function might not change at all depending on your actions. 00:59:54.060 |
The rewards are really, most of the time, delayed 10, 20, 30 steps down the line. 01:00:02.860 |
Which is why it is amazing that this works at all. 01:00:08.260 |
That it's learning locally and through that process of simulation 01:00:15.760 |
of hundreds of thousands of times it runs through the game, 01:00:17.960 |
it's able to learn what to do now such that I get a reward later. 01:00:24.260 |
If you just pause, look at the math of it, it's very simple math, 01:00:39.460 |
This one called the General Reinforcement Learning Architecture or GORILLA. 01:00:45.460 |
The cool thing about this in the simulated world at least 01:00:50.860 |
is that you can run deep reinforcement learning in distributed way. 01:00:56.160 |
You could do both the simulation in distributed way, 01:00:58.660 |
you could do the learning in the distributed way. 01:01:00.960 |
You can generate experiences which is what this kind of diagram shows. 01:01:07.560 |
You can either from human beings or from simulation. 01:01:12.060 |
So for example, the way that AlphaGo, the DeepMind team has beat the game of Go 01:01:21.160 |
is they learn from both expert games and by playing itself. 01:01:29.760 |
and you could do the learning in distributed way so you can scale. 01:01:33.360 |
And in this particular case, the GORILLA has achieved a better result than the DQN network. 01:01:45.360 |
Okay, so let me now get to driving for a second here. 01:01:59.860 |
where reinforcement learning can step in and help? 01:02:03.560 |
So this is back to the open question that I asked yesterday. 01:02:07.160 |
Is driving closer to chess or to everyday conversation? 01:02:11.460 |
Chess meaning it can be formalized in a simplistic way 01:02:16.560 |
and we could think about it as an obstacle avoidance problem 01:02:27.060 |
You choose to move left, you choose to move right in a lane, 01:02:32.860 |
Well, if it's a game like chess, which we'll assume for today 01:02:38.660 |
as opposed to for tomorrow, for today we're going to go with the one on the left 01:02:51.660 |
Here's this game, a simulation, where the goal is to achieve the highest average speed you can 01:03:07.060 |
And so, as a side note for students, the requirement is they have to follow the tutorial 01:03:14.060 |
that I'll present a link for at the end of this presentation. 01:03:21.860 |
build a network that achieves a speed of 65 miles an hour or higher. 01:03:26.060 |
There is a leaderboard and you get to submit the model you come up with 01:03:35.160 |
So all of this runs in the browser, which is also another amazing thing. 01:03:38.860 |
And then you immediately or relatively so, make your way up the leaderboard. 01:03:50.860 |
What is this world, two-dimensional world of traffic is? 01:03:56.260 |
What does it look like for the intelligence system? 01:04:01.260 |
We discretize that world into a grid shown here on the left. 01:04:08.560 |
There are seven lanes and every single lane is broken up into blocks spatially. 01:04:14.460 |
And if there's a car in that block, the length of a car is about three blocks, 01:04:18.260 |
three of those grid blocks, then that grid is seen as occupied. 01:04:30.360 |
That's the thing that's running in the intelligent agent. 01:04:33.860 |
There is on the left is the current speed of the red car. 01:04:42.360 |
And then you also have a count of how many cars you passed. 01:04:47.760 |
And if your network sucks, then that number is going to get to be negative. 01:04:54.160 |
You can also change with a drop down the simulation speed 01:05:01.160 |
from normal on the left to fast on the right. 01:05:05.060 |
So normal is, so you know, the fast speeds up the replay of the simulation. 01:05:14.260 |
The one on the left, normal, it feels a little more like real driving. 01:05:20.160 |
There's a drop down for different display options. 01:05:28.860 |
The default is none in terms of stuff you show on the road. 01:05:33.960 |
Then there is the learning input, which is the, while the whole space is discretized, 01:05:44.060 |
And that's, you can choose how far ahead it sees, behind, how far to the left and right it sees. 01:05:52.060 |
And so by choosing the learning input, to visualize learning input, 01:05:56.860 |
you get to see what you set that input to be. 01:06:04.160 |
This is a system that protects you from yourself. 01:06:07.460 |
The way we've made this game is that it operates under something similar. 01:06:15.860 |
If you have some intelligence, if you drive and you have adaptive cruise control in your car, 01:06:24.960 |
When it gets close to the car in front, it slows down for you 01:06:29.760 |
and it doesn't let you run the car to the left of you, to the right of you, off the road. 01:06:35.460 |
So it constrains the movement capabilities of your car in such a way that you don't hit anybody 01:06:44.160 |
because then it would have to simulate collisions and it would just be a mess. 01:06:48.060 |
So it protects you from that and so you can choose to visualize that "safety system" with the visualization box. 01:06:58.860 |
And then you can also choose to visualize the full map. 01:07:01.360 |
This is the full occupancy map that you get, if you would like to, provide as input to the network. 01:07:09.560 |
Now that input for every single grid, it's a number. 01:07:14.860 |
It's not just a zero one whether there's a car in there. 01:07:17.860 |
It's the maximum speed limit, which is 80 miles per hour. 01:07:24.060 |
Don't get crazy. 80 miles an hour is the speed limit. 01:07:29.260 |
That block, when it's empty, is set to the 80 miles an hour. 01:07:34.560 |
And when it's occupied, it's set to the number that's the speed of the car. 01:07:40.960 |
And then the blocks that the red car is occupying, 01:07:46.360 |
is set to a very large number, much higher than the speed limit. 01:07:57.860 |
So safety system, here shown in red, are the parts of the grid that your car can't move into. 01:08:19.060 |
Yes. The question was, what was the third option I just mentioned? 01:08:27.660 |
You yourself, the blocks underneath that car, are set to a really high number. 01:08:32.660 |
It's a way for the algorithm to know, for the learning algorithm to know, that these blocks are special. 01:08:39.760 |
So safety system shows red here if the car can't move into those blocks. 01:08:52.660 |
So any, in terms of, when it lights up red, it means the car can't speed up anymore in front of it. 01:09:01.260 |
And when the blocks to the left or to the right light up as red, 01:09:04.660 |
that means you can't change lanes to the left or right. 01:09:06.760 |
On the right of the slide, you're free to go, free to do whatever you want. 01:09:15.360 |
There's all the blocks are yellow. Safety system says you're free to choose any of the five actions. 01:09:22.760 |
And the five actions are move left, move right, stay in place, accelerate or slow down. 01:09:35.160 |
That action is, that's what's produced by the, what's called here the brain. 01:09:43.260 |
The brain takes in the current state as input, the last reward and produces and learns, 01:09:50.460 |
uses that reward to train the network through backward function there, back propagation. 01:09:59.660 |
And then ask the brain, given the current state, to give it the next action with a forward pass, the forward function. 01:10:09.760 |
You don't need to know the operation of this function in particular. 01:10:13.760 |
This is not something you need to worry about, but you can if you want. 01:10:19.560 |
There is, by the way, what I'm describing now, there's just a few lines of code right there in the browser. 01:10:29.160 |
That you can change and immediately, well, with the press of a button, 01:10:34.460 |
changes the simulation or the design of the network. 01:10:37.860 |
You don't need to have any special hardware, you don't need to do anything special. 01:10:41.560 |
And the tutorial cleanly outlines exactly all of these steps. 01:10:45.160 |
But it's kind of amazing that you can design a deep neural network 01:10:50.160 |
that's part of the reinforcement learning agent. 01:10:52.460 |
So it's a deep Q learning agent right there in the browser. 01:10:59.460 |
So you can choose the lane side variable which controls how many lanes to the side you see. 01:11:08.460 |
So when that value is zero, you only look forward. 01:11:10.760 |
When that value is one, you have one lane to the left, one value to the right. 01:11:14.860 |
It's really the lane, the radius of your perception system. 01:11:18.560 |
Patches ahead is how far ahead you look, patches behind is how far behind you look. 01:11:26.260 |
And so for example here, the lane side equals two, that means it looks two to the left, two to the right. 01:11:33.160 |
Obviously, if two to the right is off-road, it provides a value of zero in those blocks. 01:11:41.860 |
If we set the patches behind to be ten, it looks ten patches back. 01:11:48.460 |
Behind starting at the one patch back is starting from the front of the car. 01:11:54.160 |
The scoring for the evaluation, for the competition is your average speed over a predefined period of time. 01:12:03.060 |
And so the method we use to collect that speed is we run the agent ten runs, 01:12:10.960 |
about 30 simulated minutes of game each and take the median speed of the ten runs. 01:12:21.060 |
This is done server-side and so given that we've gotten some, 01:12:30.360 |
for this code, recently gotten some publicity online unfortunately. 01:12:35.160 |
This might be a dangerous thing to say, it's no cheating possible. 01:12:40.160 |
But because it's done server-side and we did, this is JavaScript and it runs in the browser, 01:12:46.660 |
it's hopefully sandboxed so we can't do anything tricky. 01:13:00.560 |
And there's a button that says evaluate and it gives you a score right back 01:13:06.360 |
of how well you're doing with the current network. 01:13:16.360 |
You press the button, it does a progress bar and it gives you the average speed. 01:13:20.960 |
You can, there's a code box where you modify all the variables I mentioned 01:13:32.960 |
And then once you're ready, you modify a few things, you can press apply code, it restarts, 01:13:40.660 |
it kills all the training that you've done up to this point or resets it and starts the training again. 01:13:52.360 |
So the training is done on a separate thread in web workers 01:13:58.960 |
which are exciting things that allow you to allow JavaScript to run amazingly in a 01:14:14.860 |
So the simulation that scores this or the training is done a lot faster than real time. 01:14:22.860 |
A thousand frames a second, a thousand movement steps a second. 01:14:31.660 |
And the next day gets shipped to the main simulation from time to time as the training goes on. 01:14:40.160 |
So all you have to do is press run training and it trains and the car behaves better over time. 01:14:47.460 |
Maybe I'll actually show it in the browser to, let's see if we work. 01:15:27.660 |
When it starts, this is running live in the browser. 01:15:35.560 |
Artificial intelligence, ladies and gentlemen, in the browser, a neural network. 01:15:42.560 |
It's driving at two miles an hour and watching everybody pass. 01:15:47.760 |
So what's being shown live is the loss function which is pretty poor. 01:15:57.260 |
So in order to train, like I said, a thousand frames a second, you just press the run training button. 01:16:04.560 |
And pretty quickly it learns based on the network you specify in the code box, 01:16:10.060 |
how to, and based on the input and all the things that I mentioned, training finished. 01:16:22.260 |
We on purpose put in a network that's not very good in there. 01:16:26.360 |
So right now it won't, on average, be doing that well, but it does better than standing there in place. 01:16:32.060 |
And then you can do the start evaluation run to simulate the network much faster than real time, 01:16:43.960 |
This is a similar evaluation step that we take when determining where you stand on the leaderboard. 01:16:51.760 |
The current average speed in that 10 run simulation is 56.56 miles per hour. 01:17:06.160 |
If you're logged in, you click submit your code. 01:17:11.160 |
If you're not logged in, it says you're not logged in, please log in to submit your code. 01:17:25.260 |
And then you press submit model again and success. 01:17:36.860 |
And so now my submission is entered as Lex in the leaderboard, 01:17:48.460 |
So to, as you play around with stuff, if you want to save the code, 01:17:55.360 |
you could do so by pressing the save code button. 01:18:00.660 |
That saves the various JavaScript configurations, 01:18:11.160 |
Again, danger, it overrides the code for you. 01:18:15.660 |
And you press the submit button to submit the model to competition. 01:18:22.560 |
You bring us, you submit a model and you have to press train. 01:18:32.560 |
This is public facing, so the queue can grow pretty big. 01:18:39.560 |
and then depending on where you stand, you get added to the leaderboard. 01:18:43.760 |
You get added to the leaderboard, showing the top 10 entries. 01:19:06.660 |
by Andrej Karpathy from Stanford, now OpenAI. 01:19:27.860 |
not just how far ahead behind you see into the left and to the right, 01:19:32.960 |
you can specify how far back in time you look as well. 01:19:40.060 |
And so, what's visualized there is the input to the network, 135 neurons, 01:19:49.960 |
similar to the kind of output we saw with numbers, 01:19:53.260 |
where there's 10 outputs saying if it's a 0, 1, or 9. 01:20:00.160 |
left, right, stay in place, speed up or slow down. 01:20:11.960 |
this is all the stuff you don't need to mess with, 01:20:13.760 |
because we already give you the variables of lane side and patches ahead and so on. 01:20:39.060 |
Again, all this is in the tutorial, just to give you a little outline. 01:20:41.960 |
There is a, the first fully connected layer has 10 neurons, 01:20:50.560 |
same kind of smooth function that we talked about before, 01:20:59.560 |
And there's a bunch of other messy options that you can play with, if you dare. 01:21:08.460 |
But those aren't the ones I mentioned before, it's really the important ones. 01:21:11.460 |
It's selecting the number of layers, the size of those layers, 01:21:14.960 |
you get to build your own very neural network that drives. 01:21:17.960 |
And the actual learning is done with a backward propagation, 01:21:22.160 |
and then that returns the action by doing a forward pass to the network. 01:21:27.260 |
In case you're interested in this kind of stuff, 01:21:38.160 |
that's, it just works, it does some auto-completion, 01:21:43.460 |
so you get to play with it, makes everything very convenient in terms of code editing. 01:21:51.160 |
and the simulation, and the simulation we'll talk about tomorrow, 01:22:01.760 |
So here is a simple specification of a blue box with Canvas, 01:22:06.060 |
and this is very efficient and easy to work with. 01:22:09.860 |
And the thing that a lot of us are excited about, a very subtle one, 01:22:20.760 |
so with the V8 engine, JavaScript has become super fast. 01:22:26.160 |
You can train neural networks in the browser, that's already amazing. 01:22:30.060 |
And then with web workers, as long as you have Chrome, a modern browser. 01:22:35.260 |
So, is you can run multiple processes in separate threads. 01:22:42.060 |
So you could do a lot of stuff, you can do visualization separately, 01:22:49.660 |
Okay, so the tutorial is cars.mit.edu and deep traffic. 01:22:54.560 |
We won't put these links on the website for a little bit, 01:22:57.760 |
because we got put on the front page of Hacker News. 01:23:08.860 |
especially with the claims that you can't cheat. 01:23:12.160 |
And while it's pretty efficient in terms of running time, 01:23:17.960 |
so everything is running on your machine, client side. 01:23:20.360 |
It's still, you have to pull some images here and pull some of the code. 01:23:24.460 |
So the tutorial is on cars.mit.edu/deep-traffic 01:23:35.460 |
I encourage you to go there, play with the network, submit your code, 01:23:42.260 |
It is a pretty cool one, but we're still working on it. 01:23:48.260 |
All right, so let's take a pause and think about what we talked about today. 01:23:58.860 |
So the very best of deep reinforcement learning is 01:24:09.060 |
is when the game, when I first started as a freshman, 01:24:18.860 |
It was said that it's a game that's impossible for machines to beat 01:24:32.360 |
And so the most amazing accomplishment of deep reinforcement learning to me 01:24:41.260 |
When for the first time, the world champion in Go was beaten by DeepMind's AlphaGo. 01:24:48.660 |
And the way they did it, and this is, I think, very relevant to driving, 01:24:55.360 |
is you start by creating first in a supervised way, training a policy network. 01:25:02.460 |
So you take expert games to construct a network first. 01:25:11.160 |
So you don't play against yourself, the agent doesn't play against itself, 01:25:28.860 |
We have a, well, we're starting to get a lot of data 01:25:34.660 |
So we can learn on that data before we then run the agents through simulation 01:25:39.660 |
where it learns much larger magnitudes of data sets through simulation. 01:25:50.760 |
Now as a reminder that when you let an agent drive itself, 01:25:59.860 |
this is probably one of my favorite videos of all time, 01:26:04.260 |
but I just recently saw it so I could just watch this for hours. 01:26:06.760 |
But it's a reminder that you can't trust your first estimates of a reward function. 01:26:20.360 |
To be those that are safe and productive for our society 01:26:24.460 |
when you're talking about an intelligent system that gets to operate in the real world. 01:26:29.060 |
This is just as clear of a reminder of that as there is. 01:26:33.760 |
So again, all the references are available online for these slides. 01:26:41.560 |
I imagine you might have, if you want to come down and talk to us for questions 01:26:46.960 |
for the either Docker or JavaScript question. 01:26:51.360 |
So the question was, what is the visualization you're seeing in deep traffic? 01:27:03.860 |
It's moving based on the latest snapshot of the network you trained. 01:27:07.060 |
So it's just visualizing for you just for fun the network you trained most recently. 01:27:15.060 |
Okay, so if people have questions, stick around afterwards.