MIT 6.S094: Deep Reinforcement Learning for Motion Planning

00:00:00.000 | All right. Hello everybody. Welcome back. Glad you came back.

00:00:04.120 | Today, we will unveil the first tutorial, the first project.

00:00:13.580 | This is Deep Traffic, code named Deep Traffic,

00:00:17.220 | where your task is to solve the traffic problem using deep reinforcement learning.

00:00:23.820 | And I'll talk about what's involved in designing a network there,

00:00:29.760 | how you submit your own network,

00:00:31.560 | and how you participate in the competition.

00:00:33.900 | As I said, the winner gets a very special prize to be announced later.

00:00:40.540 | What is machine learning?

00:00:43.240 | There are several types.

00:00:45.340 | There's supervised learning.

00:00:47.100 | As I mentioned yesterday, that's what's meant usually when you discuss about,

00:00:52.480 | you talk about machine learning and talk about its successes.

00:00:56.320 | Supervised learning requires a data set where you know the ground truth.

00:01:01.320 | You know the inputs and the outputs.

00:01:03.860 | And you provide that to a machine learning algorithm in order to learn the mapping

00:01:11.840 | between the inputs and the outputs in such a way that you can generalize

00:01:15.140 | to further examples in the future.

00:01:19.480 | Unsupervised learning is the other side,

00:01:22.340 | when you know absolutely nothing about the outputs,

00:01:28.120 | about the truth of the data that you're working with.

00:01:31.860 | All you get is data.

00:01:33.220 | And you have to find underlying structure,

00:01:37.760 | underlying representation of the data that's meaningful

00:01:41.600 | for you to accomplish a certain task, whatever that is.

00:01:46.500 | There's semi-supervised data, where only parts, usually a very small amount, is labeled.

00:01:54.940 | There's ground truth available for just a small fraction of it.

00:01:59.280 | If you think of images that are out there on the Internet,

00:02:03.440 | and then you think about ImageNet, a data set where every image is labeled,

00:02:08.080 | the size of that ImageNet data set is a tiny subset of all the images

00:02:16.100 | available online.

00:02:18.500 | But that's the task we're dealing with as human beings,

00:02:23.300 | as people interested in doing machine learning,

00:02:25.700 | is how to expand the size of that,

00:02:31.340 | of the part of our data that we know something confidently about.

00:02:37.480 | And reinforcement learning sits somewhere in between.

00:02:41.680 | It's semi-supervised learning.

00:02:46.460 | Where there's an agent that has to exist in the world.

00:02:52.660 | And that agent knows the inputs that the world provides,

00:02:59.460 | but knows very little about that world except through occasional time-delayed rewards.

00:03:08.480 | This is what it's like to be human.

00:03:11.540 | This is what life is about.

00:03:14.420 | You don't know what's good and bad.

00:03:16.420 | You kind of have to just live it.

00:03:17.820 | And every once in a while, you find out that all that stuff you did last week was a pretty bad idea.

00:03:25.360 | That's reinforcement learning.

00:03:27.220 | That's semi-supervised in the sense that only a small subset of the data comes with some ground truth,

00:03:35.500 | some certainty that you have to then extract knowledge from.

00:03:41.200 | So first, at the core of anything that works currently,

00:03:45.900 | in terms of a practical sense, there has to be some ground truth.

00:03:50.900 | There has to be some truth that we can hold on to as we try to generalize.

00:03:56.980 | And that's supervised learning.

00:03:59.440 | Even as in reinforcement learning, the only thing we can count on is that truth

00:04:03.820 | that comes in a form of a reward.

00:04:07.380 | So the standard supervised learning pipeline is you have some raw data, the inputs.

00:04:13.340 | You have ground truth, the labels, the outputs that matches to the inputs.

00:04:20.380 | Then you run a certain, any kind of algorithm, whether that's a neural network

00:04:26.360 | or another pre-processing, processing algorithm that extracts the features from that data set.

00:04:32.660 | You can think of a picture of a face.

00:04:34.460 | That algorithm could extract the nose, the eyes, the corners of the eyes, the pupil,

00:04:41.500 | or even lower level features in that image.

00:04:46.560 | After that, we insert those features into a model, a machine learning model.

00:04:56.040 | We train that model.

00:05:01.040 | Then we, as we, whatever that algorithm is, as we pass it through that training process,

00:05:06.840 | we then evaluate.

00:05:08.040 | After we've seen this one particular example, how much better are we at other tasks?

00:05:16.840 | And as we repeat this loop, the model learns to perform better and better

00:05:23.720 | at generalizing from the raw data to the labels that we have.

00:05:29.180 | And finally, you get to release that model into the wild to actually do prediction

00:05:34.580 | on data it has never seen before, that you don't know about.

00:05:39.720 | And the task there is to predict the labels.

00:05:48.020 | Okay, so neural networks is what this class is about.

00:05:56.920 | It's one of the machine learning algorithms that has proven to be very successful.

00:06:01.120 | And the building block, the computational building block of a neural network is a neuron.

00:06:08.860 | A perceptron is a type of neuron.

00:06:13.760 | It's the original old-school neuron where the output is binary, a zero or one.

00:06:21.160 | It's not real valued.

00:06:25.360 | And the process that a perceptron goes through is it has multiple inputs and a single output.

00:06:35.120 | The inputs, each of the inputs have weights on them, shown here on the left as 0.7, 0.6, 1.4.

00:06:45.160 | Those weights are applied to the inputs and a perceptron, the inputs are ones or zeros, binary.

00:06:54.160 | And those weights are applied and then sum together.

00:06:59.320 | A bias on each neuron is then added on top and a threshold.

00:07:10.560 | There's a test whether that summed value plus the bias is below or above a threshold.

00:07:17.060 | If it's above a threshold, it produces a one. If it's below a threshold, it produces a zero. Simple.

00:07:23.760 | It's one of the only things we understand about neural networks confidently.

00:07:28.260 | We can prove a lot of things about this neuron.

00:07:30.860 | For example, what we know is that a neuron can approximate a NAND gate.

00:07:44.960 | A NAND gate is a logical operation, a logical function that takes its input.

00:07:54.060 | It has two inputs, A and B, here on the diagram on the left.

00:07:59.660 | And the table shows what that function is.

00:08:03.460 | When the inputs are zeros, 0, 1 in any order, the output is a one. Otherwise, it's a zero.

00:08:14.260 | The cool thing about a NAND gate is that it's a universal gate.

00:08:19.160 | That you can build up any computer you have.

00:08:23.460 | Your phone in your pocket today can be built out of just NAND gates.

00:08:27.960 | So it's functionally complete.

00:08:31.360 | You can build any logical function out of them if you stack them together in arbitrary ways.

00:08:36.560 | The problem with NAND gates and computers is they're built from the bottom up.

00:08:43.360 | You have to design these circuits of NAND gates.

00:08:45.360 | So the cool thing here is with the Perceptron, we can learn this magical NAND gate.

00:08:53.260 | We can learn this function.

00:08:54.660 | So let's go through how we can do that, how a Perceptron can perform the NAND operation.

00:09:05.260 | Here's the four examples.

00:09:08.060 | If we put the weights of -2 on each of the inputs and a bias of 3 on the neuron,

00:09:15.460 | and if we perform that same operation of summing the weights times the inputs plus the bias,

00:09:24.860 | in the top left, we get, when the inputs are zeros and there's sum to the bias, we get a 3.

00:09:35.460 | That's a positive number, which means the output of a Perceptron will be a 1.

00:09:39.960 | In the top right, when the input is a 0 and a 1, that sum is still a positive number.

00:09:47.660 | Again, produces a 1 and so on.

00:09:51.160 | When the inputs are both 1s, then the output is a -1, less than 0.

00:10:01.160 | So while this is simple, it's really important to think about.

00:10:07.760 | It's a sort of the one basic computational truth you can hold on to

00:10:16.960 | as we talk about some of the magical things neural network can do.

00:10:19.860 | Because if you compare a circuit of NAND gates and a circuit of neurons,

00:10:29.760 | the difference while a circuit of neurons, which is what we think of as a neural network,

00:10:38.360 | can perform the same thing as the circuit of NAND gates.

00:10:42.760 | What it can also do is it can learn.

00:10:45.560 | It can learn the arbitrary logical functions that an arbitrary circuit of NAND gates can represent.

00:10:54.460 | But it doesn't require the human designer.

00:10:58.960 | We can evolve, if you will.

00:11:02.460 | So one of the key aspects here, one of the key drawbacks of Perceptron

00:11:11.460 | is it's not very smooth in its output.

00:11:14.960 | As we change the weights on the inputs and we change the bias and we tweak it a little bit,

00:11:23.860 | it's very likely that when you get, it's very easy to make the neuron output a 0 instead of a 1,

00:11:32.960 | or 1 instead of a 0.

00:11:34.360 | So when we start stacking many of these together, it's hard to control the output of the thing as a whole.

00:11:44.860 | Now the essential step that makes a neural network work that a circuit of Perceptron doesn't

00:11:54.160 | is if the output is made smooth, is made continuous with an activation function.

00:12:00.360 | And so instead of using a step function like a Perceptron does, shown there on the left,

00:12:10.760 | we use any kind of smooth function, sigmoid, where the output can change gradually

00:12:20.960 | as you change the weights and the bias.

00:12:24.960 | And this is a basic but critical step.

00:12:33.260 | And so learning is generally the process of adjusting those weights gradually

00:12:41.060 | and seeing how it has an effect on the rest of the network.

00:12:45.360 | You just keep tweaking weights here and there and seeing how much closer you get to the ground truth.

00:12:54.260 | And if you get farther away, you just adjust the weights in the opposite direction.

00:13:01.360 | That's neural networks in a nutshell.

00:13:03.260 | There is what we'll mostly talk about today is feed-forward neural networks.

00:13:11.060 | On the left, going from inputs to outputs with no loops.

00:13:17.860 | There is also these amazing things called recurrent neural networks.

00:13:27.960 | They're amazing because they have memory. They have a memory of state.

00:13:32.460 | They remember the temporal dynamics of the data that went through.

00:13:38.360 | But the painful thing is that they're really hard to train.

00:13:45.760 | Today we'll talk about feed-forward neural networks.

00:13:51.860 | So let's look at this example. An example of stacking a few of these neurons together.

00:13:59.860 | Let's think of the task, the basic task now famous using the classification of numbers.

00:14:08.160 | You have an image of a number, handwritten number, and your task is given that image

00:14:15.160 | to say what number is in that image.

00:14:21.660 | Now what is an image? An image is a collection of pixels. In this case, 28 by 28 pixels.

00:14:26.960 | That's a total of 784 numbers. Those numbers are from 0 to 255.

00:14:33.260 | And on the left of the network, the size of that input, despite the diagram, is 784 neurons.

00:14:45.360 | That's the input. Then comes the hidden layer.

00:14:51.360 | It's called the hidden layer because it has no interaction with the input or the output.

00:15:02.060 | It is simply a block used at the core of the computational power of neural networks, is the hidden layer.

00:15:14.860 | It's tasked with forming a representation of the data in such a way that it maps from the inputs to the outputs.

00:15:23.160 | In this case, there is 15 neurons in the hidden layer. There is 10 values on the output

00:15:32.160 | corresponding to each of the numbers.

00:15:36.060 | There's several ways you can build this kind of network and this is what the magic of neural networks is.

00:15:42.860 | You can do it a lot of ways. You only really need four outputs to represent values 0 through 9.

00:15:48.860 | But in practice, it seems that having 10 outputs works better. And how do these work?

00:15:55.560 | Whenever the input is a 5, the output neuron in charge of the 5 gets really excited

00:16:02.760 | and outputs a value that's close to 1, from 0 to 1.

00:16:08.060 | Close to 1. And then the other ones get an output of value, hopefully, that's close to 0.

00:16:15.460 | And when they don't, we adjust the weights in such a way that they get closer to 0 and closer to 1,

00:16:22.960 | depending on whether it's the correct neuron associated with the picture.

00:16:26.960 | We'll talk about the details of this training process more tomorrow when it's more relevant.

00:16:35.660 | But what we've discussed just now is the forward pass through the network.

00:16:43.360 | It's the pass when you take the inputs, apply the weights, sum them together, add the bias,

00:16:50.160 | produce the output and check which of the outputs produces the highest confidence of the number.

00:16:55.860 | Then once those probabilities for each of the numbers is provided,

00:17:02.860 | we determine the gradient that's used to punish or reward the weights

00:17:12.260 | that resulted in either the correct or the incorrect decisions.

00:17:16.360 | And that's called back propagation.

00:17:18.660 | We step backwards through the network applying those punishments or rewards.

00:17:23.160 | Because of the smoothness of the activation functions, that is a mathematically efficient operation.

00:17:32.460 | That's where the GPUs step in.

00:17:34.860 | So far, example of numbers.

00:17:38.360 | The ground truth for number 6 looks like the following in the slides.

00:17:47.560 | Y of X equals to a 10-dimensional vector, where only one of them, the 6th value is a 1, the rest are 0.

00:18:02.560 | That's the ground truth that comes with the image.

00:18:06.060 | The loss function here, the basic loss function is the squared error.

00:18:11.760 | Y of X is the ground truth and A is the output of the neural network resulting from the forward pass.

00:18:21.360 | So when you input that number of a 6 and it outputs whatever it outputs, that's A, a 10-dimensional vector.

00:18:31.260 | And it's summed over the inputs to produce the squared error.

00:18:35.960 | That's our loss function.

00:18:38.160 | The loss function, the objective function, that's what's used to determine

00:18:43.960 | how much to reward or punish the back propagated weights throughout the network.

00:18:50.260 | And the basic operation of optimizing that loss function, of minimizing that loss function

00:19:00.660 | is done with various variants of gradient descent.

00:19:04.260 | It's hopefully a somewhat smooth function, but it's a highly nonlinear function.

00:19:11.960 | This is why we can't prove much about neural networks.

00:19:15.360 | It's a highly high-dimensional, highly nonlinear function that's hopefully smooth enough

00:19:23.660 | where the gradient descent can find its way to at least a good solution.

00:19:30.860 | And there has to be some stochastic element there that jumps around

00:19:39.060 | to ensure that it doesn't get stuck in a local minima of this very complex function.

00:19:45.060 | Okay, that's supervised learning.

00:19:47.860 | There's inputs, there's outputs, ground truth.

00:19:52.260 | That's our comfort zone.

00:19:53.360 | Because we're pretty confident, we know what's going on.

00:19:55.960 | All you have to do is just, you have this data set, you train and

00:19:59.660 | you train a network on that data set and you can evaluate it, you can write a paper

00:20:05.260 | and try to beat a previous paper, it's great.

00:20:07.760 | The problem is when you then use that neural network to create an intelligent system

00:20:15.160 | that you put out there in the world.

00:20:16.860 | And now that system no longer is working with your data set.

00:20:22.560 | It has to exist in this world that's maybe very different from the ground truth.

00:20:29.560 | So the takeaway from supervised learning is that neural networks are great memorization.

00:20:36.060 | But in a sort of philosophical way, they might not be great at generalizing,

00:20:41.460 | at reasoning beyond the specific flavor of data set that they were trained on.

00:20:49.260 | The hope for reinforcement learning is that we can extend the knowledge we gain in a supervised way

00:20:56.960 | to the huge world outside where we don't have the ground truth of how to act,

00:21:08.160 | of what does, how good a certain state is or how bad a certain state is.

00:21:13.660 | This is a kind of brute force reasoning.

00:21:16.760 | And I'll talk about kind of what I mean there.

00:21:20.360 | But it feels like it's closer to reasoning as opposed to memorization.

00:21:24.560 | That's a good way to think of supervised learning is memorization.

00:21:28.360 | You're just studying for an exam.

00:21:30.660 | And as many of you know, that doesn't mean you're going to be successful in life

00:21:35.660 | just because you get an A.

00:21:36.760 | And so a reinforcement learning agent or just any agent,

00:21:46.560 | a human being or any machine existing in this world

00:21:51.960 | can operate in the following way from the perspective of the agent.

00:21:57.960 | It can execute an action, it can receive an observation

00:22:02.060 | resulting from that action in a form of a new state

00:22:07.560 | and it can receive a reward or a punishment.

00:22:11.660 | You can break down our existence in this way, simplistic view.

00:22:18.360 | But it's a convenient one on the computational side.

00:22:23.460 | And from the environment side, the environment receives the action,

00:22:28.860 | emits the observation, so your action changes the world,

00:22:33.560 | therefore that world has to change.

00:22:37.960 | And then tell you about it and give you a reward or punishment for it.

00:22:42.760 | So let's look at, again, one of the most fascinating things.

00:22:54.560 | I'll try to convey why this is fascinating a little bit later on.

00:22:58.760 | Is the work of DeepMind on Atari.

00:23:06.160 | This is Atari Breakout, a game where a paddle has to move around.

00:23:13.660 | That's the world that's existing in.

00:23:15.960 | It's a paddle, the agent is a paddle and there's a bouncing ball

00:23:21.460 | and you're trying to move your actions to the right, move right, move left.

00:23:25.960 | You're trying to move in such a way that the ball doesn't get past you.

00:23:30.860 | And so here is a human level performance of that agent.

00:23:36.660 | And so what does this paddle have to do?

00:23:38.260 | It has to operate in this environment.

00:23:40.060 | It has to act, move left, move right.

00:23:42.260 | Each action changes the state of the world.

00:23:47.060 | This may seem obvious but moving right changes visually the state of the world.

00:23:54.860 | In fact, what we're watching now on the slides is the world changing before your eyes for this little guy.

00:24:03.160 | And it gets rewards or punishments.

00:24:06.860 | Rewards it gets in the form of points.

00:24:10.160 | They're racking up points in the top left of the video.

00:24:14.860 | And then when the ball gets past the paddle, it gets punished by dying, quote-unquote.

00:24:22.660 | And that's the number of lives it has left, going from five to four to three, down to zero.

00:24:30.860 | And so the goal is to select at any one moment the action that maximizes future reward.

00:24:38.660 | Without any knowledge of what a reward is, in a greater sense of the word,

00:24:45.260 | all you have is an instantaneous reward or punishment, instantaneous response of the world to your actions.

00:24:51.660 | And this can be modeled as a Markov decision process.

00:25:01.160 | Markov decision process is a mathematically convenient construct.

00:25:06.460 | It has no memory. All you get is you have a state that you're currently in,

00:25:12.460 | you perform an action, you get a reward and you find yourself in a new state.

00:25:16.860 | And that repeats over and over.

00:25:18.860 | You start from state zero, you go to state one, you once again repeat an action, get a reward, go to the next state.

00:25:27.860 | Okay, that's the formulation that we're operating in.

00:25:30.060 | When you're in a certain state, you have no memory of what happened two states ago.

00:25:35.960 | Everything is operating on the instantaneous, instantaneously.

00:25:43.360 | And so what are the major components of a reinforcement learning agent?

00:25:47.260 | There's a policy. That's the agent, the function broadly defined of an agent's behavior.

00:25:56.960 | That includes the knowledge of how for any given state, what is an action that I will take with some probability.

00:26:09.560 | Value function is how good each state and action are in any particular state.

00:26:20.960 | And there's a model. Now this is a little, a subtle thing that is actually the biggest problem with everything you'll see today.

00:26:31.560 | Is the model is how we represent the environment.

00:26:35.260 | And what you'll see today is some amazing things that neural networks can achieve

00:26:39.760 | on a relatively simplistic model of the world.

00:26:43.360 | And the question whether that model can extend to the real world

00:26:47.260 | where human lives are at stake in the case of driving.

00:26:50.260 | Yeah.

00:26:50.760 | So let's look at the simplistic world. A robot in a room.

00:26:58.660 | You start at the bottom left. Your goal is to get to the top right.

00:27:04.260 | Your possible actions are going up, down, left and right.

00:27:10.560 | Now this world can be deterministic, which means when you go up, you actually go up.

00:27:19.560 | Or it could be non-deterministic as human life is.

00:27:24.960 | It's when you go up, sometimes you go right.

00:27:28.160 | So in this case, if you choose to go up, you move up 80% of the time.

00:27:34.660 | You move left 10% of the time and you move right 10% of the time.

00:27:38.560 | And when you get to the top right, you get a reward of +1.

00:27:44.060 | When you get to the second block from that, 4, 2, you get -1. You get punished.

00:27:49.660 | And every time you take a step, you get a slight punishment of 0.04.

00:27:54.760 | Okay, so the question is, if you start at the bottom left, is this a good solution?

00:28:03.060 | Is this a good policy by which you exist in the world?

00:28:06.660 | And it is if the world is deterministic.

00:28:10.860 | If whenever you choose to go up, you go up.

00:28:15.260 | Whenever you choose to go right, you go right.

00:28:17.960 | But if the actions are stochastic, that's not the case.

00:28:25.460 | In what I described previously with 0.8 up and probability of 0.1 going left and right,

00:28:36.160 | this is the optimal policy.

00:28:42.560 | Now, if we punish every single step with a -2 as opposed to -0.04.

00:28:48.960 | So, every time you take a step, it hurts.

00:28:52.760 | You're going to try to get to a positive block as quickly as possible.

00:28:59.360 | And that's what this policy says.

00:29:03.160 | I'll walk through a -1 if I have to, as long as I stop getting a -2.

00:29:11.860 | Now, if the reward for each step is -0.1,

00:29:15.060 | you might choose to go around that -1 block.

00:29:19.460 | Slight detour to avoid the pain.

00:29:23.360 | And then you might take an even longer detour as the reward for each step goes up

00:29:31.560 | or the punishment goes down, I guess.

00:29:39.360 | And then if there's an actual positive reward for every step you take,

00:29:51.160 | then you'll avoid going to the finish line.

00:29:57.060 | You'll just wander the world.

00:30:00.060 | We saw that with the coast racer yesterday.

00:30:06.460 | The boat that chose not to finish the race

00:30:10.360 | because it was having too much fun getting points in the middle.

00:30:13.560 | So, let's look at the world that this agent is operating in.

00:30:22.460 | It's a value function.

00:30:24.560 | That value function depends on a reward.

00:30:28.060 | The word that comes in the future.

00:30:31.160 | And that reward is discounted because the world is stochastic.

00:30:36.160 | We can't expect the reward to come along to us in the way that

00:30:43.160 | we hope it does based on the policy, based on the way we choose to act.

00:30:48.060 | And so there's a gamma there that over time,

00:30:52.560 | as the reward is farther and farther into the future,

00:30:57.060 | discounts that reward, diminishes the impact of that future award

00:31:05.260 | in your evaluation of the current state.

00:31:07.660 | And so your goal is to develop a strategy that maximizes

00:31:13.460 | the discounted future reward, the sum, this discounted sum.

00:31:18.760 | And reinforcement learning,

00:31:25.060 | there is a lot of approaches for coming up with a good policy,

00:31:30.160 | a near optimal, an optimal policy.

00:31:33.860 | There's a lot of fun math there.

00:31:37.560 | You can try to construct a model that optimizes some estimate of this world.

00:31:45.860 | You can try in a Monte Carlo way to just simulate that world

00:31:52.860 | and see how it unrolls.

00:31:54.460 | And as it unrolls, you try to compute the optimal policy.

00:31:59.460 | Or what we'll talk about today is Q-learning.

00:32:03.960 | It's an off-policy approach where the policy is estimated as we go along.

00:32:15.460 | The policy is represented as a Q-function.

00:32:22.560 | The Q-function shown there on the left is,

00:32:27.660 | I apologize for the equations, I lied.

00:32:31.160 | There'll be some equations.

00:32:33.560 | The input to the Q-function is a state at time t, s, t,

00:32:42.660 | and an action that you choose to take in that state, a, t.

00:32:48.960 | And your goal is in that state to choose an action which maximizes the reward in the next step.

00:32:55.560 | And what Q-learning does, and I'll describe the process,

00:33:01.260 | is it's able to approximate through experience the optimal Q-function.

00:33:08.460 | The optimal function that tells you how to act in any state of the world.

00:33:16.960 | You just have to live it.

00:33:19.560 | You have to simulate this world.

00:33:21.560 | You have to move about it.

00:33:23.260 | You have to explore in order to see every possible state,

00:33:28.060 | try every different action, get rewarded, get punished,

00:33:31.860 | and figure out what is the optimal thing to do.

00:33:36.260 | That's done using this Bellman equation.

00:33:43.460 | On the left, the output is the new state, the estimate,

00:33:48.460 | the Q-function estimate of the new state for new action.

00:33:52.860 | And this is the update rule at the core of Q-learning.

00:33:58.060 | You take the old estimate and add based on the learning rate,

00:34:08.360 | alpha from 0 to 1, update the evaluation of that state

00:34:15.860 | based on your new reward that you received at that time.

00:34:21.360 | So you've arrived in a certain state, s, t, you try to do an action

00:34:28.560 | and then you got a certain reward and you update your estimate of that state

00:34:33.360 | and action pair based on this rule.

00:34:37.860 | When the learning rate is 0, you don't learn.

00:34:41.960 | When alpha is 0, you never change your worldview based on the new incoming evidence.

00:34:50.660 | When alpha is 1, you every time change your evaluation,

00:35:00.160 | your world evaluation based on the new evidence.

00:35:04.760 | And that's the key ingredient to reinforcement learning.

00:35:08.060 | First you explore, then you exploit.

00:35:11.460 | First you explore in a non-greedy way and then you get greedy.

00:35:15.660 | You figure out what's good for you and you keep doing it.

00:35:18.160 | So if you want to learn an Atari game, first you try every single action, every state,

00:35:24.360 | you screw up, get punished, get rewarded and eventually you figure out

00:35:28.160 | what's actually the right thing to do and you just keep doing it.

00:35:30.760 | And that's how you win against the greatest human players in the world

00:35:36.960 | in a game of Go, for example, as we'll talk about.

00:35:39.760 | And the way you do that is you have an epsilon greedy policy

00:35:44.960 | that over time with a probability of 1-epsilon, you perform an optimal greedy action.

00:35:52.960 | With a probability of epsilon, you perform a random action.

00:35:55.860 | Random action being explore.

00:36:00.260 | And so as epsilon goes down from 1 to 0, you explore less and less.

00:36:05.560 | So the algorithm here is really simple on the bottom of the slide there.

00:36:13.460 | It's the algorithm version, the pseudocode version of the equation,

00:36:19.760 | the Bellman equation update.

00:36:23.160 | You initialize your estimate of state action pairs arbitrarily, a random number.

00:36:30.360 | Now this is an important point.

00:36:32.360 | When you start playing or living or doing whatever you're doing

00:36:38.360 | and whatever you're doing with reinforcement learning or driving,

00:36:41.460 | you have no preconceived notion of what's good and bad.

00:36:47.560 | It's random or however you choose to initialize it.

00:36:51.760 | And the fact that it learns anything is amazing.

00:36:54.860 | I want you to remember that.

00:36:58.260 | That's one of the amazing things about the Q-learning at all

00:37:05.760 | and then the deep neural network version of Q-learning.

00:37:09.460 | The algorithm repeats the following step.

00:37:14.360 | You step into the world, observe an initial state.

00:37:18.760 | You select an action A.

00:37:21.960 | So that action, if you're exploring, will be a random action.

00:37:25.660 | If you're greedily pursuing the best action you can,

00:37:28.760 | it will be the action that maximizes the Q function.

00:37:31.460 | You observe or reward after you take the action

00:37:34.660 | and a new state that you find yourself in.

00:37:37.760 | And then you update your estimate of the previous state you were in

00:37:41.860 | having taken that action using that Bellman equation update.

00:37:45.460 | And repeat this.

00:37:48.760 | Over and over.

00:37:49.460 | And so there on the bottom of the slide is a summary of life.

00:37:59.460 | Yes.

00:38:05.360 | The Q function?

00:38:10.060 | Yes.

00:38:12.960 | Yes.

00:38:14.060 | Yeah.

00:38:14.460 | It's a single.

00:38:16.160 | The question was, is the Q function a single value?

00:38:19.160 | And yes, it's just a single continuous value.

00:38:23.460 | So the question was, how do you model the world?

00:38:37.060 | So the way you model, so let's start this very simplistic world

00:38:45.960 | of Atari paddle.

00:38:47.960 | You model it as a paddle that can move left and right

00:38:51.460 | and there's some blocks and you model the physics of the ball.

00:38:56.160 | That requires a lot of expert knowledge in that particular game.

00:39:01.960 | So you sit there hand crafting this model.

00:39:04.760 | That's hard to do even for a simplistic game.

00:39:07.660 | The other model you could take is looking at this world

00:39:14.460 | in the way that humans do visually.

00:39:16.160 | So take the model in as a set of pixels.

00:39:19.460 | Just the model is all the pixels of the world.

00:39:25.360 | You know nothing about paddles or balls or physics or colors and points.

00:39:31.460 | They're just pixels coming in.

00:39:32.860 | That seems like a ridiculous model of the world

00:39:36.260 | but it seems to work for Atari.

00:39:38.060 | It seems to work for human beings.

00:39:40.260 | When you're born, you see there's light coming into your eyes

00:39:44.460 | and you don't have any, as far as we know,

00:39:52.260 | you don't come with an instruction when you're born.

00:39:55.160 | You know there's people in the world and there is good guys and bad guys

00:40:00.960 | and there's this is how you walk.

00:40:02.260 | No, all you get is light, sound and the other sensors.

00:40:08.160 | And you get to learn about every single thing you think of

00:40:17.060 | as the way you model the world is a learned representation.

00:40:21.160 | And we'll talk about how a neural network does that.

00:40:23.760 | It learns to represent the world.

00:40:26.160 | But if we have to hand model the world, it's an impossible task.

00:40:33.260 | That's the question.

00:40:36.060 | If we have to hand model the world, then that world better be a simplistic one.

00:40:41.260 | That's a great question.

00:40:45.260 | So the question was, what is the robustness of this model?

00:40:48.360 | If the way you represent the world is at all even slightly different

00:40:53.260 | from the way you thought that world is,

00:40:55.960 | that's not that well studied as far as I'm aware.

00:41:00.060 | I mean it's already amazing that if you construct,

00:41:03.460 | if you have a certain input of the world,

00:41:05.760 | if you have a certain model of the world,

00:41:07.160 | you can learn anything, it's already amazing.

00:41:08.660 | The question is, and it's an important one,

00:41:11.760 | is we'll talk a little bit about it,

00:41:14.760 | not about the world model but the reward function.

00:41:17.560 | If the reward function is slightly different,

00:41:19.760 | the real reward function of life or driving or of coast runner

00:41:25.760 | is different than what you expected it to be.

00:41:28.860 | What's the negative there?

00:41:31.060 | Yeah, it could be huge.

00:41:34.260 | So, there's another question or no?

00:41:37.760 | Never mind. Yep.

00:41:39.260 | Sorry, can you ask that again?

00:41:42.360 | Yes, you can change it over.

00:41:47.060 | So the question was, do you change the alpha value over time?

00:41:50.660 | And you certainly should change the alpha value over time.

00:42:00.360 | So the question was, what is the complex interplay of the epsilon function

00:42:04.860 | with the Q-learning update?

00:42:06.160 | That's 100% fine-tuned, hand-tuned to the particular learning problem.

00:42:12.560 | So you certainly want to,

00:42:16.960 | the more complex, the larger the number of states in the world

00:42:24.960 | and the larger the number of actions,

00:42:27.260 | the longer you have to wait before you decrease the epsilon to zero.

00:42:32.660 | But you have to play with it.

00:42:34.260 | It's one of the parameters you have to play with, unfortunately,

00:42:37.660 | and there's quite a few of them,

00:42:38.860 | which is why you can't just drop a reinforcement learning agent into the world.

00:42:42.960 | Oh, the effect in that sense.

00:42:48.460 | No, no, it's just a coin flip.

00:42:50.160 | And if that epsilon is 0.5,

00:42:53.960 | half the time you're going to take a random action.

00:42:56.860 | So it's no, there's no specific,

00:42:58.960 | it's not like you'll take the best action

00:43:01.960 | and then with some probability take the second best and so on.

00:43:05.160 | I mean, you could certainly do that,

00:43:06.460 | but in the simple formulation that works,

00:43:09.360 | is you just take a random action

00:43:11.060 | because you don't want to have a preconceived notion of

00:43:13.260 | what's a good action to try when you're exploring.

00:43:16.260 | The whole point is you try crazy stuff, if it's a simulation.

00:43:20.460 | So, good question.

00:43:26.560 | So, representation matters.

00:43:28.160 | This is the question about how we represent the world.

00:43:32.160 | So we can think of this world of breakout, for example,

00:43:37.460 | of this Atari game as a paddle that moves left and right

00:43:43.560 | and the exact position of the different things it can hit,

00:43:47.160 | construct this complex model,

00:43:48.960 | this expert driven model that has to fine tune it to this particular problem.

00:43:54.460 | But in practice, the more complex this model gets,

00:44:02.360 | the worse that Bellman equation update,

00:44:05.860 | that trying to construct the Q function

00:44:08.760 | for every single combination of state and actions

00:44:11.660 | becomes too difficult because that function is too sparse and huge.

00:44:16.760 | So if you think of looking at this world in a general way,

00:44:23.260 | in the way human beings would,

00:44:24.460 | is a collection of pixels visually.

00:44:26.960 | If you just take in a pixel,

00:44:29.560 | this game is a collection of 84 by 84 pixels, an image, RGB image.

00:44:35.160 | And then you look at not just the current image,

00:44:40.860 | but look at the temporal trajectory of those images.

00:44:45.960 | So like if there's a ball moving, you want to know about that movement.

00:44:49.160 | So you look at four images.

00:44:52.060 | So current image and three images back.

00:44:54.060 | And say they're grayscale with 256 gray levels.

00:45:00.560 | That size of the Q table that the Q value function has to learn

00:45:08.360 | is whatever that number is,

00:45:12.560 | but it's certainly larger than the number of atoms in the universe.

00:45:15.860 | That's a large number.

00:45:19.660 | So you have to run the simulation long enough

00:45:23.260 | to touch at least a few times most of the states in that Q table.

00:45:30.260 | So as Elon Musk says,

00:45:34.360 | "You may need to run," you know, "we live in a simulation."

00:45:38.260 | You may have to run a universe just to compute the Q function in this case.

00:45:49.360 | So that's where deep learning steps in.

00:45:51.160 | Instead of modeling the world as a Q table,

00:45:58.360 | you estimate, you try to learn that function.

00:46:04.160 | And so the takeaway from supervised learning, if you remember,

00:46:09.760 | that it's good at memorizing or good at memorizing data.

00:46:12.560 | The hope for reinforcement learning with a Q learning

00:46:19.560 | is that we can extend the occasional rewards we get

00:46:24.360 | to generalize over the operation,

00:46:27.160 | the actions you take in that world leading up to the rewards.

00:46:30.760 | And the hope for deep learning is that we can move this reinforcement learning system

00:46:38.260 | into a world that doesn't need to be,

00:46:41.360 | that can be defined arbitrarily,

00:46:43.360 | can include all the pixels of an Atari game,

00:46:48.560 | can include all the pixels sensed by a drone or a robot or a car.

00:46:52.460 | But still it needs a formalized definition of that world,

00:46:59.160 | which is much easier to do when you're able to take in sensors like an image.

00:47:05.960 | So deep Q learning, deep version.

00:47:09.760 | So instead of learning a Q table, a Q function,

00:47:16.460 | we try in estimating that Q prime,

00:47:20.760 | we try to learn it using machine learning.

00:47:24.460 | So it tries to learn some parameters.

00:47:27.560 | This huge complex function, we try to learn it.

00:47:33.460 | And the way we do that is we have a neural network,

00:47:39.260 | the same kind that I showed that learned the numbers to map from an image

00:47:42.960 | to a classification of that image into a number.

00:47:47.060 | The same kind of network is used to take in a state and action and produce a Q value.

00:47:54.560 | Now here's the amazing thing,

00:47:58.660 | that without knowing anything in the beginning,

00:48:06.060 | as I said with a Q table, it's initialized randomly.

00:48:12.460 | The Q function, this deep network knows nothing in the beginning.

00:48:17.560 | All it knows is in the simulated world, the rewards you get for a particular game.

00:48:26.060 | So you have to play time and time again and see

00:48:29.560 | the rewards you get for every single iteration of the game.

00:48:35.660 | But in the beginning it knows nothing.

00:48:40.060 | And it's able to learn to play better than human beings.

00:48:44.460 | This is a DeepMind paper playing Atari with deep reinforcement learning from 2013.

00:48:52.160 | There's one of the key things that got everybody excited about the role of deep learning

00:48:58.160 | and artificial intelligence.

00:48:59.360 | Is that using a convolutional neural network, which I'll talk about tomorrow,

00:49:07.160 | but it's a vanilla network like any other, like I talked about earlier today.

00:49:11.160 | Just a regular network that takes the raw pixels, as I said,

00:49:15.760 | and estimates that Q function from the raw pixels

00:49:19.260 | and is able to play on many of those games better than a human being.

00:49:22.860 | And the loss function that I mentioned previously.

00:49:27.860 | So again, very vanilla loss function, very simple objective function.

00:49:36.960 | The first one you'll probably implement.

00:49:38.860 | We have a tutorial in TensorFlow.

00:49:40.860 | Squared error.

00:49:43.460 | So we take this Bellman equation where the estimate is Q,

00:49:48.560 | the Q function estimate of state and action is the maximum reward you get

00:49:55.860 | for taking any of the actions that takes you to any of the future states.

00:50:03.260 | And you try to take that action, observe the result of that action,

00:50:09.460 | and if the target is different that your learned target,

00:50:14.960 | what the function has learned is the expected reward in that case,

00:50:19.360 | is different than what you actually got, you adjust it.

00:50:23.760 | You adjust the weights in the network.

00:50:28.560 | And this is exactly the process by which we learn how to exist in this pixel world.

00:50:35.360 | So you're mapping states and actions to a Q value.

00:50:42.260 | The algorithm is as follows.

00:50:46.060 | This is how we train it.

00:50:48.160 | We're given a transition, S, current state, action taken in that state,

00:50:55.060 | R, the reward you get, and S' is the state you find yourself in.

00:50:59.360 | And so we replace the basic update rule in the previous pseudocode

00:51:06.460 | by taking a forward pass through the network, given that S state.

00:51:14.860 | We look at what the predicted Q value is of that action.

00:51:20.660 | We then do another forward pass through that network.

00:51:25.160 | And see what we actually get.

00:51:26.960 | And then if we're totally off, we punish, we back propagate the weights

00:51:37.160 | in a way that next time we'll make less of that mistake.

00:51:41.960 | And you repeat this process.

00:51:44.660 | And this is a simulation.

00:51:51.360 | You're learning against yourself.

00:51:55.060 | And again, the same rule applies here.

00:52:00.560 | Exploration versus exploitation.

00:52:02.260 | You start out with an epsilon of 0 or 1.

00:52:08.960 | You're mostly exploring and then you move towards an epsilon of 0.

00:52:17.460 | And with Atari breakout, this is the DeepMind paper result.

00:52:24.260 | It's training epochs on the X axis.

00:52:26.360 | On the Y axis is the average action value and the average reward per episode.

00:52:31.560 | I'll show why it's kind of an amazing result, but it's messy.

00:52:37.260 | Because there's a lot of tricks involved.

00:52:40.060 | So it's not just putting in a bunch of pixels of a game

00:52:44.560 | and getting an agent that knows how to win at that game.

00:52:48.160 | There's a lot of pre-processing and playing with the data required.

00:52:53.360 | So which is unfortunate because the truth is messier than the hope.

00:52:59.760 | But one of the critical tricks needed is called experience replay.

00:53:05.960 | So as opposed to letting an agent, so you're learning this big network

00:53:12.560 | that tries to build a model of what's good to do in the world and what's not.

00:53:17.460 | And you're learning as you go.

00:53:21.560 | So with experience replay, you're keeping a track of all the things you did.

00:53:26.260 | And every once in a while, you look back into your memory

00:53:29.660 | and pull out some of those old experiences, the good old times, and train on those again.

00:53:34.960 | As opposed to letting the agent run itself into some local optima

00:53:41.960 | where it tries to learn a very subtle aspect of the game

00:53:44.760 | that actually in the global sense doesn't get you farther to winning the game.

00:53:50.360 | Very much like life.

00:53:52.260 | So here's the algorithm, deep Q learning algorithm, pseudocode.

00:53:57.960 | We initialize the replay memory.

00:54:01.860 | Again, there's this little trick that's required.

00:54:05.260 | It's keeping a track of stuff that's happened in the past.

00:54:08.860 | We initialize the action value function Q with random weights

00:54:13.260 | and observe initial state.

00:54:15.660 | Again, same thing, select an action with a probability epsilon, explore.

00:54:21.760 | Otherwise, choose the best one based on the estimate provided by the neural network.

00:54:27.560 | And then carry out the action, observe the reward and store that experience in the replay memory.

00:54:34.160 | And then sample random transition from replay memory.

00:54:41.860 | So with a certain probability, you bring those old times back to get yourself out of the local minima.

00:54:48.960 | And then you train the Q network using the difference between what you actually got and your estimate.

00:55:01.060 | You repeat this process over and over.

00:55:04.160 | So here's what you can do.

00:55:08.160 | After 10 minutes of training on the left, so that's very little training,

00:55:14.060 | what you get is a paddle that learns hardly anything and it just keeps dying.

00:55:22.360 | If you look at, it goes from 5 to 4 to 3 to 2 to 1, those are the number of lives left.

00:55:28.360 | Then after two hours of training on a single GPU, it learns to win, you know, not die,

00:55:38.160 | rack up points and learns to avoid the ball from passing the paddle, which is great.

00:55:48.260 | That's human level performance really, better than some humans, you know,

00:55:53.560 | but it still dies sometimes so it's very human level.

00:55:58.060 | And then after four hours, it does something really amazing.

00:56:03.560 | It figures out how to win at the game in a very lazy way, which is drill a hole through the blocks

00:56:15.060 | up to the top and get the ball stuck up there and then it does all the hard work for you.

00:56:19.960 | That minimizes the probability of the ball getting past your paddle because it's just stuck in the blocks up top.

00:56:28.560 | So that might be something that you wouldn't even figure out to do yourself.

00:56:32.260 | And that's an... I need to sort of pause here to clearly explain what's happening.

00:56:40.560 | The input to this algorithm is just the pixels of the game.

00:56:46.660 | It's the same thing that human beings take in when they take the visual perception

00:56:53.060 | and it's able to learn under this constrained definition of what is a reward and a punishment.

00:57:00.360 | It's able to learn to get a high reward.

00:57:03.860 | That's general artificial intelligence.

00:57:10.060 | A very small example of it but it's general.

00:57:13.760 | It's general purpose. It knows nothing about games.

00:57:17.460 | It knows nothing about paddles or physics. It's just taking sensory input of the game.

00:57:23.160 | And they've did the same thing for a bunch of different games in Atari.

00:57:27.760 | And what's shown here in this plot on the X-axis is a bunch of different games from Atari

00:57:38.460 | and on the Y-axis is a percentile where 100% is about the best that human beings can do.

00:57:46.660 | Meaning it's the score that human beings would get.

00:57:48.560 | So everything about there in the middle, everything to the left of that

00:57:52.360 | is far exceeding human level performance

00:57:54.860 | and below that is on par or worse than human level performance.

00:57:59.460 | So it can learn all, so many boxing, pinball, all of these games

00:58:06.660 | and it doesn't know anything about any of the individual games.

00:58:09.760 | It's just taking in pixels.

00:58:11.160 | It's just as if you put a human being behind any of these games

00:58:16.960 | and ask them to learn to beat the game.

00:58:23.260 | And there's been a lot of improvements in this algorithm recently.

00:58:28.460 | Yes, question.

00:58:29.260 | (inaudible)

00:58:31.960 | No, no.

00:58:32.960 | There's no...

00:58:34.260 | So the question was, do they customize the model for a particular game?

00:58:38.260 | And no. The point, you could of course,

00:58:41.560 | but the point is it doesn't need to be customized for the game.

00:58:45.060 | But the important thing is that it's still only on Atari games.

00:58:54.060 | Right, so the question whether this is transferable to driving.

00:58:58.060 | Perhaps not.

00:58:59.960 | Right, you play the game. Well you do.

00:59:12.660 | No, you don't have the... Well yeah, you play one step of the game.

00:59:15.960 | So you take action in a state and then you observe that.

00:59:23.960 | So you have the simulation.

00:59:26.660 | I mean that's really, that's one of the biggest problems here

00:59:31.860 | is you require the simulation in order to get the ground truth.

00:59:36.660 | (inaudible)

00:59:42.060 | So that's a great question or comment.

00:59:45.960 | The comment was that for a lot of these situations,

00:59:49.760 | the reward function might not change at all depending on your actions.

00:59:54.060 | The rewards are really, most of the time, delayed 10, 20, 30 steps down the line.

01:00:02.860 | Which is why it is amazing that this works at all.

01:00:08.260 | That it's learning locally and through that process of simulation

01:00:15.760 | of hundreds of thousands of times it runs through the game,

01:00:17.960 | it's able to learn what to do now such that I get a reward later.

01:00:24.260 | If you just pause, look at the math of it, it's very simple math,

01:00:32.160 | and look at the result, it's incredible.

01:00:38.060 | So there's a lot of improvements.

01:00:39.460 | This one called the General Reinforcement Learning Architecture or GORILLA.

01:00:45.460 | The cool thing about this in the simulated world at least

01:00:50.860 | is that you can run deep reinforcement learning in distributed way.

01:00:56.160 | You could do both the simulation in distributed way,

01:00:58.660 | you could do the learning in the distributed way.

01:01:00.960 | You can generate experiences which is what this kind of diagram shows.

01:01:07.560 | You can either from human beings or from simulation.

01:01:12.060 | So for example, the way that AlphaGo, the DeepMind team has beat the game of Go

01:01:21.160 | is they learn from both expert games and by playing itself.

01:01:26.360 | So you can do this in a distributed way

01:01:29.760 | and you could do the learning in distributed way so you can scale.

01:01:33.360 | And in this particular case, the GORILLA has achieved a better result than the DQN network.

01:01:41.960 | That's part of their nature paper.

01:01:45.360 | Okay, so let me now get to driving for a second here.

01:01:53.460 | Where does reinforcement learning,

01:01:59.860 | where reinforcement learning can step in and help?

01:02:03.560 | So this is back to the open question that I asked yesterday.

01:02:07.160 | Is driving closer to chess or to everyday conversation?

01:02:11.460 | Chess meaning it can be formalized in a simplistic way

01:02:16.560 | and we could think about it as an obstacle avoidance problem

01:02:20.260 | and once the obstacle avoidance is solved,

01:02:22.860 | you just navigate that constrained space.

01:02:27.060 | You choose to move left, you choose to move right in a lane,

01:02:30.560 | you choose to speed up or slow down.

01:02:32.860 | Well, if it's a game like chess, which we'll assume for today

01:02:38.660 | as opposed to for tomorrow, for today we're going to go with the one on the left

01:02:45.460 | and we're going to look at deep traffic.

01:02:51.660 | Here's this game, a simulation, where the goal is to achieve the highest average speed you can

01:03:01.160 | on this seven lane highway full of cars.

01:03:07.060 | And so, as a side note for students, the requirement is they have to follow the tutorial

01:03:14.060 | that I'll present a link for at the end of this presentation.

01:03:18.760 | And what they have to do is achieve a speed,

01:03:21.860 | build a network that achieves a speed of 65 miles an hour or higher.

01:03:26.060 | There is a leaderboard and you get to submit the model you come up with

01:03:33.660 | with a simple click of a button.

01:03:35.160 | So all of this runs in the browser, which is also another amazing thing.

01:03:38.860 | And then you immediately or relatively so, make your way up the leaderboard.

01:03:46.160 | So let's look, let's zoom in.

01:03:50.860 | What is this world, two-dimensional world of traffic is?

01:03:56.260 | What does it look like for the intelligence system?

01:04:01.260 | We discretize that world into a grid shown here on the left.

01:04:06.960 | That's the representation of the state.

01:04:08.560 | There are seven lanes and every single lane is broken up into blocks spatially.

01:04:14.460 | And if there's a car in that block, the length of a car is about three blocks,

01:04:18.260 | three of those grid blocks, then that grid is seen as occupied.

01:04:25.560 | And then the red car is you.

01:04:30.360 | That's the thing that's running in the intelligent agent.

01:04:33.860 | There is on the left is the current speed of the red car.

01:04:40.760 | It actually says MIT on top.

01:04:42.360 | And then you also have a count of how many cars you passed.

01:04:47.760 | And if your network sucks, then that number is going to get to be negative.

01:04:54.160 | You can also change with a drop down the simulation speed

01:05:01.160 | from normal on the left to fast on the right.

01:05:05.060 | So normal is, so you know, the fast speeds up the replay of the simulation.

01:05:14.260 | The one on the left, normal, it feels a little more like real driving.

01:05:20.160 | There's a drop down for different display options.

01:05:28.860 | The default is none in terms of stuff you show on the road.

01:05:33.960 | Then there is the learning input, which is the, while the whole space is discretized,

01:05:39.860 | you can choose what your car sees.

01:05:44.060 | And that's, you can choose how far ahead it sees, behind, how far to the left and right it sees.

01:05:52.060 | And so by choosing the learning input, to visualize learning input,

01:05:56.860 | you get to see what you set that input to be.

01:05:59.060 | Then there is the safety system.

01:06:04.160 | This is a system that protects you from yourself.

01:06:07.460 | The way we've made this game is that it operates under something similar.

01:06:15.860 | If you have some intelligence, if you drive and you have adaptive cruise control in your car,

01:06:22.960 | it operates in the same way.

01:06:24.960 | When it gets close to the car in front, it slows down for you

01:06:29.760 | and it doesn't let you run the car to the left of you, to the right of you, off the road.

01:06:35.460 | So it constrains the movement capabilities of your car in such a way that you don't hit anybody

01:06:44.160 | because then it would have to simulate collisions and it would just be a mess.

01:06:48.060 | So it protects you from that and so you can choose to visualize that "safety system" with the visualization box.

01:06:58.860 | And then you can also choose to visualize the full map.

01:07:01.360 | This is the full occupancy map that you get, if you would like to, provide as input to the network.

01:07:09.560 | Now that input for every single grid, it's a number.

01:07:14.860 | It's not just a zero one whether there's a car in there.

01:07:17.860 | It's the maximum speed limit, which is 80 miles per hour.

01:07:24.060 | Don't get crazy. 80 miles an hour is the speed limit.

01:07:29.260 | That block, when it's empty, is set to the 80 miles an hour.

01:07:34.560 | And when it's occupied, it's set to the number that's the speed of the car.

01:07:40.960 | And then the blocks that the red car is occupying,

01:07:46.360 | is set to a very large number, much higher than the speed limit.

01:07:57.860 | So safety system, here shown in red, are the parts of the grid that your car can't move into.

01:08:07.960 | Question?

01:08:08.660 | What's that?

01:08:12.660 | Yes.

01:08:19.060 | Yes. The question was, what was the third option I just mentioned?

01:08:24.460 | And it's the red car itself.

01:08:27.660 | You yourself, the blocks underneath that car, are set to a really high number.

01:08:32.660 | It's a way for the algorithm to know, for the learning algorithm to know, that these blocks are special.

01:08:39.760 | So safety system shows red here if the car can't move into those blocks.

01:08:52.660 | So any, in terms of, when it lights up red, it means the car can't speed up anymore in front of it.

01:09:01.260 | And when the blocks to the left or to the right light up as red,

01:09:04.660 | that means you can't change lanes to the left or right.

01:09:06.760 | On the right of the slide, you're free to go, free to do whatever you want.

01:09:13.860 | That's what that indicates.

01:09:15.360 | There's all the blocks are yellow. Safety system says you're free to choose any of the five actions.

01:09:22.760 | And the five actions are move left, move right, stay in place, accelerate or slow down.

01:09:31.460 | And those actions are given as input.

01:09:35.160 | That action is, that's what's produced by the, what's called here the brain.

01:09:43.260 | The brain takes in the current state as input, the last reward and produces and learns,

01:09:50.460 | uses that reward to train the network through backward function there, back propagation.

01:09:59.660 | And then ask the brain, given the current state, to give it the next action with a forward pass, the forward function.

01:10:09.760 | You don't need to know the operation of this function in particular.

01:10:13.760 | This is not something you need to worry about, but you can if you want.

01:10:17.560 | You can customize this learning step.

01:10:19.560 | There is, by the way, what I'm describing now, there's just a few lines of code right there in the browser.

01:10:29.160 | That you can change and immediately, well, with the press of a button,

01:10:34.460 | changes the simulation or the design of the network.

01:10:37.860 | You don't need to have any special hardware, you don't need to do anything special.

01:10:41.560 | And the tutorial cleanly outlines exactly all of these steps.

01:10:45.160 | But it's kind of amazing that you can design a deep neural network

01:10:50.160 | that's part of the reinforcement learning agent.

01:10:52.460 | So it's a deep Q learning agent right there in the browser.

01:10:59.460 | So you can choose the lane side variable which controls how many lanes to the side you see.

01:11:08.460 | So when that value is zero, you only look forward.

01:11:10.760 | When that value is one, you have one lane to the left, one value to the right.

01:11:14.860 | It's really the lane, the radius of your perception system.

01:11:18.560 | Patches ahead is how far ahead you look, patches behind is how far behind you look.

01:11:26.260 | And so for example here, the lane side equals two, that means it looks two to the left, two to the right.

01:11:33.160 | Obviously, if two to the right is off-road, it provides a value of zero in those blocks.

01:11:41.860 | If we set the patches behind to be ten, it looks ten patches back.

01:11:48.460 | Behind starting at the one patch back is starting from the front of the car.

01:11:54.160 | The scoring for the evaluation, for the competition is your average speed over a predefined period of time.

01:12:03.060 | And so the method we use to collect that speed is we run the agent ten runs,

01:12:10.960 | about 30 simulated minutes of game each and take the median speed of the ten runs.

01:12:16.660 | That's the score.

01:12:21.060 | This is done server-side and so given that we've gotten some,

01:12:30.360 | for this code, recently gotten some publicity online unfortunately.

01:12:35.160 | This might be a dangerous thing to say, it's no cheating possible.

01:12:40.160 | But because it's done server-side and we did, this is JavaScript and it runs in the browser,

01:12:46.660 | it's hopefully sandboxed so we can't do anything tricky.

01:12:51.260 | But we dare you to try.

01:12:52.960 | You can try it locally to get an estimate.

01:13:00.560 | And there's a button that says evaluate and it gives you a score right back

01:13:06.360 | of how well you're doing with the current network.

01:13:12.460 | That button is start evaluation run.

01:13:16.360 | You press the button, it does a progress bar and it gives you the average speed.

01:13:20.960 | You can, there's a code box where you modify all the variables I mentioned

01:13:30.960 | and the tutorial describes this in detail.

01:13:32.960 | And then once you're ready, you modify a few things, you can press apply code, it restarts,

01:13:40.660 | it kills all the training that you've done up to this point or resets it and starts the training again.

01:13:47.660 | So save often and there's a save button.

01:13:52.360 | So the training is done on a separate thread in web workers

01:13:58.960 | which are exciting things that allow you to allow JavaScript to run amazingly in a

01:14:10.660 | on multiple CPU cores in a parallel way.

01:14:14.860 | So the simulation that scores this or the training is done a lot faster than real time.

01:14:22.860 | A thousand frames a second, a thousand movement steps a second.

01:14:29.060 | This is all in JavaScript.

01:14:31.660 | And the next day gets shipped to the main simulation from time to time as the training goes on.

01:14:40.160 | So all you have to do is press run training and it trains and the car behaves better over time.

01:14:47.460 | Maybe I'll actually show it in the browser to, let's see if we work.

01:14:55.760 | Well, is it going to mess up? We're good.

01:14:59.260 | Why is it being weird?

01:15:05.060 | (silence)

01:15:12.260 | What can possibly go wrong?

01:15:13.860 | (silence)

01:15:26.660 | So here's the game.

01:15:27.660 | When it starts, this is running live in the browser.

01:15:33.560 | (silence)

01:15:35.560 | Artificial intelligence, ladies and gentlemen, in the browser, a neural network.

01:15:40.660 | So currently it's not very good.

01:15:42.560 | It's driving at two miles an hour and watching everybody pass.

01:15:47.760 | So what's being shown live is the loss function which is pretty poor.

01:15:57.260 | So in order to train, like I said, a thousand frames a second, you just press the run training button.

01:16:04.560 | And pretty quickly it learns based on the network you specify in the code box,

01:16:10.060 | how to, and based on the input and all the things that I mentioned, training finished.

01:16:18.060 | It learns how to do a little better.

01:16:22.260 | We on purpose put in a network that's not very good in there.

01:16:26.360 | So right now it won't, on average, be doing that well, but it does better than standing there in place.

01:16:32.060 | And then you can do the start evaluation run to simulate the network much faster than real time,

01:16:41.460 | to see how well it does.

01:16:43.960 | This is a similar evaluation step that we take when determining where you stand on the leaderboard.

01:16:51.760 | The current average speed in that 10 run simulation is 56.56 miles per hour.

01:17:01.160 | Now I may be logged in, maybe not.

01:17:06.160 | If you're logged in, you click submit your code.

01:17:11.160 | If you're not logged in, it says you're not logged in, please log in to submit your code.

01:17:18.360 | And then all you have to do is log in.

01:17:21.260 | This is the most flawless demo of my life.

01:17:25.260 | And then you press submit model again and success.

01:17:30.360 | Oh man.

01:17:32.260 | Thank you for your submission.

01:17:36.860 | And so now my submission is entered as Lex in the leaderboard,

01:17:41.660 | and my 56.56 or whatever it was.

01:17:44.560 | So I dare all of you to try to beat that.

01:17:48.460 | So to, as you play around with stuff, if you want to save the code,

01:17:55.360 | you could do so by pressing the save code button.

01:18:00.660 | That saves the various JavaScript configurations,

01:18:04.360 | and that saves the network layout to file.

01:18:08.260 | And then you can load from file as well.

01:18:11.160 | Again, danger, it overrides the code for you.

01:18:15.660 | And you press the submit button to submit the model to competition.

01:18:18.560 | Make sure that you train the network.

01:18:21.360 | We don't train it for you.

01:18:22.560 | You bring us, you submit a model and you have to press train.

01:18:26.260 | And it gets evaluated with, in time.

01:18:30.560 | It enters a queue to get evaluated.

01:18:32.560 | This is public facing, so the queue can grow pretty big.

01:18:36.460 | And it goes to that queue, evaluates it,

01:18:39.560 | and then depending on where you stand, you get added to the leaderboard.

01:18:43.760 | You get added to the leaderboard, showing the top 10 entries.

01:18:46.560 | You can resubmit often,

01:18:49.360 | and only the highest score counts.

01:18:53.760 | Okay, we're using code,

01:18:58.860 | the implementation of neural networks,

01:19:03.460 | done in just JavaScript,

01:19:06.660 | by Andrej Karpathy from Stanford, now OpenAI.

01:19:12.460 | ComNetJS is a library.

01:19:14.260 | And what's being visualized there,

01:19:18.260 | it's also being visualized in the game,

01:19:19.960 | is the inputs to the network.

01:19:21.660 | In this case, it's 135 inputs.

01:19:24.260 | You can also specify not just the,

01:19:27.860 | not just how far ahead behind you see into the left and to the right,

01:19:32.960 | you can specify how far back in time you look as well.

01:19:40.060 | And so, what's visualized there is the input to the network, 135 neurons,

01:19:47.360 | and then the output, a regression,

01:19:49.960 | similar to the kind of output we saw with numbers,

01:19:53.260 | where there's 10 outputs saying if it's a 0, 1, or 9.

01:19:57.560 | Here, the output is one of the five actions,

01:20:00.160 | left, right, stay in place, speed up or slow down.

01:20:03.060 | The ComNetJS settings,

01:20:06.460 | is you can select the number of inputs,

01:20:09.660 | if you want to mess with this stuff,

01:20:11.960 | this is all the stuff you don't need to mess with,

01:20:13.760 | because we already give you the variables of lane side and patches ahead and so on.

01:20:17.660 | You can select the number of actions,

01:20:19.860 | the temporal window, and the network size.

01:20:23.960 | So, the network definition here is the,

01:20:34.860 | this is the input, the size of the input.

01:20:39.060 | Again, all this is in the tutorial, just to give you a little outline.

01:20:41.960 | There is a, the first fully connected layer has 10 neurons,

01:20:47.260 | with ReLU activation functions,

01:20:50.560 | same kind of smooth function that we talked about before,

01:20:55.960 | and a regression layer for the output.

01:20:59.560 | And there's a bunch of other messy options that you can play with, if you dare.

01:21:08.460 | But those aren't the ones I mentioned before, it's really the important ones.

01:21:11.460 | It's selecting the number of layers, the size of those layers,

01:21:14.960 | you get to build your own very neural network that drives.

01:21:17.960 | And the actual learning is done with a backward propagation,

01:21:22.160 | and then that returns the action by doing a forward pass to the network.

01:21:27.260 | In case you're interested in this kind of stuff,

01:21:30.560 | there is an amazingly cool code editor,

01:21:36.660 | that's a Monaco editor,

01:21:38.160 | that's, it just works, it does some auto-completion,

01:21:43.460 | so you get to play with it, makes everything very convenient in terms of code editing.

01:21:47.360 | A lot of this visualization of the game,

01:21:51.160 | and the simulation, and the simulation we'll talk about tomorrow,

01:21:57.060 | is done in the browser using HTML5 Canvas.

01:22:01.760 | So here is a simple specification of a blue box with Canvas,

01:22:06.060 | and this is very efficient and easy to work with.

01:22:09.860 | And the thing that a lot of us are excited about, a very subtle one,

01:22:17.260 | but that you can not just run,

01:22:20.760 | so with the V8 engine, JavaScript has become super fast.

01:22:26.160 | You can train neural networks in the browser, that's already amazing.

01:22:30.060 | And then with web workers, as long as you have Chrome, a modern browser.

01:22:35.260 | So, is you can run multiple processes in separate threads.

01:22:42.060 | So you could do a lot of stuff, you can do visualization separately,

01:22:46.360 | and you can train in separate threads.

01:22:48.260 | It's very cool.

01:22:49.660 | Okay, so the tutorial is cars.mit.edu and deep traffic.

01:22:54.560 | We won't put these links on the website for a little bit,

01:22:57.760 | because we got put on the front page of Hacker News.

01:23:05.760 | Which we don't want those to leak out,

01:23:08.860 | especially with the claims that you can't cheat.

01:23:12.160 | And while it's pretty efficient in terms of running time,

01:23:17.960 | so everything is running on your machine, client side.

01:23:20.360 | It's still, you have to pull some images here and pull some of the code.

01:23:24.460 | So the tutorial is on cars.mit.edu/deep-traffic

01:23:28.660 | and the simulation is deep-traffic.js.

01:23:31.960 | So cars.mit.edu/deep-traffic.js.

01:23:35.460 | I encourage you to go there, play with the network, submit your code,

01:23:39.360 | and win the very special prize.

01:23:42.260 | It is a pretty cool one, but we're still working on it.

01:23:45.260 | It's not, there is a prize, I swear.

01:23:48.260 | All right, so let's take a pause and think about what we talked about today.

01:23:58.860 | So the very best of deep reinforcement learning is

01:24:03.760 | the most exciting accomplishment, I think,

01:24:09.060 | is when the game, when I first started as a freshman,

01:24:16.060 | took intro to artificial intelligence.

01:24:18.860 | It was said that it's a game that's impossible for machines to beat

01:24:25.160 | because of the combinatorial complexity.

01:24:27.860 | It's just the sheer number of options.

01:24:30.360 | It's so much more complex than chess.

01:24:32.360 | And so the most amazing accomplishment of deep reinforcement learning to me

01:24:39.460 | is the design of AlphaGo.

01:24:41.260 | When for the first time, the world champion in Go was beaten by DeepMind's AlphaGo.

01:24:48.660 | And the way they did it, and this is, I think, very relevant to driving,

01:24:55.360 | is you start by creating first in a supervised way, training a policy network.

01:25:02.460 | So you take expert games to construct a network first.

01:25:11.160 | So you don't play against yourself, the agent doesn't play against itself,

01:25:18.160 | but they learn from expert games.

01:25:21.360 | So there's some human ground truth.

01:25:24.860 | This human ground truth represents reality.

01:25:26.960 | So for driving, this is important.

01:25:28.860 | We have a, well, we're starting to get a lot of data

01:25:31.960 | where video of drivers is being recorded.

01:25:34.660 | So we can learn on that data before we then run the agents through simulation

01:25:39.660 | where it learns much larger magnitudes of data sets through simulation.

01:25:45.260 | And they did just that.

01:25:50.760 | Now as a reminder that when you let an agent drive itself,

01:25:59.860 | this is probably one of my favorite videos of all time,

01:26:04.260 | but I just recently saw it so I could just watch this for hours.

01:26:06.760 | But it's a reminder that you can't trust your first estimates of a reward function.

01:26:20.360 | To be those that are safe and productive for our society

01:26:24.460 | when you're talking about an intelligent system that gets to operate in the real world.

01:26:29.060 | This is just as clear of a reminder of that as there is.

01:26:33.760 | So again, all the references are available online for these slides.

01:26:40.560 | We'll put up the slides.

01:26:41.560 | I imagine you might have, if you want to come down and talk to us for questions

01:26:46.960 | for the either Docker or JavaScript question.

01:26:51.360 | So the question was, what is the visualization you're seeing in deep traffic?

01:27:00.160 | You're seeing a car move about.

01:27:02.460 | How is it moving?

01:27:03.860 | It's moving based on the latest snapshot of the network you trained.

01:27:07.060 | So it's just visualizing for you just for fun the network you trained most recently.

01:27:15.060 | Okay, so if people have questions, stick around afterwards.

01:27:18.260 | Just details on Docker and yeah.

01:27:23.060 | Do you want to do it offline?

01:27:27.760 | Want me to tuck up?

MIT 6.S094: Deep Reinforcement Learning for Motion Planning

Chapters