back to indexMIT 6.S094: Deep Reinforcement Learning
Chapters
0:0 AI Pipeline from Sensors to Action
8:25 Reinforcement Learning
23:50 Deep Reinforcement Learning
36:0 AlphaGo
41:50 DeepTraffic
54:35 Conclusion
00:00:00.000 |
Today we will talk about deep reinforcement learning. 00:00:02.800 |
The question we would like to explore is to which degree 00:00:18.680 |
So let's take a step back and think of what is the full range of tasks 00:00:25.480 |
an artificial intelligence system needs to accomplish. 00:00:30.280 |
From top to bottom, top the input, bottom output. 00:00:33.680 |
The environment at the top, the world that the agent is operating in. 00:00:41.360 |
taking in the world outside and converting it to raw data interpretable by machines. 00:00:49.080 |
And from that raw sensor data, you extract features. 00:00:58.080 |
such that you can input it, make sense of it, 00:01:05.280 |
And as we discussed, you form higher and higher order representations, 00:01:14.480 |
based on which the machine learning techniques can then be applied. 00:01:18.480 |
Once the machine learning techniques, the understanding, as I mentioned, 00:01:27.480 |
converts the data into features, into higher order representations 00:01:31.680 |
and into simple, actionable, useful information. 00:01:34.480 |
We aggregate that information into knowledge. 00:01:37.680 |
We take the pieces of knowledge extracted from the data 00:01:57.280 |
to aggregate, to connect pieces of data seen in the recent past 00:02:03.680 |
or the distant past, to make sense of the world it's operating in. 00:02:07.680 |
And finally, to make a plan of how to act in that world based on its objectives, 00:02:14.480 |
As I mentioned, a simple but commonly accepted definition of intelligence 00:02:20.080 |
is a system that's able to accomplish complex goals. 00:02:24.880 |
So a system that's operating in an environment in this world 00:02:27.880 |
must have a goal, must have an objective function, a reward function. 00:02:32.080 |
And based on that, it forms a plan and takes action. 00:02:35.480 |
And because it operates in many cases in the physical world, 00:02:39.280 |
it must have tools, effectors with which it applies the actions 00:02:46.080 |
That's the full stack of an artificial intelligence system 00:03:00.080 |
What kind of task can an artificial intelligence system learn? 00:03:06.680 |
We will talk about the advancement of deeper enforcement learning approaches 00:03:12.680 |
in some of the fascinating ways it's able to take much of this stack 00:03:17.680 |
and treat it as an end-to-end learning problem. 00:03:21.480 |
But we look at games, we look at simple formalized worlds. 00:03:25.680 |
While it's still impressive, beautiful and unprecedented accomplishments, 00:03:32.680 |
Can we then move beyond games and into expert tasks of medical diagnosis, 00:03:45.680 |
and finally the human level tasks of emotion, imagination. 00:03:52.880 |
Let's once again review the stack in practicality, 00:04:09.480 |
as LIDAR camera, radar, GPS, stereo cameras, audio microphone, 00:04:27.880 |
features are formed, representations are formed 00:04:30.280 |
and multiple higher and higher order representations. 00:04:34.880 |
Before neural networks, before the advent of, 00:04:37.880 |
before the recent successes of neural networks to go deeper 00:04:42.480 |
and therefore be able to form high order representations of the data. 00:04:56.280 |
the final layers these networks are able to accomplish, 00:04:59.680 |
the supervised learning tasks, the generative tasks 00:05:09.480 |
That's what we talked about a little in lecture one 00:05:19.680 |
And you can think about the output of those networks 00:05:22.480 |
as simple, clean, useful, valuable information. 00:05:27.480 |
And that knowledge can be in the form of single numbers. 00:05:33.280 |
It could be regression, continuous variables. 00:05:38.280 |
It could be images, audio, sentences, text, speech. 00:05:44.280 |
Once that knowledge is extracted and aggregated, 00:05:48.080 |
how do we connect it in multi-resolutional ways? 00:05:55.280 |
The trivial silly example is connecting images, 00:06:26.680 |
and making action, control and longer-term plans 00:06:35.080 |
are more and more amenable to the learning approach, 00:06:43.080 |
with non-learning optimization based approaches. 00:06:45.280 |
Like with the several of the guest speakers we have, 00:06:55.880 |
how much of the stack can be learned, end to end, 00:07:21.680 |
in machine learning over the past three decades has been. 00:07:30.880 |
the automated representation learning of deep learning, 00:07:49.680 |
So aggregating, forming higher representations, 00:07:56.680 |
and acting in this world from the raw sensory data. 00:08:06.480 |
with deeper enforcement learning on trivial tasks, 00:08:24.880 |
So today, let's talk about reinforcement learning. 00:09:11.680 |
And that's where reinforcement learning falls. 00:09:49.480 |
there is a temporal consistency to the world. 00:12:38.280 |
to state that does not directly have a reward. 00:13:58.080 |
into a way that's interpretable by the system. 00:14:04.480 |
The continuous problem of cart-pole balancing, 00:15:13.680 |
The state is the raw pixels of the real world, 00:16:17.880 |
that the agent represents the environment with, 00:16:40.880 |
and they're tasked with walking about this world, 00:16:50.080 |
and at one square below that is a negative one. 00:17:18.280 |
to the maximum square with the maximum reward. 00:17:58.080 |
you can't control where you're going to end up, 00:18:05.080 |
an optimal action to take in every single state. 00:18:16.880 |
There's a high punishment for every single step we take. 00:18:23.480 |
The optimal policy is to take the shortest path, 00:18:41.280 |
Where there is some extra degree of wandering, 00:18:55.280 |
more wandering and more wandering is allowed. 00:19:16.280 |
to stay on the board without ever reaching the destination. 00:19:37.880 |
the reward we're likely to receive in the future. 00:19:41.880 |
And the way we see the reward we're likely to receive, 00:20:00.080 |
the reward, the importance of the reward received. 00:20:05.880 |
is taking the sum of these rewards and maximizing it. 00:20:21.080 |
any policy to estimate the value of taking an action, 00:20:32.880 |
and use the Bellman equation here on the bottom, 00:20:35.280 |
to continuously update our estimate of how good, 00:20:44.680 |
this allows us to operate in a much larger state space, 00:22:18.480 |
But the better and better your estimate becomes, 00:22:23.680 |
So, usually we want to explore a lot in the beginning, 00:24:12.680 |
here, they're taking the raw pixels of the game. 00:24:49.080 |
Through the simple update of the Bellman equation. 00:25:12.280 |
or neural networks help us memorize patterns, 00:27:34.880 |
the neural network learns with a loss function. 00:28:19.280 |
that are reachable based on the actions you can take. 00:29:08.080 |
And that's how we compute the two parts of the loss function. 00:29:11.280 |
And update the weights using back propagation. 00:29:18.080 |
back propagation is how the network is trained. 00:29:33.880 |
So, as the games are played through simulation, 00:29:51.080 |
by randomly sampling the library in the past. 00:30:04.880 |
on the natural continuous evolution of the system. 00:31:56.880 |
In unpredictable, difficult to understand ways. 00:32:47.680 |
that you only take an action every four steps. 00:32:52.480 |
as part of the temporal window to make decisions. 00:34:50.280 |
That's something you'll see in deep traffic as well, 00:35:03.080 |
So, this algorithm has been able to accomplish, 00:35:27.080 |
that raw sensor information was used to create, 00:35:32.680 |
Makes sense of the physics of the world enough, 00:35:47.280 |
This DQN approach has been able to outperform, 00:36:06.680 |
of artificial intelligence in the last decade, 00:36:54.880 |
It's a very large number of possible positions to consider. 00:37:13.080 |
the community thought that this game is not solvable. 00:37:31.880 |
And I'll describe it in a little bit of detail, 00:38:16.080 |
And the, and the quality of players that is competing in, 00:38:23.280 |
to achieve a rating that's better than AlphaGo. 00:38:26.880 |
And better than the different variants of AlphaGo. 00:39:13.480 |
or to go deep in the positions you know are good, 00:39:22.680 |
quality of the choices you made leading to that position. 00:41:04.280 |
in the sense that first it outputs the probability, 00:41:09.080 |
and it's also producing a probability of winning. 00:41:11.480 |
And there's a few ways to combine that information, 00:41:39.480 |
they updated the state-of-the-art architecture, 00:42:32.880 |
We applied deep reinforcement learning to that, 00:42:37.080 |
The goal is to achieve the highest average speed, 00:43:47.680 |
much faster than what's actually being visualized, 00:44:53.880 |
That's, that one is controlled by neural network. 00:45:14.880 |
but they don't have a purpose in their existence. 00:45:46.080 |
And when there's other cars that are going slow, 00:47:36.280 |
the output is the value of the different actions. 00:48:21.480 |
The brain is where the neural network is contained, 00:49:01.080 |
You have to achieve the highest average speed, 00:49:21.880 |
of achieving the average speed for all of them. 00:49:24.480 |
But the actions are taking in a greedy way for each. 00:49:28.680 |
It's very interesting what can be learned in this way. 00:49:32.080 |
Because this kinds of approaches are scalable, 00:49:49.480 |
Because they're full in their greedy operation. 00:49:53.280 |
The number of networks that can concurrently operate, 00:50:04.680 |
The layers, the many layers types that can be added. 00:50:09.880 |
Here's a fully connected layer with ten neurons. 00:50:12.280 |
The activation functions, all of these things can be customized. 00:50:18.480 |
The final layer, a fully connected layer with, 00:50:23.480 |
regression, giving the value of each of the five actions. 00:50:28.080 |
And there's a lot of more specific parameters. 00:50:45.680 |
The optimizer, the learning rate, momentum, batch size, 00:50:53.080 |
There's a big white button that says apply code that you press. 00:50:56.880 |
That kills all the work you've done up to this point. 00:51:00.880 |
You should be doing it only at the very beginning. 00:51:03.680 |
If you happen to leave your computer running, 00:51:07.480 |
in training for several days, as folks have done. 00:51:16.280 |
And the network state gets shipped to the main simulation from time to time. 00:51:24.280 |
is running the same network that's being trained. 00:51:33.280 |
It's constantly updating the network you see on the left. 00:51:36.480 |
So if the car, for the network that you're training, 00:51:49.680 |
Number of iterations is certainly an important parameter to control. 00:51:55.080 |
And the evaluation is something we've done a lot of work on since last year. 00:52:04.880 |
the incentive to submit the same code over and over again. 00:52:14.480 |
The method for evaluation is to collect the average speed over 10 runs. 00:52:30.880 |
And we take the median speed of the 500 runs. 00:52:44.680 |
That's just for you to feel better about your network. 00:52:46.880 |
That should produce a result that's very similar to the one we'll produce on the server. 00:52:57.280 |
we significantly reduce the influence of randomness. 00:53:01.880 |
the speed you get for the network you design should be very similar with every evaluation. 00:53:10.480 |
If the network is huge and you want to switch computers, 00:53:14.480 |
It saves both the architecture of the network. 00:53:25.080 |
it's not saving any of the data you've already done. 00:53:29.280 |
You can't do transfer learning with JavaScript in the browser yet. 00:53:42.280 |
the weights are initiated randomly and will not do so well. 00:53:44.880 |
You can resubmit as often you like and the highest score is what counts. 00:53:49.480 |
The coolest part is you can load your custom image, 00:53:52.480 |
specify colors and request the visualization. 00:54:17.680 |
request visualization because it's an expensive process. 00:54:32.680 |
And the details for those that truly want to win is in the archive paper. 00:54:39.080 |
that will come up throughout is whether these reinforcement learning approaches 00:54:43.280 |
are at all or rather if action planning control is amenable to learning. 00:54:59.880 |
because that would result in millions of crashes 00:55:08.480 |
Unless we're working like we are deep crash on the RC car 00:55:18.880 |
It's an open question whether this is applicable. 00:55:23.080 |
and I bring up two companies because they're both guest speakers. 00:55:26.880 |
Deep IRL is not involved in the most successful robots operating in the real world. 00:55:48.080 |
except with minimal addition on the perception side. 00:56:00.280 |
Deep learning is used a little bit in perception on top, 00:56:03.880 |
but most of the work is done from the sensors 00:56:07.280 |
and the optimization based, the model based approaches. 00:56:11.480 |
Trajectory generation and optimizing which trajectory is best to avoid collisions. 00:56:25.280 |
the unexpected local pockets of higher award, 00:56:28.280 |
which arise in all of these situations when applied in the real world. 00:56:34.880 |
that's pretty short where the cats are ringing the bell 00:56:37.680 |
and they're learning that the ring of the bell 00:56:44.280 |
I urge you to think about how that can evolve over time in unexpected ways. 00:56:52.680 |
Where the final reward is in the form of food 00:57:05.080 |
For the artificial general intelligence course in two weeks, 00:57:10.280 |
It's how these reinforcement learning planning algorithms 00:57:23.080 |
how we can design reward functions that result in safe operation. 00:57:28.280 |
So I encourage you to come to the talk on Friday, 00:57:41.480 |
from Boston Dynamics to Ray Kurzweil and so on for AGI. 00:57:45.880 |
Now tomorrow, we'll talk about computer vision and psych fuse.