back to index

MIT 6.S094: Deep Reinforcement Learning


Chapters

0:0 AI Pipeline from Sensors to Action
8:25 Reinforcement Learning
23:50 Deep Reinforcement Learning
36:0 AlphaGo
41:50 DeepTraffic
54:35 Conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we will talk about deep reinforcement learning.
00:00:02.800 | The question we would like to explore is to which degree
00:00:12.000 | we can teach systems to act,
00:00:15.000 | to perceive and act in this world from data.
00:00:18.680 | So let's take a step back and think of what is the full range of tasks
00:00:25.480 | an artificial intelligence system needs to accomplish.
00:00:28.280 | Here's the stack.
00:00:30.280 | From top to bottom, top the input, bottom output.
00:00:33.680 | The environment at the top, the world that the agent is operating in.
00:00:37.960 | Sensed by sensors,
00:00:41.360 | taking in the world outside and converting it to raw data interpretable by machines.
00:00:46.680 | Sensor data.
00:00:49.080 | And from that raw sensor data, you extract features.
00:00:53.480 | You extract structure from that data
00:00:58.080 | such that you can input it, make sense of it,
00:01:01.280 | discriminate, separate, understand the data.
00:01:05.280 | And as we discussed, you form higher and higher order representations,
00:01:12.280 | a hierarchy of representations
00:01:14.480 | based on which the machine learning techniques can then be applied.
00:01:18.480 | Once the machine learning techniques, the understanding, as I mentioned,
00:01:27.480 | converts the data into features, into higher order representations
00:01:31.680 | and into simple, actionable, useful information.
00:01:34.480 | We aggregate that information into knowledge.
00:01:37.680 | We take the pieces of knowledge extracted from the data
00:01:40.480 | through the machine learning techniques
00:01:43.080 | and build a taxonomy,
00:01:45.680 | a library of knowledge.
00:01:49.480 | And with that knowledge, we reason.
00:01:53.080 | An agent is tasked to reason
00:01:57.280 | to aggregate, to connect pieces of data seen in the recent past
00:02:03.680 | or the distant past, to make sense of the world it's operating in.
00:02:07.680 | And finally, to make a plan of how to act in that world based on its objectives,
00:02:12.680 | based on what it wants to accomplish.
00:02:14.480 | As I mentioned, a simple but commonly accepted definition of intelligence
00:02:20.080 | is a system that's able to accomplish complex goals.
00:02:24.880 | So a system that's operating in an environment in this world
00:02:27.880 | must have a goal, must have an objective function, a reward function.
00:02:32.080 | And based on that, it forms a plan and takes action.
00:02:35.480 | And because it operates in many cases in the physical world,
00:02:39.280 | it must have tools, effectors with which it applies the actions
00:02:44.680 | to change something about the world.
00:02:46.080 | That's the full stack of an artificial intelligence system
00:02:51.080 | that acts in the world.
00:02:54.080 | And the question is,
00:02:55.280 | what kind of task can such a system take on?
00:03:00.080 | What kind of task can an artificial intelligence system learn?
00:03:04.280 | As we understand AI today.
00:03:06.680 | We will talk about the advancement of deeper enforcement learning approaches
00:03:12.680 | in some of the fascinating ways it's able to take much of this stack
00:03:17.680 | and treat it as an end-to-end learning problem.
00:03:21.480 | But we look at games, we look at simple formalized worlds.
00:03:25.680 | While it's still impressive, beautiful and unprecedented accomplishments,
00:03:29.880 | it's nevertheless formal tasks.
00:03:32.680 | Can we then move beyond games and into expert tasks of medical diagnosis,
00:03:40.680 | of design and into natural language
00:03:45.680 | and finally the human level tasks of emotion, imagination.
00:03:50.280 | Consciousness.
00:03:52.880 | Let's once again review the stack in practicality,
00:04:00.080 | in the tools we have.
00:04:02.080 | The input for robots operating in the world,
00:04:06.480 | from cars to humanoid to drones,
00:04:09.480 | as LIDAR camera, radar, GPS, stereo cameras, audio microphone,
00:04:15.480 | networking for communication
00:04:17.680 | and the various ways to measure kinematics.
00:04:20.880 | With IMU.
00:04:21.880 | The raw sensory data is then processed,
00:04:27.880 | features are formed, representations are formed
00:04:30.280 | and multiple higher and higher order representations.
00:04:33.280 | That's what deep learning gets us.
00:04:34.880 | Before neural networks, before the advent of,
00:04:37.880 | before the recent successes of neural networks to go deeper
00:04:42.480 | and therefore be able to form high order representations of the data.
00:04:46.280 | That was done by experts, by human experts.
00:04:49.480 | Today, networks are able to do that.
00:04:52.080 | That's the representation piece.
00:04:53.880 | And on top of the representation piece,
00:04:56.280 | the final layers these networks are able to accomplish,
00:04:59.680 | the supervised learning tasks, the generative tasks
00:05:02.880 | and the unsupervised clustering tasks.
00:05:06.680 | Through machine learning.
00:05:09.480 | That's what we talked about a little in lecture one
00:05:12.880 | and will continue tomorrow and Wednesday.
00:05:16.280 | That's supervised learning.
00:05:19.680 | And you can think about the output of those networks
00:05:22.480 | as simple, clean, useful, valuable information.
00:05:26.080 | That's the knowledge.
00:05:27.480 | And that knowledge can be in the form of single numbers.
00:05:33.280 | It could be regression, continuous variables.
00:05:36.680 | It could be a sequence of numbers.
00:05:38.280 | It could be images, audio, sentences, text, speech.
00:05:44.280 | Once that knowledge is extracted and aggregated,
00:05:48.080 | how do we connect it in multi-resolutional ways?
00:05:52.080 | Form hierarchies of ideas, connect ideas.
00:05:55.280 | The trivial silly example is connecting images,
00:06:00.480 | activity recognition and audio, for example.
00:06:03.280 | If it looks like a duck, quacks like a duck,
00:06:06.280 | and swims like a duck,
00:06:07.680 | we do not currently have approaches
00:06:10.680 | that effectively integrate this information
00:06:14.080 | to produce a higher confidence estimate
00:06:16.680 | that is in fact a duck.
00:06:18.280 | And the planning piece,
00:06:19.880 | the task of taking the sensory information,
00:06:24.480 | fusing the sensory information,
00:06:26.680 | and making action, control and longer-term plans
00:06:30.880 | based on that information,
00:06:32.880 | as we'll discuss today,
00:06:35.080 | are more and more amenable to the learning approach,
00:06:39.080 | to the deep learning approach.
00:06:40.480 | But to date have been the most successful
00:06:43.080 | with non-learning optimization based approaches.
00:06:45.280 | Like with the several of the guest speakers we have,
00:06:47.680 | including the creator of this robot, Atlas,
00:06:51.280 | in Boston Dynamics.
00:06:52.880 | So the question,
00:06:55.880 | how much of the stack can be learned, end to end,
00:06:58.680 | from the input to the output?
00:07:00.280 | We know we can learn the representation,
00:07:02.680 | and the knowledge.
00:07:04.080 | From the representation and to knowledge,
00:07:05.880 | even with the kernel methods of SVM,
00:07:09.280 | and certainly,
00:07:12.080 | with neural networks,
00:07:14.280 | mapping from representation to information,
00:07:18.080 | has been where the primary success
00:07:21.680 | in machine learning over the past three decades has been.
00:07:24.080 | Mapping from raw sensory data to knowledge,
00:07:28.480 | that's where the success,
00:07:30.880 | the automated representation learning of deep learning,
00:07:34.480 | has been a success.
00:07:36.280 | Going straight from raw data to knowledge.
00:07:38.680 | The open question for us,
00:07:41.480 | today and beyond,
00:07:43.080 | is if we can expand the red box there,
00:07:45.680 | of what can be learned end to end,
00:07:47.280 | from sensory data to reasoning.
00:07:49.680 | So aggregating, forming higher representations,
00:07:52.480 | of the extracted knowledge.
00:07:54.080 | And forming plans,
00:07:56.680 | and acting in this world from the raw sensory data.
00:07:59.280 | We will show the incredible fact,
00:08:02.080 | that we're able to do,
00:08:03.480 | learn exactly what's shown here, end to end,
00:08:06.480 | with deeper enforcement learning on trivial tasks,
00:08:10.280 | in a generalizable way.
00:08:11.680 | The question is whether that can,
00:08:13.680 | then move on to real world tasks,
00:08:16.280 | of autonomous vehicles,
00:08:18.080 | of humanoid robotics, and so on.
00:08:21.080 | That's the open question.
00:08:24.880 | So today, let's talk about reinforcement learning.
00:08:27.680 | There's three types of machine learning.
00:08:29.680 | Supervised,
00:08:33.280 | unsupervised,
00:08:35.880 | are the categories at the extremes,
00:08:39.080 | relative to the amount of human input,
00:08:42.080 | that's required.
00:08:42.880 | For supervised learning,
00:08:44.080 | every piece of data,
00:08:45.480 | that's used for teaching these systems,
00:08:47.680 | is first labeled by human beings.
00:08:50.480 | And unsupervised learning on the right,
00:08:52.680 | is no data is labeled by human beings.
00:08:55.880 | In between is some,
00:08:58.680 | sparse input from humans.
00:09:01.280 | Semi-supervised learning,
00:09:03.080 | is when only part of the data is provided,
00:09:06.280 | by humans, ground truth.
00:09:08.480 | And the rest,
00:09:09.280 | must be inferred, generalized by the system.
00:09:11.680 | And that's where reinforcement learning falls.
00:09:13.680 | Reinforcement learning,
00:09:15.680 | as shown there with the cats.
00:09:18.080 | As I said, every successful presentation,
00:09:21.480 | must include cats.
00:09:22.480 | They're supposed to be Pavlov's cats.
00:09:28.080 | ringing a bell,
00:09:29.280 | and every time they ring a bell,
00:09:30.680 | they're given food,
00:09:31.680 | and they learn this process.
00:09:33.080 | The goal of reinforcement learning,
00:09:37.280 | is to learn,
00:09:39.080 | from sparse reward data.
00:09:42.080 | From learn,
00:09:43.080 | from sparse supervised data.
00:09:45.080 | And take advantage of the fact,
00:09:47.080 | that in simulation,
00:09:48.280 | or in the real world,
00:09:49.480 | there is a temporal consistency to the world.
00:09:52.080 | There is a,
00:09:53.280 | temporal dynamics that follows,
00:09:55.480 | from state to state,
00:09:56.480 | to state through time.
00:09:57.880 | And so you can propagate information,
00:10:00.280 | even if the information,
00:10:01.880 | that you received about,
00:10:03.280 | the supervision,
00:10:04.880 | the ground truth is sparse.
00:10:06.880 | You can follow that information,
00:10:08.280 | back through time,
00:10:09.480 | to infer,
00:10:10.680 | something about the reality,
00:10:12.280 | of what happened before then,
00:10:13.480 | even if your reward signals were weak.
00:10:16.080 | So it's using the fact,
00:10:17.880 | that the physical world,
00:10:18.880 | devolves through time,
00:10:20.280 | in some,
00:10:21.280 | some sort of predictable way,
00:10:23.680 | to take sparse information,
00:10:26.880 | generalize it,
00:10:28.480 | over the entirety of the experience,
00:10:30.880 | that's being learned.
00:10:31.880 | So we apply this to two problems.
00:10:34.680 | Today we'll talk about deep traffic.
00:10:37.680 | As a methodology,
00:10:39.280 | as a way to introduce,
00:10:40.680 | deep reinforcement learning.
00:10:41.880 | So deep traffic is a competition,
00:10:44.080 | that we ran last year,
00:10:45.680 | and expanded significantly this year.
00:10:48.480 | And I'll talk about some of the details,
00:10:51.080 | and how the folks in this room can,
00:10:53.280 | on your smartphone today,
00:10:55.680 | or if you have a laptop,
00:10:57.080 | train an agent,
00:10:58.880 | while I'm talking.
00:11:00.080 | Training a neural network in the browser.
00:11:02.280 | Some of the things we've added,
00:11:04.080 | are we've added the capability,
00:11:06.880 | we've now turned it into a multi-agent,
00:11:08.680 | deep reinforcement learning problem.
00:11:10.280 | Where you can control up to,
00:11:11.880 | 10 cars within your own network.
00:11:13.880 | Perhaps less significant,
00:11:16.880 | but pretty cool,
00:11:18.480 | is the ability to customize,
00:11:20.680 | the way the agent looks.
00:11:22.280 | So you can upload,
00:11:24.280 | and people have,
00:11:25.480 | to an absurd degree,
00:11:26.880 | have already begun doing so,
00:11:28.280 | uploading different images,
00:11:30.080 | instead of the car that's shown there.
00:11:31.880 | As long as it maintains the dimensions,
00:11:34.880 | shown here is a SpaceX rocket.
00:11:36.880 | The competition is hosted on the website,
00:11:41.880 | selfdrivingcars.mit.edu/deeptraffic.
00:11:45.280 | We'll return to this later.
00:11:46.880 | The code is on GitHub,
00:11:50.480 | with some more information,
00:11:51.680 | a starter code,
00:11:52.680 | and a paper describing,
00:11:55.080 | some of the fundamental insights,
00:11:56.880 | that will help you win at this competition,
00:12:00.880 | is an archive.
00:12:04.280 | So, from supervised learning,
00:12:05.880 | in lecture one,
00:12:06.880 | to today.
00:12:08.080 | Supervised learning,
00:12:10.280 | we can think of as memorization,
00:12:12.680 | of ground-truth data,
00:12:15.480 | in order to form representations,
00:12:17.480 | that generalizes from that ground-truth.
00:12:19.680 | Reinforcement learning,
00:12:21.880 | is, we can think of,
00:12:24.080 | as a way to brute force,
00:12:25.480 | propagate that information,
00:12:27.080 | the sparse information,
00:12:28.880 | through time,
00:12:34.280 | to assign quality reward,
00:12:38.280 | to state that does not directly have a reward.
00:12:42.280 | To make sense of this world,
00:12:44.480 | when the rewards are sparse,
00:12:47.280 | but are connected through time.
00:12:48.680 | You can think of that as reasoning.
00:12:51.280 | So, the connection through time,
00:12:56.080 | is modeled,
00:12:58.280 | in most reinforcement learning approaches,
00:13:01.080 | very simply,
00:13:03.480 | that there's an agent,
00:13:04.880 | taking an action in a state,
00:13:06.880 | and receiving a reward.
00:13:08.480 | And the agent operating in an environment,
00:13:11.080 | executes an action,
00:13:12.680 | receives an observed state,
00:13:14.480 | a new state,
00:13:15.480 | and receives a reward.
00:13:16.680 | This process continues over and over.
00:13:19.280 | In some examples,
00:13:23.880 | we can think of,
00:13:25.280 | any of the video games,
00:13:26.680 | some of which we'll talk about today,
00:13:28.480 | like Atari Breakout,
00:13:30.480 | as the environment,
00:13:32.280 | the agent,
00:13:33.480 | is the paddle.
00:13:34.480 | Each action,
00:13:37.680 | that the agent takes,
00:13:39.680 | has an influence,
00:13:40.880 | on the evolution,
00:13:42.080 | of the environment.
00:13:43.880 | And the success is measured,
00:13:46.080 | by some reward mechanism.
00:13:47.880 | In this case,
00:13:49.080 | points are given by the game.
00:13:50.680 | And every game,
00:13:52.280 | has a different point scheme,
00:13:55.480 | that must be converted,
00:13:57.080 | normalized,
00:13:58.080 | into a way that's interpretable by the system.
00:14:00.080 | And the goal is to maximize those points,
00:14:03.080 | maximize the reward.
00:14:04.480 | The continuous problem of cart-pole balancing,
00:14:10.880 | the goal is to balance the pole,
00:14:12.080 | on top of a moving cart.
00:14:13.480 | The state is the angle,
00:14:15.680 | the angular speed,
00:14:17.080 | the position,
00:14:17.680 | the horizontal velocity.
00:14:19.080 | The actions are the horizontal force,
00:14:21.680 | applied to the cart.
00:14:22.680 | And the reward,
00:14:24.080 | is one at each time step,
00:14:25.480 | if the pole is still upright.
00:14:27.080 | All the,
00:14:31.280 | first-person shooters,
00:14:32.480 | the video games,
00:14:33.280 | and now StarCraft,
00:14:34.280 | the strategy games.
00:14:37.880 | In case of first-person shooter in Doom,
00:14:42.880 | what is the goal?
00:14:44.080 | The environment is the game,
00:14:45.680 | the goal is to eliminate all opponents,
00:14:47.880 | the state is the raw game pixels coming in,
00:14:50.280 | the actions is moving up, down, left, right,
00:14:53.880 | and so on.
00:14:55.280 | And the reward is positive,
00:14:57.480 | when eliminating an opponent,
00:14:59.480 | and negative,
00:15:01.480 | when the agent is eliminated.
00:15:03.080 | Industrial robotics,
00:15:07.080 | bin packing with a robotic arm,
00:15:09.880 | the goal is to pick up a device from a box,
00:15:12.080 | and put it into a container.
00:15:13.680 | The state is the raw pixels of the real world,
00:15:16.280 | that the robot observes,
00:15:17.880 | the actions are the possible,
00:15:20.080 | actions of the robot,
00:15:21.480 | the different degrees of freedom,
00:15:22.680 | and moving through those degrees,
00:15:24.080 | moving the different actuators,
00:15:25.880 | to realize,
00:15:27.080 | the position of the arm.
00:15:28.880 | And the reward is positive,
00:15:30.280 | when placing a device successfully,
00:15:32.080 | and negative otherwise.
00:15:33.280 | Everything can be modeled in this way.
00:15:36.880 | Markov decision process.
00:15:39.280 | There's a state as zero,
00:15:41.280 | action A zero,
00:15:42.680 | and reward received.
00:15:44.280 | A new state is achieved.
00:15:46.280 | Again, action reward state,
00:15:48.080 | action reward state,
00:15:49.480 | until a terminal state is reached.
00:15:52.480 | And the major components,
00:15:55.480 | of reinforcement learning,
00:15:57.080 | is a policy,
00:15:59.080 | some kind of plan,
00:16:00.080 | of what to do in every single state,
00:16:01.880 | what kind of action to perform.
00:16:03.880 | A value function,
00:16:05.680 | a some kind of sense,
00:16:09.080 | of what is a good state to be in,
00:16:11.480 | of what is a good action to take in a state.
00:16:13.880 | And sometimes a model,
00:16:17.880 | that the agent represents the environment with,
00:16:21.680 | some kind of sense,
00:16:23.680 | of the environment it's operating in,
00:16:25.280 | the dynamics of that environment,
00:16:26.880 | that's useful,
00:16:28.080 | for making decisions about actions.
00:16:30.080 | Let's take a trivial example.
00:16:32.680 | A grid world of 3 by 4,
00:16:37.680 | 12 squares,
00:16:39.280 | where you start at the bottom left,
00:16:40.880 | and they're tasked with walking about this world,
00:16:45.280 | to maximize reward.
00:16:47.080 | The reward at the top right is a plus one,
00:16:50.080 | and at one square below that is a negative one.
00:16:52.680 | And every step you take,
00:16:54.080 | is a punishment,
00:16:56.080 | or is a negative reward of 0.04.
00:16:59.080 | So what is the optimal policy in this world?
00:17:02.480 | Now when everything is deterministic,
00:17:06.280 | perhaps this is the policy.
00:17:08.880 | When you start at the bottom left,
00:17:11.680 | well, because every step hurts,
00:17:14.280 | every step has a negative reward,
00:17:16.080 | then you want to take the shortest path,
00:17:18.280 | to the maximum square with the maximum reward.
00:17:20.880 | When the state space is non-deterministic,
00:17:25.680 | as presented before,
00:17:28.080 | with a probability of 0.8,
00:17:30.480 | when you choose to go up,
00:17:32.080 | you go up,
00:17:33.080 | but with probability 0.1,
00:17:34.680 | you go left,
00:17:36.480 | and 0.1, you go right.
00:17:38.280 | Unfair.
00:17:39.680 | Again, much like life.
00:17:41.280 | That would be the optimal policy.
00:17:44.880 | What is the key observation here?
00:17:48.080 | That every single state in the space,
00:17:50.080 | must have a plan.
00:17:51.480 | Because you can't,
00:17:54.280 | because then the non-deterministic aspect,
00:17:56.680 | of the control,
00:17:58.080 | you can't control where you're going to end up,
00:18:00.280 | so you must have a plan for every place.
00:18:02.280 | That's the policy.
00:18:03.680 | Having an action,
00:18:05.080 | an optimal action to take in every single state.
00:18:07.280 | Now suppose we change the reward structure,
00:18:10.480 | and for every step we take,
00:18:12.280 | there's a negative,
00:18:13.280 | a reward is a negative two.
00:18:15.480 | So it really hurts.
00:18:16.880 | There's a high punishment for every single step we take.
00:18:19.280 | So no matter what,
00:18:20.880 | we always take the shortest path.
00:18:23.480 | The optimal policy is to take the shortest path,
00:18:25.880 | to the,
00:18:26.480 | to the only spot on the board,
00:18:29.280 | that doesn't result in punishment.
00:18:32.080 | If we decrease the reward of each step,
00:18:36.880 | to negative 0.1,
00:18:38.680 | the policy changes.
00:18:41.280 | Where there is some extra degree of wandering,
00:18:45.880 | encouraged.
00:18:47.480 | And as we go further and further,
00:18:50.880 | in lowering the punishment as before,
00:18:53.080 | to negative 0.04,
00:18:55.280 | more wandering and more wandering is allowed.
00:18:57.880 | And when we finally turn the reward,
00:19:02.680 | into positive,
00:19:06.280 | so every step,
00:19:07.480 | every step is increases the reward,
00:19:13.480 | then there's a significant incentive to,
00:19:16.280 | to stay on the board without ever reaching the destination.
00:19:22.480 | Kind of like college for a lot of people.
00:19:24.280 | So the value function,
00:19:29.880 | the way we think about,
00:19:31.480 | the value of a state,
00:19:33.280 | or the value of anything,
00:19:35.080 | in the environment,
00:19:37.880 | the reward we're likely to receive in the future.
00:19:41.880 | And the way we see the reward we're likely to receive,
00:19:45.880 | is we discount,
00:19:48.080 | the future reward.
00:19:49.680 | Because we can't always count on it.
00:19:53.080 | Here at Gamma,
00:19:54.680 | further and further out into the future,
00:19:57.480 | more and more discounts decreases,
00:20:00.080 | the reward, the importance of the reward received.
00:20:03.480 | And the good strategy,
00:20:05.880 | is taking the sum of these rewards and maximizing it.
00:20:08.680 | Maximizing discounted future reward.
00:20:11.080 | That's what reinforcement learning,
00:20:13.080 | hopes to achieve.
00:20:14.480 | And with Q-learning,
00:20:17.280 | we use,
00:20:21.080 | any policy to estimate the value of taking an action,
00:20:24.680 | in a state.
00:20:25.680 | So off policy,
00:20:29.480 | forget policy.
00:20:30.880 | We move about the world,
00:20:32.880 | and use the Bellman equation here on the bottom,
00:20:35.280 | to continuously update our estimate of how good,
00:20:38.680 | a certain action is in a certain state.
00:20:40.680 | So we don't need,
00:20:44.680 | this allows us to operate in a much larger state space,
00:20:47.480 | in a much larger action space.
00:20:48.880 | We move about this world,
00:20:50.680 | through simulation or in the real world,
00:20:52.480 | taking actions and updating our estimate,
00:20:55.080 | of how good certain actions are over time.
00:20:57.480 | The new state at the left,
00:21:01.280 | is the updated value.
00:21:03.080 | The old state,
00:21:04.080 | is the starting value for the equation.
00:21:05.880 | And we update that old state estimation,
00:21:08.280 | with the sum,
00:21:09.680 | of the reward received,
00:21:12.080 | by taking action S,
00:21:13.880 | action A in state S.
00:21:18.880 | the maximum reward that's possible,
00:21:22.280 | to be received in the following states,
00:21:24.680 | discounted.
00:21:26.680 | That update,
00:21:29.680 | is decreased with a learning rate.
00:21:31.480 | The higher the learning rate,
00:21:33.080 | the more value we,
00:21:34.480 | the faster we learn,
00:21:36.280 | the more value we assign to new information.
00:21:39.280 | That's simple, that's it.
00:21:41.280 | That's Q-learning.
00:21:42.280 | The simple update rule,
00:21:43.880 | allows us to,
00:21:45.080 | to explore the world,
00:21:47.280 | and as we explore,
00:21:48.680 | get more and more information,
00:21:52.080 | about what's good to do in this world.
00:21:53.880 | And there's always a balance,
00:21:55.880 | in the various problem spaces we'll discuss,
00:21:58.280 | there's always a balance between,
00:22:00.080 | exploration and exploitation.
00:22:02.280 | As you form a better and better estimate,
00:22:06.480 | of the Q-function,
00:22:07.280 | of what actions are good to take,
00:22:08.880 | you start to get a sense,
00:22:10.880 | of what is the best action to take.
00:22:12.880 | But it's not a perfect sense,
00:22:15.080 | it's still an approximation.
00:22:16.480 | And so there's value of exploration.
00:22:18.480 | But the better and better your estimate becomes,
00:22:20.880 | the less and less exploration,
00:22:22.680 | has a benefit.
00:22:23.680 | So, usually we want to explore a lot in the beginning,
00:22:27.080 | and less and less so,
00:22:28.680 | towards the end.
00:22:29.680 | And when we finally release the system out,
00:22:32.280 | into the world,
00:22:33.280 | and wish it to operate its best,
00:22:35.280 | then we,
00:22:36.280 | have it operate,
00:22:37.880 | as a greedy system,
00:22:39.080 | always taking the optimal action,
00:22:40.480 | according to the Q-value function.
00:22:45.680 | And everything I'm talking about now,
00:22:47.680 | is parameterized,
00:22:49.080 | and our parameters,
00:22:50.880 | that are very important,
00:22:53.280 | for winning the deep traffic competition.
00:22:55.680 | Which is using this very algorithm,
00:22:59.280 | with a neural network,
00:23:00.480 | at its core.
00:23:01.480 | So for a simple table representation,
00:23:06.080 | of a Q-function,
00:23:07.280 | where the Y-axis is state,
00:23:09.880 | four states,
00:23:10.680 | S1,2,3,4.
00:23:12.280 | And the X-axis is,
00:23:14.880 | actions, A1,2,3,4.
00:23:17.480 | We can think of this table,
00:23:19.680 | as randomly initiated,
00:23:21.480 | or initiated,
00:23:22.480 | initialized,
00:23:23.480 | in any kind of way,
00:23:24.680 | that's not representative of actual reality.
00:23:27.680 | And as we move about this world,
00:23:29.280 | and we take actions,
00:23:30.480 | we update this table,
00:23:31.680 | with the Bellman equation,
00:23:32.880 | shown up top.
00:23:34.080 | And here, slides now are online,
00:23:36.480 | you can see a simple,
00:23:38.080 | pseudocode algorithm,
00:23:39.480 | of how to update it,
00:23:40.880 | of how to run,
00:23:41.880 | this Bellman equation.
00:23:44.280 | over time,
00:23:45.480 | the approximation becomes the optimal,
00:23:48.080 | Q-table.
00:23:49.080 | The problem is,
00:23:50.680 | when that Q-table,
00:23:51.880 | it becomes exponential in size.
00:23:54.080 | When we take in raw sensory information,
00:23:57.480 | as we do with cameras,
00:23:58.880 | with deep crash,
00:24:00.080 | or with deep traffic,
00:24:02.480 | it's taking the full grid space,
00:24:04.280 | and taking that information,
00:24:05.680 | the raw,
00:24:06.480 | the raw grid,
00:24:08.680 | pixels of deep traffic.
00:24:10.480 | And when you take the arcade games,
00:24:12.680 | here, they're taking the raw pixels of the game.
00:24:15.280 | Or when we take Go,
00:24:18.680 | the game of Go,
00:24:19.680 | when it's taking the units,
00:24:21.080 | the board,
00:24:23.280 | the raw state of the board,
00:24:25.080 | as the input,
00:24:27.280 | the potential,
00:24:29.280 | state space,
00:24:30.680 | the number of possible,
00:24:32.280 | combinatorial variations of,
00:24:34.080 | what states is possible,
00:24:37.080 | extremely large.
00:24:38.280 | Larger than,
00:24:39.280 | we can certainly hold the memory,
00:24:41.080 | and larger than we can,
00:24:43.280 | ever be able to accurately approximate,
00:24:45.280 | through the Bellman equation,
00:24:46.680 | over time,
00:24:47.680 | through simulation.
00:24:49.080 | Through the simple update of the Bellman equation.
00:24:53.080 | So this is where,
00:24:54.880 | deep reinforcement learning comes in.
00:24:57.080 | Neural networks,
00:24:58.680 | are really good approximators.
00:25:00.680 | They're really good at exactly this task,
00:25:02.880 | of learning,
00:25:03.880 | this kind of Q-table.
00:25:05.680 | So as we started with supervised learning,
00:25:12.280 | or neural networks help us memorize patterns,
00:25:14.280 | using supervised,
00:25:15.480 | ground-truth data,
00:25:17.280 | and we move to reinforcement learning,
00:25:19.080 | that hopes to propagate,
00:25:20.480 | outcomes to knowledge.
00:25:23.080 | Deep learning,
00:25:26.080 | allows us to do so,
00:25:27.480 | on much larger state spaces,
00:25:29.680 | a much larger,
00:25:30.880 | action spaces.
00:25:32.880 | Which means,
00:25:34.680 | it's generalizable.
00:25:35.880 | It's much more capable to deal,
00:25:38.880 | with the raw,
00:25:40.280 | stuff,
00:25:41.880 | of sensory data.
00:25:43.480 | Which means it's much more capable,
00:25:45.880 | to deal with the broad variation,
00:25:47.680 | of real-world applications.
00:25:49.080 | And it does so,
00:25:55.480 | because it's able to,
00:25:56.880 | learn the representations,
00:25:58.880 | as we discussed,
00:25:59.880 | on Monday.
00:26:01.880 | The understanding comes,
00:26:06.480 | from converting,
00:26:08.280 | the raw sensory information,
00:26:10.080 | into simple, useful information,
00:26:13.080 | based on which,
00:26:14.280 | the action,
00:26:15.280 | in this particular state can be taken,
00:26:17.080 | in the same exact way.
00:26:18.480 | So instead of the Q-table,
00:26:20.480 | instead of this Q-function,
00:26:22.280 | we plug in a neural network,
00:26:23.680 | where the input is the state space,
00:26:25.880 | no matter how complex,
00:26:27.280 | and the output,
00:26:29.080 | is a value for each of the actions,
00:26:31.680 | that you could take.
00:26:32.680 | Input is the state,
00:26:36.080 | output is the,
00:26:37.480 | value of the function.
00:26:38.880 | It's simple.
00:26:40.880 | This is,
00:26:42.080 | Deep Q Network,
00:26:45.080 | At the core,
00:26:46.880 | of the success of DeepMind,
00:26:48.880 | a lot of the cool stuff you see,
00:26:50.280 | about video games,
00:26:52.280 | or variants of DQN are at play.
00:26:54.280 | This is what at first,
00:26:56.680 | with the Nature paper,
00:26:57.880 | DeepMind,
00:26:59.880 | the success came,
00:27:02.680 | of playing the different games,
00:27:03.880 | including Atari,
00:27:04.880 | games.
00:27:07.680 | So, how are these things trained?
00:27:11.480 | Very similar,
00:27:13.480 | to supervised learning.
00:27:15.280 | The Bellman equation up top,
00:27:20.880 | takes the reward,
00:27:23.680 | and the discounted,
00:27:26.280 | expected reward,
00:27:27.880 | from future states.
00:27:29.080 | The loss function here,
00:27:33.880 | for neural network,
00:27:34.880 | the neural network learns with a loss function.
00:27:36.880 | It takes,
00:27:38.880 | reward received at the current state,
00:27:41.680 | does a forward pass,
00:27:43.680 | through a neural network,
00:27:44.680 | to estimate the value of the future state,
00:27:47.480 | of the best,
00:27:49.280 | action to take in the future state,
00:27:51.880 | and then subtracts that,
00:27:54.680 | from the forward pass,
00:27:57.680 | through the network,
00:27:59.280 | for the current state in action.
00:28:00.680 | So, you take the difference between,
00:28:04.280 | what your Q,
00:28:05.680 | estimator,
00:28:06.880 | the neural network,
00:28:07.880 | believes the value of the current state is,
00:28:12.280 | what,
00:28:13.080 | it more likely is to be,
00:28:15.280 | based on the value of the future states,
00:28:19.280 | that are reachable based on the actions you can take.
00:28:21.480 | Here's the algorithm.
00:28:27.680 | Input is the state,
00:28:30.680 | output is the Q value for each action,
00:28:33.480 | or in this diagram,
00:28:34.680 | input is the state in action,
00:28:36.280 | and the output is the Q value.
00:28:37.880 | It's very similar architectures.
00:28:40.480 | So, given a transition,
00:28:42.480 | of S,
00:28:43.480 | A, R,
00:28:46.080 | S current state taking an action,
00:28:49.680 | receiving a reward,
00:28:50.680 | and achieving S' state.
00:28:56.280 | the update,
00:28:58.880 | do a feed forward pass,
00:29:00.280 | through the network for the current state,
00:29:02.680 | do a feed forward pass for each of the,
00:29:05.280 | possible actions taken in the next state.
00:29:08.080 | And that's how we compute the two parts of the loss function.
00:29:11.280 | And update the weights using back propagation.
00:29:15.680 | Again, loss function,
00:29:18.080 | back propagation is how the network is trained.
00:29:20.080 | This is actually been around for,
00:29:24.280 | much longer than,
00:29:25.280 | DeepMind.
00:29:26.480 | A few tricks made it,
00:29:29.880 | made it really work.
00:29:31.880 | Experience replay is the biggest one.
00:29:33.880 | So, as the games are played through simulation,
00:29:38.880 | or if it's a physical system,
00:29:40.080 | as it acts in the world.
00:29:41.280 | It's actually,
00:29:44.280 | collecting the observations,
00:29:46.680 | into a library of experiences.
00:29:48.880 | And the training is performed,
00:29:51.080 | by randomly sampling the library in the past.
00:29:53.880 | By randomly sampling,
00:29:56.480 | the previous experience,
00:29:58.080 | and then,
00:29:58.880 | sampling the previous experiences,
00:30:01.080 | in batches.
00:30:01.880 | So, you're not always training,
00:30:04.880 | on the natural continuous evolution of the system.
00:30:07.680 | You're training on randomly picked batches,
00:30:10.280 | of those experiences.
00:30:11.480 | That's a huge,
00:30:13.080 | it's a,
00:30:15.080 | seems like a subtle trick,
00:30:16.280 | but it's a really important one.
00:30:17.480 | So, the system doesn't,
00:30:20.480 | over fit,
00:30:22.080 | a particular evolution,
00:30:23.880 | of the,
00:30:25.080 | of the game,
00:30:26.080 | of the simulation.
00:30:28.880 | Another important,
00:30:30.880 | again, subtle trick,
00:30:33.680 | as in a lot of deep learning approaches,
00:30:35.880 | the subtle tricks make all the difference,
00:30:39.280 | fixing the target network.
00:30:41.280 | For the loss function,
00:30:43.680 | if you notice,
00:30:44.680 | you have to use the neural network,
00:30:47.480 | the single neural network,
00:30:48.880 | the DQN network,
00:30:50.280 | to estimate the value of the current state,
00:30:52.880 | an action pair.
00:30:56.880 | and the next.
00:30:58.480 | So, you're using it,
00:30:59.680 | multiple times.
00:31:03.480 | as you perform that operation,
00:31:05.880 | you're updating the network.
00:31:07.880 | Which means the target function,
00:31:09.880 | inside that loss function,
00:31:11.280 | is always changing.
00:31:12.480 | So, you're,
00:31:13.480 | the very nature of your loss function,
00:31:15.880 | is changing all the time,
00:31:17.080 | as you're learning.
00:31:18.080 | And that's a big problem for stability.
00:31:20.480 | That can create big problems,
00:31:22.480 | to the learning process.
00:31:23.680 | So, this little trick,
00:31:25.280 | is to fix,
00:31:26.680 | the network,
00:31:27.480 | and only update it,
00:31:28.680 | every,
00:31:30.080 | say, thousand steps.
00:31:33.480 | as you train the network,
00:31:35.480 | the network that's used,
00:31:38.080 | to compute the target function,
00:31:40.080 | inside the loss function, is fixed.
00:31:42.080 | It produces a more stable computation,
00:31:45.480 | on the loss function.
00:31:47.480 | the ground doesn't,
00:31:48.880 | shift under you,
00:31:50.280 | as you're trying to find a minimal,
00:31:52.480 | for the loss function.
00:31:54.480 | The loss function doesn't change.
00:31:56.880 | In unpredictable, difficult to understand ways.
00:32:00.680 | reward clipping,
00:32:02.080 | which is,
00:32:03.680 | always true,
00:32:04.880 | with general,
00:32:05.880 | systems that are,
00:32:07.080 | operating,
00:32:08.080 | seeking to operate in a generalized way,
00:32:12.080 | for very,
00:32:13.080 | for these various games,
00:32:14.680 | the points are different.
00:32:16.880 | Some, some points are low,
00:32:18.280 | some points are high,
00:32:19.280 | some go positive and negative.
00:32:20.880 | And they're all normalized,
00:32:22.280 | to a point where the good points,
00:32:24.480 | or the positive points,
00:32:26.080 | are a one,
00:32:27.080 | and negative points are a negative one.
00:32:29.880 | That's reward clipping.
00:32:31.480 | Simplify the reward structure.
00:32:33.280 | And, because a lot of the games are 30 FPS,
00:32:36.880 | or 60 FPS,
00:32:38.080 | and the actions,
00:32:39.680 | are not,
00:32:41.480 | it's not valuable to take actions,
00:32:43.880 | at such a high rate,
00:32:45.280 | inside of these,
00:32:46.280 | particularly Atari games,
00:32:47.680 | that you only take an action every four steps.
00:32:50.280 | While still taking in the frames,
00:32:52.480 | as part of the temporal window to make decisions.
00:32:55.480 | Tricks,
00:32:56.280 | but hopefully gives you a sense,
00:32:57.880 | of the kind of things necessary,
00:33:01.680 | for both,
00:33:02.880 | seminal papers like this one,
00:33:05.880 | and for the more important accomplishment,
00:33:08.280 | of winning deep traffic,
00:33:09.480 | is the,
00:33:11.080 | is the tricks make all the difference.
00:33:12.880 | Here on the bottom,
00:33:16.480 | the circle,
00:33:18.480 | is when the technique is used,
00:33:20.280 | and the X when it's not,
00:33:21.880 | looking at replay and target.
00:33:24.080 | Takes target network and experience replay.
00:33:26.280 | When both are used,
00:33:27.880 | for the game of breakout,
00:33:29.280 | river raid,
00:33:30.680 | sea quest,
00:33:31.480 | and space invaders.
00:33:32.680 | The higher the number,
00:33:34.080 | the better it is,
00:33:34.880 | the more points achieved.
00:33:36.280 | So when,
00:33:38.480 | it gives you a sense,
00:33:39.680 | that when replay and target,
00:33:41.280 | both give significant improvements,
00:33:43.680 | in the performance of the system.
00:33:45.280 | Order of magnitude improvements,
00:33:49.280 | two orders of magnitude for breakout.
00:33:52.480 | And here is,
00:33:53.480 | pseudocode,
00:33:55.680 | of implementing DQN,
00:33:57.280 | the learning.
00:33:58.280 | The key thing to notice,
00:34:01.880 | and you can look to the slides,
00:34:06.280 | the loop,
00:34:07.480 | the while loop,
00:34:08.480 | of playing through the games,
00:34:10.280 | and selecting the actions to play,
00:34:12.280 | is not part of the training.
00:34:15.080 | It's,
00:34:15.680 | it's part of the saving,
00:34:17.280 | the observation,
00:34:19.880 | the observations,
00:34:22.280 | the state action reward,
00:34:24.480 | next state observations,
00:34:26.280 | and saving them into replay memory,
00:34:28.080 | into that library.
00:34:29.280 | And then you sample randomly,
00:34:31.480 | from that replay memory,
00:34:33.080 | to then train the network,
00:34:34.680 | based on the loss function.
00:34:36.880 | And with probability up,
00:34:40.080 | up top of the probability,
00:34:41.880 | epsilon,
00:34:42.880 | select a random action.
00:34:44.080 | That epsilon is,
00:34:45.880 | the probability of exploration,
00:34:49.080 | that decreases.
00:34:50.280 | That's something you'll see in deep traffic as well,
00:34:54.680 | the rate at which that,
00:34:56.080 | exploration decreases over time,
00:34:58.080 | through the training process.
00:34:59.280 | You want to explore a lot first,
00:35:00.880 | and less and less over time.
00:35:03.080 | So, this algorithm has been able to accomplish,
00:35:06.280 | in 2015,
00:35:08.080 | and since,
00:35:09.680 | a lot of incredible things.
00:35:11.280 | Things that made,
00:35:13.480 | the AI world,
00:35:15.480 | think that we,
00:35:19.080 | we're onto something.
00:35:20.480 | That,
00:35:22.680 | general AI is within reach.
00:35:24.680 | It's for the first time,
00:35:27.080 | that raw sensor information was used to create,
00:35:29.880 | a system that acts,
00:35:31.280 | and makes sense of the world.
00:35:32.680 | Makes sense of the physics of the world enough,
00:35:34.880 | to be able to succeed in it,
00:35:36.480 | from very little information.
00:35:38.080 | But these games are trivial.
00:35:40.280 | Even though,
00:35:44.280 | there is a lot of them.
00:35:47.280 | This DQN approach has been able to outperform,
00:35:50.880 | a lot of the Atari games.
00:35:52.480 | That's what's been reported on.
00:35:54.480 | Outperform the human level performance.
00:35:56.680 | But again, these games are trivial.
00:35:59.280 | What I think,
00:36:01.680 | and perhaps biased,
00:36:03.480 | I'm biased,
00:36:04.680 | but one of the greatest accomplishments,
00:36:06.680 | of artificial intelligence in the last decade,
00:36:09.080 | at least from the philosophical,
00:36:12.680 | or the research perspective,
00:36:16.480 | AlphaGo Zero.
00:36:18.880 | First AlphaGo,
00:36:20.680 | and then AlphaGo Zero.
00:36:22.080 | Is deep mind system,
00:36:25.680 | that beat the best in the world,
00:36:28.080 | in the game of Go.
00:36:29.080 | So what's the game of Go?
00:36:31.080 | It's simple.
00:36:33.080 | I won't get into the rules,
00:36:36.080 | but basically it's a 19 by 19 board,
00:36:39.280 | showing on the bottom of the slide,
00:36:42.080 | for the bottom row of the table,
00:36:45.480 | for a board of 19 by 19,
00:36:47.480 | the number of legal game positions,
00:36:51.680 | is 2 times 10 to the power of 170.
00:36:54.880 | It's a very large number of possible positions to consider.
00:36:59.280 | At any one time,
00:37:01.080 | especially the game evolves,
00:37:02.880 | the number of possible moves is huge.
00:37:05.480 | Much larger than in chess.
00:37:08.480 | So that's why,
00:37:13.080 | the community thought that this game is not solvable.
00:37:15.680 | Until 2016,
00:37:19.880 | when AlphaGo,
00:37:22.080 | used human expert position play,
00:37:26.080 | to seed in a supervised way,
00:37:29.680 | reinforcement learning approach.
00:37:31.880 | And I'll describe it in a little bit of detail,
00:37:34.680 | in a couple of slides here,
00:37:36.680 | to beat the best in the world.
00:37:42.280 | And then AlphaGo Zero,
00:37:44.080 | that is,
00:37:45.480 | the accomplishment of the decade,
00:37:48.480 | for me, in AI.
00:37:50.680 | Is being able to play,
00:37:53.480 | with no,
00:37:56.280 | training data on human expert,
00:37:59.880 | games.
00:38:02.280 | And beat the best in the world,
00:38:05.080 | in an extremely complex game.
00:38:06.880 | This is not Atari.
00:38:08.080 | This is,
00:38:09.480 | this is a,
00:38:11.480 | a much,
00:38:12.280 | higher,
00:38:13.480 | order, difficulty game.
00:38:16.080 | And the, and the quality of players that is competing in,
00:38:19.480 | is much higher.
00:38:20.480 | And it's able to extremely quickly here,
00:38:23.280 | to achieve a rating that's better than AlphaGo.
00:38:26.880 | And better than the different variants of AlphaGo.
00:38:30.880 | And certainly better than the,
00:38:32.680 | the best of the human players.
00:38:34.280 | In 21 days,
00:38:35.880 | of self play.
00:38:37.680 | So how does it work?
00:38:40.680 | All of these approaches,
00:38:41.880 | much, much like the previous ones,
00:38:44.680 | the traditional ones,
00:38:45.880 | they're not based on deep learning.
00:38:47.880 | Are using Monte Carlo Tree Search, MCTS.
00:38:52.880 | Which is,
00:38:55.280 | when you have such a large state space,
00:38:58.480 | you start at a board,
00:38:59.680 | and you play,
00:39:01.080 | and you choose moves,
00:39:04.080 | with some,
00:39:06.480 | exploitation, exploration,
00:39:08.280 | balancing.
00:39:10.480 | choosing to explore totally new positions,
00:39:13.480 | or to go deep in the positions you know are good,
00:39:15.880 | until the bottom of the game is reached,
00:39:17.880 | until the final state is reached.
00:39:20.080 | And then you back propagate,
00:39:22.680 | quality of the choices you made leading to that position.
00:39:26.280 | And in that way, you learn the value of,
00:39:29.680 | of board positions and play.
00:39:32.680 | That's been used by the most successful,
00:39:35.680 | Go playing,
00:39:36.680 | engines before,
00:39:39.080 | and AlphaGo since.
00:39:40.280 | But you might be able to guess,
00:39:43.080 | what's the difference,
00:39:44.480 | with AlphaGo versus the previous approaches.
00:39:46.680 | They use the neural network,
00:39:49.680 | as the,
00:39:51.680 | intuition,
00:39:53.480 | quote-unquote,
00:39:54.480 | to what are the good states,
00:39:56.480 | what are the good next,
00:39:58.480 | board positions to explore.
00:40:00.680 | And the key things,
00:40:07.280 | again, the tricks make all the difference,
00:40:09.880 | that made AlphaGo zero,
00:40:12.480 | work,
00:40:14.080 | and work much better than AlphaGo,
00:40:16.280 | is first,
00:40:17.280 | because there was no expert play,
00:40:19.280 | instead of human games.
00:40:21.480 | AlphaGo,
00:40:24.680 | used,
00:40:25.880 | that very same,
00:40:27.480 | Monte Carlo tree search algorithm,
00:40:30.280 | MCTS,
00:40:31.680 | to do an intelligent look ahead,
00:40:33.280 | based on the neural network prediction,
00:40:36.280 | of what are the good states to take,
00:40:38.080 | it checked that,
00:40:40.480 | instead of human expert play,
00:40:42.080 | it checked,
00:40:43.080 | how good indeed are those,
00:40:44.880 | states.
00:40:46.080 | It's a simple look ahead action,
00:40:48.680 | that does,
00:40:50.280 | the ground truth,
00:40:51.280 | that does the target,
00:40:52.880 | correction,
00:40:53.880 | that produces the loss function.
00:40:55.480 | The second part is the multitask learning,
00:40:58.080 | or what's now called multitask learning,
00:41:00.280 | is the network is,
00:41:01.480 | is quote-unquote two-headed,
00:41:04.280 | in the sense that first it outputs the probability,
00:41:06.680 | of which move to take,
00:41:07.880 | the obvious thing,
00:41:09.080 | and it's also producing a probability of winning.
00:41:11.480 | And there's a few ways to combine that information,
00:41:15.080 | and continuously train,
00:41:16.880 | both parts of the network,
00:41:19.080 | depending on the choice taken.
00:41:20.880 | So you want to take the best choice,
00:41:22.880 | in the short term,
00:41:24.080 | and achieve the positions,
00:41:26.280 | that are highly as likelihood of winning,
00:41:28.680 | for the player,
00:41:29.880 | that's whose turn it is.
00:41:33.480 | another big step,
00:41:35.080 | is that they updated,
00:41:38.080 | from 2015,
00:41:39.480 | they updated the state-of-the-art architecture,
00:41:41.680 | which are now,
00:41:42.880 | the architecture that won ImageNet,
00:41:45.080 | is residual networks,
00:41:46.680 | ResNet,
00:41:47.680 | for ImageNet.
00:41:48.880 | Those, that's it.
00:41:50.880 | And those little changes,
00:41:52.680 | made all the difference.
00:41:54.280 | So that takes us to deep traffic,
00:41:57.480 | and the 8 billion hours stuck in traffic.
00:42:00.080 | America's pastime,
00:42:03.680 | so we tried to simulate,
00:42:05.080 | driving,
00:42:07.280 | the behavioral layer of driving.
00:42:09.480 | So not the immediate control,
00:42:11.880 | not the motion planning,
00:42:13.680 | but beyond that, on top,
00:42:15.680 | on top of those control decisions,
00:42:18.280 | the human,
00:42:20.080 | interpretable decisions of changing lane,
00:42:22.080 | of speeding up, slowing down.
00:42:23.480 | Modeling that,
00:42:24.680 | in a micro traffic simulation framework,
00:42:27.480 | that's popular in traffic engineering,
00:42:29.280 | the kind of shown here.
00:42:32.880 | We applied deep reinforcement learning to that,
00:42:35.080 | we call it deep traffic.
00:42:37.080 | The goal is to achieve the highest average speed,
00:42:40.480 | over a long period of time,
00:42:41.880 | weaving in and out of traffic.
00:42:44.080 | For students here,
00:42:46.080 | the requirement is to follow the tutorial,
00:42:48.880 | and achieve a speed of 65 miles an hour.
00:42:54.680 | if you really want,
00:42:56.280 | to achieve a speed,
00:42:57.680 | over 70 miles an hour,
00:42:59.280 | which is what's required to win.
00:43:01.080 | And perhaps upload your own image,
00:43:05.080 | to make sure you look good doing it.
00:43:07.680 | What you should do,
00:43:10.880 | clear instructions,
00:43:12.480 | to compete, read the tutorial.
00:43:14.480 | You can change parameters in the code box,
00:43:19.080 | on that website,
00:43:20.280 | cars.mit.edu/deeptraffic.
00:43:22.880 | Click the white button that says apply code,
00:43:25.680 | which applies the code that you write.
00:43:27.680 | These are the parameters,
00:43:28.880 | that you specify for the neural network.
00:43:31.480 | It applies those parameters,
00:43:33.480 | creates the architecture that you specify.
00:43:35.680 | And now you have,
00:43:37.080 | a network written in JavaScript,
00:43:39.080 | living in the browser, ready to be trained.
00:43:40.880 | Then you click,
00:43:42.480 | the blue button that says run training.
00:43:45.080 | And that trains the network,
00:43:47.680 | much faster than what's actually being visualized,
00:43:51.680 | in the browser.
00:43:52.680 | A thousand times faster,
00:43:55.080 | by evolving the game,
00:43:56.680 | making decisions,
00:43:57.680 | taking in the grid space,
00:43:59.080 | I'll talk about here in a second.
00:44:00.680 | The speed limit is 80 miles an hour.
00:44:03.080 | Based on the various adjustments,
00:44:05.480 | when we went to the game,
00:44:06.680 | reaching 80 miles an hour,
00:44:08.480 | is certainly impossible,
00:44:09.880 | on average.
00:44:11.280 | And reaching some of the speeds,
00:44:13.080 | that we've achieved last year,
00:44:14.680 | is much, much, much more difficult.
00:44:17.280 | Finally, when you're happy,
00:44:20.080 | and the training is done,
00:44:21.480 | submit the model to competition.
00:44:26.480 | For those super eager,
00:44:28.280 | dedicated students,
00:44:29.480 | you can do so every five minutes.
00:44:31.280 | And to visualize your submission,
00:44:35.080 | you can click,
00:44:37.880 | the request visualization,
00:44:40.080 | specifying the custom image,
00:44:41.680 | and the color.
00:44:42.480 | Okay, so here's the simulation.
00:44:47.080 | Speed limit 80 miles an hour,
00:44:48.880 | cars, 20 on the screen.
00:44:51.680 | One of them is a red one in this case.
00:44:53.880 | That's, that one is controlled by neural network.
00:44:56.680 | It's speed, it's allowed the actions,
00:44:58.680 | to speed up, slow down,
00:45:00.280 | change lanes, left, right,
00:45:03.280 | or stay exactly the same.
00:45:05.280 | The other cars,
00:45:10.080 | are pretty dumb.
00:45:11.880 | They speed up, slow down, turn left, right,
00:45:14.880 | but they don't have a purpose in their existence.
00:45:17.280 | They do so randomly.
00:45:18.880 | Or at least purpose has not been discovered.
00:45:22.680 | The road, the car, the speed.
00:45:25.480 | The road is a grid space.
00:45:27.280 | An occupancy grid that specifies,
00:45:30.480 | when it's empty,
00:45:32.680 | it's set to,
00:45:36.480 | Meaning,
00:45:37.680 | that,
00:45:38.880 | the grid value,
00:45:42.080 | is whatever speed is achievable,
00:45:44.480 | if you were inside that grid.
00:45:46.080 | And when there's other cars that are going slow,
00:45:49.080 | the value in that grid,
00:45:50.280 | is the speed of that car.
00:45:52.080 | That's the state space,
00:45:53.480 | that's the state representation.
00:45:55.280 | And you can choose how much,
00:45:56.880 | what slice that state space you take in.
00:45:59.280 | That's the input to the neural network.
00:46:01.080 | For visualization purposes,
00:46:07.280 | you can choose,
00:46:08.080 | normal speed or fast speed,
00:46:10.080 | for watching,
00:46:11.080 | the network operate.
00:46:12.880 | And there's display options,
00:46:16.480 | to help you build intuition,
00:46:18.080 | about what the network takes in,
00:46:19.480 | and what space the car is operating in.
00:46:21.680 | The default,
00:46:22.880 | is no extra information is added.
00:46:25.280 | Then there's the,
00:46:26.280 | learning input,
00:46:27.480 | which visualizes exactly,
00:46:29.280 | which part of the road,
00:46:30.880 | the, is serves as the input to the network.
00:46:33.480 | Then there is the,
00:46:34.680 | safety system,
00:46:36.280 | which I'll describe in a little bit,
00:46:37.880 | which is all the parts of the road,
00:46:39.880 | the car is not allowed to go into,
00:46:41.680 | because it would result in a collision.
00:46:43.480 | And that would JavaScript,
00:46:44.680 | would be very difficult to animate.
00:46:46.080 | And the full map.
00:46:48.480 | Here's a safety system.
00:46:51.480 | You could think of this system,
00:46:52.680 | as ACC,
00:46:54.880 | basic radar ultrasonic sensors,
00:46:57.680 | helping you avoid the obvious,
00:46:59.680 | collisions to,
00:47:00.880 | obviously detectable objects around you.
00:47:03.280 | And the task for this red car,
00:47:05.080 | for this neural network,
00:47:06.080 | is to move about,
00:47:07.480 | this space,
00:47:08.880 | is to move about the space,
00:47:11.880 | under the constraints of the safety system.
00:47:14.480 | The red shows all the parts of the grid,
00:47:18.480 | it's not able to move into.
00:47:21.880 | So the goal for the car,
00:47:22.880 | is to not get stuck in traffic,
00:47:24.880 | is make big sweeping motions,
00:47:28.080 | to avoid crowds of cars.
00:47:30.480 | The input,
00:47:33.880 | like DQN,
00:47:35.080 | is the state space,
00:47:36.280 | the output is the value of the different actions.
00:47:38.880 | And based on the epsilon parameter,
00:47:41.880 | through training and through,
00:47:44.280 | inference evaluation process,
00:47:46.880 | you choose,
00:47:48.480 | how much exploration you want to do.
00:47:50.280 | These are all parameters.
00:47:52.480 | The learning is done in the browser,
00:47:54.480 | on your own computer,
00:47:56.480 | utilizing only the CPU.
00:48:00.680 | The action space,
00:48:03.680 | there's five,
00:48:04.680 | giving you some of the variables here,
00:48:07.080 | perhaps you go back to the slides,
00:48:08.680 | to look at it.
00:48:09.480 | The brain,
00:48:10.680 | quote-unquote,
00:48:11.680 | is the thing that takes in,
00:48:14.080 | the state,
00:48:15.080 | and the reward,
00:48:16.880 | takes a forward pass through the state,
00:48:19.280 | and produces the next action.
00:48:21.480 | The brain is where the neural network is contained,
00:48:24.280 | both for the training and the evaluation.
00:48:26.480 | The learning input,
00:48:28.880 | can be controlled in width,
00:48:30.880 | forward length,
00:48:33.080 | and backward length.
00:48:34.280 | Lane side,
00:48:35.280 | number of lanes to the side that you see,
00:48:37.480 | patches ahead,
00:48:38.680 | is the patches ahead that you see,
00:48:40.480 | patches behind,
00:48:41.480 | is patches behind that you see.
00:48:43.080 | New this year,
00:48:45.680 | can control the number of agents,
00:48:48.680 | that are controlled by the neural network.
00:48:52.080 | Anywhere from one,
00:48:53.680 | to ten.
00:48:55.080 | And the evaluation,
00:48:59.480 | is performed exactly the same way.
00:49:01.080 | You have to achieve the highest average speed,
00:49:03.880 | for the agents.
00:49:04.680 | The very critical thing here is,
00:49:08.280 | the agents are not aware of each other.
00:49:11.080 | So they're not jointly planning.
00:49:15.280 | The network is trained,
00:49:18.480 | under the,
00:49:20.080 | joint objective,
00:49:21.880 | of achieving the average speed for all of them.
00:49:24.480 | But the actions are taking in a greedy way for each.
00:49:28.680 | It's very interesting what can be learned in this way.
00:49:32.080 | Because this kinds of approaches are scalable,
00:49:35.480 | to an arbitrary number of cars.
00:49:37.080 | And you can imagine us plopping down,
00:49:40.080 | the best cars from this class together.
00:49:43.280 | And having them compete,
00:49:45.480 | in this way.
00:49:46.880 | The best neural networks.
00:49:49.480 | Because they're full in their greedy operation.
00:49:53.280 | The number of networks that can concurrently operate,
00:49:57.080 | is fully scalable.
00:49:58.680 | There's a lot of parameters.
00:50:01.280 | The temporal window.
00:50:04.680 | The layers, the many layers types that can be added.
00:50:09.880 | Here's a fully connected layer with ten neurons.
00:50:12.280 | The activation functions, all of these things can be customized.
00:50:15.480 | As is specified in the tutorial.
00:50:18.480 | The final layer, a fully connected layer with,
00:50:21.680 | output of five,
00:50:23.480 | regression, giving the value of each of the five actions.
00:50:28.080 | And there's a lot of more specific parameters.
00:50:31.280 | Some of which I've discussed.
00:50:32.680 | From gamma, to epsilon,
00:50:37.080 | to experience replay size,
00:50:40.680 | to learning rate and temporal window.
00:50:45.680 | The optimizer, the learning rate, momentum, batch size,
00:50:49.680 | L2, L1 decay for regularization and so on.
00:50:53.080 | There's a big white button that says apply code that you press.
00:50:56.880 | That kills all the work you've done up to this point.
00:50:59.680 | So be careful doing it.
00:51:00.880 | You should be doing it only at the very beginning.
00:51:03.680 | If you happen to leave your computer running,
00:51:07.480 | in training for several days, as folks have done.
00:51:10.480 | The blue training button, you press.
00:51:13.880 | And it trains based on the parameters.
00:51:15.480 | You specify.
00:51:16.280 | And the network state gets shipped to the main simulation from time to time.
00:51:20.880 | So the thing you see in the browser,
00:51:22.880 | as you open up the website,
00:51:24.280 | is running the same network that's being trained.
00:51:27.080 | And regularly it updates that network.
00:51:29.680 | So it's getting better and better.
00:51:31.080 | Even if the training takes weeks for you.
00:51:33.280 | It's constantly updating the network you see on the left.
00:51:36.480 | So if the car, for the network that you're training,
00:51:39.680 | is just standing in place and not moving.
00:51:41.880 | It's probably,
00:51:44.880 | time to restart and change the parameters.
00:51:47.480 | Maybe add a few layers to your network.
00:51:49.680 | Number of iterations is certainly an important parameter to control.
00:51:55.080 | And the evaluation is something we've done a lot of work on since last year.
00:52:00.680 | To remove the degree of randomness.
00:52:03.280 | To remove the,
00:52:04.880 | the incentive to submit the same code over and over again.
00:52:09.080 | To hope to produce a higher reward,
00:52:11.280 | a higher evaluation score.
00:52:14.480 | The method for evaluation is to collect the average speed over 10 runs.
00:52:19.880 | About 45 seconds of game each.
00:52:25.080 | Not minutes.
00:52:26.280 | 45 simulated seconds.
00:52:28.280 | And there is 500 of those.
00:52:30.880 | And we take the median speed of the 500 runs.
00:52:33.480 | It's done server side.
00:52:35.480 | So extremely difficult to cheat.
00:52:37.080 | I urge you to try.
00:52:38.680 | And you can try it locally.
00:52:41.680 | There's a start evaluation run.
00:52:43.480 | But that one doesn't count.
00:52:44.680 | That's just for you to feel better about your network.
00:52:46.880 | That should produce a result that's very similar to the one we'll produce on the server.
00:52:52.680 | Is to build your own intuition.
00:52:55.280 | And as I said,
00:52:57.280 | we significantly reduce the influence of randomness.
00:52:59.480 | So the,
00:53:00.480 | the score,
00:53:01.880 | the speed you get for the network you design should be very similar with every evaluation.
00:53:07.480 | Loading and saving.
00:53:10.480 | If the network is huge and you want to switch computers,
00:53:13.280 | you can save the network.
00:53:14.480 | It saves both the architecture of the network.
00:53:16.480 | And the weights on the network.
00:53:19.480 | And you can load it back in.
00:53:21.480 | Obviously, when you load it in,
00:53:25.080 | it's not saving any of the data you've already done.
00:53:29.280 | You can't do transfer learning with JavaScript in the browser yet.
00:53:33.080 | Submitting your network,
00:53:35.680 | submit model to competition.
00:53:37.680 | And make sure you run training first.
00:53:39.680 | Otherwise,
00:53:41.280 | it'll be initiated,
00:53:42.280 | the weights are initiated randomly and will not do so well.
00:53:44.880 | You can resubmit as often you like and the highest score is what counts.
00:53:49.480 | The coolest part is you can load your custom image,
00:53:52.480 | specify colors and request the visualization.
00:53:56.280 | We have not yet shown the visualization,
00:54:00.480 | but I promise you it's going to be awesome.
00:54:02.480 | Again,
00:54:03.480 | read the tutorial,
00:54:05.080 | change the parameters in the code box,
00:54:06.880 | click apply code, run training.
00:54:09.280 | Everybody in this room on the way home,
00:54:11.280 | on the train,
00:54:12.280 | hopefully not in your car,
00:54:14.280 | should be able to do this in the browser.
00:54:16.080 | And then you can visualize,
00:54:17.680 | request visualization because it's an expensive process.
00:54:20.480 | You have to want it for us to do it.
00:54:22.880 | Because we have to run in server side.
00:54:25.680 | Competition link is there.
00:54:30.480 | GitHub starter code is there.
00:54:32.680 | And the details for those that truly want to win is in the archive paper.
00:54:36.480 | So the question,
00:54:39.080 | that will come up throughout is whether these reinforcement learning approaches
00:54:43.280 | are at all or rather if action planning control is amenable to learning.
00:54:48.680 | Certainly in the case of driving,
00:54:52.680 | we can't do what AlphaGo Zero did.
00:54:55.280 | We can't learn from scratch from self-play
00:54:59.880 | because that would result in millions of crashes
00:55:03.480 | in order to learn to avoid the crashes.
00:55:08.480 | Unless we're working like we are deep crash on the RC car
00:55:11.680 | or we're working in simulation.
00:55:13.480 | So we can look at expert data.
00:55:16.080 | We can look at driver data,
00:55:17.280 | which we have a lot of and learn from.
00:55:18.880 | It's an open question whether this is applicable.
00:55:21.280 | To date,
00:55:23.080 | and I bring up two companies because they're both guest speakers.
00:55:26.880 | Deep IRL is not involved in the most successful robots operating in the real world.
00:55:33.280 | In the case of Boston Dynamics,
00:55:38.480 | most of the perception,
00:55:41.080 | control and planning like in this robot,
00:55:44.880 | does not involve learning approaches
00:55:48.080 | except with minimal addition on the perception side.
00:55:51.480 | Best of our knowledge.
00:55:54.280 | And certainly the same is true with Waymo,
00:55:57.680 | as the speaker on Friday will talk about.
00:56:00.280 | Deep learning is used a little bit in perception on top,
00:56:03.880 | but most of the work is done from the sensors
00:56:07.280 | and the optimization based, the model based approaches.
00:56:11.480 | Trajectory generation and optimizing which trajectory is best to avoid collisions.
00:56:18.080 | Deep IRL is not involved.
00:56:20.680 | And coming back and back again,
00:56:25.280 | the unexpected local pockets of higher award,
00:56:28.280 | which arise in all of these situations when applied in the real world.
00:56:31.880 | So for the cat video,
00:56:34.880 | that's pretty short where the cats are ringing the bell
00:56:37.680 | and they're learning that the ring of the bell
00:56:40.280 | is mapping to food.
00:56:44.280 | I urge you to think about how that can evolve over time in unexpected ways.
00:56:50.280 | That may not have a desirable effect.
00:56:52.680 | Where the final reward is in the form of food
00:56:55.480 | and the intended effect is to ring the bell.
00:57:03.280 | That's where AI safety comes in.
00:57:05.080 | For the artificial general intelligence course in two weeks,
00:57:08.280 | that's something we'll explore extensively.
00:57:10.280 | It's how these reinforcement learning planning algorithms
00:57:17.280 | will evolve in ways that are not expected.
00:57:20.280 | And how we can constrain them,
00:57:23.080 | how we can design reward functions that result in safe operation.
00:57:28.280 | So I encourage you to come to the talk on Friday,
00:57:33.280 | at 1pm, as a reminder, it's a 1pm, not 7pm,
00:57:36.680 | in Stata 32.1.2.3.
00:57:38.880 | And to the awesome talks in two weeks,
00:57:41.480 | from Boston Dynamics to Ray Kurzweil and so on for AGI.
00:57:45.880 | Now tomorrow, we'll talk about computer vision and psych fuse.
00:57:50.880 | Thank you everybody.
00:57:51.880 | [APPLAUSE]