MIT 6.S094: Deep Reinforcement Learning

00:00:00.000 | Today we will talk about deep reinforcement learning.

00:00:02.800 | The question we would like to explore is to which degree

00:00:12.000 | we can teach systems to act,

00:00:15.000 | to perceive and act in this world from data.

00:00:18.680 | So let's take a step back and think of what is the full range of tasks

00:00:25.480 | an artificial intelligence system needs to accomplish.

00:00:28.280 | Here's the stack.

00:00:30.280 | From top to bottom, top the input, bottom output.

00:00:33.680 | The environment at the top, the world that the agent is operating in.

00:00:37.960 | Sensed by sensors,

00:00:41.360 | taking in the world outside and converting it to raw data interpretable by machines.

00:00:46.680 | Sensor data.

00:00:49.080 | And from that raw sensor data, you extract features.

00:00:53.480 | You extract structure from that data

00:00:58.080 | such that you can input it, make sense of it,

00:01:01.280 | discriminate, separate, understand the data.

00:01:05.280 | And as we discussed, you form higher and higher order representations,

00:01:12.280 | a hierarchy of representations

00:01:14.480 | based on which the machine learning techniques can then be applied.

00:01:18.480 | Once the machine learning techniques, the understanding, as I mentioned,

00:01:27.480 | converts the data into features, into higher order representations

00:01:31.680 | and into simple, actionable, useful information.

00:01:34.480 | We aggregate that information into knowledge.

00:01:37.680 | We take the pieces of knowledge extracted from the data

00:01:40.480 | through the machine learning techniques

00:01:43.080 | and build a taxonomy,

00:01:45.680 | a library of knowledge.

00:01:49.480 | And with that knowledge, we reason.

00:01:53.080 | An agent is tasked to reason

00:01:57.280 | to aggregate, to connect pieces of data seen in the recent past

00:02:03.680 | or the distant past, to make sense of the world it's operating in.

00:02:07.680 | And finally, to make a plan of how to act in that world based on its objectives,

00:02:12.680 | based on what it wants to accomplish.

00:02:14.480 | As I mentioned, a simple but commonly accepted definition of intelligence

00:02:20.080 | is a system that's able to accomplish complex goals.

00:02:24.880 | So a system that's operating in an environment in this world

00:02:27.880 | must have a goal, must have an objective function, a reward function.

00:02:32.080 | And based on that, it forms a plan and takes action.

00:02:35.480 | And because it operates in many cases in the physical world,

00:02:39.280 | it must have tools, effectors with which it applies the actions

00:02:44.680 | to change something about the world.

00:02:46.080 | That's the full stack of an artificial intelligence system

00:02:51.080 | that acts in the world.

00:02:54.080 | And the question is,

00:02:55.280 | what kind of task can such a system take on?

00:03:00.080 | What kind of task can an artificial intelligence system learn?

00:03:04.280 | As we understand AI today.

00:03:06.680 | We will talk about the advancement of deeper enforcement learning approaches

00:03:12.680 | in some of the fascinating ways it's able to take much of this stack

00:03:17.680 | and treat it as an end-to-end learning problem.

00:03:21.480 | But we look at games, we look at simple formalized worlds.

00:03:25.680 | While it's still impressive, beautiful and unprecedented accomplishments,

00:03:29.880 | it's nevertheless formal tasks.

00:03:32.680 | Can we then move beyond games and into expert tasks of medical diagnosis,

00:03:40.680 | of design and into natural language

00:03:45.680 | and finally the human level tasks of emotion, imagination.

00:03:50.280 | Consciousness.

00:03:52.880 | Let's once again review the stack in practicality,

00:04:00.080 | in the tools we have.

00:04:02.080 | The input for robots operating in the world,

00:04:06.480 | from cars to humanoid to drones,

00:04:09.480 | as LIDAR camera, radar, GPS, stereo cameras, audio microphone,

00:04:15.480 | networking for communication

00:04:17.680 | and the various ways to measure kinematics.

00:04:20.880 | With IMU.

00:04:21.880 | The raw sensory data is then processed,

00:04:27.880 | features are formed, representations are formed

00:04:30.280 | and multiple higher and higher order representations.

00:04:33.280 | That's what deep learning gets us.

00:04:34.880 | Before neural networks, before the advent of,

00:04:37.880 | before the recent successes of neural networks to go deeper

00:04:42.480 | and therefore be able to form high order representations of the data.

00:04:46.280 | That was done by experts, by human experts.

00:04:49.480 | Today, networks are able to do that.

00:04:52.080 | That's the representation piece.

00:04:53.880 | And on top of the representation piece,

00:04:56.280 | the final layers these networks are able to accomplish,

00:04:59.680 | the supervised learning tasks, the generative tasks

00:05:02.880 | and the unsupervised clustering tasks.

00:05:06.680 | Through machine learning.

00:05:09.480 | That's what we talked about a little in lecture one

00:05:12.880 | and will continue tomorrow and Wednesday.

00:05:16.280 | That's supervised learning.

00:05:19.680 | And you can think about the output of those networks

00:05:22.480 | as simple, clean, useful, valuable information.

00:05:26.080 | That's the knowledge.

00:05:27.480 | And that knowledge can be in the form of single numbers.

00:05:33.280 | It could be regression, continuous variables.

00:05:36.680 | It could be a sequence of numbers.

00:05:38.280 | It could be images, audio, sentences, text, speech.

00:05:44.280 | Once that knowledge is extracted and aggregated,

00:05:48.080 | how do we connect it in multi-resolutional ways?

00:05:52.080 | Form hierarchies of ideas, connect ideas.

00:05:55.280 | The trivial silly example is connecting images,

00:06:00.480 | activity recognition and audio, for example.

00:06:03.280 | If it looks like a duck, quacks like a duck,

00:06:06.280 | and swims like a duck,

00:06:07.680 | we do not currently have approaches

00:06:10.680 | that effectively integrate this information

00:06:14.080 | to produce a higher confidence estimate

00:06:16.680 | that is in fact a duck.

00:06:18.280 | And the planning piece,

00:06:19.880 | the task of taking the sensory information,

00:06:24.480 | fusing the sensory information,

00:06:26.680 | and making action, control and longer-term plans

00:06:30.880 | based on that information,

00:06:32.880 | as we'll discuss today,

00:06:35.080 | are more and more amenable to the learning approach,

00:06:39.080 | to the deep learning approach.

00:06:40.480 | But to date have been the most successful

00:06:43.080 | with non-learning optimization based approaches.

00:06:45.280 | Like with the several of the guest speakers we have,

00:06:47.680 | including the creator of this robot, Atlas,

00:06:51.280 | in Boston Dynamics.

00:06:52.880 | So the question,

00:06:55.880 | how much of the stack can be learned, end to end,

00:06:58.680 | from the input to the output?

00:07:00.280 | We know we can learn the representation,

00:07:02.680 | and the knowledge.

00:07:04.080 | From the representation and to knowledge,

00:07:05.880 | even with the kernel methods of SVM,

00:07:09.280 | and certainly,

00:07:12.080 | with neural networks,

00:07:14.280 | mapping from representation to information,

00:07:18.080 | has been where the primary success

00:07:21.680 | in machine learning over the past three decades has been.

00:07:24.080 | Mapping from raw sensory data to knowledge,

00:07:28.480 | that's where the success,

00:07:30.880 | the automated representation learning of deep learning,

00:07:34.480 | has been a success.

00:07:36.280 | Going straight from raw data to knowledge.

00:07:38.680 | The open question for us,

00:07:41.480 | today and beyond,

00:07:43.080 | is if we can expand the red box there,

00:07:45.680 | of what can be learned end to end,

00:07:47.280 | from sensory data to reasoning.

00:07:49.680 | So aggregating, forming higher representations,

00:07:52.480 | of the extracted knowledge.

00:07:54.080 | And forming plans,

00:07:56.680 | and acting in this world from the raw sensory data.

00:07:59.280 | We will show the incredible fact,

00:08:02.080 | that we're able to do,

00:08:03.480 | learn exactly what's shown here, end to end,

00:08:06.480 | with deeper enforcement learning on trivial tasks,

00:08:10.280 | in a generalizable way.

00:08:11.680 | The question is whether that can,

00:08:13.680 | then move on to real world tasks,

00:08:16.280 | of autonomous vehicles,

00:08:18.080 | of humanoid robotics, and so on.

00:08:21.080 | That's the open question.

00:08:24.880 | So today, let's talk about reinforcement learning.

00:08:27.680 | There's three types of machine learning.

00:08:29.680 | Supervised,

00:08:33.280 | unsupervised,

00:08:35.880 | are the categories at the extremes,

00:08:39.080 | relative to the amount of human input,

00:08:42.080 | that's required.

00:08:42.880 | For supervised learning,

00:08:44.080 | every piece of data,

00:08:45.480 | that's used for teaching these systems,

00:08:47.680 | is first labeled by human beings.

00:08:50.480 | And unsupervised learning on the right,

00:08:52.680 | is no data is labeled by human beings.

00:08:55.880 | In between is some,

00:08:58.680 | sparse input from humans.

00:09:01.280 | Semi-supervised learning,

00:09:03.080 | is when only part of the data is provided,

00:09:06.280 | by humans, ground truth.

00:09:08.480 | And the rest,

00:09:09.280 | must be inferred, generalized by the system.

00:09:11.680 | And that's where reinforcement learning falls.

00:09:13.680 | Reinforcement learning,

00:09:15.680 | as shown there with the cats.

00:09:18.080 | As I said, every successful presentation,

00:09:21.480 | must include cats.

00:09:22.480 | They're supposed to be Pavlov's cats.

00:09:26.280 | And,

00:09:28.080 | ringing a bell,

00:09:29.280 | and every time they ring a bell,

00:09:30.680 | they're given food,

00:09:31.680 | and they learn this process.

00:09:33.080 | The goal of reinforcement learning,

00:09:37.280 | is to learn,

00:09:39.080 | from sparse reward data.

00:09:42.080 | From learn,

00:09:43.080 | from sparse supervised data.

00:09:45.080 | And take advantage of the fact,

00:09:47.080 | that in simulation,

00:09:48.280 | or in the real world,

00:09:49.480 | there is a temporal consistency to the world.

00:09:52.080 | There is a,

00:09:53.280 | temporal dynamics that follows,

00:09:55.480 | from state to state,

00:09:56.480 | to state through time.

00:09:57.880 | And so you can propagate information,

00:10:00.280 | even if the information,

00:10:01.880 | that you received about,

00:10:03.280 | the supervision,

00:10:04.880 | the ground truth is sparse.

00:10:06.880 | You can follow that information,

00:10:08.280 | back through time,

00:10:09.480 | to infer,

00:10:10.680 | something about the reality,

00:10:12.280 | of what happened before then,

00:10:13.480 | even if your reward signals were weak.

00:10:16.080 | So it's using the fact,

00:10:17.880 | that the physical world,

00:10:18.880 | devolves through time,

00:10:20.280 | in some,

00:10:21.280 | some sort of predictable way,

00:10:23.680 | to take sparse information,

00:10:26.080 | and,

00:10:26.880 | generalize it,

00:10:28.480 | over the entirety of the experience,

00:10:30.880 | that's being learned.

00:10:31.880 | So we apply this to two problems.

00:10:34.680 | Today we'll talk about deep traffic.

00:10:37.680 | As a methodology,

00:10:39.280 | as a way to introduce,

00:10:40.680 | deep reinforcement learning.

00:10:41.880 | So deep traffic is a competition,

00:10:44.080 | that we ran last year,

00:10:45.680 | and expanded significantly this year.

00:10:48.480 | And I'll talk about some of the details,

00:10:51.080 | and how the folks in this room can,

00:10:53.280 | on your smartphone today,

00:10:55.680 | or if you have a laptop,

00:10:57.080 | train an agent,

00:10:58.880 | while I'm talking.

00:11:00.080 | Training a neural network in the browser.

00:11:02.280 | Some of the things we've added,

00:11:04.080 | are we've added the capability,

00:11:06.880 | we've now turned it into a multi-agent,

00:11:08.680 | deep reinforcement learning problem.

00:11:10.280 | Where you can control up to,

00:11:11.880 | 10 cars within your own network.

00:11:13.880 | Perhaps less significant,

00:11:16.880 | but pretty cool,

00:11:18.480 | is the ability to customize,

00:11:20.680 | the way the agent looks.

00:11:22.280 | So you can upload,

00:11:24.280 | and people have,

00:11:25.480 | to an absurd degree,

00:11:26.880 | have already begun doing so,

00:11:28.280 | uploading different images,

00:11:30.080 | instead of the car that's shown there.

00:11:31.880 | As long as it maintains the dimensions,

00:11:34.880 | shown here is a SpaceX rocket.

00:11:36.880 | The competition is hosted on the website,

00:11:41.880 | selfdrivingcars.mit.edu/deeptraffic.

00:11:45.280 | We'll return to this later.

00:11:46.880 | The code is on GitHub,

00:11:50.480 | with some more information,

00:11:51.680 | a starter code,

00:11:52.680 | and a paper describing,

00:11:55.080 | some of the fundamental insights,

00:11:56.880 | that will help you win at this competition,

00:12:00.880 | is an archive.

00:12:04.280 | So, from supervised learning,

00:12:05.880 | in lecture one,

00:12:06.880 | to today.

00:12:08.080 | Supervised learning,

00:12:10.280 | we can think of as memorization,

00:12:12.680 | of ground-truth data,

00:12:15.480 | in order to form representations,

00:12:17.480 | that generalizes from that ground-truth.

00:12:19.680 | Reinforcement learning,

00:12:21.880 | is, we can think of,

00:12:24.080 | as a way to brute force,

00:12:25.480 | propagate that information,

00:12:27.080 | the sparse information,

00:12:28.880 | through time,

00:12:34.280 | to assign quality reward,

00:12:38.280 | to state that does not directly have a reward.

00:12:42.280 | To make sense of this world,

00:12:44.480 | when the rewards are sparse,

00:12:47.280 | but are connected through time.

00:12:48.680 | You can think of that as reasoning.

00:12:51.280 | So, the connection through time,

00:12:56.080 | is modeled,

00:12:58.280 | in most reinforcement learning approaches,

00:13:01.080 | very simply,

00:13:03.480 | that there's an agent,

00:13:04.880 | taking an action in a state,

00:13:06.880 | and receiving a reward.

00:13:08.480 | And the agent operating in an environment,

00:13:11.080 | executes an action,

00:13:12.680 | receives an observed state,

00:13:14.480 | a new state,

00:13:15.480 | and receives a reward.

00:13:16.680 | This process continues over and over.

00:13:19.280 | In some examples,

00:13:23.880 | we can think of,

00:13:25.280 | any of the video games,

00:13:26.680 | some of which we'll talk about today,

00:13:28.480 | like Atari Breakout,

00:13:30.480 | as the environment,

00:13:32.280 | the agent,

00:13:33.480 | is the paddle.

00:13:34.480 | Each action,

00:13:37.680 | that the agent takes,

00:13:39.680 | has an influence,

00:13:40.880 | on the evolution,

00:13:42.080 | of the environment.

00:13:43.880 | And the success is measured,

00:13:46.080 | by some reward mechanism.

00:13:47.880 | In this case,

00:13:49.080 | points are given by the game.

00:13:50.680 | And every game,

00:13:52.280 | has a different point scheme,

00:13:55.480 | that must be converted,

00:13:57.080 | normalized,

00:13:58.080 | into a way that's interpretable by the system.

00:14:00.080 | And the goal is to maximize those points,

00:14:03.080 | maximize the reward.

00:14:04.480 | The continuous problem of cart-pole balancing,

00:14:10.880 | the goal is to balance the pole,

00:14:12.080 | on top of a moving cart.

00:14:13.480 | The state is the angle,

00:14:15.680 | the angular speed,

00:14:17.080 | the position,

00:14:17.680 | the horizontal velocity.

00:14:19.080 | The actions are the horizontal force,

00:14:21.680 | applied to the cart.

00:14:22.680 | And the reward,

00:14:24.080 | is one at each time step,

00:14:25.480 | if the pole is still upright.

00:14:27.080 | All the,

00:14:31.280 | first-person shooters,

00:14:32.480 | the video games,

00:14:33.280 | and now StarCraft,

00:14:34.280 | the strategy games.

00:14:37.880 | In case of first-person shooter in Doom,

00:14:42.880 | what is the goal?

00:14:44.080 | The environment is the game,

00:14:45.680 | the goal is to eliminate all opponents,

00:14:47.880 | the state is the raw game pixels coming in,

00:14:50.280 | the actions is moving up, down, left, right,

00:14:53.880 | and so on.

00:14:55.280 | And the reward is positive,

00:14:57.480 | when eliminating an opponent,

00:14:59.480 | and negative,

00:15:01.480 | when the agent is eliminated.

00:15:03.080 | Industrial robotics,

00:15:07.080 | bin packing with a robotic arm,

00:15:09.880 | the goal is to pick up a device from a box,

00:15:12.080 | and put it into a container.

00:15:13.680 | The state is the raw pixels of the real world,

00:15:16.280 | that the robot observes,

00:15:17.880 | the actions are the possible,

00:15:20.080 | actions of the robot,

00:15:21.480 | the different degrees of freedom,

00:15:22.680 | and moving through those degrees,

00:15:24.080 | moving the different actuators,

00:15:25.880 | to realize,

00:15:27.080 | the position of the arm.

00:15:28.880 | And the reward is positive,

00:15:30.280 | when placing a device successfully,

00:15:32.080 | and negative otherwise.

00:15:33.280 | Everything can be modeled in this way.

00:15:36.880 | Markov decision process.

00:15:39.280 | There's a state as zero,

00:15:41.280 | action A zero,

00:15:42.680 | and reward received.

00:15:44.280 | A new state is achieved.

00:15:46.280 | Again, action reward state,

00:15:48.080 | action reward state,

00:15:49.480 | until a terminal state is reached.

00:15:52.480 | And the major components,

00:15:55.480 | of reinforcement learning,

00:15:57.080 | is a policy,

00:15:59.080 | some kind of plan,

00:16:00.080 | of what to do in every single state,

00:16:01.880 | what kind of action to perform.

00:16:03.880 | A value function,

00:16:05.680 | a some kind of sense,

00:16:09.080 | of what is a good state to be in,

00:16:11.480 | of what is a good action to take in a state.

00:16:13.880 | And sometimes a model,

00:16:17.880 | that the agent represents the environment with,

00:16:21.680 | some kind of sense,

00:16:23.680 | of the environment it's operating in,

00:16:25.280 | the dynamics of that environment,

00:16:26.880 | that's useful,

00:16:28.080 | for making decisions about actions.

00:16:30.080 | Let's take a trivial example.

00:16:32.680 | A grid world of 3 by 4,

00:16:37.680 | 12 squares,

00:16:39.280 | where you start at the bottom left,

00:16:40.880 | and they're tasked with walking about this world,

00:16:45.280 | to maximize reward.

00:16:47.080 | The reward at the top right is a plus one,

00:16:50.080 | and at one square below that is a negative one.

00:16:52.680 | And every step you take,

00:16:54.080 | is a punishment,

00:16:56.080 | or is a negative reward of 0.04.

00:16:59.080 | So what is the optimal policy in this world?

00:17:02.480 | Now when everything is deterministic,

00:17:06.280 | perhaps this is the policy.

00:17:08.880 | When you start at the bottom left,

00:17:11.680 | well, because every step hurts,

00:17:14.280 | every step has a negative reward,

00:17:16.080 | then you want to take the shortest path,

00:17:18.280 | to the maximum square with the maximum reward.

00:17:20.880 | When the state space is non-deterministic,

00:17:25.680 | as presented before,

00:17:28.080 | with a probability of 0.8,

00:17:30.480 | when you choose to go up,

00:17:32.080 | you go up,

00:17:33.080 | but with probability 0.1,

00:17:34.680 | you go left,

00:17:36.480 | and 0.1, you go right.

00:17:38.280 | Unfair.

00:17:39.680 | Again, much like life.

00:17:41.280 | That would be the optimal policy.

00:17:44.880 | What is the key observation here?

00:17:48.080 | That every single state in the space,

00:17:50.080 | must have a plan.

00:17:51.480 | Because you can't,

00:17:54.280 | because then the non-deterministic aspect,

00:17:56.680 | of the control,

00:17:58.080 | you can't control where you're going to end up,

00:18:00.280 | so you must have a plan for every place.

00:18:02.280 | That's the policy.

00:18:03.680 | Having an action,

00:18:05.080 | an optimal action to take in every single state.

00:18:07.280 | Now suppose we change the reward structure,

00:18:10.480 | and for every step we take,

00:18:12.280 | there's a negative,

00:18:13.280 | a reward is a negative two.

00:18:15.480 | So it really hurts.

00:18:16.880 | There's a high punishment for every single step we take.

00:18:19.280 | So no matter what,

00:18:20.880 | we always take the shortest path.

00:18:23.480 | The optimal policy is to take the shortest path,

00:18:25.880 | to the,

00:18:26.480 | to the only spot on the board,

00:18:29.280 | that doesn't result in punishment.

00:18:32.080 | If we decrease the reward of each step,

00:18:36.880 | to negative 0.1,

00:18:38.680 | the policy changes.

00:18:41.280 | Where there is some extra degree of wandering,

00:18:45.880 | encouraged.

00:18:47.480 | And as we go further and further,

00:18:50.880 | in lowering the punishment as before,

00:18:53.080 | to negative 0.04,

00:18:55.280 | more wandering and more wandering is allowed.

00:18:57.880 | And when we finally turn the reward,

00:19:02.680 | into positive,

00:19:06.280 | so every step,

00:19:07.480 | every step is increases the reward,

00:19:13.480 | then there's a significant incentive to,

00:19:16.280 | to stay on the board without ever reaching the destination.

00:19:22.480 | Kind of like college for a lot of people.

00:19:24.280 | So the value function,

00:19:29.880 | the way we think about,

00:19:31.480 | the value of a state,

00:19:33.280 | or the value of anything,

00:19:35.080 | in the environment,

00:19:36.480 | is,

00:19:37.880 | the reward we're likely to receive in the future.

00:19:41.880 | And the way we see the reward we're likely to receive,

00:19:45.880 | is we discount,

00:19:48.080 | the future reward.

00:19:49.680 | Because we can't always count on it.

00:19:53.080 | Here at Gamma,

00:19:54.680 | further and further out into the future,

00:19:57.480 | more and more discounts decreases,

00:20:00.080 | the reward, the importance of the reward received.

00:20:03.480 | And the good strategy,

00:20:05.880 | is taking the sum of these rewards and maximizing it.

00:20:08.680 | Maximizing discounted future reward.

00:20:11.080 | That's what reinforcement learning,

00:20:13.080 | hopes to achieve.

00:20:14.480 | And with Q-learning,

00:20:17.280 | we use,

00:20:21.080 | any policy to estimate the value of taking an action,

00:20:24.680 | in a state.

00:20:25.680 | So off policy,

00:20:29.480 | forget policy.

00:20:30.880 | We move about the world,

00:20:32.880 | and use the Bellman equation here on the bottom,

00:20:35.280 | to continuously update our estimate of how good,

00:20:38.680 | a certain action is in a certain state.

00:20:40.680 | So we don't need,

00:20:44.680 | this allows us to operate in a much larger state space,

00:20:47.480 | in a much larger action space.

00:20:48.880 | We move about this world,

00:20:50.680 | through simulation or in the real world,

00:20:52.480 | taking actions and updating our estimate,

00:20:55.080 | of how good certain actions are over time.

00:20:57.480 | The new state at the left,

00:21:01.280 | is the updated value.

00:21:03.080 | The old state,

00:21:04.080 | is the starting value for the equation.

00:21:05.880 | And we update that old state estimation,

00:21:08.280 | with the sum,

00:21:09.680 | of the reward received,

00:21:12.080 | by taking action S,

00:21:13.880 | action A in state S.

00:21:18.080 | And,

00:21:18.880 | the maximum reward that's possible,

00:21:22.280 | to be received in the following states,

00:21:24.680 | discounted.

00:21:26.680 | That update,

00:21:29.680 | is decreased with a learning rate.

00:21:31.480 | The higher the learning rate,

00:21:33.080 | the more value we,

00:21:34.480 | the faster we learn,

00:21:36.280 | the more value we assign to new information.

00:21:39.280 | That's simple, that's it.

00:21:41.280 | That's Q-learning.

00:21:42.280 | The simple update rule,

00:21:43.880 | allows us to,

00:21:45.080 | to explore the world,

00:21:47.280 | and as we explore,

00:21:48.680 | get more and more information,

00:21:52.080 | about what's good to do in this world.

00:21:53.880 | And there's always a balance,

00:21:55.880 | in the various problem spaces we'll discuss,

00:21:58.280 | there's always a balance between,

00:22:00.080 | exploration and exploitation.

00:22:02.280 | As you form a better and better estimate,

00:22:06.480 | of the Q-function,

00:22:07.280 | of what actions are good to take,

00:22:08.880 | you start to get a sense,

00:22:10.880 | of what is the best action to take.

00:22:12.880 | But it's not a perfect sense,

00:22:15.080 | it's still an approximation.

00:22:16.480 | And so there's value of exploration.

00:22:18.480 | But the better and better your estimate becomes,

00:22:20.880 | the less and less exploration,

00:22:22.680 | has a benefit.

00:22:23.680 | So, usually we want to explore a lot in the beginning,

00:22:27.080 | and less and less so,

00:22:28.680 | towards the end.

00:22:29.680 | And when we finally release the system out,

00:22:32.280 | into the world,

00:22:33.280 | and wish it to operate its best,

00:22:35.280 | then we,

00:22:36.280 | have it operate,

00:22:37.880 | as a greedy system,

00:22:39.080 | always taking the optimal action,

00:22:40.480 | according to the Q-value function.

00:22:45.680 | And everything I'm talking about now,

00:22:47.680 | is parameterized,

00:22:49.080 | and our parameters,

00:22:50.880 | that are very important,

00:22:53.280 | for winning the deep traffic competition.

00:22:55.680 | Which is using this very algorithm,

00:22:59.280 | with a neural network,

00:23:00.480 | at its core.

00:23:01.480 | So for a simple table representation,

00:23:06.080 | of a Q-function,

00:23:07.280 | where the Y-axis is state,

00:23:09.880 | four states,

00:23:10.680 | S1,2,3,4.

00:23:12.280 | And the X-axis is,

00:23:14.880 | actions, A1,2,3,4.

00:23:17.480 | We can think of this table,

00:23:19.680 | as randomly initiated,

00:23:21.480 | or initiated,

00:23:22.480 | initialized,

00:23:23.480 | in any kind of way,

00:23:24.680 | that's not representative of actual reality.

00:23:27.680 | And as we move about this world,

00:23:29.280 | and we take actions,

00:23:30.480 | we update this table,

00:23:31.680 | with the Bellman equation,

00:23:32.880 | shown up top.

00:23:34.080 | And here, slides now are online,

00:23:36.480 | you can see a simple,

00:23:38.080 | pseudocode algorithm,

00:23:39.480 | of how to update it,

00:23:40.880 | of how to run,

00:23:41.880 | this Bellman equation.

00:23:43.280 | And,

00:23:44.280 | over time,

00:23:45.480 | the approximation becomes the optimal,

00:23:48.080 | Q-table.

00:23:49.080 | The problem is,

00:23:50.680 | when that Q-table,

00:23:51.880 | it becomes exponential in size.

00:23:54.080 | When we take in raw sensory information,

00:23:57.480 | as we do with cameras,

00:23:58.880 | with deep crash,

00:24:00.080 | or with deep traffic,

00:24:02.480 | it's taking the full grid space,

00:24:04.280 | and taking that information,

00:24:05.680 | the raw,

00:24:06.480 | the raw grid,

00:24:08.680 | pixels of deep traffic.

00:24:10.480 | And when you take the arcade games,

00:24:12.680 | here, they're taking the raw pixels of the game.

00:24:15.280 | Or when we take Go,

00:24:18.680 | the game of Go,

00:24:19.680 | when it's taking the units,

00:24:21.080 | the board,

00:24:23.280 | the raw state of the board,

00:24:25.080 | as the input,

00:24:27.280 | the potential,

00:24:29.280 | state space,

00:24:30.680 | the number of possible,

00:24:32.280 | combinatorial variations of,

00:24:34.080 | what states is possible,

00:24:35.880 | is,

00:24:37.080 | extremely large.

00:24:38.280 | Larger than,

00:24:39.280 | we can certainly hold the memory,

00:24:41.080 | and larger than we can,

00:24:43.280 | ever be able to accurately approximate,

00:24:45.280 | through the Bellman equation,

00:24:46.680 | over time,

00:24:47.680 | through simulation.

00:24:49.080 | Through the simple update of the Bellman equation.

00:24:53.080 | So this is where,

00:24:54.880 | deep reinforcement learning comes in.

00:24:57.080 | Neural networks,

00:24:58.680 | are really good approximators.

00:25:00.680 | They're really good at exactly this task,

00:25:02.880 | of learning,

00:25:03.880 | this kind of Q-table.

00:25:05.680 | So as we started with supervised learning,

00:25:12.280 | or neural networks help us memorize patterns,

00:25:14.280 | using supervised,

00:25:15.480 | ground-truth data,

00:25:17.280 | and we move to reinforcement learning,

00:25:19.080 | that hopes to propagate,

00:25:20.480 | outcomes to knowledge.

00:25:23.080 | Deep learning,

00:25:26.080 | allows us to do so,

00:25:27.480 | on much larger state spaces,

00:25:29.680 | a much larger,

00:25:30.880 | action spaces.

00:25:32.880 | Which means,

00:25:34.680 | it's generalizable.

00:25:35.880 | It's much more capable to deal,

00:25:38.880 | with the raw,

00:25:40.280 | stuff,

00:25:41.880 | of sensory data.

00:25:43.480 | Which means it's much more capable,

00:25:45.880 | to deal with the broad variation,

00:25:47.680 | of real-world applications.

00:25:49.080 | And it does so,

00:25:55.480 | because it's able to,

00:25:56.880 | learn the representations,

00:25:58.880 | as we discussed,

00:25:59.880 | on Monday.

00:26:01.880 | The understanding comes,

00:26:06.480 | from converting,

00:26:08.280 | the raw sensory information,

00:26:10.080 | into simple, useful information,

00:26:13.080 | based on which,

00:26:14.280 | the action,

00:26:15.280 | in this particular state can be taken,

00:26:17.080 | in the same exact way.

00:26:18.480 | So instead of the Q-table,

00:26:20.480 | instead of this Q-function,

00:26:22.280 | we plug in a neural network,

00:26:23.680 | where the input is the state space,

00:26:25.880 | no matter how complex,

00:26:27.280 | and the output,

00:26:29.080 | is a value for each of the actions,

00:26:31.680 | that you could take.

00:26:32.680 | Input is the state,

00:26:36.080 | output is the,

00:26:37.480 | value of the function.

00:26:38.880 | It's simple.

00:26:40.880 | This is,

00:26:42.080 | Deep Q Network,

00:26:44.080 | DQN.

00:26:45.080 | At the core,

00:26:46.880 | of the success of DeepMind,

00:26:48.880 | a lot of the cool stuff you see,

00:26:50.280 | about video games,

00:26:51.480 | DQN,

00:26:52.280 | or variants of DQN are at play.

00:26:54.280 | This is what at first,

00:26:56.680 | with the Nature paper,

00:26:57.880 | DeepMind,

00:26:59.880 | the success came,

00:27:02.680 | of playing the different games,

00:27:03.880 | including Atari,

00:27:04.880 | games.

00:27:07.680 | So, how are these things trained?

00:27:11.480 | Very similar,

00:27:13.480 | to supervised learning.

00:27:15.280 | The Bellman equation up top,

00:27:20.280 | it,

00:27:20.880 | takes the reward,

00:27:23.680 | and the discounted,

00:27:26.280 | expected reward,

00:27:27.880 | from future states.

00:27:29.080 | The loss function here,

00:27:33.880 | for neural network,

00:27:34.880 | the neural network learns with a loss function.

00:27:36.880 | It takes,

00:27:38.080 | the,

00:27:38.880 | reward received at the current state,

00:27:41.680 | does a forward pass,

00:27:43.680 | through a neural network,

00:27:44.680 | to estimate the value of the future state,

00:27:47.480 | of the best,

00:27:49.280 | action to take in the future state,

00:27:51.880 | and then subtracts that,

00:27:54.680 | from the forward pass,

00:27:57.680 | through the network,

00:27:59.280 | for the current state in action.

00:28:00.680 | So, you take the difference between,

00:28:04.280 | what your Q,

00:28:05.680 | estimator,

00:28:06.880 | the neural network,

00:28:07.880 | believes the value of the current state is,

00:28:10.680 | and,

00:28:12.280 | what,

00:28:13.080 | it more likely is to be,

00:28:15.280 | based on the value of the future states,

00:28:19.280 | that are reachable based on the actions you can take.

00:28:21.480 | Here's the algorithm.

00:28:27.680 | Input is the state,

00:28:30.680 | output is the Q value for each action,

00:28:33.480 | or in this diagram,

00:28:34.680 | input is the state in action,

00:28:36.280 | and the output is the Q value.

00:28:37.880 | It's very similar architectures.

00:28:40.480 | So, given a transition,

00:28:42.480 | of S,

00:28:43.480 | A, R,

00:28:44.880 | S'

00:28:46.080 | S current state taking an action,

00:28:49.680 | receiving a reward,

00:28:50.680 | and achieving S' state.

00:28:52.480 | The,

00:28:56.280 | the update,

00:28:57.280 | is,

00:28:58.880 | do a feed forward pass,

00:29:00.280 | through the network for the current state,

00:29:02.680 | do a feed forward pass for each of the,

00:29:05.280 | possible actions taken in the next state.

00:29:08.080 | And that's how we compute the two parts of the loss function.

00:29:11.280 | And update the weights using back propagation.

00:29:15.680 | Again, loss function,

00:29:18.080 | back propagation is how the network is trained.

00:29:20.080 | This is actually been around for,

00:29:24.280 | much longer than,

00:29:25.280 | DeepMind.

00:29:26.480 | A few tricks made it,

00:29:29.880 | made it really work.

00:29:31.880 | Experience replay is the biggest one.

00:29:33.880 | So, as the games are played through simulation,

00:29:38.880 | or if it's a physical system,

00:29:40.080 | as it acts in the world.

00:29:41.280 | It's actually,

00:29:44.280 | collecting the observations,

00:29:46.680 | into a library of experiences.

00:29:48.880 | And the training is performed,

00:29:51.080 | by randomly sampling the library in the past.

00:29:53.880 | By randomly sampling,

00:29:56.480 | the previous experience,

00:29:58.080 | and then,

00:29:58.880 | sampling the previous experiences,

00:30:01.080 | in batches.

00:30:01.880 | So, you're not always training,

00:30:04.880 | on the natural continuous evolution of the system.

00:30:07.680 | You're training on randomly picked batches,

00:30:10.280 | of those experiences.

00:30:11.480 | That's a huge,

00:30:13.080 | it's a,

00:30:15.080 | seems like a subtle trick,

00:30:16.280 | but it's a really important one.

00:30:17.480 | So, the system doesn't,

00:30:20.480 | over fit,

00:30:22.080 | a particular evolution,

00:30:23.880 | of the,

00:30:25.080 | of the game,

00:30:26.080 | of the simulation.

00:30:28.880 | Another important,

00:30:30.880 | again, subtle trick,

00:30:33.680 | as in a lot of deep learning approaches,

00:30:35.880 | the subtle tricks make all the difference,

00:30:37.880 | is,

00:30:39.280 | fixing the target network.

00:30:41.280 | For the loss function,

00:30:43.680 | if you notice,

00:30:44.680 | you have to use the neural network,

00:30:47.480 | the single neural network,

00:30:48.880 | the DQN network,

00:30:50.280 | to estimate the value of the current state,

00:30:52.880 | an action pair.

00:30:54.880 | And,

00:30:56.880 | and the next.

00:30:58.480 | So, you're using it,

00:30:59.680 | multiple times.

00:31:00.880 | And,

00:31:03.480 | as you perform that operation,

00:31:05.880 | you're updating the network.

00:31:07.880 | Which means the target function,

00:31:09.880 | inside that loss function,

00:31:11.280 | is always changing.

00:31:12.480 | So, you're,

00:31:13.480 | the very nature of your loss function,

00:31:15.880 | is changing all the time,

00:31:17.080 | as you're learning.

00:31:18.080 | And that's a big problem for stability.

00:31:20.480 | That can create big problems,

00:31:22.480 | to the learning process.

00:31:23.680 | So, this little trick,

00:31:25.280 | is to fix,

00:31:26.680 | the network,

00:31:27.480 | and only update it,

00:31:28.680 | every,

00:31:30.080 | say, thousand steps.

00:31:32.080 | So,

00:31:33.480 | as you train the network,

00:31:35.480 | the network that's used,

00:31:38.080 | to compute the target function,

00:31:40.080 | inside the loss function, is fixed.

00:31:42.080 | It produces a more stable computation,

00:31:45.480 | on the loss function.

00:31:46.680 | So,

00:31:47.480 | the ground doesn't,

00:31:48.880 | shift under you,

00:31:50.280 | as you're trying to find a minimal,

00:31:52.480 | for the loss function.

00:31:54.480 | The loss function doesn't change.

00:31:56.880 | In unpredictable, difficult to understand ways.

00:31:59.280 | And,

00:32:00.680 | reward clipping,

00:32:02.080 | which is,

00:32:03.680 | always true,

00:32:04.880 | with general,

00:32:05.880 | systems that are,

00:32:07.080 | operating,

00:32:08.080 | seeking to operate in a generalized way,

00:32:11.280 | is,

00:32:12.080 | for very,

00:32:13.080 | for these various games,

00:32:14.680 | the points are different.

00:32:16.880 | Some, some points are low,

00:32:18.280 | some points are high,

00:32:19.280 | some go positive and negative.

00:32:20.880 | And they're all normalized,

00:32:22.280 | to a point where the good points,

00:32:24.480 | or the positive points,

00:32:26.080 | are a one,

00:32:27.080 | and negative points are a negative one.

00:32:29.880 | That's reward clipping.

00:32:31.480 | Simplify the reward structure.

00:32:33.280 | And, because a lot of the games are 30 FPS,

00:32:36.880 | or 60 FPS,

00:32:38.080 | and the actions,

00:32:39.680 | are not,

00:32:41.480 | it's not valuable to take actions,

00:32:43.880 | at such a high rate,

00:32:45.280 | inside of these,

00:32:46.280 | particularly Atari games,

00:32:47.680 | that you only take an action every four steps.

00:32:50.280 | While still taking in the frames,

00:32:52.480 | as part of the temporal window to make decisions.

00:32:55.480 | Tricks,

00:32:56.280 | but hopefully gives you a sense,

00:32:57.880 | of the kind of things necessary,

00:33:01.680 | for both,

00:33:02.880 | seminal papers like this one,

00:33:05.880 | and for the more important accomplishment,

00:33:08.280 | of winning deep traffic,

00:33:09.480 | is the,

00:33:11.080 | is the tricks make all the difference.

00:33:12.880 | Here on the bottom,

00:33:15.280 | is,

00:33:16.480 | the circle,

00:33:18.480 | is when the technique is used,

00:33:20.280 | and the X when it's not,

00:33:21.880 | looking at replay and target.

00:33:24.080 | Takes target network and experience replay.

00:33:26.280 | When both are used,

00:33:27.880 | for the game of breakout,

00:33:29.280 | river raid,

00:33:30.680 | sea quest,

00:33:31.480 | and space invaders.

00:33:32.680 | The higher the number,

00:33:34.080 | the better it is,

00:33:34.880 | the more points achieved.

00:33:36.280 | So when,

00:33:38.480 | it gives you a sense,

00:33:39.680 | that when replay and target,

00:33:41.280 | both give significant improvements,

00:33:43.680 | in the performance of the system.

00:33:45.280 | Order of magnitude improvements,

00:33:49.280 | two orders of magnitude for breakout.

00:33:52.480 | And here is,

00:33:53.480 | pseudocode,

00:33:55.680 | of implementing DQN,

00:33:57.280 | the learning.

00:33:58.280 | The key thing to notice,

00:34:01.880 | and you can look to the slides,

00:34:03.480 | is,

00:34:04.680 | the,

00:34:06.280 | the loop,

00:34:07.480 | the while loop,

00:34:08.480 | of playing through the games,

00:34:10.280 | and selecting the actions to play,

00:34:12.280 | is not part of the training.

00:34:15.080 | It's,

00:34:15.680 | it's part of the saving,

00:34:17.280 | the observation,

00:34:19.880 | the observations,

00:34:22.280 | the state action reward,

00:34:24.480 | next state observations,

00:34:26.280 | and saving them into replay memory,

00:34:28.080 | into that library.

00:34:29.280 | And then you sample randomly,

00:34:31.480 | from that replay memory,

00:34:33.080 | to then train the network,

00:34:34.680 | based on the loss function.

00:34:36.880 | And with probability up,

00:34:40.080 | up top of the probability,

00:34:41.880 | epsilon,

00:34:42.880 | select a random action.

00:34:44.080 | That epsilon is,

00:34:45.880 | the probability of exploration,

00:34:49.080 | that decreases.

00:34:50.280 | That's something you'll see in deep traffic as well,

00:34:52.880 | is,

00:34:54.680 | the rate at which that,

00:34:56.080 | exploration decreases over time,

00:34:58.080 | through the training process.

00:34:59.280 | You want to explore a lot first,

00:35:00.880 | and less and less over time.

00:35:03.080 | So, this algorithm has been able to accomplish,

00:35:06.280 | in 2015,

00:35:08.080 | and since,

00:35:09.680 | a lot of incredible things.

00:35:11.280 | Things that made,

00:35:13.480 | the AI world,

00:35:15.480 | think that we,

00:35:19.080 | we're onto something.

00:35:20.480 | That,

00:35:22.680 | general AI is within reach.

00:35:24.680 | It's for the first time,

00:35:27.080 | that raw sensor information was used to create,

00:35:29.880 | a system that acts,

00:35:31.280 | and makes sense of the world.

00:35:32.680 | Makes sense of the physics of the world enough,

00:35:34.880 | to be able to succeed in it,

00:35:36.480 | from very little information.

00:35:38.080 | But these games are trivial.

00:35:40.280 | Even though,

00:35:44.280 | there is a lot of them.

00:35:47.280 | This DQN approach has been able to outperform,

00:35:50.880 | a lot of the Atari games.

00:35:52.480 | That's what's been reported on.

00:35:54.480 | Outperform the human level performance.

00:35:56.680 | But again, these games are trivial.

00:35:59.280 | What I think,

00:36:01.680 | and perhaps biased,

00:36:03.480 | I'm biased,

00:36:04.680 | but one of the greatest accomplishments,

00:36:06.680 | of artificial intelligence in the last decade,

00:36:09.080 | at least from the philosophical,

00:36:12.680 | or the research perspective,

00:36:15.680 | is,

00:36:16.480 | AlphaGo Zero.

00:36:18.880 | First AlphaGo,

00:36:20.680 | and then AlphaGo Zero.

00:36:22.080 | Is deep mind system,

00:36:25.680 | that beat the best in the world,

00:36:28.080 | in the game of Go.

00:36:29.080 | So what's the game of Go?

00:36:31.080 | It's simple.

00:36:33.080 | I won't get into the rules,

00:36:36.080 | but basically it's a 19 by 19 board,

00:36:39.280 | showing on the bottom of the slide,

00:36:42.080 | for the bottom row of the table,

00:36:45.480 | for a board of 19 by 19,

00:36:47.480 | the number of legal game positions,

00:36:51.680 | is 2 times 10 to the power of 170.

00:36:54.880 | It's a very large number of possible positions to consider.

00:36:59.280 | At any one time,

00:37:01.080 | especially the game evolves,

00:37:02.880 | the number of possible moves is huge.

00:37:05.480 | Much larger than in chess.

00:37:08.480 | So that's why,

00:37:11.080 | AI,

00:37:13.080 | the community thought that this game is not solvable.

00:37:15.680 | Until 2016,

00:37:19.880 | when AlphaGo,

00:37:22.080 | used human expert position play,

00:37:26.080 | to seed in a supervised way,

00:37:29.680 | reinforcement learning approach.

00:37:31.880 | And I'll describe it in a little bit of detail,

00:37:34.680 | in a couple of slides here,

00:37:36.680 | to beat the best in the world.

00:37:42.280 | And then AlphaGo Zero,

00:37:44.080 | that is,

00:37:45.480 | the accomplishment of the decade,

00:37:48.480 | for me, in AI.

00:37:50.680 | Is being able to play,

00:37:53.480 | with no,

00:37:56.280 | training data on human expert,

00:37:59.880 | games.

00:38:02.280 | And beat the best in the world,

00:38:05.080 | in an extremely complex game.

00:38:06.880 | This is not Atari.

00:38:08.080 | This is,

00:38:09.480 | this is a,

00:38:11.480 | a much,

00:38:12.280 | higher,

00:38:13.480 | order, difficulty game.

00:38:16.080 | And the, and the quality of players that is competing in,

00:38:19.480 | is much higher.

00:38:20.480 | And it's able to extremely quickly here,

00:38:23.280 | to achieve a rating that's better than AlphaGo.

00:38:26.880 | And better than the different variants of AlphaGo.

00:38:30.880 | And certainly better than the,

00:38:32.680 | the best of the human players.

00:38:34.280 | In 21 days,

00:38:35.880 | of self play.

00:38:37.680 | So how does it work?

00:38:40.680 | All of these approaches,

00:38:41.880 | much, much like the previous ones,

00:38:44.680 | the traditional ones,

00:38:45.880 | they're not based on deep learning.

00:38:47.880 | Are using Monte Carlo Tree Search, MCTS.

00:38:52.880 | Which is,

00:38:55.280 | when you have such a large state space,

00:38:58.480 | you start at a board,

00:38:59.680 | and you play,

00:39:01.080 | and you choose moves,

00:39:04.080 | with some,

00:39:06.480 | exploitation, exploration,

00:39:08.280 | balancing.

00:39:09.480 | And,

00:39:10.480 | choosing to explore totally new positions,

00:39:13.480 | or to go deep in the positions you know are good,

00:39:15.880 | until the bottom of the game is reached,

00:39:17.880 | until the final state is reached.

00:39:20.080 | And then you back propagate,

00:39:21.680 | the,

00:39:22.680 | quality of the choices you made leading to that position.

00:39:26.280 | And in that way, you learn the value of,

00:39:29.680 | of board positions and play.

00:39:32.680 | That's been used by the most successful,

00:39:35.680 | Go playing,

00:39:36.680 | engines before,

00:39:39.080 | and AlphaGo since.

00:39:40.280 | But you might be able to guess,

00:39:43.080 | what's the difference,

00:39:44.480 | with AlphaGo versus the previous approaches.

00:39:46.680 | They use the neural network,

00:39:49.680 | as the,

00:39:51.680 | intuition,

00:39:53.480 | quote-unquote,

00:39:54.480 | to what are the good states,

00:39:56.480 | what are the good next,

00:39:58.480 | board positions to explore.

00:40:00.680 | And the key things,

00:40:07.280 | again, the tricks make all the difference,

00:40:09.880 | that made AlphaGo zero,

00:40:12.480 | work,

00:40:14.080 | and work much better than AlphaGo,

00:40:16.280 | is first,

00:40:17.280 | because there was no expert play,

00:40:19.280 | instead of human games.

00:40:21.480 | AlphaGo,

00:40:24.680 | used,

00:40:25.880 | that very same,

00:40:27.480 | Monte Carlo tree search algorithm,

00:40:30.280 | MCTS,

00:40:31.680 | to do an intelligent look ahead,

00:40:33.280 | based on the neural network prediction,

00:40:36.280 | of what are the good states to take,

00:40:38.080 | it checked that,

00:40:40.480 | instead of human expert play,

00:40:42.080 | it checked,

00:40:43.080 | how good indeed are those,

00:40:44.880 | states.

00:40:46.080 | It's a simple look ahead action,

00:40:48.680 | that does,

00:40:50.280 | the ground truth,

00:40:51.280 | that does the target,

00:40:52.880 | correction,

00:40:53.880 | that produces the loss function.

00:40:55.480 | The second part is the multitask learning,

00:40:58.080 | or what's now called multitask learning,

00:41:00.280 | is the network is,

00:41:01.480 | is quote-unquote two-headed,

00:41:04.280 | in the sense that first it outputs the probability,

00:41:06.680 | of which move to take,

00:41:07.880 | the obvious thing,

00:41:09.080 | and it's also producing a probability of winning.

00:41:11.480 | And there's a few ways to combine that information,

00:41:15.080 | and continuously train,

00:41:16.880 | both parts of the network,

00:41:19.080 | depending on the choice taken.

00:41:20.880 | So you want to take the best choice,

00:41:22.880 | in the short term,

00:41:24.080 | and achieve the positions,

00:41:26.280 | that are highly as likelihood of winning,

00:41:28.680 | for the player,

00:41:29.880 | that's whose turn it is.

00:41:31.280 | And,

00:41:33.480 | another big step,

00:41:35.080 | is that they updated,

00:41:38.080 | from 2015,

00:41:39.480 | they updated the state-of-the-art architecture,

00:41:41.680 | which are now,

00:41:42.880 | the architecture that won ImageNet,

00:41:45.080 | is residual networks,

00:41:46.680 | ResNet,

00:41:47.680 | for ImageNet.

00:41:48.880 | Those, that's it.

00:41:50.880 | And those little changes,

00:41:52.680 | made all the difference.

00:41:54.280 | So that takes us to deep traffic,

00:41:57.480 | and the 8 billion hours stuck in traffic.

00:42:00.080 | America's pastime,

00:42:03.680 | so we tried to simulate,

00:42:05.080 | driving,

00:42:07.280 | the behavioral layer of driving.

00:42:09.480 | So not the immediate control,

00:42:11.880 | not the motion planning,

00:42:13.680 | but beyond that, on top,

00:42:15.680 | on top of those control decisions,

00:42:18.280 | the human,

00:42:20.080 | interpretable decisions of changing lane,

00:42:22.080 | of speeding up, slowing down.

00:42:23.480 | Modeling that,

00:42:24.680 | in a micro traffic simulation framework,

00:42:27.480 | that's popular in traffic engineering,

00:42:29.280 | the kind of shown here.

00:42:32.880 | We applied deep reinforcement learning to that,

00:42:35.080 | we call it deep traffic.

00:42:37.080 | The goal is to achieve the highest average speed,

00:42:40.480 | over a long period of time,

00:42:41.880 | weaving in and out of traffic.

00:42:44.080 | For students here,

00:42:46.080 | the requirement is to follow the tutorial,

00:42:48.880 | and achieve a speed of 65 miles an hour.

00:42:51.280 | And,

00:42:54.680 | if you really want,

00:42:56.280 | to achieve a speed,

00:42:57.680 | over 70 miles an hour,

00:42:59.280 | which is what's required to win.

00:43:01.080 | And perhaps upload your own image,

00:43:05.080 | to make sure you look good doing it.

00:43:07.680 | What you should do,

00:43:10.880 | clear instructions,

00:43:12.480 | to compete, read the tutorial.

00:43:14.480 | You can change parameters in the code box,

00:43:19.080 | on that website,

00:43:20.280 | cars.mit.edu/deeptraffic.

00:43:22.880 | Click the white button that says apply code,

00:43:25.680 | which applies the code that you write.

00:43:27.680 | These are the parameters,

00:43:28.880 | that you specify for the neural network.

00:43:31.480 | It applies those parameters,

00:43:33.480 | creates the architecture that you specify.

00:43:35.680 | And now you have,

00:43:37.080 | a network written in JavaScript,

00:43:39.080 | living in the browser, ready to be trained.

00:43:40.880 | Then you click,

00:43:42.480 | the blue button that says run training.

00:43:45.080 | And that trains the network,

00:43:47.680 | much faster than what's actually being visualized,

00:43:51.680 | in the browser.

00:43:52.680 | A thousand times faster,

00:43:55.080 | by evolving the game,

00:43:56.680 | making decisions,

00:43:57.680 | taking in the grid space,

00:43:59.080 | I'll talk about here in a second.

00:44:00.680 | The speed limit is 80 miles an hour.

00:44:03.080 | Based on the various adjustments,

00:44:05.480 | when we went to the game,

00:44:06.680 | reaching 80 miles an hour,

00:44:08.480 | is certainly impossible,

00:44:09.880 | on average.

00:44:11.280 | And reaching some of the speeds,

00:44:13.080 | that we've achieved last year,

00:44:14.680 | is much, much, much more difficult.

00:44:17.280 | Finally, when you're happy,

00:44:20.080 | and the training is done,

00:44:21.480 | submit the model to competition.

00:44:26.480 | For those super eager,

00:44:28.280 | dedicated students,

00:44:29.480 | you can do so every five minutes.

00:44:31.280 | And to visualize your submission,

00:44:35.080 | you can click,

00:44:37.880 | the request visualization,

00:44:40.080 | specifying the custom image,

00:44:41.680 | and the color.

00:44:42.480 | Okay, so here's the simulation.

00:44:47.080 | Speed limit 80 miles an hour,

00:44:48.880 | cars, 20 on the screen.

00:44:51.680 | One of them is a red one in this case.

00:44:53.880 | That's, that one is controlled by neural network.

00:44:56.680 | It's speed, it's allowed the actions,

00:44:58.680 | to speed up, slow down,

00:45:00.280 | change lanes, left, right,

00:45:03.280 | or stay exactly the same.

00:45:05.280 | The other cars,

00:45:10.080 | are pretty dumb.

00:45:11.880 | They speed up, slow down, turn left, right,

00:45:14.880 | but they don't have a purpose in their existence.

00:45:17.280 | They do so randomly.

00:45:18.880 | Or at least purpose has not been discovered.

00:45:22.680 | The road, the car, the speed.

00:45:25.480 | The road is a grid space.

00:45:27.280 | An occupancy grid that specifies,

00:45:30.480 | when it's empty,

00:45:32.680 | it's set to,

00:45:34.680 | 80.

00:45:36.480 | Meaning,

00:45:37.680 | that,

00:45:38.880 | the grid value,

00:45:42.080 | is whatever speed is achievable,

00:45:44.480 | if you were inside that grid.

00:45:46.080 | And when there's other cars that are going slow,

00:45:49.080 | the value in that grid,

00:45:50.280 | is the speed of that car.

00:45:52.080 | That's the state space,

00:45:53.480 | that's the state representation.

00:45:55.280 | And you can choose how much,

00:45:56.880 | what slice that state space you take in.

00:45:59.280 | That's the input to the neural network.

00:46:01.080 | For visualization purposes,

00:46:07.280 | you can choose,

00:46:08.080 | normal speed or fast speed,

00:46:10.080 | for watching,

00:46:11.080 | the network operate.

00:46:12.880 | And there's display options,

00:46:16.480 | to help you build intuition,

00:46:18.080 | about what the network takes in,

00:46:19.480 | and what space the car is operating in.

00:46:21.680 | The default,

00:46:22.880 | is no extra information is added.

00:46:25.280 | Then there's the,

00:46:26.280 | learning input,

00:46:27.480 | which visualizes exactly,

00:46:29.280 | which part of the road,

00:46:30.880 | the, is serves as the input to the network.

00:46:33.480 | Then there is the,

00:46:34.680 | safety system,

00:46:36.280 | which I'll describe in a little bit,

00:46:37.880 | which is all the parts of the road,

00:46:39.880 | the car is not allowed to go into,

00:46:41.680 | because it would result in a collision.

00:46:43.480 | And that would JavaScript,

00:46:44.680 | would be very difficult to animate.

00:46:46.080 | And the full map.

00:46:48.480 | Here's a safety system.

00:46:51.480 | You could think of this system,

00:46:52.680 | as ACC,

00:46:54.880 | basic radar ultrasonic sensors,

00:46:57.680 | helping you avoid the obvious,

00:46:59.680 | collisions to,

00:47:00.880 | obviously detectable objects around you.

00:47:03.280 | And the task for this red car,

00:47:05.080 | for this neural network,

00:47:06.080 | is to move about,

00:47:07.480 | this space,

00:47:08.880 | is to move about the space,

00:47:11.880 | under the constraints of the safety system.

00:47:14.480 | The red shows all the parts of the grid,

00:47:18.480 | it's not able to move into.

00:47:21.880 | So the goal for the car,

00:47:22.880 | is to not get stuck in traffic,

00:47:24.880 | is make big sweeping motions,

00:47:28.080 | to avoid crowds of cars.

00:47:30.480 | The input,

00:47:33.880 | like DQN,

00:47:35.080 | is the state space,

00:47:36.280 | the output is the value of the different actions.

00:47:38.880 | And based on the epsilon parameter,

00:47:41.880 | through training and through,

00:47:44.280 | inference evaluation process,

00:47:46.880 | you choose,

00:47:48.480 | how much exploration you want to do.

00:47:50.280 | These are all parameters.

00:47:52.480 | The learning is done in the browser,

00:47:54.480 | on your own computer,

00:47:56.480 | utilizing only the CPU.

00:48:00.680 | The action space,

00:48:03.680 | there's five,

00:48:04.680 | giving you some of the variables here,

00:48:07.080 | perhaps you go back to the slides,

00:48:08.680 | to look at it.

00:48:09.480 | The brain,

00:48:10.680 | quote-unquote,

00:48:11.680 | is the thing that takes in,

00:48:14.080 | the state,

00:48:15.080 | and the reward,

00:48:16.880 | takes a forward pass through the state,

00:48:19.280 | and produces the next action.

00:48:21.480 | The brain is where the neural network is contained,

00:48:24.280 | both for the training and the evaluation.

00:48:26.480 | The learning input,

00:48:28.880 | can be controlled in width,

00:48:30.880 | forward length,

00:48:33.080 | and backward length.

00:48:34.280 | Lane side,

00:48:35.280 | number of lanes to the side that you see,

00:48:37.480 | patches ahead,

00:48:38.680 | is the patches ahead that you see,

00:48:40.480 | patches behind,

00:48:41.480 | is patches behind that you see.

00:48:43.080 | New this year,

00:48:45.680 | can control the number of agents,

00:48:48.680 | that are controlled by the neural network.

00:48:52.080 | Anywhere from one,

00:48:53.680 | to ten.

00:48:55.080 | And the evaluation,

00:48:59.480 | is performed exactly the same way.

00:49:01.080 | You have to achieve the highest average speed,

00:49:03.880 | for the agents.

00:49:04.680 | The very critical thing here is,

00:49:08.280 | the agents are not aware of each other.

00:49:11.080 | So they're not jointly planning.

00:49:15.280 | The network is trained,

00:49:18.480 | under the,

00:49:20.080 | joint objective,

00:49:21.880 | of achieving the average speed for all of them.

00:49:24.480 | But the actions are taking in a greedy way for each.

00:49:28.680 | It's very interesting what can be learned in this way.

00:49:32.080 | Because this kinds of approaches are scalable,

00:49:35.480 | to an arbitrary number of cars.

00:49:37.080 | And you can imagine us plopping down,

00:49:40.080 | the best cars from this class together.

00:49:43.280 | And having them compete,

00:49:45.480 | in this way.

00:49:46.880 | The best neural networks.

00:49:49.480 | Because they're full in their greedy operation.

00:49:53.280 | The number of networks that can concurrently operate,

00:49:57.080 | is fully scalable.

00:49:58.680 | There's a lot of parameters.

00:50:01.280 | The temporal window.

00:50:04.680 | The layers, the many layers types that can be added.

00:50:09.880 | Here's a fully connected layer with ten neurons.

00:50:12.280 | The activation functions, all of these things can be customized.

00:50:15.480 | As is specified in the tutorial.

00:50:18.480 | The final layer, a fully connected layer with,

00:50:21.680 | output of five,

00:50:23.480 | regression, giving the value of each of the five actions.

00:50:28.080 | And there's a lot of more specific parameters.

00:50:31.280 | Some of which I've discussed.

00:50:32.680 | From gamma, to epsilon,

00:50:37.080 | to experience replay size,

00:50:40.680 | to learning rate and temporal window.

00:50:45.680 | The optimizer, the learning rate, momentum, batch size,

00:50:49.680 | L2, L1 decay for regularization and so on.

00:50:53.080 | There's a big white button that says apply code that you press.

00:50:56.880 | That kills all the work you've done up to this point.

00:50:59.680 | So be careful doing it.

00:51:00.880 | You should be doing it only at the very beginning.

00:51:03.680 | If you happen to leave your computer running,

00:51:07.480 | in training for several days, as folks have done.

00:51:10.480 | The blue training button, you press.

00:51:13.880 | And it trains based on the parameters.

00:51:15.480 | You specify.

00:51:16.280 | And the network state gets shipped to the main simulation from time to time.

00:51:20.880 | So the thing you see in the browser,

00:51:22.880 | as you open up the website,

00:51:24.280 | is running the same network that's being trained.

00:51:27.080 | And regularly it updates that network.

00:51:29.680 | So it's getting better and better.

00:51:31.080 | Even if the training takes weeks for you.

00:51:33.280 | It's constantly updating the network you see on the left.

00:51:36.480 | So if the car, for the network that you're training,

00:51:39.680 | is just standing in place and not moving.

00:51:41.880 | It's probably,

00:51:44.880 | time to restart and change the parameters.

00:51:47.480 | Maybe add a few layers to your network.

00:51:49.680 | Number of iterations is certainly an important parameter to control.

00:51:55.080 | And the evaluation is something we've done a lot of work on since last year.

00:52:00.680 | To remove the degree of randomness.

00:52:03.280 | To remove the,

00:52:04.880 | the incentive to submit the same code over and over again.

00:52:09.080 | To hope to produce a higher reward,

00:52:11.280 | a higher evaluation score.

00:52:14.480 | The method for evaluation is to collect the average speed over 10 runs.

00:52:19.880 | About 45 seconds of game each.

00:52:25.080 | Not minutes.

00:52:26.280 | 45 simulated seconds.

00:52:28.280 | And there is 500 of those.

00:52:30.880 | And we take the median speed of the 500 runs.

00:52:33.480 | It's done server side.

00:52:35.480 | So extremely difficult to cheat.

00:52:37.080 | I urge you to try.

00:52:38.680 | And you can try it locally.

00:52:41.680 | There's a start evaluation run.

00:52:43.480 | But that one doesn't count.

00:52:44.680 | That's just for you to feel better about your network.

00:52:46.880 | That should produce a result that's very similar to the one we'll produce on the server.

00:52:52.680 | Is to build your own intuition.

00:52:55.280 | And as I said,

00:52:57.280 | we significantly reduce the influence of randomness.

00:52:59.480 | So the,

00:53:00.480 | the score,

00:53:01.880 | the speed you get for the network you design should be very similar with every evaluation.

00:53:07.480 | Loading and saving.

00:53:10.480 | If the network is huge and you want to switch computers,

00:53:13.280 | you can save the network.

00:53:14.480 | It saves both the architecture of the network.

00:53:16.480 | And the weights on the network.

00:53:19.480 | And you can load it back in.

00:53:21.480 | Obviously, when you load it in,

00:53:25.080 | it's not saving any of the data you've already done.

00:53:29.280 | You can't do transfer learning with JavaScript in the browser yet.

00:53:33.080 | Submitting your network,

00:53:35.680 | submit model to competition.

00:53:37.680 | And make sure you run training first.

00:53:39.680 | Otherwise,

00:53:41.280 | it'll be initiated,

00:53:42.280 | the weights are initiated randomly and will not do so well.

00:53:44.880 | You can resubmit as often you like and the highest score is what counts.

00:53:49.480 | The coolest part is you can load your custom image,

00:53:52.480 | specify colors and request the visualization.

00:53:56.280 | We have not yet shown the visualization,

00:54:00.480 | but I promise you it's going to be awesome.

00:54:02.480 | Again,

00:54:03.480 | read the tutorial,

00:54:05.080 | change the parameters in the code box,

00:54:06.880 | click apply code, run training.

00:54:09.280 | Everybody in this room on the way home,

00:54:11.280 | on the train,

00:54:12.280 | hopefully not in your car,

00:54:14.280 | should be able to do this in the browser.

00:54:16.080 | And then you can visualize,

00:54:17.680 | request visualization because it's an expensive process.

00:54:20.480 | You have to want it for us to do it.

00:54:22.880 | Because we have to run in server side.

00:54:25.680 | Competition link is there.

00:54:30.480 | GitHub starter code is there.

00:54:32.680 | And the details for those that truly want to win is in the archive paper.

00:54:36.480 | So the question,

00:54:39.080 | that will come up throughout is whether these reinforcement learning approaches

00:54:43.280 | are at all or rather if action planning control is amenable to learning.

00:54:48.680 | Certainly in the case of driving,

00:54:52.680 | we can't do what AlphaGo Zero did.

00:54:55.280 | We can't learn from scratch from self-play

00:54:59.880 | because that would result in millions of crashes

00:55:03.480 | in order to learn to avoid the crashes.

00:55:08.480 | Unless we're working like we are deep crash on the RC car

00:55:11.680 | or we're working in simulation.

00:55:13.480 | So we can look at expert data.

00:55:16.080 | We can look at driver data,

00:55:17.280 | which we have a lot of and learn from.

00:55:18.880 | It's an open question whether this is applicable.

00:55:21.280 | To date,

00:55:23.080 | and I bring up two companies because they're both guest speakers.

00:55:26.880 | Deep IRL is not involved in the most successful robots operating in the real world.

00:55:33.280 | In the case of Boston Dynamics,

00:55:38.480 | most of the perception,

00:55:41.080 | control and planning like in this robot,

00:55:44.880 | does not involve learning approaches

00:55:48.080 | except with minimal addition on the perception side.

00:55:51.480 | Best of our knowledge.

00:55:54.280 | And certainly the same is true with Waymo,

00:55:57.680 | as the speaker on Friday will talk about.

00:56:00.280 | Deep learning is used a little bit in perception on top,

00:56:03.880 | but most of the work is done from the sensors

00:56:07.280 | and the optimization based, the model based approaches.

00:56:11.480 | Trajectory generation and optimizing which trajectory is best to avoid collisions.

00:56:18.080 | Deep IRL is not involved.

00:56:20.680 | And coming back and back again,

00:56:25.280 | the unexpected local pockets of higher award,

00:56:28.280 | which arise in all of these situations when applied in the real world.

00:56:31.880 | So for the cat video,

00:56:34.880 | that's pretty short where the cats are ringing the bell

00:56:37.680 | and they're learning that the ring of the bell

00:56:40.280 | is mapping to food.

00:56:44.280 | I urge you to think about how that can evolve over time in unexpected ways.

00:56:50.280 | That may not have a desirable effect.

00:56:52.680 | Where the final reward is in the form of food

00:56:55.480 | and the intended effect is to ring the bell.

00:57:03.280 | That's where AI safety comes in.

00:57:05.080 | For the artificial general intelligence course in two weeks,

00:57:08.280 | that's something we'll explore extensively.

00:57:10.280 | It's how these reinforcement learning planning algorithms

00:57:17.280 | will evolve in ways that are not expected.

00:57:20.280 | And how we can constrain them,

00:57:23.080 | how we can design reward functions that result in safe operation.

00:57:28.280 | So I encourage you to come to the talk on Friday,

00:57:33.280 | at 1pm, as a reminder, it's a 1pm, not 7pm,

00:57:36.680 | in Stata 32.1.2.3.

00:57:38.880 | And to the awesome talks in two weeks,

00:57:41.480 | from Boston Dynamics to Ray Kurzweil and so on for AGI.

00:57:45.880 | Now tomorrow, we'll talk about computer vision and psych fuse.

00:57:50.880 | Thank you everybody.

00:57:51.880 | [APPLAUSE]

MIT 6.S094: Deep Reinforcement Learning

Chapters