MIT 6.S094: Deep Reinforcement Learning

Today we will talk about deep reinforcement learning. The question we would like to explore is to which degree we can teach systems to act, to perceive and act in this world from data. So let's take a step back and think of what is the full range of tasks an artificial intelligence system needs to accomplish.

Here's the stack. From top to bottom, top the input, bottom output. The environment at the top, the world that the agent is operating in. Sensed by sensors, taking in the world outside and converting it to raw data interpretable by machines. Sensor data. And from that raw sensor data, you extract features.

You extract structure from that data such that you can input it, make sense of it, discriminate, separate, understand the data. And as we discussed, you form higher and higher order representations, a hierarchy of representations based on which the machine learning techniques can then be applied. Once the machine learning techniques, the understanding, as I mentioned, converts the data into features, into higher order representations and into simple, actionable, useful information.

We aggregate that information into knowledge. We take the pieces of knowledge extracted from the data through the machine learning techniques and build a taxonomy, a library of knowledge. And with that knowledge, we reason. An agent is tasked to reason to aggregate, to connect pieces of data seen in the recent past or the distant past, to make sense of the world it's operating in.

And finally, to make a plan of how to act in that world based on its objectives, based on what it wants to accomplish. As I mentioned, a simple but commonly accepted definition of intelligence is a system that's able to accomplish complex goals. So a system that's operating in an environment in this world must have a goal, must have an objective function, a reward function.

And based on that, it forms a plan and takes action. And because it operates in many cases in the physical world, it must have tools, effectors with which it applies the actions to change something about the world. That's the full stack of an artificial intelligence system that acts in the world.

And the question is, what kind of task can such a system take on? What kind of task can an artificial intelligence system learn? As we understand AI today. We will talk about the advancement of deeper enforcement learning approaches in some of the fascinating ways it's able to take much of this stack and treat it as an end-to-end learning problem.

But we look at games, we look at simple formalized worlds. While it's still impressive, beautiful and unprecedented accomplishments, it's nevertheless formal tasks. Can we then move beyond games and into expert tasks of medical diagnosis, of design and into natural language and finally the human level tasks of emotion, imagination.

Consciousness. Let's once again review the stack in practicality, in the tools we have. The input for robots operating in the world, from cars to humanoid to drones, as LIDAR camera, radar, GPS, stereo cameras, audio microphone, networking for communication and the various ways to measure kinematics. With IMU. The raw sensory data is then processed, features are formed, representations are formed and multiple higher and higher order representations.

That's what deep learning gets us. Before neural networks, before the advent of, before the recent successes of neural networks to go deeper and therefore be able to form high order representations of the data. That was done by experts, by human experts. Today, networks are able to do that. That's the representation piece.

And on top of the representation piece, the final layers these networks are able to accomplish, the supervised learning tasks, the generative tasks and the unsupervised clustering tasks. Through machine learning. That's what we talked about a little in lecture one and will continue tomorrow and Wednesday. That's supervised learning. And you can think about the output of those networks as simple, clean, useful, valuable information.

That's the knowledge. And that knowledge can be in the form of single numbers. It could be regression, continuous variables. It could be a sequence of numbers. It could be images, audio, sentences, text, speech. Once that knowledge is extracted and aggregated, how do we connect it in multi-resolutional ways? Form hierarchies of ideas, connect ideas.

The trivial silly example is connecting images, activity recognition and audio, for example. If it looks like a duck, quacks like a duck, and swims like a duck, we do not currently have approaches that effectively integrate this information to produce a higher confidence estimate that is in fact a duck.

And the planning piece, the task of taking the sensory information, fusing the sensory information, and making action, control and longer-term plans based on that information, as we'll discuss today, are more and more amenable to the learning approach, to the deep learning approach. But to date have been the most successful with non-learning optimization based approaches.

Like with the several of the guest speakers we have, including the creator of this robot, Atlas, in Boston Dynamics. So the question, how much of the stack can be learned, end to end, from the input to the output? We know we can learn the representation, and the knowledge. From the representation and to knowledge, even with the kernel methods of SVM, and certainly, with neural networks, mapping from representation to information, has been where the primary success in machine learning over the past three decades has been.

Mapping from raw sensory data to knowledge, that's where the success, the automated representation learning of deep learning, has been a success. Going straight from raw data to knowledge. The open question for us, today and beyond, is if we can expand the red box there, of what can be learned end to end, from sensory data to reasoning.

So aggregating, forming higher representations, of the extracted knowledge. And forming plans, and acting in this world from the raw sensory data. We will show the incredible fact, that we're able to do, learn exactly what's shown here, end to end, with deeper enforcement learning on trivial tasks, in a generalizable way.

The question is whether that can, then move on to real world tasks, of autonomous vehicles, of humanoid robotics, and so on. That's the open question. So today, let's talk about reinforcement learning. There's three types of machine learning. Supervised, unsupervised, are the categories at the extremes, relative to the amount of human input, that's required.

For supervised learning, every piece of data, that's used for teaching these systems, is first labeled by human beings. And unsupervised learning on the right, is no data is labeled by human beings. In between is some, sparse input from humans. Semi-supervised learning, is when only part of the data is provided, by humans, ground truth.

And the rest, must be inferred, generalized by the system. And that's where reinforcement learning falls. Reinforcement learning, as shown there with the cats. As I said, every successful presentation, must include cats. They're supposed to be Pavlov's cats. And, ringing a bell, and every time they ring a bell, they're given food, and they learn this process.

The goal of reinforcement learning, is to learn, from sparse reward data. From learn, from sparse supervised data. And take advantage of the fact, that in simulation, or in the real world, there is a temporal consistency to the world. There is a, temporal dynamics that follows, from state to state, to state through time.

And so you can propagate information, even if the information, that you received about, the supervision, the ground truth is sparse. You can follow that information, back through time, to infer, something about the reality, of what happened before then, even if your reward signals were weak. So it's using the fact, that the physical world, devolves through time, in some, some sort of predictable way, to take sparse information, and, generalize it, over the entirety of the experience, that's being learned.

So we apply this to two problems. Today we'll talk about deep traffic. As a methodology, as a way to introduce, deep reinforcement learning. So deep traffic is a competition, that we ran last year, and expanded significantly this year. And I'll talk about some of the details, and how the folks in this room can, on your smartphone today, or if you have a laptop, train an agent, while I'm talking.

Training a neural network in the browser. Some of the things we've added, are we've added the capability, we've now turned it into a multi-agent, deep reinforcement learning problem. Where you can control up to, 10 cars within your own network. Perhaps less significant, but pretty cool, is the ability to customize, the way the agent looks.

So you can upload, and people have, to an absurd degree, have already begun doing so, uploading different images, instead of the car that's shown there. As long as it maintains the dimensions, shown here is a SpaceX rocket. The competition is hosted on the website, selfdrivingcars.mit.edu/deeptraffic. We'll return to this later.

The code is on GitHub, with some more information, a starter code, and a paper describing, some of the fundamental insights, that will help you win at this competition, is an archive. So, from supervised learning, in lecture one, to today. Supervised learning, we can think of as memorization, of ground-truth data, in order to form representations, that generalizes from that ground-truth.

Reinforcement learning, is, we can think of, as a way to brute force, propagate that information, the sparse information, through time, to assign quality reward, to state that does not directly have a reward. To make sense of this world, when the rewards are sparse, but are connected through time. You can think of that as reasoning.

So, the connection through time, is modeled, in most reinforcement learning approaches, very simply, that there's an agent, taking an action in a state, and receiving a reward. And the agent operating in an environment, executes an action, receives an observed state, a new state, and receives a reward. This process continues over and over.

In some examples, we can think of, any of the video games, some of which we'll talk about today, like Atari Breakout, as the environment, the agent, is the paddle. Each action, that the agent takes, has an influence, on the evolution, of the environment. And the success is measured, by some reward mechanism.

In this case, points are given by the game. And every game, has a different point scheme, that must be converted, normalized, into a way that's interpretable by the system. And the goal is to maximize those points, maximize the reward. The continuous problem of cart-pole balancing, the goal is to balance the pole, on top of a moving cart.

The state is the angle, the angular speed, the position, the horizontal velocity. The actions are the horizontal force, applied to the cart. And the reward, is one at each time step, if the pole is still upright. All the, first-person shooters, the video games, and now StarCraft, the strategy games.

In case of first-person shooter in Doom, what is the goal? The environment is the game, the goal is to eliminate all opponents, the state is the raw game pixels coming in, the actions is moving up, down, left, right, and so on. And the reward is positive, when eliminating an opponent, and negative, when the agent is eliminated.

Industrial robotics, bin packing with a robotic arm, the goal is to pick up a device from a box, and put it into a container. The state is the raw pixels of the real world, that the robot observes, the actions are the possible, actions of the robot, the different degrees of freedom, and moving through those degrees, moving the different actuators, to realize, the position of the arm.

And the reward is positive, when placing a device successfully, and negative otherwise. Everything can be modeled in this way. Markov decision process. There's a state as zero, action A zero, and reward received. A new state is achieved. Again, action reward state, action reward state, until a terminal state is reached.

And the major components, of reinforcement learning, is a policy, some kind of plan, of what to do in every single state, what kind of action to perform. A value function, a some kind of sense, of what is a good state to be in, of what is a good action to take in a state.

And sometimes a model, that the agent represents the environment with, some kind of sense, of the environment it's operating in, the dynamics of that environment, that's useful, for making decisions about actions. Let's take a trivial example. A grid world of 3 by 4, 12 squares, where you start at the bottom left, and they're tasked with walking about this world, to maximize reward.

The reward at the top right is a plus one, and at one square below that is a negative one. And every step you take, is a punishment, or is a negative reward of 0.04. So what is the optimal policy in this world? Now when everything is deterministic, perhaps this is the policy.

When you start at the bottom left, well, because every step hurts, every step has a negative reward, then you want to take the shortest path, to the maximum square with the maximum reward. When the state space is non-deterministic, as presented before, with a probability of 0.8, when you choose to go up, you go up, but with probability 0.1, you go left, and 0.1, you go right.

Unfair. Again, much like life. That would be the optimal policy. What is the key observation here? That every single state in the space, must have a plan. Because you can't, because then the non-deterministic aspect, of the control, you can't control where you're going to end up, so you must have a plan for every place.

That's the policy. Having an action, an optimal action to take in every single state. Now suppose we change the reward structure, and for every step we take, there's a negative, a reward is a negative two. So it really hurts. There's a high punishment for every single step we take.

So no matter what, we always take the shortest path. The optimal policy is to take the shortest path, to the, to the only spot on the board, that doesn't result in punishment. If we decrease the reward of each step, to negative 0.1, the policy changes. Where there is some extra degree of wandering, encouraged.

And as we go further and further, in lowering the punishment as before, to negative 0.04, more wandering and more wandering is allowed. And when we finally turn the reward, into positive, so every step, every step is increases the reward, then there's a significant incentive to, to stay on the board without ever reaching the destination.

Kind of like college for a lot of people. So the value function, the way we think about, the value of a state, or the value of anything, in the environment, is, the reward we're likely to receive in the future. And the way we see the reward we're likely to receive, is we discount, the future reward.

Because we can't always count on it. Here at Gamma, further and further out into the future, more and more discounts decreases, the reward, the importance of the reward received. And the good strategy, is taking the sum of these rewards and maximizing it. Maximizing discounted future reward. That's what reinforcement learning, hopes to achieve.

And with Q-learning, we use, any policy to estimate the value of taking an action, in a state. So off policy, forget policy. We move about the world, and use the Bellman equation here on the bottom, to continuously update our estimate of how good, a certain action is in a certain state.

So we don't need, this allows us to operate in a much larger state space, in a much larger action space. We move about this world, through simulation or in the real world, taking actions and updating our estimate, of how good certain actions are over time. The new state at the left, is the updated value.

The old state, is the starting value for the equation. And we update that old state estimation, with the sum, of the reward received, by taking action S, action A in state S. And, the maximum reward that's possible, to be received in the following states, discounted. That update, is decreased with a learning rate.

The higher the learning rate, the more value we, the faster we learn, the more value we assign to new information. That's simple, that's it. That's Q-learning. The simple update rule, allows us to, to explore the world, and as we explore, get more and more information, about what's good to do in this world.

And there's always a balance, in the various problem spaces we'll discuss, there's always a balance between, exploration and exploitation. As you form a better and better estimate, of the Q-function, of what actions are good to take, you start to get a sense, of what is the best action to take.

But it's not a perfect sense, it's still an approximation. And so there's value of exploration. But the better and better your estimate becomes, the less and less exploration, has a benefit. So, usually we want to explore a lot in the beginning, and less and less so, towards the end.

And when we finally release the system out, into the world, and wish it to operate its best, then we, have it operate, as a greedy system, always taking the optimal action, according to the Q-value function. And everything I'm talking about now, is parameterized, and our parameters, that are very important, for winning the deep traffic competition.

Which is using this very algorithm, with a neural network, at its core. So for a simple table representation, of a Q-function, where the Y-axis is state, four states, S1,2,3,4. And the X-axis is, actions, A1,2,3,4. We can think of this table, as randomly initiated, or initiated, initialized, in any kind of way, that's not representative of actual reality.

And as we move about this world, and we take actions, we update this table, with the Bellman equation, shown up top. And here, slides now are online, you can see a simple, pseudocode algorithm, of how to update it, of how to run, this Bellman equation. And, over time, the approximation becomes the optimal, Q-table.

The problem is, when that Q-table, it becomes exponential in size. When we take in raw sensory information, as we do with cameras, with deep crash, or with deep traffic, it's taking the full grid space, and taking that information, the raw, the raw grid, pixels of deep traffic. And when you take the arcade games, here, they're taking the raw pixels of the game.

Or when we take Go, the game of Go, when it's taking the units, the board, the raw state of the board, as the input, the potential, state space, the number of possible, combinatorial variations of, what states is possible, is, extremely large. Larger than, we can certainly hold the memory, and larger than we can, ever be able to accurately approximate, through the Bellman equation, over time, through simulation.

Through the simple update of the Bellman equation. So this is where, deep reinforcement learning comes in. Neural networks, are really good approximators. They're really good at exactly this task, of learning, this kind of Q-table. So as we started with supervised learning, or neural networks help us memorize patterns, using supervised, ground-truth data, and we move to reinforcement learning, that hopes to propagate, outcomes to knowledge.

Deep learning, allows us to do so, on much larger state spaces, a much larger, action spaces. Which means, it's generalizable. It's much more capable to deal, with the raw, stuff, of sensory data. Which means it's much more capable, to deal with the broad variation, of real-world applications. And it does so, because it's able to, learn the representations, as we discussed, on Monday.

The understanding comes, from converting, the raw sensory information, into simple, useful information, based on which, the action, in this particular state can be taken, in the same exact way. So instead of the Q-table, instead of this Q-function, we plug in a neural network, where the input is the state space, no matter how complex, and the output, is a value for each of the actions, that you could take.

Input is the state, output is the, value of the function. It's simple. This is, Deep Q Network, DQN. At the core, of the success of DeepMind, a lot of the cool stuff you see, about video games, DQN, or variants of DQN are at play. This is what at first, with the Nature paper, DeepMind, the success came, of playing the different games, including Atari, games.

So, how are these things trained? Very similar, to supervised learning. The Bellman equation up top, it, takes the reward, and the discounted, expected reward, from future states. The loss function here, for neural network, the neural network learns with a loss function. It takes, the, reward received at the current state, does a forward pass, through a neural network, to estimate the value of the future state, of the best, action to take in the future state, and then subtracts that, from the forward pass, through the network, for the current state in action.

So, you take the difference between, what your Q, estimator, the neural network, believes the value of the current state is, and, what, it more likely is to be, based on the value of the future states, that are reachable based on the actions you can take. Here's the algorithm. Input is the state, output is the Q value for each action, or in this diagram, input is the state in action, and the output is the Q value.

It's very similar architectures. So, given a transition, of S, A, R, S' S current state taking an action, receiving a reward, and achieving S' state. The, the update, is, do a feed forward pass, through the network for the current state, do a feed forward pass for each of the, possible actions taken in the next state.

And that's how we compute the two parts of the loss function. And update the weights using back propagation. Again, loss function, back propagation is how the network is trained. This is actually been around for, much longer than, DeepMind. A few tricks made it, made it really work. Experience replay is the biggest one.

So, as the games are played through simulation, or if it's a physical system, as it acts in the world. It's actually, collecting the observations, into a library of experiences. And the training is performed, by randomly sampling the library in the past. By randomly sampling, the previous experience, and then, sampling the previous experiences, in batches.

So, you're not always training, on the natural continuous evolution of the system. You're training on randomly picked batches, of those experiences. That's a huge, it's a, seems like a subtle trick, but it's a really important one. So, the system doesn't, over fit, a particular evolution, of the, of the game, of the simulation.

Another important, again, subtle trick, as in a lot of deep learning approaches, the subtle tricks make all the difference, is, fixing the target network. For the loss function, if you notice, you have to use the neural network, the single neural network, the DQN network, to estimate the value of the current state, an action pair.

And, and the next. So, you're using it, multiple times. And, as you perform that operation, you're updating the network. Which means the target function, inside that loss function, is always changing. So, you're, the very nature of your loss function, is changing all the time, as you're learning. And that's a big problem for stability.

That can create big problems, to the learning process. So, this little trick, is to fix, the network, and only update it, every, say, thousand steps. So, as you train the network, the network that's used, to compute the target function, inside the loss function, is fixed. It produces a more stable computation, on the loss function.

So, the ground doesn't, shift under you, as you're trying to find a minimal, for the loss function. The loss function doesn't change. In unpredictable, difficult to understand ways. And, reward clipping, which is, always true, with general, systems that are, operating, seeking to operate in a generalized way, is, for very, for these various games, the points are different.

Some, some points are low, some points are high, some go positive and negative. And they're all normalized, to a point where the good points, or the positive points, are a one, and negative points are a negative one. That's reward clipping. Simplify the reward structure. And, because a lot of the games are 30 FPS, or 60 FPS, and the actions, are not, it's not valuable to take actions, at such a high rate, inside of these, particularly Atari games, that you only take an action every four steps.

While still taking in the frames, as part of the temporal window to make decisions. Tricks, but hopefully gives you a sense, of the kind of things necessary, for both, seminal papers like this one, and for the more important accomplishment, of winning deep traffic, is the, is the tricks make all the difference.

Here on the bottom, is, the circle, is when the technique is used, and the X when it's not, looking at replay and target. Takes target network and experience replay. When both are used, for the game of breakout, river raid, sea quest, and space invaders. The higher the number, the better it is, the more points achieved.

So when, it gives you a sense, that when replay and target, both give significant improvements, in the performance of the system. Order of magnitude improvements, two orders of magnitude for breakout. And here is, pseudocode, of implementing DQN, the learning. The key thing to notice, and you can look to the slides, is, the, the loop, the while loop, of playing through the games, and selecting the actions to play, is not part of the training.

It's, it's part of the saving, the observation, the observations, the state action reward, next state observations, and saving them into replay memory, into that library. And then you sample randomly, from that replay memory, to then train the network, based on the loss function. And with probability up, up top of the probability, epsilon, select a random action.

That epsilon is, the probability of exploration, that decreases. That's something you'll see in deep traffic as well, is, the rate at which that, exploration decreases over time, through the training process. You want to explore a lot first, and less and less over time. So, this algorithm has been able to accomplish, in 2015, and since, a lot of incredible things.

Things that made, the AI world, think that we, we're onto something. That, general AI is within reach. It's for the first time, that raw sensor information was used to create, a system that acts, and makes sense of the world. Makes sense of the physics of the world enough, to be able to succeed in it, from very little information.

But these games are trivial. Even though, there is a lot of them. This DQN approach has been able to outperform, a lot of the Atari games. That's what's been reported on. Outperform the human level performance. But again, these games are trivial. What I think, and perhaps biased, I'm biased, but one of the greatest accomplishments, of artificial intelligence in the last decade, at least from the philosophical, or the research perspective, is, AlphaGo Zero.

First AlphaGo, and then AlphaGo Zero. Is deep mind system, that beat the best in the world, in the game of Go. So what's the game of Go? It's simple. I won't get into the rules, but basically it's a 19 by 19 board, showing on the bottom of the slide, for the bottom row of the table, for a board of 19 by 19, the number of legal game positions, is 2 times 10 to the power of 170.

It's a very large number of possible positions to consider. At any one time, especially the game evolves, the number of possible moves is huge. Much larger than in chess. So that's why, AI, the community thought that this game is not solvable. Until 2016, when AlphaGo, used human expert position play, to seed in a supervised way, reinforcement learning approach.

And I'll describe it in a little bit of detail, in a couple of slides here, to beat the best in the world. And then AlphaGo Zero, that is, the accomplishment of the decade, for me, in AI. Is being able to play, with no, training data on human expert, games.

And beat the best in the world, in an extremely complex game. This is not Atari. This is, this is a, a much, higher, order, difficulty game. And the, and the quality of players that is competing in, is much higher. And it's able to extremely quickly here, to achieve a rating that's better than AlphaGo.

And better than the different variants of AlphaGo. And certainly better than the, the best of the human players. In 21 days, of self play. So how does it work? All of these approaches, much, much like the previous ones, the traditional ones, they're not based on deep learning. Are using Monte Carlo Tree Search, MCTS.

Which is, when you have such a large state space, you start at a board, and you play, and you choose moves, with some, exploitation, exploration, balancing. And, choosing to explore totally new positions, or to go deep in the positions you know are good, until the bottom of the game is reached, until the final state is reached.

And then you back propagate, the, quality of the choices you made leading to that position. And in that way, you learn the value of, of board positions and play. That's been used by the most successful, Go playing, engines before, and AlphaGo since. But you might be able to guess, what's the difference, with AlphaGo versus the previous approaches.

They use the neural network, as the, intuition, quote-unquote, to what are the good states, what are the good next, board positions to explore. And the key things, again, the tricks make all the difference, that made AlphaGo zero, work, and work much better than AlphaGo, is first, because there was no expert play, instead of human games.

AlphaGo, used, that very same, Monte Carlo tree search algorithm, MCTS, to do an intelligent look ahead, based on the neural network prediction, of what are the good states to take, it checked that, instead of human expert play, it checked, how good indeed are those, states. It's a simple look ahead action, that does, the ground truth, that does the target, correction, that produces the loss function.

The second part is the multitask learning, or what's now called multitask learning, is the network is, is quote-unquote two-headed, in the sense that first it outputs the probability, of which move to take, the obvious thing, and it's also producing a probability of winning. And there's a few ways to combine that information, and continuously train, both parts of the network, depending on the choice taken.

So you want to take the best choice, in the short term, and achieve the positions, that are highly as likelihood of winning, for the player, that's whose turn it is. And, another big step, is that they updated, from 2015, they updated the state-of-the-art architecture, which are now, the architecture that won ImageNet, is residual networks, ResNet, for ImageNet.

Those, that's it. And those little changes, made all the difference. So that takes us to deep traffic, and the 8 billion hours stuck in traffic. America's pastime, so we tried to simulate, driving, the behavioral layer of driving. So not the immediate control, not the motion planning, but beyond that, on top, on top of those control decisions, the human, interpretable decisions of changing lane, of speeding up, slowing down.

Modeling that, in a micro traffic simulation framework, that's popular in traffic engineering, the kind of shown here. We applied deep reinforcement learning to that, we call it deep traffic. The goal is to achieve the highest average speed, over a long period of time, weaving in and out of traffic.

For students here, the requirement is to follow the tutorial, and achieve a speed of 65 miles an hour. And, if you really want, to achieve a speed, over 70 miles an hour, which is what's required to win. And perhaps upload your own image, to make sure you look good doing it.

What you should do, clear instructions, to compete, read the tutorial. You can change parameters in the code box, on that website, cars.mit.edu/deeptraffic. Click the white button that says apply code, which applies the code that you write. These are the parameters, that you specify for the neural network. It applies those parameters, creates the architecture that you specify.

And now you have, a network written in JavaScript, living in the browser, ready to be trained. Then you click, the blue button that says run training. And that trains the network, much faster than what's actually being visualized, in the browser. A thousand times faster, by evolving the game, making decisions, taking in the grid space, I'll talk about here in a second.

The speed limit is 80 miles an hour. Based on the various adjustments, when we went to the game, reaching 80 miles an hour, is certainly impossible, on average. And reaching some of the speeds, that we've achieved last year, is much, much, much more difficult. Finally, when you're happy, and the training is done, submit the model to competition.

For those super eager, dedicated students, you can do so every five minutes. And to visualize your submission, you can click, the request visualization, specifying the custom image, and the color. Okay, so here's the simulation. Speed limit 80 miles an hour, cars, 20 on the screen. One of them is a red one in this case.

That's, that one is controlled by neural network. It's speed, it's allowed the actions, to speed up, slow down, change lanes, left, right, or stay exactly the same. The other cars, are pretty dumb. They speed up, slow down, turn left, right, but they don't have a purpose in their existence.

They do so randomly. Or at least purpose has not been discovered. The road, the car, the speed. The road is a grid space. An occupancy grid that specifies, when it's empty, it's set to, 80. Meaning, that, the grid value, is whatever speed is achievable, if you were inside that grid.

And when there's other cars that are going slow, the value in that grid, is the speed of that car. That's the state space, that's the state representation. And you can choose how much, what slice that state space you take in. That's the input to the neural network. For visualization purposes, you can choose, normal speed or fast speed, for watching, the network operate.

And there's display options, to help you build intuition, about what the network takes in, and what space the car is operating in. The default, is no extra information is added. Then there's the, learning input, which visualizes exactly, which part of the road, the, is serves as the input to the network.

Then there is the, safety system, which I'll describe in a little bit, which is all the parts of the road, the car is not allowed to go into, because it would result in a collision. And that would JavaScript, would be very difficult to animate. And the full map. Here's a safety system.

You could think of this system, as ACC, basic radar ultrasonic sensors, helping you avoid the obvious, collisions to, obviously detectable objects around you. And the task for this red car, for this neural network, is to move about, this space, is to move about the space, under the constraints of the safety system.

The red shows all the parts of the grid, it's not able to move into. So the goal for the car, is to not get stuck in traffic, is make big sweeping motions, to avoid crowds of cars. The input, like DQN, is the state space, the output is the value of the different actions.

And based on the epsilon parameter, through training and through, inference evaluation process, you choose, how much exploration you want to do. These are all parameters. The learning is done in the browser, on your own computer, utilizing only the CPU. The action space, there's five, giving you some of the variables here, perhaps you go back to the slides, to look at it.

The brain, quote-unquote, is the thing that takes in, the state, and the reward, takes a forward pass through the state, and produces the next action. The brain is where the neural network is contained, both for the training and the evaluation. The learning input, can be controlled in width, forward length, and backward length.

Lane side, number of lanes to the side that you see, patches ahead, is the patches ahead that you see, patches behind, is patches behind that you see. New this year, can control the number of agents, that are controlled by the neural network. Anywhere from one, to ten. And the evaluation, is performed exactly the same way.

You have to achieve the highest average speed, for the agents. The very critical thing here is, the agents are not aware of each other. So they're not jointly planning. The network is trained, under the, joint objective, of achieving the average speed for all of them. But the actions are taking in a greedy way for each.

It's very interesting what can be learned in this way. Because this kinds of approaches are scalable, to an arbitrary number of cars. And you can imagine us plopping down, the best cars from this class together. And having them compete, in this way. The best neural networks. Because they're full in their greedy operation.

The number of networks that can concurrently operate, is fully scalable. There's a lot of parameters. The temporal window. The layers, the many layers types that can be added. Here's a fully connected layer with ten neurons. The activation functions, all of these things can be customized. As is specified in the tutorial.

The final layer, a fully connected layer with, output of five, regression, giving the value of each of the five actions. And there's a lot of more specific parameters. Some of which I've discussed. From gamma, to epsilon, to experience replay size, to learning rate and temporal window. The optimizer, the learning rate, momentum, batch size, L2, L1 decay for regularization and so on.

There's a big white button that says apply code that you press. That kills all the work you've done up to this point. So be careful doing it. You should be doing it only at the very beginning. If you happen to leave your computer running, in training for several days, as folks have done.

The blue training button, you press. And it trains based on the parameters. You specify. And the network state gets shipped to the main simulation from time to time. So the thing you see in the browser, as you open up the website, is running the same network that's being trained.

And regularly it updates that network. So it's getting better and better. Even if the training takes weeks for you. It's constantly updating the network you see on the left. So if the car, for the network that you're training, is just standing in place and not moving. It's probably, time to restart and change the parameters.

Maybe add a few layers to your network. Number of iterations is certainly an important parameter to control. And the evaluation is something we've done a lot of work on since last year. To remove the degree of randomness. To remove the, the incentive to submit the same code over and over again.

To hope to produce a higher reward, a higher evaluation score. The method for evaluation is to collect the average speed over 10 runs. About 45 seconds of game each. Not minutes. 45 simulated seconds. And there is 500 of those. And we take the median speed of the 500 runs.

It's done server side. So extremely difficult to cheat. I urge you to try. And you can try it locally. There's a start evaluation run. But that one doesn't count. That's just for you to feel better about your network. That should produce a result that's very similar to the one we'll produce on the server.

Is to build your own intuition. And as I said, we significantly reduce the influence of randomness. So the, the score, the speed you get for the network you design should be very similar with every evaluation. Loading and saving. If the network is huge and you want to switch computers, you can save the network.

It saves both the architecture of the network. And the weights on the network. And you can load it back in. Obviously, when you load it in, it's not saving any of the data you've already done. You can't do transfer learning with JavaScript in the browser yet. Submitting your network, submit model to competition.

And make sure you run training first. Otherwise, it'll be initiated, the weights are initiated randomly and will not do so well. You can resubmit as often you like and the highest score is what counts. The coolest part is you can load your custom image, specify colors and request the visualization.

We have not yet shown the visualization, but I promise you it's going to be awesome. Again, read the tutorial, change the parameters in the code box, click apply code, run training. Everybody in this room on the way home, on the train, hopefully not in your car, should be able to do this in the browser.

And then you can visualize, request visualization because it's an expensive process. You have to want it for us to do it. Because we have to run in server side. Competition link is there. GitHub starter code is there. And the details for those that truly want to win is in the archive paper.

So the question, that will come up throughout is whether these reinforcement learning approaches are at all or rather if action planning control is amenable to learning. Certainly in the case of driving, we can't do what AlphaGo Zero did. We can't learn from scratch from self-play because that would result in millions of crashes in order to learn to avoid the crashes.

Unless we're working like we are deep crash on the RC car or we're working in simulation. So we can look at expert data. We can look at driver data, which we have a lot of and learn from. It's an open question whether this is applicable. To date, and I bring up two companies because they're both guest speakers.

Deep IRL is not involved in the most successful robots operating in the real world. In the case of Boston Dynamics, most of the perception, control and planning like in this robot, does not involve learning approaches except with minimal addition on the perception side. Best of our knowledge. And certainly the same is true with Waymo, as the speaker on Friday will talk about.

Deep learning is used a little bit in perception on top, but most of the work is done from the sensors and the optimization based, the model based approaches. Trajectory generation and optimizing which trajectory is best to avoid collisions. Deep IRL is not involved. And coming back and back again, the unexpected local pockets of higher award, which arise in all of these situations when applied in the real world.

So for the cat video, that's pretty short where the cats are ringing the bell and they're learning that the ring of the bell is mapping to food. I urge you to think about how that can evolve over time in unexpected ways. That may not have a desirable effect. Where the final reward is in the form of food and the intended effect is to ring the bell.

That's where AI safety comes in. For the artificial general intelligence course in two weeks, that's something we'll explore extensively. It's how these reinforcement learning planning algorithms will evolve in ways that are not expected. And how we can constrain them, how we can design reward functions that result in safe operation.

So I encourage you to come to the talk on Friday, at 1pm, as a reminder, it's a 1pm, not 7pm, in Stata 32.1.2.3. And to the awesome talks in two weeks, from Boston Dynamics to Ray Kurzweil and so on for AGI. Now tomorrow, we'll talk about computer vision and psych fuse.

Thank you everybody.

MIT 6.S094: Deep Reinforcement Learning

Chapters

Transcript