Back to Index

MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars


Chapters

0:0 Intro
0:54 Administrative
3:43 Project: Deep Traffic
5:2 Project: DeepTesla
6:50 Defining Artificial Intelligence
10:14 How Hard is Driving?
10:59 Chess Pieces: Self-Driving Car Sensors
13:8 Chess Pieces: Self-Driving Car Tasks
14:48 DARPA Grand Challenge II (2006)
15:23 DARPA Urban Challenge (2007)
15:48 Industry Takes on the Challenge
16:30 How Hard is it to Pass the Turing Test?
20:54 Neuron: Biological Inspiration for Computation
22:54 Perceptron: Forward Pass
23:37 Perceptron Algorithm
25:3 Neural Networks are Amazing
26:15 Special Purpose Intelligence
27:21 General Purpose Intelligence
40:43 Deep Learning Breakthroughs: What Changed?
45:19 Useful Deep Learning Terms
48:58 Neural Networks: Proceed with Caution
49:50 Deep Learning is Representation Learning
51:33 Representation Matters
52:43 Deep Learning: Scalable Machine Learning
54:1 Applications: Object Classification in Images
55:55 illumination Variability
56:32 Pose Variability and Occlusions
57:54 Pause: Object Recognition / Classification
59:41 Pouse Object Detection

Transcript

All right, hello everybody. Hopefully you can hear me well. Yes, yes, great. So welcome to course 6S094, Deep Learning for Self-Driving Cars. We will introduce to you the methods of deep learning of deep neural networks using the guiding case study of building self-driving cars. My name is Lex Friedman.

You get to listen to me for majority of these lectures and I am part of an amazing team with some brilliant TAs, would you say? Brilliant? Dan, Dan Brown, you guys want to stand up? You're okay? They're in front row. Spencer, William Angel, Spencer Dodd, and all the way in the back, the smartest and the tallest person I know, Benedict Jenek.

So what you see there on the left of the slide is a visualization of one of the two projects that, one of the two simulations games that we'll get to go through. We use it as a way to teach you about deep reinforcement learning but also as a way to excite you by challenging you to compete against others, if you wish, to win a special prize yet to be announced, super secret prize.

So you can reach me and the TAs at deepcars@mit.edu if you have any questions about the tutorials, about the lecture, about anything at all. The website cars.mit.edu has the lecture content. Code tutorials again, like today, the lecture slides for today are already up in PDF form. The slides themselves, if you want to see them, just email me, but they're over a gigabyte in size because they're very heavy in videos.

So I'm just posting the PDFs. And there will be lecture videos available a few days after the lecture is given. So speaking of which, there is a camera in the back. This is being videotaped and recorded, but for the most part, the camera is just on the speaker. So you shouldn't have to worry.

If that kind of thing worries you, then you could sit on the periphery of the classroom or maybe I suggest sunglasses and a fake mustache, that would be a good idea. There is a competition for the game that you see on the left. I'll describe exactly what's involved. In order to get credit for the course, you have to design a neural network that drives the car just above the speed limit of 65 miles an hour.

But if you want to win, you need to go a little faster than that. So who this class is for? You may be new to programming, new to machine learning, new to robotics, or you're an expert in those fields but want to go back to the basics. So what you will learn is an overview of deep reinforcement learning, convolutional neural networks, recurrent neural networks, and how these methods can help improve each of the components of autonomous driving.

Perception, visual perception, localization, mapping, control planning, and the detection of driver state. Okay, two projects. Code name Deep Traffic is the first one. There is, in this particular formulation of it, there is seven lanes. It's a top view. It looks like a game but I assure you it's very serious.

It is the agent in red. The car in red is being controlled by a neural network and we'll explain how you can control and design the various aspects, the various parameters of this neural network. And it learns in the browser. So this, we're using ConvNetJS, which is a library that is programmed by Andrej Karpathy in JavaScript.

So amazingly, we live in a world where you can train in a matter of minutes a neural network in your browser. And we'll talk about how to do that. The reason we did this is so that there is very few requirements for get you up and started with neural networks.

So in order to complete this project for the course, you don't need any requirements except to have a Chrome browser. And to win the competition, you don't need anything except a Chrome browser. The second project, code name Deep Tesla, or Tesla, is using data from a Tesla vehicle of the forward roadway and using end-to-end learning, taking the image and putting it into convolutional neural networks that directly maps a regressor that maps to a steering angle.

So all it takes is a single image and it predicts a steering angle for the car. And we have data for the car itself and you get to build a neural network that tries to do better, tries to steer better or at least as good as the car. Okay, let's get started with the question with the thing that we understand so poorly at this time because it's so shrouded in mystery but it fascinates many of us.

And that's the question of what is intelligence? This is from a 1996, March 1996, Time Magazine. And the question, can machines think, is answered below with "they already do, so what if anything is special about the human mind?" It's a good question for 1996, a good question for 2016, 17 now and the future.

And there's two ways to ask that question. One is the special purpose version. Can an artificial intelligence system achieve a well-defined, specifically, formally defined finite set of goals? And this little diagram from a book that got me into artificial intelligence as a bright-eyed high school student, the Artificial Intelligence, a Modern Approach.

This is a beautifully simple diagram of a system. It exists in an environment. It has a set of sensors that do the perception. It takes those sensors in, does something magical, there's a question mark there, and with a set of effectors, acts in the world, manipulates objects in that world.

And so special purpose, we can, under this formulation, as long as the environment is formally defined, well-defined, as long as a set of goals are well-defined, as long as a set of actions, sensors, and the ways that the perception carries itself out is well-defined, we have good algorithms of which we'll talk about that can optimize for those goals.

The question is, if we inch along this path, will we get closer to the general formulation, to the general purpose version of what artificial intelligence is? Can it achieve poorly defined, unconstrained set of goals with an unconstrained, poorly defined set of actions, and unconstrained, poorly defined utility functions, rewards?

This is what human life is about. This is what we do pretty well most days, exist in a undefined, full of uncertainty world. So, okay, we can separate tasks into three different categories. Formal tasks, this is the easiest. It doesn't seem so, it didn't seem so at the birth of artificial intelligence, but that's in fact true if you think about it.

The easiest is the formal tasks, playing board games, theorem proving, all the kind of mathematical logic problems that can be formally defined. Then there's the expert tasks. So this is where a lot of the exciting breakthroughs have been happening, where machine learning methods, data-driven methods, can help aid or improve on the performance of our human experts.

This means medical diagnosis, hardware design, scheduling. And then there is the thing that we take for granted, the trivial thing, the thing that we do so easily every day, when we wake up in the morning, the mundane tasks of everyday speech, of written language, of visual perception, of walking, which we'll talk about in today's lecture, is a fascinatingly difficult task.

And object manipulation. So the question is that we're asking here, before we talk about deep learning, before we talk about the specific methods, we really want to dig in and try to see what is it about driving. How difficult is driving? Is it more like chess, which you see on the left there, where we can formally define a set of lanes, a set of actions, and formulate it as this, you know, there's five set of actions, you can change a lane, you can avoid obstacles, you can formally define an obstacle, you can formally define the rules of the road.

Or is there something about natural language, something similar to everyday conversation about driving, that requires a much higher degree of reasoning, of communication, of learning, of existing in this underactuated space? Is it a lot more than just left lane, right lane, speed up, slow down? So let's look at it as a chess game.

Here's the chess pieces. What are the sensors we get to work with on a self, on an autonomous vehicle? And we'll get a lot more in depth on this, especially with the guest speakers who built many of these. There's radar, there's the range sensors, radar, lidar, that give you information about the obstacles in the environment, that help localize the obstacles in the environment.

There's the visible light camera, the stereo vision, that gives you texture information, that helps you figure out not just where the obstacles are, but what they are, helps to classify those, helps to understand their subtle movements. Then there is the information about the vehicle itself, about the trajectory and the movement of the vehicle, that comes from the GPS and IMU sensors.

And there is the state of, the rich state of the vehicle itself. What is it doing? What are all the individual systems doing? That comes from the CAN network. And there is one of the less studied, but fascinating to us on the research side, is audio, the sounds of the road.

That provide the rich context of a wet road, the sound of a road that when it stopped raining, but it's still wet, the sound that it makes. The screeching tire and honking, these are all fascinating signals as well. And the focus of the research in our group, the thing that's really much under investigated, is the internal facing sensors.

The driver, sensing the state of the driver. Where are they looking? Are they sleepy? The emotional state, are they in the seat at all? And the same with audio. That comes from the visual information and the audio information. More than that, here's the tasks. If you were to break into modules, the task of what it means to build a self-driving vehicle.

First, you want to know where you are, where am I? Localization and mapping. You want to map the external environment, figure out where all the different obstacles are, all the entities are, and use that estimate of the environment to then figure out where I am, where the robot is.

Then there's scene understanding. It's understanding not just the positional aspects of the external environment and the dynamics of it, but also what those entities are. Is it a car? Is it a pedestrian? Is it a bird? There's movement planning. Once you have kind of figured out to the best of your abilities, your position and the position of other entities in this world, there's figuring out a trajectory through that world.

And finally, once you've figured out how to move about, safely and effectively through that world, it's figuring out what the human that's on board is doing. Because as I will talk about, the path to a self-driving vehicle, and that is hence our focus on Tesla, may go through semi-autonomous vehicles, where the vehicle must not only drive itself, but effectively hand over control from the car to the human and back.

Okay, quick history. Well, there's a lot of fun stuff from the 80s and 90s, but the big breakthroughs came in the second DARPA Grand Challenge with Stanford Stanley when they won the competition, one of five cars that finished. This was an incredible accomplishment. In a desert race, a fully autonomous vehicle was able to complete the race in record time.

The DARPA Urban Challenge in 2007, where the task was no longer a race to the desert, but through a urban environment. And CMU's boss, with GM, won that race. And a lot of that work led directly into the acceptance and large major industry players taking on the challenge of building these vehicles.

Google, now Waymo, self-driving car. Tesla, with its Autopilot system and now Autopilot 2 system. Uber, with its testing in Pittsburgh. And there's many other companies, including one of the speakers for this course of Neutronomy, that are driving the wonderful streets of Boston. Okay, so let's take a step back.

We have, if we think about the accomplishments in the DARPA challenge and if we look at the accomplishments of the Google self-driving car, which essentially boils the world down into a chess game. It uses incredibly accurate sensors to build a three-dimensional map of the world, localize itself effective in that world and move about that world in a very well-defined way.

Now, what if driving, the open question is, if driving is more like a conversation, like a natural language conversation, how hard is it to pass the Turing test? The Turing test, as the popular current formulation is, can a computer be mistaken for a human being in more than 30% of the time.

When a human is talking behind a veil, having a conversation with either a computer or a human, they mistake the other side of that conversation for being a human when it's in fact a computer. And the way you would, in a natural language, build a system that has successfully passed the Turing test is the natural language processing part to enable it to communicate successfully.

So generate language and interpret language, then you represent knowledge, the state of the conversation, transferred over time. And the last piece, and this is the hard piece, is the automated reasoning. Is reasoning, can we teach machine learning methods to reason? That is something that will propagate through our discussion because, as I will talk about, the various methods, the various deep learning methods, neural networks, are good at learning from data.

But they're not yet, there's no good mechanism for reasoning. Now, reasoning could be just something that we tell ourselves we do to feel special, better, to feel like we're better than machines. Reasoning may be simply something as simple as learning from data. We just need a larger network. Or there could be a totally different mechanism required and we'll talk about the possibilities there.

Yes. Can you go back to the US for example? Okay, so we talked about the video, so which state is that? The top states of the US or other states? No, it's very difficult to find these kind of situations in the United States. So the question was, for this video, is it in the United States or not?

I believe it's in Tokyo. So India, a few European countries, are much more towards the direction of natural language versus chess. In the United States, generally speaking, we follow rules more concretely. The quality of roads is better, the marking on the roads is better, so there's less requirements there.

I'm not sure it's going to be Tokyo, because they drive on the left side, but India is going to the right side. So Japan is less likely to use the game. These cars are driving on the left side? No, but they drive on the right side of the road, just like in the US.

I see. I just, okay. Yeah, you're right, it is, because, yep. Yeah, so, but it's certainly not the United States. I'm pretty, I'm, I spent quite a bit of Googling trying to find the United States, and it's difficult. So let's talk about the recent breakthroughs in machine learning, and what is at the core of those breakthroughs.

It's neural networks that have been around for a long time, and I will talk about what has changed, what are the cool new things. And what hasn't changed, and what are its possibilities. But first, a neuron, crudely, is a computational building block of the brain. I know there's a few folks here, neuroscience folks.

This is hardly a model. It is mostly an inspiration. And so, the human neuron has inspired the artificial neuron, the computational building block of a neural network, of an artificial neural network. Now to give you some context, these neurons, for both artificial and human brains, are interconnected. In the human brain, there's about, I believe, 10,000 outgoing connections from every neuron, on average.

And they're interconnected to each other. The largest current, as far as I'm aware, artificial neural network has 10 billion of those connections, synapses. Our human brain, to the best estimate, that I'm aware of, has 10,000 times that. So, 100 to 1,000 trillion synapses. Now what is an artificial neuron?

This building block of a neural network. It takes a set of inputs, it puts a weight on each of those inputs, sums them together, applies a bias value on each, that sits on each neuron, and using an activation function that takes as input that sum plus the bias and squishes it together to produce a zero to one signal.

And this allows us, a single neuron, take a few inputs and produces an output, a classification, for example, a zero one. And as we'll talk about, simply, it can serve as a linear classifier. So it can draw a line, it can learn to draw a line between, like what's seen here, between the blue dots, and the yellow dots.

And that's exactly what we'll do in the IPython notebook that I'll talk about. But the basic algorithm is, you initialize the weights on the inputs, and you compute the output. You perform this previous operation I talked about, sum up, compute the output. And if the output, does not match the ground truth, the expected output, the output that it should produce, the weights are punished accordingly.

And we'll talk through a little bit of the math of that. And this process is repeated until the perceptron does not make any more mistakes. Now here's, the amazing thing about neural networks. There's several, I'll talk about them. One on the mathematical side, is the universality of neural networks.

With just a single layer, if we stack them together, a single hidden layer, the inputs on the left, the outputs on the right, and in the middle there's a single hidden layer. It can, closely approximate any function. Any function. So this is an incredible property. That with a single layer, any function, you can think of, that, you know, you can think of driving as a function.

It takes an input, the world outside as output, the control of the vehicle. There exists a neural network out there that can drive, perfectly. It's a fascinating mathematical fact. So we can think of this then, these functions as a special purpose, function, special purpose intelligence. You can take, say as input, the number of bedrooms, the square feet, the, type of neighborhood, those are the three inputs.

It, passes that value through to the hidden layer, and then one more step, it produces the final price estimate, for the house, or for the residence. And we can teach a network to do this pretty well, in a supervised way. This is supervised learning. You provide, a lot of examples, where you know the number of bedrooms, the square feet, the type of neighborhood, and then you also know the final price, of the house, or the residence.

And then you can, as I'll talk about through, a process of back propagation, teach these networks, to, make this prediction, pretty well. Now, some of the exciting breakthroughs recently, have been in the general purpose intelligence. This is from our, this is from Andrej Karpathy, who is now at OpenAI.

I, would like, to, take a moment here, to try to explain how amazing this is. This is a game of Pong. If you're not familiar with Pong, there's two paddles, and you're trying to, bounce the ball back, and in such a way that, prevents the other guy from bouncing the ball back at you.

On the, the, the artificial intelligence agents on the right in green, and up top is the score, eight to one. Now this takes, about three days to train, on a regular computer, this network. What is, this network doing? It's called the policy network. The input, is the raw, pixels.

The, they're, they're slightly, processed and also you take the difference between, two frames, but it's basically the raw pixel information. That's the input. There's, a few hidden layers, and the output is a single probability of moving up. That, that's it. That's, that's the whole, that's, that's the whole system.

And what it's doing is, it learns, not, you don't know, at any one moment, you don't know what the right thing to do is. Is it to move up? Is it to move down? You only know, what the right thing to do is, by the fact that eventually you win or lose the game.

So this is the amazing thing here, is there's no supervised learning about, there's no like, universal fact about, any one state being good or bad, and any one action being good or bad in any state. But if you punish, or reward, every single action you took, every single action you took, for entire game, based on the result.

So no matter what you did, if you won the game, the end justifies the means. If you won the game, every action you took, and every action, state pair, gets rewarded. If you lost the game, it gets punished. And this process, with only 200,000 games, where the, system just simulates the games, it can learn to beat the computer.

This system knows nothing about Pong, nothing about games. This is general intelligence. Except for the fact, that it's just a game of Pong. And I will, talk about, how this can, be extended further, why this is so promising, and why this is also, we should proceed with caution. So again, there's a set of actions you take, up, down, up, down, based on the output of the network.

There's a threshold, given the probability of moving up, you move up or down based on the output of the network. And you have a set of states. And every single state action pair is rewarded if there's a win, and it's punished, if there's a loss. When you go home, think about how amazing that is.

And if you don't understand why that's amazing, spend some time on it. It's incredible. Sure, sure thing. The question was, what is supervised learning, what is unsupervised learning, what's the difference? So supervised learning, is, when people talk about machine learning, they mean supervised learning most of the time. Supervised learning is, learning from data.

It's learning from example. When you have a set of inputs and a set of outputs, that you know are correct, what are called ground truth. So you need those examples, a large amount of them, to train any of the machine learning algorithms, to learn to then generalize that to future examples.

This is, actually there's a third one called reinforcement learning, where the ground truth is sparse. The information about, when something is good or not, the ground truth only happens every once in a while, at the end of the game, not every single frame. And unsupervised learning is when you have no information, about the outputs, that are correct or incorrect.

And it is the excitement, of the deep learning community, is unsupervised learning. But it has achieved no major breakthroughs at this point. This is the, I'll talk about what the future of deep learning is, and a lot of the people that are working in the field, are excited by it.

But right now, any interesting accomplishment, has to do with supervised learning. And the brown one is just a heuristic solution, like look at the velocity. So basically the reinforcement learning here, is learning from somebody who has certain rules. And how can that be guaranteed, that it would generalize to somebody else?

So the question was, the green paddle learns to play this game successfully, against this specific one brown paddle, operating under specific kinds of rules. How do we know it can generalize to other games, other things? And it can't. But the mechanism by which it learns generalizes. So the question is, how do we know that it can generalize to other games, so as long as you let it play, as long as you let it play in whatever world you want it to succeed in, long enough, it will use the same approach to learn to succeed in that world.

The problem is, this works for worlds you can simulate well. Unfortunately, one of the big challenges of neural networks, is that they're not currently efficient learners. We need a lot of data to learn anything. Human beings need one example, oftentimes, and they learn very efficiently from that one example.

And again, I'll talk about that as well. It's a good question. So the drawbacks of neural networks. So if you think about the way a human being would approach this game, this game of Pong, they would only need a simple set of instructions. You're in control of a paddle, and you can move it up and down.

And your task is to bounce the ball past the other player, controlled by AI. Now, human being would immediately, they may not win the game, but they would immediately understand the game. And will be able to successfully play it well enough to pretty quickly learn to beat the game.

But they need to have a concept of control. What it means to control a paddle. They need to have a concept of a paddle. They need to have a concept of moving up and down, and a ball, and bouncing. They have to know, they have to have at least a loose concept of real-world physics, that they can then project that real-world physics onto the two-dimensional world.

All of these concepts are are concepts that you come to the table with. That's knowledge. And the kind of way you transfer that knowledge from from your previous experience, from childhood to now, when you come to this game, that is something is called reasoning. Whatever reasoning means. And the question is whether through this same kind of process, you can see the entire world as a game of pong.

And reasoning is simply ability to simulate that game in your mind and learn very efficiently, much more efficiently than 200,000 iterations. The other challenge of deep neural networks and machine learning broadly is you need big data and efficient learners, as I said. That data also needs to be supervised data.

You need to have ground truth, which is very costly for, so annotation, a human being looking at a particular image, for example, and labeling that as something, as a cat or a dog, whatever objects is in the image, that's very costly. And for particularly for neural networks, there's a lot of parameters to tune.

There's a lot of hyperparameters. You need to figure out the network structure first. How does this network look? How many layers? How many hidden nodes? What type of activation function in each node? There's a lot of hyperparameters there. And then once you built your network, there's parameters for how you teach that network.

There's learning rate, loss function, mini-batch size, number of training iterations, gradient update smoothing, and selecting even the optimizer with which you, with which you solve the various differential equations involved. It's a topic of many research papers, certainly. It's rich enough for research papers, but it's also really challenging. It means that you can't just plop a network down and it will solve the problem generally.

And defining a good loss function, or in the case of Pong or games, a good reward function is difficult. So here's a game. This is a recent result from OpenAI. I'm teaching a network to play the game of Coast Runners. And the goal of Coast Runners is to go, you're in a boat, the task is to go around a track and successfully complete a race against other people you're racing against.

Now, this network is an optimal one. And what it's figured out that actually in the game, it gets a lot of points for collecting certain objects along the path. So what you see is it's figured out to go in a circle and collect those green turbo things. And what it's figured out is you don't need to complete the game to earn the reward.

Now, that more sort of, and despite being on fire and hitting the wall and going through this whole process, it's actually achieved at least a local optima given the reward function of maximizing the number of points. And so it's figured out a way to earn a higher reward while ignoring the implied bigger picture goal of finishing the race, which us as humans understand much better.

This raises for self-driving cars ethical questions. Besides other questions, you can watch this for hours and it will do that for hours. And that's the point is, it's hard to teach, it's hard to encode the formally defined utility function under which an intelligence system needs to operate. And that's made obvious even in a simple game.

And so what is, yep, question. So the question was, what's an example of a local optimum that an autonomous car, so similar to the coast race, so what would be the example in the real world for an autonomous vehicle? And it's a touchy subject, but it would certainly have to be involved with the choices we make under near crashes and crashes.

The choices a car makes when to avoid, for example, if there's a crash imminent and there's no way you can stop to prevent the crash, do you keep the driver safe or do you keep the other people safe? And there has to be some, even if you don't choose to acknowledge it, even if it's only in the data and the learning that you do, there's an implied reward function there.

And we need to be aware of that reward function is because it may find something. Until you actually see it, we won't know it. Once we see it, we'll realize that, "Oh, that was a bad design." And that's the scary thing. It's hard to know ahead of time what that is.

So the recent breakthroughs from deep learning came of several factors. First is the compute. Moore's law. CPUs are getting faster, 100 times faster every decade. Then there's GPUs. Also, the ability to train neural networks and GPUs and now ASICs has created a lot of capabilities in terms of energy efficiency and being able to train larger networks more efficiently.

There is larger, well, first of all, in the 21st century, there's digitized data. There's larger data sets of digital data. And now there is that data is becoming more organized, not just vaguely available data out there on the internet. It's actual organized data sets like ImageNet. Certainly for natural language, there's large data sets.

There is the algorithm innovations. Backprop, backpropagation, convolutional neural networks, LSTMs, all these different architectures for dealing with specific specific types of domains and tasks. There's the huge one, is infrastructure, is on the software and the hardware side. There's Git, ability to share an open source way software. There is pieces of software that make robotics and make machine learning easier.

ROS, TensorFlow. There's Amazon Mechanical Turk, which allows for efficient, cheap annotation of large scale data sets. There's AWS in the cloud hosting machine learning, hosting the data and the compute. And then there's a financial backing of large companies, Google, Facebook, Amazon. But really, nothing has changed. There really has not been any significant breakthroughs.

We're using these convolutional neural networks have been around since the 90s. Neural networks have been around since the 60s. There's been a few improvements. But the hope is, that's in terms of methodology. The compute has really been the workhorse. The ability to do the hundredfold improvement every decade holds promise.

And the question is whether that reasoning thing I talked about is all you need is a larger network. That is the open question. So some terms for deep learning. First of all, deep learning is a PR term for neural networks. It is a term for deep neural networks, for neural networks that have many layers.

It is symbolic term for the newly gained capabilities that compute has brought us, that training on GPUs has brought us. So deep learning is a subset of machine learning. There's many other methods that are still effective. The terms that will come up in this class is, first of all, multi-layer perceptron, deep neural networks, recurring neural networks, LSTM, long short-term memory networks, CNN or ConvNets, convolutional neural networks, deep belief networks.

And the operation that will come up is convolution, pooling, activation functions and back propagation. Yep, cool question. So the question was, what is the purpose of the different layers in a neural network? What does it mean to have one configuration versus another? So a neural network having several layers, it's the only thing you have an understanding of is the inputs and the outputs.

You don't have a good understanding about what each layer does. They're mysterious things, neural networks. So I'll talk about how with every layer it forms a higher level, a higher order representation of the input. So it's not like the first layer does localization, the second layer does path planning, the third layer does navigation, how you get from here to Florida.

Or maybe it does, but we don't know. So we know, we're beginning to visualize neural networks for simple tasks, like for ImageNet, classifying cats versus dogs. We can tell what is the thing that the first layer does, the second layer, the third layer, and we'll look at that. But for driving, as the input provides just the images and the output, the steering, it's still unclear what you learn.

Partially because we don't have neural networks that drive successfully yet. Do neural networks fill layers or does it eventually generate them on its own over time? So the question was, does a neural network generate layers over time? Like does it grow? That's one of the challenges is that a neural network is predefined.

The architectures, the number of nodes, number of layers, that's all fixed. Unlike the human brain where neurons die and are born all the time. Neural network is pre-specified, that's it, that's all you get. And if you want to change that, you have to change that and then retrain everything.

So it's fixed. So what I encourage you is to proceed with caution because there's this feeling when you first teach a network with very little effort, how to do some amazing tasks, like classify a face, versus non-face or your face versus other faces or cats versus dogs. It's an incredible feeling.

And then there's definitely this feeling that I'm an expert. But what you realize is, you don't actually understand how it works. And getting it to perform well for more generalized tasks, for larger scale datasets, for more useful applications, requires a lot of hyperparameter tuning. Figuring out how to tweak little things here and there.

And still in the end, you don't understand why it works so damn well. So deep learning, these deep neural network architectures is representation learning. This is the difference between traditional machine learning methods. Where, for example, for the task of having an image here as the input, the input to the network here is on the bottom, the output is up at top.

So, and the input is a single image of a person in this case. And so, the input specifically is all of the pixels in that image, RGB. The different colors of the pixels in the image. And over time, what a network does is build a multi-resolutional representation of this data.

The first layer learns the concept of edges, for example. The second layer starts to learn composition of those edges, corners, contours. Then it starts to learn about object parts. And finally, actually provide a label for the entities that are in the input. And this is the difference between traditional machine learning methods.

Where the concepts like edges and corners and contours are manually pre-specified by human beings, human experts for the particular domain. And representation matters because figuring out a line for the Cartesian coordinates of this particular dataset, where you want to design a machine learning system that tells the difference between green triangles and blue circles is difficult.

There's no line that separates them cleanly. And if you were to ask a human being, a human expert in the field, to try to draw that line, they would probably do a PhD on it and still not succeed. But a neural network can automatically figure out to remap that input into polar coordinates.

Where the representation is such that it's an easily linearly separable dataset. And so deep learning is a subset of representation learning, is a subset of machine learning and a key subset of artificial intelligence. Now, because of this, because of its ability to compute an arbitrary number of features at the core of the representation.

So you're not, if you were trying to detect a cat in an image, you're not specifying 215 specific features of cat ears and whiskers and so on that a human expert would specify. You allow a neural network to discover tens of thousands of such features. Which maybe for cats you are an expert, but for a lot of objects you may never be able to sufficiently provide the features which successfully would be used for identifying the object.

And so this kind of representation learning, one is easy in the sense that all you have to provide is inputs and outputs. All you need to provide is a dataset that you care about without hand engineering features. And two, because of its ability to construct arbitrarily sized representations, deep neural networks are hungry for data.

The more data we give them, the more they're able to learn about this particular dataset. So let's look at some applications. First, some cool things that deep neural networks have been able to accomplish up to this point. Let me go through them. First, the basic one. AlexNet. ImageNet is a famous dataset.

It's a competition of classification localization where the task is given an image, identify what are the five most likely things in that image and what is the most likely and you have to do so correctly. So on the right, there's an image of a leopard and you have to correctly classify that that is in fact a leopard.

So they're able to do this pretty well. Given a specific image, determine that it's a leopard. And we started what's shown here on the x-axis is years, on the y-axis is error in classification. So starting from 2012 on the left with AlexNet and today the errors decreased from 16% and 40% before then with traditional methods have decreased to below 4%.

So human level performance, if I were to give you this picture of a leopard, there's a 4% of those pictures of leopards, you would not say it's a leopard. That's human level performance. So for the first time in 2015, convolutional neural networks outperform human beings. That in itself is incredible.

That's something that seemed impossible and now is because it's done, it's not as impressive. But I just want to get to why that's so impressive because computer vision is hard. We as human beings have evolved visual perception over millions of years, hundreds of millions of years. So we take it for granted but computer vision is really hard.

Visual perception is really hard. There is illumination variability. So it's the same object. The only way we tell anything is from the shade, the reflection of light from that surface. It could be the same object with drastically, in terms of pixels, drastically different looking shapes and we still know it's the same object.

There is pose variability and occlusions. Probably my favorite caption for an image, for a figure in an academic paper is "deformable and truncated cat". These are pictures, you know, cats are famously deformable. They can take a lot of different shapes. It's arbitrary poses are possible. So you have to have computer vision, you should know it's still the same object, still the same class of objects given all the variability in the pose.

And occlusions is a huge problem. We still know it's an object. We still know it's a cat even when parts of it are not visible and sometimes large parts of it are not visible. And then there's all the interclass variability. In interclass, all of these on the top two rows are cats.

Many of them look drastically different and the top bottom two rows are dogs, also look drastically different. And yet some of the dogs look like cats, some of the cats look like dogs and as human beings are pretty good at telling the difference and we want computer vision to do better than that.

It's hard. So how is this done? This is done with convolutional neural networks. The input to which is a raw image. Here's an input on the left of a number three and I'll talk about through convolutional layers. That image is processed, passed through. Convolutional layers maintain spatial information. On the output, in this case predicts which of the images, what number is shown in the image, 0, 1, 2 through 9.

And so these networks, this is exactly, everybody is using the same kind of network to determine exactly that. Input is an image, output is a number. And in the case of probability that it's a leopard, what is that number? Then there's segmentation built on top of these convolutional neural networks where you chop off the end and convolutionize the network.

You chop off the end where the output is a heat map. So you can have instead of a detector for a cat, you can do a cat heat map where it's the part of the image, the output heat map gets excited, the neurons on that output get excited and the spatially excited in the parts of the image that contain a tabby cat.

And this kind of process can be used to segment the image into different objects. A horse, so the original input on the left is a woman on a horse and the output is a fully segmented image of knowing where's the woman, where's the horse. And this kind of process can be used for object detection which is the task of detecting an object in an image.

Now the traditional method with convolutional neural networks and in general in computer vision is the sliding window approach. We have a detector like the leopard detector that you slide through the image to find where in that image is a leopard. The segmenting approach, the RCNN approach is efficiently segment the image in such a way that it can propose different parts of the image that are likely to have a leopard or in this case a cowboy.

And that drastically reduces the computational requirements of the object detection task. And so these networks, this is the currently one of the best networks for the ImageNet task of localization is the deep residual networks. They're deep. So VGG19 is one of the famous ones, VGGNet. You're starting to get above 20 layers in many cases.

34 layers is the ResNet one. So the lesson there is the deeper you go, the more representation power you have, the higher accuracy. But you need more data. Other applications, colorization of images. So this again, input is a single image and output is a single image. So you can take a black and white video from a film, from an old film and recolor it.

And all you need to do to train that network in a supervised way is provide modern films and convert them to grayscale. So now you have arbitrarily sized datasets that are able to, datasets of grayscale to color. And you're able to, with very little effort on top of it, to successfully, well, somewhat successfully recolor images.

Again, Google Translate does image translation in this way. Image to image. It first perceives, here in German I believe, famous German, correct me if I'm wrong, dark chocolate written in German on a box. So this can take this image, detect the different letters, convert them to text, translate the text, and then using the image to image mapping, map the letters, the translated letters back onto the box.

And you can do this in real time on video. So what we've talked about up to this point, on the left are vanilla neural networks, convolutional neural networks that map a single input to a single output, a single image to a number, single image to another image. Then there is recurrent neural networks that map, this is the more general formulation, that map a sequence of images or a sequence of words or a sequence of any kind to another sequence.

And these networks are able to do incredible things with natural language, with video, and any time series data. For example, we can convert text to handwritten digits, with handwritten text. Here we type in, and you could do this online, type in "deep learning for self-driving cars" and it will use an arbitrary number of digits.

And it will use an arbitrary handwriting style to generate the words "deep learning for self-driving cars". This is done using recurrent neural networks. We can also take car RNNs, they're called, this character level recurrent neural networks that train on a dataset, an arbitrary text dataset, and learn to generate text one character at a time.

So there is no preconceived syntactical semantic structure that's provided to the network. It learns that structure. So, for example, you can train it on Wikipedia articles, like in this case, and it's able to generate successfully not only text that makes some kind of grammatical sense at least, but also keep perfect syntactic structure for Wikipedia, for Markdown editing, for LaTeX editing, and so on.

This text says, "Naturalism and decision for the majority of Arab countries, capitalized, whatever that means, was grounded by the Irish language by John Clare, and so on. These are sentences, if you didn't know better, that might sound correct. And it does so, let me pause, one character at a time.

So, these aren't words being generated. This is one character. You start with the beginning three letters, "Nat", you generate "You" completely without knowing of the word "Naturalism". This is incredible. You can do this to start a sentence and let the neural network complete that sentence. So, for example, if you start the sentence with "Life is" or "Life is about", actually, it will complete it with a lot of fun things.

The weather, "Life is about kids", "Life is about the true love of Mr. Mom", "Life is about the truth now", and this is from Geoffrey Hinton, the last two. If you start with the meaning of life, it can complete that with the meaning of life is literary recognition, maybe true for some of us here.

Publish or perish. And the meaning of life is the tradition of ancient human reproduction. Also true for some of us here, I'm sure. Okay, so what else can you do? This has been very exciting recently is image caption recognition. No, generation, I'm sorry. Image caption generation is important for large data sets of images where we want to be able to determine what's going on inside those images, especially for search.

If you want to find a man sitting on a couch with a dog, you type it into Google and it's able to find that. So here shown in black text, a man sitting on a couch with a dog is generated by the system. A man sitting on a chair with a dog in his lap is generated by a human observer.

And again, these annotations are done by detecting the different obstacles, the different objects in the scene. So segmenting the scene, detecting on the right there's a woman, a crowd, a cat, a camera holding purple, all of these words are being detected. Then a syntactically correct sentence is generated, a lot of them, and then you order which sentence is the most likely.

And in this way you can generate very accurate labeling of the images, captions for the images. And you can do the same kind of process for image question answering. You can ask how many, so quantity, how many chairs are there? You can ask about location, where are the right bananas?

You can ask about the type of object, what is the object on the chair? It's a pillow. And these are again using the recurrent neural networks. You can do the same thing with video caption generation, video caption description generation. So looking at a sequence of images as opposed to just a single image.

What is the action going on in this situation? This is the difficult task. There's a lot of work in this area. Now on the left is correct descriptions, a man is doing stunts on his bike. I heard of zebras are walking in the field. And on the right, there's a small bus running into a building.

You know, it's talking about relevant entities but just doing an incorrect description. A man is cutting a piece of, a piece of, a pair of a paper. He's cutting a piece of a pair of a paper. So the words are correct, perhaps. But, so you're close, but no cigar.

So one of the interesting things you can do with the recurrent neural networks is if you think about the way we look at images, human beings look at images, is we only have a small fovea with which we focus on, in the scene. So right now your periphery is very distorted.

The only thing, if you're looking at the slides, or you're looking at me, that's the only thing that's in focus. Majority of everything else is out of focus. So we can use the same kind of concept to try to teach a neural network to steer around the image, both for perception and generation of those images.

This is important first on the general artificial intelligence point of it being just fascinating that we can selectively steer our attention. But also it's important for things like drones that have to fly at high speeds in an environment where at 300 plus frames a second you have to make decisions.

So you can't possibly localize yourself or perceive the world around yourself successfully if you have to interpret the entire scene. So what you can do is you can steer, for example, here shown is reading house numbers by steering around an image. You could do the same task for reading and for writing.

So reading numbers here on the MNIST dataset on the left is reading numbers. We can also selectively steer a network around an image to generate that image. Starting with a blurred image first and then getting more and more higher resolution as the steering goes on. Work here at MIT is able to map video to audio.

So head stuff with a drumstick, silent video and able to generate the sound that would drumstick hitting that particular object makes. So you can get texture information from that impact. So here is a video of a human soccer player playing soccer and a state-of-the-art machine playing soccer. And well, let me give him some time to build up.

(Laughter) Okay, so soccer, this is, we take this for granted but walking is hard. Object manipulation is hard. Soccer is harder than chess for us to do, much harder. On your phone now, you can have a chess engine that beats the best players in the world. And you have to internalize that because the question is, this is a painful video, the question is, where does driving fall?

Is it closer to chess or is it closer to soccer? For those incredible brilliant engineers that worked on the most recent DARPA challenge, this would be a very painful video to watch, I apologize. This is a video from the DARPA challenge of robots struggling with the basic object manipulation and walking tasks.

So it's mostly a fully autonomous navigation task. (Laughter) Maybe I'll just let this play for a few moments to let it internalize how difficult this task is. Of balancing, of planning in an under-actuated way where you don't have full control of everything. When there is a delta between your perception of what you think the world is and what the reality is.

So there, a robot was trying to turn an object that wasn't there. And this is an MIT entry that actually successfully, I believe, gotten points for this because it got into that area. (Laughter) But as a lot of the teams talked about, the hardest part, so one of the things the robot had to do is get into a car and drive it and get out of the car.

And there's a few other manipulation tasks like to walk on unsteady ground, it had to drill a hole through a wall, all of these tasks. And what a lot of teams said is the hardest part, the hardest task of all of them is getting out of the car. So it's not getting into the car, it's this very task that you saw now is a robot getting out of the car.

These are things we take for granted. So in our evaluation of what is difficult about driving, we have to remember that some of those things we may take for granted in the same kind of way that we take walking for granted. This is Marv X paradox. With Hans Moravac from CMU.

Let me just quickly read that quote. "Encoded in the large highly evolved sensory and motor portions of the human brain is billions of years of experience about the nature of the world and how to survive in it." So this is data, this is big data, billions of years. An abstract thought which is reasoning, the stuff we think is intelligence is perhaps less than 100,000 years of data old.

We haven't yet mastered it and so, sorry I'm inserting my own statements in the middle of a quote but, it's been very recent that we've learned how to think and so we respect it perhaps more than the things we take for granted like walking and visual perception and so on.

But those may be strictly a matter of data, data and training time and network size. So walking is hard. The question is how hard is driving? And that's an important question because the margin of error is small. There's one fatality per 100 million miles. That's the number of people that die in car crashes every year.

One fatality per 100 million miles. That's a 0.00001% margin of error. That's through all the time you spend on the road, that is the error you get. We're impressed with ImageNet being able to classify a leopard, a cat or a dog at close to, at above human level performance.

But this is the margin of error we get with driving. And we have to be able to deal with snow, with heavy rain, with big open parking lots, with parking garages, any pedestrians that behaves irresponsibly, as rarely as that happens or just unpredictably again, especially in Boston. Reflections, the ones especially, this is one of some of the things you don't think about, the lighting variations that blind the cameras.

The question was if that number changes, if you look at just crashes. So fatalities per crash. Crashes per, yeah, so one of the big things is cars have gotten really good at crashing and not hurting the, anybody. So the number of crashes is much, much larger than number of fatalities, which is a great thing.

We've built safer cars. But still, you know, even one fatality is too many. So this is one, Google self-driving car team, is quite open about their performance since hitting public roads. This is from a report that shows the number of times the driver disengages, the car gives up control, that it asks the driver to take control back or the driver takes control back by force.

Meaning that they're unhappy with the decision that the car was making or it was putting the car or other pedestrians or other cars in unsafe situations. And so if you see over time that there's been a total from 2014 to 2015, there's been a total of 341 times on beautiful San Francisco roads.

And I say that seriously because the weather conditions are great there. 341 times that the driver had to elect to take control back. So it's a work in progress. Let me give you something to think about here. This with neural networks is a big open question. The question of robustness.

So this is an amazing paper, I encourage people to read it. There's a couple of papers around this topic. Deep neural networks are easily fooled. So here are eight images where if given to a neural network as input, a convolutional neural network as input, the network with higher than 99.6% confidence says that the image, for example, in the top left is a robin, next to it is a cheetah, then an armadillo, a panda, an electric guitar, a baseball, a starfish, a king penguin.

All of these things are obviously not in the images. So networks can be fooled with noise. More importantly, more practically for the real world, adding just a little bit of distortion, a little bit of noise distortion to the image can force the network to produce a totally wrong prediction.

So here's an example. There's three columns, correct image classification, the slight addition of distortion, and the resulting prediction of an ostrich for all three images on the left, and a prediction of an ostrich for all three images on the right. This ability to fool networks easily brings up an important point.

And that point is that there has been a lot of excitement about neural networks throughout their history. There's been a lot of excitement about artificial intelligence throughout its history. And not coupling that excitement, not grounding that excitement in the reality, the real challenges around that has resulted in crashes, in AI winters, when funding dried out and people became hopeless in terms of the possibilities of artificial intelligence.

So here's a 1958 New York Times article that said the Navy revealed the embryo of an electronic computer today. This is when the first perceptron that I talked about was implemented in hardware by Frank Rosenblatt. It took 400 pixel image input and it provided a single output weights were encoded in hardware potentiometers and weights were updated with electric motors.

Now New York Times wrote, "The Navy revealed the embryo of an electronic computer today that expects we'll be able to walk, talk, see, write, reproduce itself and be conscious of its existence." Dr. Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory, Buffalo, said perceptrons might be fired to the planets as mechanical space explorers.

This might seem ridiculous, but this was the general opinion of the time. And as we know now, perceptrons cannot even separate a nonlinear function. They're just linear classifiers. And so this led to two major AI winters in the 70s and the late 80s and early 90s. The Lighthill Report in 1973 by the UK government said that "No part of the field of discoveries made so far produced the major impact that was promised." So if the hype builds beyond the capabilities of our research, reports like this will come and they have the possibility of creating another AI winter.

So I want to pair the optimism, some of the cool things we'll talk about in this class with the reality of the challenges ahead of us. The focus of the research community. This is some of the key players in deep learning. What are the things that are next for deep learning?

The five-year vision. We want to run on smaller, cheaper mobile devices. We want to explore more in the space of unsupervised learning, as I mentioned, and reinforcement learning. We want to do things that explore the space of videos more. The recurring neural networks, like being able to summarize videos or generate short videos.

One of the big efforts, especially in the companies dealing with large data, is multimodal learning. Learning from multiple data sets with multiple sources of data. And lastly, making money from these technologies. There's a lot of this, despite the excitement, there has been an inability for the most part to make serious money from some of the more interesting parts of deep learning.

And while I got made fun of by the TAs for including this slide because it's shown in so many sort of business type lectures, but it is true that we're at the peak of a hype cycle and we have to make sure, given the large amount of hype and excitement there is, we proceed with caution.

One example of that, let me mention, is we already talked about spoofing the cameras, spoofing the cameras with a little bit of noise. So if you think about it, So if you think about it, self-driving vehicles operate with a set of sensors and they rely on those sensors to convey, to accurately capture that information.

Now what happens, not only when the world itself produces noisy visual information, but what if somebody actively tries to spoof that data? One of the fascinating things that have been recently done is spoofing of LiDAR. So these LiDARs is a range sensor that gives a 3D point cloud of the objects in the external environment and you're able to successfully do a replay attack where you have the car see people and other cars around it when there's actually nothing around it.

In the same way that you can spoof a camera to see things that are not there, a neural network. So let me run through some of the libraries that we'll work with and they're out there that you might work with if you proceed with deep learning. TensorFlow, that is the most popular one these days.

It's heavily backed and developed by Google. It has primarily a Python interface and is very good at operating on multiple GPUs. There's Keras and also TF Learn and TF Slim which are libraries that operate on top of TensorFlow that make it slightly easier, slightly more user-friendly interfaces to get up and running.

Torch, if you're interested to get in at the lower level tweaking of the different parameters of neural networks, creating your own architectures, Torch is excellent for that with its own Lua interface. Lua is a programming language and heavily backed by Facebook. There's the old school Theano, just what I started on, a lot of people early on in deep learning started on.

It's one of the first libraries that supported, that came with GPU support. It definitely encourages lower level tinkering and has a Python interface. And many of these, if not all, rely on NVIDIA's library for doing some of the low-level computations involved with training these neural networks on NVIDIA GPUs.

MXNet, heavily supported by Amazon and they've officially recently announced that they're going to be, that their AWS is going to be all in on MXNet. Neon, recently bought by Intel, started out as a manufacturer of neural network chips, which is really exciting and it performs exceptionally well. Now here are good things.

Caffe, started in Berkeley, also was very popular in Google before TensorFlow came out. It's primarily designed for computer vision with ConvNets but it's now expanded to all other domains. There is CNTK, as it used to be known and now called the Microsoft Cognitive Toolkit. Nobody calls it that still, I'm aware of.

It says multi-GPU support, has its own brain script, custom language, as well as other interfaces. And what we'll get to play around in this class is amazingly deep learning in the browser, right? Our favorite is ComNetJS, what you'll use, built by Andrej Karpathy from Stanford, now OpenAI. It's good for explaining the basic concept of neural networks.

It's fun to play around with. All you need is a browser, so very few requirements. It can't leverage GPUs, unfortunately. But for a lot of things that we're doing, you don't need GPUs. You'll be able to train a network with very little and relatively efficiently without the need of GPUs.

It has full support for CNNs, RNNs, and even deep reinforcement learning. KerasJS, which seems incredible, we tried to use for this class, didn't happen to. It has GPU support, so it runs in the browser with GPU support, with OpenGL or however it works magically. But we're able to accomplish a lot of things we need without the use of GPUs.

So, it's incredible to live in a day and age when it literally, as I'll show in the tutorials, it takes just a few minutes to get started with building your own neural network that classifies images. And a lot of these libraries are friendly in that way. So, all the references mentioned in this presentation are available at this link and the slides are available there as well.

So, I think in the interest of time, let me wrap up. Thank you so much for coming in today. And tomorrow I'll explain the deep reinforcement learning game and the actual competition and how you can win it. Thanks very much guys.