back to indexMIT 6.S094: Introduction to Deep Learning and Self-Driving Cars
Chapters
0:0 Intro
0:54 Administrative
3:43 Project: Deep Traffic
5:2 Project: DeepTesla
6:50 Defining Artificial Intelligence
10:14 How Hard is Driving?
10:59 Chess Pieces: Self-Driving Car Sensors
13:8 Chess Pieces: Self-Driving Car Tasks
14:48 DARPA Grand Challenge II (2006)
15:23 DARPA Urban Challenge (2007)
15:48 Industry Takes on the Challenge
16:30 How Hard is it to Pass the Turing Test?
20:54 Neuron: Biological Inspiration for Computation
22:54 Perceptron: Forward Pass
23:37 Perceptron Algorithm
25:3 Neural Networks are Amazing
26:15 Special Purpose Intelligence
27:21 General Purpose Intelligence
40:43 Deep Learning Breakthroughs: What Changed?
45:19 Useful Deep Learning Terms
48:58 Neural Networks: Proceed with Caution
49:50 Deep Learning is Representation Learning
51:33 Representation Matters
52:43 Deep Learning: Scalable Machine Learning
54:1 Applications: Object Classification in Images
55:55 illumination Variability
56:32 Pose Variability and Occlusions
57:54 Pause: Object Recognition / Classification
59:41 Pouse Object Detection
00:00:00.000 |
All right, hello everybody. Hopefully you can hear me well. Yes, yes, great. 00:00:07.400 |
So welcome to course 6S094, Deep Learning for Self-Driving Cars. 00:00:15.600 |
We will introduce to you the methods of deep learning of deep neural networks 00:00:22.500 |
using the guiding case study of building self-driving cars. 00:00:31.200 |
You get to listen to me for majority of these lectures 00:00:35.600 |
and I am part of an amazing team with some brilliant TAs, would you say? 00:00:50.800 |
Spencer, William Angel, Spencer Dodd, and all the way in the back, 00:00:57.600 |
the smartest and the tallest person I know, Benedict Jenek. 00:01:01.500 |
So what you see there on the left of the slide is a visualization of one of the two projects that, 00:01:09.600 |
one of the two simulations games that we'll get to go through. 00:01:16.800 |
We use it as a way to teach you about deep reinforcement learning 00:01:21.300 |
but also as a way to excite you by challenging you to compete against others, 00:01:27.900 |
if you wish, to win a special prize yet to be announced, super secret prize. 00:01:35.000 |
So you can reach me and the TAs at deepcars@mit.edu 00:01:41.000 |
if you have any questions about the tutorials, about the lecture, about anything at all. 00:01:45.400 |
The website cars.mit.edu has the lecture content. 00:01:51.200 |
Code tutorials again, like today, the lecture slides for today are already up in PDF form. 00:01:57.800 |
The slides themselves, if you want to see them, just email me, 00:02:02.100 |
but they're over a gigabyte in size because they're very heavy in videos. 00:02:07.800 |
And there will be lecture videos available a few days after the lecture is given. 00:02:15.800 |
So speaking of which, there is a camera in the back. 00:02:19.100 |
This is being videotaped and recorded, but for the most part, 00:02:28.600 |
If that kind of thing worries you, then you could sit on the periphery of the classroom 00:02:34.300 |
or maybe I suggest sunglasses and a fake mustache, that would be a good idea. 00:02:40.100 |
There is a competition for the game that you see on the left. 00:02:47.100 |
In order to get credit for the course, you have to design a neural network 00:02:52.200 |
that drives the car just above the speed limit of 65 miles an hour. 00:02:56.200 |
But if you want to win, you need to go a little faster than that. 00:03:05.500 |
You may be new to programming, new to machine learning, new to robotics, 00:03:11.900 |
or you're an expert in those fields but want to go back to the basics. 00:03:17.600 |
So what you will learn is an overview of deep reinforcement learning, 00:03:21.300 |
convolutional neural networks, recurrent neural networks, 00:03:25.900 |
and how these methods can help improve each of the components of autonomous driving. 00:03:31.800 |
Perception, visual perception, localization, mapping, control planning, 00:03:42.500 |
Okay, two projects. Code name Deep Traffic is the first one. 00:03:46.700 |
There is, in this particular formulation of it, there is seven lanes. 00:03:52.100 |
It's a top view. It looks like a game but I assure you it's very serious. 00:04:03.300 |
The car in red is being controlled by a neural network 00:04:06.800 |
and we'll explain how you can control and design the various aspects, 00:04:12.100 |
the various parameters of this neural network. 00:04:21.900 |
which is a library that is programmed by Andrej Karpathy in JavaScript. 00:04:27.000 |
So amazingly, we live in a world where you can train in a matter of minutes 00:04:37.800 |
The reason we did this is so that there is very few requirements 00:04:42.800 |
for get you up and started with neural networks. 00:04:46.300 |
So in order to complete this project for the course, 00:04:50.800 |
you don't need any requirements except to have a Chrome browser. 00:04:54.500 |
And to win the competition, you don't need anything except a Chrome browser. 00:05:00.500 |
The second project, code name Deep Tesla, or Tesla, 00:05:08.600 |
is using data from a Tesla vehicle of the forward roadway 00:05:15.600 |
and using end-to-end learning, taking the image 00:05:18.800 |
and putting it into convolutional neural networks 00:05:22.400 |
that directly maps a regressor that maps to a steering angle. 00:05:30.600 |
and it predicts a steering angle for the car. 00:05:36.900 |
and you get to build a neural network that tries to do better, 00:05:41.800 |
tries to steer better or at least as good as the car. 00:05:51.600 |
with the thing that we understand so poorly at this time 00:05:57.200 |
because it's so shrouded in mystery but it fascinates many of us. 00:06:01.800 |
And that's the question of what is intelligence? 00:06:06.500 |
This is from a 1996, March 1996, Time Magazine. 00:06:14.100 |
And the question, can machines think, is answered below 00:06:19.000 |
with "they already do, so what if anything is special about the human mind?" 00:06:24.800 |
It's a good question for 1996, a good question for 2016, 00:06:37.800 |
Can an artificial intelligence system achieve a well-defined, 00:06:44.300 |
specifically, formally defined finite set of goals? 00:06:49.200 |
And this little diagram from a book that got me into artificial intelligence 00:06:57.300 |
the Artificial Intelligence, a Modern Approach. 00:07:02.200 |
This is a beautifully simple diagram of a system. 00:07:10.300 |
It has a set of sensors that do the perception. 00:07:15.300 |
It takes those sensors in, does something magical, 00:07:20.500 |
and with a set of effectors, acts in the world, 00:07:26.900 |
And so special purpose, we can, under this formulation, 00:07:33.300 |
as long as the environment is formally defined, well-defined, 00:07:42.200 |
and the ways that the perception carries itself out is well-defined, 00:07:47.800 |
we have good algorithms of which we'll talk about 00:07:58.600 |
will we get closer to the general formulation, 00:08:04.200 |
to the general purpose version of what artificial intelligence is? 00:08:08.500 |
Can it achieve poorly defined, unconstrained set of goals 00:08:14.000 |
with an unconstrained, poorly defined set of actions, 00:08:16.900 |
and unconstrained, poorly defined utility functions, rewards? 00:08:28.800 |
exist in a undefined, full of uncertainty world. 00:08:37.000 |
So, okay, we can separate tasks into three different categories. 00:08:46.000 |
It doesn't seem so, it didn't seem so at the birth of artificial intelligence, 00:08:50.400 |
but that's in fact true if you think about it. 00:08:52.800 |
The easiest is the formal tasks, playing board games, theorem proving, 00:08:56.800 |
all the kind of mathematical logic problems that can be formally defined. 00:09:06.200 |
So this is where a lot of the exciting breakthroughs have been happening, 00:09:12.100 |
where machine learning methods, data-driven methods, 00:09:16.100 |
can help aid or improve on the performance of our human experts. 00:09:22.800 |
This means medical diagnosis, hardware design, scheduling. 00:09:26.900 |
And then there is the thing that we take for granted, 00:09:30.500 |
the trivial thing, the thing that we do so easily every day, 00:09:37.000 |
the mundane tasks of everyday speech, of written language, 00:10:02.900 |
we really want to dig in and try to see what is it about driving. 00:10:12.900 |
Is it more like chess, which you see on the left there, 00:10:17.500 |
where we can formally define a set of lanes, a set of actions, 00:10:21.200 |
and formulate it as this, you know, there's five set of actions, 00:10:24.800 |
you can change a lane, you can avoid obstacles, 00:10:30.000 |
you can formally define the rules of the road. 00:10:32.400 |
Or is there something about natural language, 00:10:37.200 |
something similar to everyday conversation about driving, 00:10:40.300 |
that requires a much higher degree of reasoning, 00:10:52.300 |
Is it a lot more than just left lane, right lane, speed up, slow down? 00:11:04.400 |
What are the sensors we get to work with on a self, 00:11:13.000 |
especially with the guest speakers who built many of these. 00:11:17.300 |
There's radar, there's the range sensors, radar, lidar, 00:11:20.400 |
that give you information about the obstacles in the environment, 00:11:24.400 |
that help localize the obstacles in the environment. 00:11:28.200 |
There's the visible light camera, the stereo vision, 00:11:34.700 |
that helps you figure out not just where the obstacles are, 00:11:47.200 |
Then there is the information about the vehicle itself, 00:11:49.700 |
about the trajectory and the movement of the vehicle, 00:11:55.600 |
And there is the state of, the rich state of the vehicle itself. 00:12:22.400 |
the sound of a road that when it stopped raining, 00:12:38.200 |
the thing that's really much under investigated, 00:12:55.100 |
The emotional state, are they in the seat at all? 00:13:01.800 |
That comes from the visual information and the audio information. 00:13:13.100 |
the task of what it means to build a self-driving vehicle. 00:13:17.300 |
First, you want to know where you are, where am I? 00:13:24.900 |
figure out where all the different obstacles are, 00:13:29.600 |
all the entities are, and use that estimate of the environment 00:13:34.100 |
to then figure out where I am, where the robot is. 00:13:40.500 |
It's understanding not just the positional aspects 00:13:44.700 |
of the external environment and the dynamics of it, 00:13:51.100 |
Is it a car? Is it a pedestrian? Is it a bird? 00:13:57.800 |
Once you have kind of figured out to the best of your abilities, 00:14:01.700 |
your position and the position of other entities in this world, 00:14:06.200 |
there's figuring out a trajectory through that world. 00:14:09.000 |
And finally, once you've figured out how to move about, 00:14:17.200 |
it's figuring out what the human that's on board is doing. 00:14:20.600 |
Because as I will talk about, the path to a self-driving vehicle, 00:14:34.600 |
where the vehicle must not only drive itself, 00:14:40.900 |
but effectively hand over control from the car 00:14:50.300 |
Well, there's a lot of fun stuff from the 80s and 90s, but 00:14:54.000 |
the big breakthroughs came in the second DARPA Grand Challenge 00:15:02.700 |
with Stanford Stanley when they won the competition, 00:15:12.400 |
In a desert race, a fully autonomous vehicle was able to complete the race 00:15:32.000 |
where the task was no longer a race to the desert, 00:15:46.700 |
And a lot of that work led directly into the acceptance 00:16:00.600 |
taking on the challenge of building these vehicles. 00:16:09.100 |
Tesla, with its Autopilot system and now Autopilot 2 system. 00:16:20.400 |
including one of the speakers for this course of Neutronomy, 00:16:24.000 |
that are driving the wonderful streets of Boston. 00:16:35.400 |
We have, if we think about the accomplishments in the DARPA challenge 00:16:39.700 |
and if we look at the accomplishments of the Google self-driving car, 00:16:45.400 |
which essentially boils the world down into a chess game. 00:16:56.800 |
to build a three-dimensional map of the world, 00:16:59.400 |
localize itself effective in that world and move about that world 00:17:24.200 |
The Turing test, as the popular current formulation is, 00:17:37.700 |
having a conversation with either a computer or a human, 00:17:40.500 |
they mistake the other side of that conversation 00:17:43.900 |
for being a human when it's in fact a computer. 00:17:47.900 |
And the way you would, in a natural language, 00:17:55.600 |
build a system that has successfully passed the Turing test 00:18:09.100 |
then you represent knowledge, the state of the conversation, 00:18:14.700 |
And the last piece, and this is the hard piece, 00:18:20.100 |
Is reasoning, can we teach machine learning methods to reason? 00:18:30.200 |
That is something that will propagate through our discussion 00:18:33.800 |
because, as I will talk about, the various methods, 00:18:40.400 |
the various deep learning methods, neural networks, 00:18:48.100 |
But they're not yet, there's no good mechanism for reasoning. 00:18:56.800 |
that we tell ourselves we do to feel special, 00:19:00.200 |
better, to feel like we're better than machines. 00:19:03.700 |
Reasoning may be simply something as simple as learning from data. 00:19:13.000 |
Or there could be a totally different mechanism required 00:19:18.000 |
and we'll talk about the possibilities there. 00:19:33.200 |
No, it's very difficult to find these kind of situations 00:20:30.800 |
No, but they drive on the right side of the road, 00:20:40.900 |
Yeah, so, but it's certainly not the United States. 00:20:57.000 |
the recent breakthroughs in machine learning, 00:21:02.000 |
and what is at the core of those breakthroughs. 00:21:25.100 |
is a computational building block of the brain. 00:21:50.700 |
the computational building block of a neural network, 00:21:58.800 |
these neurons, for both artificial and human brains, 00:22:12.700 |
I believe, 10,000 outgoing connections from every neuron, 00:22:29.300 |
has 10 billion of those connections, synapses. 00:23:18.900 |
and using an activation function that takes as input 00:24:05.400 |
like what's seen here, between the blue dots, 00:24:12.500 |
in the IPython notebook that I'll talk about. 00:24:28.800 |
You perform this previous operation I talked about, sum up, 00:24:40.400 |
the expected output, the output that it should produce, 00:24:47.400 |
And we'll talk through a little bit of the math of that. 00:24:51.800 |
And this process is repeated until the perceptron 00:25:21.500 |
With just a single layer, if we stack them together, 00:25:25.600 |
the inputs on the left, the outputs on the right, 00:25:29.600 |
and in the middle there's a single hidden layer. 00:25:52.900 |
you know, you can think of driving as a function. 00:26:04.800 |
There exists a neural network out there that can drive, 00:26:11.900 |
So we can think of this then, these functions as a special purpose, 00:26:36.900 |
passes that value through to the hidden layer, 00:26:47.900 |
And we can teach a network to do this pretty well, 00:26:58.100 |
where you know the number of bedrooms, the square feet, 00:27:24.200 |
have been in the general purpose intelligence. 00:28:01.100 |
prevents the other guy from bouncing the ball back at you. 00:28:08.800 |
the artificial intelligence agents on the right in green, 00:28:34.600 |
processed and also you take the difference between, 00:28:40.900 |
but it's basically the raw pixel information. 00:28:48.500 |
and the output is a single probability of moving up. 00:29:10.000 |
you don't know what the right thing to do is. 00:29:19.900 |
by the fact that eventually you win or lose the game. 00:29:35.800 |
and any one action being good or bad in any state. 00:29:49.900 |
So no matter what you did, if you won the game, 00:30:38.200 |
be extended further, why this is so promising, 00:30:52.500 |
up, down, up, down, based on the output of the network. 00:30:57.800 |
you move up or down based on the output of the network. 00:31:04.600 |
And every single state action pair is rewarded if there's a win, 00:31:22.000 |
And if you don't understand why that's amazing, 00:31:38.600 |
what is supervised learning, what is unsupervised learning, 00:31:47.100 |
they mean supervised learning most of the time. 00:31:58.700 |
When you have a set of inputs and a set of outputs, 00:32:09.800 |
to train any of the machine learning algorithms, 00:32:12.600 |
to learn to then generalize that to future examples. 00:32:23.800 |
actually there's a third one called reinforcement learning, 00:32:37.900 |
the ground truth only happens every once in a while, 00:32:42.900 |
And unsupervised learning is when you have no information, 00:32:58.600 |
But it has achieved no major breakthroughs at this point. 00:33:05.200 |
I'll talk about what the future of deep learning is, 00:33:07.400 |
and a lot of the people that are working in the field, 00:33:24.300 |
And the brown one is just a heuristic solution, 00:33:29.900 |
So basically the reinforcement learning here, 00:33:34.100 |
is learning from somebody who has certain rules. 00:33:49.900 |
the green paddle learns to play this game successfully, 00:34:01.100 |
How do we know it can generalize to other games, 00:34:07.000 |
But the mechanism by which it learns generalizes. 00:34:13.400 |
how do we know that it can generalize to other games, 00:34:19.200 |
as long as you let it play in whatever world you want it to succeed in, 00:34:28.500 |
it will use the same approach to learn to succeed in that world. 00:34:38.700 |
Unfortunately, one of the big challenges of neural networks, 00:34:45.900 |
is that they're not currently efficient learners. 00:34:53.800 |
oftentimes, and they learn very efficiently from that one example. 00:35:07.900 |
So if you think about the way a human being would approach this game, 00:35:14.400 |
they would only need a simple set of instructions. 00:35:21.800 |
And your task is to bounce the ball past the other player, 00:35:34.900 |
but they would immediately understand the game. 00:35:36.700 |
And will be able to successfully play it well enough 00:35:49.900 |
They need to have a concept of moving up and down, 00:35:56.400 |
they have to have at least a loose concept of real-world physics, 00:36:00.300 |
that they can then project that real-world physics 00:36:06.700 |
are concepts that you come to the table with. 00:36:13.400 |
And the kind of way you transfer that knowledge from 00:36:29.700 |
And the question is whether through this same kind of process, 00:36:53.100 |
much more efficiently than 200,000 iterations. 00:37:00.900 |
and machine learning broadly is you need big data 00:37:15.100 |
a human being looking at a particular image, for example, 00:37:38.800 |
You need to figure out the network structure first. 00:37:45.300 |
What type of activation function in each node? 00:37:56.100 |
there's parameters for how you teach that network. 00:38:03.300 |
mini-batch size, number of training iterations, 00:38:09.000 |
and selecting even the optimizer with which you, 00:38:14.100 |
with which you solve the various differential equations involved. 00:38:20.000 |
It's a topic of many research papers, certainly. 00:38:31.900 |
It means that you can't just plop a network down 00:38:54.600 |
I'm teaching a network to play the game of Coast Runners. 00:39:09.200 |
you're in a boat, the task is to go around a track 00:39:23.100 |
And what it's figured out that actually in the game, 00:39:26.900 |
it gets a lot of points for collecting certain objects along the path. 00:39:33.000 |
So what you see is it's figured out to go in a circle 00:39:40.700 |
And what it's figured out is you don't need to complete the game 00:39:54.200 |
and despite being on fire and hitting the wall 00:40:01.900 |
it's actually achieved at least a local optima 00:40:05.700 |
given the reward function of maximizing the number of points. 00:40:10.200 |
And so it's figured out a way to earn a higher reward 00:40:16.800 |
while ignoring the implied bigger picture goal of finishing the race, 00:40:24.400 |
This raises for self-driving cars ethical questions. 00:40:31.400 |
Besides other questions, you can watch this for hours 00:40:53.800 |
under which an intelligence system needs to operate. 00:40:56.300 |
And that's made obvious even in a simple game. 00:41:01.600 |
So the question was, what's an example of a local optimum 00:41:12.400 |
that an autonomous car, so similar to the coast race, 00:41:15.800 |
so what would be the example in the real world for an autonomous vehicle? 00:41:27.800 |
with the choices we make under near crashes and crashes. 00:41:41.200 |
and there's no way you can stop to prevent the crash, 00:41:45.200 |
do you keep the driver safe or do you keep the other people safe? 00:42:03.200 |
even if it's only in the data and the learning that you do, 00:42:08.800 |
And we need to be aware of that reward function is 00:42:25.000 |
It's hard to know ahead of time what that is. 00:42:27.800 |
So the recent breakthroughs from deep learning came 00:42:41.800 |
CPUs are getting faster, 100 times faster every decade. 00:42:49.100 |
Also, the ability to train neural networks and GPUs 00:42:53.100 |
and now ASICs has created a lot of capabilities 00:43:02.200 |
and being able to train larger networks more efficiently. 00:43:19.300 |
And now there is that data is becoming more organized, 00:43:23.200 |
not just vaguely available data out there on the internet. 00:43:27.900 |
It's actual organized data sets like ImageNet. 00:43:31.000 |
Certainly for natural language, there's large data sets. 00:43:38.400 |
Backprop, backpropagation, convolutional neural networks, LSTMs, 00:43:43.500 |
all these different architectures for dealing with specific 00:43:57.000 |
There's Git, ability to share an open source way software. 00:44:01.100 |
There is pieces of software that make robotics 00:44:16.600 |
which allows for efficient, cheap annotation of large scale data sets. 00:44:21.400 |
There's AWS in the cloud hosting machine learning, 00:44:28.800 |
And then there's a financial backing of large companies, 00:44:38.800 |
There really has not been any significant breakthroughs. 00:44:42.300 |
We're using these convolutional neural networks 00:44:46.600 |
Neural networks have been around since the 60s. 00:44:52.000 |
But the hope is, that's in terms of methodology. 00:45:00.200 |
The ability to do the hundredfold improvement every decade 00:45:08.800 |
And the question is whether that reasoning thing I talked about 00:45:22.500 |
First of all, deep learning is a PR term for neural networks. 00:45:40.800 |
It is symbolic term for the newly gained capabilities 00:45:50.200 |
So deep learning is a subset of machine learning. 00:45:54.300 |
There's many other methods that are still effective. 00:45:56.900 |
The terms that will come up in this class is, 00:46:04.600 |
deep neural networks, recurring neural networks, 00:46:10.200 |
CNN or ConvNets, convolutional neural networks, 00:46:15.600 |
And the operation that will come up is convolution, pooling, 00:46:50.300 |
what is the purpose of the different layers in a neural network? 00:46:54.100 |
What does it mean to have one configuration versus another? 00:47:01.500 |
it's the only thing you have an understanding of 00:47:08.500 |
You don't have a good understanding about what each layer does. 00:47:16.900 |
So I'll talk about how with every layer it forms a higher level, 00:47:26.500 |
So it's not like the first layer does localization, 00:47:40.800 |
So we know, we're beginning to visualize neural networks 00:47:45.600 |
for simple tasks, like for ImageNet, classifying cats versus dogs. 00:47:51.600 |
We can tell what is the thing that the first layer does, 00:47:54.800 |
the second layer, the third layer, and we'll look at that. 00:47:57.200 |
But for driving, as the input provides just the images and the output, the steering, 00:48:05.200 |
Partially because we don't have neural networks that drive successfully yet. 00:48:15.200 |
Do neural networks fill layers or does it eventually generate them on its own over time? 00:48:23.400 |
does a neural network generate layers over time? 00:48:31.200 |
That's one of the challenges is that a neural network is predefined. 00:48:38.200 |
The architectures, the number of nodes, number of layers, that's all fixed. 00:48:42.500 |
Unlike the human brain where neurons die and are born all the time. 00:48:46.000 |
Neural network is pre-specified, that's it, that's all you get. 00:48:50.400 |
And if you want to change that, you have to change that and then retrain everything. 00:48:55.800 |
So what I encourage you is to proceed with caution 00:49:00.800 |
because there's this feeling when you first teach a network with very little effort, 00:49:06.900 |
how to do some amazing tasks, like classify a face, 00:49:12.000 |
versus non-face or your face versus other faces or cats versus dogs. 00:49:18.100 |
And then there's definitely this feeling that I'm an expert. 00:49:31.600 |
And getting it to perform well for more generalized tasks, 00:49:35.800 |
for larger scale datasets, for more useful applications, 00:49:41.600 |
Figuring out how to tweak little things here and there. 00:49:43.900 |
And still in the end, you don't understand why it works so damn well. 00:49:48.000 |
So deep learning, these deep neural network architectures is representation learning. 00:49:59.100 |
This is the difference between traditional machine learning methods. 00:50:05.900 |
Where, for example, for the task of having an image here as the input, 00:50:14.400 |
the input to the network here is on the bottom, the output is up at top. 00:50:18.100 |
So, and the input is a single image of a person in this case. 00:50:24.700 |
And so, the input specifically is all of the pixels in that image, RGB. 00:50:35.200 |
The different colors of the pixels in the image. 00:50:44.100 |
a multi-resolutional representation of this data. 00:50:47.900 |
The first layer learns the concept of edges, for example. 00:50:55.300 |
The second layer starts to learn composition of those edges, corners, contours. 00:51:05.800 |
And finally, actually provide a label for the entities that are in the input. 00:51:12.500 |
And this is the difference between traditional machine learning methods. 00:51:16.700 |
Where the concepts like edges and corners and contours are manually pre-specified by human beings, 00:51:31.000 |
And representation matters because figuring out a line 00:51:42.900 |
for the Cartesian coordinates of this particular dataset, 00:51:46.500 |
where you want to design a machine learning system 00:51:49.200 |
that tells the difference between green triangles and blue circles is difficult. 00:52:00.000 |
And if you were to ask a human being, a human expert in the field, 00:52:04.100 |
to try to draw that line, they would probably do a PhD on it and still not succeed. 00:52:12.300 |
But a neural network can automatically figure out to remap that input into polar coordinates. 00:52:23.000 |
Where the representation is such that it's an easily linearly separable dataset. 00:52:28.600 |
And so deep learning is a subset of representation learning, 00:52:36.800 |
is a subset of machine learning and a key subset of artificial intelligence. 00:52:41.600 |
Now, because of this, because of its ability to compute an arbitrary number of features 00:52:56.400 |
So you're not, if you were trying to detect a cat in an image, 00:52:59.700 |
you're not specifying 215 specific features of cat ears and whiskers and so on 00:53:10.100 |
You allow a neural network to discover tens of thousands of such features. 00:53:14.400 |
Which maybe for cats you are an expert, but for a lot of objects 00:53:19.400 |
you may never be able to sufficiently provide the features 00:53:24.700 |
which successfully would be used for identifying the object. 00:53:30.600 |
one is easy in the sense that all you have to provide is inputs and outputs. 00:53:35.600 |
All you need to provide is a dataset that you care about without hand engineering features. 00:53:40.800 |
And two, because of its ability to construct arbitrarily sized representations, 00:53:52.700 |
The more data we give them, the more they're able to learn about this particular dataset. 00:54:06.300 |
First, some cool things that deep neural networks have been able to accomplish up to this point. 00:54:27.600 |
It's a competition of classification localization 00:54:34.400 |
identify what are the five most likely things in that image 00:54:38.300 |
and what is the most likely and you have to do so correctly. 00:54:41.400 |
So on the right, there's an image of a leopard 00:54:43.900 |
and you have to correctly classify that that is in fact a leopard. 00:54:50.800 |
Given a specific image, determine that it's a leopard. 00:54:55.200 |
And we started what's shown here on the x-axis is years, 00:55:04.900 |
So starting from 2012 on the left with AlexNet 00:55:18.000 |
and 40% before then with traditional methods have decreased to below 4%. 00:55:24.200 |
So human level performance, if I were to give you this picture of a leopard, 00:55:40.100 |
convolutional neural networks outperform human beings. 00:55:47.900 |
and now is because it's done, it's not as impressive. 00:55:53.100 |
But I just want to get to why that's so impressive 00:56:02.500 |
We as human beings have evolved visual perception over millions of years, 00:56:08.300 |
So we take it for granted but computer vision is really hard. 00:56:18.400 |
The only way we tell anything is from the shade, 00:56:23.600 |
It could be the same object with drastically, in terms of pixels, 00:56:46.800 |
These are pictures, you know, cats are famously deformable. 00:57:17.500 |
We still know it's a cat even when parts of it are not visible 00:57:21.100 |
and sometimes large parts of it are not visible. 00:57:23.400 |
And then there's all the interclass variability. 00:57:27.100 |
In interclass, all of these on the top two rows are cats. 00:57:45.200 |
and as human beings are pretty good at telling the difference 00:57:48.700 |
and we want computer vision to do better than that. 00:57:56.400 |
This is done with convolutional neural networks. 00:58:01.500 |
Here's an input on the left of a number three 00:58:05.500 |
and I'll talk about through convolutional layers. 00:58:14.300 |
Convolutional layers maintain spatial information. 00:58:19.200 |
On the output, in this case predicts which of the images, 00:58:28.900 |
what number is shown in the image, 0, 1, 2 through 9. 00:58:43.800 |
And in the case of probability that it's a leopard, 00:58:49.400 |
Then there's segmentation built on top of these 00:58:52.900 |
convolutional neural networks where you chop off the end 00:59:00.600 |
You chop off the end where the output is a heat map. 00:59:03.000 |
So you can have instead of a detector for a cat, 00:59:08.000 |
you can do a cat heat map where it's the part of the image, 00:59:19.000 |
and the spatially excited in the parts of the image 00:59:25.800 |
And this kind of process can be used to segment the image 00:59:32.500 |
is a woman on a horse and the output is a fully segmented image 00:59:36.900 |
of knowing where's the woman, where's the horse. 00:59:39.500 |
And this kind of process can be used for object detection 00:59:44.700 |
which is the task of detecting an object in an image. 00:59:47.600 |
Now the traditional method with convolutional neural networks 00:59:53.200 |
and in general in computer vision is the sliding window approach. 00:59:59.500 |
that you slide through the image to find where in that image is a leopard. 01:00:10.500 |
is efficiently segment the image in such a way 01:00:14.200 |
that it can propose different parts of the image 01:00:16.400 |
that are likely to have a leopard or in this case a cowboy. 01:00:22.200 |
And that drastically reduces the computational requirements 01:00:37.100 |
one of the best networks for the ImageNet task of localization 01:00:53.000 |
You're starting to get above 20 layers in many cases. 01:01:06.600 |
the more representation power you have, the higher accuracy. 01:01:29.000 |
So you can take a black and white video from a film, 01:01:36.700 |
And all you need to do to train that network in a supervised way 01:01:41.100 |
is provide modern films and convert them to grayscale. 01:01:48.400 |
that are able to, datasets of grayscale to color. 01:01:53.800 |
And you're able to, with very little effort on top of it, 01:02:00.900 |
to successfully, well, somewhat successfully recolor images. 01:02:05.300 |
Again, Google Translate does image translation in this way. 01:02:12.800 |
It first perceives, here in German I believe, 01:02:21.900 |
So this can take this image, detect the different letters, 01:02:32.100 |
map the letters, the translated letters back onto the box. 01:02:51.900 |
a single image to a number, single image to another image. 01:02:55.800 |
Then there is recurrent neural networks that map, 01:03:00.900 |
that map a sequence of images or a sequence of words 01:03:04.400 |
or a sequence of any kind to another sequence. 01:03:09.200 |
And these networks are able to do incredible things 01:03:18.800 |
For example, we can convert text to handwritten digits, 01:03:27.000 |
Here we type in, and you could do this online, 01:03:31.600 |
type in "deep learning for self-driving cars" 01:03:33.800 |
and it will use an arbitrary number of digits. 01:03:37.600 |
And it will use an arbitrary handwriting style 01:03:41.700 |
to generate the words "deep learning for self-driving cars". 01:03:44.600 |
This is done using recurrent neural networks. 01:03:54.200 |
this character level recurrent neural networks 01:03:57.300 |
that train on a dataset, an arbitrary text dataset, 01:04:04.300 |
and learn to generate text one character at a time. 01:04:09.000 |
So there is no preconceived syntactical semantic structure 01:04:17.700 |
So, for example, you can train it on Wikipedia articles, 01:04:24.100 |
like in this case, and it's able to generate successfully 01:04:29.800 |
not only text that makes some kind of grammatical sense at least, 01:04:35.400 |
but also keep perfect syntactic structure for Wikipedia, 01:04:41.400 |
for Markdown editing, for LaTeX editing, and so on. 01:04:45.900 |
This text says, "Naturalism and decision for the majority of Arab countries, 01:04:52.000 |
capitalized, whatever that means, was grounded by the Irish language 01:04:58.400 |
These are sentences, if you didn't know better, that might sound correct. 01:05:03.100 |
And it does so, let me pause, one character at a time. 01:05:14.400 |
You start with the beginning three letters, "Nat", 01:05:17.300 |
you generate "You" completely without knowing of the word "Naturalism". 01:05:33.600 |
and let the neural network complete that sentence. 01:05:35.700 |
So, for example, if you start the sentence with "Life is" 01:05:39.000 |
or "Life is about", actually, it will complete it with a lot of fun things. 01:05:56.700 |
and this is from Geoffrey Hinton, the last two. 01:06:03.000 |
it can complete that with the meaning of life is literary recognition, 01:06:13.500 |
And the meaning of life is the tradition of ancient human reproduction. 01:06:27.100 |
This has been very exciting recently is image caption recognition. 01:06:33.100 |
Image caption generation is important for large data sets of images 01:06:41.200 |
where we want to be able to determine what's going on inside those images, 01:06:46.900 |
If you want to find a man sitting on a couch with a dog, 01:06:50.800 |
you type it into Google and it's able to find that. 01:06:59.000 |
a man sitting on a couch with a dog is generated by the system. 01:07:02.300 |
A man sitting on a chair with a dog in his lap is generated by a human observer. 01:07:07.300 |
And again, these annotations are done by detecting the different obstacles, 01:07:15.100 |
So segmenting the scene, detecting on the right there's a woman, a crowd, a cat, 01:07:20.400 |
a camera holding purple, all of these words are being detected. 01:07:25.400 |
Then a syntactically correct sentence is generated, a lot of them, 01:07:30.700 |
and then you order which sentence is the most likely. 01:07:32.900 |
And in this way you can generate very accurate labeling of the images, 01:07:41.800 |
And you can do the same kind of process for image question answering. 01:07:49.400 |
You can ask how many, so quantity, how many chairs are there? 01:07:53.000 |
You can ask about location, where are the right bananas? 01:07:59.500 |
You can ask about the type of object, what is the object on the chair? 01:08:05.500 |
And these are again using the recurrent neural networks. 01:08:15.000 |
You can do the same thing with video caption generation, 01:08:23.000 |
So looking at a sequence of images as opposed to just a single image. 01:08:26.400 |
What is the action going on in this situation? 01:08:34.600 |
Now on the left is correct descriptions, a man is doing stunts on his bike. 01:08:41.900 |
And on the right, there's a small bus running into a building. 01:08:45.200 |
You know, it's talking about relevant entities 01:08:53.500 |
A man is cutting a piece of, a piece of, a pair of a paper. 01:09:13.200 |
you can do with the recurrent neural networks 01:09:18.900 |
is if you think about the way we look at images, 01:09:22.900 |
is we only have a small fovea with which we focus on, in the scene. 01:09:30.300 |
So right now your periphery is very distorted. 01:09:33.500 |
The only thing, if you're looking at the slides, 01:09:35.900 |
or you're looking at me, that's the only thing that's in focus. 01:09:44.900 |
to try to teach a neural network to steer around the image, 01:09:47.600 |
both for perception and generation of those images. 01:09:51.200 |
This is important first on the general artificial intelligence point 01:10:02.900 |
But also it's important for things like drones 01:10:05.300 |
that have to fly at high speeds in an environment 01:10:08.300 |
where at 300 plus frames a second you have to make decisions. 01:10:14.700 |
or perceive the world around yourself successfully 01:10:22.900 |
for example, here shown is reading house numbers 01:10:32.200 |
You could do the same task for reading and for writing. 01:10:38.400 |
So reading numbers here on the MNIST dataset on the left 01:10:42.900 |
We can also selectively steer a network around an image 01:10:53.700 |
and then getting more and more higher resolution 01:11:02.300 |
Work here at MIT is able to map video to audio. 01:11:18.200 |
that would drumstick hitting that particular object makes. 01:11:22.200 |
So you can get texture information from that impact. 01:11:29.100 |
So here is a video of a human soccer player playing soccer 01:11:38.700 |
and a state-of-the-art machine playing soccer. 01:11:44.900 |
And well, let me give him some time to build up. 01:12:03.300 |
Okay, so soccer, this is, we take this for granted 01:12:12.800 |
Soccer is harder than chess for us to do, much harder. 01:12:18.100 |
On your phone now, you can have a chess engine 01:12:26.300 |
And you have to internalize that because the question is, 01:12:37.400 |
Is it closer to chess or is it closer to soccer? 01:12:44.500 |
that worked on the most recent DARPA challenge, 01:12:47.200 |
this would be a very painful video to watch, I apologize. 01:12:55.700 |
of robots struggling with the basic object manipulation 01:13:06.900 |
So it's mostly a fully autonomous navigation task. 01:13:24.000 |
Maybe I'll just let this play for a few moments 01:13:27.100 |
to let it internalize how difficult this task is. 01:13:32.400 |
Of balancing, of planning in an under-actuated way 01:13:38.000 |
where you don't have full control of everything. 01:13:40.300 |
When there is a delta between your perception 01:13:44.300 |
of what you think the world is and what the reality is. 01:13:47.600 |
So there, a robot was trying to turn an object that wasn't there. 01:13:54.700 |
And this is an MIT entry that actually successfully, 01:14:02.300 |
I believe, gotten points for this because it got into that area. 01:14:12.000 |
But as a lot of the teams talked about, the hardest part, 01:14:19.200 |
is get into a car and drive it and get out of the car. 01:14:28.200 |
it had to drill a hole through a wall, all of these tasks. 01:14:32.000 |
And what a lot of teams said is the hardest part, 01:14:35.100 |
the hardest task of all of them is getting out of the car. 01:14:40.200 |
it's this very task that you saw now is a robot getting out of the car. 01:14:46.200 |
So in our evaluation of what is difficult about driving, 01:14:50.900 |
we have to remember that some of those things 01:14:54.700 |
we may take for granted in the same kind of way 01:15:11.300 |
"Encoded in the large highly evolved sensory and motor portions of the human brain 01:15:18.300 |
about the nature of the world and how to survive in it." 01:15:20.600 |
So this is data, this is big data, billions of years. 01:15:31.900 |
is perhaps less than 100,000 years of data old. 01:15:39.700 |
sorry I'm inserting my own statements in the middle of a quote but, 01:15:44.100 |
it's been very recent that we've learned how to think 01:15:55.200 |
than the things we take for granted like walking and visual perception and so on. 01:16:19.800 |
And that's an important question because the margin of error is small. 01:16:34.400 |
That's the number of people that die in car crashes every year. 01:16:47.200 |
That's through all the time you spend on the road, 01:16:52.200 |
We're impressed with ImageNet being able to classify a leopard, a cat or a dog 01:16:57.700 |
at close to, at above human level performance. 01:17:01.900 |
But this is the margin of error we get with driving. 01:17:04.600 |
And we have to be able to deal with snow, with heavy rain, 01:17:09.500 |
with big open parking lots, with parking garages, 01:17:13.500 |
any pedestrians that behaves irresponsibly, as rarely as that happens 01:17:18.300 |
or just unpredictably again, especially in Boston. 01:17:23.800 |
Reflections, the ones especially, this is one of some of the things you don't think about, 01:17:29.900 |
the lighting variations that blind the cameras. 01:17:33.100 |
The question was if that number changes, if you look at just crashes. 01:17:49.700 |
Crashes per, yeah, so one of the big things is cars have gotten really good at crashing 01:17:57.400 |
So the number of crashes is much, much larger than number of fatalities, 01:18:05.000 |
But still, you know, even one fatality is too many. 01:18:09.300 |
So this is one, Google self-driving car team, 01:18:20.200 |
is quite open about their performance since hitting public roads. 01:18:28.700 |
This is from a report that shows the number of times the driver disengages, 01:18:35.700 |
the car gives up control, that it asks the driver to take control back 01:18:45.600 |
Meaning that they're unhappy with the decision that the car was making 01:18:49.900 |
or it was putting the car or other pedestrians or other cars in unsafe situations. 01:18:54.500 |
And so if you see over time that there's been a total from 2014 to 2015, 01:19:01.800 |
there's been a total of 341 times on beautiful San Francisco roads. 01:19:08.600 |
And I say that seriously because the weather conditions are great there. 01:19:14.400 |
341 times that the driver had to elect to take control back. 01:19:20.400 |
Let me give you something to think about here. 01:19:24.900 |
This with neural networks is a big open question. 01:19:33.200 |
So this is an amazing paper, I encourage people to read it. 01:19:38.400 |
There's a couple of papers around this topic. 01:19:43.300 |
So here are eight images where if given to a neural network as input, 01:19:54.800 |
the network with higher than 99.6% confidence says that the image, 01:20:01.900 |
for example, in the top left is a robin, next to it is a cheetah, 01:20:06.000 |
then an armadillo, a panda, an electric guitar, a baseball, a starfish, a king penguin. 01:20:13.100 |
All of these things are obviously not in the images. 01:20:19.100 |
More importantly, more practically for the real world, 01:20:25.800 |
adding just a little bit of distortion, a little bit of noise distortion to the image 01:20:31.800 |
can force the network to produce a totally wrong prediction. 01:20:39.800 |
There's three columns, correct image classification, 01:20:47.800 |
and the resulting prediction of an ostrich for all three images on the left, 01:20:53.300 |
and a prediction of an ostrich for all three images on the right. 01:20:59.000 |
This ability to fool networks easily brings up an important point. 01:21:06.500 |
And that point is that there has been a lot of excitement 01:21:14.400 |
about neural networks throughout their history. 01:21:17.200 |
There's been a lot of excitement about artificial intelligence throughout its history. 01:21:21.000 |
And not coupling that excitement, not grounding that excitement in the reality, 01:21:28.000 |
the real challenges around that has resulted in crashes, 01:21:36.200 |
in AI winters, when funding dried out and people became hopeless 01:21:42.300 |
in terms of the possibilities of artificial intelligence. 01:21:47.800 |
that said the Navy revealed the embryo of an electronic computer today. 01:21:52.000 |
This is when the first perceptron that I talked about 01:21:55.600 |
was implemented in hardware by Frank Rosenblatt. 01:21:59.100 |
It took 400 pixel image input and it provided a single output 01:22:06.000 |
weights were encoded in hardware potentiometers 01:22:09.900 |
and weights were updated with electric motors. 01:22:13.900 |
"The Navy revealed the embryo of an electronic computer today 01:22:17.100 |
that expects we'll be able to walk, talk, see, write, reproduce itself 01:22:27.000 |
Dr. Frank Rosenblatt, a research psychologist 01:22:31.200 |
at the Cornell Aeronautical Laboratory, Buffalo, 01:22:34.700 |
said perceptrons might be fired to the planets as mechanical space explorers. 01:22:39.300 |
This might seem ridiculous, but this was the general opinion of the time. 01:22:45.500 |
And as we know now, perceptrons cannot even separate a nonlinear function. 01:22:57.300 |
And so this led to two major AI winters in the 70s and the late 80s and early 90s. 01:23:05.600 |
The Lighthill Report in 1973 by the UK government said that 01:23:14.000 |
"No part of the field of discoveries made so far produced the major impact that was promised." 01:23:19.000 |
So if the hype builds beyond the capabilities of our research, 01:23:31.300 |
and they have the possibility of creating another AI winter. 01:23:35.200 |
So I want to pair the optimism, some of the cool things we'll talk about in this class 01:23:40.000 |
with the reality of the challenges ahead of us. 01:23:51.400 |
This is some of the key players in deep learning. 01:23:55.100 |
What are the things that are next for deep learning? 01:24:01.700 |
We want to run on smaller, cheaper mobile devices. 01:24:05.700 |
We want to explore more in the space of unsupervised learning, 01:24:12.500 |
We want to do things that explore the space of videos more. 01:24:20.300 |
The recurring neural networks, like being able to summarize videos 01:24:27.000 |
One of the big efforts, especially in the companies dealing with large data, 01:24:33.400 |
Learning from multiple data sets with multiple sources of data. 01:24:37.600 |
And lastly, making money from these technologies. 01:24:43.600 |
There's a lot of this, despite the excitement, 01:24:48.800 |
there has been an inability for the most part to make serious money 01:24:54.800 |
from some of the more interesting parts of deep learning. 01:24:59.600 |
And while I got made fun of by the TAs for including this slide 01:25:10.600 |
because it's shown in so many sort of business type lectures, 01:25:13.600 |
but it is true that we're at the peak of a hype cycle 01:25:20.400 |
given the large amount of hype and excitement there is, we proceed with caution. 01:25:37.000 |
is we already talked about spoofing the cameras, 01:25:42.800 |
spoofing the cameras with a little bit of noise. 01:25:48.600 |
self-driving vehicles operate with a set of sensors 01:25:53.200 |
and they rely on those sensors to convey, to accurately capture that information. 01:25:58.200 |
Now what happens, not only when the world itself produces noisy visual information, 01:26:06.600 |
but what if somebody actively tries to spoof that data? 01:26:10.000 |
One of the fascinating things that have been recently done is spoofing of LiDAR. 01:26:16.000 |
So these LiDARs is a range sensor that gives a 3D point cloud of the objects in the external environment 01:26:22.800 |
and you're able to successfully do a replay attack 01:26:28.000 |
where you have the car see people and other cars around it 01:26:34.600 |
In the same way that you can spoof a camera to see things that are not there, 01:26:44.200 |
So let me run through some of the libraries that we'll work with 01:26:48.400 |
and they're out there that you might work with if you proceed with deep learning. 01:26:53.400 |
TensorFlow, that is the most popular one these days. 01:27:06.000 |
and is very good at operating on multiple GPUs. 01:27:18.200 |
which are libraries that operate on top of TensorFlow 01:27:21.800 |
that make it slightly easier, slightly more user-friendly interfaces 01:27:29.000 |
Torch, if you're interested to get in at the lower level 01:27:40.200 |
tweaking of the different parameters of neural networks, 01:27:42.800 |
creating your own architectures, Torch is excellent for that 01:27:49.200 |
Lua is a programming language and heavily backed by Facebook. 01:27:54.000 |
There's the old school Theano, just what I started on, 01:27:58.000 |
a lot of people early on in deep learning started on. 01:28:00.600 |
It's one of the first libraries that supported, that came with GPU support. 01:28:06.000 |
It definitely encourages lower level tinkering 01:28:11.400 |
And many of these, if not all, rely on NVIDIA's library 01:28:23.600 |
involved with training these neural networks on NVIDIA GPUs. 01:28:33.000 |
and they've officially recently announced that they're going to be, 01:28:39.800 |
that their AWS is going to be all in on MXNet. 01:28:43.800 |
Neon, recently bought by Intel, started out as a manufacturer 01:28:52.800 |
of neural network chips, which is really exciting 01:29:01.800 |
Caffe, started in Berkeley, also was very popular in Google 01:29:10.000 |
It's primarily designed for computer vision with ConvNets 01:29:18.400 |
There is CNTK, as it used to be known and now called 01:29:28.200 |
It says multi-GPU support, has its own brain script, 01:29:34.400 |
custom language, as well as other interfaces. 01:29:39.200 |
And what we'll get to play around in this class 01:29:41.600 |
is amazingly deep learning in the browser, right? 01:29:45.800 |
Our favorite is ComNetJS, what you'll use, built by Andrej Karpathy 01:29:54.000 |
It's good for explaining the basic concept of neural networks. 01:30:00.200 |
All you need is a browser, so very few requirements. 01:30:08.000 |
But for a lot of things that we're doing, you don't need GPUs. 01:30:10.600 |
You'll be able to train a network with very little 01:30:12.600 |
and relatively efficiently without the need of GPUs. 01:30:16.600 |
It has full support for CNNs, RNNs, and even deep reinforcement learning. 01:30:22.200 |
KerasJS, which seems incredible, we tried to use for this class, 01:30:28.200 |
didn't happen to. It has GPU support, so it runs in the browser 01:30:34.200 |
with GPU support, with OpenGL or however it works magically. 01:30:39.800 |
But we're able to accomplish a lot of things we need without the use of GPUs. 01:30:44.000 |
So, it's incredible to live in a day and age when it literally, 01:30:52.200 |
as I'll show in the tutorials, it takes just a few minutes 01:30:56.000 |
to get started with building your own neural network that classifies images. 01:31:00.000 |
And a lot of these libraries are friendly in that way. 01:31:05.000 |
So, all the references mentioned in this presentation are available at this link 01:31:12.400 |
So, I think in the interest of time, let me wrap up. 01:31:19.000 |
And tomorrow I'll explain the deep reinforcement learning game 01:31:23.200 |
and the actual competition and how you can win it.