back to indexMIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task
Chapters
0:0 Intro
0:35 Illustrative Case Study: Traffic Light Detection
0:58 Deep Tesla: End-to-End Learning from Human and Autopilot Driving
2:11 Computer Vision is Machine Learning
4:46 Images are Numbers
6:39 Computer Vision is Hard
8:15 Image Classification Pipeline
8:50 Famous Computer Vision Datasets
10:25 Let's Build an Image Classifier for CIFAR-10
13:8 K-Nearest Neighbors: Generalizing the Image-Diff Classifier
19:15 Reminder: Weighing the Evidence
19:39 Reminder Classify and Image of a Number
21:29 Reminder: "Learning" is Optimization of a Function
24:0 Convolutional Neural Networks: Layers
24:22 Dealing with Images: Local Connectivity
26:33 ConvNets: Spatial Arrangement of Output Volume
29:56 ConvNets: Pooling
32:32 Computer Vision: Object Recognition / Classification
34:37 Computer Vision: Segmentation
36:15 How Can Convolutional Neural Networks Help Us Drive?
36:32 Driving: The Numbers
37:41 Human at the Center of Automation
38:51 Distracted Humans
39:30 4 D's of Being Human: Drunk, Drugged, Distracted, Drowsy
39:51 In Context: Traffic Fatalities
43:3 Camera and Lens Selection
43:42 Semi-Autonomous Vehicle Components
44:18 Self-Driving Car Tasks
46:5 The Data
51:6 SLAM: Simultaneous Localization and Mapping
52:24 Visual Odometry in Parts
53:55 End-to-End Visual Odometry
55:53 Object Detection
56:23 Full Driving Scene Segmentation
57:48 Road Texture and Condition from Audio
59:2 Previous approaches: optimization-based control • Where deep learning can help: reinforcement leaming
00:00:00.000 |
All right, welcome back everyone. Sound okay? All right. 00:00:08.440 |
So today we talked a little bit about neural networks, 00:00:15.220 |
started to talk about neural networks yesterday. 00:00:17.460 |
Today we'll continue to talk about neural networks that work with images, 00:00:24.780 |
convolutional neural networks and see how those types of networks can help us drive a car. 00:00:33.160 |
If we have time, we'll cover a simple illustrative case study of detecting traffic lights, 00:00:45.120 |
If we can't teach our neural networks to do that, we're in trouble. 00:00:50.040 |
But it's a good, clear, illustrative case study of a three-class classification problem. 00:01:01.560 |
Here, looped over and over in a very short GIF. 00:01:05.600 |
This is actually running live on a website right now. 00:01:08.920 |
We'll show it towards the end of the lecture. 00:01:14.880 |
is a neural network that learns to steer a vehicle based on the video of the forward roadway. 00:01:22.600 |
And once again, doing all of that in the browser using JavaScript. 00:01:27.520 |
So you'll be able to train your own very network to drive using real-world data. 00:01:38.800 |
We will also have a tutorial and code briefly described today at the end of lecture, if there's time, 00:01:52.120 |
So if you want to build a network that's bigger, deeper, 00:01:57.320 |
and you want to utilize GPUs to train that network, you want to not do it in your browser. 00:02:03.800 |
You want to do it offline using TensorFlow and having a powerful GPU on your computer. 00:02:13.680 |
So we talked about vanilla machine learning where the size of the input is small for the most part. 00:02:27.760 |
The number of neurons, in the case of neural networks, is on the order of 10, 100, 1000. 00:02:34.040 |
When you think of images, images are a collection of pixels. 00:02:38.520 |
One of the most iconic images from computer vision, in the bottom left there, is Lana. 00:02:44.160 |
I encourage you to Google it and figure out the story behind that image. 00:02:49.120 |
It was quite shocking when I found out recently. 00:02:55.800 |
So once again, computer vision is, these days, dominated by data-driven approaches, by machine learning. 00:03:07.360 |
Where all of the same methods that are used on other types of data are used on images, 00:03:18.440 |
where the input is just a collection of pixels. 00:03:21.520 |
And pixels are numbers from 0 to 255, discrete values. 00:03:28.440 |
So we can think exactly what we've talked about previously. 00:03:32.440 |
We could think of images in the same exact way. 00:03:38.640 |
We could do supervised learning where you have an input image and output label. 00:03:43.480 |
The input image here is a picture of a woman. 00:03:57.200 |
Again, semi-supervised and reinforcement learning. 00:04:01.160 |
In fact, the Atari games we talked about yesterday do some pre-processing on the images. 00:04:08.680 |
They're using convolutional neural networks as we'll discuss today. 00:04:11.680 |
And the pipeline for supervised learning is again the same. 00:04:20.680 |
We perform a machine learning algorithm, performs feature extraction. 00:04:26.040 |
It trains given the inputs and outputs on the images and the labels of those images, 00:04:37.760 |
Accuracy is the term that's used to often describe how well a model performs. 00:04:43.080 |
I apologize for the constant presence of cats throughout this course. 00:04:51.160 |
I assure you this course is about driving, not cats. 00:05:03.000 |
We're really good at looking at and converting visual perception as human beings, 00:05:15.280 |
But a computer only sees numbers, RGB values for a colored image. 00:05:22.040 |
There's three values for every single pixel from 0 to 255. 00:05:26.840 |
And so given that image, we can think of two problems. 00:05:31.480 |
One is regression and the other is classification. 00:05:34.520 |
Regression is when given an image, we want to produce a real value output back. 00:05:41.000 |
So if we have an image of the forward roadway, 00:05:43.040 |
we want to produce a value for the steering wheel angle. 00:05:46.880 |
And if you have an algorithm that's really smart, 00:05:53.280 |
and produce the perfectly correct steering angle 00:05:55.800 |
that drives the car safely across the United States. 00:05:58.520 |
We'll talk about how to do that and where that fails. 00:06:01.520 |
Classification is when the input again is an image 00:06:08.200 |
and the output is a class label, a discrete class label. 00:06:12.120 |
Underneath it though, often is still a regression problem 00:06:18.720 |
that this particular image belongs to a particular category. 00:06:23.520 |
And we use a threshold to chop off the outputs associated with low probabilities 00:06:31.720 |
and take the labels associated with the high probabilities 00:06:35.520 |
and convert it into a discrete classification. 00:06:38.040 |
I mentioned this yesterday but bear saying again, 00:06:49.440 |
As human beings, we're really good at dealing with all these problems. 00:06:55.120 |
The object looks totally different in terms of the numbers behind the images, 00:07:00.040 |
in terms of the pixels when viewed from a different angle. 00:07:06.520 |
Objects, when you're standing far away from them or up close, 00:07:11.080 |
We're good at detecting that they're different size. 00:07:20.760 |
We talked about occlusions and deformations with cats. 00:07:29.280 |
You have to separate the object of interest from the background 00:07:35.640 |
and given the three-dimensional structure of our world, 00:07:38.120 |
there's a lot of stuff often going on in the background. 00:07:44.760 |
that's often greater than inter-class variation. 00:07:47.400 |
Meaning objects of the same type often have more variation 00:07:50.840 |
than the objects that you're trying to separate them from. 00:08:06.760 |
and the source of that light changes the way that object appears 00:08:13.520 |
So the image classification pipeline is the same as I mentioned. 00:08:24.560 |
It's a classification problem so there's categories 00:08:30.560 |
You have a bunch of examples, image examples of each of those categories 00:08:34.960 |
and so the input is just those images paired with the category 00:08:39.640 |
and you train to map, to estimate a function that maps from the images to the categories. 00:08:58.200 |
there's a growing number of data sets but they're still relatively small. 00:09:07.120 |
but they're not billions, they're trillions of images. 00:09:09.640 |
And these are the data sets that you will see if you read academic literature most often. 00:09:17.240 |
MNIST, the one that's been beaten to death and then we use as well in this course 00:09:34.240 |
ImageNet, one of the largest image data sets, 00:09:42.240 |
has images with a hierarchy of categories from WordNet 00:09:52.560 |
images associated with which words are present in the data set. 00:10:01.760 |
that are used to prove in a very efficient and quick way 00:10:06.120 |
offhand that your algorithm that you're trying to publish on 00:10:09.640 |
or trying to impress the world with works well. 00:10:17.960 |
and places is a data set of natural scenes, woods, nature, city, so on. 00:10:27.960 |
So let's look at CIFAR-10 as a data set of 10 categories, 00:10:34.520 |
They're shown there with sample images as the rows. 00:10:38.600 |
So let's build a classifier that's able to take images 00:10:44.280 |
from one of these 10 categories and tell us what is shown in the image. 00:10:51.520 |
Once again, all the algorithm sees is numbers. 00:11:03.680 |
we have to have an operator for comparing two images. 00:11:06.720 |
So given an image and I want to say if it's a cat or a dog, 00:11:10.120 |
I want to compare it to images of cats and compare it to images of dogs 00:11:18.480 |
Okay, so one way to do that is take the absolute difference 00:11:26.280 |
Take the difference between each individual pixel, 00:11:30.400 |
shown on the bottom of the slide for a 4x4 image 00:11:34.920 |
and then we sum that pixel-wise absolute difference into a single number. 00:11:42.200 |
So if the image is totally different pixel-wise, that'll be a high number. 00:11:46.880 |
If it's the same image, the number will be zero. 00:11:49.400 |
Oh, it's the absolute value too of the difference. 00:12:00.320 |
When we speak of distance, we usually mean L2 distance. 00:12:05.320 |
And so if we try to, so we can build a classifier that just 00:12:12.880 |
uses this operator to compare to every single image in the dataset 00:12:18.200 |
and say, I'm going to pick the category that's the closest using this comparative operator. 00:12:26.360 |
I'm going to find, I have a picture of a cat and I'm going to look through the dataset 00:12:31.240 |
and find the image that's the closest to this picture 00:12:33.920 |
and say that is the category that this picture belongs to. 00:12:37.600 |
So if we just flip the coin and randomly pick which category an image belongs to, 00:12:43.720 |
we get that accuracy would be on average 10%. It's random. 00:12:49.520 |
The accuracy we achieve with our brilliant image difference algorithm 00:12:58.280 |
that just goes to the dataset and finds the closest one is 38%. 00:13:10.600 |
So you can think about this operation of looking through the dataset 00:13:13.840 |
and finding the closest image as what's called k nearest neighbors. 00:13:19.880 |
Where k in that case is 1, meaning you find the one closest neighbor to this image 00:13:25.520 |
that you're asking a question about and accept the label from that image. 00:13:33.280 |
Increasing k to 2 means you take the two nearest neighbors. 00:13:39.560 |
You find the two closest in terms of pixel wise image difference, 00:13:44.040 |
images to this particular query image and find which category do those belong to. 00:13:51.120 |
What's shown up top on the left is the dataset we're working with, red, green, blue. 00:13:58.680 |
What's shown in the middle is the one nearest neighbor classifier, 00:14:04.280 |
meaning this is how you segment the entire space of different things that you can compare. 00:14:10.640 |
And if a point falls into any of these regions, 00:14:16.800 |
it will be immediately associated with the nearest neighbor algorithm to belong to that region. 00:14:23.680 |
With five nearest neighbors, there's immediately an issue. 00:14:31.200 |
The issue is that there's white regions, there's type breakers 00:14:34.280 |
where your five closest neighbors are from various categories. 00:14:43.320 |
So if we, this is a good example of parameter tuning. 00:14:52.640 |
And you have to, your task as a machine, as a teacher of machine learning, 00:15:00.560 |
you have to teach this algorithm how to do your learning for you, 00:15:06.800 |
That's called parameter tuning or hyperparameter tuning as it's called in neural networks. 00:15:12.800 |
And so on the bottom right of the slide is, on the x-axis is k 00:15:23.880 |
And on the y-axis is classification accuracy. 00:15:28.000 |
It turns out that the best k for this data set is 7, 7 nearest neighbors. 00:15:42.520 |
And I should say that the way that number is, we get that number as we do with 00:15:48.400 |
a lot of the machine learning pipeline process, 00:15:53.920 |
is you separate the data into the parts of the data set you use for training 00:16:02.280 |
You're not allowed to touch the testing part. 00:16:06.520 |
You construct your model of the world on the training data set 00:16:14.080 |
where you take a small part of the training data, shown in fold 5 there in yellow, 00:16:26.360 |
and then use it as part of the hyperparameter tuning. 00:16:31.640 |
As you train, figure out with that yellow part, fold 5, how well you're doing. 00:16:38.840 |
And then you choose a different fold and see how well you're doing 00:16:42.640 |
and keep playing with parameters, never touching the test part. 00:16:46.240 |
And when you're ready, you run the algorithm on the test data 00:16:50.160 |
to see how well you really do, how well it really generalizes. 00:16:54.360 |
Is there any way to determine intuition of what a good K may be? 00:16:57.800 |
Or do you just have to run through all the data? 00:17:01.160 |
is there any good intuition behind what a good K is? 00:17:04.600 |
There's general rules for different data sets, 00:17:07.040 |
but usually you just have to run through it, grid search, brute force. 00:17:18.240 |
So each pixel is like one number or is it two numbers? 00:17:21.800 |
Yes, the question was, is each pixel one number or three numbers? 00:17:28.080 |
For majority of computer vision throughout its history, 00:17:31.960 |
you use grayscale images, so it's one number. 00:17:36.040 |
And there's sometimes a depth value too, so it's four numbers. 00:17:39.920 |
So it's, if you have a stereo vision camera that gives you the depth information of the pixels, 00:17:46.480 |
And then if you stack two images together, it could be six. 00:17:50.320 |
In general, everything we work with will be three numbers for a pixel. 00:18:00.240 |
So the question was, for the absolute value, it's just one number. 00:18:07.360 |
So that's, you know, this algorithm is pretty good. 00:18:12.640 |
If we use the best, we optimize the hyperparameters of this algorithm, 00:18:21.720 |
choose K of seven, seems to work well for this particular CIFAR-10 dataset. 00:18:32.280 |
Human beings perform at about 94, slightly above 94% accuracy for CIFAR-10. 00:18:40.880 |
So given an image, it's a tiny image, I should clarify, it's like a little icon. 00:18:46.440 |
Given that image, human beings are able to determine accurately one of the 10 categories with 94% accuracy. 00:18:55.240 |
And the currently state-of-the-art convolutional neural networks is 95, it's 95.4% accuracy. 00:19:07.560 |
But the most important, the critical fact here is, it's recently surpassed humans 00:19:12.920 |
and certainly surpassed the K-Nearest Neighbors algorithm. 00:19:22.560 |
It all still boils down to this little guy, the neuron. 00:19:27.800 |
That sums the weights of its inputs, adds a bias, produces an output, 00:19:33.040 |
based on an activation, a smooth activation function. 00:19:39.640 |
The question was, you take a picture of a cat so you know it's a cat, 00:19:56.040 |
So you have to write as a caption, "This is my cat." 00:19:59.360 |
And then the unfortunate thing, given the internet and how witty it is, 00:20:06.280 |
Because maybe you're just being clever and it's not a cat at all. 00:20:28.160 |
do convolutional neural networks generally do better than nearest neighbors? 00:20:31.280 |
There's very few problems on which neural networks don't do better. 00:20:37.480 |
Yes, they almost always do better, except when you have almost no data. 00:20:43.880 |
And convolutional neural networks isn't some special magical thing. 00:20:49.880 |
It's just neural networks with some cheating up front that I'll explain. 00:20:55.280 |
Some tricks to try to reduce the size and make it capable to deal with images. 00:21:01.120 |
So again, yeah, the input is, in this case that we looked at, 00:21:06.640 |
as opposed to doing some fancy convolutional tricks. 00:21:10.720 |
We just take the entire 28x28 pixel image, that's 784 pixels as the input. 00:21:19.960 |
That's 784 neurons in the input, 15 neurons on the hidden layer, 00:21:28.640 |
Now everything we'll talk about has the same exact structure, nothing fancy. 00:21:37.520 |
where you take an input image and produce an output classification. 00:21:41.240 |
And there's a backward pass through the network, 00:21:43.880 |
through back propagation, where you adjust the weights 00:21:48.000 |
when your prediction doesn't match the ground truth output. 00:21:52.960 |
And learning just boils down to optimization. 00:21:58.640 |
It's just optimizing a smooth function, differentiable function, 00:22:09.360 |
between the true output and the one you actually got. 00:22:14.360 |
So what's the difference? What are convolutional neural networks? 00:22:32.160 |
have some spatial meaning in them, like images. 00:22:37.560 |
There's other things, you can think of the dimension of time 00:22:42.440 |
and you can input audio signal into a convolutional neural network. 00:22:47.880 |
And so the input is usually, for every single layer, 00:22:54.840 |
that's a convolutional layer, the input is a 3D volume 00:22:59.720 |
I'm simplifying because you can call it 4D too, but it's 3D. 00:23:08.760 |
The height and the width is the width and the height of the image. 00:23:11.800 |
And then the depth for a grayscale image is 1, 00:23:20.400 |
for a 10 frame video of grayscale images, the depth is 10. 00:23:31.280 |
And the only thing that a convolutional layer does 00:23:38.840 |
is take a 3D volume as input, produce a 3D volume as output 00:23:47.640 |
operating on the inputs, on the sum of the inputs 00:23:51.160 |
that may or may not be a parameter that you tune, that you try to optimize. 00:24:00.400 |
So Lego pieces that you stack together in the same way as we talked about before. 00:24:05.560 |
So what are the types of layers that a convolutional neural network have? 00:24:23.040 |
takes advantage of the spatial relationships of the input neurons. 00:24:33.920 |
it's the same exact neuron as for a fully connected network, 00:24:41.920 |
But it just has a narrower receptive field, it's more focused. 00:24:45.960 |
The inputs to a neuron on the convolutional layer 00:24:51.920 |
come from a specific region from the previous layer. 00:24:55.200 |
And the parameters on each filter, you can think of this as a filter 00:25:00.480 |
because you slide it across the entire image. 00:25:08.840 |
So as opposed to taking the, if you think about two layers, 00:25:12.360 |
as opposed to connecting every single pixel in the first layer 00:25:16.400 |
to every single neuron in the following layer, 00:25:20.520 |
you only connect the neurons in the input layer that are close to each other, 00:25:28.120 |
And then you enforce the weights to be tied together spatially. 00:25:41.920 |
every single layer on the output, you could think of as a filter, 00:25:49.400 |
And when it sees this particular kind of edge in the image, 00:25:54.920 |
It'll get excited in the top left of the image, 00:26:00.920 |
The assumption there is that a powerful feature 00:26:07.640 |
for detecting a cat is just as important no matter where in the image it is. 00:26:12.760 |
And this allows you to cut away a huge number of connections between neurons. 00:26:21.120 |
But it still boils down on the right as a neuron that sums 00:26:28.320 |
a collection of inputs and applies weights to them. 00:26:40.920 |
relative to the input volume is controlled by three things. 00:26:46.680 |
So for every single "filter", you'll get an extra layer on the output. 00:26:54.520 |
So if the input, let's talk about the very first layer, 00:27:15.680 |
the resulting number of stacked channels in the output will be 10. 00:27:23.400 |
Stride is the step size of the filter that you slide along the image. 00:27:33.000 |
Oftentimes that's just one or three and that directly reduces 00:27:38.960 |
the spatial size, the width and the height of the output image. 00:27:44.200 |
And then there is a convenient thing that's often done is padding 00:27:53.240 |
so that the input and the output have the same height and width. 00:28:05.400 |
I encourage you to kind of maybe offline think about what's happening. 00:28:14.760 |
Crudely so, if there's any experts in the audience. 00:28:19.320 |
So the input here on the left is a collection of numbers, 0, 1, 2. 00:28:43.440 |
Those filters shown in red are the different weights applied on those filters. 00:28:49.320 |
And each of the filters have a depth just like the input, a depth of 3. 00:29:00.840 |
Yeah, and so you slide that filter along the image, keeping the weights the same. 00:29:11.400 |
And so your first filter, you pick the weights. 00:29:17.440 |
You pick the weights in such a way that it fires, it gets excited at useful features 00:29:25.160 |
And then this second filter that fires for useful features and not. 00:29:32.920 |
depending on a positive number, meaning there's a strong feature in that region 00:29:44.280 |
This allows for drastic reduction in the parameters. 00:29:48.120 |
And so you can deal with inputs that are a thousand by a thousand pixel image, for example, or video. 00:30:04.600 |
That means there's a spatial invariance to the features you're detecting. 00:30:08.800 |
It allows you to learn from arbitrary images. 00:30:11.320 |
So you don't have to be concerned about pre-processing the images in some clever way. 00:30:31.360 |
For taking a collection of outputs and choosing the next one 00:30:40.400 |
such that the output of the pooling operation is much smaller than the input. 00:30:50.000 |
Because the justification there is that you don't need a high resolution localization 00:31:01.560 |
of exactly where which pixel is important in the image according to, you know, 00:31:07.920 |
you don't need to know exactly which pixel is associated with the cat ear or, you know, a cat face. 00:31:14.880 |
As long as you kind of know it's around that part. 00:31:18.680 |
And that reduces a lot of complexity in the operations. 00:31:32.280 |
So, pooling is a very crude operation that doesn't have any... 00:31:43.800 |
One thing you need to know is it doesn't have any parameters that are learnable. 00:31:49.000 |
So you can't learn anything clever about pooling. 00:32:04.760 |
There's an argument that you're not, you know, losing that much information 00:32:09.200 |
as long as you're not pooling the entire image into a single value. 00:32:13.000 |
But you're gaining training efficiency, you're gaining the memory size, 00:32:22.400 |
So it's definitely a thing that people debate 00:32:26.560 |
and it's a parameter that you play with to see what works for you. 00:32:30.480 |
Okay, so how does this thing look like as a whole, a convolutional neural network? 00:32:39.520 |
The input is an image. There's usually a convolutional layer. 00:32:43.600 |
There is a pooling operation, another convolutional layer, another pooling operation and so on. 00:32:52.160 |
At the very end, if the task is classification, 00:32:57.360 |
you have a stack of convolutional layers and pooling layers. 00:33:05.120 |
So you go from those, the spatial convolutional operations 00:33:11.320 |
to fully connecting every single neuron in a layer to the following layer. 00:33:15.200 |
And you do this so that by the end, you have a collection of neurons. 00:33:19.800 |
Each one is associated with a particular class. 00:33:22.920 |
So in what we looked at yesterday as the input is an image of a number, 0 through 9, 00:33:34.240 |
So you boil down that image with a collection of convolutional layers 00:33:40.960 |
with one or two or three fully connected layers at the end that all lead to 10 neurons. 00:33:47.760 |
And each of those neurons' job is to get fired up when it sees a particular number 00:33:57.000 |
and for the other ones to produce a low probability. 00:34:00.360 |
And so this kind of process is how you have the 95 percentile accuracy on the CIFAR-10 problem. 00:34:10.720 |
This here is ImageNet dataset that I mentioned. 00:34:14.480 |
It's how you take this image of a leopard, of a container ship, 00:34:19.120 |
and produce a probability that that is a container ship or a leopard. 00:34:25.480 |
Also shown there are the outputs of the other nearest neurons in terms of their confidence. 00:34:32.040 |
Now you can use the same exact operation by chopping off the fully connected layer at the end 00:34:44.280 |
and as opposed to mapping from image to a prediction of what's contained in the image, 00:34:54.560 |
And you can train that image to be one that gets excited spatially. 00:35:02.320 |
Meaning it gives you a high close to one value for areas of the image that contain the object of interest 00:35:11.840 |
and then a low number for areas of the image that are unlikely to contain that image. 00:35:20.880 |
And so from this you can go on the left an original image of a woman on a horse 00:35:25.200 |
to a segmented image of knowing where the woman is and where the horse is and where the background is. 00:35:32.440 |
The same process can be done for detecting the object. 00:35:38.800 |
So you can segment the scene into a bunch of interesting objects, 00:35:45.840 |
candidates for interesting objects and then go through those candidates one by one 00:35:53.320 |
and perform the same kind of classification as in the previous step 00:35:56.560 |
where it's just an input as an image and the output is a classification. 00:36:00.280 |
And through this process of hopping around an image, 00:36:03.440 |
you can figure out exactly where is the best way to segment the cow out of the image. 00:36:11.560 |
Okay, so how can these magical convolutional neural networks help us in driving? 00:36:22.560 |
This is a video of the forward roadway from a data set that we'll look at, 00:36:37.560 |
The general driving task from the human perspective. 00:36:41.600 |
On average, an American driver in the United States drives 10,000 miles a year. 00:36:51.160 |
A little more for rural, a little less for urban. 00:36:54.720 |
There is about 30,000 fatal crashes and 32 plus, sometimes as high as 38,000 fatalities a year. 00:37:06.800 |
This includes car occupants, pedestrians, bicyclists and motorcycle riders. 00:37:14.360 |
This may be a surprising fact but in a class on self-driving cars, 00:37:22.760 |
we should remember that, so ignore the 59.9% that's other. 00:37:28.120 |
The most popular cars in the United States are pickup trucks. 00:37:36.560 |
It's an important point that we're still married to wanting to be in control. 00:37:48.480 |
And so one of the interesting cars that we look at 00:37:56.080 |
and the car that is the data set that we provide to the class is collected from is a Tesla. 00:38:03.760 |
It's the one that comes at the intersection of the Ford F-150 00:38:07.840 |
and the cute little Google self-driving car on the right. 00:38:11.040 |
It's fast, it allows you to have a feeling of control 00:38:16.960 |
but it can also drive itself for hundreds of miles on the highway if need be. 00:38:22.040 |
It allows you to press a button and the car takes over. 00:38:28.240 |
It's a fascinating trade-off of transferring control from the human to the car. 00:38:33.520 |
It's a transfer of trust and it's a chance for us to study the psychology 00:38:41.080 |
of human beings as they relate to machines at 60 plus miles an hour. 00:38:48.560 |
In case you're not aware, a little summary of human beings. 00:38:57.560 |
We'd like to text, use the smartphone, watch videos, groom, talk to passengers, eat, drink. 00:39:08.960 |
Texting, 169 billion texts were sent in the US every single month in 2014. 00:39:20.160 |
On average, five seconds are I spent off the road while texting. 00:39:27.720 |
That's the opportunity for automation to step in. 00:39:34.000 |
More than that, there's what NHTSA refers to as the four D's. 00:39:43.000 |
Each one of those opportunities for automation step in. 00:39:48.240 |
Drunk driving stands to benefit significantly from automation, perhaps. 00:39:55.320 |
So the miles, let's look at the miles, the data. 00:40:09.040 |
And Tesla Autopilot, our case study for this class and as human beings, 00:40:20.520 |
So it's driving by itself 300 million miles as of December 2016. 00:40:27.440 |
And the fatalities for human controlled vehicles is one in 90 million. 00:40:40.760 |
And currently in Tesla, under Tesla Autopilot, there's one fatality. 00:40:47.200 |
There's a lot of ways you can tear that statistic apart, but it's one to think about. 00:40:50.880 |
Already, perhaps, automation results in safer driving. 00:40:57.040 |
The thing is, we don't understand automation because we don't have the data. 00:41:05.120 |
We don't have the data on the forward roadway video. 00:41:13.640 |
And we just don't have that many cars on the road today that drive themselves. 00:41:20.440 |
We'll provide some of it to you in the class. 00:41:23.440 |
And as part of our research at MIT, we're collecting huge amounts of it, 00:41:31.840 |
And what we, collecting that data is how we get to understanding. 00:41:38.880 |
So talking about the data and what we'll be doing training our algorithms on. 00:41:53.800 |
They have collected over 5,000 hours and 70,000 miles. 00:41:58.080 |
And I'll talk about the cameras that we put in them. 00:42:03.000 |
We're collecting video of the forward roadway. 00:42:08.560 |
This is a highlight of a trip from Boston to Florida of one of the people driving a Tesla. 00:42:13.920 |
What's also shown in blue is the amount of time that Autopilot was engaged. 00:42:21.760 |
Currently zero minutes and then it grows and grows. 00:42:26.880 |
For prolonged periods of time, so hundreds of miles, people engage Autopilot. 00:42:31.800 |
Out of 1.3 billion miles driven in a Tesla, 300 million are in Autopilot. 00:42:41.880 |
So we are collecting data of the forward roadway, of the driver. 00:42:50.720 |
What we're providing with the class is epics of time of the forward roadway for privacy considerations. 00:43:04.280 |
Cameras used to record are your regular webcam, the workhorse of the computer vision community, the C920. 00:43:13.160 |
And we have some special lenses on top of it. 00:43:19.200 |
Nothing that costs 70 bucks can be that good, right? 00:43:23.040 |
What's special about them is that they do onboard compression 00:43:29.120 |
and allow you to collect huge amounts of data and use reasonably sized storage capacity 00:43:40.080 |
to store that data and train your algorithms on. 00:43:42.400 |
So what on the self-driving side do we have to work with? 00:43:55.920 |
There is the sensors, radar, lidar, vision, audio, all looking outside, 00:44:04.280 |
helping you detect the objects in the external environment to localize yourself and so on. 00:44:09.800 |
And there's the sensors facing inside, visible light camera, audio again, and infrared camera to help detect pupils. 00:44:18.600 |
So we can decompose the self-driving car task into four steps. 00:44:25.120 |
Localization, answering where am I, scene understanding, 00:44:30.080 |
using the texture of the information of the scene around 00:44:34.640 |
to interpret the identity of the different objects in the scene 00:44:40.160 |
and the semantic meaning of those objects of their movement. 00:44:45.760 |
There's movement planning, once you figured all that out, found all the pedestrians, found all the cars, 00:44:53.680 |
how do I navigate through this maze, a clutter of objects in a safe and legal way. 00:45:02.920 |
And there's driver state, how do I detect using video or other information, 00:45:08.840 |
video of the driver detect information about their emotional state or their distraction level. 00:45:25.080 |
Lidar is the sensor that provides you the 3D point cloud of the external scene. 00:45:32.440 |
So lidar is a technology used by most folks working with self-driving cars 00:45:42.240 |
to give you a strong ground truth of the objects. 00:45:47.360 |
It's probably the best sensor we have for getting 3D information, 00:45:52.160 |
the least noisy 3D information about the external environment. 00:46:10.040 |
One of the most amazing things about this vehicle is that 00:46:14.440 |
the updates to autopilot come in the form of software. 00:46:18.320 |
So the amount of time it's available, it changes. 00:46:24.040 |
But in this, this is one of the earlier versions and it shows the second line in yellow. 00:46:32.480 |
It shows how often the autopilot was available but not turned on. 00:46:37.120 |
So it was the total driving time was 10 hours, autopilot was available 7 hours 00:46:44.360 |
This particular person is a responsible driver because what you see 00:46:49.000 |
or is more cautious driver, what you see is it's raining. 00:46:56.920 |
The comment was that you shouldn't trust that one fatality number as an indication of safety 00:47:06.320 |
because the drivers elect to only engage the system when it's safe to do so. 00:47:15.520 |
There's a lot bigger arguments for that number than just that one. 00:47:28.040 |
So maybe we can trust human beings to engage, you know, 00:47:34.280 |
despite the poorly filmed YouTube videos, despite the hype in the media, 00:47:40.040 |
you're still a human being riding a 60 miles an hour in a metal box with your life on the line. 00:47:46.040 |
You won't engage the system unless you know it's completely safe, 00:47:54.080 |
It's not all the stuff you see where a person gets in the back of a Tesla and starts sleeping 00:48:02.760 |
The reality is when it's just you in the car, it's still your life on the line. 00:48:06.720 |
And so you're going to do the responsible thing unless perhaps you're a teenager and so on 00:48:10.840 |
but that never changes no matter what you're in. 00:48:13.040 |
The question was what do you need to see or sense about the external environment 00:48:23.200 |
Do you need lane markings? Do you need other... 00:48:25.520 |
What are the landmarks based on which you do the localization and the navigation? 00:48:32.720 |
So with Google self-driving car in sunny California, 00:48:37.320 |
it depends on LiDAR to, in a high-resolution way, map the environment 00:48:42.920 |
in order to be able to localize itself based on LiDAR. 00:48:47.480 |
And LiDAR, now I don't know the details of exactly where LiDAR fails 00:48:53.960 |
but it's not good with rain, it's not good with snow, 00:48:58.240 |
it's not good when the environment is changing. 00:49:01.080 |
So what snow does is it changes the visual, the appearance, 00:49:05.720 |
the reflective texture of the surfaces around. 00:49:07.960 |
Us human beings are still able to figure stuff out 00:49:10.880 |
but a car that's relying heavily on LiDAR won't be able to localize itself 00:49:16.040 |
using the landmarks it previously has detected 00:49:19.880 |
because they look different now with the snow. 00:49:21.640 |
Computer vision can help us with lanes or following a car. 00:49:30.760 |
The two landmarks that we use in a lane is following a car in front of you 00:49:36.640 |
That's the nice thing about our roadways is they're designed for human eyes. 00:49:41.640 |
So you can use computer vision for lanes and for cars in front to follow them. 00:49:47.160 |
And there is radar that's a crude but reliable source of distance information 00:49:54.360 |
that allows you to not collide with metal objects. 00:49:58.600 |
So all of that together depending on what you want to rely on more 00:50:04.640 |
The question is when it's the messy complexity of real life occurs, 00:50:13.480 |
how reliable will it be in the urban environment and so on. 00:50:26.200 |
So first let's just quick summary of visual odometry. 00:50:33.800 |
It's using a monocular or stereo input of video images 00:50:44.160 |
The orientation in this case of a vehicle in the frame of the world. 00:50:51.280 |
And all you have to work with is a video of the forward roadway 00:50:55.280 |
and with stereo you get a little extra information of how far away different objects are. 00:51:03.120 |
And so this is where one of our speakers on Friday will talk about his expertise, 00:51:14.520 |
This is a very well studied and understood problem 00:51:17.680 |
of detecting unique features in the external scene 00:51:25.040 |
and localizing yourself based on the trajectory of those unique features. 00:51:31.480 |
When the number of features is high enough, it becomes an optimization problem. 00:51:36.320 |
You know this particular lane moved a little bit from frame to frame, 00:51:40.080 |
you can track that information and fuse everything together 00:51:44.800 |
in order to be able to estimate your trajectory through the three-dimensional space. 00:51:53.320 |
You have GPS, which is pretty accurate, not perfect but pretty accurate. 00:51:59.560 |
It's another signal to help you localize yourself. 00:52:01.840 |
You also have IMU, accelerometer, tells you your acceleration. 00:52:07.680 |
From the gyroscope, the accelerometer, you have the sixth degree of freedom of movement information 00:52:17.240 |
about how the moving object, the car, is navigating through space. 00:52:24.160 |
So you can do that using the old school way of optimization 00:52:34.600 |
given a unique set of features like SIFT features. 00:52:40.400 |
And that step involves with stereo input, on distorting and rectifying the images. 00:52:47.920 |
You have two images, you have to, from the two images, compute the depth map. 00:52:51.960 |
So for every single pixel, computing your best estimate of the depth of that pixel, 00:52:57.600 |
the three-dimensional position relative to the camera. 00:53:03.720 |
Then you compute, that's where you compute the disparity map, that's what that's called. 00:53:13.080 |
Then you detect unique interesting features in the scene. 00:53:17.720 |
SIFT is a popular one, is a popular algorithm for detecting unique features. 00:53:22.760 |
And then you, over time, track those features. 00:53:25.600 |
And that tracking is what allows you to, through the vision alone, 00:53:30.600 |
to get information about your trajectory through three-dimensional space. 00:53:37.120 |
There's a lot of assumptions. Assumptions that bodies are rigid. 00:53:40.560 |
So you have to figure out if a large object passes right in front of you, 00:53:49.600 |
You have to figure out the mobile objects in the scene and those that are stationary. 00:54:00.920 |
Or you can cheat, what we'll talk about, and do it using neural networks, end-to-end. 00:54:10.800 |
And this will come up a bunch of times throughout this class and today. 00:54:14.160 |
End-to-end means, and I refer to it as cheating because 00:54:19.520 |
it takes away a lot of the hard work of hand engineering features. 00:54:30.000 |
In this case, it's taking stereo input from a stereo vision camera. 00:54:35.720 |
So two images, a sequence of two images coming from a stereo vision camera. 00:54:39.680 |
And the output is an estimate of your trajectory through space. 00:54:47.000 |
So as opposed to doing the hard work of SLAM, of detecting unique features, 00:54:51.160 |
of localizing yourself, of tracking those features and figuring out what your trajectory is, 00:54:56.640 |
you simply train the network with some ground truth that you have from a more accurate sensor like LiDAR. 00:55:03.480 |
And you train it on a set of inputs, their stereo vision inputs, 00:55:11.680 |
You have a separate convolution neural networks for the velocity and for the orientation. 00:55:28.240 |
SLAM is one of the places where deep learning has not been able to outperform the previous approaches. 00:55:36.240 |
Where deep learning really helps is the scene understanding part. 00:55:44.320 |
It's detecting the various parts of the scene, segmenting them, 00:55:50.520 |
and with optical flow, determining their movement. 00:55:54.080 |
So previous approaches for detecting objects, 00:55:58.280 |
like the traffic signal classification detection that we have the TensorFlow tutorial for, 00:56:08.360 |
or to use PAR-like features or other types of features that are hand-engineered from the images. 00:56:18.800 |
Now we can use convolution neural networks to replace the extraction of those features. 00:56:24.040 |
And there's a TensorFlow implementation of SegNet, 00:56:34.040 |
which is taking the exact same neural network that I talked about. 00:56:39.600 |
Just the beauty is you just apply similar types of networks to different problems. 00:56:47.400 |
And depending on the complexity of the problem, it can get quite amazing performance. 00:56:52.560 |
In this case, we convolutionalize the network, meaning the output is an image, 00:57:04.320 |
where the colors indicate your best pixel by pixel estimate of what object is in that part. 00:57:10.000 |
This is not using any spatial information, it's not using any temporal information. 00:57:16.320 |
So it's processing every single frame separately. 00:57:19.760 |
And it's able to separate the road from the trees, from the pedestrians, other cars and so on. 00:57:29.200 |
This is intended to lie on top of a radar/lidar type of technology 00:57:37.680 |
that's giving you the three-dimensional or stereo vision, 00:57:40.360 |
three-dimensional information about the scene. 00:57:42.840 |
You're sort of painting that scene with the identity of the objects that are in it, 00:57:49.800 |
This is something I'll talk about tomorrow, is recurring neural networks. 00:57:56.120 |
And we can use recurring neural networks that work with temporal data 00:58:05.320 |
In this case, we can process what's shown on the bottom is a spectrogram of audio 00:58:20.640 |
and process it in a temporal way using recurring neural networks. 00:58:27.200 |
Just slide it across and keep feeding it to a network. 00:58:31.560 |
And it does incredibly well on the simple tasks, certainly, of dry road versus wet road. 00:58:38.080 |
This is an important, a subtle but very important task and there's many like it. 00:58:44.160 |
To know that the road, the texture, the quality, the characteristics of the road, 00:58:53.760 |
When it's not raining but the road is still wet, that information is very important. 00:59:02.360 |
the same kind of approach on the right is work from one of our other speakers, Sertac Karaman. 00:59:14.040 |
The same approach we're using for the, to solve traffic through friendly competition 00:59:22.480 |
is the same that we can use for what Chris Gerdes does with his race cars 00:59:29.960 |
for planning trajectories in high-speed movement along complex curves. 00:59:38.000 |
So we can solve that problem using optimization, 00:59:46.560 |
or we can use it with reinforcement learning by running 00:59:50.320 |
tens of millions, hundreds of millions of times through that simulation of taking that curve 00:59:56.600 |
and learning which trajectory doesn't, both optimizes the speed at which you take the turn 01:00:05.920 |
Exactly the same thing that you're using for traffic. 01:00:10.840 |
And for driver state, this is what we'll talk about next week, 01:00:15.480 |
is all the fun face stuff, eyes, face, emotion. 01:00:21.520 |
This is, we have video of the driver, video of the driver's body, video of the driver's face. 01:00:26.840 |
On the left is one of the TAs in his younger days. 01:00:36.120 |
So that's, in that particular case, you're doing one of the easier problems 01:00:47.160 |
which is one of detecting where the head and the eyes are positioned. 01:00:51.560 |
The head and eye pose in order to determine what's called the gaze of the driver, 01:00:59.760 |
And so shown, and we'll talk about these problems, from the left to the right, 01:01:05.640 |
on the left and green are the easier problems, 01:01:08.920 |
on the red are the harder from the computer vision aspect. 01:01:15.720 |
The larger the object, the easier it is to detect and the orientation of it is easier to detect. 01:01:20.160 |
And then there is pupil diameter, detecting the pupil, 01:01:23.920 |
the characteristics, the position, the size of the pupil. 01:01:28.640 |
And there's micro saccades, things that happen at one millisecond frequency, 01:01:35.000 |
All important information to determine the state of the driver. 01:01:41.920 |
Some are possible with computer vision, some are not. 01:01:44.400 |
This is something that we'll talk about, I think on Thursday, 01:01:51.280 |
is the detection of where the driver is looking. 01:01:54.520 |
So this is a bunch of the cameras that we have in the Tesla. 01:01:58.520 |
This is Dan driving a Tesla and detecting exactly where of one of six regions. 01:02:05.000 |
We've converted it into a classification problem of left, right, rear view mirror, 01:02:10.520 |
instrument cluster, center stack or forward roadway. 01:02:13.080 |
So we have to determine out of those six categories, 01:02:20.800 |
We don't care exactly the XYZ position of where the driver is looking at. 01:02:25.280 |
We care that they're looking at the road or not. 01:02:27.400 |
Are they looking at their cell phone in their lap or are they looking at the forward roadway? 01:02:30.920 |
And we'll be able to answer that pretty effectively using convolutional neural networks. 01:02:35.600 |
You can also look at emotion using CNNs to extract, again, converting emotion, 01:02:54.760 |
the complex world of emotion into a binary problem of frustrated versus satisfied. 01:03:02.120 |
This is a video of drivers interacting with a voice navigation system. 01:03:07.240 |
If you've ever used one, you know, it may be a source of frustration from folks. 01:03:14.520 |
This is one of the hard, you know, driver emotion if you're in what's called effective computing 01:03:18.920 |
is the field of studying emotion from the computational side. 01:03:23.840 |
If you're working in that field, you know that the annotation side of emotion 01:03:32.440 |
So getting the ground truth of, well, okay, this guy is smiling. 01:03:45.000 |
In this case, we self-report, ask people how frustrated they were on a scale of 1 to 10. 01:04:04.920 |
Now what you notice is there's a very cold stoic look on Dan's face, 01:04:12.040 |
And in the case of frustration, the driver is smiling. 01:04:17.840 |
So this is a sort of a good reminder that we can't trust our own human instincts 01:04:24.200 |
in engineering features and engineering the ground truth. 01:04:27.280 |
We have to trust the data, trust the ground truth 01:04:33.960 |
that we believe is the closest reflection of the actual semantics of what's going on in the scene. 01:04:39.800 |
Okay, so end-to-end driving, getting to the project and the tutorial. 01:04:56.720 |
and thank you for someone to clarify that this is from Arc de Triomphe in Paris, this video. 01:05:07.640 |
If driving is like a natural language conversation, 01:05:13.440 |
then we can think of end-to-end driving as skipping the entire Turing test components 01:05:19.800 |
and treating it as an end-to-end natural language generation. 01:05:24.640 |
So what we do is we take as input the external sensors 01:05:36.200 |
We replace that entire step within your own network. 01:05:41.720 |
The TAs told me to not include this image because it's the cheesiest I've ever seen. 01:05:57.920 |
So this is to show our path to self-driving cars but it's to explain a point 01:06:08.040 |
that we have a large data set of ground truth. 01:06:12.080 |
If we were to formulate the driving task as simply taking external images 01:06:17.080 |
and producing steering commands, acceleration and braking commands, 01:06:24.680 |
We have a large number of drivers on the road every day 01:06:29.720 |
driving and therefore collecting our ground truth for us 01:06:34.040 |
because they're an interested party in producing the steering commands that keep them alive. 01:06:39.560 |
And therefore, if we were to record that data, it becomes ground truth. 01:06:44.160 |
So if it's possible to learn this, what we can do is we can collect data for the manually controlled vehicles 01:06:50.960 |
and use that data to train an algorithm to control a self-driving vehicle. 01:06:58.240 |
Okay, so one of the first folks that did this is NVIDIA 01:07:04.640 |
where they actually trained in an external image, the image of the forward roadway 01:07:09.520 |
and a neural network, a convolutional network, a simple vanilla convolutional neural network. 01:07:16.600 |
I'll briefly outline, take an image in, produce a steering command out 01:07:22.720 |
and they're able to successfully, to some degree, learn to navigate basic turns, curves 01:07:32.400 |
and even stop or make sharp turns at a T-intersection. 01:07:46.200 |
The input is a 66 by 200 pixel image, RGB, shown on the left. 01:07:52.800 |
Or shown on the left is the raw input and then you crop it a little bit and resize it down. 01:07:58.400 |
66 by 200, that's what we have in the code as well. 01:08:04.000 |
In the two versions of the code we provide for you, both that runs in the browser and in TensorFlow. 01:08:11.640 |
It has a few layers, a few convolutional layers, a few fully connected layers and an output. 01:08:23.840 |
It's producing not a classification of cat versus dog, it's producing a steering command. 01:08:31.640 |
The rest is magic and we train it on human input. 01:08:38.960 |
What we have here is a project, is an implementation of the system in ComNetJS that runs in your browser. 01:08:53.480 |
This is the tutorial to follow and the project to take on. 01:08:59.000 |
So unlike the deep traffic game, this is reality. 01:09:13.160 |
Demo went wonderfully yesterday, so let's see. Maybe two for two. 01:09:32.960 |
So there's a tutorial and then the actual game, the actual simulation is on DeepTeslaJS, I apologize. 01:09:57.760 |
Again, similar structure. Up top is the visualization of the loss function as the network is learning and it's always training. 01:10:07.960 |
Next is the input for the layout of the network. There's the specification, the input 200 by 66. 01:10:22.680 |
There's a convolutional layer, there's a pooling layer and the output is the regression layer, a single neuron. 01:10:30.680 |
This is a tiny version, deep tiny, right? It's a tiny version of the NVIDIA architecture. 01:10:43.480 |
And then you can visualize the operation of this network on real video. 01:10:49.920 |
The actual wheel value that produced by the driver, by the autopilot system is in blue and the output of the network is in white. 01:11:08.960 |
And what's indicated by green is the cropping of the image that is then resized to produce the 66 by 200 input to the network. 01:11:19.920 |
So once again, amazingly, this is running in your browser, training on real world video. 01:11:28.920 |
So you can get in your car today, input it and maybe teach a neural network to drive like you. 01:11:36.040 |
We have the code in ComNetJS and TensorFlow to do that and a tutorial. 01:11:40.800 |
Well, let me briefly describe some of the work here. 01:11:49.360 |
So the input to the network is a single image. This is for DeepTeslaJS, single image. 01:11:58.240 |
The output is a steering wheel value between -20 and 20. That's in degrees. 01:12:05.720 |
We record, like I said, thousands of hours, but we provide publicly 10 video clips of highway driving from a Tesla. 01:12:16.160 |
Half are driven by autopilot, half are driven by human. 01:12:21.120 |
The wheel value is extracted from a perfectly synchronized can. 01:12:29.760 |
We are collecting all of the messages from can, which contains steering wheel value and that's synchronized with the video. 01:12:37.600 |
We crop, extract the window, the green one I mentioned, and then provide that as input to the network. 01:12:44.400 |
So this is a slight difference from deep traffic with the red car weaving through traffic 01:12:51.080 |
because there's the messy reality of real world lighting conditions. 01:12:57.720 |
And your task, for the most part, in this simple steering task, is to stay inside the lane, inside the lane markings. 01:13:11.080 |
So ComNetJS is a JavaScript implementation of CNNs, of convolutional neural networks. 01:13:22.200 |
It supports really arbitrary networks. I mean, all neural networks are simple, 01:13:27.960 |
but because it runs in JavaScript, it's not utilizing GPU. 01:13:32.040 |
The larger the network, the more it's going to be weighed down computationally. 01:13:39.200 |
Now, unlike deep traffic, this isn't a competition, 01:13:44.840 |
but if you are a student registered for the course, you still do have to submit the code. 01:13:49.840 |
You still have to submit your own car as part of the class. 01:13:55.880 |
So the question was the amount of data that's needed. 01:14:01.680 |
Is there a general rules of thumb for the amount of data needed for a particular task? 01:14:12.440 |
You generally have to, like I said, neural networks are good memorizers. 01:14:18.840 |
So you have to just have every case represented in the training set that you're interested in 01:14:24.560 |
as much as possible. So that means, in general, if you want a picture, 01:14:31.640 |
if you want to classify the difference in cats and dogs, 01:14:33.920 |
you want to have at least a thousand cats and a thousand dogs, and then you do really well. 01:14:41.920 |
One is that most of the time driving looks the same. 01:14:47.440 |
And the stuff you really care about is when driving looks different. It's all the edge cases. 01:14:51.520 |
So what we're not good with neural networks is generalizing from the common case to the edge cases, 01:14:58.640 |
to the outliers. So avoiding a crash, just because you can stay on the highway 01:15:03.880 |
for thousands of hours successfully, doesn't mean you can avoid a crash 01:15:07.520 |
when somebody runs in front of you on the road. 01:15:09.560 |
And the other part with driving is the accuracy you have to achieve is really high. 01:15:15.800 |
So for cat versus dog, you know, life doesn't depend on your error, 01:15:22.880 |
on your ability to steer a car inside of a lane. 01:15:36.120 |
There's a visualization of the metrics measuring the performance of the network as it trains. 01:15:42.080 |
There is a layer visualization of what features the network is extracting 01:15:49.240 |
at every convolutional layer and every fully connected layer. 01:15:52.440 |
There is ability to restart the training, visualize the network performing on real video. 01:16:08.280 |
There is the input layer, the convolutional layers, the video visualization. 01:16:21.400 |
An interesting tidbit on the bottom right is a barcode that Will has ingeniously designed. 01:16:36.640 |
How do I clearly explain why this is so cool? 01:16:39.760 |
It's a way to, through video, synchronize multiple streams of data together. 01:16:47.360 |
So it's very easy for those who have worked with multimodal data, 01:16:51.560 |
where there are several streams of data, for them to become unsynchronized. 01:16:56.640 |
Especially when a big component of training in neural network is shuffling the data. 01:17:02.040 |
So you have to shuffle the data in clever ways so you're not overfitting any one little aspect of the video 01:17:07.840 |
and yet maintain the data perfectly synchronized. 01:17:10.960 |
So what he did instead of doing the hard work of connecting the steering wheel and the video 01:17:17.760 |
is actually putting the steering wheel on top of the video as a barcode. 01:17:23.160 |
The final result is you can watch the network operate 01:17:30.520 |
and over time it learns more and more to steer correctly. 01:17:36.600 |
I'll fly through this a little bit in the interest of time. 01:17:39.280 |
Just kind of summarize some of the things that you can play with in terms of tutorials and let you guys go. 01:17:44.240 |
This is the same kind of process, end-to-end driving with TensorFlow. 01:17:52.720 |
We just put up on my GitHub under DeepTesla that takes in a single video 01:17:59.680 |
or an arbitrary number of videos, trains on them and produces a visualization 01:18:05.680 |
that compares the steering wheel, the actual steering wheel and the predicted steering wheel. 01:18:09.720 |
The steering wheel, when it agrees with a human driver or the autopilot system, 01:18:14.800 |
lighting up as green and when it disagrees, lighting up as red. 01:18:21.560 |
Again, this is some of the details of how that's exactly done in TensorFlow. 01:18:26.000 |
This is vanilla convolution neural networks, specifying a bunch of layers, 01:18:30.720 |
convolutional layers, a fully connected layer, train the model, 01:18:38.920 |
run the model over a test set of images and get this result. 01:18:48.240 |
We have a tutorial or IPython notebook and a tutorial up. 01:18:57.680 |
This is perhaps the best way to get started with convolutional neural networks 01:19:04.400 |
It's looking at the simplest image classification problem of traffic light classification. 01:19:14.040 |
We did the hard work of detecting them for you. 01:19:17.680 |
So now you have to figure out, you have to build a convolutional network 01:19:26.800 |
and gets excited when it sees red, yellow or green. 01:19:35.040 |
You can stay after class if you have any concerns with Docker, with TensorFlow, 01:19:42.200 |
with how to win traffic, deep traffic, just stay after class or come by Friday 5 to 7.