MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task

00:00:00.000 | All right, welcome back everyone. Sound okay? All right.

00:00:08.440 | So today we talked a little bit about neural networks,

00:00:15.220 | started to talk about neural networks yesterday.

00:00:17.460 | Today we'll continue to talk about neural networks that work with images,

00:00:24.780 | convolutional neural networks and see how those types of networks can help us drive a car.

00:00:33.160 | If we have time, we'll cover a simple illustrative case study of detecting traffic lights,

00:00:41.240 | the problem of detecting green, yellow, red.

00:00:45.120 | If we can't teach our neural networks to do that, we're in trouble.

00:00:50.040 | But it's a good, clear, illustrative case study of a three-class classification problem.

00:00:55.480 | Okay, next there's DeepTesla.

00:01:01.560 | Here, looped over and over in a very short GIF.

00:01:05.600 | This is actually running live on a website right now.

00:01:08.920 | We'll show it towards the end of the lecture.

00:01:11.360 | This once again, just like DeepTraffic,

00:01:14.880 | is a neural network that learns to steer a vehicle based on the video of the forward roadway.

00:01:22.600 | And once again, doing all of that in the browser using JavaScript.

00:01:27.520 | So you'll be able to train your own very network to drive using real-world data.

00:01:33.240 | I'll explain how.

00:01:38.800 | We will also have a tutorial and code briefly described today at the end of lecture, if there's time,

00:01:49.240 | how to do the same thing in TensorFlow.

00:01:52.120 | So if you want to build a network that's bigger, deeper,

00:01:57.320 | and you want to utilize GPUs to train that network, you want to not do it in your browser.

00:02:03.800 | You want to do it offline using TensorFlow and having a powerful GPU on your computer.

00:02:09.120 | And we'll explain how to do that.

00:02:10.920 | Computer vision.

00:02:13.680 | So we talked about vanilla machine learning where the size of the input is small for the most part.

00:02:27.760 | The number of neurons, in the case of neural networks, is on the order of 10, 100, 1000.

00:02:34.040 | When you think of images, images are a collection of pixels.

00:02:38.520 | One of the most iconic images from computer vision, in the bottom left there, is Lana.

00:02:44.160 | I encourage you to Google it and figure out the story behind that image.

00:02:49.120 | It was quite shocking when I found out recently.

00:02:55.800 | So once again, computer vision is, these days, dominated by data-driven approaches, by machine learning.

00:03:07.360 | Where all of the same methods that are used on other types of data are used on images,

00:03:18.440 | where the input is just a collection of pixels.

00:03:21.520 | And pixels are numbers from 0 to 255, discrete values.

00:03:28.440 | So we can think exactly what we've talked about previously.

00:03:32.440 | We could think of images in the same exact way.

00:03:34.960 | It's just numbers.

00:03:36.240 | And so we can do the same kind of thing.

00:03:38.640 | We could do supervised learning where you have an input image and output label.

00:03:43.480 | The input image here is a picture of a woman.

00:03:46.640 | The label might be "woman".

00:03:49.960 | On supervised learning, same thing.

00:03:51.880 | We'll look at that briefly as well.

00:03:53.920 | It's clustering images into categories.

00:03:57.200 | Again, semi-supervised and reinforcement learning.

00:04:01.160 | In fact, the Atari games we talked about yesterday do some pre-processing on the images.

00:04:07.240 | They're doing computer vision.

00:04:08.680 | They're using convolutional neural networks as we'll discuss today.

00:04:11.680 | And the pipeline for supervised learning is again the same.

00:04:17.200 | There's raw data in the form of images.

00:04:18.960 | There's labels on those images.

00:04:20.680 | We perform a machine learning algorithm, performs feature extraction.

00:04:26.040 | It trains given the inputs and outputs on the images and the labels of those images,

00:04:32.760 | constructs a model and then test that model.

00:04:35.040 | And we get a metric, an accuracy.

00:04:37.760 | Accuracy is the term that's used to often describe how well a model performs.

00:04:41.520 | It's a percentage.

00:04:43.080 | I apologize for the constant presence of cats throughout this course.

00:04:51.160 | I assure you this course is about driving, not cats.

00:04:54.360 | But images are numbers.

00:04:58.400 | So for us, we take it for granted.

00:05:03.000 | We're really good at looking at and converting visual perception as human beings,

00:05:09.720 | converting visual perception into semantics.

00:05:13.520 | We see this image and we know it's a cat.

00:05:15.280 | But a computer only sees numbers, RGB values for a colored image.

00:05:22.040 | There's three values for every single pixel from 0 to 255.

00:05:26.840 | And so given that image, we can think of two problems.

00:05:31.480 | One is regression and the other is classification.

00:05:34.520 | Regression is when given an image, we want to produce a real value output back.

00:05:41.000 | So if we have an image of the forward roadway,

00:05:43.040 | we want to produce a value for the steering wheel angle.

00:05:46.880 | And if you have an algorithm that's really smart,

00:05:50.080 | it can take any image of the forward roadway

00:05:53.280 | and produce the perfectly correct steering angle

00:05:55.800 | that drives the car safely across the United States.

00:05:58.520 | We'll talk about how to do that and where that fails.

00:06:01.520 | Classification is when the input again is an image

00:06:08.200 | and the output is a class label, a discrete class label.

00:06:12.120 | Underneath it though, often is still a regression problem

00:06:16.240 | and what's produced is a probability

00:06:18.720 | that this particular image belongs to a particular category.

00:06:23.520 | And we use a threshold to chop off the outputs associated with low probabilities

00:06:31.720 | and take the labels associated with the high probabilities

00:06:35.520 | and convert it into a discrete classification.

00:06:38.040 | I mentioned this yesterday but bear saying again,

00:06:42.920 | computer vision is hard.

00:06:44.240 | We once again take it for granted.

00:06:49.440 | As human beings, we're really good at dealing with all these problems.

00:06:53.200 | There's viewpoint variation.

00:06:55.120 | The object looks totally different in terms of the numbers behind the images,

00:07:00.040 | in terms of the pixels when viewed from a different angle.

00:07:03.600 | Viewpoint variation.

00:07:06.520 | Objects, when you're standing far away from them or up close,

00:07:09.600 | are totally different size.

00:07:11.080 | We're good at detecting that they're different size.

00:07:14.240 | It's still the same object as human beings

00:07:16.560 | but that's still a really hard problem

00:07:18.640 | because those sizes can vary drastically.

00:07:20.760 | We talked about occlusions and deformations with cats.

00:07:24.400 | Well understood problem.

00:07:26.840 | There's background clutter.

00:07:29.280 | You have to separate the object of interest from the background

00:07:35.640 | and given the three-dimensional structure of our world,

00:07:38.120 | there's a lot of stuff often going on in the background.

00:07:40.400 | The clutter.

00:07:41.920 | There are inter-class variation

00:07:44.760 | that's often greater than inter-class variation.

00:07:47.400 | Meaning objects of the same type often have more variation

00:07:50.840 | than the objects that you're trying to separate them from.

00:07:55.800 | There's the hard one for driving.

00:07:59.360 | Illumination.

00:08:00.840 | Light is the way we perceive things.

00:08:05.040 | The reflection of light off the surface

00:08:06.760 | and the source of that light changes the way that object appears

00:08:11.520 | and we have to be robust to all of that.

00:08:13.520 | So the image classification pipeline is the same as I mentioned.

00:08:21.280 | There's categories.

00:08:24.560 | It's a classification problem so there's categories

00:08:28.080 | of cat, dog, mug, hat.

00:08:30.560 | You have a bunch of examples, image examples of each of those categories

00:08:34.960 | and so the input is just those images paired with the category

00:08:39.640 | and you train to map, to estimate a function that maps from the images to the categories.

00:08:48.800 | For all of that you need data.

00:08:53.680 | A lot of it.

00:08:55.200 | There is unfortunately,

00:08:58.200 | there's a growing number of data sets but they're still relatively small.

00:09:04.640 | We get excited, there are millions of images

00:09:07.120 | but they're not billions, they're trillions of images.

00:09:09.640 | And these are the data sets that you will see if you read academic literature most often.

00:09:17.240 | MNIST, the one that's been beaten to death and then we use as well in this course

00:09:23.040 | is a data set of handwritten digits

00:09:31.880 | where the categories are 0 to 9.

00:09:34.240 | ImageNet, one of the largest image data sets,

00:09:38.280 | fully labeled image data sets in the world

00:09:42.240 | has images with a hierarchy of categories from WordNet

00:09:48.080 | and what you see there is a labeling of what

00:09:52.560 | images associated with which words are present in the data set.

00:09:57.480 | CIFAR-10 and CIFAR-100 are tiny images

00:10:01.760 | that are used to prove in a very efficient and quick way

00:10:06.120 | offhand that your algorithm that you're trying to publish on

00:10:09.640 | or trying to impress the world with works well.

00:10:12.760 | It's small, it's a small data set.

00:10:15.400 | CIFAR-10 means there's 10 categories

00:10:17.960 | and places is a data set of natural scenes, woods, nature, city, so on.

00:10:27.960 | So let's look at CIFAR-10 as a data set of 10 categories,

00:10:32.240 | airplane, automobile, bird, cat, so on.

00:10:34.520 | They're shown there with sample images as the rows.

00:10:38.600 | So let's build a classifier that's able to take images

00:10:44.280 | from one of these 10 categories and tell us what is shown in the image.

00:10:49.920 | So how do we do that?

00:10:51.520 | Once again, all the algorithm sees is numbers.

00:10:56.480 | So we have to try to have, at the very core,

00:11:03.680 | we have to have an operator for comparing two images.

00:11:06.720 | So given an image and I want to say if it's a cat or a dog,

00:11:10.120 | I want to compare it to images of cats and compare it to images of dogs

00:11:13.880 | and see which one matches better.

00:11:16.160 | So there has to be a comparative operator.

00:11:18.480 | Okay, so one way to do that is take the absolute difference

00:11:23.120 | between the two images, pixel by pixel.

00:11:26.280 | Take the difference between each individual pixel,

00:11:30.400 | shown on the bottom of the slide for a 4x4 image

00:11:34.920 | and then we sum that pixel-wise absolute difference into a single number.

00:11:42.200 | So if the image is totally different pixel-wise, that'll be a high number.

00:11:46.880 | If it's the same image, the number will be zero.

00:11:49.400 | Oh, it's the absolute value too of the difference.

00:11:53.120 | That's called L1 distance.

00:11:58.120 | Doesn't matter.

00:12:00.320 | When we speak of distance, we usually mean L2 distance.

00:12:05.320 | And so if we try to, so we can build a classifier that just

00:12:12.880 | uses this operator to compare to every single image in the dataset

00:12:18.200 | and say, I'm going to pick the category that's the closest using this comparative operator.

00:12:26.360 | I'm going to find, I have a picture of a cat and I'm going to look through the dataset

00:12:31.240 | and find the image that's the closest to this picture

00:12:33.920 | and say that is the category that this picture belongs to.

00:12:37.600 | So if we just flip the coin and randomly pick which category an image belongs to,

00:12:43.720 | we get that accuracy would be on average 10%. It's random.

00:12:49.520 | The accuracy we achieve with our brilliant image difference algorithm

00:12:58.280 | that just goes to the dataset and finds the closest one is 38%.

00:13:03.680 | It's pretty good. It's way above 10%.

00:13:10.600 | So you can think about this operation of looking through the dataset

00:13:13.840 | and finding the closest image as what's called k nearest neighbors.

00:13:19.880 | Where k in that case is 1, meaning you find the one closest neighbor to this image

00:13:25.520 | that you're asking a question about and accept the label from that image.

00:13:30.440 | You could do the same thing increasing k.

00:13:33.280 | Increasing k to 2 means you take the two nearest neighbors.

00:13:39.560 | You find the two closest in terms of pixel wise image difference,

00:13:44.040 | images to this particular query image and find which category do those belong to.

00:13:51.120 | What's shown up top on the left is the dataset we're working with, red, green, blue.

00:13:58.680 | What's shown in the middle is the one nearest neighbor classifier,

00:14:04.280 | meaning this is how you segment the entire space of different things that you can compare.

00:14:10.640 | And if a point falls into any of these regions,

00:14:16.800 | it will be immediately associated with the nearest neighbor algorithm to belong to that region.

00:14:23.680 | With five nearest neighbors, there's immediately an issue.

00:14:31.200 | The issue is that there's white regions, there's type breakers

00:14:34.280 | where your five closest neighbors are from various categories.

00:14:41.640 | So it's unclear what you belong to.

00:14:43.320 | So if we, this is a good example of parameter tuning.

00:14:50.240 | You have one parameter, k.

00:14:52.640 | And you have to, your task as a machine, as a teacher of machine learning,

00:15:00.560 | you have to teach this algorithm how to do your learning for you,

00:15:03.760 | is to figure out that parameter.

00:15:06.800 | That's called parameter tuning or hyperparameter tuning as it's called in neural networks.

00:15:12.800 | And so on the bottom right of the slide is, on the x-axis is k

00:15:19.640 | as we increase it from 0 to 100.

00:15:23.880 | And on the y-axis is classification accuracy.

00:15:28.000 | It turns out that the best k for this data set is 7, 7 nearest neighbors.

00:15:33.960 | With that, we get a performance of 30%.

00:15:39.000 | Human level performance.

00:15:42.520 | And I should say that the way that number is, we get that number as we do with

00:15:48.400 | a lot of the machine learning pipeline process,

00:15:53.920 | is you separate the data into the parts of the data set you use for training

00:15:59.040 | and another part that you use for testing.

00:16:02.280 | You're not allowed to touch the testing part.

00:16:04.680 | That's cheating.

00:16:06.520 | You construct your model of the world on the training data set

00:16:10.920 | and you use what's called cross-validation

00:16:14.080 | where you take a small part of the training data, shown in fold 5 there in yellow,

00:16:23.040 | to leave that part out from the training

00:16:26.360 | and then use it as part of the hyperparameter tuning.

00:16:31.640 | As you train, figure out with that yellow part, fold 5, how well you're doing.

00:16:38.840 | And then you choose a different fold and see how well you're doing

00:16:42.640 | and keep playing with parameters, never touching the test part.

00:16:46.240 | And when you're ready, you run the algorithm on the test data

00:16:50.160 | to see how well you really do, how well it really generalizes.

00:16:53.480 | Yes, question.

00:16:54.360 | Is there any way to determine intuition of what a good K may be?

00:16:57.800 | Or do you just have to run through all the data?

00:16:59.120 | So the question was, is there a good way to,

00:17:01.160 | is there any good intuition behind what a good K is?

00:17:04.600 | There's general rules for different data sets,

00:17:07.040 | but usually you just have to run through it, grid search, brute force.

00:17:11.360 | Yes, question.

00:17:13.120 | I have a quick question.

00:17:14.240 | Good question.

00:17:16.600 | Yes.

00:17:18.240 | So each pixel is like one number or is it two numbers?

00:17:21.800 | Yes, the question was, is each pixel one number or three numbers?

00:17:28.080 | For majority of computer vision throughout its history,

00:17:31.960 | you use grayscale images, so it's one number.

00:17:33.880 | But RGB is three numbers.

00:17:36.040 | And there's sometimes a depth value too, so it's four numbers.

00:17:39.920 | So it's, if you have a stereo vision camera that gives you the depth information of the pixels,

00:17:45.880 | that's a fourth.

00:17:46.480 | And then if you stack two images together, it could be six.

00:17:50.320 | In general, everything we work with will be three numbers for a pixel.

00:17:56.440 | It was, yes.

00:18:00.240 | So the question was, for the absolute value, it's just one number.

00:18:02.880 | Exactly right.

00:18:03.760 | So in that case, it was grayscale images.

00:18:05.520 | So it's not RGB images.

00:18:07.360 | So that's, you know, this algorithm is pretty good.

00:18:12.640 | If we use the best, we optimize the hyperparameters of this algorithm,

00:18:21.720 | choose K of seven, seems to work well for this particular CIFAR-10 dataset.

00:18:26.960 | Okay, we get 30% accuracy.

00:18:29.480 | It's impressive, higher than 10%.

00:18:32.280 | Human beings perform at about 94, slightly above 94% accuracy for CIFAR-10.

00:18:40.880 | So given an image, it's a tiny image, I should clarify, it's like a little icon.

00:18:46.440 | Given that image, human beings are able to determine accurately one of the 10 categories with 94% accuracy.

00:18:55.240 | And the currently state-of-the-art convolutional neural networks is 95, it's 95.4% accuracy.

00:19:03.560 | And believe it or not, it's a heated battle.

00:19:07.560 | But the most important, the critical fact here is, it's recently surpassed humans

00:19:12.920 | and certainly surpassed the K-Nearest Neighbors algorithm.

00:19:16.600 | So how does this work?

00:19:20.120 | Let's briefly look back.

00:19:22.560 | It all still boils down to this little guy, the neuron.

00:19:27.800 | That sums the weights of its inputs, adds a bias, produces an output,

00:19:33.040 | based on an activation, a smooth activation function.

00:19:37.640 | Yes, question.

00:19:39.640 | The question was, you take a picture of a cat so you know it's a cat,

00:19:48.800 | but that's not encoded anywhere.

00:19:52.520 | Like, you have to write that down somewhere.

00:19:56.040 | So you have to write as a caption, "This is my cat."

00:19:59.360 | And then the unfortunate thing, given the internet and how witty it is,

00:20:03.160 | you can't trust the captions and images.

00:20:06.280 | Because maybe you're just being clever and it's not a cat at all.

00:20:08.720 | It's a dog dressed as a cat.

00:20:10.240 | Yes, question.

00:20:13.160 | Sorry, CNS do better than what?

00:20:20.400 | Yeah, so the question was,

00:20:28.160 | do convolutional neural networks generally do better than nearest neighbors?

00:20:31.280 | There's very few problems on which neural networks don't do better.

00:20:37.480 | Yes, they almost always do better, except when you have almost no data.

00:20:41.720 | So you need data.

00:20:43.880 | And convolutional neural networks isn't some special magical thing.

00:20:49.880 | It's just neural networks with some cheating up front that I'll explain.

00:20:55.280 | Some tricks to try to reduce the size and make it capable to deal with images.

00:21:01.120 | So again, yeah, the input is, in this case that we looked at,

00:21:04.960 | classifying an image of a number,

00:21:06.640 | as opposed to doing some fancy convolutional tricks.

00:21:10.720 | We just take the entire 28x28 pixel image, that's 784 pixels as the input.

00:21:19.960 | That's 784 neurons in the input, 15 neurons on the hidden layer,

00:21:25.240 | and 10 neurons in the output.

00:21:28.640 | Now everything we'll talk about has the same exact structure, nothing fancy.

00:21:32.400 | There is a forward pass through the network,

00:21:37.520 | where you take an input image and produce an output classification.

00:21:41.240 | And there's a backward pass through the network,

00:21:43.880 | through back propagation, where you adjust the weights

00:21:48.000 | when your prediction doesn't match the ground truth output.

00:21:52.960 | And learning just boils down to optimization.

00:21:58.640 | It's just optimizing a smooth function, differentiable function,

00:22:03.440 | that's defined as the loss function.

00:22:06.880 | That's usually as simple as a squared error

00:22:09.360 | between the true output and the one you actually got.

00:22:14.360 | So what's the difference? What are convolutional neural networks?

00:22:17.840 | Convolutional neural networks take

00:22:25.640 | inputs that have some spatial consistency,

00:22:29.840 | have some meaning to the spatial,

00:22:32.160 | have some spatial meaning in them, like images.

00:22:37.560 | There's other things, you can think of the dimension of time

00:22:42.440 | and you can input audio signal into a convolutional neural network.

00:22:47.880 | And so the input is usually, for every single layer,

00:22:54.840 | that's a convolutional layer, the input is a 3D volume

00:22:57.520 | and the output is a 3D volume.

00:22:59.720 | I'm simplifying because you can call it 4D too, but it's 3D.

00:23:05.000 | There's height, width and depth.

00:23:06.680 | So that's an image.

00:23:08.760 | The height and the width is the width and the height of the image.

00:23:11.800 | And then the depth for a grayscale image is 1,

00:23:15.920 | for an RGB image is 3,

00:23:20.400 | for a 10 frame video of grayscale images, the depth is 10.

00:23:26.320 | It's just a volume, a 3D matrix of numbers.

00:23:31.280 | And the only thing that a convolutional layer does

00:23:38.840 | is take a 3D volume as input, produce a 3D volume as output

00:23:43.720 | and has some smooth function

00:23:47.640 | operating on the inputs, on the sum of the inputs

00:23:51.160 | that may or may not be a parameter that you tune, that you try to optimize.

00:23:58.160 | That's it.

00:24:00.400 | So Lego pieces that you stack together in the same way as we talked about before.

00:24:05.560 | So what are the types of layers that a convolutional neural network have?

00:24:10.120 | There's inputs.

00:24:11.400 | So for example, a color image of 32x32

00:24:15.560 | will be a volume of 32x32x3.

00:24:19.000 | A convolutional layer

00:24:23.040 | takes advantage of the spatial relationships of the input neurons.

00:24:31.320 | And a convolutional layer,

00:24:33.920 | it's the same exact neuron as for a fully connected network,

00:24:39.480 | the regular network we talked about before.

00:24:41.920 | But it just has a narrower receptive field, it's more focused.

00:24:45.960 | The inputs to a neuron on the convolutional layer

00:24:51.920 | come from a specific region from the previous layer.

00:24:55.200 | And the parameters on each filter, you can think of this as a filter

00:25:00.480 | because you slide it across the entire image.

00:25:02.920 | And those parameters are shared.

00:25:08.840 | So as opposed to taking the, if you think about two layers,

00:25:12.360 | as opposed to connecting every single pixel in the first layer

00:25:16.400 | to every single neuron in the following layer,

00:25:20.520 | you only connect the neurons in the input layer that are close to each other,

00:25:27.080 | to the output layer.

00:25:28.120 | And then you enforce the weights to be tied together spatially.

00:25:38.480 | And what that results in is a filter,

00:25:41.920 | every single layer on the output, you could think of as a filter,

00:25:46.040 | that gets excited, for example, for an edge.

00:25:49.400 | And when it sees this particular kind of edge in the image,

00:25:53.560 | it'll get excited.

00:25:54.920 | It'll get excited in the top left of the image,

00:25:57.960 | the top right, bottom left, bottom right.

00:26:00.920 | The assumption there is that a powerful feature

00:26:07.640 | for detecting a cat is just as important no matter where in the image it is.

00:26:12.760 | And this allows you to cut away a huge number of connections between neurons.

00:26:21.120 | But it still boils down on the right as a neuron that sums

00:26:28.320 | a collection of inputs and applies weights to them.

00:26:35.400 | The spatial arrangement of the output volume

00:26:40.920 | relative to the input volume is controlled by three things.

00:26:44.640 | The number of filters.

00:26:46.680 | So for every single "filter", you'll get an extra layer on the output.

00:26:54.520 | So if the input, let's talk about the very first layer,

00:27:00.960 | the input is 32 by 32 by 3.

00:27:05.080 | It's an RGB image of 32 by 32.

00:27:07.600 | If the number of filters is 10,

00:27:13.160 | then the resulting depth,

00:27:15.680 | the resulting number of stacked channels in the output will be 10.

00:27:23.400 | Stride is the step size of the filter that you slide along the image.

00:27:33.000 | Oftentimes that's just one or three and that directly reduces

00:27:38.960 | the spatial size, the width and the height of the output image.

00:27:44.200 | And then there is a convenient thing that's often done is padding

00:27:49.240 | the image on the outside with zeros

00:27:53.240 | so that the input and the output have the same height and width.

00:28:01.880 | So this is a visualization of convolution.

00:28:05.400 | I encourage you to kind of maybe offline think about what's happening.

00:28:10.440 | It's similar to the way human vision works.

00:28:14.760 | Crudely so, if there's any experts in the audience.

00:28:19.320 | So the input here on the left is a collection of numbers, 0, 1, 2.

00:28:29.080 | And a filter,

00:28:30.320 | well, there is two filters

00:28:37.920 | shown as W0 and W1.

00:28:43.440 | Those filters shown in red are the different weights applied on those filters.

00:28:49.320 | And each of the filters have a depth just like the input, a depth of 3.

00:28:55.120 | So there's three of them in each column.

00:28:57.760 | And so, let's see.

00:29:00.840 | Yeah, and so you slide that filter along the image, keeping the weights the same.

00:29:09.640 | This is the sharing of the weights.

00:29:11.400 | And so your first filter, you pick the weights.

00:29:16.000 | This is an optimization problem.

00:29:17.440 | You pick the weights in such a way that it fires, it gets excited at useful features

00:29:22.480 | and doesn't fire for not useful features.

00:29:25.160 | And then this second filter that fires for useful features and not.

00:29:28.920 | And produces a signal on the output

00:29:32.920 | depending on a positive number, meaning there's a strong feature in that region

00:29:40.080 | and a negative number if there isn't.

00:29:42.360 | But the filter is the same.

00:29:44.280 | This allows for drastic reduction in the parameters.

00:29:48.120 | And so you can deal with inputs that are a thousand by a thousand pixel image, for example, or video.

00:29:55.920 | There's a really powerful concept there.

00:29:59.120 | The spatial sharing of weights.

00:30:04.600 | That means there's a spatial invariance to the features you're detecting.

00:30:08.800 | It allows you to learn from arbitrary images.

00:30:11.320 | So you don't have to be concerned about pre-processing the images in some clever way.

00:30:17.480 | You just give the raw image.

00:30:19.000 | There's another operation, pooling.

00:30:24.880 | It's a way to reduce the size of the layers.

00:30:27.600 | By, for example, in this case, max pooling.

00:30:31.360 | For taking a collection of outputs and choosing the next one

00:30:36.240 | and summarizing those collection of pixels

00:30:40.400 | such that the output of the pooling operation is much smaller than the input.

00:30:50.000 | Because the justification there is that you don't need a high resolution localization

00:31:01.560 | of exactly where which pixel is important in the image according to, you know,

00:31:07.920 | you don't need to know exactly which pixel is associated with the cat ear or, you know, a cat face.

00:31:14.880 | As long as you kind of know it's around that part.

00:31:18.680 | And that reduces a lot of complexity in the operations.

00:31:21.800 | Yes, question.

00:31:22.760 | The question was, when is too much pooling?

00:31:31.120 | When do you stop pooling?

00:31:32.280 | So, pooling is a very crude operation that doesn't have any...

00:31:43.800 | One thing you need to know is it doesn't have any parameters that are learnable.

00:31:49.000 | So you can't learn anything clever about pooling.

00:31:54.720 | You're just picking, in this case, max pool.

00:31:57.480 | So you're picking the largest number.

00:31:58.880 | So you're reducing the resolution.

00:32:03.320 | You're losing a lot of information.

00:32:04.760 | There's an argument that you're not, you know, losing that much information

00:32:09.200 | as long as you're not pooling the entire image into a single value.

00:32:13.000 | But you're gaining training efficiency, you're gaining the memory size,

00:32:20.040 | you're reducing the size of the network.

00:32:22.400 | So it's definitely a thing that people debate

00:32:26.560 | and it's a parameter that you play with to see what works for you.

00:32:30.480 | Okay, so how does this thing look like as a whole, a convolutional neural network?

00:32:39.520 | The input is an image. There's usually a convolutional layer.

00:32:43.600 | There is a pooling operation, another convolutional layer, another pooling operation and so on.

00:32:52.160 | At the very end, if the task is classification,

00:32:57.360 | you have a stack of convolutional layers and pooling layers.

00:33:01.720 | There are several fully connected layers.

00:33:05.120 | So you go from those, the spatial convolutional operations

00:33:11.320 | to fully connecting every single neuron in a layer to the following layer.

00:33:15.200 | And you do this so that by the end, you have a collection of neurons.

00:33:19.800 | Each one is associated with a particular class.

00:33:22.920 | So in what we looked at yesterday as the input is an image of a number, 0 through 9,

00:33:30.720 | the output here would be 10 neurons.

00:33:34.240 | So you boil down that image with a collection of convolutional layers

00:33:40.960 | with one or two or three fully connected layers at the end that all lead to 10 neurons.

00:33:47.760 | And each of those neurons' job is to get fired up when it sees a particular number

00:33:57.000 | and for the other ones to produce a low probability.

00:34:00.360 | And so this kind of process is how you have the 95 percentile accuracy on the CIFAR-10 problem.

00:34:10.720 | This here is ImageNet dataset that I mentioned.

00:34:14.480 | It's how you take this image of a leopard, of a container ship,

00:34:19.120 | and produce a probability that that is a container ship or a leopard.

00:34:25.480 | Also shown there are the outputs of the other nearest neurons in terms of their confidence.

00:34:32.040 | Now you can use the same exact operation by chopping off the fully connected layer at the end

00:34:44.280 | and as opposed to mapping from image to a prediction of what's contained in the image,

00:34:49.040 | you map from the image to another image.

00:34:54.560 | And you can train that image to be one that gets excited spatially.

00:35:02.320 | Meaning it gives you a high close to one value for areas of the image that contain the object of interest

00:35:11.840 | and then a low number for areas of the image that are unlikely to contain that image.

00:35:20.880 | And so from this you can go on the left an original image of a woman on a horse

00:35:25.200 | to a segmented image of knowing where the woman is and where the horse is and where the background is.

00:35:32.440 | The same process can be done for detecting the object.

00:35:38.800 | So you can segment the scene into a bunch of interesting objects,

00:35:45.840 | candidates for interesting objects and then go through those candidates one by one

00:35:53.320 | and perform the same kind of classification as in the previous step

00:35:56.560 | where it's just an input as an image and the output is a classification.

00:36:00.280 | And through this process of hopping around an image,

00:36:03.440 | you can figure out exactly where is the best way to segment the cow out of the image.

00:36:09.560 | It's called object detection.

00:36:11.560 | Okay, so how can these magical convolutional neural networks help us in driving?

00:36:22.560 | This is a video of the forward roadway from a data set that we'll look at,

00:36:29.680 | that we've collected from a Tesla.

00:36:32.680 | But first let me look at driving briefly.

00:36:37.560 | The general driving task from the human perspective.

00:36:41.600 | On average, an American driver in the United States drives 10,000 miles a year.

00:36:51.160 | A little more for rural, a little less for urban.

00:36:54.720 | There is about 30,000 fatal crashes and 32 plus, sometimes as high as 38,000 fatalities a year.

00:37:06.800 | This includes car occupants, pedestrians, bicyclists and motorcycle riders.

00:37:14.360 | This may be a surprising fact but in a class on self-driving cars,

00:37:22.760 | we should remember that, so ignore the 59.9% that's other.

00:37:28.120 | The most popular cars in the United States are pickup trucks.

00:37:33.320 | Ford F1 Series, Chevy Silverado, Ram.

00:37:36.560 | It's an important point that we're still married to wanting to be in control.

00:37:48.480 | And so one of the interesting cars that we look at

00:37:56.080 | and the car that is the data set that we provide to the class is collected from is a Tesla.

00:38:03.760 | It's the one that comes at the intersection of the Ford F-150

00:38:07.840 | and the cute little Google self-driving car on the right.

00:38:11.040 | It's fast, it allows you to have a feeling of control

00:38:16.960 | but it can also drive itself for hundreds of miles on the highway if need be.

00:38:22.040 | It allows you to press a button and the car takes over.

00:38:28.240 | It's a fascinating trade-off of transferring control from the human to the car.

00:38:33.520 | It's a transfer of trust and it's a chance for us to study the psychology

00:38:41.080 | of human beings as they relate to machines at 60 plus miles an hour.

00:38:48.560 | In case you're not aware, a little summary of human beings.

00:38:55.320 | We're distracted things.

00:38:57.560 | We'd like to text, use the smartphone, watch videos, groom, talk to passengers, eat, drink.

00:39:08.960 | Texting, 169 billion texts were sent in the US every single month in 2014.

00:39:20.160 | On average, five seconds are I spent off the road while texting.

00:39:25.960 | Five seconds.

00:39:27.720 | That's the opportunity for automation to step in.

00:39:34.000 | More than that, there's what NHTSA refers to as the four D's.

00:39:39.680 | Drunk, drugged, distracted and drowsy.

00:39:43.000 | Each one of those opportunities for automation step in.

00:39:48.240 | Drunk driving stands to benefit significantly from automation, perhaps.

00:39:55.320 | So the miles, let's look at the miles, the data.

00:39:59.480 | There's 3 million miles driven every year.

00:40:09.040 | And Tesla Autopilot, our case study for this class and as human beings,

00:40:17.320 | is driven on full autopilot mode.

00:40:20.520 | So it's driving by itself 300 million miles as of December 2016.

00:40:27.440 | And the fatalities for human controlled vehicles is one in 90 million.

00:40:36.240 | So about 30 plus thousand fatalities a year.

00:40:40.760 | And currently in Tesla, under Tesla Autopilot, there's one fatality.

00:40:47.200 | There's a lot of ways you can tear that statistic apart, but it's one to think about.

00:40:50.880 | Already, perhaps, automation results in safer driving.

00:40:57.040 | The thing is, we don't understand automation because we don't have the data.

00:41:05.120 | We don't have the data on the forward roadway video.

00:41:09.960 | We don't have the data on the driver.

00:41:13.640 | And we just don't have that many cars on the road today that drive themselves.

00:41:18.040 | So we need a lot of data.

00:41:20.440 | We'll provide some of it to you in the class.

00:41:23.440 | And as part of our research at MIT, we're collecting huge amounts of it,

00:41:29.440 | of cars driving themselves.

00:41:31.840 | And what we, collecting that data is how we get to understanding.

00:41:38.880 | So talking about the data and what we'll be doing training our algorithms on.

00:41:46.280 | Here is a Tesla Model S, Model X.

00:41:51.160 | We have instrumented 17 of them.

00:41:53.800 | They have collected over 5,000 hours and 70,000 miles.

00:41:58.080 | And I'll talk about the cameras that we put in them.

00:42:03.000 | We're collecting video of the forward roadway.

00:42:08.560 | This is a highlight of a trip from Boston to Florida of one of the people driving a Tesla.

00:42:13.920 | What's also shown in blue is the amount of time that Autopilot was engaged.

00:42:21.760 | Currently zero minutes and then it grows and grows.

00:42:26.880 | For prolonged periods of time, so hundreds of miles, people engage Autopilot.

00:42:31.800 | Out of 1.3 billion miles driven in a Tesla, 300 million are in Autopilot.

00:42:39.040 | You do the math, whatever that is, 25%.

00:42:41.880 | So we are collecting data of the forward roadway, of the driver.

00:42:49.240 | We have two cameras on the driver.

00:42:50.720 | What we're providing with the class is epics of time of the forward roadway for privacy considerations.

00:43:04.280 | Cameras used to record are your regular webcam, the workhorse of the computer vision community, the C920.

00:43:13.160 | And we have some special lenses on top of it.

00:43:16.680 | Now what's special about these webcams?

00:43:19.200 | Nothing that costs 70 bucks can be that good, right?

00:43:23.040 | What's special about them is that they do onboard compression

00:43:29.120 | and allow you to collect huge amounts of data and use reasonably sized storage capacity

00:43:40.080 | to store that data and train your algorithms on.

00:43:42.400 | So what on the self-driving side do we have to work with?

00:43:51.040 | How do we build a self-driving car?

00:43:55.920 | There is the sensors, radar, lidar, vision, audio, all looking outside,

00:44:04.280 | helping you detect the objects in the external environment to localize yourself and so on.

00:44:09.800 | And there's the sensors facing inside, visible light camera, audio again, and infrared camera to help detect pupils.

00:44:18.600 | So we can decompose the self-driving car task into four steps.

00:44:25.120 | Localization, answering where am I, scene understanding,

00:44:30.080 | using the texture of the information of the scene around

00:44:34.640 | to interpret the identity of the different objects in the scene

00:44:40.160 | and the semantic meaning of those objects of their movement.

00:44:45.760 | There's movement planning, once you figured all that out, found all the pedestrians, found all the cars,

00:44:53.680 | how do I navigate through this maze, a clutter of objects in a safe and legal way.

00:45:02.920 | And there's driver state, how do I detect using video or other information,

00:45:08.840 | video of the driver detect information about their emotional state or their distraction level.

00:45:16.120 | Yes, question.

00:45:22.200 | Yes, that's a real-time figure from lidar.

00:45:25.080 | Lidar is the sensor that provides you the 3D point cloud of the external scene.

00:45:32.440 | So lidar is a technology used by most folks working with self-driving cars

00:45:42.240 | to give you a strong ground truth of the objects.

00:45:47.360 | It's probably the best sensor we have for getting 3D information,

00:45:52.160 | the least noisy 3D information about the external environment.

00:45:57.320 | Question.

00:45:59.640 | So autopilot is always changing.

00:46:10.040 | One of the most amazing things about this vehicle is that

00:46:14.440 | the updates to autopilot come in the form of software.

00:46:18.320 | So the amount of time it's available, it changes.

00:46:22.040 | It's become more conservative with time.

00:46:24.040 | But in this, this is one of the earlier versions and it shows the second line in yellow.

00:46:32.480 | It shows how often the autopilot was available but not turned on.

00:46:37.120 | So it was the total driving time was 10 hours, autopilot was available 7 hours

00:46:42.520 | and was engaged an hour.

00:46:44.360 | This particular person is a responsible driver because what you see

00:46:49.000 | or is more cautious driver, what you see is it's raining.

00:46:53.960 | Autopilot is still available but...

00:46:56.920 | The comment was that you shouldn't trust that one fatality number as an indication of safety

00:47:06.320 | because the drivers elect to only engage the system when it's safe to do so.

00:47:14.080 | It's a totally open...

00:47:15.520 | There's a lot bigger arguments for that number than just that one.

00:47:19.600 | The question is whether that's a bad thing.

00:47:28.040 | So maybe we can trust human beings to engage, you know,

00:47:34.280 | despite the poorly filmed YouTube videos, despite the hype in the media,

00:47:40.040 | you're still a human being riding a 60 miles an hour in a metal box with your life on the line.

00:47:46.040 | You won't engage the system unless you know it's completely safe,

00:47:51.640 | unless you built up a relationship with it.

00:47:54.080 | It's not all the stuff you see where a person gets in the back of a Tesla and starts sleeping

00:47:58.440 | or just playing chess or whatever.

00:48:00.560 | That's all for YouTube.

00:48:02.760 | The reality is when it's just you in the car, it's still your life on the line.

00:48:06.720 | And so you're going to do the responsible thing unless perhaps you're a teenager and so on

00:48:10.840 | but that never changes no matter what you're in.

00:48:13.040 | The question was what do you need to see or sense about the external environment

00:48:21.880 | to be able to successfully drive?

00:48:23.200 | Do you need lane markings? Do you need other...

00:48:25.520 | What are the landmarks based on which you do the localization and the navigation?

00:48:29.680 | And that depends on the sensors.

00:48:32.720 | So with Google self-driving car in sunny California,

00:48:37.320 | it depends on LiDAR to, in a high-resolution way, map the environment

00:48:42.920 | in order to be able to localize itself based on LiDAR.

00:48:47.480 | And LiDAR, now I don't know the details of exactly where LiDAR fails

00:48:53.960 | but it's not good with rain, it's not good with snow,

00:48:58.240 | it's not good when the environment is changing.

00:49:01.080 | So what snow does is it changes the visual, the appearance,

00:49:05.720 | the reflective texture of the surfaces around.

00:49:07.960 | Us human beings are still able to figure stuff out

00:49:10.880 | but a car that's relying heavily on LiDAR won't be able to localize itself

00:49:16.040 | using the landmarks it previously has detected

00:49:19.880 | because they look different now with the snow.

00:49:21.640 | Computer vision can help us with lanes or following a car.

00:49:30.760 | The two landmarks that we use in a lane is following a car in front of you

00:49:35.000 | or staying between two lanes.

00:49:36.640 | That's the nice thing about our roadways is they're designed for human eyes.

00:49:41.640 | So you can use computer vision for lanes and for cars in front to follow them.

00:49:47.160 | And there is radar that's a crude but reliable source of distance information

00:49:54.360 | that allows you to not collide with metal objects.

00:49:58.600 | So all of that together depending on what you want to rely on more

00:50:02.400 | gives you a lot of information.

00:50:04.640 | The question is when it's the messy complexity of real life occurs,

00:50:13.480 | how reliable will it be in the urban environment and so on.

00:50:16.560 | So localization, how can deep learning help?

00:50:26.200 | So first let's just quick summary of visual odometry.

00:50:33.800 | It's using a monocular or stereo input of video images

00:50:40.520 | to determine your orientation in the world.

00:50:44.160 | The orientation in this case of a vehicle in the frame of the world.

00:50:51.280 | And all you have to work with is a video of the forward roadway

00:50:55.280 | and with stereo you get a little extra information of how far away different objects are.

00:51:03.120 | And so this is where one of our speakers on Friday will talk about his expertise,

00:51:11.640 | SLAM, Simultaneous Localization and Mapping.

00:51:14.520 | This is a very well studied and understood problem

00:51:17.680 | of detecting unique features in the external scene

00:51:25.040 | and localizing yourself based on the trajectory of those unique features.

00:51:31.480 | When the number of features is high enough, it becomes an optimization problem.

00:51:36.320 | You know this particular lane moved a little bit from frame to frame,

00:51:40.080 | you can track that information and fuse everything together

00:51:44.800 | in order to be able to estimate your trajectory through the three-dimensional space.

00:51:49.800 | You also have other sensors to help you out.

00:51:53.320 | You have GPS, which is pretty accurate, not perfect but pretty accurate.

00:51:59.560 | It's another signal to help you localize yourself.

00:52:01.840 | You also have IMU, accelerometer, tells you your acceleration.

00:52:07.680 | From the gyroscope, the accelerometer, you have the sixth degree of freedom of movement information

00:52:17.240 | about how the moving object, the car, is navigating through space.

00:52:24.160 | So you can do that using the old school way of optimization

00:52:34.600 | given a unique set of features like SIFT features.

00:52:40.400 | And that step involves with stereo input, on distorting and rectifying the images.

00:52:47.920 | You have two images, you have to, from the two images, compute the depth map.

00:52:51.960 | So for every single pixel, computing your best estimate of the depth of that pixel,

00:52:57.600 | the three-dimensional position relative to the camera.

00:53:03.720 | Then you compute, that's where you compute the disparity map, that's what that's called.

00:53:09.440 | From which you get the distance.

00:53:13.080 | Then you detect unique interesting features in the scene.

00:53:17.720 | SIFT is a popular one, is a popular algorithm for detecting unique features.

00:53:22.760 | And then you, over time, track those features.

00:53:25.600 | And that tracking is what allows you to, through the vision alone,

00:53:30.600 | to get information about your trajectory through three-dimensional space.

00:53:34.200 | You estimate that trajectory.

00:53:37.120 | There's a lot of assumptions. Assumptions that bodies are rigid.

00:53:40.560 | So you have to figure out if a large object passes right in front of you,

00:53:46.680 | you have to figure out what that was.

00:53:49.600 | You have to figure out the mobile objects in the scene and those that are stationary.

00:54:00.920 | Or you can cheat, what we'll talk about, and do it using neural networks, end-to-end.

00:54:08.600 | Now what does end-to-end mean?

00:54:10.800 | And this will come up a bunch of times throughout this class and today.

00:54:14.160 | End-to-end means, and I refer to it as cheating because

00:54:19.520 | it takes away a lot of the hard work of hand engineering features.

00:54:24.120 | You take the raw input of whatever sensors.

00:54:30.000 | In this case, it's taking stereo input from a stereo vision camera.

00:54:35.720 | So two images, a sequence of two images coming from a stereo vision camera.

00:54:39.680 | And the output is an estimate of your trajectory through space.

00:54:47.000 | So as opposed to doing the hard work of SLAM, of detecting unique features,

00:54:51.160 | of localizing yourself, of tracking those features and figuring out what your trajectory is,

00:54:56.640 | you simply train the network with some ground truth that you have from a more accurate sensor like LiDAR.

00:55:03.480 | And you train it on a set of inputs, their stereo vision inputs,

00:55:08.840 | and outputs is the trajectories of space.

00:55:11.680 | You have a separate convolution neural networks for the velocity and for the orientation.

00:55:19.080 | And this works pretty well.

00:55:21.880 | Unfortunately, not quite well.

00:55:25.920 | And John Lennon will talk about that.

00:55:28.240 | SLAM is one of the places where deep learning has not been able to outperform the previous approaches.

00:55:36.240 | Where deep learning really helps is the scene understanding part.

00:55:41.560 | It's interpreting the objects in the scene.

00:55:44.320 | It's detecting the various parts of the scene, segmenting them,

00:55:50.520 | and with optical flow, determining their movement.

00:55:54.080 | So previous approaches for detecting objects,

00:55:58.280 | like the traffic signal classification detection that we have the TensorFlow tutorial for,

00:56:08.360 | or to use PAR-like features or other types of features that are hand-engineered from the images.

00:56:18.800 | Now we can use convolution neural networks to replace the extraction of those features.

00:56:24.040 | And there's a TensorFlow implementation of SegNet,

00:56:34.040 | which is taking the exact same neural network that I talked about.

00:56:38.440 | It's the same thing.

00:56:39.600 | Just the beauty is you just apply similar types of networks to different problems.

00:56:47.400 | And depending on the complexity of the problem, it can get quite amazing performance.

00:56:52.560 | In this case, we convolutionalize the network, meaning the output is an image,

00:56:57.000 | input is an image, a single monocular image.

00:57:00.080 | The output is a segmented image,

00:57:04.320 | where the colors indicate your best pixel by pixel estimate of what object is in that part.

00:57:10.000 | This is not using any spatial information, it's not using any temporal information.

00:57:16.320 | So it's processing every single frame separately.

00:57:19.760 | And it's able to separate the road from the trees, from the pedestrians, other cars and so on.

00:57:29.200 | This is intended to lie on top of a radar/lidar type of technology

00:57:37.680 | that's giving you the three-dimensional or stereo vision,

00:57:40.360 | three-dimensional information about the scene.

00:57:42.840 | You're sort of painting that scene with the identity of the objects that are in it,

00:57:47.080 | your best estimate of it.

00:57:49.800 | This is something I'll talk about tomorrow, is recurring neural networks.

00:57:56.120 | And we can use recurring neural networks that work with temporal data

00:57:59.840 | to process video and also process audio.

00:58:05.320 | In this case, we can process what's shown on the bottom is a spectrogram of audio

00:58:13.440 | for a wet road and a dry road.

00:58:15.200 | You can look at that spectrogram as an image

00:58:20.640 | and process it in a temporal way using recurring neural networks.

00:58:27.200 | Just slide it across and keep feeding it to a network.

00:58:31.560 | And it does incredibly well on the simple tasks, certainly, of dry road versus wet road.

00:58:38.080 | This is an important, a subtle but very important task and there's many like it.

00:58:44.160 | To know that the road, the texture, the quality, the characteristics of the road,

00:58:51.960 | wetness being a critical one.

00:58:53.760 | When it's not raining but the road is still wet, that information is very important.

00:59:00.800 | Okay, so for movement planning,

00:59:02.360 | the same kind of approach on the right is work from one of our other speakers, Sertac Karaman.

00:59:14.040 | The same approach we're using for the, to solve traffic through friendly competition

00:59:22.480 | is the same that we can use for what Chris Gerdes does with his race cars

00:59:29.960 | for planning trajectories in high-speed movement along complex curves.

00:59:38.000 | So we can solve that problem using optimization,

00:59:43.360 | solve the control problem using optimization

00:59:46.560 | or we can use it with reinforcement learning by running

00:59:50.320 | tens of millions, hundreds of millions of times through that simulation of taking that curve

00:59:56.600 | and learning which trajectory doesn't, both optimizes the speed at which you take the turn

01:00:03.000 | and the safety of the vehicle.

01:00:05.920 | Exactly the same thing that you're using for traffic.

01:00:10.840 | And for driver state, this is what we'll talk about next week,

01:00:15.480 | is all the fun face stuff, eyes, face, emotion.

01:00:21.520 | This is, we have video of the driver, video of the driver's body, video of the driver's face.

01:00:26.840 | On the left is one of the TAs in his younger days.

01:00:31.160 | Still looks the same. There he is.

01:00:36.120 | So that's, in that particular case, you're doing one of the easier problems

01:00:47.160 | which is one of detecting where the head and the eyes are positioned.

01:00:51.560 | The head and eye pose in order to determine what's called the gaze of the driver,

01:00:56.960 | where the driver is looking, glance.

01:00:59.760 | And so shown, and we'll talk about these problems, from the left to the right,

01:01:05.640 | on the left and green are the easier problems,

01:01:08.920 | on the red are the harder from the computer vision aspect.

01:01:13.160 | So on the left is body pose, head pose.

01:01:15.720 | The larger the object, the easier it is to detect and the orientation of it is easier to detect.

01:01:20.160 | And then there is pupil diameter, detecting the pupil,

01:01:23.920 | the characteristics, the position, the size of the pupil.

01:01:28.640 | And there's micro saccades, things that happen at one millisecond frequency,

01:01:33.200 | the tremors of the eye.

01:01:35.000 | All important information to determine the state of the driver.

01:01:41.920 | Some are possible with computer vision, some are not.

01:01:44.400 | This is something that we'll talk about, I think on Thursday,

01:01:51.280 | is the detection of where the driver is looking.

01:01:54.520 | So this is a bunch of the cameras that we have in the Tesla.

01:01:58.520 | This is Dan driving a Tesla and detecting exactly where of one of six regions.

01:02:05.000 | We've converted it into a classification problem of left, right, rear view mirror,

01:02:10.520 | instrument cluster, center stack or forward roadway.

01:02:13.080 | So we have to determine out of those six categories,

01:02:15.720 | which direction is the driver looking at.

01:02:19.000 | This is important for driving.

01:02:20.800 | We don't care exactly the XYZ position of where the driver is looking at.

01:02:25.280 | We care that they're looking at the road or not.

01:02:27.400 | Are they looking at their cell phone in their lap or are they looking at the forward roadway?

01:02:30.920 | And we'll be able to answer that pretty effectively using convolutional neural networks.

01:02:35.600 | You can also look at emotion using CNNs to extract, again, converting emotion,

01:02:54.760 | the complex world of emotion into a binary problem of frustrated versus satisfied.

01:03:02.120 | This is a video of drivers interacting with a voice navigation system.

01:03:07.240 | If you've ever used one, you know, it may be a source of frustration from folks.

01:03:10.720 | And so this is self-reported.

01:03:14.520 | This is one of the hard, you know, driver emotion if you're in what's called effective computing

01:03:18.920 | is the field of studying emotion from the computational side.

01:03:23.840 | If you're working in that field, you know that the annotation side of emotion

01:03:31.320 | is really challenging one.

01:03:32.440 | So getting the ground truth of, well, okay, this guy is smiling.

01:03:36.640 | So can I label that as happy?

01:03:39.520 | Or he's frowning, does that mean he's sad?

01:03:42.440 | Most effective computing folks do just that.

01:03:45.000 | In this case, we self-report, ask people how frustrated they were on a scale of 1 to 10.

01:03:52.360 | Dan up top reported a 1, so not frustrated.

01:03:58.480 | He's satisfied with the interaction.

01:04:00.480 | And the other driver reported as a 9.

01:04:02.920 | He was very frustrated with interaction.

01:04:04.920 | Now what you notice is there's a very cold stoic look on Dan's face,

01:04:10.080 | which is an indication of happiness.

01:04:12.040 | And in the case of frustration, the driver is smiling.

01:04:17.840 | So this is a sort of a good reminder that we can't trust our own human instincts

01:04:24.200 | in engineering features and engineering the ground truth.

01:04:27.280 | We have to trust the data, trust the ground truth

01:04:33.960 | that we believe is the closest reflection of the actual semantics of what's going on in the scene.

01:04:39.800 | Okay, so end-to-end driving, getting to the project and the tutorial.

01:04:48.600 | So if driving is like a conversation

01:04:56.720 | and thank you for someone to clarify that this is from Arc de Triomphe in Paris, this video.

01:05:07.640 | If driving is like a natural language conversation,

01:05:13.440 | then we can think of end-to-end driving as skipping the entire Turing test components

01:05:19.800 | and treating it as an end-to-end natural language generation.

01:05:24.640 | So what we do is we take as input the external sensors

01:05:29.320 | and as an output the control of the vehicle.

01:05:33.320 | And the magic happens in the middle.

01:05:36.200 | We replace that entire step within your own network.

01:05:41.720 | The TAs told me to not include this image because it's the cheesiest I've ever seen.

01:05:47.240 | I apologize.

01:05:49.120 | Thank you, thank you.

01:05:54.240 | I regret nothing.

01:05:57.920 | So this is to show our path to self-driving cars but it's to explain a point

01:06:08.040 | that we have a large data set of ground truth.

01:06:12.080 | If we were to formulate the driving task as simply taking external images

01:06:17.080 | and producing steering commands, acceleration and braking commands,

01:06:23.440 | then we have a lot of ground truth.

01:06:24.680 | We have a large number of drivers on the road every day

01:06:29.720 | driving and therefore collecting our ground truth for us

01:06:34.040 | because they're an interested party in producing the steering commands that keep them alive.

01:06:39.560 | And therefore, if we were to record that data, it becomes ground truth.

01:06:44.160 | So if it's possible to learn this, what we can do is we can collect data for the manually controlled vehicles

01:06:50.960 | and use that data to train an algorithm to control a self-driving vehicle.

01:06:58.240 | Okay, so one of the first folks that did this is NVIDIA

01:07:04.640 | where they actually trained in an external image, the image of the forward roadway

01:07:09.520 | and a neural network, a convolutional network, a simple vanilla convolutional neural network.

01:07:16.600 | I'll briefly outline, take an image in, produce a steering command out

01:07:22.720 | and they're able to successfully, to some degree, learn to navigate basic turns, curves

01:07:32.400 | and even stop or make sharp turns at a T-intersection.

01:07:38.800 | So this network is simple.

01:07:42.400 | There is input on the bottom, output up top.

01:07:46.200 | The input is a 66 by 200 pixel image, RGB, shown on the left.

01:07:52.800 | Or shown on the left is the raw input and then you crop it a little bit and resize it down.

01:07:58.400 | 66 by 200, that's what we have in the code as well.

01:08:04.000 | In the two versions of the code we provide for you, both that runs in the browser and in TensorFlow.

01:08:11.640 | It has a few layers, a few convolutional layers, a few fully connected layers and an output.

01:08:21.080 | This is a regression network.

01:08:23.840 | It's producing not a classification of cat versus dog, it's producing a steering command.

01:08:28.960 | How do I turn the steering wheel? That's it.

01:08:31.640 | The rest is magic and we train it on human input.

01:08:38.960 | What we have here is a project, is an implementation of the system in ComNetJS that runs in your browser.

01:08:53.480 | This is the tutorial to follow and the project to take on.

01:08:59.000 | So unlike the deep traffic game, this is reality.

01:09:07.560 | This is a real input from real vehicles.

01:09:10.640 | So you can go to this link.

01:09:13.160 | Demo went wonderfully yesterday, so let's see. Maybe two for two.

01:09:32.960 | So there's a tutorial and then the actual game, the actual simulation is on DeepTeslaJS, I apologize.

01:09:39.840 | Everyone's going there now, aren't they?

01:09:48.280 | Does it work on a phone? It does. Great.

01:09:57.760 | Again, similar structure. Up top is the visualization of the loss function as the network is learning and it's always training.

01:10:07.960 | Next is the input for the layout of the network. There's the specification, the input 200 by 66.

01:10:22.680 | There's a convolutional layer, there's a pooling layer and the output is the regression layer, a single neuron.

01:10:30.680 | This is a tiny version, deep tiny, right? It's a tiny version of the NVIDIA architecture.

01:10:43.480 | And then you can visualize the operation of this network on real video.

01:10:49.920 | The actual wheel value that produced by the driver, by the autopilot system is in blue and the output of the network is in white.

01:11:08.960 | And what's indicated by green is the cropping of the image that is then resized to produce the 66 by 200 input to the network.

01:11:19.920 | So once again, amazingly, this is running in your browser, training on real world video.

01:11:28.920 | So you can get in your car today, input it and maybe teach a neural network to drive like you.

01:11:36.040 | We have the code in ComNetJS and TensorFlow to do that and a tutorial.

01:11:40.800 | Well, let me briefly describe some of the work here.

01:11:49.360 | So the input to the network is a single image. This is for DeepTeslaJS, single image.

01:11:58.240 | The output is a steering wheel value between -20 and 20. That's in degrees.

01:12:05.720 | We record, like I said, thousands of hours, but we provide publicly 10 video clips of highway driving from a Tesla.

01:12:16.160 | Half are driven by autopilot, half are driven by human.

01:12:21.120 | The wheel value is extracted from a perfectly synchronized can.

01:12:29.760 | We are collecting all of the messages from can, which contains steering wheel value and that's synchronized with the video.

01:12:37.600 | We crop, extract the window, the green one I mentioned, and then provide that as input to the network.

01:12:44.400 | So this is a slight difference from deep traffic with the red car weaving through traffic

01:12:51.080 | because there's the messy reality of real world lighting conditions.

01:12:57.720 | And your task, for the most part, in this simple steering task, is to stay inside the lane, inside the lane markings.

01:13:06.560 | In an end-to-end way, learn to do just that.

01:13:11.080 | So ComNetJS is a JavaScript implementation of CNNs, of convolutional neural networks.

01:13:22.200 | It supports really arbitrary networks. I mean, all neural networks are simple,

01:13:27.960 | but because it runs in JavaScript, it's not utilizing GPU.

01:13:32.040 | The larger the network, the more it's going to be weighed down computationally.

01:13:39.200 | Now, unlike deep traffic, this isn't a competition,

01:13:44.840 | but if you are a student registered for the course, you still do have to submit the code.

01:13:49.840 | You still have to submit your own car as part of the class.

01:13:54.520 | Hey, question.

01:13:55.880 | So the question was the amount of data that's needed.

01:14:01.680 | Is there a general rules of thumb for the amount of data needed for a particular task?

01:14:08.040 | In driving, for example.

01:14:10.280 | It's a good question.

01:14:12.440 | You generally have to, like I said, neural networks are good memorizers.

01:14:18.840 | So you have to just have every case represented in the training set that you're interested in

01:14:24.560 | as much as possible. So that means, in general, if you want a picture,

01:14:31.640 | if you want to classify the difference in cats and dogs,

01:14:33.920 | you want to have at least a thousand cats and a thousand dogs, and then you do really well.

01:14:38.440 | The problem with driving is twofold.

01:14:41.920 | One is that most of the time driving looks the same.

01:14:47.440 | And the stuff you really care about is when driving looks different. It's all the edge cases.

01:14:51.520 | So what we're not good with neural networks is generalizing from the common case to the edge cases,

01:14:58.640 | to the outliers. So avoiding a crash, just because you can stay on the highway

01:15:03.880 | for thousands of hours successfully, doesn't mean you can avoid a crash

01:15:07.520 | when somebody runs in front of you on the road.

01:15:09.560 | And the other part with driving is the accuracy you have to achieve is really high.

01:15:15.800 | So for cat versus dog, you know, life doesn't depend on your error,

01:15:22.880 | on your ability to steer a car inside of a lane.

01:15:28.240 | You better be very close to 100% accurate.

01:15:32.520 | There's a box for designing the network.

01:15:36.120 | There's a visualization of the metrics measuring the performance of the network as it trains.

01:15:42.080 | There is a layer visualization of what features the network is extracting

01:15:49.240 | at every convolutional layer and every fully connected layer.

01:15:52.440 | There is ability to restart the training, visualize the network performing on real video.

01:16:08.280 | There is the input layer, the convolutional layers, the video visualization.

01:16:21.400 | An interesting tidbit on the bottom right is a barcode that Will has ingeniously designed.

01:16:36.640 | How do I clearly explain why this is so cool?

01:16:39.760 | It's a way to, through video, synchronize multiple streams of data together.

01:16:47.360 | So it's very easy for those who have worked with multimodal data,

01:16:51.560 | where there are several streams of data, for them to become unsynchronized.

01:16:56.640 | Especially when a big component of training in neural network is shuffling the data.

01:17:02.040 | So you have to shuffle the data in clever ways so you're not overfitting any one little aspect of the video

01:17:07.840 | and yet maintain the data perfectly synchronized.

01:17:10.960 | So what he did instead of doing the hard work of connecting the steering wheel and the video

01:17:17.760 | is actually putting the steering wheel on top of the video as a barcode.

01:17:23.160 | The final result is you can watch the network operate

01:17:30.520 | and over time it learns more and more to steer correctly.

01:17:36.600 | I'll fly through this a little bit in the interest of time.

01:17:39.280 | Just kind of summarize some of the things that you can play with in terms of tutorials and let you guys go.

01:17:44.240 | This is the same kind of process, end-to-end driving with TensorFlow.

01:17:49.480 | So we have code available on GitHub.

01:17:52.720 | We just put up on my GitHub under DeepTesla that takes in a single video

01:17:59.680 | or an arbitrary number of videos, trains on them and produces a visualization

01:18:05.680 | that compares the steering wheel, the actual steering wheel and the predicted steering wheel.

01:18:09.720 | The steering wheel, when it agrees with a human driver or the autopilot system,

01:18:14.800 | lighting up as green and when it disagrees, lighting up as red.

01:18:18.520 | Hopefully not too often.

01:18:21.560 | Again, this is some of the details of how that's exactly done in TensorFlow.

01:18:26.000 | This is vanilla convolution neural networks, specifying a bunch of layers,

01:18:30.720 | convolutional layers, a fully connected layer, train the model,

01:18:36.720 | so you iterate over the batches of images,

01:18:38.920 | run the model over a test set of images and get this result.

01:18:48.240 | We have a tutorial or IPython notebook and a tutorial up.

01:18:57.680 | This is perhaps the best way to get started with convolutional neural networks

01:19:02.840 | in terms of our class.

01:19:04.400 | It's looking at the simplest image classification problem of traffic light classification.

01:19:10.720 | So we have these images of traffic lights.

01:19:14.040 | We did the hard work of detecting them for you.

01:19:17.680 | So now you have to figure out, you have to build a convolutional network

01:19:22.040 | that gets, figures out the concept of color

01:19:26.800 | and gets excited when it sees red, yellow or green.

01:19:30.400 | If anyone has questions, welcome those.

01:19:35.040 | You can stay after class if you have any concerns with Docker, with TensorFlow,

01:19:42.200 | with how to win traffic, deep traffic, just stay after class or come by Friday 5 to 7.

01:19:49.200 | See you guys tomorrow.

01:19:50.240 | [BLANK_AUDIO]

MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task

Chapters