MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task

All right, welcome back everyone. Sound okay? All right. So today we talked a little bit about neural networks, started to talk about neural networks yesterday. Today we'll continue to talk about neural networks that work with images, convolutional neural networks and see how those types of networks can help us drive a car.

If we have time, we'll cover a simple illustrative case study of detecting traffic lights, the problem of detecting green, yellow, red. If we can't teach our neural networks to do that, we're in trouble. But it's a good, clear, illustrative case study of a three-class classification problem. Okay, next there's DeepTesla.

Here, looped over and over in a very short GIF. This is actually running live on a website right now. We'll show it towards the end of the lecture. This once again, just like DeepTraffic, is a neural network that learns to steer a vehicle based on the video of the forward roadway.

And once again, doing all of that in the browser using JavaScript. So you'll be able to train your own very network to drive using real-world data. I'll explain how. We will also have a tutorial and code briefly described today at the end of lecture, if there's time, how to do the same thing in TensorFlow.

So if you want to build a network that's bigger, deeper, and you want to utilize GPUs to train that network, you want to not do it in your browser. You want to do it offline using TensorFlow and having a powerful GPU on your computer. And we'll explain how to do that.

Computer vision. So we talked about vanilla machine learning where the size of the input is small for the most part. The number of neurons, in the case of neural networks, is on the order of 10, 100, 1000. When you think of images, images are a collection of pixels. One of the most iconic images from computer vision, in the bottom left there, is Lana.

I encourage you to Google it and figure out the story behind that image. It was quite shocking when I found out recently. So once again, computer vision is, these days, dominated by data-driven approaches, by machine learning. Where all of the same methods that are used on other types of data are used on images, where the input is just a collection of pixels.

And pixels are numbers from 0 to 255, discrete values. So we can think exactly what we've talked about previously. We could think of images in the same exact way. It's just numbers. And so we can do the same kind of thing. We could do supervised learning where you have an input image and output label.

The input image here is a picture of a woman. The label might be "woman". On supervised learning, same thing. We'll look at that briefly as well. It's clustering images into categories. Again, semi-supervised and reinforcement learning. In fact, the Atari games we talked about yesterday do some pre-processing on the images.

They're doing computer vision. They're using convolutional neural networks as we'll discuss today. And the pipeline for supervised learning is again the same. There's raw data in the form of images. There's labels on those images. We perform a machine learning algorithm, performs feature extraction. It trains given the inputs and outputs on the images and the labels of those images, constructs a model and then test that model.

And we get a metric, an accuracy. Accuracy is the term that's used to often describe how well a model performs. It's a percentage. I apologize for the constant presence of cats throughout this course. I assure you this course is about driving, not cats. But images are numbers. So for us, we take it for granted.

We're really good at looking at and converting visual perception as human beings, converting visual perception into semantics. We see this image and we know it's a cat. But a computer only sees numbers, RGB values for a colored image. There's three values for every single pixel from 0 to 255.

And so given that image, we can think of two problems. One is regression and the other is classification. Regression is when given an image, we want to produce a real value output back. So if we have an image of the forward roadway, we want to produce a value for the steering wheel angle.

And if you have an algorithm that's really smart, it can take any image of the forward roadway and produce the perfectly correct steering angle that drives the car safely across the United States. We'll talk about how to do that and where that fails. Classification is when the input again is an image and the output is a class label, a discrete class label.

Underneath it though, often is still a regression problem and what's produced is a probability that this particular image belongs to a particular category. And we use a threshold to chop off the outputs associated with low probabilities and take the labels associated with the high probabilities and convert it into a discrete classification.

I mentioned this yesterday but bear saying again, computer vision is hard. We once again take it for granted. As human beings, we're really good at dealing with all these problems. There's viewpoint variation. The object looks totally different in terms of the numbers behind the images, in terms of the pixels when viewed from a different angle.

Viewpoint variation. Objects, when you're standing far away from them or up close, are totally different size. We're good at detecting that they're different size. It's still the same object as human beings but that's still a really hard problem because those sizes can vary drastically. We talked about occlusions and deformations with cats.

Well understood problem. There's background clutter. You have to separate the object of interest from the background and given the three-dimensional structure of our world, there's a lot of stuff often going on in the background. The clutter. There are inter-class variation that's often greater than inter-class variation. Meaning objects of the same type often have more variation than the objects that you're trying to separate them from.

There's the hard one for driving. Illumination. Light is the way we perceive things. The reflection of light off the surface and the source of that light changes the way that object appears and we have to be robust to all of that. So the image classification pipeline is the same as I mentioned.

There's categories. It's a classification problem so there's categories of cat, dog, mug, hat. You have a bunch of examples, image examples of each of those categories and so the input is just those images paired with the category and you train to map, to estimate a function that maps from the images to the categories.

For all of that you need data. A lot of it. There is unfortunately, there's a growing number of data sets but they're still relatively small. We get excited, there are millions of images but they're not billions, they're trillions of images. And these are the data sets that you will see if you read academic literature most often.

MNIST, the one that's been beaten to death and then we use as well in this course is a data set of handwritten digits where the categories are 0 to 9. ImageNet, one of the largest image data sets, fully labeled image data sets in the world has images with a hierarchy of categories from WordNet and what you see there is a labeling of what images associated with which words are present in the data set.

CIFAR-10 and CIFAR-100 are tiny images that are used to prove in a very efficient and quick way offhand that your algorithm that you're trying to publish on or trying to impress the world with works well. It's small, it's a small data set. CIFAR-10 means there's 10 categories and places is a data set of natural scenes, woods, nature, city, so on.

So let's look at CIFAR-10 as a data set of 10 categories, airplane, automobile, bird, cat, so on. They're shown there with sample images as the rows. So let's build a classifier that's able to take images from one of these 10 categories and tell us what is shown in the image.

So how do we do that? Once again, all the algorithm sees is numbers. So we have to try to have, at the very core, we have to have an operator for comparing two images. So given an image and I want to say if it's a cat or a dog, I want to compare it to images of cats and compare it to images of dogs and see which one matches better.

So there has to be a comparative operator. Okay, so one way to do that is take the absolute difference between the two images, pixel by pixel. Take the difference between each individual pixel, shown on the bottom of the slide for a 4x4 image and then we sum that pixel-wise absolute difference into a single number.

So if the image is totally different pixel-wise, that'll be a high number. If it's the same image, the number will be zero. Oh, it's the absolute value too of the difference. That's called L1 distance. Doesn't matter. When we speak of distance, we usually mean L2 distance. And so if we try to, so we can build a classifier that just uses this operator to compare to every single image in the dataset and say, I'm going to pick the category that's the closest using this comparative operator.

I'm going to find, I have a picture of a cat and I'm going to look through the dataset and find the image that's the closest to this picture and say that is the category that this picture belongs to. So if we just flip the coin and randomly pick which category an image belongs to, we get that accuracy would be on average 10%.

It's random. The accuracy we achieve with our brilliant image difference algorithm that just goes to the dataset and finds the closest one is 38%. It's pretty good. It's way above 10%. So you can think about this operation of looking through the dataset and finding the closest image as what's called k nearest neighbors.

Where k in that case is 1, meaning you find the one closest neighbor to this image that you're asking a question about and accept the label from that image. You could do the same thing increasing k. Increasing k to 2 means you take the two nearest neighbors. You find the two closest in terms of pixel wise image difference, images to this particular query image and find which category do those belong to.

What's shown up top on the left is the dataset we're working with, red, green, blue. What's shown in the middle is the one nearest neighbor classifier, meaning this is how you segment the entire space of different things that you can compare. And if a point falls into any of these regions, it will be immediately associated with the nearest neighbor algorithm to belong to that region.

With five nearest neighbors, there's immediately an issue. The issue is that there's white regions, there's type breakers where your five closest neighbors are from various categories. So it's unclear what you belong to. So if we, this is a good example of parameter tuning. You have one parameter, k. And you have to, your task as a machine, as a teacher of machine learning, you have to teach this algorithm how to do your learning for you, is to figure out that parameter.

That's called parameter tuning or hyperparameter tuning as it's called in neural networks. And so on the bottom right of the slide is, on the x-axis is k as we increase it from 0 to 100. And on the y-axis is classification accuracy. It turns out that the best k for this data set is 7, 7 nearest neighbors.

With that, we get a performance of 30%. Human level performance. And I should say that the way that number is, we get that number as we do with a lot of the machine learning pipeline process, is you separate the data into the parts of the data set you use for training and another part that you use for testing.

You're not allowed to touch the testing part. That's cheating. You construct your model of the world on the training data set and you use what's called cross-validation where you take a small part of the training data, shown in fold 5 there in yellow, to leave that part out from the training and then use it as part of the hyperparameter tuning.

As you train, figure out with that yellow part, fold 5, how well you're doing. And then you choose a different fold and see how well you're doing and keep playing with parameters, never touching the test part. And when you're ready, you run the algorithm on the test data to see how well you really do, how well it really generalizes.

Yes, question. Is there any way to determine intuition of what a good K may be? Or do you just have to run through all the data? So the question was, is there a good way to, is there any good intuition behind what a good K is? There's general rules for different data sets, but usually you just have to run through it, grid search, brute force.

Yes, question. I have a quick question. Good question. Yes. So each pixel is like one number or is it two numbers? Yes, the question was, is each pixel one number or three numbers? For majority of computer vision throughout its history, you use grayscale images, so it's one number. But RGB is three numbers.

And there's sometimes a depth value too, so it's four numbers. So it's, if you have a stereo vision camera that gives you the depth information of the pixels, that's a fourth. And then if you stack two images together, it could be six. In general, everything we work with will be three numbers for a pixel.

It was, yes. So the question was, for the absolute value, it's just one number. Exactly right. So in that case, it was grayscale images. So it's not RGB images. So that's, you know, this algorithm is pretty good. If we use the best, we optimize the hyperparameters of this algorithm, choose K of seven, seems to work well for this particular CIFAR-10 dataset.

Okay, we get 30% accuracy. It's impressive, higher than 10%. Human beings perform at about 94, slightly above 94% accuracy for CIFAR-10. So given an image, it's a tiny image, I should clarify, it's like a little icon. Given that image, human beings are able to determine accurately one of the 10 categories with 94% accuracy.

And the currently state-of-the-art convolutional neural networks is 95, it's 95.4% accuracy. And believe it or not, it's a heated battle. But the most important, the critical fact here is, it's recently surpassed humans and certainly surpassed the K-Nearest Neighbors algorithm. So how does this work? Let's briefly look back. It all still boils down to this little guy, the neuron.

That sums the weights of its inputs, adds a bias, produces an output, based on an activation, a smooth activation function. Yes, question. The question was, you take a picture of a cat so you know it's a cat, but that's not encoded anywhere. Like, you have to write that down somewhere.

So you have to write as a caption, "This is my cat." And then the unfortunate thing, given the internet and how witty it is, you can't trust the captions and images. Because maybe you're just being clever and it's not a cat at all. It's a dog dressed as a cat.

Yes, question. Sorry, CNS do better than what? Yeah, so the question was, do convolutional neural networks generally do better than nearest neighbors? There's very few problems on which neural networks don't do better. Yes, they almost always do better, except when you have almost no data. So you need data.

And convolutional neural networks isn't some special magical thing. It's just neural networks with some cheating up front that I'll explain. Some tricks to try to reduce the size and make it capable to deal with images. So again, yeah, the input is, in this case that we looked at, classifying an image of a number, as opposed to doing some fancy convolutional tricks.

We just take the entire 28x28 pixel image, that's 784 pixels as the input. That's 784 neurons in the input, 15 neurons on the hidden layer, and 10 neurons in the output. Now everything we'll talk about has the same exact structure, nothing fancy. There is a forward pass through the network, where you take an input image and produce an output classification.

And there's a backward pass through the network, through back propagation, where you adjust the weights when your prediction doesn't match the ground truth output. And learning just boils down to optimization. It's just optimizing a smooth function, differentiable function, that's defined as the loss function. That's usually as simple as a squared error between the true output and the one you actually got.

So what's the difference? What are convolutional neural networks? Convolutional neural networks take inputs that have some spatial consistency, have some meaning to the spatial, have some spatial meaning in them, like images. There's other things, you can think of the dimension of time and you can input audio signal into a convolutional neural network.

And so the input is usually, for every single layer, that's a convolutional layer, the input is a 3D volume and the output is a 3D volume. I'm simplifying because you can call it 4D too, but it's 3D. There's height, width and depth. So that's an image. The height and the width is the width and the height of the image.

And then the depth for a grayscale image is 1, for an RGB image is 3, for a 10 frame video of grayscale images, the depth is 10. It's just a volume, a 3D matrix of numbers. And the only thing that a convolutional layer does is take a 3D volume as input, produce a 3D volume as output and has some smooth function operating on the inputs, on the sum of the inputs that may or may not be a parameter that you tune, that you try to optimize.

That's it. So Lego pieces that you stack together in the same way as we talked about before. So what are the types of layers that a convolutional neural network have? There's inputs. So for example, a color image of 32x32 will be a volume of 32x32x3. A convolutional layer takes advantage of the spatial relationships of the input neurons.

And a convolutional layer, it's the same exact neuron as for a fully connected network, the regular network we talked about before. But it just has a narrower receptive field, it's more focused. The inputs to a neuron on the convolutional layer come from a specific region from the previous layer.

And the parameters on each filter, you can think of this as a filter because you slide it across the entire image. And those parameters are shared. So as opposed to taking the, if you think about two layers, as opposed to connecting every single pixel in the first layer to every single neuron in the following layer, you only connect the neurons in the input layer that are close to each other, to the output layer.

And then you enforce the weights to be tied together spatially. And what that results in is a filter, every single layer on the output, you could think of as a filter, that gets excited, for example, for an edge. And when it sees this particular kind of edge in the image, it'll get excited.

It'll get excited in the top left of the image, the top right, bottom left, bottom right. The assumption there is that a powerful feature for detecting a cat is just as important no matter where in the image it is. And this allows you to cut away a huge number of connections between neurons.

But it still boils down on the right as a neuron that sums a collection of inputs and applies weights to them. The spatial arrangement of the output volume relative to the input volume is controlled by three things. The number of filters. So for every single "filter", you'll get an extra layer on the output.

So if the input, let's talk about the very first layer, the input is 32 by 32 by 3. It's an RGB image of 32 by 32. If the number of filters is 10, then the resulting depth, the resulting number of stacked channels in the output will be 10. Stride is the step size of the filter that you slide along the image.

Oftentimes that's just one or three and that directly reduces the spatial size, the width and the height of the output image. And then there is a convenient thing that's often done is padding the image on the outside with zeros so that the input and the output have the same height and width.

So this is a visualization of convolution. I encourage you to kind of maybe offline think about what's happening. It's similar to the way human vision works. Crudely so, if there's any experts in the audience. So the input here on the left is a collection of numbers, 0, 1, 2.

And a filter, well, there is two filters shown as W0 and W1. Those filters shown in red are the different weights applied on those filters. And each of the filters have a depth just like the input, a depth of 3. So there's three of them in each column. And so, let's see.

Yeah, and so you slide that filter along the image, keeping the weights the same. This is the sharing of the weights. And so your first filter, you pick the weights. This is an optimization problem. You pick the weights in such a way that it fires, it gets excited at useful features and doesn't fire for not useful features.

And then this second filter that fires for useful features and not. And produces a signal on the output depending on a positive number, meaning there's a strong feature in that region and a negative number if there isn't. But the filter is the same. This allows for drastic reduction in the parameters.

And so you can deal with inputs that are a thousand by a thousand pixel image, for example, or video. There's a really powerful concept there. The spatial sharing of weights. That means there's a spatial invariance to the features you're detecting. It allows you to learn from arbitrary images. So you don't have to be concerned about pre-processing the images in some clever way.

You just give the raw image. There's another operation, pooling. It's a way to reduce the size of the layers. By, for example, in this case, max pooling. For taking a collection of outputs and choosing the next one and summarizing those collection of pixels such that the output of the pooling operation is much smaller than the input.

Because the justification there is that you don't need a high resolution localization of exactly where which pixel is important in the image according to, you know, you don't need to know exactly which pixel is associated with the cat ear or, you know, a cat face. As long as you kind of know it's around that part.

And that reduces a lot of complexity in the operations. Yes, question. The question was, when is too much pooling? When do you stop pooling? So, pooling is a very crude operation that doesn't have any... One thing you need to know is it doesn't have any parameters that are learnable.

So you can't learn anything clever about pooling. You're just picking, in this case, max pool. So you're picking the largest number. So you're reducing the resolution. You're losing a lot of information. There's an argument that you're not, you know, losing that much information as long as you're not pooling the entire image into a single value.

But you're gaining training efficiency, you're gaining the memory size, you're reducing the size of the network. So it's definitely a thing that people debate and it's a parameter that you play with to see what works for you. Okay, so how does this thing look like as a whole, a convolutional neural network?

The input is an image. There's usually a convolutional layer. There is a pooling operation, another convolutional layer, another pooling operation and so on. At the very end, if the task is classification, you have a stack of convolutional layers and pooling layers. There are several fully connected layers. So you go from those, the spatial convolutional operations to fully connecting every single neuron in a layer to the following layer.

And you do this so that by the end, you have a collection of neurons. Each one is associated with a particular class. So in what we looked at yesterday as the input is an image of a number, 0 through 9, the output here would be 10 neurons. So you boil down that image with a collection of convolutional layers with one or two or three fully connected layers at the end that all lead to 10 neurons.

And each of those neurons' job is to get fired up when it sees a particular number and for the other ones to produce a low probability. And so this kind of process is how you have the 95 percentile accuracy on the CIFAR-10 problem. This here is ImageNet dataset that I mentioned.

It's how you take this image of a leopard, of a container ship, and produce a probability that that is a container ship or a leopard. Also shown there are the outputs of the other nearest neurons in terms of their confidence. Now you can use the same exact operation by chopping off the fully connected layer at the end and as opposed to mapping from image to a prediction of what's contained in the image, you map from the image to another image.

And you can train that image to be one that gets excited spatially. Meaning it gives you a high close to one value for areas of the image that contain the object of interest and then a low number for areas of the image that are unlikely to contain that image.

And so from this you can go on the left an original image of a woman on a horse to a segmented image of knowing where the woman is and where the horse is and where the background is. The same process can be done for detecting the object. So you can segment the scene into a bunch of interesting objects, candidates for interesting objects and then go through those candidates one by one and perform the same kind of classification as in the previous step where it's just an input as an image and the output is a classification.

And through this process of hopping around an image, you can figure out exactly where is the best way to segment the cow out of the image. It's called object detection. Okay, so how can these magical convolutional neural networks help us in driving? This is a video of the forward roadway from a data set that we'll look at, that we've collected from a Tesla.

But first let me look at driving briefly. The general driving task from the human perspective. On average, an American driver in the United States drives 10,000 miles a year. A little more for rural, a little less for urban. There is about 30,000 fatal crashes and 32 plus, sometimes as high as 38,000 fatalities a year.

This includes car occupants, pedestrians, bicyclists and motorcycle riders. This may be a surprising fact but in a class on self-driving cars, we should remember that, so ignore the 59.9% that's other. The most popular cars in the United States are pickup trucks. Ford F1 Series, Chevy Silverado, Ram. It's an important point that we're still married to wanting to be in control.

And so one of the interesting cars that we look at and the car that is the data set that we provide to the class is collected from is a Tesla. It's the one that comes at the intersection of the Ford F-150 and the cute little Google self-driving car on the right.

It's fast, it allows you to have a feeling of control but it can also drive itself for hundreds of miles on the highway if need be. It allows you to press a button and the car takes over. It's a fascinating trade-off of transferring control from the human to the car.

It's a transfer of trust and it's a chance for us to study the psychology of human beings as they relate to machines at 60 plus miles an hour. In case you're not aware, a little summary of human beings. We're distracted things. We'd like to text, use the smartphone, watch videos, groom, talk to passengers, eat, drink.

Texting, 169 billion texts were sent in the US every single month in 2014. On average, five seconds are I spent off the road while texting. Five seconds. That's the opportunity for automation to step in. More than that, there's what NHTSA refers to as the four D's. Drunk, drugged, distracted and drowsy.

Each one of those opportunities for automation step in. Drunk driving stands to benefit significantly from automation, perhaps. So the miles, let's look at the miles, the data. There's 3 million miles driven every year. And Tesla Autopilot, our case study for this class and as human beings, is driven on full autopilot mode.

So it's driving by itself 300 million miles as of December 2016. And the fatalities for human controlled vehicles is one in 90 million. So about 30 plus thousand fatalities a year. And currently in Tesla, under Tesla Autopilot, there's one fatality. There's a lot of ways you can tear that statistic apart, but it's one to think about.

Already, perhaps, automation results in safer driving. The thing is, we don't understand automation because we don't have the data. We don't have the data on the forward roadway video. We don't have the data on the driver. And we just don't have that many cars on the road today that drive themselves.

So we need a lot of data. We'll provide some of it to you in the class. And as part of our research at MIT, we're collecting huge amounts of it, of cars driving themselves. And what we, collecting that data is how we get to understanding. So talking about the data and what we'll be doing training our algorithms on.

Here is a Tesla Model S, Model X. We have instrumented 17 of them. They have collected over 5,000 hours and 70,000 miles. And I'll talk about the cameras that we put in them. We're collecting video of the forward roadway. This is a highlight of a trip from Boston to Florida of one of the people driving a Tesla.

What's also shown in blue is the amount of time that Autopilot was engaged. Currently zero minutes and then it grows and grows. For prolonged periods of time, so hundreds of miles, people engage Autopilot. Out of 1.3 billion miles driven in a Tesla, 300 million are in Autopilot. You do the math, whatever that is, 25%.

So we are collecting data of the forward roadway, of the driver. We have two cameras on the driver. What we're providing with the class is epics of time of the forward roadway for privacy considerations. Cameras used to record are your regular webcam, the workhorse of the computer vision community, the C920.

And we have some special lenses on top of it. Now what's special about these webcams? Nothing that costs 70 bucks can be that good, right? What's special about them is that they do onboard compression and allow you to collect huge amounts of data and use reasonably sized storage capacity to store that data and train your algorithms on.

So what on the self-driving side do we have to work with? How do we build a self-driving car? There is the sensors, radar, lidar, vision, audio, all looking outside, helping you detect the objects in the external environment to localize yourself and so on. And there's the sensors facing inside, visible light camera, audio again, and infrared camera to help detect pupils.

So we can decompose the self-driving car task into four steps. Localization, answering where am I, scene understanding, using the texture of the information of the scene around to interpret the identity of the different objects in the scene and the semantic meaning of those objects of their movement. There's movement planning, once you figured all that out, found all the pedestrians, found all the cars, how do I navigate through this maze, a clutter of objects in a safe and legal way.

And there's driver state, how do I detect using video or other information, video of the driver detect information about their emotional state or their distraction level. Yes, question. Yes, that's a real-time figure from lidar. Lidar is the sensor that provides you the 3D point cloud of the external scene.

So lidar is a technology used by most folks working with self-driving cars to give you a strong ground truth of the objects. It's probably the best sensor we have for getting 3D information, the least noisy 3D information about the external environment. Question. So autopilot is always changing. One of the most amazing things about this vehicle is that the updates to autopilot come in the form of software.

So the amount of time it's available, it changes. It's become more conservative with time. But in this, this is one of the earlier versions and it shows the second line in yellow. It shows how often the autopilot was available but not turned on. So it was the total driving time was 10 hours, autopilot was available 7 hours and was engaged an hour.

This particular person is a responsible driver because what you see or is more cautious driver, what you see is it's raining. Autopilot is still available but... The comment was that you shouldn't trust that one fatality number as an indication of safety because the drivers elect to only engage the system when it's safe to do so.

It's a totally open... There's a lot bigger arguments for that number than just that one. The question is whether that's a bad thing. So maybe we can trust human beings to engage, you know, despite the poorly filmed YouTube videos, despite the hype in the media, you're still a human being riding a 60 miles an hour in a metal box with your life on the line.

You won't engage the system unless you know it's completely safe, unless you built up a relationship with it. It's not all the stuff you see where a person gets in the back of a Tesla and starts sleeping or just playing chess or whatever. That's all for YouTube. The reality is when it's just you in the car, it's still your life on the line.

And so you're going to do the responsible thing unless perhaps you're a teenager and so on but that never changes no matter what you're in. The question was what do you need to see or sense about the external environment to be able to successfully drive? Do you need lane markings?

Do you need other... What are the landmarks based on which you do the localization and the navigation? And that depends on the sensors. So with Google self-driving car in sunny California, it depends on LiDAR to, in a high-resolution way, map the environment in order to be able to localize itself based on LiDAR.

And LiDAR, now I don't know the details of exactly where LiDAR fails but it's not good with rain, it's not good with snow, it's not good when the environment is changing. So what snow does is it changes the visual, the appearance, the reflective texture of the surfaces around. Us human beings are still able to figure stuff out but a car that's relying heavily on LiDAR won't be able to localize itself using the landmarks it previously has detected because they look different now with the snow.

Computer vision can help us with lanes or following a car. The two landmarks that we use in a lane is following a car in front of you or staying between two lanes. That's the nice thing about our roadways is they're designed for human eyes. So you can use computer vision for lanes and for cars in front to follow them.

And there is radar that's a crude but reliable source of distance information that allows you to not collide with metal objects. So all of that together depending on what you want to rely on more gives you a lot of information. The question is when it's the messy complexity of real life occurs, how reliable will it be in the urban environment and so on.

So localization, how can deep learning help? So first let's just quick summary of visual odometry. It's using a monocular or stereo input of video images to determine your orientation in the world. The orientation in this case of a vehicle in the frame of the world. And all you have to work with is a video of the forward roadway and with stereo you get a little extra information of how far away different objects are.

And so this is where one of our speakers on Friday will talk about his expertise, SLAM, Simultaneous Localization and Mapping. This is a very well studied and understood problem of detecting unique features in the external scene and localizing yourself based on the trajectory of those unique features. When the number of features is high enough, it becomes an optimization problem.

You know this particular lane moved a little bit from frame to frame, you can track that information and fuse everything together in order to be able to estimate your trajectory through the three-dimensional space. You also have other sensors to help you out. You have GPS, which is pretty accurate, not perfect but pretty accurate.

It's another signal to help you localize yourself. You also have IMU, accelerometer, tells you your acceleration. From the gyroscope, the accelerometer, you have the sixth degree of freedom of movement information about how the moving object, the car, is navigating through space. So you can do that using the old school way of optimization given a unique set of features like SIFT features.

And that step involves with stereo input, on distorting and rectifying the images. You have two images, you have to, from the two images, compute the depth map. So for every single pixel, computing your best estimate of the depth of that pixel, the three-dimensional position relative to the camera. Then you compute, that's where you compute the disparity map, that's what that's called.

From which you get the distance. Then you detect unique interesting features in the scene. SIFT is a popular one, is a popular algorithm for detecting unique features. And then you, over time, track those features. And that tracking is what allows you to, through the vision alone, to get information about your trajectory through three-dimensional space.

You estimate that trajectory. There's a lot of assumptions. Assumptions that bodies are rigid. So you have to figure out if a large object passes right in front of you, you have to figure out what that was. You have to figure out the mobile objects in the scene and those that are stationary.

Or you can cheat, what we'll talk about, and do it using neural networks, end-to-end. Now what does end-to-end mean? And this will come up a bunch of times throughout this class and today. End-to-end means, and I refer to it as cheating because it takes away a lot of the hard work of hand engineering features.

You take the raw input of whatever sensors. In this case, it's taking stereo input from a stereo vision camera. So two images, a sequence of two images coming from a stereo vision camera. And the output is an estimate of your trajectory through space. So as opposed to doing the hard work of SLAM, of detecting unique features, of localizing yourself, of tracking those features and figuring out what your trajectory is, you simply train the network with some ground truth that you have from a more accurate sensor like LiDAR.

And you train it on a set of inputs, their stereo vision inputs, and outputs is the trajectories of space. You have a separate convolution neural networks for the velocity and for the orientation. And this works pretty well. Unfortunately, not quite well. And John Lennon will talk about that. SLAM is one of the places where deep learning has not been able to outperform the previous approaches.

Where deep learning really helps is the scene understanding part. It's interpreting the objects in the scene. It's detecting the various parts of the scene, segmenting them, and with optical flow, determining their movement. So previous approaches for detecting objects, like the traffic signal classification detection that we have the TensorFlow tutorial for, or to use PAR-like features or other types of features that are hand-engineered from the images.

Now we can use convolution neural networks to replace the extraction of those features. And there's a TensorFlow implementation of SegNet, which is taking the exact same neural network that I talked about. It's the same thing. Just the beauty is you just apply similar types of networks to different problems.

And depending on the complexity of the problem, it can get quite amazing performance. In this case, we convolutionalize the network, meaning the output is an image, input is an image, a single monocular image. The output is a segmented image, where the colors indicate your best pixel by pixel estimate of what object is in that part.

This is not using any spatial information, it's not using any temporal information. So it's processing every single frame separately. And it's able to separate the road from the trees, from the pedestrians, other cars and so on. This is intended to lie on top of a radar/lidar type of technology that's giving you the three-dimensional or stereo vision, three-dimensional information about the scene.

You're sort of painting that scene with the identity of the objects that are in it, your best estimate of it. This is something I'll talk about tomorrow, is recurring neural networks. And we can use recurring neural networks that work with temporal data to process video and also process audio.

In this case, we can process what's shown on the bottom is a spectrogram of audio for a wet road and a dry road. You can look at that spectrogram as an image and process it in a temporal way using recurring neural networks. Just slide it across and keep feeding it to a network.

And it does incredibly well on the simple tasks, certainly, of dry road versus wet road. This is an important, a subtle but very important task and there's many like it. To know that the road, the texture, the quality, the characteristics of the road, wetness being a critical one. When it's not raining but the road is still wet, that information is very important.

Okay, so for movement planning, the same kind of approach on the right is work from one of our other speakers, Sertac Karaman. The same approach we're using for the, to solve traffic through friendly competition is the same that we can use for what Chris Gerdes does with his race cars for planning trajectories in high-speed movement along complex curves.

So we can solve that problem using optimization, solve the control problem using optimization or we can use it with reinforcement learning by running tens of millions, hundreds of millions of times through that simulation of taking that curve and learning which trajectory doesn't, both optimizes the speed at which you take the turn and the safety of the vehicle.

Exactly the same thing that you're using for traffic. And for driver state, this is what we'll talk about next week, is all the fun face stuff, eyes, face, emotion. This is, we have video of the driver, video of the driver's body, video of the driver's face. On the left is one of the TAs in his younger days.

Still looks the same. There he is. So that's, in that particular case, you're doing one of the easier problems which is one of detecting where the head and the eyes are positioned. The head and eye pose in order to determine what's called the gaze of the driver, where the driver is looking, glance.

And so shown, and we'll talk about these problems, from the left to the right, on the left and green are the easier problems, on the red are the harder from the computer vision aspect. So on the left is body pose, head pose. The larger the object, the easier it is to detect and the orientation of it is easier to detect.

And then there is pupil diameter, detecting the pupil, the characteristics, the position, the size of the pupil. And there's micro saccades, things that happen at one millisecond frequency, the tremors of the eye. All important information to determine the state of the driver. Some are possible with computer vision, some are not.

This is something that we'll talk about, I think on Thursday, is the detection of where the driver is looking. So this is a bunch of the cameras that we have in the Tesla. This is Dan driving a Tesla and detecting exactly where of one of six regions. We've converted it into a classification problem of left, right, rear view mirror, instrument cluster, center stack or forward roadway.

So we have to determine out of those six categories, which direction is the driver looking at. This is important for driving. We don't care exactly the XYZ position of where the driver is looking at. We care that they're looking at the road or not. Are they looking at their cell phone in their lap or are they looking at the forward roadway?

And we'll be able to answer that pretty effectively using convolutional neural networks. You can also look at emotion using CNNs to extract, again, converting emotion, the complex world of emotion into a binary problem of frustrated versus satisfied. This is a video of drivers interacting with a voice navigation system.

If you've ever used one, you know, it may be a source of frustration from folks. And so this is self-reported. This is one of the hard, you know, driver emotion if you're in what's called effective computing is the field of studying emotion from the computational side. If you're working in that field, you know that the annotation side of emotion is really challenging one.

So getting the ground truth of, well, okay, this guy is smiling. So can I label that as happy? Or he's frowning, does that mean he's sad? Most effective computing folks do just that. In this case, we self-report, ask people how frustrated they were on a scale of 1 to 10.

Dan up top reported a 1, so not frustrated. He's satisfied with the interaction. And the other driver reported as a 9. He was very frustrated with interaction. Now what you notice is there's a very cold stoic look on Dan's face, which is an indication of happiness. And in the case of frustration, the driver is smiling.

So this is a sort of a good reminder that we can't trust our own human instincts in engineering features and engineering the ground truth. We have to trust the data, trust the ground truth that we believe is the closest reflection of the actual semantics of what's going on in the scene.

Okay, so end-to-end driving, getting to the project and the tutorial. So if driving is like a conversation and thank you for someone to clarify that this is from Arc de Triomphe in Paris, this video. If driving is like a natural language conversation, then we can think of end-to-end driving as skipping the entire Turing test components and treating it as an end-to-end natural language generation.

So what we do is we take as input the external sensors and as an output the control of the vehicle. And the magic happens in the middle. We replace that entire step within your own network. The TAs told me to not include this image because it's the cheesiest I've ever seen.

I apologize. Thank you, thank you. I regret nothing. So this is to show our path to self-driving cars but it's to explain a point that we have a large data set of ground truth. If we were to formulate the driving task as simply taking external images and producing steering commands, acceleration and braking commands, then we have a lot of ground truth.

We have a large number of drivers on the road every day driving and therefore collecting our ground truth for us because they're an interested party in producing the steering commands that keep them alive. And therefore, if we were to record that data, it becomes ground truth. So if it's possible to learn this, what we can do is we can collect data for the manually controlled vehicles and use that data to train an algorithm to control a self-driving vehicle.

Okay, so one of the first folks that did this is NVIDIA where they actually trained in an external image, the image of the forward roadway and a neural network, a convolutional network, a simple vanilla convolutional neural network. I'll briefly outline, take an image in, produce a steering command out and they're able to successfully, to some degree, learn to navigate basic turns, curves and even stop or make sharp turns at a T-intersection.

So this network is simple. There is input on the bottom, output up top. The input is a 66 by 200 pixel image, RGB, shown on the left. Or shown on the left is the raw input and then you crop it a little bit and resize it down. 66 by 200, that's what we have in the code as well.

In the two versions of the code we provide for you, both that runs in the browser and in TensorFlow. It has a few layers, a few convolutional layers, a few fully connected layers and an output. This is a regression network. It's producing not a classification of cat versus dog, it's producing a steering command.

How do I turn the steering wheel? That's it. The rest is magic and we train it on human input. What we have here is a project, is an implementation of the system in ComNetJS that runs in your browser. This is the tutorial to follow and the project to take on.

So unlike the deep traffic game, this is reality. This is a real input from real vehicles. So you can go to this link. Demo went wonderfully yesterday, so let's see. Maybe two for two. So there's a tutorial and then the actual game, the actual simulation is on DeepTeslaJS, I apologize.

Everyone's going there now, aren't they? Does it work on a phone? It does. Great. Again, similar structure. Up top is the visualization of the loss function as the network is learning and it's always training. Next is the input for the layout of the network. There's the specification, the input 200 by 66.

There's a convolutional layer, there's a pooling layer and the output is the regression layer, a single neuron. This is a tiny version, deep tiny, right? It's a tiny version of the NVIDIA architecture. And then you can visualize the operation of this network on real video. The actual wheel value that produced by the driver, by the autopilot system is in blue and the output of the network is in white.

And what's indicated by green is the cropping of the image that is then resized to produce the 66 by 200 input to the network. So once again, amazingly, this is running in your browser, training on real world video. So you can get in your car today, input it and maybe teach a neural network to drive like you.

We have the code in ComNetJS and TensorFlow to do that and a tutorial. Well, let me briefly describe some of the work here. So the input to the network is a single image. This is for DeepTeslaJS, single image. The output is a steering wheel value between -20 and 20.

That's in degrees. We record, like I said, thousands of hours, but we provide publicly 10 video clips of highway driving from a Tesla. Half are driven by autopilot, half are driven by human. The wheel value is extracted from a perfectly synchronized can. We are collecting all of the messages from can, which contains steering wheel value and that's synchronized with the video.

We crop, extract the window, the green one I mentioned, and then provide that as input to the network. So this is a slight difference from deep traffic with the red car weaving through traffic because there's the messy reality of real world lighting conditions. And your task, for the most part, in this simple steering task, is to stay inside the lane, inside the lane markings.

In an end-to-end way, learn to do just that. So ComNetJS is a JavaScript implementation of CNNs, of convolutional neural networks. It supports really arbitrary networks. I mean, all neural networks are simple, but because it runs in JavaScript, it's not utilizing GPU. The larger the network, the more it's going to be weighed down computationally.

Now, unlike deep traffic, this isn't a competition, but if you are a student registered for the course, you still do have to submit the code. You still have to submit your own car as part of the class. Hey, question. So the question was the amount of data that's needed.

Is there a general rules of thumb for the amount of data needed for a particular task? In driving, for example. It's a good question. You generally have to, like I said, neural networks are good memorizers. So you have to just have every case represented in the training set that you're interested in as much as possible.

So that means, in general, if you want a picture, if you want to classify the difference in cats and dogs, you want to have at least a thousand cats and a thousand dogs, and then you do really well. The problem with driving is twofold. One is that most of the time driving looks the same.

And the stuff you really care about is when driving looks different. It's all the edge cases. So what we're not good with neural networks is generalizing from the common case to the edge cases, to the outliers. So avoiding a crash, just because you can stay on the highway for thousands of hours successfully, doesn't mean you can avoid a crash when somebody runs in front of you on the road.

And the other part with driving is the accuracy you have to achieve is really high. So for cat versus dog, you know, life doesn't depend on your error, on your ability to steer a car inside of a lane. You better be very close to 100% accurate. There's a box for designing the network.

There's a visualization of the metrics measuring the performance of the network as it trains. There is a layer visualization of what features the network is extracting at every convolutional layer and every fully connected layer. There is ability to restart the training, visualize the network performing on real video. There is the input layer, the convolutional layers, the video visualization.

An interesting tidbit on the bottom right is a barcode that Will has ingeniously designed. How do I clearly explain why this is so cool? It's a way to, through video, synchronize multiple streams of data together. So it's very easy for those who have worked with multimodal data, where there are several streams of data, for them to become unsynchronized.

Especially when a big component of training in neural network is shuffling the data. So you have to shuffle the data in clever ways so you're not overfitting any one little aspect of the video and yet maintain the data perfectly synchronized. So what he did instead of doing the hard work of connecting the steering wheel and the video is actually putting the steering wheel on top of the video as a barcode.

The final result is you can watch the network operate and over time it learns more and more to steer correctly. I'll fly through this a little bit in the interest of time. Just kind of summarize some of the things that you can play with in terms of tutorials and let you guys go.

This is the same kind of process, end-to-end driving with TensorFlow. So we have code available on GitHub. We just put up on my GitHub under DeepTesla that takes in a single video or an arbitrary number of videos, trains on them and produces a visualization that compares the steering wheel, the actual steering wheel and the predicted steering wheel.

The steering wheel, when it agrees with a human driver or the autopilot system, lighting up as green and when it disagrees, lighting up as red. Hopefully not too often. Again, this is some of the details of how that's exactly done in TensorFlow. This is vanilla convolution neural networks, specifying a bunch of layers, convolutional layers, a fully connected layer, train the model, so you iterate over the batches of images, run the model over a test set of images and get this result.

We have a tutorial or IPython notebook and a tutorial up. This is perhaps the best way to get started with convolutional neural networks in terms of our class. It's looking at the simplest image classification problem of traffic light classification. So we have these images of traffic lights. We did the hard work of detecting them for you.

So now you have to figure out, you have to build a convolutional network that gets, figures out the concept of color and gets excited when it sees red, yellow or green. If anyone has questions, welcome those. You can stay after class if you have any concerns with Docker, with TensorFlow, with how to win traffic, deep traffic, just stay after class or come by Friday 5 to 7.

See you guys tomorrow.

MIT 6.S094: Convolutional Neural Networks for End-to-End Learning of the Driving Task

Chapters

Transcript