back to indexMIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles
Chapters
0:0
2:28 Drive State Detection: A Computer Vision Perspective
4:10 Crash Test Dummy Design: Hybrid III
5:8 Sequential Detection Approach
6:19 Temporal Convolutional Neural Networks
7:4 Gaze Classification vs Gaze Estimation
10:56 Gaze Classification Pipeline
11:16 Face Alignment
12:18 A General Framework for Semi-Automated Object State Annotation
15:15 Semi-Automated Annotation Work Flow
15:25 Driver Frustration
22:14 Preprocessing Pipeline
22:45 Driver Cognitive Load Estimation
24:25 Human at the Center of Automation: The Way to Full Autonomy includes the Human
25:23 and Fundamental Breakthroughs in Deep Learning
00:00:04.700 |
So how do we turn this camera back in on the human? 00:00:14.500 |
how to detect cats and dogs, pedestrians, lanes, 00:00:20.200 |
how to steer a vehicle based on the external environment. 00:00:23.900 |
The thing that's really fascinating and severely understudied 00:00:34.600 |
we have cameras in 17 Teslas driving around Cambridge 00:00:39.000 |
because Tesla is one of the only vehicles allowing you 00:00:49.900 |
the interaction between the human and the machine. 00:01:01.200 |
of semi-autonomous vehicles and fully autonomous vehicles 00:01:12.800 |
is looking at billions of video frames of human beings 00:01:25.500 |
What are the things that we want to know about the human? 00:01:33.300 |
and we try to break apart the different things 00:01:45.100 |
the different computer vision detection problems. 00:01:52.500 |
it's feasible even under poor lighting conditions, 00:01:56.000 |
variable pose, noisy environment, poor resolution. 00:02:01.300 |
Red means it's really hard no matter what you do. 00:02:05.300 |
That's starting on the left with face detection and body pose, 00:02:11.300 |
and one of the easier computer vision problems. 00:02:19.300 |
the slight tremors of the eye that happen in one 00:02:30.500 |
well first, why do we even care about the human in the car? 00:02:47.900 |
of the biological thing it's carrying inside, 00:02:55.000 |
because you're like sitting there controlling it. 00:03:01.800 |
has no sensors with which it's perceiving you. 00:03:04.900 |
It knows some cars have a pressure sensor on the steering wheel 00:03:21.200 |
this same car that's driving 70 miles an hour 00:03:36.800 |
it's that we should have a driver-facing camera in every car. 00:03:44.900 |
and you don't have as much of a privacy concern there, 00:04:18.700 |
And they make certain assumptions about body shapes, 00:04:49.200 |
in your purse, your bag, for your cell phone, 00:04:58.700 |
The car needs to know that you're in that position. 00:05:15.200 |
Whenever you have these kind of tasks of detecting, 00:05:30.600 |
You have a CNN, convolutional neural network, 00:05:34.100 |
that takes this input image and takes as an output, 00:05:48.200 |
that give you the shoulders, the arms and so on. 00:05:52.100 |
through time, on every single frame you make that prediction. 00:06:00.300 |
you can make certain assumptions about physics, 00:06:03.800 |
that you can't, your arm can't be in this place in one frame 00:06:26.800 |
you could think of those channels as in time. 00:06:31.000 |
in what are called 3D convolutional neural networks. 00:07:28.700 |
That means the car is currently driving itself. 00:07:46.700 |
you're predicting where the person is looking, 00:08:57.300 |
So you have to estimate the frame of the camera, 00:09:02.200 |
estimate the identity of the person you're looking at. 00:09:09.500 |
and the identity of the car the person is riding in, 00:09:12.300 |
the better the performance for the different driver state classification. 00:09:18.000 |
You have a background model that works on everyone 00:09:22.900 |
You specialize each individual network to that one individual. 00:09:36.000 |
and nose are the exact same position in the image. 00:09:41.400 |
and you want to study the subtle movement of the eyes, 00:09:53.500 |
remove all effects of any other motion of the head. 00:10:08.800 |
You dump the raw pixels in and predict whatever you need. 00:10:26.400 |
and they self-reported as a frustrating experience or not 00:10:34.800 |
and you know, they put themselves as frustrated or not. 00:10:39.600 |
we can train a convolutional neural network to predict 00:10:45.400 |
Turns out smiling is a strong indication of frustration. 00:11:15.000 |
This is the one part where CNNs have still struggled to compete, 00:11:23.700 |
This is where I talked about the cascade of regressors, 00:11:31.300 |
on the eyebrows, the nose, the jawline, the mouth. 00:11:42.100 |
and so algorithms that can utilize those constraints effectively 00:11:47.100 |
can often perform better than end-to-end regressors 00:11:50.900 |
that just don't have any concept of what a face is shaped like. 00:11:58.100 |
of the awesome community that's building those datasets 00:12:03.400 |
Okay, so this is, again, the TA in his younger form. 00:12:20.700 |
exciting direction that machine learning is headed, 00:12:27.900 |
The less you have to have humans look through the data 00:12:32.900 |
the more power these machine learning algorithms get. 00:12:38.200 |
currently supervised learning is what's needed. 00:12:41.700 |
You need human beings to label a cat and label a dog. 00:12:44.900 |
But if you can only have a human being label 1%, 00:12:54.300 |
so the machine can come to the human and be like, 00:12:56.700 |
"I don't know what I'm looking at these pictures." 00:13:04.700 |
whether it's your own arm or because of light conditions. 00:13:12.800 |
This is what Google self-driving car actually struggle with 00:13:15.300 |
when they're trying to use their vision sensors. 00:13:18.600 |
so just all kinds of occlusions are really hard 00:13:40.200 |
all you're doing is staring forward at the roadway in the same way. 00:13:55.600 |
So it can do all the hard work of annotation for you. 00:13:57.800 |
It's in the transition away from those positions 00:14:17.100 |
You use that to predict when something has changed. 00:14:22.800 |
you bring that to the machine for annotation. 00:16:53.300 |
We're collecting huge amounts of data in the Tesla 00:17:09.400 |
We're not asking them to like enter in an app. 00:18:42.000 |
in the sense that there's a lot of good science 00:19:21.300 |
That your eyes are actually gonna move smoothly, 00:19:26.500 |
that probably has to do with our hunting background 00:19:51.300 |
which are almost imperceptible for computer vision. 00:20:38.500 |
Even though pupil size has been used effectively 00:20:54.500 |
I think I'm just repeating the same thing over and over. 00:21:24.700 |
And we dump that into a 3D convolutional neural network. 00:22:05.300 |
the image of the face is transposed in such a way that 00:22:59.600 |
Same process as detecting the identity of the face, 00:23:03.800 |
same process as detecting where the driver is looking, 00:23:08.600 |
And all of those require very little hyperparameter tuning 00:23:34.000 |
I was criticized for this being a very cheesy slide. 00:23:49.700 |
we're likely to take gradual steps towards that. 00:24:25.400 |
being given to somebody else, to the machine, 00:24:31.200 |
It's a gradual process of that machine earning trust. 00:24:45.100 |
is going to need to see what the human is doing. 00:25:05.000 |
is billions of miles of driver-facing data as well. 00:25:37.800 |
research, you'll find that we're in the very early stages of 00:26:09.600 |
So why does a deeper network give better results? 00:26:13.700 |
This is a mysterious thing we don't understand. 00:26:16.900 |
There's these hundreds of millions of parameters 00:26:28.400 |
One of my favorite examples of this emergent concept 00:26:38.700 |
will probably criticize me for it being as cheesy as a stairway slide. 00:26:50.400 |
like a neuron in a neural network is a really simple computational unit. 00:26:54.800 |
And then incredible power emerges when you just combine a lot of them in a network. 00:27:01.000 |
this is called a super-computational network. 00:27:09.000 |
And every single cell is operating under a simple rule. 00:27:15.100 |
You can think of it as a cell living and dying. 00:27:35.000 |
And if it has exactly three neighbors, it's dead. 00:27:47.400 |
All it's doing is operating under this very local process. 00:27:53.300 |
Or in the way we're currently training neural networks 00:29:11.200 |
I encourage you to read the deep learning book. 00:29:49.300 |
That's how I recommend you learn machine learning, 00:31:34.800 |
at the Udacity Self-Driving Car Engineering degree. 00:32:01.900 |
to have such a big community of deep learning folks 00:32:11.600 |
but this is actually the three neural networks, 00:32:19.800 |
And you can see the number of cars passed there. 00:32:46.400 |
the actual evaluation process runs through a lot 00:32:50.000 |
of a lot of iterations and takes the medium evaluation. 00:32:53.900 |
With that, let me thank you guys so much for, 00:33:32.400 |
This will run for a while and we're working on a journal paper 00:33:48.200 |
And the first time obviously teaching this class. 00:33:51.500 |
And so thank you so much for being a part of it.