Back to Index

MIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles


Chapters

0:0
2:28 Drive State Detection: A Computer Vision Perspective
4:10 Crash Test Dummy Design: Hybrid III
5:8 Sequential Detection Approach
6:19 Temporal Convolutional Neural Networks
7:4 Gaze Classification vs Gaze Estimation
10:56 Gaze Classification Pipeline
11:16 Face Alignment
12:18 A General Framework for Semi-Automated Object State Annotation
15:15 Semi-Automated Annotation Work Flow
15:25 Driver Frustration
22:14 Preprocessing Pipeline
22:45 Driver Cognitive Load Estimation
24:25 Human at the Center of Automation: The Way to Full Autonomy includes the Human
25:23 and Fundamental Breakthroughs in Deep Learning

Transcript

All right, so the human side of AI. So how do we turn this camera back in on the human? So we've been talking about perception, how to detect cats and dogs, pedestrians, lanes, how to steer a vehicle based on the external environment. The thing that's really fascinating and severely understudied is the human side.

So you know, you talk about the Tesla, we have cameras in 17 Teslas driving around Cambridge because Tesla is one of the only vehicles allowing you to experience in the real way on the road the interaction between the human and the machine. And the thing that we don't have, that deep learning needs on the human side of semi-autonomous vehicles and fully autonomous vehicles is video of drivers.

That's what we're collecting. That's what my work is in, is looking at billions of video frames of human beings driving 60 miles an hour plus on the highway in their semi-autonomous Tesla. What are the things that we want to know about the human? If we're a deep learning therapist and we try to break apart the different things we can detect from this raw set of pixels, we can look here from the green to red is the different detection problems, the different computer vision detection problems.

Green means it's less challenging, it's feasible even under poor lighting conditions, variable pose, noisy environment, poor resolution. Red means it's really hard no matter what you do. That's starting on the left with face detection and body pose, one of the best studied and one of the easier computer vision problems.

We have huge data sets for these. And then there is micro saccades, the slight tremors of the eye that happen in one at a rate of a thousand times a second. Let's look at, well first, why do we even care about the human in the car? One is trust.

This trust part is, so you think about it, to build trust, the car needs to have some awareness of the biological thing it's carrying inside, the human inside. You kind of assume the car knows about you because you're like sitting there controlling it. But if you think about it, almost every single car on the road today has no sensors with which it's perceiving you.

It knows some cars have a pressure sensor on the steering wheel and a pressure sensor or some kind of sensor detecting that you're sitting in the seat. That's the only thing it knows about you. That's it. So how is the car supposed to, this same car that's driving 70 miles an hour on the highway autonomously, how is it supposed to build trust with you if it doesn't perceive you?

That's one of the critical things here. So if I'm constantly advocating something, it's that we should have a driver-facing camera in every car. And that despite the privacy concerns, you have a camera on your phone and you don't have as much of a privacy concern there, is it despite the privacy concerns, the safety benefits are huge.

The trust benefits are huge. So let's start with the easy one, body pose, detecting body pose. Why do we care? So there's seatbelt design. There's these dummies, crash test dummies with which, which are used to design the safety system, the passive safety systems in our cars. And they make certain assumptions about body shapes, male, female, child, body shapes.

But they also make assumptions about the position of your body in the seat. They have the optimal position, the position they assume you take. The reality is, in a Tesla, when the car is driving itself, the variability, if you remember the cat, the deformable cat, you start doing a little bit more of that.

You start to reach back in the back seat, in your purse, your bag, for your cell phone, these kinds of things. And that's when the crashes happen. And we need to know how often that happens. The car needs to know that you're in that position. That's critical for that very serious moment when the actual crash happens.

How do you do? This is deep learning class, right? So this deep learning of the rescue. Whenever you have these kind of tasks of detecting, for example, body poses, you're detecting points of the shoulders, points of the head, 5, 10 points along the arms, the skeleton. How do you do that?

You have a CNN, convolutional neural network, that takes this input image and takes as an output, it's a regressor. It gives an XY position of the, whatever you're looking for, the left shoulder or the right shoulder. And then you have a cascade of regressors that give you all these points, that give you the shoulders, the arms and so on.

And then you have, through time, on every single frame you make that prediction. And then you optimize, you know, you can make certain assumptions about physics, that you can't, your arm can't be in this place in one frame and then the next frame be over here. It moves smoothly through space.

So under those constraints, you can then minimize the error, the temporal error from frame to frame. Or you can just dump all the frames as if there are different channels, like RGB is three channels, you could think of those channels as in time. You can dump all those frames together in what are called 3D convolutional neural networks.

You dump them all together and then you estimate the body pose in all the frames at once. And there are some data sets for sports and we're building our own. I don't know who that guy is. Let's fly through this a little bit. So what's called gaze classification. Gaze is another word for glance, right?

It's a classification problem. Here's one of the TAs for this class. Again, not here, because you can't see it. Again, not here, because he's married, had to be home. I know where his priorities are at. This is on camera, should be here. There's five cameras. This is what we're recording in the Tesla.

This is a Tesla vehicle. There's in the bottom right, there's a blue icon that lights up, automatically detected if it's operating under autopilot. That means the car is currently driving itself. There's five cameras, one of the forward roadway, one in the instrument cluster, one of the center stack, steering wheel, his face.

And then it's a classification problem. You dump the raw pixels into a convolutional neural network, have six classes, forward roadway, you're predicting where the person is looking, forward roadway, left, right, center stack, instrument cluster, rear view mirror, and you give it millions of frames for every class. Simple. And it does incredibly well at predicting where the driver is looking.

And the process is the same for majority of the driver state problems that have to do with the face. The face has so much information, where you're looking, emotion, drowsiness, so different degrees of frustration. I'll fly through those as well. But the process is the same. There's some pre-processing.

So this is in the wild data. There's a lot of crazy light going on. There's noise, there's vibration from the vehicle. So first you have to video stabilization. You have to remove all that vibration, all that noise as best as you can. There's a lot of algorithms, non neural network algorithms.

Boring but they work for removing the noise, removing the effects of sudden light variations and the vibrations of the vehicle. There's the automated calibration. So you have to estimate the frame of the camera, the position of the camera, and estimate the identity of the person you're looking at. The more you can specialize the network to the identity of the person and the identity of the car the person is riding in, the better the performance for the different driver state classification.

So you personalize the network. You have a background model that works on everyone and you specialize each individual. This is transfer learning. You specialize each individual network to that one individual. All right. There is face frontalization. Fancy name for the fact that no matter where they're looking, you want to transfer that face so the eyes and nose are the exact same position in the image.

That way if you want to look at the eyes and you want to study the subtle movement of the eyes, the subtle blinking, the dynamics of the eyelid, the velocity of the eyelid, it's always in the same place. You can really focus in, remove all effects of any other motion of the head.

And then you just, this is the beauty of deep learning, right? You don't, there is some pre-processing because this is real-world data, but you just dump the raw pixels in. You dump the raw pixels in and predict whatever you need. What do you need? One is emotion. You can have, so we had a study where people used a crappy and a good voice-based navigation system.

So the crappy one got them really frustrated and they self-reported as a frustrating experience or not on a scale of 1 to 10. So that gives us ground truth. But it had a bunch of people use the system and you know, they put themselves as frustrated or not. And so then we can predict, we can train a convolutional neural network to predict is this person frustrated or not.

I think we've seen a video of that. Turns out smiling is a strong indication of frustration. You can also predict drowsiness in this way, gaze estimation in this way, cognitive load. I'll briefly look at that. And the process is all the same. You detect the face, you find the landmark points in the face for the face alignment, face frontalization.

And then you dump the raw pixels in for classification, step 5. You can use SVMs there or you can use what everyone uses now, convolutional neural networks. This is the one part where CNNs have still struggled to compete, is the alignment problem. This is where I talked about the cascade of regressors, is finding the landmarks on the eyebrows, the nose, the jawline, the mouth.

There's certain constraints there and so algorithms that can utilize those constraints effectively can often perform better than end-to-end regressors that just don't have any concept of what a face is shaped like. And there's huge datasets and we're a part of the awesome community that's building those datasets for face alignment.

Okay, so this is, again, the TA in his younger form. This is live in the car, real-time system, predicting where they're looking. This is taking slow steps towards the exciting direction that machine learning is headed, which is unsupervised learning. The less you have to have humans look through the data and annotate that data, the more power these machine learning algorithms get.

Right, so, currently supervised learning is what's needed. You need human beings to label a cat and label a dog. But if you can only have a human being label 1%, 1/10 of a percent of a dataset, only the hard cases, so the machine can come to the human and be like, "I don't know what I'm looking at these pictures." Because of the partial light occlusions, we're not good at dealing with occlusions, whether it's your own arm or because of light conditions.

We're not good with crazy light, drowning out the image. This is what Google self-driving car actually struggle with when they're trying to use their vision sensors. Moving out of frame, so just all kinds of occlusions are really hard for computer vision algorithms. And in those case, we want a machine to step in and say, and pass that image on to the human, be like, "Help me out with this." And the other corner cases is, so in driving for example, 90+% of the time, all you're doing is staring forward at the roadway in the same way.

That's where the machine shines. That's where machine annotation, automated annotation shines. Because it's seen that face for hundreds of millions of frames already in that exact position. So it can do all the hard work of annotation for you. It's in the transition away from those positions that it needs a little bit help.

Just to make sure, that this person just start looking away from the road to the rear view and you bring those points up. So you're, there's using optical flow, putting the optical flow in the convolutional neural network. You use that to predict when something has changed. And when something has changed, you bring that to the machine for annotation.

All of this is to build a giant, billions of frames, annotated dataset of ground truth. I want you to train your driver state algorithms. And in this way, you can control. On the X-axis is the fraction of frames that a human has to annotate. 0% on the left, 10% on the right.

And then the accuracy trade-off. The more the human annotates, the higher the accuracy. You approach 100% accuracy. But you can still do pretty good. This is for the gaze classification task. When, with an 84%, 84 fold, so almost two orders of magnitude reduction in human annotation. This is the future of machine learning.

And hopefully one day, no human annotation. And the result is millions of images like this. Video frames. Same thing, driver frustration. This is what I was talking about. The frustrated driver is the one that's on the bottom. So a lot of movement of the eyebrows and a lot of smiling.

And that's true subject after subject. And the happy, the satisfied, I don't say happy. The satisfied driver is cold and stoic. And that's true for subject after subject. Because driving is a boring experience and you want it to stay that way. Yes, question. Great, great, great question. They're not.

Absolutely, that's a great question. There is a, so this is cars owned by MIT. There is somebody in the back. So, the comment was my emotions might then have nothing to do with the driving experience. Yes, let me continue that comment is your emotions are often, you're an actor on the stage for others with your emotion.

So when you're alone, you might not express emotion. You're really expressing emotion oftentimes for others. Like you're frustrated. So it's like, oh, what the heck. That's for the passenger and that's absolutely right. So one of the cool things we're doing, as I said, we now have over a billion video frames in the Tesla.

We're collecting huge amounts of data in the Tesla and it's, emotion is complex thing, right? In this case, we can, we know the ground truth how frustrated they were. In naturalistic data, when it's just people driving around, we don't know how they're really feeling at the moment. We're not asking them to like enter in an app.

How are you feeling right now? But we do know certain things, like we know that people sing a lot. That has to be a paper at some point. It's awesome. People love singing. So that doesn't happen in this kind of data because there's somebody sitting in the car and I think the expression of a frustration is also the same.

Yes, so yeah, the question is, yeah, so or the comment is that the data set, the solo data set is probably gonna be very different from a data set that's non-solo with a passenger. And it's very true. The tricky thing about driving, and this is why it's a huge challenge for self-driving cars, for the external facing sensors and for the internal facing sensors analyzing human behavior, is like 99.9% of driving is the same thing.

It's really boring. So finding the interesting bits is actually pretty complicated. So that has to do with emotion. That has to do with, so singing is easy to find. So we can track the mouth pretty well. So whenever you're talking or singing, we can find that. But how do you find the subtle expressions of emotion?

It's hard when you're solo. And cognitive load, that's that's a fascinating thing. I mean, similar to emotion, it's a little more concrete in the sense that there's a lot of good science and ways to measure cognitive load, cognitive workload, how occupied your mind is, mental workload is another term used.

And so the window to the soul, the cognitive workload soul is the eyes. So pupil, so first of all, the eyes move in two different ways. They move a lot of ways, but two major ways. It's saccades, these are these ballistic movements, they jump around. Whenever you look around the room, they're actually just jumping around.

When you read, their eyes are jumping around. And when you follow, you just follow this bottle with your eyes. That your eyes are actually gonna move smoothly, smooth pursuit. Somebody actually just told me today that probably has to do with our hunting background or as animals. I don't know how that helps, like frogs track flies really well.

So that you have to like, I don't know. Anyway, the point is there's smooth pursuit movements where the eyes move smoothly. And those are all indications of certain aspects of cognitive load. And then there is very subtle movements which are almost imperceptible for computer vision. And these are micro saccades, these are tremors of the eye.

Here, work from here from Bill Freeman magnifying those subtle movements. These are taken at 500 frames a second. And so cognitive load, when the pupil, that black dot in the middle, just in case we don't know what a pupil is, in the middle of the eye. When it gets larger, that's an indicator of high cognitive load.

But it also gets larger when the light is dim. So there's like this complex interplay. So we can't rely in the wild, outside, in the car, or just in general outdoors on using the pupil size. Even though pupil size has been used effectively in a lab to measure cognitive load, it can't be reliably used in the car.

And the same with blinks. When there's higher cognitive load, your blink rate decreases and your blink duration shortens. Okay. I think I'm just repeating the same thing over and over. But you can imagine how we can predict cognitive load, right? We extract video of the eye. Here is the primary eye of the person the system is observing.

Happens to be the same TA once again. We take the sequence of a hundred, oh, it's 90 images. So that's six seconds, 16 frames a second, 15 frames a second. And we dump that into a 3D convolutional neural network. Again, that means it's 90 channels of, it's not 90 frames, grayscale.

And then the prediction is one of three classes of cognitive load is the same. So we can predict the same thing. Classes of cognitive load. Low cognitive load, medium cognitive load, and high cognitive load. And there's ground truth for that because we have people, over 500 different people do different tasks of various cognitive load.

And after some frontalization again, where you see the eyes are, no matter where the person is looking, the image of the face is transposed in such a way that the eyes, the corners of the eyes remain always in the same position. After the frontalization, we find the eye, active appearance models, find 39 points of the eye, of the eyelids, the iris, and four points on the pupil.

Putting all of that into a 3D CNN model, they're positioned, image eye sequence on the left, 3D CNN model in the middle, cognitive load prediction on the right. This code, by the way, is freely available online. All you have to do, dump a webcam from the video stream, CNN runs in faster than real time, predicts cognitive load.

Same process as detecting the identity of the face, same process as detecting where the driver is looking, same process as detecting emotion. And all of those require very little hyperparameter tuning on the convolutional neural networks. They only require huge amounts of data. And why do we care about detecting what the driver is doing?

And I think Eric has mentioned this. Is on the, oh man, this is the comeback of the slide. I was criticized for this being a very cheesy slide. In in the path towards full automation, we're likely to take gradual steps towards that. It's enough of that. This is better.

And especially given that this is given today, our new president, this is pickup truck country. This is manually controlled vehicle country for quite a little while. Will I control? And control being given to somebody else, to the machine, will be a gradual process. It's a gradual process of that machine earning trust.

And through that process, the machine, like the Tesla, like the BMW, like the Mercedes, the Volvo, that's now playing with these ideas, is going to need to see what the human is doing. And for that, to see what the human is doing, we have billions of miles of forward-facing data.

What we need is billions of miles of driver-facing data as well. We're in the process of collecting that. And this is a pitch for automakers and everybody to buy cars that have a driver-facing camera. And let me sort of close. So I said we need a lot of data.

But I think this class has been and through your own research, you'll find that we're in the very early stages of discovering the power of deep learning. For example, you know, as recently, like Jan LeCun said, that it seems that the deeper the network, the better the results in a lot of really important cases.

Even though the data is not increasing. So why does a deeper network give better results? This is a mysterious thing we don't understand. There's these hundreds of millions of parameters and from them is emerging some kind of structure, some kind of representation of the knowledge that we're giving it.

One of my favorite examples of this emergent concept is Conway's Game of Life. For those of you who know what this is, will probably criticize me for it being as cheesy as a stairway slide. But I think it's actually such a simple and brilliant example of how like a neuron in a neural network is a really simple computational unit.

And then incredible power emerges when you just combine a lot of them in a network. In the same way, this is called a super-computational network. This is called a cellular automata. That's a weird pronunciation. And every single cell is operating under a simple rule. You can think of it as a cell living and dying.

It's filled in black when it's alive and white when it's dead. And when it has two or three, if it's alive and has two or three neighbors, it survives to the next time slot. Otherwise it dies. And if it has exactly three neighbors, it's dead. It comes back to life.

If it has exactly three neighbors. That's a simple rule. Whatever, you can just imagine. It's just simple. All it's doing is operating under this very local process. Same as a neuron. Or in the way we're currently training neural networks in this local gradient. We're optimizing over a local gradient.

Same, local rules. And what happens if you run this system operating under really local rules, when you get on the right, it's not, again, you have to go home. Hopefully no drugs involved. But you have to open up your mind and see how amazing that is. Because what happens is it's a local computational unit that knows very little about the world.

But somehow really complex patterns emerge. And we don't understand why. In fact, under different rules, incredible patterns emerge. And it feels like it's living creatures, like communicating. Like when you just watch it. Not these examples. This is the original. They get like complex and interesting. But even these examples, this complex geometric patterns that emerge, it's incredible.

We don't understand why. Same with neural networks. We don't understand why. And we need to. In order to see how these networks will be able to reason. Okay, so what's next? I encourage you to read the deep learning book. It's available online, deeplearningbook.org. As I mentioned to a few people, you should, well, first, there's a ton of amazing papers every day coming out on archive.

I'll put these links up, but there's a lot of good collections of strong paper, list of papers. There is the literally awesome list, the awesome deep learning papers on GitHub. It's calling itself awesome, but it happens to be awesome. And there is a lot of blogs, that are just amazing.

That's how I recommend you learn machine learning, is on blogs. And if you're interested in the application of deep learning in the automotive space, you can come do research in our group. Just email me. Anyway, we have three winners. Jeffrey Hu, Michael Gump. And how do you, are you here?

Hey, how do you say your name? No, this is not my name. All right. This is, so my name is Forna. Forna. He stands for Forna. And Doli is, Oh, I see. They, like, they, they, I'm sorry, I'm not, Well, anyway, here. So he achieved the stunning speed of, so I, this is kind of incredible.

So I didn't know what kind of speed we're going to be able to achieve. I thought 73 was unbeatable, because we played with it for a while. We couldn't achieve 73. We designed a deterministic algorithm, that was able to achieve 74, I believe. Meaning, like, it's cheating, with a cheating algorithm that got 74.

And so folks have come up with algorithms that have done, that have beaten 73 and then 74. So this is really incredible. And the other two guys, so all three of you get a free term at the Udacity Self-Driving Car Engineering degree. Thanks to those guys for giving it, giving that award, and bringing their army of brilliant, so they have people who are obsessed about self-driving cars.

And we've received over 2,000 submissions for this competition. A lot of them from those guys. And they're just brilliant. So it's really exciting to have such a big community of deep learning folks working in this field. So this is, for the rest of eternity, we're going to change this up a little bit, but this is actually the three neural networks, the three winning neural networks running side by side.

And you can see the number of cars passed there. The first place is on the left, second place and third place. And in fact, the third place is almost, wait, no, second place is winning currently. But that just tells you that the random nature of competition, sometimes you win, sometimes you lose.

So there's a, the actual evaluation process runs through a lot of a lot of iterations and takes the medium evaluation. With that, let me thank you guys so much for, well, wait, wait, wait, you have a question? Do I have to be winning networks at all? Yeah, so, all three guys wrote me a note about how their networks work.

I did not read that note. So I'll post, this tells you how crazy this has been. I'll post their winning networks online. And I encourage you to continue competing and continue submitting networks. This will run for a while and we're working on a journal paper for this game. We're trying to find the optimal solutions.

Okay, so this is the first time I've ever taught a class. And the first time obviously teaching this class. And so thank you so much for being a part of it. Thank you, Eric. If you didn't get a shirt, please come back, please come down and get a shirt.

Just write your email on the note, on the index note. Thank you. Bye. Thank you. Thank you.