back to index

MIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles


Chapters

0:0
2:28 Drive State Detection: A Computer Vision Perspective
4:10 Crash Test Dummy Design: Hybrid III
5:8 Sequential Detection Approach
6:19 Temporal Convolutional Neural Networks
7:4 Gaze Classification vs Gaze Estimation
10:56 Gaze Classification Pipeline
11:16 Face Alignment
12:18 A General Framework for Semi-Automated Object State Annotation
15:15 Semi-Automated Annotation Work Flow
15:25 Driver Frustration
22:14 Preprocessing Pipeline
22:45 Driver Cognitive Load Estimation
24:25 Human at the Center of Automation: The Way to Full Autonomy includes the Human
25:23 and Fundamental Breakthroughs in Deep Learning

Whisper Transcript | Transcript Only Page

00:00:00.000 | All right, so the human side of AI.
00:00:04.700 | So how do we turn this camera back in on the human?
00:00:11.400 | So we've been talking about perception,
00:00:14.500 | how to detect cats and dogs, pedestrians, lanes,
00:00:20.200 | how to steer a vehicle based on the external environment.
00:00:23.900 | The thing that's really fascinating and severely understudied
00:00:30.100 | is the human side.
00:00:31.700 | So you know, you talk about the Tesla,
00:00:34.600 | we have cameras in 17 Teslas driving around Cambridge
00:00:39.000 | because Tesla is one of the only vehicles allowing you
00:00:43.500 | to experience in the real way on the road
00:00:49.900 | the interaction between the human and the machine.
00:00:53.300 | And the thing that we don't have,
00:00:57.800 | that deep learning needs on the human side
00:01:01.200 | of semi-autonomous vehicles and fully autonomous vehicles
00:01:05.300 | is video of drivers.
00:01:08.000 | That's what we're collecting.
00:01:10.000 | That's what my work is in,
00:01:12.800 | is looking at billions of video frames of human beings
00:01:17.500 | driving 60 miles an hour plus on the highway
00:01:21.700 | in their semi-autonomous Tesla.
00:01:25.500 | What are the things that we want to know about the human?
00:01:28.200 | If we're a deep learning therapist
00:01:33.300 | and we try to break apart the different things
00:01:37.800 | we can detect from this raw set of pixels,
00:01:40.300 | we can look here from the green to red
00:01:43.700 | is the different detection problems,
00:01:45.100 | the different computer vision detection problems.
00:01:47.000 | Green means it's less challenging,
00:01:52.500 | it's feasible even under poor lighting conditions,
00:01:56.000 | variable pose, noisy environment, poor resolution.
00:02:01.300 | Red means it's really hard no matter what you do.
00:02:05.300 | That's starting on the left with face detection and body pose,
00:02:09.500 | one of the best studied
00:02:11.300 | and one of the easier computer vision problems.
00:02:13.500 | We have huge data sets for these.
00:02:16.700 | And then there is micro saccades,
00:02:19.300 | the slight tremors of the eye that happen in one
00:02:22.900 | at a rate of a thousand times a second.
00:02:25.900 | Let's look at,
00:02:30.500 | well first, why do we even care about the human in the car?
00:02:36.500 | One is trust.
00:02:39.100 | This trust part is,
00:02:40.700 | so you think about it,
00:02:42.300 | to build trust,
00:02:45.200 | the car needs to have some awareness
00:02:47.900 | of the biological thing it's carrying inside,
00:02:52.000 | the human inside.
00:02:52.800 | You kind of assume the car knows about you
00:02:55.000 | because you're like sitting there controlling it.
00:02:56.800 | But if you think about it,
00:02:59.300 | almost every single car on the road today
00:03:01.800 | has no sensors with which it's perceiving you.
00:03:04.900 | It knows some cars have a pressure sensor on the steering wheel
00:03:09.000 | and a pressure sensor or some kind of sensor
00:03:12.600 | detecting that you're sitting in the seat.
00:03:14.900 | That's the only thing it knows about you.
00:03:17.000 | That's it.
00:03:19.300 | So how is the car supposed to,
00:03:21.200 | this same car that's driving 70 miles an hour
00:03:25.200 | on the highway autonomously,
00:03:27.000 | how is it supposed to build trust with you
00:03:29.800 | if it doesn't perceive you?
00:03:31.100 | That's one of the critical things here.
00:03:33.800 | So if I'm constantly advocating something,
00:03:36.800 | it's that we should have a driver-facing camera in every car.
00:03:39.900 | And that despite the privacy concerns,
00:03:43.000 | you have a camera on your phone
00:03:44.900 | and you don't have as much of a privacy concern there,
00:03:48.300 | is it despite the privacy concerns,
00:03:51.100 | the safety benefits are huge.
00:03:54.400 | The trust benefits are huge.
00:03:57.000 | So let's start with the easy one, body pose,
00:04:01.900 | detecting body pose.
00:04:03.200 | Why do we care?
00:04:05.200 | So there's seatbelt design.
00:04:07.600 | There's these dummies,
00:04:10.600 | crash test dummies with which,
00:04:12.500 | which are used to design the safety system,
00:04:16.600 | the passive safety systems in our cars.
00:04:18.700 | And they make certain assumptions about body shapes,
00:04:21.700 | male, female, child, body shapes.
00:04:25.000 | But they also make assumptions about
00:04:27.000 | the position of your body in the seat.
00:04:29.000 | They have the optimal position,
00:04:32.000 | the position they assume you take.
00:04:34.100 | The reality is, in a Tesla,
00:04:37.400 | when the car is driving itself,
00:04:39.800 | the variability, if you remember the cat,
00:04:42.500 | the deformable cat,
00:04:44.300 | you start doing a little bit more of that.
00:04:46.900 | You start to reach back in the back seat,
00:04:49.200 | in your purse, your bag, for your cell phone,
00:04:52.600 | these kinds of things.
00:04:53.900 | And that's when the crashes happen.
00:04:55.600 | And we need to know how often that happens.
00:04:58.700 | The car needs to know that you're in that position.
00:05:01.100 | That's critical for that very serious moment
00:05:05.300 | when the actual crash happens.
00:05:06.700 | How do you do?
00:05:10.200 | This is deep learning class, right?
00:05:12.800 | So this deep learning of the rescue.
00:05:15.200 | Whenever you have these kind of tasks of detecting,
00:05:19.100 | for example, body poses,
00:05:21.000 | you're detecting points of the shoulders,
00:05:22.700 | points of the head,
00:05:23.600 | 5, 10 points along the arms, the skeleton.
00:05:27.200 | How do you do that?
00:05:30.600 | You have a CNN, convolutional neural network,
00:05:34.100 | that takes this input image and takes as an output,
00:05:37.600 | it's a regressor.
00:05:38.600 | It gives an XY position of the,
00:05:41.200 | whatever you're looking for,
00:05:42.600 | the left shoulder or the right shoulder.
00:05:43.900 | And then you have a cascade of regressors
00:05:46.100 | that give you all these points,
00:05:48.200 | that give you the shoulders, the arms and so on.
00:05:50.300 | And then you have,
00:05:52.100 | through time, on every single frame you make that prediction.
00:05:55.700 | And then you optimize,
00:05:58.900 | you know,
00:06:00.300 | you can make certain assumptions about physics,
00:06:03.800 | that you can't, your arm can't be in this place in one frame
00:06:07.300 | and then the next frame be over here.
00:06:09.300 | It moves smoothly through space.
00:06:11.600 | So under those constraints,
00:06:12.900 | you can then minimize the error,
00:06:15.400 | the temporal error from frame to frame.
00:06:20.100 | Or you can just dump all the frames
00:06:23.700 | as if there are different channels,
00:06:25.200 | like RGB is three channels,
00:06:26.800 | you could think of those channels as in time.
00:06:29.000 | You can dump all those frames together
00:06:31.000 | in what are called 3D convolutional neural networks.
00:06:34.400 | You dump them all together
00:06:36.100 | and then you estimate the body pose
00:06:38.100 | in all the frames at once.
00:06:40.400 | And there are some data sets for sports
00:06:42.600 | and we're building our own.
00:06:44.700 | I don't know who that guy is.
00:06:46.600 | Let's fly through this a little bit.
00:06:51.600 | So what's called gaze classification.
00:06:54.800 | Gaze is another word for glance, right?
00:06:57.300 | It's a classification problem.
00:07:00.600 | Here's one of the TAs for this class.
00:07:04.600 | Again, not here,
00:07:07.000 | because you can't see it.
00:07:08.200 | Again, not here,
00:07:09.500 | because he's married,
00:07:10.500 | had to be home.
00:07:11.500 | I know where his priorities are at.
00:07:13.900 | This is on camera,
00:07:14.800 | should be here.
00:07:15.600 | There's five cameras.
00:07:18.700 | This is what we're recording in the Tesla.
00:07:20.100 | This is a Tesla vehicle.
00:07:21.200 | There's in the bottom right,
00:07:24.100 | there's a blue icon that lights up,
00:07:25.900 | automatically detected
00:07:27.100 | if it's operating under autopilot.
00:07:28.700 | That means the car is currently driving itself.
00:07:30.900 | There's five cameras,
00:07:32.200 | one of the forward roadway,
00:07:33.400 | one in the instrument cluster,
00:07:34.700 | one of the center stack,
00:07:35.700 | steering wheel,
00:07:36.500 | his face.
00:07:37.800 | And then it's a classification problem.
00:07:40.400 | You dump the raw pixels
00:07:41.700 | into a convolutional neural network,
00:07:43.400 | have six classes,
00:07:45.500 | forward roadway,
00:07:46.700 | you're predicting where the person is looking,
00:07:49.000 | forward roadway, left, right,
00:07:50.500 | center stack, instrument cluster,
00:07:54.400 | rear view mirror,
00:07:55.600 | and you give it millions of frames
00:07:58.700 | for every class.
00:07:59.500 | Simple.
00:08:00.300 | And it does incredibly well at predicting
00:08:06.800 | where the driver is looking.
00:08:08.000 | And the process is the same
00:08:09.700 | for majority of the driver state problems
00:08:13.300 | that have to do with the face.
00:08:14.400 | The face has so much information,
00:08:16.300 | where you're looking,
00:08:18.300 | emotion,
00:08:19.300 | drowsiness,
00:08:20.400 | so different degrees of frustration.
00:08:22.800 | I'll fly through those as well.
00:08:24.200 | But the process is the same.
00:08:25.600 | There's some pre-processing.
00:08:27.200 | So this is in the wild data.
00:08:29.300 | There's a lot of crazy light going on.
00:08:31.400 | There's noise,
00:08:32.200 | there's vibration from the vehicle.
00:08:33.800 | So first you have to
00:08:36.100 | video stabilization.
00:08:37.100 | You have to remove all that vibration,
00:08:38.800 | all that noise as best as you can.
00:08:40.600 | There's a lot of
00:08:41.300 | algorithms,
00:08:43.000 | non neural network algorithms.
00:08:44.900 | Boring but they work
00:08:48.000 | for removing the noise,
00:08:50.800 | removing the effects of
00:08:51.900 | sudden light variations
00:08:53.700 | and the vibrations of the vehicle.
00:08:55.100 | There's the automated calibration.
00:08:57.300 | So you have to estimate the frame of the camera,
00:08:59.700 | the position of the camera,
00:09:02.200 | estimate the identity of the person you're looking at.
00:09:06.000 | The more you can specialize the network
00:09:08.000 | to the identity of the person
00:09:09.500 | and the identity of the car the person is riding in,
00:09:12.300 | the better the performance for the different driver state classification.
00:09:15.700 | So you personalize the network.
00:09:18.000 | You have a background model that works on everyone
00:09:20.100 | and you specialize each individual.
00:09:21.900 | This is transfer learning.
00:09:22.900 | You specialize each individual network to that one individual.
00:09:26.000 | All right.
00:09:27.200 | There is face frontalization.
00:09:29.900 | Fancy name
00:09:31.500 | for the fact that
00:09:32.700 | no matter where they're looking,
00:09:34.000 | you want to transfer that face
00:09:35.300 | so the eyes
00:09:36.000 | and nose are the exact same position in the image.
00:09:38.700 | That way if you want to look at the eyes
00:09:41.400 | and you want to study the subtle movement of the eyes,
00:09:45.200 | the subtle blinking,
00:09:46.500 | the dynamics of the eyelid,
00:09:49.000 | the velocity of the eyelid,
00:09:50.600 | it's always in the same place.
00:09:51.900 | You can really focus in,
00:09:53.500 | remove all effects of any other motion of the head.
00:09:56.400 | And then you just,
00:09:59.500 | this is the beauty of deep learning, right?
00:10:01.300 | You don't, there is some pre-processing
00:10:03.100 | because this is
00:10:05.500 | real-world data,
00:10:06.600 | but you just dump the raw pixels in.
00:10:08.800 | You dump the raw pixels in and predict whatever you need.
00:10:12.400 | What do you need?
00:10:13.500 | One is emotion.
00:10:14.900 | You can have,
00:10:15.700 | so we had a study where people
00:10:19.300 | used a crappy and a good
00:10:21.700 | voice-based navigation system.
00:10:24.000 | So the crappy one got them really frustrated
00:10:26.400 | and they self-reported as a frustrating experience or not
00:10:29.500 | on a scale of 1 to 10.
00:10:30.500 | So that gives us ground truth.
00:10:31.900 | But it had a bunch of people use the system
00:10:34.800 | and you know, they put themselves as frustrated or not.
00:10:38.100 | And so then we can predict,
00:10:39.600 | we can train a convolutional neural network to predict
00:10:42.200 | is this person frustrated or not.
00:10:43.700 | I think we've seen a video of that.
00:10:45.400 | Turns out smiling is a strong indication of frustration.
00:10:49.400 | You can also predict drowsiness in this way,
00:10:51.600 | gaze estimation in this way,
00:10:54.400 | cognitive load.
00:10:55.600 | I'll briefly look at that.
00:10:57.400 | And the process is all the same.
00:10:59.200 | You detect the face,
00:11:00.700 | you find the landmark points in the face
00:11:02.800 | for the face alignment, face frontalization.
00:11:04.800 | And then you dump the raw pixels in
00:11:08.000 | for classification, step 5.
00:11:09.600 | You can use SVMs there
00:11:11.500 | or you can use what everyone uses now,
00:11:13.700 | convolutional neural networks.
00:11:15.000 | This is the one part where CNNs have still struggled to compete,
00:11:21.400 | is the alignment problem.
00:11:23.700 | This is where I talked about the cascade of regressors,
00:11:27.800 | is finding the landmarks
00:11:31.300 | on the eyebrows, the nose, the jawline, the mouth.
00:11:39.300 | There's certain constraints there
00:11:42.100 | and so algorithms that can utilize those constraints effectively
00:11:47.100 | can often perform better than end-to-end regressors
00:11:50.900 | that just don't have any concept of what a face is shaped like.
00:11:53.800 | And there's huge datasets and we're a part
00:11:58.100 | of the awesome community that's building those datasets
00:12:01.900 | for face alignment.
00:12:03.400 | Okay, so this is, again, the TA in his younger form.
00:12:06.900 | This is live in the car, real-time system,
00:12:13.700 | predicting where they're looking.
00:12:15.500 | This is taking slow steps towards the
00:12:20.700 | exciting direction that machine learning is headed,
00:12:25.600 | which is unsupervised learning.
00:12:27.900 | The less you have to have humans look through the data
00:12:31.500 | and annotate that data,
00:12:32.900 | the more power these machine learning algorithms get.
00:12:36.700 | Right, so,
00:12:38.200 | currently supervised learning is what's needed.
00:12:41.700 | You need human beings to label a cat and label a dog.
00:12:44.900 | But if you can only have a human being label 1%,
00:12:49.800 | 1/10 of a percent of a dataset,
00:12:52.200 | only the hard cases,
00:12:54.300 | so the machine can come to the human and be like,
00:12:56.700 | "I don't know what I'm looking at these pictures."
00:12:59.700 | Because of the partial light occlusions,
00:13:02.000 | we're not good at dealing with occlusions,
00:13:04.700 | whether it's your own arm or because of light conditions.
00:13:07.700 | We're not good with crazy light,
00:13:11.100 | drowning out the image.
00:13:12.800 | This is what Google self-driving car actually struggle with
00:13:15.300 | when they're trying to use their vision sensors.
00:13:17.000 | Moving out of frame,
00:13:18.600 | so just all kinds of occlusions are really hard
00:13:22.100 | for computer vision algorithms.
00:13:26.200 | And in those case,
00:13:27.200 | we want a machine to step in and say,
00:13:29.600 | and pass that image on to the human,
00:13:31.800 | be like, "Help me out with this."
00:13:33.000 | And the other corner cases is,
00:13:36.400 | so in driving for example,
00:13:38.200 | 90+% of the time,
00:13:40.200 | all you're doing is staring forward at the roadway in the same way.
00:13:42.700 | That's where the machine shines.
00:13:44.400 | That's where machine annotation,
00:13:46.100 | automated annotation shines.
00:13:48.800 | Because it's seen that face
00:13:51.000 | for hundreds of millions of frames already
00:13:53.800 | in that exact position.
00:13:55.600 | So it can do all the hard work of annotation for you.
00:13:57.800 | It's in the transition away from those positions
00:14:00.800 | that it needs a little bit help.
00:14:02.000 | Just to make sure,
00:14:03.400 | that this person just start looking away
00:14:06.300 | from the road to the rear view
00:14:07.900 | and you bring those points up.
00:14:10.000 | So you're,
00:14:10.800 | there's using optical flow,
00:14:13.700 | putting the optical flow
00:14:15.200 | in the convolutional neural network.
00:14:17.100 | You use that to predict when something has changed.
00:14:21.500 | And when something has changed,
00:14:22.800 | you bring that to the machine for annotation.
00:14:25.300 | All of this is to build a giant,
00:14:27.500 | billions of frames,
00:14:29.900 | annotated dataset
00:14:31.500 | of ground truth.
00:14:32.800 | I want you to train your
00:14:34.800 | driver state algorithms.
00:14:36.600 | And in this way, you can control.
00:14:39.300 | On the X-axis is the fraction of frames
00:14:41.500 | that a human has to annotate.
00:14:42.700 | 0% on the left,
00:14:44.700 | 10% on the right.
00:14:46.700 | And then the accuracy trade-off.
00:14:49.500 | The more the human annotates,
00:14:50.900 | the higher the accuracy.
00:14:51.900 | You approach 100% accuracy.
00:14:54.200 | But you can still do pretty good.
00:14:55.400 | This is for the gaze classification task.
00:14:57.600 | When,
00:14:59.400 | with an 84%,
00:15:03.900 | 84 fold,
00:15:05.900 | so almost two orders of magnitude reduction
00:15:08.100 | in human annotation.
00:15:09.300 | This is the future of machine learning.
00:15:11.200 | And hopefully one day,
00:15:13.300 | no human annotation.
00:15:15.100 | And the result
00:15:21.100 | is millions of images like this.
00:15:24.300 | Video frames.
00:15:25.100 | Same thing,
00:15:27.300 | driver frustration.
00:15:28.600 | This is what I was talking about.
00:15:30.100 | The frustrated driver is the one
00:15:31.600 | that's on the bottom.
00:15:32.800 | So a lot of movement of the eyebrows
00:15:36.600 | and a lot of smiling.
00:15:37.600 | And that's true subject after subject.
00:15:39.800 | And the happy, the satisfied,
00:15:42.600 | I don't say happy.
00:15:43.500 | The satisfied driver
00:15:45.300 | is cold and stoic.
00:15:47.400 | And that's true for subject after subject.
00:15:50.000 | Because driving is a boring experience
00:15:51.800 | and you want it to stay that way.
00:15:53.300 | Yes, question.
00:15:54.000 | Great, great, great question.
00:16:00.300 | They're not.
00:16:01.000 | Absolutely, that's a great question.
00:16:04.900 | There is a,
00:16:06.100 | so this is cars owned by MIT.
00:16:08.100 | There is somebody in the back.
00:16:17.500 | the comment was
00:16:19.200 | my emotions might then have nothing to do
00:16:21.900 | with the driving experience.
00:16:23.900 | Yes, let me continue that comment is
00:16:25.700 | your emotions are often,
00:16:27.700 | you're an actor on the stage for others
00:16:31.800 | with your emotion.
00:16:32.700 | So when you're alone,
00:16:33.600 | you might not express emotion.
00:16:35.200 | You're really expressing emotion oftentimes
00:16:37.400 | for others.
00:16:38.200 | Like you're frustrated.
00:16:39.700 | So it's like, oh, what the heck.
00:16:41.200 | That's for the passenger
00:16:43.200 | and that's absolutely right.
00:16:44.600 | So one of the cool things
00:16:46.900 | we're doing,
00:16:49.600 | as I said,
00:16:50.500 | we now have over a billion video frames
00:16:52.300 | in the Tesla.
00:16:53.300 | We're collecting huge amounts of data in the Tesla
00:16:56.100 | and it's,
00:16:57.200 | emotion is complex thing, right?
00:16:59.100 | In this case,
00:17:00.400 | we can,
00:17:00.800 | we know the ground truth
00:17:01.800 | how frustrated they were.
00:17:03.000 | In naturalistic data,
00:17:04.900 | when it's just people driving around,
00:17:06.600 | we don't know
00:17:07.500 | how they're really feeling at the moment.
00:17:09.400 | We're not asking them to like enter in an app.
00:17:11.600 | How are you feeling right now?
00:17:12.700 | But we do know certain things,
00:17:15.900 | like we know that people sing a lot.
00:17:17.900 | That has to be a paper at some point.
00:17:22.500 | It's awesome.
00:17:23.200 | People love singing.
00:17:24.300 | So that doesn't happen in this kind of data
00:17:27.900 | because there's somebody sitting in the car
00:17:29.300 | and I think the expression of a frustration
00:17:31.400 | is also the same.
00:17:32.200 | Yes, so
00:17:43.500 | yeah, the question is,
00:17:46.700 | yeah, so or the comment is that
00:17:48.500 | the data set,
00:17:49.500 | the solo data set
00:17:50.600 | is probably gonna be very different
00:17:52.200 | from a data set that's non-solo
00:17:54.200 | with a passenger.
00:17:55.100 | And it's very true.
00:17:56.200 | The tricky thing about driving,
00:17:57.800 | and this is why it's a huge challenge
00:17:59.300 | for self-driving cars,
00:18:00.500 | for the external facing sensors
00:18:02.200 | and for the internal facing sensors
00:18:03.800 | analyzing human behavior,
00:18:05.100 | is like 99.9% of driving
00:18:08.500 | is the same thing.
00:18:09.500 | It's really boring.
00:18:10.900 | So finding the interesting bits
00:18:12.600 | is actually pretty complicated.
00:18:14.000 | So that has to do with emotion.
00:18:16.400 | That has to do with,
00:18:18.100 | so singing is easy to find.
00:18:20.100 | So we can track the mouth pretty well.
00:18:22.100 | So whenever you're talking or singing,
00:18:23.400 | we can find that.
00:18:24.200 | But how do you find
00:18:25.300 | the subtle expressions of emotion?
00:18:26.800 | It's hard
00:18:27.700 | when you're solo.
00:18:30.500 | And cognitive load,
00:18:33.500 | that's
00:18:35.400 | that's a fascinating thing.
00:18:38.500 | I mean, similar to emotion,
00:18:39.500 | it's a little more concrete
00:18:42.000 | in the sense that there's a lot of good science
00:18:44.800 | and ways to measure cognitive load,
00:18:46.600 | cognitive workload,
00:18:48.500 | how occupied your mind is,
00:18:51.100 | mental workload is another term used.
00:18:52.900 | And so the window to the soul,
00:18:55.700 | the cognitive workload soul
00:18:58.400 | is the eyes.
00:18:59.400 | So pupil,
00:19:01.200 | so first of all,
00:19:02.600 | the eyes move in two different ways.
00:19:04.300 | They move a lot of ways,
00:19:05.400 | but two major ways.
00:19:06.300 | It's saccades,
00:19:07.500 | these are these ballistic movements,
00:19:08.900 | they jump around.
00:19:09.700 | Whenever you look around the room,
00:19:11.800 | they're actually just jumping around.
00:19:13.800 | When you read,
00:19:14.400 | their eyes are jumping around.
00:19:15.700 | And when
00:19:17.000 | you follow,
00:19:18.900 | you just follow this bottle with your eyes.
00:19:21.300 | That your eyes are actually gonna move smoothly,
00:19:23.400 | smooth pursuit.
00:19:24.900 | Somebody actually just told me today
00:19:26.500 | that probably has to do with our hunting background
00:19:28.700 | or as animals.
00:19:30.100 | I don't know how that helps,
00:19:34.100 | like frogs track flies
00:19:36.100 | really well.
00:19:37.100 | So that you have to like,
00:19:38.700 | I don't know.
00:19:39.200 | Anyway, the point is
00:19:40.900 | there's smooth pursuit movements
00:19:42.800 | where the eyes move smoothly.
00:19:44.600 | And those are all indications
00:19:46.600 | of certain aspects of cognitive load.
00:19:49.000 | And then there is very subtle movements
00:19:51.300 | which are almost imperceptible for computer vision.
00:19:53.600 | And these are
00:19:54.700 | micro saccades,
00:19:56.600 | these are tremors of the eye.
00:19:57.900 | Here,
00:20:00.200 | work from here from Bill Freeman
00:20:02.100 | magnifying those subtle movements.
00:20:04.200 | These are taken at
00:20:05.600 | 500 frames a second.
00:20:07.500 | And so cognitive load,
00:20:14.200 | when the pupil,
00:20:17.200 | that black dot in the middle,
00:20:18.800 | just in case we don't know what a pupil is,
00:20:20.600 | in the middle of the eye.
00:20:21.600 | When it gets larger,
00:20:23.100 | that's an indicator of high cognitive load.
00:20:25.500 | But it also gets
00:20:26.900 | larger when the light is dim.
00:20:28.700 | So there's like this complex interplay.
00:20:31.200 | So we can't rely in the wild,
00:20:32.800 | outside,
00:20:33.500 | in the car,
00:20:34.500 | or just in general outdoors
00:20:36.400 | on using the pupil size.
00:20:38.500 | Even though pupil size has been used effectively
00:20:40.800 | in a lab to measure cognitive load,
00:20:42.500 | it can't be reliably used in the car.
00:20:44.700 | And the same with blinks.
00:20:47.900 | When there's higher cognitive load,
00:20:49.600 | your blink rate decreases
00:20:51.200 | and your blink duration shortens.
00:20:52.800 | Okay.
00:20:54.500 | I think I'm just repeating the same thing over and over.
00:20:58.000 | But you can imagine
00:20:59.800 | how we can predict cognitive load, right?
00:21:01.700 | We extract
00:21:04.200 | video of the eye.
00:21:05.800 | Here is
00:21:08.400 | the primary eye
00:21:10.500 | of the person the system is observing.
00:21:13.100 | Happens to be the same TA once again.
00:21:17.400 | We take the sequence of a hundred,
00:21:19.400 | oh, it's 90 images.
00:21:20.900 | So that's six seconds,
00:21:21.900 | 16 frames a second,
00:21:23.400 | 15 frames a second.
00:21:24.700 | And we dump that into a 3D convolutional neural network.
00:21:27.900 | Again,
00:21:29.000 | that means it's 90 channels
00:21:33.700 | it's not 90 frames,
00:21:35.400 | grayscale.
00:21:36.200 | And then the prediction is
00:21:38.100 | one of three
00:21:39.100 | classes of cognitive load
00:21:41.400 | is the same.
00:21:42.900 | So we can predict
00:21:43.900 | the same thing.
00:21:44.900 | Classes of cognitive load.
00:21:46.900 | Low cognitive load,
00:21:48.500 | medium cognitive load,
00:21:49.900 | and high cognitive load.
00:21:50.900 | And there's ground truth for that
00:21:52.400 | because we have people,
00:21:53.700 | over 500 different people
00:21:55.600 | do different tasks
00:21:56.700 | of various cognitive load.
00:21:58.300 | And after some frontalization again,
00:22:01.500 | where you see the eyes are,
00:22:02.800 | no matter where the person is looking,
00:22:05.300 | the image of the face is transposed in such a way that
00:22:09.300 | the eyes, the corners of the eyes
00:22:11.000 | remain always in the same position.
00:22:13.800 | After the frontalization,
00:22:15.200 | we find the eye,
00:22:18.200 | active appearance models,
00:22:20.100 | find 39 points
00:22:22.000 | of the eye, of the eyelids,
00:22:25.600 | the iris,
00:22:27.400 | and four points on the pupil.
00:22:29.400 | Putting all of that
00:22:34.200 | into a 3D CNN model,
00:22:35.900 | they're positioned,
00:22:37.400 | image eye sequence on the left,
00:22:39.000 | 3D CNN model in the middle,
00:22:40.900 | cognitive load prediction on the right.
00:22:43.600 | This code, by the way, is
00:22:44.700 | freely available online.
00:22:47.600 | All you have to do,
00:22:51.000 | dump a webcam
00:22:52.500 | from the video stream,
00:22:55.700 | CNN runs in faster than real time,
00:22:58.300 | predicts cognitive load.
00:22:59.600 | Same process as detecting the identity of the face,
00:23:03.800 | same process as detecting where the driver is looking,
00:23:06.600 | same process as detecting emotion.
00:23:08.600 | And all of those require very little hyperparameter tuning
00:23:12.400 | on the convolutional neural networks.
00:23:13.700 | They only require
00:23:16.400 | huge amounts of data.
00:23:18.400 | And why do we care
00:23:21.300 | about detecting what the driver is doing?
00:23:24.100 | And I think Eric has mentioned
00:23:26.200 | this.
00:23:29.000 | on the,
00:23:31.100 | oh man, this is the comeback of the slide.
00:23:34.000 | I was criticized for this being a very cheesy slide.
00:23:44.000 | in the
00:23:46.000 | path towards
00:23:48.000 | full automation,
00:23:49.700 | we're likely to take gradual steps towards that.
00:23:55.500 | It's enough of that. This is better.
00:24:01.700 | especially
00:24:03.100 | given
00:24:05.600 | this is
00:24:07.600 | given today,
00:24:09.300 | our new president,
00:24:11.800 | this is pickup truck country.
00:24:13.700 | This is manually controlled vehicle country
00:24:20.500 | for quite a little while.
00:24:21.700 | Will I control?
00:24:22.900 | And control
00:24:25.400 | being given to somebody else, to the machine,
00:24:29.700 | will be a gradual process.
00:24:31.200 | It's a gradual process of that machine earning trust.
00:24:34.300 | And through that process,
00:24:36.500 | the machine,
00:24:37.700 | like the Tesla,
00:24:39.000 | like the BMW,
00:24:40.900 | like the Mercedes, the Volvo,
00:24:42.700 | that's now playing with these ideas,
00:24:45.100 | is going to need to see what the human is doing.
00:24:48.400 | And for that,
00:24:54.400 | to see what the human is doing,
00:24:57.800 | we have
00:24:59.500 | billions of miles of forward-facing data.
00:25:03.600 | What we need
00:25:05.000 | is billions of miles of driver-facing data as well.
00:25:09.800 | We're in the process of collecting that.
00:25:11.500 | And this is a pitch
00:25:13.800 | for automakers
00:25:16.200 | and everybody to
00:25:18.100 | buy cars that have a driver-facing camera.
00:25:21.400 | And let me
00:25:25.600 | sort of
00:25:26.600 | close.
00:25:28.400 | So I said we need a lot of data.
00:25:30.200 | But I think this class has been
00:25:34.400 | and through your own
00:25:37.800 | research, you'll find that we're in the very early stages of
00:25:42.500 | discovering the power of deep learning.
00:25:47.100 | For example,
00:25:49.900 | you know, as recently,
00:25:51.600 | like Jan LeCun said,
00:25:54.000 | that it seems
00:25:57.900 | that the deeper the network,
00:26:00.700 | the better the results
00:26:02.300 | in a lot of really
00:26:04.100 | important cases.
00:26:07.400 | Even though the data is not increasing.
00:26:09.600 | So why does a deeper network give better results?
00:26:13.700 | This is a mysterious thing we don't understand.
00:26:16.900 | There's these hundreds of millions of parameters
00:26:20.000 | and from them is emerging some kind of
00:26:23.800 | structure, some kind of representation of
00:26:26.800 | the knowledge that we're giving it.
00:26:28.400 | One of my favorite examples of this emergent concept
00:26:32.800 | is Conway's Game of Life.
00:26:36.100 | For those of you who know what this is,
00:26:38.700 | will probably criticize me for it being as cheesy as a stairway slide.
00:26:43.300 | But I think it's actually such a simple
00:26:46.700 | and brilliant example
00:26:48.900 | of how
00:26:50.400 | like a neuron in a neural network is a really simple computational unit.
00:26:54.800 | And then incredible power emerges when you just combine a lot of them in a network.
00:26:59.400 | In the same way,
00:27:01.000 | this is called a super-computational network.
00:27:04.400 | This is called a cellular automata.
00:27:06.300 | That's a weird pronunciation.
00:27:09.000 | And every single cell is operating under a simple rule.
00:27:15.100 | You can think of it as a cell living and dying.
00:27:18.100 | It's filled in black when it's alive
00:27:22.000 | and white when it's dead.
00:27:23.500 | And when it has two or three, if it's alive
00:27:26.800 | and has two or three neighbors,
00:27:29.000 | it survives to the next time slot.
00:27:31.400 | Otherwise it dies.
00:27:35.000 | And if it has exactly three neighbors, it's dead.
00:27:39.300 | It comes back to life.
00:27:42.000 | If it has exactly three neighbors.
00:27:43.500 | That's a simple rule.
00:27:44.600 | Whatever, you can just imagine.
00:27:46.800 | It's just simple.
00:27:47.400 | All it's doing is operating under this very local process.
00:27:51.800 | Same as a neuron.
00:27:53.300 | Or in the way we're currently training neural networks
00:27:58.700 | in this local gradient.
00:28:01.400 | We're optimizing over a local gradient.
00:28:03.900 | Same, local rules.
00:28:05.600 | And what happens if you run this system
00:28:09.300 | operating under really local rules,
00:28:12.600 | when you get on the right,
00:28:13.900 | it's not, again, you have to go home.
00:28:17.800 | Hopefully no drugs involved.
00:28:20.600 | But you have to open up your mind
00:28:22.900 | and see how amazing that is.
00:28:27.500 | Because what happens is
00:28:29.100 | it's a local computational unit
00:28:32.800 | that knows very little about the world.
00:28:35.000 | But somehow really complex patterns emerge.
00:28:38.800 | And we don't understand why.
00:28:41.000 | In fact, under different rules,
00:28:43.900 | incredible patterns emerge.
00:28:45.600 | And it feels like it's living creatures,
00:28:47.700 | like communicating.
00:28:49.100 | Like when you just watch it.
00:28:50.900 | Not these examples.
00:28:52.700 | This is the original.
00:28:54.900 | They get like complex and interesting.
00:28:57.700 | But even these examples,
00:28:58.900 | this complex geometric patterns that emerge,
00:29:00.800 | it's incredible.
00:29:01.700 | We don't understand why.
00:29:03.100 | Same with neural networks.
00:29:04.000 | We don't understand why.
00:29:05.000 | And we need to.
00:29:06.000 | In order to see how these networks
00:29:07.500 | will be able to reason.
00:29:08.600 | Okay, so what's next?
00:29:11.200 | I encourage you to read the deep learning book.
00:29:16.700 | It's available online, deeplearningbook.org.
00:29:20.700 | As I mentioned to a few people,
00:29:23.200 | you should, well, first,
00:29:25.000 | there's a ton of amazing papers
00:29:26.400 | every day coming out on archive.
00:29:28.000 | I'll put these links up,
00:29:31.100 | but there's a lot of good collections
00:29:33.400 | of strong paper, list of papers.
00:29:35.600 | There is the literally awesome list,
00:29:38.200 | the awesome deep learning papers on GitHub.
00:29:41.000 | It's calling itself awesome,
00:29:43.300 | but it happens to be awesome.
00:29:44.800 | And there is a lot of blogs,
00:29:47.900 | that are just amazing.
00:29:49.300 | That's how I recommend you learn machine learning,
00:29:53.400 | is on blogs.
00:29:54.400 | And if you're interested
00:29:57.600 | in the application of deep learning
00:29:59.500 | in the automotive space,
00:30:01.100 | you can come do research in our group.
00:30:03.700 | Just email me.
00:30:05.300 | Anyway, we have three winners.
00:30:09.300 | Jeffrey Hu, Michael Gump.
00:30:15.400 | And how do you, are you here?
00:30:19.400 | Hey, how do you say your name?
00:30:22.800 | No, this is not my name.
00:30:25.800 | All right.
00:30:29.300 | This is, so my name is Forna.
00:30:31.500 | Forna.
00:30:32.500 | He stands for Forna.
00:30:33.700 | And Doli is,
00:30:35.000 | Oh, I see.
00:30:36.900 | They, like, they, they, I'm sorry, I'm not,
00:30:39.900 | Well, anyway, here.
00:30:55.700 | So he achieved the stunning speed of,
00:31:01.300 | so I, this is kind of incredible.
00:31:04.100 | So I didn't know what kind of speed
00:31:05.200 | we're going to be able to achieve.
00:31:06.400 | I thought 73 was unbeatable,
00:31:08.200 | because we played with it for a while.
00:31:09.800 | We couldn't achieve 73.
00:31:11.100 | We designed a deterministic algorithm,
00:31:13.200 | that was able to achieve 74, I believe.
00:31:15.200 | Meaning, like, it's cheating,
00:31:17.900 | with a cheating algorithm that got 74.
00:31:20.200 | And so folks have come up
00:31:24.100 | with algorithms that have done,
00:31:26.000 | that have beaten 73 and then 74.
00:31:28.600 | So this is really incredible.
00:31:29.700 | And the other two guys,
00:31:32.300 | so all three of you get a free term
00:31:34.800 | at the Udacity Self-Driving Car Engineering degree.
00:31:37.800 | Thanks to those guys for giving it,
00:31:40.300 | giving that award,
00:31:42.100 | and bringing their army of brilliant,
00:31:44.500 | so they have people who are obsessed
00:31:47.100 | about self-driving cars.
00:31:48.200 | And we've received over 2,000 submissions
00:31:53.400 | for this competition.
00:31:54.500 | A lot of them from those guys.
00:31:56.800 | And they're just brilliant.
00:31:58.100 | So it's really exciting
00:32:01.900 | to have such a big community of deep learning folks
00:32:04.100 | working in this field.
00:32:05.800 | So this is, for the rest of eternity,
00:32:09.100 | we're going to change this up a little bit,
00:32:11.600 | but this is actually the three neural networks,
00:32:15.400 | the three winning neural networks
00:32:18.500 | running side by side.
00:32:19.800 | And you can see the number of cars passed there.
00:32:22.800 | The first place is on the left,
00:32:24.200 | second place and third place.
00:32:26.600 | And in fact, the third place is almost,
00:32:29.600 | wait, no, second place is winning currently.
00:32:32.300 | But that just tells you that
00:32:35.200 | the random nature of competition,
00:32:40.400 | sometimes you win, sometimes you lose.
00:32:42.600 | So there's a,
00:32:46.400 | the actual evaluation process runs through a lot
00:32:50.000 | of a lot of iterations and takes the medium evaluation.
00:32:53.900 | With that, let me thank you guys so much for,
00:32:59.400 | well, wait, wait, wait, you have a question?
00:33:01.100 | Do I have to be winning networks at all?
00:33:03.600 | Yeah, so,
00:33:06.200 | all three guys wrote me a note
00:33:10.800 | about how their networks work.
00:33:13.100 | I did not read that note.
00:33:15.100 | So I'll post,
00:33:18.300 | this tells you how crazy this has been.
00:33:20.500 | I'll post their winning networks
00:33:24.100 | online.
00:33:27.400 | And I encourage you to continue competing
00:33:30.300 | and continue submitting networks.
00:33:32.400 | This will run for a while and we're working on a journal paper
00:33:35.900 | for this game.
00:33:40.100 | We're trying to find the optimal solutions.
00:33:43.100 | Okay, so this is the first time
00:33:45.900 | I've ever taught a class.
00:33:48.200 | And the first time obviously teaching this class.
00:33:51.500 | And so thank you so much for being a part of it.
00:33:54.500 | Thank you, Eric.
00:33:59.900 | If you didn't get a shirt, please come back,
00:34:06.800 | please come down and get a shirt.
00:34:08.500 | Just write your email on the note,
00:34:11.500 | on the index note.
00:34:13.100 | Thank you.
00:34:16.900 | Thank you.
00:34:17.700 | Thank you.
00:34:18.200 | [audience chatter]