MIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles

00:00:00.000 | All right, so the human side of AI.

00:00:04.700 | So how do we turn this camera back in on the human?

00:00:11.400 | So we've been talking about perception,

00:00:14.500 | how to detect cats and dogs, pedestrians, lanes,

00:00:20.200 | how to steer a vehicle based on the external environment.

00:00:23.900 | The thing that's really fascinating and severely understudied

00:00:30.100 | is the human side.

00:00:31.700 | So you know, you talk about the Tesla,

00:00:34.600 | we have cameras in 17 Teslas driving around Cambridge

00:00:39.000 | because Tesla is one of the only vehicles allowing you

00:00:43.500 | to experience in the real way on the road

00:00:49.900 | the interaction between the human and the machine.

00:00:53.300 | And the thing that we don't have,

00:00:57.800 | that deep learning needs on the human side

00:01:01.200 | of semi-autonomous vehicles and fully autonomous vehicles

00:01:05.300 | is video of drivers.

00:01:08.000 | That's what we're collecting.

00:01:10.000 | That's what my work is in,

00:01:12.800 | is looking at billions of video frames of human beings

00:01:17.500 | driving 60 miles an hour plus on the highway

00:01:21.700 | in their semi-autonomous Tesla.

00:01:25.500 | What are the things that we want to know about the human?

00:01:28.200 | If we're a deep learning therapist

00:01:33.300 | and we try to break apart the different things

00:01:37.800 | we can detect from this raw set of pixels,

00:01:40.300 | we can look here from the green to red

00:01:43.700 | is the different detection problems,

00:01:45.100 | the different computer vision detection problems.

00:01:47.000 | Green means it's less challenging,

00:01:52.500 | it's feasible even under poor lighting conditions,

00:01:56.000 | variable pose, noisy environment, poor resolution.

00:02:01.300 | Red means it's really hard no matter what you do.

00:02:05.300 | That's starting on the left with face detection and body pose,

00:02:09.500 | one of the best studied

00:02:11.300 | and one of the easier computer vision problems.

00:02:13.500 | We have huge data sets for these.

00:02:16.700 | And then there is micro saccades,

00:02:19.300 | the slight tremors of the eye that happen in one

00:02:22.900 | at a rate of a thousand times a second.

00:02:25.900 | Let's look at,

00:02:30.500 | well first, why do we even care about the human in the car?

00:02:36.500 | One is trust.

00:02:39.100 | This trust part is,

00:02:40.700 | so you think about it,

00:02:42.300 | to build trust,

00:02:45.200 | the car needs to have some awareness

00:02:47.900 | of the biological thing it's carrying inside,

00:02:52.000 | the human inside.

00:02:52.800 | You kind of assume the car knows about you

00:02:55.000 | because you're like sitting there controlling it.

00:02:56.800 | But if you think about it,

00:02:59.300 | almost every single car on the road today

00:03:01.800 | has no sensors with which it's perceiving you.

00:03:04.900 | It knows some cars have a pressure sensor on the steering wheel

00:03:09.000 | and a pressure sensor or some kind of sensor

00:03:12.600 | detecting that you're sitting in the seat.

00:03:14.900 | That's the only thing it knows about you.

00:03:17.000 | That's it.

00:03:19.300 | So how is the car supposed to,

00:03:21.200 | this same car that's driving 70 miles an hour

00:03:25.200 | on the highway autonomously,

00:03:27.000 | how is it supposed to build trust with you

00:03:29.800 | if it doesn't perceive you?

00:03:31.100 | That's one of the critical things here.

00:03:33.800 | So if I'm constantly advocating something,

00:03:36.800 | it's that we should have a driver-facing camera in every car.

00:03:39.900 | And that despite the privacy concerns,

00:03:43.000 | you have a camera on your phone

00:03:44.900 | and you don't have as much of a privacy concern there,

00:03:48.300 | is it despite the privacy concerns,

00:03:51.100 | the safety benefits are huge.

00:03:54.400 | The trust benefits are huge.

00:03:57.000 | So let's start with the easy one, body pose,

00:04:01.900 | detecting body pose.

00:04:03.200 | Why do we care?

00:04:05.200 | So there's seatbelt design.

00:04:07.600 | There's these dummies,

00:04:10.600 | crash test dummies with which,

00:04:12.500 | which are used to design the safety system,

00:04:16.600 | the passive safety systems in our cars.

00:04:18.700 | And they make certain assumptions about body shapes,

00:04:21.700 | male, female, child, body shapes.

00:04:25.000 | But they also make assumptions about

00:04:27.000 | the position of your body in the seat.

00:04:29.000 | They have the optimal position,

00:04:32.000 | the position they assume you take.

00:04:34.100 | The reality is, in a Tesla,

00:04:37.400 | when the car is driving itself,

00:04:39.800 | the variability, if you remember the cat,

00:04:42.500 | the deformable cat,

00:04:44.300 | you start doing a little bit more of that.

00:04:46.900 | You start to reach back in the back seat,

00:04:49.200 | in your purse, your bag, for your cell phone,

00:04:52.600 | these kinds of things.

00:04:53.900 | And that's when the crashes happen.

00:04:55.600 | And we need to know how often that happens.

00:04:58.700 | The car needs to know that you're in that position.

00:05:01.100 | That's critical for that very serious moment

00:05:05.300 | when the actual crash happens.

00:05:06.700 | How do you do?

00:05:10.200 | This is deep learning class, right?

00:05:12.800 | So this deep learning of the rescue.

00:05:15.200 | Whenever you have these kind of tasks of detecting,

00:05:19.100 | for example, body poses,

00:05:21.000 | you're detecting points of the shoulders,

00:05:22.700 | points of the head,

00:05:23.600 | 5, 10 points along the arms, the skeleton.

00:05:27.200 | How do you do that?

00:05:30.600 | You have a CNN, convolutional neural network,

00:05:34.100 | that takes this input image and takes as an output,

00:05:37.600 | it's a regressor.

00:05:38.600 | It gives an XY position of the,

00:05:41.200 | whatever you're looking for,

00:05:42.600 | the left shoulder or the right shoulder.

00:05:43.900 | And then you have a cascade of regressors

00:05:46.100 | that give you all these points,

00:05:48.200 | that give you the shoulders, the arms and so on.

00:05:50.300 | And then you have,

00:05:52.100 | through time, on every single frame you make that prediction.

00:05:55.700 | And then you optimize,

00:05:58.900 | you know,

00:06:00.300 | you can make certain assumptions about physics,

00:06:03.800 | that you can't, your arm can't be in this place in one frame

00:06:07.300 | and then the next frame be over here.

00:06:09.300 | It moves smoothly through space.

00:06:11.600 | So under those constraints,

00:06:12.900 | you can then minimize the error,

00:06:15.400 | the temporal error from frame to frame.

00:06:20.100 | Or you can just dump all the frames

00:06:23.700 | as if there are different channels,

00:06:25.200 | like RGB is three channels,

00:06:26.800 | you could think of those channels as in time.

00:06:29.000 | You can dump all those frames together

00:06:31.000 | in what are called 3D convolutional neural networks.

00:06:34.400 | You dump them all together

00:06:36.100 | and then you estimate the body pose

00:06:38.100 | in all the frames at once.

00:06:40.400 | And there are some data sets for sports

00:06:42.600 | and we're building our own.

00:06:44.700 | I don't know who that guy is.

00:06:46.600 | Let's fly through this a little bit.

00:06:51.600 | So what's called gaze classification.

00:06:54.800 | Gaze is another word for glance, right?

00:06:57.300 | It's a classification problem.

00:07:00.600 | Here's one of the TAs for this class.

00:07:04.600 | Again, not here,

00:07:07.000 | because you can't see it.

00:07:08.200 | Again, not here,

00:07:09.500 | because he's married,

00:07:10.500 | had to be home.

00:07:11.500 | I know where his priorities are at.

00:07:13.900 | This is on camera,

00:07:14.800 | should be here.

00:07:15.600 | There's five cameras.

00:07:18.700 | This is what we're recording in the Tesla.

00:07:20.100 | This is a Tesla vehicle.

00:07:21.200 | There's in the bottom right,

00:07:24.100 | there's a blue icon that lights up,

00:07:25.900 | automatically detected

00:07:27.100 | if it's operating under autopilot.

00:07:28.700 | That means the car is currently driving itself.

00:07:30.900 | There's five cameras,

00:07:32.200 | one of the forward roadway,

00:07:33.400 | one in the instrument cluster,

00:07:34.700 | one of the center stack,

00:07:35.700 | steering wheel,

00:07:36.500 | his face.

00:07:37.800 | And then it's a classification problem.

00:07:40.400 | You dump the raw pixels

00:07:41.700 | into a convolutional neural network,

00:07:43.400 | have six classes,

00:07:45.500 | forward roadway,

00:07:46.700 | you're predicting where the person is looking,

00:07:49.000 | forward roadway, left, right,

00:07:50.500 | center stack, instrument cluster,

00:07:54.400 | rear view mirror,

00:07:55.600 | and you give it millions of frames

00:07:58.700 | for every class.

00:07:59.500 | Simple.

00:08:00.300 | And it does incredibly well at predicting

00:08:06.800 | where the driver is looking.

00:08:08.000 | And the process is the same

00:08:09.700 | for majority of the driver state problems

00:08:13.300 | that have to do with the face.

00:08:14.400 | The face has so much information,

00:08:16.300 | where you're looking,

00:08:18.300 | emotion,

00:08:19.300 | drowsiness,

00:08:20.400 | so different degrees of frustration.

00:08:22.800 | I'll fly through those as well.

00:08:24.200 | But the process is the same.

00:08:25.600 | There's some pre-processing.

00:08:27.200 | So this is in the wild data.

00:08:29.300 | There's a lot of crazy light going on.

00:08:31.400 | There's noise,

00:08:32.200 | there's vibration from the vehicle.

00:08:33.800 | So first you have to

00:08:36.100 | video stabilization.

00:08:37.100 | You have to remove all that vibration,

00:08:38.800 | all that noise as best as you can.

00:08:40.600 | There's a lot of

00:08:41.300 | algorithms,

00:08:43.000 | non neural network algorithms.

00:08:44.900 | Boring but they work

00:08:48.000 | for removing the noise,

00:08:50.800 | removing the effects of

00:08:51.900 | sudden light variations

00:08:53.700 | and the vibrations of the vehicle.

00:08:55.100 | There's the automated calibration.

00:08:57.300 | So you have to estimate the frame of the camera,

00:08:59.700 | the position of the camera,

00:09:00.800 | and

00:09:02.200 | estimate the identity of the person you're looking at.

00:09:06.000 | The more you can specialize the network

00:09:08.000 | to the identity of the person

00:09:09.500 | and the identity of the car the person is riding in,

00:09:12.300 | the better the performance for the different driver state classification.

00:09:15.700 | So you personalize the network.

00:09:18.000 | You have a background model that works on everyone

00:09:20.100 | and you specialize each individual.

00:09:21.900 | This is transfer learning.

00:09:22.900 | You specialize each individual network to that one individual.

00:09:26.000 | All right.

00:09:27.200 | There is face frontalization.

00:09:29.900 | Fancy name

00:09:31.500 | for the fact that

00:09:32.700 | no matter where they're looking,

00:09:34.000 | you want to transfer that face

00:09:35.300 | so the eyes

00:09:36.000 | and nose are the exact same position in the image.

00:09:38.700 | That way if you want to look at the eyes

00:09:41.400 | and you want to study the subtle movement of the eyes,

00:09:45.200 | the subtle blinking,

00:09:46.500 | the dynamics of the eyelid,

00:09:49.000 | the velocity of the eyelid,

00:09:50.600 | it's always in the same place.

00:09:51.900 | You can really focus in,

00:09:53.500 | remove all effects of any other motion of the head.

00:09:56.400 | And then you just,

00:09:59.500 | this is the beauty of deep learning, right?

00:10:01.300 | You don't, there is some pre-processing

00:10:03.100 | because this is

00:10:05.500 | real-world data,

00:10:06.600 | but you just dump the raw pixels in.

00:10:08.800 | You dump the raw pixels in and predict whatever you need.

00:10:12.400 | What do you need?

00:10:13.500 | One is emotion.

00:10:14.900 | You can have,

00:10:15.700 | so we had a study where people

00:10:19.300 | used a crappy and a good

00:10:21.700 | voice-based navigation system.

00:10:24.000 | So the crappy one got them really frustrated

00:10:26.400 | and they self-reported as a frustrating experience or not

00:10:29.500 | on a scale of 1 to 10.

00:10:30.500 | So that gives us ground truth.

00:10:31.900 | But it had a bunch of people use the system

00:10:34.800 | and you know, they put themselves as frustrated or not.

00:10:38.100 | And so then we can predict,

00:10:39.600 | we can train a convolutional neural network to predict

00:10:42.200 | is this person frustrated or not.

00:10:43.700 | I think we've seen a video of that.

00:10:45.400 | Turns out smiling is a strong indication of frustration.

00:10:49.400 | You can also predict drowsiness in this way,

00:10:51.600 | gaze estimation in this way,

00:10:54.400 | cognitive load.

00:10:55.600 | I'll briefly look at that.

00:10:57.400 | And the process is all the same.

00:10:59.200 | You detect the face,

00:11:00.700 | you find the landmark points in the face

00:11:02.800 | for the face alignment, face frontalization.

00:11:04.800 | And then you dump the raw pixels in

00:11:08.000 | for classification, step 5.

00:11:09.600 | You can use SVMs there

00:11:11.500 | or you can use what everyone uses now,

00:11:13.700 | convolutional neural networks.

00:11:15.000 | This is the one part where CNNs have still struggled to compete,

00:11:21.400 | is the alignment problem.

00:11:23.700 | This is where I talked about the cascade of regressors,

00:11:27.800 | is finding the landmarks

00:11:31.300 | on the eyebrows, the nose, the jawline, the mouth.

00:11:39.300 | There's certain constraints there

00:11:42.100 | and so algorithms that can utilize those constraints effectively

00:11:47.100 | can often perform better than end-to-end regressors

00:11:50.900 | that just don't have any concept of what a face is shaped like.

00:11:53.800 | And there's huge datasets and we're a part

00:11:58.100 | of the awesome community that's building those datasets

00:12:01.900 | for face alignment.

00:12:03.400 | Okay, so this is, again, the TA in his younger form.

00:12:06.900 | This is live in the car, real-time system,

00:12:13.700 | predicting where they're looking.

00:12:15.500 | This is taking slow steps towards the

00:12:20.700 | exciting direction that machine learning is headed,

00:12:25.600 | which is unsupervised learning.

00:12:27.900 | The less you have to have humans look through the data

00:12:31.500 | and annotate that data,

00:12:32.900 | the more power these machine learning algorithms get.

00:12:36.700 | Right, so,

00:12:38.200 | currently supervised learning is what's needed.

00:12:41.700 | You need human beings to label a cat and label a dog.

00:12:44.900 | But if you can only have a human being label 1%,

00:12:49.800 | 1/10 of a percent of a dataset,

00:12:52.200 | only the hard cases,

00:12:54.300 | so the machine can come to the human and be like,

00:12:56.700 | "I don't know what I'm looking at these pictures."

00:12:59.700 | Because of the partial light occlusions,

00:13:02.000 | we're not good at dealing with occlusions,

00:13:04.700 | whether it's your own arm or because of light conditions.

00:13:07.700 | We're not good with crazy light,

00:13:11.100 | drowning out the image.

00:13:12.800 | This is what Google self-driving car actually struggle with

00:13:15.300 | when they're trying to use their vision sensors.

00:13:17.000 | Moving out of frame,

00:13:18.600 | so just all kinds of occlusions are really hard

00:13:22.100 | for computer vision algorithms.

00:13:26.200 | And in those case,

00:13:27.200 | we want a machine to step in and say,

00:13:29.600 | and pass that image on to the human,

00:13:31.800 | be like, "Help me out with this."

00:13:33.000 | And the other corner cases is,

00:13:36.400 | so in driving for example,

00:13:38.200 | 90+% of the time,

00:13:40.200 | all you're doing is staring forward at the roadway in the same way.

00:13:42.700 | That's where the machine shines.

00:13:44.400 | That's where machine annotation,

00:13:46.100 | automated annotation shines.

00:13:48.800 | Because it's seen that face

00:13:51.000 | for hundreds of millions of frames already

00:13:53.800 | in that exact position.

00:13:55.600 | So it can do all the hard work of annotation for you.

00:13:57.800 | It's in the transition away from those positions

00:14:00.800 | that it needs a little bit help.

00:14:02.000 | Just to make sure,

00:14:03.400 | that this person just start looking away

00:14:06.300 | from the road to the rear view

00:14:07.900 | and you bring those points up.

00:14:10.000 | So you're,

00:14:10.800 | there's using optical flow,

00:14:13.700 | putting the optical flow

00:14:15.200 | in the convolutional neural network.

00:14:17.100 | You use that to predict when something has changed.

00:14:21.500 | And when something has changed,

00:14:22.800 | you bring that to the machine for annotation.

00:14:25.300 | All of this is to build a giant,

00:14:27.500 | billions of frames,

00:14:29.900 | annotated dataset

00:14:31.500 | of ground truth.

00:14:32.800 | I want you to train your

00:14:34.800 | driver state algorithms.

00:14:36.600 | And in this way, you can control.

00:14:39.300 | On the X-axis is the fraction of frames

00:14:41.500 | that a human has to annotate.

00:14:42.700 | 0% on the left,

00:14:44.700 | 10% on the right.

00:14:46.700 | And then the accuracy trade-off.

00:14:49.500 | The more the human annotates,

00:14:50.900 | the higher the accuracy.

00:14:51.900 | You approach 100% accuracy.

00:14:54.200 | But you can still do pretty good.

00:14:55.400 | This is for the gaze classification task.

00:14:57.600 | When,

00:14:59.400 | with an 84%,

00:15:03.900 | 84 fold,

00:15:05.900 | so almost two orders of magnitude reduction

00:15:08.100 | in human annotation.

00:15:09.300 | This is the future of machine learning.

00:15:11.200 | And hopefully one day,

00:15:13.300 | no human annotation.

00:15:15.100 | And the result

00:15:21.100 | is millions of images like this.

00:15:24.300 | Video frames.

00:15:25.100 | Same thing,

00:15:27.300 | driver frustration.

00:15:28.600 | This is what I was talking about.

00:15:30.100 | The frustrated driver is the one

00:15:31.600 | that's on the bottom.

00:15:32.800 | So a lot of movement of the eyebrows

00:15:36.600 | and a lot of smiling.

00:15:37.600 | And that's true subject after subject.

00:15:39.800 | And the happy, the satisfied,

00:15:42.600 | I don't say happy.

00:15:43.500 | The satisfied driver

00:15:45.300 | is cold and stoic.

00:15:47.400 | And that's true for subject after subject.

00:15:50.000 | Because driving is a boring experience

00:15:51.800 | and you want it to stay that way.

00:15:53.300 | Yes, question.

00:15:54.000 | Great, great, great question.

00:16:00.300 | They're not.

00:16:01.000 | Absolutely, that's a great question.

00:16:04.900 | There is a,

00:16:06.100 | so this is cars owned by MIT.

00:16:08.100 | There is somebody in the back.

00:16:10.000 | So,

00:16:17.500 | the comment was

00:16:19.200 | my emotions might then have nothing to do

00:16:21.900 | with the driving experience.

00:16:23.900 | Yes, let me continue that comment is

00:16:25.700 | your emotions are often,

00:16:27.700 | you're an actor on the stage for others

00:16:31.800 | with your emotion.

00:16:32.700 | So when you're alone,

00:16:33.600 | you might not express emotion.

00:16:35.200 | You're really expressing emotion oftentimes

00:16:37.400 | for others.

00:16:38.200 | Like you're frustrated.

00:16:39.700 | So it's like, oh, what the heck.

00:16:41.200 | That's for the passenger

00:16:43.200 | and that's absolutely right.

00:16:44.600 | So one of the cool things

00:16:46.900 | we're doing,

00:16:49.600 | as I said,

00:16:50.500 | we now have over a billion video frames

00:16:52.300 | in the Tesla.

00:16:53.300 | We're collecting huge amounts of data in the Tesla

00:16:56.100 | and it's,

00:16:57.200 | emotion is complex thing, right?

00:16:59.100 | In this case,

00:17:00.400 | we can,

00:17:00.800 | we know the ground truth

00:17:01.800 | how frustrated they were.

00:17:03.000 | In naturalistic data,

00:17:04.900 | when it's just people driving around,

00:17:06.600 | we don't know

00:17:07.500 | how they're really feeling at the moment.

00:17:09.400 | We're not asking them to like enter in an app.

00:17:11.600 | How are you feeling right now?

00:17:12.700 | But we do know certain things,

00:17:15.900 | like we know that people sing a lot.

00:17:17.900 | That has to be a paper at some point.

00:17:22.500 | It's awesome.

00:17:23.200 | People love singing.

00:17:24.300 | So that doesn't happen in this kind of data

00:17:27.900 | because there's somebody sitting in the car

00:17:29.300 | and I think the expression of a frustration

00:17:31.400 | is also the same.

00:17:32.200 | Yes, so

00:17:43.500 | yeah, the question is,

00:17:46.700 | yeah, so or the comment is that

00:17:48.500 | the data set,

00:17:49.500 | the solo data set

00:17:50.600 | is probably gonna be very different

00:17:52.200 | from a data set that's non-solo

00:17:54.200 | with a passenger.

00:17:55.100 | And it's very true.

00:17:56.200 | The tricky thing about driving,

00:17:57.800 | and this is why it's a huge challenge

00:17:59.300 | for self-driving cars,

00:18:00.500 | for the external facing sensors

00:18:02.200 | and for the internal facing sensors

00:18:03.800 | analyzing human behavior,

00:18:05.100 | is like 99.9% of driving

00:18:08.500 | is the same thing.

00:18:09.500 | It's really boring.

00:18:10.900 | So finding the interesting bits

00:18:12.600 | is actually pretty complicated.

00:18:14.000 | So that has to do with emotion.

00:18:16.400 | That has to do with,

00:18:18.100 | so singing is easy to find.

00:18:20.100 | So we can track the mouth pretty well.

00:18:22.100 | So whenever you're talking or singing,

00:18:23.400 | we can find that.

00:18:24.200 | But how do you find

00:18:25.300 | the subtle expressions of emotion?

00:18:26.800 | It's hard

00:18:27.700 | when you're solo.

00:18:30.500 | And cognitive load,

00:18:33.500 | that's

00:18:35.400 | that's a fascinating thing.

00:18:38.500 | I mean, similar to emotion,

00:18:39.500 | it's a little more concrete

00:18:42.000 | in the sense that there's a lot of good science

00:18:44.800 | and ways to measure cognitive load,

00:18:46.600 | cognitive workload,

00:18:48.500 | how occupied your mind is,

00:18:51.100 | mental workload is another term used.

00:18:52.900 | And so the window to the soul,

00:18:55.700 | the cognitive workload soul

00:18:58.400 | is the eyes.

00:18:59.400 | So pupil,

00:19:01.200 | so first of all,

00:19:02.600 | the eyes move in two different ways.

00:19:04.300 | They move a lot of ways,

00:19:05.400 | but two major ways.

00:19:06.300 | It's saccades,

00:19:07.500 | these are these ballistic movements,

00:19:08.900 | they jump around.

00:19:09.700 | Whenever you look around the room,

00:19:11.800 | they're actually just jumping around.

00:19:13.800 | When you read,

00:19:14.400 | their eyes are jumping around.

00:19:15.700 | And when

00:19:17.000 | you follow,

00:19:18.900 | you just follow this bottle with your eyes.

00:19:21.300 | That your eyes are actually gonna move smoothly,

00:19:23.400 | smooth pursuit.

00:19:24.900 | Somebody actually just told me today

00:19:26.500 | that probably has to do with our hunting background

00:19:28.700 | or as animals.

00:19:30.100 | I don't know how that helps,

00:19:34.100 | like frogs track flies

00:19:36.100 | really well.

00:19:37.100 | So that you have to like,

00:19:38.700 | I don't know.

00:19:39.200 | Anyway, the point is

00:19:40.900 | there's smooth pursuit movements

00:19:42.800 | where the eyes move smoothly.

00:19:44.600 | And those are all indications

00:19:46.600 | of certain aspects of cognitive load.

00:19:49.000 | And then there is very subtle movements

00:19:51.300 | which are almost imperceptible for computer vision.

00:19:53.600 | And these are

00:19:54.700 | micro saccades,

00:19:56.600 | these are tremors of the eye.

00:19:57.900 | Here,

00:20:00.200 | work from here from Bill Freeman

00:20:02.100 | magnifying those subtle movements.

00:20:04.200 | These are taken at

00:20:05.600 | 500 frames a second.

00:20:07.500 | And so cognitive load,

00:20:14.200 | when the pupil,

00:20:17.200 | that black dot in the middle,

00:20:18.800 | just in case we don't know what a pupil is,

00:20:20.600 | in the middle of the eye.

00:20:21.600 | When it gets larger,

00:20:23.100 | that's an indicator of high cognitive load.

00:20:25.500 | But it also gets

00:20:26.900 | larger when the light is dim.

00:20:28.700 | So there's like this complex interplay.

00:20:31.200 | So we can't rely in the wild,

00:20:32.800 | outside,

00:20:33.500 | in the car,

00:20:34.500 | or just in general outdoors

00:20:36.400 | on using the pupil size.

00:20:38.500 | Even though pupil size has been used effectively

00:20:40.800 | in a lab to measure cognitive load,

00:20:42.500 | it can't be reliably used in the car.

00:20:44.700 | And the same with blinks.

00:20:47.900 | When there's higher cognitive load,

00:20:49.600 | your blink rate decreases

00:20:51.200 | and your blink duration shortens.

00:20:52.800 | Okay.

00:20:54.500 | I think I'm just repeating the same thing over and over.

00:20:58.000 | But you can imagine

00:20:59.800 | how we can predict cognitive load, right?

00:21:01.700 | We extract

00:21:04.200 | video of the eye.

00:21:05.800 | Here is

00:21:08.400 | the primary eye

00:21:10.500 | of the person the system is observing.

00:21:13.100 | Happens to be the same TA once again.

00:21:17.400 | We take the sequence of a hundred,

00:21:19.400 | oh, it's 90 images.

00:21:20.900 | So that's six seconds,

00:21:21.900 | 16 frames a second,

00:21:23.400 | 15 frames a second.

00:21:24.700 | And we dump that into a 3D convolutional neural network.

00:21:27.900 | Again,

00:21:29.000 | that means it's 90 channels

00:21:31.200 | of,

00:21:33.700 | it's not 90 frames,

00:21:35.400 | grayscale.

00:21:36.200 | And then the prediction is

00:21:38.100 | one of three

00:21:39.100 | classes of cognitive load

00:21:41.400 | is the same.

00:21:42.900 | So we can predict

00:21:43.900 | the same thing.

00:21:44.900 | Classes of cognitive load.

00:21:46.900 | Low cognitive load,

00:21:48.500 | medium cognitive load,

00:21:49.900 | and high cognitive load.

00:21:50.900 | And there's ground truth for that

00:21:52.400 | because we have people,

00:21:53.700 | over 500 different people

00:21:55.600 | do different tasks

00:21:56.700 | of various cognitive load.

00:21:58.300 | And after some frontalization again,

00:22:01.500 | where you see the eyes are,

00:22:02.800 | no matter where the person is looking,

00:22:05.300 | the image of the face is transposed in such a way that

00:22:09.300 | the eyes, the corners of the eyes

00:22:11.000 | remain always in the same position.

00:22:13.800 | After the frontalization,

00:22:15.200 | we find the eye,

00:22:18.200 | active appearance models,

00:22:20.100 | find 39 points

00:22:22.000 | of the eye, of the eyelids,

00:22:25.600 | the iris,

00:22:27.400 | and four points on the pupil.

00:22:29.400 | Putting all of that

00:22:34.200 | into a 3D CNN model,

00:22:35.900 | they're positioned,

00:22:37.400 | image eye sequence on the left,

00:22:39.000 | 3D CNN model in the middle,

00:22:40.900 | cognitive load prediction on the right.

00:22:43.600 | This code, by the way, is

00:22:44.700 | freely available online.

00:22:47.600 | All you have to do,

00:22:51.000 | dump a webcam

00:22:52.500 | from the video stream,

00:22:55.700 | CNN runs in faster than real time,

00:22:58.300 | predicts cognitive load.

00:22:59.600 | Same process as detecting the identity of the face,

00:23:03.800 | same process as detecting where the driver is looking,

00:23:06.600 | same process as detecting emotion.

00:23:08.600 | And all of those require very little hyperparameter tuning

00:23:12.400 | on the convolutional neural networks.

00:23:13.700 | They only require

00:23:16.400 | huge amounts of data.

00:23:18.400 | And why do we care

00:23:21.300 | about detecting what the driver is doing?

00:23:24.100 | And I think Eric has mentioned

00:23:26.200 | this.

00:23:28.200 | Is

00:23:29.000 | on the,

00:23:31.100 | oh man, this is the comeback of the slide.

00:23:34.000 | I was criticized for this being a very cheesy slide.

00:23:43.000 | In

00:23:44.000 | in the

00:23:46.000 | path towards

00:23:48.000 | full automation,

00:23:49.700 | we're likely to take gradual steps towards that.

00:23:55.500 | It's enough of that. This is better.

00:23:59.600 | And

00:24:01.700 | especially

00:24:03.100 | given

00:24:04.300 | that

00:24:05.600 | this is

00:24:07.600 | given today,

00:24:09.300 | our new president,

00:24:11.800 | this is pickup truck country.

00:24:13.700 | This is manually controlled vehicle country

00:24:20.500 | for quite a little while.

00:24:21.700 | Will I control?

00:24:22.900 | And control

00:24:25.400 | being given to somebody else, to the machine,

00:24:29.700 | will be a gradual process.

00:24:31.200 | It's a gradual process of that machine earning trust.

00:24:34.300 | And through that process,

00:24:36.500 | the machine,

00:24:37.700 | like the Tesla,

00:24:39.000 | like the BMW,

00:24:40.900 | like the Mercedes, the Volvo,

00:24:42.700 | that's now playing with these ideas,

00:24:45.100 | is going to need to see what the human is doing.

00:24:48.400 | And for that,

00:24:54.400 | to see what the human is doing,

00:24:57.800 | we have

00:24:59.500 | billions of miles of forward-facing data.

00:25:03.600 | What we need

00:25:05.000 | is billions of miles of driver-facing data as well.

00:25:09.800 | We're in the process of collecting that.

00:25:11.500 | And this is a pitch

00:25:13.800 | for automakers

00:25:16.200 | and everybody to

00:25:18.100 | buy cars that have a driver-facing camera.

00:25:21.400 | And let me

00:25:25.600 | sort of

00:25:26.600 | close.

00:25:28.400 | So I said we need a lot of data.

00:25:30.200 | But I think this class has been

00:25:34.400 | and through your own

00:25:37.800 | research, you'll find that we're in the very early stages of

00:25:42.500 | discovering the power of deep learning.

00:25:47.100 | For example,

00:25:49.900 | you know, as recently,

00:25:51.600 | like Jan LeCun said,

00:25:54.000 | that it seems

00:25:57.900 | that the deeper the network,

00:26:00.700 | the better the results

00:26:02.300 | in a lot of really

00:26:04.100 | important cases.

00:26:07.400 | Even though the data is not increasing.

00:26:09.600 | So why does a deeper network give better results?

00:26:13.700 | This is a mysterious thing we don't understand.

00:26:16.900 | There's these hundreds of millions of parameters

00:26:20.000 | and from them is emerging some kind of

00:26:23.800 | structure, some kind of representation of

00:26:26.800 | the knowledge that we're giving it.

00:26:28.400 | One of my favorite examples of this emergent concept

00:26:32.800 | is Conway's Game of Life.

00:26:36.100 | For those of you who know what this is,

00:26:38.700 | will probably criticize me for it being as cheesy as a stairway slide.

00:26:43.300 | But I think it's actually such a simple

00:26:46.700 | and brilliant example

00:26:48.900 | of how

00:26:50.400 | like a neuron in a neural network is a really simple computational unit.

00:26:54.800 | And then incredible power emerges when you just combine a lot of them in a network.

00:26:59.400 | In the same way,

00:27:01.000 | this is called a super-computational network.

00:27:04.400 | This is called a cellular automata.

00:27:06.300 | That's a weird pronunciation.

00:27:09.000 | And every single cell is operating under a simple rule.

00:27:15.100 | You can think of it as a cell living and dying.

00:27:18.100 | It's filled in black when it's alive

00:27:22.000 | and white when it's dead.

00:27:23.500 | And when it has two or three, if it's alive

00:27:26.800 | and has two or three neighbors,

00:27:29.000 | it survives to the next time slot.

00:27:31.400 | Otherwise it dies.

00:27:35.000 | And if it has exactly three neighbors, it's dead.

00:27:39.300 | It comes back to life.

00:27:42.000 | If it has exactly three neighbors.

00:27:43.500 | That's a simple rule.

00:27:44.600 | Whatever, you can just imagine.

00:27:46.800 | It's just simple.

00:27:47.400 | All it's doing is operating under this very local process.

00:27:51.800 | Same as a neuron.

00:27:53.300 | Or in the way we're currently training neural networks

00:27:58.700 | in this local gradient.

00:28:01.400 | We're optimizing over a local gradient.

00:28:03.900 | Same, local rules.

00:28:05.600 | And what happens if you run this system

00:28:09.300 | operating under really local rules,

00:28:12.600 | when you get on the right,

00:28:13.900 | it's not, again, you have to go home.

00:28:17.800 | Hopefully no drugs involved.

00:28:20.600 | But you have to open up your mind

00:28:22.900 | and see how amazing that is.

00:28:27.500 | Because what happens is

00:28:29.100 | it's a local computational unit

00:28:32.800 | that knows very little about the world.

00:28:35.000 | But somehow really complex patterns emerge.

00:28:38.800 | And we don't understand why.

00:28:41.000 | In fact, under different rules,

00:28:43.900 | incredible patterns emerge.

00:28:45.600 | And it feels like it's living creatures,

00:28:47.700 | like communicating.

00:28:49.100 | Like when you just watch it.

00:28:50.900 | Not these examples.

00:28:52.700 | This is the original.

00:28:54.900 | They get like complex and interesting.

00:28:57.700 | But even these examples,

00:28:58.900 | this complex geometric patterns that emerge,

00:29:00.800 | it's incredible.

00:29:01.700 | We don't understand why.

00:29:03.100 | Same with neural networks.

00:29:04.000 | We don't understand why.

00:29:05.000 | And we need to.

00:29:06.000 | In order to see how these networks

00:29:07.500 | will be able to reason.

00:29:08.600 | Okay, so what's next?

00:29:11.200 | I encourage you to read the deep learning book.

00:29:16.700 | It's available online, deeplearningbook.org.

00:29:20.700 | As I mentioned to a few people,

00:29:23.200 | you should, well, first,

00:29:25.000 | there's a ton of amazing papers

00:29:26.400 | every day coming out on archive.

00:29:28.000 | I'll put these links up,

00:29:31.100 | but there's a lot of good collections

00:29:33.400 | of strong paper, list of papers.

00:29:35.600 | There is the literally awesome list,

00:29:38.200 | the awesome deep learning papers on GitHub.

00:29:41.000 | It's calling itself awesome,

00:29:43.300 | but it happens to be awesome.

00:29:44.800 | And there is a lot of blogs,

00:29:47.900 | that are just amazing.

00:29:49.300 | That's how I recommend you learn machine learning,

00:29:53.400 | is on blogs.

00:29:54.400 | And if you're interested

00:29:57.600 | in the application of deep learning

00:29:59.500 | in the automotive space,

00:30:01.100 | you can come do research in our group.

00:30:03.700 | Just email me.

00:30:05.300 | Anyway, we have three winners.

00:30:09.300 | Jeffrey Hu, Michael Gump.

00:30:15.400 | And how do you, are you here?

00:30:19.400 | Hey, how do you say your name?

00:30:22.800 | No, this is not my name.

00:30:25.800 | All right.

00:30:29.300 | This is, so my name is Forna.

00:30:31.500 | Forna.

00:30:32.500 | He stands for Forna.

00:30:33.700 | And Doli is,

00:30:35.000 | Oh, I see.

00:30:36.900 | They, like, they, they, I'm sorry, I'm not,

00:30:39.900 | Well, anyway, here.

00:30:55.700 | So he achieved the stunning speed of,

00:31:01.300 | so I, this is kind of incredible.

00:31:04.100 | So I didn't know what kind of speed

00:31:05.200 | we're going to be able to achieve.

00:31:06.400 | I thought 73 was unbeatable,

00:31:08.200 | because we played with it for a while.

00:31:09.800 | We couldn't achieve 73.

00:31:11.100 | We designed a deterministic algorithm,

00:31:13.200 | that was able to achieve 74, I believe.

00:31:15.200 | Meaning, like, it's cheating,

00:31:17.900 | with a cheating algorithm that got 74.

00:31:20.200 | And so folks have come up

00:31:24.100 | with algorithms that have done,

00:31:26.000 | that have beaten 73 and then 74.

00:31:28.600 | So this is really incredible.

00:31:29.700 | And the other two guys,

00:31:32.300 | so all three of you get a free term

00:31:34.800 | at the Udacity Self-Driving Car Engineering degree.

00:31:37.800 | Thanks to those guys for giving it,

00:31:40.300 | giving that award,

00:31:42.100 | and bringing their army of brilliant,

00:31:44.500 | so they have people who are obsessed

00:31:47.100 | about self-driving cars.

00:31:48.200 | And we've received over 2,000 submissions

00:31:53.400 | for this competition.

00:31:54.500 | A lot of them from those guys.

00:31:56.800 | And they're just brilliant.

00:31:58.100 | So it's really exciting

00:32:01.900 | to have such a big community of deep learning folks

00:32:04.100 | working in this field.

00:32:05.800 | So this is, for the rest of eternity,

00:32:09.100 | we're going to change this up a little bit,

00:32:11.600 | but this is actually the three neural networks,

00:32:15.400 | the three winning neural networks

00:32:18.500 | running side by side.

00:32:19.800 | And you can see the number of cars passed there.

00:32:22.800 | The first place is on the left,

00:32:24.200 | second place and third place.

00:32:26.600 | And in fact, the third place is almost,

00:32:29.600 | wait, no, second place is winning currently.

00:32:32.300 | But that just tells you that

00:32:35.200 | the random nature of competition,

00:32:40.400 | sometimes you win, sometimes you lose.

00:32:42.600 | So there's a,

00:32:46.400 | the actual evaluation process runs through a lot

00:32:50.000 | of a lot of iterations and takes the medium evaluation.

00:32:53.900 | With that, let me thank you guys so much for,

00:32:59.400 | well, wait, wait, wait, you have a question?

00:33:01.100 | Do I have to be winning networks at all?

00:33:03.600 | Yeah, so,

00:33:06.200 | all three guys wrote me a note

00:33:10.800 | about how their networks work.

00:33:13.100 | I did not read that note.

00:33:15.100 | So I'll post,

00:33:18.300 | this tells you how crazy this has been.

00:33:20.500 | I'll post their winning networks

00:33:24.100 | online.

00:33:27.400 | And I encourage you to continue competing

00:33:30.300 | and continue submitting networks.

00:33:32.400 | This will run for a while and we're working on a journal paper

00:33:35.900 | for this game.

00:33:40.100 | We're trying to find the optimal solutions.

00:33:43.100 | Okay, so this is the first time

00:33:45.900 | I've ever taught a class.

00:33:48.200 | And the first time obviously teaching this class.

00:33:51.500 | And so thank you so much for being a part of it.

00:33:54.500 | Thank you, Eric.

00:33:59.900 | If you didn't get a shirt, please come back,

00:34:06.800 | please come down and get a shirt.

00:34:08.500 | Just write your email on the note,

00:34:11.500 | on the index note.

00:34:13.100 | Thank you.

00:34:15.600 | Bye.

00:34:16.900 | Thank you.

00:34:17.700 | Thank you.

00:34:18.200 | [audience chatter]

MIT 6.S094: Deep Learning for Human-Centered Semi-Autonomous Vehicles

Chapters