MIT 6.S094: Deep Learning for Human Sensing

00:00:00.000 | Today we will talk about how to apply

00:00:03.000 | the methods of deep learning

00:00:04.720 | to understanding the sense in the human being.

00:00:08.120 | The focus will be on computer vision,

00:00:10.360 | the visual aspects of a human being.

00:00:12.520 | Of course, we humans express ourselves visually,

00:00:16.480 | but also through audio, voice and through text.

00:00:20.160 | Beautiful poetry and novels and so on,

00:00:23.080 | we're not going to touch those today,

00:00:24.440 | we're just going to focus on computer vision.

00:00:27.040 | How we can use computer vision to extract

00:00:30.240 | useful, actionable information

00:00:33.000 | from video, images, video of human beings,

00:00:37.160 | in particular in the context of the car.

00:00:39.640 | So,

00:00:41.680 | what are the requirements

00:00:45.760 | for successfully applying deep learning methods

00:00:48.080 | in the real world?

00:00:49.680 | So, when we talk about human sensing,

00:00:52.360 | we're not talking about

00:00:54.840 | basic face recognition of celebrity images.

00:00:58.360 | We're talking about using computer vision,

00:01:01.480 | deep learning methods

00:01:03.040 | to create systems that operate in the real world.

00:01:06.760 | And in order for them to operate in the real world,

00:01:09.160 | there are several things.

00:01:10.680 | They sound simple, some are much harder than they sound.

00:01:14.400 | First, and the most important,

00:01:16.480 | here from most to less,

00:01:18.600 | more to less critical, ordered,

00:01:20.920 | is data.

00:01:22.360 | Data is everything.

00:01:23.920 | Real world data.

00:01:25.240 | We need a lot of real world data

00:01:27.640 | to form the data set on which

00:01:30.240 | these supervised learning methods can be trained.

00:01:33.560 | I'll say this over and over throughout the day today,

00:01:37.280 | data is everything.

00:01:38.440 | That means data collection

00:01:40.200 | is the hardest part and the most important part.

00:01:42.840 | We'll talk about how that data collection

00:01:45.320 | is carried out here,

00:01:46.760 | in our group at MIT,

00:01:48.680 | all the different ways we capture human beings

00:01:51.160 | in the driving context,

00:01:52.960 | in the road user context,

00:01:54.360 | pedestrians, cyclists.

00:01:55.960 | But the data,

00:01:59.760 | it starts and ends at data.

00:02:02.960 | The fun stuff is the algorithms.

00:02:05.760 | But the data is what makes it all work.

00:02:09.160 | Real world data.

00:02:10.760 | Okay, then once you have the data,

00:02:13.680 | okay, data isn't everything, I lied.

00:02:15.560 | Because you have to actually annotate it.

00:02:18.200 | So what do we mean by data?

00:02:19.440 | There's raw data,

00:02:22.400 | video, audio,

00:02:24.080 | LIDAR, all the types of sensors we'll

00:02:27.760 | talk about to capture

00:02:30.320 | real world

00:02:31.600 | road user interaction.

00:02:34.400 | You have to reduce that

00:02:36.640 | into meaningful representative cases

00:02:40.280 | of what happens in that real world.

00:02:42.280 | In driving, 99% of the time,

00:02:44.680 | driving looks the same.

00:02:45.680 | It's the 1%,

00:02:47.600 | the interesting cases that we're interested in.

00:02:50.280 | And what we want is algorithms to train

00:02:52.760 | learning algorithms on those 1%.

00:02:55.760 | So we have to collect the 100%,

00:02:58.880 | we have to collect all the data,

00:03:00.280 | and then figure out an automated,

00:03:02.200 | semi-automated ways

00:03:03.560 | to find the pieces of that data

00:03:06.760 | that could be used to train neural networks

00:03:08.760 | and that are representative of the general

00:03:11.760 | things, kinds of things that happen in this world.

00:03:14.480 | Efficient annotation.

00:03:18.760 | Annotation isn't just about

00:03:21.040 | drawing bounding boxes on images of cats.

00:03:24.840 | Annotation tooling is key to unlocking

00:03:29.160 | real world performance.

00:03:33.040 | Systems that successfully

00:03:36.120 | solve some problem,

00:03:37.920 | accomplish some goal in real world data.

00:03:39.960 | That means designing annotation tools

00:03:42.400 | for a particular task.

00:03:43.840 | Annotation tools that are used for glance classification,

00:03:47.800 | for determining where drivers are looking,

00:03:49.400 | is very different than annotation tools

00:03:51.400 | used for body pose estimation.

00:03:53.200 | It's very different than the tooling

00:03:55.320 | that we use for PsycFuse,

00:03:57.920 | investing thousands of dollars

00:03:59.720 | for the competition for this class,

00:04:01.680 | to annotate fully seen segmentation

00:04:04.800 | where every pixel is colored.

00:04:06.000 | There needs to be tooling for each one of those elements,

00:04:09.480 | and they're key.

00:04:10.280 | That's HCI question.

00:04:12.120 | That's a design question.

00:04:13.680 | There's no deep learning,

00:04:15.520 | there's no robotics in that question.

00:04:18.880 | It's how do we leverage human computation,

00:04:22.840 | human, the human brain,

00:04:25.160 | to most effectively label images

00:04:27.240 | such that we can train neural networks on them.

00:04:29.120 | Hardware.

00:04:31.200 | In order to train these networks,

00:04:35.960 | in order to parse the data we collect,

00:04:39.560 | and we'll talk about,

00:04:40.320 | we have now over 5 billion images of data,

00:04:44.600 | of driving data.

00:04:45.800 | In order to parse that,

00:04:47.160 | you can't do it on a single machine.

00:04:49.040 | You have to do large-scale distributed compute,

00:04:52.920 | and large-scale distributed storage.

00:04:54.760 | And finally,

00:04:58.000 | the stuff that's the most exciting,

00:05:01.080 | that people,

00:05:03.280 | that this class,

00:05:04.800 | and many classes,

00:05:05.720 | and much of the literature is focused on,

00:05:08.040 | is the algorithms.

00:05:08.920 | The deep learning algorithms,

00:05:10.600 | the machine learning algorithms,

00:05:11.920 | the algorithms that learn from data.

00:05:13.800 | Of course, that's really exciting and important,

00:05:16.720 | but what we find time and time again,

00:05:19.200 | in real-world systems,

00:05:20.720 | is that as long as these algorithms learn from data,

00:05:24.200 | so as long as it's deep learning,

00:05:25.760 | the data is what's much more important.

00:05:28.760 | Of course, it's nice for the algorithms

00:05:31.040 | to be calibration-free,

00:05:32.960 | meaning they learn to calibrate, self-calibrate.

00:05:36.320 | We don't need to have the sensors

00:05:38.120 | in an exact same position every time.

00:05:40.000 | That's a very nice feature.

00:05:41.560 | The robustness of the system is then

00:05:44.160 | generalizable across multiple,

00:05:46.600 | multiple vehicles, multiple scenarios.

00:05:50.360 | And,

00:05:51.880 | one of the key things that comes up time and time again,

00:05:55.880 | and we'll mention today,

00:05:57.280 | is a lot of the algorithms developed in deep learning

00:06:00.600 | are really focused for computer vision,

00:06:02.160 | are focused on single images.

00:06:03.760 | Now, the real world happens in both space and time,

00:06:08.280 | and we have to have algorithms that both

00:06:10.840 | capture the visual characteristics,

00:06:12.360 | but also look at the sequence of images,

00:06:14.600 | sequence of those visual characteristics

00:06:16.440 | that form the temporal dynamics,

00:06:18.000 | the physics of this world.

00:06:19.320 | So, it's nice when those algorithms

00:06:21.520 | are able to capture the physics of the scene.

00:06:24.840 | The big takeaway I would like,

00:06:28.920 | if you leave with anything today,

00:06:31.120 | unfortunately, it's that the painful, boring stuff

00:06:36.320 | of collecting data, of cleaning that data,

00:06:39.160 | of annotating that data,

00:06:40.840 | in order to create successful systems

00:06:43.640 | is much more important than good algorithms,

00:06:45.960 | or great algorithms.

00:06:47.560 | It's important to have good algorithms,

00:06:49.280 | as long as you have neural networks

00:06:51.840 | that learn from that data.

00:06:53.360 | Okay, so today I'll talk, I'd like to talk about

00:06:57.240 | human imperfections,

00:06:59.920 | and the various detection problems,

00:07:04.720 | the pedestrian body pose, glance,

00:07:07.400 | emotion, cognitive load estimation,

00:07:10.920 | that we can use to help those humans

00:07:14.360 | as they operate in a driving context.

00:07:17.320 | And finally, try to continue

00:07:21.840 | with the idea, the vision,

00:07:26.560 | that fully autonomous vehicles,

00:07:28.440 | as some of our guest speakers have spoke about,

00:07:30.320 | and Sterling Ennis will speak about tomorrow,

00:07:32.600 | is really far away.

00:07:34.280 | That the humans will be an integral part

00:07:37.160 | of the operating, cooperating with AI systems.

00:07:40.720 | And I'll continue on that line of thought

00:07:44.800 | to try to motivate why we need to continuously

00:07:48.200 | approach the autonomous vehicle,

00:07:51.040 | the self-driving car paradigm,

00:07:53.680 | in a human-centered way.

00:07:55.680 | Okay, first, before we

00:08:02.320 | talk about human imperfections,

00:08:04.920 | let's just pause and acknowledge

00:08:07.320 | that humans are amazing.

00:08:08.880 | We're actually really good at a lot of things.

00:08:12.760 | That's sometimes sort of fun to talk about

00:08:16.800 | how much, how terrible of drivers we are,

00:08:19.040 | how distracted we are, how irrational we are.

00:08:21.880 | But we're actually really damn good at driving.

00:08:24.840 | Here's a video of a state-of-the-art soccer player,

00:08:28.720 | Messi, the best soccer player in the world, obviously.

00:08:31.640 | And a state-of-the-art robot on the right.

00:08:35.760 | Same thing.

00:08:36.760 | Well, it's not playing,

00:08:39.240 | but I assure you, the American Ninja Warrior,

00:08:42.120 | Casey,

00:08:45.160 | is far superior to the

00:08:49.920 | DARPA humanoid robotics systems shown on the right.

00:08:53.120 | Okay, so,

00:08:57.800 | continuing on the line of thought to challenge,

00:09:02.520 | to challenge us here, that humans are amazing,

00:09:05.880 | is, you know, there's a record high

00:09:09.600 | in 2016 in the United States.

00:09:12.400 | There was over 40,000 since many years,

00:09:16.280 | it's across the 40,000 fatalities mark.

00:09:19.480 | More than 40,000 people died in car crashes

00:09:22.240 | in the United States.

00:09:23.240 | But that's in 3.2 trillion miles traveled.

00:09:27.360 | So that's one fatality per 80 million miles.

00:09:30.320 | That's,

00:09:32.640 | one in 625 chance

00:09:36.880 | of dying in a car crash in your lifetime.

00:09:39.760 | Interesting side fact,

00:09:43.200 | for anyone in the United States,

00:09:45.480 | folks who live in Massachusetts are the least likely

00:09:49.520 | to die in a car crash.

00:09:51.120 | Montana is the most likely.

00:09:53.520 | So for everyone that

00:09:57.760 | thinks of Boston Drives is terrible,

00:10:01.960 | maybe that adds some perspective.

00:10:03.520 | Here's a visualization of Waze data

00:10:06.160 | across a period of a day,

00:10:08.240 | showing you the rich blood of the city,

00:10:10.440 | that the traffic flow of the city,

00:10:12.840 | the people getting from A to B on a mass scale,

00:10:16.640 | and doing it,

00:10:18.080 | surviving, doing it okay.

00:10:20.640 | Humans are amazing.

00:10:23.080 | But they're also flawed.

00:10:26.240 | Texting,

00:10:28.000 | sources of distraction with a smartphone,

00:10:30.840 | the eating, the secondary tasks of talking to other passengers,

00:10:34.040 | grooming, reading,

00:10:35.160 | using navigation system,

00:10:38.240 | yes, sometimes watching video,

00:10:40.480 | and manually adjusting or adjusting the radio.

00:10:44.360 | And 3,000 people were killed.

00:10:48.480 | And 400,000 were injured in motor vehicle crashes

00:10:53.400 | involving distraction in 2014.

00:10:57.400 | Distraction is a,

00:10:59.720 | is a very serious issue for safety.

00:11:01.920 | Texting,

00:11:03.840 | every day more and more people text.

00:11:06.600 | Smartphones are proliferating our society.

00:11:08.800 | 170 billion text messages

00:11:11.560 | are sent in the United States every month.

00:11:13.560 | That's in 2014.

00:11:15.360 | You can only imagine what it is today.

00:11:17.160 | Eyes off road for five seconds

00:11:20.640 | is the average time your eyes are off the road while texting.

00:11:23.280 | Five seconds.

00:11:24.200 | If you're traveling 55 miles an hour in that five seconds,

00:11:28.840 | that's enough time to cover the length of a football field.

00:11:31.760 | So you're blindfolded.

00:11:34.080 | You're not looking at the road.

00:11:35.440 | In five seconds, the average time of texting,

00:11:37.960 | you're covering the entire football field.

00:11:40.120 | And so many things can happen in that moment of time.

00:11:44.240 | That's distraction.

00:11:45.640 | Drunk driving.

00:11:49.560 | 31% of traffic fatalities involve a drunk driver.

00:11:53.440 | Drug driving.

00:11:55.320 | 23% of nighttime drivers tested positive

00:11:58.080 | for a legal prescription or over-the-counter medication.

00:12:01.000 | Distracted driving, as I said, is a huge safety risk.

00:12:04.800 | Drowsy driving.

00:12:06.200 | People driving tired.

00:12:07.760 | Nearly 3% of all traffic fatalities involve a drowsy driver.

00:12:12.120 | If you are uncomfortable with videos

00:12:17.360 | that involve risk,

00:12:19.240 | I urge you to look away.

00:12:20.800 | These are videos collected by AAA of teenagers,

00:12:24.600 | a very large-scale naturalistic driving data set,

00:12:27.160 | and it's capturing clips of teenagers being distracted

00:12:30.200 | on their smartphone.

00:12:31.880 | (TEEN CHATTERING)

00:12:32.880 | (TEEN CHATTERING)

00:12:33.880 | (THUNDER RUMBLING)

00:12:37.960 | (TEEN CHATTERING)

00:12:48.960 | (TIRES SCREECHING)

00:12:49.960 | (TIRES SCREECHING)

00:12:50.960 | (TIRES SCREECHING)

00:12:51.960 | (TEEN CHATTERING)

00:12:52.960 | (TEEN CHATTERING)

00:12:53.960 | (TEEN CHATTERING)

00:12:54.960 | (TEEN CHATTERING)

00:12:55.960 | (TEEN CHATTERING)

00:12:56.960 | (TEEN CHATTERING)

00:12:57.960 | (TEEN CHATTERING)

00:12:58.960 | (TEEN CHATTERING)

00:12:59.960 | (TEEN CHATTERING)

00:13:00.960 | (TEEN CHATTERING)

00:13:01.960 | (TIRES SCREECHING)

00:13:02.960 | (TIRES SCREECHING)

00:13:03.960 | (TIRES SCREECHING)

00:13:04.960 | (TIRES SCREECHING)

00:13:05.960 | (TIRES SCREECHING)

00:13:06.960 | (TIRES SCREECHING)

00:13:07.960 | (TIRES SCREECHING)

00:13:08.960 | (TIRES SCREECHING)

00:13:09.960 | (TIRES SCREECHING)

00:13:10.960 | (TIRES SCREECHING)

00:13:11.960 | (TIRES SCREECHING)

00:13:12.960 | (TIRES SCREECHING)

00:13:13.960 | (TIRES SCREECHING)

00:13:14.960 | (TIRES SCREECHING)

00:13:15.960 | (TIRES SCREECHING)

00:13:16.960 | (TIRES SCREECHING)

00:13:17.960 | (TIRES SCREECHING)

00:13:18.960 | (TIRES SCREECHING)

00:13:19.960 | (TIRES SCREECHING)

00:13:20.960 | (TIRES SCREECHING)

00:13:21.960 | Once you take it in, the problem we're against.

00:13:40.520 | So in that context of human imperfections, we have to ask ourselves, is the human-centered

00:13:47.440 | approach to autonomy in systems, autonomous vehicles that are using artificial intelligence

00:13:53.040 | to aid the driving task, do we want to go, as I mentioned a couple of lectures ago, the

00:13:58.760 | human-centered way or the full autonomy way?

00:14:01.720 | The tempting path is towards full autonomy, where we remove this imperfect, flawed human

00:14:07.520 | from the picture altogether and focus on the robotics problem of perception and control

00:14:12.240 | and planning and driving policy.

00:14:17.080 | Or do we work together, human and machine, to improve the safety, to alleviate distraction,

00:14:24.120 | to bring driver attention back to the road and use artificial intelligence to increase

00:14:28.360 | safety through collaboration, human-robot interaction versus removing the human completely

00:14:34.280 | from the picture?

00:14:38.040 | As I've mentioned, as Sterling will certainly talk about tomorrow and rightfully so, and

00:14:47.040 | yesterday or on Tuesday, Emilio has talked about the L4 way is grounded in literature,

00:14:55.480 | is grounded in common sense in some sense.

00:15:00.960 | You can count on the fact that humans, the natural flaws of human beings to overtrust,

00:15:08.400 | to misbehave, to be irrational about their risk estimates will result in improper use

00:15:13.280 | of the technology.

00:15:16.800 | And that leads to what I've showed before, the public perception of what drivers do in

00:15:21.760 | semi-autonomous vehicles.

00:15:23.040 | They begin to overtrust.

00:15:24.240 | The moment the system works well, they begin to overtrust.

00:15:27.840 | They begin to do stuff they're not supposed to be doing in the car, taking it for granted.

00:15:33.760 | A recent video that somebody posted, this is a common sort of more practical concern

00:15:39.960 | that people have is, well, the traditional ways to ensure the physical engagement of

00:15:47.680 | the driver is by saying they should touch the wheel, the steering wheel every once in

00:15:51.400 | a while.

00:15:52.600 | And of course, there's ways to bypass the need to touch the steering wheel.

00:15:57.080 | Some people hang objects like a can off of the steering wheel.

00:16:01.620 | In this case, brilliantly, I have to say, they shove an orange into the wheel to make

00:16:11.260 | the touch sensor fire and therefore be able to take their hands off the autopilot.

00:16:17.840 | And that kind of idea makes us believe that there's no way that humans will find a way

00:16:24.100 | to misuse this technology.

00:16:25.780 | However, I believe that that's not giving the technology enough credit.

00:16:33.100 | Artificial intelligence systems, if they're able to perceive the human being, are also

00:16:37.980 | able to work with the human being.

00:16:39.940 | And that's what I'd like to talk about today.

00:16:43.840 | Teaching cars to perceive the human being.

00:16:47.740 | And it all starts with data.

00:16:50.340 | It's all about data, as I mentioned.

00:16:52.260 | Data is everything in these real-world systems.

00:16:55.300 | With the MIT naturalistic driving data set of 25 vehicles, of which 21 are equipped with

00:17:01.620 | Tesla autopilot, we instrument them.

00:17:04.100 | This is what we do with the data collection.

00:17:05.940 | Two cameras on the driver.

00:17:07.740 | We'll see the cameras on the face, capturing high-definition video of the face.

00:17:12.380 | That's where we get the glance classification, the emotion recognition, cognitive load, everything

00:17:16.980 | coming from the face.

00:17:17.980 | Then we have another camera, a fisheye, that's looking at the body of the driver.

00:17:22.780 | And from that comes the body pose estimation, hands-on wheel, activity recognition.

00:17:28.980 | And then one video looking out for the full scene segmentation for all the scene perception

00:17:33.260 | tasks.

00:17:34.260 | And everything is being recorded, synchronized together with GPS, with audio, with all the

00:17:38.340 | cam coming from the car on a single device.

00:17:42.180 | Synchronization of this data is critical.

00:17:47.740 | So that's one road trip in the data.

00:17:52.340 | We have thousands like it, traveling hundreds of miles, sometimes hundreds of miles under

00:17:57.740 | automated control, in autopilot.

00:18:02.820 | That's the data.

00:18:03.820 | Again, as I said, data is everything.

00:18:07.540 | And from this data, we can both gain understanding of what people do, which is really important

00:18:12.980 | to understand how autonomy, successful autonomy can be deployed in the real world, and to

00:18:19.620 | design algorithms for training, for training the deep learning, the deep neural networks

00:18:26.700 | in order to perform the perception task better.

00:18:30.860 | 25 vehicles, 21 Teslas, Model S, Model X, and now Model 3.

00:18:41.940 | Over a thousand miles collected a day.

00:18:44.180 | Every single day we have thousands of miles in the Boston, Massachusetts area driving

00:18:47.780 | around, all of that video being recorded.

00:18:51.060 | Now over 5 billion video frames.

00:18:56.540 | There's several ways to look at autonomy.

00:19:01.340 | One of the big ones is safety.

00:19:07.060 | That's what everybody talks about.

00:19:08.380 | How do we make these things safe?

00:19:10.740 | But the other one is enjoyment.

00:19:14.100 | Do people actually want to use it?

00:19:17.660 | We can create a perfectly safe system.

00:19:20.780 | We can create it right now.

00:19:22.500 | We've had it forever, before even cars.

00:19:26.700 | A car that never moves is a perfectly safe system.

00:19:29.500 | Well, not perfectly, but almost.

00:19:33.220 | But it doesn't provide a service that's valuable.

00:19:36.340 | It doesn't provide an enjoyable driving experience.

00:19:39.460 | So okay, what about slow-moving vehicles?

00:19:42.460 | That's an open question.

00:19:44.260 | The reality is with these Tesla vehicles and L2 systems doing automated driving,

00:19:49.220 | people are driving 33% of miles using Tesla Autopilot.

00:19:54.300 | What does that mean?

00:19:55.620 | That means that people are getting value from it.

00:19:58.340 | A large fraction of their driving is done in an automated way.

00:20:03.300 | That's value, that's enjoyment.

00:20:07.100 | The glance classification algorithm we'll talk about today is used as one example that

00:20:14.180 | we use to understand what's in this data.

00:20:16.540 | Shown with the bar graphs there in the red and the blue.

00:20:19.580 | Red is your manual driving, blue is your autopilot driving.

00:20:22.820 | And we look at glance classification, regions of where drivers are looking,

00:20:26.340 | on road and off road.

00:20:28.060 | And if that distribution changes with automated driving or manual driving.

00:20:33.820 | And with these glance classification methods,

00:20:36.180 | we can determine that there's not much difference.

00:20:38.780 | At least until you dig into the details, which we haven't done.

00:20:42.620 | And the aggregate, there's not a significant difference.

00:20:45.620 | That means people are getting value and enjoying using these technologies.

00:20:52.620 | But yet they're staying attentive or at least not attentive,

00:20:59.140 | but physically engaged.

00:21:01.620 | When your eyes are on the road, you might not be attentive.

00:21:04.820 | But you're at the very least physically, your body's positioned in such a way,

00:21:09.220 | your head is looking at the forward roadway,

00:21:11.340 | that you're physically in position to be alert and to take in the forward roadway.

00:21:17.700 | So they're using it and they don't over trust it.

00:21:24.100 | And that's I think the sweet spot that human robot interaction needs to achieve.

00:21:29.900 | Is the human gaining through experience, through exploration, through trial and error,

00:21:37.660 | exploring and understanding the limitations of the system,

00:21:40.780 | to a degree that over trust can occur.

00:21:43.500 | That seems to be happening in this system.

00:21:45.580 | And using the computer vision methods I'll talk about,

00:21:49.340 | we can continue to explore how that can be achieved in other systems.

00:21:53.700 | When the fraction of automated driving increases,

00:21:58.900 | from 30% to 40% to 50% and so on.

00:22:02.740 | It's all about the data and I'll harp on this again.

00:22:10.660 | The algorithms are interesting.

00:22:12.060 | I will mention of course, it's the same convolutional neural networks.

00:22:16.860 | It's the same networks that take in raw pixels and extract features of interest.

00:22:23.780 | It's 3D convolutional neural networks that take in a sequences of images

00:22:28.340 | and extract the temporal dynamics along with the visual characteristics of the individual images.

00:22:33.180 | It's RNN's LSTMs that use the convolutional neural networks to extract features

00:22:38.900 | and over time look at the dynamics of the images.

00:22:42.740 | These are pretty basic architectures, the same kind of deep neural network architectures.

00:22:49.060 | But they rely fundamentally and deeply on the data, on real-world data.

00:22:54.860 | So let's start where perhaps on the human sensing side it all began,

00:23:00.980 | which is pedestrian detection.

00:23:02.980 | Decades ago.

00:23:08.740 | To put it in context, pedestrian detection here shown from left to right.

00:23:12.620 | On the left is green showing the easier human sensing tasks.

00:23:18.220 | Tasks of sensing some aspect of the human being.

00:23:20.980 | Pedestrian detection, which is detecting the full body of a human being in an image or video,

00:23:28.260 | is one of the easier computer vision tasks.

00:23:31.380 | And on the right, in the red, micro saccades.

00:23:35.220 | These are tremors of the eye or measuring the pupil diameter,

00:23:39.060 | or measuring the cognitive load of the fine blink dynamics of the eye,

00:23:43.940 | the velocity of the blink, micro glances and eye pose are much harder problems.

00:23:50.740 | So you think body pose estimation, pedestrian detection,

00:23:53.900 | face classification detection, recognition, head pose estimation,

00:23:57.660 | all those are easier tasks.

00:23:59.380 | Anything that starts getting smaller, looking at the eye

00:24:02.860 | and everything that start getting fine-grained,

00:24:06.380 | this is much more difficult.

00:24:07.980 | So we start at the easiest, pedestrian detection.

00:24:10.820 | And as the usual challenges of all of computer vision we've talked about,

00:24:15.740 | as the various styles of appearance,

00:24:18.140 | so the inter-class variation,

00:24:20.700 | the different possible articulations of our bodies,

00:24:25.980 | superseded only perhaps by cats,

00:24:28.580 | but us humans are pretty flexible as well.

00:24:31.500 | The presence of occlusion from the accessories that we wear

00:24:34.980 | to occluding self-occlusion and occluding each other.

00:24:38.260 | But crowded scenes have a lot of humans in them and they occlude each other

00:24:43.100 | and therefore being able to disambiguate,

00:24:45.900 | to figure out each individual pedestrian is a very challenging problem.

00:24:49.940 | So how do people approach this problem?

00:24:52.220 | Well, there is a need to extract features from raw pixels.

00:25:00.620 | Whether that was Harkascades, Hogue or CNN,

00:25:04.500 | through the decades,

00:25:07.300 | the sliding window approach was used.

00:25:10.780 | Because the pedestrians can be small in an image or big,

00:25:13.820 | so there's the problem of scale.

00:25:15.420 | So you use a sliding window to detect where that pedestrian is.

00:25:19.140 | You have a classifier that's given a single image,

00:25:21.700 | says is this a pedestrian or not.

00:25:23.500 | You take that classifier, you slide it across the image

00:25:26.780 | to find where all the pedestrians are seen are.

00:25:29.260 | So you can use non-neural network methods

00:25:32.380 | or you can use convolutional neural networks for that classifier.

00:25:35.660 | It's extremely inefficient.

00:25:38.180 | Then came along our CNN, fast R-CNN, fast R-CNN.

00:25:42.140 | These are networks that,

00:25:44.860 | as opposed to doing a complete sliding window approach,

00:25:47.620 | are much more intelligent, clever about generating the candidates to consider.

00:25:53.100 | So as opposed to considering every possible position of a window,

00:25:55.980 | different scales of the window,

00:25:57.700 | they generate a small subset of candidates that are more likely.

00:26:04.100 | And finally, using a CNN classifier for those candidates,

00:26:06.940 | whether there's a pedestrian or not,

00:26:08.740 | whether there's an object of interest or not, a face or not.

00:26:12.540 | And using non-maximal suppression,

00:26:15.300 | because there's overlapping bounding boxes

00:26:17.380 | to figure out what is the most likely bounding box

00:26:20.220 | around this pedestrian, around this object.

00:26:22.620 | That's R-CNN.

00:26:23.780 | And there's a lot of variance.

00:26:25.420 | Now with mask R-CNN,

00:26:27.420 | really the state-of-the-art localization network,

00:26:30.140 | mask also adds to this,

00:26:33.460 | on top of the bounding box also performs segmentation.

00:26:36.340 | There's VoxelNet, which does three-dimensional and LiDAR data,

00:26:39.860 | uses localization and point clouds.

00:26:42.340 | So it's not just using 2D images, but in 3D.

00:26:45.820 | But it's all kind of grounded in the R-CNN framework.

00:26:51.340 | Okay, data.

00:26:54.100 | So we have large-scale data collection

00:26:56.540 | going on here in Cambridge.

00:26:58.420 | If you've seen cameras or LiDAR

00:27:00.020 | at various intersections throughout MIT,

00:27:02.540 | we're part of that.

00:27:03.780 | So for example, here's one of the intersections,

00:27:05.900 | we're collecting about 10 hours a day,

00:27:09.060 | instrumenting it with various sensors I'll mention,

00:27:13.820 | but we see about 12,000 pedestrians a day

00:27:16.820 | across that particular intersection,

00:27:20.980 | using 4K cameras,

00:27:23.660 | using stereo vision cameras, 360,

00:27:27.860 | now the Insta360, which is an 8K 360 camera,

00:27:31.220 | GoPro, LiDAR of various sizes,

00:27:33.980 | the 64 channel or the 16.

00:27:36.100 | And recording.

00:27:40.220 | This is where the data comes from.

00:27:44.980 | This is from the 360 video.

00:27:48.300 | This is from the LiDAR data of the same intersection.

00:27:51.380 | This is for the 4K camcorders

00:27:55.980 | pointing at a different intersection,

00:27:58.140 | and the different,

00:28:01.060 | then capturing the entire 360 view

00:28:03.060 | with the vehicles approaching

00:28:04.100 | and the pedestrians making crossing decisions.

00:28:07.340 | This is understanding the negotiation that pedestrians,

00:28:11.540 | the nonverbal negotiation that pedestrians perform

00:28:14.020 | in choosing to cross or not,

00:28:15.500 | especially when they're jaywalking,

00:28:17.060 | and everybody jaywalks.

00:28:18.900 | Especially if you're familiar

00:28:22.700 | with this particular intersection,

00:28:24.300 | there's more jaywalkers than non-jaywalkers.

00:28:26.740 | It's a fascinating one.

00:28:28.540 | And so we record everything about the driver

00:28:31.300 | and everything about the pedestrians.

00:28:33.260 | And again, RCNN, this is where it comes in,

00:28:37.100 | is you do bionic detection of the pedestrians,

00:28:39.820 | here the vehicles as well,

00:28:41.420 | and allows you to convert this raw data

00:28:44.020 | into hours of pedestrian crossing decisions,

00:28:49.020 | and begin to interpret it.

00:28:51.420 | That's pedestrian detection, bounding box.

00:28:55.620 | For body pose estimation,

00:28:59.380 | is the more difficult task.

00:29:02.580 | Body pose estimation is also finding the joints,

00:29:06.580 | the hands, the elbows, the shoulders,

00:29:08.820 | the hips, knees, feet,

00:29:11.700 | the landmark points in the image,

00:29:14.180 | X, Y position that mark those joints.

00:29:17.860 | That's body pose estimation.

00:29:19.420 | So why is that important in driving, for example?

00:29:22.620 | It's important to determine the vertical position

00:29:25.300 | or the alignment of the driver,

00:29:27.180 | the seat belts and the sort of the airbag testing

00:29:31.540 | is always performed,

00:29:32.500 | and the seat belt testing is performed

00:29:33.900 | with the dummy considering the frontal position

00:29:36.420 | in a standard dummy position.

00:29:39.500 | With a greater and greater degrees of automation

00:29:42.460 | comes more capability and flexibility

00:29:45.100 | for the driver to get misaligned

00:29:47.180 | from the standard quote-unquote dummy position.

00:29:49.700 | And so body pose, or at least upper body pose estimation

00:29:53.420 | allows you to determine how often these drivers

00:29:55.780 | get out of line from the standard position,

00:30:00.260 | general movement.

00:30:01.580 | And then you can look at hands on wheel,

00:30:04.060 | smartphone, smartphone detection,

00:30:07.620 | activity and help add context to glance estimation

00:30:11.980 | that we'll talk about.

00:30:13.660 | So some of the more traditional methods were sequential,

00:30:19.180 | is detecting first the head

00:30:21.180 | and then stepping detecting the shoulders,

00:30:22.940 | the elbows, the hands.

00:30:24.340 | The deep pose holistic view,

00:30:30.460 | which has been a very powerful, successful way

00:30:35.580 | for multi-person pose estimation

00:30:39.220 | is performing a regression

00:30:42.260 | of detecting body parts from the entire image.

00:30:47.260 | It's not sequentially stitching bodies together,

00:30:50.100 | it's detecting the left elbow, the right elbow,

00:30:52.500 | the hands individually.

00:30:53.940 | It's performing that detection

00:30:57.140 | and then stitching everything together afterwards.

00:31:01.980 | Allowing you to deal with the crazy deformations

00:31:05.780 | of the body that happen, the occlusions and so on

00:31:09.100 | because you don't need all the joints to be visible.

00:31:12.020 | And with this cascade of pose regressors,

00:31:18.180 | meaning these are convolutional neural networks

00:31:20.540 | that take in a raw image

00:31:22.500 | and produce an XY position of their estimate

00:31:24.940 | of each individual joint.

00:31:26.980 | Input is an image, output is an estimate of a joint,

00:31:31.180 | of an elbow, shoulder, whatever,

00:31:35.060 | one of several landmarks.

00:31:37.140 | And then you can build on top of that,

00:31:40.180 | every estimation zooms in on that particular area

00:31:43.620 | and performs a finer and finer grain estimation

00:31:47.860 | of the exact position of the joint.

00:31:49.620 | Repeating it over and over and over.

00:31:53.780 | So through this process,

00:31:54.940 | we can do part detection in multi-person,

00:31:58.060 | in multi-person scene that contain multiple people.

00:32:00.980 | So we can detect the head, the neck here,

00:32:03.900 | the hands, the elbows shown in the various images

00:32:06.820 | on the right that don't have an understanding

00:32:09.860 | who the head, the elbows, the hands belong to.

00:32:13.300 | It's just performing a detection

00:32:15.780 | without trying to do individual person detection first.

00:32:18.780 | And then finally connecting, or not finally,

00:32:25.700 | but next step is connecting with part affinity fields

00:32:28.980 | is connecting those parts together.

00:32:32.340 | So first you detect individual parts,

00:32:33.940 | then you connect them together.

00:32:35.540 | And then through bipartite matching,

00:32:37.820 | you determine which is,

00:32:39.620 | who is that each individual body part

00:32:41.660 | most likely belonging to.

00:32:43.140 | So you kind of stitch the different people together

00:32:45.180 | in the scene after the detection is performed with the CNN.

00:32:48.500 | We use this approach for detecting the upper body,

00:32:56.580 | specifically the shoulders, the neck,

00:32:58.220 | and the head, eyes, nose, ears.

00:33:02.180 | That is used to determine the position of the driver

00:33:08.220 | relative to the standard dummy position.

00:33:10.740 | For example, looking during autopilot driving,

00:33:14.020 | 30 minute periods, we can look at on the x-axis is time

00:33:18.180 | and the y-axis is the position of the neck point

00:33:20.620 | that I pointed out in the previous slide

00:33:23.180 | that the midpoint between the two shoulders

00:33:28.340 | the neck is the position over time

00:33:31.060 | relative to where it began.

00:33:32.660 | This is the slouching, the sinking into the seat.

00:33:35.620 | Allowing the car to know that information

00:33:39.420 | and allowing us or the designers of safety systems

00:33:42.580 | to know that information is really important.

00:33:44.780 | We can use the same body pose algorithm

00:33:48.100 | to from the perspective of the vehicle,

00:33:50.460 | outside the vehicle perspective.

00:33:52.380 | So the vehicle looking out is doing the,

00:33:54.580 | as opposed to just plain pedestrian detection

00:33:56.540 | using body pose estimation.

00:33:58.380 | Again, here in Kendall Square,

00:34:02.980 | vehicles crossing, observing pedestrians

00:34:07.020 | making crossing decisions

00:34:08.980 | and performing body pose estimation,

00:34:11.260 | which allows you to then

00:34:14.020 | generate visualizations like this

00:34:19.900 | and gain understanding like this.

00:34:21.420 | On the x-axis is time,

00:34:23.660 | on the y-axis is on the top plot in blue

00:34:27.660 | is the speed of the vehicle.

00:34:29.260 | The speed of the vehicle, the ego vehicle

00:34:31.940 | from which the camera is observing the scene.

00:34:35.740 | And on the bottom in green,

00:34:38.540 | up and down is a binary value.

00:34:41.460 | Whether the pedestrian,

00:34:42.420 | zero when the pedestrian is not looking at the car,

00:34:44.740 | one when the pedestrian is looking at the car.

00:34:47.900 | So we can look at thousands of episodes like this,

00:34:50.380 | crossing decisions, nonverbal communication decisions

00:34:53.220 | and determine using body pose estimation,

00:34:56.100 | the dynamics of this nonverbal communication.

00:35:00.620 | Here, just nearby, by Media Lab,

00:35:03.860 | crossing, there's a pedestrian approaches.

00:35:06.380 | We can look in green there

00:35:07.660 | when the pedestrian glances, looks away,

00:35:10.020 | glances at the car, looks away.

00:35:12.220 | Fascinating glance behavior that happens.

00:35:15.020 | Interesting most people look away before they cross.

00:35:21.460 | Same thing here,

00:35:25.380 | this is an example, we have thousands of these.

00:35:28.260 | Body pose estimation allows you to

00:35:30.580 | get this fine-grained information

00:35:32.940 | about the pedestrian glance behavior,

00:35:35.500 | pedestrian body behavior, hesitation.

00:35:38.260 | Glance classification,

00:35:42.380 | one of the most important things in driving

00:35:44.780 | is determining where drivers are looking.

00:35:48.300 | If there's any sensing that I advocate

00:35:53.300 | and has the most impact in the driving context

00:35:58.220 | is for the car to know where the driver is looking.

00:36:02.780 | And at the very crude region level information

00:36:07.780 | is the driver looking on road or off road?

00:36:10.980 | That's what we mean by glance classification.

00:36:13.220 | It's not the standard gaze estimation problem

00:36:16.140 | of X, Y, Z determining where the eye pose

00:36:18.700 | and the head pose combine

00:36:20.620 | to determine where the driver is looking.

00:36:22.740 | No, this is classifying two regions,

00:36:25.700 | on road, off road, or six regions.

00:36:28.780 | On road, off road, left, right, center stack,

00:36:32.340 | rear view mirror and instrument cluster.

00:36:34.420 | So it's region-based glance allocation,

00:36:39.020 | not the geometric gaze estimation problem.

00:36:42.100 | Why is that important?

00:36:43.540 | It allows you to address it as a machine learning problem.

00:36:48.540 | This is subtle but critical point.

00:36:50.700 | Every problem we try to solve in human sensing,

00:36:53.660 | in driver sensing, has to be learnable from data.

00:36:58.380 | Otherwise it's not amenable to application in the real world.

00:37:03.380 | We can't design systems in the lab

00:37:06.900 | that are deployed without learning if they involve a human.

00:37:10.820 | It's possible to do SLAM localization

00:37:14.860 | by having really good sensors

00:37:17.700 | and doing localization using those sensors

00:37:20.460 | without much learning.

00:37:21.780 | It's not possible to design systems

00:37:23.820 | that deal with lighting variability

00:37:25.620 | and the full variability of human behavior

00:37:29.080 | without being able to learn.

00:37:30.780 | So gaze estimation, the geometric approach

00:37:33.580 | of finding the landmarks in the face

00:37:35.860 | and from those landmarks determining

00:37:39.500 | the orientation of the head and the orientation of the eyes,

00:37:42.780 | there's no learning there

00:37:44.700 | outside of actually training the systems

00:37:47.420 | to detect the different landmarks.

00:37:49.120 | If we convert this into a gaze classification problem

00:37:52.700 | shown here, glance classification,

00:37:55.060 | is when taking the raw video stream,

00:38:00.260 | determining in post, so humans are annotating this video,

00:38:04.460 | is the driver, which region the driver is looking at.

00:38:08.640 | That's, we're able to do by converting the problem

00:38:12.100 | into a simple variant of classification.

00:38:14.360 | On road, off road, left, right.

00:38:17.500 | The same can be done for pedestrians.

00:38:19.840 | Left, forward, right.

00:38:22.020 | It can annotate regions of where they're looking

00:38:25.500 | and using that kind of classification approach

00:38:29.340 | determine are they looking at the cars or not.

00:38:32.460 | Are they looking away?

00:38:33.520 | Are they looking at their smartphone?

00:38:35.140 | Without doing the 3D gaze estimation,

00:38:37.640 | again, it's a subtle point, but think about it.

00:38:40.020 | If you wanted to estimate exactly where they're looking,

00:38:43.040 | you need that ground truth.

00:38:44.480 | You don't have that ground truth unless you,

00:38:48.940 | there's no, in the real world data,

00:38:51.260 | there's no way to get the information

00:38:52.980 | about where exactly people were looking.

00:38:54.980 | You're only inferring.

00:38:56.220 | So you have to convert it

00:38:58.220 | into a region-based classification problem

00:39:00.260 | in order to be able to train neural networks on this.

00:39:03.800 | And the pipeline is the same.

00:39:05.500 | The source video, here, the face,

00:39:08.360 | the 30 frames a second video coming in

00:39:11.980 | of the driver's face or the human face.

00:39:14.780 | There is some degree of calibration that's required.

00:39:17.160 | You have to determine approximately where the sensor is

00:39:20.900 | that's taking in the image,

00:39:22.560 | especially for the classification task

00:39:25.300 | because it's region-based.

00:39:26.680 | It needs to be able to estimate

00:39:29.000 | where the forward roadway is,

00:39:31.640 | where the camera frame is relative to the world frame.

00:39:36.100 | The video stabilization and the face frontalization,

00:39:39.700 | all the basic processing that remove the vibration

00:39:42.140 | and the noise that remove the physical movement of the head,

00:39:45.840 | that remove the shaking of the car

00:39:49.040 | in order to be able to determine stuff

00:39:50.740 | about eye movement and blink dynamics.

00:39:53.180 | And finally, with neural networks,

00:39:56.000 | there is nothing left except taking in the raw video

00:40:01.660 | of the face for the glance classification tasks

00:40:04.780 | and the eye for the cognitive load tasks.

00:40:07.540 | Raw pixels, that's the input to these networks.

00:40:10.020 | And the output is whatever the training data is.

00:40:13.020 | And we'll mention each one.

00:40:16.140 | So whether that's cognitive load, glance,

00:40:18.060 | emotion, drowsiness.

00:40:20.180 | The input is the raw pixels

00:40:23.080 | and the output is whatever you have data for.

00:40:25.700 | Data is everything.

00:40:27.580 | Here, the face alignment problem,

00:40:30.020 | which is a traditional geometric approach to this problem,

00:40:34.940 | is designing algorithms that are able to detect

00:40:38.780 | accurately the individual landmarks in the face

00:40:41.460 | and from that estimate the geometry of the head pose.

00:40:44.920 | For the classification version,

00:40:50.580 | we perform the same kind of alignment

00:40:53.300 | or the same kind of face detection alignment

00:40:55.180 | to determine where the head is.

00:40:57.860 | But once we have that, we pass in just the raw pixels

00:41:01.060 | and perform the classification on that.

00:41:03.100 | As opposed to doing the estimation, it's classification.

00:41:07.580 | Allowing you to perform what's shown there on the bottom

00:41:11.260 | is the real-time classification

00:41:14.660 | of where the driver is looking.

00:41:16.340 | Road, left, right, center stack, instrument cluster,

00:41:20.380 | and rear view mirror.

00:41:25.340 | And as I mentioned, annotation tooling is key.

00:41:28.900 | So we have a total of five billion video frames,

00:41:33.020 | one and a half billion of the face.

00:41:35.220 | That would take tens of millions of dollars to annotate

00:41:42.660 | just for the glance classification fully.

00:41:46.300 | So we have to figure out what to annotate

00:41:48.580 | in order to train the neural networks to perform this task.

00:41:52.780 | And what we annotate is the things

00:41:54.940 | that the network is not confident about.

00:41:57.580 | The moments of high lighting variation,

00:41:59.900 | the partial occlusions from the light or self-occlusion,

00:42:03.860 | and the moving out of frame, the out of frame occlusions.

00:42:07.780 | All the difficult cases.

00:42:09.660 | Going from frame to frame to frame here

00:42:12.020 | in the different pipelines,

00:42:13.020 | starting at the top, going to the bottom.

00:42:15.820 | Whenever the classification has a low confidence,

00:42:19.060 | we pass it to the human.

00:42:20.500 | It's simple.

00:42:21.340 | We rely on the human only when the classifier

00:42:24.220 | is not confident.

00:42:25.740 | And the fundamental trade-off in all of these systems

00:42:30.860 | is what is the accuracy we're willing to put up with.

00:42:35.540 | Here in red and blue, in red is human choice decision,

00:42:40.020 | in blue is machine task.

00:42:42.380 | In red, we select the video we want to classify.

00:42:46.940 | In blue, the neural network performs

00:42:52.660 | the face detection task, localizing the camera,

00:42:55.620 | choosing what is the angle of the camera,

00:42:57.980 | and provides a trade-off between accuracy

00:43:01.620 | and percent frames it can annotate.

00:43:05.380 | So certainly a neural network can annotate glance

00:43:08.140 | for the entire data set, but it would achieve accuracy

00:43:11.740 | in the case of glance classification

00:43:13.180 | of low 90% classification on the sixth class task.

00:43:18.180 | Now if you wanted a higher accuracy,

00:43:21.260 | they would only be able to achieve that

00:43:22.940 | for a smaller fraction of frames.

00:43:25.540 | That's the choice.

00:43:27.260 | And then a human has to go in and perform

00:43:31.260 | the annotation of the frames

00:43:34.900 | that the algorithm is not confident about.

00:43:38.060 | And it repeats over and over.

00:43:39.580 | The algorithm is then trained on the frames

00:43:41.780 | that were annotated by the human.

00:43:44.060 | And it repeats this process over and over on the frames

00:43:46.300 | until everything is annotated.

00:43:47.580 | Yes.

00:43:48.420 | (audience member speaking off mic)

00:43:52.180 | Yes, absolutely.

00:43:53.460 | The question was, do you ever observe

00:43:55.900 | that the classifier is highly confident

00:43:57.860 | about the incorrect class?

00:43:59.020 | Yep.

00:43:59.860 | (audience member speaking off mic)

00:44:03.300 | Right, question was, well then,

00:44:04.980 | how do you deal with that?

00:44:07.700 | How do you account for that?

00:44:09.700 | How do you account for the fact that

00:44:12.020 | highly confident predictions can be highly wrong?

00:44:17.020 | Yeah, false positives.

00:44:18.260 | False positives that you're really confident in.

00:44:21.980 | There's no, at least in our experience,

00:44:24.900 | there's no good answer for that,

00:44:26.100 | except more and more training data

00:44:28.620 | on the things you're not confident about.

00:44:30.540 | That usually seems to deal, generalize over cases.

00:44:35.140 | We don't encounter obvious large categories

00:44:38.180 | of data where you're really confident

00:44:42.740 | about the wrong thing.

00:44:45.140 | Usually some degree of human annotation fixes most problems.

00:44:49.340 | Annotating the low confidence part of the data

00:44:55.580 | solves all incorrect issues.

00:45:00.420 | But of course, that's not always true in a general case.

00:45:05.860 | You can imagine a lot of scenarios where that's not true.

00:45:08.700 | For example, one thing that always,

00:45:15.100 | perform is for each individual person,

00:45:19.060 | we usually annotate a large amount of the data

00:45:21.540 | manually no matter what.

00:45:23.380 | So we have to make sure that the neural network

00:45:25.180 | has seen that person in the various,

00:45:27.740 | in the various ways their face looks like,

00:45:30.300 | with glasses, with different hair,

00:45:33.260 | with different lighting variation.

00:45:36.700 | So we wanna manually annotate that.

00:45:38.420 | It's over time we allow the machine

00:45:40.020 | to do more and more of the work.

00:45:42.220 | So what's resulting in this,

00:45:43.860 | in the glance classification cases,

00:45:45.380 | you can do real time classification.

00:45:46.860 | You can give the car information about

00:45:48.860 | whether the driver's looking on road or off road.

00:45:51.340 | This is critical information for the car to understand.

00:45:53.860 | And you wanna pause for a second to realize that

00:45:56.780 | when you're driving a car for those who drive,

00:45:59.100 | or for those who've driven any kind of car

00:46:00.940 | with any kind of automation,

00:46:02.660 | it has no idea about what you're up to at all.

00:46:06.900 | It has no, it doesn't have any information

00:46:09.100 | about the driver except if they're touching

00:46:11.300 | the steering wheel or not.

00:46:12.780 | More and more now with the GM Super Cruise vehicle

00:46:15.580 | and Tesla now has added a driver facing camera,

00:46:18.740 | they're slowly starting to think about

00:46:21.020 | moving towards perceiving the driver.

00:46:24.180 | But most vehicles on the road today

00:46:25.740 | have no knowledge of the driver.

00:46:27.620 | This knowledge is almost common sense

00:46:30.380 | and trivial for the car to have.

00:46:32.620 | It's common sense how important this information is,

00:46:36.420 | where the driver is looking.

00:46:38.140 | That's the glance classification problem.

00:46:40.340 | And again, emphasizing that we've converted,

00:46:44.660 | it's been three decades of work on gaze estimation.

00:46:48.420 | Gaze estimation is doing head pose estimation,

00:46:51.220 | so the geometric orientation of the head,

00:46:53.380 | combining the orientation of the eyes

00:46:55.980 | and using that combined information

00:46:57.900 | to determine where the person is looking.

00:47:00.740 | We convert that into a classification problem.

00:47:03.100 | So the standard gaze estimation definition

00:47:05.700 | is not a machine learning problem.

00:47:08.660 | Glance classification is a machine learning problem.

00:47:11.020 | This transformation is key.

00:47:12.580 | Emotion.

00:47:14.420 | Human emotion is a fascinating thing.

00:47:18.500 | So the same kind of pipeline, stabilization,

00:47:23.100 | cleaning of the data, raw pixels in,

00:47:25.820 | and then the classification is emotion.

00:47:28.180 | The problem with emotion, if I may speak as an expert,

00:47:32.860 | human, not an expert in emotions,

00:47:37.660 | just an expert at being human,

00:47:39.940 | is that there is a lot of ways to taxonomize emotion,

00:47:43.500 | to categorize emotion, to define emotion,

00:47:47.180 | whether that's for the primary emotion of the parascale

00:47:51.780 | with love, joy, surprise, anger, sadness, fear.

00:47:54.980 | There's a lot of ways to mix those together,

00:47:57.380 | to break those apart into hierarchical taxonomies.

00:48:00.260 | And the way we think about it,

00:48:02.980 | in the driving context at least,

00:48:05.020 | there is a general emotion recognition task.

00:48:08.900 | Sort of, I'll mention it,

00:48:11.740 | but it's kind of how we think about primary emotions

00:48:14.700 | is detecting the broad categories of emotion,

00:48:19.380 | of joy and anger, of disgust and surprise.

00:48:22.900 | And then there is application-specific emotion recognition,

00:48:28.140 | where you're using the facial expressions

00:48:30.660 | that all the various ways that we can deform our face

00:48:33.420 | to communicate information,

00:48:35.300 | to determine a specific question

00:48:40.700 | about the interaction of the driver.

00:48:42.980 | So first for the general case,

00:48:46.300 | these are the building blocks.

00:48:47.780 | I mean there's countless ways of deforming the face

00:48:52.780 | that we use to communicate with each other.

00:48:54.900 | There's 42 individual facial muscles

00:48:57.100 | that can be used to form those expressions.

00:49:03.260 | One of our favorite tools to work with

00:49:07.260 | is the Affectiva SDK.

00:49:09.100 | This is their task with the general emotion recognition task

00:49:13.060 | is taking in raw pixels

00:49:16.980 | and determining categories of emotion,

00:49:19.260 | various subtleties of that emotion in a general case,

00:49:22.540 | producing a classification of anger,

00:49:25.020 | disgust, fear, surprise, so on.

00:49:27.540 | And then mapping,

00:49:30.300 | I mean essentially what these algorithms are doing

00:49:32.220 | whether they're using deep neural networks or not,

00:49:34.940 | whether they're using face alignment

00:49:36.340 | to do the landmark detection

00:49:37.820 | and then tracking those landmarks over time

00:49:40.020 | to do the facial actions,

00:49:42.220 | they're mapping the expressions,

00:49:45.780 | the component, the various expressions we can make

00:49:47.900 | with our eyebrows, with our nose and mouth and eyes

00:49:50.900 | to map them to the emotion.

00:49:54.740 | So I'd like to highlight one

00:49:56.500 | because I think it's an illustrative one

00:49:58.220 | for joy, an expression of joy is smiling.

00:50:01.940 | So there's an increased likelihood

00:50:05.180 | that you observe a smiling expression on the face

00:50:08.700 | when joy is experienced or vice versa.

00:50:11.420 | If there's an increased probability of a smile,

00:50:14.060 | there's an increased probability

00:50:16.300 | of emotion of joy being experienced.

00:50:18.620 | And then joy and experience

00:50:21.380 | has a decreased probability likelihood

00:50:23.420 | of brow raising and brow following.

00:50:27.180 | So if you see a smile,

00:50:28.740 | that's a plus for joy.

00:50:32.740 | If you see brow raise, brow follow,

00:50:34.820 | brow follow is a minus for joy.

00:50:37.940 | That's for the general emotion recognition task

00:50:40.060 | that's been well studied,

00:50:41.020 | that's sort of the core of effective computing movement

00:50:43.340 | from the visual perspective,

00:50:44.860 | again from the computer vision perspective.

00:50:47.220 | For the application specific perspective

00:50:50.460 | which we're really focused on,

00:50:52.340 | again data is everything,

00:50:53.660 | what are you annotating?

00:50:55.740 | We can take, here we have a large scale data set

00:50:58.860 | of drivers interacting

00:51:00.220 | with a voice based navigation system.

00:51:02.100 | So they're tasked with in various vehicles

00:51:05.420 | to enter a navigation,

00:51:07.940 | so they're talking to their GPS using their voice.

00:51:10.900 | This is for, depending on the vehicle,

00:51:12.580 | depending on the system,

00:51:13.940 | in most cases an incredibly frustrating experience.

00:51:16.860 | So we have them perform this task

00:51:18.620 | and then the annotation is self-report.

00:51:21.980 | After the task they say on a scale of one to 10,

00:51:24.900 | how frustrating was this experience?

00:51:27.500 | And what you see on top is the expressions detected

00:51:32.500 | and associated with a satisfied,

00:51:35.420 | a person who said a 10 on the satisfaction,

00:51:39.300 | so a one on the frustration scale.

00:51:42.300 | Is perfectly satisfied with a voice based interaction.

00:51:46.300 | On the bottom is frustrated,

00:51:49.380 | as I believe a nine on the frustration scale.

00:51:53.460 | So the feature, the strongest there, the expression,

00:51:58.220 | remember joy, smile was the strongest indicator

00:52:01.740 | of frustration for all our subjects.

00:52:03.820 | That was the strongest expression.

00:52:05.140 | Smile was the thing that was always there for frustration.

00:52:09.540 | There's other various frowning that followed

00:52:13.900 | and shaking the head and so on,

00:52:15.400 | but smiles were there.

00:52:17.340 | So that shows you the kind of clean difference

00:52:19.900 | between general emotion recognition task

00:52:22.700 | and the application specific.

00:52:24.700 | Here, perhaps they enjoyed an absurd moment of joy

00:52:29.340 | at the frustration they were experiencing.

00:52:31.740 | You can sort of get philosophical about it,

00:52:33.180 | but the practical nature is,

00:52:34.740 | they were frustrated with the experience

00:52:36.420 | and were using the 42 months of the face

00:52:38.700 | that make expressions to do classification

00:52:41.940 | of frustrated or not.

00:52:43.580 | And their data does the work, not the algorithms.

00:52:47.540 | It's the annotation.

00:52:49.580 | A quick mention for the AGI class next week

00:52:53.220 | for the artificial general intelligence class.

00:52:55.620 | One of the competitions we're doing

00:52:57.740 | is we have a JavaScript face

00:53:02.740 | that's trained with a neural network

00:53:05.080 | to form various expressions

00:53:06.980 | to communicate with the observer.

00:53:12.820 | So we're interested in creating emotion,

00:53:16.540 | which is a nice mirror coupling

00:53:19.460 | of the emotional recognition problem.

00:53:21.420 | It's gonna be super cool.

00:53:24.460 | Cognitive load, we're starting to get to the eyes.

00:53:29.460 | Cognitive load is the degree to which a human being

00:53:35.820 | is accessing their memory or is lost in thought,

00:53:40.500 | how hard they're working in their mind

00:53:43.300 | to recollect something, to think about something.

00:53:46.460 | That's cognitive load.

00:53:47.740 | And to do a quick pause of eyes

00:53:51.820 | as the window to cognitive load,

00:53:53.780 | eyes the window to the mind,

00:53:55.780 | there's different ways the eyes move.

00:53:57.660 | So there's pupils, the black part of the eye,

00:53:59.940 | they can expand and contract based on various factors,

00:54:04.260 | including the lighting variations in the scene,

00:54:06.780 | but they also expand and contract based on cognitive load.

00:54:10.220 | That's a strong signal.

00:54:12.820 | They can also move around.

00:54:14.140 | There's ballistic movements, saccades.

00:54:16.020 | When we look around, eyes jump around the scene.

00:54:19.020 | They can also do something called smooth pursuit.

00:54:21.860 | When you, and connecting to our animal past,

00:54:25.300 | can see a delicious meal flying by or running by

00:54:30.300 | that your eyes can follow it perfectly.

00:54:33.180 | They're not jumping around.

00:54:34.740 | So when we read a book,

00:54:36.540 | our eyes are using saccadic movements

00:54:39.300 | where they jump around.

00:54:40.340 | And when the smooth pursuit,

00:54:42.340 | the eye is moving perfectly smoothly.

00:54:44.140 | Those are the kinds of movements

00:54:45.180 | we have to work with.

00:54:48.060 | And cognitive load can be detected

00:54:50.780 | by looking at various factors of the eye.

00:54:54.340 | The blink dynamics, the eye movement,

00:54:56.820 | and the pupil diameter.

00:55:00.580 | The problem is in the real world

00:55:02.780 | and real world data with lighting variations,

00:55:05.700 | everything goes out the window

00:55:06.860 | in terms of using pupil diameter,

00:55:08.380 | which is the standard way to measure,

00:55:10.820 | non-contact way to measure cognitive load in the lab

00:55:13.300 | when you can control lighting conditions

00:55:14.820 | and use infrared cameras.

00:55:16.780 | When you can't, all that goes out the window

00:55:19.060 | and all you have is the blink dynamics

00:55:20.740 | and the eye movement.

00:55:22.140 | So neural networks to the rescue.

00:55:24.900 | 3D convolutional neural networks in this case,

00:55:26.940 | we take a sequence of images of the eye through time

00:55:29.860 | and use 3D convolutions as opposed to 2D convolutions.

00:55:33.860 | On the left is everything we've talked about

00:55:36.620 | previous to this is 2D convolutions

00:55:39.020 | when the convolution filter is operating

00:55:41.020 | on the XY 2D image.

00:55:45.060 | Every channel is operated on by the filter separately.

00:55:49.580 | 3D convolutions combine those,

00:55:51.820 | convolve across multiple images,

00:55:56.780 | across multiple channels.

00:55:58.140 | Therefore being able to learn the dynamics

00:56:03.980 | of the scene through time as well,

00:56:05.900 | not just spatially.

00:56:08.220 | Temporal.

00:56:09.980 | And data.

00:56:11.140 | Data is everything.

00:56:12.540 | For cognitive load,

00:56:14.780 | we have in this case 92 drivers.

00:56:18.460 | So how do we sort of perform

00:56:20.060 | the cognitive load classification task?

00:56:22.380 | We have these drivers driving on the highway

00:56:24.900 | and performing what's called the N-back task.

00:56:27.260 | Zero back, one back, two back.

00:56:29.380 | And that task involves hearing numbers being read to you

00:56:33.780 | and then recalling those numbers one at a time.

00:56:37.700 | So when zero back, the system gives you a number, seven,

00:56:41.740 | and then you have to just say that number back.

00:56:44.340 | Seven.

00:56:45.460 | And it keeps repeating that, it's easy.

00:56:47.060 | It's supposed to be the easy task.

00:56:48.540 | One back is when you hear a number,

00:56:51.100 | you have to remember it.

00:56:52.580 | And for the next number,

00:56:54.860 | you have to say the number previous to that.

00:56:57.460 | So you kind of have to keep one number

00:56:59.900 | in your memory always.

00:57:01.300 | And not get distracted by the new information coming up.

00:57:05.060 | With two back, you have to do that two numbers back.

00:57:07.740 | So you have to use memory more and more with two back.

00:57:10.540 | So cognitive load is higher and higher.

00:57:12.700 | Okay, so what do we do?

00:57:15.020 | We use face alignment, face frontalization,

00:57:18.980 | and detecting the eye closest to the camera

00:57:21.500 | and extract the eye region.

00:57:23.060 | And now we have this nice raw pixels of the eye region

00:57:27.420 | across six seconds of video.

00:57:31.100 | And we take that and put that

00:57:32.420 | into a 3D convolutional neural network

00:57:34.260 | and classify simply one of three classes.

00:57:37.860 | Zero back, one back, and two back.

00:57:39.620 | We have a ton of data of people on the highway

00:57:42.380 | performing these tasks and back tasks.

00:57:44.380 | And that forms the classification,

00:57:47.140 | supervised learning training data.

00:57:49.100 | That's it.

00:57:51.220 | The input is 90 images at 15 frames a second.

00:57:54.860 | And the output is one of three classes.

00:57:58.300 | Face frontalization I should mention

00:58:03.820 | is the technique developed for face recognition

00:58:07.140 | because most face recognition tasks

00:58:09.060 | require frontal face orientation.

00:58:11.260 | It's also what we use here to normalize everything

00:58:14.020 | that we can focus in on the exact blink.

00:58:16.100 | It's taking whatever the orientation of the face

00:58:21.300 | and projecting into the frontal position.

00:58:23.340 | Taking the raw pixels of the face

00:58:28.220 | is detecting the eye region,

00:58:29.900 | zooming in and grabbing the eye.

00:58:32.140 | (claps)

00:58:34.140 | Where you find, and this is where the intuition builds.

00:58:39.820 | It's a fascinating one.

00:58:45.060 | What's being plotted here is the relative movement

00:58:47.220 | of the pupil.

00:58:49.460 | The relative movement of the eye

00:58:51.460 | based on the different cognitive loads.

00:58:54.300 | For cognitive load on the left of zero,

00:58:57.140 | so when your mind is not that lost in thought.

00:58:59.940 | And cognitive load of two on the right,

00:59:02.260 | when it is lost in thought, the eye moves a lot less.

00:59:05.260 | Eye is more focused on the forward roadway.

00:59:09.420 | That's an interesting finding, but it's only an aggregate.

00:59:12.140 | And that's what the neural network is tasked with doing,

00:59:14.940 | with extracting on a frame by frame basis.

00:59:18.100 | This is a standard 3D convolutional architecture.

00:59:23.980 | Again, taking in the image sequence as the input,

00:59:26.620 | cognitive load classification as the output,

00:59:29.140 | and classifying on the right is the accuracy

00:59:33.980 | that's able to achieve of 86%.

00:59:37.180 | That's pretty cool, from real world data.

00:59:41.300 | The idea is that you can just plop in a webcam,

00:59:43.780 | get the video going into the neural network,

00:59:48.500 | and it's predicting a continuous stream

00:59:53.500 | of from zero to two of cognitive load.

00:59:58.740 | Because every single zero back, one back, two back classes

01:00:03.380 | have a confidence that's associated with them,

01:00:05.300 | so you can turn that into a real value between zero and two.

01:00:09.340 | And what you see here is a plot

01:00:11.380 | of three of the people on the team here,

01:00:14.420 | driving a car, performing a task of conversation.

01:00:19.420 | And in white, showing the cognitive load, frame by frame.

01:00:24.460 | At 30 frames a second, estimating the cognitive load

01:00:27.220 | of each of the drivers, from zero to two on the y-axis.

01:00:31.700 | So these are high cognitive load,

01:00:34.020 | and showing on the bottom, red and yellow,

01:00:39.020 | of high and medium cognitive load.

01:00:41.140 | And when everybody's silent, the cognitive load goes down.

01:00:44.460 | So we can perform now with this simple neural network,

01:00:47.380 | with the training data that we formed,

01:00:49.300 | we can extend that to any arbitrary new data set

01:00:52.260 | and generalize.

01:00:53.260 | Okay, those are some examples of how neural networks

01:00:56.900 | can be applied.

01:00:58.140 | And why is this important?

01:00:59.500 | Again, is while we focus on the sort of the perception task

01:01:04.500 | of using neural networks, of using sensors

01:01:08.020 | and signal processing to determine where we are

01:01:10.900 | in the world, where the different obstacles are

01:01:12.740 | and form trajectories around those obstacles,

01:01:15.260 | we are still far away from completely solving that problem.

01:01:19.660 | I would argue 20 plus years away.

01:01:23.660 | The human will have to be involved

01:01:26.820 | and so when the system is not able to control,

01:01:29.660 | when the system is not able to perceive

01:01:31.620 | when there's some flawed aspect about the perception

01:01:34.180 | or the driving policy, the human has to be involved.

01:01:37.580 | And that's where we have to know, let the car know

01:01:40.820 | what the human is doing.

01:01:42.700 | That's the essential element of human robot interaction.

01:01:45.900 | The most popular car in the United States today

01:01:50.540 | is the Ford F-150, no automation.

01:01:54.740 | The thing that sort of inspires us and makes us think

01:01:58.300 | that transportation can be fundamentally transformed

01:02:02.820 | is the Google self-drive, the Waymo car

01:02:05.540 | and all of our guest speakers and all the folks

01:02:07.780 | working on autonomous vehicles.

01:02:09.860 | But if you look at it, the only people who are

01:02:13.100 | at a mass scale or beginning to, are actually injecting

01:02:17.620 | automation into our daily lives is the ones in between.

01:02:22.620 | It's the Tesla, it's the L2 systems,

01:02:24.580 | it's the Tesla system, the Super Cruise,

01:02:26.740 | the Audi, the Volvo S90s, the vehicles that are slowly

01:02:31.740 | adding some degree of automation and teaching human beings

01:02:38.420 | how to interact with that automation.

01:02:40.700 | And here it is again.

01:02:43.380 | The path towards mass scale automation

01:02:52.820 | where steering wheel is removed,

01:02:55.420 | the consideration of the human is removed,

01:02:57.900 | I believe is more than two decades away.

01:03:02.420 | On the path to that, we have to understand

01:03:05.820 | and create successful human robot interaction,

01:03:08.660 | approach autonomous vehicles, autonomous systems

01:03:13.180 | in a human-centered way.

01:03:15.340 | The mass scale integration of these systems,

01:03:18.180 | of the human-centered systems, like the Tesla vehicles,

01:03:21.380 | the Tesla is just a small company right now.

01:03:23.500 | The kind of L2 technologies have not truly penetrated

01:03:27.300 | the market, have not penetrated our vehicles,

01:03:30.380 | even the new vehicles being released today.

01:03:32.900 | I believe that happens in the early 2020s.

01:03:35.500 | And that's going to form the core of our algorithms

01:03:40.500 | that will eventually lead to the full autonomy.

01:03:43.940 | All of that data, what I mentioned with Tesla

01:03:46.100 | with the 32% miles being driven,

01:03:49.020 | all of that is training data for the algorithms.

01:03:51.300 | The edge cases arise there.

01:03:53.020 | That's where we get all this data.

01:03:54.980 | In our data set at MIT is 400,000 miles.

01:03:59.060 | Tesla has a billion miles.

01:04:01.860 | So that's all training data on the way,

01:04:04.780 | on the stairway to mass scale automation.

01:04:07.940 | Why is this important, beautiful, and fundamental

01:04:14.240 | to the role of AI in society?

01:04:16.020 | I believe that self-driving cars,

01:04:17.560 | when they're in this way, are focused

01:04:20.580 | on the human-robot interaction, are personal robots.

01:04:24.340 | They're not perception control systems,

01:04:27.060 | tools like a Roomba performing a particular task.

01:04:32.060 | When human life is at stake,

01:04:33.780 | when there's a fundamental transfer between it,

01:04:36.300 | of life, of a human being giving their life

01:04:39.800 | over to an AI system directly, one-on-one,

01:04:42.700 | there's a transfer, that is kind of a relationship

01:04:47.700 | that is one indicative of a personal robot.

01:04:52.700 | This is, it requires all the things

01:04:55.460 | of understanding, communication, of trust.

01:04:58.800 | These are fascinating to understand

01:05:02.300 | how a human-robot can form trust

01:05:05.260 | enough to create really an almost

01:05:08.860 | one-to-one understanding of each other's mental state,

01:05:13.900 | learn from each other.

01:05:16.160 | Oh boy.

01:05:17.000 | So, one of my favorite movies, Good Will Hunting,

01:05:23.780 | we're in Boston, Cambridge, have to,

01:05:25.940 | gonna regret this one.

01:05:28.600 | This is Robin Williams speaking about human imperfections.

01:05:34.860 | So I'd like you to take this quote

01:05:38.080 | and replace every time he mentions girl with car.

01:05:43.760 | People call those things imperfections.

01:05:46.940 | Robin Williams is talking about his wife

01:05:48.700 | who passed away in the movie.

01:05:50.140 | Talking about her imperfections.

01:05:53.940 | They call these things imperfections, but they're not.

01:05:56.380 | That's the good stuff.

01:05:57.700 | And then we get to choose who we let

01:06:00.260 | into our weird little worlds.

01:06:03.340 | You're not perfect sport.

01:06:05.120 | And let me save you the suspense.

01:06:07.140 | This girl you met, she isn't perfect either.

01:06:09.060 | You know what, let me just...

01:06:10.500 | (man speaking faintly)

01:06:13.740 | Well, that'll be the idiosyncrasies that only I know about.

01:06:23.400 | That's what made her my wife.

01:06:24.800 | Why'd she get the goods on me to shield

01:06:27.700 | all my other pet dogs?

01:06:30.040 | People call these things imperfections,

01:06:32.640 | but they're not.

01:06:33.480 | Oh, that's the good stuff.

01:06:36.320 | And then we get to choose who we let

01:06:37.760 | into our weird little worlds.

01:06:41.080 | You're not perfect sport.

01:06:43.080 | Let me save you the suspense.

01:06:46.120 | This girl you met, she isn't perfect either.

01:06:49.200 | But the question is,

01:06:50.120 | why don't I be perfect for each other?

01:06:52.000 | That's the whole deal.

01:06:54.320 | That's what everything is all about.

01:06:57.720 | Now you could know everything in the world's worth,

01:06:59.400 | but the only way you find that out

01:07:00.680 | is by getting in a shot.

01:07:08.400 | So the approach we're taking

01:07:11.240 | in building the autonomous vehicle

01:07:14.000 | we are here at MIT in our group,

01:07:17.120 | it's the human-centered approach to autonomous vehicles

01:07:20.160 | that we're going to release in March of 2018

01:07:23.040 | in the streets of Boston.

01:07:24.360 | Those who would like to help, please do.

01:07:32.600 | I will talk, run a course

01:07:37.120 | on deep learning for understanding the humans at CHI 2018.

01:07:40.480 | We'll be going through tutorials

01:07:41.880 | that go far beyond the visual,

01:07:45.000 | the convolutional neural network based detection

01:07:47.680 | of various aspects of the face and body.

01:07:51.620 | We'd look at natural language processing,

01:07:55.000 | voice recognition, and GANs.

01:07:59.320 | If you're going to CHI, please join.

01:08:01.380 | Next week, we have an incredible course

01:08:07.000 | that aims to understand,

01:08:09.360 | to begin to explore the nature of intelligence,

01:08:15.280 | natural and artificial.

01:08:18.160 | We have Josh Tenenbaum, Ray Kurzweil,

01:08:22.240 | Lisa Barrett, Nate Derbinsky

01:08:26.200 | looking at cognitive modeling architectures,

01:08:28.200 | Andrej Karpathy, Steven Wolfram,

01:08:31.400 | Richard Moyes talking about autonomous weapon systems

01:08:35.400 | and AI safety, Mark Rybert from Boston Dynamics

01:08:40.400 | and the amazing, incredible robots I have,

01:08:43.420 | and Ilya Tsitskever from OpenAI and myself.

01:08:49.320 | So what next?

01:08:54.080 | For folks registered for this course,

01:08:56.360 | you have to submit by tonight

01:08:58.560 | a deep traffic entry that achieves a speed

01:09:03.760 | of 65 miles an hour,

01:09:06.960 | and I hope you continue to submit more

01:09:09.680 | that win the competition.

01:09:11.560 | The high performer award will be given to folks,

01:09:15.180 | the very few folks who achieve 70 miles an hour faster.

01:09:20.180 | We will continue rolling out SegFuse,

01:09:24.160 | having hit a few snags and invested a few thousands of dollars

01:09:30.160 | in the sanitation process.

01:09:33.360 | I've annotating a large scale data set for you guys.

01:09:37.840 | We'll continue this competition that will take us

01:09:40.240 | into a submission towards NIPS

01:09:44.200 | where we would hope to submit the results

01:09:46.000 | for this competition,

01:09:47.160 | and DeepCrash, the deeper enforcement learning.

01:09:49.780 | These competitions will continue through May 2018.

01:09:53.040 | I hope you stay tuned and participate.

01:09:55.000 | There's upcoming classes.

01:09:57.920 | The AGI class I encourage you to come to

01:10:01.360 | is going to be fascinating,

01:10:03.200 | and there's so many cool, interesting ideas

01:10:06.400 | that we're going to explore.

01:10:07.600 | It's gonna be awesome.

01:10:09.320 | There's an introduction to deep learning course

01:10:11.240 | that I'm also a part of

01:10:12.640 | where I get a little bit more applied

01:10:15.120 | and get folks who are interested

01:10:16.440 | in the very basic algorithms of deep learning

01:10:20.560 | how to get started with those hands on.

01:10:24.240 | And there's an awesome class that I ran last year

01:10:26.820 | for those who took this class last year.

01:10:28.920 | We also talked about it on

01:10:30.760 | the global business of AI and robotics.

01:10:34.340 | The slides are online.

01:10:35.640 | I encourage you to click a link on there and register.

01:10:37.960 | It's in the spring.

01:10:39.300 | It's once a week,

01:10:40.760 | and it truly brings together a lot of cross disciplinary

01:10:44.560 | folks to talk about ideas of artificial intelligence

01:10:47.520 | and the role of AI and robotics in society.

01:10:49.840 | It's an awesome class.

01:10:51.800 | And if you're interested in applying deep learning methods

01:10:56.920 | in the automotive space, come work with us.

01:11:00.240 | We have a lot of fascinating problems to solve

01:11:03.360 | or collaborate.

01:11:04.480 | So with that, I'd like to thank everybody here,

01:11:09.520 | everybody across the community that's been contributing.

01:11:13.800 | We have thousands of submissions coming in for deep traffic,

01:11:17.120 | and I'm just truly humbled by the support

01:11:20.040 | we've been getting,

01:11:21.000 | and the team behind this class is incredible.

01:11:22.880 | Thank you to Nvidia, Google, Amazon,

01:11:25.480 | Alexa, AutoLive, and Toyota.

01:11:27.560 | And today we have shirts,

01:11:31.220 | extra large, extra, extra large and medium over there,

01:11:35.960 | small and large over there.

01:11:37.800 | The big and small people over here,

01:11:40.120 | and then the medium sized people over here.

01:11:42.220 | So just grab it, grab one, and enjoy.

01:11:46.680 | Thank you very much.

01:11:47.680 | (audience applauding)

MIT 6.S094: Deep Learning for Human Sensing

Chapters