back to index

MIT 6.S094: Deep Learning for Human Sensing


Chapters

0:0 Intro
6:53 Human Imperfections
22:57 Pedestrian Detection
28:57 Body Pose Estimation
35:40 Glance Classification
47:13 Emotion Recognition
53:24 Cognitive Load Estimation
60:54 Human-Centered Vision for Autonomous Vehicles

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we will talk about how to apply
00:00:03.000 | the methods of deep learning
00:00:04.720 | to understanding the sense in the human being.
00:00:08.120 | The focus will be on computer vision,
00:00:10.360 | the visual aspects of a human being.
00:00:12.520 | Of course, we humans express ourselves visually,
00:00:16.480 | but also through audio, voice and through text.
00:00:20.160 | Beautiful poetry and novels and so on,
00:00:23.080 | we're not going to touch those today,
00:00:24.440 | we're just going to focus on computer vision.
00:00:27.040 | How we can use computer vision to extract
00:00:30.240 | useful, actionable information
00:00:33.000 | from video, images, video of human beings,
00:00:37.160 | in particular in the context of the car.
00:00:41.680 | what are the requirements
00:00:45.760 | for successfully applying deep learning methods
00:00:48.080 | in the real world?
00:00:49.680 | So, when we talk about human sensing,
00:00:52.360 | we're not talking about
00:00:54.840 | basic face recognition of celebrity images.
00:00:58.360 | We're talking about using computer vision,
00:01:01.480 | deep learning methods
00:01:03.040 | to create systems that operate in the real world.
00:01:06.760 | And in order for them to operate in the real world,
00:01:09.160 | there are several things.
00:01:10.680 | They sound simple, some are much harder than they sound.
00:01:14.400 | First, and the most important,
00:01:16.480 | here from most to less,
00:01:18.600 | more to less critical, ordered,
00:01:20.920 | is data.
00:01:22.360 | Data is everything.
00:01:23.920 | Real world data.
00:01:25.240 | We need a lot of real world data
00:01:27.640 | to form the data set on which
00:01:30.240 | these supervised learning methods can be trained.
00:01:33.560 | I'll say this over and over throughout the day today,
00:01:37.280 | data is everything.
00:01:38.440 | That means data collection
00:01:40.200 | is the hardest part and the most important part.
00:01:42.840 | We'll talk about how that data collection
00:01:45.320 | is carried out here,
00:01:46.760 | in our group at MIT,
00:01:48.680 | all the different ways we capture human beings
00:01:51.160 | in the driving context,
00:01:52.960 | in the road user context,
00:01:54.360 | pedestrians, cyclists.
00:01:55.960 | But the data,
00:01:59.760 | it starts and ends at data.
00:02:02.960 | The fun stuff is the algorithms.
00:02:05.760 | But the data is what makes it all work.
00:02:09.160 | Real world data.
00:02:10.760 | Okay, then once you have the data,
00:02:13.680 | okay, data isn't everything, I lied.
00:02:15.560 | Because you have to actually annotate it.
00:02:18.200 | So what do we mean by data?
00:02:19.440 | There's raw data,
00:02:22.400 | video, audio,
00:02:24.080 | LIDAR, all the types of sensors we'll
00:02:27.760 | talk about to capture
00:02:30.320 | real world
00:02:31.600 | road user interaction.
00:02:34.400 | You have to reduce that
00:02:36.640 | into meaningful representative cases
00:02:40.280 | of what happens in that real world.
00:02:42.280 | In driving, 99% of the time,
00:02:44.680 | driving looks the same.
00:02:45.680 | It's the 1%,
00:02:47.600 | the interesting cases that we're interested in.
00:02:50.280 | And what we want is algorithms to train
00:02:52.760 | learning algorithms on those 1%.
00:02:55.760 | So we have to collect the 100%,
00:02:58.880 | we have to collect all the data,
00:03:00.280 | and then figure out an automated,
00:03:02.200 | semi-automated ways
00:03:03.560 | to find the pieces of that data
00:03:06.760 | that could be used to train neural networks
00:03:08.760 | and that are representative of the general
00:03:11.760 | things, kinds of things that happen in this world.
00:03:14.480 | Efficient annotation.
00:03:18.760 | Annotation isn't just about
00:03:21.040 | drawing bounding boxes on images of cats.
00:03:24.840 | Annotation tooling is key to unlocking
00:03:29.160 | real world performance.
00:03:33.040 | Systems that successfully
00:03:36.120 | solve some problem,
00:03:37.920 | accomplish some goal in real world data.
00:03:39.960 | That means designing annotation tools
00:03:42.400 | for a particular task.
00:03:43.840 | Annotation tools that are used for glance classification,
00:03:47.800 | for determining where drivers are looking,
00:03:49.400 | is very different than annotation tools
00:03:51.400 | used for body pose estimation.
00:03:53.200 | It's very different than the tooling
00:03:55.320 | that we use for PsycFuse,
00:03:57.920 | investing thousands of dollars
00:03:59.720 | for the competition for this class,
00:04:01.680 | to annotate fully seen segmentation
00:04:04.800 | where every pixel is colored.
00:04:06.000 | There needs to be tooling for each one of those elements,
00:04:09.480 | and they're key.
00:04:10.280 | That's HCI question.
00:04:12.120 | That's a design question.
00:04:13.680 | There's no deep learning,
00:04:15.520 | there's no robotics in that question.
00:04:18.880 | It's how do we leverage human computation,
00:04:22.840 | human, the human brain,
00:04:25.160 | to most effectively label images
00:04:27.240 | such that we can train neural networks on them.
00:04:29.120 | Hardware.
00:04:31.200 | In order to train these networks,
00:04:35.960 | in order to parse the data we collect,
00:04:39.560 | and we'll talk about,
00:04:40.320 | we have now over 5 billion images of data,
00:04:44.600 | of driving data.
00:04:45.800 | In order to parse that,
00:04:47.160 | you can't do it on a single machine.
00:04:49.040 | You have to do large-scale distributed compute,
00:04:52.920 | and large-scale distributed storage.
00:04:54.760 | And finally,
00:04:58.000 | the stuff that's the most exciting,
00:05:01.080 | that people,
00:05:03.280 | that this class,
00:05:04.800 | and many classes,
00:05:05.720 | and much of the literature is focused on,
00:05:08.040 | is the algorithms.
00:05:08.920 | The deep learning algorithms,
00:05:10.600 | the machine learning algorithms,
00:05:11.920 | the algorithms that learn from data.
00:05:13.800 | Of course, that's really exciting and important,
00:05:16.720 | but what we find time and time again,
00:05:19.200 | in real-world systems,
00:05:20.720 | is that as long as these algorithms learn from data,
00:05:24.200 | so as long as it's deep learning,
00:05:25.760 | the data is what's much more important.
00:05:28.760 | Of course, it's nice for the algorithms
00:05:31.040 | to be calibration-free,
00:05:32.960 | meaning they learn to calibrate, self-calibrate.
00:05:36.320 | We don't need to have the sensors
00:05:38.120 | in an exact same position every time.
00:05:40.000 | That's a very nice feature.
00:05:41.560 | The robustness of the system is then
00:05:44.160 | generalizable across multiple,
00:05:46.600 | multiple vehicles, multiple scenarios.
00:05:51.880 | one of the key things that comes up time and time again,
00:05:55.880 | and we'll mention today,
00:05:57.280 | is a lot of the algorithms developed in deep learning
00:06:00.600 | are really focused for computer vision,
00:06:02.160 | are focused on single images.
00:06:03.760 | Now, the real world happens in both space and time,
00:06:08.280 | and we have to have algorithms that both
00:06:10.840 | capture the visual characteristics,
00:06:12.360 | but also look at the sequence of images,
00:06:14.600 | sequence of those visual characteristics
00:06:16.440 | that form the temporal dynamics,
00:06:18.000 | the physics of this world.
00:06:19.320 | So, it's nice when those algorithms
00:06:21.520 | are able to capture the physics of the scene.
00:06:24.840 | The big takeaway I would like,
00:06:28.920 | if you leave with anything today,
00:06:31.120 | unfortunately, it's that the painful, boring stuff
00:06:36.320 | of collecting data, of cleaning that data,
00:06:39.160 | of annotating that data,
00:06:40.840 | in order to create successful systems
00:06:43.640 | is much more important than good algorithms,
00:06:45.960 | or great algorithms.
00:06:47.560 | It's important to have good algorithms,
00:06:49.280 | as long as you have neural networks
00:06:51.840 | that learn from that data.
00:06:53.360 | Okay, so today I'll talk, I'd like to talk about
00:06:57.240 | human imperfections,
00:06:59.920 | and the various detection problems,
00:07:04.720 | the pedestrian body pose, glance,
00:07:07.400 | emotion, cognitive load estimation,
00:07:10.920 | that we can use to help those humans
00:07:14.360 | as they operate in a driving context.
00:07:17.320 | And finally, try to continue
00:07:21.840 | with the idea, the vision,
00:07:26.560 | that fully autonomous vehicles,
00:07:28.440 | as some of our guest speakers have spoke about,
00:07:30.320 | and Sterling Ennis will speak about tomorrow,
00:07:32.600 | is really far away.
00:07:34.280 | That the humans will be an integral part
00:07:37.160 | of the operating, cooperating with AI systems.
00:07:40.720 | And I'll continue on that line of thought
00:07:44.800 | to try to motivate why we need to continuously
00:07:48.200 | approach the autonomous vehicle,
00:07:51.040 | the self-driving car paradigm,
00:07:53.680 | in a human-centered way.
00:07:55.680 | Okay, first, before we
00:08:02.320 | talk about human imperfections,
00:08:04.920 | let's just pause and acknowledge
00:08:07.320 | that humans are amazing.
00:08:08.880 | We're actually really good at a lot of things.
00:08:12.760 | That's sometimes sort of fun to talk about
00:08:16.800 | how much, how terrible of drivers we are,
00:08:19.040 | how distracted we are, how irrational we are.
00:08:21.880 | But we're actually really damn good at driving.
00:08:24.840 | Here's a video of a state-of-the-art soccer player,
00:08:28.720 | Messi, the best soccer player in the world, obviously.
00:08:31.640 | And a state-of-the-art robot on the right.
00:08:35.760 | Same thing.
00:08:36.760 | Well, it's not playing,
00:08:39.240 | but I assure you, the American Ninja Warrior,
00:08:42.120 | Casey,
00:08:45.160 | is far superior to the
00:08:49.920 | DARPA humanoid robotics systems shown on the right.
00:08:53.120 | Okay, so,
00:08:57.800 | continuing on the line of thought to challenge,
00:09:02.520 | to challenge us here, that humans are amazing,
00:09:05.880 | is, you know, there's a record high
00:09:09.600 | in 2016 in the United States.
00:09:12.400 | There was over 40,000 since many years,
00:09:16.280 | it's across the 40,000 fatalities mark.
00:09:19.480 | More than 40,000 people died in car crashes
00:09:22.240 | in the United States.
00:09:23.240 | But that's in 3.2 trillion miles traveled.
00:09:27.360 | So that's one fatality per 80 million miles.
00:09:30.320 | That's,
00:09:32.640 | one in 625 chance
00:09:36.880 | of dying in a car crash in your lifetime.
00:09:39.760 | Interesting side fact,
00:09:43.200 | for anyone in the United States,
00:09:45.480 | folks who live in Massachusetts are the least likely
00:09:49.520 | to die in a car crash.
00:09:51.120 | Montana is the most likely.
00:09:53.520 | So for everyone that
00:09:57.760 | thinks of Boston Drives is terrible,
00:10:01.960 | maybe that adds some perspective.
00:10:03.520 | Here's a visualization of Waze data
00:10:06.160 | across a period of a day,
00:10:08.240 | showing you the rich blood of the city,
00:10:10.440 | that the traffic flow of the city,
00:10:12.840 | the people getting from A to B on a mass scale,
00:10:16.640 | and doing it,
00:10:18.080 | surviving, doing it okay.
00:10:20.640 | Humans are amazing.
00:10:23.080 | But they're also flawed.
00:10:26.240 | Texting,
00:10:28.000 | sources of distraction with a smartphone,
00:10:30.840 | the eating, the secondary tasks of talking to other passengers,
00:10:34.040 | grooming, reading,
00:10:35.160 | using navigation system,
00:10:38.240 | yes, sometimes watching video,
00:10:40.480 | and manually adjusting or adjusting the radio.
00:10:44.360 | And 3,000 people were killed.
00:10:48.480 | And 400,000 were injured in motor vehicle crashes
00:10:53.400 | involving distraction in 2014.
00:10:57.400 | Distraction is a,
00:10:59.720 | is a very serious issue for safety.
00:11:01.920 | Texting,
00:11:03.840 | every day more and more people text.
00:11:06.600 | Smartphones are proliferating our society.
00:11:08.800 | 170 billion text messages
00:11:11.560 | are sent in the United States every month.
00:11:13.560 | That's in 2014.
00:11:15.360 | You can only imagine what it is today.
00:11:17.160 | Eyes off road for five seconds
00:11:20.640 | is the average time your eyes are off the road while texting.
00:11:23.280 | Five seconds.
00:11:24.200 | If you're traveling 55 miles an hour in that five seconds,
00:11:28.840 | that's enough time to cover the length of a football field.
00:11:31.760 | So you're blindfolded.
00:11:34.080 | You're not looking at the road.
00:11:35.440 | In five seconds, the average time of texting,
00:11:37.960 | you're covering the entire football field.
00:11:40.120 | And so many things can happen in that moment of time.
00:11:44.240 | That's distraction.
00:11:45.640 | Drunk driving.
00:11:49.560 | 31% of traffic fatalities involve a drunk driver.
00:11:53.440 | Drug driving.
00:11:55.320 | 23% of nighttime drivers tested positive
00:11:58.080 | for a legal prescription or over-the-counter medication.
00:12:01.000 | Distracted driving, as I said, is a huge safety risk.
00:12:04.800 | Drowsy driving.
00:12:06.200 | People driving tired.
00:12:07.760 | Nearly 3% of all traffic fatalities involve a drowsy driver.
00:12:12.120 | If you are uncomfortable with videos
00:12:17.360 | that involve risk,
00:12:19.240 | I urge you to look away.
00:12:20.800 | These are videos collected by AAA of teenagers,
00:12:24.600 | a very large-scale naturalistic driving data set,
00:12:27.160 | and it's capturing clips of teenagers being distracted
00:12:30.200 | on their smartphone.
00:12:31.880 | (TEEN CHATTERING)
00:12:32.880 | (TEEN CHATTERING)
00:12:33.880 | (THUNDER RUMBLING)
00:12:37.960 | (TEEN CHATTERING)
00:12:48.960 | (TIRES SCREECHING)
00:12:49.960 | (TIRES SCREECHING)
00:12:50.960 | (TIRES SCREECHING)
00:12:51.960 | (TEEN CHATTERING)
00:12:52.960 | (TEEN CHATTERING)
00:12:53.960 | (TEEN CHATTERING)
00:12:54.960 | (TEEN CHATTERING)
00:12:55.960 | (TEEN CHATTERING)
00:12:56.960 | (TEEN CHATTERING)
00:12:57.960 | (TEEN CHATTERING)
00:12:58.960 | (TEEN CHATTERING)
00:12:59.960 | (TEEN CHATTERING)
00:13:00.960 | (TEEN CHATTERING)
00:13:01.960 | (TIRES SCREECHING)
00:13:02.960 | (TIRES SCREECHING)
00:13:03.960 | (TIRES SCREECHING)
00:13:04.960 | (TIRES SCREECHING)
00:13:05.960 | (TIRES SCREECHING)
00:13:06.960 | (TIRES SCREECHING)
00:13:07.960 | (TIRES SCREECHING)
00:13:08.960 | (TIRES SCREECHING)
00:13:09.960 | (TIRES SCREECHING)
00:13:10.960 | (TIRES SCREECHING)
00:13:11.960 | (TIRES SCREECHING)
00:13:12.960 | (TIRES SCREECHING)
00:13:13.960 | (TIRES SCREECHING)
00:13:14.960 | (TIRES SCREECHING)
00:13:15.960 | (TIRES SCREECHING)
00:13:16.960 | (TIRES SCREECHING)
00:13:17.960 | (TIRES SCREECHING)
00:13:18.960 | (TIRES SCREECHING)
00:13:19.960 | (TIRES SCREECHING)
00:13:20.960 | (TIRES SCREECHING)
00:13:21.960 | Once you take it in, the problem we're against.
00:13:40.520 | So in that context of human imperfections, we have to ask ourselves, is the human-centered
00:13:47.440 | approach to autonomy in systems, autonomous vehicles that are using artificial intelligence
00:13:53.040 | to aid the driving task, do we want to go, as I mentioned a couple of lectures ago, the
00:13:58.760 | human-centered way or the full autonomy way?
00:14:01.720 | The tempting path is towards full autonomy, where we remove this imperfect, flawed human
00:14:07.520 | from the picture altogether and focus on the robotics problem of perception and control
00:14:12.240 | and planning and driving policy.
00:14:17.080 | Or do we work together, human and machine, to improve the safety, to alleviate distraction,
00:14:24.120 | to bring driver attention back to the road and use artificial intelligence to increase
00:14:28.360 | safety through collaboration, human-robot interaction versus removing the human completely
00:14:34.280 | from the picture?
00:14:38.040 | As I've mentioned, as Sterling will certainly talk about tomorrow and rightfully so, and
00:14:47.040 | yesterday or on Tuesday, Emilio has talked about the L4 way is grounded in literature,
00:14:55.480 | is grounded in common sense in some sense.
00:15:00.960 | You can count on the fact that humans, the natural flaws of human beings to overtrust,
00:15:08.400 | to misbehave, to be irrational about their risk estimates will result in improper use
00:15:13.280 | of the technology.
00:15:16.800 | And that leads to what I've showed before, the public perception of what drivers do in
00:15:21.760 | semi-autonomous vehicles.
00:15:23.040 | They begin to overtrust.
00:15:24.240 | The moment the system works well, they begin to overtrust.
00:15:27.840 | They begin to do stuff they're not supposed to be doing in the car, taking it for granted.
00:15:33.760 | A recent video that somebody posted, this is a common sort of more practical concern
00:15:39.960 | that people have is, well, the traditional ways to ensure the physical engagement of
00:15:47.680 | the driver is by saying they should touch the wheel, the steering wheel every once in
00:15:51.400 | a while.
00:15:52.600 | And of course, there's ways to bypass the need to touch the steering wheel.
00:15:57.080 | Some people hang objects like a can off of the steering wheel.
00:16:01.620 | In this case, brilliantly, I have to say, they shove an orange into the wheel to make
00:16:11.260 | the touch sensor fire and therefore be able to take their hands off the autopilot.
00:16:17.840 | And that kind of idea makes us believe that there's no way that humans will find a way
00:16:24.100 | to misuse this technology.
00:16:25.780 | However, I believe that that's not giving the technology enough credit.
00:16:33.100 | Artificial intelligence systems, if they're able to perceive the human being, are also
00:16:37.980 | able to work with the human being.
00:16:39.940 | And that's what I'd like to talk about today.
00:16:43.840 | Teaching cars to perceive the human being.
00:16:47.740 | And it all starts with data.
00:16:50.340 | It's all about data, as I mentioned.
00:16:52.260 | Data is everything in these real-world systems.
00:16:55.300 | With the MIT naturalistic driving data set of 25 vehicles, of which 21 are equipped with
00:17:01.620 | Tesla autopilot, we instrument them.
00:17:04.100 | This is what we do with the data collection.
00:17:05.940 | Two cameras on the driver.
00:17:07.740 | We'll see the cameras on the face, capturing high-definition video of the face.
00:17:12.380 | That's where we get the glance classification, the emotion recognition, cognitive load, everything
00:17:16.980 | coming from the face.
00:17:17.980 | Then we have another camera, a fisheye, that's looking at the body of the driver.
00:17:22.780 | And from that comes the body pose estimation, hands-on wheel, activity recognition.
00:17:28.980 | And then one video looking out for the full scene segmentation for all the scene perception
00:17:33.260 | tasks.
00:17:34.260 | And everything is being recorded, synchronized together with GPS, with audio, with all the
00:17:38.340 | cam coming from the car on a single device.
00:17:42.180 | Synchronization of this data is critical.
00:17:47.740 | So that's one road trip in the data.
00:17:52.340 | We have thousands like it, traveling hundreds of miles, sometimes hundreds of miles under
00:17:57.740 | automated control, in autopilot.
00:18:02.820 | That's the data.
00:18:03.820 | Again, as I said, data is everything.
00:18:07.540 | And from this data, we can both gain understanding of what people do, which is really important
00:18:12.980 | to understand how autonomy, successful autonomy can be deployed in the real world, and to
00:18:19.620 | design algorithms for training, for training the deep learning, the deep neural networks
00:18:26.700 | in order to perform the perception task better.
00:18:30.860 | 25 vehicles, 21 Teslas, Model S, Model X, and now Model 3.
00:18:41.940 | Over a thousand miles collected a day.
00:18:44.180 | Every single day we have thousands of miles in the Boston, Massachusetts area driving
00:18:47.780 | around, all of that video being recorded.
00:18:51.060 | Now over 5 billion video frames.
00:18:56.540 | There's several ways to look at autonomy.
00:19:01.340 | One of the big ones is safety.
00:19:07.060 | That's what everybody talks about.
00:19:08.380 | How do we make these things safe?
00:19:10.740 | But the other one is enjoyment.
00:19:14.100 | Do people actually want to use it?
00:19:17.660 | We can create a perfectly safe system.
00:19:20.780 | We can create it right now.
00:19:22.500 | We've had it forever, before even cars.
00:19:26.700 | A car that never moves is a perfectly safe system.
00:19:29.500 | Well, not perfectly, but almost.
00:19:33.220 | But it doesn't provide a service that's valuable.
00:19:36.340 | It doesn't provide an enjoyable driving experience.
00:19:39.460 | So okay, what about slow-moving vehicles?
00:19:42.460 | That's an open question.
00:19:44.260 | The reality is with these Tesla vehicles and L2 systems doing automated driving,
00:19:49.220 | people are driving 33% of miles using Tesla Autopilot.
00:19:54.300 | What does that mean?
00:19:55.620 | That means that people are getting value from it.
00:19:58.340 | A large fraction of their driving is done in an automated way.
00:20:03.300 | That's value, that's enjoyment.
00:20:07.100 | The glance classification algorithm we'll talk about today is used as one example that
00:20:14.180 | we use to understand what's in this data.
00:20:16.540 | Shown with the bar graphs there in the red and the blue.
00:20:19.580 | Red is your manual driving, blue is your autopilot driving.
00:20:22.820 | And we look at glance classification, regions of where drivers are looking,
00:20:26.340 | on road and off road.
00:20:28.060 | And if that distribution changes with automated driving or manual driving.
00:20:33.820 | And with these glance classification methods,
00:20:36.180 | we can determine that there's not much difference.
00:20:38.780 | At least until you dig into the details, which we haven't done.
00:20:42.620 | And the aggregate, there's not a significant difference.
00:20:45.620 | That means people are getting value and enjoying using these technologies.
00:20:52.620 | But yet they're staying attentive or at least not attentive,
00:20:59.140 | but physically engaged.
00:21:01.620 | When your eyes are on the road, you might not be attentive.
00:21:04.820 | But you're at the very least physically, your body's positioned in such a way,
00:21:09.220 | your head is looking at the forward roadway,
00:21:11.340 | that you're physically in position to be alert and to take in the forward roadway.
00:21:17.700 | So they're using it and they don't over trust it.
00:21:24.100 | And that's I think the sweet spot that human robot interaction needs to achieve.
00:21:29.900 | Is the human gaining through experience, through exploration, through trial and error,
00:21:37.660 | exploring and understanding the limitations of the system,
00:21:40.780 | to a degree that over trust can occur.
00:21:43.500 | That seems to be happening in this system.
00:21:45.580 | And using the computer vision methods I'll talk about,
00:21:49.340 | we can continue to explore how that can be achieved in other systems.
00:21:53.700 | When the fraction of automated driving increases,
00:21:58.900 | from 30% to 40% to 50% and so on.
00:22:02.740 | It's all about the data and I'll harp on this again.
00:22:10.660 | The algorithms are interesting.
00:22:12.060 | I will mention of course, it's the same convolutional neural networks.
00:22:16.860 | It's the same networks that take in raw pixels and extract features of interest.
00:22:23.780 | It's 3D convolutional neural networks that take in a sequences of images
00:22:28.340 | and extract the temporal dynamics along with the visual characteristics of the individual images.
00:22:33.180 | It's RNN's LSTMs that use the convolutional neural networks to extract features
00:22:38.900 | and over time look at the dynamics of the images.
00:22:42.740 | These are pretty basic architectures, the same kind of deep neural network architectures.
00:22:49.060 | But they rely fundamentally and deeply on the data, on real-world data.
00:22:54.860 | So let's start where perhaps on the human sensing side it all began,
00:23:00.980 | which is pedestrian detection.
00:23:02.980 | Decades ago.
00:23:08.740 | To put it in context, pedestrian detection here shown from left to right.
00:23:12.620 | On the left is green showing the easier human sensing tasks.
00:23:18.220 | Tasks of sensing some aspect of the human being.
00:23:20.980 | Pedestrian detection, which is detecting the full body of a human being in an image or video,
00:23:28.260 | is one of the easier computer vision tasks.
00:23:31.380 | And on the right, in the red, micro saccades.
00:23:35.220 | These are tremors of the eye or measuring the pupil diameter,
00:23:39.060 | or measuring the cognitive load of the fine blink dynamics of the eye,
00:23:43.940 | the velocity of the blink, micro glances and eye pose are much harder problems.
00:23:50.740 | So you think body pose estimation, pedestrian detection,
00:23:53.900 | face classification detection, recognition, head pose estimation,
00:23:57.660 | all those are easier tasks.
00:23:59.380 | Anything that starts getting smaller, looking at the eye
00:24:02.860 | and everything that start getting fine-grained,
00:24:06.380 | this is much more difficult.
00:24:07.980 | So we start at the easiest, pedestrian detection.
00:24:10.820 | And as the usual challenges of all of computer vision we've talked about,
00:24:15.740 | as the various styles of appearance,
00:24:18.140 | so the inter-class variation,
00:24:20.700 | the different possible articulations of our bodies,
00:24:25.980 | superseded only perhaps by cats,
00:24:28.580 | but us humans are pretty flexible as well.
00:24:31.500 | The presence of occlusion from the accessories that we wear
00:24:34.980 | to occluding self-occlusion and occluding each other.
00:24:38.260 | But crowded scenes have a lot of humans in them and they occlude each other
00:24:43.100 | and therefore being able to disambiguate,
00:24:45.900 | to figure out each individual pedestrian is a very challenging problem.
00:24:49.940 | So how do people approach this problem?
00:24:52.220 | Well, there is a need to extract features from raw pixels.
00:25:00.620 | Whether that was Harkascades, Hogue or CNN,
00:25:04.500 | through the decades,
00:25:07.300 | the sliding window approach was used.
00:25:10.780 | Because the pedestrians can be small in an image or big,
00:25:13.820 | so there's the problem of scale.
00:25:15.420 | So you use a sliding window to detect where that pedestrian is.
00:25:19.140 | You have a classifier that's given a single image,
00:25:21.700 | says is this a pedestrian or not.
00:25:23.500 | You take that classifier, you slide it across the image
00:25:26.780 | to find where all the pedestrians are seen are.
00:25:29.260 | So you can use non-neural network methods
00:25:32.380 | or you can use convolutional neural networks for that classifier.
00:25:35.660 | It's extremely inefficient.
00:25:38.180 | Then came along our CNN, fast R-CNN, fast R-CNN.
00:25:42.140 | These are networks that,
00:25:44.860 | as opposed to doing a complete sliding window approach,
00:25:47.620 | are much more intelligent, clever about generating the candidates to consider.
00:25:53.100 | So as opposed to considering every possible position of a window,
00:25:55.980 | different scales of the window,
00:25:57.700 | they generate a small subset of candidates that are more likely.
00:26:04.100 | And finally, using a CNN classifier for those candidates,
00:26:06.940 | whether there's a pedestrian or not,
00:26:08.740 | whether there's an object of interest or not, a face or not.
00:26:12.540 | And using non-maximal suppression,
00:26:15.300 | because there's overlapping bounding boxes
00:26:17.380 | to figure out what is the most likely bounding box
00:26:20.220 | around this pedestrian, around this object.
00:26:22.620 | That's R-CNN.
00:26:23.780 | And there's a lot of variance.
00:26:25.420 | Now with mask R-CNN,
00:26:27.420 | really the state-of-the-art localization network,
00:26:30.140 | mask also adds to this,
00:26:33.460 | on top of the bounding box also performs segmentation.
00:26:36.340 | There's VoxelNet, which does three-dimensional and LiDAR data,
00:26:39.860 | uses localization and point clouds.
00:26:42.340 | So it's not just using 2D images, but in 3D.
00:26:45.820 | But it's all kind of grounded in the R-CNN framework.
00:26:51.340 | Okay, data.
00:26:54.100 | So we have large-scale data collection
00:26:56.540 | going on here in Cambridge.
00:26:58.420 | If you've seen cameras or LiDAR
00:27:00.020 | at various intersections throughout MIT,
00:27:02.540 | we're part of that.
00:27:03.780 | So for example, here's one of the intersections,
00:27:05.900 | we're collecting about 10 hours a day,
00:27:09.060 | instrumenting it with various sensors I'll mention,
00:27:13.820 | but we see about 12,000 pedestrians a day
00:27:16.820 | across that particular intersection,
00:27:20.980 | using 4K cameras,
00:27:23.660 | using stereo vision cameras, 360,
00:27:27.860 | now the Insta360, which is an 8K 360 camera,
00:27:31.220 | GoPro, LiDAR of various sizes,
00:27:33.980 | the 64 channel or the 16.
00:27:36.100 | And recording.
00:27:40.220 | This is where the data comes from.
00:27:44.980 | This is from the 360 video.
00:27:48.300 | This is from the LiDAR data of the same intersection.
00:27:51.380 | This is for the 4K camcorders
00:27:55.980 | pointing at a different intersection,
00:27:58.140 | and the different,
00:28:01.060 | then capturing the entire 360 view
00:28:03.060 | with the vehicles approaching
00:28:04.100 | and the pedestrians making crossing decisions.
00:28:07.340 | This is understanding the negotiation that pedestrians,
00:28:11.540 | the nonverbal negotiation that pedestrians perform
00:28:14.020 | in choosing to cross or not,
00:28:15.500 | especially when they're jaywalking,
00:28:17.060 | and everybody jaywalks.
00:28:18.900 | Especially if you're familiar
00:28:22.700 | with this particular intersection,
00:28:24.300 | there's more jaywalkers than non-jaywalkers.
00:28:26.740 | It's a fascinating one.
00:28:28.540 | And so we record everything about the driver
00:28:31.300 | and everything about the pedestrians.
00:28:33.260 | And again, RCNN, this is where it comes in,
00:28:37.100 | is you do bionic detection of the pedestrians,
00:28:39.820 | here the vehicles as well,
00:28:41.420 | and allows you to convert this raw data
00:28:44.020 | into hours of pedestrian crossing decisions,
00:28:49.020 | and begin to interpret it.
00:28:51.420 | That's pedestrian detection, bounding box.
00:28:55.620 | For body pose estimation,
00:28:59.380 | is the more difficult task.
00:29:02.580 | Body pose estimation is also finding the joints,
00:29:06.580 | the hands, the elbows, the shoulders,
00:29:08.820 | the hips, knees, feet,
00:29:11.700 | the landmark points in the image,
00:29:14.180 | X, Y position that mark those joints.
00:29:17.860 | That's body pose estimation.
00:29:19.420 | So why is that important in driving, for example?
00:29:22.620 | It's important to determine the vertical position
00:29:25.300 | or the alignment of the driver,
00:29:27.180 | the seat belts and the sort of the airbag testing
00:29:31.540 | is always performed,
00:29:32.500 | and the seat belt testing is performed
00:29:33.900 | with the dummy considering the frontal position
00:29:36.420 | in a standard dummy position.
00:29:39.500 | With a greater and greater degrees of automation
00:29:42.460 | comes more capability and flexibility
00:29:45.100 | for the driver to get misaligned
00:29:47.180 | from the standard quote-unquote dummy position.
00:29:49.700 | And so body pose, or at least upper body pose estimation
00:29:53.420 | allows you to determine how often these drivers
00:29:55.780 | get out of line from the standard position,
00:30:00.260 | general movement.
00:30:01.580 | And then you can look at hands on wheel,
00:30:04.060 | smartphone, smartphone detection,
00:30:07.620 | activity and help add context to glance estimation
00:30:11.980 | that we'll talk about.
00:30:13.660 | So some of the more traditional methods were sequential,
00:30:19.180 | is detecting first the head
00:30:21.180 | and then stepping detecting the shoulders,
00:30:22.940 | the elbows, the hands.
00:30:24.340 | The deep pose holistic view,
00:30:30.460 | which has been a very powerful, successful way
00:30:35.580 | for multi-person pose estimation
00:30:39.220 | is performing a regression
00:30:42.260 | of detecting body parts from the entire image.
00:30:47.260 | It's not sequentially stitching bodies together,
00:30:50.100 | it's detecting the left elbow, the right elbow,
00:30:52.500 | the hands individually.
00:30:53.940 | It's performing that detection
00:30:57.140 | and then stitching everything together afterwards.
00:31:01.980 | Allowing you to deal with the crazy deformations
00:31:05.780 | of the body that happen, the occlusions and so on
00:31:09.100 | because you don't need all the joints to be visible.
00:31:12.020 | And with this cascade of pose regressors,
00:31:18.180 | meaning these are convolutional neural networks
00:31:20.540 | that take in a raw image
00:31:22.500 | and produce an XY position of their estimate
00:31:24.940 | of each individual joint.
00:31:26.980 | Input is an image, output is an estimate of a joint,
00:31:31.180 | of an elbow, shoulder, whatever,
00:31:35.060 | one of several landmarks.
00:31:37.140 | And then you can build on top of that,
00:31:40.180 | every estimation zooms in on that particular area
00:31:43.620 | and performs a finer and finer grain estimation
00:31:47.860 | of the exact position of the joint.
00:31:49.620 | Repeating it over and over and over.
00:31:53.780 | So through this process,
00:31:54.940 | we can do part detection in multi-person,
00:31:58.060 | in multi-person scene that contain multiple people.
00:32:00.980 | So we can detect the head, the neck here,
00:32:03.900 | the hands, the elbows shown in the various images
00:32:06.820 | on the right that don't have an understanding
00:32:09.860 | who the head, the elbows, the hands belong to.
00:32:13.300 | It's just performing a detection
00:32:15.780 | without trying to do individual person detection first.
00:32:18.780 | And then finally connecting, or not finally,
00:32:25.700 | but next step is connecting with part affinity fields
00:32:28.980 | is connecting those parts together.
00:32:32.340 | So first you detect individual parts,
00:32:33.940 | then you connect them together.
00:32:35.540 | And then through bipartite matching,
00:32:37.820 | you determine which is,
00:32:39.620 | who is that each individual body part
00:32:41.660 | most likely belonging to.
00:32:43.140 | So you kind of stitch the different people together
00:32:45.180 | in the scene after the detection is performed with the CNN.
00:32:48.500 | We use this approach for detecting the upper body,
00:32:56.580 | specifically the shoulders, the neck,
00:32:58.220 | and the head, eyes, nose, ears.
00:33:02.180 | That is used to determine the position of the driver
00:33:08.220 | relative to the standard dummy position.
00:33:10.740 | For example, looking during autopilot driving,
00:33:14.020 | 30 minute periods, we can look at on the x-axis is time
00:33:18.180 | and the y-axis is the position of the neck point
00:33:20.620 | that I pointed out in the previous slide
00:33:23.180 | that the midpoint between the two shoulders
00:33:28.340 | the neck is the position over time
00:33:31.060 | relative to where it began.
00:33:32.660 | This is the slouching, the sinking into the seat.
00:33:35.620 | Allowing the car to know that information
00:33:39.420 | and allowing us or the designers of safety systems
00:33:42.580 | to know that information is really important.
00:33:44.780 | We can use the same body pose algorithm
00:33:48.100 | to from the perspective of the vehicle,
00:33:50.460 | outside the vehicle perspective.
00:33:52.380 | So the vehicle looking out is doing the,
00:33:54.580 | as opposed to just plain pedestrian detection
00:33:56.540 | using body pose estimation.
00:33:58.380 | Again, here in Kendall Square,
00:34:02.980 | vehicles crossing, observing pedestrians
00:34:07.020 | making crossing decisions
00:34:08.980 | and performing body pose estimation,
00:34:11.260 | which allows you to then
00:34:14.020 | generate visualizations like this
00:34:19.900 | and gain understanding like this.
00:34:21.420 | On the x-axis is time,
00:34:23.660 | on the y-axis is on the top plot in blue
00:34:27.660 | is the speed of the vehicle.
00:34:29.260 | The speed of the vehicle, the ego vehicle
00:34:31.940 | from which the camera is observing the scene.
00:34:35.740 | And on the bottom in green,
00:34:38.540 | up and down is a binary value.
00:34:41.460 | Whether the pedestrian,
00:34:42.420 | zero when the pedestrian is not looking at the car,
00:34:44.740 | one when the pedestrian is looking at the car.
00:34:47.900 | So we can look at thousands of episodes like this,
00:34:50.380 | crossing decisions, nonverbal communication decisions
00:34:53.220 | and determine using body pose estimation,
00:34:56.100 | the dynamics of this nonverbal communication.
00:35:00.620 | Here, just nearby, by Media Lab,
00:35:03.860 | crossing, there's a pedestrian approaches.
00:35:06.380 | We can look in green there
00:35:07.660 | when the pedestrian glances, looks away,
00:35:10.020 | glances at the car, looks away.
00:35:12.220 | Fascinating glance behavior that happens.
00:35:15.020 | Interesting most people look away before they cross.
00:35:21.460 | Same thing here,
00:35:25.380 | this is an example, we have thousands of these.
00:35:28.260 | Body pose estimation allows you to
00:35:30.580 | get this fine-grained information
00:35:32.940 | about the pedestrian glance behavior,
00:35:35.500 | pedestrian body behavior, hesitation.
00:35:38.260 | Glance classification,
00:35:42.380 | one of the most important things in driving
00:35:44.780 | is determining where drivers are looking.
00:35:48.300 | If there's any sensing that I advocate
00:35:53.300 | and has the most impact in the driving context
00:35:58.220 | is for the car to know where the driver is looking.
00:36:02.780 | And at the very crude region level information
00:36:07.780 | is the driver looking on road or off road?
00:36:10.980 | That's what we mean by glance classification.
00:36:13.220 | It's not the standard gaze estimation problem
00:36:16.140 | of X, Y, Z determining where the eye pose
00:36:18.700 | and the head pose combine
00:36:20.620 | to determine where the driver is looking.
00:36:22.740 | No, this is classifying two regions,
00:36:25.700 | on road, off road, or six regions.
00:36:28.780 | On road, off road, left, right, center stack,
00:36:32.340 | rear view mirror and instrument cluster.
00:36:34.420 | So it's region-based glance allocation,
00:36:39.020 | not the geometric gaze estimation problem.
00:36:42.100 | Why is that important?
00:36:43.540 | It allows you to address it as a machine learning problem.
00:36:48.540 | This is subtle but critical point.
00:36:50.700 | Every problem we try to solve in human sensing,
00:36:53.660 | in driver sensing, has to be learnable from data.
00:36:58.380 | Otherwise it's not amenable to application in the real world.
00:37:03.380 | We can't design systems in the lab
00:37:06.900 | that are deployed without learning if they involve a human.
00:37:10.820 | It's possible to do SLAM localization
00:37:14.860 | by having really good sensors
00:37:17.700 | and doing localization using those sensors
00:37:20.460 | without much learning.
00:37:21.780 | It's not possible to design systems
00:37:23.820 | that deal with lighting variability
00:37:25.620 | and the full variability of human behavior
00:37:29.080 | without being able to learn.
00:37:30.780 | So gaze estimation, the geometric approach
00:37:33.580 | of finding the landmarks in the face
00:37:35.860 | and from those landmarks determining
00:37:39.500 | the orientation of the head and the orientation of the eyes,
00:37:42.780 | there's no learning there
00:37:44.700 | outside of actually training the systems
00:37:47.420 | to detect the different landmarks.
00:37:49.120 | If we convert this into a gaze classification problem
00:37:52.700 | shown here, glance classification,
00:37:55.060 | is when taking the raw video stream,
00:38:00.260 | determining in post, so humans are annotating this video,
00:38:04.460 | is the driver, which region the driver is looking at.
00:38:08.640 | That's, we're able to do by converting the problem
00:38:12.100 | into a simple variant of classification.
00:38:14.360 | On road, off road, left, right.
00:38:17.500 | The same can be done for pedestrians.
00:38:19.840 | Left, forward, right.
00:38:22.020 | It can annotate regions of where they're looking
00:38:25.500 | and using that kind of classification approach
00:38:29.340 | determine are they looking at the cars or not.
00:38:32.460 | Are they looking away?
00:38:33.520 | Are they looking at their smartphone?
00:38:35.140 | Without doing the 3D gaze estimation,
00:38:37.640 | again, it's a subtle point, but think about it.
00:38:40.020 | If you wanted to estimate exactly where they're looking,
00:38:43.040 | you need that ground truth.
00:38:44.480 | You don't have that ground truth unless you,
00:38:48.940 | there's no, in the real world data,
00:38:51.260 | there's no way to get the information
00:38:52.980 | about where exactly people were looking.
00:38:54.980 | You're only inferring.
00:38:56.220 | So you have to convert it
00:38:58.220 | into a region-based classification problem
00:39:00.260 | in order to be able to train neural networks on this.
00:39:03.800 | And the pipeline is the same.
00:39:05.500 | The source video, here, the face,
00:39:08.360 | the 30 frames a second video coming in
00:39:11.980 | of the driver's face or the human face.
00:39:14.780 | There is some degree of calibration that's required.
00:39:17.160 | You have to determine approximately where the sensor is
00:39:20.900 | that's taking in the image,
00:39:22.560 | especially for the classification task
00:39:25.300 | because it's region-based.
00:39:26.680 | It needs to be able to estimate
00:39:29.000 | where the forward roadway is,
00:39:31.640 | where the camera frame is relative to the world frame.
00:39:36.100 | The video stabilization and the face frontalization,
00:39:39.700 | all the basic processing that remove the vibration
00:39:42.140 | and the noise that remove the physical movement of the head,
00:39:45.840 | that remove the shaking of the car
00:39:49.040 | in order to be able to determine stuff
00:39:50.740 | about eye movement and blink dynamics.
00:39:53.180 | And finally, with neural networks,
00:39:56.000 | there is nothing left except taking in the raw video
00:40:01.660 | of the face for the glance classification tasks
00:40:04.780 | and the eye for the cognitive load tasks.
00:40:07.540 | Raw pixels, that's the input to these networks.
00:40:10.020 | And the output is whatever the training data is.
00:40:13.020 | And we'll mention each one.
00:40:16.140 | So whether that's cognitive load, glance,
00:40:18.060 | emotion, drowsiness.
00:40:20.180 | The input is the raw pixels
00:40:23.080 | and the output is whatever you have data for.
00:40:25.700 | Data is everything.
00:40:27.580 | Here, the face alignment problem,
00:40:30.020 | which is a traditional geometric approach to this problem,
00:40:34.940 | is designing algorithms that are able to detect
00:40:38.780 | accurately the individual landmarks in the face
00:40:41.460 | and from that estimate the geometry of the head pose.
00:40:44.920 | For the classification version,
00:40:50.580 | we perform the same kind of alignment
00:40:53.300 | or the same kind of face detection alignment
00:40:55.180 | to determine where the head is.
00:40:57.860 | But once we have that, we pass in just the raw pixels
00:41:01.060 | and perform the classification on that.
00:41:03.100 | As opposed to doing the estimation, it's classification.
00:41:07.580 | Allowing you to perform what's shown there on the bottom
00:41:11.260 | is the real-time classification
00:41:14.660 | of where the driver is looking.
00:41:16.340 | Road, left, right, center stack, instrument cluster,
00:41:20.380 | and rear view mirror.
00:41:25.340 | And as I mentioned, annotation tooling is key.
00:41:28.900 | So we have a total of five billion video frames,
00:41:33.020 | one and a half billion of the face.
00:41:35.220 | That would take tens of millions of dollars to annotate
00:41:42.660 | just for the glance classification fully.
00:41:46.300 | So we have to figure out what to annotate
00:41:48.580 | in order to train the neural networks to perform this task.
00:41:52.780 | And what we annotate is the things
00:41:54.940 | that the network is not confident about.
00:41:57.580 | The moments of high lighting variation,
00:41:59.900 | the partial occlusions from the light or self-occlusion,
00:42:03.860 | and the moving out of frame, the out of frame occlusions.
00:42:07.780 | All the difficult cases.
00:42:09.660 | Going from frame to frame to frame here
00:42:12.020 | in the different pipelines,
00:42:13.020 | starting at the top, going to the bottom.
00:42:15.820 | Whenever the classification has a low confidence,
00:42:19.060 | we pass it to the human.
00:42:20.500 | It's simple.
00:42:21.340 | We rely on the human only when the classifier
00:42:24.220 | is not confident.
00:42:25.740 | And the fundamental trade-off in all of these systems
00:42:30.860 | is what is the accuracy we're willing to put up with.
00:42:35.540 | Here in red and blue, in red is human choice decision,
00:42:40.020 | in blue is machine task.
00:42:42.380 | In red, we select the video we want to classify.
00:42:46.940 | In blue, the neural network performs
00:42:52.660 | the face detection task, localizing the camera,
00:42:55.620 | choosing what is the angle of the camera,
00:42:57.980 | and provides a trade-off between accuracy
00:43:01.620 | and percent frames it can annotate.
00:43:05.380 | So certainly a neural network can annotate glance
00:43:08.140 | for the entire data set, but it would achieve accuracy
00:43:11.740 | in the case of glance classification
00:43:13.180 | of low 90% classification on the sixth class task.
00:43:18.180 | Now if you wanted a higher accuracy,
00:43:21.260 | they would only be able to achieve that
00:43:22.940 | for a smaller fraction of frames.
00:43:25.540 | That's the choice.
00:43:27.260 | And then a human has to go in and perform
00:43:31.260 | the annotation of the frames
00:43:34.900 | that the algorithm is not confident about.
00:43:38.060 | And it repeats over and over.
00:43:39.580 | The algorithm is then trained on the frames
00:43:41.780 | that were annotated by the human.
00:43:44.060 | And it repeats this process over and over on the frames
00:43:46.300 | until everything is annotated.
00:43:48.420 | (audience member speaking off mic)
00:43:52.180 | Yes, absolutely.
00:43:53.460 | The question was, do you ever observe
00:43:55.900 | that the classifier is highly confident
00:43:57.860 | about the incorrect class?
00:43:59.860 | (audience member speaking off mic)
00:44:03.300 | Right, question was, well then,
00:44:04.980 | how do you deal with that?
00:44:07.700 | How do you account for that?
00:44:09.700 | How do you account for the fact that
00:44:12.020 | highly confident predictions can be highly wrong?
00:44:17.020 | Yeah, false positives.
00:44:18.260 | False positives that you're really confident in.
00:44:21.980 | There's no, at least in our experience,
00:44:24.900 | there's no good answer for that,
00:44:26.100 | except more and more training data
00:44:28.620 | on the things you're not confident about.
00:44:30.540 | That usually seems to deal, generalize over cases.
00:44:35.140 | We don't encounter obvious large categories
00:44:38.180 | of data where you're really confident
00:44:42.740 | about the wrong thing.
00:44:45.140 | Usually some degree of human annotation fixes most problems.
00:44:49.340 | Annotating the low confidence part of the data
00:44:55.580 | solves all incorrect issues.
00:45:00.420 | But of course, that's not always true in a general case.
00:45:05.860 | You can imagine a lot of scenarios where that's not true.
00:45:08.700 | For example, one thing that always,
00:45:15.100 | perform is for each individual person,
00:45:19.060 | we usually annotate a large amount of the data
00:45:21.540 | manually no matter what.
00:45:23.380 | So we have to make sure that the neural network
00:45:25.180 | has seen that person in the various,
00:45:27.740 | in the various ways their face looks like,
00:45:30.300 | with glasses, with different hair,
00:45:33.260 | with different lighting variation.
00:45:36.700 | So we wanna manually annotate that.
00:45:38.420 | It's over time we allow the machine
00:45:40.020 | to do more and more of the work.
00:45:42.220 | So what's resulting in this,
00:45:43.860 | in the glance classification cases,
00:45:45.380 | you can do real time classification.
00:45:46.860 | You can give the car information about
00:45:48.860 | whether the driver's looking on road or off road.
00:45:51.340 | This is critical information for the car to understand.
00:45:53.860 | And you wanna pause for a second to realize that
00:45:56.780 | when you're driving a car for those who drive,
00:45:59.100 | or for those who've driven any kind of car
00:46:00.940 | with any kind of automation,
00:46:02.660 | it has no idea about what you're up to at all.
00:46:06.900 | It has no, it doesn't have any information
00:46:09.100 | about the driver except if they're touching
00:46:11.300 | the steering wheel or not.
00:46:12.780 | More and more now with the GM Super Cruise vehicle
00:46:15.580 | and Tesla now has added a driver facing camera,
00:46:18.740 | they're slowly starting to think about
00:46:21.020 | moving towards perceiving the driver.
00:46:24.180 | But most vehicles on the road today
00:46:25.740 | have no knowledge of the driver.
00:46:27.620 | This knowledge is almost common sense
00:46:30.380 | and trivial for the car to have.
00:46:32.620 | It's common sense how important this information is,
00:46:36.420 | where the driver is looking.
00:46:38.140 | That's the glance classification problem.
00:46:40.340 | And again, emphasizing that we've converted,
00:46:44.660 | it's been three decades of work on gaze estimation.
00:46:48.420 | Gaze estimation is doing head pose estimation,
00:46:51.220 | so the geometric orientation of the head,
00:46:53.380 | combining the orientation of the eyes
00:46:55.980 | and using that combined information
00:46:57.900 | to determine where the person is looking.
00:47:00.740 | We convert that into a classification problem.
00:47:03.100 | So the standard gaze estimation definition
00:47:05.700 | is not a machine learning problem.
00:47:08.660 | Glance classification is a machine learning problem.
00:47:11.020 | This transformation is key.
00:47:12.580 | Emotion.
00:47:14.420 | Human emotion is a fascinating thing.
00:47:18.500 | So the same kind of pipeline, stabilization,
00:47:23.100 | cleaning of the data, raw pixels in,
00:47:25.820 | and then the classification is emotion.
00:47:28.180 | The problem with emotion, if I may speak as an expert,
00:47:32.860 | human, not an expert in emotions,
00:47:37.660 | just an expert at being human,
00:47:39.940 | is that there is a lot of ways to taxonomize emotion,
00:47:43.500 | to categorize emotion, to define emotion,
00:47:47.180 | whether that's for the primary emotion of the parascale
00:47:51.780 | with love, joy, surprise, anger, sadness, fear.
00:47:54.980 | There's a lot of ways to mix those together,
00:47:57.380 | to break those apart into hierarchical taxonomies.
00:48:00.260 | And the way we think about it,
00:48:02.980 | in the driving context at least,
00:48:05.020 | there is a general emotion recognition task.
00:48:08.900 | Sort of, I'll mention it,
00:48:11.740 | but it's kind of how we think about primary emotions
00:48:14.700 | is detecting the broad categories of emotion,
00:48:19.380 | of joy and anger, of disgust and surprise.
00:48:22.900 | And then there is application-specific emotion recognition,
00:48:28.140 | where you're using the facial expressions
00:48:30.660 | that all the various ways that we can deform our face
00:48:33.420 | to communicate information,
00:48:35.300 | to determine a specific question
00:48:40.700 | about the interaction of the driver.
00:48:42.980 | So first for the general case,
00:48:46.300 | these are the building blocks.
00:48:47.780 | I mean there's countless ways of deforming the face
00:48:52.780 | that we use to communicate with each other.
00:48:54.900 | There's 42 individual facial muscles
00:48:57.100 | that can be used to form those expressions.
00:49:03.260 | One of our favorite tools to work with
00:49:07.260 | is the Affectiva SDK.
00:49:09.100 | This is their task with the general emotion recognition task
00:49:13.060 | is taking in raw pixels
00:49:16.980 | and determining categories of emotion,
00:49:19.260 | various subtleties of that emotion in a general case,
00:49:22.540 | producing a classification of anger,
00:49:25.020 | disgust, fear, surprise, so on.
00:49:27.540 | And then mapping,
00:49:30.300 | I mean essentially what these algorithms are doing
00:49:32.220 | whether they're using deep neural networks or not,
00:49:34.940 | whether they're using face alignment
00:49:36.340 | to do the landmark detection
00:49:37.820 | and then tracking those landmarks over time
00:49:40.020 | to do the facial actions,
00:49:42.220 | they're mapping the expressions,
00:49:45.780 | the component, the various expressions we can make
00:49:47.900 | with our eyebrows, with our nose and mouth and eyes
00:49:50.900 | to map them to the emotion.
00:49:54.740 | So I'd like to highlight one
00:49:56.500 | because I think it's an illustrative one
00:49:58.220 | for joy, an expression of joy is smiling.
00:50:01.940 | So there's an increased likelihood
00:50:05.180 | that you observe a smiling expression on the face
00:50:08.700 | when joy is experienced or vice versa.
00:50:11.420 | If there's an increased probability of a smile,
00:50:14.060 | there's an increased probability
00:50:16.300 | of emotion of joy being experienced.
00:50:18.620 | And then joy and experience
00:50:21.380 | has a decreased probability likelihood
00:50:23.420 | of brow raising and brow following.
00:50:27.180 | So if you see a smile,
00:50:28.740 | that's a plus for joy.
00:50:32.740 | If you see brow raise, brow follow,
00:50:34.820 | brow follow is a minus for joy.
00:50:37.940 | That's for the general emotion recognition task
00:50:40.060 | that's been well studied,
00:50:41.020 | that's sort of the core of effective computing movement
00:50:43.340 | from the visual perspective,
00:50:44.860 | again from the computer vision perspective.
00:50:47.220 | For the application specific perspective
00:50:50.460 | which we're really focused on,
00:50:52.340 | again data is everything,
00:50:53.660 | what are you annotating?
00:50:55.740 | We can take, here we have a large scale data set
00:50:58.860 | of drivers interacting
00:51:00.220 | with a voice based navigation system.
00:51:02.100 | So they're tasked with in various vehicles
00:51:05.420 | to enter a navigation,
00:51:07.940 | so they're talking to their GPS using their voice.
00:51:10.900 | This is for, depending on the vehicle,
00:51:12.580 | depending on the system,
00:51:13.940 | in most cases an incredibly frustrating experience.
00:51:16.860 | So we have them perform this task
00:51:18.620 | and then the annotation is self-report.
00:51:21.980 | After the task they say on a scale of one to 10,
00:51:24.900 | how frustrating was this experience?
00:51:27.500 | And what you see on top is the expressions detected
00:51:32.500 | and associated with a satisfied,
00:51:35.420 | a person who said a 10 on the satisfaction,
00:51:39.300 | so a one on the frustration scale.
00:51:42.300 | Is perfectly satisfied with a voice based interaction.
00:51:46.300 | On the bottom is frustrated,
00:51:49.380 | as I believe a nine on the frustration scale.
00:51:53.460 | So the feature, the strongest there, the expression,
00:51:58.220 | remember joy, smile was the strongest indicator
00:52:01.740 | of frustration for all our subjects.
00:52:03.820 | That was the strongest expression.
00:52:05.140 | Smile was the thing that was always there for frustration.
00:52:09.540 | There's other various frowning that followed
00:52:13.900 | and shaking the head and so on,
00:52:15.400 | but smiles were there.
00:52:17.340 | So that shows you the kind of clean difference
00:52:19.900 | between general emotion recognition task
00:52:22.700 | and the application specific.
00:52:24.700 | Here, perhaps they enjoyed an absurd moment of joy
00:52:29.340 | at the frustration they were experiencing.
00:52:31.740 | You can sort of get philosophical about it,
00:52:33.180 | but the practical nature is,
00:52:34.740 | they were frustrated with the experience
00:52:36.420 | and were using the 42 months of the face
00:52:38.700 | that make expressions to do classification
00:52:41.940 | of frustrated or not.
00:52:43.580 | And their data does the work, not the algorithms.
00:52:47.540 | It's the annotation.
00:52:49.580 | A quick mention for the AGI class next week
00:52:53.220 | for the artificial general intelligence class.
00:52:55.620 | One of the competitions we're doing
00:52:57.740 | is we have a JavaScript face
00:53:02.740 | that's trained with a neural network
00:53:05.080 | to form various expressions
00:53:06.980 | to communicate with the observer.
00:53:12.820 | So we're interested in creating emotion,
00:53:16.540 | which is a nice mirror coupling
00:53:19.460 | of the emotional recognition problem.
00:53:21.420 | It's gonna be super cool.
00:53:24.460 | Cognitive load, we're starting to get to the eyes.
00:53:29.460 | Cognitive load is the degree to which a human being
00:53:35.820 | is accessing their memory or is lost in thought,
00:53:40.500 | how hard they're working in their mind
00:53:43.300 | to recollect something, to think about something.
00:53:46.460 | That's cognitive load.
00:53:47.740 | And to do a quick pause of eyes
00:53:51.820 | as the window to cognitive load,
00:53:53.780 | eyes the window to the mind,
00:53:55.780 | there's different ways the eyes move.
00:53:57.660 | So there's pupils, the black part of the eye,
00:53:59.940 | they can expand and contract based on various factors,
00:54:04.260 | including the lighting variations in the scene,
00:54:06.780 | but they also expand and contract based on cognitive load.
00:54:10.220 | That's a strong signal.
00:54:12.820 | They can also move around.
00:54:14.140 | There's ballistic movements, saccades.
00:54:16.020 | When we look around, eyes jump around the scene.
00:54:19.020 | They can also do something called smooth pursuit.
00:54:21.860 | When you, and connecting to our animal past,
00:54:25.300 | can see a delicious meal flying by or running by
00:54:30.300 | that your eyes can follow it perfectly.
00:54:33.180 | They're not jumping around.
00:54:34.740 | So when we read a book,
00:54:36.540 | our eyes are using saccadic movements
00:54:39.300 | where they jump around.
00:54:40.340 | And when the smooth pursuit,
00:54:42.340 | the eye is moving perfectly smoothly.
00:54:44.140 | Those are the kinds of movements
00:54:45.180 | we have to work with.
00:54:48.060 | And cognitive load can be detected
00:54:50.780 | by looking at various factors of the eye.
00:54:54.340 | The blink dynamics, the eye movement,
00:54:56.820 | and the pupil diameter.
00:55:00.580 | The problem is in the real world
00:55:02.780 | and real world data with lighting variations,
00:55:05.700 | everything goes out the window
00:55:06.860 | in terms of using pupil diameter,
00:55:08.380 | which is the standard way to measure,
00:55:10.820 | non-contact way to measure cognitive load in the lab
00:55:13.300 | when you can control lighting conditions
00:55:14.820 | and use infrared cameras.
00:55:16.780 | When you can't, all that goes out the window
00:55:19.060 | and all you have is the blink dynamics
00:55:20.740 | and the eye movement.
00:55:22.140 | So neural networks to the rescue.
00:55:24.900 | 3D convolutional neural networks in this case,
00:55:26.940 | we take a sequence of images of the eye through time
00:55:29.860 | and use 3D convolutions as opposed to 2D convolutions.
00:55:33.860 | On the left is everything we've talked about
00:55:36.620 | previous to this is 2D convolutions
00:55:39.020 | when the convolution filter is operating
00:55:41.020 | on the XY 2D image.
00:55:45.060 | Every channel is operated on by the filter separately.
00:55:49.580 | 3D convolutions combine those,
00:55:51.820 | convolve across multiple images,
00:55:56.780 | across multiple channels.
00:55:58.140 | Therefore being able to learn the dynamics
00:56:03.980 | of the scene through time as well,
00:56:05.900 | not just spatially.
00:56:08.220 | Temporal.
00:56:09.980 | And data.
00:56:11.140 | Data is everything.
00:56:12.540 | For cognitive load,
00:56:14.780 | we have in this case 92 drivers.
00:56:18.460 | So how do we sort of perform
00:56:20.060 | the cognitive load classification task?
00:56:22.380 | We have these drivers driving on the highway
00:56:24.900 | and performing what's called the N-back task.
00:56:27.260 | Zero back, one back, two back.
00:56:29.380 | And that task involves hearing numbers being read to you
00:56:33.780 | and then recalling those numbers one at a time.
00:56:37.700 | So when zero back, the system gives you a number, seven,
00:56:41.740 | and then you have to just say that number back.
00:56:44.340 | Seven.
00:56:45.460 | And it keeps repeating that, it's easy.
00:56:47.060 | It's supposed to be the easy task.
00:56:48.540 | One back is when you hear a number,
00:56:51.100 | you have to remember it.
00:56:52.580 | And for the next number,
00:56:54.860 | you have to say the number previous to that.
00:56:57.460 | So you kind of have to keep one number
00:56:59.900 | in your memory always.
00:57:01.300 | And not get distracted by the new information coming up.
00:57:05.060 | With two back, you have to do that two numbers back.
00:57:07.740 | So you have to use memory more and more with two back.
00:57:10.540 | So cognitive load is higher and higher.
00:57:12.700 | Okay, so what do we do?
00:57:15.020 | We use face alignment, face frontalization,
00:57:18.980 | and detecting the eye closest to the camera
00:57:21.500 | and extract the eye region.
00:57:23.060 | And now we have this nice raw pixels of the eye region
00:57:27.420 | across six seconds of video.
00:57:31.100 | And we take that and put that
00:57:32.420 | into a 3D convolutional neural network
00:57:34.260 | and classify simply one of three classes.
00:57:37.860 | Zero back, one back, and two back.
00:57:39.620 | We have a ton of data of people on the highway
00:57:42.380 | performing these tasks and back tasks.
00:57:44.380 | And that forms the classification,
00:57:47.140 | supervised learning training data.
00:57:49.100 | That's it.
00:57:51.220 | The input is 90 images at 15 frames a second.
00:57:54.860 | And the output is one of three classes.
00:57:58.300 | Face frontalization I should mention
00:58:03.820 | is the technique developed for face recognition
00:58:07.140 | because most face recognition tasks
00:58:09.060 | require frontal face orientation.
00:58:11.260 | It's also what we use here to normalize everything
00:58:14.020 | that we can focus in on the exact blink.
00:58:16.100 | It's taking whatever the orientation of the face
00:58:21.300 | and projecting into the frontal position.
00:58:23.340 | Taking the raw pixels of the face
00:58:28.220 | is detecting the eye region,
00:58:29.900 | zooming in and grabbing the eye.
00:58:32.140 | (claps)
00:58:34.140 | Where you find, and this is where the intuition builds.
00:58:39.820 | It's a fascinating one.
00:58:45.060 | What's being plotted here is the relative movement
00:58:47.220 | of the pupil.
00:58:49.460 | The relative movement of the eye
00:58:51.460 | based on the different cognitive loads.
00:58:54.300 | For cognitive load on the left of zero,
00:58:57.140 | so when your mind is not that lost in thought.
00:58:59.940 | And cognitive load of two on the right,
00:59:02.260 | when it is lost in thought, the eye moves a lot less.
00:59:05.260 | Eye is more focused on the forward roadway.
00:59:09.420 | That's an interesting finding, but it's only an aggregate.
00:59:12.140 | And that's what the neural network is tasked with doing,
00:59:14.940 | with extracting on a frame by frame basis.
00:59:18.100 | This is a standard 3D convolutional architecture.
00:59:23.980 | Again, taking in the image sequence as the input,
00:59:26.620 | cognitive load classification as the output,
00:59:29.140 | and classifying on the right is the accuracy
00:59:33.980 | that's able to achieve of 86%.
00:59:37.180 | That's pretty cool, from real world data.
00:59:41.300 | The idea is that you can just plop in a webcam,
00:59:43.780 | get the video going into the neural network,
00:59:48.500 | and it's predicting a continuous stream
00:59:53.500 | of from zero to two of cognitive load.
00:59:58.740 | Because every single zero back, one back, two back classes
01:00:03.380 | have a confidence that's associated with them,
01:00:05.300 | so you can turn that into a real value between zero and two.
01:00:09.340 | And what you see here is a plot
01:00:11.380 | of three of the people on the team here,
01:00:14.420 | driving a car, performing a task of conversation.
01:00:19.420 | And in white, showing the cognitive load, frame by frame.
01:00:24.460 | At 30 frames a second, estimating the cognitive load
01:00:27.220 | of each of the drivers, from zero to two on the y-axis.
01:00:31.700 | So these are high cognitive load,
01:00:34.020 | and showing on the bottom, red and yellow,
01:00:39.020 | of high and medium cognitive load.
01:00:41.140 | And when everybody's silent, the cognitive load goes down.
01:00:44.460 | So we can perform now with this simple neural network,
01:00:47.380 | with the training data that we formed,
01:00:49.300 | we can extend that to any arbitrary new data set
01:00:52.260 | and generalize.
01:00:53.260 | Okay, those are some examples of how neural networks
01:00:56.900 | can be applied.
01:00:58.140 | And why is this important?
01:00:59.500 | Again, is while we focus on the sort of the perception task
01:01:04.500 | of using neural networks, of using sensors
01:01:08.020 | and signal processing to determine where we are
01:01:10.900 | in the world, where the different obstacles are
01:01:12.740 | and form trajectories around those obstacles,
01:01:15.260 | we are still far away from completely solving that problem.
01:01:19.660 | I would argue 20 plus years away.
01:01:23.660 | The human will have to be involved
01:01:26.820 | and so when the system is not able to control,
01:01:29.660 | when the system is not able to perceive
01:01:31.620 | when there's some flawed aspect about the perception
01:01:34.180 | or the driving policy, the human has to be involved.
01:01:37.580 | And that's where we have to know, let the car know
01:01:40.820 | what the human is doing.
01:01:42.700 | That's the essential element of human robot interaction.
01:01:45.900 | The most popular car in the United States today
01:01:50.540 | is the Ford F-150, no automation.
01:01:54.740 | The thing that sort of inspires us and makes us think
01:01:58.300 | that transportation can be fundamentally transformed
01:02:02.820 | is the Google self-drive, the Waymo car
01:02:05.540 | and all of our guest speakers and all the folks
01:02:07.780 | working on autonomous vehicles.
01:02:09.860 | But if you look at it, the only people who are
01:02:13.100 | at a mass scale or beginning to, are actually injecting
01:02:17.620 | automation into our daily lives is the ones in between.
01:02:22.620 | It's the Tesla, it's the L2 systems,
01:02:24.580 | it's the Tesla system, the Super Cruise,
01:02:26.740 | the Audi, the Volvo S90s, the vehicles that are slowly
01:02:31.740 | adding some degree of automation and teaching human beings
01:02:38.420 | how to interact with that automation.
01:02:40.700 | And here it is again.
01:02:43.380 | The path towards mass scale automation
01:02:52.820 | where steering wheel is removed,
01:02:55.420 | the consideration of the human is removed,
01:02:57.900 | I believe is more than two decades away.
01:03:02.420 | On the path to that, we have to understand
01:03:05.820 | and create successful human robot interaction,
01:03:08.660 | approach autonomous vehicles, autonomous systems
01:03:13.180 | in a human-centered way.
01:03:15.340 | The mass scale integration of these systems,
01:03:18.180 | of the human-centered systems, like the Tesla vehicles,
01:03:21.380 | the Tesla is just a small company right now.
01:03:23.500 | The kind of L2 technologies have not truly penetrated
01:03:27.300 | the market, have not penetrated our vehicles,
01:03:30.380 | even the new vehicles being released today.
01:03:32.900 | I believe that happens in the early 2020s.
01:03:35.500 | And that's going to form the core of our algorithms
01:03:40.500 | that will eventually lead to the full autonomy.
01:03:43.940 | All of that data, what I mentioned with Tesla
01:03:46.100 | with the 32% miles being driven,
01:03:49.020 | all of that is training data for the algorithms.
01:03:51.300 | The edge cases arise there.
01:03:53.020 | That's where we get all this data.
01:03:54.980 | In our data set at MIT is 400,000 miles.
01:03:59.060 | Tesla has a billion miles.
01:04:01.860 | So that's all training data on the way,
01:04:04.780 | on the stairway to mass scale automation.
01:04:07.940 | Why is this important, beautiful, and fundamental
01:04:14.240 | to the role of AI in society?
01:04:16.020 | I believe that self-driving cars,
01:04:17.560 | when they're in this way, are focused
01:04:20.580 | on the human-robot interaction, are personal robots.
01:04:24.340 | They're not perception control systems,
01:04:27.060 | tools like a Roomba performing a particular task.
01:04:32.060 | When human life is at stake,
01:04:33.780 | when there's a fundamental transfer between it,
01:04:36.300 | of life, of a human being giving their life
01:04:39.800 | over to an AI system directly, one-on-one,
01:04:42.700 | there's a transfer, that is kind of a relationship
01:04:47.700 | that is one indicative of a personal robot.
01:04:52.700 | This is, it requires all the things
01:04:55.460 | of understanding, communication, of trust.
01:04:58.800 | These are fascinating to understand
01:05:02.300 | how a human-robot can form trust
01:05:05.260 | enough to create really an almost
01:05:08.860 | one-to-one understanding of each other's mental state,
01:05:13.900 | learn from each other.
01:05:16.160 | Oh boy.
01:05:17.000 | So, one of my favorite movies, Good Will Hunting,
01:05:23.780 | we're in Boston, Cambridge, have to,
01:05:25.940 | gonna regret this one.
01:05:28.600 | This is Robin Williams speaking about human imperfections.
01:05:34.860 | So I'd like you to take this quote
01:05:38.080 | and replace every time he mentions girl with car.
01:05:43.760 | People call those things imperfections.
01:05:46.940 | Robin Williams is talking about his wife
01:05:48.700 | who passed away in the movie.
01:05:50.140 | Talking about her imperfections.
01:05:53.940 | They call these things imperfections, but they're not.
01:05:56.380 | That's the good stuff.
01:05:57.700 | And then we get to choose who we let
01:06:00.260 | into our weird little worlds.
01:06:03.340 | You're not perfect sport.
01:06:05.120 | And let me save you the suspense.
01:06:07.140 | This girl you met, she isn't perfect either.
01:06:09.060 | You know what, let me just...
01:06:10.500 | (man speaking faintly)
01:06:13.740 | Well, that'll be the idiosyncrasies that only I know about.
01:06:23.400 | That's what made her my wife.
01:06:24.800 | Why'd she get the goods on me to shield
01:06:27.700 | all my other pet dogs?
01:06:30.040 | People call these things imperfections,
01:06:32.640 | but they're not.
01:06:33.480 | Oh, that's the good stuff.
01:06:36.320 | And then we get to choose who we let
01:06:37.760 | into our weird little worlds.
01:06:41.080 | You're not perfect sport.
01:06:43.080 | Let me save you the suspense.
01:06:46.120 | This girl you met, she isn't perfect either.
01:06:49.200 | But the question is,
01:06:50.120 | why don't I be perfect for each other?
01:06:52.000 | That's the whole deal.
01:06:54.320 | That's what everything is all about.
01:06:57.720 | Now you could know everything in the world's worth,
01:06:59.400 | but the only way you find that out
01:07:00.680 | is by getting in a shot.
01:07:08.400 | So the approach we're taking
01:07:11.240 | in building the autonomous vehicle
01:07:14.000 | we are here at MIT in our group,
01:07:17.120 | it's the human-centered approach to autonomous vehicles
01:07:20.160 | that we're going to release in March of 2018
01:07:23.040 | in the streets of Boston.
01:07:24.360 | Those who would like to help, please do.
01:07:32.600 | I will talk, run a course
01:07:37.120 | on deep learning for understanding the humans at CHI 2018.
01:07:40.480 | We'll be going through tutorials
01:07:41.880 | that go far beyond the visual,
01:07:45.000 | the convolutional neural network based detection
01:07:47.680 | of various aspects of the face and body.
01:07:51.620 | We'd look at natural language processing,
01:07:55.000 | voice recognition, and GANs.
01:07:59.320 | If you're going to CHI, please join.
01:08:01.380 | Next week, we have an incredible course
01:08:07.000 | that aims to understand,
01:08:09.360 | to begin to explore the nature of intelligence,
01:08:15.280 | natural and artificial.
01:08:18.160 | We have Josh Tenenbaum, Ray Kurzweil,
01:08:22.240 | Lisa Barrett, Nate Derbinsky
01:08:26.200 | looking at cognitive modeling architectures,
01:08:28.200 | Andrej Karpathy, Steven Wolfram,
01:08:31.400 | Richard Moyes talking about autonomous weapon systems
01:08:35.400 | and AI safety, Mark Rybert from Boston Dynamics
01:08:40.400 | and the amazing, incredible robots I have,
01:08:43.420 | and Ilya Tsitskever from OpenAI and myself.
01:08:49.320 | So what next?
01:08:54.080 | For folks registered for this course,
01:08:56.360 | you have to submit by tonight
01:08:58.560 | a deep traffic entry that achieves a speed
01:09:03.760 | of 65 miles an hour,
01:09:06.960 | and I hope you continue to submit more
01:09:09.680 | that win the competition.
01:09:11.560 | The high performer award will be given to folks,
01:09:15.180 | the very few folks who achieve 70 miles an hour faster.
01:09:20.180 | We will continue rolling out SegFuse,
01:09:24.160 | having hit a few snags and invested a few thousands of dollars
01:09:30.160 | in the sanitation process.
01:09:33.360 | I've annotating a large scale data set for you guys.
01:09:37.840 | We'll continue this competition that will take us
01:09:40.240 | into a submission towards NIPS
01:09:44.200 | where we would hope to submit the results
01:09:46.000 | for this competition,
01:09:47.160 | and DeepCrash, the deeper enforcement learning.
01:09:49.780 | These competitions will continue through May 2018.
01:09:53.040 | I hope you stay tuned and participate.
01:09:55.000 | There's upcoming classes.
01:09:57.920 | The AGI class I encourage you to come to
01:10:01.360 | is going to be fascinating,
01:10:03.200 | and there's so many cool, interesting ideas
01:10:06.400 | that we're going to explore.
01:10:07.600 | It's gonna be awesome.
01:10:09.320 | There's an introduction to deep learning course
01:10:11.240 | that I'm also a part of
01:10:12.640 | where I get a little bit more applied
01:10:15.120 | and get folks who are interested
01:10:16.440 | in the very basic algorithms of deep learning
01:10:20.560 | how to get started with those hands on.
01:10:24.240 | And there's an awesome class that I ran last year
01:10:26.820 | for those who took this class last year.
01:10:28.920 | We also talked about it on
01:10:30.760 | the global business of AI and robotics.
01:10:34.340 | The slides are online.
01:10:35.640 | I encourage you to click a link on there and register.
01:10:37.960 | It's in the spring.
01:10:39.300 | It's once a week,
01:10:40.760 | and it truly brings together a lot of cross disciplinary
01:10:44.560 | folks to talk about ideas of artificial intelligence
01:10:47.520 | and the role of AI and robotics in society.
01:10:49.840 | It's an awesome class.
01:10:51.800 | And if you're interested in applying deep learning methods
01:10:56.920 | in the automotive space, come work with us.
01:11:00.240 | We have a lot of fascinating problems to solve
01:11:03.360 | or collaborate.
01:11:04.480 | So with that, I'd like to thank everybody here,
01:11:09.520 | everybody across the community that's been contributing.
01:11:13.800 | We have thousands of submissions coming in for deep traffic,
01:11:17.120 | and I'm just truly humbled by the support
01:11:20.040 | we've been getting,
01:11:21.000 | and the team behind this class is incredible.
01:11:22.880 | Thank you to Nvidia, Google, Amazon,
01:11:25.480 | Alexa, AutoLive, and Toyota.
01:11:27.560 | And today we have shirts,
01:11:31.220 | extra large, extra, extra large and medium over there,
01:11:35.960 | small and large over there.
01:11:37.800 | The big and small people over here,
01:11:40.120 | and then the medium sized people over here.
01:11:42.220 | So just grab it, grab one, and enjoy.
01:11:46.680 | Thank you very much.
01:11:47.680 | (audience applauding)