back to indexMIT 6.S094: Deep Learning for Human Sensing
Chapters
0:0 Intro
6:53 Human Imperfections
22:57 Pedestrian Detection
28:57 Body Pose Estimation
35:40 Glance Classification
47:13 Emotion Recognition
53:24 Cognitive Load Estimation
60:54 Human-Centered Vision for Autonomous Vehicles
00:00:04.720 |
to understanding the sense in the human being. 00:00:12.520 |
Of course, we humans express ourselves visually, 00:00:16.480 |
but also through audio, voice and through text. 00:00:24.440 |
we're just going to focus on computer vision. 00:00:45.760 |
for successfully applying deep learning methods 00:01:03.040 |
to create systems that operate in the real world. 00:01:06.760 |
And in order for them to operate in the real world, 00:01:10.680 |
They sound simple, some are much harder than they sound. 00:01:30.240 |
these supervised learning methods can be trained. 00:01:33.560 |
I'll say this over and over throughout the day today, 00:01:40.200 |
is the hardest part and the most important part. 00:01:48.680 |
all the different ways we capture human beings 00:02:47.600 |
the interesting cases that we're interested in. 00:03:11.760 |
things, kinds of things that happen in this world. 00:03:43.840 |
Annotation tools that are used for glance classification, 00:04:06.000 |
There needs to be tooling for each one of those elements, 00:04:27.240 |
such that we can train neural networks on them. 00:04:49.040 |
You have to do large-scale distributed compute, 00:05:13.800 |
Of course, that's really exciting and important, 00:05:20.720 |
is that as long as these algorithms learn from data, 00:05:32.960 |
meaning they learn to calibrate, self-calibrate. 00:05:51.880 |
one of the key things that comes up time and time again, 00:05:57.280 |
is a lot of the algorithms developed in deep learning 00:06:03.760 |
Now, the real world happens in both space and time, 00:06:21.520 |
are able to capture the physics of the scene. 00:06:31.120 |
unfortunately, it's that the painful, boring stuff 00:06:53.360 |
Okay, so today I'll talk, I'd like to talk about 00:07:28.440 |
as some of our guest speakers have spoke about, 00:07:30.320 |
and Sterling Ennis will speak about tomorrow, 00:07:37.160 |
of the operating, cooperating with AI systems. 00:07:44.800 |
to try to motivate why we need to continuously 00:08:08.880 |
We're actually really good at a lot of things. 00:08:19.040 |
how distracted we are, how irrational we are. 00:08:21.880 |
But we're actually really damn good at driving. 00:08:24.840 |
Here's a video of a state-of-the-art soccer player, 00:08:28.720 |
Messi, the best soccer player in the world, obviously. 00:08:39.240 |
but I assure you, the American Ninja Warrior, 00:08:49.920 |
DARPA humanoid robotics systems shown on the right. 00:08:57.800 |
continuing on the line of thought to challenge, 00:09:02.520 |
to challenge us here, that humans are amazing, 00:09:45.480 |
folks who live in Massachusetts are the least likely 00:10:12.840 |
the people getting from A to B on a mass scale, 00:10:30.840 |
the eating, the secondary tasks of talking to other passengers, 00:10:40.480 |
and manually adjusting or adjusting the radio. 00:10:48.480 |
And 400,000 were injured in motor vehicle crashes 00:11:20.640 |
is the average time your eyes are off the road while texting. 00:11:24.200 |
If you're traveling 55 miles an hour in that five seconds, 00:11:28.840 |
that's enough time to cover the length of a football field. 00:11:35.440 |
In five seconds, the average time of texting, 00:11:40.120 |
And so many things can happen in that moment of time. 00:11:49.560 |
31% of traffic fatalities involve a drunk driver. 00:11:58.080 |
for a legal prescription or over-the-counter medication. 00:12:01.000 |
Distracted driving, as I said, is a huge safety risk. 00:12:07.760 |
Nearly 3% of all traffic fatalities involve a drowsy driver. 00:12:20.800 |
These are videos collected by AAA of teenagers, 00:12:24.600 |
a very large-scale naturalistic driving data set, 00:12:27.160 |
and it's capturing clips of teenagers being distracted 00:13:21.960 |
Once you take it in, the problem we're against. 00:13:40.520 |
So in that context of human imperfections, we have to ask ourselves, is the human-centered 00:13:47.440 |
approach to autonomy in systems, autonomous vehicles that are using artificial intelligence 00:13:53.040 |
to aid the driving task, do we want to go, as I mentioned a couple of lectures ago, the 00:14:01.720 |
The tempting path is towards full autonomy, where we remove this imperfect, flawed human 00:14:07.520 |
from the picture altogether and focus on the robotics problem of perception and control 00:14:17.080 |
Or do we work together, human and machine, to improve the safety, to alleviate distraction, 00:14:24.120 |
to bring driver attention back to the road and use artificial intelligence to increase 00:14:28.360 |
safety through collaboration, human-robot interaction versus removing the human completely 00:14:38.040 |
As I've mentioned, as Sterling will certainly talk about tomorrow and rightfully so, and 00:14:47.040 |
yesterday or on Tuesday, Emilio has talked about the L4 way is grounded in literature, 00:15:00.960 |
You can count on the fact that humans, the natural flaws of human beings to overtrust, 00:15:08.400 |
to misbehave, to be irrational about their risk estimates will result in improper use 00:15:16.800 |
And that leads to what I've showed before, the public perception of what drivers do in 00:15:24.240 |
The moment the system works well, they begin to overtrust. 00:15:27.840 |
They begin to do stuff they're not supposed to be doing in the car, taking it for granted. 00:15:33.760 |
A recent video that somebody posted, this is a common sort of more practical concern 00:15:39.960 |
that people have is, well, the traditional ways to ensure the physical engagement of 00:15:47.680 |
the driver is by saying they should touch the wheel, the steering wheel every once in 00:15:52.600 |
And of course, there's ways to bypass the need to touch the steering wheel. 00:15:57.080 |
Some people hang objects like a can off of the steering wheel. 00:16:01.620 |
In this case, brilliantly, I have to say, they shove an orange into the wheel to make 00:16:11.260 |
the touch sensor fire and therefore be able to take their hands off the autopilot. 00:16:17.840 |
And that kind of idea makes us believe that there's no way that humans will find a way 00:16:25.780 |
However, I believe that that's not giving the technology enough credit. 00:16:33.100 |
Artificial intelligence systems, if they're able to perceive the human being, are also 00:16:39.940 |
And that's what I'd like to talk about today. 00:16:52.260 |
Data is everything in these real-world systems. 00:16:55.300 |
With the MIT naturalistic driving data set of 25 vehicles, of which 21 are equipped with 00:17:07.740 |
We'll see the cameras on the face, capturing high-definition video of the face. 00:17:12.380 |
That's where we get the glance classification, the emotion recognition, cognitive load, everything 00:17:17.980 |
Then we have another camera, a fisheye, that's looking at the body of the driver. 00:17:22.780 |
And from that comes the body pose estimation, hands-on wheel, activity recognition. 00:17:28.980 |
And then one video looking out for the full scene segmentation for all the scene perception 00:17:34.260 |
And everything is being recorded, synchronized together with GPS, with audio, with all the 00:17:52.340 |
We have thousands like it, traveling hundreds of miles, sometimes hundreds of miles under 00:18:07.540 |
And from this data, we can both gain understanding of what people do, which is really important 00:18:12.980 |
to understand how autonomy, successful autonomy can be deployed in the real world, and to 00:18:19.620 |
design algorithms for training, for training the deep learning, the deep neural networks 00:18:26.700 |
in order to perform the perception task better. 00:18:30.860 |
25 vehicles, 21 Teslas, Model S, Model X, and now Model 3. 00:18:44.180 |
Every single day we have thousands of miles in the Boston, Massachusetts area driving 00:19:26.700 |
A car that never moves is a perfectly safe system. 00:19:33.220 |
But it doesn't provide a service that's valuable. 00:19:36.340 |
It doesn't provide an enjoyable driving experience. 00:19:44.260 |
The reality is with these Tesla vehicles and L2 systems doing automated driving, 00:19:49.220 |
people are driving 33% of miles using Tesla Autopilot. 00:19:55.620 |
That means that people are getting value from it. 00:19:58.340 |
A large fraction of their driving is done in an automated way. 00:20:07.100 |
The glance classification algorithm we'll talk about today is used as one example that 00:20:16.540 |
Shown with the bar graphs there in the red and the blue. 00:20:19.580 |
Red is your manual driving, blue is your autopilot driving. 00:20:22.820 |
And we look at glance classification, regions of where drivers are looking, 00:20:28.060 |
And if that distribution changes with automated driving or manual driving. 00:20:33.820 |
And with these glance classification methods, 00:20:36.180 |
we can determine that there's not much difference. 00:20:38.780 |
At least until you dig into the details, which we haven't done. 00:20:42.620 |
And the aggregate, there's not a significant difference. 00:20:45.620 |
That means people are getting value and enjoying using these technologies. 00:20:52.620 |
But yet they're staying attentive or at least not attentive, 00:21:01.620 |
When your eyes are on the road, you might not be attentive. 00:21:04.820 |
But you're at the very least physically, your body's positioned in such a way, 00:21:11.340 |
that you're physically in position to be alert and to take in the forward roadway. 00:21:17.700 |
So they're using it and they don't over trust it. 00:21:24.100 |
And that's I think the sweet spot that human robot interaction needs to achieve. 00:21:29.900 |
Is the human gaining through experience, through exploration, through trial and error, 00:21:37.660 |
exploring and understanding the limitations of the system, 00:21:45.580 |
And using the computer vision methods I'll talk about, 00:21:49.340 |
we can continue to explore how that can be achieved in other systems. 00:21:53.700 |
When the fraction of automated driving increases, 00:22:02.740 |
It's all about the data and I'll harp on this again. 00:22:12.060 |
I will mention of course, it's the same convolutional neural networks. 00:22:16.860 |
It's the same networks that take in raw pixels and extract features of interest. 00:22:23.780 |
It's 3D convolutional neural networks that take in a sequences of images 00:22:28.340 |
and extract the temporal dynamics along with the visual characteristics of the individual images. 00:22:33.180 |
It's RNN's LSTMs that use the convolutional neural networks to extract features 00:22:38.900 |
and over time look at the dynamics of the images. 00:22:42.740 |
These are pretty basic architectures, the same kind of deep neural network architectures. 00:22:49.060 |
But they rely fundamentally and deeply on the data, on real-world data. 00:22:54.860 |
So let's start where perhaps on the human sensing side it all began, 00:23:08.740 |
To put it in context, pedestrian detection here shown from left to right. 00:23:12.620 |
On the left is green showing the easier human sensing tasks. 00:23:18.220 |
Tasks of sensing some aspect of the human being. 00:23:20.980 |
Pedestrian detection, which is detecting the full body of a human being in an image or video, 00:23:31.380 |
And on the right, in the red, micro saccades. 00:23:35.220 |
These are tremors of the eye or measuring the pupil diameter, 00:23:39.060 |
or measuring the cognitive load of the fine blink dynamics of the eye, 00:23:43.940 |
the velocity of the blink, micro glances and eye pose are much harder problems. 00:23:50.740 |
So you think body pose estimation, pedestrian detection, 00:23:53.900 |
face classification detection, recognition, head pose estimation, 00:23:59.380 |
Anything that starts getting smaller, looking at the eye 00:24:02.860 |
and everything that start getting fine-grained, 00:24:07.980 |
So we start at the easiest, pedestrian detection. 00:24:10.820 |
And as the usual challenges of all of computer vision we've talked about, 00:24:20.700 |
the different possible articulations of our bodies, 00:24:31.500 |
The presence of occlusion from the accessories that we wear 00:24:34.980 |
to occluding self-occlusion and occluding each other. 00:24:38.260 |
But crowded scenes have a lot of humans in them and they occlude each other 00:24:45.900 |
to figure out each individual pedestrian is a very challenging problem. 00:24:52.220 |
Well, there is a need to extract features from raw pixels. 00:25:10.780 |
Because the pedestrians can be small in an image or big, 00:25:15.420 |
So you use a sliding window to detect where that pedestrian is. 00:25:19.140 |
You have a classifier that's given a single image, 00:25:23.500 |
You take that classifier, you slide it across the image 00:25:26.780 |
to find where all the pedestrians are seen are. 00:25:32.380 |
or you can use convolutional neural networks for that classifier. 00:25:38.180 |
Then came along our CNN, fast R-CNN, fast R-CNN. 00:25:44.860 |
as opposed to doing a complete sliding window approach, 00:25:47.620 |
are much more intelligent, clever about generating the candidates to consider. 00:25:53.100 |
So as opposed to considering every possible position of a window, 00:25:57.700 |
they generate a small subset of candidates that are more likely. 00:26:04.100 |
And finally, using a CNN classifier for those candidates, 00:26:08.740 |
whether there's an object of interest or not, a face or not. 00:26:17.380 |
to figure out what is the most likely bounding box 00:26:27.420 |
really the state-of-the-art localization network, 00:26:33.460 |
on top of the bounding box also performs segmentation. 00:26:36.340 |
There's VoxelNet, which does three-dimensional and LiDAR data, 00:26:45.820 |
But it's all kind of grounded in the R-CNN framework. 00:27:03.780 |
So for example, here's one of the intersections, 00:27:09.060 |
instrumenting it with various sensors I'll mention, 00:27:48.300 |
This is from the LiDAR data of the same intersection. 00:28:04.100 |
and the pedestrians making crossing decisions. 00:28:07.340 |
This is understanding the negotiation that pedestrians, 00:28:11.540 |
the nonverbal negotiation that pedestrians perform 00:28:37.100 |
is you do bionic detection of the pedestrians, 00:29:02.580 |
Body pose estimation is also finding the joints, 00:29:19.420 |
So why is that important in driving, for example? 00:29:22.620 |
It's important to determine the vertical position 00:29:27.180 |
the seat belts and the sort of the airbag testing 00:29:33.900 |
with the dummy considering the frontal position 00:29:39.500 |
With a greater and greater degrees of automation 00:29:47.180 |
from the standard quote-unquote dummy position. 00:29:49.700 |
And so body pose, or at least upper body pose estimation 00:29:53.420 |
allows you to determine how often these drivers 00:30:07.620 |
activity and help add context to glance estimation 00:30:13.660 |
So some of the more traditional methods were sequential, 00:30:30.460 |
which has been a very powerful, successful way 00:30:42.260 |
of detecting body parts from the entire image. 00:30:47.260 |
It's not sequentially stitching bodies together, 00:30:50.100 |
it's detecting the left elbow, the right elbow, 00:30:57.140 |
and then stitching everything together afterwards. 00:31:01.980 |
Allowing you to deal with the crazy deformations 00:31:05.780 |
of the body that happen, the occlusions and so on 00:31:09.100 |
because you don't need all the joints to be visible. 00:31:18.180 |
meaning these are convolutional neural networks 00:31:26.980 |
Input is an image, output is an estimate of a joint, 00:31:40.180 |
every estimation zooms in on that particular area 00:31:43.620 |
and performs a finer and finer grain estimation 00:31:58.060 |
in multi-person scene that contain multiple people. 00:32:03.900 |
the hands, the elbows shown in the various images 00:32:06.820 |
on the right that don't have an understanding 00:32:09.860 |
who the head, the elbows, the hands belong to. 00:32:15.780 |
without trying to do individual person detection first. 00:32:25.700 |
but next step is connecting with part affinity fields 00:32:43.140 |
So you kind of stitch the different people together 00:32:45.180 |
in the scene after the detection is performed with the CNN. 00:32:48.500 |
We use this approach for detecting the upper body, 00:33:02.180 |
That is used to determine the position of the driver 00:33:10.740 |
For example, looking during autopilot driving, 00:33:14.020 |
30 minute periods, we can look at on the x-axis is time 00:33:18.180 |
and the y-axis is the position of the neck point 00:33:32.660 |
This is the slouching, the sinking into the seat. 00:33:39.420 |
and allowing us or the designers of safety systems 00:33:42.580 |
to know that information is really important. 00:33:54.580 |
as opposed to just plain pedestrian detection 00:34:31.940 |
from which the camera is observing the scene. 00:34:42.420 |
zero when the pedestrian is not looking at the car, 00:34:44.740 |
one when the pedestrian is looking at the car. 00:34:47.900 |
So we can look at thousands of episodes like this, 00:34:50.380 |
crossing decisions, nonverbal communication decisions 00:34:56.100 |
the dynamics of this nonverbal communication. 00:35:15.020 |
Interesting most people look away before they cross. 00:35:25.380 |
this is an example, we have thousands of these. 00:35:53.300 |
and has the most impact in the driving context 00:35:58.220 |
is for the car to know where the driver is looking. 00:36:02.780 |
And at the very crude region level information 00:36:10.980 |
That's what we mean by glance classification. 00:36:13.220 |
It's not the standard gaze estimation problem 00:36:28.780 |
On road, off road, left, right, center stack, 00:36:43.540 |
It allows you to address it as a machine learning problem. 00:36:50.700 |
Every problem we try to solve in human sensing, 00:36:53.660 |
in driver sensing, has to be learnable from data. 00:36:58.380 |
Otherwise it's not amenable to application in the real world. 00:37:06.900 |
that are deployed without learning if they involve a human. 00:37:39.500 |
the orientation of the head and the orientation of the eyes, 00:37:49.120 |
If we convert this into a gaze classification problem 00:38:00.260 |
determining in post, so humans are annotating this video, 00:38:04.460 |
is the driver, which region the driver is looking at. 00:38:08.640 |
That's, we're able to do by converting the problem 00:38:22.020 |
It can annotate regions of where they're looking 00:38:25.500 |
and using that kind of classification approach 00:38:29.340 |
determine are they looking at the cars or not. 00:38:37.640 |
again, it's a subtle point, but think about it. 00:38:40.020 |
If you wanted to estimate exactly where they're looking, 00:39:00.260 |
in order to be able to train neural networks on this. 00:39:14.780 |
There is some degree of calibration that's required. 00:39:17.160 |
You have to determine approximately where the sensor is 00:39:31.640 |
where the camera frame is relative to the world frame. 00:39:36.100 |
The video stabilization and the face frontalization, 00:39:39.700 |
all the basic processing that remove the vibration 00:39:42.140 |
and the noise that remove the physical movement of the head, 00:39:56.000 |
there is nothing left except taking in the raw video 00:40:01.660 |
of the face for the glance classification tasks 00:40:07.540 |
Raw pixels, that's the input to these networks. 00:40:10.020 |
And the output is whatever the training data is. 00:40:23.080 |
and the output is whatever you have data for. 00:40:30.020 |
which is a traditional geometric approach to this problem, 00:40:34.940 |
is designing algorithms that are able to detect 00:40:38.780 |
accurately the individual landmarks in the face 00:40:41.460 |
and from that estimate the geometry of the head pose. 00:40:57.860 |
But once we have that, we pass in just the raw pixels 00:41:03.100 |
As opposed to doing the estimation, it's classification. 00:41:07.580 |
Allowing you to perform what's shown there on the bottom 00:41:16.340 |
Road, left, right, center stack, instrument cluster, 00:41:25.340 |
And as I mentioned, annotation tooling is key. 00:41:28.900 |
So we have a total of five billion video frames, 00:41:35.220 |
That would take tens of millions of dollars to annotate 00:41:48.580 |
in order to train the neural networks to perform this task. 00:41:59.900 |
the partial occlusions from the light or self-occlusion, 00:42:03.860 |
and the moving out of frame, the out of frame occlusions. 00:42:15.820 |
Whenever the classification has a low confidence, 00:42:21.340 |
We rely on the human only when the classifier 00:42:25.740 |
And the fundamental trade-off in all of these systems 00:42:30.860 |
is what is the accuracy we're willing to put up with. 00:42:35.540 |
Here in red and blue, in red is human choice decision, 00:42:42.380 |
In red, we select the video we want to classify. 00:42:52.660 |
the face detection task, localizing the camera, 00:43:05.380 |
So certainly a neural network can annotate glance 00:43:08.140 |
for the entire data set, but it would achieve accuracy 00:43:13.180 |
of low 90% classification on the sixth class task. 00:43:44.060 |
And it repeats this process over and over on the frames 00:44:12.020 |
highly confident predictions can be highly wrong? 00:44:18.260 |
False positives that you're really confident in. 00:44:30.540 |
That usually seems to deal, generalize over cases. 00:44:45.140 |
Usually some degree of human annotation fixes most problems. 00:44:49.340 |
Annotating the low confidence part of the data 00:45:00.420 |
But of course, that's not always true in a general case. 00:45:05.860 |
You can imagine a lot of scenarios where that's not true. 00:45:19.060 |
we usually annotate a large amount of the data 00:45:23.380 |
So we have to make sure that the neural network 00:45:48.860 |
whether the driver's looking on road or off road. 00:45:51.340 |
This is critical information for the car to understand. 00:45:53.860 |
And you wanna pause for a second to realize that 00:45:56.780 |
when you're driving a car for those who drive, 00:46:02.660 |
it has no idea about what you're up to at all. 00:46:12.780 |
More and more now with the GM Super Cruise vehicle 00:46:15.580 |
and Tesla now has added a driver facing camera, 00:46:32.620 |
It's common sense how important this information is, 00:46:44.660 |
it's been three decades of work on gaze estimation. 00:46:48.420 |
Gaze estimation is doing head pose estimation, 00:47:00.740 |
We convert that into a classification problem. 00:47:08.660 |
Glance classification is a machine learning problem. 00:47:28.180 |
The problem with emotion, if I may speak as an expert, 00:47:39.940 |
is that there is a lot of ways to taxonomize emotion, 00:47:47.180 |
whether that's for the primary emotion of the parascale 00:47:51.780 |
with love, joy, surprise, anger, sadness, fear. 00:47:57.380 |
to break those apart into hierarchical taxonomies. 00:48:11.740 |
but it's kind of how we think about primary emotions 00:48:14.700 |
is detecting the broad categories of emotion, 00:48:22.900 |
And then there is application-specific emotion recognition, 00:48:30.660 |
that all the various ways that we can deform our face 00:48:47.780 |
I mean there's countless ways of deforming the face 00:49:09.100 |
This is their task with the general emotion recognition task 00:49:19.260 |
various subtleties of that emotion in a general case, 00:49:30.300 |
I mean essentially what these algorithms are doing 00:49:32.220 |
whether they're using deep neural networks or not, 00:49:45.780 |
the component, the various expressions we can make 00:49:47.900 |
with our eyebrows, with our nose and mouth and eyes 00:50:05.180 |
that you observe a smiling expression on the face 00:50:11.420 |
If there's an increased probability of a smile, 00:50:37.940 |
That's for the general emotion recognition task 00:50:41.020 |
that's sort of the core of effective computing movement 00:50:55.740 |
We can take, here we have a large scale data set 00:51:07.940 |
so they're talking to their GPS using their voice. 00:51:13.940 |
in most cases an incredibly frustrating experience. 00:51:21.980 |
After the task they say on a scale of one to 10, 00:51:27.500 |
And what you see on top is the expressions detected 00:51:42.300 |
Is perfectly satisfied with a voice based interaction. 00:51:49.380 |
as I believe a nine on the frustration scale. 00:51:53.460 |
So the feature, the strongest there, the expression, 00:51:58.220 |
remember joy, smile was the strongest indicator 00:52:05.140 |
Smile was the thing that was always there for frustration. 00:52:17.340 |
So that shows you the kind of clean difference 00:52:24.700 |
Here, perhaps they enjoyed an absurd moment of joy 00:52:43.580 |
And their data does the work, not the algorithms. 00:52:53.220 |
for the artificial general intelligence class. 00:53:24.460 |
Cognitive load, we're starting to get to the eyes. 00:53:29.460 |
Cognitive load is the degree to which a human being 00:53:35.820 |
is accessing their memory or is lost in thought, 00:53:43.300 |
to recollect something, to think about something. 00:53:57.660 |
So there's pupils, the black part of the eye, 00:53:59.940 |
they can expand and contract based on various factors, 00:54:04.260 |
including the lighting variations in the scene, 00:54:06.780 |
but they also expand and contract based on cognitive load. 00:54:16.020 |
When we look around, eyes jump around the scene. 00:54:19.020 |
They can also do something called smooth pursuit. 00:54:25.300 |
can see a delicious meal flying by or running by 00:55:02.780 |
and real world data with lighting variations, 00:55:10.820 |
non-contact way to measure cognitive load in the lab 00:55:24.900 |
3D convolutional neural networks in this case, 00:55:26.940 |
we take a sequence of images of the eye through time 00:55:29.860 |
and use 3D convolutions as opposed to 2D convolutions. 00:55:45.060 |
Every channel is operated on by the filter separately. 00:56:24.900 |
and performing what's called the N-back task. 00:56:29.380 |
And that task involves hearing numbers being read to you 00:56:33.780 |
and then recalling those numbers one at a time. 00:56:37.700 |
So when zero back, the system gives you a number, seven, 00:56:41.740 |
and then you have to just say that number back. 00:57:01.300 |
And not get distracted by the new information coming up. 00:57:05.060 |
With two back, you have to do that two numbers back. 00:57:07.740 |
So you have to use memory more and more with two back. 00:57:23.060 |
And now we have this nice raw pixels of the eye region 00:57:39.620 |
We have a ton of data of people on the highway 00:57:51.220 |
The input is 90 images at 15 frames a second. 00:58:03.820 |
is the technique developed for face recognition 00:58:11.260 |
It's also what we use here to normalize everything 00:58:16.100 |
It's taking whatever the orientation of the face 00:58:34.140 |
Where you find, and this is where the intuition builds. 00:58:45.060 |
What's being plotted here is the relative movement 00:58:57.140 |
so when your mind is not that lost in thought. 00:59:02.260 |
when it is lost in thought, the eye moves a lot less. 00:59:09.420 |
That's an interesting finding, but it's only an aggregate. 00:59:12.140 |
And that's what the neural network is tasked with doing, 00:59:18.100 |
This is a standard 3D convolutional architecture. 00:59:23.980 |
Again, taking in the image sequence as the input, 00:59:41.300 |
The idea is that you can just plop in a webcam, 00:59:58.740 |
Because every single zero back, one back, two back classes 01:00:03.380 |
have a confidence that's associated with them, 01:00:05.300 |
so you can turn that into a real value between zero and two. 01:00:14.420 |
driving a car, performing a task of conversation. 01:00:19.420 |
And in white, showing the cognitive load, frame by frame. 01:00:24.460 |
At 30 frames a second, estimating the cognitive load 01:00:27.220 |
of each of the drivers, from zero to two on the y-axis. 01:00:41.140 |
And when everybody's silent, the cognitive load goes down. 01:00:44.460 |
So we can perform now with this simple neural network, 01:00:49.300 |
we can extend that to any arbitrary new data set 01:00:53.260 |
Okay, those are some examples of how neural networks 01:00:59.500 |
Again, is while we focus on the sort of the perception task 01:01:08.020 |
and signal processing to determine where we are 01:01:10.900 |
in the world, where the different obstacles are 01:01:12.740 |
and form trajectories around those obstacles, 01:01:15.260 |
we are still far away from completely solving that problem. 01:01:26.820 |
and so when the system is not able to control, 01:01:31.620 |
when there's some flawed aspect about the perception 01:01:34.180 |
or the driving policy, the human has to be involved. 01:01:37.580 |
And that's where we have to know, let the car know 01:01:42.700 |
That's the essential element of human robot interaction. 01:01:45.900 |
The most popular car in the United States today 01:01:54.740 |
The thing that sort of inspires us and makes us think 01:01:58.300 |
that transportation can be fundamentally transformed 01:02:05.540 |
and all of our guest speakers and all the folks 01:02:09.860 |
But if you look at it, the only people who are 01:02:13.100 |
at a mass scale or beginning to, are actually injecting 01:02:17.620 |
automation into our daily lives is the ones in between. 01:02:26.740 |
the Audi, the Volvo S90s, the vehicles that are slowly 01:02:31.740 |
adding some degree of automation and teaching human beings 01:03:05.820 |
and create successful human robot interaction, 01:03:08.660 |
approach autonomous vehicles, autonomous systems 01:03:18.180 |
of the human-centered systems, like the Tesla vehicles, 01:03:23.500 |
The kind of L2 technologies have not truly penetrated 01:03:27.300 |
the market, have not penetrated our vehicles, 01:03:35.500 |
And that's going to form the core of our algorithms 01:03:40.500 |
that will eventually lead to the full autonomy. 01:03:43.940 |
All of that data, what I mentioned with Tesla 01:03:49.020 |
all of that is training data for the algorithms. 01:04:07.940 |
Why is this important, beautiful, and fundamental 01:04:20.580 |
on the human-robot interaction, are personal robots. 01:04:27.060 |
tools like a Roomba performing a particular task. 01:04:33.780 |
when there's a fundamental transfer between it, 01:04:42.700 |
there's a transfer, that is kind of a relationship 01:05:08.860 |
one-to-one understanding of each other's mental state, 01:05:17.000 |
So, one of my favorite movies, Good Will Hunting, 01:05:28.600 |
This is Robin Williams speaking about human imperfections. 01:05:38.080 |
and replace every time he mentions girl with car. 01:05:53.940 |
They call these things imperfections, but they're not. 01:06:13.740 |
Well, that'll be the idiosyncrasies that only I know about. 01:06:57.720 |
Now you could know everything in the world's worth, 01:07:17.120 |
it's the human-centered approach to autonomous vehicles 01:07:37.120 |
on deep learning for understanding the humans at CHI 2018. 01:07:45.000 |
the convolutional neural network based detection 01:08:09.360 |
to begin to explore the nature of intelligence, 01:08:31.400 |
Richard Moyes talking about autonomous weapon systems 01:08:35.400 |
and AI safety, Mark Rybert from Boston Dynamics 01:09:11.560 |
The high performer award will be given to folks, 01:09:15.180 |
the very few folks who achieve 70 miles an hour faster. 01:09:24.160 |
having hit a few snags and invested a few thousands of dollars 01:09:33.360 |
I've annotating a large scale data set for you guys. 01:09:37.840 |
We'll continue this competition that will take us 01:09:47.160 |
and DeepCrash, the deeper enforcement learning. 01:09:49.780 |
These competitions will continue through May 2018. 01:10:09.320 |
There's an introduction to deep learning course 01:10:16.440 |
in the very basic algorithms of deep learning 01:10:24.240 |
And there's an awesome class that I ran last year 01:10:35.640 |
I encourage you to click a link on there and register. 01:10:40.760 |
and it truly brings together a lot of cross disciplinary 01:10:44.560 |
folks to talk about ideas of artificial intelligence 01:10:51.800 |
And if you're interested in applying deep learning methods 01:11:00.240 |
We have a lot of fascinating problems to solve 01:11:04.480 |
So with that, I'd like to thank everybody here, 01:11:09.520 |
everybody across the community that's been contributing. 01:11:13.800 |
We have thousands of submissions coming in for deep traffic, 01:11:21.000 |
and the team behind this class is incredible. 01:11:31.220 |
extra large, extra, extra large and medium over there,