MIT 6.S094: Deep Learning for Human Sensing

Today we will talk about how to apply the methods of deep learning to understanding the sense in the human being. The focus will be on computer vision, the visual aspects of a human being. Of course, we humans express ourselves visually, but also through audio, voice and through text. Beautiful poetry and novels and so on, we're not going to touch those today, we're just going to focus on computer vision.

How we can use computer vision to extract useful, actionable information from video, images, video of human beings, in particular in the context of the car. So, what are the requirements for successfully applying deep learning methods in the real world? So, when we talk about human sensing, we're not talking about basic face recognition of celebrity images.

We're talking about using computer vision, deep learning methods to create systems that operate in the real world. And in order for them to operate in the real world, there are several things. They sound simple, some are much harder than they sound. First, and the most important, here from most to less, more to less critical, ordered, is data.

Data is everything. Real world data. We need a lot of real world data to form the data set on which these supervised learning methods can be trained. I'll say this over and over throughout the day today, data is everything. That means data collection is the hardest part and the most important part.

We'll talk about how that data collection is carried out here, in our group at MIT, all the different ways we capture human beings in the driving context, in the road user context, pedestrians, cyclists. But the data, it starts and ends at data. The fun stuff is the algorithms. But the data is what makes it all work.

Real world data. Okay, then once you have the data, okay, data isn't everything, I lied. Because you have to actually annotate it. So what do we mean by data? There's raw data, video, audio, LIDAR, all the types of sensors we'll talk about to capture real world road user interaction.

You have to reduce that into meaningful representative cases of what happens in that real world. In driving, 99% of the time, driving looks the same. It's the 1%, the interesting cases that we're interested in. And what we want is algorithms to train learning algorithms on those 1%. So we have to collect the 100%, we have to collect all the data, and then figure out an automated, semi-automated ways to find the pieces of that data that could be used to train neural networks and that are representative of the general things, kinds of things that happen in this world.

Efficient annotation. Annotation isn't just about drawing bounding boxes on images of cats. Annotation tooling is key to unlocking real world performance. Systems that successfully solve some problem, accomplish some goal in real world data. That means designing annotation tools for a particular task. Annotation tools that are used for glance classification, for determining where drivers are looking, is very different than annotation tools used for body pose estimation.

It's very different than the tooling that we use for PsycFuse, investing thousands of dollars for the competition for this class, to annotate fully seen segmentation where every pixel is colored. There needs to be tooling for each one of those elements, and they're key. That's HCI question. That's a design question.

There's no deep learning, there's no robotics in that question. It's how do we leverage human computation, human, the human brain, to most effectively label images such that we can train neural networks on them. Hardware. In order to train these networks, in order to parse the data we collect, and we'll talk about, we have now over 5 billion images of data, of driving data.

In order to parse that, you can't do it on a single machine. You have to do large-scale distributed compute, and large-scale distributed storage. And finally, the stuff that's the most exciting, that people, that this class, and many classes, and much of the literature is focused on, is the algorithms.

The deep learning algorithms, the machine learning algorithms, the algorithms that learn from data. Of course, that's really exciting and important, but what we find time and time again, in real-world systems, is that as long as these algorithms learn from data, so as long as it's deep learning, the data is what's much more important.

Of course, it's nice for the algorithms to be calibration-free, meaning they learn to calibrate, self-calibrate. We don't need to have the sensors in an exact same position every time. That's a very nice feature. The robustness of the system is then generalizable across multiple, multiple vehicles, multiple scenarios. And, one of the key things that comes up time and time again, and we'll mention today, is a lot of the algorithms developed in deep learning are really focused for computer vision, are focused on single images.

Now, the real world happens in both space and time, and we have to have algorithms that both capture the visual characteristics, but also look at the sequence of images, sequence of those visual characteristics that form the temporal dynamics, the physics of this world. So, it's nice when those algorithms are able to capture the physics of the scene.

The big takeaway I would like, if you leave with anything today, unfortunately, it's that the painful, boring stuff of collecting data, of cleaning that data, of annotating that data, in order to create successful systems is much more important than good algorithms, or great algorithms. It's important to have good algorithms, as long as you have neural networks that learn from that data.

Okay, so today I'll talk, I'd like to talk about human imperfections, and the various detection problems, the pedestrian body pose, glance, emotion, cognitive load estimation, that we can use to help those humans as they operate in a driving context. And finally, try to continue with the idea, the vision, that fully autonomous vehicles, as some of our guest speakers have spoke about, and Sterling Ennis will speak about tomorrow, is really far away.

That the humans will be an integral part of the operating, cooperating with AI systems. And I'll continue on that line of thought to try to motivate why we need to continuously approach the autonomous vehicle, the self-driving car paradigm, in a human-centered way. Okay, first, before we talk about human imperfections, let's just pause and acknowledge that humans are amazing.

We're actually really good at a lot of things. That's sometimes sort of fun to talk about how much, how terrible of drivers we are, how distracted we are, how irrational we are. But we're actually really damn good at driving. Here's a video of a state-of-the-art soccer player, Messi, the best soccer player in the world, obviously.

And a state-of-the-art robot on the right. Same thing. Well, it's not playing, but I assure you, the American Ninja Warrior, Casey, is far superior to the DARPA humanoid robotics systems shown on the right. Okay, so, continuing on the line of thought to challenge, to challenge us here, that humans are amazing, is, you know, there's a record high in 2016 in the United States.

There was over 40,000 since many years, it's across the 40,000 fatalities mark. More than 40,000 people died in car crashes in the United States. But that's in 3.2 trillion miles traveled. So that's one fatality per 80 million miles. That's, one in 625 chance of dying in a car crash in your lifetime.

Interesting side fact, for anyone in the United States, folks who live in Massachusetts are the least likely to die in a car crash. Montana is the most likely. So for everyone that thinks of Boston Drives is terrible, maybe that adds some perspective. Here's a visualization of Waze data across a period of a day, showing you the rich blood of the city, that the traffic flow of the city, the people getting from A to B on a mass scale, and doing it, surviving, doing it okay.

Humans are amazing. But they're also flawed. Texting, sources of distraction with a smartphone, the eating, the secondary tasks of talking to other passengers, grooming, reading, using navigation system, yes, sometimes watching video, and manually adjusting or adjusting the radio. And 3,000 people were killed. And 400,000 were injured in motor vehicle crashes involving distraction in 2014.

Distraction is a, is a very serious issue for safety. Texting, every day more and more people text. Smartphones are proliferating our society. 170 billion text messages are sent in the United States every month. That's in 2014. You can only imagine what it is today. Eyes off road for five seconds is the average time your eyes are off the road while texting.

Five seconds. If you're traveling 55 miles an hour in that five seconds, that's enough time to cover the length of a football field. So you're blindfolded. You're not looking at the road. In five seconds, the average time of texting, you're covering the entire football field. And so many things can happen in that moment of time.

That's distraction. Drunk driving. 31% of traffic fatalities involve a drunk driver. Drug driving. 23% of nighttime drivers tested positive for a legal prescription or over-the-counter medication. Distracted driving, as I said, is a huge safety risk. Drowsy driving. People driving tired. Nearly 3% of all traffic fatalities involve a drowsy driver.

If you are uncomfortable with videos that involve risk, I urge you to look away. These are videos collected by AAA of teenagers, a very large-scale naturalistic driving data set, and it's capturing clips of teenagers being distracted on their smartphone. (TEEN CHATTERING) (TEEN CHATTERING) (THUNDER RUMBLING) (TEEN CHATTERING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TEEN CHATTERING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) (TIRES SCREECHING) Once you take it in, the problem we're against.

So in that context of human imperfections, we have to ask ourselves, is the human-centered approach to autonomy in systems, autonomous vehicles that are using artificial intelligence to aid the driving task, do we want to go, as I mentioned a couple of lectures ago, the human-centered way or the full autonomy way?

The tempting path is towards full autonomy, where we remove this imperfect, flawed human from the picture altogether and focus on the robotics problem of perception and control and planning and driving policy. Or do we work together, human and machine, to improve the safety, to alleviate distraction, to bring driver attention back to the road and use artificial intelligence to increase safety through collaboration, human-robot interaction versus removing the human completely from the picture?

As I've mentioned, as Sterling will certainly talk about tomorrow and rightfully so, and yesterday or on Tuesday, Emilio has talked about the L4 way is grounded in literature, is grounded in common sense in some sense. You can count on the fact that humans, the natural flaws of human beings to overtrust, to misbehave, to be irrational about their risk estimates will result in improper use of the technology.

And that leads to what I've showed before, the public perception of what drivers do in semi-autonomous vehicles. They begin to overtrust. The moment the system works well, they begin to overtrust. They begin to do stuff they're not supposed to be doing in the car, taking it for granted. A recent video that somebody posted, this is a common sort of more practical concern that people have is, well, the traditional ways to ensure the physical engagement of the driver is by saying they should touch the wheel, the steering wheel every once in a while.

And of course, there's ways to bypass the need to touch the steering wheel. Some people hang objects like a can off of the steering wheel. In this case, brilliantly, I have to say, they shove an orange into the wheel to make the touch sensor fire and therefore be able to take their hands off the autopilot.

And that kind of idea makes us believe that there's no way that humans will find a way to misuse this technology. However, I believe that that's not giving the technology enough credit. Artificial intelligence systems, if they're able to perceive the human being, are also able to work with the human being.

And that's what I'd like to talk about today. Teaching cars to perceive the human being. And it all starts with data. It's all about data, as I mentioned. Data is everything in these real-world systems. With the MIT naturalistic driving data set of 25 vehicles, of which 21 are equipped with Tesla autopilot, we instrument them.

This is what we do with the data collection. Two cameras on the driver. We'll see the cameras on the face, capturing high-definition video of the face. That's where we get the glance classification, the emotion recognition, cognitive load, everything coming from the face. Then we have another camera, a fisheye, that's looking at the body of the driver.

And from that comes the body pose estimation, hands-on wheel, activity recognition. And then one video looking out for the full scene segmentation for all the scene perception tasks. And everything is being recorded, synchronized together with GPS, with audio, with all the cam coming from the car on a single device.

Synchronization of this data is critical. So that's one road trip in the data. We have thousands like it, traveling hundreds of miles, sometimes hundreds of miles under automated control, in autopilot. That's the data. Again, as I said, data is everything. And from this data, we can both gain understanding of what people do, which is really important to understand how autonomy, successful autonomy can be deployed in the real world, and to design algorithms for training, for training the deep learning, the deep neural networks in order to perform the perception task better.

25 vehicles, 21 Teslas, Model S, Model X, and now Model 3. Over a thousand miles collected a day. Every single day we have thousands of miles in the Boston, Massachusetts area driving around, all of that video being recorded. Now over 5 billion video frames. There's several ways to look at autonomy.

One of the big ones is safety. That's what everybody talks about. How do we make these things safe? But the other one is enjoyment. Do people actually want to use it? We can create a perfectly safe system. We can create it right now. We've had it forever, before even cars.

A car that never moves is a perfectly safe system. Well, not perfectly, but almost. But it doesn't provide a service that's valuable. It doesn't provide an enjoyable driving experience. So okay, what about slow-moving vehicles? That's an open question. The reality is with these Tesla vehicles and L2 systems doing automated driving, people are driving 33% of miles using Tesla Autopilot.

What does that mean? That means that people are getting value from it. A large fraction of their driving is done in an automated way. That's value, that's enjoyment. The glance classification algorithm we'll talk about today is used as one example that we use to understand what's in this data.

Shown with the bar graphs there in the red and the blue. Red is your manual driving, blue is your autopilot driving. And we look at glance classification, regions of where drivers are looking, on road and off road. And if that distribution changes with automated driving or manual driving. And with these glance classification methods, we can determine that there's not much difference.

At least until you dig into the details, which we haven't done. And the aggregate, there's not a significant difference. That means people are getting value and enjoying using these technologies. But yet they're staying attentive or at least not attentive, but physically engaged. When your eyes are on the road, you might not be attentive.

But you're at the very least physically, your body's positioned in such a way, your head is looking at the forward roadway, that you're physically in position to be alert and to take in the forward roadway. So they're using it and they don't over trust it. And that's I think the sweet spot that human robot interaction needs to achieve.

Is the human gaining through experience, through exploration, through trial and error, exploring and understanding the limitations of the system, to a degree that over trust can occur. That seems to be happening in this system. And using the computer vision methods I'll talk about, we can continue to explore how that can be achieved in other systems.

When the fraction of automated driving increases, from 30% to 40% to 50% and so on. It's all about the data and I'll harp on this again. The algorithms are interesting. I will mention of course, it's the same convolutional neural networks. It's the same networks that take in raw pixels and extract features of interest.

It's 3D convolutional neural networks that take in a sequences of images and extract the temporal dynamics along with the visual characteristics of the individual images. It's RNN's LSTMs that use the convolutional neural networks to extract features and over time look at the dynamics of the images. These are pretty basic architectures, the same kind of deep neural network architectures.

But they rely fundamentally and deeply on the data, on real-world data. So let's start where perhaps on the human sensing side it all began, which is pedestrian detection. Decades ago. To put it in context, pedestrian detection here shown from left to right. On the left is green showing the easier human sensing tasks.

Tasks of sensing some aspect of the human being. Pedestrian detection, which is detecting the full body of a human being in an image or video, is one of the easier computer vision tasks. And on the right, in the red, micro saccades. These are tremors of the eye or measuring the pupil diameter, or measuring the cognitive load of the fine blink dynamics of the eye, the velocity of the blink, micro glances and eye pose are much harder problems.

So you think body pose estimation, pedestrian detection, face classification detection, recognition, head pose estimation, all those are easier tasks. Anything that starts getting smaller, looking at the eye and everything that start getting fine-grained, this is much more difficult. So we start at the easiest, pedestrian detection. And as the usual challenges of all of computer vision we've talked about, as the various styles of appearance, so the inter-class variation, the different possible articulations of our bodies, superseded only perhaps by cats, but us humans are pretty flexible as well.

The presence of occlusion from the accessories that we wear to occluding self-occlusion and occluding each other. But crowded scenes have a lot of humans in them and they occlude each other and therefore being able to disambiguate, to figure out each individual pedestrian is a very challenging problem. So how do people approach this problem?

Well, there is a need to extract features from raw pixels. Whether that was Harkascades, Hogue or CNN, through the decades, the sliding window approach was used. Because the pedestrians can be small in an image or big, so there's the problem of scale. So you use a sliding window to detect where that pedestrian is.

You have a classifier that's given a single image, says is this a pedestrian or not. You take that classifier, you slide it across the image to find where all the pedestrians are seen are. So you can use non-neural network methods or you can use convolutional neural networks for that classifier.

It's extremely inefficient. Then came along our CNN, fast R-CNN, fast R-CNN. These are networks that, as opposed to doing a complete sliding window approach, are much more intelligent, clever about generating the candidates to consider. So as opposed to considering every possible position of a window, different scales of the window, they generate a small subset of candidates that are more likely.

And finally, using a CNN classifier for those candidates, whether there's a pedestrian or not, whether there's an object of interest or not, a face or not. And using non-maximal suppression, because there's overlapping bounding boxes to figure out what is the most likely bounding box around this pedestrian, around this object.

That's R-CNN. And there's a lot of variance. Now with mask R-CNN, really the state-of-the-art localization network, mask also adds to this, on top of the bounding box also performs segmentation. There's VoxelNet, which does three-dimensional and LiDAR data, uses localization and point clouds. So it's not just using 2D images, but in 3D.

But it's all kind of grounded in the R-CNN framework. Okay, data. So we have large-scale data collection going on here in Cambridge. If you've seen cameras or LiDAR at various intersections throughout MIT, we're part of that. So for example, here's one of the intersections, we're collecting about 10 hours a day, instrumenting it with various sensors I'll mention, but we see about 12,000 pedestrians a day across that particular intersection, using 4K cameras, using stereo vision cameras, 360, now the Insta360, which is an 8K 360 camera, GoPro, LiDAR of various sizes, the 64 channel or the 16.

And recording. This is where the data comes from. This is from the 360 video. This is from the LiDAR data of the same intersection. This is for the 4K camcorders pointing at a different intersection, and the different, then capturing the entire 360 view with the vehicles approaching and the pedestrians making crossing decisions.

This is understanding the negotiation that pedestrians, the nonverbal negotiation that pedestrians perform in choosing to cross or not, especially when they're jaywalking, and everybody jaywalks. Especially if you're familiar with this particular intersection, there's more jaywalkers than non-jaywalkers. It's a fascinating one. And so we record everything about the driver and everything about the pedestrians.

And again, RCNN, this is where it comes in, is you do bionic detection of the pedestrians, here the vehicles as well, and allows you to convert this raw data into hours of pedestrian crossing decisions, and begin to interpret it. That's pedestrian detection, bounding box. For body pose estimation, is the more difficult task.

Body pose estimation is also finding the joints, the hands, the elbows, the shoulders, the hips, knees, feet, the landmark points in the image, X, Y position that mark those joints. That's body pose estimation. So why is that important in driving, for example? It's important to determine the vertical position or the alignment of the driver, the seat belts and the sort of the airbag testing is always performed, and the seat belt testing is performed with the dummy considering the frontal position in a standard dummy position.

With a greater and greater degrees of automation comes more capability and flexibility for the driver to get misaligned from the standard quote-unquote dummy position. And so body pose, or at least upper body pose estimation allows you to determine how often these drivers get out of line from the standard position, general movement.

And then you can look at hands on wheel, smartphone, smartphone detection, activity and help add context to glance estimation that we'll talk about. So some of the more traditional methods were sequential, is detecting first the head and then stepping detecting the shoulders, the elbows, the hands. The deep pose holistic view, which has been a very powerful, successful way for multi-person pose estimation is performing a regression of detecting body parts from the entire image.

It's not sequentially stitching bodies together, it's detecting the left elbow, the right elbow, the hands individually. It's performing that detection and then stitching everything together afterwards. Allowing you to deal with the crazy deformations of the body that happen, the occlusions and so on because you don't need all the joints to be visible.

And with this cascade of pose regressors, meaning these are convolutional neural networks that take in a raw image and produce an XY position of their estimate of each individual joint. Input is an image, output is an estimate of a joint, of an elbow, shoulder, whatever, one of several landmarks.

And then you can build on top of that, every estimation zooms in on that particular area and performs a finer and finer grain estimation of the exact position of the joint. Repeating it over and over and over. So through this process, we can do part detection in multi-person, in multi-person scene that contain multiple people.

So we can detect the head, the neck here, the hands, the elbows shown in the various images on the right that don't have an understanding who the head, the elbows, the hands belong to. It's just performing a detection without trying to do individual person detection first. And then finally connecting, or not finally, but next step is connecting with part affinity fields is connecting those parts together.

So first you detect individual parts, then you connect them together. And then through bipartite matching, you determine which is, who is that each individual body part most likely belonging to. So you kind of stitch the different people together in the scene after the detection is performed with the CNN.

We use this approach for detecting the upper body, specifically the shoulders, the neck, and the head, eyes, nose, ears. That is used to determine the position of the driver relative to the standard dummy position. For example, looking during autopilot driving, 30 minute periods, we can look at on the x-axis is time and the y-axis is the position of the neck point that I pointed out in the previous slide that the midpoint between the two shoulders the neck is the position over time relative to where it began.

This is the slouching, the sinking into the seat. Allowing the car to know that information and allowing us or the designers of safety systems to know that information is really important. We can use the same body pose algorithm to from the perspective of the vehicle, outside the vehicle perspective.

So the vehicle looking out is doing the, as opposed to just plain pedestrian detection using body pose estimation. Again, here in Kendall Square, vehicles crossing, observing pedestrians making crossing decisions and performing body pose estimation, which allows you to then generate visualizations like this and gain understanding like this. On the x-axis is time, on the y-axis is on the top plot in blue is the speed of the vehicle.

The speed of the vehicle, the ego vehicle from which the camera is observing the scene. And on the bottom in green, up and down is a binary value. Whether the pedestrian, zero when the pedestrian is not looking at the car, one when the pedestrian is looking at the car.

So we can look at thousands of episodes like this, crossing decisions, nonverbal communication decisions and determine using body pose estimation, the dynamics of this nonverbal communication. Here, just nearby, by Media Lab, crossing, there's a pedestrian approaches. We can look in green there when the pedestrian glances, looks away, glances at the car, looks away.

Fascinating glance behavior that happens. Interesting most people look away before they cross. Same thing here, this is an example, we have thousands of these. Body pose estimation allows you to get this fine-grained information about the pedestrian glance behavior, pedestrian body behavior, hesitation. Glance classification, one of the most important things in driving is determining where drivers are looking.

If there's any sensing that I advocate and has the most impact in the driving context is for the car to know where the driver is looking. And at the very crude region level information is the driver looking on road or off road? That's what we mean by glance classification.

It's not the standard gaze estimation problem of X, Y, Z determining where the eye pose and the head pose combine to determine where the driver is looking. No, this is classifying two regions, on road, off road, or six regions. On road, off road, left, right, center stack, rear view mirror and instrument cluster.

So it's region-based glance allocation, not the geometric gaze estimation problem. Why is that important? It allows you to address it as a machine learning problem. This is subtle but critical point. Every problem we try to solve in human sensing, in driver sensing, has to be learnable from data. Otherwise it's not amenable to application in the real world.

We can't design systems in the lab that are deployed without learning if they involve a human. It's possible to do SLAM localization by having really good sensors and doing localization using those sensors without much learning. It's not possible to design systems that deal with lighting variability and the full variability of human behavior without being able to learn.

So gaze estimation, the geometric approach of finding the landmarks in the face and from those landmarks determining the orientation of the head and the orientation of the eyes, there's no learning there outside of actually training the systems to detect the different landmarks. If we convert this into a gaze classification problem shown here, glance classification, is when taking the raw video stream, determining in post, so humans are annotating this video, is the driver, which region the driver is looking at.

That's, we're able to do by converting the problem into a simple variant of classification. On road, off road, left, right. The same can be done for pedestrians. Left, forward, right. It can annotate regions of where they're looking and using that kind of classification approach determine are they looking at the cars or not.

Are they looking away? Are they looking at their smartphone? Without doing the 3D gaze estimation, again, it's a subtle point, but think about it. If you wanted to estimate exactly where they're looking, you need that ground truth. You don't have that ground truth unless you, there's no, in the real world data, there's no way to get the information about where exactly people were looking.

You're only inferring. So you have to convert it into a region-based classification problem in order to be able to train neural networks on this. And the pipeline is the same. The source video, here, the face, the 30 frames a second video coming in of the driver's face or the human face.

There is some degree of calibration that's required. You have to determine approximately where the sensor is that's taking in the image, especially for the classification task because it's region-based. It needs to be able to estimate where the forward roadway is, where the camera frame is relative to the world frame.

The video stabilization and the face frontalization, all the basic processing that remove the vibration and the noise that remove the physical movement of the head, that remove the shaking of the car in order to be able to determine stuff about eye movement and blink dynamics. And finally, with neural networks, there is nothing left except taking in the raw video of the face for the glance classification tasks and the eye for the cognitive load tasks.

Raw pixels, that's the input to these networks. And the output is whatever the training data is. And we'll mention each one. So whether that's cognitive load, glance, emotion, drowsiness. The input is the raw pixels and the output is whatever you have data for. Data is everything. Here, the face alignment problem, which is a traditional geometric approach to this problem, is designing algorithms that are able to detect accurately the individual landmarks in the face and from that estimate the geometry of the head pose.

For the classification version, we perform the same kind of alignment or the same kind of face detection alignment to determine where the head is. But once we have that, we pass in just the raw pixels and perform the classification on that. As opposed to doing the estimation, it's classification.

Allowing you to perform what's shown there on the bottom is the real-time classification of where the driver is looking. Road, left, right, center stack, instrument cluster, and rear view mirror. And as I mentioned, annotation tooling is key. So we have a total of five billion video frames, one and a half billion of the face.

That would take tens of millions of dollars to annotate just for the glance classification fully. So we have to figure out what to annotate in order to train the neural networks to perform this task. And what we annotate is the things that the network is not confident about. The moments of high lighting variation, the partial occlusions from the light or self-occlusion, and the moving out of frame, the out of frame occlusions.

All the difficult cases. Going from frame to frame to frame here in the different pipelines, starting at the top, going to the bottom. Whenever the classification has a low confidence, we pass it to the human. It's simple. We rely on the human only when the classifier is not confident.

And the fundamental trade-off in all of these systems is what is the accuracy we're willing to put up with. Here in red and blue, in red is human choice decision, in blue is machine task. In red, we select the video we want to classify. In blue, the neural network performs the face detection task, localizing the camera, choosing what is the angle of the camera, and provides a trade-off between accuracy and percent frames it can annotate.

So certainly a neural network can annotate glance for the entire data set, but it would achieve accuracy in the case of glance classification of low 90% classification on the sixth class task. Now if you wanted a higher accuracy, they would only be able to achieve that for a smaller fraction of frames.

That's the choice. And then a human has to go in and perform the annotation of the frames that the algorithm is not confident about. And it repeats over and over. The algorithm is then trained on the frames that were annotated by the human. And it repeats this process over and over on the frames until everything is annotated.

Yes. (audience member speaking off mic) Yes, absolutely. The question was, do you ever observe that the classifier is highly confident about the incorrect class? Yep. (audience member speaking off mic) Right, question was, well then, how do you deal with that? How do you account for that? How do you account for the fact that highly confident predictions can be highly wrong?

Yeah, false positives. False positives that you're really confident in. There's no, at least in our experience, there's no good answer for that, except more and more training data on the things you're not confident about. That usually seems to deal, generalize over cases. We don't encounter obvious large categories of data where you're really confident about the wrong thing.

Usually some degree of human annotation fixes most problems. Annotating the low confidence part of the data solves all incorrect issues. But of course, that's not always true in a general case. You can imagine a lot of scenarios where that's not true. For example, one thing that always, perform is for each individual person, we usually annotate a large amount of the data manually no matter what.

So we have to make sure that the neural network has seen that person in the various, in the various ways their face looks like, with glasses, with different hair, with different lighting variation. So we wanna manually annotate that. It's over time we allow the machine to do more and more of the work.

So what's resulting in this, in the glance classification cases, you can do real time classification. You can give the car information about whether the driver's looking on road or off road. This is critical information for the car to understand. And you wanna pause for a second to realize that when you're driving a car for those who drive, or for those who've driven any kind of car with any kind of automation, it has no idea about what you're up to at all.

It has no, it doesn't have any information about the driver except if they're touching the steering wheel or not. More and more now with the GM Super Cruise vehicle and Tesla now has added a driver facing camera, they're slowly starting to think about moving towards perceiving the driver. But most vehicles on the road today have no knowledge of the driver.

This knowledge is almost common sense and trivial for the car to have. It's common sense how important this information is, where the driver is looking. That's the glance classification problem. And again, emphasizing that we've converted, it's been three decades of work on gaze estimation. Gaze estimation is doing head pose estimation, so the geometric orientation of the head, combining the orientation of the eyes and using that combined information to determine where the person is looking.

We convert that into a classification problem. So the standard gaze estimation definition is not a machine learning problem. Glance classification is a machine learning problem. This transformation is key. Emotion. Human emotion is a fascinating thing. So the same kind of pipeline, stabilization, cleaning of the data, raw pixels in, and then the classification is emotion.

The problem with emotion, if I may speak as an expert, human, not an expert in emotions, just an expert at being human, is that there is a lot of ways to taxonomize emotion, to categorize emotion, to define emotion, whether that's for the primary emotion of the parascale with love, joy, surprise, anger, sadness, fear.

There's a lot of ways to mix those together, to break those apart into hierarchical taxonomies. And the way we think about it, in the driving context at least, there is a general emotion recognition task. Sort of, I'll mention it, but it's kind of how we think about primary emotions is detecting the broad categories of emotion, of joy and anger, of disgust and surprise.

And then there is application-specific emotion recognition, where you're using the facial expressions that all the various ways that we can deform our face to communicate information, to determine a specific question about the interaction of the driver. So first for the general case, these are the building blocks. I mean there's countless ways of deforming the face that we use to communicate with each other.

There's 42 individual facial muscles that can be used to form those expressions. One of our favorite tools to work with is the Affectiva SDK. This is their task with the general emotion recognition task is taking in raw pixels and determining categories of emotion, various subtleties of that emotion in a general case, producing a classification of anger, disgust, fear, surprise, so on.

And then mapping, I mean essentially what these algorithms are doing whether they're using deep neural networks or not, whether they're using face alignment to do the landmark detection and then tracking those landmarks over time to do the facial actions, they're mapping the expressions, the component, the various expressions we can make with our eyebrows, with our nose and mouth and eyes to map them to the emotion.

So I'd like to highlight one because I think it's an illustrative one for joy, an expression of joy is smiling. So there's an increased likelihood that you observe a smiling expression on the face when joy is experienced or vice versa. If there's an increased probability of a smile, there's an increased probability of emotion of joy being experienced.

And then joy and experience has a decreased probability likelihood of brow raising and brow following. So if you see a smile, that's a plus for joy. If you see brow raise, brow follow, brow follow is a minus for joy. That's for the general emotion recognition task that's been well studied, that's sort of the core of effective computing movement from the visual perspective, again from the computer vision perspective.

For the application specific perspective which we're really focused on, again data is everything, what are you annotating? We can take, here we have a large scale data set of drivers interacting with a voice based navigation system. So they're tasked with in various vehicles to enter a navigation, so they're talking to their GPS using their voice.

This is for, depending on the vehicle, depending on the system, in most cases an incredibly frustrating experience. So we have them perform this task and then the annotation is self-report. After the task they say on a scale of one to 10, how frustrating was this experience? And what you see on top is the expressions detected and associated with a satisfied, a person who said a 10 on the satisfaction, so a one on the frustration scale.

Is perfectly satisfied with a voice based interaction. On the bottom is frustrated, as I believe a nine on the frustration scale. So the feature, the strongest there, the expression, remember joy, smile was the strongest indicator of frustration for all our subjects. That was the strongest expression. Smile was the thing that was always there for frustration.

There's other various frowning that followed and shaking the head and so on, but smiles were there. So that shows you the kind of clean difference between general emotion recognition task and the application specific. Here, perhaps they enjoyed an absurd moment of joy at the frustration they were experiencing. You can sort of get philosophical about it, but the practical nature is, they were frustrated with the experience and were using the 42 months of the face that make expressions to do classification of frustrated or not.

And their data does the work, not the algorithms. It's the annotation. A quick mention for the AGI class next week for the artificial general intelligence class. One of the competitions we're doing is we have a JavaScript face that's trained with a neural network to form various expressions to communicate with the observer.

So we're interested in creating emotion, which is a nice mirror coupling of the emotional recognition problem. It's gonna be super cool. Cognitive load, we're starting to get to the eyes. Cognitive load is the degree to which a human being is accessing their memory or is lost in thought, how hard they're working in their mind to recollect something, to think about something.

That's cognitive load. And to do a quick pause of eyes as the window to cognitive load, eyes the window to the mind, there's different ways the eyes move. So there's pupils, the black part of the eye, they can expand and contract based on various factors, including the lighting variations in the scene, but they also expand and contract based on cognitive load.

That's a strong signal. They can also move around. There's ballistic movements, saccades. When we look around, eyes jump around the scene. They can also do something called smooth pursuit. When you, and connecting to our animal past, can see a delicious meal flying by or running by that your eyes can follow it perfectly.

They're not jumping around. So when we read a book, our eyes are using saccadic movements where they jump around. And when the smooth pursuit, the eye is moving perfectly smoothly. Those are the kinds of movements we have to work with. And cognitive load can be detected by looking at various factors of the eye.

The blink dynamics, the eye movement, and the pupil diameter. The problem is in the real world and real world data with lighting variations, everything goes out the window in terms of using pupil diameter, which is the standard way to measure, non-contact way to measure cognitive load in the lab when you can control lighting conditions and use infrared cameras.

When you can't, all that goes out the window and all you have is the blink dynamics and the eye movement. So neural networks to the rescue. 3D convolutional neural networks in this case, we take a sequence of images of the eye through time and use 3D convolutions as opposed to 2D convolutions.

On the left is everything we've talked about previous to this is 2D convolutions when the convolution filter is operating on the XY 2D image. Every channel is operated on by the filter separately. 3D convolutions combine those, convolve across multiple images, across multiple channels. Therefore being able to learn the dynamics of the scene through time as well, not just spatially.

Temporal. And data. Data is everything. For cognitive load, we have in this case 92 drivers. So how do we sort of perform the cognitive load classification task? We have these drivers driving on the highway and performing what's called the N-back task. Zero back, one back, two back. And that task involves hearing numbers being read to you and then recalling those numbers one at a time.

So when zero back, the system gives you a number, seven, and then you have to just say that number back. Seven. And it keeps repeating that, it's easy. It's supposed to be the easy task. One back is when you hear a number, you have to remember it. And for the next number, you have to say the number previous to that.

So you kind of have to keep one number in your memory always. And not get distracted by the new information coming up. With two back, you have to do that two numbers back. So you have to use memory more and more with two back. So cognitive load is higher and higher.

Okay, so what do we do? We use face alignment, face frontalization, and detecting the eye closest to the camera and extract the eye region. And now we have this nice raw pixels of the eye region across six seconds of video. And we take that and put that into a 3D convolutional neural network and classify simply one of three classes.

Zero back, one back, and two back. We have a ton of data of people on the highway performing these tasks and back tasks. And that forms the classification, supervised learning training data. That's it. The input is 90 images at 15 frames a second. And the output is one of three classes.

Face frontalization I should mention is the technique developed for face recognition because most face recognition tasks require frontal face orientation. It's also what we use here to normalize everything that we can focus in on the exact blink. It's taking whatever the orientation of the face and projecting into the frontal position.

Taking the raw pixels of the face is detecting the eye region, zooming in and grabbing the eye. (claps) Where you find, and this is where the intuition builds. It's a fascinating one. What's being plotted here is the relative movement of the pupil. The relative movement of the eye based on the different cognitive loads.

For cognitive load on the left of zero, so when your mind is not that lost in thought. And cognitive load of two on the right, when it is lost in thought, the eye moves a lot less. Eye is more focused on the forward roadway. That's an interesting finding, but it's only an aggregate.

And that's what the neural network is tasked with doing, with extracting on a frame by frame basis. This is a standard 3D convolutional architecture. Again, taking in the image sequence as the input, cognitive load classification as the output, and classifying on the right is the accuracy that's able to achieve of 86%.

That's pretty cool, from real world data. The idea is that you can just plop in a webcam, get the video going into the neural network, and it's predicting a continuous stream of from zero to two of cognitive load. Because every single zero back, one back, two back classes have a confidence that's associated with them, so you can turn that into a real value between zero and two.

And what you see here is a plot of three of the people on the team here, driving a car, performing a task of conversation. And in white, showing the cognitive load, frame by frame. At 30 frames a second, estimating the cognitive load of each of the drivers, from zero to two on the y-axis.

So these are high cognitive load, and showing on the bottom, red and yellow, of high and medium cognitive load. And when everybody's silent, the cognitive load goes down. So we can perform now with this simple neural network, with the training data that we formed, we can extend that to any arbitrary new data set and generalize.

Okay, those are some examples of how neural networks can be applied. And why is this important? Again, is while we focus on the sort of the perception task of using neural networks, of using sensors and signal processing to determine where we are in the world, where the different obstacles are and form trajectories around those obstacles, we are still far away from completely solving that problem.

I would argue 20 plus years away. The human will have to be involved and so when the system is not able to control, when the system is not able to perceive when there's some flawed aspect about the perception or the driving policy, the human has to be involved. And that's where we have to know, let the car know what the human is doing.

That's the essential element of human robot interaction. The most popular car in the United States today is the Ford F-150, no automation. The thing that sort of inspires us and makes us think that transportation can be fundamentally transformed is the Google self-drive, the Waymo car and all of our guest speakers and all the folks working on autonomous vehicles.

But if you look at it, the only people who are at a mass scale or beginning to, are actually injecting automation into our daily lives is the ones in between. It's the Tesla, it's the L2 systems, it's the Tesla system, the Super Cruise, the Audi, the Volvo S90s, the vehicles that are slowly adding some degree of automation and teaching human beings how to interact with that automation.

And here it is again. The path towards mass scale automation where steering wheel is removed, the consideration of the human is removed, I believe is more than two decades away. On the path to that, we have to understand and create successful human robot interaction, approach autonomous vehicles, autonomous systems in a human-centered way.

The mass scale integration of these systems, of the human-centered systems, like the Tesla vehicles, the Tesla is just a small company right now. The kind of L2 technologies have not truly penetrated the market, have not penetrated our vehicles, even the new vehicles being released today. I believe that happens in the early 2020s.

And that's going to form the core of our algorithms that will eventually lead to the full autonomy. All of that data, what I mentioned with Tesla with the 32% miles being driven, all of that is training data for the algorithms. The edge cases arise there. That's where we get all this data.

In our data set at MIT is 400,000 miles. Tesla has a billion miles. So that's all training data on the way, on the stairway to mass scale automation. Why is this important, beautiful, and fundamental to the role of AI in society? I believe that self-driving cars, when they're in this way, are focused on the human-robot interaction, are personal robots.

They're not perception control systems, tools like a Roomba performing a particular task. When human life is at stake, when there's a fundamental transfer between it, of life, of a human being giving their life over to an AI system directly, one-on-one, there's a transfer, that is kind of a relationship that is one indicative of a personal robot.

This is, it requires all the things of understanding, communication, of trust. These are fascinating to understand how a human-robot can form trust enough to create really an almost one-to-one understanding of each other's mental state, learn from each other. Oh boy. So, one of my favorite movies, Good Will Hunting, we're in Boston, Cambridge, have to, gonna regret this one.

This is Robin Williams speaking about human imperfections. So I'd like you to take this quote and replace every time he mentions girl with car. People call those things imperfections. Robin Williams is talking about his wife who passed away in the movie. Talking about her imperfections. They call these things imperfections, but they're not.

That's the good stuff. And then we get to choose who we let into our weird little worlds. You're not perfect sport. And let me save you the suspense. This girl you met, she isn't perfect either. You know what, let me just... (man speaking faintly) Well, that'll be the idiosyncrasies that only I know about.

That's what made her my wife. Why'd she get the goods on me to shield all my other pet dogs? People call these things imperfections, but they're not. Oh, that's the good stuff. And then we get to choose who we let into our weird little worlds. You're not perfect sport.

Let me save you the suspense. This girl you met, she isn't perfect either. But the question is, why don't I be perfect for each other? That's the whole deal. That's what everything is all about. Now you could know everything in the world's worth, but the only way you find that out is by getting in a shot.

So the approach we're taking in building the autonomous vehicle we are here at MIT in our group, it's the human-centered approach to autonomous vehicles that we're going to release in March of 2018 in the streets of Boston. Those who would like to help, please do. I will talk, run a course on deep learning for understanding the humans at CHI 2018.

We'll be going through tutorials that go far beyond the visual, the convolutional neural network based detection of various aspects of the face and body. We'd look at natural language processing, voice recognition, and GANs. If you're going to CHI, please join. Next week, we have an incredible course that aims to understand, to begin to explore the nature of intelligence, natural and artificial.

We have Josh Tenenbaum, Ray Kurzweil, Lisa Barrett, Nate Derbinsky looking at cognitive modeling architectures, Andrej Karpathy, Steven Wolfram, Richard Moyes talking about autonomous weapon systems and AI safety, Mark Rybert from Boston Dynamics and the amazing, incredible robots I have, and Ilya Tsitskever from OpenAI and myself. So what next?

For folks registered for this course, you have to submit by tonight a deep traffic entry that achieves a speed of 65 miles an hour, and I hope you continue to submit more that win the competition. The high performer award will be given to folks, the very few folks who achieve 70 miles an hour faster.

We will continue rolling out SegFuse, having hit a few snags and invested a few thousands of dollars in the sanitation process. I've annotating a large scale data set for you guys. We'll continue this competition that will take us into a submission towards NIPS where we would hope to submit the results for this competition, and DeepCrash, the deeper enforcement learning.

These competitions will continue through May 2018. I hope you stay tuned and participate. There's upcoming classes. The AGI class I encourage you to come to is going to be fascinating, and there's so many cool, interesting ideas that we're going to explore. It's gonna be awesome. There's an introduction to deep learning course that I'm also a part of where I get a little bit more applied and get folks who are interested in the very basic algorithms of deep learning how to get started with those hands on.

And there's an awesome class that I ran last year for those who took this class last year. We also talked about it on the global business of AI and robotics. The slides are online. I encourage you to click a link on there and register. It's in the spring. It's once a week, and it truly brings together a lot of cross disciplinary folks to talk about ideas of artificial intelligence and the role of AI and robotics in society.

It's an awesome class. And if you're interested in applying deep learning methods in the automotive space, come work with us. We have a lot of fascinating problems to solve or collaborate. So with that, I'd like to thank everybody here, everybody across the community that's been contributing. We have thousands of submissions coming in for deep traffic, and I'm just truly humbled by the support we've been getting, and the team behind this class is incredible.

Thank you to Nvidia, Google, Amazon, Alexa, AutoLive, and Toyota. And today we have shirts, extra large, extra, extra large and medium over there, small and large over there. The big and small people over here, and then the medium sized people over here. So just grab it, grab one, and enjoy.

Thank you very much. (audience applauding)

MIT 6.S094: Deep Learning for Human Sensing

Chapters

Transcript