MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)

Welcome to Human-Centered Artificial Intelligence. The last couple of decades in the developments of deep learning have been exciting in the problems that we've been able to automate, in the problems that we've been able to crack with learning-based methods. One of the things underlying this lecture and the following lectures is the idea that with purely the learning-based approach that we have been using, there's certain aspects that are fundamental to our reality that we're going to hit a wall on, that we have to integrate, incorporate the human being deeply into the learning-based systems in order to make the systems learn well and operate in the real world.

The underlying first prediction under the idea of human-centered AI in this century is that the learning-based approaches that have been successful over the past two decades, like deep learning, machine learning approaches that learn from data, are going to continue to become better and dominate the real world applications. So as opposed to fine-tuned optimization-based models that do not learn from data, more and more we're going to see learning-based methods dominate real world applications.

That's the underlying prediction that we're working with. Now, if that's the case, the corollary of that, if learning-based methods is the solution to many of these real world problems, is the way we get smarter AI systems is by improving the machine learning and the machine teaching. Machine learning is the thing that we've been talking about quite a bit.

That's the deep learning, that's the algorithm, the optimization of neural network parameters where you learn from data. That's the current focus of the community, current focus in the research, and the thing that's behind the success of much of the developments in deep learning. And then there's the machine teaching.

That's the human-centered part. It's optimizing not the models, not the algorithms, but optimizing how you select the data based on which the algorithms learn. It's to make better teachers. Just like when you yourself are learning as a student or as a child how to operate in this world, the world and the parents and the teachers around you are informing you with very sparse information, but providing the kind of information that is most useful for your learning process.

The selection of data based on which to learn, I believe, is the critical direction of research where we have to solve in order to create truly intelligent systems, the ones that are able to work in the real world, and I'll explain why in ways that I'm referring to. The implications of learning-based systems.

So when you have a learning system, a system that learns from data, neural networks, machine learning, learns from data, the fundamental reality of that is the model is trying to generalize across the entirety of the reality which we'll have to be tasked with operating based on a very small subset of samples from that reality, and that generalization means that there's always going to be a degree of uncertainty.

There's always going to be a degree of incomplete information, and so no matter how much we want to, these systems will not be provably safe, so we can't put anything concrete down to how guaranteed to be safe in some specific way unless it's extremely constrained. Therefore, we need human supervision of these systems.

The systems will not be provably fair from an ethics perspective, from a discrimination perspective, from all degrees of fairness. Therefore, we need human supervision of these systems, and it will not be explainable. At any step of the pipeline in which they made the decisions, AI systems will not be perfectly explainable to the satisfaction of us as human supervisors.

So there again, human supervision constantly will be required, and the solution to this is a whole set of techniques, whole set of ideas that we're putting under the flag of human-centered artificial intelligence, human-centered AI, and the core ideas there is that we need to integrate the human being deeply into the annotation process and deeply into the human supervision of the real-world operation of the system, so both in the training phase and the testing phase, the execution, the operation of the system.

So this is what deep learning looks like with the human out of the loop. The human contributes to a learning model by helping annotate some data, and that data is then used to train a model that hopefully generalizes in the real world, and that model makes decisions, and deep learning is really exciting because it's able to, in a greater and greater degree of autonomy, able to form high-level representations of the raw data in a way that it's actually able to do quite well on certain kinds of tasks that were before very difficult, but fundamentally, the human is out of the loop, both of the training and the operation.

First, you build the dataset, annotate the dataset, and then the systems run away with it. They train on the data, and the real-world operation does not involve the human except as the recipient of the service the system provides. Now, the human in the loop version of that, the human-centered version of that, means that annotation and operation of the system is both aided by human beings in a deep way.

What does that mean? So we can look at a human experts, so individuals, and crowd intelligence, the wisdom of the crowd and the wisdom of the individual. At the training phase, the first part of that is the objective annotation. We need to significantly improve objective annotation, meaning annotation where the human intelligence is sufficient to be able to look at a sample and annotate it.

This is what we think about as an ImageNet and all the basic computer vision tasks where a single human is enough to do a pretty damn good job of determining what's in a particular sample. And then there's subjective annotation, things that are difficult for humans to determine as a singular sample of a human being, as a crowd, we kind of converge in these difficult questions.

These are questions at a low level of emotion, these things that are a little bit fuzzy, that require multiple people to annotate, and at the high level are ethical questions of decisions that an AI system is tasked to making or we're tasked to making that nobody really knows the right answer to.

And as a crowd, we kind of converge in the right answer. That's where the crowd intelligence comes in on the data annotation step. Now in the operation, once you train the model, the supervision, again, of the system based, and I'll give examples of this more concretely, on the wisdom of the individual is, for example, operating an autonomous vehicle, the supervision of that autonomous vehicle, a single driver, is tasked with supervising the decisions of that AI system.

That's a critical step for a learning based system that's not guaranteed to be safe, that's not guaranteed to be explainable. And the subjective side of that, where the crowd intelligence is required, where a single person is not able to make it, these are, again, ethical questions about the operation of autonomous systems.

The supervision of autonomous vehicles, the supervision of systems in the medical diagnosis, in medicine in general, and this is AI operating in the real world, making ethical decisions that are fundamentally difficult decisions for humans to make, and that's where the crowd intelligence needs to come in. And so we have to transform the machine learning problem by integrating the human being.

First up top in the training process, on the left, that's the usual machine learning formulation of a human being doing brute force annotation of some kind of data set, cats and dogs and ImageNet, segmentation data set in cityscapes, video action recognition in the YouTube data set. Given the data set, humans put in a lot of expensive labor to annotate what's going on in that data, and then the machine learns.

The flip side of that, the machine teaching side, the human-centered side of that, is the machine instead, the learning model, the learning algorithm, talking about mostly neural networks here, is tasked with providing, selecting the subset, the small, sparse subsets of the data that are most useful for the human to annotate.

So instead of the human doing the brute force task first of the annotation, the machine queries the human. This is the field called machine teaching. The machine queries the human with questions, and therefore, the task is, and this is wide open research field, the task is to minimize in several orders of magnitude the amount of data that needs to be annotated.

In the real world operation side, the integration of the human looks like this. On the left, the machine, now trained with the learning model, makes decisions, and the human living in this world receives the service provided by the machine, whether that's medical diagnosis, whether that's an autonomous vehicle, whether that's a system that determines whether you get a loan or not, so on.

With the human-centered version of that, the machine makes a decision, but it's able to provide a degree of uncertainty. It's one of the big requirements, to be able to specify a degree of uncertainty of that decision such that when uncertainty is below a certain threshold, human supervision is sought.

And again, in that decision, whether that's a costly decision financially or a costly decision in terms of human life, human supervision is sought. And the service is received by the human, by the very same humans that are providing the supervision, or another set of humans. But ultimately, the decision is over-sought by human beings.

This is what I believe is going to be the defining mode of operation for AI systems in the 21st century, is we won't be able to, as much as we'd like, to create perfect AI systems that escape the need to work together with human beings at every step. There is five areas of research, grand challenges here, that define human-centered AI.

I'll focus on a few today, and focus on one very much so. And even with that degree of high pruning, we have 120 slides, so I'll skip around. But, on the human-centered AI during the learning phase, there is the methods, the research arm of machine teaching. How do we select, how do we improve supervised learning?

As opposed to needing 10,000, 100,000, a million examples, how do we reduce that, where the algorithm queries only the essential elements, and able to learn effectively from very little information, from very little samples? Just like we do when we're students, when we learn fundamental aspects of math, language, and so on, we just need a few examples.

But those examples are critical to our understanding. And the second part of that is the reward engineering. That during a learning process, injecting the human being into the definition of the loss function, of what's good, what's bad. Systems that have to operate in the real world have to understand what our society deems as good and bad.

And we're not always good at injecting that at the very beginning. There has to be a continuous process of adjusting the rewards, of reward re-engineering by humans, so that we can encode human values into the learning process. Now, on the second part, on the human-centered AI during real-world operation, when the system's actually trained, there is the interactive element of robots and humans working together.

That means the part I'll focus on quite a bit today, because there's been quite a lot of development and progress on the deep learning side, is human sensing, is algorithms that understand the human being. Algorithms that, from taking raw information, whether that's video, audio, text, begin to get a context, a measure of the state of the human being in the short term and the long term over time, the temporal understanding and the instantaneous understanding.

Then there is the interaction aspect. So once you understand the human, the perception problem, you have to interact with them and interact in such a way that it's continuous, collaborative, and a rich, meaningful experience. We're in the very early days of creating anything like rich, meaningful experiences with AI systems, especially learning-based AI systems.

And the safety, in the real world operation, safety, ethics, unrolling the results of the engineered rewards that were in place during the learning process, now come to fruition. And we need to make sure that the trained model does not result in things that are highly detrimental, catastrophic to our safety, or highly detrimental to what we deem as good and bad in society, of discrimination, of ethical considerations, and all those kinds of things.

The gray area, the line we all walk as a society in the crowd intelligence, we have to provide bounds on AI systems. And there's an entire group of work, and I'll mention what we're doing in that area. So first, on the machine teaching side, and the efficient supervised learning, I'd like to sort of do one slide on each of these to kind of give you an idea, near-term, and do two things for each area, that we will elaborate in future lectures on, and some of it I'll elaborate today.

First, the near-term directions of research, the things that are within our reach now, and a sort of thought experiment, a grand challenge, that if we can do it, that'll be damn impressive. That will be a definition of real progress in this area. So near-term directions of research for machine teaching, for improved supervised learning, integrating the human into the annotation process, is instead of annotating brute force, is annotate by asking the human questions.

So we have to transform the way we do annotations, where the process of annotation is not defining the dataset, and then you go through the entire dataset, it's a machine teaching system that queries the user for questions to annotate. And on the algorithm side, active learning, these are all sort of areas of work where we can be more clever about the way we use data, select data on which to train.

So active learning is actively selecting during the training process, which part of the data to train on, and annotate. Data augmentation is taking things that have been supervised by a human, and expanding them, modifying the data, warping the data in interesting ways such that it expands, it multiplies the human effort that was injected into helping understand what's in the data.

The one-shot learning, zero-shot learning, and transfer learning are all in that category. And self-play is in the reinforcement learning area where the system constructs a model of the world, and goes along alone in a room, and plays with that model to try to figure out the different constraints of the model, how do you achieve good things there.

An example grand challenge here that would define serious progress in the field is if we take ImageNet or COCO, the ImageNet challenge or COCO object detection challenge, and training only on a totally different kind of data, be able to achieve state-of-the-art results. So training only on Wikipedia, with the text and images that are there on Wikipedia, be able to perform object detection on the state-of-the-art benchmark of COCO.

The COCO is a data set of different objects with rich annotation of the localization of the objects. That I believe is exactly the kind of thing that all the problems in the transfer learning and efficient data annotation machine teaching have to be solved to achieve that. Another way to, another challenge you can think of, if we can even just simplify it more, is achieve a 3%, 0.3% error on MNIST, that's the handwritten recognition task that everybody always provides as an example.

So achieve a very good accuracy, state-of-the-art accuracy, by training only on a single example of a digit, as opposed to training on thousands, training on one example. That's something that most of us humans can do, given one example of a new language you haven't seen before for each character, after studying them for a little bit, be able to now classify future characters at high accuracy.

The second part of the learning process where the human needs to be injected, and the near-term directions of research there, is the reward engineering, and the tuning of those, continuous tuning of those rewards by a human being. So if OpenAI is doing quite a bit of work here, here's a game played by human and AI, and it's really my favorite example of this.

On the left, human is controlling a boat that's finishing a race. On the right is a RL agent, reinforcement learning agent, that's controlling a boat that's trying to, not finish a race, trying to maximize the reward defined prior to, by, initially by a human being. And what it finds is that you can get much more reward by collecting green turbos that appears close to finishing the race.

It realizes that finishing the race actually gets in the way of maximizing reward. And so that's the unintended consequences of a reward function that was specified previously, and most human supervisors of this result would be able to adjust through the, re-engineer the reward function to be able to get the robot to the AI system here to finish the race.

And that kind of continuous monitoring, monitoring of the performance of the system during the training process is a near term direction of research that's a few, DeepMind, OpenAI, and ourselves are taking on. Example grand challenge is allowing AI system to operate in a context where there's a lot of fuzziness for us humans.

There's a lot of uncertainty, there's a lot of gray area, there's a lot of challenging aspects in terms of what is right and what is wrong that we continually need to improve on. Example I provide here is one of the least popular things in the world is the US Congress.

So replacing US Congress, the body of representatives of the people of the United States, and they make bills based on the belief of the people, that sounds a lot like what Netflix does in recommending what movie you should watch next in representing what people love to watch. So that's just the recommender system.

So it makes perfect sense that an AI system should be able to take on this challenge. And I see that as a grand challenge, is replacing some of the fundamental representation of large crowds of people that make ethical decisions replaced by a human-centered AI system. Okay, in real world operation, the first thing we have to do, before we have a robot and a human work together, the first thing is the robot has to perceive the human.

Question. - The question was, do you want to, so there's currently, there's Congress, do you want to change the way Congress works, make it better, or do you want to just take the system that currently is and automate it? So the idea is take the system as it currently is supposed to be and automate that.

So an AI system can provide a lot more transparency of the inputs. The idea of Congress is supposed, the only inputs is supposed to be the people and the beliefs of the people. And there's also, and there's rich information there. So for example, I mean, the input, there's, for me, not saying anything about politics, but there's certain issues I care a lot about and certain issues I don't care much about.

And let's put that aside. And then there's certain issues that I know a lot about and certain issues I know very little about. And those don't actually intersect that well. I'm very opinionated about things I don't know anything about. It's very common, all of us are. So being able to put that representation of me into a system that would take a lot of, our entire nation together, and be able to make bills that represent the people.

Now the challenge there, it can't be just the training set and then the system now operates. AI is running the country. No, there has to be that human center element where we're constantly supervising, just like we're, in theory, supposed to be supervising our congressmen and congresswomen. Human sensing, the first part, in order to have an AI system that works with a human being, the AI system has to perceive, understand the state of the human being at the very simplest level and the more complex, temporal, contextual, over time level.

So the near-term directions of research is purely the perception problem, where deep learning shines, of taking data, whether that comes in the visual, audio, text, and so on, and being able to classify the physical, mental, social state, social context of the person. Be able to, everything, and this is what I'll cover a little bit of today, everything from face detection, face recognition, emotion recognition, natural language processing, body pose estimation, those same recommender systems, speech recognition, all of those conversions of raw data that captures something about the human being into actually meaningful, actionable information.

The grand challenge there is emotion recognition. You know, there's been a lot of companies and ideas that we've somehow cracked emotion recognition, that we are able to determine the mood of a person. But really, that's, for those who were here last year with Lisa Feldman Barrett, but just, if you're sort of very honest and you study emotional intelligence and emotion and the expression of emotion, it's a fascinating area and we're not even close to being able to build perceptual systems that detect emotion.

What we're more so doing is detecting very simple facial expressions that correspond to our storybook versions of emotion, smiling, crying, like frowning in a caricatured way. So if you build a system that has a high accuracy of doing real emotion recognition, you can think of it as stated here, an AI system that classifies, binary classification problem, 95% accuracy of whether you wanna be left alone or not.

And being able to do that after collecting data for 30 days. That I see as a really clean formulation of exactly the kind of human understanding we need to be able to build in our learning models. And we're very far away from that, especially the long temporal aspect of that, of being able to integrate data over a long period of time.

Then the second part of human robot interaction in the real world operation is the experience. This is where we're now just beginning to consider that interactive experience of how do we have a rich fulfilling experience. We have autonomous vehicles, for example, semi-autonomous vehicles, whether that's Tesla, Volvo, Super Cruise with the Cadillac.

There's a bunch of systems that have now greater and greater degrees of automation in the car and we get to have the human interact with that AI system and trying to figure out how do we have a rich fulfilling experience. In the, currently the Volvo system, that experience is more limited.

There's a little icon. It's more kind of traditional driving situation. In the Tesla, you have a much bigger display about what's going on. In the Super Cruise, there's a camera looking at your eyes in the Cadillac Super Cruise system. There's a camera looking at your eyes determining if you're awake or not, paying attention or not.

And that, there's like an experience there that we're trying to create. And in the Tesla case, the miles are racking up. We have real data. Here at MIT, we're studying this exact interaction. There's now over a billion miles driven in the Tesla. And the same in the fully autonomous side with Waymo, they've now reached 10 plus million miles driven autonomously.

And there's a lot of people experimenting with this. But that's that collaborative interaction of going back and forth, of being able to, for the AI system to express the degree of uncertainty as about the environment. About the AI system being able to express when it needs help and not.

Be able to communicate what are its limitations and capabilities and so on. Trade off control. Be able to seek human supervision. There's a dance there that's really, that takes into consideration everything from the neurobiological research to psychology to deep learning, to the pure robotics, HRI, human robotic interaction aspects.

One grand challenge would be, Tesla's driven one billion miles now under autopilot, under the semi-autonomous mode. The grand challenge here is when we start getting to the kind of mileage that we see in the United States every year, you start getting into the hundreds of billions of miles driven semi-autonomously.

We get to see teenagers, 16, 17, 18, using these systems for the first time. We get to see older folks, folks who don't necessarily drive or use any kind of AI in their lives get to use these systems. We start to explore that aspect. That's the real challenge. And of course, the old Turing test, now reimagined by Alexa, with the Alexa Prize challenge of Social Bot, is creating natural language.

It's such a beautiful thing to explore human-robot interaction with, is both on the audio side and just the text side, is passing the Turing test. That's a true grand challenge in a real way, where you wanna have a conversation with a robot for prolonged periods of times, maybe more than even some of your other friends.

And on the other side of friends is the risk, the catastrophic risk that's potential when you have an AI system that's learning from data. The near-term directions of research is purely the human supervision of AI decisions in terms of safety and ethics. There's a lot of systems, like with cars, or medical diagnosis and so on, where there's some life-critical, safety-critical aspect that we want to be able to supervise the safety of that.

And there's ethical decisions in terms of who gets alone or not, who gets a certain criminal penalty or not. Any degree to which AI systems are incorporated into that, you have to consider ethical questions. And even just the crude, the low-level perception systems, like face recognition, you wanna make sure that your face recognition systems are not discriminating based on color or gender or age and so on.

You wanna make sure that at that basic fundamental level of ethics, the systems are trained in a way they maintain our human values, or the better angels of our nature, the better sides of our values, some of the brighter aspects of our values. And the other thing is, in terms of just maintaining values, that's the normal, that's looking at the mean of the distribution.

But we also want to control the outliers from the AI systems not to do anything catastrophic. So the unintended consequences, when something happens that you didn't anticipate, you wanna be able to put boundaries on that. And the grand challenge there, really, it all boils down to the ability of an AI system to say that it's uncertain about something.

And that measure of uncertainty has to be good. It has to be able to make a prediction always accompanied with uncertainty, even on things it hasn't seen before. That's the real challenge, to be able to be trained on cats and dogs and then seeing a giraffe and saying, "I'm not sure what that is." We're quite far away from that, 'cause right now, we'll probably confidently say it's a dog, depending on the giraffe.

But we want to be able to have an extremely high accuracy in the ability of AI systems to determine their own uncertainty, to know what they don't know. Because from that comes the supervision. From that comes the ability to stop under things that it's uncertain about, catastrophic events. The first aspect of real-world operation is understanding the human.

One of the places where deep learning has really shined is the perception problem. It all begins at the ability to look at raw data and convert that into meaningful information. That's really the understanding the human comes in. Not the kind of understanding that when you're in a relationship with somebody, when you're friends with somebody, over a long period of time, you gain an understanding of their quirks, limitations, capabilities, so on.

That's really fascinating. But the first step is just to be able to, when you see them, recognize who they are, what's on their mind, what's their, the body language, what are they saying with their mouth. All those basic raw perception tasks, that's where deep learning really shines. I'd like to cover the state of the art in those various perception tasks.

So first, face recognition. Now there's a full slide presentation with this, and I'm skipping around. The full slide presentation has the following structure for each of these topics. It has the motivation, description, the excitement, the worry, the future impact is the first part. And then there's five papers. One defining the quote unquote old school seminal work that opened the field.

Then the early progress in the field. Paper three is the recent breakthrough, often associated with deep learning. Paper four is the current state of the art. And paper five is the thing that defines the future direction. The possible set of things that define the future direction. And then the open problems in the field, and where the future research is very much needed.

That's kind of the structure of every topic I'll cover here as quickly as possible. Face recognition. So what is it? It's the first thing, you know, the face contains so much rich information about the state of the human being. So understanding the human being really starts with the face and detecting the face is the first step.

Detecting the body, and then that there's a head on top of that body, that's the first step. And then there is the task of face recognition, been an exceptionally active area of research because it has a lot of applications. And through that research, we're able to now study a lot of aspects, how we perform perception on the face.

So recognition, purely stated, is the recognizing the identity of a human face. Who is this? Detection is just detecting a face. Now, recognition means there's a database of identities. What is it? Seven billion of them on earth. And you're trying to determine which of them it is, which of the seven billion it is, or whatever the database is.

The face verification problem is something that your phone uses when you unlock it with your face. Is it saying, is it you or not? Is it Lex or somebody else? It's a database of two, one person versus everybody else. There's a lot of applications here, obviously, from identification to all the security aspects of using the face as a sort of fingerprint of your identity in all the interactive elements of AI systems, software-based systems in this world.

Okay, so why is it hard? So all the usual computer vision problems come in. Lighting variation, pose variation. That's just, computer vision is really hard. It's just you get these raw numbers and you have to infer so many things that us humans take for granted. So the basic computer vision stuff.

But there's stuff on top of that. So faces, we're trying to, it's like cats versus dogs. There's thousands of breeds of dog and thousands of breeds of cats. In that same way, there's, faces can look very similar to each other. So these two classes that you're trying to separate can be very, very, very close together and intermingle.

Now, there's a lot of face data available. Now, because of the application, because of the financial benefits of such data sets, but for any one individual, unless you're Brad Pitt or Angelina Jolie or celebrity, there's not many samples of the data available. So the individuals based on which the classification is to be made, there's often not very much data.

Then there is the, a lot of variation. So you have to, in making the face recognition task, you have to be invariant to all the hairstyles, all the, that you change yourself over time, the weight gain, the weight loss, the beard you decided to grow, the glasses you wear sometimes and not others, the different styles of glasses and so on, makeup or no makeups.

All of these things, it's still you, still the same identity. You have to be able to classify that. And that kind of accuracy, especially for security applications, extremely high, that's required. The reason it's an exciting area is there's a lot of possibility, but, and there's also a lot of concern, right?

So the future impact, utopia, dystopia, and the more reasonable middle path here is face provides a very user-friendly way of letting your devices recognize you and say hello. Your voice is certainly one, but one of the most powerful ones to really classify at a distance is face. So what does that mean?

The utopian view, the possibility of the future, the best possible, brightest possible future. As you can use your face to, as a passport, you replace the license, replace all the security measures we put from the passwords in our devices to the credit card and so on, all of that, Apple pays, it'll be face pay.

You show up, it'll automatically connect to all your devices, all your banking information, so on. Obviously, the flip side of that, just rephrasing that sentence also can be dystopian because complete violations of privacy, being watched at any time, being able to, through your Facebook and social media and all your devices being able to identify you, making it impossible for you to sort of hide from society.

The fundamental aspects of privacy, maintaining privacy that many of us value greatly. The middle path is really just a useful way to unlock your phone. The recent breakthroughs here, it started with deep face. The essential idea there is applying deep neural networks to the task of face recognition. I mean, with a lot of the breakthroughs here on the perception side, we're not covering the old school papers and so on, and the historical context here, biggest breakthroughs came with deep learning, 2006, '07, '08, last 10 years.

So that's the same is true with face recognition. Deep face was the big first application that achieved near human performance on one of the big benchmarks at the time on the labeled faces in the wild. So using a very large data set, being able to form a good representation.

The state of the art, or at least close to the state of the art is face net. The key idea there is using those same deep architectures to now optimize for the representation itself directly. The notebook we're putting out, we shared with some of you for the assignment, describes face recognition, the challenge there, that it's not like the traditional classification problem.

You have to form an embedding of the face into a small vector, compressed vector, such that in that embedding, faces that are similar to each other, so identities that are close together, are close in the Euclidean sense in that embedding, and people that are very different are far away.

And so you use that embedding to then do the classification. That's really the only way to deal with data sets for which you have so little information on any one individual person. And so face net optimize that embedding in a way that directly optimizes for the Euclidean distance between non-matching identities.

So there's still a lot of excitement about face recognition. There's a lot of benchmark competitions and a lot of people working in this, and really bigger, badder networks and more data is really one of the ways to crack this problem. So public large data set with 672,000 identities, 4.7 million photos, that's in 2017, and that just keeps scaling up and up and up and up.

Now we have to also be honest here on the possible future directions of work in that even though the benchmarks are growing, that's still a tiny subset of the people in the world. We're still not quite there to be able to have the general face recognition applicable to the entirety of the population, or a large swath of the population of the world.

So in this topic here, brief coverage, we're not covering all of the aspects of the face, especially temporal, that are useful in face recognition or useful saying a lot of things about the face, which is the FACS, facts, the different kinds of facial expressions that can then be used to infer emotion and so on.

Raised eyebrows and all those kinds of things that can provide rich information for recognizing and interpreting the face, and the different other modalities, including 3D face recognition, we're not covering. There's a lot of exciting areas there. We're just looking at the pure formulation of the face recognition problem of looking at a 2D single image.

The open problems here is first, not often stated and misinterpreted by people, is that most of these methods of face recognition start with assuming that you have a bounding box around the face. Now, oftentimes recognition can happen, so they're assuming a frontal or near frontal view of the face.

But you can do recognition all kinds of poses. And it's very interesting to think that recognition, the way we recognize our friends and colleagues, parents and children is often using a lot of cue context information that's beyond just the pure frontal view of the face. It can do pretty well on profile views, it can from body language and so on.

So all those things, that's open in the field, how we incorporate that into face recognition. Then the black box side is problematic for both bias and just being able to understand why incorrect decisions are made, is making those face recognition systems more interpretable. And then finally, privacy. The ability to collect the kind of data where the face recognition it would be performing extremely well, and yet not violating the fundamental aspects of privacy that we value.

Activity recognition, taking the next step forward here into the richer temporal context of what people do. Again, the same structure from recent breakthroughs to the future direction of work. What is it? It's classifying human activity from images or from video. And why is it important? Depending on the level of abstraction for the activity, it provides context for understanding the human.

What are they doing? Are they playing baseball? Are they singing? Are they sleeping? Are they putting on makeup, knitting, so on, mixing butter? Why is it hard? Again, all the usual problems in image recognition. The kind of data we're dealing with is just much larger. The kind of video, the richness of possibilities that define what activity is, is much larger.

So the complexity is much larger. It's often difficult to quantify motion because the fundamental aspect of activity is the change in the world, is the motion of things. And then it's difficult to determine how the dynamics of the physics of the world, especially from a 2D view of what's background information, what's noise, and what's essential to understanding the activity.

And the subjective, ambiguous elements of activity. When does a particular activity begin? When does it end? What's all the gray areas when you're partially engaging in that activity and so on? When you start to annotate these things, when you start to try to do the detection, it becomes clear that sometimes the activity is partially undertaken and the beginning and the end is fuzzy.

Future impact, utopia, dystopia, middle path. So the impact here comes from being able to understand the world in time and be able to predict. The utopian possibilities is that the contextual perception that can occur from here can enrich the experience between the human and robot. The dystopian view, the flip side is being able to understand sort of human activities can let the robots sever the relationship.

So it can damage the human robot interaction to where they just do their own thing. The middle path is just finding useful information, massive amounts of data like YouTube. Now there's a YouTube video data set, being able to identify what's going on in this video, being able to infer rich, useful semantic information.

And so what do we do with video? How do we do perception in video? Now the recent breakthrough came with deep learning and C3D, this 3D convolutional neural networks that take a sequence of images and are able to determine the action that's going on in an end-to-end way, what's going on in the video.

That was the recent breakthrough. The state of the art coming from a slightly, well, from a different architecture that takes in two streams. One is the image RGB data, the other is optical flow data that's really focusing on the motion in the image. Those are the two that's opened the wave of two stream networks.

Here from that paper showing the different architectures, on the far right is the two stream architecture and the C3D on the, shown under B here, taking a sequence of images. But all these are just different architectures. And then first one is LSTMs. There's different architectures of how do you represent, how do you allow a network?

How do you allow a learning model to be able to capture the dynamics in the data? The future possibilities has to do, well, literally with the future, being able to take single images or sequences of images and predicting the future. It's very interesting to think about in our ability to hallucinate the future, and generate the future from images, you start to think about what are the defining qualities of activities, and in this way, augment data and be able to train much more accurate action recognition systems.

Topics not covered is the localization of activity in video. So action recognition purely defined is I give you a clip and you tell me what's going on in this clip. Now, if you take actually a full YouTube video, you want to be able to localize, find all the times when a particular activity is going on.

It could be multi-label, multiple activities going on at the same time, beginning and ending, and asynchronously. And then there is more richly three-dimensional or 2D classification of activity based on human movement. So looking at, like from a Kinect, from 3D sensors, looking at skeleton-based action recognition from sensors that provide you more than just the 2D image data.

The open problems is that activity recognition is more than just the way we move our body, or if it's baseball, like a ball in your hand and hitting it with a baseball bat. It also has to do with context. There's sitting down or working or looking at something, picking up an item.

Those sometimes can change profoundly based on the other objects in the scene and the activity of other people in the scene. And so being able to work with that kind of context is a totally open problem. It's having to reduce a very complex real world context into something where you can clearly identify an activity.

Body pose estimation is the task of localizing the joints that form the skeleton of the human body. So infer from visual information, the positions of the different joints. Along the line of complexity, it's important to be able to understand the body language, the rich information about the body of the human being.

So that's from reading body language to animation, to aiding activity recognition. And it's just a useful representation of the human body. If you're analyzing pedestrians or in interactive environments, human robot interaction, being able to understand what the heck it is the human is trying to do. A body pose is really useful.

It's hard because the body, when you look at a 2D image projection of the body, there's a lot of, it's a highly dimensional optimization problem, figuring out how the raw pixels match to the actual three-dimensional orientation of the human joints. And the usual computer vision challenges of pose, lighting, and so on.

Future impact is, it's really exciting for interactive environments for a robot to be able to know the position of the human body with which it's trying to interact. Whether it's a robot that's trying to get their favorite human a beer or whatever your favorite choice of drink, you have to be able to find where their hand is so you can do the trade-off.

Same thing in the car. You have to determine if the person's hands are on the steering wheel, if their head and orientation is such that they're able to physically take control of the vehicle. That's a really exciting set of possibilities there. And there's applications in sports and CGI and video games and all aspects when the robot and human have to work together.

The dystopian view you can imagine is, of course, being able to localize all those joints means robots that are able to more effectively hurt humans. And so that's always a huge concern and always a dark dystopian view of the world with so much AI in it. Of course, the reality is, it's just more rich, fulfilling HCI that takes advantage of not just the face, stuff coming from the face, but also the body of the human that the robot is interacting with.

So it started with deep learning being applied to the body pose estimation problem, 2014 with deep pose. The key idea is there is looking at the holistic human pose estimation problem of detecting all the different joints of a single person in an image. Power of deep learning is that you no longer have to do handcrafted expert engineered features that it automatically determines a set of features.

All the parts are being detected for you. So this highly complex problem is all solved with data. This is the state of the art of the 2017 and beyond there's been a few papers from CMU along this line is doing real time multi-person 2D pose estimation, but in a bottom up way where you're detecting individual joints first.

So all the knees in the picture, all the elbows, all the shoulders, all the wrists and so on, and then stitching them together using parts affinity fields. What is the most likely? So if you find 17 elbows in a picture, you then have to try to see which elbow belongs to which person.

So that actually turns out to be extremely powerful way to detect especially multi-pose, especially to deal with occlusions way of detecting body pose. It's really interesting and also is able to because of that, because of the separation of the detections is able to run a real time, which is also really exciting.

Possible future direction is the using much more information, using deformable models of the human body. So not just a skeleton, rich volumetric information to do the detection and then optimizing for what's the most likely orientation of the body. The open problems in the field is the fact that pose is not a thing that happens in a single image.

Pose that happens is part of human behavior and part of movement through time. So here, Monty Python, Ministry of Silly Walks, people walk in funny ways. But so we collect a lot of data on pedestrians. I can tell you that people walk in different ways and people position their body in different ways.

And so the temporal aspects of human motion are for the most part not incorporated in the body pose estimation problem and they should be. There's a lot of exciting possibilities of capturing the temporal dynamics. There's a lot of awesome slides here that I'm just skipping through. Speech recognition. That was 2018, it was really big for recommender systems, for Netflix, OKCupid, AI for President.

Each one of the things that I mentioned briefly today we'll have a separate mini lecture. I taught an entire course on this at CHI last year. So deep learning for understanding the human. It's a topic I'm really excited about because it's really the first step for a machine to be able to interact in a rich way with a human being, is to understand it.

And it's also area where the most near term impact can happen, a system to be able to effectively detect what a human being is up to, what they're thinking about, how to best serve them and enrich the experience of interacting with that human. Let me jump to AI safety and then the interactive experience to humans and robots to just give examples of some work in that direction, some research in that direction I'm really excited about.

So AI safety, at the very basic level, there's an AI system that's making decisions where we want human beings to supervise those decisions. We've done quite a bit of work here at MIT on that aspect of supervising machines, with arguing machines. And OpenAI has done work with safety by having machines debate each other.

So this kind of idea that you can achieve safety by not giving ultimate power to any one decision maker. And the disagreement that emerges from two AI systems or multiple AI systems having to make decisions and agree with each other, it allows us to then produce a signal of uncertainty based on which the human supervision can be sought.

Without that, when we have a state of the art black box AI system that does something like drive a car, all we have is a system that just runs and we're supposed to have faith that it's always going to be right. We don't have any uncertainty signal coming from the system.

So the idea of arguing machines that we've developed and been working on is to have multiple AI system and ensemble of AI system where the disagreement, when there's a disagreement detected, human supervision is sought. And the idea there is that when you have a system like Tesla Autopilot, here we've instrumented a Tesla vehicle.

We have a system like Tesla Autopilot, it's telling you nothing about how uncertain it is about the decision it's making. It just knows, once the system is on, it's now steering the car for you. And in very rare cases, this is just disengage. But no matter what, it's not showing to you the degree of uncertainty it has about the world around it.

And so the way we create that signal of uncertainty is by adding another, in this case, end-to-end vision system that's looking at the external environment, making steering decisions. And whenever there's a disagreement between the two detected, that's when human supervision is sought. And we can predict in this way, shown in the plot there, is we can predict with high accuracy the times when the driver chose to disengage the system because they were uncomfortable.

So you're detecting, you're using this mechanism to detect risky, challenging situations. It's an idea about how we supervise AI by having multiple AI systems that are independent, and through their disagreement emerges the uncertainty signal. And we can apply this, like the AI folks in natural language would debate, we can apply this in computer vision as well, taking two independently trained, but on the same training set, networks ResNet and VGGNet, trained on ImageNet, and we can have them argue, and in the process, improve significantly the accuracy.

So in the case of ResNet as an architecture, VGGNet as an architecture, trained on the ImageNet training dataset, they separately have a certain error. ResNet has an error of 8%, VGG16 has an error of 10%. When we apply the argue machines framework, when the disagreement is brought to the human, that error rate decreases to 2.8%.

Now if that, this is just ImageNet challenge, but if that error meant the loss of human life, this kind of framework is really powerful for overseeing the operation of the AI system. That, just examples here where they disagree. So taking this image that's from ImageNet, the ground truth is a wine bottle, and ResNet prediction is that it's definitely 0.93, 93% confidence that it's a paper towel, and VGGNet, 25% confidence that it's a seatbelt.

So these disagreements are then brought in, and then we, the fact that they disagree arises the uncertainty, and then human supervision is brought, and then humans are able to annotate correctly what's going on in this picture. Same thing here, mailbox, the ground truth is a mailbox. The, again, the two architectures disagree.

One says traffic light, the other one says garbage truck. For an autonomous vehicle, you can imagine this being problematic. If there's a traffic light, you might stop for this mailbox, that kind of thing. That's early research in the field of how do we have AI systems that are more and more powerful.

We can also inject human effort to supervise when it's needed. The when it's needed part, the uncertainty signal is the critical thing, so we have to figure out ways to create that uncertainty signal. The subarea of just creating a rich human interaction. So this is, we're doing a lot of testing with autonomous vehicles here.

I'm tweeting. So we have a human-centered autonomous vehicle here at MIT that's taking control back and forth from the human based on the activity. That's just me explaining the video. The point is that the driving experience, the human-robot interaction experience should be fun and awesome and enriching to life.

And that's why you would want to use these kinds of systems. We have a bunch of videos online. You can check them out, including ridiculous one of me playing guitar. And there's a paper along with this describing different principles of how we have humans and robots work together in this kind of way.

There's a lot of totally untouched problems in that space. Most of the robotics community and the machine learning community approaches AI as a system that we want to make perfect. And once it's perfect, we want to then put it in the real world where us humans get to interact with it.

Just like, what is it, Robin Williams in Good Will Hunting, talking about relationships, that nobody's perfect. I think the way I foresee it, AI systems will not be perfect for the next 100 years. And so we have to have humans and AI systems work together and optimize that problem, solve that problem.

That both of us are flawed, but together there's something enriching to both. As I mentioned, the videos here will be available online. The lectures underlying all the deep learning for understanding the human and underlying the five principles here of human-centered AI. And it's an area of active research here at MIT and globally, and it's one that I'm extremely passionate about.

And one of the analogies I think about when I think about the success of artificial intelligence systems as an analogy of parasitism and symbiosis, a lot of the ways we're training machine learning algorithms now is we inject a lot of human labor, a lot of really costly human labor, separately, offline, out of the loop, in order to improve the learning models through brute force annotation.

And what I see as success in the future requires that the learning is done, the models improve in a symbiotic way, as a side effect of interacting with humans. This is done a lot now in reinforcement learning, through game playing and so on. But the human computation, the human effort of annotation is something that happens naturally through interaction, not a costly thing you have to pay for.

Because when it happens naturally, in a symbiotic way, we can increase scale. We can scale learning to a degree that's required to solve some of the real-world problems. That also requires solving a lot of aspects of human-robot interaction, from understanding our own brain, from the biological to the electrical and neuroscience, to the behavioral aspects captured by cognitive science, psychology, sociology, to the mathematical formulations of behavior and game theory, to when you take that human behavior and put it in the real world with engineering systems, human factors and design.

These are all giant subfields with conferences and papers that all of them need to work together. And then on the computer science side, with natural language processing, understanding language, the human-robot interaction, human-computer interaction, just the interfaces, what does, how does, and what does the computer, the robot show to you.

Again, entire conferences. And then the exciting aspects of learning from data and deep learning and learning to act from data and reinforcement learning, deep reinforcement learning. And then the robotics is actually building these things, the building the hardware, again, an entire area, exciting field of research. All of them have to work together to create systems here that integrate the human during the learning process and integrate the human during the operation process.

So the videos on deeplearning.mit.edu, videos and slide available there, code is available there. So with that, I'd like to thank you very much. (audience applauding) (audience cheering) (upbeat music) (upbeat music) (upbeat music) (upbeat music) (upbeat music)

MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)

Chapters

Transcript