back to index

MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)


Chapters

0:0 Introduction to human-centered AI
5:17 Deep Learning with human out of the loop
6:11 Deep Learning with human in the loop
8:55 Integrating the human into training process and real-world operation
11:53 Five areas of research
15:38 Machine teaching
19:27 Reward engineering
22:35 Question about representative government as a recommender system
24:27 Human sensing
27:6 Human-robot interaction experience
30:28 AI safety and ethics
33:10 Deep learning for understanding the human
34:6 Face recognition
45:20 Activity recognition
51:16 Body pose estimation
57:24 AI Safety
62:35 Human-centered autonomy
64:33 Symbiosis with learning-based AI systems
65:42 Interdisciplinary research

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome to Human-Centered Artificial Intelligence.
00:00:03.120 | The last couple of decades
00:00:07.040 | in the developments of deep learning
00:00:09.140 | have been exciting in the problems
00:00:14.040 | that we've been able to automate,
00:00:15.880 | in the problems that we've been able to crack
00:00:18.320 | with learning-based methods.
00:00:21.720 | One of the things underlying this lecture
00:00:23.820 | and the following lectures
00:00:26.480 | is the idea that with purely the learning-based approach
00:00:31.480 | that we have been using,
00:00:33.480 | there's certain aspects that are fundamental to our reality
00:00:37.460 | that we're going to hit a wall on,
00:00:39.860 | that we have to integrate, incorporate the human being
00:00:43.200 | deeply into the learning-based systems
00:00:45.720 | in order to make the systems learn well
00:00:50.200 | and operate in the real world.
00:00:53.180 | The underlying first prediction
00:00:57.160 | under the idea of human-centered AI in this century
00:01:00.800 | is that the learning-based approaches
00:01:03.200 | that have been successful over the past two decades,
00:01:06.200 | like deep learning, machine learning approaches
00:01:08.160 | that learn from data,
00:01:09.520 | are going to continue to become better
00:01:12.920 | and dominate the real world applications.
00:01:16.240 | So as opposed to fine-tuned optimization-based models
00:01:21.080 | that do not learn from data,
00:01:23.300 | more and more we're going to see learning-based methods
00:01:26.960 | dominate real world applications.
00:01:29.480 | That's the underlying prediction that we're working with.
00:01:33.520 | Now, if that's the case, the corollary of that,
00:01:38.220 | if learning-based methods is the solution
00:01:40.700 | to many of these real world problems,
00:01:43.380 | is the way we get smarter AI systems
00:01:46.780 | is by improving the machine learning
00:01:50.280 | and the machine teaching.
00:01:52.000 | Machine learning is the thing
00:01:53.720 | that we've been talking about quite a bit.
00:01:56.720 | That's the deep learning, that's the algorithm,
00:01:58.860 | the optimization of neural network parameters
00:02:01.520 | where you learn from data.
00:02:03.360 | That's the current focus of the community,
00:02:05.040 | current focus in the research,
00:02:06.380 | and the thing that's behind the success
00:02:08.440 | of much of the developments in deep learning.
00:02:11.400 | And then there's the machine teaching.
00:02:13.800 | That's the human-centered part.
00:02:16.000 | It's optimizing not the models,
00:02:19.940 | not the algorithms,
00:02:21.500 | but optimizing how you select the data
00:02:25.580 | based on which the algorithms learn.
00:02:27.840 | It's to make better teachers.
00:02:29.620 | Just like when you yourself are learning as a student
00:02:33.000 | or as a child how to operate in this world,
00:02:36.220 | the world and the parents and the teachers around you
00:02:41.220 | are informing you with very sparse information,
00:02:45.720 | but providing the kind of information
00:02:47.480 | that is most useful for your learning process.
00:02:49.780 | The selection of data based on which to learn,
00:02:53.680 | I believe, is the critical direction of research
00:02:57.100 | where we have to solve in order to create
00:03:01.440 | truly intelligent systems,
00:03:03.120 | the ones that are able to work in the real world,
00:03:05.960 | and I'll explain why in ways that I'm referring to.
00:03:10.100 | The implications of learning-based systems.
00:03:12.880 | So when you have a learning system,
00:03:15.440 | a system that learns from data,
00:03:18.180 | neural networks, machine learning,
00:03:20.360 | learns from data,
00:03:21.540 | the fundamental reality of that
00:03:25.800 | is the model is trying to generalize
00:03:29.560 | across the entirety of the reality
00:03:32.640 | which we'll have to be tasked with operating
00:03:35.360 | based on a very small subset of samples from that reality,
00:03:40.360 | and that generalization means that
00:03:43.320 | there's always going to be a degree of uncertainty.
00:03:47.360 | There's always going to be
00:03:48.720 | a degree of incomplete information,
00:03:50.600 | and so no matter how much we want to,
00:03:53.900 | these systems will not be provably safe,
00:03:57.600 | so we can't put anything concrete down
00:04:00.960 | to how guaranteed to be safe in some specific way
00:04:05.120 | unless it's extremely constrained.
00:04:07.880 | Therefore, we need human supervision of these systems.
00:04:11.140 | The systems will not be provably fair
00:04:13.600 | from an ethics perspective,
00:04:15.040 | from a discrimination perspective,
00:04:16.640 | from all degrees of fairness.
00:04:18.600 | Therefore, we need human supervision of these systems,
00:04:22.540 | and it will not be explainable.
00:04:26.160 | At any step of the pipeline
00:04:27.640 | in which they made the decisions,
00:04:29.300 | AI systems will not be perfectly explainable
00:04:32.440 | to the satisfaction of us as human supervisors.
00:04:37.440 | So there again,
00:04:40.800 | human supervision constantly will be required,
00:04:44.560 | and the solution to this is a whole set of techniques,
00:04:47.520 | whole set of ideas that we're putting under the flag
00:04:51.520 | of human-centered artificial intelligence,
00:04:53.800 | human-centered AI,
00:04:55.000 | and the core ideas there is that we need to integrate
00:04:58.660 | the human being deeply into the annotation process
00:05:02.680 | and deeply into the human supervision
00:05:06.440 | of the real-world operation of the system,
00:05:09.320 | so both in the training phase and the testing phase,
00:05:13.360 | the execution, the operation of the system.
00:05:16.880 | So this is what deep learning looks like
00:05:18.760 | with the human out of the loop.
00:05:20.480 | The human contributes to a learning model
00:05:25.160 | by helping annotate some data,
00:05:27.720 | and that data is then used to train a model
00:05:31.180 | that hopefully generalizes in the real world,
00:05:33.120 | and that model makes decisions,
00:05:35.560 | and deep learning is really exciting
00:05:37.120 | because it's able to,
00:05:38.840 | in a greater and greater degree of autonomy,
00:05:41.640 | able to form high-level representations of the raw data
00:05:45.840 | in a way that it's actually able to do quite well
00:05:49.000 | on certain kinds of tasks
00:05:50.680 | that were before very difficult,
00:05:52.580 | but fundamentally, the human is out of the loop,
00:05:55.140 | both of the training and the operation.
00:05:57.220 | First, you build the dataset, annotate the dataset,
00:06:00.340 | and then the systems run away with it.
00:06:02.680 | They train on the data,
00:06:03.960 | and the real-world operation does not involve the human
00:06:06.960 | except as the recipient of the service the system provides.
00:06:11.320 | Now, the human in the loop version of that,
00:06:13.360 | the human-centered version of that,
00:06:15.280 | means that annotation and operation of the system
00:06:20.280 | is both aided by human beings in a deep way.
00:06:27.920 | What does that mean?
00:06:29.960 | So we can look at a human experts,
00:06:32.120 | so individuals, and crowd intelligence,
00:06:35.920 | the wisdom of the crowd and the wisdom of the individual.
00:06:39.620 | At the training phase, the first part of that
00:06:44.520 | is the objective annotation.
00:06:46.320 | We need to significantly improve objective annotation,
00:06:49.440 | meaning annotation where the human intelligence
00:06:53.160 | is sufficient to be able to look at a sample and annotate it.
00:06:57.040 | This is what we think about as an ImageNet
00:06:59.040 | and all the basic computer vision tasks
00:07:00.880 | where a single human is enough
00:07:02.320 | to do a pretty damn good job
00:07:04.120 | of determining what's in a particular sample.
00:07:06.960 | And then there's subjective annotation,
00:07:09.180 | things that are difficult for humans to determine
00:07:12.760 | as a singular sample of a human being,
00:07:15.000 | as a crowd, we kind of converge in these difficult questions.
00:07:20.000 | These are questions at a low level of emotion,
00:07:24.520 | these things that are a little bit fuzzy,
00:07:26.880 | that require multiple people to annotate,
00:07:28.800 | and at the high level are ethical questions
00:07:31.640 | of decisions that an AI system is tasked to making
00:07:36.560 | or we're tasked to making
00:07:38.260 | that nobody really knows the right answer to.
00:07:40.640 | And as a crowd, we kind of converge in the right answer.
00:07:43.400 | That's where the crowd intelligence comes in
00:07:45.200 | on the data annotation step.
00:07:46.920 | Now in the operation, once you train the model,
00:07:50.600 | the supervision, again, of the system based,
00:07:54.600 | and I'll give examples of this more concretely,
00:07:57.480 | on the wisdom of the individual is, for example,
00:08:01.120 | operating an autonomous vehicle,
00:08:02.920 | the supervision of that autonomous vehicle,
00:08:05.040 | a single driver, is tasked with supervising
00:08:07.680 | the decisions of that AI system.
00:08:09.820 | That's a critical step for a learning based system
00:08:12.840 | that's not guaranteed to be safe,
00:08:15.520 | that's not guaranteed to be explainable.
00:08:18.040 | And the subjective side of that,
00:08:22.480 | where the crowd intelligence is required,
00:08:24.320 | where a single person is not able to make it,
00:08:26.200 | these are, again, ethical questions
00:08:28.000 | about the operation of autonomous systems.
00:08:30.160 | The supervision of autonomous vehicles,
00:08:33.200 | the supervision of systems in the medical diagnosis,
00:08:37.920 | in medicine in general, and this is AI
00:08:42.920 | operating in the real world, making ethical decisions
00:08:46.820 | that are fundamentally difficult decisions
00:08:51.240 | for humans to make, and that's where
00:08:52.600 | the crowd intelligence needs to come in.
00:08:55.360 | And so we have to transform the machine learning problem
00:08:58.200 | by integrating the human being.
00:09:01.260 | First up top in the training process, on the left,
00:09:04.400 | that's the usual machine learning formulation
00:09:07.000 | of a human being doing brute force annotation
00:09:10.680 | of some kind of data set, cats and dogs and ImageNet,
00:09:13.780 | segmentation data set in cityscapes,
00:09:17.440 | video action recognition in the YouTube data set.
00:09:22.440 | Given the data set, humans put in a lot of expensive labor
00:09:25.820 | to annotate what's going on in that data,
00:09:27.960 | and then the machine learns.
00:09:30.880 | The flip side of that, the machine teaching side,
00:09:33.180 | the human-centered side of that, is the machine instead,
00:09:36.820 | the learning model, the learning algorithm,
00:09:39.000 | talking about mostly neural networks here,
00:09:41.360 | is tasked with providing, selecting the subset,
00:09:48.240 | the small, sparse subsets of the data
00:09:52.620 | that are most useful for the human to annotate.
00:09:55.460 | So instead of the human doing the brute force task first
00:09:59.040 | of the annotation, the machine queries the human.
00:10:02.200 | This is the field called machine teaching.
00:10:04.840 | The machine queries the human with questions,
00:10:07.240 | and therefore, the task is,
00:10:09.320 | and this is wide open research field,
00:10:11.260 | the task is to minimize in several orders of magnitude
00:10:17.360 | the amount of data that needs to be annotated.
00:10:19.600 | In the real world operation side,
00:10:22.760 | the integration of the human looks like this.
00:10:24.800 | On the left, the machine, now trained
00:10:28.160 | with the learning model, makes decisions,
00:10:30.320 | and the human living in this world
00:10:33.320 | receives the service provided by the machine,
00:10:36.520 | whether that's medical diagnosis,
00:10:38.100 | whether that's an autonomous vehicle,
00:10:40.120 | whether that's a system that determines
00:10:42.980 | whether you get a loan or not, so on.
00:10:45.800 | With the human-centered version of that,
00:10:50.160 | the machine makes a decision,
00:10:52.240 | but it's able to provide a degree of uncertainty.
00:10:57.880 | It's one of the big requirements,
00:10:59.580 | to be able to specify a degree of uncertainty
00:11:01.720 | of that decision such that when uncertainty
00:11:04.240 | is below a certain threshold, human supervision is sought.
00:11:08.280 | And again, in that decision,
00:11:10.520 | whether that's a costly decision financially
00:11:13.040 | or a costly decision in terms of human life,
00:11:15.120 | human supervision is sought.
00:11:17.160 | And the service is received by the human,
00:11:20.080 | by the very same humans that are providing the supervision,
00:11:23.040 | or another set of humans.
00:11:25.360 | But ultimately, the decision is over-sought
00:11:29.360 | by human beings.
00:11:31.760 | This is what I believe is going to be
00:11:34.880 | the defining mode of operation for AI systems
00:11:37.720 | in the 21st century, is we won't be able to,
00:11:40.480 | as much as we'd like, to create perfect AI systems
00:11:44.320 | that escape the need to work together
00:11:49.000 | with human beings at every step.
00:11:52.920 | There is five areas of research,
00:11:55.680 | grand challenges here, that define human-centered AI.
00:12:01.080 | I'll focus on a few today,
00:12:03.680 | and focus on one very much so.
00:12:06.720 | And even with that degree of high pruning,
00:12:11.200 | we have 120 slides, so I'll skip around.
00:12:14.040 | But, on the human-centered AI during the learning phase,
00:12:22.120 | there is the methods, the research arm of machine teaching.
00:12:25.640 | How do we select, how do we improve supervised learning?
00:12:28.760 | As opposed to needing 10,000, 100,000, a million examples,
00:12:33.280 | how do we reduce that, where the algorithm queries
00:12:36.560 | only the essential elements, and able to learn effectively
00:12:39.880 | from very little information, from very little samples?
00:12:43.240 | Just like we do when we're students,
00:12:45.000 | when we learn fundamental aspects of math,
00:12:48.160 | language, and so on, we just need a few examples.
00:12:51.600 | But those examples are critical to our understanding.
00:12:54.200 | And the second part of that is the reward engineering.
00:12:59.480 | That during a learning process,
00:13:01.040 | injecting the human being into the definition
00:13:04.040 | of the loss function, of what's good, what's bad.
00:13:06.960 | Systems that have to operate in the real world
00:13:11.560 | have to understand what our society deems as good and bad.
00:13:16.560 | And we're not always good at injecting that
00:13:19.800 | at the very beginning.
00:13:20.920 | There has to be a continuous process
00:13:23.160 | of adjusting the rewards, of reward re-engineering
00:13:26.560 | by humans, so that we can encode human values
00:13:29.640 | into the learning process.
00:13:31.240 | Now, on the second part, on the human-centered AI
00:13:34.200 | during real-world operation,
00:13:36.240 | when the system's actually trained,
00:13:38.400 | there is the interactive element
00:13:42.080 | of robots and humans working together.
00:13:44.560 | That means the part I'll focus on quite a bit today,
00:13:47.840 | because there's been quite a lot of development
00:13:50.640 | and progress on the deep learning side,
00:13:53.000 | is human sensing, is algorithms
00:13:56.000 | that understand the human being.
00:13:58.280 | Algorithms that, from taking raw information,
00:14:02.120 | whether that's video, audio, text,
00:14:04.560 | begin to get a context, a measure
00:14:08.840 | of the state of the human being in the short term
00:14:10.640 | and the long term over time,
00:14:12.040 | the temporal understanding
00:14:15.200 | and the instantaneous understanding.
00:14:17.760 | Then there is the interaction aspect.
00:14:21.160 | So once you understand the human,
00:14:22.960 | the perception problem, you have to interact with them
00:14:26.280 | and interact in such a way that it's continuous,
00:14:29.000 | collaborative, and a rich, meaningful experience.
00:14:32.360 | We're in the very early days of creating anything
00:14:37.440 | like rich, meaningful experiences with AI systems,
00:14:41.200 | especially learning-based AI systems.
00:14:44.880 | And the safety, in the real world operation,
00:14:48.360 | safety, ethics, unrolling the results
00:14:52.120 | of the engineered rewards that were in place
00:14:55.360 | during the learning process, now come to fruition.
00:14:59.600 | And we need to make sure that the trained model
00:15:04.600 | does not result in things that are highly detrimental,
00:15:10.920 | catastrophic to our safety,
00:15:13.560 | or highly detrimental to what we deem as good
00:15:16.920 | and bad in society, of discrimination,
00:15:19.600 | of ethical considerations, and all those kinds of things.
00:15:22.920 | The gray area, the line we all walk as a society
00:15:27.240 | in the crowd intelligence,
00:15:28.640 | we have to provide bounds on AI systems.
00:15:32.160 | And there's an entire group of work,
00:15:34.600 | and I'll mention what we're doing in that area.
00:15:37.880 | So first, on the machine teaching side,
00:15:42.040 | and the efficient supervised learning,
00:15:44.000 | I'd like to sort of do one slide on each of these
00:15:46.800 | to kind of give you an idea,
00:15:49.400 | near-term, and do two things for each area,
00:15:54.320 | that we will elaborate in future lectures on,
00:15:56.960 | and some of it I'll elaborate today.
00:15:59.440 | First, the near-term directions of research,
00:16:03.020 | the things that are within our reach now,
00:16:05.760 | and a sort of thought experiment, a grand challenge,
00:16:10.000 | that if we can do it, that'll be damn impressive.
00:16:14.000 | That will be a definition of real progress in this area.
00:16:17.600 | So near-term directions of research for machine teaching,
00:16:21.480 | for improved supervised learning,
00:16:22.900 | integrating the human into the annotation process,
00:16:25.520 | is instead of annotating brute force,
00:16:28.240 | is annotate by asking the human questions.
00:16:30.880 | So we have to transform the way we do annotations,
00:16:34.840 | where the process of annotation is not defining the dataset,
00:16:39.680 | and then you go through the entire dataset,
00:16:41.960 | it's a machine teaching system that queries the user
00:16:45.800 | for questions to annotate.
00:16:48.020 | And on the algorithm side, active learning,
00:16:52.560 | these are all sort of areas of work
00:16:55.120 | where we can be more clever about the way we use data,
00:16:58.600 | select data on which to train.
00:17:00.560 | So active learning is actively selecting
00:17:03.520 | during the training process,
00:17:04.780 | which part of the data to train on, and annotate.
00:17:08.720 | Data augmentation is taking things
00:17:11.360 | that have been supervised by a human,
00:17:13.200 | and expanding them, modifying the data,
00:17:16.240 | warping the data in interesting ways such that it expands,
00:17:20.040 | it multiplies the human effort that was injected
00:17:23.920 | into helping understand what's in the data.
00:17:26.920 | The one-shot learning, zero-shot learning,
00:17:29.080 | and transfer learning are all in that category.
00:17:31.160 | And self-play is in the reinforcement learning area
00:17:34.920 | where the system constructs a model of the world,
00:17:39.360 | and goes along alone in a room,
00:17:42.120 | and plays with that model to try to figure out
00:17:44.660 | the different constraints of the model,
00:17:46.160 | how do you achieve good things there.
00:17:48.400 | An example grand challenge here
00:17:51.680 | that would define serious progress in the field
00:17:55.080 | is if we take ImageNet or COCO,
00:17:57.840 | the ImageNet challenge or COCO object detection challenge,
00:18:01.460 | and training only on a totally different kind of data,
00:18:06.460 | be able to achieve state-of-the-art results.
00:18:10.800 | So training only on Wikipedia,
00:18:14.320 | with the text and images that are there on Wikipedia,
00:18:16.780 | be able to perform object detection
00:18:19.580 | on the state-of-the-art benchmark of COCO.
00:18:22.780 | The COCO is a data set of different objects
00:18:25.140 | with rich annotation of the localization of the objects.
00:18:28.740 | That I believe is exactly the kind of thing
00:18:31.980 | that all the problems in the transfer learning
00:18:35.540 | and efficient data annotation machine teaching
00:18:38.620 | have to be solved to achieve that.
00:18:40.600 | Another way to, another challenge you can think of,
00:18:44.500 | if we can even just simplify it more,
00:18:47.060 | is achieve a 3%, 0.3% error on MNIST,
00:18:51.620 | that's the handwritten recognition task
00:18:54.080 | that everybody always provides as an example.
00:18:56.180 | So achieve a very good accuracy,
00:19:00.340 | state-of-the-art accuracy,
00:19:02.500 | by training only on a single example of a digit,
00:19:06.100 | as opposed to training on thousands,
00:19:08.020 | training on one example.
00:19:09.640 | That's something that most of us humans can do,
00:19:12.220 | given one example of a new language
00:19:16.420 | you haven't seen before for each character,
00:19:19.940 | after studying them for a little bit,
00:19:22.000 | be able to now classify future characters
00:19:24.620 | at high accuracy.
00:19:25.700 | The second part of the learning process
00:19:32.980 | where the human needs to be injected,
00:19:34.380 | and the near-term directions of research there,
00:19:37.380 | is the reward engineering,
00:19:39.580 | and the tuning of those,
00:19:40.540 | continuous tuning of those rewards by a human being.
00:19:43.180 | So if OpenAI is doing quite a bit of work here,
00:19:50.500 | here's a game played by human and AI,
00:19:53.500 | and it's really my favorite example of this.
00:19:56.180 | On the left, human is controlling a boat
00:19:58.500 | that's finishing a race.
00:19:59.600 | On the right is a RL agent,
00:20:02.140 | reinforcement learning agent,
00:20:03.300 | that's controlling a boat that's trying to,
00:20:07.100 | not finish a race,
00:20:08.300 | trying to maximize the reward
00:20:12.260 | defined prior to, by, initially by a human being.
00:20:16.500 | And what it finds is that you can get much more reward
00:20:20.700 | by collecting green turbos that appears
00:20:23.460 | close to finishing the race.
00:20:25.260 | It realizes that finishing the race
00:20:26.980 | actually gets in the way of maximizing reward.
00:20:29.660 | And so that's the unintended consequences
00:20:32.140 | of a reward function that was specified previously,
00:20:37.140 | and most human supervisors of this result
00:20:42.060 | would be able to adjust through the,
00:20:44.380 | re-engineer the reward function
00:20:47.180 | to be able to get the robot to the AI system here
00:20:51.140 | to finish the race.
00:20:52.420 | And that kind of continuous monitoring,
00:20:54.820 | monitoring of the performance of the system
00:20:58.180 | during the training process
00:20:59.580 | is a near term direction of research
00:21:02.740 | that's a few, DeepMind, OpenAI, and ourselves are taking on.
00:21:07.100 | Example grand challenge is allowing
00:21:12.180 | AI system to operate in a context
00:21:16.700 | where there's a lot of fuzziness for us humans.
00:21:21.380 | There's a lot of uncertainty, there's a lot of gray area,
00:21:23.620 | there's a lot of challenging aspects
00:21:25.340 | in terms of what is right and what is wrong
00:21:28.740 | that we continually need to improve on.
00:21:31.020 | Example I provide here is one of the least popular things
00:21:35.900 | in the world is the US Congress.
00:21:40.620 | So replacing US Congress,
00:21:42.500 | the body of representatives of the people
00:21:44.980 | of the United States, and they make bills
00:21:47.620 | based on the belief of the people,
00:21:49.660 | that sounds a lot like what Netflix does
00:21:52.780 | in recommending what movie you should watch next
00:21:56.180 | in representing what people love to watch.
00:21:58.660 | So that's just the recommender system.
00:22:00.460 | So it makes perfect sense that an AI system
00:22:03.740 | should be able to take on this challenge.
00:22:06.300 | And I see that as a grand challenge,
00:22:08.860 | is replacing some of the fundamental representation
00:22:13.180 | of large crowds of people that make ethical decisions
00:22:17.660 | replaced by a human-centered AI system.
00:22:22.300 | Okay, in real world operation,
00:22:25.420 | the first thing we have to do,
00:22:27.620 | before we have a robot and a human work together,
00:22:30.460 | the first thing is the robot has to perceive the human.
00:22:34.620 | Question.
00:22:36.060 | - The question was, do you want to,
00:22:38.340 | so there's currently, there's Congress,
00:22:41.540 | do you want to change the way Congress works,
00:22:45.060 | make it better, or do you want to just take the system
00:22:47.640 | that currently is and automate it?
00:22:51.140 | So the idea is take the system as it currently
00:22:56.140 | is supposed to be and automate that.
00:22:59.420 | So an AI system can provide a lot more transparency
00:23:05.180 | of the inputs.
00:23:06.820 | The idea of Congress is supposed,
00:23:08.420 | the only inputs is supposed to be the people
00:23:10.740 | and the beliefs of the people.
00:23:16.100 | And there's also, and there's rich information there.
00:23:20.840 | So for example, I mean, the input,
00:23:24.160 | there's, for me, not saying anything about politics,
00:23:28.080 | but there's certain issues I care a lot about
00:23:29.880 | and certain issues I don't care much about.
00:23:32.720 | And let's put that aside.
00:23:35.560 | And then there's certain issues that I know a lot about
00:23:39.760 | and certain issues I know very little about.
00:23:42.400 | And those don't actually intersect that well.
00:23:46.080 | I'm very opinionated about things
00:23:47.580 | I don't know anything about.
00:23:48.620 | It's very common, all of us are.
00:23:50.780 | So being able to put that representation of me
00:23:55.020 | into a system that would take a lot of,
00:23:58.860 | our entire nation together, and be able to make bills
00:24:03.860 | that represent the people.
00:24:08.240 | Now the challenge there, it can't be just the training set
00:24:11.060 | and then the system now operates.
00:24:13.580 | AI is running the country.
00:24:15.800 | No, there has to be that human center element
00:24:18.000 | where we're constantly supervising,
00:24:19.380 | just like we're, in theory, supposed to be supervising
00:24:22.240 | our congressmen and congresswomen.
00:24:25.520 | Human sensing, the first part,
00:24:28.040 | in order to have an AI system that works with a human being,
00:24:31.800 | the AI system has to perceive, understand
00:24:34.320 | the state of the human being at the very simplest level
00:24:36.920 | and the more complex, temporal, contextual, over time level.
00:24:40.660 | So the near-term directions of research
00:24:42.840 | is purely the perception problem,
00:24:44.640 | where deep learning shines, of taking data,
00:24:48.100 | whether that comes in the visual, audio, text, and so on,
00:24:53.100 | and being able to classify the physical, mental,
00:24:58.040 | social state, social context of the person.
00:25:02.080 | Be able to, everything, and this is what I'll cover
00:25:04.960 | a little bit of today, everything from face detection,
00:25:09.320 | face recognition, emotion recognition,
00:25:12.640 | natural language processing, body pose estimation,
00:25:16.620 | those same recommender systems, speech recognition,
00:25:21.720 | all of those conversions of raw data
00:25:25.320 | that captures something about the human being
00:25:27.240 | into actually meaningful, actionable information.
00:25:29.840 | The grand challenge there is emotion recognition.
00:25:34.840 | You know, there's been a lot of companies and ideas
00:25:38.760 | that we've somehow cracked emotion recognition,
00:25:41.460 | that we are able to determine the mood of a person.
00:25:46.020 | But really, that's, for those who were here last year
00:25:49.160 | with Lisa Feldman Barrett, but just,
00:25:52.880 | if you're sort of very honest
00:25:54.840 | and you study emotional intelligence and emotion
00:25:59.300 | and the expression of emotion, it's a fascinating area
00:26:03.000 | and we're not even close to being able
00:26:04.760 | to build perceptual systems that detect emotion.
00:26:07.560 | What we're more so doing is detecting very simple
00:26:12.560 | facial expressions that correspond
00:26:15.180 | to our storybook versions of emotion, smiling,
00:26:20.000 | crying, like frowning in a caricatured way.
00:26:23.460 | So if you build a system that has a high accuracy
00:26:26.520 | of doing real emotion recognition,
00:26:29.560 | you can think of it as stated here,
00:26:33.080 | an AI system that classifies,
00:26:35.720 | binary classification problem, 95% accuracy
00:26:39.080 | of whether you wanna be left alone or not.
00:26:41.680 | And being able to do that after collecting data for 30 days.
00:26:46.040 | That I see as a really clean formulation
00:26:48.420 | of exactly the kind of human understanding
00:26:53.420 | we need to be able to build in our learning models.
00:26:57.440 | And we're very far away from that,
00:26:59.280 | especially the long temporal aspect of that,
00:27:02.520 | of being able to integrate data over a long period of time.
00:27:06.140 | Then the second part of human robot interaction
00:27:08.980 | in the real world operation is the experience.
00:27:12.520 | This is where we're now just beginning to consider
00:27:15.860 | that interactive experience
00:27:17.460 | of how do we have a rich fulfilling experience.
00:27:20.380 | We have autonomous vehicles, for example,
00:27:24.140 | semi-autonomous vehicles, whether that's Tesla,
00:27:26.820 | Volvo, Super Cruise with the Cadillac.
00:27:29.460 | There's a bunch of systems that have now
00:27:31.460 | greater and greater degrees of automation in the car
00:27:33.860 | and we get to have the human interact with that AI system
00:27:36.860 | and trying to figure out how do we have
00:27:39.980 | a rich fulfilling experience.
00:27:43.220 | In the, currently the Volvo system,
00:27:47.260 | that experience is more limited.
00:27:49.600 | There's a little icon.
00:27:51.220 | It's more kind of traditional driving situation.
00:27:54.420 | In the Tesla, you have a much bigger display
00:27:57.020 | about what's going on.
00:27:58.540 | In the Super Cruise, there's a camera looking at your eyes
00:28:03.140 | in the Cadillac Super Cruise system.
00:28:06.220 | There's a camera looking at your eyes
00:28:07.500 | determining if you're awake or not,
00:28:09.940 | paying attention or not.
00:28:11.140 | And that, there's like an experience there
00:28:13.300 | that we're trying to create.
00:28:15.260 | And in the Tesla case, the miles are racking up.
00:28:19.460 | We have real data.
00:28:20.860 | Here at MIT, we're studying this exact interaction.
00:28:23.580 | There's now over a billion miles driven in the Tesla.
00:28:26.380 | And the same in the fully autonomous side with Waymo,
00:28:30.260 | they've now reached 10 plus million miles driven autonomously.
00:28:34.140 | And there's a lot of people experimenting with this.
00:28:36.780 | But that's that collaborative interaction
00:28:39.700 | of going back and forth, of being able to,
00:28:41.900 | for the AI system to express the degree of uncertainty
00:28:44.580 | as about the environment.
00:28:46.060 | About the AI system being able to express
00:28:48.820 | when it needs help and not.
00:28:50.940 | Be able to communicate what are its limitations
00:28:53.500 | and capabilities and so on.
00:28:55.620 | Trade off control.
00:28:57.220 | Be able to seek human supervision.
00:28:59.000 | There's a dance there that's really,
00:29:01.620 | that takes into consideration everything
00:29:03.860 | from the neurobiological research to psychology
00:29:08.860 | to deep learning, to the pure robotics,
00:29:14.580 | HRI, human robotic interaction aspects.
00:29:17.940 | One grand challenge would be,
00:29:19.900 | Tesla's driven one billion miles now under autopilot,
00:29:23.380 | under the semi-autonomous mode.
00:29:25.300 | The grand challenge here is when we start getting
00:29:27.980 | to the kind of mileage that we see
00:29:30.560 | in the United States every year,
00:29:32.140 | you start getting into the hundreds of billions
00:29:34.500 | of miles driven semi-autonomously.
00:29:36.340 | We get to see teenagers, 16, 17, 18,
00:29:39.700 | using these systems for the first time.
00:29:41.620 | We get to see older folks,
00:29:43.140 | folks who don't necessarily drive
00:29:46.600 | or use any kind of AI in their lives
00:29:48.780 | get to use these systems.
00:29:50.060 | We start to explore that aspect.
00:29:51.820 | That's the real challenge.
00:29:53.480 | And of course, the old Turing test,
00:29:57.980 | now reimagined by Alexa,
00:30:00.440 | with the Alexa Prize challenge of Social Bot,
00:30:05.200 | is creating natural language.
00:30:07.680 | It's such a beautiful thing to explore
00:30:09.540 | human-robot interaction with,
00:30:11.400 | is both on the audio side and just the text side,
00:30:15.520 | is passing the Turing test.
00:30:18.040 | That's a true grand challenge in a real way,
00:30:20.640 | where you wanna have a conversation with a robot
00:30:23.360 | for prolonged periods of times,
00:30:25.120 | maybe more than even some of your other friends.
00:30:28.360 | And on the other side of friends is the risk,
00:30:33.360 | the catastrophic risk that's potential
00:30:35.880 | when you have an AI system that's learning from data.
00:30:39.220 | The near-term directions of research
00:30:41.000 | is purely the human supervision of AI decisions
00:30:44.200 | in terms of safety and ethics.
00:30:46.120 | There's a lot of systems, like with cars,
00:30:48.920 | or medical diagnosis and so on,
00:30:51.480 | where there's some life-critical, safety-critical aspect
00:30:54.480 | that we want to be able to supervise the safety of that.
00:30:57.080 | And there's ethical decisions
00:30:59.080 | in terms of who gets alone or not,
00:31:02.280 | who gets a certain criminal penalty or not.
00:31:05.600 | Any degree to which AI systems are incorporated into that,
00:31:09.100 | you have to consider ethical questions.
00:31:11.240 | And even just the crude,
00:31:13.140 | the low-level perception systems, like face recognition,
00:31:18.140 | you wanna make sure that your face recognition systems
00:31:21.280 | are not discriminating based on color or gender or age
00:31:23.920 | and so on.
00:31:24.840 | You wanna make sure that
00:31:26.160 | at that basic fundamental level of ethics,
00:31:30.940 | the systems are trained in a way
00:31:33.760 | they maintain our human values,
00:31:36.120 | or the better angels of our nature,
00:31:39.840 | the better sides of our values,
00:31:42.040 | some of the brighter aspects of our values.
00:31:44.800 | And the other thing is, in terms of just maintaining values,
00:31:49.360 | that's the normal,
00:31:51.480 | that's looking at the mean of the distribution.
00:31:53.940 | But we also want to control the outliers
00:31:57.420 | from the AI systems not to do anything catastrophic.
00:32:01.280 | So the unintended consequences,
00:32:03.120 | when something happens that you didn't anticipate,
00:32:06.160 | you wanna be able to put boundaries on that.
00:32:08.580 | And the grand challenge there,
00:32:11.360 | really, it all boils down to the ability of an AI system
00:32:15.080 | to say that it's uncertain about something.
00:32:17.780 | And that measure of uncertainty has to be good.
00:32:22.740 | It has to be able to make a prediction
00:32:24.800 | always accompanied with uncertainty,
00:32:28.080 | even on things it hasn't seen before.
00:32:30.160 | That's the real challenge,
00:32:31.940 | to be able to be trained on cats and dogs
00:32:36.400 | and then seeing a giraffe
00:32:38.040 | and saying, "I'm not sure what that is."
00:32:41.960 | We're quite far away from that,
00:32:45.160 | 'cause right now, we'll probably confidently say it's a dog,
00:32:48.620 | depending on the giraffe.
00:32:49.860 | But we want to be able to have an extremely high accuracy
00:32:55.580 | in the ability of AI systems
00:32:57.220 | to determine their own uncertainty,
00:32:58.620 | to know what they don't know.
00:33:00.140 | Because from that comes the supervision.
00:33:03.540 | From that comes the ability to stop
00:33:05.980 | under things that it's uncertain about, catastrophic events.
00:33:09.340 | The first aspect of real-world operation
00:33:12.620 | is understanding the human.
00:33:14.080 | One of the places where deep learning has really shined
00:33:17.980 | is the perception problem.
00:33:19.900 | It all begins at the ability to look at raw data
00:33:22.860 | and convert that into meaningful information.
00:33:25.220 | That's really the understanding the human comes in.
00:33:28.100 | Not the kind of understanding
00:33:29.540 | that when you're in a relationship with somebody,
00:33:31.620 | when you're friends with somebody,
00:33:32.940 | over a long period of time,
00:33:34.580 | you gain an understanding of their quirks,
00:33:37.140 | limitations, capabilities, so on.
00:33:39.220 | That's really fascinating.
00:33:40.940 | But the first step is just to be able to,
00:33:43.740 | when you see them, recognize who they are,
00:33:46.020 | what's on their mind,
00:33:48.200 | what's their, the body language,
00:33:52.860 | what are they saying with their mouth.
00:33:55.700 | All those basic raw perception tasks,
00:33:58.180 | that's where deep learning really shines.
00:33:59.660 | I'd like to cover the state of the art
00:34:01.860 | in those various perception tasks.
00:34:05.980 | So first, face recognition.
00:34:07.760 | Now there's a full slide presentation with this,
00:34:11.700 | and I'm skipping around.
00:34:13.240 | The full slide presentation has the following structure
00:34:15.680 | for each of these topics.
00:34:17.920 | It has the motivation, description, the excitement,
00:34:21.920 | the worry, the future impact is the first part.
00:34:24.900 | And then there's five papers.
00:34:26.740 | One defining the quote unquote old school seminal work
00:34:29.920 | that opened the field.
00:34:31.320 | Then the early progress in the field.
00:34:33.780 | Paper three is the recent breakthrough,
00:34:38.160 | often associated with deep learning.
00:34:40.220 | Paper four is the current state of the art.
00:34:42.300 | And paper five is the thing
00:34:43.860 | that defines the future direction.
00:34:46.040 | The possible set of things that define the future direction.
00:34:49.020 | And then the open problems in the field,
00:34:52.120 | and where the future research is very much needed.
00:34:55.480 | That's kind of the structure of every topic
00:34:58.140 | I'll cover here as quickly as possible.
00:35:00.640 | Face recognition.
00:35:04.720 | So what is it?
00:35:05.700 | It's the first thing, you know,
00:35:08.700 | the face contains so much rich information
00:35:12.760 | about the state of the human being.
00:35:15.060 | So understanding the human being really starts with the face
00:35:18.180 | and detecting the face is the first step.
00:35:21.020 | Detecting the body,
00:35:22.580 | and then that there's a head on top of that body,
00:35:25.000 | that's the first step.
00:35:26.180 | And then there is the task of face recognition,
00:35:29.320 | been an exceptionally active area of research
00:35:32.500 | because it has a lot of applications.
00:35:34.640 | And through that research,
00:35:36.120 | we're able to now study a lot of aspects,
00:35:39.140 | how we perform perception on the face.
00:35:41.500 | So recognition, purely stated,
00:35:44.440 | is the recognizing the identity of a human face.
00:35:47.980 | Who is this?
00:35:49.120 | Detection is just detecting a face.
00:35:54.020 | Now, recognition means there's a database of identities.
00:36:00.420 | What is it?
00:36:01.420 | Seven billion of them on earth.
00:36:02.920 | And you're trying to determine
00:36:05.180 | which of them it is,
00:36:07.260 | which of the seven billion it is,
00:36:09.200 | or whatever the database is.
00:36:11.760 | The face verification problem
00:36:14.900 | is something that your phone uses
00:36:17.100 | when you unlock it with your face.
00:36:19.260 | Is it saying, is it you or not?
00:36:21.780 | Is it Lex or somebody else?
00:36:23.940 | It's a database of two,
00:36:25.780 | one person versus everybody else.
00:36:29.640 | There's a lot of applications here, obviously,
00:36:33.060 | from identification to all the security aspects
00:36:36.980 | of using the face as a sort of fingerprint
00:36:41.860 | of your identity in all the interactive elements
00:36:44.580 | of AI systems, software-based systems in this world.
00:36:48.240 | Okay, so why is it hard?
00:36:50.400 | So all the usual computer vision problems come in.
00:36:52.620 | Lighting variation, pose variation.
00:36:55.060 | That's just, computer vision is really hard.
00:36:57.100 | It's just you get these raw numbers
00:36:58.580 | and you have to infer so many things
00:37:00.740 | that us humans take for granted.
00:37:04.540 | So the basic computer vision stuff.
00:37:06.540 | But there's stuff on top of that.
00:37:08.320 | So faces, we're trying to,
00:37:11.940 | it's like cats versus dogs.
00:37:13.780 | There's thousands of breeds of dog
00:37:15.580 | and thousands of breeds of cats.
00:37:17.380 | In that same way, there's,
00:37:19.060 | faces can look very similar to each other.
00:37:21.820 | So these two classes that you're trying to separate
00:37:24.300 | can be very, very, very close together and intermingle.
00:37:29.800 | Now, there's a lot of face data available.
00:37:33.320 | Now, because of the application,
00:37:35.020 | because of the financial benefits of such data sets,
00:37:38.780 | but for any one individual,
00:37:40.460 | unless you're Brad Pitt or Angelina Jolie or celebrity,
00:37:43.580 | there's not many samples of the data available.
00:37:46.580 | So the individuals based on which the classification
00:37:49.420 | is to be made, there's often not very much data.
00:37:52.120 | Then there is the, a lot of variation.
00:37:56.000 | So you have to, in making the face recognition task,
00:37:59.540 | you have to be invariant to all the hairstyles,
00:38:02.300 | all the, that you change yourself over time,
00:38:05.700 | the weight gain, the weight loss,
00:38:07.700 | the beard you decided to grow,
00:38:10.940 | the glasses you wear sometimes and not others,
00:38:14.340 | the different styles of glasses and so on,
00:38:16.620 | makeup or no makeups.
00:38:18.020 | All of these things, it's still you,
00:38:19.660 | still the same identity.
00:38:21.140 | You have to be able to classify that.
00:38:23.140 | And that kind of accuracy,
00:38:24.580 | especially for security applications,
00:38:26.200 | extremely high, that's required.
00:38:29.300 | The reason it's an exciting area
00:38:33.700 | is there's a lot of possibility,
00:38:35.340 | but, and there's also a lot of concern, right?
00:38:37.940 | So the future impact, utopia, dystopia,
00:38:41.700 | and the more reasonable middle path here
00:38:44.380 | is face provides a very user-friendly way
00:38:49.380 | of letting your devices recognize you and say hello.
00:38:57.140 | Your voice is certainly one,
00:38:58.480 | but one of the most powerful ones
00:39:00.300 | to really classify at a distance is face.
00:39:04.440 | So what does that mean?
00:39:05.380 | The utopian view, the possibility of the future,
00:39:08.540 | the best possible, brightest possible future.
00:39:11.180 | As you can use your face to, as a passport,
00:39:16.180 | you replace the license,
00:39:17.460 | replace all the security measures we put
00:39:20.440 | from the passwords in our devices
00:39:22.180 | to the credit card and so on,
00:39:24.260 | all of that, Apple pays, it'll be face pay.
00:39:28.580 | You show up, it'll automatically connect
00:39:30.580 | to all your devices, all your banking information, so on.
00:39:34.020 | Obviously, the flip side of that,
00:39:36.060 | just rephrasing that sentence also can be dystopian
00:39:39.900 | because complete violations of privacy,
00:39:44.140 | being watched at any time,
00:39:45.860 | being able to, through your Facebook and social media
00:39:49.820 | and all your devices being able to identify you,
00:39:52.340 | making it impossible for you to sort of hide from society.
00:39:56.780 | The fundamental aspects of privacy,
00:39:58.980 | maintaining privacy that many of us value greatly.
00:40:02.580 | The middle path is really just a useful way
00:40:05.020 | to unlock your phone.
00:40:06.160 | The recent breakthroughs here,
00:40:09.660 | it started with deep face.
00:40:14.660 | The essential idea there is applying deep neural networks
00:40:19.980 | to the task of face recognition.
00:40:23.020 | I mean, with a lot of the breakthroughs here
00:40:24.980 | on the perception side,
00:40:27.300 | we're not covering the old school papers and so on,
00:40:29.860 | and the historical context here,
00:40:34.620 | biggest breakthroughs came with deep learning,
00:40:38.860 | 2006, '07, '08, last 10 years.
00:40:43.860 | So that's the same is true with face recognition.
00:40:48.780 | Deep face was the big first application
00:40:51.580 | that achieved near human performance
00:40:54.540 | on one of the big benchmarks at the time
00:40:57.300 | on the labeled faces in the wild.
00:40:59.740 | So using a very large data set,
00:41:01.340 | being able to form a good representation.
00:41:03.940 | The state of the art,
00:41:05.420 | or at least close to the state of the art is face net.
00:41:10.240 | The key idea there is using those same deep architectures
00:41:13.880 | to now optimize for the representation itself directly.
00:41:18.260 | The notebook we're putting out,
00:41:20.580 | we shared with some of you for the assignment,
00:41:24.020 | describes face recognition, the challenge there,
00:41:27.380 | that it's not like the traditional classification problem.
00:41:30.940 | You have to form an embedding of the face
00:41:35.940 | into a small vector, compressed vector,
00:41:42.140 | such that in that embedding,
00:41:44.340 | faces that are similar to each other,
00:41:46.060 | so identities that are close together,
00:41:48.460 | are close in the Euclidean sense in that embedding,
00:41:52.620 | and people that are very different are far away.
00:41:55.540 | And so you use that embedding to then do the classification.
00:41:58.500 | That's really the only way to deal with data sets
00:42:01.380 | for which you have so little information
00:42:02.940 | on any one individual person.
00:42:04.660 | And so face net optimize that embedding
00:42:09.080 | in a way that directly optimizes for the Euclidean distance
00:42:13.180 | between non-matching identities.
00:42:16.340 | So there's still a lot of excitement
00:42:17.780 | about face recognition.
00:42:18.780 | There's a lot of benchmark competitions
00:42:20.580 | and a lot of people working in this,
00:42:22.340 | and really bigger, badder networks and more data
00:42:26.500 | is really one of the ways to crack this problem.
00:42:29.460 | So public large data set with 672,000 identities,
00:42:34.460 | 4.7 million photos, that's in 2017,
00:42:38.660 | and that just keeps scaling up and up and up and up.
00:42:41.660 | Now we have to also be honest here
00:42:43.460 | on the possible future directions of work
00:42:47.380 | in that even though the benchmarks are growing,
00:42:50.900 | that's still a tiny subset of the people in the world.
00:42:53.420 | We're still not quite there to be able to have
00:42:57.420 | the general face recognition applicable
00:42:59.340 | to the entirety of the population,
00:43:01.540 | or a large swath of the population of the world.
00:43:04.700 | So in this topic here, brief coverage,
00:43:09.140 | we're not covering all of the aspects of the face,
00:43:13.020 | especially temporal, that are useful in face recognition
00:43:16.300 | or useful saying a lot of things about the face,
00:43:18.620 | which is the FACS, facts,
00:43:21.780 | the different kinds of facial expressions
00:43:23.700 | that can then be used to infer emotion and so on.
00:43:26.780 | Raised eyebrows and all those kinds of things
00:43:30.740 | that can provide rich information
00:43:32.140 | for recognizing and interpreting the face,
00:43:34.500 | and the different other modalities,
00:43:36.420 | including 3D face recognition, we're not covering.
00:43:39.580 | There's a lot of exciting areas there.
00:43:41.220 | We're just looking at the pure formulation
00:43:43.780 | of the face recognition problem
00:43:45.260 | of looking at a 2D single image.
00:43:49.600 | The open problems here is first,
00:43:55.200 | not often stated and misinterpreted by people,
00:44:01.060 | is that most of these methods of face recognition
00:44:05.260 | start with assuming that you have a bounding box
00:44:08.940 | around the face.
00:44:10.740 | Now, oftentimes recognition can happen,
00:44:15.740 | so they're assuming a frontal
00:44:18.460 | or near frontal view of the face.
00:44:20.500 | But you can do recognition all kinds of poses.
00:44:23.260 | And it's very interesting to think that recognition,
00:44:27.900 | the way we recognize our friends and colleagues,
00:44:30.740 | parents and children is often using
00:44:33.940 | a lot of cue context information
00:44:35.460 | that's beyond just the pure frontal view of the face.
00:44:38.420 | It can do pretty well on profile views,
00:44:40.780 | it can from body language and so on.
00:44:43.020 | So all those things, that's open in the field,
00:44:45.860 | how we incorporate that into face recognition.
00:44:48.440 | Then the black box side is problematic for both bias
00:44:53.140 | and just being able to understand
00:44:54.480 | why incorrect decisions are made,
00:44:56.580 | is making those face recognition systems more interpretable.
00:45:00.420 | And then finally, privacy.
00:45:04.860 | The ability to collect the kind of data
00:45:07.620 | where the face recognition
00:45:09.460 | it would be performing extremely well,
00:45:11.540 | and yet not violating the fundamental aspects
00:45:14.700 | of privacy that we value.
00:45:16.400 | Activity recognition, taking the next step forward here
00:45:24.420 | into the richer temporal context of what people do.
00:45:30.280 | Again, the same structure from recent breakthroughs
00:45:32.700 | to the future direction of work.
00:45:34.300 | What is it?
00:45:37.120 | It's classifying human activity from images or from video.
00:45:41.900 | And why is it important?
00:45:44.200 | Depending on the level of abstraction for the activity,
00:45:51.580 | it provides context for understanding the human.
00:45:54.380 | What are they doing?
00:45:55.220 | Are they playing baseball?
00:45:56.060 | Are they singing?
00:45:56.900 | Are they sleeping?
00:45:57.940 | Are they putting on makeup, knitting, so on, mixing butter?
00:46:02.940 | Why is it hard?
00:46:05.300 | Again, all the usual problems in image recognition.
00:46:08.620 | The kind of data we're dealing with is just much larger.
00:46:12.960 | The kind of video, the richness of possibilities
00:46:16.500 | that define what activity is, is much larger.
00:46:19.360 | So the complexity is much larger.
00:46:21.680 | It's often difficult to quantify motion
00:46:26.680 | because the fundamental aspect of activity
00:46:30.600 | is the change in the world, is the motion of things.
00:46:33.440 | And then it's difficult to determine how the dynamics
00:46:37.320 | of the physics of the world, especially from a 2D view
00:46:40.000 | of what's background information, what's noise,
00:46:42.060 | and what's essential to understanding the activity.
00:46:46.120 | And the subjective, ambiguous elements of activity.
00:46:52.420 | When does a particular activity begin?
00:46:56.860 | When does it end?
00:46:58.060 | What's all the gray areas when you're partially engaging
00:47:03.120 | in that activity and so on?
00:47:05.200 | When you start to annotate these things,
00:47:07.040 | when you start to try to do the detection,
00:47:08.560 | it becomes clear that sometimes the activity
00:47:12.280 | is partially undertaken and the beginning
00:47:16.080 | and the end is fuzzy.
00:47:17.240 | Future impact, utopia, dystopia, middle path.
00:47:21.880 | So the impact here comes from being able
00:47:25.760 | to understand the world in time and be able to predict.
00:47:31.260 | The utopian possibilities is that the contextual perception
00:47:36.260 | that can occur from here can enrich the experience
00:47:39.260 | between the human and robot.
00:47:40.660 | The dystopian view, the flip side is being able
00:47:45.140 | to understand sort of human activities
00:47:47.560 | can let the robots sever the relationship.
00:47:50.660 | So it can damage the human robot interaction
00:47:54.860 | to where they just do their own thing.
00:47:57.260 | The middle path is just finding useful information,
00:47:59.580 | massive amounts of data like YouTube.
00:48:01.820 | Now there's a YouTube video data set,
00:48:03.820 | being able to identify what's going on in this video,
00:48:06.160 | being able to infer rich, useful semantic information.
00:48:10.580 | And so what do we do with video?
00:48:12.380 | How do we do perception in video?
00:48:14.200 | Now the recent breakthrough came with deep learning
00:48:17.340 | and C3D, this 3D convolutional neural networks
00:48:20.180 | that take a sequence of images and are able to determine
00:48:23.060 | the action that's going on in an end-to-end way,
00:48:25.300 | what's going on in the video.
00:48:26.900 | That was the recent breakthrough.
00:48:29.500 | The state of the art coming from a slightly,
00:48:32.500 | well, from a different architecture
00:48:34.180 | that takes in two streams.
00:48:35.800 | One is the image RGB data, the other is optical flow data
00:48:40.500 | that's really focusing on the motion in the image.
00:48:42.940 | Those are the two that's opened the wave
00:48:44.820 | of two stream networks.
00:48:46.900 | Here from that paper showing the different architectures,
00:48:49.700 | on the far right is the two stream architecture
00:48:54.340 | and the C3D on the, shown under B here,
00:48:59.020 | taking a sequence of images.
00:49:00.300 | But all these are just different architectures.
00:49:02.340 | And then first one is LSTMs.
00:49:05.020 | There's different architectures of how do you represent,
00:49:07.580 | how do you allow a network?
00:49:08.860 | How do you allow a learning model to be able
00:49:11.140 | to capture the dynamics in the data?
00:49:13.280 | The future possibilities has to do,
00:49:16.860 | well, literally with the future,
00:49:18.340 | being able to take single images or sequences of images
00:49:21.860 | and predicting the future.
00:49:23.540 | It's very interesting to think about
00:49:25.340 | in our ability to hallucinate the future,
00:49:28.900 | and generate the future from images,
00:49:32.340 | you start to think about what are the defining qualities
00:49:35.300 | of activities, and in this way, augment data
00:49:37.820 | and be able to train much more accurate
00:49:40.060 | action recognition systems.
00:49:42.140 | Topics not covered is the localization of activity in video.
00:49:46.920 | So action recognition purely defined is I give you a clip
00:49:50.260 | and you tell me what's going on in this clip.
00:49:53.060 | Now, if you take actually a full YouTube video,
00:49:55.100 | you want to be able to localize,
00:49:56.660 | find all the times when a particular activity is going on.
00:50:00.580 | It could be multi-label, multiple activities going on
00:50:02.860 | at the same time, beginning and ending, and asynchronously.
00:50:06.480 | And then there is more richly three-dimensional
00:50:11.980 | or 2D classification of activity based on human movement.
00:50:16.740 | So looking at, like from a Kinect, from 3D sensors,
00:50:20.020 | looking at skeleton-based action recognition
00:50:22.980 | from sensors that provide you more
00:50:25.780 | than just the 2D image data.
00:50:30.300 | The open problems is that activity recognition
00:50:35.300 | is more than just the way we move our body,
00:50:39.300 | or if it's baseball, like a ball in your hand
00:50:42.420 | and hitting it with a baseball bat.
00:50:45.460 | It also has to do with context.
00:50:47.460 | There's sitting down or working or looking at something,
00:50:52.460 | picking up an item.
00:50:53.940 | Those sometimes can change profoundly
00:50:56.780 | based on the other objects in the scene
00:50:58.940 | and the activity of other people in the scene.
00:51:00.980 | And so being able to work with that kind of context
00:51:03.780 | is a totally open problem.
00:51:05.900 | It's having to reduce a very complex real world context
00:51:10.060 | into something where you can clearly identify an activity.
00:51:14.260 | Body pose estimation is the task of localizing the joints
00:51:21.620 | that form the skeleton of the human body.
00:51:26.180 | So infer from visual information,
00:51:28.740 | the positions of the different joints.
00:51:30.700 | Along the line of complexity,
00:51:33.460 | it's important to be able to understand the body language,
00:51:35.820 | the rich information about the body of the human being.
00:51:40.820 | So that's from reading body language to animation,
00:51:43.780 | to aiding activity recognition.
00:51:47.260 | And it's just a useful representation of the human body.
00:51:51.820 | If you're analyzing pedestrians
00:51:53.980 | or in interactive environments, human robot interaction,
00:51:57.260 | being able to understand what the heck it is
00:51:59.580 | the human is trying to do.
00:52:01.140 | A body pose is really useful.
00:52:03.400 | It's hard because the body,
00:52:07.700 | when you look at a 2D image projection of the body,
00:52:11.420 | there's a lot of,
00:52:13.540 | it's a highly dimensional optimization problem,
00:52:16.100 | figuring out how the raw pixels match
00:52:18.740 | to the actual three-dimensional orientation
00:52:21.980 | of the human joints.
00:52:24.340 | And the usual computer vision challenges
00:52:26.140 | of pose, lighting, and so on.
00:52:28.340 | Future impact is,
00:52:31.220 | it's really exciting for interactive environments
00:52:34.320 | for a robot to be able to know the position
00:52:37.460 | of the human body with which it's trying to interact.
00:52:39.620 | Whether it's a robot that's trying to get
00:52:42.980 | their favorite human a beer
00:52:44.540 | or whatever your favorite choice of drink,
00:52:47.500 | you have to be able to find where their hand is
00:52:49.500 | so you can do the trade-off.
00:52:50.780 | Same thing in the car.
00:52:52.180 | You have to determine if the person's hands
00:52:54.740 | are on the steering wheel,
00:52:56.020 | if their head and orientation is such
00:52:59.100 | that they're able to physically take control of the vehicle.
00:53:01.420 | That's a really exciting set of possibilities there.
00:53:03.860 | And there's applications in sports and CGI
00:53:07.060 | and video games and all aspects
00:53:09.420 | when the robot and human have to work together.
00:53:11.900 | The dystopian view you can imagine is,
00:53:14.860 | of course, being able to localize all those joints
00:53:17.780 | means robots that are able to more effectively hurt humans.
00:53:22.060 | And so that's always a huge concern
00:53:24.780 | and always a dark dystopian view of the world
00:53:29.220 | with so much AI in it.
00:53:30.560 | Of course, the reality is,
00:53:32.060 | it's just more rich, fulfilling HCI
00:53:34.900 | that takes advantage of not just the face,
00:53:37.900 | stuff coming from the face,
00:53:39.200 | but also the body of the human
00:53:42.660 | that the robot is interacting with.
00:53:44.580 | So it started with deep learning being applied
00:53:48.260 | to the body pose estimation problem,
00:53:50.580 | 2014 with deep pose.
00:53:52.860 | The key idea is there is looking at
00:53:54.580 | the holistic human pose estimation problem
00:53:57.300 | of detecting all the different joints
00:53:59.980 | of a single person in an image.
00:54:02.880 | Power of deep learning is that you no longer have to do
00:54:05.260 | handcrafted expert engineered features
00:54:08.340 | that it automatically determines a set of features.
00:54:10.840 | All the parts are being detected for you.
00:54:12.560 | So this highly complex problem is all solved with data.
00:54:16.500 | This is the state of the art of the 2017
00:54:21.420 | and beyond there's been a few papers from CMU
00:54:24.260 | along this line is doing real time multi-person
00:54:27.460 | 2D pose estimation,
00:54:29.260 | but in a bottom up way
00:54:31.500 | where you're detecting individual joints first.
00:54:35.440 | So all the knees in the picture,
00:54:36.940 | all the elbows, all the shoulders,
00:54:39.460 | all the wrists and so on,
00:54:41.420 | and then stitching them together
00:54:42.540 | using parts affinity fields.
00:54:44.140 | What is the most likely?
00:54:45.660 | So if you find 17 elbows in a picture,
00:54:49.440 | you then have to try to see which elbow
00:54:51.980 | belongs to which person.
00:54:53.980 | So that actually turns out to be extremely powerful way
00:54:57.660 | to detect especially multi-pose,
00:54:59.540 | especially to deal with occlusions
00:55:02.620 | way of detecting body pose.
00:55:05.140 | It's really interesting and also is able to
00:55:08.320 | because of that,
00:55:09.700 | because of the separation of the detections
00:55:12.840 | is able to run a real time,
00:55:14.500 | which is also really exciting.
00:55:16.000 | Possible future direction is the
00:55:19.260 | using much more information,
00:55:22.160 | using deformable models of the human body.
00:55:25.860 | So not just a skeleton,
00:55:28.020 | rich volumetric information to do the detection
00:55:32.820 | and then optimizing for what's the most likely
00:55:35.620 | orientation of the body.
00:55:37.940 | The open problems in the field is the fact that
00:55:42.900 | pose is not a thing that happens in a single image.
00:55:48.140 | Pose that happens is part of human behavior
00:55:51.580 | and part of movement through time.
00:55:52.860 | So here, Monty Python, Ministry of Silly Walks,
00:55:56.700 | people walk in funny ways.
00:55:57.980 | But so we collect a lot of data on pedestrians.
00:56:01.220 | I can tell you that people walk in different ways
00:56:03.220 | and people position their body in different ways.
00:56:05.980 | And so the temporal aspects of human motion
00:56:09.860 | are for the most part not incorporated
00:56:13.580 | in the body pose estimation problem and they should be.
00:56:16.100 | There's a lot of exciting possibilities
00:56:17.660 | of capturing the temporal dynamics.
00:56:20.780 | There's a lot of awesome slides here
00:56:26.140 | that I'm just skipping through.
00:56:28.620 | Speech recognition.
00:56:31.000 | That was 2018, it was really big for recommender systems,
00:56:36.000 | for Netflix, OKCupid, AI for President.
00:56:41.560 | Each one of the things that I mentioned briefly today
00:56:47.280 | we'll have a separate mini lecture.
00:56:50.440 | I taught an entire course on this at CHI last year.
00:56:52.880 | So deep learning for understanding the human.
00:56:54.760 | It's a topic I'm really excited about
00:56:56.920 | because it's really the first step for a machine
00:56:59.820 | to be able to interact in a rich way with a human being,
00:57:02.680 | is to understand it.
00:57:03.600 | And it's also area where the most near term impact
00:57:06.680 | can happen, a system to be able to effectively detect
00:57:09.920 | what a human being is up to, what they're thinking about,
00:57:13.720 | how to best serve them and enrich the experience
00:57:18.600 | of interacting with that human.
00:57:21.600 | Let me jump to AI safety and then the interactive experience
00:57:26.680 | to humans and robots to just give examples of some work
00:57:31.300 | in that direction, some research in that direction
00:57:33.500 | I'm really excited about.
00:57:35.460 | So AI safety, at the very basic level,
00:57:39.180 | there's an AI system that's making decisions
00:57:42.380 | where we want human beings to supervise those decisions.
00:57:45.740 | We've done quite a bit of work here at MIT
00:57:48.140 | on that aspect of supervising machines,
00:57:51.300 | with arguing machines.
00:57:52.660 | And OpenAI has done work with safety
00:57:55.400 | by having machines debate each other.
00:58:00.440 | So this kind of idea that you can achieve safety
00:58:05.440 | by not giving ultimate power to any one decision maker.
00:58:09.560 | And the disagreement that emerges from two AI systems
00:58:14.560 | or multiple AI systems having to make decisions
00:58:19.800 | and agree with each other,
00:58:21.540 | it allows us to then produce a signal of uncertainty
00:58:25.280 | based on which the human supervision can be sought.
00:58:27.840 | Without that, when we have a state of the art
00:58:31.440 | black box AI system that does something like drive a car,
00:58:34.480 | all we have is a system that just runs
00:58:37.400 | and we're supposed to have faith
00:58:38.880 | that it's always going to be right.
00:58:40.000 | We don't have any uncertainty signal coming from the system.
00:58:43.960 | So the idea of arguing machines that we've developed
00:58:48.800 | and been working on is to have multiple AI system
00:58:51.940 | and ensemble of AI system where the disagreement,
00:58:55.020 | when there's a disagreement detected,
00:58:56.700 | human supervision is sought.
00:58:58.300 | And the idea there is that when you have a system
00:59:01.260 | like Tesla Autopilot,
00:59:02.820 | here we've instrumented a Tesla vehicle.
00:59:06.900 | We have a system like Tesla Autopilot,
00:59:08.540 | it's telling you nothing about how uncertain it is
00:59:12.460 | about the decision it's making.
00:59:14.180 | It just knows, once the system is on,
00:59:17.140 | it's now steering the car for you.
00:59:19.200 | And in very rare cases, this is just disengage.
00:59:22.280 | But no matter what, it's not showing to you
00:59:24.340 | the degree of uncertainty it has about the world around it.
00:59:27.520 | And so the way we create that signal of uncertainty
00:59:30.560 | is by adding another, in this case, end-to-end vision system
00:59:34.680 | that's looking at the external environment,
00:59:36.120 | making steering decisions.
00:59:37.360 | And whenever there's a disagreement
00:59:38.880 | between the two detected,
00:59:40.300 | that's when human supervision is sought.
00:59:42.760 | And we can predict in this way,
00:59:46.880 | shown in the plot there,
00:59:48.720 | is we can predict with high accuracy
00:59:52.920 | the times when the driver chose to disengage the system
00:59:56.860 | because they were uncomfortable.
00:59:58.360 | So you're detecting, you're using this mechanism
01:00:00.840 | to detect risky, challenging situations.
01:00:04.520 | It's an idea about how we supervise AI
01:00:09.080 | by having multiple AI systems that are independent,
01:00:12.800 | and through their disagreement
01:00:14.600 | emerges the uncertainty signal.
01:00:17.320 | And we can apply this, like the AI folks
01:00:19.840 | in natural language would debate,
01:00:22.080 | we can apply this in computer vision as well,
01:00:24.760 | taking two independently trained,
01:00:28.480 | but on the same training set,
01:00:30.680 | networks ResNet and VGGNet, trained on ImageNet,
01:00:34.480 | and we can have them argue,
01:00:37.100 | and in the process, improve significantly the accuracy.
01:00:40.840 | So in the case of ResNet as an architecture,
01:00:44.880 | VGGNet as an architecture,
01:00:46.840 | trained on the ImageNet training dataset,
01:00:49.120 | they separately have a certain error.
01:00:53.760 | ResNet has an error of 8%,
01:00:56.320 | VGG16 has an error of 10%.
01:00:58.800 | When we apply the argue machines framework,
01:01:01.760 | when the disagreement is brought to the human,
01:01:04.480 | that error rate decreases to 2.8%.
01:01:07.880 | Now if that, this is just ImageNet challenge,
01:01:10.960 | but if that error meant the loss of human life,
01:01:14.920 | this kind of framework is really powerful
01:01:17.240 | for overseeing the operation of the AI system.
01:01:21.480 | That, just examples here where they disagree.
01:01:24.260 | So taking this image that's from ImageNet,
01:01:27.440 | the ground truth is a wine bottle,
01:01:29.240 | and ResNet prediction is that it's definitely 0.93,
01:01:32.720 | 93% confidence that it's a paper towel,
01:01:35.800 | and VGGNet, 25% confidence that it's a seatbelt.
01:01:39.440 | So these disagreements are then brought in,
01:01:41.480 | and then we, the fact that they disagree
01:01:45.000 | arises the uncertainty,
01:01:48.880 | and then human supervision is brought,
01:01:50.920 | and then humans are able to annotate correctly
01:01:53.360 | what's going on in this picture.
01:01:54.980 | Same thing here, mailbox, the ground truth is a mailbox.
01:01:58.780 | The, again, the two architectures disagree.
01:02:01.780 | One says traffic light, the other one says garbage truck.
01:02:04.640 | For an autonomous vehicle,
01:02:06.640 | you can imagine this being problematic.
01:02:10.120 | If there's a traffic light,
01:02:11.680 | you might stop for this mailbox, that kind of thing.
01:02:14.800 | That's early research in the field
01:02:17.840 | of how do we have AI systems
01:02:20.240 | that are more and more powerful.
01:02:22.000 | We can also inject human effort
01:02:25.600 | to supervise when it's needed.
01:02:27.560 | The when it's needed part,
01:02:28.640 | the uncertainty signal is the critical thing,
01:02:30.680 | so we have to figure out ways
01:02:31.740 | to create that uncertainty signal.
01:02:34.080 | The subarea of just creating a rich human interaction.
01:02:38.240 | So this is, we're doing a lot of testing
01:02:42.960 | with autonomous vehicles here.
01:02:44.240 | I'm tweeting.
01:02:45.200 | So we have a human-centered autonomous vehicle
01:02:51.880 | here at MIT that's taking control back and forth
01:02:54.320 | from the human based on the activity.
01:02:56.860 | That's just me explaining the video.
01:03:00.440 | The point is that the driving experience,
01:03:04.600 | the human-robot interaction experience
01:03:06.880 | should be fun and awesome and enriching to life.
01:03:11.320 | And that's why you would want to use these kinds of systems.
01:03:15.000 | We have a bunch of videos online.
01:03:17.420 | You can check them out,
01:03:18.560 | including ridiculous one of me playing guitar.
01:03:21.760 | And there's a paper along with this
01:03:23.280 | describing different principles
01:03:24.640 | of how we have humans and robots work together
01:03:27.040 | in this kind of way.
01:03:29.040 | There's a lot of totally untouched problems in that space.
01:03:33.120 | Most of the robotics community
01:03:34.640 | and the machine learning community approaches AI
01:03:36.680 | as a system that we want to make perfect.
01:03:39.580 | And once it's perfect,
01:03:42.480 | we want to then put it in the real world
01:03:44.080 | where us humans get to interact with it.
01:03:46.600 | Just like, what is it, Robin Williams in Good Will Hunting,
01:03:51.600 | talking about relationships, that nobody's perfect.
01:03:56.620 | I think the way I foresee it,
01:03:59.360 | AI systems will not be perfect for the next 100 years.
01:04:02.600 | And so we have to have humans and AI systems work together
01:04:05.520 | and optimize that problem, solve that problem.
01:04:08.640 | That both of us are flawed,
01:04:10.280 | but together there's something enriching to both.
01:04:14.720 | As I mentioned, the videos here will be available online.
01:04:17.600 | The lectures underlying all the deep learning
01:04:19.760 | for understanding the human
01:04:21.240 | and underlying the five principles here
01:04:23.280 | of human-centered AI.
01:04:24.960 | And it's an area of active research here at MIT
01:04:28.800 | and globally, and it's one
01:04:30.520 | that I'm extremely passionate about.
01:04:32.580 | And one of the analogies I think about
01:04:35.980 | when I think about the success
01:04:38.040 | of artificial intelligence systems
01:04:40.920 | as an analogy of parasitism and symbiosis,
01:04:45.240 | a lot of the ways we're training
01:04:47.420 | machine learning algorithms now
01:04:50.560 | is we inject a lot of human labor,
01:04:54.080 | a lot of really costly human labor,
01:04:56.160 | separately, offline, out of the loop,
01:04:58.880 | in order to improve the learning models
01:05:01.920 | through brute force annotation.
01:05:03.720 | And what I see as success in the future
01:05:06.960 | requires that the learning is done,
01:05:10.920 | the models improve in a symbiotic way,
01:05:14.280 | as a side effect of interacting with humans.
01:05:16.800 | This is done a lot now in reinforcement learning,
01:05:18.960 | through game playing and so on.
01:05:20.740 | But the human computation,
01:05:23.600 | the human effort of annotation
01:05:25.680 | is something that happens naturally through interaction,
01:05:28.320 | not a costly thing you have to pay for.
01:05:30.640 | Because when it happens naturally,
01:05:33.400 | in a symbiotic way, we can increase scale.
01:05:36.200 | We can scale learning to a degree that's required
01:05:39.260 | to solve some of the real-world problems.
01:05:42.280 | That also requires solving a lot of aspects
01:05:46.000 | of human-robot interaction,
01:05:48.780 | from understanding our own brain,
01:05:51.540 | from the biological to the electrical and neuroscience,
01:05:55.520 | to the behavioral aspects captured by cognitive science,
01:05:58.520 | psychology, sociology,
01:06:00.720 | to the mathematical formulations of behavior
01:06:03.040 | and game theory,
01:06:04.200 | to when you take that human behavior
01:06:07.320 | and put it in the real world with engineering systems,
01:06:09.680 | human factors and design.
01:06:11.240 | These are all giant subfields with conferences and papers
01:06:14.660 | that all of them need to work together.
01:06:16.360 | And then on the computer science side,
01:06:18.000 | with natural language processing,
01:06:20.060 | understanding language,
01:06:21.580 | the human-robot interaction,
01:06:23.440 | human-computer interaction,
01:06:24.560 | just the interfaces,
01:06:25.980 | what does, how does, and what does the computer,
01:06:30.420 | the robot show to you.
01:06:32.080 | Again, entire conferences.
01:06:33.920 | And then the exciting aspects of learning from data
01:06:36.640 | and deep learning and learning to act from data
01:06:39.960 | and reinforcement learning,
01:06:41.120 | deep reinforcement learning.
01:06:42.280 | And then the robotics is actually building these things,
01:06:45.620 | the building the hardware,
01:06:48.720 | again, an entire area, exciting field of research.
01:06:52.940 | All of them have to work together to create systems here
01:06:56.320 | that integrate the human during the learning process
01:06:58.900 | and integrate the human during the operation process.
01:07:01.980 | So the videos on deeplearning.mit.edu,
01:07:06.780 | videos and slide available there,
01:07:08.900 | code is available there.
01:07:10.940 | So with that, I'd like to thank you very much.
01:07:13.740 | (audience applauding)
01:07:14.980 | (audience cheering)
01:07:17.980 | (upbeat music)
01:07:20.560 | (upbeat music)
01:07:23.140 | (upbeat music)
01:07:25.720 | (upbeat music)
01:07:28.300 | (upbeat music)
01:07:30.880 | [BLANK_AUDIO]