MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)

00:00:00.000 | Welcome to Human-Centered Artificial Intelligence.

00:00:03.120 | The last couple of decades

00:00:07.040 | in the developments of deep learning

00:00:09.140 | have been exciting in the problems

00:00:14.040 | that we've been able to automate,

00:00:15.880 | in the problems that we've been able to crack

00:00:18.320 | with learning-based methods.

00:00:21.720 | One of the things underlying this lecture

00:00:23.820 | and the following lectures

00:00:26.480 | is the idea that with purely the learning-based approach

00:00:31.480 | that we have been using,

00:00:33.480 | there's certain aspects that are fundamental to our reality

00:00:37.460 | that we're going to hit a wall on,

00:00:39.860 | that we have to integrate, incorporate the human being

00:00:43.200 | deeply into the learning-based systems

00:00:45.720 | in order to make the systems learn well

00:00:50.200 | and operate in the real world.

00:00:53.180 | The underlying first prediction

00:00:57.160 | under the idea of human-centered AI in this century

00:01:00.800 | is that the learning-based approaches

00:01:03.200 | that have been successful over the past two decades,

00:01:06.200 | like deep learning, machine learning approaches

00:01:08.160 | that learn from data,

00:01:09.520 | are going to continue to become better

00:01:12.920 | and dominate the real world applications.

00:01:16.240 | So as opposed to fine-tuned optimization-based models

00:01:21.080 | that do not learn from data,

00:01:23.300 | more and more we're going to see learning-based methods

00:01:26.960 | dominate real world applications.

00:01:29.480 | That's the underlying prediction that we're working with.

00:01:33.520 | Now, if that's the case, the corollary of that,

00:01:38.220 | if learning-based methods is the solution

00:01:40.700 | to many of these real world problems,

00:01:43.380 | is the way we get smarter AI systems

00:01:46.780 | is by improving the machine learning

00:01:50.280 | and the machine teaching.

00:01:52.000 | Machine learning is the thing

00:01:53.720 | that we've been talking about quite a bit.

00:01:56.720 | That's the deep learning, that's the algorithm,

00:01:58.860 | the optimization of neural network parameters

00:02:01.520 | where you learn from data.

00:02:03.360 | That's the current focus of the community,

00:02:05.040 | current focus in the research,

00:02:06.380 | and the thing that's behind the success

00:02:08.440 | of much of the developments in deep learning.

00:02:11.400 | And then there's the machine teaching.

00:02:13.800 | That's the human-centered part.

00:02:16.000 | It's optimizing not the models,

00:02:19.940 | not the algorithms,

00:02:21.500 | but optimizing how you select the data

00:02:25.580 | based on which the algorithms learn.

00:02:27.840 | It's to make better teachers.

00:02:29.620 | Just like when you yourself are learning as a student

00:02:33.000 | or as a child how to operate in this world,

00:02:36.220 | the world and the parents and the teachers around you

00:02:41.220 | are informing you with very sparse information,

00:02:45.720 | but providing the kind of information

00:02:47.480 | that is most useful for your learning process.

00:02:49.780 | The selection of data based on which to learn,

00:02:53.680 | I believe, is the critical direction of research

00:02:57.100 | where we have to solve in order to create

00:03:01.440 | truly intelligent systems,

00:03:03.120 | the ones that are able to work in the real world,

00:03:05.960 | and I'll explain why in ways that I'm referring to.

00:03:10.100 | The implications of learning-based systems.

00:03:12.880 | So when you have a learning system,

00:03:15.440 | a system that learns from data,

00:03:18.180 | neural networks, machine learning,

00:03:20.360 | learns from data,

00:03:21.540 | the fundamental reality of that

00:03:25.800 | is the model is trying to generalize

00:03:29.560 | across the entirety of the reality

00:03:32.640 | which we'll have to be tasked with operating

00:03:35.360 | based on a very small subset of samples from that reality,

00:03:40.360 | and that generalization means that

00:03:43.320 | there's always going to be a degree of uncertainty.

00:03:47.360 | There's always going to be

00:03:48.720 | a degree of incomplete information,

00:03:50.600 | and so no matter how much we want to,

00:03:53.900 | these systems will not be provably safe,

00:03:57.600 | so we can't put anything concrete down

00:04:00.960 | to how guaranteed to be safe in some specific way

00:04:05.120 | unless it's extremely constrained.

00:04:07.880 | Therefore, we need human supervision of these systems.

00:04:11.140 | The systems will not be provably fair

00:04:13.600 | from an ethics perspective,

00:04:15.040 | from a discrimination perspective,

00:04:16.640 | from all degrees of fairness.

00:04:18.600 | Therefore, we need human supervision of these systems,

00:04:22.540 | and it will not be explainable.

00:04:26.160 | At any step of the pipeline

00:04:27.640 | in which they made the decisions,

00:04:29.300 | AI systems will not be perfectly explainable

00:04:32.440 | to the satisfaction of us as human supervisors.

00:04:37.440 | So there again,

00:04:40.800 | human supervision constantly will be required,

00:04:44.560 | and the solution to this is a whole set of techniques,

00:04:47.520 | whole set of ideas that we're putting under the flag

00:04:51.520 | of human-centered artificial intelligence,

00:04:53.800 | human-centered AI,

00:04:55.000 | and the core ideas there is that we need to integrate

00:04:58.660 | the human being deeply into the annotation process

00:05:02.680 | and deeply into the human supervision

00:05:06.440 | of the real-world operation of the system,

00:05:09.320 | so both in the training phase and the testing phase,

00:05:13.360 | the execution, the operation of the system.

00:05:16.880 | So this is what deep learning looks like

00:05:18.760 | with the human out of the loop.

00:05:20.480 | The human contributes to a learning model

00:05:25.160 | by helping annotate some data,

00:05:27.720 | and that data is then used to train a model

00:05:31.180 | that hopefully generalizes in the real world,

00:05:33.120 | and that model makes decisions,

00:05:35.560 | and deep learning is really exciting

00:05:37.120 | because it's able to,

00:05:38.840 | in a greater and greater degree of autonomy,

00:05:41.640 | able to form high-level representations of the raw data

00:05:45.840 | in a way that it's actually able to do quite well

00:05:49.000 | on certain kinds of tasks

00:05:50.680 | that were before very difficult,

00:05:52.580 | but fundamentally, the human is out of the loop,

00:05:55.140 | both of the training and the operation.

00:05:57.220 | First, you build the dataset, annotate the dataset,

00:06:00.340 | and then the systems run away with it.

00:06:02.680 | They train on the data,

00:06:03.960 | and the real-world operation does not involve the human

00:06:06.960 | except as the recipient of the service the system provides.

00:06:11.320 | Now, the human in the loop version of that,

00:06:13.360 | the human-centered version of that,

00:06:15.280 | means that annotation and operation of the system

00:06:20.280 | is both aided by human beings in a deep way.

00:06:27.920 | What does that mean?

00:06:29.960 | So we can look at a human experts,

00:06:32.120 | so individuals, and crowd intelligence,

00:06:35.920 | the wisdom of the crowd and the wisdom of the individual.

00:06:39.620 | At the training phase, the first part of that

00:06:44.520 | is the objective annotation.

00:06:46.320 | We need to significantly improve objective annotation,

00:06:49.440 | meaning annotation where the human intelligence

00:06:53.160 | is sufficient to be able to look at a sample and annotate it.

00:06:57.040 | This is what we think about as an ImageNet

00:06:59.040 | and all the basic computer vision tasks

00:07:00.880 | where a single human is enough

00:07:02.320 | to do a pretty damn good job

00:07:04.120 | of determining what's in a particular sample.

00:07:06.960 | And then there's subjective annotation,

00:07:09.180 | things that are difficult for humans to determine

00:07:12.760 | as a singular sample of a human being,

00:07:15.000 | as a crowd, we kind of converge in these difficult questions.

00:07:20.000 | These are questions at a low level of emotion,

00:07:24.520 | these things that are a little bit fuzzy,

00:07:26.880 | that require multiple people to annotate,

00:07:28.800 | and at the high level are ethical questions

00:07:31.640 | of decisions that an AI system is tasked to making

00:07:36.560 | or we're tasked to making

00:07:38.260 | that nobody really knows the right answer to.

00:07:40.640 | And as a crowd, we kind of converge in the right answer.

00:07:43.400 | That's where the crowd intelligence comes in

00:07:45.200 | on the data annotation step.

00:07:46.920 | Now in the operation, once you train the model,

00:07:50.600 | the supervision, again, of the system based,

00:07:54.600 | and I'll give examples of this more concretely,

00:07:57.480 | on the wisdom of the individual is, for example,

00:08:01.120 | operating an autonomous vehicle,

00:08:02.920 | the supervision of that autonomous vehicle,

00:08:05.040 | a single driver, is tasked with supervising

00:08:07.680 | the decisions of that AI system.

00:08:09.820 | That's a critical step for a learning based system

00:08:12.840 | that's not guaranteed to be safe,

00:08:15.520 | that's not guaranteed to be explainable.

00:08:18.040 | And the subjective side of that,

00:08:22.480 | where the crowd intelligence is required,

00:08:24.320 | where a single person is not able to make it,

00:08:26.200 | these are, again, ethical questions

00:08:28.000 | about the operation of autonomous systems.

00:08:30.160 | The supervision of autonomous vehicles,

00:08:33.200 | the supervision of systems in the medical diagnosis,

00:08:37.920 | in medicine in general, and this is AI

00:08:42.920 | operating in the real world, making ethical decisions

00:08:46.820 | that are fundamentally difficult decisions

00:08:51.240 | for humans to make, and that's where

00:08:52.600 | the crowd intelligence needs to come in.

00:08:55.360 | And so we have to transform the machine learning problem

00:08:58.200 | by integrating the human being.

00:09:01.260 | First up top in the training process, on the left,

00:09:04.400 | that's the usual machine learning formulation

00:09:07.000 | of a human being doing brute force annotation

00:09:10.680 | of some kind of data set, cats and dogs and ImageNet,

00:09:13.780 | segmentation data set in cityscapes,

00:09:17.440 | video action recognition in the YouTube data set.

00:09:22.440 | Given the data set, humans put in a lot of expensive labor

00:09:25.820 | to annotate what's going on in that data,

00:09:27.960 | and then the machine learns.

00:09:30.880 | The flip side of that, the machine teaching side,

00:09:33.180 | the human-centered side of that, is the machine instead,

00:09:36.820 | the learning model, the learning algorithm,

00:09:39.000 | talking about mostly neural networks here,

00:09:41.360 | is tasked with providing, selecting the subset,

00:09:48.240 | the small, sparse subsets of the data

00:09:52.620 | that are most useful for the human to annotate.

00:09:55.460 | So instead of the human doing the brute force task first

00:09:59.040 | of the annotation, the machine queries the human.

00:10:02.200 | This is the field called machine teaching.

00:10:04.840 | The machine queries the human with questions,

00:10:07.240 | and therefore, the task is,

00:10:09.320 | and this is wide open research field,

00:10:11.260 | the task is to minimize in several orders of magnitude

00:10:17.360 | the amount of data that needs to be annotated.

00:10:19.600 | In the real world operation side,

00:10:22.760 | the integration of the human looks like this.

00:10:24.800 | On the left, the machine, now trained

00:10:28.160 | with the learning model, makes decisions,

00:10:30.320 | and the human living in this world

00:10:33.320 | receives the service provided by the machine,

00:10:36.520 | whether that's medical diagnosis,

00:10:38.100 | whether that's an autonomous vehicle,

00:10:40.120 | whether that's a system that determines

00:10:42.980 | whether you get a loan or not, so on.

00:10:45.800 | With the human-centered version of that,

00:10:50.160 | the machine makes a decision,

00:10:52.240 | but it's able to provide a degree of uncertainty.

00:10:57.880 | It's one of the big requirements,

00:10:59.580 | to be able to specify a degree of uncertainty

00:11:01.720 | of that decision such that when uncertainty

00:11:04.240 | is below a certain threshold, human supervision is sought.

00:11:08.280 | And again, in that decision,

00:11:10.520 | whether that's a costly decision financially

00:11:13.040 | or a costly decision in terms of human life,

00:11:15.120 | human supervision is sought.

00:11:17.160 | And the service is received by the human,

00:11:20.080 | by the very same humans that are providing the supervision,

00:11:23.040 | or another set of humans.

00:11:25.360 | But ultimately, the decision is over-sought

00:11:29.360 | by human beings.

00:11:31.760 | This is what I believe is going to be

00:11:34.880 | the defining mode of operation for AI systems

00:11:37.720 | in the 21st century, is we won't be able to,

00:11:40.480 | as much as we'd like, to create perfect AI systems

00:11:44.320 | that escape the need to work together

00:11:49.000 | with human beings at every step.

00:11:52.920 | There is five areas of research,

00:11:55.680 | grand challenges here, that define human-centered AI.

00:12:01.080 | I'll focus on a few today,

00:12:03.680 | and focus on one very much so.

00:12:06.720 | And even with that degree of high pruning,

00:12:11.200 | we have 120 slides, so I'll skip around.

00:12:14.040 | But, on the human-centered AI during the learning phase,

00:12:22.120 | there is the methods, the research arm of machine teaching.

00:12:25.640 | How do we select, how do we improve supervised learning?

00:12:28.760 | As opposed to needing 10,000, 100,000, a million examples,

00:12:33.280 | how do we reduce that, where the algorithm queries

00:12:36.560 | only the essential elements, and able to learn effectively

00:12:39.880 | from very little information, from very little samples?

00:12:43.240 | Just like we do when we're students,

00:12:45.000 | when we learn fundamental aspects of math,

00:12:48.160 | language, and so on, we just need a few examples.

00:12:51.600 | But those examples are critical to our understanding.

00:12:54.200 | And the second part of that is the reward engineering.

00:12:59.480 | That during a learning process,

00:13:01.040 | injecting the human being into the definition

00:13:04.040 | of the loss function, of what's good, what's bad.

00:13:06.960 | Systems that have to operate in the real world

00:13:11.560 | have to understand what our society deems as good and bad.

00:13:16.560 | And we're not always good at injecting that

00:13:19.800 | at the very beginning.

00:13:20.920 | There has to be a continuous process

00:13:23.160 | of adjusting the rewards, of reward re-engineering

00:13:26.560 | by humans, so that we can encode human values

00:13:29.640 | into the learning process.

00:13:31.240 | Now, on the second part, on the human-centered AI

00:13:34.200 | during real-world operation,

00:13:36.240 | when the system's actually trained,

00:13:38.400 | there is the interactive element

00:13:42.080 | of robots and humans working together.

00:13:44.560 | That means the part I'll focus on quite a bit today,

00:13:47.840 | because there's been quite a lot of development

00:13:50.640 | and progress on the deep learning side,

00:13:53.000 | is human sensing, is algorithms

00:13:56.000 | that understand the human being.

00:13:58.280 | Algorithms that, from taking raw information,

00:14:02.120 | whether that's video, audio, text,

00:14:04.560 | begin to get a context, a measure

00:14:08.840 | of the state of the human being in the short term

00:14:10.640 | and the long term over time,

00:14:12.040 | the temporal understanding

00:14:15.200 | and the instantaneous understanding.

00:14:17.760 | Then there is the interaction aspect.

00:14:21.160 | So once you understand the human,

00:14:22.960 | the perception problem, you have to interact with them

00:14:26.280 | and interact in such a way that it's continuous,

00:14:29.000 | collaborative, and a rich, meaningful experience.

00:14:32.360 | We're in the very early days of creating anything

00:14:37.440 | like rich, meaningful experiences with AI systems,

00:14:41.200 | especially learning-based AI systems.

00:14:44.880 | And the safety, in the real world operation,

00:14:48.360 | safety, ethics, unrolling the results

00:14:52.120 | of the engineered rewards that were in place

00:14:55.360 | during the learning process, now come to fruition.

00:14:59.600 | And we need to make sure that the trained model

00:15:04.600 | does not result in things that are highly detrimental,

00:15:10.920 | catastrophic to our safety,

00:15:13.560 | or highly detrimental to what we deem as good

00:15:16.920 | and bad in society, of discrimination,

00:15:19.600 | of ethical considerations, and all those kinds of things.

00:15:22.920 | The gray area, the line we all walk as a society

00:15:27.240 | in the crowd intelligence,

00:15:28.640 | we have to provide bounds on AI systems.

00:15:32.160 | And there's an entire group of work,

00:15:34.600 | and I'll mention what we're doing in that area.

00:15:37.880 | So first, on the machine teaching side,

00:15:42.040 | and the efficient supervised learning,

00:15:44.000 | I'd like to sort of do one slide on each of these

00:15:46.800 | to kind of give you an idea,

00:15:49.400 | near-term, and do two things for each area,

00:15:54.320 | that we will elaborate in future lectures on,

00:15:56.960 | and some of it I'll elaborate today.

00:15:59.440 | First, the near-term directions of research,

00:16:03.020 | the things that are within our reach now,

00:16:05.760 | and a sort of thought experiment, a grand challenge,

00:16:10.000 | that if we can do it, that'll be damn impressive.

00:16:14.000 | That will be a definition of real progress in this area.

00:16:17.600 | So near-term directions of research for machine teaching,

00:16:21.480 | for improved supervised learning,

00:16:22.900 | integrating the human into the annotation process,

00:16:25.520 | is instead of annotating brute force,

00:16:28.240 | is annotate by asking the human questions.

00:16:30.880 | So we have to transform the way we do annotations,

00:16:34.840 | where the process of annotation is not defining the dataset,

00:16:39.680 | and then you go through the entire dataset,

00:16:41.960 | it's a machine teaching system that queries the user

00:16:45.800 | for questions to annotate.

00:16:48.020 | And on the algorithm side, active learning,

00:16:52.560 | these are all sort of areas of work

00:16:55.120 | where we can be more clever about the way we use data,

00:16:58.600 | select data on which to train.

00:17:00.560 | So active learning is actively selecting

00:17:03.520 | during the training process,

00:17:04.780 | which part of the data to train on, and annotate.

00:17:08.720 | Data augmentation is taking things

00:17:11.360 | that have been supervised by a human,

00:17:13.200 | and expanding them, modifying the data,

00:17:16.240 | warping the data in interesting ways such that it expands,

00:17:20.040 | it multiplies the human effort that was injected

00:17:23.920 | into helping understand what's in the data.

00:17:26.920 | The one-shot learning, zero-shot learning,

00:17:29.080 | and transfer learning are all in that category.

00:17:31.160 | And self-play is in the reinforcement learning area

00:17:34.920 | where the system constructs a model of the world,

00:17:39.360 | and goes along alone in a room,

00:17:42.120 | and plays with that model to try to figure out

00:17:44.660 | the different constraints of the model,

00:17:46.160 | how do you achieve good things there.

00:17:48.400 | An example grand challenge here

00:17:51.680 | that would define serious progress in the field

00:17:55.080 | is if we take ImageNet or COCO,

00:17:57.840 | the ImageNet challenge or COCO object detection challenge,

00:18:01.460 | and training only on a totally different kind of data,

00:18:06.460 | be able to achieve state-of-the-art results.

00:18:10.800 | So training only on Wikipedia,

00:18:14.320 | with the text and images that are there on Wikipedia,

00:18:16.780 | be able to perform object detection

00:18:19.580 | on the state-of-the-art benchmark of COCO.

00:18:22.780 | The COCO is a data set of different objects

00:18:25.140 | with rich annotation of the localization of the objects.

00:18:28.740 | That I believe is exactly the kind of thing

00:18:31.980 | that all the problems in the transfer learning

00:18:35.540 | and efficient data annotation machine teaching

00:18:38.620 | have to be solved to achieve that.

00:18:40.600 | Another way to, another challenge you can think of,

00:18:44.500 | if we can even just simplify it more,

00:18:47.060 | is achieve a 3%, 0.3% error on MNIST,

00:18:51.620 | that's the handwritten recognition task

00:18:54.080 | that everybody always provides as an example.

00:18:56.180 | So achieve a very good accuracy,

00:19:00.340 | state-of-the-art accuracy,

00:19:02.500 | by training only on a single example of a digit,

00:19:06.100 | as opposed to training on thousands,

00:19:08.020 | training on one example.

00:19:09.640 | That's something that most of us humans can do,

00:19:12.220 | given one example of a new language

00:19:16.420 | you haven't seen before for each character,

00:19:19.940 | after studying them for a little bit,

00:19:22.000 | be able to now classify future characters

00:19:24.620 | at high accuracy.

00:19:25.700 | The second part of the learning process

00:19:32.980 | where the human needs to be injected,

00:19:34.380 | and the near-term directions of research there,

00:19:37.380 | is the reward engineering,

00:19:39.580 | and the tuning of those,

00:19:40.540 | continuous tuning of those rewards by a human being.

00:19:43.180 | So if OpenAI is doing quite a bit of work here,

00:19:50.500 | here's a game played by human and AI,

00:19:53.500 | and it's really my favorite example of this.

00:19:56.180 | On the left, human is controlling a boat

00:19:58.500 | that's finishing a race.

00:19:59.600 | On the right is a RL agent,

00:20:02.140 | reinforcement learning agent,

00:20:03.300 | that's controlling a boat that's trying to,

00:20:07.100 | not finish a race,

00:20:08.300 | trying to maximize the reward

00:20:12.260 | defined prior to, by, initially by a human being.

00:20:16.500 | And what it finds is that you can get much more reward

00:20:20.700 | by collecting green turbos that appears

00:20:23.460 | close to finishing the race.

00:20:25.260 | It realizes that finishing the race

00:20:26.980 | actually gets in the way of maximizing reward.

00:20:29.660 | And so that's the unintended consequences

00:20:32.140 | of a reward function that was specified previously,

00:20:37.140 | and most human supervisors of this result

00:20:42.060 | would be able to adjust through the,

00:20:44.380 | re-engineer the reward function

00:20:47.180 | to be able to get the robot to the AI system here

00:20:51.140 | to finish the race.

00:20:52.420 | And that kind of continuous monitoring,

00:20:54.820 | monitoring of the performance of the system

00:20:58.180 | during the training process

00:20:59.580 | is a near term direction of research

00:21:02.740 | that's a few, DeepMind, OpenAI, and ourselves are taking on.

00:21:07.100 | Example grand challenge is allowing

00:21:12.180 | AI system to operate in a context

00:21:16.700 | where there's a lot of fuzziness for us humans.

00:21:21.380 | There's a lot of uncertainty, there's a lot of gray area,

00:21:23.620 | there's a lot of challenging aspects

00:21:25.340 | in terms of what is right and what is wrong

00:21:28.740 | that we continually need to improve on.

00:21:31.020 | Example I provide here is one of the least popular things

00:21:35.900 | in the world is the US Congress.

00:21:40.620 | So replacing US Congress,

00:21:42.500 | the body of representatives of the people

00:21:44.980 | of the United States, and they make bills

00:21:47.620 | based on the belief of the people,

00:21:49.660 | that sounds a lot like what Netflix does

00:21:52.780 | in recommending what movie you should watch next

00:21:56.180 | in representing what people love to watch.

00:21:58.660 | So that's just the recommender system.

00:22:00.460 | So it makes perfect sense that an AI system

00:22:03.740 | should be able to take on this challenge.

00:22:06.300 | And I see that as a grand challenge,

00:22:08.860 | is replacing some of the fundamental representation

00:22:13.180 | of large crowds of people that make ethical decisions

00:22:17.660 | replaced by a human-centered AI system.

00:22:22.300 | Okay, in real world operation,

00:22:25.420 | the first thing we have to do,

00:22:27.620 | before we have a robot and a human work together,

00:22:30.460 | the first thing is the robot has to perceive the human.

00:22:34.620 | Question.

00:22:36.060 | - The question was, do you want to,

00:22:38.340 | so there's currently, there's Congress,

00:22:41.540 | do you want to change the way Congress works,

00:22:45.060 | make it better, or do you want to just take the system

00:22:47.640 | that currently is and automate it?

00:22:51.140 | So the idea is take the system as it currently

00:22:56.140 | is supposed to be and automate that.

00:22:59.420 | So an AI system can provide a lot more transparency

00:23:05.180 | of the inputs.

00:23:06.820 | The idea of Congress is supposed,

00:23:08.420 | the only inputs is supposed to be the people

00:23:10.740 | and the beliefs of the people.

00:23:16.100 | And there's also, and there's rich information there.

00:23:20.840 | So for example, I mean, the input,

00:23:24.160 | there's, for me, not saying anything about politics,

00:23:28.080 | but there's certain issues I care a lot about

00:23:29.880 | and certain issues I don't care much about.

00:23:32.720 | And let's put that aside.

00:23:35.560 | And then there's certain issues that I know a lot about

00:23:39.760 | and certain issues I know very little about.

00:23:42.400 | And those don't actually intersect that well.

00:23:46.080 | I'm very opinionated about things

00:23:47.580 | I don't know anything about.

00:23:48.620 | It's very common, all of us are.

00:23:50.780 | So being able to put that representation of me

00:23:55.020 | into a system that would take a lot of,

00:23:58.860 | our entire nation together, and be able to make bills

00:24:03.860 | that represent the people.

00:24:08.240 | Now the challenge there, it can't be just the training set

00:24:11.060 | and then the system now operates.

00:24:13.580 | AI is running the country.

00:24:15.800 | No, there has to be that human center element

00:24:18.000 | where we're constantly supervising,

00:24:19.380 | just like we're, in theory, supposed to be supervising

00:24:22.240 | our congressmen and congresswomen.

00:24:25.520 | Human sensing, the first part,

00:24:28.040 | in order to have an AI system that works with a human being,

00:24:31.800 | the AI system has to perceive, understand

00:24:34.320 | the state of the human being at the very simplest level

00:24:36.920 | and the more complex, temporal, contextual, over time level.

00:24:40.660 | So the near-term directions of research

00:24:42.840 | is purely the perception problem,

00:24:44.640 | where deep learning shines, of taking data,

00:24:48.100 | whether that comes in the visual, audio, text, and so on,

00:24:53.100 | and being able to classify the physical, mental,

00:24:58.040 | social state, social context of the person.

00:25:02.080 | Be able to, everything, and this is what I'll cover

00:25:04.960 | a little bit of today, everything from face detection,

00:25:09.320 | face recognition, emotion recognition,

00:25:12.640 | natural language processing, body pose estimation,

00:25:16.620 | those same recommender systems, speech recognition,

00:25:21.720 | all of those conversions of raw data

00:25:25.320 | that captures something about the human being

00:25:27.240 | into actually meaningful, actionable information.

00:25:29.840 | The grand challenge there is emotion recognition.

00:25:34.840 | You know, there's been a lot of companies and ideas

00:25:38.760 | that we've somehow cracked emotion recognition,

00:25:41.460 | that we are able to determine the mood of a person.

00:25:46.020 | But really, that's, for those who were here last year

00:25:49.160 | with Lisa Feldman Barrett, but just,

00:25:52.880 | if you're sort of very honest

00:25:54.840 | and you study emotional intelligence and emotion

00:25:59.300 | and the expression of emotion, it's a fascinating area

00:26:03.000 | and we're not even close to being able

00:26:04.760 | to build perceptual systems that detect emotion.

00:26:07.560 | What we're more so doing is detecting very simple

00:26:12.560 | facial expressions that correspond

00:26:15.180 | to our storybook versions of emotion, smiling,

00:26:20.000 | crying, like frowning in a caricatured way.

00:26:23.460 | So if you build a system that has a high accuracy

00:26:26.520 | of doing real emotion recognition,

00:26:29.560 | you can think of it as stated here,

00:26:33.080 | an AI system that classifies,

00:26:35.720 | binary classification problem, 95% accuracy

00:26:39.080 | of whether you wanna be left alone or not.

00:26:41.680 | And being able to do that after collecting data for 30 days.

00:26:46.040 | That I see as a really clean formulation

00:26:48.420 | of exactly the kind of human understanding

00:26:53.420 | we need to be able to build in our learning models.

00:26:57.440 | And we're very far away from that,

00:26:59.280 | especially the long temporal aspect of that,

00:27:02.520 | of being able to integrate data over a long period of time.

00:27:06.140 | Then the second part of human robot interaction

00:27:08.980 | in the real world operation is the experience.

00:27:12.520 | This is where we're now just beginning to consider

00:27:15.860 | that interactive experience

00:27:17.460 | of how do we have a rich fulfilling experience.

00:27:20.380 | We have autonomous vehicles, for example,

00:27:24.140 | semi-autonomous vehicles, whether that's Tesla,

00:27:26.820 | Volvo, Super Cruise with the Cadillac.

00:27:29.460 | There's a bunch of systems that have now

00:27:31.460 | greater and greater degrees of automation in the car

00:27:33.860 | and we get to have the human interact with that AI system

00:27:36.860 | and trying to figure out how do we have

00:27:39.980 | a rich fulfilling experience.

00:27:43.220 | In the, currently the Volvo system,

00:27:47.260 | that experience is more limited.

00:27:49.600 | There's a little icon.

00:27:51.220 | It's more kind of traditional driving situation.

00:27:54.420 | In the Tesla, you have a much bigger display

00:27:57.020 | about what's going on.

00:27:58.540 | In the Super Cruise, there's a camera looking at your eyes

00:28:03.140 | in the Cadillac Super Cruise system.

00:28:06.220 | There's a camera looking at your eyes

00:28:07.500 | determining if you're awake or not,

00:28:09.940 | paying attention or not.

00:28:11.140 | And that, there's like an experience there

00:28:13.300 | that we're trying to create.

00:28:15.260 | And in the Tesla case, the miles are racking up.

00:28:19.460 | We have real data.

00:28:20.860 | Here at MIT, we're studying this exact interaction.

00:28:23.580 | There's now over a billion miles driven in the Tesla.

00:28:26.380 | And the same in the fully autonomous side with Waymo,

00:28:30.260 | they've now reached 10 plus million miles driven autonomously.

00:28:34.140 | And there's a lot of people experimenting with this.

00:28:36.780 | But that's that collaborative interaction

00:28:39.700 | of going back and forth, of being able to,

00:28:41.900 | for the AI system to express the degree of uncertainty

00:28:44.580 | as about the environment.

00:28:46.060 | About the AI system being able to express

00:28:48.820 | when it needs help and not.

00:28:50.940 | Be able to communicate what are its limitations

00:28:53.500 | and capabilities and so on.

00:28:55.620 | Trade off control.

00:28:57.220 | Be able to seek human supervision.

00:28:59.000 | There's a dance there that's really,

00:29:01.620 | that takes into consideration everything

00:29:03.860 | from the neurobiological research to psychology

00:29:08.860 | to deep learning, to the pure robotics,

00:29:14.580 | HRI, human robotic interaction aspects.

00:29:17.940 | One grand challenge would be,

00:29:19.900 | Tesla's driven one billion miles now under autopilot,

00:29:23.380 | under the semi-autonomous mode.

00:29:25.300 | The grand challenge here is when we start getting

00:29:27.980 | to the kind of mileage that we see

00:29:30.560 | in the United States every year,

00:29:32.140 | you start getting into the hundreds of billions

00:29:34.500 | of miles driven semi-autonomously.

00:29:36.340 | We get to see teenagers, 16, 17, 18,

00:29:39.700 | using these systems for the first time.

00:29:41.620 | We get to see older folks,

00:29:43.140 | folks who don't necessarily drive

00:29:46.600 | or use any kind of AI in their lives

00:29:48.780 | get to use these systems.

00:29:50.060 | We start to explore that aspect.

00:29:51.820 | That's the real challenge.

00:29:53.480 | And of course, the old Turing test,

00:29:57.980 | now reimagined by Alexa,

00:30:00.440 | with the Alexa Prize challenge of Social Bot,

00:30:05.200 | is creating natural language.

00:30:07.680 | It's such a beautiful thing to explore

00:30:09.540 | human-robot interaction with,

00:30:11.400 | is both on the audio side and just the text side,

00:30:15.520 | is passing the Turing test.

00:30:18.040 | That's a true grand challenge in a real way,

00:30:20.640 | where you wanna have a conversation with a robot

00:30:23.360 | for prolonged periods of times,

00:30:25.120 | maybe more than even some of your other friends.

00:30:28.360 | And on the other side of friends is the risk,

00:30:33.360 | the catastrophic risk that's potential

00:30:35.880 | when you have an AI system that's learning from data.

00:30:39.220 | The near-term directions of research

00:30:41.000 | is purely the human supervision of AI decisions

00:30:44.200 | in terms of safety and ethics.

00:30:46.120 | There's a lot of systems, like with cars,

00:30:48.920 | or medical diagnosis and so on,

00:30:51.480 | where there's some life-critical, safety-critical aspect

00:30:54.480 | that we want to be able to supervise the safety of that.

00:30:57.080 | And there's ethical decisions

00:30:59.080 | in terms of who gets alone or not,

00:31:02.280 | who gets a certain criminal penalty or not.

00:31:05.600 | Any degree to which AI systems are incorporated into that,

00:31:09.100 | you have to consider ethical questions.

00:31:11.240 | And even just the crude,

00:31:13.140 | the low-level perception systems, like face recognition,

00:31:18.140 | you wanna make sure that your face recognition systems

00:31:21.280 | are not discriminating based on color or gender or age

00:31:23.920 | and so on.

00:31:24.840 | You wanna make sure that

00:31:26.160 | at that basic fundamental level of ethics,

00:31:30.940 | the systems are trained in a way

00:31:33.760 | they maintain our human values,

00:31:36.120 | or the better angels of our nature,

00:31:39.840 | the better sides of our values,

00:31:42.040 | some of the brighter aspects of our values.

00:31:44.800 | And the other thing is, in terms of just maintaining values,

00:31:49.360 | that's the normal,

00:31:51.480 | that's looking at the mean of the distribution.

00:31:53.940 | But we also want to control the outliers

00:31:57.420 | from the AI systems not to do anything catastrophic.

00:32:01.280 | So the unintended consequences,

00:32:03.120 | when something happens that you didn't anticipate,

00:32:06.160 | you wanna be able to put boundaries on that.

00:32:08.580 | And the grand challenge there,

00:32:11.360 | really, it all boils down to the ability of an AI system

00:32:15.080 | to say that it's uncertain about something.

00:32:17.780 | And that measure of uncertainty has to be good.

00:32:22.740 | It has to be able to make a prediction

00:32:24.800 | always accompanied with uncertainty,

00:32:28.080 | even on things it hasn't seen before.

00:32:30.160 | That's the real challenge,

00:32:31.940 | to be able to be trained on cats and dogs

00:32:36.400 | and then seeing a giraffe

00:32:38.040 | and saying, "I'm not sure what that is."

00:32:41.960 | We're quite far away from that,

00:32:45.160 | 'cause right now, we'll probably confidently say it's a dog,

00:32:48.620 | depending on the giraffe.

00:32:49.860 | But we want to be able to have an extremely high accuracy

00:32:55.580 | in the ability of AI systems

00:32:57.220 | to determine their own uncertainty,

00:32:58.620 | to know what they don't know.

00:33:00.140 | Because from that comes the supervision.

00:33:03.540 | From that comes the ability to stop

00:33:05.980 | under things that it's uncertain about, catastrophic events.

00:33:09.340 | The first aspect of real-world operation

00:33:12.620 | is understanding the human.

00:33:14.080 | One of the places where deep learning has really shined

00:33:17.980 | is the perception problem.

00:33:19.900 | It all begins at the ability to look at raw data

00:33:22.860 | and convert that into meaningful information.

00:33:25.220 | That's really the understanding the human comes in.

00:33:28.100 | Not the kind of understanding

00:33:29.540 | that when you're in a relationship with somebody,

00:33:31.620 | when you're friends with somebody,

00:33:32.940 | over a long period of time,

00:33:34.580 | you gain an understanding of their quirks,

00:33:37.140 | limitations, capabilities, so on.

00:33:39.220 | That's really fascinating.

00:33:40.940 | But the first step is just to be able to,

00:33:43.740 | when you see them, recognize who they are,

00:33:46.020 | what's on their mind,

00:33:48.200 | what's their, the body language,

00:33:52.860 | what are they saying with their mouth.

00:33:55.700 | All those basic raw perception tasks,

00:33:58.180 | that's where deep learning really shines.

00:33:59.660 | I'd like to cover the state of the art

00:34:01.860 | in those various perception tasks.

00:34:05.980 | So first, face recognition.

00:34:07.760 | Now there's a full slide presentation with this,

00:34:11.700 | and I'm skipping around.

00:34:13.240 | The full slide presentation has the following structure

00:34:15.680 | for each of these topics.

00:34:17.920 | It has the motivation, description, the excitement,

00:34:21.920 | the worry, the future impact is the first part.

00:34:24.900 | And then there's five papers.

00:34:26.740 | One defining the quote unquote old school seminal work

00:34:29.920 | that opened the field.

00:34:31.320 | Then the early progress in the field.

00:34:33.780 | Paper three is the recent breakthrough,

00:34:38.160 | often associated with deep learning.

00:34:40.220 | Paper four is the current state of the art.

00:34:42.300 | And paper five is the thing

00:34:43.860 | that defines the future direction.

00:34:46.040 | The possible set of things that define the future direction.

00:34:49.020 | And then the open problems in the field,

00:34:52.120 | and where the future research is very much needed.

00:34:55.480 | That's kind of the structure of every topic

00:34:58.140 | I'll cover here as quickly as possible.

00:35:00.640 | Face recognition.

00:35:04.720 | So what is it?

00:35:05.700 | It's the first thing, you know,

00:35:08.700 | the face contains so much rich information

00:35:12.760 | about the state of the human being.

00:35:15.060 | So understanding the human being really starts with the face

00:35:18.180 | and detecting the face is the first step.

00:35:21.020 | Detecting the body,

00:35:22.580 | and then that there's a head on top of that body,

00:35:25.000 | that's the first step.

00:35:26.180 | And then there is the task of face recognition,

00:35:29.320 | been an exceptionally active area of research

00:35:32.500 | because it has a lot of applications.

00:35:34.640 | And through that research,

00:35:36.120 | we're able to now study a lot of aspects,

00:35:39.140 | how we perform perception on the face.

00:35:41.500 | So recognition, purely stated,

00:35:44.440 | is the recognizing the identity of a human face.

00:35:47.980 | Who is this?

00:35:49.120 | Detection is just detecting a face.

00:35:54.020 | Now, recognition means there's a database of identities.

00:36:00.420 | What is it?

00:36:01.420 | Seven billion of them on earth.

00:36:02.920 | And you're trying to determine

00:36:05.180 | which of them it is,

00:36:07.260 | which of the seven billion it is,

00:36:09.200 | or whatever the database is.

00:36:11.760 | The face verification problem

00:36:14.900 | is something that your phone uses

00:36:17.100 | when you unlock it with your face.

00:36:19.260 | Is it saying, is it you or not?

00:36:21.780 | Is it Lex or somebody else?

00:36:23.940 | It's a database of two,

00:36:25.780 | one person versus everybody else.

00:36:29.640 | There's a lot of applications here, obviously,

00:36:33.060 | from identification to all the security aspects

00:36:36.980 | of using the face as a sort of fingerprint

00:36:41.860 | of your identity in all the interactive elements

00:36:44.580 | of AI systems, software-based systems in this world.

00:36:48.240 | Okay, so why is it hard?

00:36:50.400 | So all the usual computer vision problems come in.

00:36:52.620 | Lighting variation, pose variation.

00:36:55.060 | That's just, computer vision is really hard.

00:36:57.100 | It's just you get these raw numbers

00:36:58.580 | and you have to infer so many things

00:37:00.740 | that us humans take for granted.

00:37:04.540 | So the basic computer vision stuff.

00:37:06.540 | But there's stuff on top of that.

00:37:08.320 | So faces, we're trying to,

00:37:11.940 | it's like cats versus dogs.

00:37:13.780 | There's thousands of breeds of dog

00:37:15.580 | and thousands of breeds of cats.

00:37:17.380 | In that same way, there's,

00:37:19.060 | faces can look very similar to each other.

00:37:21.820 | So these two classes that you're trying to separate

00:37:24.300 | can be very, very, very close together and intermingle.

00:37:29.800 | Now, there's a lot of face data available.

00:37:33.320 | Now, because of the application,

00:37:35.020 | because of the financial benefits of such data sets,

00:37:38.780 | but for any one individual,

00:37:40.460 | unless you're Brad Pitt or Angelina Jolie or celebrity,

00:37:43.580 | there's not many samples of the data available.

00:37:46.580 | So the individuals based on which the classification

00:37:49.420 | is to be made, there's often not very much data.

00:37:52.120 | Then there is the, a lot of variation.

00:37:56.000 | So you have to, in making the face recognition task,

00:37:59.540 | you have to be invariant to all the hairstyles,

00:38:02.300 | all the, that you change yourself over time,

00:38:05.700 | the weight gain, the weight loss,

00:38:07.700 | the beard you decided to grow,

00:38:10.940 | the glasses you wear sometimes and not others,

00:38:14.340 | the different styles of glasses and so on,

00:38:16.620 | makeup or no makeups.

00:38:18.020 | All of these things, it's still you,

00:38:19.660 | still the same identity.

00:38:21.140 | You have to be able to classify that.

00:38:23.140 | And that kind of accuracy,

00:38:24.580 | especially for security applications,

00:38:26.200 | extremely high, that's required.

00:38:29.300 | The reason it's an exciting area

00:38:33.700 | is there's a lot of possibility,

00:38:35.340 | but, and there's also a lot of concern, right?

00:38:37.940 | So the future impact, utopia, dystopia,

00:38:41.700 | and the more reasonable middle path here

00:38:44.380 | is face provides a very user-friendly way

00:38:49.380 | of letting your devices recognize you and say hello.

00:38:57.140 | Your voice is certainly one,

00:38:58.480 | but one of the most powerful ones

00:39:00.300 | to really classify at a distance is face.

00:39:04.440 | So what does that mean?

00:39:05.380 | The utopian view, the possibility of the future,

00:39:08.540 | the best possible, brightest possible future.

00:39:11.180 | As you can use your face to, as a passport,

00:39:16.180 | you replace the license,

00:39:17.460 | replace all the security measures we put

00:39:20.440 | from the passwords in our devices

00:39:22.180 | to the credit card and so on,

00:39:24.260 | all of that, Apple pays, it'll be face pay.

00:39:28.580 | You show up, it'll automatically connect

00:39:30.580 | to all your devices, all your banking information, so on.

00:39:34.020 | Obviously, the flip side of that,

00:39:36.060 | just rephrasing that sentence also can be dystopian

00:39:39.900 | because complete violations of privacy,

00:39:44.140 | being watched at any time,

00:39:45.860 | being able to, through your Facebook and social media

00:39:49.820 | and all your devices being able to identify you,

00:39:52.340 | making it impossible for you to sort of hide from society.

00:39:56.780 | The fundamental aspects of privacy,

00:39:58.980 | maintaining privacy that many of us value greatly.

00:40:02.580 | The middle path is really just a useful way

00:40:05.020 | to unlock your phone.

00:40:06.160 | The recent breakthroughs here,

00:40:09.660 | it started with deep face.

00:40:14.660 | The essential idea there is applying deep neural networks

00:40:19.980 | to the task of face recognition.

00:40:23.020 | I mean, with a lot of the breakthroughs here

00:40:24.980 | on the perception side,

00:40:27.300 | we're not covering the old school papers and so on,

00:40:29.860 | and the historical context here,

00:40:34.620 | biggest breakthroughs came with deep learning,

00:40:38.860 | 2006, '07, '08, last 10 years.

00:40:43.860 | So that's the same is true with face recognition.

00:40:48.780 | Deep face was the big first application

00:40:51.580 | that achieved near human performance

00:40:54.540 | on one of the big benchmarks at the time

00:40:57.300 | on the labeled faces in the wild.

00:40:59.740 | So using a very large data set,

00:41:01.340 | being able to form a good representation.

00:41:03.940 | The state of the art,

00:41:05.420 | or at least close to the state of the art is face net.

00:41:10.240 | The key idea there is using those same deep architectures

00:41:13.880 | to now optimize for the representation itself directly.

00:41:18.260 | The notebook we're putting out,

00:41:20.580 | we shared with some of you for the assignment,

00:41:24.020 | describes face recognition, the challenge there,

00:41:27.380 | that it's not like the traditional classification problem.

00:41:30.940 | You have to form an embedding of the face

00:41:35.940 | into a small vector, compressed vector,

00:41:42.140 | such that in that embedding,

00:41:44.340 | faces that are similar to each other,

00:41:46.060 | so identities that are close together,

00:41:48.460 | are close in the Euclidean sense in that embedding,

00:41:52.620 | and people that are very different are far away.

00:41:55.540 | And so you use that embedding to then do the classification.

00:41:58.500 | That's really the only way to deal with data sets

00:42:01.380 | for which you have so little information

00:42:02.940 | on any one individual person.

00:42:04.660 | And so face net optimize that embedding

00:42:09.080 | in a way that directly optimizes for the Euclidean distance

00:42:13.180 | between non-matching identities.

00:42:16.340 | So there's still a lot of excitement

00:42:17.780 | about face recognition.

00:42:18.780 | There's a lot of benchmark competitions

00:42:20.580 | and a lot of people working in this,

00:42:22.340 | and really bigger, badder networks and more data

00:42:26.500 | is really one of the ways to crack this problem.

00:42:29.460 | So public large data set with 672,000 identities,

00:42:34.460 | 4.7 million photos, that's in 2017,

00:42:38.660 | and that just keeps scaling up and up and up and up.

00:42:41.660 | Now we have to also be honest here

00:42:43.460 | on the possible future directions of work

00:42:47.380 | in that even though the benchmarks are growing,

00:42:50.900 | that's still a tiny subset of the people in the world.

00:42:53.420 | We're still not quite there to be able to have

00:42:57.420 | the general face recognition applicable

00:42:59.340 | to the entirety of the population,

00:43:01.540 | or a large swath of the population of the world.

00:43:04.700 | So in this topic here, brief coverage,

00:43:09.140 | we're not covering all of the aspects of the face,

00:43:13.020 | especially temporal, that are useful in face recognition

00:43:16.300 | or useful saying a lot of things about the face,

00:43:18.620 | which is the FACS, facts,

00:43:21.780 | the different kinds of facial expressions

00:43:23.700 | that can then be used to infer emotion and so on.

00:43:26.780 | Raised eyebrows and all those kinds of things

00:43:30.740 | that can provide rich information

00:43:32.140 | for recognizing and interpreting the face,

00:43:34.500 | and the different other modalities,

00:43:36.420 | including 3D face recognition, we're not covering.

00:43:39.580 | There's a lot of exciting areas there.

00:43:41.220 | We're just looking at the pure formulation

00:43:43.780 | of the face recognition problem

00:43:45.260 | of looking at a 2D single image.

00:43:49.600 | The open problems here is first,

00:43:55.200 | not often stated and misinterpreted by people,

00:44:01.060 | is that most of these methods of face recognition

00:44:05.260 | start with assuming that you have a bounding box

00:44:08.940 | around the face.

00:44:10.740 | Now, oftentimes recognition can happen,

00:44:15.740 | so they're assuming a frontal

00:44:18.460 | or near frontal view of the face.

00:44:20.500 | But you can do recognition all kinds of poses.

00:44:23.260 | And it's very interesting to think that recognition,

00:44:27.900 | the way we recognize our friends and colleagues,

00:44:30.740 | parents and children is often using

00:44:33.940 | a lot of cue context information

00:44:35.460 | that's beyond just the pure frontal view of the face.

00:44:38.420 | It can do pretty well on profile views,

00:44:40.780 | it can from body language and so on.

00:44:43.020 | So all those things, that's open in the field,

00:44:45.860 | how we incorporate that into face recognition.

00:44:48.440 | Then the black box side is problematic for both bias

00:44:53.140 | and just being able to understand

00:44:54.480 | why incorrect decisions are made,

00:44:56.580 | is making those face recognition systems more interpretable.

00:45:00.420 | And then finally, privacy.

00:45:04.860 | The ability to collect the kind of data

00:45:07.620 | where the face recognition

00:45:09.460 | it would be performing extremely well,

00:45:11.540 | and yet not violating the fundamental aspects

00:45:14.700 | of privacy that we value.

00:45:16.400 | Activity recognition, taking the next step forward here

00:45:24.420 | into the richer temporal context of what people do.

00:45:30.280 | Again, the same structure from recent breakthroughs

00:45:32.700 | to the future direction of work.

00:45:34.300 | What is it?

00:45:37.120 | It's classifying human activity from images or from video.

00:45:41.900 | And why is it important?

00:45:44.200 | Depending on the level of abstraction for the activity,

00:45:51.580 | it provides context for understanding the human.

00:45:54.380 | What are they doing?

00:45:55.220 | Are they playing baseball?

00:45:56.060 | Are they singing?

00:45:56.900 | Are they sleeping?

00:45:57.940 | Are they putting on makeup, knitting, so on, mixing butter?

00:46:02.940 | Why is it hard?

00:46:05.300 | Again, all the usual problems in image recognition.

00:46:08.620 | The kind of data we're dealing with is just much larger.

00:46:12.960 | The kind of video, the richness of possibilities

00:46:16.500 | that define what activity is, is much larger.

00:46:19.360 | So the complexity is much larger.

00:46:21.680 | It's often difficult to quantify motion

00:46:26.680 | because the fundamental aspect of activity

00:46:30.600 | is the change in the world, is the motion of things.

00:46:33.440 | And then it's difficult to determine how the dynamics

00:46:37.320 | of the physics of the world, especially from a 2D view

00:46:40.000 | of what's background information, what's noise,

00:46:42.060 | and what's essential to understanding the activity.

00:46:46.120 | And the subjective, ambiguous elements of activity.

00:46:52.420 | When does a particular activity begin?

00:46:56.860 | When does it end?

00:46:58.060 | What's all the gray areas when you're partially engaging

00:47:03.120 | in that activity and so on?

00:47:05.200 | When you start to annotate these things,

00:47:07.040 | when you start to try to do the detection,

00:47:08.560 | it becomes clear that sometimes the activity

00:47:12.280 | is partially undertaken and the beginning

00:47:16.080 | and the end is fuzzy.

00:47:17.240 | Future impact, utopia, dystopia, middle path.

00:47:21.880 | So the impact here comes from being able

00:47:25.760 | to understand the world in time and be able to predict.

00:47:31.260 | The utopian possibilities is that the contextual perception

00:47:36.260 | that can occur from here can enrich the experience

00:47:39.260 | between the human and robot.

00:47:40.660 | The dystopian view, the flip side is being able

00:47:45.140 | to understand sort of human activities

00:47:47.560 | can let the robots sever the relationship.

00:47:50.660 | So it can damage the human robot interaction

00:47:54.860 | to where they just do their own thing.

00:47:57.260 | The middle path is just finding useful information,

00:47:59.580 | massive amounts of data like YouTube.

00:48:01.820 | Now there's a YouTube video data set,

00:48:03.820 | being able to identify what's going on in this video,

00:48:06.160 | being able to infer rich, useful semantic information.

00:48:10.580 | And so what do we do with video?

00:48:12.380 | How do we do perception in video?

00:48:14.200 | Now the recent breakthrough came with deep learning

00:48:17.340 | and C3D, this 3D convolutional neural networks

00:48:20.180 | that take a sequence of images and are able to determine

00:48:23.060 | the action that's going on in an end-to-end way,

00:48:25.300 | what's going on in the video.

00:48:26.900 | That was the recent breakthrough.

00:48:29.500 | The state of the art coming from a slightly,

00:48:32.500 | well, from a different architecture

00:48:34.180 | that takes in two streams.

00:48:35.800 | One is the image RGB data, the other is optical flow data

00:48:40.500 | that's really focusing on the motion in the image.

00:48:42.940 | Those are the two that's opened the wave

00:48:44.820 | of two stream networks.

00:48:46.900 | Here from that paper showing the different architectures,

00:48:49.700 | on the far right is the two stream architecture

00:48:54.340 | and the C3D on the, shown under B here,

00:48:59.020 | taking a sequence of images.

00:49:00.300 | But all these are just different architectures.

00:49:02.340 | And then first one is LSTMs.

00:49:05.020 | There's different architectures of how do you represent,

00:49:07.580 | how do you allow a network?

00:49:08.860 | How do you allow a learning model to be able

00:49:11.140 | to capture the dynamics in the data?

00:49:13.280 | The future possibilities has to do,

00:49:16.860 | well, literally with the future,

00:49:18.340 | being able to take single images or sequences of images

00:49:21.860 | and predicting the future.

00:49:23.540 | It's very interesting to think about

00:49:25.340 | in our ability to hallucinate the future,

00:49:28.900 | and generate the future from images,

00:49:32.340 | you start to think about what are the defining qualities

00:49:35.300 | of activities, and in this way, augment data

00:49:37.820 | and be able to train much more accurate

00:49:40.060 | action recognition systems.

00:49:42.140 | Topics not covered is the localization of activity in video.

00:49:46.920 | So action recognition purely defined is I give you a clip

00:49:50.260 | and you tell me what's going on in this clip.

00:49:53.060 | Now, if you take actually a full YouTube video,

00:49:55.100 | you want to be able to localize,

00:49:56.660 | find all the times when a particular activity is going on.

00:50:00.580 | It could be multi-label, multiple activities going on

00:50:02.860 | at the same time, beginning and ending, and asynchronously.

00:50:06.480 | And then there is more richly three-dimensional

00:50:11.980 | or 2D classification of activity based on human movement.

00:50:16.740 | So looking at, like from a Kinect, from 3D sensors,

00:50:20.020 | looking at skeleton-based action recognition

00:50:22.980 | from sensors that provide you more

00:50:25.780 | than just the 2D image data.

00:50:30.300 | The open problems is that activity recognition

00:50:35.300 | is more than just the way we move our body,

00:50:39.300 | or if it's baseball, like a ball in your hand

00:50:42.420 | and hitting it with a baseball bat.

00:50:45.460 | It also has to do with context.

00:50:47.460 | There's sitting down or working or looking at something,

00:50:52.460 | picking up an item.

00:50:53.940 | Those sometimes can change profoundly

00:50:56.780 | based on the other objects in the scene

00:50:58.940 | and the activity of other people in the scene.

00:51:00.980 | And so being able to work with that kind of context

00:51:03.780 | is a totally open problem.

00:51:05.900 | It's having to reduce a very complex real world context

00:51:10.060 | into something where you can clearly identify an activity.

00:51:14.260 | Body pose estimation is the task of localizing the joints

00:51:21.620 | that form the skeleton of the human body.

00:51:26.180 | So infer from visual information,

00:51:28.740 | the positions of the different joints.

00:51:30.700 | Along the line of complexity,

00:51:33.460 | it's important to be able to understand the body language,

00:51:35.820 | the rich information about the body of the human being.

00:51:40.820 | So that's from reading body language to animation,

00:51:43.780 | to aiding activity recognition.

00:51:47.260 | And it's just a useful representation of the human body.

00:51:51.820 | If you're analyzing pedestrians

00:51:53.980 | or in interactive environments, human robot interaction,

00:51:57.260 | being able to understand what the heck it is

00:51:59.580 | the human is trying to do.

00:52:01.140 | A body pose is really useful.

00:52:03.400 | It's hard because the body,

00:52:07.700 | when you look at a 2D image projection of the body,

00:52:11.420 | there's a lot of,

00:52:13.540 | it's a highly dimensional optimization problem,

00:52:16.100 | figuring out how the raw pixels match

00:52:18.740 | to the actual three-dimensional orientation

00:52:21.980 | of the human joints.

00:52:24.340 | And the usual computer vision challenges

00:52:26.140 | of pose, lighting, and so on.

00:52:28.340 | Future impact is,

00:52:31.220 | it's really exciting for interactive environments

00:52:34.320 | for a robot to be able to know the position

00:52:37.460 | of the human body with which it's trying to interact.

00:52:39.620 | Whether it's a robot that's trying to get

00:52:42.980 | their favorite human a beer

00:52:44.540 | or whatever your favorite choice of drink,

00:52:47.500 | you have to be able to find where their hand is

00:52:49.500 | so you can do the trade-off.

00:52:50.780 | Same thing in the car.

00:52:52.180 | You have to determine if the person's hands

00:52:54.740 | are on the steering wheel,

00:52:56.020 | if their head and orientation is such

00:52:59.100 | that they're able to physically take control of the vehicle.

00:53:01.420 | That's a really exciting set of possibilities there.

00:53:03.860 | And there's applications in sports and CGI

00:53:07.060 | and video games and all aspects

00:53:09.420 | when the robot and human have to work together.

00:53:11.900 | The dystopian view you can imagine is,

00:53:14.860 | of course, being able to localize all those joints

00:53:17.780 | means robots that are able to more effectively hurt humans.

00:53:22.060 | And so that's always a huge concern

00:53:24.780 | and always a dark dystopian view of the world

00:53:29.220 | with so much AI in it.

00:53:30.560 | Of course, the reality is,

00:53:32.060 | it's just more rich, fulfilling HCI

00:53:34.900 | that takes advantage of not just the face,

00:53:37.900 | stuff coming from the face,

00:53:39.200 | but also the body of the human

00:53:42.660 | that the robot is interacting with.

00:53:44.580 | So it started with deep learning being applied

00:53:48.260 | to the body pose estimation problem,

00:53:50.580 | 2014 with deep pose.

00:53:52.860 | The key idea is there is looking at

00:53:54.580 | the holistic human pose estimation problem

00:53:57.300 | of detecting all the different joints

00:53:59.980 | of a single person in an image.

00:54:02.880 | Power of deep learning is that you no longer have to do

00:54:05.260 | handcrafted expert engineered features

00:54:08.340 | that it automatically determines a set of features.

00:54:10.840 | All the parts are being detected for you.

00:54:12.560 | So this highly complex problem is all solved with data.

00:54:16.500 | This is the state of the art of the 2017

00:54:21.420 | and beyond there's been a few papers from CMU

00:54:24.260 | along this line is doing real time multi-person

00:54:27.460 | 2D pose estimation,

00:54:29.260 | but in a bottom up way

00:54:31.500 | where you're detecting individual joints first.

00:54:35.440 | So all the knees in the picture,

00:54:36.940 | all the elbows, all the shoulders,

00:54:39.460 | all the wrists and so on,

00:54:41.420 | and then stitching them together

00:54:42.540 | using parts affinity fields.

00:54:44.140 | What is the most likely?

00:54:45.660 | So if you find 17 elbows in a picture,

00:54:49.440 | you then have to try to see which elbow

00:54:51.980 | belongs to which person.

00:54:53.980 | So that actually turns out to be extremely powerful way

00:54:57.660 | to detect especially multi-pose,

00:54:59.540 | especially to deal with occlusions

00:55:02.620 | way of detecting body pose.

00:55:05.140 | It's really interesting and also is able to

00:55:08.320 | because of that,

00:55:09.700 | because of the separation of the detections

00:55:12.840 | is able to run a real time,

00:55:14.500 | which is also really exciting.

00:55:16.000 | Possible future direction is the

00:55:19.260 | using much more information,

00:55:22.160 | using deformable models of the human body.

00:55:25.860 | So not just a skeleton,

00:55:28.020 | rich volumetric information to do the detection

00:55:32.820 | and then optimizing for what's the most likely

00:55:35.620 | orientation of the body.

00:55:37.940 | The open problems in the field is the fact that

00:55:42.900 | pose is not a thing that happens in a single image.

00:55:48.140 | Pose that happens is part of human behavior

00:55:51.580 | and part of movement through time.

00:55:52.860 | So here, Monty Python, Ministry of Silly Walks,

00:55:56.700 | people walk in funny ways.

00:55:57.980 | But so we collect a lot of data on pedestrians.

00:56:01.220 | I can tell you that people walk in different ways

00:56:03.220 | and people position their body in different ways.

00:56:05.980 | And so the temporal aspects of human motion

00:56:09.860 | are for the most part not incorporated

00:56:13.580 | in the body pose estimation problem and they should be.

00:56:16.100 | There's a lot of exciting possibilities

00:56:17.660 | of capturing the temporal dynamics.

00:56:20.780 | There's a lot of awesome slides here

00:56:26.140 | that I'm just skipping through.

00:56:28.620 | Speech recognition.

00:56:31.000 | That was 2018, it was really big for recommender systems,

00:56:36.000 | for Netflix, OKCupid, AI for President.

00:56:41.560 | Each one of the things that I mentioned briefly today

00:56:47.280 | we'll have a separate mini lecture.

00:56:50.440 | I taught an entire course on this at CHI last year.

00:56:52.880 | So deep learning for understanding the human.

00:56:54.760 | It's a topic I'm really excited about

00:56:56.920 | because it's really the first step for a machine

00:56:59.820 | to be able to interact in a rich way with a human being,

00:57:02.680 | is to understand it.

00:57:03.600 | And it's also area where the most near term impact

00:57:06.680 | can happen, a system to be able to effectively detect

00:57:09.920 | what a human being is up to, what they're thinking about,

00:57:13.720 | how to best serve them and enrich the experience

00:57:18.600 | of interacting with that human.

00:57:21.600 | Let me jump to AI safety and then the interactive experience

00:57:26.680 | to humans and robots to just give examples of some work

00:57:31.300 | in that direction, some research in that direction

00:57:33.500 | I'm really excited about.

00:57:35.460 | So AI safety, at the very basic level,

00:57:39.180 | there's an AI system that's making decisions

00:57:42.380 | where we want human beings to supervise those decisions.

00:57:45.740 | We've done quite a bit of work here at MIT

00:57:48.140 | on that aspect of supervising machines,

00:57:51.300 | with arguing machines.

00:57:52.660 | And OpenAI has done work with safety

00:57:55.400 | by having machines debate each other.

00:58:00.440 | So this kind of idea that you can achieve safety

00:58:05.440 | by not giving ultimate power to any one decision maker.

00:58:09.560 | And the disagreement that emerges from two AI systems

00:58:14.560 | or multiple AI systems having to make decisions

00:58:19.800 | and agree with each other,

00:58:21.540 | it allows us to then produce a signal of uncertainty

00:58:25.280 | based on which the human supervision can be sought.

00:58:27.840 | Without that, when we have a state of the art

00:58:31.440 | black box AI system that does something like drive a car,

00:58:34.480 | all we have is a system that just runs

00:58:37.400 | and we're supposed to have faith

00:58:38.880 | that it's always going to be right.

00:58:40.000 | We don't have any uncertainty signal coming from the system.

00:58:43.960 | So the idea of arguing machines that we've developed

00:58:48.800 | and been working on is to have multiple AI system

00:58:51.940 | and ensemble of AI system where the disagreement,

00:58:55.020 | when there's a disagreement detected,

00:58:56.700 | human supervision is sought.

00:58:58.300 | And the idea there is that when you have a system

00:59:01.260 | like Tesla Autopilot,

00:59:02.820 | here we've instrumented a Tesla vehicle.

00:59:06.900 | We have a system like Tesla Autopilot,

00:59:08.540 | it's telling you nothing about how uncertain it is

00:59:12.460 | about the decision it's making.

00:59:14.180 | It just knows, once the system is on,

00:59:17.140 | it's now steering the car for you.

00:59:19.200 | And in very rare cases, this is just disengage.

00:59:22.280 | But no matter what, it's not showing to you

00:59:24.340 | the degree of uncertainty it has about the world around it.

00:59:27.520 | And so the way we create that signal of uncertainty

00:59:30.560 | is by adding another, in this case, end-to-end vision system

00:59:34.680 | that's looking at the external environment,

00:59:36.120 | making steering decisions.

00:59:37.360 | And whenever there's a disagreement

00:59:38.880 | between the two detected,

00:59:40.300 | that's when human supervision is sought.

00:59:42.760 | And we can predict in this way,

00:59:46.880 | shown in the plot there,

00:59:48.720 | is we can predict with high accuracy

00:59:52.920 | the times when the driver chose to disengage the system

00:59:56.860 | because they were uncomfortable.

00:59:58.360 | So you're detecting, you're using this mechanism

01:00:00.840 | to detect risky, challenging situations.

01:00:04.520 | It's an idea about how we supervise AI

01:00:09.080 | by having multiple AI systems that are independent,

01:00:12.800 | and through their disagreement

01:00:14.600 | emerges the uncertainty signal.

01:00:17.320 | And we can apply this, like the AI folks

01:00:19.840 | in natural language would debate,

01:00:22.080 | we can apply this in computer vision as well,

01:00:24.760 | taking two independently trained,

01:00:28.480 | but on the same training set,

01:00:30.680 | networks ResNet and VGGNet, trained on ImageNet,

01:00:34.480 | and we can have them argue,

01:00:37.100 | and in the process, improve significantly the accuracy.

01:00:40.840 | So in the case of ResNet as an architecture,

01:00:44.880 | VGGNet as an architecture,

01:00:46.840 | trained on the ImageNet training dataset,

01:00:49.120 | they separately have a certain error.

01:00:53.760 | ResNet has an error of 8%,

01:00:56.320 | VGG16 has an error of 10%.

01:00:58.800 | When we apply the argue machines framework,

01:01:01.760 | when the disagreement is brought to the human,

01:01:04.480 | that error rate decreases to 2.8%.

01:01:07.880 | Now if that, this is just ImageNet challenge,

01:01:10.960 | but if that error meant the loss of human life,

01:01:14.920 | this kind of framework is really powerful

01:01:17.240 | for overseeing the operation of the AI system.

01:01:21.480 | That, just examples here where they disagree.

01:01:24.260 | So taking this image that's from ImageNet,

01:01:27.440 | the ground truth is a wine bottle,

01:01:29.240 | and ResNet prediction is that it's definitely 0.93,

01:01:32.720 | 93% confidence that it's a paper towel,

01:01:35.800 | and VGGNet, 25% confidence that it's a seatbelt.

01:01:39.440 | So these disagreements are then brought in,

01:01:41.480 | and then we, the fact that they disagree

01:01:45.000 | arises the uncertainty,

01:01:48.880 | and then human supervision is brought,

01:01:50.920 | and then humans are able to annotate correctly

01:01:53.360 | what's going on in this picture.

01:01:54.980 | Same thing here, mailbox, the ground truth is a mailbox.

01:01:58.780 | The, again, the two architectures disagree.

01:02:01.780 | One says traffic light, the other one says garbage truck.

01:02:04.640 | For an autonomous vehicle,

01:02:06.640 | you can imagine this being problematic.

01:02:10.120 | If there's a traffic light,

01:02:11.680 | you might stop for this mailbox, that kind of thing.

01:02:14.800 | That's early research in the field

01:02:17.840 | of how do we have AI systems

01:02:20.240 | that are more and more powerful.

01:02:22.000 | We can also inject human effort

01:02:25.600 | to supervise when it's needed.

01:02:27.560 | The when it's needed part,

01:02:28.640 | the uncertainty signal is the critical thing,

01:02:30.680 | so we have to figure out ways

01:02:31.740 | to create that uncertainty signal.

01:02:34.080 | The subarea of just creating a rich human interaction.

01:02:38.240 | So this is, we're doing a lot of testing

01:02:42.960 | with autonomous vehicles here.

01:02:44.240 | I'm tweeting.

01:02:45.200 | So we have a human-centered autonomous vehicle

01:02:51.880 | here at MIT that's taking control back and forth

01:02:54.320 | from the human based on the activity.

01:02:56.860 | That's just me explaining the video.

01:03:00.440 | The point is that the driving experience,

01:03:04.600 | the human-robot interaction experience

01:03:06.880 | should be fun and awesome and enriching to life.

01:03:11.320 | And that's why you would want to use these kinds of systems.

01:03:15.000 | We have a bunch of videos online.

01:03:17.420 | You can check them out,

01:03:18.560 | including ridiculous one of me playing guitar.

01:03:21.760 | And there's a paper along with this

01:03:23.280 | describing different principles

01:03:24.640 | of how we have humans and robots work together

01:03:27.040 | in this kind of way.

01:03:29.040 | There's a lot of totally untouched problems in that space.

01:03:33.120 | Most of the robotics community

01:03:34.640 | and the machine learning community approaches AI

01:03:36.680 | as a system that we want to make perfect.

01:03:39.580 | And once it's perfect,

01:03:42.480 | we want to then put it in the real world

01:03:44.080 | where us humans get to interact with it.

01:03:46.600 | Just like, what is it, Robin Williams in Good Will Hunting,

01:03:51.600 | talking about relationships, that nobody's perfect.

01:03:56.620 | I think the way I foresee it,

01:03:59.360 | AI systems will not be perfect for the next 100 years.

01:04:02.600 | And so we have to have humans and AI systems work together

01:04:05.520 | and optimize that problem, solve that problem.

01:04:08.640 | That both of us are flawed,

01:04:10.280 | but together there's something enriching to both.

01:04:14.720 | As I mentioned, the videos here will be available online.

01:04:17.600 | The lectures underlying all the deep learning

01:04:19.760 | for understanding the human

01:04:21.240 | and underlying the five principles here

01:04:23.280 | of human-centered AI.

01:04:24.960 | And it's an area of active research here at MIT

01:04:28.800 | and globally, and it's one

01:04:30.520 | that I'm extremely passionate about.

01:04:32.580 | And one of the analogies I think about

01:04:35.980 | when I think about the success

01:04:38.040 | of artificial intelligence systems

01:04:40.920 | as an analogy of parasitism and symbiosis,

01:04:45.240 | a lot of the ways we're training

01:04:47.420 | machine learning algorithms now

01:04:50.560 | is we inject a lot of human labor,

01:04:54.080 | a lot of really costly human labor,

01:04:56.160 | separately, offline, out of the loop,

01:04:58.880 | in order to improve the learning models

01:05:01.920 | through brute force annotation.

01:05:03.720 | And what I see as success in the future

01:05:06.960 | requires that the learning is done,

01:05:10.920 | the models improve in a symbiotic way,

01:05:14.280 | as a side effect of interacting with humans.

01:05:16.800 | This is done a lot now in reinforcement learning,

01:05:18.960 | through game playing and so on.

01:05:20.740 | But the human computation,

01:05:23.600 | the human effort of annotation

01:05:25.680 | is something that happens naturally through interaction,

01:05:28.320 | not a costly thing you have to pay for.

01:05:30.640 | Because when it happens naturally,

01:05:33.400 | in a symbiotic way, we can increase scale.

01:05:36.200 | We can scale learning to a degree that's required

01:05:39.260 | to solve some of the real-world problems.

01:05:42.280 | That also requires solving a lot of aspects

01:05:46.000 | of human-robot interaction,

01:05:48.780 | from understanding our own brain,

01:05:51.540 | from the biological to the electrical and neuroscience,

01:05:55.520 | to the behavioral aspects captured by cognitive science,

01:05:58.520 | psychology, sociology,

01:06:00.720 | to the mathematical formulations of behavior

01:06:03.040 | and game theory,

01:06:04.200 | to when you take that human behavior

01:06:07.320 | and put it in the real world with engineering systems,

01:06:09.680 | human factors and design.

01:06:11.240 | These are all giant subfields with conferences and papers

01:06:14.660 | that all of them need to work together.

01:06:16.360 | And then on the computer science side,

01:06:18.000 | with natural language processing,

01:06:20.060 | understanding language,

01:06:21.580 | the human-robot interaction,

01:06:23.440 | human-computer interaction,

01:06:24.560 | just the interfaces,

01:06:25.980 | what does, how does, and what does the computer,

01:06:30.420 | the robot show to you.

01:06:32.080 | Again, entire conferences.

01:06:33.920 | And then the exciting aspects of learning from data

01:06:36.640 | and deep learning and learning to act from data

01:06:39.960 | and reinforcement learning,

01:06:41.120 | deep reinforcement learning.

01:06:42.280 | And then the robotics is actually building these things,

01:06:45.620 | the building the hardware,

01:06:48.720 | again, an entire area, exciting field of research.

01:06:52.940 | All of them have to work together to create systems here

01:06:56.320 | that integrate the human during the learning process

01:06:58.900 | and integrate the human during the operation process.

01:07:01.980 | So the videos on deeplearning.mit.edu,

01:07:06.780 | videos and slide available there,

01:07:08.900 | code is available there.

01:07:10.940 | So with that, I'd like to thank you very much.

01:07:13.740 | (audience applauding)

01:07:14.980 | (audience cheering)

01:07:17.980 | (upbeat music)

01:07:20.560 | (upbeat music)

01:07:23.140 | (upbeat music)

01:07:25.720 | (upbeat music)

01:07:28.300 | (upbeat music)

01:07:30.880 | [BLANK_AUDIO]

MIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)

Chapters