back to indexMIT 6.S093: Introduction to Human-Centered Artificial Intelligence (AI)
Chapters
0:0 Introduction to human-centered AI
5:17 Deep Learning with human out of the loop
6:11 Deep Learning with human in the loop
8:55 Integrating the human into training process and real-world operation
11:53 Five areas of research
15:38 Machine teaching
19:27 Reward engineering
22:35 Question about representative government as a recommender system
24:27 Human sensing
27:6 Human-robot interaction experience
30:28 AI safety and ethics
33:10 Deep learning for understanding the human
34:6 Face recognition
45:20 Activity recognition
51:16 Body pose estimation
57:24 AI Safety
62:35 Human-centered autonomy
64:33 Symbiosis with learning-based AI systems
65:42 Interdisciplinary research
00:00:00.000 |
Welcome to Human-Centered Artificial Intelligence. 00:00:15.880 |
in the problems that we've been able to crack 00:00:26.480 |
is the idea that with purely the learning-based approach 00:00:33.480 |
there's certain aspects that are fundamental to our reality 00:00:39.860 |
that we have to integrate, incorporate the human being 00:00:57.160 |
under the idea of human-centered AI in this century 00:01:03.200 |
that have been successful over the past two decades, 00:01:06.200 |
like deep learning, machine learning approaches 00:01:16.240 |
So as opposed to fine-tuned optimization-based models 00:01:23.300 |
more and more we're going to see learning-based methods 00:01:29.480 |
That's the underlying prediction that we're working with. 00:01:33.520 |
Now, if that's the case, the corollary of that, 00:01:56.720 |
That's the deep learning, that's the algorithm, 00:01:58.860 |
the optimization of neural network parameters 00:02:08.440 |
of much of the developments in deep learning. 00:02:29.620 |
Just like when you yourself are learning as a student 00:02:36.220 |
the world and the parents and the teachers around you 00:02:41.220 |
are informing you with very sparse information, 00:02:47.480 |
that is most useful for your learning process. 00:02:49.780 |
The selection of data based on which to learn, 00:02:53.680 |
I believe, is the critical direction of research 00:03:03.120 |
the ones that are able to work in the real world, 00:03:05.960 |
and I'll explain why in ways that I'm referring to. 00:03:35.360 |
based on a very small subset of samples from that reality, 00:03:43.320 |
there's always going to be a degree of uncertainty. 00:04:00.960 |
to how guaranteed to be safe in some specific way 00:04:07.880 |
Therefore, we need human supervision of these systems. 00:04:18.600 |
Therefore, we need human supervision of these systems, 00:04:32.440 |
to the satisfaction of us as human supervisors. 00:04:40.800 |
human supervision constantly will be required, 00:04:44.560 |
and the solution to this is a whole set of techniques, 00:04:47.520 |
whole set of ideas that we're putting under the flag 00:04:55.000 |
and the core ideas there is that we need to integrate 00:04:58.660 |
the human being deeply into the annotation process 00:05:09.320 |
so both in the training phase and the testing phase, 00:05:31.180 |
that hopefully generalizes in the real world, 00:05:41.640 |
able to form high-level representations of the raw data 00:05:45.840 |
in a way that it's actually able to do quite well 00:05:52.580 |
but fundamentally, the human is out of the loop, 00:05:57.220 |
First, you build the dataset, annotate the dataset, 00:06:03.960 |
and the real-world operation does not involve the human 00:06:06.960 |
except as the recipient of the service the system provides. 00:06:15.280 |
means that annotation and operation of the system 00:06:35.920 |
the wisdom of the crowd and the wisdom of the individual. 00:06:39.620 |
At the training phase, the first part of that 00:06:46.320 |
We need to significantly improve objective annotation, 00:06:49.440 |
meaning annotation where the human intelligence 00:06:53.160 |
is sufficient to be able to look at a sample and annotate it. 00:07:04.120 |
of determining what's in a particular sample. 00:07:09.180 |
things that are difficult for humans to determine 00:07:15.000 |
as a crowd, we kind of converge in these difficult questions. 00:07:20.000 |
These are questions at a low level of emotion, 00:07:31.640 |
of decisions that an AI system is tasked to making 00:07:38.260 |
that nobody really knows the right answer to. 00:07:40.640 |
And as a crowd, we kind of converge in the right answer. 00:07:46.920 |
Now in the operation, once you train the model, 00:07:54.600 |
and I'll give examples of this more concretely, 00:07:57.480 |
on the wisdom of the individual is, for example, 00:08:09.820 |
That's a critical step for a learning based system 00:08:24.320 |
where a single person is not able to make it, 00:08:33.200 |
the supervision of systems in the medical diagnosis, 00:08:42.920 |
operating in the real world, making ethical decisions 00:08:55.360 |
And so we have to transform the machine learning problem 00:09:01.260 |
First up top in the training process, on the left, 00:09:04.400 |
that's the usual machine learning formulation 00:09:07.000 |
of a human being doing brute force annotation 00:09:10.680 |
of some kind of data set, cats and dogs and ImageNet, 00:09:17.440 |
video action recognition in the YouTube data set. 00:09:22.440 |
Given the data set, humans put in a lot of expensive labor 00:09:30.880 |
The flip side of that, the machine teaching side, 00:09:33.180 |
the human-centered side of that, is the machine instead, 00:09:41.360 |
is tasked with providing, selecting the subset, 00:09:52.620 |
that are most useful for the human to annotate. 00:09:55.460 |
So instead of the human doing the brute force task first 00:09:59.040 |
of the annotation, the machine queries the human. 00:10:04.840 |
The machine queries the human with questions, 00:10:11.260 |
the task is to minimize in several orders of magnitude 00:10:17.360 |
the amount of data that needs to be annotated. 00:10:22.760 |
the integration of the human looks like this. 00:10:33.320 |
receives the service provided by the machine, 00:10:52.240 |
but it's able to provide a degree of uncertainty. 00:10:59.580 |
to be able to specify a degree of uncertainty 00:11:04.240 |
is below a certain threshold, human supervision is sought. 00:11:20.080 |
by the very same humans that are providing the supervision, 00:11:34.880 |
the defining mode of operation for AI systems 00:11:40.480 |
as much as we'd like, to create perfect AI systems 00:11:55.680 |
grand challenges here, that define human-centered AI. 00:12:14.040 |
But, on the human-centered AI during the learning phase, 00:12:22.120 |
there is the methods, the research arm of machine teaching. 00:12:25.640 |
How do we select, how do we improve supervised learning? 00:12:28.760 |
As opposed to needing 10,000, 100,000, a million examples, 00:12:33.280 |
how do we reduce that, where the algorithm queries 00:12:36.560 |
only the essential elements, and able to learn effectively 00:12:39.880 |
from very little information, from very little samples? 00:12:48.160 |
language, and so on, we just need a few examples. 00:12:51.600 |
But those examples are critical to our understanding. 00:12:54.200 |
And the second part of that is the reward engineering. 00:13:01.040 |
injecting the human being into the definition 00:13:04.040 |
of the loss function, of what's good, what's bad. 00:13:06.960 |
Systems that have to operate in the real world 00:13:11.560 |
have to understand what our society deems as good and bad. 00:13:23.160 |
of adjusting the rewards, of reward re-engineering 00:13:26.560 |
by humans, so that we can encode human values 00:13:31.240 |
Now, on the second part, on the human-centered AI 00:13:44.560 |
That means the part I'll focus on quite a bit today, 00:13:47.840 |
because there's been quite a lot of development 00:13:58.280 |
Algorithms that, from taking raw information, 00:14:08.840 |
of the state of the human being in the short term 00:14:22.960 |
the perception problem, you have to interact with them 00:14:26.280 |
and interact in such a way that it's continuous, 00:14:29.000 |
collaborative, and a rich, meaningful experience. 00:14:32.360 |
We're in the very early days of creating anything 00:14:37.440 |
like rich, meaningful experiences with AI systems, 00:14:55.360 |
during the learning process, now come to fruition. 00:14:59.600 |
And we need to make sure that the trained model 00:15:04.600 |
does not result in things that are highly detrimental, 00:15:13.560 |
or highly detrimental to what we deem as good 00:15:19.600 |
of ethical considerations, and all those kinds of things. 00:15:22.920 |
The gray area, the line we all walk as a society 00:15:34.600 |
and I'll mention what we're doing in that area. 00:15:44.000 |
I'd like to sort of do one slide on each of these 00:15:54.320 |
that we will elaborate in future lectures on, 00:16:05.760 |
and a sort of thought experiment, a grand challenge, 00:16:10.000 |
that if we can do it, that'll be damn impressive. 00:16:14.000 |
That will be a definition of real progress in this area. 00:16:17.600 |
So near-term directions of research for machine teaching, 00:16:22.900 |
integrating the human into the annotation process, 00:16:30.880 |
So we have to transform the way we do annotations, 00:16:34.840 |
where the process of annotation is not defining the dataset, 00:16:41.960 |
it's a machine teaching system that queries the user 00:16:55.120 |
where we can be more clever about the way we use data, 00:17:04.780 |
which part of the data to train on, and annotate. 00:17:16.240 |
warping the data in interesting ways such that it expands, 00:17:20.040 |
it multiplies the human effort that was injected 00:17:29.080 |
and transfer learning are all in that category. 00:17:31.160 |
And self-play is in the reinforcement learning area 00:17:34.920 |
where the system constructs a model of the world, 00:17:42.120 |
and plays with that model to try to figure out 00:17:51.680 |
that would define serious progress in the field 00:17:57.840 |
the ImageNet challenge or COCO object detection challenge, 00:18:01.460 |
and training only on a totally different kind of data, 00:18:14.320 |
with the text and images that are there on Wikipedia, 00:18:25.140 |
with rich annotation of the localization of the objects. 00:18:31.980 |
that all the problems in the transfer learning 00:18:35.540 |
and efficient data annotation machine teaching 00:18:40.600 |
Another way to, another challenge you can think of, 00:18:54.080 |
that everybody always provides as an example. 00:19:02.500 |
by training only on a single example of a digit, 00:19:09.640 |
That's something that most of us humans can do, 00:19:34.380 |
and the near-term directions of research there, 00:19:40.540 |
continuous tuning of those rewards by a human being. 00:19:43.180 |
So if OpenAI is doing quite a bit of work here, 00:20:12.260 |
defined prior to, by, initially by a human being. 00:20:16.500 |
And what it finds is that you can get much more reward 00:20:26.980 |
actually gets in the way of maximizing reward. 00:20:32.140 |
of a reward function that was specified previously, 00:20:47.180 |
to be able to get the robot to the AI system here 00:21:02.740 |
that's a few, DeepMind, OpenAI, and ourselves are taking on. 00:21:16.700 |
where there's a lot of fuzziness for us humans. 00:21:21.380 |
There's a lot of uncertainty, there's a lot of gray area, 00:21:31.020 |
Example I provide here is one of the least popular things 00:21:52.780 |
in recommending what movie you should watch next 00:22:08.860 |
is replacing some of the fundamental representation 00:22:13.180 |
of large crowds of people that make ethical decisions 00:22:27.620 |
before we have a robot and a human work together, 00:22:30.460 |
the first thing is the robot has to perceive the human. 00:22:41.540 |
do you want to change the way Congress works, 00:22:45.060 |
make it better, or do you want to just take the system 00:22:51.140 |
So the idea is take the system as it currently 00:22:59.420 |
So an AI system can provide a lot more transparency 00:23:16.100 |
And there's also, and there's rich information there. 00:23:24.160 |
there's, for me, not saying anything about politics, 00:23:28.080 |
but there's certain issues I care a lot about 00:23:35.560 |
And then there's certain issues that I know a lot about 00:23:42.400 |
And those don't actually intersect that well. 00:23:50.780 |
So being able to put that representation of me 00:23:58.860 |
our entire nation together, and be able to make bills 00:24:08.240 |
Now the challenge there, it can't be just the training set 00:24:15.800 |
No, there has to be that human center element 00:24:19.380 |
just like we're, in theory, supposed to be supervising 00:24:28.040 |
in order to have an AI system that works with a human being, 00:24:34.320 |
the state of the human being at the very simplest level 00:24:36.920 |
and the more complex, temporal, contextual, over time level. 00:24:48.100 |
whether that comes in the visual, audio, text, and so on, 00:24:53.100 |
and being able to classify the physical, mental, 00:25:02.080 |
Be able to, everything, and this is what I'll cover 00:25:04.960 |
a little bit of today, everything from face detection, 00:25:12.640 |
natural language processing, body pose estimation, 00:25:16.620 |
those same recommender systems, speech recognition, 00:25:25.320 |
that captures something about the human being 00:25:27.240 |
into actually meaningful, actionable information. 00:25:29.840 |
The grand challenge there is emotion recognition. 00:25:34.840 |
You know, there's been a lot of companies and ideas 00:25:38.760 |
that we've somehow cracked emotion recognition, 00:25:41.460 |
that we are able to determine the mood of a person. 00:25:46.020 |
But really, that's, for those who were here last year 00:25:54.840 |
and you study emotional intelligence and emotion 00:25:59.300 |
and the expression of emotion, it's a fascinating area 00:26:04.760 |
to build perceptual systems that detect emotion. 00:26:07.560 |
What we're more so doing is detecting very simple 00:26:15.180 |
to our storybook versions of emotion, smiling, 00:26:23.460 |
So if you build a system that has a high accuracy 00:26:41.680 |
And being able to do that after collecting data for 30 days. 00:26:53.420 |
we need to be able to build in our learning models. 00:27:02.520 |
of being able to integrate data over a long period of time. 00:27:06.140 |
Then the second part of human robot interaction 00:27:08.980 |
in the real world operation is the experience. 00:27:12.520 |
This is where we're now just beginning to consider 00:27:17.460 |
of how do we have a rich fulfilling experience. 00:27:24.140 |
semi-autonomous vehicles, whether that's Tesla, 00:27:31.460 |
greater and greater degrees of automation in the car 00:27:33.860 |
and we get to have the human interact with that AI system 00:27:51.220 |
It's more kind of traditional driving situation. 00:27:58.540 |
In the Super Cruise, there's a camera looking at your eyes 00:28:15.260 |
And in the Tesla case, the miles are racking up. 00:28:20.860 |
Here at MIT, we're studying this exact interaction. 00:28:23.580 |
There's now over a billion miles driven in the Tesla. 00:28:26.380 |
And the same in the fully autonomous side with Waymo, 00:28:30.260 |
they've now reached 10 plus million miles driven autonomously. 00:28:34.140 |
And there's a lot of people experimenting with this. 00:28:41.900 |
for the AI system to express the degree of uncertainty 00:28:50.940 |
Be able to communicate what are its limitations 00:29:03.860 |
from the neurobiological research to psychology 00:29:19.900 |
Tesla's driven one billion miles now under autopilot, 00:29:25.300 |
The grand challenge here is when we start getting 00:29:32.140 |
you start getting into the hundreds of billions 00:30:00.440 |
with the Alexa Prize challenge of Social Bot, 00:30:11.400 |
is both on the audio side and just the text side, 00:30:20.640 |
where you wanna have a conversation with a robot 00:30:25.120 |
maybe more than even some of your other friends. 00:30:28.360 |
And on the other side of friends is the risk, 00:30:35.880 |
when you have an AI system that's learning from data. 00:30:41.000 |
is purely the human supervision of AI decisions 00:30:51.480 |
where there's some life-critical, safety-critical aspect 00:30:54.480 |
that we want to be able to supervise the safety of that. 00:31:05.600 |
Any degree to which AI systems are incorporated into that, 00:31:13.140 |
the low-level perception systems, like face recognition, 00:31:18.140 |
you wanna make sure that your face recognition systems 00:31:21.280 |
are not discriminating based on color or gender or age 00:31:44.800 |
And the other thing is, in terms of just maintaining values, 00:31:51.480 |
that's looking at the mean of the distribution. 00:31:57.420 |
from the AI systems not to do anything catastrophic. 00:32:03.120 |
when something happens that you didn't anticipate, 00:32:11.360 |
really, it all boils down to the ability of an AI system 00:32:17.780 |
And that measure of uncertainty has to be good. 00:32:45.160 |
'cause right now, we'll probably confidently say it's a dog, 00:32:49.860 |
But we want to be able to have an extremely high accuracy 00:33:05.980 |
under things that it's uncertain about, catastrophic events. 00:33:14.080 |
One of the places where deep learning has really shined 00:33:19.900 |
It all begins at the ability to look at raw data 00:33:22.860 |
and convert that into meaningful information. 00:33:25.220 |
That's really the understanding the human comes in. 00:33:29.540 |
that when you're in a relationship with somebody, 00:34:07.760 |
Now there's a full slide presentation with this, 00:34:13.240 |
The full slide presentation has the following structure 00:34:17.920 |
It has the motivation, description, the excitement, 00:34:21.920 |
the worry, the future impact is the first part. 00:34:26.740 |
One defining the quote unquote old school seminal work 00:34:46.040 |
The possible set of things that define the future direction. 00:34:52.120 |
and where the future research is very much needed. 00:35:15.060 |
So understanding the human being really starts with the face 00:35:22.580 |
and then that there's a head on top of that body, 00:35:26.180 |
And then there is the task of face recognition, 00:35:29.320 |
been an exceptionally active area of research 00:35:44.440 |
is the recognizing the identity of a human face. 00:35:54.020 |
Now, recognition means there's a database of identities. 00:36:29.640 |
There's a lot of applications here, obviously, 00:36:33.060 |
from identification to all the security aspects 00:36:41.860 |
of your identity in all the interactive elements 00:36:44.580 |
of AI systems, software-based systems in this world. 00:36:50.400 |
So all the usual computer vision problems come in. 00:37:21.820 |
So these two classes that you're trying to separate 00:37:24.300 |
can be very, very, very close together and intermingle. 00:37:35.020 |
because of the financial benefits of such data sets, 00:37:40.460 |
unless you're Brad Pitt or Angelina Jolie or celebrity, 00:37:43.580 |
there's not many samples of the data available. 00:37:46.580 |
So the individuals based on which the classification 00:37:49.420 |
is to be made, there's often not very much data. 00:37:56.000 |
So you have to, in making the face recognition task, 00:37:59.540 |
you have to be invariant to all the hairstyles, 00:38:10.940 |
the glasses you wear sometimes and not others, 00:38:35.340 |
but, and there's also a lot of concern, right? 00:38:49.380 |
of letting your devices recognize you and say hello. 00:39:05.380 |
The utopian view, the possibility of the future, 00:39:08.540 |
the best possible, brightest possible future. 00:39:30.580 |
to all your devices, all your banking information, so on. 00:39:36.060 |
just rephrasing that sentence also can be dystopian 00:39:45.860 |
being able to, through your Facebook and social media 00:39:49.820 |
and all your devices being able to identify you, 00:39:52.340 |
making it impossible for you to sort of hide from society. 00:39:58.980 |
maintaining privacy that many of us value greatly. 00:40:14.660 |
The essential idea there is applying deep neural networks 00:40:27.300 |
we're not covering the old school papers and so on, 00:40:34.620 |
biggest breakthroughs came with deep learning, 00:40:43.860 |
So that's the same is true with face recognition. 00:41:05.420 |
or at least close to the state of the art is face net. 00:41:10.240 |
The key idea there is using those same deep architectures 00:41:13.880 |
to now optimize for the representation itself directly. 00:41:20.580 |
we shared with some of you for the assignment, 00:41:24.020 |
describes face recognition, the challenge there, 00:41:27.380 |
that it's not like the traditional classification problem. 00:41:48.460 |
are close in the Euclidean sense in that embedding, 00:41:52.620 |
and people that are very different are far away. 00:41:55.540 |
And so you use that embedding to then do the classification. 00:41:58.500 |
That's really the only way to deal with data sets 00:42:09.080 |
in a way that directly optimizes for the Euclidean distance 00:42:22.340 |
and really bigger, badder networks and more data 00:42:26.500 |
is really one of the ways to crack this problem. 00:42:29.460 |
So public large data set with 672,000 identities, 00:42:38.660 |
and that just keeps scaling up and up and up and up. 00:42:47.380 |
in that even though the benchmarks are growing, 00:42:50.900 |
that's still a tiny subset of the people in the world. 00:42:53.420 |
We're still not quite there to be able to have 00:43:01.540 |
or a large swath of the population of the world. 00:43:09.140 |
we're not covering all of the aspects of the face, 00:43:13.020 |
especially temporal, that are useful in face recognition 00:43:16.300 |
or useful saying a lot of things about the face, 00:43:23.700 |
that can then be used to infer emotion and so on. 00:43:26.780 |
Raised eyebrows and all those kinds of things 00:43:36.420 |
including 3D face recognition, we're not covering. 00:43:55.200 |
not often stated and misinterpreted by people, 00:44:01.060 |
is that most of these methods of face recognition 00:44:05.260 |
start with assuming that you have a bounding box 00:44:20.500 |
But you can do recognition all kinds of poses. 00:44:23.260 |
And it's very interesting to think that recognition, 00:44:27.900 |
the way we recognize our friends and colleagues, 00:44:35.460 |
that's beyond just the pure frontal view of the face. 00:44:43.020 |
So all those things, that's open in the field, 00:44:45.860 |
how we incorporate that into face recognition. 00:44:48.440 |
Then the black box side is problematic for both bias 00:44:56.580 |
is making those face recognition systems more interpretable. 00:45:11.540 |
and yet not violating the fundamental aspects 00:45:16.400 |
Activity recognition, taking the next step forward here 00:45:24.420 |
into the richer temporal context of what people do. 00:45:30.280 |
Again, the same structure from recent breakthroughs 00:45:37.120 |
It's classifying human activity from images or from video. 00:45:44.200 |
Depending on the level of abstraction for the activity, 00:45:51.580 |
it provides context for understanding the human. 00:45:57.940 |
Are they putting on makeup, knitting, so on, mixing butter? 00:46:05.300 |
Again, all the usual problems in image recognition. 00:46:08.620 |
The kind of data we're dealing with is just much larger. 00:46:12.960 |
The kind of video, the richness of possibilities 00:46:16.500 |
that define what activity is, is much larger. 00:46:30.600 |
is the change in the world, is the motion of things. 00:46:33.440 |
And then it's difficult to determine how the dynamics 00:46:37.320 |
of the physics of the world, especially from a 2D view 00:46:40.000 |
of what's background information, what's noise, 00:46:42.060 |
and what's essential to understanding the activity. 00:46:46.120 |
And the subjective, ambiguous elements of activity. 00:46:58.060 |
What's all the gray areas when you're partially engaging 00:47:17.240 |
Future impact, utopia, dystopia, middle path. 00:47:25.760 |
to understand the world in time and be able to predict. 00:47:31.260 |
The utopian possibilities is that the contextual perception 00:47:36.260 |
that can occur from here can enrich the experience 00:47:40.660 |
The dystopian view, the flip side is being able 00:47:57.260 |
The middle path is just finding useful information, 00:48:03.820 |
being able to identify what's going on in this video, 00:48:06.160 |
being able to infer rich, useful semantic information. 00:48:14.200 |
Now the recent breakthrough came with deep learning 00:48:17.340 |
and C3D, this 3D convolutional neural networks 00:48:20.180 |
that take a sequence of images and are able to determine 00:48:23.060 |
the action that's going on in an end-to-end way, 00:48:35.800 |
One is the image RGB data, the other is optical flow data 00:48:40.500 |
that's really focusing on the motion in the image. 00:48:46.900 |
Here from that paper showing the different architectures, 00:48:49.700 |
on the far right is the two stream architecture 00:49:00.300 |
But all these are just different architectures. 00:49:05.020 |
There's different architectures of how do you represent, 00:49:18.340 |
being able to take single images or sequences of images 00:49:32.340 |
you start to think about what are the defining qualities 00:49:42.140 |
Topics not covered is the localization of activity in video. 00:49:46.920 |
So action recognition purely defined is I give you a clip 00:49:50.260 |
and you tell me what's going on in this clip. 00:49:53.060 |
Now, if you take actually a full YouTube video, 00:49:56.660 |
find all the times when a particular activity is going on. 00:50:00.580 |
It could be multi-label, multiple activities going on 00:50:02.860 |
at the same time, beginning and ending, and asynchronously. 00:50:06.480 |
And then there is more richly three-dimensional 00:50:11.980 |
or 2D classification of activity based on human movement. 00:50:16.740 |
So looking at, like from a Kinect, from 3D sensors, 00:50:30.300 |
The open problems is that activity recognition 00:50:39.300 |
or if it's baseball, like a ball in your hand 00:50:47.460 |
There's sitting down or working or looking at something, 00:50:58.940 |
and the activity of other people in the scene. 00:51:00.980 |
And so being able to work with that kind of context 00:51:05.900 |
It's having to reduce a very complex real world context 00:51:10.060 |
into something where you can clearly identify an activity. 00:51:14.260 |
Body pose estimation is the task of localizing the joints 00:51:33.460 |
it's important to be able to understand the body language, 00:51:35.820 |
the rich information about the body of the human being. 00:51:40.820 |
So that's from reading body language to animation, 00:51:47.260 |
And it's just a useful representation of the human body. 00:51:53.980 |
or in interactive environments, human robot interaction, 00:52:07.700 |
when you look at a 2D image projection of the body, 00:52:13.540 |
it's a highly dimensional optimization problem, 00:52:31.220 |
it's really exciting for interactive environments 00:52:37.460 |
of the human body with which it's trying to interact. 00:52:47.500 |
you have to be able to find where their hand is 00:52:59.100 |
that they're able to physically take control of the vehicle. 00:53:01.420 |
That's a really exciting set of possibilities there. 00:53:09.420 |
when the robot and human have to work together. 00:53:14.860 |
of course, being able to localize all those joints 00:53:17.780 |
means robots that are able to more effectively hurt humans. 00:53:24.780 |
and always a dark dystopian view of the world 00:53:44.580 |
So it started with deep learning being applied 00:54:02.880 |
Power of deep learning is that you no longer have to do 00:54:08.340 |
that it automatically determines a set of features. 00:54:12.560 |
So this highly complex problem is all solved with data. 00:54:21.420 |
and beyond there's been a few papers from CMU 00:54:24.260 |
along this line is doing real time multi-person 00:54:31.500 |
where you're detecting individual joints first. 00:54:53.980 |
So that actually turns out to be extremely powerful way 00:55:28.020 |
rich volumetric information to do the detection 00:55:32.820 |
and then optimizing for what's the most likely 00:55:37.940 |
The open problems in the field is the fact that 00:55:42.900 |
pose is not a thing that happens in a single image. 00:55:52.860 |
So here, Monty Python, Ministry of Silly Walks, 00:55:57.980 |
But so we collect a lot of data on pedestrians. 00:56:01.220 |
I can tell you that people walk in different ways 00:56:03.220 |
and people position their body in different ways. 00:56:13.580 |
in the body pose estimation problem and they should be. 00:56:31.000 |
That was 2018, it was really big for recommender systems, 00:56:41.560 |
Each one of the things that I mentioned briefly today 00:56:50.440 |
I taught an entire course on this at CHI last year. 00:56:52.880 |
So deep learning for understanding the human. 00:56:56.920 |
because it's really the first step for a machine 00:56:59.820 |
to be able to interact in a rich way with a human being, 00:57:03.600 |
And it's also area where the most near term impact 00:57:06.680 |
can happen, a system to be able to effectively detect 00:57:09.920 |
what a human being is up to, what they're thinking about, 00:57:13.720 |
how to best serve them and enrich the experience 00:57:21.600 |
Let me jump to AI safety and then the interactive experience 00:57:26.680 |
to humans and robots to just give examples of some work 00:57:31.300 |
in that direction, some research in that direction 00:57:42.380 |
where we want human beings to supervise those decisions. 00:58:00.440 |
So this kind of idea that you can achieve safety 00:58:05.440 |
by not giving ultimate power to any one decision maker. 00:58:09.560 |
And the disagreement that emerges from two AI systems 00:58:14.560 |
or multiple AI systems having to make decisions 00:58:21.540 |
it allows us to then produce a signal of uncertainty 00:58:25.280 |
based on which the human supervision can be sought. 00:58:27.840 |
Without that, when we have a state of the art 00:58:31.440 |
black box AI system that does something like drive a car, 00:58:40.000 |
We don't have any uncertainty signal coming from the system. 00:58:43.960 |
So the idea of arguing machines that we've developed 00:58:48.800 |
and been working on is to have multiple AI system 00:58:51.940 |
and ensemble of AI system where the disagreement, 00:58:58.300 |
And the idea there is that when you have a system 00:59:08.540 |
it's telling you nothing about how uncertain it is 00:59:19.200 |
And in very rare cases, this is just disengage. 00:59:24.340 |
the degree of uncertainty it has about the world around it. 00:59:27.520 |
And so the way we create that signal of uncertainty 00:59:30.560 |
is by adding another, in this case, end-to-end vision system 00:59:52.920 |
the times when the driver chose to disengage the system 00:59:58.360 |
So you're detecting, you're using this mechanism 01:00:09.080 |
by having multiple AI systems that are independent, 01:00:22.080 |
we can apply this in computer vision as well, 01:00:30.680 |
networks ResNet and VGGNet, trained on ImageNet, 01:00:37.100 |
and in the process, improve significantly the accuracy. 01:01:01.760 |
when the disagreement is brought to the human, 01:01:07.880 |
Now if that, this is just ImageNet challenge, 01:01:10.960 |
but if that error meant the loss of human life, 01:01:17.240 |
for overseeing the operation of the AI system. 01:01:21.480 |
That, just examples here where they disagree. 01:01:29.240 |
and ResNet prediction is that it's definitely 0.93, 01:01:35.800 |
and VGGNet, 25% confidence that it's a seatbelt. 01:01:50.920 |
and then humans are able to annotate correctly 01:01:54.980 |
Same thing here, mailbox, the ground truth is a mailbox. 01:02:01.780 |
One says traffic light, the other one says garbage truck. 01:02:11.680 |
you might stop for this mailbox, that kind of thing. 01:02:28.640 |
the uncertainty signal is the critical thing, 01:02:34.080 |
The subarea of just creating a rich human interaction. 01:02:45.200 |
So we have a human-centered autonomous vehicle 01:02:51.880 |
here at MIT that's taking control back and forth 01:03:06.880 |
should be fun and awesome and enriching to life. 01:03:11.320 |
And that's why you would want to use these kinds of systems. 01:03:18.560 |
including ridiculous one of me playing guitar. 01:03:24.640 |
of how we have humans and robots work together 01:03:29.040 |
There's a lot of totally untouched problems in that space. 01:03:34.640 |
and the machine learning community approaches AI 01:03:46.600 |
Just like, what is it, Robin Williams in Good Will Hunting, 01:03:51.600 |
talking about relationships, that nobody's perfect. 01:03:59.360 |
AI systems will not be perfect for the next 100 years. 01:04:02.600 |
And so we have to have humans and AI systems work together 01:04:05.520 |
and optimize that problem, solve that problem. 01:04:10.280 |
but together there's something enriching to both. 01:04:14.720 |
As I mentioned, the videos here will be available online. 01:04:17.600 |
The lectures underlying all the deep learning 01:04:24.960 |
And it's an area of active research here at MIT 01:05:16.800 |
This is done a lot now in reinforcement learning, 01:05:25.680 |
is something that happens naturally through interaction, 01:05:36.200 |
We can scale learning to a degree that's required 01:05:51.540 |
from the biological to the electrical and neuroscience, 01:05:55.520 |
to the behavioral aspects captured by cognitive science, 01:06:07.320 |
and put it in the real world with engineering systems, 01:06:11.240 |
These are all giant subfields with conferences and papers 01:06:25.980 |
what does, how does, and what does the computer, 01:06:33.920 |
And then the exciting aspects of learning from data 01:06:36.640 |
and deep learning and learning to act from data 01:06:42.280 |
And then the robotics is actually building these things, 01:06:48.720 |
again, an entire area, exciting field of research. 01:06:52.940 |
All of them have to work together to create systems here 01:06:56.320 |
that integrate the human during the learning process 01:06:58.900 |
and integrate the human during the operation process. 01:07:10.940 |
So with that, I'd like to thank you very much.