back to indexMIT Sloan: Intro to Machine Learning (in 360/VR)
Chapters
0:0 Intro
0:51 Course Overview
2:29 How Powerful is Artificial Intelligence
3:38 Supervised Learning
4:55 Augmented Learning
6:3 Machine Learning
9:34 Questions
12:33 Artificial Neuron
13:54 Building an Artificial Neuron
15:32 Neural Networks
20:0 Representation
24:28 General Intelligence
29:13 Data Representation
30:43 Pattern Recognition
33:53 Machine Learning Examples
35:10 How to Detect Traffic Lights
38:8 How to Generate Text
39:54 EndtoEnd Approach
41:23 What Cant We Do
43:8 The Pipeline
44:7 The Machine Learning
46:36 The Open Questions
47:41 Pong
49:14 The Mars Paradox
49:55 Neural Networks vs Natural Selection
00:00:00.000 |
The video you're watching now is in 360. Resolution is not great but we wanted to 00:00:05.880 |
try something different. So if you're on a desktop or laptop you can pan around 00:00:10.160 |
with your mouse or if you're on a phone or tablet you should be able to just 00:00:13.760 |
move your device to look around. Of course it's best viewed with a VR 00:00:18.320 |
headset. The video that follows is a guest lecture on machine learning that I 00:00:22.640 |
gave an MIT Sloan course on the business of artificial intelligence. The lecture is 00:00:28.560 |
non-technical and intended to build intuition about these ideas amongst the 00:00:33.540 |
business students in the audience. The room was a half circle so we thought 00:00:37.880 |
why not film the lecture in 360. We recorded a screencast of the slides and 00:00:42.960 |
pasted it into the video so that the slides are more crisp. Let me know what 00:00:47.840 |
you think and remember it's an experiment. So this course is talking 00:00:53.840 |
about the broad context, the impact of artificial intelligence, the global, this 00:00:58.640 |
global, which is the global impact of artificial intelligence, there's the 00:01:01.920 |
business which is when you have to take these fun research ideas that I'll talk 00:01:06.280 |
about today. A lot of them are cool on toy examples when you bring them to 00:01:10.560 |
reality you face real challenges which is what I would like to really highlight 00:01:14.920 |
today. That's the business part when you want to make real impact, when you're 00:01:20.320 |
going to make these technologies a reality. So I'll talk about how amazing 00:01:23.880 |
the technology is for a nerd like me but also talk about how when you take that 00:01:30.120 |
into the real world what are the challenges you face. So machine learning 00:01:34.600 |
which is the technology at the core of artificial intelligence. We'll talk about 00:01:39.480 |
the promise, the excitement that I feel about it, the limitations, we'll bring it 00:01:45.920 |
down a little bit. What are the real capabilities of technology. We're for the 00:01:51.000 |
first time really as a civilization exploring the meaning of intelligence. It 00:01:59.240 |
is if you pause for a second and just think you know maybe many of you want to 00:02:04.440 |
make money out of this technology, many of you want to save lives, help people 00:02:09.240 |
but also on the philosophical level we get to explore what makes us human. So 00:02:15.200 |
while I'll talk about the low-level technologies also think about the 00:02:20.040 |
incredible opportunity here we get to almost psychoanalyze ourselves by trying 00:02:25.440 |
to build versions of ourselves in the machine. All right so here's the open 00:02:31.040 |
question how powerful is artificial intelligence? How powerful is machine 00:02:37.200 |
learning that lies at the core of artificial intelligence? Is it simply a 00:02:40.920 |
helpful tool, a special purpose tool to help you solve simple problems? If you're 00:02:45.040 |
which is what it currently is. Currently machine learning artificial intelligence 00:02:51.080 |
is a way if you can formally define the problem, you can formally define the 00:02:56.400 |
tools you're working with, you can formally define the utility function 00:02:59.440 |
what you want to achieve with those tools. As long as you can define those 00:03:02.800 |
things we can come up with algorithms that can solve them. As long as you have 00:03:07.040 |
the right kind of data which is what I'll talk about. Data is key and the 00:03:12.600 |
question is into the future can we break past this very narrow definition of what 00:03:21.320 |
machine learning can give us which is solve specific problems to something 00:03:26.640 |
bigger to where we approach the general intelligence that we exhibit as human 00:03:31.400 |
beings. When we're born we know nothing and we learn quickly from very little 00:03:35.960 |
data. The right answer is we don't know. We don't know what are the limitations 00:03:44.080 |
of technology. What kind of machine learning are there? There's several 00:03:48.600 |
flavors. The first two is what's really the first is what's achieved success 00:03:53.720 |
today. Supervised learning. What I'm showing here on the left of the slide is 00:03:57.640 |
the teachers, is the data that is fed to the system and on the right is the 00:04:03.640 |
students which is the system itself from machine learning. So there's supervised 00:04:07.800 |
learning. Whenever everybody talks about machine learning today, for the 00:04:12.360 |
most part they're referring to supervised learning which means every 00:04:15.840 |
single piece of data that is used to train the model is seen by human eyes 00:04:20.080 |
and those human eyes with an accompanying brain label that data in a 00:04:28.080 |
way that makes it useful to the machine. This is critical because that's 00:04:32.240 |
one, the blue box, the human is really costly. So whenever every single piece of 00:04:37.280 |
data that needs to be that's used to train the machine needs to be seen by 00:04:41.560 |
human, you need to pay for that human. And second you're limited to just the the time. 00:04:46.400 |
There's the amount of data necessary to label what it means to exist in this 00:04:52.240 |
world is humongous. Augmented supervised learning is when you get machine to 00:04:59.040 |
really to help you a little bit. There's a few tricks there but still only 00:05:02.760 |
tricks. It's still the human is at the core of it and the promise of future 00:05:08.600 |
research that we're pursuing, that I'm pursuing and perhaps in the applications 00:05:13.360 |
if we get to discuss or some of the speakers here get to discuss, they're 00:05:17.680 |
pursuing in semi supervised and reinforcement learning where the human 00:05:22.200 |
starts to play a smaller and smaller role in how much they get to annotate, 00:05:25.960 |
they have to annotate the data. And the dream of the sort of wizards of the dark 00:05:32.200 |
arts of deep learning are all excited about unsupervised learning. It has very 00:05:37.120 |
few actual successes in application in the real world today but it is the idea 00:05:44.800 |
that you can build a machine that doesn't require a human teacher, a human 00:05:52.320 |
being to teach you anything, fills us artificial intelligence researchers 00:05:59.560 |
with excitement. There's a theme here. Machine learning is really simple. The 00:06:09.640 |
learning system in the middle, there's a training stage where you teach it 00:06:15.040 |
something. All you need is some data, input data, and you need to teach it the 00:06:23.160 |
correct output for that input data. So you have to have a lot of pairs of input 00:06:29.880 |
data and correct output. There'll be a theme of cats throughout this 00:06:33.200 |
presentation. So if you want to teach a system difference between a cat and a dog, 00:06:39.240 |
you need a lot of images of cats and you need to tell it that this is a cat. This 00:06:43.640 |
bounding box here in the image is a cat. You have to give it a lot of images of 00:06:47.600 |
dogs and tell it, "Okay, well in these pictures there are dogs." And 00:06:51.960 |
then there's a spelling mistake on the second stage is the testing 00:06:57.440 |
stage when you actually give it new input data it's never seen before and you 00:07:02.840 |
hope that it has given for cat versus dog enough data to guess is this new 00:07:08.520 |
image that I've never seen before a cat or a dog. Now one of the open questions 00:07:16.680 |
you want to keep in mind is what in this world can we not model in this way? What 00:07:26.680 |
activity, what task, what goal? I offer to you that there's nothing you can't 00:07:37.080 |
model in this way. So let's think about what in terms of machine learning can be 00:07:51.160 |
so let's start small. What can be modeled in this way? First on the bottom of the 00:07:56.520 |
slide left is one-to-one mapping where the input is an image of a cat and the 00:08:01.120 |
output is a label that says cat or dog. You can also do one-to-many where the 00:08:07.480 |
image, the input is an image of a cat and the output is a story about that cat, a 00:08:14.080 |
captioning of the image. You can, first of all, you can do the other way, many-to-one 00:08:20.640 |
mapping where you give it a story about a cat and it generates an image. There's 00:08:26.800 |
many-to-many, this is Google Translate, we translate a sentence from one 00:08:30.840 |
language to another and there's various flavors of that. Again, same theme here, 00:08:36.480 |
input data provided with correct output and then let it go into the wild where 00:08:43.680 |
it runs on input data it hasn't seen before to provide guesses. And it's as 00:08:52.360 |
simple as this, whatever you can convert into one of the following four things, 00:08:56.560 |
numbers, vector of numbers, so a bunch of numbers, a sequence of numbers where the 00:09:03.040 |
temporal dynamics matters, so like audio, video, where the sequence, the ordering 00:09:09.240 |
matters, or a sequence of vector numbers, just a bunch of numbers. If you can 00:09:13.000 |
convert it into numbers, and I propose to you that there's nothing you can't 00:09:16.920 |
convert into numbers. If you can convert it to numbers you can have a system 00:09:23.880 |
learn to do it. And the same thing with the output, generate numbers, vectors of 00:09:28.160 |
numbers, sequence of numbers, or sequence of vectors and numbers. First, is there 00:09:36.280 |
any questions at this point? Well, we have a lot of fun slides to get through, but 00:09:42.720 |
I'll pause every once in a while to make sure we're on the same page here. So what 00:09:47.680 |
kind of input are we talking about? Just to fly through it, images, so faces or 00:09:52.080 |
medical applications for looking at scans of different parts of the 00:09:58.120 |
body to determine if they're to diagnose any kind of medical conditions. Texts, so 00:10:03.120 |
conversations, your texts, article, blog posts for sentiment analysis, question 00:10:08.560 |
answering, so you ask it a question where the output you hope is answers. Sounds, 00:10:14.320 |
a voice recognition, any kind of anything you could tell from audio. Time series 00:10:20.400 |
data, so financial data, stock market, you can use it to predict anything you want 00:10:26.040 |
about the stock market including whether to buy or sell. If you're curious, it doesn't 00:10:30.080 |
work quite well as a machine learning application. Physical world, so cars or 00:10:37.160 |
any kind of object, any kind of robot that exists in this world. So location of 00:10:42.760 |
where I am, location of where other things are, the actions of others, that 00:10:47.520 |
could be all input. All of it can be converted to numbers. And the correct 00:10:51.560 |
output, same thing. Classification, a bunch of numbers. Classification is saying is 00:10:56.400 |
this a cat or a dog, regression is saying to what degree I turn the steering wheel, 00:11:01.280 |
sequence, generating audio, generating video, generating stories, captioning, text, 00:11:07.400 |
images, generating anything you could think of as numbers. And at the core of 00:11:11.600 |
it is a bunch of data agnostic machine learning algorithms. There's traditional 00:11:19.000 |
ones, nearest neighbors, Naive Bay, support vector machines. A lot of 00:11:24.720 |
them are limited and I'll describe how. And then there's neural networks. There's 00:11:34.280 |
nothing special and new about neural networks. And I'll describe exactly the 00:11:40.560 |
very subtle thing that is powerful, that's always been there all along and 00:11:47.920 |
certain things have now been able to unlock that power about neural networks. 00:11:52.320 |
But it's still just the flavor of a machine learning algorithm. And the 00:11:56.520 |
inspiration for neural networks, as Jonathan showed last time, is our human 00:12:00.440 |
brain. It's perhaps why the media, perhaps why the hype is captivated by the idea 00:12:07.440 |
of neural networks, is because you immediately jump to this feeling like 00:12:11.760 |
because there's this mysterious structure to them that scientists don't 00:12:16.200 |
understand. Artificial neural networks I'm referring to and the biological ones. 00:12:20.640 |
We don't understand them and the similarity captivates our minds and we 00:12:26.760 |
think well this approach is perhaps as limited as our, as limitless as our own 00:12:31.680 |
human mind. But the comparison ends there. In fact the artificial neuron, their 00:12:38.960 |
artificial neural networks are much simpler computational units. At the core 00:12:45.200 |
of everything is this neuron. This is a computational unit that does a very, two 00:12:55.280 |
very simple operations. On the left side it takes a set of numbers as inputs 00:13:01.520 |
it applies weights to those inputs, sums them together, applies a little bias and 00:13:10.360 |
provides an output somewhere between 0 and 1. So you can think of it as a 00:13:19.360 |
computational entity that gets excited when it sees certain inputs and gets 00:13:27.200 |
totally turned off when it gets other kinds of inputs. So maybe this neuron 00:13:33.560 |
with a 0, with a 0.7, 0.6, 1.4 weights, it gets really excited when it sees 00:13:40.360 |
pictures of cats and totally doesn't care about dogs. Some of us are like that. 00:13:47.720 |
So that's the job of this neuron is to detect cats. Now what, the way you build 00:13:56.560 |
an artificial neural network, the way you release the power that I'll talk about 00:14:04.240 |
in the following slides about the applications, what could be achieved, is 00:14:08.400 |
just stacking a bunch of these together. Think about it. This is, this is a 00:14:15.160 |
extremely simple computational unit. So you need to sort of pause whenever we 00:14:23.160 |
talk about the following slides and think that there's a few slides 00:14:29.320 |
that I'll show that say neural networks are amazing. I want you to think back to 00:14:33.960 |
this slide that everything is built on top of these really simple addition 00:14:39.960 |
operations with a simple nonlinear function applied at the end. Just a tiny 00:14:47.280 |
math operation. We stack them together in a feed-forward way so there's a 00:14:52.520 |
bunch of layers and when people talk about deep neural networks it means 00:14:56.280 |
there's a bunch of those layers and then there's recurring neural networks that 00:15:02.480 |
are also a special flavor that's able to have memory. So as opposed to just 00:15:08.480 |
pushing input into output directly, it's also able to do stuff on the inside in a 00:15:14.280 |
loop where it remembers things. This is useful for natural language processing, 00:15:17.920 |
for audio processing, whenever the sequence is not, the length of the 00:15:23.720 |
sequence is not defined. Okay, slide number one in terms of neural networks 00:15:29.880 |
are amazing. This is, this is perhaps for the math nerds, but also I want you to 00:15:39.800 |
use your imagination. There's a universality to neural networks. It means 00:15:44.640 |
that the simple computational unit on the left is an input, on the right is the 00:15:49.320 |
output of this network. With just a single hidden layer, it's called a hidden 00:15:53.680 |
layer because it sits there in the middle of the input and the output 00:15:57.960 |
layers. A single hidden layer with some number of nodes can represent any 00:16:06.680 |
function. Any function. That means anything you want to build in this world. 00:16:16.120 |
Everyone in this room can be represented with a neural network with a single 00:16:23.200 |
hidden layer. So the power, and this is just one hidden layer, the power of these 00:16:29.720 |
things is limitless. The problem of course is how do you find the network? So 00:16:37.800 |
how do you build a network that is as clever as many of the people in this 00:16:43.720 |
room? But the fact that you can build such a network is incredible, is 00:16:50.520 |
amazing. I want you to think about that. And the way you train a network, so it's 00:16:58.920 |
born as a blank slate. Some random weights assigned to the edges. Again, a 00:17:04.200 |
network is represented, the numbers at the core, the parameters at the core of 00:17:07.880 |
this network are the numbers on each of those arrows, each of those edges. And you 00:17:13.680 |
start knowing nothing. This is a baby network. And the way you teach it 00:17:18.760 |
something, unfortunately, currently, as I said, in a supervised learning mechanism, 00:17:25.480 |
you have to give it pairs of input and output. You have to give it pictures of 00:17:30.240 |
cats and labels on those pictures saying that they're cats. And the basic 00:17:38.480 |
fundamental operation of learning is when you compute the measure of an error 00:17:48.000 |
and you back propagate it to the network. What I mean, everything is easier with 00:17:55.960 |
cats. I apologize, I apologize, too many cats. And so the input here is a cat and 00:18:05.400 |
the neural network we trained, it's just guessing, it doesn't know. Say, I don't 00:18:10.320 |
know, it's guessing cat. Well, it happens to be right. So we have to, this is the 00:18:16.960 |
measure of error. Yes, you got it right. And you have to back propagate that 00:18:21.120 |
error. You have to reward the network for doing a good job. And all you do, what I 00:18:27.360 |
mean by reward, there's weights on each of those edges and so the node, the 00:18:32.680 |
individual neurons that were responsible, that back to that cat neuron, that cat 00:18:37.240 |
neuron needs to be rewarded for seeing the cat. So you just increase the weights 00:18:41.680 |
on the neurons that were associated with producing the correct answer. Now you 00:18:46.040 |
give it a picture of a dog and the neural network says cat. Well, that's an 00:18:51.560 |
incorrect answer, so no, there's a high error, needs to be back propagated to the 00:18:56.520 |
network. So the weights that were responsible with classifying this 00:19:00.600 |
picture as a cat need to be punished, they need to be decreased. Simple. And you 00:19:07.960 |
just repeat this process over and over. This is what we do as kids when we're 00:19:11.480 |
first learning. And you know, for the most part, that we have to, we're also 00:19:18.040 |
supervised learning machines in the sense that we have our parents and we 00:19:21.840 |
have the environment, the world, that teaches about what's correct and what's 00:19:28.360 |
incorrect. And we back propagate this error and reward through our brain to 00:19:33.040 |
learn. The problem is, as human beings, we don't need too many examples and I'll 00:19:39.120 |
talk about some of the drawbacks of these approaches. We don't need too many 00:19:43.080 |
examples. You fall off your bike once or twice and you learn how to ride the bike. 00:19:46.760 |
Unfortunately neural networks need tens of thousands of times when 00:19:52.520 |
they fall off the bike in order to learn how to not do it. That's one of the 00:19:56.400 |
limitations. And one key thing I didn't mention here is when we refer to input 00:20:03.800 |
data, it's, when we refer to input data, we usually refer to sensory data, raw data. 00:20:12.760 |
We have to represent that data in some clever way, in some deeply clever way, 00:20:20.640 |
where we can reason about it, whether it's in our brains or in the neural 00:20:27.520 |
network. And a very simple example here to illustrate why representation of 00:20:33.720 |
data matters. So the way you represent the data can make the discrimination of 00:20:39.680 |
one class from another, a cat versus dog, either incredibly difficult or 00:20:45.720 |
incredibly simple. Here is a visualization of the same kind of data 00:20:49.960 |
in Cartesian coordinates and polar coordinates. On the right you can just 00:20:54.760 |
draw a simple line to separate the two. What you want is a system that's able to 00:21:01.640 |
learn the polar coordinate representation versus the Cartesian 00:21:07.000 |
representation automatically. And this is where deep learning has stepped in and 00:21:15.240 |
revealed the incredible power of this approach, which deep learning is the 00:21:21.520 |
smallest circle there. It's a type of representational learning. Machine 00:21:26.960 |
learning is the bigger second to the biggest. So this class is about the 00:21:31.160 |
biggest circle, AI, includes robotics, includes all the fun things that are 00:21:35.040 |
built on learning. And I'll discuss while machine learning I think will close 00:21:39.000 |
this entire circle into one. But for now AI is the biggest circle, then a subset 00:21:44.920 |
of that is machine learning, and a smaller subset of that is representation 00:21:49.080 |
learning. So deep learning is not only able to say, given a few examples of cats 00:21:55.160 |
and dogs, to discriminate between a cat and a dog. It's able to represent what it 00:22:01.040 |
means to be a cat. So it's able to automatically determine what are the 00:22:08.560 |
fundamental units at the low level and the high level. Talking about this very 00:22:14.720 |
Plato. What it means to represent a cat from the whiskers to the high level 00:22:21.600 |
shape of the head to the the fuzziness and the deformable aspects of the cat. 00:22:27.640 |
Not a cat expert, but I hear this these are the features of a cat. Verses that 00:22:32.400 |
are essential to discriminate between a cat and a dog. Learning those features as 00:22:37.080 |
opposed to having to have experts. This is the drawback of systems that Jonathan 00:22:42.520 |
talked about from the 80s and 90s where you have to bring in experts for any 00:22:46.600 |
specific domain that you try to solve. You had to have them encode that 00:22:50.880 |
information. Deep learning, this is simply the only big difference 00:22:58.240 |
between deep learning and other methods. Is that it learns the representation for 00:23:02.520 |
you. It learns what it means to be a cat. Nobody has to step in and help it figure 00:23:06.880 |
out what cats have whiskers and dogs don't. What does this mean? The 00:23:13.480 |
fact that it can learn these features, these whisker features, is as opposed to 00:23:19.000 |
having five or ten or a hundred or five hundred features that are encoded by 00:23:23.840 |
brilliant engineers with PhDs. It can find hundreds of thousands, millions of 00:23:30.560 |
features automatically. Hundreds of millions of features. So stuff that 00:23:37.800 |
can't be put into words or described. In fact it's one of the limitations in 00:23:41.800 |
neural networks is they find so many fundamental things about what it means 00:23:44.960 |
to be a cat that you can't visualize what it really knows. It just seems to 00:23:49.520 |
know stuff and it finds that stuff automatically. What does this mean? 00:23:55.920 |
The critical thing here is because it's able to automatically learn those 00:24:01.120 |
hundreds of millions of features, it's able to utilize data. It doesn't start, 00:24:07.720 |
the diminishing returns don't hit until, well we don't know when they hit. The 00:24:14.480 |
point is with the classical machine learning algorithms you start hitting a 00:24:18.200 |
wall when you have tens of thousands of images of cats. With deep learning you 00:24:28.040 |
Neural networks are amazing slide two. Here's a game, a simple arcade 00:24:36.400 |
game, where there's two paddles, they're bouncing a ball back and forth. Okay, 00:24:40.520 |
great, you can figure out an artificial intelligence agent that can play this 00:24:44.840 |
game. It can, not even that well, just kind of, it kind of learns to do alright 00:24:50.400 |
and eventually win. Here's the fascinating thing. With deep learning, as 00:24:59.760 |
opposed to encoding the position of the paddles, the position of the ball, having 00:25:06.440 |
an expert in this game, there's many, come in and encode the physics of this game. 00:25:12.920 |
The input to the neural network is the raw pixels of the game. So it's learning 00:25:23.200 |
in the following way. You give it an evolution of the game, you give it a 00:25:28.560 |
bunch of pixels. Pixels are images that are built up of pixels. They're just numbers 00:25:34.360 |
from 0 to 256. So there's this array of numbers that represent each image and 00:25:40.520 |
then you give it several tens of thousands of images that represent a 00:25:45.120 |
game. So you have this stack of pixels and stack of images that represent a 00:25:50.920 |
game and the only thing you know, this giant stack of numbers, the only thing 00:25:55.640 |
you know is at the end you won or lost. That's it. So based on that you have to 00:26:04.340 |
figure out how to play the game. You know nothing about games, you know nothing 00:26:08.240 |
about colors or balls or paddles or winning or anything. That's it. So this is, 00:26:15.920 |
why is this amazing? That it even works and it works, it wins. It's amazing 00:26:22.040 |
because that's exactly what we do as human beings. This is general 00:26:24.960 |
intelligence. So I need you to pause and think about this. We'll talk about 00:26:30.760 |
special intelligence and the usefulness and okay there's cool tricks here and 00:26:34.320 |
there that we can do to get you an edge on your high-frequency trading system 00:26:39.840 |
but this is general intelligence. General intelligence is the same intelligence we 00:26:46.520 |
use as babies when we're born. What we get is an input, sensory input of image 00:26:52.240 |
sensory input. Right now all of us, most of us are seeing, hearing, feeling with 00:26:59.520 |
touch and that's the only input we get. We know nothing and with that input we 00:27:03.640 |
have to learn something. Nobody is pre-teaching us stuff and this is an 00:27:10.160 |
example of that, a trivial example but one of the first examples where this is 00:27:14.840 |
truly working. I'm sorry to linger on this but it's a fundamental fact. The fact 00:27:20.640 |
that we have systems that and now outperform human beings in these simple 00:27:25.640 |
arcade games is incredible. This is the research side of things but let me step 00:27:34.280 |
back. These again the takeaways. That previous slide is why I think machine 00:27:40.000 |
learning is limitless in the future. Currently it's limited. Again the 00:27:50.120 |
representation of the data matters and if you want to have impact we currently 00:27:57.200 |
can only tackle the small problems. What are those problems? Image recognition. We 00:28:03.960 |
can classify given the entire image of a leopard, of a boat, of a mite with pretty 00:28:10.960 |
good accuracy of what's in that image. That's image classification. What else? We 00:28:19.760 |
can find exactly where in that image each individual object is. That's called 00:28:23.320 |
image segmentation. Again the process is the same. The learning 00:28:30.800 |
system in the middle, a neural network, as long as you give it a set of numbers as 00:28:38.240 |
input and the correct set of labels as output, it learns to do that for data 00:28:43.240 |
it hasn't seen in the past. Let me pause a second and maybe if you have any 00:28:48.880 |
questions. Does anyone have any questions about the techniques of neural 00:29:14.120 |
So that's a great question and in a couple of slides I'll get to it exactly. 00:29:19.840 |
So the data representation, I'll elaborate in a little bit, but loosely 00:29:27.720 |
the data representation is for a neural network is in the weights of each of 00:29:35.600 |
those arrows that connect the neurons. That's where the representation is. So 00:29:40.880 |
I'll show to really clarify that example of what that means. The Cartesian versus 00:29:50.480 |
polar coordinates is just a very simple visualization of the concept. 00:29:56.680 |
But you want to be able to represent the data in an arbitrary way where there's 00:30:03.080 |
no limits to the representation. It could be highly nonlinear, highly complex. Any 00:30:12.720 |
Generally speaking, in our current state, when we talk about machine learning or AI, it's simply statistical models that are able to recognize 00:30:20.920 |
patterns or things of that nature where they're not necessarily thinking but 00:30:26.400 |
simply recognizing. So I'm a little confused about how the current, I guess, 00:30:32.360 |
system differs from deep learning and whether you think that there is the 00:30:39.240 |
possibility of transitioning from recognizing to actually thinking. 00:30:44.520 |
So I have a couple of slides almost asking this question because there's no 00:30:49.600 |
good answers. But one could argue, and I think somebody in the last class brought up 00:30:54.160 |
that, you know, is machine learning just pattern recognition? It's possible that 00:31:01.960 |
reasoning, thinking, is just pattern recognition. And I'll describe sort of an 00:31:13.600 |
intuition behind that. So we tend to respect thinking a lot because we've 00:31:24.360 |
recently as human beings learned to do it. In our evolutionary time, we think 00:31:29.480 |
that it's somehow special from, for example, perception. We've had visual 00:31:33.520 |
perception for several orders of magnitude longer in our evolution as a 00:31:39.280 |
living species. We've started to learn to reason, I think, about a hundred thousand 00:31:45.520 |
years ago. So we think it's somehow special from the same kind of mechanism 00:31:50.720 |
we use for seeing things. Perhaps it's exactly the same thing. So perception 00:31:56.440 |
is pattern recognition. Perhaps reasoning is just a few more layers of that. 00:32:09.240 |
The concept of neural network itself is not very new. So is there any technical innovation or breakthrough to expand the use of neural network? 00:32:23.240 |
Or is it just an increase of the result of computational power? 00:32:32.240 |
Yes, that's a great question. There's been very few breakthroughs in neural networks since through the AI winters that we've discussed, through a lot of excitement, in spurts, and even recently there's been a very few algorithmic innovations. 00:32:52.240 |
The big gains came from compute. So improvements in GPU and better, faster computers. 00:32:59.240 |
You can't underestimate the power of community. So the ability to share code and the internet. 00:33:07.240 |
Ability to communicate together through the internet and work on code together. 00:33:12.240 |
And then digitization of data. So like ability to have large data sets easily accessible and downloadable. 00:33:19.240 |
All of those little things. But I think in terms of the future of deep learning and machine learning, it all rides on compute, I think. 00:33:28.240 |
Meaning continued bigger and faster computers. 00:33:33.240 |
That doesn't necessarily mean Moore's Law in making smaller and smaller chips. It means getting clever in different directions. 00:33:41.240 |
Massive parallelization. Coming up with ways to do super efficient, power efficient implementations in neural networks and so on. 00:33:50.240 |
So let me just fly through a few examples of what we can do with machine learning. 00:33:58.240 |
Just to give you a flavor, I think in future lectures it's possible we'll discuss with different speakers, different specific applications, really dig into those. 00:34:08.240 |
So we can, as opposed to working with just images, you can work with videos and segment those. 00:34:16.240 |
I mentioned image segmentation. We can do video segmentation. 00:34:20.240 |
Through video segment, the different parts of a scene that's useful to a particular application. 00:34:24.240 |
Here in driving, you can segment the road from cars and vegetation and lane markings. 00:34:34.240 |
You can also, this is a subtle but important point. 00:34:38.240 |
>> Just go back one slide. How do they see the light? It's such a critical piece. 00:34:44.240 |
The more I listen to you and read your stuff, it seems like this critical, these very small pieces of information that we know are important. 00:34:53.240 |
Like there is a red light. I have to stop. I have to slow down. 00:34:57.240 |
How does it filter that out and pick out that? 00:35:01.240 |
>> It's got to be 100% reliable on that, right? 00:35:13.240 |
The question was how do you detect the traffic light and lights. 00:35:22.240 |
How do we do it as human beings, first of all? Let's start there. 00:35:27.240 |
The way we do it is by the knowledge we bring to the table. 00:35:33.240 |
We know what it means to be on the road. There's a lot of the huge network of knowledge that you come with. 00:35:39.240 |
That makes the perception problem much easier. 00:35:42.240 |
This is pure perception. You take an image and you separate different parts based purely on tiny patterns of pixels. 00:35:51.240 |
First it finds all the edges. It learns that traffic lights have certain kinds of edges around them. 00:36:01.240 |
They have a certain collection of edges that make up this black rectangle type shape. 00:36:08.240 |
It's all about shapes. It builds up knowing the shape structure of things. 00:36:14.240 |
It's a purely perception problem. One of the things I argue is that if it's purely a perception approach 00:36:21.240 |
and you bring no knowledge to the table about the physics of the world, 00:36:24.240 |
the three-dimensional physics and the temporal dynamics, 00:36:27.240 |
that you're not going to be able to successfully achieve near 100% accuracy on some of these systems. 00:36:38.240 |
For all of these things, think about how you as a human being would solve these problems. 00:36:43.240 |
What is lacking in the machine learning approach? 00:36:46.240 |
What data is lacking in the machine learning approach in order to achieve the same kind of results? 00:36:51.240 |
The same kind of reasoning required that you would use as a human. 00:36:59.240 |
Image detection, which means, it's a subtle but important point, 00:37:04.240 |
the stuff I mentioned before, image classification is given an image of a cat. 00:37:08.240 |
You find the cat. Sorry, you don't find the cat. You say this image is of a cat or not. 00:37:13.240 |
And then detection or localization is when you actually find where in the image that is. 00:37:18.240 |
That problem is much harder, but also doable with machine learning, with deep neural networks. 00:37:26.240 |
Now, as I said, inputs/outputs can be anything. 00:37:29.240 |
The input can be video. The output can be video. 00:37:32.240 |
And you can do anything you want with these videos. You can colorize the video. 00:37:36.240 |
You can take an old black and white film and produce color images. 00:37:43.240 |
Again, in terms of having an impact in the world using these applications, 00:37:50.240 |
you have to think, this is a cool demonstration, but how well does it actually work in the real world? 00:37:57.240 |
Translation, whether that's from text to text or image to image, 00:38:02.240 |
you can translate here dark chocolate from one language to another. 00:38:09.240 |
This class, Global Business of Artificial Intelligence, there's a reference below there. 00:38:16.240 |
You can generate the writing of the act of generating handwriting. 00:38:21.240 |
You can type in some text and given different styles that it learns from other handwriting samples, 00:38:28.240 |
it can generate any kind of text using handwriting. 00:38:32.240 |
Again, the input is language. The output is a sequence of writing of pen movements on the screen. 00:38:41.240 |
You can complete sentences. This is kind of a fun one where if you start... 00:38:50.240 |
And you can generate language where you start, you feed the system some input first. 00:38:54.240 |
So in black there it says, "Life is," and then have the neural network complete those sentences. 00:39:01.240 |
"Life is about kids." "Life is about the weather." 00:39:05.240 |
There's a lot of knowledge here, I think, being conveyed. 00:39:08.240 |
And you can start the sentence with, "The meaning of life is." 00:39:11.240 |
"The meaning of life is literary recognition." True for us academics. 00:39:16.240 |
Or, "The meaning of life is the tradition of ancient human production." Also true. 00:39:25.240 |
You can also caption. This has become very popular recently, is caption generation. 00:39:31.240 |
Given input is an image, the output is a set of text that captures the content of the image. 00:39:38.240 |
You find the different objects in the image. That's a perception problem. 00:39:43.240 |
And once you find the different objects, you stitch them together in a sentence that makes sense. 00:39:47.240 |
You generate a bunch of sentences and classify which sentence is the most likely to fit this image. 00:39:54.240 |
And you can, so certainly in the, I try to avoid mentioning driving too much, 00:40:01.240 |
because it is my field, it is what I'm excited about. 00:40:04.240 |
But then the moment I start talking about driving, it'll all be about driving. 00:40:10.240 |
So, but I should mention, of course, that deep learning is critical to driving applications 00:40:15.240 |
for both the perception and what is really exciting to us now is the end-to-end, 00:40:20.240 |
the end-to-end approach. So whenever you say end-to-end in any application, 00:40:25.240 |
what that means is you start from the very raw inputs that the system gets, 00:40:31.240 |
and you produce the very final output that's expected of the system. 00:40:36.240 |
So as opposed to in the self-driving car case, as opposed to breaking a car down 00:40:40.240 |
into each individual components of perception, localization, mapping, control, planning, 00:40:47.240 |
it's just taking the whole stack and just ignoring all the super complex problems in the middle 00:40:53.240 |
and just taking the external scene as input, and as output, produce steering and acceleration 00:40:59.240 |
and braking commands. And so in this way, taking this input as the image of the external world, 00:41:05.240 |
in this case in a Tesla, we can generate steering commands for the car. 00:41:10.240 |
Again, input, a bunch of numbers that's just images. 00:41:14.240 |
Output, a single number that gives you the steering of the car. 00:41:23.240 |
Okay, so let's step back for a second and think about what can't we do with machine learning. 00:41:31.240 |
We talked about you can map numbers to numbers. Let's think about what we can't do. 00:41:36.240 |
At the core of artificial intelligence, in terms of making an impact on this world, is robotics. 00:41:42.240 |
So what can't we solve in robotics and artificial intelligence with a machine learning approach? 00:41:48.240 |
And let's break down what artificial intelligence means. 00:41:51.240 |
Here's a stack. Starting at the very top is the environment, the world that you operate in. 00:41:56.240 |
There's sensors that sense that world. There's feature extraction and learning from that data. 00:42:02.240 |
And there's some reasoning, planning, and effectors are the ways you manipulate the world. 00:42:12.240 |
So we've had a lot of success, as Jonathan talked about, in the history of AI with formal tasks, 00:42:18.240 |
playing games, solving puzzles. Recently we're having a lot of breakthroughs with medical diagnosis. 00:42:25.240 |
We're still struggling, but are very excited about in the robotics space with more mundane tasks of walking, 00:42:39.240 |
of basic perception, of natural language written and spoken. 00:42:44.240 |
And then there is the human tasks, which are perhaps completely out of reach of this pipeline at the moment, 00:42:52.240 |
is cognition, imagination, subjective experience. 00:43:00.240 |
So high level reasoning, not just common sense, but high level human level reasoning. 00:43:08.240 |
So let's fly through this pipeline. There's sensors, cameras, LIDAR, audio. 00:43:14.240 |
There's communication that flies through the air or wired or wireless or wired. 00:43:21.240 |
IMU, measuring the movement of things. So that's the way, think about it, 00:43:26.240 |
that's the way as human beings and as any kind of system that you design, you measure the world. 00:43:31.240 |
You don't just get an API to the world. You need to somehow measure aspects of this world. 00:43:40.240 |
So that's how you get the data. So that's how you convert the world into data you can play with. 00:43:45.240 |
And once you have the data, this is the representation side. 00:43:49.240 |
You have to convert that raw data of raw pixels, raw audio, raw LIDAR data. 00:43:54.240 |
You have to convert that into data that's useful for the intelligence system, 00:43:59.240 |
for the learning system to use to discriminate between one thing and another. 00:44:07.240 |
For vision, that's finding edges, corners, object parts, and entire objects. 00:44:13.240 |
And there's the machine learning that I've talked about. 00:44:17.240 |
There's different kinds of mapping of the representation that you've learned to an actual outputs. 00:44:24.240 |
There is, once you have this, so you have this idea of, and this goes to maybe a little bit of Simon's question, 00:44:31.240 |
is reasoning. This is something that's out of reach of machine learning at the moment. 00:44:40.240 |
Then we can build a world class machine learning system for taking an image and classifying that it's a duck. 00:45:01.240 |
So we could take, this is well studied, exceptionally well studied problem. 00:45:07.240 |
We could take audio sample of a duck and tell that it's a duck. 00:45:15.240 |
It's incredible how much research there is in bird species classification. 00:45:18.240 |
And you can look at video and we could tell that we can do action recognition, it's swimming. 00:45:28.240 |
That if it looks like a duck, it swims like a duck, and quacks like a duck, it's very likely to be a duck. 00:45:37.240 |
This is the task that I personally am obsessed with and that I hope that machine learning can close. 00:45:45.240 |
And then there is the planning action and the effectors. 00:45:56.240 |
So this is another place where machine learning has not had many strides. 00:46:05.240 |
There's mechanical issues here that are incredibly difficult. 00:46:08.240 |
There's degrees of freedom with all the actuators involved, with all the, just the ability to localize every part of yourself in this dynamic space. 00:46:23.240 |
Where things are constantly changing, where there's degrees of uncertainty, where there's noise. 00:46:27.240 |
Just that basic problem is exceptionally difficult. 00:46:39.240 |
We talked about how machine, what machine learning can do with the cats and the duck. 00:46:45.240 |
Given representation, it could predict what's in the image. 00:46:48.240 |
But one of the open questions is, and deep learning has been able to do the feature extraction, the representation learning. 00:46:55.240 |
This is the big breakthrough that everybody's excited about. 00:47:07.240 |
And as human beings do, can it close the loop entirely from sensors to effectors? 00:47:14.240 |
So learn not only the brain, but the way you sense the world and the way you affect the world. 00:47:28.240 |
The thing about that pawn game, so essentially, does the neural network get punished when it detects the ball because it goes off the map? 00:47:48.240 |
It doesn't get punished when it doesn't detect the ball. 00:47:53.240 |
It gets punished only at the very end of the game for losing the game and gets rewarded for winning the game. 00:47:59.240 |
So it knows nothing about that ball and it learns about that ball. 00:48:03.240 |
That's something you need to really sit and think about. 00:48:08.240 |
Because as human beings, imagine if you're playing with a physical ball. 00:48:16.240 |
You get hurt by it, you squeeze it, you throw it, you feel the dynamics of it, the physics of it. 00:48:30.240 |
We take it for granted, and maybe this is what I can end on. 00:48:39.240 |
We take the simplicity of this task for granted. 00:48:42.240 |
Because we've had eyes, we, broadly speaking, as living species on planet Earth, 00:48:51.240 |
these eyes have been involved for 540 million years. 00:48:59.240 |
We've been walking for close to that, bipedal mammals. 00:49:04.240 |
We have been thinking only very recently, so 100,000 years versus 100 million years. 00:49:12.240 |
And that's why some of these problems that we're trying to solve, 00:49:18.240 |
you can't take for granted how actually difficult they are. 00:49:21.240 |
So for example, this is the Marvax Paradox that Jonathan brought up, 00:49:27.240 |
The things we think are easy are actually really hard. 00:49:30.240 |
This is a state-of-the-art robot on the right playing soccer. 00:49:34.240 |
And that was a state-of-the-art human on the left playing soccer. 00:49:56.240 |
The question was, you know, there's a fundamental difference between the way we train neural networks 00:50:00.240 |
and the way we've trained biological neural networks through evolution 00:50:03.240 |
by discarding through natural selection a bunch of the neural networks that didn't work so well. 00:50:12.240 |
So first of all, the process of evolution is, I think, not well understood. 00:50:24.240 |
The role of evolution in the evolution of our cognition, of our intelligence. 00:50:32.240 |
I don't know if that's, so this is an open question. 00:50:37.240 |
Is neural networks, artificial neural networks are fixed for the most part in size. 00:50:43.240 |
It's like a single human being that gets to learn. 00:50:46.240 |
We don't have mechanisms of modifying or evolving those neural networks yet. 00:50:55.240 |
Although you could think of researchers as doing exactly that. 00:50:59.240 |
You have grad students working on different neural networks, 00:51:02.240 |
and the ones that don't do a good job don't get promoted and get a good job. 00:51:06.240 |
There is a natural selection there, but other than that, it's an open question. 00:51:13.240 |
So Lex is going to come back. He's not available next week, but he's going to come back the week after. 00:51:20.240 |
Are there any last final takeaways you want to emphasize? 00:51:25.240 |
Stay tuned and keep your head up because the future, I believe, is really promising. 00:51:32.240 |
And the slides will be made available for sure. 00:51:39.240 |
I think a lot of the explorations of what it means to build an intelligent machine has been in sci-fi movies. 00:51:45.240 |
We're now beginning to actually make it a reality. 00:51:47.240 |
This is Space Odyssey to keep with that theme in the previous lecture that we had. 00:51:53.240 |
This is as opposed to the dreamlike monolith view when the astronaut is gazing out into the open sky at the stars. 00:52:04.240 |
We're going to look at the practice of AI today and how we go. 00:52:08.240 |
If you're familiar with the movie, when this new technology appeared before our eyes and we're full of excitement, 00:52:15.240 |
how we transfer that into actual practical impact on our lives. 00:52:23.240 |
To quickly review what we talked about last time, I presented the technology and asked the question of whether this technology 00:52:31.240 |
merely serves a special purpose to answer specific tasks that can be formalized 00:52:36.240 |
or whether it can be through the process of transferring the knowledge learned on one domain be generalizable 00:52:45.240 |
to where an intelligent system that's trained in a small domain can be used to achieve general intelligent tasks 00:52:55.240 |
This is kind of the stack of artificial intelligence going from all the way up to the top of the environment, the world. 00:53:01.240 |
The sensors, the data, the intelligent system, the way it perceives this world. 00:53:07.240 |
Then once you have this, you convert the world into some numbers, you're able to extract some representation of that world 00:53:13.240 |
and this is where machine learning starts to come into play. 00:53:16.240 |
And then there's the part where I will raise it again today is can machine learning be doing the following steps too 00:53:23.240 |
that we can do very well as human beings is the reasoning step. 00:53:26.240 |
You know, you can tell the difference between a cat and a dog, but can you now start to reason about what it means to be alive, 00:53:33.240 |
what it means to be a cat, a living creature, what it means to be this kind of physical object or this kind of physical object 00:53:39.240 |
and take what's called common sense, things we take for granted, start to construct models of the world through reasoning. 00:53:47.240 |
Descartes, "I think, therefore I am." We want our neural networks to come up with that on their own. 00:53:53.240 |
And once you do that, action. You'll go right back into the world and you start acting in that world. 00:54:00.240 |
So the question is can machine learning, can this be learned from data or do experts need to encode the knowledge of reasoning, 00:54:07.240 |
the knowledge of actions, the set of actions? That's kind of the open question I raised. 00:54:14.240 |
And so as we start to think about how artificial intelligence, especially machine learning, 00:54:22.240 |
as it realizes itself through robotics, gets to impact the world, we start thinking about what are the easy problems 00:54:28.240 |
And it seems to us that vision and movement, walking, is easy because we've been doing it for millions of years, 00:54:38.240 |
hundreds of millions of years, and thinking is hard, reasoning is hard. 00:54:43.240 |
I propose to you that it's perhaps because we've only been doing it for a short time and so think we're quite special 00:54:51.240 |
So we have to kind of question of what is easy and what is hard. 00:54:55.240 |
Because when we start to develop some of these systems, you start to realize that all of these problems are equally hard. 00:55:03.240 |
So the problem of walking that we take for granted, the actuation and the ability to recognize where you are 00:55:11.240 |
in the physical space, to sense the world around you, to deal with the uncertainty of the perception problem. 00:55:20.240 |
And then, so all of these robots, by the way, this is for the most recent DARPA challenge, which MIT was also part of. 00:55:32.240 |
They don't have any, they only have sparse communication with human beings on the periphery. 00:55:39.240 |
So most of the stuff they have to do autonomously, like get inside a car. 00:55:46.240 |
They have to get in the car and the hardest task, they have to get out of the car. 00:55:52.240 |
That's walking. So this kind of raises to you the very real aspect here. 00:55:58.240 |
You want to build applications that actually work in the real world. 00:56:01.240 |
And that's the first challenge and opportunity here. 00:56:05.240 |
Many of the technologies we talked about currently crumble under the reality of our world. 00:56:15.240 |
When we transfer them from a small data set in the lab to the real world. 00:56:20.240 |
For the computer vision is perhaps one of the best illustrations of this. 00:56:24.240 |
Computer vision is the task, as we talked about, of interpreting images. 00:56:29.240 |
And so when you, there's been a lot of great accomplishments on interpreting images, cats versus dogs. 00:56:35.240 |
Now, when you try to create a system like the Tesla vehicle that I've often, that we work with, 00:56:45.240 |
and I always talk about is it's a vision based robot, right? 00:56:51.240 |
It has radar for basic obstacle avoidance, but most of the understanding of the world comes from a single monocular camera. 00:56:57.240 |
Now they've expanded the number of cameras, but for the most time, 00:57:00.240 |
there's been 100,000 vehicles driving on the roads today with a single, essentially a single webcam. 00:57:07.240 |
So when you start to do that, you have to perform all of these extraction of texture, color, optical flow. 00:57:14.240 |
So the movement through time, temporal dynamics of the images, you have to construct these patterns, 00:57:20.240 |
construct the understanding of objects and entities and how they interact. 00:57:24.240 |
And from that, you have to act in this world. And that's all based on this computer vision system. 00:57:29.240 |
So it's no longer cats versus dogs. It's, it's detection of pedestrians or the wrong classification. 00:57:40.240 |
The wrong detection is the difference between life and death. 00:57:45.240 |
So let's look at cats where things are a little more comfortable. 00:57:49.240 |
Computer vision, and I would like to illustrate to you why this is such a hard task. 00:57:55.240 |
We talked about, we've been doing it for 500 million years, so we think it's easy. 00:58:02.240 |
So all you're getting with your human eyes is you're getting essentially pixels in. 00:58:06.240 |
There's light coming into your eyes and all you're getting is the reflection from the different surfaces in here of light. 00:58:13.240 |
And there's perception, there's sensors inside your eyes converting that into numbers. 00:58:21.240 |
Numbers, in the case of what we use with computers, RGB images, 00:58:26.240 |
where the individual pixels are numbers from 0 to 255, so 256 possible numbers, and there's just a bunch of them. 00:58:34.240 |
And that's all we get. We get a collection of numbers where they're spatially connected. 00:58:38.240 |
The ones that are close together are part of the same object, so cat pixels are all connected together. 00:58:44.240 |
That's the only thing we have to help us, but the rest of it is just numbers, intensity numbers. 00:58:48.240 |
And we have to use those numbers to classify what's in the image. 00:58:53.240 |
And if you really think about it, this is a really difficult task. 00:58:59.240 |
How the heck are you supposed to form a model of the world with which you can detect pedestrians 00:59:10.240 |
Because these pedestrians, or these cars, the cyclists in the car context, 00:59:15.240 |
or any kind of applications you're looking at, 00:59:18.240 |
even if your job is in the factory floor to detect the defective gummy bears 00:59:23.240 |
that are flying past at like 100 miles an hour, 00:59:26.240 |
your task is you don't want that bad gummy bear to get by, 00:59:29.240 |
that your product and the brand will be damaged. 00:59:33.240 |
However serious or not serious your application is, 00:59:36.240 |
you have to have a computer vision system that deals with all of these aspects. 00:59:44.240 |
Viewpoint variation, scale variation, no matter the size of the object, it's still the same object. 00:59:50.240 |
No matter the viewpoint from which area you look at that object, it's still the same object. 00:59:56.240 |
The lighting that moves, we have lighting consistently here because we're indoors, 01:00:00.240 |
but when you're outdoors or you're moving, the scene is moving, 01:00:04.240 |
the lighting, the complexity of the lighting variations is incredible. 01:00:07.240 |
From the illumination to just the movement of the different objects in the scene. 01:00:14.240 |
Now that we've had these conversations, I think about this every time I drive. 01:00:18.240 |
I think about you and this point and how hard it is to see these things. 01:00:22.240 |
And particularly when I'm driving at night, and particularly when it's twilight and the light is changing, 01:00:26.240 |
I think almost every time I drive there's one or two things that I see that I'm drawing in 200 million years 01:00:37.240 |
It's a guy who's opened his car door and I can't see him, 01:00:40.240 |
but I can just see the light doesn't look quite right on that side of the road. 01:00:47.240 |
But it seems like an almost impossible problem for the machines to get right with sufficient accuracy. 01:00:54.240 |
I will argue that the pure perception task is too hard. 01:00:58.240 |
That you come to the table as human beings with all this huge amount of knowledge. 01:01:03.240 |
That you're not actually interpreting all the complex lighting variations that you're seeing. 01:01:10.240 |
You actually know enough about the world, enough about your commute home, 01:01:14.240 |
enough about the kinds of things you would see in this world, 01:01:18.240 |
about Boston, about the way pedestrians move, the certain light of day. 01:01:22.240 |
You bring all that to the table that makes the perception task doable. 01:01:26.240 |
And that's one of the big missing pieces in the technology. 01:01:29.240 |
As I'll talk about, that's the open problem of machine learning. 01:01:33.240 |
It's how to bring all that knowledge, first of all build that knowledge, 01:01:38.240 |
As opposed to starting from scratch every time. 01:01:45.240 |
Okay, so to me occlusion, for most of the computer vision community, 01:01:52.240 |
And it really highlights how far we are from being able to reason about this world. 01:02:00.240 |
Occlusions are when, what an occlusion is, is when the objects you're trying to detect, 01:02:06.240 |
something about, classify the object, detect the object, 01:02:09.240 |
the object is blocked partially by another object in front of them. 01:02:16.240 |
This is something you think is trivial perhaps, you don't even really think about it, 01:02:21.240 |
because we reason in a three-dimensional way. 01:02:23.240 |
But the occlusion aspect makes perception incredibly difficult. 01:02:33.240 |
So this image is converted into numbers, and we, for the task of detecting, 01:02:40.240 |
You have to be able to reason about this image with that object in the scene. 01:02:45.240 |
Most of us are able to very easily detect that there's a cat in this image. 01:02:50.240 |
We're able to detect that there's a cat in this image. 01:02:53.240 |
Now think about this, there's a single eye and there's an ear. 01:02:58.240 |
So you have to think about, what is it, part of our brain, 01:03:01.240 |
that allows us to understand, to suppose that with some high degree of accuracy 01:03:10.240 |
I mean the degree of occlusion here is immense. 01:03:19.240 |
Some of you will think this is in fact a monkey eating a banana, 01:03:23.240 |
but I would venture to say that most of us are able to tell it's nevertheless a cat. 01:03:33.240 |
And so let me give you another, this is kind of a paper that's often cited, 01:03:38.240 |
or a set of papers, that illustrate how difficult computer vision is, 01:03:45.240 |
how thin the line that we're walking with all of these impressive results 01:03:51.240 |
that we've been able to show recently in the machine learning community. 01:03:56.240 |
In this case, for deep neural networks are easily fooled paper, 01:04:02.240 |
the seminal paper at this point, shows that when you apply a network trained on ImageNet, 01:04:09.240 |
so basically on detecting cats versus dogs or different categories inside images, 01:04:14.240 |
if you can find an arbitrary number of images that look like noise up in the top row, 01:04:22.240 |
where the algorithm used to classify those images in ImageNet of cat versus dog, 01:04:30.240 |
is able to confidently say with 99.6% accuracy or above, 01:04:34.240 |
that it's seeing a robin or a cheetah or an armadillo or a panda in that noise. 01:04:41.240 |
So it's confidently saying, given this noise, that that's obviously a robin. 01:04:46.240 |
So you have to realize that the kind of, this is patterns, 01:04:53.240 |
the kind of processes it's using to understand what's contained in the image 01:04:58.240 |
is purely a collection of patterns that it has been able to extract from other images 01:05:06.240 |
And that perhaps is very limiting to trying to create a system 01:05:14.240 |
This is a very clean illustration of that concept. 01:05:19.240 |
In the same, you can confidently predict in those images below, 01:05:23.240 |
where there are strong patterns, it's not even noise, 01:05:26.240 |
strong patterns that have nothing to do with the entities being detected. 01:05:29.240 |
Again, confidently, that same algorithm is able to see a penguin, a starfish, 01:05:38.240 |
And more serious for people designing robots like myself, 01:05:44.240 |
on the sensor side, you can flip that and say, 01:05:49.240 |
I can take an image and I can distort it with some very little amount of noise. 01:06:02.240 |
I can completely change the confident prediction about what's in that image. 01:06:08.240 |
so on the left, in the column on the left, and again here, 01:06:13.240 |
what's the same kind of neural network is able to predict accurately, 01:06:19.240 |
confidently, that there is a dog in that image. 01:06:22.240 |
But if we apply just a little bit of noise to that image, 01:06:25.240 |
to produce that image, imperceptible to our human eyes, 01:06:30.240 |
the same algorithm is saying that there is confidently an ostrich in that image. 01:06:38.240 |
how noise can have such a significant impact on the prediction of these algorithms. 01:06:46.240 |
out of all the things I'll say today and I'm aware of, 01:06:50.240 |
one of the biggest challenges of machine learning being applied in the real world is robustness. 01:06:59.240 |
How much noise can you add into the system before everything falls apart? 01:07:07.240 |
So say a car company has to produce a vehicle and it has sensors in that vehicle. 01:07:11.240 |
How do you know that those sensors will not start generating slight noise due to interference of various kinds? 01:07:18.240 |
And because of that noise, instead of seeing a pedestrian, 01:07:25.240 |
So of course, the most dangerous is when it will not see an object and collide with it, 01:07:32.240 |
There's also spoofing, which a lot of people, as always, with security, 01:07:38.240 |
And perhaps people here are really concerned about this issue. 01:07:43.240 |
but because you can apply noise and convince the system that you're seeing an ostrich when there's in fact no ostrich, 01:07:50.240 |
you can do the same thing in an attacking way. 01:07:55.240 |
So you can attack the sensors of a car and make it believe, like with LIDAR spoofing, 01:07:59.240 |
so spoof LIDAR or radar or ultrasonic sensors to believe that you're seeing pedestrians when they're not there, 01:08:06.240 |
and the opposite, to hide pedestrians, make pedestrians invisible to the sensor when they're in fact there. 01:08:14.240 |
So whenever you have intelligent systems operating in this world, 01:08:18.240 |
they become susceptible to the fact that everything, so much of the work is done in software and based on sensors. 01:08:26.240 |
So at any point in the chain, if there's a failure, you have to be able to detect that failure. 01:08:32.240 |
And right now we have no mechanisms for automatically detecting that failure. 01:08:35.240 |
So on the data side, one challenge that we're constantly dealing with is that we, 01:08:47.240 |
the algorithms and machine learning algorithms that we're using need labeled data. 01:08:57.240 |
Labeled data, again, is when you have pairs of input data and the ground truth, 01:09:04.240 |
the true label annotation class that that image belongs to or concept. 01:09:12.240 |
And it doesn't have to be an image, it could be any source of data. 01:09:19.240 |
So because it's so costly, we rely, every breakthrough we've had so far relies on that labeled data. 01:09:31.240 |
And because of its cost, we don't have much of it. 01:09:35.240 |
So all the problems that come from data can either be solved by having a lot more of this data, 01:09:40.240 |
which I believe is, most people believe is too challenging. 01:09:44.240 |
It's too challenging to have human beings annotate huge amounts of data. 01:09:48.240 |
Or we have to develop algorithms that are able to do something with the unlabeled data. 01:09:54.240 |
It's the unsupervised, semi-supervised, sparsely supervised reinforcement learning. 01:10:00.240 |
As we talked about last time, I'll mention again here. 01:10:03.240 |
So one way you understand something about data when you don't have labels is you reason about it. 01:10:13.240 |
When you're a baby, your parents give you a few facts, 01:10:15.240 |
and you go into this world with those facts, and you grow your knowledge graph, 01:10:19.240 |
your knowledge base, your understanding of the world from those few facts. 01:10:22.240 |
We don't have a good method of doing that in an automated, unrestricted way. 01:10:28.240 |
The inefficiency of our learners, the machine learning algorithms I've talked about, 01:10:32.240 |
neural networks, need a lot of examples of every single concept that they're given 01:10:38.240 |
Thousands, tens of thousands of cats are needed to understand what the spatial patterns 01:10:44.240 |
at every level, the representation of a cat, the visual representation of a cat. 01:10:52.240 |
There's a few approaches, but nothing quite robust yet. 01:10:58.240 |
And we haven't come up with a way--this is also possible-- 01:11:04.240 |
to make annotation, this labeling process, somehow be very cheap. 01:11:11.240 |
So leveraging--this is something being called human computation. 01:11:15.240 |
That term has fallen out of favor a little bit. 01:11:18.240 |
One of my big passions is human computation, is using something about our behavior, 01:11:23.240 |
something about what we do in this world online or in the real world, 01:11:33.240 |
So, for example, as you drive, which is what we do, everybody has to drive, 01:11:38.240 |
and we can collect data about you driving in order to train self-driving vehicles to drive. 01:11:48.240 |
So here are the annotated data sets we have, the supervised learning data sets. 01:11:54.240 |
There's many, but these are some of the more famous ones, 01:11:57.240 |
from the toy data sets of MNIST to the large, broad, arbitrary categories of images data sets, 01:12:08.240 |
And there's in health care, there's in audio, there's in video, 01:12:16.240 |
but each one of them is usually in a scale of hundreds of thousands, millions, 01:12:24.240 |
which is what we need to create systems that operate in the real world. 01:12:28.240 |
And again, these are the kinds of machine learning algorithms we have. 01:12:35.240 |
The teachers on the left is what is the input to the system that requires to train it. 01:12:44.240 |
From the supervised learning at the very top is where we have all of our successes, 01:12:48.240 |
and everything else is where the promise lies. 01:12:51.240 |
The semi-supervised, the reinforcement, or the fully unsupervised learning, 01:12:55.240 |
where the input from the human is very minimal. 01:13:01.240 |
so whenever you think about machine learning today, 01:13:04.240 |
whenever somebody talks about machine learning, 01:13:06.240 |
what they're talking about is systems that memorize, that memorize patterns. 01:13:11.240 |
And so this is one of the big criticisms of the current machine learning approaches, 01:13:18.240 |
they're only as good as the human annotated data that they're provided. 01:13:23.240 |
We don't have mechanisms for actually understanding. 01:13:29.240 |
In order to create an intelligent system, it shouldn't just memorize. 01:13:32.240 |
It should understand the representations inside that data in order to operate in that world. 01:13:43.240 |
And one of the challenges and opportunities for machine learning researchers today 01:13:47.240 |
is to extend machine learning from memorization to understanding. 01:13:58.240 |
If you get information from the perception systems that it looks like a duck, 01:14:02.240 |
from the audio processing that it quacks like a duck, 01:14:06.240 |
and then from video classification, the activity recognition that it swims like a duck, 01:14:11.240 |
the reasoning step is how to connect those facts to then say that it is in fact a duck. 01:14:19.240 |
Okay, so that's on the algorithm side and the data side. 01:14:23.240 |
Now this is one of the reasons computational power, computational hardware, 01:14:28.240 |
that is at the core of the success of machine learning. 01:14:34.240 |
So our algorithms have been the same since the '60s, since the '80s, '90s, 01:14:47.240 |
Most of you know the way the CPU side of our computers works for a single CPU 01:14:52.240 |
is that it's, for the most part, executing a single action at a time in a sequence. 01:14:59.240 |
So it's sequential, very different from our brain, which is a massively parallelized system. 01:15:06.240 |
So because it's sequential, the clock speed matters, 01:15:09.240 |
because that's how fast, essentially, those instructions are able to be executed. 01:15:16.240 |
Physics is stopping us from continuing Moore's Law. 01:15:21.240 |
Intel, AMD are aggressively pushing this Moore's Law forward. 01:15:27.240 |
But--and there's some promise that it will actually continue for another 10 or 15 years. 01:15:36.240 |
Then there's another form of parallelism, massive parallelism, is the GPU. 01:15:46.240 |
This is essential to the success, recent success of neural networks, 01:15:49.240 |
is the ability to utilize these inherently parallel architectures of graphics processing units, GPUs. 01:16:01.240 |
This is the reason NVIDIA stock is doing extremely well, is GPUs. 01:16:09.240 |
So it's parallelism of basic computational processes that make machine learning work on a GPU. 01:16:17.240 |
One of the limitations of GPUs, one of the challenges is in bringing them to--in scaling, 01:16:24.240 |
and bringing them into real-world applications, is power usage, is power consumption. 01:16:29.240 |
And so there is a lot of specialized chips, specialized just from the neural network architectures, 01:16:36.240 |
coming out from Google with their Tensor Processing Unit from IBM, Intel, and so on. 01:16:44.240 |
So this is sort of the direction of trying to design an electronic brain so it has the efficiency. 01:16:49.240 |
Our human brain is exceptionally efficient at running the neural networks in our heads. 01:16:55.240 |
Or it is a magnitude more efficient than our computers are. 01:16:58.240 |
And this is trying to design systems that are able to go towards that efficiency. 01:17:08.240 |
One, of course, as I'm sure we'll talk about throughout this class, is about the thing in our smartphones, battery usage. 01:17:19.240 |
I think it could be attributed to the big breakthroughs in machine learning recently. 01:17:28.240 |
In the last decade, compute is important, algorithm development is important. 01:17:42.240 |
And I will show in several ways why global is essential here. 01:17:46.240 |
Is tens of, hundreds of thousands, millions of programmers, mechanical engineers, building robots, 01:17:56.240 |
building intelligence systems, building machine learning algorithms. 01:17:59.240 |
The exciting nature of the growth of the community perhaps is the key for the future to unlocking the power of machine learning. 01:18:13.240 |
And this is showing on the y-axis at the bottom is 2008 when GitHub first opened. 01:18:20.240 |
Quick, near exponential growth of the number of users participating and the number of repositories. 01:18:25.240 |
So these are standalone, unique projects that are being hosted on GitHub. 01:18:30.240 |
So this is one example I'll show you about this competition that we're recently running. 01:18:34.240 |
And then I'll challenge people here to participate in this competition, if you dare. 01:18:39.240 |
So this is a chance for you to build a neural network in your browser. 01:18:46.240 |
So you can do this on your phone later tonight, of course. 01:18:51.240 |
On your phone, you can specify various parameters of the neural network, specify different numbers of layers and the depth of the network, 01:18:58.240 |
the number of neurons in the network, the type of layers. 01:19:02.240 |
It's super easy in terms of just tweaking little things. 01:19:07.240 |
And remember, machine learning to a large part is an art at this point. 01:19:12.240 |
It's more perhaps than even, you know, more than a well-understood, theoretically bounded science, which is one of the challenges. 01:19:27.240 |
Americans spend eight billion hours stuck in traffic every year. 01:19:33.240 |
And so you have a neural network that drives that little car with an MIT logo, red one, on this highway and tries to weave in and out of traffic to get to his destination. 01:19:43.240 |
And trying to achieve a speed of 80 miles an hour, which is the speed limit, which is the physical speed limit of the car. 01:19:50.240 |
Of course, the actual speed limit of the road is 65 miles an hour. 01:19:55.240 |
We just want to get to work as quickly as possible or home. 01:19:57.240 |
So what the basic structure of this game is, and I want to explain this game a little bit and then tell you how incredibly popular it's gotten and how incredibly powerful the networks that people have built from all over the world, the community that's built of this over a single month, is incredible. 01:20:20.240 |
And this happens for thousands of projects out there. 01:20:27.240 |
OK, so you may have seen this. This is kind of ethics. 01:20:30.240 |
Most engineers, most I personally don't like, I love the philosophy. 01:20:35.240 |
But this kind of construction of ethics that's often presented here is one that is not usually concerned to engineering. 01:20:44.240 |
You know, when you have a car and you have a bunch of pedestrians, do you hit the larger group of pedestrians or the smaller group of pedestrians? 01:20:51.240 |
Do you avoid the group of pedestrians, but put yourself into danger? 01:20:56.240 |
These kinds of ethical questions of an intelligent system. 01:21:01.240 |
It's one that we can debate and there's really no good answer, quite honestly. 01:21:05.240 |
But it's a problem that both humans and machines struggle with. 01:21:09.240 |
And so it's not interesting on the engineering side. 01:21:11.240 |
We're interested with problems that we can solve on the engineering side. 01:21:14.240 |
So the kind of problem that I'm obsessed with and very interested in is the real world problem of controlling a vehicle through this space. 01:21:25.240 |
So this is a Manhattan, New York intersection, right? 01:21:30.240 |
This is pedestrians walking perfectly legally. 01:21:33.240 |
I think they have a green light. Of course, there's a lot of jaywalking, too, as well. 01:21:38.240 |
Well, this car just like it's not part of the point. 01:21:43.240 |
And so there's another car that starts making a left turn in a little bit. 01:21:49.240 |
So, yeah. And then there's another car after that, too, that just illustrates when you design an algorithm that's supposed to move through the space. 01:21:57.240 |
Like watch this car. The aggression it shows. 01:22:00.240 |
Now, this isn't a true example for those that try to build robots. 01:22:03.240 |
This is this is the real question is how do you design a system that's able. 01:22:10.240 |
So you have to think you have to put reward functions, objective functions, utility functions under which it performs the planning. 01:22:18.240 |
So a car like that has several thousand candidate trajectories you can take to that intersection. 01:22:26.240 |
You can take a trajectory where it speeds up to 60 miles an hour. It doesn't stop and just swerves and hits everything. 01:22:33.240 |
Then there is a trajectory which most companies take, which most of Google self-driving car and every company that is concerned about PR is whenever there's any kind of obstacle, any kind of risk that's at all reasonable that you can maybe even touch an obstacle. 01:22:49.240 |
Then you're not going to take that trajectory. So what that means is you're going to navigate to this intersection at 10 miles an hour and let people abuse you by walking in front of you because they know you're not going to stop. 01:22:59.240 |
And so in the middle there is hundreds, thousands of trajectories that are ethically questionable in the sense that you're putting other human beings at risk in order to safely and successfully navigate to an intersection. 01:23:12.240 |
And the design of those objective functions is is the kind of question you have to ask for intelligence systems for four cars. 01:23:21.240 |
There's no grandma and a few children. You have to choose who gets to die. Very, very difficult problems, of course. 01:23:29.240 |
But the problem of one I'm very interested in is streets of Boston, streets of New York is how to gently nudge yourself through a crowd of pedestrians in the way we all actually do when we drive in New York in order to be able to safely navigate these environments. 01:23:47.240 |
And these questions come up in health care. These questions come up in factory, in robots, in armed and humanoid robots that operate with other human beings. 01:24:01.240 |
Another sort of fun illustration that folks at OpenAI use often to illustrate, well let me just pause for a second, the gamified version of this. 01:24:10.240 |
There's a game called Coast Runners and you're racing against other boats along this track and your job is, there's your score here at the bottom left, number of laps, your time, 01:24:21.240 |
and you're trying to get to the destination as quickly as possible while also collecting funky little things like these green little things along the way. 01:24:33.240 |
Okay, so what they've done is build an intelligence system, the general purpose one that we talked about last time that learns how to navigate successfully through the space. 01:24:45.240 |
So you're trying to maximize the reward. And what this boat learns to do is instead of finishing the race, it learns to find a loop where it can keep going around and around, collecting those green dots, 01:25:02.240 |
and it learns the fact that they regenerate with time. 01:25:06.240 |
So it learns to maximize the score by going around and around. 01:25:11.240 |
Now these are the kinds of things, this is the big challenge of reward functions, of designing systems, of designing what you want your system to achieve. 01:25:20.240 |
Not only is it difficult to, the ethical questions are difficult, but just avoiding the pitfalls of local optima, 01:25:29.240 |
of figuring out something really good that happens in the short term, the greedy, what are those psychology experiments where the kid eats the marshmallow 01:25:38.240 |
and can't wait for, can't delay gratification. 01:25:42.240 |
This kind of, the idea of delayed gratification in the case of designing intelligence systems is a huge, actual serious problem. 01:25:52.240 |
So, we flew through a few concepts here. Is there any questions about some of the compute and the algorithm side we talked about today? 01:26:05.240 |
So the question was, yeah you highlighted some of the limitations of machine, computer vision algorithms, machine learning algorithms, 01:26:13.240 |
but you haven't highlighted some of the limitations of human beings. 01:26:16.240 |
And if you put those in a column and you compare those, are machines doing better overall? 01:26:22.240 |
Or is there any kind of way to compare those? 01:26:24.240 |
I mean there is actually interesting work on ImageNet, so ImageNet is this categorization task of where you have to classify images. 01:26:32.240 |
And you can ask the question, when I present you images of cats and dogs, where are machines better than humans and when are they not? 01:26:39.240 |
So you can compare when machines do better, what are the fail points, and what are the fail points for humans. 01:26:44.240 |
And there's a lot of interesting visual perception questions there. 01:26:47.240 |
But I think overall, it's certainly true that machines fail differently than human beings. 01:26:52.240 |
But in order to make an artificial intelligence system that's usable and could make you a lot of money, 01:27:01.240 |
and people would want to use, it has to be better for that particular task in every single way. 01:27:07.240 |
In order for you to want to use the system, it has to be superior to human performance, and usually far superior to human performance. 01:27:17.240 |
So on the philosophical level, it's an interesting thing to compare what are we good at, what are not. 01:27:23.240 |
But if you're using Amazon Echo, your voice recognition, or any kind of natural language, chat bots, or a car, 01:27:33.240 |
you're not going to be, well this car is not so good with pedestrians, but I appreciate the fact that it can stay in the lane. 01:27:39.240 |
Fortunately, you have a very high standard for every single thing that you're good at, and it has to be superior to that. 01:27:50.240 |
I'm more of the nerd that makes the technology happen. 01:27:54.240 |
But it's certainly, on the self-driving car aspect, policy is probably the biggest challenge. 01:28:00.240 |
And I don't think there's good answers there. 01:28:04.240 |
Some of those ethical questions that come up, it feels like, so we work a lot with Tesla. 01:28:09.240 |
So I'm driving a Tesla around every day, and we're playing around with it, and studying human behavior inside Teslas. 01:28:16.240 |
And it seems like there's so much hunger amongst the media to jump on something. 01:28:21.240 |
And it feels like a very shaky PR terrain, a very shaky policy terrain, we're all walking. 01:28:27.240 |
Because we have no idea how we coexist with intelligence systems. 01:28:32.240 |
And then, of course, government is nervous, because how do we regulate this shaky terrain? 01:28:43.240 |
That's a perfect transition point, if that's okay. 01:28:48.240 |
Thanks a lot, Lex, for another great session.