back to indexDeep Learning Basics: Introduction and Overview
Chapters
0:0 Introduction
0:53 Deep learning in one slide
4:55 History of ideas and tools
9:43 Simple example in TensorFlow
11:36 TensorFlow in one slide
13:32 Deep learning is representation learning
16:2 Why deep learning (and why not)
22:0 Challenges for supervised learning
38:27 Key low-level concepts
46:15 Higher-level methods
66:0 Toward artificial general intelligence
00:00:06.760 |
This is 6S094, Deep Learning for Self-Driving Cars. 00:00:10.920 |
It is part of a series of courses on deep learning 00:00:19.880 |
The website that you can get all the content, 00:00:26.400 |
The videos and slides will be made available there, 00:00:39.400 |
And you can always contact us with questions, 00:00:41.520 |
concerns, comments at hcaihumancenteredai@mit.edu. 00:00:46.520 |
So let's start through the basics, the fundamentals. 00:00:52.480 |
To summarize in one slide, what is deep learning? 00:00:58.320 |
It is a way to extract useful patterns from data 00:01:03.960 |
with as little human effort involved as possible, 00:01:13.000 |
The fundamental aspect that we'll talk about a lot 00:01:18.940 |
The practical nature that we'll provide through the code 00:01:30.480 |
to do some of the most powerful things in deep learning 00:01:45.720 |
is asking good questions and getting good data. 00:01:54.320 |
and a lot of the exciting aspects of what is published 00:01:58.120 |
in the prestigious conferences, in an archive, 00:02:12.580 |
That requires asking the right questions of that data, 00:02:21.520 |
that can reveal the answers to the questions you ask. 00:02:24.660 |
So why has this breakthrough over the past decade 00:02:53.440 |
all kinds of problems have now a digital form 00:02:59.800 |
Hardware, compute, both the Moore's Law of CPU 00:03:13.120 |
effective, large-scale execution of these algorithms. 00:03:18.660 |
Community, people here, people all over the world 00:03:22.340 |
are being able to work together, to talk to each other, 00:03:25.160 |
to feed the fire of excitement behind machine learning. 00:03:45.620 |
to reach a solution in less and less and less time. 00:03:53.000 |
empower people to solve problems in less and less time 00:03:59.020 |
where the idea and the data become the central point, 00:04:02.440 |
not the effort that takes you from the idea to the solution. 00:04:13.840 |
of scene understanding, image classification to speech, 00:04:17.320 |
text, natural language processing, transcription, 00:04:20.520 |
translation in medical applications and medical diagnosis. 00:04:25.040 |
And cars, being able to solve many aspects of perception 00:04:29.720 |
in autonomous vehicles with drivable area lane detection, 00:04:36.040 |
the ones on your phone and beyond, the ones in your home. 00:04:40.800 |
Ads, recommender systems, from Netflix to Search 00:05:10.960 |
"AI began with the ancient wish to forge the gods." 00:05:15.000 |
Throughout our history, throughout our civilization, 00:05:18.560 |
human civilization, we've dreamed about creating echoes 00:05:22.160 |
of whatever is in this mind of ours in the machine 00:05:35.400 |
this vision, this dream of understanding intelligence 00:05:38.920 |
and creating intelligence has captivated all of us. 00:05:45.400 |
because there's aspects of it, the learning aspects 00:05:48.920 |
that captivate our imagination about what is possible, 00:05:56.660 |
learning to learn and beyond, how far that can take us. 00:06:03.000 |
And here visualized is just 3% of the neurons 00:06:06.280 |
and 1/1,000,000 of the synapses in our own brain. 00:06:15.000 |
small shadows of it in our artificial neural networks 00:06:44.040 |
and the implementation of those neural networks 00:06:53.160 |
recurring neural networks in the '70s and '80s 00:07:05.360 |
and the rebranding and the rebirth of neural networks 00:07:17.040 |
on which the possibilities of what deep learning 00:07:21.600 |
can bring to the world has been first illustrated 00:07:37.700 |
improving the performance of neural networks. 00:07:50.140 |
the ability to, with very little supervision, 00:08:51.680 |
2018 was the year of natural language processing. 00:08:57.920 |
Google's BERT and others that we'll talk about, 00:09:02.620 |
breakthroughs on ability to understand language, 00:09:09.340 |
including generation, that's built all around that. 00:09:26.860 |
These really solidified, exciting, powerful ecosystems 00:09:49.540 |
Everything should be made as simple as possible. 00:09:51.940 |
So let's start simple, with a little piece of code, 00:10:08.820 |
At the very basic level, with just a few lines of code, 00:10:20.580 |
The classic, that I will always love, MNIST dataset, 00:10:26.860 |
to a neural network, a machine learning system, 00:10:31.020 |
and the output is the number that's in that digit. 00:10:45.940 |
Third step, like Lego bricks, stack on top of each other, 00:10:53.500 |
with a hidden layer, an input layer, an output layer. 00:11:02.680 |
Evaluate the model in step five, on the testing dataset, 00:11:15.300 |
And much of this code, obviously, much more complicated, 00:11:19.620 |
or much more elaborate, and rich, and interesting, 00:11:23.700 |
and complex, we'll be making available on GitHub, 00:11:27.540 |
on our repository that accompanies these courses. 00:11:58.820 |
It's not just something you import in Python, 00:12:15.620 |
But there's also the ability to run in the browser 00:12:17.980 |
with TensorFlow.js, on the phone with TensorFlow Lite, 00:12:21.740 |
in the cloud, without any need to have a computer, 00:12:29.140 |
You can run all the code that we're providing 00:13:25.060 |
and in the tooling and the applied side of TensorFlow. 00:13:34.580 |
is the ability to form higher and higher level 00:13:53.140 |
and effective for being able to interpret data. 00:14:06.980 |
Cat versus dog, blue dot versus green triangle. 00:14:14.620 |
In this task, drawing a line under polar coordinates 00:14:21.900 |
it's very difficult, well, impossible to do accurately. 00:14:25.140 |
And that's a trivial example of a representation. 00:14:31.940 |
is forming representations that map the topology, 00:14:37.700 |
the rich space of the problem that you're trying to deal 00:14:44.260 |
that the final representation is trivial to work with, 00:14:55.380 |
trivial to generate new samples of that data. 00:14:58.220 |
And that representation of higher and higher levels 00:15:21.300 |
that's been the dream of all of science in general, 00:15:38.740 |
The models of the universe of our solar system 00:15:56.260 |
Those higher and higher levels of simple representations 00:16:29.860 |
Deep learning automates much of the extraction 00:16:40.820 |
ability to form representations from the raw data 00:16:53.580 |
with which then the machine learning algorithms 00:16:58.420 |
enables us to work with large and larger data sets 00:17:02.980 |
except from the supervision labeling step at the very end. 00:17:40.220 |
of an inflated expectation with deep learning. 00:17:51.540 |
that we'll talk about in future lectures in this course. 00:18:09.640 |
as the ups and downs of the excitement progresses forward 00:18:34.740 |
majority of the aspects of the autonomous vehicles 00:18:41.940 |
The problems are not formulated as data-driven learning. 00:18:46.260 |
Instead, they're model-based optimization methods 00:18:51.980 |
And then from the speakers these couple of weeks, 00:19:01.540 |
with amazing humanoid robotics in Boston Dynamics, 00:19:04.560 |
to date, almost no machine learning has been used 00:19:22.740 |
Plus, what's becoming, what's starting to be used 00:19:27.260 |
a little bit more is use of recurring neural networks 00:19:36.020 |
to predict the intent of the different players in the scene 00:19:46.860 |
the 10 million miles that Waymo has achieved, 00:19:50.340 |
has been attributed mostly to non-machine learning methods. 00:19:58.580 |
Here's a really clean example of unintended consequences. 00:20:03.700 |
Of ethical issues we have to really think about. 00:20:14.540 |
based on an objective function, a loss function, 00:20:22.820 |
that optimizes that function is not always obvious. 00:20:31.740 |
it's a boat racing game where the task is to go 00:20:34.580 |
around the racetrack and try to win the race. 00:20:38.280 |
And the objective is to get as many points as possible. 00:20:44.640 |
The finishing time, how long it took you to finish. 00:20:47.340 |
The finishing position, where you were in the ranking. 00:20:59.180 |
So we design an agent, in this case an RL agent, 00:21:10.220 |
the optimal, the agent discovers that the optimal 00:21:13.140 |
actually has nothing to do with finishing the race 00:21:27.400 |
slamming into the wall, collecting the green turbos. 00:21:32.060 |
Now that's a very clear example of a well-reasoned, 00:22:23.500 |
degrees of what the algorithm is accomplishing. 00:22:30.220 |
In fact, it's very far from scene understanding. 00:22:33.540 |
Classification may be very far from understanding. 00:22:41.660 |
across the different benchmarks and the data sets used. 00:22:52.440 |
And the real world data is where the big impact is. 00:22:56.040 |
So oftentimes the one doesn't transfer to the other. 00:23:07.940 |
all the things that we take for granted as human beings 00:23:14.320 |
greater and greater understanding of a scene. 00:23:16.580 |
And all the other things we have to close the gap on 00:23:22.420 |
Here's an image from the Andrej Karpathy blog 00:23:26.620 |
of former President Obama stepping on a scale. 00:23:30.620 |
We can classify, we can do semantic segmentation 00:23:50.000 |
We can't deal with the sparsity of information. 00:24:04.100 |
that there's human beings behind from a single image, 00:24:08.660 |
things we can trivially do using all the common sense 00:24:14.460 |
The physics of the scene, that there's gravity. 00:24:23.820 |
about what's on other people's minds, and so on. 00:24:29.260 |
being able to infer what people are thinking about. 00:24:33.900 |
there's been a lot of exciting work here at MIT 00:24:38.260 |
But we're not even close to solving that problem either. 00:24:42.100 |
we haven't even begun to really think about that problem. 00:24:52.600 |
I think I'm harboring on the visual perception problem, 00:24:55.860 |
because it's one we take really for granted as human beings, 00:24:59.340 |
especially when trying to solve real world problems, 00:25:01.220 |
especially when trying to solve autonomous driving. 00:25:04.980 |
We have 540 million years of data for visual perception, 00:25:16.940 |
of abstract thought, being able to play chess, 00:25:42.020 |
The last few years, there's been a lot of papers, 00:25:44.760 |
a lot of work to show that you can mess with these systems 00:25:52.480 |
predict a dog, add a little bit of distortion. 00:25:55.500 |
Immediately the system predicts with 99% accuracy 00:26:12.380 |
and real world perception that has to be solved, 00:26:18.460 |
I really like this Max Tegmark's visualization 00:26:26.800 |
of this rising sea of the landscape of human competence 00:26:34.580 |
And this is the difference as we progress forward 00:26:40.860 |
and we discuss some of these machine learning methods 00:26:52.940 |
that's able to generalize over all kinds of problems, 00:27:07.120 |
which is Savant's, which is specialized intelligence, 00:27:20.420 |
of art, cinematography, book writing at the peaks 00:27:38.540 |
of everything we're doing now keep the sea rising 00:27:42.300 |
or do fundamental breakthroughs have to happen 00:27:44.380 |
in order to generalize and solve these problems. 00:27:47.780 |
And so from the specialized where the successes are, 00:27:56.340 |
given the data set and given the ground truth 00:28:02.140 |
in the Boston area, be able to input several parameters 00:28:06.460 |
and based on those parameters, predict the apartment cost. 00:28:30.980 |
an entire series of lectures in the third week 00:28:38.900 |
with very little annotation through self-play 00:28:41.740 |
where their systems learn without human supervision 00:29:14.940 |
artificial intelligence, but it is a very small step 00:29:18.280 |
because it's in a simulated, very trivial situation. 00:29:35.620 |
where majority of the teaching is done by human beings 00:29:43.860 |
and further and further down to semi-supervised learning, 00:29:49.100 |
reinforcement learning and supervised learning 00:30:20.140 |
modifying those images to grow a small data set 00:30:41.340 |
This is a video and there's many of them online 00:30:54.740 |
We learned to do this, it's one shot learning. 00:31:21.940 |
the fundamental aspect of how to solve a particular problem. 00:31:24.940 |
Machines in most cases need thousands, millions 00:31:32.980 |
on the life critical nature of the application. 00:31:49.200 |
is there's input data, there's a learning system 00:31:57.880 |
And so we use that ground truth to teach the system. 00:32:02.880 |
In the testing stage, when it goes out into the wild 00:32:05.320 |
there's new input data over which we have to generalize 00:32:07.520 |
with the learning system and have to make our best guess. 00:32:10.680 |
In the training stage, the processes with neural networks 00:32:15.680 |
is given the input data for which we have the ground truth, 00:32:18.360 |
pass it through the model, get the prediction 00:32:23.000 |
we can compare the prediction to the ground truth, 00:32:25.280 |
look at the error and based on the error adjust the weights. 00:32:37.000 |
Here, if we look at weather, the regression problem says 00:32:46.000 |
and the classification formulation of that problem 00:32:50.320 |
or some threshold definition of what hot or cold is. 00:32:55.280 |
On the classification front, it can be multi-class 00:33:09.840 |
where a particular entity can be multiple things. 00:33:16.480 |
can be not just a single sample of the particular dataset 00:33:21.480 |
and the output doesn't have to be a particular sample 00:33:33.760 |
From video captioning where it's video captioning 00:33:37.120 |
to translation to natural language generation 00:33:41.960 |
to of course the one-to-one general computer vision. 00:33:49.760 |
to a single neuron inspired by our own brain, 00:34:00.120 |
that is behind a lot of the intelligence in our mind. 00:34:03.200 |
The artificial neuron has inputs with weights on them 00:34:08.920 |
plus a bias and an activation function and an output. 00:34:20.280 |
with three million neurons and 476 million synapses. 00:34:24.000 |
The full brain has a hundred billion, billion neurons 00:34:33.400 |
ResNet and some of the other state-of-the-art networks 00:34:36.760 |
have in tens, hundreds of millions of edges of synapses. 00:34:42.760 |
The human brain has 10 million times more synapses 00:35:00.840 |
The learning algorithm for artificial neural networks 00:35:03.960 |
is back propagation for our biological neurons 00:35:12.960 |
That's one of the mysteries of the human brain. 00:35:18.760 |
human brains are much more efficient than neural networks. 00:35:21.200 |
That's one of the problems that we're trying to solve 00:35:47.560 |
Online learning is an exceptionally difficult thing 00:35:50.040 |
that we're still in the very early stages of. 00:35:59.600 |
the fundamental computational block behind neural networks, 00:36:07.240 |
sums them up, puts it into a nonlinear activation function 00:36:12.800 |
also a learned parameter, and gives an output. 00:36:17.600 |
And the task of this neuron is to get excited 00:36:43.800 |
Different levels of abstractions form a knowledge base 00:36:49.720 |
or even act on a particular set of raw inputs. 00:36:53.640 |
And you stack these neurons together in layers, 00:36:58.240 |
both in width and depth, increasing further on, 00:37:02.000 |
and there's a lot of different architectural variants, 00:37:08.240 |
that with just a single hidden layer of a neural network, 00:37:15.720 |
Adding a neural network with a single hidden layer 00:37:35.400 |
And the other aspect here is the mathematical underpinnings 00:37:57.120 |
And that's why the other aspect on the compute, 00:38:03.080 |
is what enables some of the exciting advancements 00:38:24.440 |
to be able to train and perform inference on neural networks. 00:38:46.320 |
And for classification, it's cross-entropy loss. 00:38:48.760 |
In the cross-entropy loss, the ground truth is zero, one. 00:38:51.600 |
In the mean squared error, it's real numbered. 00:39:02.160 |
and the weights and the bias and the activation functions 00:39:21.440 |
to have the air flow backwards through the network 00:39:24.000 |
and adjust the weights such that, once again, 00:39:30.680 |
for producing the correct output are increased, 00:39:39.040 |
for producing the incorrect output were decreased. 00:39:50.000 |
And based on the gradients, the optimization algorithm, 00:39:52.960 |
combined with a learning rate, adjust the weights. 00:39:56.800 |
The learning rate is how fast the network learns. 00:40:11.200 |
in the backward flow through the network of the gradients, 00:40:18.760 |
There's a lot of variance of this optimization algorithms 00:40:23.040 |
from dying Rayleigh's to vanishing gradients. 00:40:29.080 |
on momentum and so on that really just boil down 00:40:33.520 |
to all the different problems that are solved 00:40:55.920 |
or do you do it with stochastic gradient descent 00:41:11.680 |
"More importantly, it's bad for your test error. 00:41:14.000 |
"Friends don't let friends use mini-batches larger than 32." 00:41:18.480 |
Larger batch size means more computational speed, 00:41:23.320 |
'cause you don't have to update the weights as often. 00:41:31.080 |
The problem we're often on the broader scale of learning 00:41:42.000 |
And the way we solve it is through regularization. 00:41:52.520 |
that you only do well in that trained dataset. 00:41:56.240 |
So you want it to be generalizable into future, 00:41:58.880 |
into the future things that you haven't seen yet. 00:42:02.800 |
So obviously, this is a problem for small datasets 00:42:07.800 |
and also for sets of parameters that you choose. 00:42:19.200 |
trying to fit a particular set of data with the blue dots. 00:42:25.560 |
It does very well for that particular set of samples 00:42:28.240 |
but does not generalize well in the general case. 00:42:31.560 |
And the trade-off here is as you train further and further, 00:42:45.760 |
on the training set and going to one on the test set. 00:43:02.120 |
and you call it the validation set and you set it aside 00:43:04.680 |
and you evaluate the performance of your system 00:43:09.080 |
And after you notice that your trained network 00:43:17.120 |
for a prolonged period of time, that's when you stop. 00:43:20.960 |
Basically it's getting better and better and better 00:43:26.040 |
and after some period of time, it's definitely getting worse. 00:43:35.600 |
And there's a lot of other regularization methodologies. 00:44:01.200 |
Normalization is obviously always applied at the input. 00:44:14.240 |
as different lighting conditions, different variations, 00:44:19.080 |
you have to all kind of put it on the same level ground 00:44:21.960 |
so that we're learning the fundamental aspects 00:44:31.560 |
So we should usually always normalize, for example, 00:44:35.920 |
if it's computer vision with pixels from zero to 255, 00:44:38.960 |
you always normalize to zero to one or negative one to one 00:44:42.080 |
or normalize based on the mean and the standard deviation. 00:44:46.280 |
That's something you should almost always do. 00:44:54.160 |
a lot of breakthrough performances in the past few years 00:44:59.080 |
It's performing this kind of same normalization 00:45:12.000 |
normalized based on the mean and the standard deviation. 00:45:15.000 |
As batch normalization with batch renormalization 00:45:23.880 |
given that you're normalizing during the training 00:45:31.880 |
that doesn't directly map to the inference stage 00:45:35.280 |
And so it allows by keeping a running average, 00:45:56.240 |
in all the levels of abstractions that you're forming. 00:45:58.720 |
And batch renorm solves a lot of these problems 00:46:03.320 |
from layer to weight to instance normalization 00:46:15.120 |
So now let's run through a bunch of different ideas, 00:46:18.880 |
some of which we'll cover in future lectures. 00:46:22.920 |
Of what is all of this in this world of deep learning, 00:46:25.500 |
from computer vision to deep reinforcement learning, 00:46:37.760 |
So these convolutional filters slide over the image 00:46:41.520 |
of the spatial invariance of visual information 00:46:58.040 |
and use the spatial invariance of visual information 00:47:03.800 |
to slide a convolution filter across the image 00:47:09.880 |
as opposed to assigning equal value to features 00:47:14.800 |
that are present in various regions of the image. 00:47:22.440 |
high level abstractions of visual information and images. 00:47:59.280 |
is a step, the next step in the visual recognition. 00:48:02.780 |
So the image classification is just taking the entire image 00:48:10.560 |
find all the objects of interest in the scene 00:48:25.760 |
Here's a bunch of candidates that you should look at. 00:48:49.940 |
And you can really summarize region-based methods 00:49:20.460 |
that's been trained to do image classification, 00:49:22.860 |
stack a bunch of convolutional layers on top, 00:49:34.100 |
and the classes associated with those bounding box. 00:49:37.860 |
and this is where the popular YOLO V123 come from. 00:49:58.420 |
or rather objects that are small in the image 00:50:10.700 |
That's where the tutorial that we presented here 00:50:55.800 |
of compressing a representation of the scene, 00:51:08.160 |
and up sampling the pixel level classification. 00:51:13.200 |
there's a lot of tricks that we'll talk through 00:51:15.860 |
but ultimately it boils down to the encoding step 00:51:29.520 |
the underlying idea applied most extensively, 00:51:36.540 |
Most commonly applied way of transfer learning 00:52:11.520 |
like you want to build a pedestrian detector. 00:52:20.320 |
it's useful to take ResNet trained on ImageNet, 00:52:23.680 |
or CoCo trained in the general case of vision perception, 00:52:29.560 |
and then retraining on your specialized pedestrian dataset. 00:52:48.680 |
And this is extremely effective in a computer vision, 00:52:55.480 |
And so as I mentioned with the pre-trained networks, 00:53:00.480 |
they are ultimately forming representations of the data 00:53:15.600 |
or forming representations in an unsupervised way. 00:53:25.320 |
Well, if you add a bottleneck in the network, 00:53:48.760 |
and reproduce it with a latent representation 00:53:54.240 |
And that's a really powerful way to compress the data. 00:54:19.720 |
In practice, if you want to form an efficient, 00:54:31.960 |
You want to train it on a discriminative task, 00:54:36.320 |
and the network is trained to identify cat versus dog. 00:54:39.960 |
That network that's trained in a discriminative way, 00:54:58.240 |
is a way to visualize these different representations, 00:55:25.240 |
One is the generator, one is the discriminator, 00:55:40.360 |
to generate images based on a certain representation, 00:55:49.320 |
that has to discriminate between real images, 00:56:02.120 |
And the discriminator gets better and better, 00:56:04.400 |
at telling the difference between real and fake, 00:56:17.040 |
I mean the ability to generate realistic faces, 00:56:34.280 |
temporally consistent video over time with GANs. 00:56:41.800 |
I'm sure they'll, I'm sure I'll also talk about, 00:56:44.800 |
the on a pixel level from semantic segmentation, 00:56:47.400 |
being so from the semantic pixel segmentation on the right, 00:56:57.160 |
All the raw rich high definition pixels on the left. 00:57:15.080 |
ability to from words to form representation, 00:57:24.400 |
The whole idea of forming representation about the data, 00:57:34.320 |
where words that are far apart from each other, 00:57:43.880 |
are semantically far apart from each other as well. 00:57:47.440 |
So things that are similar are together in that space. 00:58:16.720 |
you're able to know which words are related to each other. 00:58:25.440 |
but the main thing here with the input vector, 00:58:29.560 |
and the output of vector representing the probability, 00:58:32.920 |
that those words are connected to each other. 00:58:40.520 |
The low, that representation gives you the embedding, 00:58:47.560 |
the ones that are close together semantically, 00:59:09.320 |
The recurring neural networks are able to learn, 00:59:58.120 |
allow it to freely pass through information in time. 01:00:30.680 |
learning representations for what happened in the past. 01:00:39.640 |
you look into the data that falls out to the sequence. 01:00:42.200 |
So benefits you do a forward pass through the network, 01:00:56.680 |
used very much when the sequence on the input, 01:01:13.560 |
So this is useful for machine translation for example. 01:01:43.760 |
is the improvement on this encoder decoder architecture, 01:04:24.000 |
to the problem of maybe what's traditionally, 01:04:35.680 |
and what is the right data to solve that question? 01:04:59.320 |
and learning from the very sparse nature of the reward, 01:05:09.760 |
when you successfully accomplish a task or not, 01:05:13.720 |
are able to learn how to behave in that world. 01:05:18.440 |
with cats learning how the bell maps to the food, 01:05:21.600 |
and a lot of the amazing work at OpenAI and DeepMind, 01:05:24.920 |
about the robotics manipulation and navigation, 01:05:33.760 |
our own deep reinforcement learning competition, 01:05:56.680 |
able to learn how to operate successfully in this world. 01:06:21.160 |
create data really from this understanding of the world, 01:06:51.000 |
basically removing a human as much as possible, 01:06:55.840 |
and involving the human only on the fundamental side, 01:07:07.360 |
which is understanding the fundamental big questions, 01:07:11.720 |
that empowers us to solve real world problems, 01:07:16.400 |
that needs to be struck in order to solve those problems well,