back to indexMIT 6.S094: Computer Vision
Chapters
0:0 Computer Vision and Convolutional Neural Networks
22:15 Network Architectures for Image Classification
34:39 Fully Convolutional Neural Networks
44:35 Optical Flow
50:7 SegFuse Dynamic Scene Segmentation Competition
00:00:00.000 |
Today we'll talk about how to make machines see. 00:00:21.400 |
teach you about concepts of deep reinforcement learning, 00:00:25.400 |
SegFuse, the deep dynamic driving scene segmentation competition 00:00:41.400 |
that would lead the world in the area of perception. 00:00:45.400 |
Perhaps together with the people running this class, 00:01:04.400 |
Majority of the successes in how we interpret, 00:01:08.400 |
form representations, understand images and videos 00:01:12.400 |
utilize to a significant degree neural networks. 00:01:34.400 |
There's annotated data where the human provides the labels 00:01:37.400 |
that serves the ground truth in the training process. 00:01:40.400 |
Then the neural network goes through that data, 00:01:52.400 |
and then generalize over the testing data set. 00:01:56.400 |
And the kind of raw sensors we're dealing with are numbers. 00:02:15.400 |
That's something, whether you're an expert computer vision person 00:02:31.400 |
in order to perform the task you're asking it to do. 00:02:35.400 |
Perhaps the data is given is highly insufficient 00:02:41.400 |
That's the question that'll come up again and again. 00:02:43.400 |
Are images enough to understand the world around you? 00:02:58.400 |
where every single pixel have three different colors. 00:03:20.400 |
of what is hard and what is easy in computer vision. 00:03:41.400 |
is a little bit more similar in these regards. 00:03:44.400 |
The structure of the human visual cortex is in layers. 00:04:05.400 |
higher and higher order representations are formed. 00:04:28.400 |
forming those edges to form more complex features 00:04:31.400 |
and finally into the higher order semantic meaning 00:04:43.400 |
the illumination variability is the biggest challenge 00:04:46.400 |
or at least one of the biggest challenges in driving 00:04:58.400 |
as I'll also discuss about some of the advances 00:05:06.400 |
as they are currently used for computer vision 00:05:09.400 |
are not good with representing variable pose. 00:05:25.400 |
and the object is mangled and shaped in different ways. 00:05:49.400 |
there is a lot of variability inside the classes 00:05:52.400 |
and very little variability between the classes. 00:06:07.400 |
visible light camera perception is occlusion. 00:06:50.400 |
here's a cat dressed as a monkey eating a banana. 00:06:57.400 |
most of us understand what's going on in the scene. 00:07:21.400 |
and the fact that you could argue it's a monkey, 00:07:26.400 |
And what else is missing is the dynamic information, 00:07:33.400 |
That's what's missing in a lot of the perception work 00:07:58.400 |
Those bins, there's a lot of examples of each. 00:08:03.400 |
when a new example comes along you've never seen before, 00:08:08.400 |
It's the same as the machine learning task before. 00:08:18.400 |
MNIST is a toy data set of handwritten digits, 00:09:03.400 |
some of the basic convolutional neural networks 00:09:06.400 |
So let's come up with a very trivial classifier 00:09:08.400 |
to explain the concept of how we could go about it. 00:09:59.400 |
I'm going to find one of the 10 bins for a new image 00:10:14.400 |
and put it in the same bin as that image is in. 00:10:30.400 |
much better than random, much better than 10%. 00:10:46.400 |
Let's take our classifier to a whole new level. 00:10:58.400 |
and say, what class do the majority of them belong to? 00:11:14.400 |
which is the optimal under this approach for CIFAR-10, 00:11:42.400 |
It all starts at this basic computational unit. 00:11:54.400 |
and put an input into a nonlinear activation function 00:13:32.400 |
the neuron which produces the highest output. 00:13:57.400 |
that the neural network is tasked with learning, 00:14:01.400 |
And that's where convolutional neural networks step in. 00:14:22.400 |
That's where the convolution operation steps in. 00:15:21.400 |
what kind of features you look for in an image. 00:15:31.400 |
All kind of higher order patterns in the images. 00:16:31.400 |
the zero padding on the outside of the input, 00:18:38.400 |
That's it, that's the convolutional operation. 00:18:41.400 |
That's what's called the convolutional layer in neural networks. 00:21:06.400 |
But traditionally has been successfully done, 00:25:12.400 |
There's a certain kind of beautiful simplicity, 00:25:18.400 |
Because you can just make it deeper and deeper, 00:25:53.400 |
with the small modules within these networks, 00:26:01.400 |
The idea behind the inception module shown here, 00:28:36.400 |
that's reminiscent of a current neural networks, 00:34:08.400 |
to then come up with the total classification, 00:34:26.400 |
developments of how we design neural networks, 00:36:25.400 |
But it's still an incredibly difficult problem. 00:38:03.400 |
a lot of the work in the semantic segmentation, 00:39:11.400 |
the up-sampling is going to be extremely coarse. 00:39:26.400 |
So you're throwing away a lot of information, 00:41:44.400 |
by looking at the underlying image intensities. 00:42:46.400 |
but increases the resolution of the upsampling, 00:43:22.400 |
That's really the key part that made it work. 00:43:56.400 |
The steps of that dilated convolution filter, 00:44:22.400 |
is the parameterization of the upscaling filters. 00:44:27.400 |
Okay, so that's what we use to generate that data, 00:44:30.400 |
and that's what we provide you the code with, 00:44:32.400 |
if you're interested in competing in PsycFuse. 00:44:49.400 |
the temporal dynamics of the scene is thrown away. 00:45:27.400 |
and there's a lot of set of open problems there. 00:47:52.400 |
And it did so with two kinds of architectures, 00:48:07.400 |
and you want to produce from those two images, 00:48:23.400 |
so it produces a six channel input to the network, 00:48:35.400 |
Then there is flow net correlation architecture, 00:48:39.400 |
where you perform some convolution separately, 00:49:34.400 |
one that's common across various applications, 00:49:59.400 |
those data sets were used for the training process, 00:51:03.400 |
that's pretty damn close to the ground truth, 00:51:13.400 |
our task is to take the output of this network, 00:51:26.400 |
to help you propagate the information better. 00:51:38.400 |
it's not using the temporal information at all.