back to index

MIT 6.S094: Computer Vision


Chapters

0:0 Computer Vision and Convolutional Neural Networks
22:15 Network Architectures for Image Classification
34:39 Fully Convolutional Neural Networks
44:35 Optical Flow
50:7 SegFuse Dynamic Scene Segmentation Competition

Whisper Transcript | Transcript Only Page

00:00:00.000 | Today we'll talk about how to make machines see.
00:00:04.400 | Computer vision.
00:00:05.800 | And we'll present,
00:00:07.400 | Thank you for whoever said yes.
00:00:09.400 | And today we will present a competition
00:00:14.400 | that unlike deep traffic,
00:00:16.400 | which is designed to explore ideas,
00:00:21.400 | teach you about concepts of deep reinforcement learning,
00:00:25.400 | SegFuse, the deep dynamic driving scene segmentation competition
00:00:30.400 | that I'll present today,
00:00:32.400 | is at the very cutting edge.
00:00:35.400 | Whoever does well in this competition
00:00:37.400 | is likely to produce a publication or ideas
00:00:41.400 | that would lead the world in the area of perception.
00:00:45.400 | Perhaps together with the people running this class,
00:00:49.400 | perhaps on your own.
00:00:51.400 | And I encourage you to do so.
00:00:54.400 | Even more cats today.
00:00:57.400 | Computer vision,
00:00:59.400 | today as it stands,
00:01:02.400 | is deep learning.
00:01:04.400 | Majority of the successes in how we interpret,
00:01:08.400 | form representations, understand images and videos
00:01:12.400 | utilize to a significant degree neural networks.
00:01:16.400 | The very ideas we've been talking about.
00:01:19.400 | That applies for supervised, unsupervised,
00:01:22.400 | and reinforcement learning.
00:01:24.400 | And for the supervised case,
00:01:27.400 | which is the focus of today,
00:01:29.400 | the process is the same.
00:01:32.400 | The data is essential.
00:01:34.400 | There's annotated data where the human provides the labels
00:01:37.400 | that serves the ground truth in the training process.
00:01:40.400 | Then the neural network goes through that data,
00:01:46.400 | learning to map from the raw sensory input
00:01:50.400 | to the ground truth labels,
00:01:52.400 | and then generalize over the testing data set.
00:01:56.400 | And the kind of raw sensors we're dealing with are numbers.
00:02:00.400 | I'll say this again and again,
00:02:02.400 | that for human vision,
00:02:04.400 | for us here, we take for granted
00:02:06.400 | this particular aspect of our ability.
00:02:08.400 | Is to take in raw sensory information
00:02:11.400 | through our eyes and interpret it.
00:02:13.400 | But it's just numbers.
00:02:15.400 | That's something, whether you're an expert computer vision person
00:02:19.400 | or new to the field,
00:02:20.400 | you have to always go back to meditate on.
00:02:23.400 | Is what kind of things the machine is given.
00:02:27.400 | What is the data that is tasked to work with
00:02:31.400 | in order to perform the task you're asking it to do.
00:02:35.400 | Perhaps the data is given is highly insufficient
00:02:39.400 | to do what you want it to do.
00:02:41.400 | That's the question that'll come up again and again.
00:02:43.400 | Are images enough to understand the world around you?
00:02:49.400 | And given these numbers,
00:02:52.400 | the set of numbers,
00:02:54.400 | sometimes with one channel,
00:02:56.400 | sometimes with three RGB,
00:02:58.400 | where every single pixel have three different colors.
00:03:01.400 | The task is to classify or regress.
00:03:07.400 | Producing continuous variable
00:03:09.400 | or one of a set of class labels.
00:03:13.400 | As before,
00:03:15.400 | we must be careful about our intuition
00:03:20.400 | of what is hard and what is easy in computer vision.
00:03:24.400 | Let's take a step back
00:03:29.400 | to the inspiration for neural networks.
00:03:33.400 | Our own biological neural networks.
00:03:36.400 | Because the human vision system
00:03:39.400 | and the computer vision system
00:03:41.400 | is a little bit more similar in these regards.
00:03:44.400 | The structure of the human visual cortex is in layers.
00:03:56.400 | And as information passes from the eyes
00:04:00.400 | to the parts of the brain that make sense
00:04:02.400 | of the raw sensor information,
00:04:05.400 | higher and higher order representations are formed.
00:04:08.400 | This is the inspiration, the idea behind
00:04:11.400 | using deep neural networks for images.
00:04:14.400 | Higher and higher order representations
00:04:16.400 | are formed through the layers.
00:04:19.400 | The early layers,
00:04:21.400 | taking in the very raw sensory information
00:04:24.400 | and extracting edges,
00:04:27.400 | connecting those edges,
00:04:28.400 | forming those edges to form more complex features
00:04:31.400 | and finally into the higher order semantic meaning
00:04:34.400 | that we hope to get from these images.
00:04:38.400 | In computer vision, deep learning is hard.
00:04:41.400 | I'll say this again,
00:04:43.400 | the illumination variability is the biggest challenge
00:04:46.400 | or at least one of the biggest challenges in driving
00:04:51.400 | for visible light cameras.
00:04:54.400 | Pose variability,
00:04:56.400 | the objects,
00:04:58.400 | as I'll also discuss about some of the advances
00:05:01.400 | from Jeff Hinton and the capsule networks,
00:05:03.400 | the idea with neural networks
00:05:06.400 | as they are currently used for computer vision
00:05:09.400 | are not good with representing variable pose.
00:05:13.400 | These objects in images
00:05:16.400 | and this 2D plane of color and texture
00:05:19.400 | look very different numerically
00:05:22.400 | when the object is rotated
00:05:25.400 | and the object is mangled and shaped in different ways.
00:05:28.400 | The deformable truncated cat.
00:05:31.400 | Inter-class variability,
00:05:33.400 | for the classification task
00:05:36.400 | which would be an example today throughout
00:05:39.400 | to introduce some of the networks
00:05:41.400 | over the past decade
00:05:42.400 | that have received success
00:05:43.400 | and some of the intuition and insight
00:05:45.400 | that made those networks work.
00:05:47.400 | Classification,
00:05:49.400 | there is a lot of variability inside the classes
00:05:52.400 | and very little variability between the classes.
00:05:56.400 | All of these are cats at top,
00:05:58.400 | all of those are dogs at bottom.
00:06:00.400 | They look very different
00:06:02.400 | and the other,
00:06:03.400 | I would say the second biggest problem
00:06:05.400 | in driving perception,
00:06:07.400 | visible light camera perception is occlusion.
00:06:09.400 | When part of the object is occluded,
00:06:11.400 | due to the three-dimensional
00:06:14.400 | nature of our world,
00:06:17.400 | some objects in front of others
00:06:19.400 | and they occlude the background object
00:06:23.400 | and yet we're still tasked with identifying
00:06:26.400 | the object when only part of it is visible.
00:06:29.400 | And sometimes that part,
00:06:30.400 | I told you there's cats,
00:06:32.400 | is very hardly visible.
00:06:34.400 | Here we're tasked with classifying a cat
00:06:37.400 | when just an ear is visible,
00:06:38.400 | just the leg.
00:06:40.400 | And on the philosophical level,
00:06:46.400 | as we'll talk about the motivation
00:06:48.400 | for our competition here,
00:06:50.400 | here's a cat dressed as a monkey eating a banana.
00:06:54.400 | On a philosophical level,
00:06:57.400 | most of us understand what's going on in the scene.
00:07:03.400 | In fact, a neural network,
00:07:07.400 | today successfully classified this image,
00:07:14.400 | this video as a cat.
00:07:17.400 | But the context,
00:07:20.400 | the humor of the situation,
00:07:21.400 | and the fact that you could argue it's a monkey,
00:07:24.400 | is missing.
00:07:26.400 | And what else is missing is the dynamic information,
00:07:30.400 | the temporal dynamics of the scene.
00:07:33.400 | That's what's missing in a lot of the perception work
00:07:37.400 | that has been done to date
00:07:39.400 | in the autonomous vehicle space
00:07:42.400 | in terms of visible light cameras.
00:07:44.400 | And we're looking to expand on that.
00:07:47.400 | That's what SegFuse is all about.
00:07:49.400 | Image classification pipeline,
00:07:51.400 | there's a bin with different categories
00:07:54.400 | inside each class,
00:07:56.400 | cat, dog, mug, hat.
00:07:58.400 | Those bins, there's a lot of examples of each.
00:08:01.400 | And you're tasked with,
00:08:03.400 | when a new example comes along you've never seen before,
00:08:05.400 | to put that image in a bin.
00:08:08.400 | It's the same as the machine learning task before.
00:08:11.400 | And everything relies on the data
00:08:14.400 | that's been ground truth,
00:08:16.400 | that's been labeled by human beings.
00:08:18.400 | MNIST is a toy data set of handwritten digits,
00:08:22.400 | often used as examples.
00:08:24.400 | And COCO, CIFAR, ImageNet, PLACES,
00:08:28.400 | and a lot of other incredible data sets,
00:08:30.400 | rich data sets of a hundred thousands,
00:08:32.400 | millions of images out there,
00:08:34.400 | represent scenes, people's faces,
00:08:37.400 | and different objects.
00:08:39.400 | Those are all ground truth data
00:08:42.400 | for testing algorithms,
00:08:43.400 | and for competing architectures
00:08:46.400 | to be evaluated against each other.
00:08:49.400 | CIFAR-10, one of the simplest,
00:08:52.400 | almost toy data sets of tiny icons
00:08:55.400 | with 10 categories,
00:08:56.400 | of airplane, automobile, bird, cat, deer,
00:08:59.400 | dog, frog, horse, ship, and truck,
00:09:01.400 | is commonly used to explore
00:09:03.400 | some of the basic convolutional neural networks
00:09:05.400 | we'll discuss.
00:09:06.400 | So let's come up with a very trivial classifier
00:09:08.400 | to explain the concept of how we could go about it.
00:09:12.400 | In fact, this is,
00:09:13.400 | maybe if you start to think about
00:09:15.400 | how to classify an image,
00:09:16.400 | if you don't know any of these techniques,
00:09:18.400 | this is perhaps the approach you would take,
00:09:21.400 | is you would subtract images.
00:09:23.400 | So in order to know that an image of a cat
00:09:26.400 | is different than an image of a dog,
00:09:27.400 | you have to compare them.
00:09:29.400 | When given those two images,
00:09:30.400 | what's the way you compare them?
00:09:33.400 | One way you could do it,
00:09:34.400 | is you just subtract it,
00:09:36.400 | and then sum all the pixel-wise differences
00:09:39.400 | in the image.
00:09:40.400 | Just subtract the intensity of the image,
00:09:42.400 | pixel by pixel, sum it up.
00:09:45.400 | If that difference is really high,
00:09:47.400 | that means the images are very different.
00:09:50.400 | Using that metric, we can look at CIFAR-10,
00:09:53.400 | and use it as a classifier.
00:09:56.400 | Saying, based on this difference function,
00:09:59.400 | I'm going to find one of the 10 bins for a new image
00:10:03.400 | that has the lowest difference.
00:10:10.400 | Find an image in this data set
00:10:12.400 | that is most like the image I have,
00:10:14.400 | and put it in the same bin as that image is in.
00:10:18.400 | So, there's 10 classes,
00:10:21.400 | if we just flip a coin,
00:10:22.400 | the accuracy of our classifier will be 10%.
00:10:25.400 | Using our image difference classifier,
00:10:28.400 | we can actually do pretty good,
00:10:30.400 | much better than random, much better than 10%.
00:10:33.400 | We can do 35, 38% accuracy.
00:10:37.400 | That's the classifier,
00:10:38.400 | we have our first classifier.
00:10:44.400 | K-nearest neighbors.
00:10:46.400 | Let's take our classifier to a whole new level.
00:10:49.400 | Instead of comparing it to just,
00:10:51.400 | trying to find one image,
00:10:53.400 | that's the closest in our data set.
00:10:56.400 | We try to find K closest,
00:10:58.400 | and say, what class do the majority of them belong to?
00:11:03.400 | And we take that K,
00:11:04.400 | and increase it from 1 to 2 to 3 to 4 to 5.
00:11:08.400 | And see how that changes the problem.
00:11:12.400 | With 7 nearest neighbors,
00:11:14.400 | which is the optimal under this approach for CIFAR-10,
00:11:20.400 | we achieve 30% accuracy.
00:11:23.400 | Human level is 95% accuracy.
00:11:27.400 | And with convolutional neural networks,
00:11:29.400 | we get very close to 100%.
00:11:33.400 | That's where neural networks shine.
00:11:39.400 | This very task of binning images.
00:11:42.400 | It all starts at this basic computational unit.
00:11:45.400 | Signal in, each of the signals are weighed,
00:11:50.400 | summed, bias added,
00:11:54.400 | and put an input into a nonlinear activation function
00:11:58.400 | that produces an output.
00:12:00.400 | The nonlinear activation function is key.
00:12:04.400 | All of these put together,
00:12:06.400 | in more and more hidden layers,
00:12:09.400 | form a deep neural network.
00:12:12.400 | And that deep neural network is trained,
00:12:14.400 | as we've discussed,
00:12:16.400 | by taking a forward pass,
00:12:18.400 | on examples of ground truth labels,
00:12:20.400 | seeing how close those labels are
00:12:22.400 | to the real ground truth,
00:12:24.400 | and then punishing the weights,
00:12:26.400 | that resulted in the incorrect decisions,
00:12:29.400 | and rewarding the weights,
00:12:30.400 | that resulted in correct decisions.
00:12:33.400 | For the case of 10 examples,
00:12:35.400 | the output of the network,
00:12:38.400 | is 10 different values.
00:12:43.400 | The input being handwritten digits,
00:12:46.400 | from 0 to 9, there's 10 of those.
00:12:50.400 | And we wanted our network to classify,
00:12:52.400 | what is in this image,
00:12:54.400 | of a handwritten digit.
00:12:56.400 | Is it 0, 1, 2, 3, through 9.
00:13:00.400 | The way it's often done,
00:13:02.400 | is there's 10 outputs of the network.
00:13:06.400 | And each of the neurons on the output,
00:13:09.400 | is responsible for getting really excited,
00:13:13.400 | when it's number is called.
00:13:16.400 | And everybody else,
00:13:18.400 | is supposed to be not excited.
00:13:20.400 | Therefore, the number of classes,
00:13:23.400 | is the number of outputs.
00:13:24.400 | That's how it's commonly done.
00:13:27.400 | And you assign a class to the input image,
00:13:30.400 | based on the highest,
00:13:32.400 | the neuron which produces the highest output.
00:13:36.400 | But that's for a fully connected network,
00:13:38.400 | that we've discussed on Monday.
00:13:41.400 | There is in deep learning,
00:13:44.400 | a lot of tricks,
00:13:45.400 | that make things work,
00:13:47.400 | that make training much more efficient,
00:13:49.400 | on large class problems,
00:13:52.400 | where there's a lot of classes,
00:13:54.400 | on large data sets.
00:13:56.400 | When the representation,
00:13:57.400 | that the neural network is tasked with learning,
00:13:59.400 | is extremely complex.
00:14:01.400 | And that's where convolutional neural networks step in.
00:14:04.400 | The trick they use is spatial invariance.
00:14:07.400 | They use the idea that,
00:14:10.400 | a cat in the top left corner of an image,
00:14:13.400 | is the same as a cat,
00:14:14.400 | in the bottom right corner of an image.
00:14:17.400 | So we can learn the same features,
00:14:19.400 | across the image.
00:14:22.400 | That's where the convolution operation steps in.
00:14:26.400 | Instead of the fully connected networks,
00:14:28.400 | here there's a third dimension,
00:14:31.400 | of depth.
00:14:33.400 | So the blocks in this neural network,
00:14:36.400 | as input take 3D volumes,
00:14:38.400 | and as output produce 3D volumes.
00:14:46.400 | They take a slice of the image,
00:14:49.400 | a window,
00:14:50.400 | and slide it across.
00:14:52.400 | Applying the same exact weights,
00:14:54.400 | and we'll go through an example.
00:14:56.400 | The same exact weights,
00:14:58.400 | as in the fully connected network,
00:15:00.400 | on the edges that are used to,
00:15:02.400 | map the input to the output.
00:15:04.400 | Here are used to,
00:15:05.400 | map the slice of an image,
00:15:08.400 | this window of an image,
00:15:09.400 | to the output.
00:15:11.400 | And you can make several,
00:15:13.400 | many of such convolutional filters.
00:15:17.400 | Many layers,
00:15:19.400 | many different options of,
00:15:21.400 | what kind of features you look for in an image.
00:15:24.400 | What kind of window you slide across,
00:15:26.400 | in order to extract all kinds of things.
00:15:29.400 | All kinds of edges.
00:15:31.400 | All kind of higher order patterns in the images.
00:15:35.400 | The very important thing is,
00:15:37.400 | the parameters on each of these filters,
00:15:40.400 | the subset of the image,
00:15:41.400 | these windows,
00:15:42.400 | are shared.
00:15:44.400 | If the feature,
00:15:46.400 | that defines a cat,
00:15:47.400 | is useful in the top left corner,
00:15:49.400 | it's useful in the top right corner,
00:15:50.400 | it's useful in every aspect of the image.
00:15:53.400 | This is the trick,
00:15:54.400 | that makes convolutional neural networks,
00:15:56.400 | save a lot of,
00:15:58.400 | a lot of parameters,
00:16:00.400 | reduce parameters significantly.
00:16:03.400 | It's the reuse,
00:16:04.400 | the spatial sharing of features,
00:16:06.400 | across the space of the image.
00:16:11.400 | The depth of these 3D volumes,
00:16:14.400 | is the number of filters.
00:16:16.400 | The stride is the skip of the filter,
00:16:20.400 | the step size.
00:16:21.400 | How many pixels you skip,
00:16:23.400 | when you apply the filter to the input.
00:16:27.400 | And the padding,
00:16:29.400 | is the padding,
00:16:31.400 | the zero padding on the outside of the input,
00:16:34.400 | to a convolutional layer.
00:16:37.400 | Let's go through an example.
00:16:40.400 | So, on the left here,
00:16:43.400 | and the slides are now available online,
00:16:45.400 | you can follow them along.
00:16:47.400 | And I'll step through this example.
00:16:49.400 | On the left here is,
00:16:51.400 | input volume of three channels.
00:16:54.400 | The left column is the input.
00:16:57.400 | The three squares there,
00:16:59.400 | are the three channels.
00:17:01.400 | And there's numbers,
00:17:03.400 | inside those channels.
00:17:06.400 | And then we have a filter in red.
00:17:11.400 | Two of them,
00:17:13.400 | two channels of filters,
00:17:15.400 | with a bias.
00:17:17.400 | And those filters are three by three.
00:17:19.400 | Each one of them,
00:17:21.400 | is size three by three.
00:17:24.400 | And what we do is,
00:17:25.400 | we take those three by three filters,
00:17:28.400 | that are to be learned.
00:17:30.400 | These are our variables,
00:17:31.400 | our weights that we have to learn.
00:17:33.400 | And then we slide it across an image,
00:17:36.400 | to produce the output on the right,
00:17:38.400 | the green.
00:17:40.400 | So by applying the filters in the red,
00:17:42.400 | there's two of them,
00:17:44.400 | and within each one,
00:17:45.400 | there's one for every input channel.
00:17:48.400 | We go from the left,
00:17:50.400 | to the right.
00:17:51.400 | From the input volume on the left,
00:17:53.400 | to the output volume green on the right.
00:17:57.400 | And you can look,
00:17:59.400 | you can pull up the slides yourself now,
00:18:01.400 | if you can't see the numbers on the screen.
00:18:04.400 | But the operations,
00:18:08.400 | are performed on the input,
00:18:10.400 | to produce the single value,
00:18:12.400 | that's highlighted there in the green,
00:18:14.400 | in the output.
00:18:15.400 | And we slide this convolution,
00:18:18.400 | no filter,
00:18:19.400 | along the image.
00:18:21.400 | With a stride, in this case,
00:18:25.400 | of two,
00:18:27.400 | skipping,
00:18:28.400 | skipping along.
00:18:30.400 | They sum to the right,
00:18:33.400 | the two channel output,
00:18:37.400 | in green.
00:18:38.400 | That's it, that's the convolutional operation.
00:18:41.400 | That's what's called the convolutional layer in neural networks.
00:18:45.400 | And the parameters here,
00:18:47.400 | besides the bias,
00:18:48.400 | are the red values in the middle.
00:18:51.400 | That's what we're trying to learn.
00:18:54.400 | And there's a lot of interesting tricks,
00:18:56.400 | we'll discuss today on top of those.
00:18:58.400 | But this is at the core.
00:19:00.400 | This is the spatially invariant,
00:19:02.400 | sharing of parameters,
00:19:04.400 | that make convolutional neural networks,
00:19:07.400 | able to efficiently,
00:19:09.400 | learn and find patterns and images.
00:19:13.400 | To build your intuition a little bit more,
00:19:16.400 | about convolution,
00:19:17.400 | here's an input image on the left.
00:19:19.400 | And on the right,
00:19:21.400 | the identity filter,
00:19:23.400 | produces the output you see on the right.
00:19:26.400 | And then there's different ways,
00:19:28.400 | you can, different kinds of edges,
00:19:30.400 | you can extract,
00:19:32.400 | with the result in activation map,
00:19:35.400 | seen on the right.
00:19:37.400 | So when applying the filters,
00:19:39.400 | with those edge detection filters,
00:19:42.400 | to the image on the left,
00:19:43.400 | you produce in white,
00:19:45.400 | are the parts that activate,
00:19:47.400 | the convolution.
00:19:49.400 | The results of these filters.
00:19:54.400 | And so you can do any kind of filter,
00:19:56.400 | that's what we're trying to learn.
00:19:58.400 | Any kind of edge,
00:20:00.400 | any kind of,
00:20:01.400 | any kind of pattern,
00:20:03.400 | you can move along in this window,
00:20:05.400 | in this way that's shown here,
00:20:06.400 | you slide around the image,
00:20:08.400 | and you produce,
00:20:09.400 | the output you see on the right.
00:20:11.400 | And depending on how many filters,
00:20:13.400 | you have in every level,
00:20:14.400 | you have many of such slices,
00:20:16.400 | that you see on the right.
00:20:17.400 | The input on the left,
00:20:18.400 | the output on the right.
00:20:20.400 | If you have,
00:20:22.400 | dozens of filters,
00:20:23.400 | you have dozens of images on the right,
00:20:25.400 | each with different results,
00:20:27.400 | that show,
00:20:29.400 | where each of the individual,
00:20:31.400 | filter patterns were found.
00:20:33.400 | And we learn,
00:20:34.400 | what patterns are useful to look for,
00:20:37.400 | in order to perform the classification task.
00:20:40.400 | That's the task,
00:20:41.400 | for the neural network,
00:20:42.400 | to learn these filters.
00:20:45.400 | And the filters,
00:20:46.400 | have higher and higher order,
00:20:49.400 | of representation.
00:20:52.400 | Going from the very basic edges,
00:20:55.400 | to the high semantic,
00:20:57.400 | meaning that spans entire images.
00:21:00.400 | And the ability to span images,
00:21:04.400 | can be done in several ways.
00:21:06.400 | But traditionally has been successfully done,
00:21:08.400 | through max pooling,
00:21:09.400 | through pooling.
00:21:10.400 | Of taking the output,
00:21:13.400 | of a convolutional operation,
00:21:17.400 | and reducing the resolution of that,
00:21:20.400 | by condensing that information,
00:21:23.400 | by for example,
00:21:24.400 | taking the maximum values,
00:21:26.400 | the maximum activations.
00:21:28.400 | Therefore reducing the,
00:21:33.400 | spatial resolution,
00:21:35.400 | which has detrimental effects,
00:21:36.400 | as we'll talk about in scene segmentation.
00:21:39.400 | But it's beneficial,
00:21:40.400 | for finding higher order representations,
00:21:43.400 | in the images,
00:21:44.400 | that bring images together.
00:21:46.400 | That bring features together,
00:21:47.400 | to form an entity,
00:21:49.400 | that we're trying to identify and classify.
00:21:51.400 | Okay.
00:21:53.400 | So that forms,
00:21:55.400 | a convolutional neural network.
00:21:57.400 | Such convolutional layers,
00:21:58.400 | stacked on top of each other,
00:22:00.400 | is the only addition,
00:22:01.400 | to a neural network that makes,
00:22:03.400 | for a convolutional neural network.
00:22:05.400 | And then at the end,
00:22:06.400 | the fully connected layers,
00:22:08.400 | or any kind of other architectures,
00:22:11.400 | allow us to apply particular domains.
00:22:15.400 | Let's take ImageNet,
00:22:17.400 | as a case study.
00:22:19.400 | In ImageNet,
00:22:21.400 | the data set,
00:22:23.400 | in ImageNet,
00:22:24.400 | the challenge,
00:22:26.400 | the task is classification.
00:22:28.400 | As I mentioned in the first lecture,
00:22:31.400 | ImageNet is a data set,
00:22:33.400 | one of the largest in the world of images.
00:22:36.400 | With 14 million images,
00:22:38.400 | 21,000 categories.
00:22:40.400 | And a lot of depth,
00:22:44.400 | to many of the categories.
00:22:45.400 | As I mentioned,
00:22:46.400 | 1200 Granny Smith apples.
00:22:48.400 | These allow to,
00:22:53.400 | these allow the neural networks to,
00:22:55.400 | learn the rich representations,
00:22:58.400 | in both pose,
00:22:59.400 | lighting variability,
00:23:00.400 | and interclass,
00:23:01.400 | class variation,
00:23:02.400 | for the particular things,
00:23:03.400 | particular classes,
00:23:05.400 | like Granny Smith apples.
00:23:09.400 | let's look through the various networks.
00:23:11.400 | Let's discuss them,
00:23:12.400 | let's see the insights.
00:23:13.400 | It started with AlexNet,
00:23:15.400 | the first,
00:23:16.400 | really big successful,
00:23:18.400 | GPU trained neural network,
00:23:19.400 | on ImageNet,
00:23:20.400 | that's achieved a significant boost,
00:23:22.400 | over the previous year.
00:23:24.400 | And moved on to VGGNet,
00:23:26.400 | GoogleNet,
00:23:29.400 | Agulinet,
00:23:31.400 | ResNet,
00:23:33.400 | CU Image,
00:23:34.400 | and SENet,
00:23:36.400 | in 2017.
00:23:38.400 | Again,
00:23:41.400 | the numbers will show,
00:23:42.400 | for the accuracy,
00:23:43.400 | are based on the,
00:23:44.400 | top five error rate.
00:23:46.400 | We get five guesses,
00:23:48.400 | and it's a one or zero.
00:23:50.400 | If you get guess,
00:23:51.400 | if one of the five is correct,
00:23:52.400 | you get a one,
00:23:53.400 | for that particular guess.
00:23:54.400 | Otherwise, it's a zero.
00:24:02.400 | human error is 5.1.
00:24:04.400 | When a human,
00:24:05.400 | tries to achieve the same,
00:24:06.400 | tries to,
00:24:07.400 | perform the same task,
00:24:08.400 | as the machinist task we're doing,
00:24:10.400 | the error is 5.1.
00:24:12.400 | The human annotation,
00:24:13.400 | is performed on the images,
00:24:14.400 | based on binary classification.
00:24:16.400 | Granny Smith,
00:24:17.400 | Apple or not,
00:24:18.400 | cat or not.
00:24:20.400 | The actual task,
00:24:21.400 | that the machine has to perform,
00:24:23.400 | and that the human competing,
00:24:25.400 | has to perform,
00:24:26.400 | is given an image,
00:24:27.400 | is provide,
00:24:28.400 | one of the many classes.
00:24:30.400 | Under that,
00:24:31.400 | human error is 5.1%,
00:24:33.400 | which was surpassed,
00:24:34.400 | in 2015,
00:24:36.400 | by ResNet,
00:24:37.400 | to achieve 4% error.
00:24:41.400 | So, let's start with,
00:24:43.400 | AlexNet.
00:24:44.400 | I'll zoom in on the later networks,
00:24:46.400 | they have some interesting insights.
00:24:48.400 | But, AlexNet,
00:24:49.400 | and VGGNet,
00:24:51.400 | both followed a very similar architecture.
00:24:54.400 | Very uniform throughout its depth.
00:24:57.400 | VGGNet in 2014,
00:25:02.400 | is convolution,
00:25:04.400 | convolution pooling,
00:25:06.400 | convolution pooling,
00:25:07.400 | convolution pooling,
00:25:08.400 | convolution pooling,
00:25:09.400 | convolution pooling,
00:25:10.400 | and fully connected layers at the end.
00:25:12.400 | There's a certain kind of beautiful simplicity,
00:25:16.400 | uniformity to these architectures.
00:25:18.400 | Because you can just make it deeper and deeper,
00:25:20.400 | and makes it very amenable to,
00:25:22.400 | implementation in a layer stack kind of way,
00:25:26.400 | in any of the deep learning frameworks.
00:25:28.400 | It's clean and beautiful to understand.
00:25:31.400 | In the case of VGGNet,
00:25:32.400 | 16 or 19 layers,
00:25:34.400 | with 138 million parameters,
00:25:36.400 | not many optimizations on these parameters,
00:25:38.400 | therefore,
00:25:39.400 | the number of parameters is much higher than
00:25:41.400 | the networks that followed it.
00:25:43.400 | Despite the layers not being that large.
00:25:45.400 | GoogleNet introduced the inception module,
00:25:50.400 | starting to do some interesting things,
00:25:53.400 | with the small modules within these networks,
00:25:57.400 | which allow for the training to be more,
00:25:59.400 | efficient and effective.
00:26:01.400 | The idea behind the inception module shown here,
00:26:05.400 | with the previous layer on bottom,
00:26:09.400 | and the convolutional layer,
00:26:13.400 | here with the inception module,
00:26:15.400 | on top,
00:26:17.400 | produced on top,
00:26:19.400 | is it used the idea that,
00:26:23.400 | different size convolutions,
00:26:25.400 | provide different value for the network.
00:26:27.400 | Smaller convolutions are able to capture,
00:26:31.400 | or propagate forward,
00:26:33.400 | features that are very local,
00:26:36.400 | a high resolution in texture.
00:26:40.400 | Larger convolutions are better able to,
00:26:44.400 | represent and capture and catch,
00:26:47.400 | highly abstracted features,
00:26:49.400 | higher order features.
00:26:51.400 | So the idea behind the inception module,
00:26:53.400 | is to say well,
00:26:55.400 | as opposed to choosing,
00:26:57.400 | in a hyper parameter tuning process,
00:26:59.400 | or architecture design process,
00:27:01.400 | choosing which convolution size,
00:27:03.400 | we want to go with,
00:27:05.400 | why not do all of them together,
00:27:07.400 | well several together.
00:27:08.400 | In the case of the GoogleNet model,
00:27:11.400 | there's the 1x1, 3x3 and 5x5 convolutions,
00:27:15.400 | with the old trusty friend of max pooling,
00:27:18.400 | still left in there as well,
00:27:20.400 | which has lost favor,
00:27:23.400 | more and more over time,
00:27:24.400 | for the image classification task.
00:27:26.400 | And the result is,
00:27:28.400 | there's fewer parameters are required,
00:27:30.400 | if you pick,
00:27:32.400 | the placing of these,
00:27:34.400 | inception modules correctly,
00:27:36.400 | the number of parameters required,
00:27:38.400 | to achieve a higher performance,
00:27:41.400 | is much lower.
00:27:43.400 | ResNet,
00:27:47.400 | one of the most popular,
00:27:48.400 | still to date,
00:27:50.400 | architectures,
00:27:54.400 | that we'll discuss in,
00:27:56.400 | in scene segmentation as well.
00:27:59.400 | Came up and used,
00:28:01.400 | the idea of a residual block.
00:28:03.400 | The initial,
00:28:06.400 | inspiring observation,
00:28:07.400 | which doesn't necessarily,
00:28:09.400 | hold true as it turns out,
00:28:10.400 | but that network depth,
00:28:13.400 | increases representation power.
00:28:16.400 | So these residual blocks,
00:28:18.400 | allow you to have much deeper networks,
00:28:21.400 | and I'll explain,
00:28:22.400 | why in a second here.
00:28:24.400 | the thought was,
00:28:26.400 | they work so well,
00:28:27.400 | because the networks are much deeper.
00:28:29.400 | The key thing,
00:28:31.400 | that makes these blocks so effective,
00:28:34.400 | is the same idea,
00:28:36.400 | that's reminiscent of a current neural networks,
00:28:39.400 | that I hope we get a chance to talk about.
00:28:42.400 | The training of them is much easier.
00:28:46.400 | They take a simple block,
00:28:49.400 | repeated over and over,
00:28:51.400 | and they pass the input along,
00:28:54.400 | without transformation,
00:28:56.400 | along with the ability,
00:28:58.400 | to transform it,
00:28:59.400 | to learn,
00:29:00.400 | to learn the filters,
00:29:01.400 | learn the weights.
00:29:03.400 | So you're allowed to,
00:29:06.400 | you're allowed every layer,
00:29:08.400 | to not only take on,
00:29:10.400 | the processing of previous layers,
00:29:13.400 | but to take in the raw and transform data,
00:29:16.400 | and learn something new.
00:29:18.400 | The ability to learn something new,
00:29:21.400 | allows you to have,
00:29:22.400 | much deeper networks,
00:29:25.400 | and the simplicity of this block,
00:29:27.400 | allows for more effective training.
00:29:30.400 | The state-of-the-art,
00:29:34.400 | in 2017,
00:29:35.400 | the winner is,
00:29:36.400 | Squeeze and Excitation Networks.
00:29:38.400 | That unlike the previous year,
00:29:41.400 | with CU Image,
00:29:42.400 | which simply took ensemble methods,
00:29:44.400 | and combined a lot of successful approaches,
00:29:47.400 | to take a marginal improvement.
00:29:49.400 | SCNet,
00:29:51.400 | got a significant improvement,
00:29:54.400 | at least in percentages,
00:29:55.400 | I think it's a 25% reduction,
00:29:57.400 | in error,
00:29:59.400 | from 4% to 3%,
00:30:03.400 | something like that.
00:30:05.400 | By using a very simple idea,
00:30:08.400 | that I think is important to mention,
00:30:10.400 | a simple insight.
00:30:12.400 | It added a parameter,
00:30:14.400 | to each channel,
00:30:15.400 | in the convolutional layer,
00:30:18.400 | in the convolutional block.
00:30:20.400 | So the network,
00:30:21.400 | can now adjust the weighting,
00:30:23.400 | on each channel,
00:30:25.400 | based,
00:30:27.400 | for each feature map,
00:30:28.400 | based on the content,
00:30:29.400 | based on the input to the network.
00:30:31.400 | This is kind of a,
00:30:33.400 | a takeaway to think about,
00:30:35.400 | about any of the networks,
00:30:36.400 | to talk about any of the architectures.
00:30:38.400 | Is, a lot of times,
00:30:41.400 | your recurrent neural networks,
00:30:43.400 | and convolutional neural networks,
00:30:45.400 | have tricks,
00:30:46.400 | that significantly reduce,
00:30:47.400 | the number of parameters,
00:30:49.400 | the bulk, the sort of low-hanging fruit.
00:30:52.400 | They use spatial invariance,
00:30:54.400 | the temporal invariance,
00:30:55.400 | to reduce the number of parameters,
00:30:57.400 | to represent the input data.
00:30:58.400 | But, they also leave certain things,
00:31:02.400 | not parameterized.
00:31:03.400 | They don't allow the network to learn it.
00:31:05.400 | Allowing in this case,
00:31:06.400 | the network to learn the weighting,
00:31:08.400 | on each of the individual channels,
00:31:10.400 | so each of the individual filters,
00:31:12.400 | is something that you learn,
00:31:14.400 | as along with the filters,
00:31:16.400 | takes it, makes a huge boost.
00:31:18.400 | The cool thing about this,
00:31:19.400 | is it's applicable to any architecture.
00:31:21.400 | This kind of block,
00:31:23.400 | this kind of,
00:31:24.400 | the squeeze and excitation block,
00:31:26.400 | is applicable to any architecture.
00:31:31.400 | because obviously,
00:31:33.400 | it's just simply parameterizes,
00:31:35.400 | the ability to choose,
00:31:37.400 | which filter you go with,
00:31:38.400 | based on the content.
00:31:39.400 | It's a subtle, but crucial thing.
00:31:41.400 | I think it's pretty cool.
00:31:43.400 | And, for future research,
00:31:44.400 | it inspires to think about,
00:31:46.400 | what else can be parameterized,
00:31:48.400 | in neural networks?
00:31:49.400 | What else can be controlled,
00:31:50.400 | as part of the learning process?
00:31:52.400 | Including higher and higher order,
00:31:54.400 | hyperparameters.
00:31:55.400 | Which aspects of the training,
00:31:58.400 | and the architecture of the network,
00:32:00.400 | can be part of the learning?
00:32:02.400 | This is what this network inspires.
00:32:05.400 | Another network,
00:32:13.400 | has been in development since the 90s,
00:32:15.400 | ideas,
00:32:16.400 | by Jeff Hinton,
00:32:18.400 | but really received,
00:32:19.400 | has been published on,
00:32:20.400 | and received significant attention in 2017.
00:32:23.400 | That I won't go into detail here.
00:32:26.400 | We are going to release,
00:32:29.400 | an online only video,
00:32:32.400 | about capsule networks.
00:32:34.400 | It's a little bit too technical,
00:32:36.400 | but they inspire a very important point,
00:32:39.400 | that we should always think about,
00:32:42.400 | with deep learning.
00:32:43.400 | Whenever it's successful.
00:32:45.400 | It's to think about,
00:32:46.400 | what, as I mentioned,
00:32:48.400 | with the cat eating a banana,
00:32:50.400 | on a philosophical,
00:32:52.400 | and the mathematical level,
00:32:53.400 | we have to consider,
00:32:55.400 | what assumptions these networks make,
00:32:58.400 | and what,
00:32:59.400 | through those assumptions,
00:33:00.400 | they throw away.
00:33:01.400 | So neural networks,
00:33:03.400 | due to the spatial,
00:33:04.400 | with convolutional neural networks,
00:33:06.400 | due to their spatial invariance,
00:33:08.400 | throw away information,
00:33:09.400 | about the relationship,
00:33:11.400 | between,
00:33:13.400 | the hierarchies,
00:33:15.400 | between the simple,
00:33:16.400 | and the complex objects.
00:33:17.400 | So the face on the left,
00:33:18.400 | and the face on the right,
00:33:20.400 | looks the same,
00:33:21.400 | to a convolutional neural network.
00:33:23.400 | The presence of eyes,
00:33:24.400 | and nose,
00:33:25.400 | and mouth,
00:33:26.400 | is the essential aspect,
00:33:29.400 | of what makes,
00:33:31.400 | the classification task work,
00:33:33.400 | for convolutional network.
00:33:34.400 | Where it will fire,
00:33:36.400 | and say this is definitely a face.
00:33:38.400 | But the spatial relationship,
00:33:40.400 | is lost,
00:33:41.400 | is ignored,
00:33:43.400 | which means,
00:33:44.400 | there's a lot of implications to this,
00:33:47.400 | for things like,
00:33:49.400 | pose variation,
00:33:50.400 | that information is lost.
00:33:53.400 | We're throwing away,
00:33:54.400 | that away completely,
00:33:56.400 | and hoping that,
00:33:57.400 | the pooling operation,
00:33:58.400 | that's performing these networks,
00:34:01.400 | is able to sort of,
00:34:02.400 | mesh everything together,
00:34:04.400 | to come up with the features,
00:34:06.400 | that are firing,
00:34:07.400 | of the different parts of the face,
00:34:08.400 | to then come up with the total classification,
00:34:10.400 | that it's a face.
00:34:11.400 | Without representing,
00:34:12.400 | really the relationship,
00:34:13.400 | between these features,
00:34:14.400 | at the low level,
00:34:15.400 | and the high level.
00:34:17.400 | At the low level of the hierarchy,
00:34:19.400 | at the simple,
00:34:20.400 | and the complex level.
00:34:22.400 | This is a super exciting field now,
00:34:25.400 | that's hopefully will spark,
00:34:26.400 | developments of how we design neural networks,
00:34:29.400 | that are able to learn,
00:34:30.400 | the rotational,
00:34:33.400 | the orientation,
00:34:34.400 | in variance,
00:34:35.400 | as well.
00:34:37.400 | Okay, so as I mentioned,
00:34:42.400 | you take these,
00:34:44.400 | convolutional neural networks,
00:34:45.400 | chop off the final layer,
00:34:47.400 | in order to apply,
00:34:48.400 | to a particular domain.
00:34:50.400 | And that is what we'll do,
00:34:52.400 | with fully convolutional neural networks.
00:34:54.400 | The ones that we tasked,
00:34:55.400 | to segment the image,
00:34:56.400 | at a pixel level.
00:34:58.400 | As a reminder,
00:35:01.400 | these networks,
00:35:02.400 | through the convolutional process,
00:35:05.400 | are really producing,
00:35:08.400 | a heat map.
00:35:09.400 | Different parts of the network,
00:35:11.400 | are getting excited,
00:35:12.400 | based on the different,
00:35:13.400 | aspects of the image.
00:35:15.400 | And so it can be used,
00:35:16.400 | to do the localization of detecting,
00:35:18.400 | not just classifying the image,
00:35:20.400 | but localizing the object.
00:35:22.400 | And they could do so,
00:35:23.400 | at a pixel level.
00:35:25.400 | So the convolutional layers,
00:35:28.400 | are doing the,
00:35:30.400 | encoding process.
00:35:31.400 | They're taking the rich,
00:35:32.400 | raw sensory information,
00:35:35.400 | in the image,
00:35:36.400 | and encoding them,
00:35:37.400 | into an interpretable set of features,
00:35:39.400 | representation,
00:35:41.400 | that can then be used for classification.
00:35:43.400 | But we can also then use a decoder,
00:35:45.400 | up sample that information,
00:35:47.400 | and produce a map like this.
00:35:50.400 | Fully convolutional neural networks,
00:35:52.400 | segmentation,
00:35:53.400 | semantic scene segmentation,
00:35:55.400 | image segmentation.
00:35:56.400 | The goal is to,
00:35:57.400 | as opposed to classify the entire image,
00:35:59.400 | you classify every single pixel.
00:36:02.400 | It's pixel level segmentation.
00:36:04.400 | You color every single pixel,
00:36:05.400 | with what that pixel,
00:36:07.400 | what object that pixel belongs to,
00:36:09.400 | in this 2D space of the image.
00:36:11.400 | The 2D projection,
00:36:14.400 | in the image of a three-dimensional world.
00:36:18.400 | So the thing is,
00:36:20.400 | there's been a lot of advancement,
00:36:21.400 | in the last three years.
00:36:25.400 | But it's still an incredibly difficult problem.
00:36:29.400 | If you think about,
00:36:32.400 | the amount of data that's used,
00:36:36.400 | for training,
00:36:37.400 | and the task of pixel level,
00:36:39.400 | of megapixels here,
00:36:41.400 | of millions of pixels,
00:36:43.400 | that are tasked with,
00:36:44.400 | having assigned a single label.
00:36:46.400 | It's an extremely difficult problem.
00:36:48.400 | Why is this interesting,
00:36:52.400 | important problem to try to solve,
00:36:54.400 | as opposed to bounding boxes,
00:36:56.400 | around cats?
00:36:57.400 | Well,
00:36:58.400 | it's whenever precise boundaries,
00:37:00.400 | of objects are important.
00:37:01.400 | Certainly medical applications,
00:37:03.400 | when looking at imaging,
00:37:05.400 | and detecting particular,
00:37:07.400 | for example,
00:37:08.400 | detecting tumors,
00:37:09.400 | in medical imaging,
00:37:13.400 | of different organs.
00:37:15.400 | And in driving,
00:37:18.400 | in robotics,
00:37:20.400 | when objects are involved,
00:37:22.400 | it's a dense scene,
00:37:23.400 | involved with vehicles,
00:37:24.400 | pedestrians, cyclists.
00:37:25.400 | We need to be able to,
00:37:27.400 | not just have a loose estimate,
00:37:29.400 | of where objects are.
00:37:30.400 | We need to be able to have,
00:37:32.400 | the exact boundaries.
00:37:33.400 | And then potentially,
00:37:35.400 | through data fusion,
00:37:37.400 | fusing sensors together,
00:37:39.400 | fusing this rich textural information,
00:37:41.400 | about pedestrians, cyclists,
00:37:43.400 | and vehicles,
00:37:44.400 | to light our data,
00:37:45.400 | that's providing us the three-dimensional,
00:37:47.400 | map of the world.
00:37:48.400 | We'll have both,
00:37:49.400 | the semantic meaning,
00:37:50.400 | of the different objects,
00:37:51.400 | and their exact three-dimensional location.
00:37:53.400 | A lot of this work,
00:38:01.400 | successfully,
00:38:03.400 | a lot of the work in the semantic segmentation,
00:38:05.400 | started with,
00:38:06.400 | fully convolutional networks,
00:38:08.400 | for semantic segmentation paper.
00:38:11.400 | That's where the name FCN came from,
00:38:12.400 | in November 2014.
00:38:14.400 | Now go through a few papers here,
00:38:16.400 | to give you some intuition,
00:38:18.400 | where the field is gone.
00:38:20.400 | And how that takes us to segfuse,
00:38:23.400 | the segmentation competition.
00:38:25.400 | So FCN,
00:38:27.400 | repurposed the ImageNet pre-trained nets.
00:38:29.400 | The nets that were trained,
00:38:31.400 | to classify what's in an image,
00:38:33.400 | the entire image.
00:38:35.400 | And chopped off,
00:38:37.400 | the fully connected layers.
00:38:39.400 | And then added decoder parts,
00:38:41.400 | that up-sampled the image,
00:38:44.400 | to produce a heat map.
00:38:48.400 | Here shown,
00:38:49.400 | with a tabby cat,
00:38:51.400 | a heat map of where the cat is in the image.
00:38:53.400 | It's a much lower,
00:38:55.400 | much coarser resolution,
00:38:57.400 | than the input image.
00:38:59.400 | 1/8 at best.
00:39:01.400 | Skip connections,
00:39:04.400 | to improve coarseness of up-sampling.
00:39:06.400 | There's a few tricks.
00:39:09.400 | If you do the most naive approach,
00:39:11.400 | the up-sampling is going to be extremely coarse.
00:39:14.400 | Because that's the whole point,
00:39:16.400 | of the neural network.
00:39:17.400 | The encoding part,
00:39:18.400 | is you throw away all the useless data,
00:39:21.400 | to the most essential aspects,
00:39:24.400 | that represent that image.
00:39:26.400 | So you're throwing away a lot of information,
00:39:27.400 | that's necessary,
00:39:29.400 | to then form a high-resolution image.
00:39:32.400 | So there's a few tricks,
00:39:34.400 | where you skip a few of the final,
00:39:37.400 | pooling operations,
00:39:39.400 | to go in similar ways,
00:39:41.400 | as a residual block,
00:39:43.400 | to go to the output,
00:39:45.400 | produce higher and higher,
00:39:47.400 | resolution heat map at the end.
00:39:50.400 | SegNet in 2015,
00:39:53.400 | applied this to the driving context.
00:39:56.400 | And really, taking a Dikiti dataset,
00:39:58.400 | and have shown a lot of interesting results,
00:40:02.400 | and really explored the encoder-decoder,
00:40:05.400 | formulation of the problem.
00:40:07.400 | Really solidifying the place,
00:40:11.400 | of the encoder-decoder framework,
00:40:13.400 | for the segmentation task.
00:40:16.400 | Dilated convolution,
00:40:18.400 | I'm taking you through a few components,
00:40:20.400 | which are critical here,
00:40:21.400 | to the state of the art.
00:40:23.400 | Dilated convolutions,
00:40:25.400 | so the convolution operation,
00:40:28.400 | as the pooling operation,
00:40:31.400 | reduces resolution significantly.
00:40:34.400 | And dilated convolution,
00:40:38.400 | has a certain kind of grating,
00:40:40.400 | as visualized there,
00:40:41.400 | that maintains the local,
00:40:46.400 | high-resolution textures,
00:40:48.400 | while still capturing,
00:40:51.400 | the spatial window necessary.
00:40:55.400 | It's called dilated convolutional layer.
00:40:59.400 | And that's in a 2015 paper,
00:41:03.400 | proved to be much better at up-sampling,
00:41:05.400 | a high-resolution image.
00:41:08.400 | D-Lab, with a B,
00:41:13.400 | V1, V2, now V3,
00:41:17.400 | added conditional random fields,
00:41:20.400 | which is the final piece of the,
00:41:23.400 | state of the art puzzle here.
00:41:25.400 | A lot of the successful networks today,
00:41:28.400 | that do segmentation, not all,
00:41:31.400 | do post-process using CRFs,
00:41:35.400 | conditional random fields.
00:41:37.400 | And what they do is,
00:41:38.400 | they smooth the segmentation,
00:41:40.400 | the up-sampled segmentation,
00:41:42.400 | that results from the FCN,
00:41:44.400 | by looking at the underlying image intensities.
00:41:48.400 | So that's the key aspects,
00:41:52.400 | of the successful approaches today.
00:41:54.400 | You have the encoder-decoder framework,
00:41:56.400 | of a fully convolutional neural network.
00:41:58.400 | It replaces the fully connected layers,
00:42:00.400 | with the convolutional layers,
00:42:02.400 | deconvolutional layers.
00:42:04.400 | And as the years progressed,
00:42:07.400 | from 2014 to today,
00:42:09.400 | as usual, the underlying networks,
00:42:13.400 | from AlexNet to VGGNet,
00:42:16.400 | and to now ResNet,
00:42:18.400 | have been one of the big reasons,
00:42:21.400 | for the improvements of these networks,
00:42:22.400 | to be able to perform the segmentation.
00:42:24.400 | So naturally, they mirrored,
00:42:26.400 | the ImageNet challenge performance,
00:42:28.400 | in adapting these networks.
00:42:30.400 | So the state of the art,
00:42:31.400 | uses ResNet or similar networks.
00:42:34.400 | Conditional random fields,
00:42:36.400 | for smoothing,
00:42:37.400 | based on the input image intensities,
00:42:40.400 | and the dilated convolution,
00:42:43.400 | that maintains the computational cost,
00:42:46.400 | but increases the resolution of the upsampling,
00:42:49.400 | throughout the intermediate feature maps.
00:42:53.400 | And that takes us to the state of the art,
00:42:57.400 | that we used to produce the images,
00:43:01.400 | to produce the images for the competition.
00:43:05.400 | ResNet-DUC for dance upsampling convolution,
00:43:10.400 | instead of bilinear upsampling,
00:43:13.400 | you make the upsampling learnable.
00:43:17.400 | You learn the upscaling filters,
00:43:20.400 | that's on the bottom.
00:43:22.400 | That's really the key part that made it work.
00:43:25.400 | There should be a theme here.
00:43:27.400 | Sometimes the biggest addition,
00:43:29.400 | that could be done,
00:43:30.400 | is parameterizing,
00:43:32.400 | one of the aspects of the network,
00:43:33.400 | they've taken for granted.
00:43:35.400 | Letting the network learn that aspect.
00:43:37.400 | And the other,
00:43:39.400 | not sure how important it is to the success,
00:43:42.400 | but it's a cool little addition,
00:43:44.400 | is a hybrid dilated convolution.
00:43:47.400 | As I showed that visualization,
00:43:50.400 | where the convolution is spread apart,
00:43:52.400 | a little bit in the input,
00:43:55.400 | from the input to the output.
00:43:56.400 | The steps of that dilated convolution filter,
00:44:00.400 | when they're changed,
00:44:01.400 | it produces a smoother result,
00:44:03.400 | because when it's kept the same,
00:44:06.400 | there are certain input pixels,
00:44:08.400 | get a lot more attention than others.
00:44:11.400 | So losing that favoritism,
00:44:14.400 | is what's achieved by using a variable,
00:44:16.400 | different dilation rate.
00:44:19.400 | Those are the two tricks,
00:44:20.400 | but really the biggest one,
00:44:22.400 | is the parameterization of the upscaling filters.
00:44:27.400 | Okay, so that's what we use to generate that data,
00:44:30.400 | and that's what we provide you the code with,
00:44:32.400 | if you're interested in competing in PsycFuse.
00:44:35.400 | The other aspect here,
00:44:37.400 | that everything we talked about,
00:44:38.400 | from the classification,
00:44:40.400 | to the segmentation,
00:44:42.400 | to making sense of images,
00:44:44.400 | is the information about time,
00:44:49.400 | the temporal dynamics of the scene is thrown away.
00:44:53.400 | And for the driving context,
00:44:55.400 | for the robotics contest,
00:44:56.400 | and what we'd like to do with PsycFuse,
00:44:58.400 | for the segmentation,
00:45:00.400 | dynamic scene segmentation context,
00:45:02.400 | of when you try to interpret,
00:45:04.400 | what's going on in the scene over time,
00:45:06.400 | and use that information.
00:45:08.400 | Time is essential,
00:45:11.400 | the movement of pixels is essential,
00:45:14.400 | through time.
00:45:16.400 | That understanding how those objects move,
00:45:19.400 | in a 3D space,
00:45:22.400 | through the 2D projection of an image,
00:45:25.400 | is fascinating,
00:45:27.400 | and there's a lot of set of open problems there.
00:45:30.400 | So flow, is what's very helpful,
00:45:35.400 | as a starting point,
00:45:37.400 | to help us understand how these pixels move.
00:45:40.400 | Flow, optical flow,
00:45:43.400 | dense optical flow is the computation,
00:45:45.400 | our best approximation,
00:45:50.400 | of where each pixel in image one,
00:45:53.400 | and moved in the temporally,
00:45:58.400 | following image after that.
00:46:00.400 | There's two images,
00:46:02.400 | in 30 frames a second,
00:46:03.400 | there's one image at time zero,
00:46:05.400 | the other is 33.3 milliseconds later,
00:46:08.400 | and the dense optical flow,
00:46:10.400 | is our best estimate of how each pixel,
00:46:12.400 | in the input image moved,
00:46:14.400 | to in the output image.
00:46:16.400 | The optical flow, for every pixel,
00:46:19.400 | produces a direction,
00:46:20.400 | of where we think that pixel moved,
00:46:22.400 | and the magnitude of how far moved.
00:46:24.400 | That allows us,
00:46:26.400 | to take information that we detected,
00:46:28.400 | about the first frame,
00:46:30.400 | and try to propagate it forward.
00:46:33.400 | This is the competition,
00:46:35.400 | is to try to segment an image,
00:46:38.400 | and propagate that information forward.
00:46:41.400 | For manual annotation,
00:46:45.400 | of an image,
00:46:47.400 | so this kind of coloring book annotation,
00:46:49.400 | where you color every single pixel,
00:46:51.400 | in the state-of-the-art data set,
00:46:54.400 | for driving cityscapes,
00:46:56.400 | that it takes 1.5 hours,
00:47:00.400 | 90 minutes to do that coloring.
00:47:02.400 | That's 90 minutes per image.
00:47:05.400 | That's extremely long time.
00:47:07.400 | That's why there doesn't exist today,
00:47:10.400 | a data set,
00:47:11.400 | and in this class,
00:47:12.400 | we're going to create one,
00:47:14.400 | of segmentation of these images,
00:47:17.400 | through time,
00:47:19.400 | through video.
00:47:21.400 | So long videos,
00:47:23.400 | where every single frame,
00:47:25.400 | is fully segmented.
00:47:27.400 | That's still an open problem,
00:47:29.400 | that we need to solve.
00:47:31.400 | Flow is a piece of that,
00:47:33.400 | and we also provide you,
00:47:36.400 | this computed state-of-the-art flow,
00:47:39.400 | using flow net 2.0.
00:47:41.400 | So flow net 1.0,
00:47:43.400 | in May 2015,
00:47:46.400 | use neural networks,
00:47:48.400 | to learn the optical flow,
00:47:50.400 | the dense optical flow.
00:47:52.400 | And it did so with two kinds of architectures,
00:47:56.400 | flow net S,
00:47:57.400 | flow net simple,
00:47:58.400 | and flow net core,
00:47:59.400 | flow net C.
00:48:01.400 | The simple one,
00:48:02.400 | is simply taking the two images,
00:48:04.400 | so what's the task here?
00:48:06.400 | There's two images,
00:48:07.400 | and you want to produce from those two images,
00:48:09.400 | they follow each other in time,
00:48:11.400 | 33.3 milliseconds apart,
00:48:13.400 | and your task is the output,
00:48:16.400 | to produce the dense optical flow.
00:48:18.400 | So for the simple architecture,
00:48:20.400 | you just stack them together,
00:48:22.400 | each are RGB,
00:48:23.400 | so it produces a six channel input to the network,
00:48:26.400 | there's a lot of convolution,
00:48:28.400 | and finally it's the same kind of process,
00:48:30.400 | as the fully convolution neural networks,
00:48:33.400 | to produce the optical flow.
00:48:35.400 | Then there is flow net correlation architecture,
00:48:39.400 | where you perform some convolution separately,
00:48:42.400 | before using a correlation layer,
00:48:44.400 | to combine the feature maps.
00:48:47.400 | Both are effective,
00:48:50.400 | in different data sets,
00:48:53.400 | and different applications.
00:48:54.400 | So flow net 2.0,
00:48:56.400 | in December 2016,
00:48:59.400 | is one of the state-of-the-art frameworks,
00:49:02.400 | code bases,
00:49:04.400 | that we use to generate the data I'll show,
00:49:07.400 | combines the flow net S,
00:49:09.400 | and flow net C,
00:49:10.400 | and improves over the initial flow net,
00:49:13.400 | producing a smoother flow field,
00:49:15.400 | preserves the fine motion detail,
00:49:18.400 | along the edges of the objects,
00:49:20.400 | and it runs extremely efficiently,
00:49:23.400 | depending on the architecture,
00:49:25.400 | there's a few variants,
00:49:26.400 | either 8 to 140 frames a second.
00:49:30.400 | And the process there,
00:49:33.400 | is essentially,
00:49:34.400 | one that's common across various applications,
00:49:36.400 | deep learning,
00:49:37.400 | is stacking these networks together.
00:49:39.400 | The very interesting aspect here,
00:49:43.400 | that we're still exploring,
00:49:46.400 | and again,
00:49:47.400 | applicable in all of deep learning,
00:49:49.400 | in this case,
00:49:50.400 | it seemed that there was a strong effect,
00:49:53.400 | in taking sparse,
00:49:54.400 | small,
00:49:55.400 | multiple data set,
00:49:56.400 | and doing the training,
00:49:57.400 | the order of which,
00:49:59.400 | those data sets were used for the training process,
00:50:01.400 | mattered a lot.
00:50:02.400 | That's very interesting.
00:50:05.400 | So, using flow net 2.0,
00:50:09.400 | here's the data set,
00:50:11.400 | we're making available for PsycFuse,
00:50:13.400 | the competition.
00:50:14.400 | Cars.mit.edu/psychfuse
00:50:18.400 | First, the original video,
00:50:20.400 | us driving in high definition,
00:50:24.400 | 1080p,
00:50:26.400 | and a 8K 360 video,
00:50:29.400 | original video,
00:50:32.400 | driving around Cambridge.
00:50:35.400 | Then we're providing the ground truth,
00:50:40.400 | for a training set.
00:50:43.400 | For that training set,
00:50:45.400 | for every single frame,
00:50:46.400 | 30 frames a second,
00:50:47.400 | we're providing the segmentation,
00:50:49.400 | frame to frame to frame,
00:50:51.400 | segmented on Mechanical Turk.
00:50:54.400 | We're also providing the output,
00:50:58.400 | of the network that I mentioned,
00:51:01.400 | the state of the art segmentation network,
00:51:03.400 | that's pretty damn close to the ground truth,
00:51:06.400 | but still not.
00:51:09.400 | And our task is,
00:51:11.400 | this is the interesting thing is,
00:51:13.400 | our task is to take the output of this network,
00:51:18.400 | well there's two options,
00:51:19.400 | one is to take the output of this network,
00:51:22.400 | and use other networks,
00:51:26.400 | to help you propagate the information better.
00:51:29.400 | So what this segmentation,
00:51:31.400 | the output of this network does,
00:51:34.400 | is it only takes a frame by frame by frame,
00:51:38.400 | it's not using the temporal information at all.
00:51:40.400 | So the question is,
00:51:42.400 | can we figure out a way,
00:51:43.400 | can we figure out tricks,
00:51:44.400 | to use temporal information,
00:51:46.400 | to improve this segmentation,
00:51:48.400 | so it looks more like this segmentation.
00:51:51.400 | And we're also providing the optical flow,
00:51:57.400 | from frame to frame to frame.
00:51:58.400 | So the optical flow,
00:52:00.400 | based on flow net 2.0,
00:52:01.400 | of how each of the pixels moved.
00:52:07.400 | Okay.
00:52:08.400 | And that forms the PsyFuse competition.
00:52:11.400 | 10,000 images,
00:52:13.400 | and the task is to submit code,
00:52:17.400 | we have starter code in Python,
00:52:19.400 | and on GitHub,
00:52:21.400 | to take in the original video,
00:52:25.400 | take in for the training set,
00:52:26.400 | the ground truth,
00:52:28.400 | the segmentation from the state of the art,
00:52:30.400 | segmentation network,
00:52:31.400 | the optical flow from the state of the art,
00:52:33.400 | optical flow network,
00:52:36.400 | and taking that together,
00:52:37.400 | to improve the stuff on the bottom left,
00:52:40.400 | the segmentation,
00:52:41.400 | to try to achieve the ground truth,
00:52:43.400 | on the top right.
00:52:44.400 | Okay.
00:52:47.400 | With that,
00:52:48.400 | I'd like to thank you.
00:52:49.400 | Tomorrow at 1 p.m.,
00:52:51.400 | is Waymo in Stata,
00:52:54.400 | 32.123.
00:52:56.400 | The next lecture,
00:52:58.400 | next week,
00:52:59.400 | will be on deep learning,
00:53:00.400 | for sensing the human,
00:53:01.400 | understanding the human,
00:53:02.400 | and we will release,
00:53:03.400 | online only lecture,
00:53:05.400 | on capsule networks,
00:53:06.400 | and GANs,
00:53:07.400 | General Adversarial Networks.
00:53:09.400 | Thank you very much.
00:53:10.400 | Thank you very much.
00:53:11.000 | (Applause)