MIT 6.S094: Computer Vision

00:00:00.000 | Today we'll talk about how to make machines see.

00:00:04.400 | Computer vision.

00:00:05.800 | And we'll present,

00:00:07.400 | Thank you for whoever said yes.

00:00:09.400 | And today we will present a competition

00:00:14.400 | that unlike deep traffic,

00:00:16.400 | which is designed to explore ideas,

00:00:21.400 | teach you about concepts of deep reinforcement learning,

00:00:25.400 | SegFuse, the deep dynamic driving scene segmentation competition

00:00:30.400 | that I'll present today,

00:00:32.400 | is at the very cutting edge.

00:00:35.400 | Whoever does well in this competition

00:00:37.400 | is likely to produce a publication or ideas

00:00:41.400 | that would lead the world in the area of perception.

00:00:45.400 | Perhaps together with the people running this class,

00:00:49.400 | perhaps on your own.

00:00:51.400 | And I encourage you to do so.

00:00:54.400 | Even more cats today.

00:00:57.400 | Computer vision,

00:00:59.400 | today as it stands,

00:01:02.400 | is deep learning.

00:01:04.400 | Majority of the successes in how we interpret,

00:01:08.400 | form representations, understand images and videos

00:01:12.400 | utilize to a significant degree neural networks.

00:01:16.400 | The very ideas we've been talking about.

00:01:19.400 | That applies for supervised, unsupervised,

00:01:22.400 | and reinforcement learning.

00:01:24.400 | And for the supervised case,

00:01:27.400 | which is the focus of today,

00:01:29.400 | the process is the same.

00:01:32.400 | The data is essential.

00:01:34.400 | There's annotated data where the human provides the labels

00:01:37.400 | that serves the ground truth in the training process.

00:01:40.400 | Then the neural network goes through that data,

00:01:46.400 | learning to map from the raw sensory input

00:01:50.400 | to the ground truth labels,

00:01:52.400 | and then generalize over the testing data set.

00:01:56.400 | And the kind of raw sensors we're dealing with are numbers.

00:02:00.400 | I'll say this again and again,

00:02:02.400 | that for human vision,

00:02:04.400 | for us here, we take for granted

00:02:06.400 | this particular aspect of our ability.

00:02:08.400 | Is to take in raw sensory information

00:02:11.400 | through our eyes and interpret it.

00:02:13.400 | But it's just numbers.

00:02:15.400 | That's something, whether you're an expert computer vision person

00:02:19.400 | or new to the field,

00:02:20.400 | you have to always go back to meditate on.

00:02:23.400 | Is what kind of things the machine is given.

00:02:27.400 | What is the data that is tasked to work with

00:02:31.400 | in order to perform the task you're asking it to do.

00:02:35.400 | Perhaps the data is given is highly insufficient

00:02:39.400 | to do what you want it to do.

00:02:41.400 | That's the question that'll come up again and again.

00:02:43.400 | Are images enough to understand the world around you?

00:02:49.400 | And given these numbers,

00:02:52.400 | the set of numbers,

00:02:54.400 | sometimes with one channel,

00:02:56.400 | sometimes with three RGB,

00:02:58.400 | where every single pixel have three different colors.

00:03:01.400 | The task is to classify or regress.

00:03:07.400 | Producing continuous variable

00:03:09.400 | or one of a set of class labels.

00:03:13.400 | As before,

00:03:15.400 | we must be careful about our intuition

00:03:20.400 | of what is hard and what is easy in computer vision.

00:03:24.400 | Let's take a step back

00:03:29.400 | to the inspiration for neural networks.

00:03:33.400 | Our own biological neural networks.

00:03:36.400 | Because the human vision system

00:03:39.400 | and the computer vision system

00:03:41.400 | is a little bit more similar in these regards.

00:03:44.400 | The structure of the human visual cortex is in layers.

00:03:56.400 | And as information passes from the eyes

00:04:00.400 | to the parts of the brain that make sense

00:04:02.400 | of the raw sensor information,

00:04:05.400 | higher and higher order representations are formed.

00:04:08.400 | This is the inspiration, the idea behind

00:04:11.400 | using deep neural networks for images.

00:04:14.400 | Higher and higher order representations

00:04:16.400 | are formed through the layers.

00:04:19.400 | The early layers,

00:04:21.400 | taking in the very raw sensory information

00:04:24.400 | and extracting edges,

00:04:27.400 | connecting those edges,

00:04:28.400 | forming those edges to form more complex features

00:04:31.400 | and finally into the higher order semantic meaning

00:04:34.400 | that we hope to get from these images.

00:04:38.400 | In computer vision, deep learning is hard.

00:04:41.400 | I'll say this again,

00:04:43.400 | the illumination variability is the biggest challenge

00:04:46.400 | or at least one of the biggest challenges in driving

00:04:51.400 | for visible light cameras.

00:04:54.400 | Pose variability,

00:04:56.400 | the objects,

00:04:58.400 | as I'll also discuss about some of the advances

00:05:01.400 | from Jeff Hinton and the capsule networks,

00:05:03.400 | the idea with neural networks

00:05:06.400 | as they are currently used for computer vision

00:05:09.400 | are not good with representing variable pose.

00:05:13.400 | These objects in images

00:05:16.400 | and this 2D plane of color and texture

00:05:19.400 | look very different numerically

00:05:22.400 | when the object is rotated

00:05:25.400 | and the object is mangled and shaped in different ways.

00:05:28.400 | The deformable truncated cat.

00:05:31.400 | Inter-class variability,

00:05:33.400 | for the classification task

00:05:36.400 | which would be an example today throughout

00:05:39.400 | to introduce some of the networks

00:05:41.400 | over the past decade

00:05:42.400 | that have received success

00:05:43.400 | and some of the intuition and insight

00:05:45.400 | that made those networks work.

00:05:47.400 | Classification,

00:05:49.400 | there is a lot of variability inside the classes

00:05:52.400 | and very little variability between the classes.

00:05:56.400 | All of these are cats at top,

00:05:58.400 | all of those are dogs at bottom.

00:06:00.400 | They look very different

00:06:02.400 | and the other,

00:06:03.400 | I would say the second biggest problem

00:06:05.400 | in driving perception,

00:06:07.400 | visible light camera perception is occlusion.

00:06:09.400 | When part of the object is occluded,

00:06:11.400 | due to the three-dimensional

00:06:14.400 | nature of our world,

00:06:17.400 | some objects in front of others

00:06:19.400 | and they occlude the background object

00:06:23.400 | and yet we're still tasked with identifying

00:06:26.400 | the object when only part of it is visible.

00:06:29.400 | And sometimes that part,

00:06:30.400 | I told you there's cats,

00:06:32.400 | is very hardly visible.

00:06:34.400 | Here we're tasked with classifying a cat

00:06:37.400 | when just an ear is visible,

00:06:38.400 | just the leg.

00:06:40.400 | And on the philosophical level,

00:06:46.400 | as we'll talk about the motivation

00:06:48.400 | for our competition here,

00:06:50.400 | here's a cat dressed as a monkey eating a banana.

00:06:54.400 | On a philosophical level,

00:06:57.400 | most of us understand what's going on in the scene.

00:07:03.400 | In fact, a neural network,

00:07:07.400 | today successfully classified this image,

00:07:14.400 | this video as a cat.

00:07:17.400 | But the context,

00:07:20.400 | the humor of the situation,

00:07:21.400 | and the fact that you could argue it's a monkey,

00:07:24.400 | is missing.

00:07:26.400 | And what else is missing is the dynamic information,

00:07:30.400 | the temporal dynamics of the scene.

00:07:33.400 | That's what's missing in a lot of the perception work

00:07:37.400 | that has been done to date

00:07:39.400 | in the autonomous vehicle space

00:07:42.400 | in terms of visible light cameras.

00:07:44.400 | And we're looking to expand on that.

00:07:47.400 | That's what SegFuse is all about.

00:07:49.400 | Image classification pipeline,

00:07:51.400 | there's a bin with different categories

00:07:54.400 | inside each class,

00:07:56.400 | cat, dog, mug, hat.

00:07:58.400 | Those bins, there's a lot of examples of each.

00:08:01.400 | And you're tasked with,

00:08:03.400 | when a new example comes along you've never seen before,

00:08:05.400 | to put that image in a bin.

00:08:08.400 | It's the same as the machine learning task before.

00:08:11.400 | And everything relies on the data

00:08:14.400 | that's been ground truth,

00:08:16.400 | that's been labeled by human beings.

00:08:18.400 | MNIST is a toy data set of handwritten digits,

00:08:22.400 | often used as examples.

00:08:24.400 | And COCO, CIFAR, ImageNet, PLACES,

00:08:28.400 | and a lot of other incredible data sets,

00:08:30.400 | rich data sets of a hundred thousands,

00:08:32.400 | millions of images out there,

00:08:34.400 | represent scenes, people's faces,

00:08:37.400 | and different objects.

00:08:39.400 | Those are all ground truth data

00:08:42.400 | for testing algorithms,

00:08:43.400 | and for competing architectures

00:08:46.400 | to be evaluated against each other.

00:08:49.400 | CIFAR-10, one of the simplest,

00:08:52.400 | almost toy data sets of tiny icons

00:08:55.400 | with 10 categories,

00:08:56.400 | of airplane, automobile, bird, cat, deer,

00:08:59.400 | dog, frog, horse, ship, and truck,

00:09:01.400 | is commonly used to explore

00:09:03.400 | some of the basic convolutional neural networks

00:09:05.400 | we'll discuss.

00:09:06.400 | So let's come up with a very trivial classifier

00:09:08.400 | to explain the concept of how we could go about it.

00:09:12.400 | In fact, this is,

00:09:13.400 | maybe if you start to think about

00:09:15.400 | how to classify an image,

00:09:16.400 | if you don't know any of these techniques,

00:09:18.400 | this is perhaps the approach you would take,

00:09:21.400 | is you would subtract images.

00:09:23.400 | So in order to know that an image of a cat

00:09:26.400 | is different than an image of a dog,

00:09:27.400 | you have to compare them.

00:09:29.400 | When given those two images,

00:09:30.400 | what's the way you compare them?

00:09:33.400 | One way you could do it,

00:09:34.400 | is you just subtract it,

00:09:36.400 | and then sum all the pixel-wise differences

00:09:39.400 | in the image.

00:09:40.400 | Just subtract the intensity of the image,

00:09:42.400 | pixel by pixel, sum it up.

00:09:45.400 | If that difference is really high,

00:09:47.400 | that means the images are very different.

00:09:50.400 | Using that metric, we can look at CIFAR-10,

00:09:53.400 | and use it as a classifier.

00:09:56.400 | Saying, based on this difference function,

00:09:59.400 | I'm going to find one of the 10 bins for a new image

00:10:03.400 | that has the lowest difference.

00:10:10.400 | Find an image in this data set

00:10:12.400 | that is most like the image I have,

00:10:14.400 | and put it in the same bin as that image is in.

00:10:18.400 | So, there's 10 classes,

00:10:21.400 | if we just flip a coin,

00:10:22.400 | the accuracy of our classifier will be 10%.

00:10:25.400 | Using our image difference classifier,

00:10:28.400 | we can actually do pretty good,

00:10:30.400 | much better than random, much better than 10%.

00:10:33.400 | We can do 35, 38% accuracy.

00:10:37.400 | That's the classifier,

00:10:38.400 | we have our first classifier.

00:10:44.400 | K-nearest neighbors.

00:10:46.400 | Let's take our classifier to a whole new level.

00:10:49.400 | Instead of comparing it to just,

00:10:51.400 | trying to find one image,

00:10:53.400 | that's the closest in our data set.

00:10:56.400 | We try to find K closest,

00:10:58.400 | and say, what class do the majority of them belong to?

00:11:03.400 | And we take that K,

00:11:04.400 | and increase it from 1 to 2 to 3 to 4 to 5.

00:11:08.400 | And see how that changes the problem.

00:11:12.400 | With 7 nearest neighbors,

00:11:14.400 | which is the optimal under this approach for CIFAR-10,

00:11:20.400 | we achieve 30% accuracy.

00:11:23.400 | Human level is 95% accuracy.

00:11:27.400 | And with convolutional neural networks,

00:11:29.400 | we get very close to 100%.

00:11:33.400 | That's where neural networks shine.

00:11:39.400 | This very task of binning images.

00:11:42.400 | It all starts at this basic computational unit.

00:11:45.400 | Signal in, each of the signals are weighed,

00:11:50.400 | summed, bias added,

00:11:54.400 | and put an input into a nonlinear activation function

00:11:58.400 | that produces an output.

00:12:00.400 | The nonlinear activation function is key.

00:12:04.400 | All of these put together,

00:12:06.400 | in more and more hidden layers,

00:12:09.400 | form a deep neural network.

00:12:12.400 | And that deep neural network is trained,

00:12:14.400 | as we've discussed,

00:12:16.400 | by taking a forward pass,

00:12:18.400 | on examples of ground truth labels,

00:12:20.400 | seeing how close those labels are

00:12:22.400 | to the real ground truth,

00:12:24.400 | and then punishing the weights,

00:12:26.400 | that resulted in the incorrect decisions,

00:12:29.400 | and rewarding the weights,

00:12:30.400 | that resulted in correct decisions.

00:12:33.400 | For the case of 10 examples,

00:12:35.400 | the output of the network,

00:12:38.400 | is 10 different values.

00:12:43.400 | The input being handwritten digits,

00:12:46.400 | from 0 to 9, there's 10 of those.

00:12:50.400 | And we wanted our network to classify,

00:12:52.400 | what is in this image,

00:12:54.400 | of a handwritten digit.

00:12:56.400 | Is it 0, 1, 2, 3, through 9.

00:13:00.400 | The way it's often done,

00:13:02.400 | is there's 10 outputs of the network.

00:13:06.400 | And each of the neurons on the output,

00:13:09.400 | is responsible for getting really excited,

00:13:13.400 | when it's number is called.

00:13:16.400 | And everybody else,

00:13:18.400 | is supposed to be not excited.

00:13:20.400 | Therefore, the number of classes,

00:13:23.400 | is the number of outputs.

00:13:24.400 | That's how it's commonly done.

00:13:27.400 | And you assign a class to the input image,

00:13:30.400 | based on the highest,

00:13:32.400 | the neuron which produces the highest output.

00:13:36.400 | But that's for a fully connected network,

00:13:38.400 | that we've discussed on Monday.

00:13:41.400 | There is in deep learning,

00:13:44.400 | a lot of tricks,

00:13:45.400 | that make things work,

00:13:47.400 | that make training much more efficient,

00:13:49.400 | on large class problems,

00:13:52.400 | where there's a lot of classes,

00:13:54.400 | on large data sets.

00:13:56.400 | When the representation,

00:13:57.400 | that the neural network is tasked with learning,

00:13:59.400 | is extremely complex.

00:14:01.400 | And that's where convolutional neural networks step in.

00:14:04.400 | The trick they use is spatial invariance.

00:14:07.400 | They use the idea that,

00:14:10.400 | a cat in the top left corner of an image,

00:14:13.400 | is the same as a cat,

00:14:14.400 | in the bottom right corner of an image.

00:14:17.400 | So we can learn the same features,

00:14:19.400 | across the image.

00:14:22.400 | That's where the convolution operation steps in.

00:14:26.400 | Instead of the fully connected networks,

00:14:28.400 | here there's a third dimension,

00:14:31.400 | of depth.

00:14:33.400 | So the blocks in this neural network,

00:14:36.400 | as input take 3D volumes,

00:14:38.400 | and as output produce 3D volumes.

00:14:46.400 | They take a slice of the image,

00:14:49.400 | a window,

00:14:50.400 | and slide it across.

00:14:52.400 | Applying the same exact weights,

00:14:54.400 | and we'll go through an example.

00:14:56.400 | The same exact weights,

00:14:58.400 | as in the fully connected network,

00:15:00.400 | on the edges that are used to,

00:15:02.400 | map the input to the output.

00:15:04.400 | Here are used to,

00:15:05.400 | map the slice of an image,

00:15:08.400 | this window of an image,

00:15:09.400 | to the output.

00:15:11.400 | And you can make several,

00:15:13.400 | many of such convolutional filters.

00:15:17.400 | Many layers,

00:15:19.400 | many different options of,

00:15:21.400 | what kind of features you look for in an image.

00:15:24.400 | What kind of window you slide across,

00:15:26.400 | in order to extract all kinds of things.

00:15:29.400 | All kinds of edges.

00:15:31.400 | All kind of higher order patterns in the images.

00:15:35.400 | The very important thing is,

00:15:37.400 | the parameters on each of these filters,

00:15:40.400 | the subset of the image,

00:15:41.400 | these windows,

00:15:42.400 | are shared.

00:15:44.400 | If the feature,

00:15:46.400 | that defines a cat,

00:15:47.400 | is useful in the top left corner,

00:15:49.400 | it's useful in the top right corner,

00:15:50.400 | it's useful in every aspect of the image.

00:15:53.400 | This is the trick,

00:15:54.400 | that makes convolutional neural networks,

00:15:56.400 | save a lot of,

00:15:58.400 | a lot of parameters,

00:16:00.400 | reduce parameters significantly.

00:16:03.400 | It's the reuse,

00:16:04.400 | the spatial sharing of features,

00:16:06.400 | across the space of the image.

00:16:11.400 | The depth of these 3D volumes,

00:16:14.400 | is the number of filters.

00:16:16.400 | The stride is the skip of the filter,

00:16:20.400 | the step size.

00:16:21.400 | How many pixels you skip,

00:16:23.400 | when you apply the filter to the input.

00:16:27.400 | And the padding,

00:16:29.400 | is the padding,

00:16:31.400 | the zero padding on the outside of the input,

00:16:34.400 | to a convolutional layer.

00:16:37.400 | Let's go through an example.

00:16:40.400 | So, on the left here,

00:16:43.400 | and the slides are now available online,

00:16:45.400 | you can follow them along.

00:16:47.400 | And I'll step through this example.

00:16:49.400 | On the left here is,

00:16:51.400 | input volume of three channels.

00:16:54.400 | The left column is the input.

00:16:57.400 | The three squares there,

00:16:59.400 | are the three channels.

00:17:01.400 | And there's numbers,

00:17:03.400 | inside those channels.

00:17:06.400 | And then we have a filter in red.

00:17:11.400 | Two of them,

00:17:13.400 | two channels of filters,

00:17:15.400 | with a bias.

00:17:17.400 | And those filters are three by three.

00:17:19.400 | Each one of them,

00:17:21.400 | is size three by three.

00:17:24.400 | And what we do is,

00:17:25.400 | we take those three by three filters,

00:17:28.400 | that are to be learned.

00:17:30.400 | These are our variables,

00:17:31.400 | our weights that we have to learn.

00:17:33.400 | And then we slide it across an image,

00:17:36.400 | to produce the output on the right,

00:17:38.400 | the green.

00:17:40.400 | So by applying the filters in the red,

00:17:42.400 | there's two of them,

00:17:44.400 | and within each one,

00:17:45.400 | there's one for every input channel.

00:17:48.400 | We go from the left,

00:17:50.400 | to the right.

00:17:51.400 | From the input volume on the left,

00:17:53.400 | to the output volume green on the right.

00:17:57.400 | And you can look,

00:17:59.400 | you can pull up the slides yourself now,

00:18:01.400 | if you can't see the numbers on the screen.

00:18:04.400 | But the operations,

00:18:08.400 | are performed on the input,

00:18:10.400 | to produce the single value,

00:18:12.400 | that's highlighted there in the green,

00:18:14.400 | in the output.

00:18:15.400 | And we slide this convolution,

00:18:18.400 | no filter,

00:18:19.400 | along the image.

00:18:21.400 | With a stride, in this case,

00:18:25.400 | of two,

00:18:27.400 | skipping,

00:18:28.400 | skipping along.

00:18:30.400 | They sum to the right,

00:18:33.400 | the two channel output,

00:18:37.400 | in green.

00:18:38.400 | That's it, that's the convolutional operation.

00:18:41.400 | That's what's called the convolutional layer in neural networks.

00:18:45.400 | And the parameters here,

00:18:47.400 | besides the bias,

00:18:48.400 | are the red values in the middle.

00:18:51.400 | That's what we're trying to learn.

00:18:54.400 | And there's a lot of interesting tricks,

00:18:56.400 | we'll discuss today on top of those.

00:18:58.400 | But this is at the core.

00:19:00.400 | This is the spatially invariant,

00:19:02.400 | sharing of parameters,

00:19:04.400 | that make convolutional neural networks,

00:19:07.400 | able to efficiently,

00:19:09.400 | learn and find patterns and images.

00:19:13.400 | To build your intuition a little bit more,

00:19:16.400 | about convolution,

00:19:17.400 | here's an input image on the left.

00:19:19.400 | And on the right,

00:19:21.400 | the identity filter,

00:19:23.400 | produces the output you see on the right.

00:19:26.400 | And then there's different ways,

00:19:28.400 | you can, different kinds of edges,

00:19:30.400 | you can extract,

00:19:32.400 | with the result in activation map,

00:19:35.400 | seen on the right.

00:19:37.400 | So when applying the filters,

00:19:39.400 | with those edge detection filters,

00:19:42.400 | to the image on the left,

00:19:43.400 | you produce in white,

00:19:45.400 | are the parts that activate,

00:19:47.400 | the convolution.

00:19:49.400 | The results of these filters.

00:19:54.400 | And so you can do any kind of filter,

00:19:56.400 | that's what we're trying to learn.

00:19:58.400 | Any kind of edge,

00:20:00.400 | any kind of,

00:20:01.400 | any kind of pattern,

00:20:03.400 | you can move along in this window,

00:20:05.400 | in this way that's shown here,

00:20:06.400 | you slide around the image,

00:20:08.400 | and you produce,

00:20:09.400 | the output you see on the right.

00:20:11.400 | And depending on how many filters,

00:20:13.400 | you have in every level,

00:20:14.400 | you have many of such slices,

00:20:16.400 | that you see on the right.

00:20:17.400 | The input on the left,

00:20:18.400 | the output on the right.

00:20:20.400 | If you have,

00:20:22.400 | dozens of filters,

00:20:23.400 | you have dozens of images on the right,

00:20:25.400 | each with different results,

00:20:27.400 | that show,

00:20:29.400 | where each of the individual,

00:20:31.400 | filter patterns were found.

00:20:33.400 | And we learn,

00:20:34.400 | what patterns are useful to look for,

00:20:37.400 | in order to perform the classification task.

00:20:40.400 | That's the task,

00:20:41.400 | for the neural network,

00:20:42.400 | to learn these filters.

00:20:45.400 | And the filters,

00:20:46.400 | have higher and higher order,

00:20:49.400 | of representation.

00:20:52.400 | Going from the very basic edges,

00:20:55.400 | to the high semantic,

00:20:57.400 | meaning that spans entire images.

00:21:00.400 | And the ability to span images,

00:21:04.400 | can be done in several ways.

00:21:06.400 | But traditionally has been successfully done,

00:21:08.400 | through max pooling,

00:21:09.400 | through pooling.

00:21:10.400 | Of taking the output,

00:21:13.400 | of a convolutional operation,

00:21:17.400 | and reducing the resolution of that,

00:21:20.400 | by condensing that information,

00:21:23.400 | by for example,

00:21:24.400 | taking the maximum values,

00:21:26.400 | the maximum activations.

00:21:28.400 | Therefore reducing the,

00:21:33.400 | spatial resolution,

00:21:35.400 | which has detrimental effects,

00:21:36.400 | as we'll talk about in scene segmentation.

00:21:39.400 | But it's beneficial,

00:21:40.400 | for finding higher order representations,

00:21:43.400 | in the images,

00:21:44.400 | that bring images together.

00:21:46.400 | That bring features together,

00:21:47.400 | to form an entity,

00:21:49.400 | that we're trying to identify and classify.

00:21:51.400 | Okay.

00:21:53.400 | So that forms,

00:21:55.400 | a convolutional neural network.

00:21:57.400 | Such convolutional layers,

00:21:58.400 | stacked on top of each other,

00:22:00.400 | is the only addition,

00:22:01.400 | to a neural network that makes,

00:22:03.400 | for a convolutional neural network.

00:22:05.400 | And then at the end,

00:22:06.400 | the fully connected layers,

00:22:08.400 | or any kind of other architectures,

00:22:11.400 | allow us to apply particular domains.

00:22:15.400 | Let's take ImageNet,

00:22:17.400 | as a case study.

00:22:19.400 | In ImageNet,

00:22:21.400 | the data set,

00:22:23.400 | in ImageNet,

00:22:24.400 | the challenge,

00:22:26.400 | the task is classification.

00:22:28.400 | As I mentioned in the first lecture,

00:22:31.400 | ImageNet is a data set,

00:22:33.400 | one of the largest in the world of images.

00:22:36.400 | With 14 million images,

00:22:38.400 | 21,000 categories.

00:22:40.400 | And a lot of depth,

00:22:44.400 | to many of the categories.

00:22:45.400 | As I mentioned,

00:22:46.400 | 1200 Granny Smith apples.

00:22:48.400 | These allow to,

00:22:53.400 | these allow the neural networks to,

00:22:55.400 | learn the rich representations,

00:22:58.400 | in both pose,

00:22:59.400 | lighting variability,

00:23:00.400 | and interclass,

00:23:01.400 | class variation,

00:23:02.400 | for the particular things,

00:23:03.400 | particular classes,

00:23:05.400 | like Granny Smith apples.

00:23:07.400 | So,

00:23:09.400 | let's look through the various networks.

00:23:11.400 | Let's discuss them,

00:23:12.400 | let's see the insights.

00:23:13.400 | It started with AlexNet,

00:23:15.400 | the first,

00:23:16.400 | really big successful,

00:23:18.400 | GPU trained neural network,

00:23:19.400 | on ImageNet,

00:23:20.400 | that's achieved a significant boost,

00:23:22.400 | over the previous year.

00:23:24.400 | And moved on to VGGNet,

00:23:26.400 | GoogleNet,

00:23:29.400 | Agulinet,

00:23:31.400 | ResNet,

00:23:33.400 | CU Image,

00:23:34.400 | and SENet,

00:23:36.400 | in 2017.

00:23:38.400 | Again,

00:23:41.400 | the numbers will show,

00:23:42.400 | for the accuracy,

00:23:43.400 | are based on the,

00:23:44.400 | top five error rate.

00:23:46.400 | We get five guesses,

00:23:48.400 | and it's a one or zero.

00:23:50.400 | If you get guess,

00:23:51.400 | if one of the five is correct,

00:23:52.400 | you get a one,

00:23:53.400 | for that particular guess.

00:23:54.400 | Otherwise, it's a zero.

00:23:56.400 | And,

00:24:02.400 | human error is 5.1.

00:24:04.400 | When a human,

00:24:05.400 | tries to achieve the same,

00:24:06.400 | tries to,

00:24:07.400 | perform the same task,

00:24:08.400 | as the machinist task we're doing,

00:24:10.400 | the error is 5.1.

00:24:12.400 | The human annotation,

00:24:13.400 | is performed on the images,

00:24:14.400 | based on binary classification.

00:24:16.400 | Granny Smith,

00:24:17.400 | Apple or not,

00:24:18.400 | cat or not.

00:24:20.400 | The actual task,

00:24:21.400 | that the machine has to perform,

00:24:23.400 | and that the human competing,

00:24:25.400 | has to perform,

00:24:26.400 | is given an image,

00:24:27.400 | is provide,

00:24:28.400 | one of the many classes.

00:24:30.400 | Under that,

00:24:31.400 | human error is 5.1%,

00:24:33.400 | which was surpassed,

00:24:34.400 | in 2015,

00:24:36.400 | by ResNet,

00:24:37.400 | to achieve 4% error.

00:24:41.400 | So, let's start with,

00:24:43.400 | AlexNet.

00:24:44.400 | I'll zoom in on the later networks,

00:24:46.400 | they have some interesting insights.

00:24:48.400 | But, AlexNet,

00:24:49.400 | and VGGNet,

00:24:51.400 | both followed a very similar architecture.

00:24:54.400 | Very uniform throughout its depth.

00:24:57.400 | VGGNet in 2014,

00:25:02.400 | is convolution,

00:25:04.400 | convolution pooling,

00:25:06.400 | convolution pooling,

00:25:07.400 | convolution pooling,

00:25:08.400 | convolution pooling,

00:25:09.400 | convolution pooling,

00:25:10.400 | and fully connected layers at the end.

00:25:12.400 | There's a certain kind of beautiful simplicity,

00:25:16.400 | uniformity to these architectures.

00:25:18.400 | Because you can just make it deeper and deeper,

00:25:20.400 | and makes it very amenable to,

00:25:22.400 | implementation in a layer stack kind of way,

00:25:26.400 | in any of the deep learning frameworks.

00:25:28.400 | It's clean and beautiful to understand.

00:25:31.400 | In the case of VGGNet,

00:25:32.400 | 16 or 19 layers,

00:25:34.400 | with 138 million parameters,

00:25:36.400 | not many optimizations on these parameters,

00:25:38.400 | therefore,

00:25:39.400 | the number of parameters is much higher than

00:25:41.400 | the networks that followed it.

00:25:43.400 | Despite the layers not being that large.

00:25:45.400 | GoogleNet introduced the inception module,

00:25:50.400 | starting to do some interesting things,

00:25:53.400 | with the small modules within these networks,

00:25:57.400 | which allow for the training to be more,

00:25:59.400 | efficient and effective.

00:26:01.400 | The idea behind the inception module shown here,

00:26:05.400 | with the previous layer on bottom,

00:26:09.400 | and the convolutional layer,

00:26:13.400 | here with the inception module,

00:26:15.400 | on top,

00:26:17.400 | produced on top,

00:26:19.400 | is it used the idea that,

00:26:23.400 | different size convolutions,

00:26:25.400 | provide different value for the network.

00:26:27.400 | Smaller convolutions are able to capture,

00:26:31.400 | or propagate forward,

00:26:33.400 | features that are very local,

00:26:36.400 | a high resolution in texture.

00:26:40.400 | Larger convolutions are better able to,

00:26:44.400 | represent and capture and catch,

00:26:47.400 | highly abstracted features,

00:26:49.400 | higher order features.

00:26:51.400 | So the idea behind the inception module,

00:26:53.400 | is to say well,

00:26:55.400 | as opposed to choosing,

00:26:57.400 | in a hyper parameter tuning process,

00:26:59.400 | or architecture design process,

00:27:01.400 | choosing which convolution size,

00:27:03.400 | we want to go with,

00:27:05.400 | why not do all of them together,

00:27:07.400 | well several together.

00:27:08.400 | In the case of the GoogleNet model,

00:27:11.400 | there's the 1x1, 3x3 and 5x5 convolutions,

00:27:15.400 | with the old trusty friend of max pooling,

00:27:18.400 | still left in there as well,

00:27:20.400 | which has lost favor,

00:27:23.400 | more and more over time,

00:27:24.400 | for the image classification task.

00:27:26.400 | And the result is,

00:27:28.400 | there's fewer parameters are required,

00:27:30.400 | if you pick,

00:27:32.400 | the placing of these,

00:27:34.400 | inception modules correctly,

00:27:36.400 | the number of parameters required,

00:27:38.400 | to achieve a higher performance,

00:27:41.400 | is much lower.

00:27:43.400 | ResNet,

00:27:47.400 | one of the most popular,

00:27:48.400 | still to date,

00:27:50.400 | architectures,

00:27:54.400 | that we'll discuss in,

00:27:56.400 | in scene segmentation as well.

00:27:59.400 | Came up and used,

00:28:01.400 | the idea of a residual block.

00:28:03.400 | The initial,

00:28:06.400 | inspiring observation,

00:28:07.400 | which doesn't necessarily,

00:28:09.400 | hold true as it turns out,

00:28:10.400 | but that network depth,

00:28:13.400 | increases representation power.

00:28:16.400 | So these residual blocks,

00:28:18.400 | allow you to have much deeper networks,

00:28:21.400 | and I'll explain,

00:28:22.400 | why in a second here.

00:28:23.400 | But,

00:28:24.400 | the thought was,

00:28:26.400 | they work so well,

00:28:27.400 | because the networks are much deeper.

00:28:29.400 | The key thing,

00:28:31.400 | that makes these blocks so effective,

00:28:34.400 | is the same idea,

00:28:36.400 | that's reminiscent of a current neural networks,

00:28:39.400 | that I hope we get a chance to talk about.

00:28:42.400 | The training of them is much easier.

00:28:46.400 | They take a simple block,

00:28:49.400 | repeated over and over,

00:28:51.400 | and they pass the input along,

00:28:54.400 | without transformation,

00:28:56.400 | along with the ability,

00:28:58.400 | to transform it,

00:28:59.400 | to learn,

00:29:00.400 | to learn the filters,

00:29:01.400 | learn the weights.

00:29:03.400 | So you're allowed to,

00:29:06.400 | you're allowed every layer,

00:29:08.400 | to not only take on,

00:29:10.400 | the processing of previous layers,

00:29:13.400 | but to take in the raw and transform data,

00:29:16.400 | and learn something new.

00:29:18.400 | The ability to learn something new,

00:29:21.400 | allows you to have,

00:29:22.400 | much deeper networks,

00:29:25.400 | and the simplicity of this block,

00:29:27.400 | allows for more effective training.

00:29:30.400 | The state-of-the-art,

00:29:34.400 | in 2017,

00:29:35.400 | the winner is,

00:29:36.400 | Squeeze and Excitation Networks.

00:29:38.400 | That unlike the previous year,

00:29:41.400 | with CU Image,

00:29:42.400 | which simply took ensemble methods,

00:29:44.400 | and combined a lot of successful approaches,

00:29:47.400 | to take a marginal improvement.

00:29:49.400 | SCNet,

00:29:51.400 | got a significant improvement,

00:29:54.400 | at least in percentages,

00:29:55.400 | I think it's a 25% reduction,

00:29:57.400 | in error,

00:29:59.400 | from 4% to 3%,

00:30:03.400 | something like that.

00:30:05.400 | By using a very simple idea,

00:30:08.400 | that I think is important to mention,

00:30:10.400 | a simple insight.

00:30:12.400 | It added a parameter,

00:30:14.400 | to each channel,

00:30:15.400 | in the convolutional layer,

00:30:18.400 | in the convolutional block.

00:30:20.400 | So the network,

00:30:21.400 | can now adjust the weighting,

00:30:23.400 | on each channel,

00:30:25.400 | based,

00:30:27.400 | for each feature map,

00:30:28.400 | based on the content,

00:30:29.400 | based on the input to the network.

00:30:31.400 | This is kind of a,

00:30:33.400 | a takeaway to think about,

00:30:35.400 | about any of the networks,

00:30:36.400 | to talk about any of the architectures.

00:30:38.400 | Is, a lot of times,

00:30:41.400 | your recurrent neural networks,

00:30:43.400 | and convolutional neural networks,

00:30:45.400 | have tricks,

00:30:46.400 | that significantly reduce,

00:30:47.400 | the number of parameters,

00:30:49.400 | the bulk, the sort of low-hanging fruit.

00:30:52.400 | They use spatial invariance,

00:30:54.400 | the temporal invariance,

00:30:55.400 | to reduce the number of parameters,

00:30:57.400 | to represent the input data.

00:30:58.400 | But, they also leave certain things,

00:31:02.400 | not parameterized.

00:31:03.400 | They don't allow the network to learn it.

00:31:05.400 | Allowing in this case,

00:31:06.400 | the network to learn the weighting,

00:31:08.400 | on each of the individual channels,

00:31:10.400 | so each of the individual filters,

00:31:12.400 | is something that you learn,

00:31:14.400 | as along with the filters,

00:31:16.400 | takes it, makes a huge boost.

00:31:18.400 | The cool thing about this,

00:31:19.400 | is it's applicable to any architecture.

00:31:21.400 | This kind of block,

00:31:23.400 | this kind of,

00:31:24.400 | the squeeze and excitation block,

00:31:26.400 | is applicable to any architecture.

00:31:28.400 | And,

00:31:31.400 | because obviously,

00:31:33.400 | it's just simply parameterizes,

00:31:35.400 | the ability to choose,

00:31:37.400 | which filter you go with,

00:31:38.400 | based on the content.

00:31:39.400 | It's a subtle, but crucial thing.

00:31:41.400 | I think it's pretty cool.

00:31:43.400 | And, for future research,

00:31:44.400 | it inspires to think about,

00:31:46.400 | what else can be parameterized,

00:31:48.400 | in neural networks?

00:31:49.400 | What else can be controlled,

00:31:50.400 | as part of the learning process?

00:31:52.400 | Including higher and higher order,

00:31:54.400 | hyperparameters.

00:31:55.400 | Which aspects of the training,

00:31:58.400 | and the architecture of the network,

00:32:00.400 | can be part of the learning?

00:32:02.400 | This is what this network inspires.

00:32:05.400 | Another network,

00:32:13.400 | has been in development since the 90s,

00:32:15.400 | ideas,

00:32:16.400 | by Jeff Hinton,

00:32:18.400 | but really received,

00:32:19.400 | has been published on,

00:32:20.400 | and received significant attention in 2017.

00:32:23.400 | That I won't go into detail here.

00:32:26.400 | We are going to release,

00:32:29.400 | an online only video,

00:32:32.400 | about capsule networks.

00:32:34.400 | It's a little bit too technical,

00:32:36.400 | but they inspire a very important point,

00:32:39.400 | that we should always think about,

00:32:42.400 | with deep learning.

00:32:43.400 | Whenever it's successful.

00:32:45.400 | It's to think about,

00:32:46.400 | what, as I mentioned,

00:32:48.400 | with the cat eating a banana,

00:32:50.400 | on a philosophical,

00:32:52.400 | and the mathematical level,

00:32:53.400 | we have to consider,

00:32:55.400 | what assumptions these networks make,

00:32:58.400 | and what,

00:32:59.400 | through those assumptions,

00:33:00.400 | they throw away.

00:33:01.400 | So neural networks,

00:33:03.400 | due to the spatial,

00:33:04.400 | with convolutional neural networks,

00:33:06.400 | due to their spatial invariance,

00:33:08.400 | throw away information,

00:33:09.400 | about the relationship,

00:33:11.400 | between,

00:33:13.400 | the hierarchies,

00:33:15.400 | between the simple,

00:33:16.400 | and the complex objects.

00:33:17.400 | So the face on the left,

00:33:18.400 | and the face on the right,

00:33:20.400 | looks the same,

00:33:21.400 | to a convolutional neural network.

00:33:23.400 | The presence of eyes,

00:33:24.400 | and nose,

00:33:25.400 | and mouth,

00:33:26.400 | is the essential aspect,

00:33:29.400 | of what makes,

00:33:31.400 | the classification task work,

00:33:33.400 | for convolutional network.

00:33:34.400 | Where it will fire,

00:33:36.400 | and say this is definitely a face.

00:33:38.400 | But the spatial relationship,

00:33:40.400 | is lost,

00:33:41.400 | is ignored,

00:33:43.400 | which means,

00:33:44.400 | there's a lot of implications to this,

00:33:46.400 | but,

00:33:47.400 | for things like,

00:33:49.400 | pose variation,

00:33:50.400 | that information is lost.

00:33:53.400 | We're throwing away,

00:33:54.400 | that away completely,

00:33:56.400 | and hoping that,

00:33:57.400 | the pooling operation,

00:33:58.400 | that's performing these networks,

00:34:01.400 | is able to sort of,

00:34:02.400 | mesh everything together,

00:34:04.400 | to come up with the features,

00:34:06.400 | that are firing,

00:34:07.400 | of the different parts of the face,

00:34:08.400 | to then come up with the total classification,

00:34:10.400 | that it's a face.

00:34:11.400 | Without representing,

00:34:12.400 | really the relationship,

00:34:13.400 | between these features,

00:34:14.400 | at the low level,

00:34:15.400 | and the high level.

00:34:17.400 | At the low level of the hierarchy,

00:34:19.400 | at the simple,

00:34:20.400 | and the complex level.

00:34:22.400 | This is a super exciting field now,

00:34:25.400 | that's hopefully will spark,

00:34:26.400 | developments of how we design neural networks,

00:34:29.400 | that are able to learn,

00:34:30.400 | the rotational,

00:34:33.400 | the orientation,

00:34:34.400 | in variance,

00:34:35.400 | as well.

00:34:37.400 | Okay, so as I mentioned,

00:34:42.400 | you take these,

00:34:44.400 | convolutional neural networks,

00:34:45.400 | chop off the final layer,

00:34:47.400 | in order to apply,

00:34:48.400 | to a particular domain.

00:34:50.400 | And that is what we'll do,

00:34:52.400 | with fully convolutional neural networks.

00:34:54.400 | The ones that we tasked,

00:34:55.400 | to segment the image,

00:34:56.400 | at a pixel level.

00:34:58.400 | As a reminder,

00:35:01.400 | these networks,

00:35:02.400 | through the convolutional process,

00:35:05.400 | are really producing,

00:35:08.400 | a heat map.

00:35:09.400 | Different parts of the network,

00:35:11.400 | are getting excited,

00:35:12.400 | based on the different,

00:35:13.400 | aspects of the image.

00:35:15.400 | And so it can be used,

00:35:16.400 | to do the localization of detecting,

00:35:18.400 | not just classifying the image,

00:35:20.400 | but localizing the object.

00:35:22.400 | And they could do so,

00:35:23.400 | at a pixel level.

00:35:25.400 | So the convolutional layers,

00:35:28.400 | are doing the,

00:35:30.400 | encoding process.

00:35:31.400 | They're taking the rich,

00:35:32.400 | raw sensory information,

00:35:35.400 | in the image,

00:35:36.400 | and encoding them,

00:35:37.400 | into an interpretable set of features,

00:35:39.400 | representation,

00:35:41.400 | that can then be used for classification.

00:35:43.400 | But we can also then use a decoder,

00:35:45.400 | up sample that information,

00:35:47.400 | and produce a map like this.

00:35:50.400 | Fully convolutional neural networks,

00:35:52.400 | segmentation,

00:35:53.400 | semantic scene segmentation,

00:35:55.400 | image segmentation.

00:35:56.400 | The goal is to,

00:35:57.400 | as opposed to classify the entire image,

00:35:59.400 | you classify every single pixel.

00:36:02.400 | It's pixel level segmentation.

00:36:04.400 | You color every single pixel,

00:36:05.400 | with what that pixel,

00:36:07.400 | what object that pixel belongs to,

00:36:09.400 | in this 2D space of the image.

00:36:11.400 | The 2D projection,

00:36:14.400 | in the image of a three-dimensional world.

00:36:18.400 | So the thing is,

00:36:20.400 | there's been a lot of advancement,

00:36:21.400 | in the last three years.

00:36:25.400 | But it's still an incredibly difficult problem.

00:36:29.400 | If you think about,

00:36:32.400 | the amount of data that's used,

00:36:36.400 | for training,

00:36:37.400 | and the task of pixel level,

00:36:39.400 | of megapixels here,

00:36:41.400 | of millions of pixels,

00:36:43.400 | that are tasked with,

00:36:44.400 | having assigned a single label.

00:36:46.400 | It's an extremely difficult problem.

00:36:48.400 | Why is this interesting,

00:36:52.400 | important problem to try to solve,

00:36:54.400 | as opposed to bounding boxes,

00:36:56.400 | around cats?

00:36:57.400 | Well,

00:36:58.400 | it's whenever precise boundaries,

00:37:00.400 | of objects are important.

00:37:01.400 | Certainly medical applications,

00:37:03.400 | when looking at imaging,

00:37:05.400 | and detecting particular,

00:37:07.400 | for example,

00:37:08.400 | detecting tumors,

00:37:09.400 | in medical imaging,

00:37:13.400 | of different organs.

00:37:15.400 | And in driving,

00:37:18.400 | in robotics,

00:37:20.400 | when objects are involved,

00:37:22.400 | it's a dense scene,

00:37:23.400 | involved with vehicles,

00:37:24.400 | pedestrians, cyclists.

00:37:25.400 | We need to be able to,

00:37:27.400 | not just have a loose estimate,

00:37:29.400 | of where objects are.

00:37:30.400 | We need to be able to have,

00:37:32.400 | the exact boundaries.

00:37:33.400 | And then potentially,

00:37:35.400 | through data fusion,

00:37:37.400 | fusing sensors together,

00:37:39.400 | fusing this rich textural information,

00:37:41.400 | about pedestrians, cyclists,

00:37:43.400 | and vehicles,

00:37:44.400 | to light our data,

00:37:45.400 | that's providing us the three-dimensional,

00:37:47.400 | map of the world.

00:37:48.400 | We'll have both,

00:37:49.400 | the semantic meaning,

00:37:50.400 | of the different objects,

00:37:51.400 | and their exact three-dimensional location.

00:37:53.400 | A lot of this work,

00:38:01.400 | successfully,

00:38:03.400 | a lot of the work in the semantic segmentation,

00:38:05.400 | started with,

00:38:06.400 | fully convolutional networks,

00:38:08.400 | for semantic segmentation paper.

00:38:10.400 | FCN.

00:38:11.400 | That's where the name FCN came from,

00:38:12.400 | in November 2014.

00:38:14.400 | Now go through a few papers here,

00:38:16.400 | to give you some intuition,

00:38:18.400 | where the field is gone.

00:38:20.400 | And how that takes us to segfuse,

00:38:23.400 | the segmentation competition.

00:38:25.400 | So FCN,

00:38:27.400 | repurposed the ImageNet pre-trained nets.

00:38:29.400 | The nets that were trained,

00:38:31.400 | to classify what's in an image,

00:38:33.400 | the entire image.

00:38:35.400 | And chopped off,

00:38:37.400 | the fully connected layers.

00:38:39.400 | And then added decoder parts,

00:38:41.400 | that up-sampled the image,

00:38:44.400 | to produce a heat map.

00:38:48.400 | Here shown,

00:38:49.400 | with a tabby cat,

00:38:51.400 | a heat map of where the cat is in the image.

00:38:53.400 | It's a much lower,

00:38:55.400 | much coarser resolution,

00:38:57.400 | than the input image.

00:38:59.400 | 1/8 at best.

00:39:01.400 | Skip connections,

00:39:04.400 | to improve coarseness of up-sampling.

00:39:06.400 | There's a few tricks.

00:39:09.400 | If you do the most naive approach,

00:39:11.400 | the up-sampling is going to be extremely coarse.

00:39:14.400 | Because that's the whole point,

00:39:16.400 | of the neural network.

00:39:17.400 | The encoding part,

00:39:18.400 | is you throw away all the useless data,

00:39:21.400 | to the most essential aspects,

00:39:24.400 | that represent that image.

00:39:26.400 | So you're throwing away a lot of information,

00:39:27.400 | that's necessary,

00:39:29.400 | to then form a high-resolution image.

00:39:32.400 | So there's a few tricks,

00:39:34.400 | where you skip a few of the final,

00:39:37.400 | pooling operations,

00:39:39.400 | to go in similar ways,

00:39:41.400 | as a residual block,

00:39:43.400 | to go to the output,

00:39:45.400 | produce higher and higher,

00:39:47.400 | resolution heat map at the end.

00:39:50.400 | SegNet in 2015,

00:39:53.400 | applied this to the driving context.

00:39:56.400 | And really, taking a Dikiti dataset,

00:39:58.400 | and have shown a lot of interesting results,

00:40:02.400 | and really explored the encoder-decoder,

00:40:05.400 | formulation of the problem.

00:40:07.400 | Really solidifying the place,

00:40:11.400 | of the encoder-decoder framework,

00:40:13.400 | for the segmentation task.

00:40:16.400 | Dilated convolution,

00:40:18.400 | I'm taking you through a few components,

00:40:20.400 | which are critical here,

00:40:21.400 | to the state of the art.

00:40:23.400 | Dilated convolutions,

00:40:25.400 | so the convolution operation,

00:40:28.400 | as the pooling operation,

00:40:31.400 | reduces resolution significantly.

00:40:34.400 | And dilated convolution,

00:40:38.400 | has a certain kind of grating,

00:40:40.400 | as visualized there,

00:40:41.400 | that maintains the local,

00:40:46.400 | high-resolution textures,

00:40:48.400 | while still capturing,

00:40:51.400 | the spatial window necessary.

00:40:55.400 | It's called dilated convolutional layer.

00:40:59.400 | And that's in a 2015 paper,

00:41:03.400 | proved to be much better at up-sampling,

00:41:05.400 | a high-resolution image.

00:41:08.400 | D-Lab, with a B,

00:41:13.400 | V1, V2, now V3,

00:41:17.400 | added conditional random fields,

00:41:20.400 | which is the final piece of the,

00:41:23.400 | state of the art puzzle here.

00:41:25.400 | A lot of the successful networks today,

00:41:28.400 | that do segmentation, not all,

00:41:31.400 | do post-process using CRFs,

00:41:35.400 | conditional random fields.

00:41:37.400 | And what they do is,

00:41:38.400 | they smooth the segmentation,

00:41:40.400 | the up-sampled segmentation,

00:41:42.400 | that results from the FCN,

00:41:44.400 | by looking at the underlying image intensities.

00:41:48.400 | So that's the key aspects,

00:41:52.400 | of the successful approaches today.

00:41:54.400 | You have the encoder-decoder framework,

00:41:56.400 | of a fully convolutional neural network.

00:41:58.400 | It replaces the fully connected layers,

00:42:00.400 | with the convolutional layers,

00:42:02.400 | deconvolutional layers.

00:42:04.400 | And as the years progressed,

00:42:07.400 | from 2014 to today,

00:42:09.400 | as usual, the underlying networks,

00:42:13.400 | from AlexNet to VGGNet,

00:42:16.400 | and to now ResNet,

00:42:18.400 | have been one of the big reasons,

00:42:21.400 | for the improvements of these networks,

00:42:22.400 | to be able to perform the segmentation.

00:42:24.400 | So naturally, they mirrored,

00:42:26.400 | the ImageNet challenge performance,

00:42:28.400 | in adapting these networks.

00:42:30.400 | So the state of the art,

00:42:31.400 | uses ResNet or similar networks.

00:42:34.400 | Conditional random fields,

00:42:36.400 | for smoothing,

00:42:37.400 | based on the input image intensities,

00:42:40.400 | and the dilated convolution,

00:42:43.400 | that maintains the computational cost,

00:42:46.400 | but increases the resolution of the upsampling,

00:42:49.400 | throughout the intermediate feature maps.

00:42:53.400 | And that takes us to the state of the art,

00:42:57.400 | that we used to produce the images,

00:43:01.400 | to produce the images for the competition.

00:43:05.400 | ResNet-DUC for dance upsampling convolution,

00:43:10.400 | instead of bilinear upsampling,

00:43:13.400 | you make the upsampling learnable.

00:43:17.400 | You learn the upscaling filters,

00:43:20.400 | that's on the bottom.

00:43:22.400 | That's really the key part that made it work.

00:43:25.400 | There should be a theme here.

00:43:27.400 | Sometimes the biggest addition,

00:43:29.400 | that could be done,

00:43:30.400 | is parameterizing,

00:43:32.400 | one of the aspects of the network,

00:43:33.400 | they've taken for granted.

00:43:35.400 | Letting the network learn that aspect.

00:43:37.400 | And the other,

00:43:39.400 | not sure how important it is to the success,

00:43:42.400 | but it's a cool little addition,

00:43:44.400 | is a hybrid dilated convolution.

00:43:47.400 | As I showed that visualization,

00:43:50.400 | where the convolution is spread apart,

00:43:52.400 | a little bit in the input,

00:43:55.400 | from the input to the output.

00:43:56.400 | The steps of that dilated convolution filter,

00:44:00.400 | when they're changed,

00:44:01.400 | it produces a smoother result,

00:44:03.400 | because when it's kept the same,

00:44:06.400 | there are certain input pixels,

00:44:08.400 | get a lot more attention than others.

00:44:11.400 | So losing that favoritism,

00:44:14.400 | is what's achieved by using a variable,

00:44:16.400 | different dilation rate.

00:44:19.400 | Those are the two tricks,

00:44:20.400 | but really the biggest one,

00:44:22.400 | is the parameterization of the upscaling filters.

00:44:27.400 | Okay, so that's what we use to generate that data,

00:44:30.400 | and that's what we provide you the code with,

00:44:32.400 | if you're interested in competing in PsycFuse.

00:44:35.400 | The other aspect here,

00:44:37.400 | that everything we talked about,

00:44:38.400 | from the classification,

00:44:40.400 | to the segmentation,

00:44:42.400 | to making sense of images,

00:44:44.400 | is the information about time,

00:44:49.400 | the temporal dynamics of the scene is thrown away.

00:44:53.400 | And for the driving context,

00:44:55.400 | for the robotics contest,

00:44:56.400 | and what we'd like to do with PsycFuse,

00:44:58.400 | for the segmentation,

00:45:00.400 | dynamic scene segmentation context,

00:45:02.400 | of when you try to interpret,

00:45:04.400 | what's going on in the scene over time,

00:45:06.400 | and use that information.

00:45:08.400 | Time is essential,

00:45:11.400 | the movement of pixels is essential,

00:45:14.400 | through time.

00:45:16.400 | That understanding how those objects move,

00:45:19.400 | in a 3D space,

00:45:22.400 | through the 2D projection of an image,

00:45:25.400 | is fascinating,

00:45:27.400 | and there's a lot of set of open problems there.

00:45:30.400 | So flow, is what's very helpful,

00:45:35.400 | as a starting point,

00:45:37.400 | to help us understand how these pixels move.

00:45:40.400 | Flow, optical flow,

00:45:43.400 | dense optical flow is the computation,

00:45:45.400 | our best approximation,

00:45:50.400 | of where each pixel in image one,

00:45:53.400 | and moved in the temporally,

00:45:58.400 | following image after that.

00:46:00.400 | There's two images,

00:46:02.400 | in 30 frames a second,

00:46:03.400 | there's one image at time zero,

00:46:05.400 | the other is 33.3 milliseconds later,

00:46:08.400 | and the dense optical flow,

00:46:10.400 | is our best estimate of how each pixel,

00:46:12.400 | in the input image moved,

00:46:14.400 | to in the output image.

00:46:16.400 | The optical flow, for every pixel,

00:46:19.400 | produces a direction,

00:46:20.400 | of where we think that pixel moved,

00:46:22.400 | and the magnitude of how far moved.

00:46:24.400 | That allows us,

00:46:26.400 | to take information that we detected,

00:46:28.400 | about the first frame,

00:46:30.400 | and try to propagate it forward.

00:46:33.400 | This is the competition,

00:46:35.400 | is to try to segment an image,

00:46:38.400 | and propagate that information forward.

00:46:41.400 | For manual annotation,

00:46:45.400 | of an image,

00:46:47.400 | so this kind of coloring book annotation,

00:46:49.400 | where you color every single pixel,

00:46:51.400 | in the state-of-the-art data set,

00:46:54.400 | for driving cityscapes,

00:46:56.400 | that it takes 1.5 hours,

00:47:00.400 | 90 minutes to do that coloring.

00:47:02.400 | That's 90 minutes per image.

00:47:05.400 | That's extremely long time.

00:47:07.400 | That's why there doesn't exist today,

00:47:10.400 | a data set,

00:47:11.400 | and in this class,

00:47:12.400 | we're going to create one,

00:47:14.400 | of segmentation of these images,

00:47:17.400 | through time,

00:47:19.400 | through video.

00:47:21.400 | So long videos,

00:47:23.400 | where every single frame,

00:47:25.400 | is fully segmented.

00:47:27.400 | That's still an open problem,

00:47:29.400 | that we need to solve.

00:47:31.400 | Flow is a piece of that,

00:47:33.400 | and we also provide you,

00:47:36.400 | this computed state-of-the-art flow,

00:47:39.400 | using flow net 2.0.

00:47:41.400 | So flow net 1.0,

00:47:43.400 | in May 2015,

00:47:46.400 | use neural networks,

00:47:48.400 | to learn the optical flow,

00:47:50.400 | the dense optical flow.

00:47:52.400 | And it did so with two kinds of architectures,

00:47:56.400 | flow net S,

00:47:57.400 | flow net simple,

00:47:58.400 | and flow net core,

00:47:59.400 | flow net C.

00:48:01.400 | The simple one,

00:48:02.400 | is simply taking the two images,

00:48:04.400 | so what's the task here?

00:48:06.400 | There's two images,

00:48:07.400 | and you want to produce from those two images,

00:48:09.400 | they follow each other in time,

00:48:11.400 | 33.3 milliseconds apart,

00:48:13.400 | and your task is the output,

00:48:16.400 | to produce the dense optical flow.

00:48:18.400 | So for the simple architecture,

00:48:20.400 | you just stack them together,

00:48:22.400 | each are RGB,

00:48:23.400 | so it produces a six channel input to the network,

00:48:26.400 | there's a lot of convolution,

00:48:28.400 | and finally it's the same kind of process,

00:48:30.400 | as the fully convolution neural networks,

00:48:33.400 | to produce the optical flow.

00:48:35.400 | Then there is flow net correlation architecture,

00:48:39.400 | where you perform some convolution separately,

00:48:42.400 | before using a correlation layer,

00:48:44.400 | to combine the feature maps.

00:48:47.400 | Both are effective,

00:48:50.400 | in different data sets,

00:48:53.400 | and different applications.

00:48:54.400 | So flow net 2.0,

00:48:56.400 | in December 2016,

00:48:59.400 | is one of the state-of-the-art frameworks,

00:49:02.400 | code bases,

00:49:04.400 | that we use to generate the data I'll show,

00:49:07.400 | combines the flow net S,

00:49:09.400 | and flow net C,

00:49:10.400 | and improves over the initial flow net,

00:49:13.400 | producing a smoother flow field,

00:49:15.400 | preserves the fine motion detail,

00:49:18.400 | along the edges of the objects,

00:49:20.400 | and it runs extremely efficiently,

00:49:23.400 | depending on the architecture,

00:49:25.400 | there's a few variants,

00:49:26.400 | either 8 to 140 frames a second.

00:49:30.400 | And the process there,

00:49:33.400 | is essentially,

00:49:34.400 | one that's common across various applications,

00:49:36.400 | deep learning,

00:49:37.400 | is stacking these networks together.

00:49:39.400 | The very interesting aspect here,

00:49:43.400 | that we're still exploring,

00:49:46.400 | and again,

00:49:47.400 | applicable in all of deep learning,

00:49:49.400 | in this case,

00:49:50.400 | it seemed that there was a strong effect,

00:49:53.400 | in taking sparse,

00:49:54.400 | small,

00:49:55.400 | multiple data set,

00:49:56.400 | and doing the training,

00:49:57.400 | the order of which,

00:49:59.400 | those data sets were used for the training process,

00:50:01.400 | mattered a lot.

00:50:02.400 | That's very interesting.

00:50:05.400 | So, using flow net 2.0,

00:50:09.400 | here's the data set,

00:50:11.400 | we're making available for PsycFuse,

00:50:13.400 | the competition.

00:50:14.400 | Cars.mit.edu/psychfuse

00:50:18.400 | First, the original video,

00:50:20.400 | us driving in high definition,

00:50:24.400 | 1080p,

00:50:26.400 | and a 8K 360 video,

00:50:29.400 | original video,

00:50:32.400 | driving around Cambridge.

00:50:35.400 | Then we're providing the ground truth,

00:50:40.400 | for a training set.

00:50:43.400 | For that training set,

00:50:45.400 | for every single frame,

00:50:46.400 | 30 frames a second,

00:50:47.400 | we're providing the segmentation,

00:50:49.400 | frame to frame to frame,

00:50:51.400 | segmented on Mechanical Turk.

00:50:54.400 | We're also providing the output,

00:50:58.400 | of the network that I mentioned,

00:51:01.400 | the state of the art segmentation network,

00:51:03.400 | that's pretty damn close to the ground truth,

00:51:06.400 | but still not.

00:51:09.400 | And our task is,

00:51:11.400 | this is the interesting thing is,

00:51:13.400 | our task is to take the output of this network,

00:51:18.400 | well there's two options,

00:51:19.400 | one is to take the output of this network,

00:51:22.400 | and use other networks,

00:51:26.400 | to help you propagate the information better.

00:51:29.400 | So what this segmentation,

00:51:31.400 | the output of this network does,

00:51:34.400 | is it only takes a frame by frame by frame,

00:51:38.400 | it's not using the temporal information at all.

00:51:40.400 | So the question is,

00:51:42.400 | can we figure out a way,

00:51:43.400 | can we figure out tricks,

00:51:44.400 | to use temporal information,

00:51:46.400 | to improve this segmentation,

00:51:48.400 | so it looks more like this segmentation.

00:51:51.400 | And we're also providing the optical flow,

00:51:57.400 | from frame to frame to frame.

00:51:58.400 | So the optical flow,

00:52:00.400 | based on flow net 2.0,

00:52:01.400 | of how each of the pixels moved.

00:52:07.400 | Okay.

00:52:08.400 | And that forms the PsyFuse competition.

00:52:11.400 | 10,000 images,

00:52:13.400 | and the task is to submit code,

00:52:17.400 | we have starter code in Python,

00:52:19.400 | and on GitHub,

00:52:21.400 | to take in the original video,

00:52:25.400 | take in for the training set,

00:52:26.400 | the ground truth,

00:52:28.400 | the segmentation from the state of the art,

00:52:30.400 | segmentation network,

00:52:31.400 | the optical flow from the state of the art,

00:52:33.400 | optical flow network,

00:52:36.400 | and taking that together,

00:52:37.400 | to improve the stuff on the bottom left,

00:52:40.400 | the segmentation,

00:52:41.400 | to try to achieve the ground truth,

00:52:43.400 | on the top right.

00:52:44.400 | Okay.

00:52:47.400 | With that,

00:52:48.400 | I'd like to thank you.

00:52:49.400 | Tomorrow at 1 p.m.,

00:52:51.400 | is Waymo in Stata,

00:52:54.400 | 32.123.

00:52:56.400 | The next lecture,

00:52:58.400 | next week,

00:52:59.400 | will be on deep learning,

00:53:00.400 | for sensing the human,

00:53:01.400 | understanding the human,

00:53:02.400 | and we will release,

00:53:03.400 | online only lecture,

00:53:05.400 | on capsule networks,

00:53:06.400 | and GANs,

00:53:07.400 | General Adversarial Networks.

00:53:09.400 | Thank you very much.

00:53:10.400 | Thank you very much.

00:53:11.000 | (Applause)

MIT 6.S094: Computer Vision

Chapters