MIT 6.S094: Computer Vision

Today we'll talk about how to make machines see. Computer vision. And we'll present, Thank you for whoever said yes. And today we will present a competition that unlike deep traffic, which is designed to explore ideas, teach you about concepts of deep reinforcement learning, SegFuse, the deep dynamic driving scene segmentation competition that I'll present today, is at the very cutting edge.

Whoever does well in this competition is likely to produce a publication or ideas that would lead the world in the area of perception. Perhaps together with the people running this class, perhaps on your own. And I encourage you to do so. Even more cats today. Computer vision, today as it stands, is deep learning.

Majority of the successes in how we interpret, form representations, understand images and videos utilize to a significant degree neural networks. The very ideas we've been talking about. That applies for supervised, unsupervised, and reinforcement learning. And for the supervised case, which is the focus of today, the process is the same.

The data is essential. There's annotated data where the human provides the labels that serves the ground truth in the training process. Then the neural network goes through that data, learning to map from the raw sensory input to the ground truth labels, and then generalize over the testing data set.

And the kind of raw sensors we're dealing with are numbers. I'll say this again and again, that for human vision, for us here, we take for granted this particular aspect of our ability. Is to take in raw sensory information through our eyes and interpret it. But it's just numbers.

That's something, whether you're an expert computer vision person or new to the field, you have to always go back to meditate on. Is what kind of things the machine is given. What is the data that is tasked to work with in order to perform the task you're asking it to do.

Perhaps the data is given is highly insufficient to do what you want it to do. That's the question that'll come up again and again. Are images enough to understand the world around you? And given these numbers, the set of numbers, sometimes with one channel, sometimes with three RGB, where every single pixel have three different colors.

The task is to classify or regress. Producing continuous variable or one of a set of class labels. As before, we must be careful about our intuition of what is hard and what is easy in computer vision. Let's take a step back to the inspiration for neural networks. Our own biological neural networks.

Because the human vision system and the computer vision system is a little bit more similar in these regards. The structure of the human visual cortex is in layers. And as information passes from the eyes to the parts of the brain that make sense of the raw sensor information, higher and higher order representations are formed.

This is the inspiration, the idea behind using deep neural networks for images. Higher and higher order representations are formed through the layers. The early layers, taking in the very raw sensory information and extracting edges, connecting those edges, forming those edges to form more complex features and finally into the higher order semantic meaning that we hope to get from these images.

In computer vision, deep learning is hard. I'll say this again, the illumination variability is the biggest challenge or at least one of the biggest challenges in driving for visible light cameras. Pose variability, the objects, as I'll also discuss about some of the advances from Jeff Hinton and the capsule networks, the idea with neural networks as they are currently used for computer vision are not good with representing variable pose.

These objects in images and this 2D plane of color and texture look very different numerically when the object is rotated and the object is mangled and shaped in different ways. The deformable truncated cat. Inter-class variability, for the classification task which would be an example today throughout to introduce some of the networks over the past decade that have received success and some of the intuition and insight that made those networks work.

Classification, there is a lot of variability inside the classes and very little variability between the classes. All of these are cats at top, all of those are dogs at bottom. They look very different and the other, I would say the second biggest problem in driving perception, visible light camera perception is occlusion.

When part of the object is occluded, due to the three-dimensional nature of our world, some objects in front of others and they occlude the background object and yet we're still tasked with identifying the object when only part of it is visible. And sometimes that part, I told you there's cats, is very hardly visible.

Here we're tasked with classifying a cat when just an ear is visible, just the leg. And on the philosophical level, as we'll talk about the motivation for our competition here, here's a cat dressed as a monkey eating a banana. On a philosophical level, most of us understand what's going on in the scene.

In fact, a neural network, today successfully classified this image, this video as a cat. But the context, the humor of the situation, and the fact that you could argue it's a monkey, is missing. And what else is missing is the dynamic information, the temporal dynamics of the scene. That's what's missing in a lot of the perception work that has been done to date in the autonomous vehicle space in terms of visible light cameras.

And we're looking to expand on that. That's what SegFuse is all about. Image classification pipeline, there's a bin with different categories inside each class, cat, dog, mug, hat. Those bins, there's a lot of examples of each. And you're tasked with, when a new example comes along you've never seen before, to put that image in a bin.

It's the same as the machine learning task before. And everything relies on the data that's been ground truth, that's been labeled by human beings. MNIST is a toy data set of handwritten digits, often used as examples. And COCO, CIFAR, ImageNet, PLACES, and a lot of other incredible data sets, rich data sets of a hundred thousands, millions of images out there, represent scenes, people's faces, and different objects.

Those are all ground truth data for testing algorithms, and for competing architectures to be evaluated against each other. CIFAR-10, one of the simplest, almost toy data sets of tiny icons with 10 categories, of airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck, is commonly used to explore some of the basic convolutional neural networks we'll discuss.

So let's come up with a very trivial classifier to explain the concept of how we could go about it. In fact, this is, maybe if you start to think about how to classify an image, if you don't know any of these techniques, this is perhaps the approach you would take, is you would subtract images.

So in order to know that an image of a cat is different than an image of a dog, you have to compare them. When given those two images, what's the way you compare them? One way you could do it, is you just subtract it, and then sum all the pixel-wise differences in the image.

Just subtract the intensity of the image, pixel by pixel, sum it up. If that difference is really high, that means the images are very different. Using that metric, we can look at CIFAR-10, and use it as a classifier. Saying, based on this difference function, I'm going to find one of the 10 bins for a new image that has the lowest difference.

Find an image in this data set that is most like the image I have, and put it in the same bin as that image is in. So, there's 10 classes, if we just flip a coin, the accuracy of our classifier will be 10%. Using our image difference classifier, we can actually do pretty good, much better than random, much better than 10%.

We can do 35, 38% accuracy. That's the classifier, we have our first classifier. K-nearest neighbors. Let's take our classifier to a whole new level. Instead of comparing it to just, trying to find one image, that's the closest in our data set. We try to find K closest, and say, what class do the majority of them belong to?

And we take that K, and increase it from 1 to 2 to 3 to 4 to 5. And see how that changes the problem. With 7 nearest neighbors, which is the optimal under this approach for CIFAR-10, we achieve 30% accuracy. Human level is 95% accuracy. And with convolutional neural networks, we get very close to 100%.

That's where neural networks shine. This very task of binning images. It all starts at this basic computational unit. Signal in, each of the signals are weighed, summed, bias added, and put an input into a nonlinear activation function that produces an output. The nonlinear activation function is key. All of these put together, in more and more hidden layers, form a deep neural network.

And that deep neural network is trained, as we've discussed, by taking a forward pass, on examples of ground truth labels, seeing how close those labels are to the real ground truth, and then punishing the weights, that resulted in the incorrect decisions, and rewarding the weights, that resulted in correct decisions.

For the case of 10 examples, the output of the network, is 10 different values. The input being handwritten digits, from 0 to 9, there's 10 of those. And we wanted our network to classify, what is in this image, of a handwritten digit. Is it 0, 1, 2, 3, through 9.

The way it's often done, is there's 10 outputs of the network. And each of the neurons on the output, is responsible for getting really excited, when it's number is called. And everybody else, is supposed to be not excited. Therefore, the number of classes, is the number of outputs. That's how it's commonly done.

And you assign a class to the input image, based on the highest, the neuron which produces the highest output. But that's for a fully connected network, that we've discussed on Monday. There is in deep learning, a lot of tricks, that make things work, that make training much more efficient, on large class problems, where there's a lot of classes, on large data sets.

When the representation, that the neural network is tasked with learning, is extremely complex. And that's where convolutional neural networks step in. The trick they use is spatial invariance. They use the idea that, a cat in the top left corner of an image, is the same as a cat, in the bottom right corner of an image.

So we can learn the same features, across the image. That's where the convolution operation steps in. Instead of the fully connected networks, here there's a third dimension, of depth. So the blocks in this neural network, as input take 3D volumes, and as output produce 3D volumes. They take a slice of the image, a window, and slide it across.

Applying the same exact weights, and we'll go through an example. The same exact weights, as in the fully connected network, on the edges that are used to, map the input to the output. Here are used to, map the slice of an image, this window of an image, to the output.

And you can make several, many of such convolutional filters. Many layers, many different options of, what kind of features you look for in an image. What kind of window you slide across, in order to extract all kinds of things. All kinds of edges. All kind of higher order patterns in the images.

The very important thing is, the parameters on each of these filters, the subset of the image, these windows, are shared. If the feature, that defines a cat, is useful in the top left corner, it's useful in the top right corner, it's useful in every aspect of the image. This is the trick, that makes convolutional neural networks, save a lot of, a lot of parameters, reduce parameters significantly.

It's the reuse, the spatial sharing of features, across the space of the image. The depth of these 3D volumes, is the number of filters. The stride is the skip of the filter, the step size. How many pixels you skip, when you apply the filter to the input. And the padding, is the padding, the zero padding on the outside of the input, to a convolutional layer.

Let's go through an example. So, on the left here, and the slides are now available online, you can follow them along. And I'll step through this example. On the left here is, input volume of three channels. The left column is the input. The three squares there, are the three channels.

And there's numbers, inside those channels. And then we have a filter in red. Two of them, two channels of filters, with a bias. And those filters are three by three. Each one of them, is size three by three. And what we do is, we take those three by three filters, that are to be learned.

These are our variables, our weights that we have to learn. And then we slide it across an image, to produce the output on the right, the green. So by applying the filters in the red, there's two of them, and within each one, there's one for every input channel. We go from the left, to the right.

From the input volume on the left, to the output volume green on the right. And you can look, you can pull up the slides yourself now, if you can't see the numbers on the screen. But the operations, are performed on the input, to produce the single value, that's highlighted there in the green, in the output.

And we slide this convolution, no filter, along the image. With a stride, in this case, of two, skipping, skipping along. They sum to the right, the two channel output, in green. That's it, that's the convolutional operation. That's what's called the convolutional layer in neural networks. And the parameters here, besides the bias, are the red values in the middle.

That's what we're trying to learn. And there's a lot of interesting tricks, we'll discuss today on top of those. But this is at the core. This is the spatially invariant, sharing of parameters, that make convolutional neural networks, able to efficiently, learn and find patterns and images. To build your intuition a little bit more, about convolution, here's an input image on the left.

And on the right, the identity filter, produces the output you see on the right. And then there's different ways, you can, different kinds of edges, you can extract, with the result in activation map, seen on the right. So when applying the filters, with those edge detection filters, to the image on the left, you produce in white, are the parts that activate, the convolution.

The results of these filters. And so you can do any kind of filter, that's what we're trying to learn. Any kind of edge, any kind of, any kind of pattern, you can move along in this window, in this way that's shown here, you slide around the image, and you produce, the output you see on the right.

And depending on how many filters, you have in every level, you have many of such slices, that you see on the right. The input on the left, the output on the right. If you have, dozens of filters, you have dozens of images on the right, each with different results, that show, where each of the individual, filter patterns were found.

And we learn, what patterns are useful to look for, in order to perform the classification task. That's the task, for the neural network, to learn these filters. And the filters, have higher and higher order, of representation. Going from the very basic edges, to the high semantic, meaning that spans entire images.

And the ability to span images, can be done in several ways. But traditionally has been successfully done, through max pooling, through pooling. Of taking the output, of a convolutional operation, and reducing the resolution of that, by condensing that information, by for example, taking the maximum values, the maximum activations.

Therefore reducing the, spatial resolution, which has detrimental effects, as we'll talk about in scene segmentation. But it's beneficial, for finding higher order representations, in the images, that bring images together. That bring features together, to form an entity, that we're trying to identify and classify. Okay. So that forms, a convolutional neural network.

Such convolutional layers, stacked on top of each other, is the only addition, to a neural network that makes, for a convolutional neural network. And then at the end, the fully connected layers, or any kind of other architectures, allow us to apply particular domains. Let's take ImageNet, as a case study.

In ImageNet, the data set, in ImageNet, the challenge, the task is classification. As I mentioned in the first lecture, ImageNet is a data set, one of the largest in the world of images. With 14 million images, 21,000 categories. And a lot of depth, to many of the categories. As I mentioned, 1200 Granny Smith apples.

These allow to, these allow the neural networks to, learn the rich representations, in both pose, lighting variability, and interclass, class variation, for the particular things, particular classes, like Granny Smith apples. So, let's look through the various networks. Let's discuss them, let's see the insights. It started with AlexNet, the first, really big successful, GPU trained neural network, on ImageNet, that's achieved a significant boost, over the previous year.

And moved on to VGGNet, GoogleNet, Agulinet, ResNet, CU Image, and SENet, in 2017. Again, the numbers will show, for the accuracy, are based on the, top five error rate. We get five guesses, and it's a one or zero. If you get guess, if one of the five is correct, you get a one, for that particular guess.

Otherwise, it's a zero. And, human error is 5.1. When a human, tries to achieve the same, tries to, perform the same task, as the machinist task we're doing, the error is 5.1. The human annotation, is performed on the images, based on binary classification. Granny Smith, Apple or not, cat or not.

The actual task, that the machine has to perform, and that the human competing, has to perform, is given an image, is provide, one of the many classes. Under that, human error is 5.1%, which was surpassed, in 2015, by ResNet, to achieve 4% error. So, let's start with, AlexNet. I'll zoom in on the later networks, they have some interesting insights.

But, AlexNet, and VGGNet, both followed a very similar architecture. Very uniform throughout its depth. VGGNet in 2014, is convolution, convolution pooling, convolution pooling, convolution pooling, convolution pooling, convolution pooling, and fully connected layers at the end. There's a certain kind of beautiful simplicity, uniformity to these architectures. Because you can just make it deeper and deeper, and makes it very amenable to, implementation in a layer stack kind of way, in any of the deep learning frameworks.

It's clean and beautiful to understand. In the case of VGGNet, 16 or 19 layers, with 138 million parameters, not many optimizations on these parameters, therefore, the number of parameters is much higher than the networks that followed it. Despite the layers not being that large. GoogleNet introduced the inception module, starting to do some interesting things, with the small modules within these networks, which allow for the training to be more, efficient and effective.

The idea behind the inception module shown here, with the previous layer on bottom, and the convolutional layer, here with the inception module, on top, produced on top, is it used the idea that, different size convolutions, provide different value for the network. Smaller convolutions are able to capture, or propagate forward, features that are very local, a high resolution in texture.

Larger convolutions are better able to, represent and capture and catch, highly abstracted features, higher order features. So the idea behind the inception module, is to say well, as opposed to choosing, in a hyper parameter tuning process, or architecture design process, choosing which convolution size, we want to go with, why not do all of them together, well several together.

In the case of the GoogleNet model, there's the 1x1, 3x3 and 5x5 convolutions, with the old trusty friend of max pooling, still left in there as well, which has lost favor, more and more over time, for the image classification task. And the result is, there's fewer parameters are required, if you pick, the placing of these, inception modules correctly, the number of parameters required, to achieve a higher performance, is much lower.

ResNet, one of the most popular, still to date, architectures, that we'll discuss in, in scene segmentation as well. Came up and used, the idea of a residual block. The initial, inspiring observation, which doesn't necessarily, hold true as it turns out, but that network depth, increases representation power. So these residual blocks, allow you to have much deeper networks, and I'll explain, why in a second here.

But, the thought was, they work so well, because the networks are much deeper. The key thing, that makes these blocks so effective, is the same idea, that's reminiscent of a current neural networks, that I hope we get a chance to talk about. The training of them is much easier.

They take a simple block, repeated over and over, and they pass the input along, without transformation, along with the ability, to transform it, to learn, to learn the filters, learn the weights. So you're allowed to, you're allowed every layer, to not only take on, the processing of previous layers, but to take in the raw and transform data, and learn something new.

The ability to learn something new, allows you to have, much deeper networks, and the simplicity of this block, allows for more effective training. The state-of-the-art, in 2017, the winner is, Squeeze and Excitation Networks. That unlike the previous year, with CU Image, which simply took ensemble methods, and combined a lot of successful approaches, to take a marginal improvement.

SCNet, got a significant improvement, at least in percentages, I think it's a 25% reduction, in error, from 4% to 3%, something like that. By using a very simple idea, that I think is important to mention, a simple insight. It added a parameter, to each channel, in the convolutional layer, in the convolutional block.

So the network, can now adjust the weighting, on each channel, based, for each feature map, based on the content, based on the input to the network. This is kind of a, a takeaway to think about, about any of the networks, to talk about any of the architectures. Is, a lot of times, your recurrent neural networks, and convolutional neural networks, have tricks, that significantly reduce, the number of parameters, the bulk, the sort of low-hanging fruit.

They use spatial invariance, the temporal invariance, to reduce the number of parameters, to represent the input data. But, they also leave certain things, not parameterized. They don't allow the network to learn it. Allowing in this case, the network to learn the weighting, on each of the individual channels, so each of the individual filters, is something that you learn, as along with the filters, takes it, makes a huge boost.

The cool thing about this, is it's applicable to any architecture. This kind of block, this kind of, the squeeze and excitation block, is applicable to any architecture. And, because obviously, it's just simply parameterizes, the ability to choose, which filter you go with, based on the content. It's a subtle, but crucial thing.

I think it's pretty cool. And, for future research, it inspires to think about, what else can be parameterized, in neural networks? What else can be controlled, as part of the learning process? Including higher and higher order, hyperparameters. Which aspects of the training, and the architecture of the network, can be part of the learning?

This is what this network inspires. Another network, has been in development since the 90s, ideas, by Jeff Hinton, but really received, has been published on, and received significant attention in 2017. That I won't go into detail here. We are going to release, an online only video, about capsule networks.

It's a little bit too technical, but they inspire a very important point, that we should always think about, with deep learning. Whenever it's successful. It's to think about, what, as I mentioned, with the cat eating a banana, on a philosophical, and the mathematical level, we have to consider, what assumptions these networks make, and what, through those assumptions, they throw away.

So neural networks, due to the spatial, with convolutional neural networks, due to their spatial invariance, throw away information, about the relationship, between, the hierarchies, between the simple, and the complex objects. So the face on the left, and the face on the right, looks the same, to a convolutional neural network.

The presence of eyes, and nose, and mouth, is the essential aspect, of what makes, the classification task work, for convolutional network. Where it will fire, and say this is definitely a face. But the spatial relationship, is lost, is ignored, which means, there's a lot of implications to this, but, for things like, pose variation, that information is lost.

We're throwing away, that away completely, and hoping that, the pooling operation, that's performing these networks, is able to sort of, mesh everything together, to come up with the features, that are firing, of the different parts of the face, to then come up with the total classification, that it's a face.

Without representing, really the relationship, between these features, at the low level, and the high level. At the low level of the hierarchy, at the simple, and the complex level. This is a super exciting field now, that's hopefully will spark, developments of how we design neural networks, that are able to learn, the rotational, the orientation, in variance, as well.

Okay, so as I mentioned, you take these, convolutional neural networks, chop off the final layer, in order to apply, to a particular domain. And that is what we'll do, with fully convolutional neural networks. The ones that we tasked, to segment the image, at a pixel level. As a reminder, these networks, through the convolutional process, are really producing, a heat map.

Different parts of the network, are getting excited, based on the different, aspects of the image. And so it can be used, to do the localization of detecting, not just classifying the image, but localizing the object. And they could do so, at a pixel level. So the convolutional layers, are doing the, encoding process.

They're taking the rich, raw sensory information, in the image, and encoding them, into an interpretable set of features, representation, that can then be used for classification. But we can also then use a decoder, up sample that information, and produce a map like this. Fully convolutional neural networks, segmentation, semantic scene segmentation, image segmentation.

The goal is to, as opposed to classify the entire image, you classify every single pixel. It's pixel level segmentation. You color every single pixel, with what that pixel, what object that pixel belongs to, in this 2D space of the image. The 2D projection, in the image of a three-dimensional world.

So the thing is, there's been a lot of advancement, in the last three years. But it's still an incredibly difficult problem. If you think about, the amount of data that's used, for training, and the task of pixel level, of megapixels here, of millions of pixels, that are tasked with, having assigned a single label.

It's an extremely difficult problem. Why is this interesting, important problem to try to solve, as opposed to bounding boxes, around cats? Well, it's whenever precise boundaries, of objects are important. Certainly medical applications, when looking at imaging, and detecting particular, for example, detecting tumors, in medical imaging, of different organs.

And in driving, in robotics, when objects are involved, it's a dense scene, involved with vehicles, pedestrians, cyclists. We need to be able to, not just have a loose estimate, of where objects are. We need to be able to have, the exact boundaries. And then potentially, through data fusion, fusing sensors together, fusing this rich textural information, about pedestrians, cyclists, and vehicles, to light our data, that's providing us the three-dimensional, map of the world.

We'll have both, the semantic meaning, of the different objects, and their exact three-dimensional location. A lot of this work, successfully, a lot of the work in the semantic segmentation, started with, fully convolutional networks, for semantic segmentation paper. FCN. That's where the name FCN came from, in November 2014. Now go through a few papers here, to give you some intuition, where the field is gone.

And how that takes us to segfuse, the segmentation competition. So FCN, repurposed the ImageNet pre-trained nets. The nets that were trained, to classify what's in an image, the entire image. And chopped off, the fully connected layers. And then added decoder parts, that up-sampled the image, to produce a heat map.

Here shown, with a tabby cat, a heat map of where the cat is in the image. It's a much lower, much coarser resolution, than the input image. 1/8 at best. Skip connections, to improve coarseness of up-sampling. There's a few tricks. If you do the most naive approach, the up-sampling is going to be extremely coarse.

Because that's the whole point, of the neural network. The encoding part, is you throw away all the useless data, to the most essential aspects, that represent that image. So you're throwing away a lot of information, that's necessary, to then form a high-resolution image. So there's a few tricks, where you skip a few of the final, pooling operations, to go in similar ways, as a residual block, to go to the output, produce higher and higher, resolution heat map at the end.

SegNet in 2015, applied this to the driving context. And really, taking a Dikiti dataset, and have shown a lot of interesting results, and really explored the encoder-decoder, formulation of the problem. Really solidifying the place, of the encoder-decoder framework, for the segmentation task. Dilated convolution, I'm taking you through a few components, which are critical here, to the state of the art.

Dilated convolutions, so the convolution operation, as the pooling operation, reduces resolution significantly. And dilated convolution, has a certain kind of grating, as visualized there, that maintains the local, high-resolution textures, while still capturing, the spatial window necessary. It's called dilated convolutional layer. And that's in a 2015 paper, proved to be much better at up-sampling, a high-resolution image.

D-Lab, with a B, V1, V2, now V3, added conditional random fields, which is the final piece of the, state of the art puzzle here. A lot of the successful networks today, that do segmentation, not all, do post-process using CRFs, conditional random fields. And what they do is, they smooth the segmentation, the up-sampled segmentation, that results from the FCN, by looking at the underlying image intensities.

So that's the key aspects, of the successful approaches today. You have the encoder-decoder framework, of a fully convolutional neural network. It replaces the fully connected layers, with the convolutional layers, deconvolutional layers. And as the years progressed, from 2014 to today, as usual, the underlying networks, from AlexNet to VGGNet, and to now ResNet, have been one of the big reasons, for the improvements of these networks, to be able to perform the segmentation.

So naturally, they mirrored, the ImageNet challenge performance, in adapting these networks. So the state of the art, uses ResNet or similar networks. Conditional random fields, for smoothing, based on the input image intensities, and the dilated convolution, that maintains the computational cost, but increases the resolution of the upsampling, throughout the intermediate feature maps.

And that takes us to the state of the art, that we used to produce the images, to produce the images for the competition. ResNet-DUC for dance upsampling convolution, instead of bilinear upsampling, you make the upsampling learnable. You learn the upscaling filters, that's on the bottom. That's really the key part that made it work.

There should be a theme here. Sometimes the biggest addition, that could be done, is parameterizing, one of the aspects of the network, they've taken for granted. Letting the network learn that aspect. And the other, not sure how important it is to the success, but it's a cool little addition, is a hybrid dilated convolution.

As I showed that visualization, where the convolution is spread apart, a little bit in the input, from the input to the output. The steps of that dilated convolution filter, when they're changed, it produces a smoother result, because when it's kept the same, there are certain input pixels, get a lot more attention than others.

So losing that favoritism, is what's achieved by using a variable, different dilation rate. Those are the two tricks, but really the biggest one, is the parameterization of the upscaling filters. Okay, so that's what we use to generate that data, and that's what we provide you the code with, if you're interested in competing in PsycFuse.

The other aspect here, that everything we talked about, from the classification, to the segmentation, to making sense of images, is the information about time, the temporal dynamics of the scene is thrown away. And for the driving context, for the robotics contest, and what we'd like to do with PsycFuse, for the segmentation, dynamic scene segmentation context, of when you try to interpret, what's going on in the scene over time, and use that information.

Time is essential, the movement of pixels is essential, through time. That understanding how those objects move, in a 3D space, through the 2D projection of an image, is fascinating, and there's a lot of set of open problems there. So flow, is what's very helpful, as a starting point, to help us understand how these pixels move.

Flow, optical flow, dense optical flow is the computation, our best approximation, of where each pixel in image one, and moved in the temporally, following image after that. There's two images, in 30 frames a second, there's one image at time zero, the other is 33.3 milliseconds later, and the dense optical flow, is our best estimate of how each pixel, in the input image moved, to in the output image.

The optical flow, for every pixel, produces a direction, of where we think that pixel moved, and the magnitude of how far moved. That allows us, to take information that we detected, about the first frame, and try to propagate it forward. This is the competition, is to try to segment an image, and propagate that information forward.

For manual annotation, of an image, so this kind of coloring book annotation, where you color every single pixel, in the state-of-the-art data set, for driving cityscapes, that it takes 1.5 hours, 90 minutes to do that coloring. That's 90 minutes per image. That's extremely long time. That's why there doesn't exist today, a data set, and in this class, we're going to create one, of segmentation of these images, through time, through video.

So long videos, where every single frame, is fully segmented. That's still an open problem, that we need to solve. Flow is a piece of that, and we also provide you, this computed state-of-the-art flow, using flow net 2.0. So flow net 1.0, in May 2015, use neural networks, to learn the optical flow, the dense optical flow.

And it did so with two kinds of architectures, flow net S, flow net simple, and flow net core, flow net C. The simple one, is simply taking the two images, so what's the task here? There's two images, and you want to produce from those two images, they follow each other in time, 33.3 milliseconds apart, and your task is the output, to produce the dense optical flow.

So for the simple architecture, you just stack them together, each are RGB, so it produces a six channel input to the network, there's a lot of convolution, and finally it's the same kind of process, as the fully convolution neural networks, to produce the optical flow. Then there is flow net correlation architecture, where you perform some convolution separately, before using a correlation layer, to combine the feature maps.

Both are effective, in different data sets, and different applications. So flow net 2.0, in December 2016, is one of the state-of-the-art frameworks, code bases, that we use to generate the data I'll show, combines the flow net S, and flow net C, and improves over the initial flow net, producing a smoother flow field, preserves the fine motion detail, along the edges of the objects, and it runs extremely efficiently, depending on the architecture, there's a few variants, either 8 to 140 frames a second.

And the process there, is essentially, one that's common across various applications, deep learning, is stacking these networks together. The very interesting aspect here, that we're still exploring, and again, applicable in all of deep learning, in this case, it seemed that there was a strong effect, in taking sparse, small, multiple data set, and doing the training, the order of which, those data sets were used for the training process, mattered a lot.

That's very interesting. So, using flow net 2.0, here's the data set, we're making available for PsycFuse, the competition. Cars.mit.edu/psychfuse First, the original video, us driving in high definition, 1080p, and a 8K 360 video, original video, driving around Cambridge. Then we're providing the ground truth, for a training set. For that training set, for every single frame, 30 frames a second, we're providing the segmentation, frame to frame to frame, segmented on Mechanical Turk.

We're also providing the output, of the network that I mentioned, the state of the art segmentation network, that's pretty damn close to the ground truth, but still not. And our task is, this is the interesting thing is, our task is to take the output of this network, well there's two options, one is to take the output of this network, and use other networks, to help you propagate the information better.

So what this segmentation, the output of this network does, is it only takes a frame by frame by frame, it's not using the temporal information at all. So the question is, can we figure out a way, can we figure out tricks, to use temporal information, to improve this segmentation, so it looks more like this segmentation.

And we're also providing the optical flow, from frame to frame to frame. So the optical flow, based on flow net 2.0, of how each of the pixels moved. Okay. And that forms the PsyFuse competition. 10,000 images, and the task is to submit code, we have starter code in Python, and on GitHub, to take in the original video, take in for the training set, the ground truth, the segmentation from the state of the art, segmentation network, the optical flow from the state of the art, optical flow network, and taking that together, to improve the stuff on the bottom left, the segmentation, to try to achieve the ground truth, on the top right.

Okay. With that, I'd like to thank you. Tomorrow at 1 p.m., is Waymo in Stata, 32.123. The next lecture, next week, will be on deep learning, for sensing the human, understanding the human, and we will release, online only lecture, on capsule networks, and GANs, General Adversarial Networks. Thank you very much.

Thank you very much. (Applause)

MIT 6.S094: Computer Vision

Chapters

Transcript