Lesson 14: Deep Learning Part 2 2018 - Super resolution; Image segmentation with Unet

Thank you. Welcome to the last lesson, lesson 14. We're going to be looking at image segmentation today, amongst other things, but before we do, a bit of show and tell from last week. Elena Harley did something really interesting, which was she tried finding out what would happen if you did CycleGAN on just 300 or 400 images.

I really like these projects where people just go to Google image search using the API or one of the libraries out there. Some of our students have created some very good libraries for interacting with Google images API, download a bunch of stuff they're interested in, in this case some photos and some stained glass windows, and with 300 or 400 photos of that she trained a model.

She trained actually a few different models, this is what I particularly liked, and as you can see, with quite a small number of images she gets some very nice stained glass effects. So I thought that was an interesting example of using pretty small amounts of data that was readily available, which she was able to download pretty quickly, and there's more information about that on the forum if you're interested.

It's interesting to wonder about what kinds of things people will come up with with this kind of generative model, it's clearly a great artistic medium, it's clearly a great medium for forgeries and fakeries, I wonder what other kinds of things people will realize they can do with these kind of generative models.

I think audio is going to be the next big area, and also very interactive type stuff. Nvidia just released a paper showing an interactive photo repair tool where you just brush over an object and it replaces it with a deep learning generated replacement very nicely. Those kinds of interactive tools I think will be very interesting too.

So before we talk about segmentation, we've got some stuff to finish up from last time which is that we looked at doing style transfer by actually directly optimizing pixels. Like with most of the things in Part 2, it's not so much that I'm wanting you to understand style transfer per se, but the kind of idea of optimizing your input directly and using activations as part of a loss function is really the key kind of takeaway here.

So it's interesting then to kind of see what is effectively the follow-up paper, not from the same people, but the paper that kind of came next in the sequence of these kind of vision generative models with this one from Justin Johnson and folks at Stanford. And it actually does the same thing, style transfer, but it does it in a different way.

Rather than optimizing the pixels, we're going to go back to something much more familiar and optimize some weights. And so specifically we're going to train a model which learns to take a photo and translate it into a photo in the style of a particular artwork. So each ConvNet will learn to produce one kind of style.

Now it turns out that getting to that point, there's an intermediate point which is I actually think kind of more useful and takes us halfway there, which is something called super-resolution. So we're actually going to start with super-resolution because then we'll build on top of super-resolution to finish off the style transfer, ConvNet based style transfer.

And so super-resolution is where we take a low-res image, we're going to take 72x72 and upscale it to a larger image, 288x288 in our case, trying to create a higher-res image that looks as real as possible. And so this is a pretty challenging thing to do because at 72x72 there's not that much information about a lot of the details.

And the cool thing is that we're going to do it in a way as we tend to do with vision models which is not tied to the input size. So you could totally then take this model and apply it to a 288x288 image and get something that's 4 times bigger on each side, so 16 times bigger than that.

But often it even works better at that level because you're really introducing a lot of detail into the finer details and you could really print out a high-resolution print of something which earlier on was pretty pixelated. So this is the notebook called Enhance. And it is a lot like that kind of CSI style enhancement where we're going to take something that appears like the information is just not there and we kind of invent it, but the ConvNet is going to learn to invent it in a way that's consistent with the information that is there, so hopefully it's kind of inventing the right information.

One of the really nice things about this kind of problem is that we can create our own dataset as big as we like without any labeling requirements because we can easily create a low-res image from a high-res image just by downsampling our images. So something I would love some of you to try during the week would be to do other types of image-to-image translation where you can invent kind of labels, invent your dependent variable.

For example, de-skewing, so either recognize things that have been rotated by 90 degrees or better still that have been rotated by 5 degrees and straighten them. Colorization, so make a bunch of images into black and white and learn to put the color back again. Noise reduction, maybe do a really low-quality JPEG save and learn to put it back to how it should have been, and so forth, or maybe take something that's in a 16 color palette and put it back to a higher color palette.

I think these things are all interesting because they can be used to take pictures that you may have taken back on crappy old digital cameras before they were high resolution, or you may have scanned in some old photos that have faded or whatever, I think it's a really useful thing to be able to do, and also it's a good project because it's really similar to what we're doing here, but different enough that you'll come across some interesting challenges on the way, I'm sure.

So I'm going to use ImageNet again. You don't need to use all of ImageNet at all, I just happen to have it lying around. You can download the 1% sample of ImageNet from files.fast.ai. You can use any set of pictures you have lying around, honestly. And in this case, as I said, we don't really have labels per se, so I'm just going to give everything a label of 0 just so we can use it with our existing infrastructure more easily.

Now because I'm in this case pointing at a folder that contains all of ImageNet, I certainly don't want to wait for all of ImageNet to finish, to run an epoch. So here most of the time I would set keep% to 1 or 2%, and then I just generate a bunch of random numbers, and then I just keep those which are less than 0.02, and so that lets me quickly sub-sample my rows.

So we're going to use VGG16, and VGG16 is something that we haven't really looked at in this class, but it's a very simple model where we take our normal, presumably 3-channel input, and we basically run it through a number of 3x3 convolutions, and then from time to time we put it through a 2x2 MaxPool, and then we do a few more 3x3 convolutions, MaxPool, so on and so forth.

And then this is kind of our backbone, I guess. And then we don't do an average pooling layer, an adaptive average pooling layer. After a few of these we end up with this 7x7 grid as usual, I think it's about 7x7x512. And so rather than average pooling we do something different, which is we flatten the whole thing.

So that spits out a very long vector of activations of size 7x7x512 if memory says correctly. And then that gets fed into two fully connected layers, each one of which has 4096 activations, and then one more fully connected layer which has however many classes. So if you think about it, the weight matrix here is huge, it's 7x7x512x4096, and it's because of that weight matrix really that VGG went out of favor pretty quickly, because it takes a lot of memory, it takes a lot of computation, and it's really slow.

And there's a lot of redundant stuff going on here, because really those 512 activations are not that specific to which of those 7x7 grid cells they're in, but when you have this entire weight matrix here of every possible combination, it treats all of them uniquely. And so that can also lead to generalization problems, because there's just a lot of weights and so forth.

My view is that the approach that's used in every modern network, which is here we do an adaptive average pooling in Keras that we know as a global average pooling, or in fast.ai we generally do a concat pooling, which spits it straight down to a 512-long activation. I think that's throwing away too much geometry, so to me probably the correct answer is somewhere in between and would involve some kind of factored convolution or some kind of tensor decomposition which maybe some of us can think about in the coming months.

So for now we've gone from one extreme, which is the adaptive average pooling, to the other extreme which is this huge flattened pooling connection layer. So a couple of things which are interesting about VGG that make it still useful today. The first one is that there's more interesting layers going on here with most modern networks including the ResNet family.

The very first layer generally is a 7x7 pond, or something similar, which means we throw away half the grid size straight away and so there's little opportunity to use the fine detail because we never do any computation with it. And so that's a bit of a problem for things like segmentation or super resolution models because the fine detail matters, we actually want to restore it.

And then the second problem is that the adaptive average pooling layer entirely throws away the geometry in the last few sections, which means that the rest of the model doesn't really have as much interest in learning the geometry as it otherwise might. And so therefore for things which are dependent on position, any kind of localization based approach to anything that requires generative modeling is going to be less effective.

So one of the things I'm hoping you're hearing as I describe this is that probably none of the existing architectures are actually ideal. We can invent a new one. And actually I just tried inventing a new one over the week which was to take the VGG thread and attach it to a ResNet backbone.

And interestingly I found I actually got a slightly better classifier than a normal ResNet, but it also was something with a little bit more useful information. It took 5 or 10% longer to train, but nothing worth worrying about. I think maybe we couldn't in ResNet replace this as we've talked about briefly before, this very early convolution with something more like an inception stem which does a bit more computation.

I think there's definitely room for some nice little tweaks to these architectures so that we can build some models which are maybe more versatile. At the moment people tend to build architectures that just do one thing. They don't really think what am I throwing away in terms of opportunity because that's how publishing works.

You know you publish like I've got the state-of-the-art in this one thing rather than I've created something that's good at lots of things. So for these reasons we're going to use VGG today even though it's ancient and it's missing lots of great stuff. One thing we are going to do though is use a slightly more modern version which is a version of VGG where batch norm has been added after all the convolutions.

And so in fast.ai actually when you ask for a VGG network you always get the batch norm one because that's basically always what you want. So this is actually our VGG with batch norm. There's a 16 and a 19. The 19 is way bigger and heavier and doesn't really any better so no one really uses it.

So we're going to go from 72x72, LR is low resolution input, Si is low resolution. We're going to initially scale it up by x2 with a batch size of 64 to get a 2x72, so 1x44x144 output. So that's going to be our stage 1. We'll create our own dataset for this and the dataset, it's very worthwhile looking inside the fastai.dataset module and seeing what's there because just about anything you'd want we probably have something that's almost what you want.

So in this case I want a dataset where my x's are images and my y's are also images. So there's already a files dataset we can inherit from where the x's are images and then I just inherit from that and I just copied and pasted the get x and turned that into get y so it just opens an image.

So now I've got something where the x is an image and the y is an image and in both cases what we're passing in is an array of file names. I'm going to do some data augmentation, obviously with all of ImageNet we don't really need it, but this is mainly here for anybody who's using smaller datasets to make most of it.

Random dihedral is referring to every possible 90 degree rotation plus optional left/right flipping, so the dihedral group of eight symmetries. Probably we don't use this transformation for ImageNet pictures because you don't normally flip dogs upside down, but in this case we're not trying to classify whether it's a dog or a cat, we're just trying to keep the general structure of it, so actually every possible flip is a reasonably sensible thing to do for this problem.

So create a validation set in the usual way, and you can see I'm kind of using a few more slightly lower level functions, generally speaking I just copy and paste them out of the fast.ai source code to find the bits I want. So here's the bit which takes an array of validation set indexes and one or more arrays of variables and simply splits, so in this case this into a training and a validation set and this into a training and a validation set to give us our x's and y's.

Now in this case the x and y are the same, our image and our output are the same, we're going to use transformations to make one of them lower resolution, so that's why these are the same thing. So the next thing that we need to do is to create our transformations as per usual, and we're going to use this transform y parameter like we did for bounding boxes, but rather than use transform type.coordinate, we're going to use transform type.pixel, and so that tells our transformations framework that your y values are images with normal pixels in them and so anything you do with the x you also need to do the y, do the same thing.

And you need to make sure any data representation transforms you use have the same parameter as well. So you can see the possible transform types, basically you've got classification, which we're about to use for segmentation in the second half of today, coordinates, no transformation at all, or pixel. So once we've got a dataset class and some x and y training and validation sets, there's a handy little method called get_datasets, which basically runs that constructor over all the different things that you have to return all the datasets that you need in exactly the right format to pass to a model data constructor, in this case the image data constructor.

So we're kind of like going back under the covers of fast.ai a little bit and building it up from scratch. And in the next few weeks this will all be wrapped up and refactored into something that you can do in a single step in fast.ai, but the point of this class is to learn a bit about going under the covers.

So something we've briefly seen before is that when we take images in we transform them not just with data augmentation, but we also move the channels dimension up to the start, we subtract the mean, divide by the standard deviation, whatever. So if we want to be able to display those pictures that have come out of our datasets or data loaders, we need to denormalize them, and so the model data objects dataset has a denorm function that knows how to do that, so I'm just going to give that a short name for convenience.

So now I'm going to create a function that can show an image from a dataset, and if you pass in something saying this is a normalized image, then we'll denormalize it. So we can go ahead and have a look at that. You'll see here we've passed in size_low_res as our size for the transforms, and size_high_res as this is something new, the size_y parameter.

So the two bits are going to get different sizes. And so here you can see the two different resolutions of our x and our y for a whole bunch of fish. As per usual, plot.subplots to create our two plots, and then we can just use the different axes that came back to put stuff next to each other.

So we can then have a look at a few different versions of the data transformation, and there you can see them being flipped in all different directions. So let's create our model. So we're going to have an image coming in, a small image coming in, and we want to have a big image coming out.

And so we need to do some computation between those two to calculate what the big image would look like. And so essentially there's kind of two ways of doing that computation. We could first of all do some upsampling, and then do a few stride1 kind of layers to do lots of computation.

Or we could first do lots of stride1 layers to do all the computation, and then at the end do some upsampling. We're going to pick the second approach, because we want to do lots of computation on something smaller because it's much faster to do it that way. And also like all that computation we get to leverage during the upsampling process.

So upsampling, we know a couple of possible ways to do that. We can use transposed or fractionally strided convolutions, or we can use nearest neighbor upsampling, followed by a 1x1 conv. And then in the do lots of computation section, we could just have a whole bunch of 3x3 cons.

But in this case in particular, it seems likely that ResNet blocks are going to be better, because really the output and the input are very similar. So we really want a flow-through path that allows as little fussing around as possible except the minimal amount necessary to do our super-resolution.

And so if we use ResNet blocks, then they have an identity path already. So you could imagine the most simple version where it does a bilinear sampling kind of approach or something. It could basically just go through identity blocks all the way through, and then in the upsampling blocks just learn to take the averages of the inputs and get something that's not too terrible.

So that's what we're going to do. We're going to create something with 5 ResNet blocks, and then for each 2x scale-up we have to do, we'll have one upsampling block. So they're all going to consist of, obviously as per usual, convolution layers, possibly with activation functions after many of them.

So I kind of like to put my standard convolution block into a function so I can refactor it more easily. As per usual I just won't worry about passing in padding and just calculate it directly as kernel size over 2. So one interesting thing about our little conv block here is that there's no batch norm, which is pretty unusual for ResNet-type models.

And the reason there's no batch norm is because I'm stealing ideas from this fantastic recent paper which actually won a recent competition in super-resolution performance. And to see how good this paper is, here's kind of a previous state of the art, this SR ResNet, and what they've done here is they've zoomed way in to an upsampled kind of net or fence, this is the original.

And you can see in the previous best approach there's a whole lot of distortion and blurring going on, whereas in their approach it's nearly perfect. So it was a really big step up this paper. They call their model EDSR, Enhanced Deep Residual Networks. And they did two things differently to the previous standard approaches.

One was to take the ResNet block, this is a regular ResNet block, and throw away the batch norm. So why would they throw away the batch norm? Well the reason they would throw away the batch norm is because batch norm changes stuff, and we want a nice straight-through path that doesn't change stuff.

So the idea basically here is if you don't want to fiddle with the input more than you have to, then don't force it to have to calculate things like batch norm parameters. So throw away the batch norm. And the second trick we'll see shortly. So here's a conv with no batch norm.

And so then we're going to create a residual block containing, as per usual, two convolutions. And as you see in their approach, they don't even have a value after their second conv. So that's why I've only got activation on the first one. So a couple of interesting things here.

One is that this idea of having some kind of main ResNet path, like conv_relu_conv, and then turning that into a relu block by adding it back to the identity, it's something we do so often. We've kind of factored it out into a tiny little module called res_sequential, which simply takes a bunch of layers that you want to put into your residual path, turns that into a sequential model, runs it, and then adds it back to the input.

So with this little module we can now turn anything like conv_activation_conv into a ResNet block, just by wrapping it in res_sequential. But that's not quite all I'm doing, because normally a res block just has that in its forward. But I've also got that. What's res_scale? Res_scale is the number 0.1.

Why is it there? I'm not sure anybody quite knows. But the short answer is that the guy who invented batchnorm also somewhat more recently did a paper in which he showed, I think the first time, the ability to train imageNet in under an hour. And the way he did it was fire up lots and lots of machines and have them work in parallel to create really large batch sizes.

Now generally when you increase the batch size by order n, you also increase the learning rate by order n to go with it. So generally very large batch size training means very high learning rate training as well. And he found that with these very large batch sizes of 8,000 plus, or even up to 32,000, that at the start of training his activations would basically go straight to infinity.

And a lot of other people found that, we actually found that when we were competing in Dawnbench both on the Cypher and the imageNet competitions that we really struggled to make the most of even the eight GPUs that we were trying to take advantage of because of these challenges with these larger batch sizes and taking advantage of them.

So something that Christian found, this researcher, was that in the resNet blocks, if he multiplied them by some number smaller than 1, something like 0.1 or 0.2, it really helped stabilize training at the start. And that's kind of weird because mathematically it's kind of identical, because obviously whatever I'm multiplying it by here, I could just scale the weights by the opposite amount here and have the same number.

So it's kind of like we're not dealing with abstract math, we're dealing with real optimization problems and different initializations and learning rates and whatever else. And so the problem of weights disappearing off into infinity I guess generally is really about the kind of discrete and finite nature of computers in practice.

And so often these kind of little tricks can make the difference. So in this case we're just kind of toning things down, at least based on our initialization. And so there are probably other ways to do this. For example, one approach from some folks at Nvidia called Lars, L-A-R-S, which I briefly mentioned last week, is an approach which uses discriminative learning rates calculated in real time, basically looking at the ratio between the gradients and the activations to scale learning rates by layer.

And so they found that they didn't need this trick to scale up the batch sizes a lot. Maybe a different initialization would be all that's necessary. The reason I mention this is not so much because I think a lot of you are likely to want to train on massive clusters of computers, but rather that I think a lot of you want to train models quickly, and that means using high learning rates and ideally getting super-convergence.

And I think these kinds of tricks, the tricks that we'll need to be able to get super-convergence across more different architectures and so forth. And other than Leslie Smith, no one else is really working on super-convergence other than some fast AI students nowadays. So these kinds of things about how do we train at very, very high learning rates, we're going to have to be the ones who figure it out as far as I can tell nobody else cares yet.

So I think looking at the literature around training ImageNet in one hour, or more recently there's now a train ImageNet in 15 minutes, these papers actually have some of the tricks to allow us to train things at high learning rates. And so here's one of them. And so interestingly other than the train ImageNet in one hour paper, the only other place I've seen this mentioned was in this EDSR paper.

And it's really cool because people who win competitions, I just find them to be very pragmatic and well-read. They actually have to get things to work. And so this paper describes an approach which actually worked better than anybody else's approach. And they did these pragmatic things like throw away batch norm and use this little scaling factor which almost nobody else seems to know about and stuff like that.

So that's where the point one comes from. So basically our super-resolution ResNet is going to do a convolution to go from our three channels to 64 channels just to richen up the space a little bit. Oh sorry, I've got actually 8, not 5. Eight lots of these res blocks.

Remember every one of these res blocks is Stripe 1, so the grid size doesn't change, the number of filters doesn't change, it's just 64 all the way through. We'll do one more convolution and then we'll do our up-sampling by however much scale we asked for. And then something I've added which is a little idea is just one batch norm here because it kind of felt like it might be helpful just to scale the last layer.

And then finally a conv to go back to the three channels we want. So you can see that's basically here's lots and lots of computation and then a little bit of up-sampling just like we kind of described. So the only other piece here then is -- and also just to mention as you can see as I'm tending to do now, this whole thing is done by creating just a list of layers and then at the end turning that into a sequential model, and so my forward function is as simple as can be.

So here's our up-sampling. And up-sampling is a bit interesting because it is not doing either of these two things. So let's talk a bit about up-sampling. Here's a picture from the paper, not from the competition-winning paper but from this original paper. And so they're saying our approach is so much better, but look at their approach.

It's got goddamn artifacts in it. These just pop up everywhere, don't they? And so one of the reasons for this is that they use transposed convolutions, and we all know don't use transposed convolutions. So here are transposed convolutions. This is from this fantastic convolutional arithmetic paper that was shown also in the Theano docs.

If we're going from the blue is the original image, so a 3x3 image up to a 5x5 image, or a 6x6 if we added a layer of padding, then all a transposed convolution does is it uses a regular 3x3 conv, but it sticks white 0 pixels between every pair of pixels.

So that makes the input image bigger and when we run this convolution up over it, it therefore gives us a larger output. But that's obviously stupid because when we get here, for example, of the 9 pixels coming in, 8 of them are 0. So we're just wasting a whole lot of computation.

And then on the other hand, if we're slightly off over here, then 4 of our 9 are non-zero. But yet we only have one filter, like one kernel to use, so it can't change depending on how many zeros are coming in, so it has to be suitable for both.

And it's just not possible. So we end up with these artifacts. So one approach we've learned to make it a bit better is to not put white things here, but instead to copy this pixel's value to each of these three locations. That's certainly a bit better, but it's still pretty crappy because now still when we get to these 9 here, 4 of them are exactly the same number.

And when we move across 1, then now we've got a different situation entirely. And so depending on where we are, in particular if we're here, there's going to be a lot less repetition. So again we have this problem where there's wasted computation and too much structure in the data and it's going to lead to artifacts.

So up-sampling is better than transposed convolutions, it's better to copy them rather than replace them with zeros, but it's still not quite good enough. So instead we're going to do the pixel shuffle. So the pixel shuffle is an operation in this sub-pixel convolutional neural network. And it's a little bit mind-bending, but it's kind of fascinating.

And so we start with our input, we go through some convolutions to create some feature maps for a while until eventually we get to layer i-1, which has n i-1 feature maps. We're going to do another 3x3 conv. And our goal here is to go from a 7x7 grid cell, we're going to go a 3x3 upscaling, so we're going to go up to a 21x21 grid cell.

So what's another way we could do that? To make it simpler, let's just pick one face, just one filter. So we'll just take the topmost filter and just do a convolution over that just to see what happens. And what we're going to do is we're going to use a convolution where the kernel size the number of filters is 9 times bigger than we, strictly speaking, need.

So if we needed 64 filters, we're actually going to do 64 times 9 filters. Why is that? And so here r is the scale factor, so 3, so r squared, 3 squared is 9. So here are the 9 filters to cover one of these input layers, one of these input slices.

But what we can do is we started with 7x7 and we turned it into 7x7x9. Well the output that we want is equal to 7x3 by 7x3, so in other words there's an equal number of pixels here, or activations here, as there are r activations here. So we can literally reshuffle these 7x7x9 activations to create this 7x3x7x3 map.

And so what we're going to do is we're going to take one little tube here, the top left hand of each grid, and we're going to put the purple one up in the top left, and then the blue one, one to the right, and then the light blue one, one to the right of that, and then the slightly darker blue one in the middle of the far left, the green one in the middle, and so forth.

So each of these 9 cells in the top left are going to end up in this little 3x3 section of our grid. And then we're going to take 2, 1 and take all of those 9 and move them to these 3x3 part of the grid, and so on and so forth.

And so we're going to end up having every one of these 7x7x9 activations inside this 7x3x7x3 image. So the first thing to realize is, yes of course this works under some definition of works because we have a learnable convolution here, and it's going to get some gradients, which is going to do the best job it can of filling in the correct activation such that this output is the thing we want.

So the first step is to realize there's nothing particularly magical here, we can create any architecture we like, we can move things around anyhow we want to, and our weights in the convolution will do their best to do all we asked. The real question is, is it a good idea?

Is this an easier thing for it to do, and a more flexible thing for it to do, than the transposed convolution or the upsampling followed by 1x1 conv? And the short answer is, yes it is. And the reason it's better in short is that the convolution here is happening in the low resolution 7x7 space, which is quite efficient, whereas if we first of all upsampled and then did our conv, then our conv would be happening in the 21x21 space, which is a lot of computation.

And furthermore as we discussed, there's a lot of replication and redundancy in the nearest neighbor upsampled version. So they actually show in this paper, in fact I think they have a follow-up technical note where they provide some more mathematical details as to exactly what work is being done and show that the work really is more efficient this way.

So that's what we're going to do. So for our upsampling we're going to have two steps. The first will be a 3x3 conv with R^2 times more channels than we originally wanted, and then a pixel shuffle operation which moves everything in each grid cell into the little R/R grids that are located throughout here.

So here it is, it's one line of code. And so here's the conv from number of in to number of filters out times 4, because we're doing a scale2 upsample, so 2^2 is 4. So that's our convolution, and then here is our pixel shuffle, it's built into PyTorch. Pixel shuffle is the thing that moves each thing into its right spot.

So that will increase, will upsample by a scale factor of 2, and so we need to do that log base2 scale times, so if scale is 4, then we have to do it 2 times to go 2 times 2 bigger. So that's what this upsample here does. Great, guess what?

That does not get rid of the checkerboard patterns. We still have checkerboard patterns. So I'm sure in great fury and frustration, this same team from Twitter, I think this was back when they used to be at a startup called MagicPony that Twitter bought, came back again with another paper saying, okay, this time we've got rid of the checkerboard.

So why do we still have, as you can see here, we still have a checkerboard? And so the reason we still have a checkerboard, even after doing this, is that when we randomly initialize this convolutional kernel at the start, it means that each of these 9 pixels in this little 3x3 grid over here are going to be totally randomly different.

But then the next set of 3 pixels will be randomly different to each other, but will be very similar to the corresponding pixel in the previous 3x3 section. So we're going to have repeating 3x3 things all the way across. And so then as we try to learn something better, it's starting from this repeating 3x3 starting point, which is not what we want.

What we actually would want is for these 3x3 pixels to be the same to start with. So to make these 3x3 pixels the same, we would need to make these 9 channels the same here. For each filter. And so the solution, and this paper is very simple, is that when we initialize this convolution at the start, when we randomly initialize it, we don't totally randomly initialize it.

We randomly initialize one of the R^2 sets of channels, and then we copy that to the other R^2, so they're all the same. And that way, initially, each of these 3x3s will be the same. And so that is called IC&R, and that's what we're going to use in a moment.

So before we do, let's take a quick look. So we've got this super resolution ResNet, which does lots of computation with lots of ResNet blocks, and then it does some up-sampling and gets our final 3 channels out. And then to make life faster, we're going to run this in parallel.

One reason we want to run it in parallel is because Dorado told us that he has 6 GPUs, and this is what his computer looks like right now. And so I'm sure anybody who has more than one GPU has had this experience before. So how do we get these men working together?

All you need to do is to take your PyTorch module and wrap it with nn.data_parallel. And once you've done that, it copies it to each of your GPUs and will automatically run it in parallel. It scales pretty well to 2 GPUs, okay to 3 GPUs, better than nothing to 4 GPUs, and beyond that performance starts to go backwards.

By default it will copy it to all of your GPUs. You can add an array of GPUs, otherwise if you want to avoid getting in trouble, for example I have to share our box with Yannette, and if I didn't put this here, then she would be yelling at me right now, or maybe boycotting my class.

So this is how you avoid getting into trouble with Yannette. So one thing to be aware of here is that once you do this, it actually modifies your module. So if you now print out your module, let's say prohibuously it was just an nn.sequential, now you'll find it's an nn.sequential embedded inside a module called module.

And so in other words, if you save something which you had nn.data_parallel, and then try to load it back into something that you hadn't, nn.beta_parallel, it'll say it doesn't match up because one of them is embedded inside this module attribute and the other one isn't. It may also depend even on which GPU IDs you had it copied to.

So two possible solutions, one is don't save the module m, but instead save the module attribute m.module, because that's actually the non-data parallel bit. Or always put it on the same GPU IDs and use data parallel and load and save that every time. That's what I was using. This would be an easy thing for me to fix automatically in fast.ai and I'll do it pretty soon so it'll look for that module attribute and deal with it automatically, but for now we have to do it manually.

It's probably useful to know what's going on behind the scenes anyway. So we've got our module, I find it'll run like 50% or 60% faster on a 1080ti. If you're running on Volta, it actually parallelizes a bit better. There are much faster ways to parallelize, but this is a super easy way.

So we create our learner in the usual way. We could use mse_loss here, so that's just going to compare the pixels of the output to the pixels that we expected, and we can run our learning rate finder and we can train it for a while, and here's our input and here's our output, and you can see that what we've managed to do is to train a very advanced residual convolutional network that's learned to blur things.

Why is that? Well because it's what we asked for. We said to minimize mse_loss, an mse_loss between pixels, really the best way to do that is just average the pixels, i.e. to blur it. So that's why pixel_loss is no good. So we want to use our perceptual loss. So let's try perceptual loss.

So with perceptual loss, we're basically going to take our VGG network, and just like we did last week, we're going to find the block index just before we get a max pool. So here are the ends of each block of the same grid size, and if we just print them out as we'd expect, every one of those is a value module.

And so in this case, these last two blocks are less interesting to us. The grid size there is small enough, coarse enough that it's not as useful for super resolution, so we're just going to use the first three. And so just to save unnecessary computation, we're just going to use those first 23 layers for VGG, we'll throw away the rest, we'll stick it on the GPU, we're not going to be training this VGG model at all, we're just using it to compare activations.

So we'll stick it in eval mode, and we will set it to not trainable. Just like last week, we'll use a save_features class to do a forward hook, which saves the output activations at each of those layers. And so now we've got everything we need to create our perceptual loss, or as I call it here, feature_loss_plus.

And so we're going to pass in a list of layer IDs, the layers where we want the content loss to be calculated, an array of weights, a list of weights for each of those layers. So we can just go through each of those layer IDs and create an object which has got the hook function, forward hook function to store the activations.

And so in our forward, then we can just go ahead and call the forward pass of our model with the target, so the target is the high res image we're trying to create. And so the reason we do that is because that's going to then call that hook function and store in self.save_features the activations we want.

Now we're going to need to do that for our Confinet output as well. So we need to clone these because otherwise the Confinet output is going to go ahead and just plobber what we already had. So now we can do the same thing for the Confinet output, which is the input to the loss function.

And so now we've got those two things, we can zip them all together along with the weights. So we've got inputs, targets, weights, and then we can do the L1 loss between the inputs and the targets and multiply by the layer weights. The only other thing I do is I also grab the pixel loss, but I weight it down quite a bit.

And most people don't do this, I haven't seen papers that do this, but in my opinion it's maybe a little bit better because you've got the perceptual content loss activation stuff, but at the finest level it also cares about the individual pixels. So that's our loss function, we create our super resolution ResNet, telling it how much to scale up by.

And then we're going to do our ICNR initialization of that pixel shuffle convolution. So this is very, very boring code, I actually stole it from somebody else. Literally all it does is just say, okay, you've got some weight tensor x that you want to initialize, so we're going to treat it as if it had a number of number of features divided by scale squared features in practice, so this might be 2 squared, it could be 4, because we actually want to keep one set of them and then copy them 4 times.

So we divide it by 4, and we create something of that size, and we initialize that with a default timing normal initialization, and then we just make scale squared copies of it. And the rest of it is just moving axes around a little bit. So that's going to return a new weight matrix where each initialized subkernel is repeated R squared or scale squared times.

So that details don't matter very much, all that matters here is that I just looked through to find what was the actual layer, the conv layer just before the pixel shuffle, and stored it away, and then I called ICNR on its weight matrix to get my new weight matrix, and then I copied that new weight matrix back into that layer.

So as you can see, I went to quite a lot of trouble in this exercise to really try to implement all the best practices, and I kind of tend to do things a bit one extreme or the other. I show you a really hacky version that only slightly works, or I go to the nth degree to make it work really well.

So this is a version where I'm claiming that this is pretty much a state-of-the-art implementation, it's a competition-winning approach, and the reason I'm doing that is because I think this is one of those rare papers where they actually get a lot of the details right, and I kind of want you to get a feel of what it feels like to get all the details right.

And remember, getting the details right is the difference between this hideous blurry mess and this really pretty exquisite result. So we're going to have to do theta parallel on that again, we're going to set our criterion to be feature loss using our VGG model, grab the first few blocks, and these are sets of layer weights that I found worked pretty well, do a learning rate finder, fit it for a while, and I fit all around for a little while trying to get some of these details right.

But here's my favorite part of the paper, what happens next, now that we've done it for scale=2, progressive resizing. So progressive resizing is the trick that let us get the best single computer result for ImageNet training on Dawnbench. This idea is starting small, gradually making bigger, and in two papers that have used this idea, one is the progressive resizing of GANs paper which allows training of very high-resolution GANs, and the other one is the EDSR paper.

And the cool thing about progressive resizing is not only are your earlier epochs, assuming you've got two by two smaller, four times faster, you can also make the batch size maybe three or four times bigger, but more importantly, they're going to generalize better because you're feeding your model different size images during training.

So we were able to train like half as many epochs for ImageNet as most people. So our epochs were faster and there were fewer of them. So progressive resizing is something that, particularly if you're training from scratch, I'm not so sure if it's useful for fine-tuning transfer learning, but if you're training from scratch, you probably want to do nearly all the time.

So the next step is to go all the way back to the top and change to 4-scale 32 batch size, like restart, so I save the model before I do that, go back. And that's why there's a little bit of fussing around in here with reloading, because what I needed to do now is I needed to load my saved model back in, but there's a slight issue, which is I now have one more up-sampling layer than I used to have.

To go from 2x2 to 4x4, my little loop here is now looping through twice, not once, and therefore it's added an extra conv and an extra pixel shuffle. So how am I going to load in weights through a different network? And the answer is that I use a very handy thing in PyTorch, which is if I call -- this is basically what learn.load calls behind the scenes, load state dict.

If I pass this parameter strict=false, if I pass in this parameter strict=false, then it says if you can't fill in all of the layers, just fill in the layers you can. So after loading the model back in this way, we're going to end up with something where it's loaded in all the layers that it can, and that one conv layer that's new is going to be randomly initialized.

And so then I freeze all my layers and then unfreeze that up-sampling part, and then use ICNR on my newly added extra layer, and then I can go ahead and load again. And so then the rest is the same. So if you're trying to replicate this, don't just run this top to bottom, realize it involves a bit of jumping around.

The longer you train, the better it gets. I ended up training it for about 10 hours, but you'll still get very good results much more quickly if you're less patient. And so we can try it out, and here is the result. Here is my pixelated bird, and look here, it's like totally randomly pixels.

And here's the up-sampled version, it's like it's literally invented coloration. But it figured out what kind of bird it is, and it knows what these feathers are meant to look like. And so it has imagined a set of feathers which are compatible with these exact pixels, which is like genius.

Like same here, there's no way you can tell what these blue dots are meant to represent, but if you know that this kind of bird has an array of feathers here, you know that's what they must be. And then you can figure out where the feathers would have to be such that when they were pixelated they'd end up in these spots.

So it's like literally reverse engineered, given its knowledge of this exact species of bird, how it would have to have looked to create this output. And so this is like so amazing. It also knows from all the kind of signs around it that this area here was almost certainly blurred out.

So it's actually reconstructed blurred vegetation. And if it hadn't done all of those things, it wouldn't have got such a good loss function. Because in the end, it had to match the activations saying like there's a feather over here and it's kind of fluffy looking and it's in this direction and all that.

Alright, well that brings us to the end of super resolution. Don't forget to check out the Ask Jeremy Anything thread and we will do some Ask Jeremy Anything after the break. Let's see you back here at quarter to eight. Okay. So we are going to do Ask Jeremy Anything, Rachel will tell me the most voted up of your questions.

Yes, Rachel. What are the future plans for Fast AI in this course? Will there be a part three? If there is a part three, I would really love to take it. That's cool. I'm not quite sure, it's always hard to guess. I hope there will be some kind of follow-up.

Last year after part two, one of the students started up a weekly book club going through the Ian Goodfellow deep learning book and Ian actually came in and presented quite a few of the chapters and other people, like there was somebody, an expert, who presented every chapter. That was like a really cool part three.

To a large extent it will depend on you, the community, to come up with ideas and to help make them happen and I'm definitely keen to help. I've got a bunch of ideas, but I'm nervous about saying them because I'm not sure which ones will happen and which ones won't, but the more support I have in making things happen that you want to happen from you, the more likely they are to happen.

What was your experience like starting down the path of entrepreneurship? Have you always been an entrepreneur or did you start out at a big company and transition to a start-up? Did you go from academia to start-ups or start-ups to academia? I was definitely not in academia, I'm totally a fake academic.

I started at McKinsey & Company which is a strategy firm when I was 18, which meant I couldn't really go to university, so I didn't really turn up and then I spent eight years in business helping really big companies on strategic questions. I always wanted to be an entrepreneur, I planned to already spend two years at McKinsey, the only thing I really regret in my life was not sticking to that plan and wasting eight years instead.

So two years would have been perfect, but then I went into entrepreneurship, started two companies in Australia and the best part about that was that I didn't get any funding, so all the money that I made was mine, all the decisions were mine and my partners. I focused entirely on profit and product and customer and service, whereas I find in San Francisco I'm glad I came here and so the two of us came here for Kaggle, Anthony and I and raised a ridiculous amount of money, $11 million for this really new company.

That was really interesting but it's also really distracting, trying to worry about scaling and VCs wanting to see what your business development plans are and also just not having any real need to actually make a profit. So I had a bit of the same problem at Inletic, where I again raised a lot of money, $15 million pretty quickly and a lot of distractions.

So I think trying to bootstrap your own company and focus on making money by selling something at a profit and then plowing that back into the company worked really well because within like five years we were making a profit from three months in and within five years we were making enough of a profit not just to pay all of us in their own wages but also to see my bank account growing and after ten years sold it for a big chunk of money, not enough that a VC would be excited but enough that I didn't have to worry about money again.

So I think bootstrapping a company is something which people in the Bay Area at least don't seem to appreciate how good an idea that is. If you are 25 years old today and still know what you know where, which you'd be looking to use AI, what are you working on right now or looking to work on in the next two years?

You should ignore the last part of that, I won't even answer it, it doesn't matter where I'm looking, what you should do is leverage your knowledge about your domain. So one of the main reasons we do this is to get people who have backgrounds in whatever, recruiting, oil field surveys, journalism, activism, whatever, and solve your problems.

It will be really obvious to you what your problems are and it will be really obvious to you what data you have and where to find it. Those are all the bits that for everybody else it's really hard, so people who start out with "Oh I know deep learning" now go and find something to apply it to, basically never succeed, whereas people who are like "Oh I've been spending 25 years doing specialized recruiting for legal firms and I know that the key issue is this thing and I know that this piece of data totally solves it and so I'm just going to do that now and I already know who to call to actually start selling it to, they're the ones who tend to win.

So if you've done nothing but academic stuff then it's more about your hobbies and interests, so everybody has hobbies. The main thing I would say is please don't focus on building tools for data scientists to use or for software engineers to use because every data scientist knows about the market of data scientists, whereas only you know about the market for analyzing oil survey well logs or understanding audiology studies or whatever it is that you do.

Given what you've shown us about applying transfer learning from image recognition to NLP, there looks to be a lot of value in paying attention to all of the developments that happen across the whole machine learning field and that if you were to focus in one area you might miss out on some great advances in other concentrations.

How do you stay aware of all the advancements across the field while still having time to dig in deep to your specific domains? Yeah that's awesome, I mean that's kind of the message of this course, one of the key messages of this course is like lots of good works being done in different places and people are so specialized most people don't know about it, like if I can get state-of-the-art results in NLP within six months of starting to look at NLP, then I think that says more about NLP than it does about me.

So yeah it's kind of like the entrepreneurship thing, it's like you pick the areas that you see that you know about and kind of transfer stuff like oh we could use deep learning to solve this problem or in this case like we could use this idea of computer vision to solve that problem.

So things like transfer learning, I'm sure there's like a thousand things, opportunities for you to do in other fields to do what Sebastian and I did in NLP with NLP classification. So the short answer to your question is the way to stay ahead of what's going on would be to follow my feed of Twitter favorites and my approach is to follow lots and lots of people on Twitter and put them into the Twitter favorites for you.

Every time I come across something interesting I click favorite and there are two reasons I do it, the first is that when the next course comes along I go through my favorites to find which things I want to study and the second is so that you can do the same thing.

And then which do you go deep into, it almost doesn't matter, like I find every time I look at something it turns out to be super interesting and important. So just pick something which is like, you feel like solving that problem would be actually useful for some reason and it doesn't seem to be very popular, which is kind of the opposite of what everybody else does, everybody else works on the problems which everybody else is already working on because they're the ones that seem popular and I don't know.

I can't quite understand this kind of thinking but it seems to be very common. Is deep learning an overkill to use on tabular data? When is it better to use deep learning instead of machine learning on tabular data? Is that a real question or did you just put that there so that I would point out that Rachel Thomas just wrote an article?

Yes, so Rachel's just written about this and Rachel and I spent a long time talking about it and the short answer is we think it's great to use deep learning on tabular data. Actually of all the rich, complex, important and interesting things that appear in Rachel's Twitter stream covering everything from the genocide of the Rohingya through to the latest ethics violations in AI companies, the one by far that got the most attention and engagement from the community was her question about is it called tabular data or structured data.

Ask computer people how to name things and you'll get plenty of interest. There are some really good links here to stuff from Instacart and Pinterest and other folks who have done some good work in this area. Many of you that went to the Data Institute conference will have seen Jeremy Stanley's presentation about the really cool work they did at Instacart.

Yes, Rachel? I relied heavily on lessons three and four from part one in writing this post, so much of it may be familiar to you. Rachel asked me during the post how to tell whether you should use a decision tree ensemble like GVM or random forest or neural net and my answer is I still don't know.

Nobody I'm aware of has done that research in any particularly meaningful way, so there's a question to be answered there. I guess my approach has been to try to make both of those things as accessible as possible through the fastAI library so you can try them both and see what works.

That was it for the top three questions. Just quickly to go from super resolution to style transfer is kind of -- I think I missed the one on reinforcement learning. Reinforcement learning popularity has been on a gradual rise in the recent past. What's your take on reinforcement learning? Would fastAI consider covering some ground and popular RL techniques in the future?

I'm still not a believer in reinforcement learning. I think it's an interesting problem to solve, but it's not at all clear that we have a good way of solving this problem. The problem really is the delayed credit problem. I want to learn to play Pong, I move up or down, and three minutes later I find out whether I won the game of Pong, which actions I took were actually useful.

To me the idea of calculating the gradients of the output with respect to those inputs, the credit is so delayed that those derivatives don't seem very interesting. I get this question quite regularly in every one of these four courses so far. I've always said the same thing. I'm rather pleased that finally recently there's been some results showing that basically random search often does better than reinforcement learning.

Basically what's happened is very well-funded companies with vast amounts of computational power throw all of it at reinforcement learning problems and get good results and people then say it's because of the reinforcement learning rather than the vast amounts of compute power. Or they use extremely thoughtful and clever algorithms like a combination of convolutional neural nets and Monte Carlo tree search like they did with the AlphaGo stuff to get great results and people incorrectly say that's because of reinforcement learning but it wasn't really reinforcement learning at all.

I'm very interested in solving these kind of more generic optimization type problems rather than just prediction problems and that's what these delayed credit problems look like. But I don't think we've yet got good enough best practices that I have anything I'm ready to teach and say like I'm going to teach you this thing because I think it's still going to be useful next year.

So we'll keep watching and see what happens. So we're going to now turn the super resolution network basically into a style transfer network and we'll do this pretty quickly. We basically already have something, so here's my input image and I'm going to have some loss function and I've got some neural net again.

So instead of a neural net that does a whole lot of compute and then does upsampling at the end, our input this time is just as big as our output so we're going to do some downsampling first and then our compute and then our upsampling. So that's the first change we're going to make is we're going to add some down sampling, so some stride 2 convolution layers to the front of our network.

The second is rather than just comparing y, c and x to the same thing here. So we're going to basically say our input image should look like itself by the end, so specifically we're going to compare it by chucking it through VGG and comparing it at one of the activation layers.

And then its style should look like some painting which we'll do just like we did with the Gatties approach by looking at the grammatrix correspondence at a number of layers. So that's basically it, and so that ought to be super straightforward, it's really just combining two things we've already done.

And so all this code at the start is identical, except we don't have high res and low res, we just have one size 256, all this is the same, my model's the same. One thing I did here is I did not do any kind of fancy best practices for this one at all, partly because there doesn't seem to be any, like there's been very little follow-up in this approach compared to the super resolution stuff, and we'll talk about why in a moment.

So you'll see this is much more normal looking, I've got batch norm layers, I don't have the scaling factor here, I don't have a pixel shuffle, it's just using a normal upsampling followed by one by one conge, blah blah blah, so it's just more normal. One thing they mentioned in the paper is they had a lot of problems with zero padding creating artifacts, and the way they solved that was by adding 40 pixels of reflection padding at the start, so I did the same thing, and then they used zero padding in their convolutions in their res blocks.

Now if you've got zero padding in your convolution in your res blocks, then that means that the two parts of your resnet won't add up anymore because you've lost a pixel from each side on each of your two convolutions. So my res sequential has become res sequential center, and I've removed the last two pixels on each side of those good cells.

So other than that, this is basically the same as what we had before. So then we can bring in our starry_night_picture, we can resize it, we can throw it through our transformations. Just to make the method a little bit easier for my brain to handle, I took my transform style image, which after transformations is 3x256x256, and I made a mini-batch.

My batch size is 24, 24 copies of it. That just makes it a little bit easier to do the batch arithmetic without worrying about some of the broadcasting, they're not really 24 copies, I used np.broadcast to basically fake 24 copies. So just like before, we create our VGG, grab the last block, this time we're going to use all of these layers so we keep everything up to the 43rd layer.

And so now our combined loss is going to add together a content loss for the 3rd block plus the gram loss for all of our blocks with different weights. And so the gram loss, and again, going back to everything being as normal as possible, I've gone back to using MSE here.

Basically what happened is I had a lot of trouble getting this to train properly, so I gradually removed trick after trick and eventually just went okay, I'm just going to make it as bland as possible. Last week's gram matrix was wrong, by the way, it only worked for a batch size of 1, and we only had a batch size of 1, so that was fine.

I was using matrix multiply, which meant that every batch was being compared to every other batch. You actually need to use batch matrix multiply, which does a matrix multiply per batch. So that's something to be aware of there. So I've got my gram matrices, I do my MSE loss between the gram matrices, I weight them, I style weights, so I create that resnet, so I create my style, my combined loss, passing in the VGG network, passing in the block IDs, passing in the transformed starry night image, and so you'll see at the very start here I do a forward pass through my VGG model with that starry night image in order that I can save the features for it.

Now notice it's really important now that I don't do any data augmentation because I've saved the style features for a particular non-augmented version, so if I augmented it it might make some minor problems, but that's fine because I've got all of ImageNet to deal with, I don't really need to do data augmentation anyway.

Okay, so I've got my loss function and I can go ahead and fit, and there's really nothing clever here at all, at the end I have my sumLayers equals false so I can see what each part looks like and see that they're reasonably balanced, and I can finally pop it out.

So I mentioned that should be pretty easy, and yet it took me about four days because I just found this incredibly fiddly to actually get it to work. So when I finally got up in the morning I said to Rachel, guess what, they're trained correctly. Rachel was like, I never thought that was going to happen.

It just looked awful all the time, and it was really about getting the exact right mix of content loss versus style loss, the mix of the layers of the style loss, and the worst part was it takes a really long time to train the damn CNN, and I didn't really know how long to train it before I decided it wasn't doing well, like should I just train it for longer or what?

And I don't know, all the little details didn't seem to slightly change it, but it would totally fall apart all the time. So I kind of mentioned this partly to say just remember the final answer you see here is after me driving myself crazy all week, nearly always not working until finally at the last minute, it finally does, even for things which just seem like they couldn't possibly be difficult because they're just combining two things we already have working.

The other is to be careful about how we interpret what authors claim. It was so fiddly getting this style transfer to work, and after doing it, it left me thinking, why did I bother? Because now I've got something that takes hours to create a network that can turn any kind of photo into one specific style.

It just seems very unlikely I would want that for anything, like the only reason I could think that being useful would be to do some art stuff on a video to turn every frame into some style. It's an incredibly niche thing to do, but when I looked at the paper, the table was saying we're a thousand times faster than the Gatties approach, which is just such an obviously meaningless thing to say and such an incredibly misleading thing to say because it ignores all the hours of training for each individual style.

I find this frustrating because groups like this Stanford group clearly know better, or ought to know better, but still I guess the academic community kind of encourages people to make these ridiculously grand claims. It also completely ignores this incredibly sensitive, fiddly training process. This paper was just so well-accepted when it came out.

I remember everybody getting on Twitter and being like, wow, these Stanford people have found this way of doing style transfer a thousand times faster. And clearly, the people saying this were like all top researchers in the field, but clearly none of them actually understood it because nobody said, you know, I don't see why this is remotely useful and also I tried it and it was incredibly fiddly to get it all to work.

And so it's not until like, what is this now, like 18 months later or something that I'm finally coming back to it and kind of thinking like, wait a minute, this is kind of stupid. So this is the answer I think to the question of why haven't people done follow-ups on this to like create really amazing best practices and better approaches like with a super resolution part of the paper?

And I think the answer is because it's done. So I think this part of the paper is clearly not done, you know, and it's been improved and improved and improved and now we have great super resolution and I think we can derive from that great noise reduction, great colorization, great, you know, slant removal, great interactive artifact removal, whatever else.

So I think there's a lot of really cool techniques here. It's also leveraging a lot of stuff that we've been learning and getting better and better at. Okay, so then finally let's talk about segmentation. This is from the famous CAMVID dataset which is a classic example of an academic segmentation dataset.

And basically you can see what we do is we start with a picture, there are actually video frames in this dataset like here, and we construct, we have some labels where they're not actually colors, each one has an ID and the IDs are mapped colors, so like red might be one, purple might be two, like pink might be three.

And so all the buildings, you know, one class or the cars or another class, all the people or another class, all the road is another class. And so what we're actually doing here is multi-class classification for every pixel, okay? And so you can see sometimes that multi-class classification really is quite tricky, you know, like these branches.

Although sometimes the labels are really not that great, you know, this is very coarse, as you can see. So here are traffic lights and so forth. So that's what we're going to do. We're going to do, this is segmentation. And so it's a lot like bounding boxes, right? But rather than just finding a box around each thing, we're actually going to label every single pixel with its class.

And really that's actually a lot easier because it fits our CNN style so nicely that we basically, we can create any CNN where the output is an n by m grid containing the integers from 0 to c where there are c categories, and then we can use cross-entropy loss with a softmax activation and we're done.

So I could actually stop the class there and you can go and use exactly the approaches you've learned in like lessons 1 and 2 and you'll get a perfectly okay result. So the first thing to say is like this is not actually a terribly hard thing to do, but we're going to try and do it really well.

And so let's start by doing it the really simple way. And we're going to use the Kaggle Carvana competition, so you Google Kaggle Carvana to find it. You can download it with the Kaggle API as per usual. And basically there's a train folder containing a bunch of images which is the independent variable and a train_masks folder that contains the dependent variable and they look like this.

Here's one of the independent variable and here's one of the dependent variable. So in this case, just like cats and dogs, we're going simple. Rather than doing multi-class classification, we're going to do binary classification, but of course multi-class is just the more general version, you know, categorical cross-entropy or binary cross-entropy.

So there's no differences conceptually. So we've got this is just zeros and ones, whereas this is a regular image. So in order to do this well, it would really help to know what cars look like because really what we just want to do is figure out this is the car and this is its orientation and then put white pixels where we expect the car to be based on the picture and our understanding of what cars look like.

The original data set came with these CSV files as well. I don't really use them for very much other than getting a list of images from them. Each image after the car ID has a 01, 02, et cetera of which I've printed out all 16 of them for one car and as you can see basically those numbers are the 16 orientations of one car.

So there that is. I don't think anybody in this competition actually used this orientation information. I believe they all kept the car's images, just treated them separately. These images are pretty big, like over 1,000 by 1,000 in size and just opening the JPEGs and resizing them is slow. So I processed them all.

Also OpenCV can't handle GIF files, so I converted them. Yes, Rachel? Question, how would somebody get these masks for training initially, Mechanical Turk or something? Yeah, just a lot of boring work. Probably some tools that help you with a bit of edge snapping and stuff so that the human can kind of do it roughly and then just fine-tune the bits that gets wrong.

These kinds of labels are expensive. One of the things I really want to work on is deep learning enhanced interactive labeling tools because that's clearly something that would help a lot of people. I've got a little section here that you can run if you want to. You probably want to, which converts the GIFs into PNGs.

So just open it up with a PIL and then save it as PNG because OpenCV doesn't have GIF support. And as per usual for this kind of stuff I do it with a thread pool so I can take advantage of parallel processing, and then also create a separate directory, train-128 and train-masks-128, which contains the 128x128 resized versions of them.

And this is the kind of stuff that keeps you sane if you do it early in the process. So anytime you get a new data set, seriously think about creating a smaller version to make life fast. Anytime you find yourself waiting on your computer, try and think of a way to create a smaller version.

So after you grab it from Kaggle you probably want to run this stuff, go away, have lunch, come back, and when you're done you'll have these smaller directories which we're going to use here, 128x128 pixel versions to start with. So here's a cool trick, if you use the same axis object to plot an image twice, and the second time you use alpha, which as you might know means transparency in the computer vision world, then you can actually plot the mask over the top of the photo.

And so here's a nice way to see all the masks on top of the photos for all of the cars in one group. This is the same matched files data set we've seen twice already, this is all the same code we used to, and here's something important though, if we had something that was in the training set good at this image, and then the validation had that image, that would kind of be cheating because it's the same car.

So we use a contiguous set of car IDs, and since each set is a set of 16, we make sure it's evenly divisible by 16, so we make sure that our validation set contains different car IDs to our training set. This is the kind of stuff which you've got to be careful of.

On Kaggle it's not so bad, you'll know about it because you'll submit your result and you'll get a very different result on your leaderboard compared to your validation set, but in the real world you won't know until you put it in production and send your company bankrupt and lose your job, so you might want to think carefully about your validation set.

So here we're going to use transform_type.classification, it's basically the same as transform_type.pixel, but if you think about it, with the pixel version if we rotate a little bit, then we probably want to average the pixels in between the two, but for classification obviously we don't, we use nearest_neighbor, so there's a slight difference there.

Also for classification, lighting doesn't kick in, normalization doesn't kick in to the dependent variable. There are already square images, so we don't have to do any cropping. So here you can see different versions of the augmented, you know, they're moving around a bit and they're rotating a bit and so forth.

I get a lot of questions during our study group and stuff about how do I debug things and fix things that aren't working, and I never have a great answer other than every time I fix a problem it's because of stuff like this that I do all the time.

I just always print out everything as I go and then the one thing that I screw up always turns out to be the one thing that I forgot to check along the way. The more of this kind of thing you can do the better. If you're not looking at all of your intermediate results you're going to have troubles.

So given that we want something that knows what cars look like, we probably want to start with a pre-trained ImageNet network. So we're going to start with ResNet34 and so with ConvNetBuilder we can grab our ResNet34 and we can add a custom head. And so the custom head is going to be something that upsamples a bunch of times.

And we're going to do things really dumb for now. We're just going to do Conv transpose 2D batch norm value. This is what I'm saying. Any of you could have built this without looking at any of this notebook, or at least you have the information from previous classes. There's nothing new at all.

And so at the very end we have a single filter. And now that's going to give us something which is batch size by 1, by 128, by 128. But we want something which is batch size by 128 by 128. So we have to remove that unit axis. So I've got a lambda layer here.

Lambda layers are incredibly helpful, because without the lambda layer here, which is simply removing that unit axis by just indexing into it at zero, without the lambda layer I would have to have created a custom class with a custom forward method and so forth. But by creating a lambda layer that does like the one custom bit, I can now just chuck it in the sequential.

And so that just makes life easier. So the PyTorch people are kind of snooty about this approach. Lambda layer is actually something that's part of the fast AI library, not part of the PyTorch library. And literally people on the PyTorch discussion board are like, yes, we could give people this, yes, it is only a single line of code, but then it would encourage them to use sequential too often.

So there you go. So this is our custom head. So we're going to have a Resbit34 that goes down sample and then a really simple custom head that very quickly upsamples and that hopefully will do something. And we're going to use accuracy with a threshold of 0.5 to print out metrics.

And so after a few epochs we've got 96% accurate. So is that good? Is 96% accurate? Good. And hopefully the answer to your question is it depends. What's it for? And the answer is Kavana wanted this because they wanted to be able to take their car images and cut them out and paste them on exotic Monte Carlo backgrounds or whatever.

That's Monte Carlo the place, not the simulation. So to do that, you need a really good mask. You don't want to leave the rearview mirrors behind or have one wheel missing or include background or something that would look stupid. So you would need something very good. So only having 96% of the pixels correct doesn't sound great, but we won't really know until we look at it.

So let's look at it. So there's the correct version that we want to cut out. That's the 96% accurate version. So when you look at it, you realize, oh yeah, getting 96% of the pixels accurate is actually easy because all the outside bits are not car and all the inside bits are car and really the interesting bit is the edge.

So we need to do better. So let's unfreeze because all we've done so far is train the custom head. And let's do more. And so after a bit more we've got 99.1%. So is that good? I don't know. Let's take a look. And so actually no, it's totally missed the rearview vision mirror here and missed a lot of it here and it's clearly got an edge wrong here and these things are totally going to matter when we try to cut it out.

So it's still not good enough. So let's try upscaling. And the nice thing is that when we upscale to 512x512, make sure you decrease the batch size because you'll run out of memory. Here's the true ones. This is all identical. There's quite a lot more information there for it to go on.

So our accuracy increases to 99.4% and things keep getting better. But we've still got quite a few little black blocky bits. So let's go to 124x124 down to batch size of 4. This is pretty high res now. And train a bit more, 99.6, 99.8. And so now if we look at the masks, they're actually looking not bad.

That's looking pretty good. So can we do better? And the answer is yes we can. So we're moving from the Carvana notebook to the Carvana UNet notebook now. And the UNet network is quite magnificent. You see, with that previous approach, our pre-trained ImageNet network was being squished down all the way down to 7x7 and then expanded out all the way back up to, well it's 224 and then expanded out again all this way, which means it has to somehow store all the information about the much bigger version in the small version.

And actually most of the information about the bigger version was really in the original picture anyway. So it doesn't seem like a great approach, this squishing and unsquishing. So the UNet idea comes from this fantastic paper where it was literally invented in this very domain-specific area of biomedical image segmentation.

But in fact, basically every Kaggle winner in anything even vaguely related to segmentation has ended up using UNet. It's one of these things that like everybody in Kaggle knows is the best practice, but in more of academic circles, like even now, this has been around for a couple of years at least, a lot of people still don't realize.

This is by far the best approach. And here's the basic idea. Here's the downward path where we basically start at 572x532 in this case and then kind of half the grid size, half the grid size, half the grid size, half the grid size. And then here's the upward path where we double the grid size, double-double-double-double.

But the thing that we also do is we take at every point where we've halved the grid size, we actually copy those activations over to the upward path and concatenate them together. And so you can see here these red blobs are max pooling operations, the green blobs are upward sampling, and then these gray bits here are copying.

So we copy and concat. So basically in other words, the input image after a couple of columns is copied over to the output, concatenated together, and so now we get to use all of the information that's gone through all the down and all the up, plus also a slightly modified version of the input pixels, and a slightly modified version of one thing down from the input pixels because they came out through here.

So we have like all of the richness of going all the way down and up, but also like a slightly less coarse version and a slightly less coarse version and then this really kind of simple version and they can all be combined together. And so that's UNET, such a cool idea.

So here we are in the Kavana UNET notebook, all this is the same code as before. And at the start I've got a simple upsample version just to kind of show you again the non-UNET version. This time I'm going to add in something called the dice metric. Dice is very similar, as you see, to Jacquard, or A over U.

It's just a minor difference, it's basically intersection over union with a minor tweak. And the reason we're going to use dice is that's the metric that the Kaggle competition used. And it's a little bit harder to get a high dice score than a high accuracy because it's really looking at like what the overlap of the correct pixels are with your pixels.

But it's pretty similar. So in the Kaggle competition, people that were doing okay were getting about 99.6 dice and the winners were about 99.7 dice. So here's our standard upsample, this is all as before. And so now we can check our dice metric. And so you can see on dice metric we're getting like 9.6.8 at 128x128.

And so that's not great. So let's try UNET. And I'm calling it UNET-ish because as per usual I'm creating my own somewhat hacky version, kind of trying to keep things similar to what you're used to as possible and doing things that I think make sense. And so there should be plenty of opportunity for you to at least make this more authentically UNET by looking at the exact kind of grid sizes.

And like see how here the size is going down a little bit so they're obviously not adding any padding and then they're doing here they've got some cropping going on. There's a few differences. But one of the things is because I want to take advantage of transfer learning, that means I can't quite use UNET.

So here's another big opportunity is what if you create the UNET downpath and then add a classifier on the end and then train that on ImageNet. And you've now got an ImageNet trained classifier which is specifically designed to be a good backbone for UNET. And then you should be able to now come back and get pretty close to winning this old competition.

Because that pre-trained network didn't exist before. But if you think about what YOLOv3 did, it's basically that. They created DarkNet, they pre-trained it on ImageNet and then they used it as the basis for their founding boxes. So again, this kind of idea of pre-training things which are designed not just for classification but for other things is just something that nobody's done yet.

But as we've shown, you can train ImageNet for 25 bucks in 3 hours. So and if people in the community are interested in doing this, hopefully I'll have credits I can help you with as well. So if you do the work to get it set up and give me a script, I can probably run it for you.

So for now though, we don't have that. So we're going to use ResNet. So we're basically going to start with this, let's see, with getBase. And so base is our base network and that was defined back up in this first section. So getBase is going to be something that calls whatever this is and this is ResNet 34.

So we're going to grab our ResNet 34 and cutModel is the first thing that our ConvNet builder does. It basically removes everything from the adaptive pulling onwards and so that gives us back the backbone of ResNet 34. So getBase is going to give us back our ResNet 34 backbone.

And then we're going to take that ResNet 34 backbone and turn it into a unit 34. So what that's going to do is it's going to save that ResNet that we passed in and then we're going to use a forward hook, just like before, to save the results at the second, fourth, fifth and sixth blocks, which as before is basically before each stride 2 convolution.

Then we're going to create a bunch of these things we're calling unit blocks. And the unit block basically says, so these unit blocks are these things. These are unit blocks. So the unit block tells us, we have to tell it, how many things are coming from the kind of previous layer that we're upsampling, how many are coming across, and then how many do we want to come at.

And so the amount coming across is entirely defined by whatever the base network was. Whatever the downward path was, we need that many layers. And so this is a little bit awkward. And actually one of our master's students here, Karim, has actually created something called dynamic unit that you'll find in fastai.unit.dynamic_unit.

And it actually calculates this all for you and automatically creates the whole unit from your base model. It's got some minor quirks still that I want to fix. By the time the video is out, it'll definitely be working and I will at least have a notebook showing how to use it and possibly an additional video.

But for now, you'll just have to go through and do it yourself. You can easily see it just by once you've got a resnet, you can just go type in its name and it'll print out all the layers and you can see how many activations there are in each block.

Or you could even have it printed out for you for each block automatically. Anyway, I just did this manually. And so the unit block works like this. So you said, "Okay, I've got this many coming up from the previous layer, I've got this many coming across this x." I'm using across from the downward path.

This is the amount I want coming out. Now what I do is I then say, "Okay, we're going to create a certain amount of convolutions from the upward path and a certain amount from the cross path and so I'm going to be concatenating them together. So let's divide the number we want out by 2.

And so we're going to have our cross convolution take our cross path and create number out divided by 2. And then the upward path is going to be a conv transpose 2D because we want to increase up sample. And again, here we've got the number n divided by 2.

And then at the end, I just concatenate those together. So I've got an upward sample, I've got a cross convolution, I concatenate the two together. And so that's all a unit block is. And so that's actually a pretty easy module to create. And so then in my forward path, I need to pass to the forward of the unit block the upward path and the cross path.

So the upward path is just wherever I'm up to so far. But then the cross path is whatever the value is of whatever the activations are that I stored on the way down. So as I come up, it's the last set of saved features that I need first. And as I gradually keep going up further and further and further, eventually it's the first set of features.

And so there are some more tricks we can do to make this a little bit better, but this is a good start. So the simple upsampling approach looked horrible and had a dice of 968. A unit with everything else identical, except we've now got these unit blocks, has a dice of 985.

So that's like we've kind of halved the error with everything else exactly the same. And more to the point, you can look at it. This is actually looking somewhat car-like compared to our non-unet equivalent, which is just a blob. Because trying to do this through down and up paths, it's just asking too much.

Whereas when we actually provide the downward path pixels at every point, it can actually start to create something car-ish. So at the end of that, we'll go .close to again remove those SFS features that are taking up GPU memory, go to a smaller batch size, a higher size, and you can see the dice coefficient is really going up.

So notice here I'm loading in the 128x128 version of the network. So we're doing this progressive resizing trick again. So that gets us 99.3, and then unfreeze to get to 99.4. And you can see it's now looking pretty good. Go down to a batch size of 4, size of 102.4, load in what we just did with the 512, takes us to 99.5, unfreeze, takes us to 99.

And as you can see, that actually looks good. Accuracy terms, 99.82. You can see this is looking like something you could just about use to cut out. I think at this point there's a couple of minor tweaks we can do to get up to 99.7, but really the key thing then I think is just maybe to do a little bit of smoothing maybe, or a little bit of post-processing.

You can go and have a look at the Carvana winner's blogs and see some of these tricks. But as I say, the difference between where we're at 99.6 and what the winner's got of 99.7 is not heaps. And so really the unit on its own pretty much solves that problem.

Okay so that's it. The last thing I wanted to mention is now to come all the way back to bounding boxes. Because you might remember I said our bounding box model was still not doing very well on small objects, so hopefully you might be able to guess where I'm going to go with this.

Which is that for the bounding box model, remember how we had at different grid cells, we spat out outputs of our model, and it was those earlier ones with the small grid sizes that weren't very good. How do we fix it? Unet it. Let's have an upward path with cross-connections.

And so then we're just going to do a unet and then spit them out of that. Because now those finer grid cells have all of the information of that path and that path and that path and that path to leverage. Now of course, this is deep learning, so that means you can't write a paper saying we just used unet for bounding boxes.

You have to invent a new word. So this is called feature pyramid networks, or FPMs. And literally this is part of the retina net paper, which is used in the retina net paper. It was created in earlier papers specifically about FPMs. If memory says correctly, they did briefly cite the unet paper, but they kind of made it sound like it was this vaguely slightly connected thing that maybe some people could consider slightly useful.

But it really, FPMs is units. I don't have an implementation of it to show you, but it'll be a fun thing maybe for some of us to try. I know some of the students have been trying to get it working well on the forums. Interesting thing to try. So I think a couple of things to look at after this class, as well as the other things I mentioned, would be playing around with FPMs and also maybe trying Caram's dynamic unet.

They would both be interesting things to look at. So you guys have all been through 14 lessons of me talking at you now, so I'm sorry about that. Thanks for putting up with me. You're going to find it hard to find people who actually know as much about training neural networks in practice as you do.

It'll be really easy for you to overestimate how capable all these other people are and underestimate how capable you are. The main thing to say is please practice. Please, just because you don't have this constant thing getting you to come back here every Monday night now, it's very easy to kind of lose that momentum.

So find ways to keep it, organize a study group or a book reading group or get together with some friends and work on a project. Do something more than just deciding I want to keep working on X. Unless you're the kind of person who's super motivated and you know that whenever you decide to do something, it happens, that's not me.

I know something to happen. I have to say, "Yes, David, in October I will absolutely teach that course." And then it's like, "Okay, I better actually write some material." That's the only way I can get stuff to happen. We've got a great community there on the forums. If people have ideas for ways to make it better, please tell me.

If you think you can help with, if you want to create some new forum or moderate it in some different way or whatever, just let me know. You can always PM me. There's a lot of projects going on through GitHub as well, lots of stuff. I hope to see you all back here at Something Else.

Thanks so much for joining me on this journey.

Lesson 14: Deep Learning Part 2 2018 - Super resolution; Image segmentation with Unet

Chapters

Transcript