back to index

Lesson 14: Deep Learning Part 2 2018 - Super resolution; Image segmentation with Unet


Chapters

0:0
2:13 Style Transfer
3:50 Super Resolution
16:33 Data Augmentation
16:42 Random Dihedral
18:20 Transformations
19:6 Transform Types
26:0 Enhanced Deep Residual Networks
34:52 App Sampling
35:40 Transposed Convolutions
43:45 Pixel Shuffle
50:53 Perceptual Loss
58:20 Progressive Resizing
64:10 What Are the Future Plans for Fast Ai in this Course
68:27 Leverage Your Knowledge about Your Domain
75:26 Reinforcement Learning
89:21 Segmentation
97:4 Transform Type Classification
102:50 Upscaling
107:9 The Dice Metric
111:48 Unit Blocks
112:35 Dynamic Unit
119:34 Feature Pyramid Networks

Whisper Transcript | Transcript Only Page

00:00:00.000 | Thank you.
00:00:03.000 | Welcome to the last lesson, lesson 14.
00:00:08.040 | We're going to be looking at image segmentation today, amongst other things, but before we
00:00:13.080 | do, a bit of show and tell from last week.
00:00:20.040 | Elena Harley did something really interesting, which was she tried finding out what would
00:00:24.160 | happen if you did CycleGAN on just 300 or 400 images.
00:00:28.780 | I really like these projects where people just go to Google image search using the API
00:00:33.360 | or one of the libraries out there.
00:00:35.400 | Some of our students have created some very good libraries for interacting with Google
00:00:38.720 | images API, download a bunch of stuff they're interested in, in this case some photos and
00:00:44.360 | some stained glass windows, and with 300 or 400 photos of that she trained a model.
00:00:51.320 | She trained actually a few different models, this is what I particularly liked, and as
00:00:54.840 | you can see, with quite a small number of images she gets some very nice stained glass
00:00:59.760 | effects.
00:01:00.760 | So I thought that was an interesting example of using pretty small amounts of data that
00:01:06.440 | was readily available, which she was able to download pretty quickly, and there's more
00:01:10.960 | information about that on the forum if you're interested.
00:01:17.160 | It's interesting to wonder about what kinds of things people will come up with with this
00:01:20.280 | kind of generative model, it's clearly a great artistic medium, it's clearly a great medium
00:01:28.120 | for forgeries and fakeries, I wonder what other kinds of things people will realize
00:01:35.240 | they can do with these kind of generative models.
00:01:38.080 | I think audio is going to be the next big area, and also very interactive type stuff.
00:01:45.240 | Nvidia just released a paper showing an interactive photo repair tool where you just brush over
00:01:54.940 | an object and it replaces it with a deep learning generated replacement very nicely.
00:02:01.400 | Those kinds of interactive tools I think will be very interesting too.
00:02:07.000 | So before we talk about segmentation, we've got some stuff to finish up from last time
00:02:12.240 | which is that we looked at doing style transfer by actually directly optimizing pixels.
00:02:22.760 | Like with most of the things in Part 2, it's not so much that I'm wanting you to understand
00:02:30.680 | style transfer per se, but the kind of idea of optimizing your input directly and using
00:02:37.240 | activations as part of a loss function is really the key kind of takeaway here.
00:02:46.920 | So it's interesting then to kind of see what is effectively the follow-up paper, not from
00:02:52.680 | the same people, but the paper that kind of came next in the sequence of these kind of
00:02:57.240 | vision generative models with this one from Justin Johnson and folks at Stanford.
00:03:05.240 | And it actually does the same thing, style transfer, but it does it in a different way.
00:03:11.200 | Rather than optimizing the pixels, we're going to go back to something much more familiar
00:03:16.480 | and optimize some weights.
00:03:18.760 | And so specifically we're going to train a model which learns to take a photo and translate
00:03:25.020 | it into a photo in the style of a particular artwork.
00:03:30.360 | So each ConvNet will learn to produce one kind of style.
00:03:39.260 | Now it turns out that getting to that point, there's an intermediate point which is I actually
00:03:45.040 | think kind of more useful and takes us halfway there, which is something called super-resolution.
00:03:52.480 | So we're actually going to start with super-resolution because then we'll build on top of super-resolution
00:03:57.240 | to finish off the style transfer, ConvNet based style transfer.
00:04:02.560 | And so super-resolution is where we take a low-res image, we're going to take 72x72 and
00:04:09.200 | upscale it to a larger image, 288x288 in our case, trying to create a higher-res image
00:04:18.160 | that looks as real as possible.
00:04:24.360 | And so this is a pretty challenging thing to do because at 72x72 there's not that much
00:04:28.640 | information about a lot of the details.
00:04:31.000 | And the cool thing is that we're going to do it in a way as we tend to do with vision
00:04:35.680 | models which is not tied to the input size.
00:04:39.160 | So you could totally then take this model and apply it to a 288x288 image and get something
00:04:45.160 | that's 4 times bigger on each side, so 16 times bigger than that.
00:04:51.840 | But often it even works better at that level because you're really introducing a lot of
00:04:57.440 | detail into the finer details and you could really print out a high-resolution print of
00:05:02.080 | something which earlier on was pretty pixelated.
00:05:07.560 | So this is the notebook called Enhance.
00:05:12.520 | And it is a lot like that kind of CSI style enhancement where we're going to take something
00:05:18.600 | that appears like the information is just not there and we kind of invent it, but the
00:05:25.760 | ConvNet is going to learn to invent it in a way that's consistent with the information
00:05:29.320 | that is there, so hopefully it's kind of inventing the right information.
00:05:33.920 | One of the really nice things about this kind of problem is that we can create our own dataset
00:05:40.440 | as big as we like without any labeling requirements because we can easily create a low-res image
00:05:47.240 | from a high-res image just by downsampling our images.
00:05:51.200 | So something I would love some of you to try during the week would be to do other types
00:05:57.760 | of image-to-image translation where you can invent kind of labels, invent your dependent
00:06:04.920 | variable.
00:06:05.920 | For example, de-skewing, so either recognize things that have been rotated by 90 degrees
00:06:12.720 | or better still that have been rotated by 5 degrees and straighten them.
00:06:18.480 | Colorization, so make a bunch of images into black and white and learn to put the color
00:06:24.360 | back again.
00:06:27.480 | Noise reduction, maybe do a really low-quality JPEG save and learn to put it back to how
00:06:38.080 | it should have been, and so forth, or maybe take something that's in a 16 color palette
00:06:45.360 | and put it back to a higher color palette.
00:06:49.280 | I think these things are all interesting because they can be used to take pictures that you
00:06:56.600 | may have taken back on crappy old digital cameras before they were high resolution,
00:07:00.880 | or you may have scanned in some old photos that have faded or whatever, I think it's
00:07:06.120 | a really useful thing to be able to do, and also it's a good project because it's really
00:07:11.120 | similar to what we're doing here, but different enough that you'll come across some interesting
00:07:15.140 | challenges on the way, I'm sure.
00:07:19.200 | So I'm going to use ImageNet again.
00:07:22.400 | You don't need to use all of ImageNet at all, I just happen to have it lying around.
00:07:26.360 | You can download the 1% sample of ImageNet from files.fast.ai.
00:07:29.720 | You can use any set of pictures you have lying around, honestly.
00:07:35.880 | And in this case, as I said, we don't really have labels per se, so I'm just going to give
00:07:42.160 | everything a label of 0 just so we can use it with our existing infrastructure more easily.
00:07:50.360 | Now because I'm in this case pointing at a folder that contains all of ImageNet, I certainly
00:07:54.880 | don't want to wait for all of ImageNet to finish, to run an epoch.
00:07:58.420 | So here most of the time I would set keep% to 1 or 2%, and then I just generate a bunch
00:08:06.720 | of random numbers, and then I just keep those which are less than 0.02, and so that lets
00:08:14.000 | me quickly sub-sample my rows.
00:08:21.720 | So we're going to use VGG16, and VGG16 is something that we haven't really looked at in this class,
00:08:35.960 | but it's a very simple model where we take our normal, presumably 3-channel input, and
00:08:46.240 | we basically run it through a number of 3x3 convolutions, and then from time to time we
00:08:55.200 | put it through a 2x2 MaxPool, and then we do a few more 3x3 convolutions, MaxPool, so
00:09:08.280 | on and so forth.
00:09:10.160 | And then this is kind of our backbone, I guess.
00:09:21.520 | And then we don't do an average pooling layer, an adaptive average pooling layer.
00:09:27.560 | After a few of these we end up with this 7x7 grid as usual, I think it's about 7x7x512.
00:09:36.760 | And so rather than average pooling we do something different, which is we flatten the whole thing.
00:09:42.160 | So that spits out a very long vector of activations of size 7x7x512 if memory says correctly.
00:09:52.900 | And then that gets fed into two fully connected layers, each one of which has 4096 activations,
00:10:04.920 | and then one more fully connected layer which has however many classes.
00:10:10.840 | So if you think about it, the weight matrix here is huge, it's 7x7x512x4096, and it's because
00:10:25.440 | of that weight matrix really that VGG went out of favor pretty quickly, because it takes
00:10:32.100 | a lot of memory, it takes a lot of computation, and it's really slow.
00:10:36.920 | And there's a lot of redundant stuff going on here, because really those 512 activations
00:10:44.040 | are not that specific to which of those 7x7 grid cells they're in, but when you have this
00:10:51.720 | entire weight matrix here of every possible combination, it treats all of them uniquely.
00:11:00.600 | And so that can also lead to generalization problems, because there's just a lot of weights
00:11:04.560 | and so forth.
00:11:07.800 | My view is that the approach that's used in every modern network, which is here we do
00:11:14.840 | an adaptive average pooling in Keras that we know as a global average pooling, or in
00:11:21.840 | fast.ai we generally do a concat pooling, which spits it straight down to a 512-long activation.
00:11:32.320 | I think that's throwing away too much geometry, so to me probably the correct answer is somewhere
00:11:39.000 | in between and would involve some kind of factored convolution or some kind of tensor
00:11:45.160 | decomposition which maybe some of us can think about in the coming months.
00:11:51.020 | So for now we've gone from one extreme, which is the adaptive average pooling, to the other
00:11:56.200 | extreme which is this huge flattened pooling connection layer.
00:12:00.400 | So a couple of things which are interesting about VGG that make it still useful today.
00:12:08.200 | The first one is that there's more interesting layers going on here with most modern networks
00:12:18.000 | including the ResNet family.
00:12:20.600 | The very first layer generally is a 7x7 pond, or something similar, which means we throw
00:12:31.560 | away half the grid size straight away and so there's little opportunity to use the fine
00:12:39.120 | detail because we never do any computation with it.
00:12:44.640 | And so that's a bit of a problem for things like segmentation or super resolution models
00:12:52.080 | because the fine detail matters, we actually want to restore it.
00:12:57.400 | And then the second problem is that the adaptive average pooling layer entirely throws away
00:13:03.800 | the geometry in the last few sections, which means that the rest of the model doesn't really
00:13:08.800 | have as much interest in learning the geometry as it otherwise might.
00:13:13.560 | And so therefore for things which are dependent on position, any kind of localization based
00:13:18.360 | approach to anything that requires generative modeling is going to be less effective.
00:13:22.800 | So one of the things I'm hoping you're hearing as I describe this is that probably none of
00:13:28.080 | the existing architectures are actually ideal.
00:13:32.340 | We can invent a new one.
00:13:33.520 | And actually I just tried inventing a new one over the week which was to take the VGG
00:13:42.360 | thread and attach it to a ResNet backbone.
00:13:47.720 | And interestingly I found I actually got a slightly better classifier than a normal ResNet,
00:13:53.520 | but it also was something with a little bit more useful information.
00:13:57.640 | It took 5 or 10% longer to train, but nothing worth worrying about.
00:14:05.960 | I think maybe we couldn't in ResNet replace this as we've talked about briefly before,
00:14:10.040 | this very early convolution with something more like an inception stem which does a bit
00:14:14.760 | more computation.
00:14:16.160 | I think there's definitely room for some nice little tweaks to these architectures so that
00:14:22.820 | we can build some models which are maybe more versatile.
00:14:26.360 | At the moment people tend to build architectures that just do one thing.
00:14:29.720 | They don't really think what am I throwing away in terms of opportunity because that's
00:14:35.120 | how publishing works.
00:14:36.120 | You know you publish like I've got the state-of-the-art in this one thing rather than I've created
00:14:40.240 | something that's good at lots of things.
00:14:43.480 | So for these reasons we're going to use VGG today even though it's ancient and it's missing
00:14:49.020 | lots of great stuff.
00:14:50.760 | One thing we are going to do though is use a slightly more modern version which is a
00:14:55.200 | version of VGG where batch norm has been added after all the convolutions.
00:15:00.400 | And so in fast.ai actually when you ask for a VGG network you always get the batch norm
00:15:05.200 | one because that's basically always what you want.
00:15:09.280 | So this is actually our VGG with batch norm.
00:15:12.480 | There's a 16 and a 19.
00:15:14.160 | The 19 is way bigger and heavier and doesn't really any better so no one really uses it.
00:15:23.480 | So we're going to go from 72x72, LR is low resolution input, Si is low resolution.
00:15:30.500 | We're going to initially scale it up by x2 with a batch size of 64 to get a 2x72, so
00:15:37.340 | 1x44x144 output.
00:15:39.720 | So that's going to be our stage 1.
00:15:46.120 | We'll create our own dataset for this and the dataset, it's very worthwhile looking
00:15:52.800 | inside the fastai.dataset module and seeing what's there because just about anything you'd
00:15:59.920 | want we probably have something that's almost what you want.
00:16:04.120 | So in this case I want a dataset where my x's are images and my y's are also images.
00:16:10.880 | So there's already a files dataset we can inherit from where the x's are images and
00:16:15.360 | then I just inherit from that and I just copied and pasted the get x and turned that into
00:16:20.200 | get y so it just opens an image.
00:16:23.520 | So now I've got something where the x is an image and the y is an image and in both cases
00:16:27.760 | what we're passing in is an array of file names.
00:16:31.240 | I'm going to do some data augmentation, obviously with all of ImageNet we don't really need
00:16:37.320 | it, but this is mainly here for anybody who's using smaller datasets to make most of it.
00:16:43.640 | Random dihedral is referring to every possible 90 degree rotation plus optional left/right
00:16:50.240 | flipping, so the dihedral group of eight symmetries.
00:16:56.560 | Probably we don't use this transformation for ImageNet pictures because you don't normally
00:17:01.440 | flip dogs upside down, but in this case we're not trying to classify whether it's a dog
00:17:07.480 | or a cat, we're just trying to keep the general structure of it, so actually every possible
00:17:13.640 | flip is a reasonably sensible thing to do for this problem.
00:17:20.280 | So create a validation set in the usual way, and you can see I'm kind of using a few more
00:17:26.360 | slightly lower level functions, generally speaking I just copy and paste them out of
00:17:30.440 | the fast.ai source code to find the bits I want.
00:17:34.600 | So here's the bit which takes an array of validation set indexes and one or more arrays
00:17:43.480 | of variables and simply splits, so in this case this into a training and a validation
00:17:50.440 | set and this into a training and a validation set to give us our x's and y's.
00:17:58.760 | Now in this case the x and y are the same, our image and our output are the same, we're
00:18:05.720 | going to use transformations to make one of them lower resolution, so that's why these
00:18:10.720 | are the same thing.
00:18:14.760 | So the next thing that we need to do is to create our transformations as per usual, and
00:18:24.880 | we're going to use this transform y parameter like we did for bounding boxes, but rather
00:18:30.440 | than use transform type.coordinate, we're going to use transform type.pixel, and so
00:18:38.320 | that tells our transformations framework that your y values are images with normal pixels
00:18:47.160 | in them and so anything you do with the x you also need to do the y, do the same thing.
00:18:54.200 | And you need to make sure any data representation transforms you use have the same parameter
00:18:59.560 | as well.
00:19:06.440 | So you can see the possible transform types, basically you've got classification, which
00:19:09.960 | we're about to use for segmentation in the second half of today, coordinates, no transformation
00:19:15.040 | at all, or pixel.
00:19:20.480 | So once we've got a dataset class and some x and y training and validation sets, there's
00:19:28.840 | a handy little method called get_datasets, which basically runs that constructor over
00:19:34.760 | all the different things that you have to return all the datasets that you need in exactly
00:19:39.560 | the right format to pass to a model data constructor, in this case the image data constructor.
00:19:46.560 | So we're kind of like going back under the covers of fast.ai a little bit and building
00:19:51.520 | it up from scratch.
00:19:53.840 | And in the next few weeks this will all be wrapped up and refactored into something that
00:19:58.400 | you can do in a single step in fast.ai, but the point of this class is to learn a bit
00:20:03.000 | about going under the covers.
00:20:08.320 | So something we've briefly seen before is that when we take images in we transform them
00:20:17.200 | not just with data augmentation, but we also move the channels dimension up to the start,
00:20:23.800 | we subtract the mean, divide by the standard deviation, whatever.
00:20:27.800 | So if we want to be able to display those pictures that have come out of our datasets
00:20:32.640 | or data loaders, we need to denormalize them, and so the model data objects dataset has
00:20:38.840 | a denorm function that knows how to do that, so I'm just going to give that a short name
00:20:43.680 | for convenience.
00:20:46.160 | So now I'm going to create a function that can show an image from a dataset, and if you
00:20:50.320 | pass in something saying this is a normalized image, then we'll denormalize it.
00:20:55.560 | So we can go ahead and have a look at that.
00:20:59.160 | You'll see here we've passed in size_low_res as our size for the transforms, and size_high_res
00:21:07.400 | as this is something new, the size_y parameter.
00:21:10.760 | So the two bits are going to get different sizes.
00:21:14.620 | And so here you can see the two different resolutions of our x and our y for a whole
00:21:20.240 | bunch of fish.
00:21:23.800 | As per usual, plot.subplots to create our two plots, and then we can just use the different
00:21:29.360 | axes that came back to put stuff next to each other.
00:21:37.980 | So we can then have a look at a few different versions of the data transformation, and there
00:21:43.640 | you can see them being flipped in all different directions.
00:21:49.480 | So let's create our model.
00:21:57.260 | So we're going to have an image coming in, a small image coming in, and we want to have
00:22:07.520 | a big image coming out.
00:22:12.720 | And so we need to do some computation between those two to calculate what the big image
00:22:18.480 | would look like.
00:22:20.000 | And so essentially there's kind of two ways of doing that computation.
00:22:23.120 | We could first of all do some upsampling, and then do a few stride1 kind of layers to
00:22:31.640 | do lots of computation.
00:22:34.240 | Or we could first do lots of stride1 layers to do all the computation, and then at the
00:22:39.080 | end do some upsampling.
00:22:42.760 | We're going to pick the second approach, because we want to do lots of computation on something
00:22:48.160 | smaller because it's much faster to do it that way.
00:22:53.160 | And also like all that computation we get to leverage during the upsampling process.
00:23:01.760 | So upsampling, we know a couple of possible ways to do that.
00:23:05.960 | We can use transposed or fractionally strided convolutions, or we can use nearest neighbor
00:23:14.440 | upsampling, followed by a 1x1 conv.
00:23:21.920 | And then in the do lots of computation section, we could just have a whole bunch of 3x3 cons.
00:23:30.400 | But in this case in particular, it seems likely that ResNet blocks are going to be better,
00:23:37.000 | because really the output and the input are very similar.
00:23:45.200 | So we really want a flow-through path that allows as little fussing around as possible
00:23:51.040 | except the minimal amount necessary to do our super-resolution.
00:23:55.760 | And so if we use ResNet blocks, then they have an identity path already.
00:24:01.920 | So you could imagine the most simple version where it does a bilinear sampling kind of
00:24:09.880 | approach or something.
00:24:10.880 | It could basically just go through identity blocks all the way through, and then in the
00:24:14.080 | upsampling blocks just learn to take the averages of the inputs and get something that's not
00:24:19.400 | too terrible.
00:24:22.160 | So that's what we're going to do.
00:24:23.160 | We're going to create something with 5 ResNet blocks, and then for each 2x scale-up we have
00:24:31.960 | to do, we'll have one upsampling block.
00:24:38.440 | So they're all going to consist of, obviously as per usual, convolution layers, possibly
00:24:43.600 | with activation functions after many of them.
00:24:46.760 | So I kind of like to put my standard convolution block into a function so I can refactor it
00:24:53.840 | more easily.
00:24:56.240 | As per usual I just won't worry about passing in padding and just calculate it directly
00:25:00.560 | as kernel size over 2.
00:25:04.340 | So one interesting thing about our little conv block here is that there's no batch norm,
00:25:09.720 | which is pretty unusual for ResNet-type models.
00:25:14.880 | And the reason there's no batch norm is because I'm stealing ideas from this fantastic recent
00:25:20.560 | paper which actually won a recent competition in super-resolution performance.
00:25:27.280 | And to see how good this paper is, here's kind of a previous state of the art, this
00:25:32.800 | SR ResNet, and what they've done here is they've zoomed way in to an upsampled kind of net
00:25:41.200 | or fence, this is the original.
00:25:43.680 | And you can see in the previous best approach there's a whole lot of distortion and blurring
00:25:49.080 | going on, whereas in their approach it's nearly perfect.
00:25:55.040 | So it was a really big step up this paper.
00:25:59.520 | They call their model EDSR, Enhanced Deep Residual Networks.
00:26:03.000 | And they did two things differently to the previous standard approaches.
00:26:10.280 | One was to take the ResNet block, this is a regular ResNet block, and throw away the batch
00:26:15.000 | norm.
00:26:17.360 | So why would they throw away the batch norm?
00:26:19.520 | Well the reason they would throw away the batch norm is because batch norm changes stuff,
00:26:25.980 | and we want a nice straight-through path that doesn't change stuff.
00:26:31.360 | So the idea basically here is if you don't want to fiddle with the input more than you
00:26:36.200 | have to, then don't force it to have to calculate things like batch norm parameters.
00:26:41.120 | So throw away the batch norm.
00:26:42.760 | And the second trick we'll see shortly.
00:26:46.000 | So here's a conv with no batch norm.
00:26:49.640 | And so then we're going to create a residual block containing, as per usual, two convolutions.
00:26:58.480 | And as you see in their approach, they don't even have a value after their second conv.
00:27:03.520 | So that's why I've only got activation on the first one.
00:27:11.760 | So a couple of interesting things here.
00:27:16.460 | One is that this idea of having some kind of main ResNet path, like conv_relu_conv,
00:27:26.280 | and then turning that into a relu block by adding it back to the identity, it's something
00:27:30.840 | we do so often.
00:27:31.840 | We've kind of factored it out into a tiny little module called res_sequential, which
00:27:36.880 | simply takes a bunch of layers that you want to put into your residual path, turns that
00:27:44.660 | into a sequential model, runs it, and then adds it back to the input.
00:27:50.560 | So with this little module we can now turn anything like conv_activation_conv into a
00:27:58.160 | ResNet block, just by wrapping it in res_sequential.
00:28:04.960 | But that's not quite all I'm doing, because normally a res block just has that in its
00:28:10.760 | forward.
00:28:11.760 | But I've also got that.
00:28:15.760 | What's res_scale?
00:28:16.760 | Res_scale is the number 0.1.
00:28:20.040 | Why is it there?
00:28:22.400 | I'm not sure anybody quite knows.
00:28:25.320 | But the short answer is that the guy who invented batchnorm also somewhat more recently did
00:28:33.400 | a paper in which he showed, I think the first time, the ability to train imageNet in under
00:28:39.800 | an hour.
00:28:41.480 | And the way he did it was fire up lots and lots of machines and have them work in parallel
00:28:48.500 | to create really large batch sizes.
00:28:51.400 | Now generally when you increase the batch size by order n, you also increase the learning
00:28:56.880 | rate by order n to go with it.
00:28:58.960 | So generally very large batch size training means very high learning rate training as
00:29:04.200 | well.
00:29:05.200 | And he found that with these very large batch sizes of 8,000 plus, or even up to 32,000,
00:29:13.240 | that at the start of training his activations would basically go straight to infinity.
00:29:18.760 | And a lot of other people found that, we actually found that when we were competing in Dawnbench
00:29:22.880 | both on the Cypher and the imageNet competitions that we really struggled to make the most
00:29:28.920 | of even the eight GPUs that we were trying to take advantage of because of these challenges
00:29:34.920 | with these larger batch sizes and taking advantage of them.
00:29:38.760 | So something that Christian found, this researcher, was that in the resNet blocks, if he multiplied
00:29:43.920 | them by some number smaller than 1, something like 0.1 or 0.2, it really helped stabilize
00:29:50.320 | training at the start.
00:29:53.760 | And that's kind of weird because mathematically it's kind of identical, because obviously
00:30:01.000 | whatever I'm multiplying it by here, I could just scale the weights by the opposite amount
00:30:07.560 | here and have the same number.
00:30:10.440 | So it's kind of like we're not dealing with abstract math, we're dealing with real optimization
00:30:21.480 | problems and different initializations and learning rates and whatever else.
00:30:27.920 | And so the problem of weights disappearing off into infinity I guess generally is really
00:30:35.280 | about the kind of discrete and finite nature of computers in practice.
00:30:42.040 | And so often these kind of little tricks can make the difference.
00:30:46.800 | So in this case we're just kind of toning things down, at least based on our initialization.
00:30:55.040 | And so there are probably other ways to do this.
00:30:58.400 | For example, one approach from some folks at Nvidia called Lars, L-A-R-S, which I briefly
00:31:04.320 | mentioned last week, is an approach which uses discriminative learning rates calculated
00:31:09.760 | in real time, basically looking at the ratio between the gradients and the activations
00:31:18.200 | to scale learning rates by layer.
00:31:20.820 | And so they found that they didn't need this trick to scale up the batch sizes a lot.
00:31:30.000 | Maybe a different initialization would be all that's necessary.
00:31:35.060 | The reason I mention this is not so much because I think a lot of you are likely to want to
00:31:39.560 | train on massive clusters of computers, but rather that I think a lot of you want to train
00:31:45.200 | models quickly, and that means using high learning rates and ideally getting super-convergence.
00:31:51.800 | And I think these kinds of tricks, the tricks that we'll need to be able to get super-convergence
00:31:58.880 | across more different architectures and so forth.
00:32:02.640 | And other than Leslie Smith, no one else is really working on super-convergence other
00:32:10.240 | than some fast AI students nowadays.
00:32:12.640 | So these kinds of things about how do we train at very, very high learning rates, we're going
00:32:17.120 | to have to be the ones who figure it out as far as I can tell nobody else cares yet.
00:32:24.840 | So I think looking at the literature around training ImageNet in one hour, or more recently
00:32:31.160 | there's now a train ImageNet in 15 minutes, these papers actually have some of the tricks
00:32:37.720 | to allow us to train things at high learning rates.
00:32:40.960 | And so here's one of them.
00:32:42.200 | And so interestingly other than the train ImageNet in one hour paper, the only other
00:32:47.920 | place I've seen this mentioned was in this EDSR paper.
00:32:53.280 | And it's really cool because people who win competitions, I just find them to be very
00:33:00.640 | pragmatic and well-read.
00:33:03.200 | They actually have to get things to work.
00:33:05.420 | And so this paper describes an approach which actually worked better than anybody else's
00:33:10.480 | approach.
00:33:11.480 | And they did these pragmatic things like throw away batch norm and use this little scaling
00:33:17.120 | factor which almost nobody else seems to know about and stuff like that.
00:33:23.000 | So that's where the point one comes from.
00:33:26.400 | So basically our super-resolution ResNet is going to do a convolution to go from our three
00:33:32.960 | channels to 64 channels just to richen up the space a little bit.
00:33:36.760 | Oh sorry, I've got actually 8, not 5.
00:33:39.920 | Eight lots of these res blocks.
00:33:43.560 | Remember every one of these res blocks is Stripe 1, so the grid size doesn't change,
00:33:48.880 | the number of filters doesn't change, it's just 64 all the way through.
00:33:54.080 | We'll do one more convolution and then we'll do our up-sampling by however much scale
00:33:58.800 | we asked for.
00:34:01.160 | And then something I've added which is a little idea is just one batch norm here because it
00:34:06.720 | kind of felt like it might be helpful just to scale the last layer.
00:34:11.560 | And then finally a conv to go back to the three channels we want.
00:34:16.120 | So you can see that's basically here's lots and lots of computation and then a little
00:34:20.960 | bit of up-sampling just like we kind of described.
00:34:32.200 | So the only other piece here then is -- and also just to mention as you can see as I'm
00:34:38.640 | tending to do now, this whole thing is done by creating just a list of layers and then
00:34:44.800 | at the end turning that into a sequential model, and so my forward function is as simple
00:34:49.920 | as can be.
00:34:53.120 | So here's our up-sampling.
00:34:55.800 | And up-sampling is a bit interesting because it is not doing either of these two things.
00:35:05.680 | So let's talk a bit about up-sampling.
00:35:13.820 | Here's a picture from the paper, not from the competition-winning paper but from this
00:35:18.040 | original paper.
00:35:20.480 | And so they're saying our approach is so much better, but look at their approach.
00:35:25.040 | It's got goddamn artifacts in it.
00:35:30.000 | These just pop up everywhere, don't they?
00:35:31.800 | And so one of the reasons for this is that they use transposed convolutions, and we all
00:35:35.960 | know don't use transposed convolutions.
00:35:40.840 | So here are transposed convolutions.
00:35:42.600 | This is from this fantastic convolutional arithmetic paper that was shown also in the
00:35:47.640 | Theano docs.
00:35:48.720 | If we're going from the blue is the original image, so a 3x3 image up to a 5x5 image, or
00:35:55.760 | a 6x6 if we added a layer of padding, then all a transposed convolution does is it uses
00:36:01.520 | a regular 3x3 conv, but it sticks white 0 pixels between every pair of pixels.
00:36:10.320 | So that makes the input image bigger and when we run this convolution up over it, it therefore
00:36:14.680 | gives us a larger output.
00:36:17.240 | But that's obviously stupid because when we get here, for example, of the 9 pixels coming
00:36:23.480 | in, 8 of them are 0.
00:36:26.200 | So we're just wasting a whole lot of computation.
00:36:28.960 | And then on the other hand, if we're slightly off over here, then 4 of our 9 are non-zero.
00:36:35.080 | But yet we only have one filter, like one kernel to use, so it can't change depending
00:36:42.600 | on how many zeros are coming in, so it has to be suitable for both.
00:36:48.640 | And it's just not possible.
00:36:50.680 | So we end up with these artifacts.
00:36:53.720 | So one approach we've learned to make it a bit better is to not put white things here,
00:36:59.320 | but instead to copy this pixel's value to each of these three locations.
00:37:07.240 | That's certainly a bit better, but it's still pretty crappy because now still when we get
00:37:11.480 | to these 9 here, 4 of them are exactly the same number.
00:37:17.160 | And when we move across 1, then now we've got a different situation entirely.
00:37:25.200 | And so depending on where we are, in particular if we're here, there's going to be a lot less
00:37:30.440 | repetition.
00:37:31.480 | So again we have this problem where there's wasted computation and too much structure
00:37:36.640 | in the data and it's going to lead to artifacts.
00:37:39.480 | So up-sampling is better than transposed convolutions, it's better to copy them rather than replace
00:37:45.160 | them with zeros, but it's still not quite good enough.
00:37:50.220 | So instead we're going to do the pixel shuffle.
00:38:00.640 | So the pixel shuffle is an operation in this sub-pixel convolutional neural network.
00:38:07.160 | And it's a little bit mind-bending, but it's kind of fascinating.
00:38:12.900 | And so we start with our input, we go through some convolutions to create some feature maps
00:38:18.200 | for a while until eventually we get to layer i-1, which has n i-1 feature maps.
00:38:27.800 | We're going to do another 3x3 conv.
00:38:29.960 | And our goal here is to go from a 7x7 grid cell, we're going to go a 3x3 upscaling, so
00:38:36.960 | we're going to go up to a 21x21 grid cell.
00:38:41.400 | So what's another way we could do that?
00:38:45.560 | To make it simpler, let's just pick one face, just one filter.
00:38:50.700 | So we'll just take the topmost filter and just do a convolution over that just to see
00:38:54.720 | what happens.
00:38:56.120 | And what we're going to do is we're going to use a convolution where the kernel size
00:39:02.840 | the number of filters is 9 times bigger than we, strictly speaking, need.
00:39:12.200 | So if we needed 64 filters, we're actually going to do 64 times 9 filters.
00:39:20.600 | Why is that?
00:39:21.840 | And so here r is the scale factor, so 3, so r squared, 3 squared is 9.
00:39:27.980 | So here are the 9 filters to cover one of these input layers, one of these input slices.
00:39:38.120 | But what we can do is we started with 7x7 and we turned it into 7x7x9.
00:39:47.240 | Well the output that we want is equal to 7x3 by 7x3, so in other words there's an equal
00:39:58.160 | number of pixels here, or activations here, as there are r activations here.
00:40:04.100 | So we can literally reshuffle these 7x7x9 activations to create this 7x3x7x3 map.
00:40:17.320 | And so what we're going to do is we're going to take one little tube here, the top left
00:40:21.920 | hand of each grid, and we're going to put the purple one up in the top left, and then
00:40:29.280 | the blue one, one to the right, and then the light blue one, one to the right of that, and
00:40:35.440 | then the slightly darker blue one in the middle of the far left, the green one in the middle,
00:40:40.920 | and so forth.
00:40:41.920 | So each of these 9 cells in the top left are going to end up in this little 3x3 section
00:40:49.320 | of our grid.
00:40:51.640 | And then we're going to take 2, 1 and take all of those 9 and move them to these 3x3
00:40:58.760 | part of the grid, and so on and so forth.
00:41:02.160 | And so we're going to end up having every one of these 7x7x9 activations inside this
00:41:08.000 | 7x3x7x3 image.
00:41:13.360 | So the first thing to realize is, yes of course this works under some definition of works
00:41:19.280 | because we have a learnable convolution here, and it's going to get some gradients, which
00:41:25.440 | is going to do the best job it can of filling in the correct activation such that this output
00:41:30.880 | is the thing we want.
00:41:33.720 | So the first step is to realize there's nothing particularly magical here, we can create any
00:41:40.360 | architecture we like, we can move things around anyhow we want to, and our weights in the convolution
00:41:46.640 | will do their best to do all we asked.
00:41:49.040 | The real question is, is it a good idea?
00:41:52.760 | Is this an easier thing for it to do, and a more flexible thing for it to do, than the
00:41:58.680 | transposed convolution or the upsampling followed by 1x1 conv?
00:42:04.440 | And the short answer is, yes it is.
00:42:07.480 | And the reason it's better in short is that the convolution here is happening in the low
00:42:13.760 | resolution 7x7 space, which is quite efficient, whereas if we first of all upsampled and then
00:42:21.160 | did our conv, then our conv would be happening in the 21x21 space, which is a lot of computation.
00:42:30.960 | And furthermore as we discussed, there's a lot of replication and redundancy in the nearest
00:42:35.880 | neighbor upsampled version.
00:42:40.840 | So they actually show in this paper, in fact I think they have a follow-up technical note
00:42:45.160 | where they provide some more mathematical details as to exactly what work is being done
00:42:51.080 | and show that the work really is more efficient this way.
00:42:58.480 | So that's what we're going to do.
00:43:00.280 | So for our upsampling we're going to have two steps.
00:43:02.880 | The first will be a 3x3 conv with R^2 times more channels than we originally wanted, and
00:43:11.920 | then a pixel shuffle operation which moves everything in each grid cell into the little
00:43:20.880 | R/R grids that are located throughout here.
00:43:25.880 | So here it is, it's one line of code.
00:43:31.200 | And so here's the conv from number of in to number of filters out times 4, because we're
00:43:37.400 | doing a scale2 upsample, so 2^2 is 4.
00:43:43.560 | So that's our convolution, and then here is our pixel shuffle, it's built into PyTorch.
00:43:49.320 | Pixel shuffle is the thing that moves each thing into its right spot.
00:43:54.960 | So that will increase, will upsample by a scale factor of 2, and so we need to do that
00:44:02.800 | log base2 scale times, so if scale is 4, then we have to do it 2 times to go 2 times 2 bigger.
00:44:12.200 | So that's what this upsample here does.
00:44:17.840 | Great, guess what?
00:44:24.240 | That does not get rid of the checkerboard patterns.
00:44:27.440 | We still have checkerboard patterns.
00:44:30.580 | So I'm sure in great fury and frustration, this same team from Twitter, I think this
00:44:35.040 | was back when they used to be at a startup called MagicPony that Twitter bought, came
00:44:39.500 | back again with another paper saying, okay, this time we've got rid of the checkerboard.
00:44:52.760 | So why do we still have, as you can see here, we still have a checkerboard?
00:45:00.080 | And so the reason we still have a checkerboard, even after doing this, is that when we randomly
00:45:07.300 | initialize this convolutional kernel at the start, it means that each of these 9 pixels
00:45:13.840 | in this little 3x3 grid over here are going to be totally randomly different.
00:45:19.320 | But then the next set of 3 pixels will be randomly different to each other, but will
00:45:24.840 | be very similar to the corresponding pixel in the previous 3x3 section.
00:45:29.520 | So we're going to have repeating 3x3 things all the way across.
00:45:33.880 | And so then as we try to learn something better, it's starting from this repeating 3x3 starting
00:45:40.720 | point, which is not what we want.
00:45:44.300 | What we actually would want is for these 3x3 pixels to be the same to start with.
00:45:51.100 | So to make these 3x3 pixels the same, we would need to make these 9 channels the same here.
00:45:59.440 | For each filter.
00:46:01.760 | And so the solution, and this paper is very simple, is that when we initialize this convolution
00:46:10.980 | at the start, when we randomly initialize it, we don't totally randomly initialize it.
00:46:15.740 | We randomly initialize one of the R^2 sets of channels, and then we copy that to the
00:46:23.620 | other R^2, so they're all the same.
00:46:26.800 | And that way, initially, each of these 3x3s will be the same.
00:46:31.900 | And so that is called IC&R, and that's what we're going to use in a moment.
00:46:42.420 | So before we do, let's take a quick look.
00:46:45.140 | So we've got this super resolution ResNet, which does lots of computation with lots of
00:46:50.600 | ResNet blocks, and then it does some up-sampling and gets our final 3 channels out.
00:46:57.020 | And then to make life faster, we're going to run this in parallel.
00:47:03.140 | One reason we want to run it in parallel is because Dorado told us that he has 6 GPUs,
00:47:08.960 | and this is what his computer looks like right now.
00:47:13.240 | And so I'm sure anybody who has more than one GPU has had this experience before.
00:47:19.900 | So how do we get these men working together?
00:47:27.700 | All you need to do is to take your PyTorch module and wrap it with nn.data_parallel.
00:47:37.220 | And once you've done that, it copies it to each of your GPUs and will automatically run
00:47:43.580 | it in parallel.
00:47:45.820 | It scales pretty well to 2 GPUs, okay to 3 GPUs, better than nothing to 4 GPUs, and beyond
00:47:54.580 | that performance starts to go backwards.
00:48:00.140 | By default it will copy it to all of your GPUs.
00:48:03.220 | You can add an array of GPUs, otherwise if you want to avoid getting in trouble, for
00:48:08.940 | example I have to share our box with Yannette, and if I didn't put this here, then she would
00:48:13.340 | be yelling at me right now, or maybe boycotting my class.
00:48:17.720 | So this is how you avoid getting into trouble with Yannette.
00:48:22.740 | So one thing to be aware of here is that once you do this, it actually modifies your module.
00:48:29.460 | So if you now print out your module, let's say prohibuously it was just an nn.sequential,
00:48:34.140 | now you'll find it's an nn.sequential embedded inside a module called module.
00:48:43.020 | And so in other words, if you save something which you had nn.data_parallel, and then try
00:48:49.580 | to load it back into something that you hadn't, nn.beta_parallel, it'll say it doesn't match
00:48:54.580 | up because one of them is embedded inside this module attribute and the other one isn't.
00:49:00.820 | It may also depend even on which GPU IDs you had it copied to.
00:49:07.020 | So two possible solutions, one is don't save the module m, but instead save the module attribute
00:49:16.380 | m.module, because that's actually the non-data parallel bit.
00:49:21.860 | Or always put it on the same GPU IDs and use data parallel and load and save that every
00:49:28.540 | time.
00:49:29.540 | That's what I was using.
00:49:30.540 | This would be an easy thing for me to fix automatically in fast.ai and I'll do it pretty
00:49:35.060 | soon so it'll look for that module attribute and deal with it automatically, but for now
00:49:41.140 | we have to do it manually.
00:49:42.140 | It's probably useful to know what's going on behind the scenes anyway.
00:49:46.720 | So we've got our module, I find it'll run like 50% or 60% faster on a 1080ti.
00:49:54.340 | If you're running on Volta, it actually parallelizes a bit better.
00:50:00.580 | There are much faster ways to parallelize, but this is a super easy way.
00:50:06.260 | So we create our learner in the usual way.
00:50:08.980 | We could use mse_loss here, so that's just going to compare the pixels of the output
00:50:13.340 | to the pixels that we expected, and we can run our learning rate finder and we can train
00:50:19.360 | it for a while, and here's our input and here's our output, and you can see that what we've
00:50:27.420 | managed to do is to train a very advanced residual convolutional network that's learned
00:50:32.500 | to blur things.
00:50:36.100 | Why is that?
00:50:37.100 | Well because it's what we asked for.
00:50:38.540 | We said to minimize mse_loss, an mse_loss between pixels, really the best way to do
00:50:45.900 | that is just average the pixels, i.e. to blur it.
00:50:50.400 | So that's why pixel_loss is no good.
00:50:52.620 | So we want to use our perceptual loss.
00:50:56.980 | So let's try perceptual loss.
00:51:00.120 | So with perceptual loss, we're basically going to take our VGG network, and just like we
00:51:06.260 | did last week, we're going to find the block index just before we get a max pool.
00:51:14.120 | So here are the ends of each block of the same grid size, and if we just print them
00:51:20.500 | out as we'd expect, every one of those is a value module.
00:51:26.040 | And so in this case, these last two blocks are less interesting to us.
00:51:32.440 | The grid size there is small enough, coarse enough that it's not as useful for super resolution,
00:51:39.580 | so we're just going to use the first three.
00:51:42.380 | And so just to save unnecessary computation, we're just going to use those first 23 layers
00:51:47.300 | for VGG, we'll throw away the rest, we'll stick it on the GPU, we're not going to be
00:51:54.340 | training this VGG model at all, we're just using it to compare activations.
00:51:59.740 | So we'll stick it in eval mode, and we will set it to not trainable.
00:52:07.540 | Just like last week, we'll use a save_features class to do a forward hook, which saves the
00:52:13.940 | output activations at each of those layers.
00:52:17.340 | And so now we've got everything we need to create our perceptual loss, or as I call it
00:52:21.180 | here, feature_loss_plus.
00:52:24.660 | And so we're going to pass in a list of layer IDs, the layers where we want the content
00:52:32.160 | loss to be calculated, an array of weights, a list of weights for each of those layers.
00:52:39.580 | So we can just go through each of those layer IDs and create an object which has got the
00:52:46.180 | hook function, forward hook function to store the activations.
00:52:49.860 | And so in our forward, then we can just go ahead and call the forward pass of our model
00:52:58.220 | with the target, so the target is the high res image we're trying to create.
00:53:02.620 | And so the reason we do that is because that's going to then call that hook function and
00:53:06.860 | store in self.save_features the activations we want.
00:53:14.060 | Now we're going to need to do that for our Confinet output as well.
00:53:20.540 | So we need to clone these because otherwise the Confinet output is going to go ahead and
00:53:24.780 | just plobber what we already had.
00:53:27.980 | So now we can do the same thing for the Confinet output, which is the input to the loss function.
00:53:34.180 | And so now we've got those two things, we can zip them all together along with the weights.
00:53:40.500 | So we've got inputs, targets, weights, and then we can do the L1 loss between the inputs
00:53:45.420 | and the targets and multiply by the layer weights.
00:53:48.820 | The only other thing I do is I also grab the pixel loss, but I weight it down quite a bit.
00:53:57.100 | And most people don't do this, I haven't seen papers that do this, but in my opinion it's
00:54:02.260 | maybe a little bit better because you've got the perceptual content loss activation stuff,
00:54:09.860 | but at the finest level it also cares about the individual pixels.
00:54:18.660 | So that's our loss function, we create our super resolution ResNet, telling it how much
00:54:24.060 | to scale up by.
00:54:28.060 | And then we're going to do our ICNR initialization of that pixel shuffle convolution.
00:54:38.820 | So this is very, very boring code, I actually stole it from somebody else.
00:54:46.840 | Literally all it does is just say, okay, you've got some weight tensor x that you want to
00:54:53.500 | initialize, so we're going to treat it as if it had a number of number of features divided
00:55:01.300 | by scale squared features in practice, so this might be 2 squared, it could be 4, because
00:55:11.020 | we actually want to keep one set of them and then copy them 4 times.
00:55:16.960 | So we divide it by 4, and we create something of that size, and we initialize that with
00:55:22.620 | a default timing normal initialization, and then we just make scale squared copies of
00:55:32.460 | And the rest of it is just moving axes around a little bit.
00:55:36.220 | So that's going to return a new weight matrix where each initialized subkernel is repeated
00:55:45.740 | R squared or scale squared times.
00:55:49.780 | So that details don't matter very much, all that matters here is that I just looked through
00:55:53.760 | to find what was the actual layer, the conv layer just before the pixel shuffle, and stored
00:56:00.820 | it away, and then I called ICNR on its weight matrix to get my new weight matrix, and then
00:56:07.100 | I copied that new weight matrix back into that layer.
00:56:13.140 | So as you can see, I went to quite a lot of trouble in this exercise to really try to
00:56:20.660 | implement all the best practices, and I kind of tend to do things a bit one extreme or
00:56:25.900 | the other.
00:56:26.900 | I show you a really hacky version that only slightly works, or I go to the nth degree
00:56:30.420 | to make it work really well.
00:56:32.860 | So this is a version where I'm claiming that this is pretty much a state-of-the-art implementation,
00:56:37.940 | it's a competition-winning approach, and the reason I'm doing that is because I think this
00:56:46.220 | is one of those rare papers where they actually get a lot of the details right, and I kind
00:56:51.180 | of want you to get a feel of what it feels like to get all the details right.
00:56:56.580 | And remember, getting the details right is the difference between this hideous blurry
00:57:02.220 | mess and this really pretty exquisite result.
00:57:14.780 | So we're going to have to do theta parallel on that again, we're going to set our criterion
00:57:19.260 | to be feature loss using our VGG model, grab the first few blocks, and these are sets of
00:57:25.500 | layer weights that I found worked pretty well, do a learning rate finder, fit it for a while,
00:57:34.580 | and I fit all around for a little while trying to get some of these details right.
00:57:40.700 | But here's my favorite part of the paper, what happens next, now that we've done it
00:57:48.260 | for scale=2, progressive resizing.
00:57:55.180 | So progressive resizing is the trick that let us get the best single computer result
00:58:00.540 | for ImageNet training on Dawnbench.
00:58:02.740 | This idea is starting small, gradually making bigger, and in two papers that have used this
00:58:07.860 | idea, one is the progressive resizing of GANs paper which allows training of very high-resolution
00:58:15.220 | GANs, and the other one is the EDSR paper.
00:58:19.620 | And the cool thing about progressive resizing is not only are your earlier epochs, assuming
00:58:26.700 | you've got two by two smaller, four times faster, you can also make the batch size maybe
00:58:33.000 | three or four times bigger, but more importantly, they're going to generalize better because
00:58:39.060 | you're feeding your model different size images during training.
00:58:44.980 | So we were able to train like half as many epochs for ImageNet as most people.
00:58:51.000 | So our epochs were faster and there were fewer of them.
00:58:54.620 | So progressive resizing is something that, particularly if you're training from scratch,
00:59:01.140 | I'm not so sure if it's useful for fine-tuning transfer learning, but if you're training
00:59:04.780 | from scratch, you probably want to do nearly all the time.
00:59:08.740 | So the next step is to go all the way back to the top and change to 4-scale 32 batch
00:59:16.140 | size, like restart, so I save the model before I do that, go back.
00:59:21.780 | And that's why there's a little bit of fussing around in here with reloading, because what
00:59:29.340 | I needed to do now is I needed to load my saved model back in, but there's a slight
00:59:35.580 | issue, which is I now have one more up-sampling layer than I used to have.
00:59:41.500 | To go from 2x2 to 4x4, my little loop here is now looping through twice, not once, and
00:59:54.420 | therefore it's added an extra conv and an extra pixel shuffle.
00:59:58.100 | So how am I going to load in weights through a different network?
01:00:03.900 | And the answer is that I use a very handy thing in PyTorch, which is if I call -- this
01:00:11.100 | is basically what learn.load calls behind the scenes, load state dict.
01:00:19.440 | If I pass this parameter strict=false, if I pass in this parameter strict=false, then
01:00:26.780 | it says if you can't fill in all of the layers, just fill in the layers you can.
01:00:34.500 | So after loading the model back in this way, we're going to end up with something where
01:00:38.900 | it's loaded in all the layers that it can, and that one conv layer that's new is going
01:00:44.460 | to be randomly initialized.
01:00:46.900 | And so then I freeze all my layers and then unfreeze that up-sampling part, and then use
01:00:56.600 | ICNR on my newly added extra layer, and then I can go ahead and load again.
01:01:06.980 | And so then the rest is the same.
01:01:08.820 | So if you're trying to replicate this, don't just run this top to bottom, realize it involves
01:01:13.780 | a bit of jumping around.
01:01:21.220 | The longer you train, the better it gets.
01:01:24.460 | I ended up training it for about 10 hours, but you'll still get very good results much
01:01:28.540 | more quickly if you're less patient.
01:01:32.020 | And so we can try it out, and here is the result.
01:01:35.160 | Here is my pixelated bird, and look here, it's like totally randomly pixels.
01:01:41.160 | And here's the up-sampled version, it's like it's literally invented coloration.
01:01:48.900 | But it figured out what kind of bird it is, and it knows what these feathers are meant
01:01:55.300 | to look like.
01:01:56.680 | And so it has imagined a set of feathers which are compatible with these exact pixels, which
01:02:03.620 | is like genius.
01:02:04.920 | Like same here, there's no way you can tell what these blue dots are meant to represent,
01:02:10.940 | but if you know that this kind of bird has an array of feathers here, you know that's
01:02:16.120 | what they must be.
01:02:17.120 | And then you can figure out where the feathers would have to be such that when they were
01:02:20.320 | pixelated they'd end up in these spots.
01:02:23.080 | So it's like literally reverse engineered, given its knowledge of this exact species
01:02:30.780 | of bird, how it would have to have looked to create this output.
01:02:36.640 | And so this is like so amazing.
01:02:39.440 | It also knows from all the kind of signs around it that this area here was almost certainly
01:02:48.100 | blurred out.
01:02:49.580 | So it's actually reconstructed blurred vegetation.
01:02:55.520 | And if it hadn't done all of those things, it wouldn't have got such a good loss function.
01:03:00.400 | Because in the end, it had to match the activations saying like there's a feather over here and
01:03:08.440 | it's kind of fluffy looking and it's in this direction and all that.
01:03:17.320 | Alright, well that brings us to the end of super resolution.
01:03:22.460 | Don't forget to check out the Ask Jeremy Anything thread and we will do some Ask Jeremy Anything
01:03:27.960 | after the break.
01:03:28.960 | Let's see you back here at quarter to eight.
01:03:47.160 | Okay.
01:03:54.040 | So we are going to do Ask Jeremy Anything, Rachel will tell me the most voted up of your
01:04:04.960 | questions.
01:04:05.960 | Yes, Rachel.
01:04:06.960 | What are the future plans for Fast AI in this course?
01:04:15.080 | Will there be a part three?
01:04:16.560 | If there is a part three, I would really love to take it.
01:04:19.480 | That's cool.
01:04:20.480 | I'm not quite sure, it's always hard to guess.
01:04:25.400 | I hope there will be some kind of follow-up.
01:04:28.320 | Last year after part two, one of the students started up a weekly book club going through
01:04:33.700 | the Ian Goodfellow deep learning book and Ian actually came in and presented quite a
01:04:39.240 | few of the chapters and other people, like there was somebody, an expert, who presented
01:04:43.040 | every chapter.
01:04:44.040 | That was like a really cool part three.
01:04:46.440 | To a large extent it will depend on you, the community, to come up with ideas and to help
01:04:52.720 | make them happen and I'm definitely keen to help.
01:04:57.360 | I've got a bunch of ideas, but I'm nervous about saying them because I'm not sure which
01:05:01.160 | ones will happen and which ones won't, but the more support I have in making things happen
01:05:07.080 | that you want to happen from you, the more likely they are to happen.
01:05:13.800 | What was your experience like starting down the path of entrepreneurship?
01:05:17.440 | Have you always been an entrepreneur or did you start out at a big company and transition
01:05:21.920 | to a start-up?
01:05:22.920 | Did you go from academia to start-ups or start-ups to academia?
01:05:26.800 | I was definitely not in academia, I'm totally a fake academic.
01:05:31.720 | I started at McKinsey & Company which is a strategy firm when I was 18, which meant I
01:05:38.760 | couldn't really go to university, so I didn't really turn up and then I spent eight years
01:05:43.400 | in business helping really big companies on strategic questions.
01:05:47.240 | I always wanted to be an entrepreneur, I planned to already spend two years at McKinsey, the
01:05:53.380 | only thing I really regret in my life was not sticking to that plan and wasting eight
01:05:58.160 | years instead.
01:05:59.160 | So two years would have been perfect, but then I went into entrepreneurship, started
01:06:04.480 | two companies in Australia and the best part about that was that I didn't get any funding,
01:06:12.480 | so all the money that I made was mine, all the decisions were mine and my partners.
01:06:19.540 | I focused entirely on profit and product and customer and service, whereas I find in San
01:06:27.400 | Francisco I'm glad I came here and so the two of us came here for Kaggle, Anthony and
01:06:38.040 | I and raised a ridiculous amount of money, $11 million for this really new company.
01:06:47.320 | That was really interesting but it's also really distracting, trying to worry about
01:06:51.720 | scaling and VCs wanting to see what your business development plans are and also just not having
01:06:58.760 | any real need to actually make a profit.
01:07:02.840 | So I had a bit of the same problem at Inletic, where I again raised a lot of money, $15 million
01:07:11.240 | pretty quickly and a lot of distractions.
01:07:17.340 | So I think trying to bootstrap your own company and focus on making money by selling something
01:07:28.320 | at a profit and then plowing that back into the company worked really well because within
01:07:37.000 | like five years we were making a profit from three months in and within five years we were
01:07:43.280 | making enough of a profit not just to pay all of us in their own wages but also to see
01:07:47.800 | my bank account growing and after ten years sold it for a big chunk of money, not enough
01:07:52.680 | that a VC would be excited but enough that I didn't have to worry about money again.
01:07:59.480 | So I think bootstrapping a company is something which people in the Bay Area at least don't
01:08:05.440 | seem to appreciate how good an idea that is.
01:08:10.920 | If you are 25 years old today and still know what you know where, which you'd be looking
01:08:15.240 | to use AI, what are you working on right now or looking to work on in the next two years?
01:08:21.600 | You should ignore the last part of that, I won't even answer it, it doesn't matter where
01:08:24.920 | I'm looking, what you should do is leverage your knowledge about your domain.
01:08:32.200 | So one of the main reasons we do this is to get people who have backgrounds in whatever,
01:08:39.120 | recruiting, oil field surveys, journalism, activism, whatever, and solve your problems.
01:08:53.000 | It will be really obvious to you what your problems are and it will be really obvious
01:08:56.680 | to you what data you have and where to find it.
01:09:00.000 | Those are all the bits that for everybody else it's really hard, so people who start
01:09:03.160 | out with "Oh I know deep learning" now go and find something to apply it to, basically
01:09:09.280 | never succeed, whereas people who are like "Oh I've been spending 25 years doing specialized
01:09:16.240 | recruiting for legal firms and I know that the key issue is this thing and I know that
01:09:20.840 | this piece of data totally solves it and so I'm just going to do that now and I already
01:09:25.360 | know who to call to actually start selling it to, they're the ones who tend to win.
01:09:31.720 | So if you've done nothing but academic stuff then it's more about your hobbies and interests,
01:09:44.680 | so everybody has hobbies.
01:09:47.720 | The main thing I would say is please don't focus on building tools for data scientists
01:09:53.520 | to use or for software engineers to use because every data scientist knows about the market
01:10:00.280 | of data scientists, whereas only you know about the market for analyzing oil survey well logs
01:10:08.920 | or understanding audiology studies or whatever it is that you do.
01:10:19.560 | Given what you've shown us about applying transfer learning from image recognition to
01:10:23.360 | NLP, there looks to be a lot of value in paying attention to all of the developments that
01:10:27.920 | happen across the whole machine learning field and that if you were to focus in one area
01:10:32.000 | you might miss out on some great advances in other concentrations.
01:10:35.920 | How do you stay aware of all the advancements across the field while still having time to
01:10:39.720 | dig in deep to your specific domains?
01:10:42.280 | Yeah that's awesome, I mean that's kind of the message of this course, one of the key
01:10:46.640 | messages of this course is like lots of good works being done in different places and people
01:10:52.240 | are so specialized most people don't know about it, like if I can get state-of-the-art
01:10:57.000 | results in NLP within six months of starting to look at NLP, then I think that says more
01:11:02.440 | about NLP than it does about me.
01:11:06.720 | So yeah it's kind of like the entrepreneurship thing, it's like you pick the areas that you
01:11:13.160 | see that you know about and kind of transfer stuff like oh we could use deep learning to
01:11:17.800 | solve this problem or in this case like we could use this idea of computer vision to
01:11:24.840 | solve that problem.
01:11:27.380 | So things like transfer learning, I'm sure there's like a thousand things, opportunities
01:11:32.440 | for you to do in other fields to do what Sebastian and I did in NLP with NLP classification.
01:11:39.600 | So the short answer to your question is the way to stay ahead of what's going on would
01:11:43.600 | be to follow my feed of Twitter favorites and my approach is to follow lots and lots
01:11:50.440 | of people on Twitter and put them into the Twitter favorites for you.
01:11:55.040 | Every time I come across something interesting I click favorite and there are two reasons
01:11:59.080 | I do it, the first is that when the next course comes along I go through my favorites to find
01:12:03.640 | which things I want to study and the second is so that you can do the same thing.
01:12:11.480 | And then which do you go deep into, it almost doesn't matter, like I find every time I look
01:12:17.040 | at something it turns out to be super interesting and important.
01:12:19.880 | So just pick something which is like, you feel like solving that problem would be actually
01:12:26.400 | useful for some reason and it doesn't seem to be very popular, which is kind of the opposite
01:12:31.480 | of what everybody else does, everybody else works on the problems which everybody else
01:12:36.720 | is already working on because they're the ones that seem popular and I don't know.
01:12:41.360 | I can't quite understand this kind of thinking but it seems to be very common.
01:12:46.880 | Is deep learning an overkill to use on tabular data?
01:12:50.200 | When is it better to use deep learning instead of machine learning on tabular data?
01:12:59.320 | Is that a real question or did you just put that there so that I would point out that
01:13:03.280 | Rachel Thomas just wrote an article?
01:13:10.000 | Yes, so Rachel's just written about this and Rachel and I spent a long time talking about
01:13:16.520 | it and the short answer is we think it's great to use deep learning on tabular data.
01:13:24.280 | Actually of all the rich, complex, important and interesting things that appear in Rachel's
01:13:30.520 | Twitter stream covering everything from the genocide of the Rohingya through to the latest
01:13:37.680 | ethics violations in AI companies, the one by far that got the most attention and engagement
01:13:44.540 | from the community was her question about is it called tabular data or structured data.
01:13:51.920 | Ask computer people how to name things and you'll get plenty of interest.
01:13:57.200 | There are some really good links here to stuff from Instacart and Pinterest and other folks
01:14:03.000 | who have done some good work in this area.
01:14:05.520 | Many of you that went to the Data Institute conference will have seen Jeremy Stanley's
01:14:09.020 | presentation about the really cool work they did at Instacart.
01:14:12.400 | Yes, Rachel?
01:14:13.400 | I relied heavily on lessons three and four from part one in writing this post, so much
01:14:19.960 | of it may be familiar to you.
01:14:23.520 | Rachel asked me during the post how to tell whether you should use a decision tree ensemble
01:14:30.600 | like GVM or random forest or neural net and my answer is I still don't know.
01:14:37.320 | Nobody I'm aware of has done that research in any particularly meaningful way, so there's
01:14:42.480 | a question to be answered there.
01:14:44.680 | I guess my approach has been to try to make both of those things as accessible as possible
01:14:49.920 | through the fastAI library so you can try them both and see what works.
01:14:54.480 | That was it for the top three questions.
01:15:09.000 | Just quickly to go from super resolution to style transfer is kind of --
01:15:15.600 | I think I missed the one on reinforcement learning.
01:15:22.040 | Reinforcement learning popularity has been on a gradual rise in the recent past.
01:15:26.920 | What's your take on reinforcement learning?
01:15:28.980 | Would fastAI consider covering some ground and popular RL techniques in the future?
01:15:36.160 | I'm still not a believer in reinforcement learning.
01:15:41.520 | I think it's an interesting problem to solve, but it's not at all clear that we have a good
01:15:47.280 | way of solving this problem.
01:15:48.480 | The problem really is the delayed credit problem.
01:15:53.000 | I want to learn to play Pong, I move up or down, and three minutes later I find out whether
01:15:58.780 | I won the game of Pong, which actions I took were actually useful.
01:16:05.520 | To me the idea of calculating the gradients of the output with respect to those inputs,
01:16:13.480 | the credit is so delayed that those derivatives don't seem very interesting.
01:16:21.720 | I get this question quite regularly in every one of these four courses so far.
01:16:25.680 | I've always said the same thing.
01:16:28.360 | I'm rather pleased that finally recently there's been some results showing that basically random
01:16:33.520 | search often does better than reinforcement learning.
01:16:39.400 | Basically what's happened is very well-funded companies with vast amounts of computational
01:16:44.800 | power throw all of it at reinforcement learning problems and get good results and people then
01:16:51.120 | say it's because of the reinforcement learning rather than the vast amounts of compute power.
01:16:56.880 | Or they use extremely thoughtful and clever algorithms like a combination of convolutional
01:17:04.600 | neural nets and Monte Carlo tree search like they did with the AlphaGo stuff to get great
01:17:09.920 | results and people incorrectly say that's because of reinforcement learning but it wasn't
01:17:15.800 | really reinforcement learning at all.
01:17:19.880 | I'm very interested in solving these kind of more generic optimization type problems
01:17:27.440 | rather than just prediction problems and that's what these delayed credit problems look like.
01:17:33.880 | But I don't think we've yet got good enough best practices that I have anything I'm ready
01:17:40.160 | to teach and say like I'm going to teach you this thing because I think it's still going
01:17:44.080 | to be useful next year.
01:17:46.640 | So we'll keep watching and see what happens.
01:17:58.080 | So we're going to now turn the super resolution network basically into a style transfer network
01:18:05.040 | and we'll do this pretty quickly.
01:18:07.160 | We basically already have something, so here's my input image and I'm going to have some
01:18:11.940 | loss function and I've got some neural net again.
01:18:16.960 | So instead of a neural net that does a whole lot of compute and then does upsampling at
01:18:20.600 | the end, our input this time is just as big as our output so we're going to do some downsampling
01:18:26.520 | first and then our compute and then our upsampling.
01:18:30.400 | So that's the first change we're going to make is we're going to add some down sampling,
01:18:34.200 | so some stride 2 convolution layers to the front of our network.
01:18:37.680 | The second is rather than just comparing y, c and x to the same thing here.
01:18:43.160 | So we're going to basically say our input image should look like itself by the end,
01:18:50.320 | so specifically we're going to compare it by chucking it through VGG and comparing it
01:18:54.120 | at one of the activation layers.
01:18:58.360 | And then its style should look like some painting which we'll do just like we did with the Gatties
01:19:04.600 | approach by looking at the grammatrix correspondence at a number of layers.
01:19:10.360 | So that's basically it, and so that ought to be super straightforward, it's really just
01:19:16.600 | combining two things we've already done.
01:19:20.000 | And so all this code at the start is identical, except we don't have high res and low res,
01:19:24.120 | we just have one size 256, all this is the same, my model's the same.
01:19:33.280 | One thing I did here is I did not do any kind of fancy best practices for this one at all,
01:19:41.760 | partly because there doesn't seem to be any, like there's been very little follow-up in
01:19:47.360 | this approach compared to the super resolution stuff, and we'll talk about why in a moment.
01:19:55.000 | So you'll see this is much more normal looking, I've got batch norm layers, I don't have the
01:20:03.200 | scaling factor here, I don't have a pixel shuffle, it's just using a normal upsampling
01:20:09.920 | followed by one by one conge, blah blah blah, so it's just more normal.
01:20:15.880 | One thing they mentioned in the paper is they had a lot of problems with zero padding creating
01:20:22.260 | artifacts, and the way they solved that was by adding 40 pixels of reflection padding
01:20:27.160 | at the start, so I did the same thing, and then they used zero padding in their convolutions
01:20:34.120 | in their res blocks.
01:20:36.400 | Now if you've got zero padding in your convolution in your res blocks, then that means that the
01:20:41.320 | two parts of your resnet won't add up anymore because you've lost a pixel from each side
01:20:47.080 | on each of your two convolutions.
01:20:49.080 | So my res sequential has become res sequential center, and I've removed the last two pixels
01:20:56.240 | on each side of those good cells.
01:20:59.000 | So other than that, this is basically the same as what we had before.
01:21:03.720 | So then we can bring in our starry_night_picture, we can resize it, we can throw it through
01:21:10.680 | our transformations.
01:21:14.060 | Just to make the method a little bit easier for my brain to handle, I took my transform
01:21:23.540 | style image, which after transformations is 3x256x256, and I made a mini-batch.
01:21:29.640 | My batch size is 24, 24 copies of it.
01:21:32.680 | That just makes it a little bit easier to do the batch arithmetic without worrying about
01:21:37.560 | some of the broadcasting, they're not really 24 copies, I used np.broadcast to basically
01:21:45.720 | fake 24 copies.
01:21:52.000 | So just like before, we create our VGG, grab the last block, this time we're going to use
01:21:58.240 | all of these layers so we keep everything up to the 43rd layer.
01:22:05.600 | And so now our combined loss is going to add together a content loss for the 3rd block
01:22:12.040 | plus the gram loss for all of our blocks with different weights.
01:22:16.840 | And so the gram loss, and again, going back to everything being as normal as possible,
01:22:23.520 | I've gone back to using MSE here.
01:22:26.800 | Basically what happened is I had a lot of trouble getting this to train properly, so
01:22:29.480 | I gradually removed trick after trick and eventually just went okay, I'm just going
01:22:32.480 | to make it as bland as possible.
01:22:38.440 | Last week's gram matrix was wrong, by the way, it only worked for a batch size of 1,
01:22:44.920 | and we only had a batch size of 1, so that was fine.
01:22:48.680 | I was using matrix multiply, which meant that every batch was being compared to every other
01:22:55.680 | batch.
01:22:56.680 | You actually need to use batch matrix multiply, which does a matrix multiply per batch.
01:23:03.840 | So that's something to be aware of there.
01:23:06.960 | So I've got my gram matrices, I do my MSE loss between the gram matrices, I weight them,
01:23:12.760 | I style weights, so I create that resnet, so I create my style, my combined loss, passing
01:23:20.240 | in the VGG network, passing in the block IDs, passing in the transformed starry night image,
01:23:29.180 | and so you'll see at the very start here I do a forward pass through my VGG model with
01:23:34.720 | that starry night image in order that I can save the features for it.
01:23:40.960 | Now notice it's really important now that I don't do any data augmentation because I've
01:23:46.040 | saved the style features for a particular non-augmented version, so if I augmented it
01:23:55.240 | it might make some minor problems, but that's fine because I've got all of ImageNet to deal
01:24:00.960 | with, I don't really need to do data augmentation anyway.
01:24:04.840 | Okay, so I've got my loss function and I can go ahead and fit, and there's really nothing
01:24:12.120 | clever here at all, at the end I have my sumLayers equals false so I can see what each part looks
01:24:19.360 | like and see that they're reasonably balanced, and I can finally pop it out.
01:24:27.500 | So I mentioned that should be pretty easy, and yet it took me about four days because
01:24:35.480 | I just found this incredibly fiddly to actually get it to work.
01:24:42.680 | So when I finally got up in the morning I said to Rachel, guess what, they're trained
01:24:47.600 | correctly.
01:24:48.600 | Rachel was like, I never thought that was going to happen.
01:24:54.980 | It just looked awful all the time, and it was really about getting the exact right mix
01:25:00.040 | of content loss versus style loss, the mix of the layers of the style loss, and the worst
01:25:05.080 | part was it takes a really long time to train the damn CNN, and I didn't really know how
01:25:12.680 | long to train it before I decided it wasn't doing well, like should I just train it for
01:25:17.840 | longer or what?
01:25:22.320 | And I don't know, all the little details didn't seem to slightly change it, but it would totally
01:25:27.880 | fall apart all the time.
01:25:29.840 | So I kind of mentioned this partly to say just remember the final answer you see here
01:25:39.400 | is after me driving myself crazy all week, nearly always not working until finally at
01:25:45.240 | the last minute, it finally does, even for things which just seem like they couldn't
01:25:51.560 | possibly be difficult because they're just combining two things we already have working.
01:25:56.220 | The other is to be careful about how we interpret what authors claim.
01:26:11.280 | It was so fiddly getting this style transfer to work, and after doing it, it left me thinking,
01:26:20.640 | why did I bother? Because now I've got something that takes hours to create a network that
01:26:26.600 | can turn any kind of photo into one specific style.
01:26:31.640 | It just seems very unlikely I would want that for anything, like the only reason I could
01:26:36.880 | think that being useful would be to do some art stuff on a video to turn every frame into
01:26:43.400 | some style. It's an incredibly niche thing to do, but when I looked at the paper, the
01:26:51.480 | table was saying we're a thousand times faster than the Gatties approach, which is just such
01:26:59.880 | an obviously meaningless thing to say and such an incredibly misleading thing to say
01:27:07.040 | because it ignores all the hours of training for each individual style. I find this frustrating
01:27:14.380 | because groups like this Stanford group clearly know better, or ought to know better, but still
01:27:21.200 | I guess the academic community kind of encourages people to make these ridiculously grand claims.
01:27:29.280 | It also completely ignores this incredibly sensitive, fiddly training process.
01:27:40.880 | This paper was just so well-accepted when it came out. I remember everybody getting
01:27:45.800 | on Twitter and being like, wow, these Stanford people have found this way of doing style
01:27:50.240 | transfer a thousand times faster. And clearly, the people saying this were like all top researchers
01:27:58.600 | in the field, but clearly none of them actually understood it because nobody said, you know,
01:28:05.160 | I don't see why this is remotely useful and also I tried it and it was incredibly fiddly
01:28:09.400 | to get it all to work. And so it's not until like, what is this now, like 18 months later
01:28:14.720 | or something that I'm finally coming back to it and kind of thinking like, wait a minute,
01:28:19.320 | this is kind of stupid. So this is the answer I think to the question of why haven't people
01:28:26.280 | done follow-ups on this to like create really amazing best practices and better approaches
01:28:30.440 | like with a super resolution part of the paper? And I think the answer is because it's done.
01:28:36.760 | So I think this part of the paper is clearly not done, you know, and it's been improved
01:28:44.400 | and improved and improved and now we have great super resolution and I think we can
01:28:49.840 | derive from that great noise reduction, great colorization, great, you know, slant removal,
01:28:57.880 | great interactive artifact removal, whatever else. So I think there's a lot of really cool
01:29:06.280 | techniques here. It's also leveraging a lot of stuff that we've been learning and getting
01:29:10.560 | better and better at.
01:29:12.280 | Okay, so then finally let's talk about segmentation. This is from the famous CAMVID dataset which
01:29:20.240 | is a classic example of an academic segmentation dataset. And basically you can see what we
01:29:24.760 | do is we start with a picture, there are actually video frames in this dataset like here, and
01:29:30.920 | we construct, we have some labels where they're not actually colors, each one has an ID and
01:29:40.360 | the IDs are mapped colors, so like red might be one, purple might be two, like pink might
01:29:45.880 | be three. And so all the buildings, you know, one class or the cars or another class, all
01:29:54.760 | the people or another class, all the road is another class. And so what we're actually
01:29:59.560 | doing here is multi-class classification for every pixel, okay? And so you can see sometimes
01:30:07.720 | that multi-class classification really is quite tricky, you know, like these branches.
01:30:13.560 | Although sometimes the labels are really not that great, you know, this is very coarse,
01:30:19.000 | as you can see. So here are traffic lights and so forth. So that's what we're going to
01:30:25.920 | do. We're going to do, this is segmentation. And so it's a lot like bounding boxes, right?
01:30:32.160 | But rather than just finding a box around each thing, we're actually going to label
01:30:38.480 | every single pixel with its class. And really that's actually a lot easier because it fits
01:30:47.160 | our CNN style so nicely that we basically, we can create any CNN where the output is
01:30:54.240 | an n by m grid containing the integers from 0 to c where there are c categories, and then
01:31:02.240 | we can use cross-entropy loss with a softmax activation and we're done. So I could actually
01:31:07.920 | stop the class there and you can go and use exactly the approaches you've learned in like
01:31:12.320 | lessons 1 and 2 and you'll get a perfectly okay result. So the first thing to say is
01:31:18.800 | like this is not actually a terribly hard thing to do, but we're going to try and do
01:31:22.800 | it really well. And so let's start by doing it the really simple way. And we're going
01:31:30.320 | to use the Kaggle Carvana competition, so you Google Kaggle Carvana to find it. You
01:31:35.240 | can download it with the Kaggle API as per usual. And basically there's a train folder
01:31:40.880 | containing a bunch of images which is the independent variable and a train_masks folder
01:31:45.920 | that contains the dependent variable and they look like this. Here's one of the independent
01:31:50.760 | variable and here's one of the dependent variable.
01:31:59.280 | So in this case, just like cats and dogs, we're going simple. Rather than doing multi-class
01:32:04.960 | classification, we're going to do binary classification, but of course multi-class is just the more
01:32:10.040 | general version, you know, categorical cross-entropy or binary cross-entropy. So there's no differences
01:32:16.320 | conceptually. So we've got this is just zeros and ones, whereas this is a regular image.
01:32:24.560 | So in order to do this well, it would really help to know what cars look like because really
01:32:31.400 | what we just want to do is figure out this is the car and this is its orientation and
01:32:35.560 | then put white pixels where we expect the car to be based on the picture and our understanding
01:32:41.440 | of what cars look like.
01:32:45.080 | The original data set came with these CSV files as well. I don't really use them for
01:32:49.640 | very much other than getting a list of images from them. Each image after the car ID has
01:33:02.760 | a 01, 02, et cetera of which I've printed out all 16 of them for one car and as you
01:33:08.680 | can see basically those numbers are the 16 orientations of one car. So there that is.
01:33:16.600 | I don't think anybody in this competition actually used this orientation information.
01:33:21.400 | I believe they all kept the car's images, just treated them separately. These images
01:33:28.160 | are pretty big, like over 1,000 by 1,000 in size and just opening the JPEGs and resizing
01:33:37.160 | them is slow. So I processed them all. Also OpenCV can't handle GIF files, so I converted
01:33:47.600 | them.
01:33:48.600 | Yes, Rachel?
01:33:49.600 | Question, how would somebody get these masks for training initially, Mechanical Turk or
01:33:53.400 | something?
01:33:54.400 | Yeah, just a lot of boring work. Probably some tools that help you with a bit of edge
01:34:03.360 | snapping and stuff so that the human can kind of do it roughly and then just fine-tune the
01:34:07.740 | bits that gets wrong.
01:34:14.120 | These kinds of labels are expensive. One of the things I really want to work on is deep
01:34:19.920 | learning enhanced interactive labeling tools because that's clearly something that would
01:34:27.760 | help a lot of people.
01:34:29.000 | I've got a little section here that you can run if you want to. You probably want to,
01:34:34.280 | which converts the GIFs into PNGs. So just open it up with a PIL and then save it as
01:34:40.160 | PNG because OpenCV doesn't have GIF support. And as per usual for this kind of stuff I
01:34:45.960 | do it with a thread pool so I can take advantage of parallel processing, and then also create
01:34:51.680 | a separate directory, train-128 and train-masks-128, which contains the 128x128 resized versions
01:34:58.800 | of them. And this is the kind of stuff that keeps you sane if you do it early in the process.
01:35:04.800 | So anytime you get a new data set, seriously think about creating a smaller version to
01:35:11.880 | make life fast. Anytime you find yourself waiting on your computer, try and think of
01:35:17.320 | a way to create a smaller version.
01:35:20.280 | So after you grab it from Kaggle you probably want to run this stuff, go away, have lunch,
01:35:24.080 | come back, and when you're done you'll have these smaller directories which we're going
01:35:28.680 | to use here, 128x128 pixel versions to start with.
01:35:34.240 | So here's a cool trick, if you use the same axis object to plot an image twice, and the
01:35:42.280 | second time you use alpha, which as you might know means transparency in the computer vision
01:35:46.760 | world, then you can actually plot the mask over the top of the photo. And so here's a
01:35:53.240 | nice way to see all the masks on top of the photos for all of the cars in one group. This
01:35:59.520 | is the same matched files data set we've seen twice already, this is all the same code we
01:36:04.240 | used to, and here's something important though, if we had something that was in the training
01:36:10.520 | set good at this image, and then the validation had that image, that would kind of be cheating
01:36:17.320 | because it's the same car.
01:36:19.440 | So we use a contiguous set of car IDs, and since each set is a set of 16, we make sure
01:36:29.320 | it's evenly divisible by 16, so we make sure that our validation set contains different
01:36:35.160 | car IDs to our training set. This is the kind of stuff which you've got to be careful of.
01:36:41.120 | On Kaggle it's not so bad, you'll know about it because you'll submit your result and you'll
01:36:45.280 | get a very different result on your leaderboard compared to your validation set, but in the
01:36:51.160 | real world you won't know until you put it in production and send your company bankrupt
01:36:57.000 | and lose your job, so you might want to think carefully about your validation set.
01:37:04.760 | So here we're going to use transform_type.classification, it's basically the same as transform_type.pixel,
01:37:11.040 | but if you think about it, with the pixel version if we rotate a little bit, then we
01:37:16.040 | probably want to average the pixels in between the two, but for classification obviously
01:37:20.720 | we don't, we use nearest_neighbor, so there's a slight difference there. Also for classification,
01:37:27.240 | lighting doesn't kick in, normalization doesn't kick in to the dependent variable.
01:37:35.440 | There are already square images, so we don't have to do any cropping. So here you can see
01:37:43.360 | different versions of the augmented, you know, they're moving around a bit and they're rotating
01:37:47.560 | a bit and so forth.
01:37:52.040 | I get a lot of questions during our study group and stuff about how do I debug things
01:37:58.760 | and fix things that aren't working, and I never have a great answer other than every
01:38:04.760 | time I fix a problem it's because of stuff like this that I do all the time. I just always
01:38:11.720 | print out everything as I go and then the one thing that I screw up always turns out
01:38:17.880 | to be the one thing that I forgot to check along the way. The more of this kind of thing
01:38:22.680 | you can do the better. If you're not looking at all of your intermediate results you're
01:38:26.600 | going to have troubles.
01:38:30.800 | So given that we want something that knows what cars look like, we probably want to start
01:38:36.120 | with a pre-trained ImageNet network. So we're going to start with ResNet34 and so with ConvNetBuilder
01:38:44.360 | we can grab our ResNet34 and we can add a custom head. And so the custom head is going
01:38:50.640 | to be something that upsamples a bunch of times. And we're going to do things really
01:38:55.680 | dumb for now. We're just going to do Conv transpose 2D batch norm value. This is what I'm saying.
01:39:07.480 | Any of you could have built this without looking at any of this notebook, or at least you have
01:39:14.040 | the information from previous classes. There's nothing new at all.
01:39:19.800 | And so at the very end we have a single filter. And now that's going to give us something
01:39:29.040 | which is batch size by 1, by 128, by 128. But we want something which is batch size
01:39:36.200 | by 128 by 128. So we have to remove that unit axis. So I've got a lambda layer here. Lambda
01:39:42.560 | layers are incredibly helpful, because without the lambda layer here, which is simply removing
01:39:48.320 | that unit axis by just indexing into it at zero, without the lambda layer I would have
01:39:53.840 | to have created a custom class with a custom forward method and so forth. But by creating
01:40:00.520 | a lambda layer that does like the one custom bit, I can now just chuck it in the sequential.
01:40:05.280 | And so that just makes life easier.
01:40:07.440 | So the PyTorch people are kind of snooty about this approach. Lambda layer is actually something
01:40:13.880 | that's part of the fast AI library, not part of the PyTorch library. And literally people
01:40:18.760 | on the PyTorch discussion board are like, yes, we could give people this, yes, it is
01:40:24.880 | only a single line of code, but then it would encourage them to use sequential too often.
01:40:30.360 | So there you go.
01:40:37.240 | So this is our custom head. So we're going to have a Resbit34 that goes down sample and
01:40:41.800 | then a really simple custom head that very quickly upsamples and that hopefully will
01:40:46.520 | do something. And we're going to use accuracy with a threshold of 0.5 to print out metrics.
01:40:52.800 | And so after a few epochs we've got 96% accurate.
01:40:56.520 | So is that good? Is 96% accurate? Good. And hopefully the answer to your question is it
01:41:04.520 | depends. What's it for? And the answer is Kavana wanted this because they wanted to be able
01:41:11.500 | to take their car images and cut them out and paste them on exotic Monte Carlo backgrounds
01:41:21.620 | or whatever. That's Monte Carlo the place, not the simulation.
01:41:27.520 | So to do that, you need a really good mask. You don't want to leave the rearview mirrors
01:41:34.720 | behind or have one wheel missing or include background or something that would look stupid.
01:41:43.500 | So you would need something very good. So only having 96% of the pixels correct doesn't
01:41:48.760 | sound great, but we won't really know until we look at it. So let's look at it.
01:41:55.300 | So there's the correct version that we want to cut out. That's the 96% accurate version.
01:42:03.400 | So when you look at it, you realize, oh yeah, getting 96% of the pixels accurate is actually
01:42:09.480 | easy because all the outside bits are not car and all the inside bits are car and really
01:42:14.000 | the interesting bit is the edge. So we need to do better.
01:42:20.120 | So let's unfreeze because all we've done so far is train the custom head. And let's do
01:42:25.400 | more. And so after a bit more we've got 99.1%. So is that good? I don't know. Let's take
01:42:33.080 | a look. And so actually no, it's totally missed the rearview vision mirror here and missed
01:42:41.920 | a lot of it here and it's clearly got an edge wrong here and these things are totally going
01:42:46.520 | to matter when we try to cut it out. So it's still not good enough. So let's try upscaling.
01:42:52.400 | And the nice thing is that when we upscale to 512x512, make sure you decrease the batch
01:42:56.360 | size because you'll run out of memory. Here's the true ones. This is all identical. There's
01:43:05.960 | quite a lot more information there for it to go on. So our accuracy increases to 99.4%
01:43:11.560 | and things keep getting better. But we've still got quite a few little black blocky bits.
01:43:17.360 | So let's go to 124x124 down to batch size of 4. This is pretty high res now. And train
01:43:24.480 | a bit more, 99.6, 99.8. And so now if we look at the masks, they're actually looking not
01:43:37.080 | bad. That's looking pretty good. So can we do better? And the answer is yes we can. So
01:43:47.680 | we're moving from the Carvana notebook to the Carvana UNet notebook now. And the UNet
01:43:52.080 | network is quite magnificent. You see, with that previous approach, our pre-trained ImageNet
01:44:00.000 | network was being squished down all the way down to 7x7 and then expanded out all the way
01:44:05.360 | back up to, well it's 224 and then expanded out again all this way, which means it has
01:44:15.640 | to somehow store all the information about the much bigger version in the small version.
01:44:21.860 | And actually most of the information about the bigger version was really in the original
01:44:26.040 | picture anyway. So it doesn't seem like a great approach, this squishing and unsquishing.
01:44:33.360 | So the UNet idea comes from this fantastic paper where it was literally invented in this
01:44:41.280 | very domain-specific area of biomedical image segmentation. But in fact, basically every
01:44:46.680 | Kaggle winner in anything even vaguely related to segmentation has ended up using UNet. It's
01:44:53.960 | one of these things that like everybody in Kaggle knows is the best practice, but in
01:44:57.880 | more of academic circles, like even now, this has been around for a couple of years at least,
01:45:03.560 | a lot of people still don't realize. This is by far the best approach.
01:45:11.200 | And here's the basic idea. Here's the downward path where we basically start at 572x532 in
01:45:22.240 | this case and then kind of half the grid size, half the grid size, half the grid size, half
01:45:26.420 | the grid size. And then here's the upward path where we double the grid size, double-double-double-double.
01:45:36.160 | But the thing that we also do is we take at every point where we've halved the grid size,
01:45:44.600 | we actually copy those activations over to the upward path and concatenate them together.
01:45:53.780 | And so you can see here these red blobs are max pooling operations, the green blobs are
01:45:59.720 | upward sampling, and then these gray bits here are copying. So we copy and concat. So
01:46:08.260 | basically in other words, the input image after a couple of columns is copied over to
01:46:14.160 | the output, concatenated together, and so now we get to use all of the information that's
01:46:20.600 | gone through all the down and all the up, plus also a slightly modified version of the
01:46:24.840 | input pixels, and a slightly modified version of one thing down from the input pixels because
01:46:30.640 | they came out through here. So we have like all of the richness of going all the way down
01:46:36.720 | and up, but also like a slightly less coarse version and a slightly less coarse version
01:46:41.960 | and then this really kind of simple version and they can all be combined together. And
01:46:47.320 | so that's UNET, such a cool idea. So here we are in the Kavana UNET notebook, all this
01:46:55.320 | is the same code as before. And at the start I've got a simple upsample version just to
01:47:05.320 | kind of show you again the non-UNET version. This time I'm going to add in something called
01:47:10.120 | the dice metric. Dice is very similar, as you see, to Jacquard, or A over U. It's just a
01:47:18.360 | minor difference, it's basically intersection over union with a minor tweak. And the reason
01:47:27.800 | we're going to use dice is that's the metric that the Kaggle competition used. And it's
01:47:34.560 | a little bit harder to get a high dice score than a high accuracy because it's really looking
01:47:39.480 | at like what the overlap of the correct pixels are with your pixels. But it's pretty similar.
01:47:46.960 | So in the Kaggle competition, people that were doing okay were getting about 99.6 dice
01:47:53.320 | and the winners were about 99.7 dice. So here's our standard upsample, this is all as before.
01:48:01.440 | And so now we can check our dice metric. And so you can see on dice metric we're getting
01:48:06.960 | like 9.6.8 at 128x128. And so that's not great. So let's try UNET. And I'm calling it UNET-ish
01:48:20.000 | because as per usual I'm creating my own somewhat hacky version, kind of trying to keep things
01:48:26.200 | similar to what you're used to as possible and doing things that I think make sense.
01:48:31.840 | And so there should be plenty of opportunity for you to at least make this more authentically
01:48:36.600 | UNET by looking at the exact kind of grid sizes. And like see how here the size is going
01:48:42.640 | down a little bit so they're obviously not adding any padding and then they're doing
01:48:47.960 | here they've got some cropping going on. There's a few differences. But one of the things is
01:48:54.920 | because I want to take advantage of transfer learning, that means I can't quite use UNET.
01:49:00.640 | So here's another big opportunity is what if you create the UNET downpath and then add
01:49:10.120 | a classifier on the end and then train that on ImageNet. And you've now got an ImageNet
01:49:16.800 | trained classifier which is specifically designed to be a good backbone for UNET. And then you
01:49:23.560 | should be able to now come back and get pretty close to winning this old competition. Because
01:49:34.040 | that pre-trained network didn't exist before. But if you think about what YOLOv3 did, it's
01:49:41.040 | basically that. They created DarkNet, they pre-trained it on ImageNet and then they used
01:49:45.840 | it as the basis for their founding boxes. So again, this kind of idea of pre-training things
01:49:55.200 | which are designed not just for classification but for other things is just something that
01:50:00.960 | nobody's done yet. But as we've shown, you can train ImageNet for 25 bucks in 3 hours.
01:50:15.720 | So and if people in the community are interested in doing this, hopefully I'll have credits
01:50:21.400 | I can help you with as well. So if you do the work to get it set up and give me a script,
01:50:25.720 | I can probably run it for you.
01:50:30.320 | So for now though, we don't have that. So we're going to use ResNet. So we're basically
01:50:38.800 | going to start with this, let's see, with getBase. And so base is our base network and that was
01:50:47.760 | defined back up in this first section. So getBase is going to be something that calls whatever
01:50:53.920 | this is and this is ResNet 34. So we're going to grab our ResNet 34 and cutModel is the
01:50:59.640 | first thing that our ConvNet builder does. It basically removes everything from the adaptive
01:51:04.400 | pulling onwards and so that gives us back the backbone of ResNet 34. So getBase is going
01:51:10.860 | to give us back our ResNet 34 backbone.
01:51:17.960 | And then we're going to take that ResNet 34 backbone and turn it into a unit 34. So what
01:51:25.520 | that's going to do is it's going to save that ResNet that we passed in and then we're going
01:51:33.200 | to use a forward hook, just like before, to save the results at the second, fourth, fifth
01:51:38.440 | and sixth blocks, which as before is basically before each stride 2 convolution.
01:51:45.600 | Then we're going to create a bunch of these things we're calling unit blocks. And the
01:51:50.200 | unit block basically says, so these unit blocks are these things. These are unit blocks.
01:51:57.720 | So the unit block tells us, we have to tell it, how many things are coming from the kind
01:52:04.640 | of previous layer that we're upsampling, how many are coming across, and then how many
01:52:10.400 | do we want to come at. And so the amount coming across is entirely defined by whatever the
01:52:20.440 | base network was. Whatever the downward path was, we need that many layers.
01:52:28.360 | And so this is a little bit awkward. And actually one of our master's students here, Karim, has
01:52:33.800 | actually created something called dynamic unit that you'll find in fastai.unit.dynamic_unit.
01:52:41.960 | And it actually calculates this all for you and automatically creates the whole unit from
01:52:46.760 | your base model. It's got some minor quirks still that I want to fix. By the time the
01:52:52.480 | video is out, it'll definitely be working and I will at least have a notebook showing
01:52:57.820 | how to use it and possibly an additional video. But for now, you'll just have to go through
01:53:04.640 | and do it yourself. You can easily see it just by once you've got a resnet, you can
01:53:08.960 | just go type in its name and it'll print out all the layers and you can see how many activations
01:53:16.080 | there are in each block. Or you could even have it printed out for you for each block
01:53:24.720 | automatically.
01:53:25.720 | Anyway, I just did this manually. And so the unit block works like this. So you said, "Okay,
01:53:35.400 | I've got this many coming up from the previous layer, I've got this many coming across this
01:53:39.240 | x." I'm using across from the downward path. This is the amount I want coming out.
01:53:45.440 | Now what I do is I then say, "Okay, we're going to create a certain amount of convolutions
01:53:50.880 | from the upward path and a certain amount from the cross path and so I'm going to be
01:53:55.120 | concatenating them together. So let's divide the number we want out by 2. And so we're
01:54:01.660 | going to have our cross convolution take our cross path and create number out divided by
01:54:08.520 | 2. And then the upward path is going to be a conv transpose 2D because we want to increase
01:54:16.400 | up sample. And again, here we've got the number n divided by 2. And then at the end, I just
01:54:23.200 | concatenate those together. So I've got an upward sample, I've got a cross convolution,
01:54:30.480 | I concatenate the two together.
01:54:33.080 | And so that's all a unit block is. And so that's actually a pretty easy module to create.
01:54:40.960 | And so then in my forward path, I need to pass to the forward of the unit block the
01:54:47.800 | upward path and the cross path. So the upward path is just wherever I'm up to so far. But
01:54:55.160 | then the cross path is whatever the value is of whatever the activations are that I
01:55:01.240 | stored on the way down.
01:55:04.600 | So as I come up, it's the last set of saved features that I need first. And as I gradually
01:55:09.900 | keep going up further and further and further, eventually it's the first set of features.
01:55:16.700 | And so there are some more tricks we can do to make this a little bit better, but this
01:55:21.640 | is a good start.
01:55:24.880 | So the simple upsampling approach looked horrible and had a dice of 968. A unit with everything
01:55:35.280 | else identical, except we've now got these unit blocks, has a dice of 985. So that's
01:55:44.600 | like we've kind of halved the error with everything else exactly the same. And more to the point,
01:55:51.720 | you can look at it. This is actually looking somewhat car-like compared to our non-unet
01:55:57.360 | equivalent, which is just a blob. Because trying to do this through down and up paths,
01:56:04.920 | it's just asking too much. Whereas when we actually provide the downward path pixels
01:56:12.300 | at every point, it can actually start to create something car-ish.
01:56:16.600 | So at the end of that, we'll go .close to again remove those SFS features that are taking
01:56:24.560 | up GPU memory, go to a smaller batch size, a higher size, and you can see the dice coefficient
01:56:31.880 | is really going up. So notice here I'm loading in the 128x128 version of the network. So we're
01:56:42.120 | doing this progressive resizing trick again. So that gets us 99.3, and then unfreeze to
01:56:48.160 | get to 99.4. And you can see it's now looking pretty good. Go down to a batch size of 4,
01:56:57.200 | size of 102.4, load in what we just did with the 512, takes us to 99.5, unfreeze, takes
01:57:07.760 | us to 99. And as you can see, that actually looks good. Accuracy terms, 99.82. You can
01:57:26.360 | see this is looking like something you could just about use to cut out. I think at this
01:57:33.600 | point there's a couple of minor tweaks we can do to get up to 99.7, but really the key thing
01:57:40.200 | then I think is just maybe to do a little bit of smoothing maybe, or a little bit of
01:57:45.920 | post-processing. You can go and have a look at the Carvana winner's blogs and see some
01:57:53.560 | of these tricks. But as I say, the difference between where we're at 99.6 and what the winner's
01:57:59.840 | got of 99.7 is not heaps. And so really the unit on its own pretty much solves that problem.
01:58:15.400 | Okay so that's it. The last thing I wanted to mention is now to come all the way back
01:58:21.160 | to bounding boxes. Because you might remember I said our bounding box model was still not
01:58:28.880 | doing very well on small objects, so hopefully you might be able to guess where I'm going
01:58:34.800 | to go with this. Which is that for the bounding box model, remember how we had at different
01:58:44.360 | grid cells, we spat out outputs of our model, and it was those earlier ones with the small
01:58:54.200 | grid sizes that weren't very good. How do we fix it? Unet it. Let's have an upward path
01:59:03.520 | with cross-connections. And so then we're just going to do a unet and then spit them
01:59:10.120 | out of that. Because now those finer grid cells have all of the information of that path and
01:59:17.960 | that path and that path and that path to leverage. Now of course, this is deep learning, so that
01:59:25.600 | means you can't write a paper saying we just used unet for bounding boxes. You have to
01:59:32.080 | invent a new word. So this is called feature pyramid networks, or FPMs. And literally this
01:59:42.040 | is part of the retina net paper, which is used in the retina net paper. It was created
01:59:49.200 | in earlier papers specifically about FPMs. If memory says correctly, they did briefly
01:59:55.200 | cite the unet paper, but they kind of made it sound like it was this vaguely slightly
02:00:01.800 | connected thing that maybe some people could consider slightly useful. But it really, FPMs
02:00:09.040 | is units. I don't have an implementation of it to show you, but it'll be a fun thing maybe
02:00:17.360 | for some of us to try. I know some of the students have been trying to get it working
02:00:24.400 | well on the forums. Interesting thing to try. So I think a couple of things to look at after
02:00:32.400 | this class, as well as the other things I mentioned, would be playing around with FPMs and also
02:00:39.560 | maybe trying Caram's dynamic unet. They would both be interesting things to look at.
02:00:46.360 | So you guys have all been through 14 lessons of me talking at you now, so I'm sorry about
02:00:53.400 | that. Thanks for putting up with me. You're going to find it hard to find people who actually
02:01:05.880 | know as much about training neural networks in practice as you do. It'll be really easy
02:01:12.360 | for you to overestimate how capable all these other people are and underestimate how capable
02:01:19.400 | you are. The main thing to say is please practice. Please, just because you don't have this constant
02:01:28.920 | thing getting you to come back here every Monday night now, it's very easy to kind of
02:01:34.700 | lose that momentum. So find ways to keep it, organize a study group or a book reading group
02:01:45.040 | or get together with some friends and work on a project. Do something more than just
02:01:52.740 | deciding I want to keep working on X. Unless you're the kind of person who's super motivated
02:01:59.320 | and you know that whenever you decide to do something, it happens, that's not me. I know
02:02:06.360 | something to happen. I have to say, "Yes, David, in October I will absolutely teach
02:02:11.360 | that course." And then it's like, "Okay, I better actually write some material." That's
02:02:17.640 | the only way I can get stuff to happen. We've got a great community there on the forums.
02:02:22.160 | If people have ideas for ways to make it better, please tell me. If you think you can help
02:02:27.240 | with, if you want to create some new forum or moderate it in some different way or whatever,
02:02:32.980 | just let me know. You can always PM me. There's a lot of projects going on through GitHub
02:02:39.200 | as well, lots of stuff. I hope to see you all back here at Something Else. Thanks so much
02:02:44.560 | for joining me on this journey.
02:02:45.800 | [APPLAUSE]