Lesson 7: Deep Learning 2019 - Resnets from scratch; U-net; Generative (adversarial) networks

00:00:00.000 | Wellcome to lesson seven, the last lesson of part one.

00:00:08.760 | This will be a pretty intense lesson.

00:00:13.440 | And so don't let that bother you, because partly what I want to do is to kind of give

00:00:18.520 | you enough things to think about to keep you busy until part two.

00:00:24.200 | And so, in fact, some of the things we cover today, I'm not going to tell you about some

00:00:28.720 | of the details, I'll just point out a few things where I'll say like, okay, that we're

00:00:31.840 | not talking about yet, that we're not talking about that.

00:00:34.360 | And so then come back in part two to get the details on some of these extra pieces.

00:00:40.280 | So today will be a lot of material.

00:00:44.900 | Pretty quickly might require a few viewings to fully understand at all, a few experiments

00:00:50.280 | and so forth.

00:00:51.280 | And that's kind of intentional.

00:00:52.280 | I'm going to give you stuff to keep you amused for a couple of months.

00:00:59.520 | Wanted to start by showing some cool work done by a couple of students, Reshma and Npata01,

00:01:07.520 | who have developed an Android and an iOS app.

00:01:12.120 | And so check out Reshma's post on the forum about that, because they have a demonstration

00:01:17.760 | of how to create both Android and iOS apps that are actually on the Play Store and on

00:01:23.080 | the Apple App Store.

00:01:25.760 | So that's pretty cool.

00:01:27.480 | First ones I know of that are on the App Store that are using fast AI.

00:01:32.000 | And let me also say a huge thank you to Reshma for all of the work she does, both for the

00:01:36.180 | fast AI community and the machine learning community, or generally, and also the women

00:01:41.000 | in machine learning community in particular.

00:01:43.560 | She does a lot of fantastic work, including providing lots of fantastic documentation

00:01:49.440 | and tutorials and community organizing and so many other things.

00:01:53.340 | So thank you, Reshma, and congrats on getting this app out there.

00:02:04.200 | We have lots of Lesson 7 notebooks today, as you see, and we're going to start with

00:02:09.120 | the one.

00:02:13.740 | So the first notebook we're going to look at is Lesson 7 ResNet MNIST.

00:02:18.480 | And what I want to do is look at some of the stuff we started talking about last week around

00:02:23.560 | convolutions and convolutional neural networks and start building on top of them to create

00:02:28.600 | a fairly modern deep learning architecture, largely from scratch.

00:02:34.000 | When I say from scratch, I'm not going to re-implement things we already know how to

00:02:37.000 | implement but kind of use the pre-existing PyTorch bits of those.

00:02:42.640 | So we're going to use the MNIST dataset, which -- so urls.mnist has the whole MNIST dataset.

00:02:49.880 | Often we've done stuff with a subset of it.

00:02:52.540 | So in there, there's a training folder and a testing folder.

00:02:57.440 | And as I read this in, I'm going to show some more details about pieces of the Datablocks

00:03:02.000 | API so that you see how to kind of see what's going on.

00:03:05.540 | Similarly with the Datablocks API, we've kind of said blah, blah, blah, blah, blah, and done

00:03:09.600 | it all in one cell.

00:03:10.600 | But let's do them one cell at a time.

00:03:12.320 | So the first thing you say is what kind of item list do you have?

00:03:16.640 | So in this case, it's an item list of images.

00:03:19.680 | And then where are you getting the list of file names from?

00:03:22.520 | In this case, by looking in a folder recursively.

00:03:26.520 | And that's where it's coming from.

00:03:28.880 | You can pass in arguments that end up going to pillow, because pillow or PIL is the thing

00:03:33.080 | that actually opens that for us.

00:03:35.680 | And in this case, these are black and white rather than RGB.

00:03:39.480 | So you have to use PILO's convert mode equals L. For more details, refer to the Python imaging

00:03:45.360 | library documentation to see what their convert modes are.

00:03:48.960 | But this one is going to be grayscale, which is what MNIST is.

00:03:54.840 | So inside an item list is an items attribute.

00:03:58.880 | And the items attribute is kind of the thing that you gave it.

00:04:02.920 | It's the thing that it's going to use to create your items.

00:04:05.480 | So in this case, the thing you gave it really is a list of file names.

00:04:08.360 | That's what it got from the folder.

00:04:13.600 | When you show images, normally it shows them in RGB.

00:04:17.960 | And so in this case, we want to use a binary color map.

00:04:20.360 | So in FastAI, you can set a default color map.

00:04:23.240 | For more information about Cmap and color maps, refer to the mapplotlib documentation.

00:04:28.380 | And so this will set the default color map for FastAI.

00:04:32.000 | Okay.

00:04:33.280 | So our image item list contains 70,000 items.

00:04:36.400 | And it's a bunch of images that are 1 by 28 by 28.

00:04:40.440 | Remember that PyTorch puts channel first.

00:04:42.640 | So they're 1 channel, 28 by 28.

00:04:45.200 | You might think, why aren't they just 28 by 28 matrices rather than a 1 by 28 by 28 rank

00:04:51.840 | 3 tensor?

00:04:53.400 | It's just easier that way.

00:04:54.800 | All the conv2d stuff and so forth works on rank 3 tensors.

00:05:00.320 | So you want to include that unit access at the start.

00:05:04.240 | And so FastAI will do that for you even when it's reading 1 channel images.

00:05:11.100 | So the dot items attribute contains the thing that's kind of read to build the image, which

00:05:16.760 | in this case is the file name.

00:05:18.720 | But if you just index into an item list directly, you'll get the actual image object.

00:05:23.400 | And so the actual image object has a show method.

00:05:25.920 | And so there's the image.

00:05:28.120 | So once you've got an image item list, you then split it into training versus validation.

00:05:33.720 | You nearly always want validation.

00:05:35.420 | If you don't, you can actually use the dot no split method to create a kind of empty

00:05:40.880 | validation set.

00:05:42.800 | You can't skip it entirely.

00:05:44.580 | You have to say how to split.

00:05:46.120 | And one of the options is no split.

00:05:48.920 | And so remember, that's always the order.

00:05:50.440 | First create your item list, then decide how to split.

00:05:53.440 | In this case, we're going to do it based on folders.

00:05:58.000 | In this case, the validation folder for MNIST is called testing.

00:06:04.520 | So in FastAI parlance, we use the same kind of parlance that Kaggle does, which is the

00:06:09.180 | training set is what you train on.

00:06:11.680 | The validation set has labels, and you do it for testing that your model's working.

00:06:16.460 | The test set doesn't have labels.

00:06:19.520 | And you use it for doing inference or submitting to a competition or sending it off to somebody

00:06:25.060 | who's held out those labels for vendor testing or whatever.

00:06:29.880 | So just because a folder in your data set is called testing doesn't mean it's a test

00:06:34.120 | set.

00:06:35.120 | This one has labels, so it's a validation set.

00:06:39.520 | So if you want to do inference on lots of things at a time rather than one thing at

00:06:43.720 | a time, you want to use the test equals in FastAI to say this is stuff which has no labels

00:06:51.320 | that I'm just using for inference.

00:06:56.480 | My split data is a training set and a validation set, as you can see.

00:07:02.680 | So inside the training set, there's a folder for each class.

00:07:09.140 | So now we can take that split data and say label from folder.

00:07:13.920 | So first you create the item list, then you split it, then you label it.

00:07:18.200 | And so you can see now we have an X and a Y, and the Y are category objects.

00:07:26.080 | Category object is just a class, basically.

00:07:30.280 | So if you index into a label list, such as ll.train as a label list, you will get back

00:07:39.760 | an independent variable, independent variable, X and Y.

00:07:42.520 | So in this case, the X will be an image object, which I can show, and the Y will be a category

00:07:47.960 | object which I can print.

00:07:51.360 | That's the number eight category, and there's the eight.

00:07:56.880 | Next thing we can do is to add transforms.

00:07:59.840 | In this case, we're not going to use the normal get transforms function because we're doing

00:08:05.360 | digit recognition, and digit recognition, you wouldn't want to flip it left or right,

00:08:09.840 | that would change the meaning of it.

00:08:11.480 | You wouldn't want to rotate it too much, that would change the meaning of it.

00:08:14.800 | Also because these images are so small, doing zooms and stuff is going to make them so fuzzy

00:08:18.880 | as to be unreadable.

00:08:20.280 | So normally for small images of digits like this, you just add a bit of random padding.

00:08:25.940 | So I'll use the random padding function, which actually returns two transforms, the bit that

00:08:32.240 | does the padding and the bit that does the random crop.

00:08:34.480 | So you have to use star to, say, put both these transforms in this list.

00:08:39.400 | So now we can call transform.

00:08:41.920 | This empty array here is referring to the validation set transforms.

00:08:45.560 | So no transforms for the validation set.

00:08:49.280 | Now we've got a transformed, labeled list.

00:08:53.320 | We can pick a batch size and choose data bunch.

00:08:57.040 | We can choose normalize.

00:08:59.440 | In this case, we're not using a pre-trained model, so there's no reason to use image net

00:09:03.720 | stats here.

00:09:05.960 | And so if you call normalize like this, without passing in stats, it will grab a batch of

00:09:13.200 | data at random and use that to decide what normalization stats to use.

00:09:18.500 | That's a good idea if you're not using a pre-trained model.

00:09:24.040 | So we've got a data bunch.

00:09:25.840 | And so in that data bunch is a data set, which we've seen already.

00:09:33.880 | But what is interesting is that the training data set now has data augmentation because

00:09:37.480 | we've got transforms.

00:09:39.220 | So plot multi is a fast AI function that will plot the result of calling some function for

00:09:45.320 | each of this row by column grid.

00:09:47.960 | So in this case, my function is just grab the first image from the training set.

00:09:53.160 | And because each time you grab something from the training set, it's going to load it from

00:09:56.640 | disk and it's going to transform it on the fly.

00:10:01.320 | So people sometimes ask like, how many transformed versions of the image do you create?

00:10:07.320 | And the answer is kind of infinite.

00:10:09.480 | Each time we grab one thing from the data set, we do a random transform on the fly.

00:10:15.320 | So potentially everyone will look a little bit different.

00:10:19.080 | So you can see here, if we plot the result of that lots of times, we get eights in slightly

00:10:24.280 | different positions because we did random padding.

00:10:28.680 | You can always grab a batch of data then from the data bunch.

00:10:33.000 | Because remember, a data bunch has data loaders, and data loaders are things that you grab

00:10:37.560 | a batch at a time.

00:10:39.580 | And so you can then grab an X batch and a Y batch, look at their shape, batch size by

00:10:44.840 | channel by row by column.

00:10:47.560 | All fast AI data bunches have a show batch, which will show you what's in it in some sensible

00:10:54.960 | way.

00:10:55.960 | Okay, so that's a quick walkthrough of the data block API stuff to grab our data.

00:11:02.120 | So let's start out creating a simple CNN, simple confident.

00:11:08.340 | So the input is 28 by 28.

00:11:12.960 | So let's define -- I like to define when I'm creating architectures a function which kind

00:11:18.700 | of does the things that I do again and again and again.

00:11:20.880 | I don't want to call it with the same arguments because I'll forget, I'll make a mistake.

00:11:24.520 | So in this case, all of my convolutions are going to be kernel size three, stride two,

00:11:30.480 | padding one.

00:11:31.480 | So let's just create a simple function to do a conv with those parameters.

00:11:34.720 | So each time I have a convolution, it's skipping over one pixel, so it's doing jumping two

00:11:43.320 | steps each time.

00:11:45.240 | So that means that each time we have a convolution, it's going to halve the grid size.

00:11:50.100 | So I've put a comment here showing what the new grid size is after each one.

00:11:54.960 | So after the first convolution, we have one channel coming in, because remember it's a

00:12:00.200 | grayscale image with one channel.

00:12:02.440 | And then how many channels coming out, whatever you like, right?

00:12:06.320 | So remember you always get to pick how many filters you create regardless of whether it's

00:12:11.560 | a fully connected layer, in which case it's just the width of the matrix you're multiplying

00:12:16.760 | by, or in this case with a 2D conv, it's just how many filters do you want.

00:12:23.880 | So I picked eight.

00:12:25.040 | And so after this, it's stride two.

00:12:26.520 | So the 28 by 28 image is now a 14 by 14 feature map with eight channels.

00:12:33.460 | So specifically, therefore, it's an eight by 14 by 14 tensor of activations.

00:12:41.280 | Then we'll do batch norm, then we'll do relu.

00:12:43.700 | So the number of input filters to the next conv has to equal the number of output filters

00:12:47.960 | from the previous conv, and we can just keep increasing the number of channels.

00:12:53.360 | Because we're doing stride two, it's going to keep decreasing the grid size.

00:12:58.040 | Notice here it goes from seven to four, because if you're doing a stride two conv over seven,

00:13:04.260 | it's going to be kind of math.ceiling of seven divided by two.

00:13:11.600 | Batch norm, relu, conv, we're now down to two by two.

00:13:15.200 | Batch norm, relu, conv, we're now down to one by one.

00:13:18.960 | So after this, we have a picture map of, let's see, ten by one by one.

00:13:32.960 | Does that make sense?

00:13:34.880 | We've got a grid size of one now.

00:13:36.540 | So it's not a vector of length ten.

00:13:40.240 | It's a rank three tensor of ten by one by one.

00:13:46.120 | So our loss functions expect generally a vector, not a rank three tensor.

00:13:51.800 | So you can chuck flatten at the end, and flatten just means remove any unit axes.

00:13:59.360 | So that will make it now just a vector of length ten, which is what we always expect.

00:14:06.040 | So that's how we can create a CNN.

00:14:10.260 | So then we can return that into a learner by passing in the data and the model and the

00:14:14.960 | loss function and, if optionally, some metrics.

00:14:19.260 | So we're going to use cross entropy as usual.

00:14:22.240 | So we can then call learn.summary and confirm.

00:14:24.700 | After that first conv, we're down to 14 by 14.

00:14:28.480 | And after the second conv, 7 by 7 and 4 by 4, 2 by 2, 1 by 1.

00:14:36.160 | The flatten comes out calling it a lambda, but that, as you can see, gets rid of the

00:14:40.040 | one by one, and it's now just a length ten vector for each item in the batch.

00:14:46.500 | So a 128 by 10 matrix for the whole mini batch.

00:14:51.440 | So just to confirm that this is working okay, we can grab that mini batch of X that we created

00:14:58.280 | earlier.

00:14:59.280 | That's our mini batch of X. Pop it onto the GPU and call the model directly.

00:15:05.320 | Remember any PyTorch module we can pretend it's a function.

00:15:09.820 | And that gives us back, as we hoped, a 128 by 10 result.

00:15:14.880 | So that's how you can directly get some predictions out.

00:15:17.880 | We already have a 98.6% accurate ConvNet.

00:15:29.040 | And this is trained from scratch, of course, it's not pre-trained, we literally created

00:15:32.520 | our own architecture, it's about the simplest possible architecture you can imagine, 18

00:15:36.440 | seconds to train.

00:15:37.440 | So that's how easy it is to create a pretty accurate digit detector.

00:15:42.800 | So let's refactor that a little rather than saying Conv Batch Norm Relu all the time.

00:15:51.520 | Fast AI already has something called Conv_Layer, which lets you create Conv Batch Norm Relu

00:15:58.240 | combinations.

00:16:00.240 | And it has various other options to do other tweaks to it, but the basic version is just

00:16:04.960 | exactly what I just showed you.

00:16:06.520 | So we can refactor that like so, so that's exactly the same neural net.

00:16:14.280 | And so let's just train it a little bit longer, and it's actually 99.1% accurate if we train

00:16:20.280 | it for all of a minute.

00:16:22.200 | So that's cool.

00:16:25.080 | So how can we improve this?

00:16:28.240 | Well what we really want to do is create a deeper network.

00:16:33.980 | And it's a very easy way to create a deeper network would be after every stride two Conv,

00:16:40.280 | add a stride one Conv, because the stride one Conv doesn't change the feature map size at

00:16:45.440 | all, so you can add as many as you like.

00:16:49.320 | But there's a problem.

00:16:53.280 | There's a problem.

00:16:54.280 | And the problem was pointed out in this paper, very, very, very influential paper, called

00:16:58.680 | Deep Residual Learning for Image Recognition by Kaiming He and colleagues then at Microsoft

00:17:05.360 | Research.

00:17:06.360 | And they did something interesting.

00:17:07.760 | They said, let's look at the training error.

00:17:09.680 | So forget generalization even.

00:17:11.520 | Let's just look at the training error of a network trained on CIFAR-10.

00:17:17.920 | And let's try one network with 20 layers, just basic three by three Convs, just basically

00:17:22.960 | the same network I just showed you, but without batch norm.

00:17:28.800 | So I trained 20 layer one and a 56 layer one on the training set.

00:17:34.760 | So the 56 layer one has a lot more parameters.

00:17:37.040 | It's got a lot more of these stride one Convs in the middle.

00:17:40.560 | So the one with more parameters should seriously overfit.

00:17:45.060 | So you would expect the 56 layer one to zip down to zero-ish training error pretty quickly.

00:17:51.360 | And that is not what happens.

00:17:52.360 | It is worse than the shallower network.

00:17:55.760 | So when you see something weird happen, really good researchers don't go, oh, no, it's not

00:18:01.920 | working.

00:18:02.920 | They go, that's interesting.

00:18:05.500 | So Kaiming He said, that's interesting.

00:18:09.480 | What's going on?

00:18:11.600 | And he said, I don't know, but what I do know is this.

00:18:16.520 | I could take this 56 layer network and make a new version of it, which is identical, but

00:18:24.320 | has to be at least as good as the 20 layer network.

00:18:27.680 | And here's how.

00:18:29.920 | Every two convolutions, I'm going to add together the input to those two convolutions, add it

00:18:38.840 | together with the result of those two convolutions.

00:18:44.700 | So in other words, he's saying, instead of saying output equals conv2 of conv1 of x, instead,

00:18:59.680 | he's saying output equals x plus conv2 of conv1 of x.

00:19:09.720 | So that 56 layers worth of convolutions in that, his theory has to be at least as good

00:19:20.800 | as the 20 layer version because it could always just set conv2 and conv1 to a bunch of zero

00:19:28.480 | weights for everything except for the first 20 layers because the x, the input, could

00:19:35.200 | just go straight through.

00:19:38.000 | So this thing here is, as you see, called an identity connection.

00:19:43.500 | It's the identity function.

00:19:46.000 | Nothing happens at all.

00:19:47.260 | It's also known as a skip connection.

00:19:49.760 | So that was a theory, right?

00:19:51.400 | That's what the paper describes as the intuition behind this is what would happen if we created

00:19:57.400 | something which has to train at least as well as a 20 layer neural network because it kind

00:20:03.200 | of contains that 20 layer neural network.

00:20:05.280 | It's literally a path you can just skip over all the convolutions.

00:20:11.120 | And so what happens?

00:20:13.760 | And what happened was he won ImageNet that year.

00:20:17.960 | He easily won ImageNet that year.

00:20:20.280 | And in fact, even today, we had that record-breaking result on ImageNet speed training ourselves.

00:20:29.760 | In the last year, we used this, too.

00:20:33.440 | ResNet has been revolutionary.

00:20:37.680 | And here's a trick.

00:20:39.720 | If you're interested in doing some research, some novel research, any time you find some

00:20:45.680 | model for anything, whether it's medical image segmentation or some kind of GAN or whatever,

00:20:54.120 | and it was written a couple of years ago, they might have forgotten to put ResNets in.

00:20:59.960 | Resblocks, this is what we normally call a resblock.

00:21:04.080 | They might have forgotten to put resblocks in.

00:21:05.920 | So replace their convolutional path with a bunch of resblocks, and you'll almost always

00:21:12.680 | get better results faster.

00:21:14.000 | It's a good trick.

00:21:17.040 | So at NeurIPS, which Rachel and I and David all just came back from and Sylvain, we saw

00:21:24.200 | a new presentation where they actually figured out how to visualize the loss surface of a

00:21:32.440 | neural net, which is really cool.

00:21:34.240 | This is a fantastic paper.

00:21:35.960 | And anybody who's watching this, lesson seven, is at a point where they will understand most

00:21:41.840 | of the most important concepts in this paper.

00:21:44.120 | You can read this now.

00:21:45.760 | You won't necessarily get all of it, but I'm sure you'll get enough to find it interesting.

00:21:51.060 | And so the big picture was this one.

00:21:54.000 | Here's what happens if you draw a picture where kind of X and Y here are two projections

00:22:00.600 | of the weight space, and Z is the loss.

00:22:04.940 | And so as you move through the weight space, a 56 layer neural network without skip connections

00:22:11.520 | is very, very bumpy.

00:22:13.680 | And that's why this got nowhere, because it just got stuck in all these hills and valleys.

00:22:21.400 | The exact same network with identity connections, with skip connections, has this lost landscape.

00:22:29.540 | So it's kind of interesting how Herr recognized back in 2015, this shouldn't happen here's

00:22:39.880 | a way that must fix it.

00:22:41.240 | And it took three years before people were able to say, oh, this is kind of why it fixed

00:22:47.480 | it.

00:22:48.480 | With the batch norm discussion we had a couple of weeks ago, people realizing a little bit

00:22:53.360 | after the fact sometimes what's going on and why it helps.

00:22:57.860 | So in our code, we can create a res block in just the way I described.

00:23:10.200 | We create an nn.module, we create two conf layers.

00:23:14.120 | Where a conf layer is conf2d, batch norm, relu, sorry, conf2d, relu, batch norm.

00:23:23.800 | So create two of those and then in forward we go conf1 of X, conf2 of that, and then

00:23:29.200 | add X.

00:23:33.560 | There's a res block function already in fast AI.

00:23:36.520 | So you can just call res block instead, and you just pass in something saying how many

00:23:42.320 | filters do you want.

00:23:46.000 | So there's the res block that I defined in our notebook.

00:23:51.160 | And so with that res block we can now take every one of those, I've just copied the previous

00:23:57.000 | CNN, and after every conf2, except the last one, I added a res block.

00:24:03.240 | So this has now got three times as many layers.

00:24:06.580 | So it should be able to do more compute.

00:24:09.120 | But it shouldn't be any harder to optimize.

00:24:12.280 | So what happens?

00:24:14.000 | Well, let's just refactor it one more time.

00:24:16.960 | Since I go conf2 res block so many times, let's just pop that into a mini sequential

00:24:23.160 | model here, and so I can refactor that like so.

00:24:26.840 | Keep refactoring your architectures if you're trying novel architectures because you'll

00:24:30.040 | make less mistakes.

00:24:31.840 | Very few people do this.

00:24:32.880 | Most research code you look at is clunky as all hell, and people often make mistakes in

00:24:38.680 | that way.

00:24:39.680 | So don't do that.

00:24:40.680 | You're all coders.

00:24:42.760 | So use your coding skills to make life easier.

00:24:47.800 | So there's my ResNet-ish architecture.

00:24:54.440 | And I'll find, as usual, fit for a while.

00:25:00.320 | And I get 99.54.

00:25:05.120 | So that's interesting because we've trained this literally from scratch with an architecture

00:25:11.000 | we built from scratch.

00:25:12.200 | I didn't look up this architecture anywhere.

00:25:14.200 | It was just the first thing that came to mind.

00:25:17.520 | But in terms of where that puts us, 0.45% error is around about the state of the art

00:25:24.360 | for this data set as of three or four years ago.

00:25:28.600 | Now, you know, today MNIST is considered a kind of trivially easy data set, so I'm not

00:25:34.760 | saying, like, wow, we've broken some records here.

00:25:37.400 | People have got beyond 0.45% error.

00:25:40.040 | But what I'm saying is that, you know, we can't -- this kind of ResNet is a genuinely

00:25:49.720 | extremely useful network still today, and this is really all we use in our fast ImageNet

00:25:56.080 | training still.

00:25:57.080 | And one of the reasons as well is that it's so popular, so the vendors of the library

00:26:02.580 | spend a lot of time optimizing it, so things tend to work fast, whereas some more modern-style

00:26:10.760 | architectures using things like separable or grouped convolutions tend not to actually

00:26:15.380 | train very quickly in practice.

00:26:19.580 | If you look at the definition of ResBlock in the fast AI code, you'll see it looks a little

00:26:25.360 | bit different to this.

00:26:27.740 | And that's because I've created something called a merge layer.

00:26:30.800 | And a merge layer is something which in the forward -- just skip dense for a moment -- the

00:26:36.120 | forward says x plus x dot a ridge.

00:26:41.240 | So you can see there's something ResNet-ish going on here.

00:26:44.400 | What is x dot a ridge?

00:26:45.800 | Well, if you create a special kind of sequential model called a sequential EX -- so this is

00:26:51.120 | like fast AI's sequential extended -- it's just like a normal sequential model, but we

00:26:56.640 | store the input in x dot a ridge.

00:27:01.400 | And so this here, sequential EX, conv layer, conv layer, merge layer, will do exactly the

00:27:09.880 | same as this.

00:27:12.600 | So you can create your own variations of ResNet blocks very easily with just sequential EX

00:27:18.600 | and merge layer.

00:27:22.200 | So there's something else here, which is when you create your merge layer, you can optionally

00:27:26.160 | set dense equals true.

00:27:28.660 | What happens if you do?

00:27:29.660 | Well, if you do, it doesn't go x plus x dot a ridge, it goes cat x comma x dot a ridge.

00:27:36.120 | In other words, rather than putting a plus in this connection, it does a concatenate.

00:27:43.940 | So that's pretty interesting, because what happens is that you have your input coming

00:27:51.760 | into your Res block.

00:27:53.720 | And once you use concatenate instead of plus, it's not called a Res block anymore, it's

00:27:58.080 | called a dense block, and it's not called a ResNet anymore, it's called a dense net.

00:28:02.600 | So the dense net was invented about a year after the ResNet.

00:28:07.520 | And if you read the dense net paper, it can sound incredibly complex and different, but

00:28:11.240 | actually it's literally identical, but plus here is replaced with cat.

00:28:17.320 | So you have your input coming into your dense block, right, and you've got a few convolutions

00:28:22.160 | in here, and then you've got some output coming out, and then you've got your identity connection.

00:28:28.440 | And remember, it doesn't plus, it concats, so if this is the channel axis, it gets a

00:28:33.640 | little bit bigger.

00:28:35.840 | And then so we do another dense block, and at the end of that we have all of this coming

00:28:43.160 | in.

00:28:44.160 | So at the end of that we have the result of the convolution as per usual, but this time

00:28:52.140 | the identity block is that big, right?

00:28:57.400 | So you can see that what happens is that with dense blocks it's getting bigger and bigger

00:29:01.600 | and bigger, and kind of interestingly the exact input is still here, right?

00:29:09.160 | So actually, no matter how deep you get, the original input pixels are still there, and

00:29:14.040 | the original layer one features are still there, and the original layer two features

00:29:16.880 | are still there.

00:29:17.880 | So as you can imagine, dense nets are very memory intensive.

00:29:24.040 | There are ways to manage this, from time to time you can have a regular convolution that

00:29:28.920 | squishes your channels back down, but they are memory intensive.

00:29:32.760 | But they have very few parameters.

00:29:36.600 | So for dealing with small data sets, you should definitely experiment with dense blocks and

00:29:43.640 | dense nets.

00:29:45.480 | They tend to work really well on small data sets.

00:29:49.160 | Also, because it's possible to kind of keep those original input pixels all the way down

00:29:53.480 | the path, they work really well for segmentation, right?

00:29:56.920 | Because for segmentation, you kind of want to be able to reconstruct the original resolution

00:30:03.140 | of your picture, so having all of those original pixels still there is super helpful.

00:30:14.640 | So that's res nets, and one of the main reasons, other than the fact that res nets are awesome,

00:30:22.520 | to tell you about them, is that these skip connections are useful in other places as well,

00:30:28.640 | and they're particularly useful in other places and other ways of designing architectures

00:30:32.960 | for segmentation.

00:30:35.280 | So in building this lesson, I always kind of, I keep trying to take old papers and saying,

00:30:43.000 | like I'm mentioning, what would that person have done if they had access to all the modern

00:30:47.400 | techniques we have now?

00:30:48.400 | And I try to kind of rebuild them in a more modern style.

00:30:51.600 | So I've been really rebuilding this next architecture we're going to look at, called a UNET, in

00:30:56.440 | a more modern style recently.

00:30:59.260 | And I got to the point now, I keep showing you this semantic segmentation paper with

00:31:06.760 | the state of the art for CAMVID, which was 91.5.

00:31:11.120 | This week I got it up to 94.1 using the architecture I'm about to show you.

00:31:17.540 | So we keep pushing this further and further and further.

00:31:21.440 | And it really was all about adding all of the modern tricks, many of which I'll show

00:31:30.400 | you today, some of which we'll see in part two.

00:31:35.040 | So what we're going to do to get there is we're going to use this UNET.

00:31:39.620 | So we've used a UNET before, I've improved it a bit since then.

00:31:45.480 | So we've used a UNET before, we used it when we did the CAMVID segmentation, but we didn't

00:31:49.000 | understand what it was doing.

00:31:51.260 | So we're now in a position where we can understand what it was doing.

00:31:58.900 | And so the first thing we need to do is kind of understand the basic idea of how you can

00:32:03.600 | do segmentation.

00:32:06.800 | So if we go back to our CAMVID notebook, in our CAMVID notebook you'll remember that basically

00:32:17.000 | what we were doing is we were taking these photos and adding a class to every single

00:32:22.920 | pixel.

00:32:23.920 | And so when you go data.showbatch for something which is a segmentation item list, it will

00:32:30.240 | automatically show you these color-coded pixels.

00:32:35.840 | So here's the thing.

00:32:38.680 | In order to color-code this as a pedestrian, but this as a bicyclist, it needs to know

00:32:47.780 | what it is.

00:32:48.780 | It needs to actually know that's what a pedestrian looks like, and it needs to know that's exactly

00:32:52.320 | where the pedestrian is, and this is the arm of the pedestrian and not part of their shopping

00:32:55.440 | basket.

00:32:56.440 | It needs to really understand a lot about this picture to do this task.

00:33:01.940 | And it really does do this task, like when you look at the results of our top model,

00:33:10.040 | I can't see a single pixel by looking at it by eye.

00:33:13.720 | I know there's a few wrong, but I can't see the ones that are wrong, it's that accurate.

00:33:18.560 | So how does it do that?

00:33:19.640 | So the way that we're doing it to get these really, really good results is, not surprisingly,

00:33:27.320 | using pre-training.

00:33:29.260 | So we start with a ResNet-34, and you can see that here, unet-learner data, models.resnet-34.

00:33:40.920 | And if you don't say pre-trained equals false, by default, you get pre-trained equals true,

00:33:45.640 | because why not?

00:33:48.860 | So we start with a ResNet-34, which starts with a big image.

00:33:57.360 | So in this case, this is from the unet paper now.

00:33:59.960 | They're images, they started with one channel by 572 by 572.

00:34:05.040 | This is for medical imaging segmentation.

00:34:08.500 | So after your stride 2 conv, they're doubling the number of channels to 128, and they're

00:34:15.200 | halving the size, so they're now down to 280 by 280.

00:34:19.740 | In this original unet paper, they didn't add any padding, so they lost a pixel on each

00:34:24.480 | side each time they did a conv.

00:34:26.080 | That's why you're losing these two.

00:34:27.960 | So basically half the size, and then half the size, and then half the size, and then

00:34:33.000 | half the size, until they're down to 28 by 28, with 1024 channels.

00:34:39.760 | So that's what the unet's downsampling path, this is called the downsampling path look

00:34:46.320 | like.

00:34:47.320 | Hours is just a ResNet-34.

00:34:50.720 | So you can see it here, learn.summary.

00:34:55.480 | This is literally a ResNet-34.

00:35:01.080 | So you can see that the size keeps halving, channels keep going up, and so forth.

00:35:08.160 | So eventually, you've got down to a point where if you use a unit architecture, it's

00:35:13.400 | 28 by 28 with 1024 channels, with a ResNet architecture, with a 224 pixel input, it would

00:35:20.640 | be 512 channels by 7 by 7.

00:35:24.720 | So it's a pretty small grid size on this feature map.

00:35:29.480 | Somehow we've got to end up with something which is the same size as our original picture.

00:35:37.640 | So how do we do that?

00:35:38.840 | How do you do computation which increases the grid size?

00:35:44.800 | Well, we don't have a way to do that in our current bag of tricks.

00:35:49.340 | We can use a stride 1 conv to do computation and keep grid size or a stride 2 conv to do

00:35:56.160 | computation and halve the grid size.

00:35:58.680 | So how do we double the grid size?

00:36:00.440 | We do a stride half conv, also known as a deconvolution, also known as a transposed convolution.

00:36:11.320 | There is a fantastic paper called A Guide to Convolution Arithmetic for Deep Learning that

00:36:16.240 | shows a great picture of exactly what does a 3 by 3 kernel stride half conv look like.

00:36:23.080 | And it's literally this.

00:36:24.300 | If you have a 2 by 2 input, so the blue squares are the 2 by 2 input, you add not only two

00:36:32.040 | pixels of padding all around the outside, but you also add a pixel of padding between

00:36:40.760 | every pixel.

00:36:43.440 | And so now if we put this 3 by 3 kernel here and then here and then here, you see how the

00:36:49.160 | 3 by 3 kernel is just moving across it in the usual way?

00:36:52.160 | You will end up going from a 2 by 2 output to a 5 by 5 output.

00:36:59.000 | So if you only added one pixel of padding around the outside, you would end up with

00:37:03.200 | a 3 by 3 output.

00:37:07.560 | So sorry, 4 by 4.

00:37:10.600 | So this is how you can increase the resolution.

00:37:18.240 | This was the way people did it until maybe a year or two ago.

00:37:27.200 | It's another trick for improving things you find online, because this is actually a dumb

00:37:31.880 | way to do it.

00:37:32.880 | And it's kind of obvious it's a dumb way to do it for a couple of reasons.

00:37:35.440 | One is that, have a look at this, nearly all of those pixels are white.

00:37:40.440 | They're nearly all zeros.

00:37:42.480 | So what a waste.

00:37:44.200 | What a waste of time.

00:37:45.580 | What a waste of computation.

00:37:47.080 | There's just nothing going on there.

00:37:49.320 | Also, this one, when you get down to that 3 by 3 area, 2 out of the 9 pixels are non-white,

00:38:00.600 | but this one, 1 out of the 9 are non-white.

00:38:03.560 | So there's different amounts of information going into different parts of your convolution.

00:38:09.120 | So it just doesn't make any sense to kind of throw away information like this and to

00:38:15.280 | do all this unnecessary computation and have different parts of the convolution having

00:38:18.840 | access to different amounts of information.

00:38:22.240 | So what people generally do nowadays is something really simple, which is if you have, let's

00:38:29.320 | say, a 2 by 2 input, these are your pixel values, A, B, C, and D, and you want to create

00:38:39.480 | a 4 by 4, why not just do this?

00:38:46.160 | A, A, A, A, B, B, B, B, C, C, C, C, C, D, D, D, D.

00:38:55.640 | So I've now upscaled from 2 by 2 to 4 by 4.

00:38:58.960 | I haven't done any interesting computation, but now on top of that, I could just do a

00:39:05.360 | stride 1 convolution, and now I have done some computation.

00:39:09.240 | So an up sample, this is called nearest neighbor interpolation, nearest neighbor interpolation.

00:39:20.760 | So you can just do, and that's super fast, which is nice, so you can do a nearest neighbor

00:39:24.480 | interpolation and then a stride 1 conv, and now you've got some computation, which is

00:39:29.840 | actually kind of using, you know, there's no zeros here.

00:39:34.280 | This is kind of nice because it gets a mixture of A's and B's, which is kind of what you

00:39:37.480 | would want and so forth.

00:39:40.920 | Another approach is instead of using nearest neighbor interpolation, you can use bilinear

00:39:45.040 | interpolation, which basically means instead of copying A to all those different cells,

00:39:50.440 | you take a kind of a weighted average of the cells around it.

00:39:53.840 | So for example, if you were, you know, looking at what should go here, you would kind of

00:40:00.440 | go like, oh, it's about 3 A's, 2 C's, 1 D, and 2 B's, and you could have taken the average.

00:40:07.800 | Not exactly, but roughly just a weighted average.

00:40:10.920 | Bilinear interpolation you'll find in any, you know, all over the place, it's a pretty

00:40:14.360 | standard technique.

00:40:16.280 | Any time you look at a picture on your computer screen and change its size, it's doing bilinear

00:40:21.080 | interpolation.

00:40:22.080 | So you can do that, and then a Strad1Conf.

00:40:26.200 | So that was what people were using, well, that's what people still tend to use.

00:40:31.360 | That's as much as I'm going to teach you this part.

00:40:34.360 | In part two, we'll actually learn what the FastAI library is actually doing behind the

00:40:39.120 | scenes, which is something called a pixel shuffle, also known as sub-pixel convolutions.

00:40:44.900 | It's not dramatically more complex, but complex enough that I won't cover it today.

00:40:48.720 | There's the same basic idea.

00:40:49.800 | All of these things is something which is basically letting us do a convolution that

00:40:54.760 | ends up with something that's twice the size.

00:40:57.960 | And so that gives us our upsampling path.

00:41:02.640 | So that lets us go from 28 by 28 to 54 by 54 and keep on doubling the size.

00:41:11.280 | So that's good.

00:41:14.240 | And that was it until UNET came along.

00:41:19.560 | That's what people did.

00:41:21.140 | And it didn't work real well.

00:41:22.760 | Which is not surprising, because in this 28 by 28 feature map, how the hell is it going

00:41:28.680 | to have enough information to reconstruct a 572 by 572 output space?

00:41:35.620 | That's a really tough ask.

00:41:37.800 | So you tended to end up with these things that lacked fine detail.

00:41:45.120 | So what Olaf Rolleberger and et al. did was they said, hey, let's add a skip connection,

00:41:56.520 | an identity connection.

00:41:58.300 | And amazingly enough, this was before resnets existed.

00:42:02.800 | So this was like a really big leap.

00:42:06.720 | Really impressive.

00:42:08.080 | And so but rather than adding a skip connection that skipped every two convolutions, they added

00:42:14.560 | skip connections where these gray lines are.

00:42:17.680 | In other words, they added a skip connection from the same part of the downsampling path

00:42:22.440 | to the same sized bit in the upsampling path.

00:42:26.960 | And they didn't add.

00:42:28.320 | That's why you can see the white and the blue next to each other.

00:42:30.960 | They didn't add.

00:42:31.960 | They concatenated.

00:42:33.440 | So basically these are like dense blocks, right?

00:42:36.880 | But the skip connections are skipping over larger and larger amounts of the architecture.

00:42:42.960 | So that over here, you've literally got nearly the input pixels themselves coming into the

00:42:53.280 | computation of these last couple of layers.

00:42:55.920 | And so that's going to make it super handy for resolving the fine details in these segmentation

00:43:00.960 | tasks because you've literally got all of the fine details.

00:43:04.600 | On the downside, you don't have very many layers of computation going on here, just four.

00:43:11.480 | So you better hope that by that stage, you've done all the computation necessary to figure

00:43:15.760 | out, is this a bicyclist or is this a pedestrian?

00:43:18.920 | But you can then add on top of that something saying like, is this exact pixel where their

00:43:23.800 | nose finishes or is that the start of the tree?

00:43:27.900 | So that works out really well.

00:43:31.300 | And that's a unit.

00:43:34.640 | So this is the unit code from FastAI.

00:43:39.600 | And the key thing that comes in is the encoder.

00:43:44.240 | The encoder refers to that part.

00:43:51.440 | In other words, in our case, a ResNet-34.

00:43:56.480 | In most cases, they have this specific older-style architecture.

00:44:01.040 | But like I said, replace any older-style architecture bits with ResNet bits and life improves,

00:44:06.400 | particularly if they're pre-trained.

00:44:08.440 | So that certainly happened for us.

00:44:09.880 | So we start with our encoder.

00:44:11.080 | So our layers of our unit is an encoder, then batch norm, then ReLU, and then middle conv,

00:44:17.880 | which is just conv layer, comma, conv layer.

00:44:20.920 | Remember conv layer is a conv ReLU batch norm in FastAI.

00:44:26.680 | And so the middle conv is these two extra steps here at the bottom, just doing a little bit

00:44:33.360 | of computation.

00:44:34.360 | It's kind of nice to add more layers of computation where you can.

00:44:38.960 | So encoder, batch norm, ReLU, and then two convolutions.

00:44:42.120 | And then we enumerate through these indexes.

00:44:45.960 | What are these indexes?

00:44:46.960 | I haven't included the code.

00:44:48.160 | But these are basically -- we figure out what is the layer number where each of these strived

00:44:55.280 | two convs occurs, and we just store it in an array of indexes.

00:44:59.480 | So then we can loop through that, and we can basically say for each one of those points,

00:45:04.600 | create a unit block, telling us how many up-sampling channels there are and how many cross-connection.

00:45:11.400 | These things here are called cross-connections, or at least that's what I call them.

00:45:16.920 | So that's really the main works going on in the unit block.

00:45:22.720 | As I said, there's quite a few tweaks we do, as well as the fact we use a much better encoder.

00:45:27.280 | We also use some tweaks in all of our up-sampling using this pixel shuffle.

00:45:31.440 | We use another tweak called ICNR.

00:45:34.200 | And then another tweak, which I just did in the last week, is to not just take the result

00:45:39.040 | of the convolutions and pass it across, but we actually grab the input pixels and make

00:45:43.760 | them another cross-connection.

00:45:45.720 | That's what this last cross is here.

00:45:48.000 | You can see we're literally appending a res block with the original inputs.

00:45:52.880 | So you can see our merge layer.

00:45:57.000 | So really all the work's going on in unit block, and unit block has to store the activations

00:46:05.720 | at each of these down-sampling points.

00:46:08.120 | And the way to do that, as we learned in the last lesson, is with hooks.

00:46:13.140 | So we put hooks into the ResNet-34 to store the activations each time there's a Strive2

00:46:20.920 | conv.

00:46:21.920 | And so you can see here we grab the hook.

00:46:25.760 | And we grab the result of the stored value in that hook, and we literally just go torch.cat,

00:46:32.120 | so we concatenate the up-sampled convolution with the result of the hook, which we chuck

00:46:44.440 | through batch norm, and then we do two convolutions to it.

00:46:48.560 | And actually, you know, something you could play with at home is pretty obvious here.

00:46:53.640 | Any time you see two convolutions like this, there's an obvious question is, what if we

00:46:56.800 | used a ResNet block instead?

00:46:59.020 | So you could try replacing those two comms with a ResNet block.

00:47:02.420 | You might find you get even better results.

00:47:04.000 | And then the kind of things I look for when I look at an architecture is like, oh, two

00:47:08.840 | comms in a row probably should be a ResNet block.

00:47:14.720 | Okay.

00:47:16.260 | So that's UNET, and it's amazing to think it preceded ResNet, it preceded DenseNet.

00:47:29.240 | It wasn't even published in a major machine learning venue.

00:47:32.640 | It was actually published in MICHI, which is a specialized medical image computing conference.

00:47:39.600 | For years, actually, it was largely unknown outside of the medical imaging community.

00:47:44.420 | And actually, what happened was Kaggle competitions for segmentation kept on being easily won

00:47:51.240 | by people using UNETs.

00:47:52.240 | And that was the first time I saw it getting noticed outside the medical imaging community.

00:47:56.080 | And then, gradually, a few people in the academic machine learning community started noticing,

00:48:00.480 | and now everybody loves UNET, which I'm glad, because it's just awesome.

00:48:09.320 | So identity connections, regardless of whether they're a plus style or a concat style, are

00:48:18.880 | incredibly useful.

00:48:20.120 | They can basically get us close to the state of the art on lots of important tasks.

00:48:27.680 | So I want to use them on another task now.

00:48:31.440 | And so the next task I want to look at is image restoration.

00:48:36.120 | So image restoration refers to starting with an image, and this time, we're not going to

00:48:41.240 | create a segmentation mask, but we're going to try and create a better image.

00:48:47.440 | And there's lots of versions of better-- there could be different image.

00:48:50.680 | So the kind of things we can do with this image generation would be take a low res image,

00:48:55.800 | make it high res, take a black and white image, make it color, take an image where something's

00:49:01.480 | being cut out of it and try and replace the cut out thing, take a photo and try and turn

00:49:07.160 | it into what looks like a line drawing, take a photo and try and make it look like a Monet

00:49:11.240 | painting.

00:49:12.240 | These are all examples of kind of image to image generation tasks, which you'll know

00:49:15.840 | how to do after this part of the class.

00:49:21.040 | So in our case, we're going to try to do image restoration, which is going to start with

00:49:27.600 | low resolution, poor quality JPEGs with writing written over the top of them, and get them

00:49:35.520 | to replace them with high resolution, good quality pictures in which the text has been

00:49:41.240 | removed.

00:49:45.440 | Two questions?

00:49:46.440 | OK, let's go.

00:49:51.680 | Why do you concat before calling conv2, conv1, not after?

00:50:00.320 | Because if you did conv1-- if you did your comms before you concat, then there's no way

00:50:06.440 | for the channels of the two parts to interact with each other.

00:50:11.240 | You don't get any-- so remember, in a 2D conv, it's really 3D, right?

00:50:18.320 | It's moving across two dimensions, but in each case, it's doing a dot product of all

00:50:25.720 | three dimensions of a rank 3 tensor, row by column by channel.

00:50:30.440 | So generally speaking, we want as much interaction as possible.

00:50:35.000 | We want to say this part of the downsampling path and this part of the upsampling path,

00:50:40.520 | if you look at the combination of them, you find these interesting things.

00:50:43.480 | So generally, you want to have as many interactions going on as possible in each computation that

00:50:51.120 | you do.

00:50:55.480 | How does concatenating every layer together in a dense net work when the size of the image

00:50:59.960 | feature maps is changing through the layers?

00:51:06.760 | That's a great question.

00:51:07.760 | So, if you have a stride 2 conv, you can't keep dense netting.

00:51:14.920 | That's what actually happens in a dense net, is you kind of go like dense block growing,

00:51:19.040 | dense block growing, dense block growing, so you're getting more and more channels.

00:51:22.000 | And then you do a stride 2 conv without a dense block.

00:51:27.360 | And so now, it's kind of gone.

00:51:29.600 | And then you just do a few more dense blocks and then it's gone.

00:51:32.280 | So in practice, a dense block doesn't actually keep all the information all the way through,

00:51:38.920 | but just up into every one of these stride 2 convs.

00:51:45.400 | And there's kind of various ways of doing these bottlenecking layers where you're basically

00:51:48.880 | saying, hey, let's reset.

00:51:52.160 | It also helps us keep memory under control because at that point we can decide how many

00:51:55.320 | channels we actually want.

00:51:58.320 | Good questions.

00:51:59.320 | Thank you.

00:52:00.320 | So, in order to create something which can turn crappy images into nice images, we need

00:52:09.960 | a data set containing nice versions of images and crappy versions of the same images.

00:52:15.080 | So the easiest way to do that is to start with some nice images and Crapify them.

00:52:20.000 | And so the way to Crapify them is to create a function called Crapify, which contains

00:52:24.680 | your Crapification logic.

00:52:27.200 | So my Crapification logic, you can pick your own, is that I open up my nice image, I resize

00:52:34.360 | it to be really small, 96 by 96 pixels, with bilinear interpolation, I then pick a random

00:52:42.400 | number between 10 and 70, I draw that number into my image at some random location, and

00:52:51.720 | then I save that image with a JPEG quality of that random number.

00:52:56.760 | And a JPEG quality of 10 is like absolute rubbish.

00:53:01.840 | A JPEG quality of 70 is not bad at all.

00:53:06.120 | So I end up with high quality images, low quality images that look something like these.

00:53:15.340 | And so you can see this one, you know, there's the image.

00:53:18.760 | And this is after transformations, that's why it's been flipped.

00:53:22.520 | And you won't always see the image because we're zooming into them.

00:53:26.360 | So a lot of the time the image is cropped out.

00:53:30.600 | So yeah, it's trying to figure out how to take this incredibly JPEG artifacty thing with

00:53:34.800 | text written over the top and turn it into this.

00:53:38.240 | So I'm using the Oxford Pets dataset, again, the same one we used in lesson one.

00:53:43.220 | So there's nothing more high quality than pictures of dogs and cats, I think we can

00:53:46.360 | all agree with that.

00:53:49.400 | The Crapification process can take a while, but fast.ai has a function called parallel.

00:53:56.200 | And if you pass parallel a function name and a list of things to run that function on,

00:54:01.580 | it will run that function on them all in parallel.

00:54:05.320 | So this actually can run pretty quickly.

00:54:12.040 | The way you write this function is where you get to do all the interesting stuff in this

00:54:16.160 | assignment.

00:54:19.000 | Try and think of an interesting Crapification which does something that you want to do.

00:54:23.240 | So if you want to colorize black and white images, you would replace it with black and

00:54:28.100 | white.

00:54:29.100 | If you want something which can take large cut out blocks of image and replace them with

00:54:35.820 | hallucinated image, add a big black box to these.

00:54:40.600 | If you want something which can take old family photo scans that have been folded up and have

00:54:45.400 | crinkles in, try and find a way of adding dust prints and crinkles and so forth.

00:54:52.040 | Something that you don't include in Crapify, your model won't learn to fix because every

00:54:58.960 | time it sees that in your photos, the input and output will be the same, so it won't consider

00:55:02.960 | that to be something worthy of fixing.

00:55:09.760 | So we now want to create a model which can take an input photo that looks like that and

00:55:17.480 | output something that looks like that.

00:55:19.840 | So obviously what we want to do is use a unit, because we already know that units can do

00:55:24.080 | exactly that kind of thing, and we just need to pass the unit that data.

00:55:30.480 | So our data is just literally the file names from each of those two folders.

00:55:37.600 | Do some transforms, databunch, normalize, or use ImageNet stats because we're going

00:55:43.240 | to use a pre-trained model.

00:55:45.380 | Why are we using a pre-trained model?

00:55:46.960 | Well, because like if you're going to get rid of this 46, you need to know what probably

00:55:52.080 | was there, and to know what probably was there, you need to know what this is a picture of.

00:55:55.880 | Because otherwise, how can you possibly know what it ought to look like?

00:55:59.080 | So let's use a pre-trained model that knows about these kinds of things.

00:56:04.620 | So we create our unit with that data.

00:56:07.400 | The architecture is ResNet 34.

00:56:12.720 | These three things are important and interesting and useful, but I'm going to leave them to

00:56:17.080 | part two.

00:56:18.460 | For now, you should always include them when you use a unit for this kind of problem.

00:56:26.920 | And so now we're going to-- and this whole thing I'm calling a generator.

00:56:30.200 | It's going to generate-- this is generative modeling.

00:56:34.320 | There's not a really formal definition, but it's basically something where the thing we're

00:56:37.280 | outputting is like a real object, in this case, an image.

00:56:41.640 | It's not just a number.

00:56:44.000 | So we're going to create a generator learner, which is this unit learner.

00:56:49.280 | And then we can fit.

00:56:51.480 | We're using MSC loss, right?

00:56:53.480 | So in other words, what's the mean squared error between the actual pixel value that

00:56:57.360 | it should be and the pixel value that we predicted?

00:56:59.760 | MSC loss normally expects two vectors.

00:57:04.280 | In our case, we have two images.

00:57:06.300 | So we have a version called MSC loss flat, which simply flattens out those images into

00:57:11.240 | a big long vector.

00:57:13.860 | There's never any reason not to use this.

00:57:15.860 | Even if you do have a vector, it works fine.

00:57:17.360 | If you don't have a vector, it'll also work fine.

00:57:20.340 | So we're already down to 0.05 mean squared error on the pixel values, which is not bad,

00:57:26.720 | after 1 minute 35.

00:57:29.540 | Like all things in fast AI, pretty much, because we are doing transfer learning by default,

00:57:34.680 | when you create this, it'll freeze the pre-trained part.

00:57:40.200 | And the pre-trained part of a unit is this part, the down sampling part.

00:57:46.000 | That's where the resonant is.

00:57:47.220 | So let's unfreeze that and train a little more.

00:57:53.760 | And look at that.

00:57:55.620 | So with four minutes of training, we've got something which is basically doing a perfect

00:58:03.800 | job of removing numbers.

00:58:07.600 | It's certainly not doing a good job of up sampling.

00:58:13.120 | But it's definitely doing a nice-- sometimes when it removes a number, it maybe leaves

00:58:16.680 | a little bit of JPEG artifact.

00:58:18.360 | But it's certainly doing something pretty useful.

00:58:21.120 | And so if all we wanted to do was kind of watermark removal, we'd be finished.

00:58:28.320 | We're not finished, because we actually want this thing to look more like this thing.

00:58:34.920 | So how are we going to do that?

00:58:38.600 | The problem, the reason that we're not making as much progress with that as we'd like is

00:58:43.400 | that our loss function doesn't really describe what we want.

00:58:47.160 | Because actually, the mean squared error between the pixels of this and this is actually very

00:58:53.680 | small.

00:58:54.680 | And if you actually think about it, most of the pixels are very nearly the right color.

00:58:59.920 | But we're missing the texture of the pillow.

00:59:02.520 | And we're missing the eyeballs entirely, pretty much.

00:59:05.880 | And we're missing the texture of the fur.

00:59:08.800 | So we want some loss function that does a better job than pixel mean squared error loss

00:59:16.880 | of saying like, is this a good quality picture of this thing?

00:59:23.660 | So there's a fairly general way of answering that question.

00:59:29.560 | And it's something called a generative adversarial network, or GaN.

00:59:36.720 | And a GaN tries to solve this problem by using a loss function which actually calls another

00:59:44.880 | model.

00:59:46.320 | And let me describe it to you.

00:59:52.760 | So we've got our crappy image, and we've already created a generator.

00:59:57.520 | It's not a great one, but it's not terrible.

00:59:59.960 | And that's creating predictions like this.

01:00:08.160 | We have a high res image like that.

01:00:12.220 | And we can compare the high res image to the prediction with pixel MSE.

01:00:20.000 | We could also train another model, which we would variously call either the discriminator

01:00:26.600 | or the critic.

01:00:27.600 | They both mean the same thing.

01:00:28.960 | I'll call it a critic.

01:00:31.120 | We could try and build a binary classification model that takes all the pairs of the generated

01:00:37.680 | image and the real high res image and tries to classify, learn to classify, which is which.

01:00:45.360 | So look at some picture and say like, hey, what do you think?

01:00:50.320 | Is that a high res cat or is that a generated cat?

01:00:52.720 | How about this one?

01:00:53.720 | Is that a high res cat or a generated cat?

01:00:55.200 | So just a regular standard binary cross-entropy classifier.

01:01:01.300 | So we know how to do that already.

01:01:04.480 | So if we had one of those, we could now fine-tune the generator.

01:01:11.580 | And rather than using pixel MSE as the loss, the loss could be how good are we at fooling

01:01:18.060 | the critic?

01:01:19.840 | So can we create generated images that the critic thinks are real?

01:01:27.720 | So that would be a very good plan, right?

01:01:30.840 | Because if it can do that, if the loss function is am I fooling the critic, then it's going

01:01:36.840 | to learn to create images which the critic can't tell whether they're real or fake.

01:01:43.760 | So we could do that for a while, train a few batches, but the critic isn't that great.

01:01:52.280 | The reason the critic isn't that great is because it wasn't that hard.

01:01:55.160 | Like these images are really shitty, so it's really easy to tell the difference, right?

01:01:59.320 | So after we train the generator a little bit more using the critic as the loss function,

01:02:05.680 | the generator is going to get really good at fooling the critic.

01:02:09.600 | So now we're going to stop training the generator and we'll train the critic some more on these

01:02:14.560 | newly generated images.

01:02:16.800 | So now that the generator's better, it's now a tougher task for the critic to decide which

01:02:21.600 | is real and which is fake, so we'll train that a little bit more.

01:02:25.880 | And then once we've done that, and the critic's now pretty good at recognising the difference

01:02:29.040 | between the better generated images and the originals, we'll go back and we'll fine-tune

01:02:34.480 | the generator some more using the better discriminator, the better critic, as the loss function.

01:02:40.040 | And so we'll just go ping pong, ping pong, backwards and forwards.

01:02:44.360 | That's a GAN.

01:02:45.360 | Well, that's our version of a GAN.

01:02:49.080 | I don't know if anybody's written this before.

01:02:52.840 | We've created a new version of a GAN, which is kind of a lot like the original GANs, but

01:02:57.840 | we have this neat trick where we pre-train the generator and we pre-train the critic.

01:03:03.920 | I mean, GANs have been kind of in the news a lot.

01:03:07.280 | They're a pretty fashionable tool.

01:03:09.460 | And if you've seen them, you may have heard that they're a real pain to train.

01:03:14.900 | But it turns out we realise that really most of the pain of training them was at the start.

01:03:20.160 | If you don't have a pre-trained generator and you don't have a pre-trained critic, then

01:03:24.280 | it's basically the blind leading the blind, right?

01:03:27.720 | You're basically like the critics, well, the generator's trying to generate something which

01:03:31.300 | falls a critic, but the critic doesn't know anything at all, so it's basically got nothing

01:03:34.760 | to do.

01:03:35.760 | And then the critics kind of try to decide whether the generated images are real or not,

01:03:39.000 | and that gets really obvious, so that just does it.

01:03:41.320 | And so they kind of like don't go anywhere for ages.

01:03:46.360 | And then once they finally start picking up steam, they go along pretty quickly.

01:03:50.420 | So if you can find a way to generate things without using a GAN, like mean squared error

01:03:56.800 | pixel loss, and discriminate things without using a GAN, like predict on that first generator,

01:04:03.360 | you can make a lot of progress.

01:04:05.040 | So let's create the critic.

01:04:08.040 | So to create just a totally standard fast.ai binary classification model, we need two folders,

01:04:15.760 | one folder containing high-res images, one folder containing generated images.

01:04:20.680 | We already have the folder with the high-res images, so we just have to save our generated

01:04:24.480 | images.

01:04:26.180 | So here's a tiny, tiny bit of code that does that.

01:04:30.400 | We're going to create a directory called imagegen, pop it into a variable called pathgen.

01:04:37.560 | We've got a little function called save preds that takes a data loader, and we're going

01:04:41.960 | to grab all of the file names, because remember that in an item list, the dot items contains

01:04:46.960 | the file names, if it's an image item list.

01:04:50.340 | So here's the file names in that data loader's data set.

01:04:55.080 | And so now let's go through each batch of the data loader, and let's grab a batch of

01:05:00.640 | predictions for that batch, and then reconstruct equals true, means it's actually going to create

01:05:06.680 | fast.ai image objects for each of those, each thing in the batch.

01:05:12.000 | And so then we'll go through each of those predictions and save them.

01:05:16.000 | And the name we'll save it with is the name of the original file, but we're going to pop

01:05:21.160 | it into our new directory.

01:05:24.160 | So that's it.

01:05:26.880 | That's how you save predictions.

01:05:28.320 | And so you can see I'm kind of increasingly not just using stuff that's already in the

01:05:33.480 | fast.ai library, but trying to show you how to write stuff yourself, right?

01:05:38.400 | And generally it doesn't require heaps of code to do that.

01:05:41.920 | And so if you come back to part two, this is what, you know, lots of part two were kind

01:05:47.080 | of like here's how you use things inside the library, and of course here's how we wrote

01:05:51.800 | the library.

01:05:52.800 | So increasingly writing our own code.

01:05:56.240 | Okay.

01:05:57.520 | So save those predictions, and then let's just do a PIL.image.open on the first one,

01:06:03.800 | and yep, there it is, okay?

01:06:05.300 | So there's an example of a generated image.

01:06:08.120 | So now I can train a critic in the usual way.

01:06:13.320 | It's really annoying to have to restart Jupyter Notebook to reclaim GPU memory.

01:06:18.440 | So one easy way to handle this is if you just set something that you knew was using a lot

01:06:22.580 | of GPU to none, like this learner, and then just go gc.collect.

01:06:28.080 | That tells Python to do memory garbage collection, and after that you'll generally be fine.

01:06:36.920 | You'll be able to use all of your GPU memory again.

01:06:40.340 | If you're using Nvidia SMI to actually look at your GPU memory, you won't see it clear

01:06:45.620 | because PyTorch still has a kind of allocated cache, but it makes it available.

01:06:51.700 | So you should find this is how you can avoid restarting your Notebook.

01:06:55.160 | Okay.

01:06:56.160 | So we're going to create a critic, it's just an image item list from folder in the totally

01:07:00.520 | usual way, and the classes will be the image gen and images.

01:07:07.960 | We'll do a random split because we want to know how well we're doing with a critic to

01:07:11.120 | have a validation set.

01:07:12.800 | We just label it from folder in the usual way, add some transforms, databunch, normalize,

01:07:17.880 | so it's a totally standard object classifier.

01:07:22.280 | Okay, so we've got a totally standard classifier.

01:07:30.040 | So here's what some of it looks like.

01:07:31.900 | So here's one from the real images, generated images, generated images.

01:07:38.080 | So it's going to try and figure out which class is which.

01:07:42.720 | Okay, so we're going to use binary cross-entropy as usual, however, we're not going to use

01:07:55.560 | a ResNet here.

01:07:59.200 | And the reason we'll get into it in more detail in part two, but basically when you're doing

01:08:03.720 | a GAN, you need to be particularly careful that the generator and the critic can't kind

01:08:14.440 | of both push in the same direction and increase the weights out of control.

01:08:19.940 | So we have to use something called spectral normalization to make GANs work nowadays.

01:08:24.720 | We'll learn about that in part two.

01:08:27.400 | So if you say GAN critic, that will give you a binary classifier suitable for GANs.

01:08:34.960 | I strongly suspect we probably can use a ResNet here, we just have to create a pre-trained

01:08:39.320 | ResNet with spectral norm, hope to do that pretty soon, we'll see how we go.

01:08:44.160 | But as of now, this is kind of the best approach, there's this thing called GAN critic.

01:08:51.680 | And again, critic uses a slightly different way of averaging the different parts of the

01:09:01.620 | image when it does the loss.

01:09:03.220 | So any time you're doing a GAN at the moment, you have to wrap your loss function with adaptive

01:09:07.600 | loss.

01:09:08.600 | Again, we'll look at the details in part two, for now, just know this is what you have to

01:09:12.480 | do and it'll work.

01:09:14.700 | So other than that, slightly odd loss function and that slightly odd architecture, everything

01:09:19.280 | else is the same, we can call that to create our critic.

01:09:24.080 | Because we have this slightly different architecture and slightly different loss function, we did

01:09:27.920 | a slightly different metric, this is the equivalent GAN version of accuracy, the critics, and

01:09:34.320 | then we can train it, and you can see it's 98% accurate at recognizing that kind of crappy

01:09:42.480 | thing from that kind of nice thing.

01:09:44.280 | And of course, we don't see the numbers here anymore, right, because these are the generated

01:09:48.040 | images, the generator already knows how to get rid of those numbers that are written

01:09:52.920 | on top.

01:09:54.600 | So let's finish up this GAN.

01:09:58.160 | Now that we have pre-trained the generator and pre-trained the critic, we now need to

01:10:04.680 | get it to ping-pong between training a little bit of each.

01:10:08.240 | And the amount of time you spend on each of those things and the learning rates you use

01:10:14.440 | is still a little bit on the fussy side.

01:10:17.480 | So we've created a GAN learner for you, which you just pass in your generator and your critic,

01:10:27.400 | which we've just simply loaded here from the ones we just trained, and it will go ahead

01:10:33.800 | and when you go learn.fit, it will do that for you.

01:10:37.360 | It will figure out how much time to train the generator and then when to switch to training

01:10:40.800 | the discriminator, the critic, and it will go back and forth.

01:10:45.200 | These weights here is that what we actually do is we don't only use the critic as the

01:10:50.880 | loss function.

01:10:51.880 | If we only use the critic as the loss function, the GAN could get very good at creating pictures

01:10:57.520 | that look like real pictures, but they actually have nothing to do with the original photo

01:11:04.860 | at all.

01:11:05.860 | So we actually add together the pixel loss and the critic loss.

01:11:10.560 | And so those two losses are kind of on different scales.

01:11:15.240 | So we multiply the pixel loss by something between about 50 and about 200.

01:11:21.280 | Again, something in that range generally works pretty well.

01:11:27.700 | Something else with GANs, GANs hate momentum when you're training them.

01:11:33.040 | It kind of doesn't make sense to train them with momentum because you keep switching between

01:11:36.320 | generator and critic, so it's kind of tough.

01:11:39.140 | Maybe there are ways to use momentum, but I'm not sure anybody's figured it out.

01:11:43.000 | This number here, when you create an atom optimizer, is where the momentum goes, so you should

01:11:47.900 | set that to zero.

01:11:48.900 | So anyway, if you're doing GANs, use these hyperparameters, it should work.

01:12:00.120 | So that's what GAN learner does, and so then you can go fit, and it trains for a while.

01:12:05.680 | And one of the tough things about GANs is that these loss numbers, they're meaningless.

01:12:14.360 | You can't expect them to go down, because as the generator gets better, it gets harder

01:12:20.040 | for the discriminator, the critic.

01:12:22.720 | Then as the critic gets better, it gets harder for the generator.

01:12:25.560 | So the numbers should stay about the same.

01:12:31.480 | So that's one of the tough things about training GANs, is it's kind of hard to know how are

01:12:35.960 | they doing.

01:12:36.960 | So the only way to know how are they doing is to actually take a look at the results

01:12:41.400 | from time to time.

01:12:43.200 | And so if you put show image equals true here, it will actually print out a sample after

01:12:48.920 | every epoch.

01:12:49.920 | I haven't put that in the notebook because it makes it too big for the repo, but you

01:12:54.320 | can try that.

01:12:55.680 | So I've just put the results at the bottom, and here it is.

01:13:01.000 | So pretty beautiful, I would say.

01:13:05.880 | We already knew how to get rid of the numbers, but we now don't really have that kind of

01:13:09.800 | artifact of where it used to be.

01:13:12.240 | And it's definitely sharpening up this little kitty cat quite nicely.

01:13:22.000 | It's not great, always.

01:13:23.720 | There's some weird kind of noise going on here.

01:13:28.560 | It's certainly a lot better than the horrible original.

01:13:32.320 | This is a tough job to turn that into that.

01:13:36.440 | But there are some really obvious problems.

01:13:40.120 | Like here, these things ought to be eyeballs, and they're not.

01:13:46.080 | So why aren't they?

01:13:48.000 | Well, our critic doesn't know anything about eyeballs.

01:13:52.000 | And even if it did, it wouldn't know that eyeballs are particularly important.

01:13:56.560 | We care about eyes.

01:13:57.760 | Like when we see a cat without eyes, it's a lot less cute.

01:14:02.600 | I mean, I'm more of a dog person, but it just doesn't know that this is a feature that matters.

01:14:18.520 | Particularly because the critic, remember, is not a pre-trained network.

01:14:21.440 | So I kind of suspect that if we replace the critic with a pre-trained network that's been

01:14:26.160 | pre-trained on ImageNet but is also compatible with GANs, it might do a better job here.

01:14:31.760 | But it's definitely a shortcoming of this approach.

01:14:36.880 | So we're going to have a break.

01:14:39.640 | Question first.

01:14:40.640 | And then we'll have a break.

01:14:42.120 | And then after the break, I will show you how to find the cat's eyeballs again.

01:14:48.880 | For what kind of problems do you not want to use UNETs?

01:14:56.480 | Well, UNETs are for when the size of your output is similar to the size of your input

01:15:05.880 | and kind of aligned with it.

01:15:08.880 | There's no point kind of having cross-connections if that level of spatial resolution in the

01:15:13.480 | output isn't necessary or useful.

01:15:16.600 | So any kind of generative modeling and segmentation is generative modeling.

01:15:23.440 | It's generating a picture which is a mask of the original objects.

01:15:29.840 | So probably anything where you want that resolution of the output to be of the same kind of fidelity

01:15:37.640 | as resolution of the input.

01:15:39.080 | Obviously, something like a classifier makes no sense.

01:15:42.160 | In a classifier, you just want the downsampling path, because at the end, you just want a

01:15:48.160 | single number, which is like, is it a dog or a cat, or what kind of pet is it, or whatever.

01:15:53.760 | Great.

01:15:54.760 | Okay.

01:15:55.760 | So let's get back together at 5 past 8.

01:16:00.160 | Just before we leave GANs, I'll just mention there's another notebook you might be interested

01:16:03.920 | in looking at, which is lesson 7wGAN.

01:16:09.920 | When GANs started a few years ago, people generally used them to kind of create images

01:16:18.000 | out of thin air, which I personally don't think is a particularly useful or interesting

01:16:23.840 | thing to do, but it's kind of a good, I don't know, it's a good research exercise, I guess.

01:16:30.000 | So we implemented this wGAN paper, which was kind of really the first one to do a somewhat

01:16:36.680 | adequate job, somewhat easily, and so you can see how to do that with the fast AI library.

01:16:43.280 | It's kind of interesting, because the dataset we use is this Lsun bedrooms dataset, which

01:16:51.120 | we've provided in our URLs, which just, as you can see, has bedrooms, lots and lots and

01:16:57.160 | lots of bedrooms.

01:16:59.560 | And the approach, you'll see in the pros here that Sylvain wrote, the approach that we use

01:17:08.520 | in this case is to just say, can we create a bedroom?

01:17:14.160 | And so what we actually do is that the input to the generator isn't an image that we clean

01:17:22.960 | up.

01:17:23.960 | We actually feed to the generator random noise.

01:17:27.820 | And so then the generator's task is, can you turn random noise into something which the

01:17:33.300 | critic can't tell the difference between that output and a real bedroom?

01:17:38.860 | And so we're not doing any pre-training here or any of the stuff that makes this kind of

01:17:42.360 | fast and easy.

01:17:48.360 | So this is a very traditional approach, but you can still see, you still just go, you

01:17:52.200 | know, gan learner, and there's actually a wGAN version, which is, you know, this kind

01:17:56.160 | of older style approach, but you just pass in the data and the generator and the critic

01:18:00.160 | in the usual way, and you call fit, and you'll see, in this case we have a show image on,

01:18:08.400 | you know, after epoch one, it's not creating great bedrooms or two or three, and you can

01:18:12.720 | really see that in the early days of these kinds of gans, it doesn't do a great job of

01:18:16.400 | anything, but eventually after, you know, a couple of hours of training, producing somewhat

01:18:26.440 | like bedroom-ish things, you know.

01:18:29.440 | So anyway, it's a notebook you can have a play with, and it's a bit of fun.

01:18:35.520 | So I was very excited when we got fast.ai to the point in the last week or so that we

01:18:47.720 | had gans working in a way where kind of API-wise, they're far more concise and more flexible

01:18:53.720 | than any other library that exists, but also kind of disappointed with they take a long

01:18:59.840 | time to train, and the outputs are still like so-so, and so the next step was like, well,

01:19:04.880 | can we get rid of gans entirely?

01:19:07.920 | So the first step with that, I mean, obviously, the thing we really want to do is come up

01:19:12.040 | with a better loss function.

01:19:13.040 | We want a loss function that does a good job of saying this is a high-quality image without

01:19:19.480 | having to go over all the gan trouble, and preferably it also doesn't just say it's a

01:19:23.640 | high-quality image, but it's an image which actually looks like the thing it's meant to.

01:19:29.200 | So the real trick here comes back to this paper from a couple of years ago, perceptual

01:19:33.720 | losses for real-time style transfer and super resolution.

01:19:37.600 | Justin Johnson at our, created this thing they call perceptual losses.

01:19:42.080 | It's a nice paper, but I hate this term because they're nothing particularly perceptual about

01:19:47.080 | them.

01:19:48.080 | I would call them feature losses.

01:19:49.080 | So in the fastai library, you'll see this referred to as feature losses.

01:19:53.800 | And it shares something with gans, which is that after we go through our generator, which

01:19:59.920 | they call the image transform net, and you can see it's got this kind of unit shaped

01:20:03.720 | thing.

01:20:04.720 | They didn't actually use units because at the time this came out, nobody in the machine

01:20:07.920 | learning world much knew about units.

01:20:09.800 | Nowadays, of course, we use units.

01:20:12.960 | But anyway, something unit-ish.

01:20:15.600 | I should mention, like, in these architectures where you have a downsampling path followed

01:20:21.080 | by the upsampling path, the downsampling path is very often called the encoder.

01:20:27.000 | As you saw in our code, actually, we called that the encoder.

01:20:30.200 | And the upsampling path is very often called the decoder.

01:20:35.080 | In generative models, generally, including generative text models, neural translation,

01:20:41.760 | stuff like that, they tend to be called the encoder and the decoder, two pieces.

01:20:45.880 | So we have this generator, and we want a loss function that says, you know, is the thing

01:20:52.920 | that it's created like the thing that we want.

01:20:56.320 | And so the way they do that is they take the prediction -- remember Y hat is what we normally

01:21:00.680 | use for a prediction from a model -- we take the prediction and we put it through a pre-trained

01:21:06.760 | image net network.

01:21:09.080 | So at the time that this came out, the pre-trained image network they were using was VGG.

01:21:15.120 | People still -- it's kind of old now, but people still tend to use it because it works

01:21:19.260 | fine for this process.

01:21:23.320 | So they take the prediction and they put it through VGG, the pre-trained image net network.

01:21:28.200 | It doesn't matter too much which one it is.

01:21:30.960 | And so normally the output of that would tell you, hey, is this generated thing, you know,

01:21:36.840 | a dog or a cat or an airplane or a fire engine or whatever, right?

01:21:42.200 | But in the process of getting to that final classification, it goes through lots of different

01:21:46.640 | layers.

01:21:47.640 | And in this case, they've color-coded all the layers with the same grid size in the

01:21:52.040 | feature map with the same color.

01:21:53.640 | So every time we switch colors, we're switching grid size.

01:21:56.520 | So there's a strive to conv, or in VGG's case they still used to use max pooling layers,

01:22:01.400 | which kind of similar idea.

01:22:04.840 | And so what we could do is say, hey, let's not take the final output of the VGG model

01:22:10.080 | on this generated image, but let's take something in the middle.

01:22:17.000 | Let's take the activations of some layer in the middle.

01:22:20.960 | So those activations might be a feature map of like 256 channels by 28 by 28, say.

01:22:30.500 | And so those kind of 28 by 28 grid cells will kind of roughly semantically say things like,

01:22:35.280 | hey, in this part of that 28 by 28 grid, is there something that looks kind of furry?

01:22:40.400 | Or is there something that looks kind of shiny?

01:22:42.280 | Or is there something that looks kind of circular?

01:22:43.440 | Or is there something that kind of looks like an eyeball or whatever?

01:22:47.000 | So what we do is that we then take the target, so the actual Y value, and we put it through

01:22:53.480 | the same pre-trained VGG network, and we pull out the activations at the same layer, and

01:22:58.040 | then we do a mean squared error comparison.

01:23:01.000 | So it'll say, OK, in the real image, grid cell 1, 1 of that 28 by 28 feature map is

01:23:11.760 | furry and blue and round shaped, and in the generated image, it's furry and blue and not

01:23:19.320 | round shaped.

01:23:20.320 | So it's kind of like an OK match.

01:23:23.620 | So that ought to go a long way towards fixing our eyeball problem, because in this case,

01:23:30.040 | the feature map is going to say, there's eyeballs here-- sorry, here-- but there isn't here.

01:23:36.240 | So do a better job of that, please.

01:23:38.080 | Make better eyeballs.

01:23:39.580 | So that's the idea.

01:23:41.140 | So that's what we call feature losses, or Johnson et al. called perceptual losses.

01:23:52.760 | So to do that, we're going to use the Lesson 7 Super Res notebook.

01:24:02.800 | And this time, the task we're going to do is kind of the same as the previous task,

01:24:08.000 | but I wrote this notebook a little bit before the GAN notebook.

01:24:11.800 | Before I came up with the idea of putting text on it and having a random JPEG quality.

01:24:16.260 | So JPEG quality is always 60.

01:24:18.720 | There's no text written on top, and it's 96 by 96.

01:24:23.640 | And it's before I realized what a great word "crapify" is, so it's called resize.

01:24:29.100 | So here's our crappy images and our original images, kind of a similar task to what we

01:24:36.600 | had before.

01:24:38.240 | So I'm going to try and create a loss function which does this.

01:24:47.380 | So the first thing I do is I define a base loss function, which is basically like, how

01:24:54.640 | am I going to compare the pixels and the features?

01:24:57.240 | And the choices mainly are like MSE or L1.

01:25:02.120 | Doesn't matter too much, which you choose.

01:25:03.840 | I tend to like L1 better than MSE, actually.

01:25:06.560 | So I picked L1.

01:25:08.240 | So any time you see base loss, we mean L1 loss.

01:25:12.320 | You could use MSE loss as well.

01:25:15.040 | So let's create a VGG model.

01:25:17.340 | So just using the pre-trained model.

01:25:20.280 | In VGG, there's an attribute called dot_features, which contains the convolutional part of the

01:25:26.200 | model.

01:25:27.200 | So here's the convolutional part of the VGG model.

01:25:30.680 | Because we don't need the head, because we only want the intermediate activations.

01:25:35.840 | So then we'll chuck that on the GPU.

01:25:37.340 | We'll put it into eval mode, because we're not training it.

01:25:41.120 | And we'll turn off requires_grad, because we don't want to update the weights of this

01:25:45.840 | model.

01:25:46.840 | We're just using it for inference, for the loss.

01:25:50.840 | So then let's enumerate through all the children of that model and find all of the max pooling

01:25:55.760 | layers.

01:25:56.760 | Because in the VGG model, that's where the grid size changes.

01:26:01.320 | And as you can see from this picture, we kind of want to grab features from every time just

01:26:07.160 | before the grid size changes.

01:26:09.080 | So we grab layer i minus 1.

01:26:11.320 | So that's the layer before it changes.

01:26:13.280 | So there's our list of layer numbers just before the max pooling layers.

01:26:21.160 | And so all of those are values, not surprisingly.

01:26:27.200 | So those are where we want to grab some features from.

01:26:31.480 | So we put that in blocks.

01:26:32.480 | It's just a list of IDs.

01:26:34.360 | So here's our feature_loss class, which is going to implement this idea.

01:26:40.140 | So basically, when we call the feature_loss class, we're going to pass it some pre-trained

01:26:46.920 | model.

01:26:47.920 | And so that's going to be called m_feet.

01:26:50.360 | That's the model which contains the features which we want to generate for-- want our feature

01:26:55.040 | loss on.

01:26:56.160 | So we can go ahead and grab all of the layers from that network that we want the features

01:27:04.960 | for to create the losses.

01:27:07.760 | So we're going to need to hook all of those outputs.

01:27:10.360 | Because remember, that's how we grab intermediate layers in PyTorch is by hooking them.

01:27:15.760 | So this is going to contain our hooked outputs.

01:27:22.160 | So now, in the forward of feature_loss, we're going to make features passing in the target.

01:27:28.960 | So this is our actual Y, which is just going to call that VGG model and go through all

01:27:33.760 | of the stored activations and just grab a copy of them.

01:27:39.320 | And so we're going to do that both for the target, call that out_feet, and for the input.

01:27:43.960 | So that's the output of a generator in_feet.

01:27:48.520 | And so now, let's calculate the L1 loss between the pixels.

01:27:55.700 | Because we still want the pixel loss a little bit.

01:27:58.140 | And then let's also go through all of those layers features and get the L1 loss on them.

01:28:08.000 | So we're basically going through every one of these end of each block and grabbing the

01:28:14.720 | activations and getting the L1 on each one.

01:28:19.360 | So that's going to end up in this list called feature_losses, which I then sum them all

01:28:27.380 | up.

01:28:28.880 | And by the way, the reason I do it as a list is because we've got this nice little callback

01:28:33.180 | that if you put them into a thing called .metrics in your loss function, it'll print out all

01:28:38.240 | of the separate layer loss amounts for you, which is super handy.

01:28:46.880 | So that's it.

01:28:47.880 | That's our perceptual loss or feature_loss class.

01:28:51.060 | And so now we can just go ahead and train a unit in the usual way with our data and our

01:28:55.160 | pre-trained architecture, which is a ResNet-34, passing in our loss function, which is using

01:29:00.680 | our pre-trained VGG model.

01:29:02.920 | And this is that callback I mentioned, loss_metrics, which is going to print out all the different

01:29:06.880 | layers losses for us.

01:29:09.760 | These are two things that we'll learn about in part two of the course, but you should

01:29:12.720 | use them.

01:29:13.720 | LR_find.

01:29:14.720 | I just created a little function called do_fit that does fit one cycle and then saves the

01:29:20.000 | model and then shows the results.

01:29:23.020 | So as per usual, because we're using a pre-trained network in our UNet, we start with frozen

01:29:29.240 | layers for the downsampling path, train for a while, and as you can see, we get not only

01:29:34.720 | the loss, but also the pixel loss and the loss at each of our feature layers.

01:29:40.000 | And then also something we'll learn about in part two called gram_loss, which I don't

01:29:45.560 | think anybody's used for SuperRes before as far as I know, but as you'll see, it turns

01:29:50.680 | out great.

01:29:52.160 | So that's eight minutes, so much faster than a GAN.

01:29:55.960 | And already, as you can see, this is our output, modeled output, pretty good.

01:30:01.500 | So then we unfreeze and train some more, and it's a little bit better.

01:30:07.720 | And then let's switch up to double the size, and so we need to also halve the batch size

01:30:12.360 | to avoid running a GPU memory.

01:30:14.880 | And freeze again and train some more, so it's now taking half an hour.

01:30:20.280 | Even better.

01:30:21.680 | And then unfreeze and train some more.

01:30:24.320 | So all in all, we've done about an hour and 20 minutes of training.

01:30:28.800 | And look at that!

01:30:31.040 | It's done it.

01:30:33.920 | It knows that eyes are important, so it's really made an effort.

01:30:36.920 | It knows that fur is important, so it's really made an effort.

01:30:39.780 | So it started with something with JPEG artifacts around the ears and all this mess and eyes

01:30:47.920 | that are just kind of vague, light blue things, and it really created a lot of texture.

01:30:53.880 | This cat is clearly kind of like looking over the top of one of those little clawing frames

01:30:59.480 | covered in fuzz, so it actually recognized that this thing is probably kind of a carpety

01:31:04.440 | material that's created a carpety material for us.

01:31:07.680 | So I mean, that's just remarkable.

01:31:12.080 | So talking of remarkable, we can now - so I've never seen outputs like this before without

01:31:23.360 | again.

01:31:24.520 | So I was just so excited when we were able to generate this.

01:31:27.400 | And so quickly, one GPU, hour and a half.

01:31:31.200 | So if you create your own krapification functions and train this model, you'll build stuff that

01:31:37.120 | nobody's built before.

01:31:38.120 | Because like nobody else's that I know of is doing it this way.

01:31:42.560 | So there are huge opportunities, I think.

01:31:44.680 | So check this out.

01:31:45.680 | What we can now do is we can now, instead of starting with our low res, I actually stored

01:31:53.440 | another set at size 256, which are called medium res.

01:31:57.600 | So let's see what happens if we upsize a medium res.

01:32:01.560 | So we're going to grab our medium res data.

01:32:05.440 | And here is our medium res stored photo.

01:32:13.040 | And so can we improve this?

01:32:14.480 | So you can see there's still a lot of room for improvement.

01:32:16.860 | Like you see the lashes here are very pixelated.

01:32:21.920 | Size where there should be hair here is just kind of fuzzy.

01:32:24.880 | So watch this area as I hit down on my keyboard.

01:32:27.880 | Bump.

01:32:28.880 | Look at that.

01:32:30.360 | It's done it.

01:32:31.360 | You know, it's taken a medium res image and it's made a totally clear thing here.

01:32:37.120 | You know, the furs reappeared.

01:32:38.960 | Look at the eyeball.

01:32:39.960 | Let's go back.

01:32:41.360 | The eyeball here is just kind of a general blue thing.

01:32:46.040 | Here it's added all the right texture, you know.

01:32:49.780 | So I just think this is super exciting, you know.

01:32:54.080 | Here's a model I trained in an hour and a half using standard stuff that you've all learned

01:32:59.840 | about a unit, a pre-trained model, feature loss function, and we've got something which

01:33:05.680 | can turn that into that or, you know, this absolute mess into this.

01:33:14.680 | And like it's really exciting to think what could you do with that, right?

01:33:19.660 | So one of the inspirations here has been a guy called Jason Antich.

01:33:26.840 | And Jason was a student in the course last year.

01:33:34.160 | And what he did very sensibly was decide to focus basically nearly quit his job and work

01:33:44.080 | four days a week or really six days a week on studying deep learning.

01:33:47.760 | And as you should do, he created a kind of capstone project.

01:33:51.240 | And his project was to combine GANs and feature losses together.

01:33:57.320 | And his crepification approach was to take color pictures and make them black and white.

01:34:05.200 | So he took the whole of ImageNet, created a black and white ImageNet, and then trained

01:34:08.720 | a model to recolorize it.

01:34:10.720 | And he's put this up as de-oldify.

01:34:12.920 | And now he's got these actual old photos from the 19th century that he's turning into color.

01:34:20.640 | And like what this is doing is incredible.

01:34:24.520 | Like look at this.

01:34:25.520 | The model thought, oh, that's probably some kind of copper kettle.

01:34:28.340 | So I'll make it like copper colored.

01:34:30.100 | And oh, these pictures are on the wall.

01:34:32.240 | They're probably like different colors to the wall.

01:34:34.720 | And maybe that looks a bit like a mirror.

01:34:38.200 | Maybe it would be reflecting stuff outside, you know.

01:34:42.720 | These things might be vegetables.

01:34:44.400 | Vegetables are often red.

01:34:45.400 | You know, let's make them red.

01:34:48.400 | It's extraordinary what it's done.

01:34:51.600 | And you could totally do this, too.

01:34:53.560 | Like you can take our feature loss and our GAN loss and combine them.

01:34:58.640 | So I'm very grateful to Jason, because he's helped us build this lesson.

01:35:03.400 | And it's been really nice, because we've been able to help him, too, because he hadn't realized

01:35:08.520 | that he can use all this pre-training and stuff.

01:35:10.480 | And so hopefully you'll see De-oldify in the next couple of weeks be even better at de-oldification.

01:35:16.560 | But hopefully you all can now add other kinds of de-crapification methods as well.

01:35:23.840 | So I like every course, if possible, to show something totally new, because then every

01:35:33.600 | student has a chance to basically build things that have never been built before.

01:35:36.920 | So this is kind of that thing, you know, but between the much better segmentation results

01:35:42.460 | and these much simpler and faster de-crapification results, I think you can build some really

01:35:47.180 | cool stuff.

01:35:48.180 | Did you have a question?

01:35:55.040 | Is it possible to use similar ideas to UNET and GANs for NLP?

01:35:59.960 | For example, if I want to tag the verbs and nouns in a sentence or create a really good

01:36:03.840 | Shakespeare generator?

01:36:07.680 | Yeah, pretty much.

01:36:10.120 | We don't fully know yet.

01:36:11.920 | It's a pretty new area, but there's a lot of opportunities there.

01:36:15.160 | And we'll be looking at some in a moment, actually.

01:36:24.000 | So I actually tried training this -- well, I actually tried testing this on this -- remember

01:36:30.040 | this picture I showed you with a slide last lesson?

01:36:33.320 | And it's a really rubbishy-looking picture, and I thought, what would happen if we tried

01:36:36.440 | running this just through the exact same model and it changed it from that to that?

01:36:43.720 | So I thought that was a really good example.

01:36:45.480 | You can see something it didn't do, which is this weird discoloration.

01:36:49.280 | It didn't fix it, because I didn't crepify things with weird discoloration, right?

01:36:53.520 | So if you want to create really good image restoration, like I say, you need really good

01:36:57.640 | crepification.

01:37:00.160 | Okay.

01:37:01.480 | So here's what we've learned so far, right, in the course, some of the main things.

01:37:08.200 | So we've learned that neural nets consist of sandwich layers of affine functions, which

01:37:15.680 | are basically matrix multiplications, slightly more general version, and nonlinearities, like

01:37:20.520 | ReLU.

01:37:21.520 | And we learned that the results of those calculations are called activations, and the things that

01:37:26.640 | go into those calculations that we learn are called parameters, and that the parameters

01:37:31.520 | are initially, randomly initialized, or we copy them over from a pre-trained model, and

01:37:36.480 | then we train them with SGD or faster versions, and we learned that convolutions are a particular

01:37:42.720 | affine function that work great for autocorrelated data, so things like images and stuff.

01:37:48.560 | We learned about batch norm, dropout data orientation and weight decay as ways of regularizing

01:37:54.600 | models, and also batch norm helps train models more quickly.

01:37:57.640 | And then today we've learned about res/dense blocks.

01:38:02.760 | We've learned a lot about image classification and regression, embeddings, categorical and

01:38:07.400 | continuous variables, collaborative filtering, language models and NLP classification, and

01:38:13.360 | then kind of segmentation unit and GANs.

01:38:15.820 | So go over these things and make sure that you feel comfortable with each of them.

01:38:21.880 | If you've only watched this series once, you definitely won't.

01:38:26.140 | People normally watch it three times or so to really understand the detail.

01:38:32.520 | So one thing that doesn't get here is RNNs.

01:38:38.540 | So that's the last thing we're going to do, RNNs.

01:38:42.200 | So RNNs, I'm going to introduce a little kind of diagrammatic method here to explain RNNs.

01:38:48.160 | And the diagrammatic method, I'll start by showing you a basic neural net with a single

01:38:51.800 | hidden layer.

01:38:53.800 | Square means an input.

01:38:56.960 | So that'll be batch size by number of inputs.

01:39:01.040 | So kind of, you know, batch size by number of inputs.

01:39:10.040 | An arrow means a layer, broadly defined, such as matrix product followed by value.

01:39:16.840 | A circle is activation.

01:39:21.840 | So in this case, we have one set of hidden activations.

01:39:25.660 | And so given that the input was number of inputs, this here is a matrix of number of

01:39:32.440 | inputs by number of activations.

01:39:34.580 | So the output will be batch size by number of activations.

01:39:38.960 | It's really important you know how to calculate these shapes.

01:39:41.480 | So go learn.summary lots to see all the shapes.

01:39:47.200 | So then here's another arrow.

01:39:48.620 | So that means it's another layer, matrix product followed by non-linearity.

01:39:51.880 | In this case, we're going to the output, so we use softmax.

01:39:55.840 | And then triangle means an output.

01:39:59.680 | And so this matrix product will be number of activations by number of classes.

01:40:03.240 | So our output is batch size by number of classes.

01:40:06.160 | So let's reuse that key, remember, triangle output, circle is activations, hidden state,

01:40:16.440 | we also call that, and rectangle is input.

01:40:20.420 | So let's now imagine that we wanted to get a big document, split it into sets of three

01:40:27.840 | words at a time, and grab each set of three words and then try to predict the third word

01:40:34.480 | using the first two words.

01:40:36.560 | So if we had the data set in place, we could grab word one as an input, chuck it through

01:40:42.080 | an embedding, create some activations, pass that through a matrix product and non-linearity,

01:40:53.960 | grab the second word, put it through an embedding, and then we could either add those two things

01:41:00.160 | together or concatenate them.

01:41:02.080 | Generally speaking, when you see kind of two sets of activations coming together in a diagram,

01:41:08.400 | you normally have a choice of concatenate or add.

01:41:13.200 | And that's going to create a second bunch of activations, and then you can put it through

01:41:16.640 | one more fully connected layer and softmax to create an output.

01:41:23.160 | So that would be a totally standard, fully connected neural net with one very minor tweak,

01:41:29.520 | which is concatenating or adding at this point, which we could use to try to predict the third

01:41:34.900 | word from pairs of two words.

01:41:41.120 | So remember, arrows represent layer operations, and I removed in this one the specifics of

01:41:48.520 | what they are because they're always an affine function followed by a non-linearity.

01:41:56.760 | Let's go further. What if we wanted to predict word four using words one and two and three?

01:42:03.080 | It's basically the same picture as last time, except with one extra input and one extra

01:42:06.980 | circle.

01:42:07.980 | But I want to point something out, which is each time we go from rectangle to circle,

01:42:15.720 | we're doing the same thing. We're doing an embedding, which is just a particular kind

01:42:20.320 | of matrix multiply, where you have a one-hot encoded input.

01:42:24.740 | Each time we go from circle to circle, we're basically taking one piece of hidden state,

01:42:31.000 | one set of activations, and turning it into another set of activations by saying we're

01:42:34.920 | now at the next word.

01:42:37.280 | And then when we go from circle to triangle, we're doing something else again, which is

01:42:41.000 | we're saying let's convert the hidden state, these activations, into an output.

01:42:46.360 | So it would make sense, so you can see I've colored each of those arrows differently.

01:42:50.680 | So each of those arrows should probably use the same weight matrix, because it's doing

01:42:56.440 | the same thing.

01:42:57.800 | So why would you have a different set of embeddings for each word, or a different set of -- a

01:43:02.320 | different matrix to multiply by to go from this hidden state to this hidden state versus

01:43:06.680 | this one.

01:43:08.800 | So this is what we're going to build. So we're now going to jump into human numbers, which

01:43:22.080 | is less than seven human numbers, and this is the dataset that I created, which literally

01:43:25.960 | just contains all the numbers from one to 9,999 written out in English.

01:43:31.880 | And we're going to try and create a language model that can predict the next word in this

01:43:36.320 | document. It's just a toy example for this purpose.

01:43:41.240 | So in this case, we only have one document, and that one document is the list of numbers.

01:43:47.320 | So we can use a text list to create an item list with text in for the training and the

01:43:52.200 | validation. In this case, the validation set is the numbers from 8,000 onwards, and the

01:43:56.640 | training set is 1 to 8,000.

01:43:59.600 | We can combine them together, turn that into a data bunch.

01:44:04.660 | So we only have one document. So train zero is the document. Grab its dot text. That's

01:44:09.200 | how you grab the contents of a text list, and here are the first 80 characters.

01:44:15.080 | It starts with a special token, XXBOS. Anything starting with XX is a special fast AI token.

01:44:21.080 | BOS is the beginning of stream token. It basically says this is the start of a document. It's

01:44:26.480 | very helpful in NLP to know when documents start so that your models can learn to recognize

01:44:31.920 | them. The validation set contains 13,000 tokens,

01:44:36.360 | so 13,000 words or punctuation marks, because everything between spaces is a separate token.

01:44:44.640 | The batch size that we asked for was 64. And then by default, it uses something called

01:44:54.640 | BPT-T of 70. BPT-T, as we briefly mentioned, stands for backprop through time.

01:45:01.440 | That's the sequence length. So with each of our 64 document segments, we split it up into

01:45:12.080 | lists of 70 words that we look at at one time.

01:45:15.840 | So what we do is we grab this for the validation set, an entire string of 13,000 tokens, and

01:45:24.040 | then we split it into 64 roughly equal sized sections. People very, very, very often think

01:45:32.440 | I'm saying something different. I did not say they are of length 64. They're not. They're

01:45:38.120 | 64 equally sized roughly segments. So we take the first 1/64 of the document, piece one.

01:45:46.160 | 1/64, piece two. And then for each of those 1/64 of the document, we then split those

01:45:56.960 | into pieces of length 70. So each batch -- so let's now say, okay, for

01:46:04.720 | those 13,000 tokens, how many batches are there? Well, divide by batch size and divide

01:46:10.080 | by 70. So there's about 2.9 batches. So there's going to be three batches. So let's grab an

01:46:16.760 | iterator for our data loader, grab one, two, three batches, the X and the Y, and let's add

01:46:23.840 | up the number of elements, and we get back slightly less than this because there's a

01:46:28.920 | little bit left over at the end that doesn't quite make up a full batch.

01:46:34.360 | So this is the kind of stuff you should play around with a lot, lots of shapes and sizes

01:46:37.920 | and stuff and iterators. As you can see, it's 95 by 64. I claimed it was going to be 70 by

01:46:45.680 | 64. That's because our data loader for language models slightly randomizes, BPTT, just to

01:46:53.800 | give you a bit more kind of shuffling, get a bit more randomization. It helps the model.

01:47:00.600 | And so here you can see the first batch of X. Remember, we've numericalized all these.

01:47:10.280 | And here's the first batch of Y. And you'll see here, this is 2, 18, 10, 11, 8. This is

01:47:15.800 | 18, 10, 11, 8. So this one is offset by 1 from here because that's what we want to do with

01:47:23.120 | a language model. We want to predict the next word. So after 2 should come 18. And after

01:47:30.160 | 18 should come 10. You can grab the vocab for this data set. And a vocab has a textify.

01:47:39.000 | So if we look at the same thing but with textify, that'll just look it up in the vocab. So here

01:47:43.880 | you can see XXBOS 8001. Whereas in the Y, there's no XXBOS. It's just 8001. So after XXBOS is

01:47:52.440 | 8, after 8 is 1, after 1000 is 1. And so then after we get 8023 comes X2. And look at this.

01:48:03.520 | We're always looking at column 0. So this is the first batch, the first mini-batch.

01:48:08.720 | Comes 8024 and then X3 all the way up to 8040. And so then we can go right back to the start

01:48:18.880 | but look at batch 1. So index 1, which is batch number 2. And now we can continue. A

01:48:25.600 | slight skip from 8040 to 8046. That's because the last mini-batch wasn't quite complete.

01:48:32.240 | So what this means is that every mini-batch joins up with the previous mini-batch. So

01:48:41.840 | you can go straight from X1, 0 to X2, 0. It continues. 8023, 8024, right? And so if you

01:48:50.320 | look at the same thing for colon, comma, 1, you'll also see they join up. So all the mini-batches

01:48:57.060 | join up. So that's the data. We can do show batch to see it. And here is our model which

01:49:09.040 | is doing this. So this is just the code copied over. So it contains one embedding, i.e. the

01:49:25.640 | green arrow, one hidden to hidden brown arrow layer, and one hidden to output. So each colored

01:49:35.800 | arrow has a single matrix. And so then in the forward pass, we take our first input,

01:49:45.000 | X0, and put it through input to hidden, the green arrow, create our first set of activations,

01:49:52.120 | which we call H. Assuming that there is a second word, because sometimes we might be

01:49:58.320 | at the end of a batch where there isn't a second word, assuming there is a second word,

01:50:02.160 | then we would add to H the result of X1, put through the green arrow. Remember that's IH.

01:50:11.920 | And then we would say, okay, our new H is the result of those two added together, put

01:50:20.600 | through our hidden to hidden, orange arrow, and then relu then batch it on. And then for

01:50:25.360 | the second word, do exactly the same thing. And then finally, blue arrow, put it through

01:50:30.840 | H0. So that's how we convert our diagram to code. So nothing new here at all. So now let's

01:50:42.960 | do -- so we can check that in the learner and we can train it, 46%. Let's take this code

01:50:49.720 | and recognize it's pretty awful. There's a lot of duplicate code. And as coders, when

01:50:55.040 | we see duplicate code, what do we do? We refactor. So we should refactor this into a loop. So

01:51:01.760 | here we are. We've refactored it into a loop. So now we're going for each X, I and X and

01:51:06.800 | doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not

01:51:16.320 | anything new. This is now an RNN. And let's refactor our diagram from this to this. This

01:51:27.360 | is the same diagram. But I've just replaced it with my loop. Does the same thing. So here

01:51:36.520 | it is. It's got exactly the same in it. Literally exactly the same. Just popped a loop here.

01:51:41.400 | Before I start, I just have to make sure I've got a bunch of zeros to add to. And of course

01:51:47.600 | I get exactly the same result when I train it. Okay. So next thing that you might think

01:51:53.760 | then -- and one nice thing about the loop, though, is now this will work even if I'm

01:51:57.280 | not predicting the fourth word from the previous three but the ninth word from the previous

01:52:01.560 | eight. It will work for any arbitrarily length long sequence, which is nice. So let's up

01:52:07.120 | the BPTT to 20 since we can now. And let's now say, okay, instead of just predicting

01:52:19.400 | the nth word from the previous n minus 1, let's try to predict the second word from

01:52:25.760 | the first and the third from the second and the fourth from the third and so forth. Because

01:52:29.920 | previously -- look at our loss function. Previously we were comparing the result of our model

01:52:35.440 | to just the last word of the sequence. It's very wasteful because there's a lot of words

01:52:39.560 | in the sequence. So let's compare every word in X to every word in Y. So to do that, we

01:52:46.720 | need to change this so it's not just one triangle at the end of the loop. But the triangle is

01:52:52.440 | inside this, right? So that in other words, after every loop, predict, loop, predict,

01:53:00.360 | loop, predict. So here's this code. It's the same as the previous code but now I've created

01:53:08.120 | an array. And every time I go through the loop, I append HOH to the array. So now for

01:53:15.960 | n inputs, I create n outputs. So I'm predicting after every word. Previously I had 46%. Now

01:53:23.640 | I have 40%. Why is it worse? Well, it's worse because now, like when I'm trying to predict

01:53:31.080 | the second word, I only have one word of state to use. Right? So like when I'm looking at

01:53:37.400 | the third word, I only have two words of state to use. So it's a much harder problem for

01:53:42.200 | it to solve. So the obvious way to fix this then would -- you know, the key problem is

01:53:47.640 | here. I go H equals torch.zeros, like I reset my state to zero every time I start another

01:53:54.360 | BPTT sequence. Well, let's not do that. Let's keep H. Right? And we can because remember

01:54:01.240 | each batch connects to the previous batch. It's not shuffled like happens in image classification.

01:54:09.000 | So let's take this exact model and replicate it again. But let's move the creation of H

01:54:13.440 | into the constructor. Okay. There it is. So it's now self.h. So this is now exactly the

01:54:20.160 | same code. But at the end, let's put the new H back into self.h. So it's now doing the

01:54:25.800 | same thing, but it's not throwing away that state. And so therefore now we actually get

01:54:33.160 | above the original. We get all the way up to 54% accuracy. So this is what a real RNN

01:54:41.720 | looks like. You always want to keep that state. But just keep remembering there's nothing

01:54:48.720 | different about an RNN. It's a totally normal, fully connected neural net. It's just that

01:54:52.840 | you've got a loop you refactored. What you could do, though, is at the end of your -- every

01:55:03.120 | loop, you could not just spit out an output, but you could spit it out into another RNN.

01:55:07.920 | So you could have an RNN going into an RNN. And that's nice because we've now got more

01:55:12.160 | layers of computation. You would expect that to work better. Well, to get there, let's

01:55:18.960 | do some more refactoring. So let's take this code and replace it with the equivalent built-in

01:55:25.720 | PyTorch code, which is -- you just say that. So nn.rn basically says do the loop for me.

01:55:33.000 | We've still got the same embedding, the same output, the same batch norm, the same initialization

01:55:39.880 | of H, but we just got rid of the loop. So one of the nice things about RNN is that you can

01:55:45.680 | now say how many layers you want. So this is the same accuracy, of course. So here I've

01:55:53.280 | got to do it with two layers. But here's the thing. When you think about this, right, think

01:56:00.480 | about it without the loop. It looks like this, right? It's like -- it keeps on going -- and

01:56:06.360 | we've got a BPTT of 20, so there's 20 layers of this. And we know from that visualizing

01:56:13.320 | the lost landscapes paper that deep networks have awful, bumpy, lost surfaces. So when

01:56:20.680 | you start creating long time scales and multiple layers, these things get impossible to train.

01:56:31.520 | So there's a few tricks you can do. One thing is you can add skip connections, of course.

01:56:37.640 | But what people normally do is instead they put inside -- instead of just adding these

01:56:43.280 | together, they actually use a little mini neural net to decide how much of the green

01:56:48.680 | arrow to keep and how much of the orange arrow to keep. And when you do that, you get something

01:56:53.640 | that's either called a GIU or an LSTM depending on the details of that little neural net.

01:56:59.200 | And we'll learn about the details of those neural nets in part two. They really don't

01:57:02.600 | matter, though, frankly. So we can now say let's create a GIU instead, so it's just like

01:57:07.920 | what we had before, but it'll handle longer sequences in deeper networks. Let's use two

01:57:13.560 | layers, and we're up to 75%. Okay. So that's RNNs. And the main reason I wanted to show

01:57:27.960 | it to you was to remove the last remaining piece of magic. And this is one of the least

01:57:35.360 | magical things we have in deep learning. It's just a refactored, fully connected network.

01:57:40.800 | So don't let RNNs ever put you off. And with this approach where you basically have a sequence

01:57:48.760 | of N inputs and a sequence of N outputs we've been using for language modeling, you can

01:57:53.360 | use that for other tasks, right? For example, the sequence of outputs could be for every

01:57:58.440 | word. There could be something saying is this something that is sensitive and I want to

01:58:01.980 | anonymize or not? You know, so like is this private data or not? Or it could be a part

01:58:08.560 | of speech tag for that word. Or it could be something saying, you know, how should that

01:58:15.840 | word be formatted? Or whatever. And so these are called sequence labeling tasks, and so

01:58:21.360 | you can use this same approach for pretty much any sequence labeling task. Or you can do

01:58:26.960 | what I did in the earlier lesson, which is once you finish building your language model,

01:58:33.360 | you can throw away the kind of this HO bit and instead pop there a standard classification

01:58:42.480 | head and then you can now do NLP classification, which as you saw earlier will give you state-of-the-art

01:58:49.160 | results even on long documents. So this is a super valuable technique and not remotely

01:58:57.600 | magical. Okay, so that's it, right? That's deep learning or at least, you know, the kind

01:59:05.880 | of the practical pieces from my point of view. Having watched this one time, you won't get

01:59:17.120 | it all. And I don't recommend that you do watch this so slowly that you get it all the

01:59:21.560 | first time, but you go back and look at it again, take your time, and there'll be bits

01:59:27.200 | that you go like, "Oh, now I see what he's saying," and then you'll be able to implement

01:59:31.080 | things you couldn't implement before and you'll be able to dig in more than you before. So

01:59:34.560 | definitely go back and do it again. And as you do, write code, not just for yourself,

01:59:40.640 | but put it on GitHub. It doesn't matter if you think it's great code or not. The fact

01:59:45.640 | that you're writing code and sharing it is impressive, and the feedback you'll get if

01:59:51.880 | you tell people on the forum, "Hey, I wrote this code. It's not great, but it's my first

01:59:57.120 | effort. Anything you see, jump out at you," people will say like, "Oh, that bit was done

02:00:02.320 | well. Hey, but did you know for this bit you could have used this library and saved you

02:00:05.760 | some time?" You'll learn a lot by interacting with your peers. As you've noticed, I've started

02:00:12.520 | introducing more and more papers. Now, part two will be a lot of papers, and so it's a

02:00:17.320 | good time to start reading some of the papers that have been introduced in this section.

02:00:24.160 | All the bits that say derivation and theorems and lemmas, you can skip them. I do. They add

02:00:29.520 | almost nothing to your understanding of practical deep learning. But the bits that say why are

02:00:36.600 | we solving this problem, and what are the results, and so forth are really interesting.

02:00:42.540 | And then try and write English prose. Not English prose that you want to be read by Jeff Hinton

02:00:51.200 | and Yann LeCun, but English prose that you want to be read by you as of six months ago.

02:00:56.560 | Because there's a lot more people in the audience of you as of six months ago than there is

02:01:02.360 | of Jeffrey Hinton and Yann LeCun. That's the person you best understand. You know what

02:01:07.600 | they need. Go and get help and help others. Tell us about your success stories. But perhaps

02:01:16.360 | the most important one is get together with others. People's learning works much better

02:01:20.860 | if you've got that social experience. So start a book club, get involved in meetups, create

02:01:27.640 | study groups, and build things. And again, it doesn't have to be amazing. Just build

02:01:36.880 | something that you think the world would be a little bit better if that existed. Or you

02:01:41.700 | think it would be kind of slightly delightful to your two-year-old to see that thing. Or

02:01:46.600 | you just want to show it to your brother the next time they come around to see what you're

02:01:49.120 | doing. Whatever. Just finish something. Finish something. And then try and make it a bit

02:01:57.320 | better. So for example, something I just saw this afternoon is the Elon Musk tweet generator.

02:02:09.320 | So looking at lots of older tweets, creating a language model from Elon Musk, and then

02:02:14.520 | creating new tweets such as humanity will also have an option to publish on its own

02:02:19.640 | journey as an alien civilization. It will always, like all human beings, Mars is no

02:02:25.680 | longer possible. AI will definitely be the central intelligence agency. Okay. So this

02:02:31.680 | is great. I love this. And I love that Dave Smith wrote and said, "These are my first

02:02:37.920 | ever commits. Thanks for teaching a finance guy how to build an app in eight weeks." Right?

02:02:43.560 | So I think this is awesome. And I think clearly a lot of care and passion is being put into

02:02:50.520 | this project. Will it systematically change the future direction of society as a whole?

02:02:59.720 | Maybe not. But maybe Elon will look at this and think, "Oh, maybe I need to rethink my

02:03:05.380 | method of prose." I don't know. I think it's great. And so, yeah. Create something. Put

02:03:12.240 | it out there. Put a bit of yourself into it. Or get involved in fast AI. The fast AI project,

02:03:20.120 | there's a lot going on. You know, you can help with documentation and tests, which might

02:03:24.760 | sound boring, but you'd be surprised how incredibly not boring it is to, like, take a piece of

02:03:28.780 | code that hasn't been properly documented and research it and understand it and ask

02:03:33.640 | Silver and I on the forum what's going on. Why did you write it this way? We'll send

02:03:37.240 | you off to the papers that we were implementing. You know, writing a test requires deeply understanding

02:03:42.200 | that part of the machine learning world to understand how it's meant to work. So that's

02:03:46.800 | always interesting. Staz Beckman has created this nice dev projects index which you can

02:03:53.060 | go on to the forum in the fast AI dev section and find actually the dev project section

02:03:59.060 | and find, like, here's some stuff going on that you might want to get involved in. Or

02:04:02.680 | maybe there's stuff you want to exist. You can add your own. Create a study group. Dean

02:04:07.280 | has already created a study group for San Francisco starting in January. This is how

02:04:10.720 | easy it is to create a study group. Go on the forum, find your little time zone subcategory

02:04:15.920 | and add a post saying let's create a study group. But make sure you give people a little

02:04:22.000 | Google sheet to sign up, some way to actually do something. A great example is Pierre who's

02:04:28.380 | been doing a fantastic job in Brazil of running study groups for the last couple of parts

02:04:34.400 | of the course and he keeps posting these pictures of people having a good time and learning

02:04:40.000 | deep learning together, creating wikis together, creating projects together. Great experience.

02:04:47.100 | And then come back for part two, right, where we'll be looking at all of this interesting

02:04:53.400 | stuff in particular going deep into the fast AI code base to understand how did we build

02:04:58.080 | it exactly. We'll actually go through, as we were building it, we created notebooks

02:05:03.400 | of like here is where we were at each stage. So we're actually going to see the software

02:05:06.920 | development process itself. We'll talk about the process of doing research, how to read

02:05:11.760 | academic papers, how to turn math into code, and then a whole bunch of additional types

02:05:17.080 | of models that we haven't seen yet. So it'll be kind of like going beyond practical deep

02:05:21.800 | learning into actually cutting edge research. So we've got five minutes to take some questions.

02:05:31.140 | We had an AMA going on online and so we're going to have time for a couple of the highest

02:05:37.200 | ranked AMA questions from the community. And the first one is by Jeremy's request, although

02:05:42.160 | it's not the highest ranked. What's your typical day like? How do you manage your time across

02:05:47.160 | so many things that you do? Yeah, I thought that I hear that all the time. So I thought

02:05:54.160 | I should answer it. And I think I've got a few votes. Because I think people who come

02:06:01.880 | to our study group are always shocked at how disorganized and incompetent I am. And so

02:06:09.120 | I often hear people saying like, oh, wow, I thought you were like this deep learning

02:06:12.720 | role model and I'd get to see how to be like you. And now I'm not sure what to be like

02:06:16.240 | you at all. So yeah, it's for me, it's all about just having a good time with it. I never

02:06:26.160 | really have many plans. I just try to finish what I start. If you're not having fun with

02:06:32.400 | it, it's really, really hard to continue because there's a lot of frustration in deep learning

02:06:36.720 | because it's not like writing a web app, where it's like, you know, authentication check,

02:06:41.840 | you know, backend service watchdog check. Okay, user credentials check. You know, like you're

02:06:51.200 | making progress. Where else for stuff like this and stuff that we've been doing the last

02:06:55.960 | couple of weeks, it's just like, it's not working. It's not working. It's not working.

02:07:00.760 | No, that also didn't work. That also didn't work until oh, my God, it's amazing. It's

02:07:05.360 | a cat. That's kind of what it is, right? So you don't get that regular feedback. So yeah,

02:07:11.360 | you know, you got to have fun with it. And so, so my, yeah, my day is kind of, you know,

02:07:19.560 | I mean, the other thing I'll do, I'll say I don't, I don't do any meetings. I don't

02:07:24.320 | do phone calls. I don't do coffees. I don't watch TV. I don't play computer games. I spend

02:07:29.920 | a lot of time with my family, a lot of time exercising and a lot of time reading and coding

02:07:36.600 | and doing things I like. So, you know, I think, you know, the main thing is just finish, finish

02:07:45.440 | something like properly finish it. So when you get to that point where you think you're

02:07:50.120 | 80% of the way through, but you haven't quite created a read me yet and the install process

02:07:54.600 | is still a bit clunky and you know, this is what 99% of GitHub projects look like. You'll

02:07:59.320 | see the read me says to do, you know, complete baseline experiments, document, blah, blah,

02:08:06.880 | blah. It's like, don't be that person. Like just do something properly and finish it and

02:08:13.080 | maybe get some other people around you to work with you so that you're all doing it

02:08:16.000 | together and you know, get it done.

02:08:22.360 | What are the up and coming deep learning machine learning things that you are most excited

02:08:26.200 | about? Also, you've mentioned last year that you are not a believer in reinforcement learning.

02:08:31.280 | Do you still feel the same way?

02:08:33.160 | Yeah, I still feel exactly the same way as I did three years ago when we started this,

02:08:38.200 | which is it's all about transfer learning. It's underappreciated. It's under researched.

02:08:44.200 | Every time we put transfer learning into anything, we make it much better. You know, our academic

02:08:50.640 | paper on transfer learning for NLP has, you know, helped be one piece of kind of changing

02:08:55.800 | the direction of NLP this year. It's made it all the way to the New York Times, just

02:09:00.440 | a stupid, obvious little thing that we threw together. So I remain excited about that.

02:09:06.160 | I remain unexcited about reinforcement learning for most things. I don't see it used by normal

02:09:12.760 | people for normal things, for nearly anything. It's an incredibly inefficient way to solve

02:09:17.720 | problems which are often solved more simply and more quickly in other ways. It probably

02:09:22.480 | has maybe a role in the world, but a limited one and not in most people's day-to-day work.

02:09:39.040 | For someone planning to take part two in 2019, what would you recommend doing learning practicing

02:09:43.840 | until the part two course starts?

02:09:48.200 | Just code. Yeah, just code all the time. I know it's perfectly possible I hear from people

02:09:53.000 | who get to this point of the course and they haven't actually written any code yet. And

02:09:56.920 | if that's you, it's okay. You know, you just go through and do it again and this time do

02:10:01.920 | code and look at the shapes of your inputs and look at your outputs and make sure you

02:10:08.680 | know how to grab a mini batch and look at its main and standard deviation and plot it.

02:10:15.080 | There's so much material that we've covered. If you can get to a point where you can rebuild

02:10:23.960 | those notebooks from scratch without too much cheating, when I say from scratch, I mean

02:10:30.960 | using the first AI library, not from scratch from scratch, you'll be in the top echelon

02:10:38.400 | of practitioners because you'll be able to do all of these things yourself and that's

02:10:41.800 | really, really rare. And that'll put you in a great position for part two. Should we do

02:10:47.240 | one more?

02:10:48.240 | Nine o'clock. We always do one more. Where do you see the fast AI library going in the

02:10:53.200 | future, say in five years?

02:10:56.040 | Well, like I said, I don't make plans. I just piss around. So, I mean, our only plan for

02:11:05.640 | fast AI as an organization is to make deep learning accessible as a tool for normal people

02:11:15.400 | to use for normal stuff. So, as long as we need to code, we failed at that. So, the big

02:11:22.000 | goal, because 99.8% of the world can't code. So, the main goal would be to get to a point

02:11:30.120 | where it's not a library but it's a piece of software that doesn't require code. It certainly

02:11:34.280 | shouldn't require a goddamn lengthy, hard-working course like this one. So, I want to get rid

02:11:41.840 | of the course. I want to get rid of the code. I want to make it so you can just do useful

02:11:46.160 | stuff quickly and easily. So, that's maybe five years? Yeah, maybe longer.

02:11:52.640 | All right. Well, I hope to see you all back here for part two. Thank you.

02:11:55.720 | [applause]

02:11:56.720 | [end of transcript]

02:11:56.720 | [applause]

02:11:57.720 | [end of transcript]

02:11:57.720 | [applause]

02:11:58.720 | [end of transcript]

02:11:58.720 | [applause]

02:11:59.720 | [end of transcript]

02:11:59.720 | [applause]

02:12:00.720 | [end of transcript]

02:12:01.720 | [end of transcript]

02:12:01.720 | (audience applauds)

02:12:04.720 | [BLANK_AUDIO]

Lesson 7: Deep Learning 2019 - Resnets from scratch; U-net; Generative (adversarial) networks

Chapters