Back to Index

Lesson 7: Deep Learning 2019 - Resnets from scratch; U-net; Generative (adversarial) networks


Chapters

0:0
8:23 add a bit of random padding
11:1 start out creating a simple cnn
27:12 create your own variations of resnet blocks
56:43 create a generator learner
72:46 print out a sample after every epoch
85:17 using the pre-trained model
116:33 add skip connections

Transcript

Wellcome to lesson seven, the last lesson of part one. This will be a pretty intense lesson. And so don't let that bother you, because partly what I want to do is to kind of give you enough things to think about to keep you busy until part two. And so, in fact, some of the things we cover today, I'm not going to tell you about some of the details, I'll just point out a few things where I'll say like, okay, that we're not talking about yet, that we're not talking about that.

And so then come back in part two to get the details on some of these extra pieces. So today will be a lot of material. Pretty quickly might require a few viewings to fully understand at all, a few experiments and so forth. And that's kind of intentional. I'm going to give you stuff to keep you amused for a couple of months.

Wanted to start by showing some cool work done by a couple of students, Reshma and Npata01, who have developed an Android and an iOS app. And so check out Reshma's post on the forum about that, because they have a demonstration of how to create both Android and iOS apps that are actually on the Play Store and on the Apple App Store.

So that's pretty cool. First ones I know of that are on the App Store that are using fast AI. And let me also say a huge thank you to Reshma for all of the work she does, both for the fast AI community and the machine learning community, or generally, and also the women in machine learning community in particular.

She does a lot of fantastic work, including providing lots of fantastic documentation and tutorials and community organizing and so many other things. So thank you, Reshma, and congrats on getting this app out there. We have lots of Lesson 7 notebooks today, as you see, and we're going to start with the one.

So the first notebook we're going to look at is Lesson 7 ResNet MNIST. And what I want to do is look at some of the stuff we started talking about last week around convolutions and convolutional neural networks and start building on top of them to create a fairly modern deep learning architecture, largely from scratch.

When I say from scratch, I'm not going to re-implement things we already know how to implement but kind of use the pre-existing PyTorch bits of those. So we're going to use the MNIST dataset, which -- so urls.mnist has the whole MNIST dataset. Often we've done stuff with a subset of it.

So in there, there's a training folder and a testing folder. And as I read this in, I'm going to show some more details about pieces of the Datablocks API so that you see how to kind of see what's going on. Similarly with the Datablocks API, we've kind of said blah, blah, blah, blah, blah, and done it all in one cell.

But let's do them one cell at a time. So the first thing you say is what kind of item list do you have? So in this case, it's an item list of images. And then where are you getting the list of file names from? In this case, by looking in a folder recursively.

And that's where it's coming from. You can pass in arguments that end up going to pillow, because pillow or PIL is the thing that actually opens that for us. And in this case, these are black and white rather than RGB. So you have to use PILO's convert mode equals L.

For more details, refer to the Python imaging library documentation to see what their convert modes are. But this one is going to be grayscale, which is what MNIST is. So inside an item list is an items attribute. And the items attribute is kind of the thing that you gave it.

It's the thing that it's going to use to create your items. So in this case, the thing you gave it really is a list of file names. That's what it got from the folder. When you show images, normally it shows them in RGB. And so in this case, we want to use a binary color map.

So in FastAI, you can set a default color map. For more information about Cmap and color maps, refer to the mapplotlib documentation. And so this will set the default color map for FastAI. Okay. So our image item list contains 70,000 items. And it's a bunch of images that are 1 by 28 by 28.

Remember that PyTorch puts channel first. So they're 1 channel, 28 by 28. You might think, why aren't they just 28 by 28 matrices rather than a 1 by 28 by 28 rank 3 tensor? It's just easier that way. All the conv2d stuff and so forth works on rank 3 tensors.

So you want to include that unit access at the start. And so FastAI will do that for you even when it's reading 1 channel images. So the dot items attribute contains the thing that's kind of read to build the image, which in this case is the file name. But if you just index into an item list directly, you'll get the actual image object.

And so the actual image object has a show method. And so there's the image. So once you've got an image item list, you then split it into training versus validation. You nearly always want validation. If you don't, you can actually use the dot no split method to create a kind of empty validation set.

You can't skip it entirely. You have to say how to split. And one of the options is no split. And so remember, that's always the order. First create your item list, then decide how to split. In this case, we're going to do it based on folders. In this case, the validation folder for MNIST is called testing.

So in FastAI parlance, we use the same kind of parlance that Kaggle does, which is the training set is what you train on. The validation set has labels, and you do it for testing that your model's working. The test set doesn't have labels. And you use it for doing inference or submitting to a competition or sending it off to somebody who's held out those labels for vendor testing or whatever.

So just because a folder in your data set is called testing doesn't mean it's a test set. This one has labels, so it's a validation set. So if you want to do inference on lots of things at a time rather than one thing at a time, you want to use the test equals in FastAI to say this is stuff which has no labels that I'm just using for inference.

My split data is a training set and a validation set, as you can see. So inside the training set, there's a folder for each class. So now we can take that split data and say label from folder. So first you create the item list, then you split it, then you label it.

And so you can see now we have an X and a Y, and the Y are category objects. Category object is just a class, basically. So if you index into a label list, such as ll.train as a label list, you will get back an independent variable, independent variable, X and Y.

So in this case, the X will be an image object, which I can show, and the Y will be a category object which I can print. That's the number eight category, and there's the eight. Next thing we can do is to add transforms. In this case, we're not going to use the normal get transforms function because we're doing digit recognition, and digit recognition, you wouldn't want to flip it left or right, that would change the meaning of it.

You wouldn't want to rotate it too much, that would change the meaning of it. Also because these images are so small, doing zooms and stuff is going to make them so fuzzy as to be unreadable. So normally for small images of digits like this, you just add a bit of random padding.

So I'll use the random padding function, which actually returns two transforms, the bit that does the padding and the bit that does the random crop. So you have to use star to, say, put both these transforms in this list. So now we can call transform. This empty array here is referring to the validation set transforms.

So no transforms for the validation set. Now we've got a transformed, labeled list. We can pick a batch size and choose data bunch. We can choose normalize. In this case, we're not using a pre-trained model, so there's no reason to use image net stats here. And so if you call normalize like this, without passing in stats, it will grab a batch of data at random and use that to decide what normalization stats to use.

That's a good idea if you're not using a pre-trained model. So we've got a data bunch. And so in that data bunch is a data set, which we've seen already. But what is interesting is that the training data set now has data augmentation because we've got transforms. So plot multi is a fast AI function that will plot the result of calling some function for each of this row by column grid.

So in this case, my function is just grab the first image from the training set. And because each time you grab something from the training set, it's going to load it from disk and it's going to transform it on the fly. So people sometimes ask like, how many transformed versions of the image do you create?

And the answer is kind of infinite. Each time we grab one thing from the data set, we do a random transform on the fly. So potentially everyone will look a little bit different. So you can see here, if we plot the result of that lots of times, we get eights in slightly different positions because we did random padding.

You can always grab a batch of data then from the data bunch. Because remember, a data bunch has data loaders, and data loaders are things that you grab a batch at a time. And so you can then grab an X batch and a Y batch, look at their shape, batch size by channel by row by column.

All fast AI data bunches have a show batch, which will show you what's in it in some sensible way. Okay, so that's a quick walkthrough of the data block API stuff to grab our data. So let's start out creating a simple CNN, simple confident. So the input is 28 by 28.

So let's define -- I like to define when I'm creating architectures a function which kind of does the things that I do again and again and again. I don't want to call it with the same arguments because I'll forget, I'll make a mistake. So in this case, all of my convolutions are going to be kernel size three, stride two, padding one.

So let's just create a simple function to do a conv with those parameters. So each time I have a convolution, it's skipping over one pixel, so it's doing jumping two steps each time. So that means that each time we have a convolution, it's going to halve the grid size.

So I've put a comment here showing what the new grid size is after each one. So after the first convolution, we have one channel coming in, because remember it's a grayscale image with one channel. And then how many channels coming out, whatever you like, right? So remember you always get to pick how many filters you create regardless of whether it's a fully connected layer, in which case it's just the width of the matrix you're multiplying by, or in this case with a 2D conv, it's just how many filters do you want.

So I picked eight. And so after this, it's stride two. So the 28 by 28 image is now a 14 by 14 feature map with eight channels. So specifically, therefore, it's an eight by 14 by 14 tensor of activations. Then we'll do batch norm, then we'll do relu. So the number of input filters to the next conv has to equal the number of output filters from the previous conv, and we can just keep increasing the number of channels.

Because we're doing stride two, it's going to keep decreasing the grid size. Notice here it goes from seven to four, because if you're doing a stride two conv over seven, it's going to be kind of math.ceiling of seven divided by two. Batch norm, relu, conv, we're now down to two by two.

Batch norm, relu, conv, we're now down to one by one. So after this, we have a picture map of, let's see, ten by one by one. Does that make sense? We've got a grid size of one now. So it's not a vector of length ten. It's a rank three tensor of ten by one by one.

So our loss functions expect generally a vector, not a rank three tensor. So you can chuck flatten at the end, and flatten just means remove any unit axes. So that will make it now just a vector of length ten, which is what we always expect. So that's how we can create a CNN.

So then we can return that into a learner by passing in the data and the model and the loss function and, if optionally, some metrics. So we're going to use cross entropy as usual. So we can then call learn.summary and confirm. After that first conv, we're down to 14 by 14.

And after the second conv, 7 by 7 and 4 by 4, 2 by 2, 1 by 1. The flatten comes out calling it a lambda, but that, as you can see, gets rid of the one by one, and it's now just a length ten vector for each item in the batch.

So a 128 by 10 matrix for the whole mini batch. So just to confirm that this is working okay, we can grab that mini batch of X that we created earlier. That's our mini batch of X. Pop it onto the GPU and call the model directly. Remember any PyTorch module we can pretend it's a function.

And that gives us back, as we hoped, a 128 by 10 result. So that's how you can directly get some predictions out. We already have a 98.6% accurate ConvNet. And this is trained from scratch, of course, it's not pre-trained, we literally created our own architecture, it's about the simplest possible architecture you can imagine, 18 seconds to train.

So that's how easy it is to create a pretty accurate digit detector. So let's refactor that a little rather than saying Conv Batch Norm Relu all the time. Fast AI already has something called Conv_Layer, which lets you create Conv Batch Norm Relu combinations. And it has various other options to do other tweaks to it, but the basic version is just exactly what I just showed you.

So we can refactor that like so, so that's exactly the same neural net. And so let's just train it a little bit longer, and it's actually 99.1% accurate if we train it for all of a minute. So that's cool. So how can we improve this? Well what we really want to do is create a deeper network.

And it's a very easy way to create a deeper network would be after every stride two Conv, add a stride one Conv, because the stride one Conv doesn't change the feature map size at all, so you can add as many as you like. But there's a problem. There's a problem.

And the problem was pointed out in this paper, very, very, very influential paper, called Deep Residual Learning for Image Recognition by Kaiming He and colleagues then at Microsoft Research. And they did something interesting. They said, let's look at the training error. So forget generalization even. Let's just look at the training error of a network trained on CIFAR-10.

And let's try one network with 20 layers, just basic three by three Convs, just basically the same network I just showed you, but without batch norm. So I trained 20 layer one and a 56 layer one on the training set. So the 56 layer one has a lot more parameters.

It's got a lot more of these stride one Convs in the middle. So the one with more parameters should seriously overfit. So you would expect the 56 layer one to zip down to zero-ish training error pretty quickly. And that is not what happens. It is worse than the shallower network.

So when you see something weird happen, really good researchers don't go, oh, no, it's not working. They go, that's interesting. So Kaiming He said, that's interesting. What's going on? And he said, I don't know, but what I do know is this. I could take this 56 layer network and make a new version of it, which is identical, but has to be at least as good as the 20 layer network.

And here's how. Every two convolutions, I'm going to add together the input to those two convolutions, add it together with the result of those two convolutions. So in other words, he's saying, instead of saying output equals conv2 of conv1 of x, instead, he's saying output equals x plus conv2 of conv1 of x.

So that 56 layers worth of convolutions in that, his theory has to be at least as good as the 20 layer version because it could always just set conv2 and conv1 to a bunch of zero weights for everything except for the first 20 layers because the x, the input, could just go straight through.

So this thing here is, as you see, called an identity connection. It's the identity function. Nothing happens at all. It's also known as a skip connection. So that was a theory, right? That's what the paper describes as the intuition behind this is what would happen if we created something which has to train at least as well as a 20 layer neural network because it kind of contains that 20 layer neural network.

It's literally a path you can just skip over all the convolutions. And so what happens? And what happened was he won ImageNet that year. He easily won ImageNet that year. And in fact, even today, we had that record-breaking result on ImageNet speed training ourselves. In the last year, we used this, too.

ResNet has been revolutionary. And here's a trick. If you're interested in doing some research, some novel research, any time you find some model for anything, whether it's medical image segmentation or some kind of GAN or whatever, and it was written a couple of years ago, they might have forgotten to put ResNets in.

Resblocks, this is what we normally call a resblock. They might have forgotten to put resblocks in. So replace their convolutional path with a bunch of resblocks, and you'll almost always get better results faster. It's a good trick. So at NeurIPS, which Rachel and I and David all just came back from and Sylvain, we saw a new presentation where they actually figured out how to visualize the loss surface of a neural net, which is really cool.

This is a fantastic paper. And anybody who's watching this, lesson seven, is at a point where they will understand most of the most important concepts in this paper. You can read this now. You won't necessarily get all of it, but I'm sure you'll get enough to find it interesting.

And so the big picture was this one. Here's what happens if you draw a picture where kind of X and Y here are two projections of the weight space, and Z is the loss. And so as you move through the weight space, a 56 layer neural network without skip connections is very, very bumpy.

And that's why this got nowhere, because it just got stuck in all these hills and valleys. The exact same network with identity connections, with skip connections, has this lost landscape. So it's kind of interesting how Herr recognized back in 2015, this shouldn't happen here's a way that must fix it.

And it took three years before people were able to say, oh, this is kind of why it fixed it. With the batch norm discussion we had a couple of weeks ago, people realizing a little bit after the fact sometimes what's going on and why it helps. So in our code, we can create a res block in just the way I described.

We create an nn.module, we create two conf layers. Where a conf layer is conf2d, batch norm, relu, sorry, conf2d, relu, batch norm. So create two of those and then in forward we go conf1 of X, conf2 of that, and then add X. There's a res block function already in fast AI.

So you can just call res block instead, and you just pass in something saying how many filters do you want. So there's the res block that I defined in our notebook. And so with that res block we can now take every one of those, I've just copied the previous CNN, and after every conf2, except the last one, I added a res block.

So this has now got three times as many layers. So it should be able to do more compute. But it shouldn't be any harder to optimize. So what happens? Well, let's just refactor it one more time. Since I go conf2 res block so many times, let's just pop that into a mini sequential model here, and so I can refactor that like so.

Keep refactoring your architectures if you're trying novel architectures because you'll make less mistakes. Very few people do this. Most research code you look at is clunky as all hell, and people often make mistakes in that way. So don't do that. You're all coders. So use your coding skills to make life easier.

So there's my ResNet-ish architecture. And I'll find, as usual, fit for a while. And I get 99.54. So that's interesting because we've trained this literally from scratch with an architecture we built from scratch. I didn't look up this architecture anywhere. It was just the first thing that came to mind.

But in terms of where that puts us, 0.45% error is around about the state of the art for this data set as of three or four years ago. Now, you know, today MNIST is considered a kind of trivially easy data set, so I'm not saying, like, wow, we've broken some records here.

People have got beyond 0.45% error. But what I'm saying is that, you know, we can't -- this kind of ResNet is a genuinely extremely useful network still today, and this is really all we use in our fast ImageNet training still. And one of the reasons as well is that it's so popular, so the vendors of the library spend a lot of time optimizing it, so things tend to work fast, whereas some more modern-style architectures using things like separable or grouped convolutions tend not to actually train very quickly in practice.

If you look at the definition of ResBlock in the fast AI code, you'll see it looks a little bit different to this. And that's because I've created something called a merge layer. And a merge layer is something which in the forward -- just skip dense for a moment -- the forward says x plus x dot a ridge.

So you can see there's something ResNet-ish going on here. What is x dot a ridge? Well, if you create a special kind of sequential model called a sequential EX -- so this is like fast AI's sequential extended -- it's just like a normal sequential model, but we store the input in x dot a ridge.

And so this here, sequential EX, conv layer, conv layer, merge layer, will do exactly the same as this. So you can create your own variations of ResNet blocks very easily with just sequential EX and merge layer. So there's something else here, which is when you create your merge layer, you can optionally set dense equals true.

What happens if you do? Well, if you do, it doesn't go x plus x dot a ridge, it goes cat x comma x dot a ridge. In other words, rather than putting a plus in this connection, it does a concatenate. So that's pretty interesting, because what happens is that you have your input coming into your Res block.

And once you use concatenate instead of plus, it's not called a Res block anymore, it's called a dense block, and it's not called a ResNet anymore, it's called a dense net. So the dense net was invented about a year after the ResNet. And if you read the dense net paper, it can sound incredibly complex and different, but actually it's literally identical, but plus here is replaced with cat.

So you have your input coming into your dense block, right, and you've got a few convolutions in here, and then you've got some output coming out, and then you've got your identity connection. And remember, it doesn't plus, it concats, so if this is the channel axis, it gets a little bit bigger.

And then so we do another dense block, and at the end of that we have all of this coming in. So at the end of that we have the result of the convolution as per usual, but this time the identity block is that big, right? So you can see that what happens is that with dense blocks it's getting bigger and bigger and bigger, and kind of interestingly the exact input is still here, right?

So actually, no matter how deep you get, the original input pixels are still there, and the original layer one features are still there, and the original layer two features are still there. So as you can imagine, dense nets are very memory intensive. There are ways to manage this, from time to time you can have a regular convolution that squishes your channels back down, but they are memory intensive.

But they have very few parameters. So for dealing with small data sets, you should definitely experiment with dense blocks and dense nets. They tend to work really well on small data sets. Also, because it's possible to kind of keep those original input pixels all the way down the path, they work really well for segmentation, right?

Because for segmentation, you kind of want to be able to reconstruct the original resolution of your picture, so having all of those original pixels still there is super helpful. So that's res nets, and one of the main reasons, other than the fact that res nets are awesome, to tell you about them, is that these skip connections are useful in other places as well, and they're particularly useful in other places and other ways of designing architectures for segmentation.

So in building this lesson, I always kind of, I keep trying to take old papers and saying, like I'm mentioning, what would that person have done if they had access to all the modern techniques we have now? And I try to kind of rebuild them in a more modern style.

So I've been really rebuilding this next architecture we're going to look at, called a UNET, in a more modern style recently. And I got to the point now, I keep showing you this semantic segmentation paper with the state of the art for CAMVID, which was 91.5. This week I got it up to 94.1 using the architecture I'm about to show you.

So we keep pushing this further and further and further. And it really was all about adding all of the modern tricks, many of which I'll show you today, some of which we'll see in part two. So what we're going to do to get there is we're going to use this UNET.

So we've used a UNET before, I've improved it a bit since then. So we've used a UNET before, we used it when we did the CAMVID segmentation, but we didn't understand what it was doing. So we're now in a position where we can understand what it was doing. And so the first thing we need to do is kind of understand the basic idea of how you can do segmentation.

So if we go back to our CAMVID notebook, in our CAMVID notebook you'll remember that basically what we were doing is we were taking these photos and adding a class to every single pixel. And so when you go data.showbatch for something which is a segmentation item list, it will automatically show you these color-coded pixels.

So here's the thing. In order to color-code this as a pedestrian, but this as a bicyclist, it needs to know what it is. It needs to actually know that's what a pedestrian looks like, and it needs to know that's exactly where the pedestrian is, and this is the arm of the pedestrian and not part of their shopping basket.

It needs to really understand a lot about this picture to do this task. And it really does do this task, like when you look at the results of our top model, I can't see a single pixel by looking at it by eye. I know there's a few wrong, but I can't see the ones that are wrong, it's that accurate.

So how does it do that? So the way that we're doing it to get these really, really good results is, not surprisingly, using pre-training. So we start with a ResNet-34, and you can see that here, unet-learner data, models.resnet-34. And if you don't say pre-trained equals false, by default, you get pre-trained equals true, because why not?

So we start with a ResNet-34, which starts with a big image. So in this case, this is from the unet paper now. They're images, they started with one channel by 572 by 572. This is for medical imaging segmentation. So after your stride 2 conv, they're doubling the number of channels to 128, and they're halving the size, so they're now down to 280 by 280.

In this original unet paper, they didn't add any padding, so they lost a pixel on each side each time they did a conv. That's why you're losing these two. So basically half the size, and then half the size, and then half the size, and then half the size, until they're down to 28 by 28, with 1024 channels.

So that's what the unet's downsampling path, this is called the downsampling path look like. Hours is just a ResNet-34. So you can see it here, learn.summary. This is literally a ResNet-34. So you can see that the size keeps halving, channels keep going up, and so forth. So eventually, you've got down to a point where if you use a unit architecture, it's 28 by 28 with 1024 channels, with a ResNet architecture, with a 224 pixel input, it would be 512 channels by 7 by 7.

So it's a pretty small grid size on this feature map. Somehow we've got to end up with something which is the same size as our original picture. So how do we do that? How do you do computation which increases the grid size? Well, we don't have a way to do that in our current bag of tricks.

We can use a stride 1 conv to do computation and keep grid size or a stride 2 conv to do computation and halve the grid size. So how do we double the grid size? We do a stride half conv, also known as a deconvolution, also known as a transposed convolution.

There is a fantastic paper called A Guide to Convolution Arithmetic for Deep Learning that shows a great picture of exactly what does a 3 by 3 kernel stride half conv look like. And it's literally this. If you have a 2 by 2 input, so the blue squares are the 2 by 2 input, you add not only two pixels of padding all around the outside, but you also add a pixel of padding between every pixel.

And so now if we put this 3 by 3 kernel here and then here and then here, you see how the 3 by 3 kernel is just moving across it in the usual way? You will end up going from a 2 by 2 output to a 5 by 5 output.

So if you only added one pixel of padding around the outside, you would end up with a 3 by 3 output. So sorry, 4 by 4. So this is how you can increase the resolution. This was the way people did it until maybe a year or two ago. It's another trick for improving things you find online, because this is actually a dumb way to do it.

And it's kind of obvious it's a dumb way to do it for a couple of reasons. One is that, have a look at this, nearly all of those pixels are white. They're nearly all zeros. So what a waste. What a waste of time. What a waste of computation. There's just nothing going on there.

Also, this one, when you get down to that 3 by 3 area, 2 out of the 9 pixels are non-white, but this one, 1 out of the 9 are non-white. So there's different amounts of information going into different parts of your convolution. So it just doesn't make any sense to kind of throw away information like this and to do all this unnecessary computation and have different parts of the convolution having access to different amounts of information.

So what people generally do nowadays is something really simple, which is if you have, let's say, a 2 by 2 input, these are your pixel values, A, B, C, and D, and you want to create a 4 by 4, why not just do this? A, A, A, A, B, B, B, B, C, C, C, C, C, D, D, D, D.

So I've now upscaled from 2 by 2 to 4 by 4. I haven't done any interesting computation, but now on top of that, I could just do a stride 1 convolution, and now I have done some computation. So an up sample, this is called nearest neighbor interpolation, nearest neighbor interpolation.

So you can just do, and that's super fast, which is nice, so you can do a nearest neighbor interpolation and then a stride 1 conv, and now you've got some computation, which is actually kind of using, you know, there's no zeros here. This is kind of nice because it gets a mixture of A's and B's, which is kind of what you would want and so forth.

Another approach is instead of using nearest neighbor interpolation, you can use bilinear interpolation, which basically means instead of copying A to all those different cells, you take a kind of a weighted average of the cells around it. So for example, if you were, you know, looking at what should go here, you would kind of go like, oh, it's about 3 A's, 2 C's, 1 D, and 2 B's, and you could have taken the average.

Not exactly, but roughly just a weighted average. Bilinear interpolation you'll find in any, you know, all over the place, it's a pretty standard technique. Any time you look at a picture on your computer screen and change its size, it's doing bilinear interpolation. So you can do that, and then a Strad1Conf.

So that was what people were using, well, that's what people still tend to use. That's as much as I'm going to teach you this part. In part two, we'll actually learn what the FastAI library is actually doing behind the scenes, which is something called a pixel shuffle, also known as sub-pixel convolutions.

It's not dramatically more complex, but complex enough that I won't cover it today. There's the same basic idea. All of these things is something which is basically letting us do a convolution that ends up with something that's twice the size. And so that gives us our upsampling path. So that lets us go from 28 by 28 to 54 by 54 and keep on doubling the size.

So that's good. And that was it until UNET came along. That's what people did. And it didn't work real well. Which is not surprising, because in this 28 by 28 feature map, how the hell is it going to have enough information to reconstruct a 572 by 572 output space?

That's a really tough ask. So you tended to end up with these things that lacked fine detail. So what Olaf Rolleberger and et al. did was they said, hey, let's add a skip connection, an identity connection. And amazingly enough, this was before resnets existed. So this was like a really big leap.

Really impressive. And so but rather than adding a skip connection that skipped every two convolutions, they added skip connections where these gray lines are. In other words, they added a skip connection from the same part of the downsampling path to the same sized bit in the upsampling path. And they didn't add.

That's why you can see the white and the blue next to each other. They didn't add. They concatenated. So basically these are like dense blocks, right? But the skip connections are skipping over larger and larger amounts of the architecture. So that over here, you've literally got nearly the input pixels themselves coming into the computation of these last couple of layers.

And so that's going to make it super handy for resolving the fine details in these segmentation tasks because you've literally got all of the fine details. On the downside, you don't have very many layers of computation going on here, just four. So you better hope that by that stage, you've done all the computation necessary to figure out, is this a bicyclist or is this a pedestrian?

But you can then add on top of that something saying like, is this exact pixel where their nose finishes or is that the start of the tree? So that works out really well. And that's a unit. So this is the unit code from FastAI. And the key thing that comes in is the encoder.

The encoder refers to that part. In other words, in our case, a ResNet-34. In most cases, they have this specific older-style architecture. But like I said, replace any older-style architecture bits with ResNet bits and life improves, particularly if they're pre-trained. So that certainly happened for us. So we start with our encoder.

So our layers of our unit is an encoder, then batch norm, then ReLU, and then middle conv, which is just conv layer, comma, conv layer. Remember conv layer is a conv ReLU batch norm in FastAI. And so the middle conv is these two extra steps here at the bottom, just doing a little bit of computation.

It's kind of nice to add more layers of computation where you can. So encoder, batch norm, ReLU, and then two convolutions. And then we enumerate through these indexes. What are these indexes? I haven't included the code. But these are basically -- we figure out what is the layer number where each of these strived two convs occurs, and we just store it in an array of indexes.

So then we can loop through that, and we can basically say for each one of those points, create a unit block, telling us how many up-sampling channels there are and how many cross-connection. These things here are called cross-connections, or at least that's what I call them. So that's really the main works going on in the unit block.

As I said, there's quite a few tweaks we do, as well as the fact we use a much better encoder. We also use some tweaks in all of our up-sampling using this pixel shuffle. We use another tweak called ICNR. And then another tweak, which I just did in the last week, is to not just take the result of the convolutions and pass it across, but we actually grab the input pixels and make them another cross-connection.

That's what this last cross is here. You can see we're literally appending a res block with the original inputs. So you can see our merge layer. So really all the work's going on in unit block, and unit block has to store the activations at each of these down-sampling points.

And the way to do that, as we learned in the last lesson, is with hooks. So we put hooks into the ResNet-34 to store the activations each time there's a Strive2 conv. And so you can see here we grab the hook. And we grab the result of the stored value in that hook, and we literally just go torch.cat, so we concatenate the up-sampled convolution with the result of the hook, which we chuck through batch norm, and then we do two convolutions to it.

And actually, you know, something you could play with at home is pretty obvious here. Any time you see two convolutions like this, there's an obvious question is, what if we used a ResNet block instead? So you could try replacing those two comms with a ResNet block. You might find you get even better results.

And then the kind of things I look for when I look at an architecture is like, oh, two comms in a row probably should be a ResNet block. Okay. So that's UNET, and it's amazing to think it preceded ResNet, it preceded DenseNet. It wasn't even published in a major machine learning venue.

It was actually published in MICHI, which is a specialized medical image computing conference. For years, actually, it was largely unknown outside of the medical imaging community. And actually, what happened was Kaggle competitions for segmentation kept on being easily won by people using UNETs. And that was the first time I saw it getting noticed outside the medical imaging community.

And then, gradually, a few people in the academic machine learning community started noticing, and now everybody loves UNET, which I'm glad, because it's just awesome. So identity connections, regardless of whether they're a plus style or a concat style, are incredibly useful. They can basically get us close to the state of the art on lots of important tasks.

So I want to use them on another task now. And so the next task I want to look at is image restoration. So image restoration refers to starting with an image, and this time, we're not going to create a segmentation mask, but we're going to try and create a better image.

And there's lots of versions of better-- there could be different image. So the kind of things we can do with this image generation would be take a low res image, make it high res, take a black and white image, make it color, take an image where something's being cut out of it and try and replace the cut out thing, take a photo and try and turn it into what looks like a line drawing, take a photo and try and make it look like a Monet painting.

These are all examples of kind of image to image generation tasks, which you'll know how to do after this part of the class. So in our case, we're going to try to do image restoration, which is going to start with low resolution, poor quality JPEGs with writing written over the top of them, and get them to replace them with high resolution, good quality pictures in which the text has been removed.

Two questions? OK, let's go. Why do you concat before calling conv2, conv1, not after? Because if you did conv1-- if you did your comms before you concat, then there's no way for the channels of the two parts to interact with each other. You don't get any-- so remember, in a 2D conv, it's really 3D, right?

It's moving across two dimensions, but in each case, it's doing a dot product of all three dimensions of a rank 3 tensor, row by column by channel. So generally speaking, we want as much interaction as possible. We want to say this part of the downsampling path and this part of the upsampling path, if you look at the combination of them, you find these interesting things.

So generally, you want to have as many interactions going on as possible in each computation that you do. How does concatenating every layer together in a dense net work when the size of the image feature maps is changing through the layers? That's a great question. So, if you have a stride 2 conv, you can't keep dense netting.

That's what actually happens in a dense net, is you kind of go like dense block growing, dense block growing, dense block growing, so you're getting more and more channels. And then you do a stride 2 conv without a dense block. And so now, it's kind of gone. And then you just do a few more dense blocks and then it's gone.

So in practice, a dense block doesn't actually keep all the information all the way through, but just up into every one of these stride 2 convs. And there's kind of various ways of doing these bottlenecking layers where you're basically saying, hey, let's reset. It also helps us keep memory under control because at that point we can decide how many channels we actually want.

Good questions. Thank you. So, in order to create something which can turn crappy images into nice images, we need a data set containing nice versions of images and crappy versions of the same images. So the easiest way to do that is to start with some nice images and Crapify them.

And so the way to Crapify them is to create a function called Crapify, which contains your Crapification logic. So my Crapification logic, you can pick your own, is that I open up my nice image, I resize it to be really small, 96 by 96 pixels, with bilinear interpolation, I then pick a random number between 10 and 70, I draw that number into my image at some random location, and then I save that image with a JPEG quality of that random number.

And a JPEG quality of 10 is like absolute rubbish. A JPEG quality of 70 is not bad at all. So I end up with high quality images, low quality images that look something like these. And so you can see this one, you know, there's the image. And this is after transformations, that's why it's been flipped.

And you won't always see the image because we're zooming into them. So a lot of the time the image is cropped out. So yeah, it's trying to figure out how to take this incredibly JPEG artifacty thing with text written over the top and turn it into this. So I'm using the Oxford Pets dataset, again, the same one we used in lesson one.

So there's nothing more high quality than pictures of dogs and cats, I think we can all agree with that. The Crapification process can take a while, but fast.ai has a function called parallel. And if you pass parallel a function name and a list of things to run that function on, it will run that function on them all in parallel.

So this actually can run pretty quickly. The way you write this function is where you get to do all the interesting stuff in this assignment. Try and think of an interesting Crapification which does something that you want to do. So if you want to colorize black and white images, you would replace it with black and white.

If you want something which can take large cut out blocks of image and replace them with hallucinated image, add a big black box to these. If you want something which can take old family photo scans that have been folded up and have crinkles in, try and find a way of adding dust prints and crinkles and so forth.

Something that you don't include in Crapify, your model won't learn to fix because every time it sees that in your photos, the input and output will be the same, so it won't consider that to be something worthy of fixing. So we now want to create a model which can take an input photo that looks like that and output something that looks like that.

So obviously what we want to do is use a unit, because we already know that units can do exactly that kind of thing, and we just need to pass the unit that data. So our data is just literally the file names from each of those two folders. Do some transforms, databunch, normalize, or use ImageNet stats because we're going to use a pre-trained model.

Why are we using a pre-trained model? Well, because like if you're going to get rid of this 46, you need to know what probably was there, and to know what probably was there, you need to know what this is a picture of. Because otherwise, how can you possibly know what it ought to look like?

So let's use a pre-trained model that knows about these kinds of things. So we create our unit with that data. The architecture is ResNet 34. These three things are important and interesting and useful, but I'm going to leave them to part two. For now, you should always include them when you use a unit for this kind of problem.

And so now we're going to-- and this whole thing I'm calling a generator. It's going to generate-- this is generative modeling. There's not a really formal definition, but it's basically something where the thing we're outputting is like a real object, in this case, an image. It's not just a number.

So we're going to create a generator learner, which is this unit learner. And then we can fit. We're using MSC loss, right? So in other words, what's the mean squared error between the actual pixel value that it should be and the pixel value that we predicted? MSC loss normally expects two vectors.

In our case, we have two images. So we have a version called MSC loss flat, which simply flattens out those images into a big long vector. There's never any reason not to use this. Even if you do have a vector, it works fine. If you don't have a vector, it'll also work fine.

So we're already down to 0.05 mean squared error on the pixel values, which is not bad, after 1 minute 35. Like all things in fast AI, pretty much, because we are doing transfer learning by default, when you create this, it'll freeze the pre-trained part. And the pre-trained part of a unit is this part, the down sampling part.

That's where the resonant is. So let's unfreeze that and train a little more. And look at that. So with four minutes of training, we've got something which is basically doing a perfect job of removing numbers. It's certainly not doing a good job of up sampling. But it's definitely doing a nice-- sometimes when it removes a number, it maybe leaves a little bit of JPEG artifact.

But it's certainly doing something pretty useful. And so if all we wanted to do was kind of watermark removal, we'd be finished. We're not finished, because we actually want this thing to look more like this thing. So how are we going to do that? The problem, the reason that we're not making as much progress with that as we'd like is that our loss function doesn't really describe what we want.

Because actually, the mean squared error between the pixels of this and this is actually very small. And if you actually think about it, most of the pixels are very nearly the right color. But we're missing the texture of the pillow. And we're missing the eyeballs entirely, pretty much. And we're missing the texture of the fur.

So we want some loss function that does a better job than pixel mean squared error loss of saying like, is this a good quality picture of this thing? So there's a fairly general way of answering that question. And it's something called a generative adversarial network, or GaN. And a GaN tries to solve this problem by using a loss function which actually calls another model.

And let me describe it to you. So we've got our crappy image, and we've already created a generator. It's not a great one, but it's not terrible. And that's creating predictions like this. We have a high res image like that. And we can compare the high res image to the prediction with pixel MSE.

We could also train another model, which we would variously call either the discriminator or the critic. They both mean the same thing. I'll call it a critic. We could try and build a binary classification model that takes all the pairs of the generated image and the real high res image and tries to classify, learn to classify, which is which.

So look at some picture and say like, hey, what do you think? Is that a high res cat or is that a generated cat? How about this one? Is that a high res cat or a generated cat? So just a regular standard binary cross-entropy classifier. So we know how to do that already.

So if we had one of those, we could now fine-tune the generator. And rather than using pixel MSE as the loss, the loss could be how good are we at fooling the critic? So can we create generated images that the critic thinks are real? So that would be a very good plan, right?

Because if it can do that, if the loss function is am I fooling the critic, then it's going to learn to create images which the critic can't tell whether they're real or fake. So we could do that for a while, train a few batches, but the critic isn't that great.

The reason the critic isn't that great is because it wasn't that hard. Like these images are really shitty, so it's really easy to tell the difference, right? So after we train the generator a little bit more using the critic as the loss function, the generator is going to get really good at fooling the critic.

So now we're going to stop training the generator and we'll train the critic some more on these newly generated images. So now that the generator's better, it's now a tougher task for the critic to decide which is real and which is fake, so we'll train that a little bit more.

And then once we've done that, and the critic's now pretty good at recognising the difference between the better generated images and the originals, we'll go back and we'll fine-tune the generator some more using the better discriminator, the better critic, as the loss function. And so we'll just go ping pong, ping pong, backwards and forwards.

That's a GAN. Well, that's our version of a GAN. I don't know if anybody's written this before. We've created a new version of a GAN, which is kind of a lot like the original GANs, but we have this neat trick where we pre-train the generator and we pre-train the critic.

I mean, GANs have been kind of in the news a lot. They're a pretty fashionable tool. And if you've seen them, you may have heard that they're a real pain to train. But it turns out we realise that really most of the pain of training them was at the start.

If you don't have a pre-trained generator and you don't have a pre-trained critic, then it's basically the blind leading the blind, right? You're basically like the critics, well, the generator's trying to generate something which falls a critic, but the critic doesn't know anything at all, so it's basically got nothing to do.

And then the critics kind of try to decide whether the generated images are real or not, and that gets really obvious, so that just does it. And so they kind of like don't go anywhere for ages. And then once they finally start picking up steam, they go along pretty quickly.

So if you can find a way to generate things without using a GAN, like mean squared error pixel loss, and discriminate things without using a GAN, like predict on that first generator, you can make a lot of progress. So let's create the critic. So to create just a totally standard fast.ai binary classification model, we need two folders, one folder containing high-res images, one folder containing generated images.

We already have the folder with the high-res images, so we just have to save our generated images. So here's a tiny, tiny bit of code that does that. We're going to create a directory called imagegen, pop it into a variable called pathgen. We've got a little function called save preds that takes a data loader, and we're going to grab all of the file names, because remember that in an item list, the dot items contains the file names, if it's an image item list.

So here's the file names in that data loader's data set. And so now let's go through each batch of the data loader, and let's grab a batch of predictions for that batch, and then reconstruct equals true, means it's actually going to create fast.ai image objects for each of those, each thing in the batch.

And so then we'll go through each of those predictions and save them. And the name we'll save it with is the name of the original file, but we're going to pop it into our new directory. So that's it. That's how you save predictions. And so you can see I'm kind of increasingly not just using stuff that's already in the fast.ai library, but trying to show you how to write stuff yourself, right?

And generally it doesn't require heaps of code to do that. And so if you come back to part two, this is what, you know, lots of part two were kind of like here's how you use things inside the library, and of course here's how we wrote the library. So increasingly writing our own code.

Okay. So save those predictions, and then let's just do a PIL.image.open on the first one, and yep, there it is, okay? So there's an example of a generated image. So now I can train a critic in the usual way. It's really annoying to have to restart Jupyter Notebook to reclaim GPU memory.

So one easy way to handle this is if you just set something that you knew was using a lot of GPU to none, like this learner, and then just go gc.collect. That tells Python to do memory garbage collection, and after that you'll generally be fine. You'll be able to use all of your GPU memory again.

If you're using Nvidia SMI to actually look at your GPU memory, you won't see it clear because PyTorch still has a kind of allocated cache, but it makes it available. So you should find this is how you can avoid restarting your Notebook. Okay. So we're going to create a critic, it's just an image item list from folder in the totally usual way, and the classes will be the image gen and images.

We'll do a random split because we want to know how well we're doing with a critic to have a validation set. We just label it from folder in the usual way, add some transforms, databunch, normalize, so it's a totally standard object classifier. Okay, so we've got a totally standard classifier.

So here's what some of it looks like. So here's one from the real images, generated images, generated images. So it's going to try and figure out which class is which. Okay, so we're going to use binary cross-entropy as usual, however, we're not going to use a ResNet here. And the reason we'll get into it in more detail in part two, but basically when you're doing a GAN, you need to be particularly careful that the generator and the critic can't kind of both push in the same direction and increase the weights out of control.

So we have to use something called spectral normalization to make GANs work nowadays. We'll learn about that in part two. So if you say GAN critic, that will give you a binary classifier suitable for GANs. I strongly suspect we probably can use a ResNet here, we just have to create a pre-trained ResNet with spectral norm, hope to do that pretty soon, we'll see how we go.

But as of now, this is kind of the best approach, there's this thing called GAN critic. And again, critic uses a slightly different way of averaging the different parts of the image when it does the loss. So any time you're doing a GAN at the moment, you have to wrap your loss function with adaptive loss.

Again, we'll look at the details in part two, for now, just know this is what you have to do and it'll work. So other than that, slightly odd loss function and that slightly odd architecture, everything else is the same, we can call that to create our critic. Because we have this slightly different architecture and slightly different loss function, we did a slightly different metric, this is the equivalent GAN version of accuracy, the critics, and then we can train it, and you can see it's 98% accurate at recognizing that kind of crappy thing from that kind of nice thing.

And of course, we don't see the numbers here anymore, right, because these are the generated images, the generator already knows how to get rid of those numbers that are written on top. So let's finish up this GAN. Now that we have pre-trained the generator and pre-trained the critic, we now need to get it to ping-pong between training a little bit of each.

And the amount of time you spend on each of those things and the learning rates you use is still a little bit on the fussy side. So we've created a GAN learner for you, which you just pass in your generator and your critic, which we've just simply loaded here from the ones we just trained, and it will go ahead and when you go learn.fit, it will do that for you.

It will figure out how much time to train the generator and then when to switch to training the discriminator, the critic, and it will go back and forth. These weights here is that what we actually do is we don't only use the critic as the loss function. If we only use the critic as the loss function, the GAN could get very good at creating pictures that look like real pictures, but they actually have nothing to do with the original photo at all.

So we actually add together the pixel loss and the critic loss. And so those two losses are kind of on different scales. So we multiply the pixel loss by something between about 50 and about 200. Again, something in that range generally works pretty well. Something else with GANs, GANs hate momentum when you're training them.

It kind of doesn't make sense to train them with momentum because you keep switching between generator and critic, so it's kind of tough. Maybe there are ways to use momentum, but I'm not sure anybody's figured it out. This number here, when you create an atom optimizer, is where the momentum goes, so you should set that to zero.

So anyway, if you're doing GANs, use these hyperparameters, it should work. So that's what GAN learner does, and so then you can go fit, and it trains for a while. And one of the tough things about GANs is that these loss numbers, they're meaningless. You can't expect them to go down, because as the generator gets better, it gets harder for the discriminator, the critic.

Then as the critic gets better, it gets harder for the generator. So the numbers should stay about the same. So that's one of the tough things about training GANs, is it's kind of hard to know how are they doing. So the only way to know how are they doing is to actually take a look at the results from time to time.

And so if you put show image equals true here, it will actually print out a sample after every epoch. I haven't put that in the notebook because it makes it too big for the repo, but you can try that. So I've just put the results at the bottom, and here it is.

So pretty beautiful, I would say. We already knew how to get rid of the numbers, but we now don't really have that kind of artifact of where it used to be. And it's definitely sharpening up this little kitty cat quite nicely. It's not great, always. There's some weird kind of noise going on here.

It's certainly a lot better than the horrible original. This is a tough job to turn that into that. But there are some really obvious problems. Like here, these things ought to be eyeballs, and they're not. So why aren't they? Well, our critic doesn't know anything about eyeballs. And even if it did, it wouldn't know that eyeballs are particularly important.

We care about eyes. Like when we see a cat without eyes, it's a lot less cute. I mean, I'm more of a dog person, but it just doesn't know that this is a feature that matters. Particularly because the critic, remember, is not a pre-trained network. So I kind of suspect that if we replace the critic with a pre-trained network that's been pre-trained on ImageNet but is also compatible with GANs, it might do a better job here.

But it's definitely a shortcoming of this approach. So we're going to have a break. Question first. And then we'll have a break. And then after the break, I will show you how to find the cat's eyeballs again. For what kind of problems do you not want to use UNETs?

Well, UNETs are for when the size of your output is similar to the size of your input and kind of aligned with it. There's no point kind of having cross-connections if that level of spatial resolution in the output isn't necessary or useful. So any kind of generative modeling and segmentation is generative modeling.

It's generating a picture which is a mask of the original objects. So probably anything where you want that resolution of the output to be of the same kind of fidelity as resolution of the input. Obviously, something like a classifier makes no sense. In a classifier, you just want the downsampling path, because at the end, you just want a single number, which is like, is it a dog or a cat, or what kind of pet is it, or whatever.

Great. Okay. So let's get back together at 5 past 8. Just before we leave GANs, I'll just mention there's another notebook you might be interested in looking at, which is lesson 7wGAN. When GANs started a few years ago, people generally used them to kind of create images out of thin air, which I personally don't think is a particularly useful or interesting thing to do, but it's kind of a good, I don't know, it's a good research exercise, I guess.

So we implemented this wGAN paper, which was kind of really the first one to do a somewhat adequate job, somewhat easily, and so you can see how to do that with the fast AI library. It's kind of interesting, because the dataset we use is this Lsun bedrooms dataset, which we've provided in our URLs, which just, as you can see, has bedrooms, lots and lots and lots of bedrooms.

And the approach, you'll see in the pros here that Sylvain wrote, the approach that we use in this case is to just say, can we create a bedroom? And so what we actually do is that the input to the generator isn't an image that we clean up. We actually feed to the generator random noise.

And so then the generator's task is, can you turn random noise into something which the critic can't tell the difference between that output and a real bedroom? And so we're not doing any pre-training here or any of the stuff that makes this kind of fast and easy. So this is a very traditional approach, but you can still see, you still just go, you know, gan learner, and there's actually a wGAN version, which is, you know, this kind of older style approach, but you just pass in the data and the generator and the critic in the usual way, and you call fit, and you'll see, in this case we have a show image on, you know, after epoch one, it's not creating great bedrooms or two or three, and you can really see that in the early days of these kinds of gans, it doesn't do a great job of anything, but eventually after, you know, a couple of hours of training, producing somewhat like bedroom-ish things, you know.

So anyway, it's a notebook you can have a play with, and it's a bit of fun. So I was very excited when we got fast.ai to the point in the last week or so that we had gans working in a way where kind of API-wise, they're far more concise and more flexible than any other library that exists, but also kind of disappointed with they take a long time to train, and the outputs are still like so-so, and so the next step was like, well, can we get rid of gans entirely?

So the first step with that, I mean, obviously, the thing we really want to do is come up with a better loss function. We want a loss function that does a good job of saying this is a high-quality image without having to go over all the gan trouble, and preferably it also doesn't just say it's a high-quality image, but it's an image which actually looks like the thing it's meant to.

So the real trick here comes back to this paper from a couple of years ago, perceptual losses for real-time style transfer and super resolution. Justin Johnson at our, created this thing they call perceptual losses. It's a nice paper, but I hate this term because they're nothing particularly perceptual about them.

I would call them feature losses. So in the fastai library, you'll see this referred to as feature losses. And it shares something with gans, which is that after we go through our generator, which they call the image transform net, and you can see it's got this kind of unit shaped thing.

They didn't actually use units because at the time this came out, nobody in the machine learning world much knew about units. Nowadays, of course, we use units. But anyway, something unit-ish. I should mention, like, in these architectures where you have a downsampling path followed by the upsampling path, the downsampling path is very often called the encoder.

As you saw in our code, actually, we called that the encoder. And the upsampling path is very often called the decoder. In generative models, generally, including generative text models, neural translation, stuff like that, they tend to be called the encoder and the decoder, two pieces. So we have this generator, and we want a loss function that says, you know, is the thing that it's created like the thing that we want.

And so the way they do that is they take the prediction -- remember Y hat is what we normally use for a prediction from a model -- we take the prediction and we put it through a pre-trained image net network. So at the time that this came out, the pre-trained image network they were using was VGG.

People still -- it's kind of old now, but people still tend to use it because it works fine for this process. So they take the prediction and they put it through VGG, the pre-trained image net network. It doesn't matter too much which one it is. And so normally the output of that would tell you, hey, is this generated thing, you know, a dog or a cat or an airplane or a fire engine or whatever, right?

But in the process of getting to that final classification, it goes through lots of different layers. And in this case, they've color-coded all the layers with the same grid size in the feature map with the same color. So every time we switch colors, we're switching grid size. So there's a strive to conv, or in VGG's case they still used to use max pooling layers, which kind of similar idea.

And so what we could do is say, hey, let's not take the final output of the VGG model on this generated image, but let's take something in the middle. Let's take the activations of some layer in the middle. So those activations might be a feature map of like 256 channels by 28 by 28, say.

And so those kind of 28 by 28 grid cells will kind of roughly semantically say things like, hey, in this part of that 28 by 28 grid, is there something that looks kind of furry? Or is there something that looks kind of shiny? Or is there something that looks kind of circular?

Or is there something that kind of looks like an eyeball or whatever? So what we do is that we then take the target, so the actual Y value, and we put it through the same pre-trained VGG network, and we pull out the activations at the same layer, and then we do a mean squared error comparison.

So it'll say, OK, in the real image, grid cell 1, 1 of that 28 by 28 feature map is furry and blue and round shaped, and in the generated image, it's furry and blue and not round shaped. So it's kind of like an OK match. So that ought to go a long way towards fixing our eyeball problem, because in this case, the feature map is going to say, there's eyeballs here-- sorry, here-- but there isn't here.

So do a better job of that, please. Make better eyeballs. So that's the idea. So that's what we call feature losses, or Johnson et al. called perceptual losses. So to do that, we're going to use the Lesson 7 Super Res notebook. And this time, the task we're going to do is kind of the same as the previous task, but I wrote this notebook a little bit before the GAN notebook.

Before I came up with the idea of putting text on it and having a random JPEG quality. So JPEG quality is always 60. There's no text written on top, and it's 96 by 96. And it's before I realized what a great word "crapify" is, so it's called resize. So here's our crappy images and our original images, kind of a similar task to what we had before.

So I'm going to try and create a loss function which does this. So the first thing I do is I define a base loss function, which is basically like, how am I going to compare the pixels and the features? And the choices mainly are like MSE or L1. Doesn't matter too much, which you choose.

I tend to like L1 better than MSE, actually. So I picked L1. So any time you see base loss, we mean L1 loss. You could use MSE loss as well. So let's create a VGG model. So just using the pre-trained model. In VGG, there's an attribute called dot_features, which contains the convolutional part of the model.

So here's the convolutional part of the VGG model. Because we don't need the head, because we only want the intermediate activations. So then we'll chuck that on the GPU. We'll put it into eval mode, because we're not training it. And we'll turn off requires_grad, because we don't want to update the weights of this model.

We're just using it for inference, for the loss. So then let's enumerate through all the children of that model and find all of the max pooling layers. Because in the VGG model, that's where the grid size changes. And as you can see from this picture, we kind of want to grab features from every time just before the grid size changes.

So we grab layer i minus 1. So that's the layer before it changes. So there's our list of layer numbers just before the max pooling layers. And so all of those are values, not surprisingly. So those are where we want to grab some features from. So we put that in blocks.

It's just a list of IDs. So here's our feature_loss class, which is going to implement this idea. So basically, when we call the feature_loss class, we're going to pass it some pre-trained model. And so that's going to be called m_feet. That's the model which contains the features which we want to generate for-- want our feature loss on.

So we can go ahead and grab all of the layers from that network that we want the features for to create the losses. So we're going to need to hook all of those outputs. Because remember, that's how we grab intermediate layers in PyTorch is by hooking them. So this is going to contain our hooked outputs.

So now, in the forward of feature_loss, we're going to make features passing in the target. So this is our actual Y, which is just going to call that VGG model and go through all of the stored activations and just grab a copy of them. And so we're going to do that both for the target, call that out_feet, and for the input.

So that's the output of a generator in_feet. And so now, let's calculate the L1 loss between the pixels. Because we still want the pixel loss a little bit. And then let's also go through all of those layers features and get the L1 loss on them. So we're basically going through every one of these end of each block and grabbing the activations and getting the L1 on each one.

So that's going to end up in this list called feature_losses, which I then sum them all up. And by the way, the reason I do it as a list is because we've got this nice little callback that if you put them into a thing called .metrics in your loss function, it'll print out all of the separate layer loss amounts for you, which is super handy.

So that's it. That's our perceptual loss or feature_loss class. And so now we can just go ahead and train a unit in the usual way with our data and our pre-trained architecture, which is a ResNet-34, passing in our loss function, which is using our pre-trained VGG model. And this is that callback I mentioned, loss_metrics, which is going to print out all the different layers losses for us.

These are two things that we'll learn about in part two of the course, but you should use them. LR_find. I just created a little function called do_fit that does fit one cycle and then saves the model and then shows the results. So as per usual, because we're using a pre-trained network in our UNet, we start with frozen layers for the downsampling path, train for a while, and as you can see, we get not only the loss, but also the pixel loss and the loss at each of our feature layers.

And then also something we'll learn about in part two called gram_loss, which I don't think anybody's used for SuperRes before as far as I know, but as you'll see, it turns out great. So that's eight minutes, so much faster than a GAN. And already, as you can see, this is our output, modeled output, pretty good.

So then we unfreeze and train some more, and it's a little bit better. And then let's switch up to double the size, and so we need to also halve the batch size to avoid running a GPU memory. And freeze again and train some more, so it's now taking half an hour.

Even better. And then unfreeze and train some more. So all in all, we've done about an hour and 20 minutes of training. And look at that! It's done it. It knows that eyes are important, so it's really made an effort. It knows that fur is important, so it's really made an effort.

So it started with something with JPEG artifacts around the ears and all this mess and eyes that are just kind of vague, light blue things, and it really created a lot of texture. This cat is clearly kind of like looking over the top of one of those little clawing frames covered in fuzz, so it actually recognized that this thing is probably kind of a carpety material that's created a carpety material for us.

So I mean, that's just remarkable. So talking of remarkable, we can now - so I've never seen outputs like this before without again. So I was just so excited when we were able to generate this. And so quickly, one GPU, hour and a half. So if you create your own krapification functions and train this model, you'll build stuff that nobody's built before.

Because like nobody else's that I know of is doing it this way. So there are huge opportunities, I think. So check this out. What we can now do is we can now, instead of starting with our low res, I actually stored another set at size 256, which are called medium res.

So let's see what happens if we upsize a medium res. So we're going to grab our medium res data. And here is our medium res stored photo. And so can we improve this? So you can see there's still a lot of room for improvement. Like you see the lashes here are very pixelated.

Size where there should be hair here is just kind of fuzzy. So watch this area as I hit down on my keyboard. Bump. Look at that. It's done it. You know, it's taken a medium res image and it's made a totally clear thing here. You know, the furs reappeared.

Look at the eyeball. Let's go back. The eyeball here is just kind of a general blue thing. Here it's added all the right texture, you know. So I just think this is super exciting, you know. Here's a model I trained in an hour and a half using standard stuff that you've all learned about a unit, a pre-trained model, feature loss function, and we've got something which can turn that into that or, you know, this absolute mess into this.

And like it's really exciting to think what could you do with that, right? So one of the inspirations here has been a guy called Jason Antich. And Jason was a student in the course last year. And what he did very sensibly was decide to focus basically nearly quit his job and work four days a week or really six days a week on studying deep learning.

And as you should do, he created a kind of capstone project. And his project was to combine GANs and feature losses together. And his crepification approach was to take color pictures and make them black and white. So he took the whole of ImageNet, created a black and white ImageNet, and then trained a model to recolorize it.

And he's put this up as de-oldify. And now he's got these actual old photos from the 19th century that he's turning into color. And like what this is doing is incredible. Like look at this. The model thought, oh, that's probably some kind of copper kettle. So I'll make it like copper colored.

And oh, these pictures are on the wall. They're probably like different colors to the wall. And maybe that looks a bit like a mirror. Maybe it would be reflecting stuff outside, you know. These things might be vegetables. Vegetables are often red. You know, let's make them red. It's extraordinary what it's done.

And you could totally do this, too. Like you can take our feature loss and our GAN loss and combine them. So I'm very grateful to Jason, because he's helped us build this lesson. And it's been really nice, because we've been able to help him, too, because he hadn't realized that he can use all this pre-training and stuff.

And so hopefully you'll see De-oldify in the next couple of weeks be even better at de-oldification. But hopefully you all can now add other kinds of de-crapification methods as well. So I like every course, if possible, to show something totally new, because then every student has a chance to basically build things that have never been built before.

So this is kind of that thing, you know, but between the much better segmentation results and these much simpler and faster de-crapification results, I think you can build some really cool stuff. Did you have a question? Is it possible to use similar ideas to UNET and GANs for NLP?

For example, if I want to tag the verbs and nouns in a sentence or create a really good Shakespeare generator? Yeah, pretty much. We don't fully know yet. It's a pretty new area, but there's a lot of opportunities there. And we'll be looking at some in a moment, actually.

So I actually tried training this -- well, I actually tried testing this on this -- remember this picture I showed you with a slide last lesson? And it's a really rubbishy-looking picture, and I thought, what would happen if we tried running this just through the exact same model and it changed it from that to that?

So I thought that was a really good example. You can see something it didn't do, which is this weird discoloration. It didn't fix it, because I didn't crepify things with weird discoloration, right? So if you want to create really good image restoration, like I say, you need really good crepification.

Okay. So here's what we've learned so far, right, in the course, some of the main things. So we've learned that neural nets consist of sandwich layers of affine functions, which are basically matrix multiplications, slightly more general version, and nonlinearities, like ReLU. And we learned that the results of those calculations are called activations, and the things that go into those calculations that we learn are called parameters, and that the parameters are initially, randomly initialized, or we copy them over from a pre-trained model, and then we train them with SGD or faster versions, and we learned that convolutions are a particular affine function that work great for autocorrelated data, so things like images and stuff.

We learned about batch norm, dropout data orientation and weight decay as ways of regularizing models, and also batch norm helps train models more quickly. And then today we've learned about res/dense blocks. We've learned a lot about image classification and regression, embeddings, categorical and continuous variables, collaborative filtering, language models and NLP classification, and then kind of segmentation unit and GANs.

So go over these things and make sure that you feel comfortable with each of them. If you've only watched this series once, you definitely won't. People normally watch it three times or so to really understand the detail. So one thing that doesn't get here is RNNs. So that's the last thing we're going to do, RNNs.

So RNNs, I'm going to introduce a little kind of diagrammatic method here to explain RNNs. And the diagrammatic method, I'll start by showing you a basic neural net with a single hidden layer. Square means an input. So that'll be batch size by number of inputs. So kind of, you know, batch size by number of inputs.

An arrow means a layer, broadly defined, such as matrix product followed by value. A circle is activation. So in this case, we have one set of hidden activations. And so given that the input was number of inputs, this here is a matrix of number of inputs by number of activations.

So the output will be batch size by number of activations. It's really important you know how to calculate these shapes. So go learn.summary lots to see all the shapes. So then here's another arrow. So that means it's another layer, matrix product followed by non-linearity. In this case, we're going to the output, so we use softmax.

And then triangle means an output. And so this matrix product will be number of activations by number of classes. So our output is batch size by number of classes. So let's reuse that key, remember, triangle output, circle is activations, hidden state, we also call that, and rectangle is input.

So let's now imagine that we wanted to get a big document, split it into sets of three words at a time, and grab each set of three words and then try to predict the third word using the first two words. So if we had the data set in place, we could grab word one as an input, chuck it through an embedding, create some activations, pass that through a matrix product and non-linearity, grab the second word, put it through an embedding, and then we could either add those two things together or concatenate them.

Generally speaking, when you see kind of two sets of activations coming together in a diagram, you normally have a choice of concatenate or add. And that's going to create a second bunch of activations, and then you can put it through one more fully connected layer and softmax to create an output.

So that would be a totally standard, fully connected neural net with one very minor tweak, which is concatenating or adding at this point, which we could use to try to predict the third word from pairs of two words. So remember, arrows represent layer operations, and I removed in this one the specifics of what they are because they're always an affine function followed by a non-linearity.

Let's go further. What if we wanted to predict word four using words one and two and three? It's basically the same picture as last time, except with one extra input and one extra circle. But I want to point something out, which is each time we go from rectangle to circle, we're doing the same thing.

We're doing an embedding, which is just a particular kind of matrix multiply, where you have a one-hot encoded input. Each time we go from circle to circle, we're basically taking one piece of hidden state, one set of activations, and turning it into another set of activations by saying we're now at the next word.

And then when we go from circle to triangle, we're doing something else again, which is we're saying let's convert the hidden state, these activations, into an output. So it would make sense, so you can see I've colored each of those arrows differently. So each of those arrows should probably use the same weight matrix, because it's doing the same thing.

So why would you have a different set of embeddings for each word, or a different set of -- a different matrix to multiply by to go from this hidden state to this hidden state versus this one. So this is what we're going to build. So we're now going to jump into human numbers, which is less than seven human numbers, and this is the dataset that I created, which literally just contains all the numbers from one to 9,999 written out in English.

And we're going to try and create a language model that can predict the next word in this document. It's just a toy example for this purpose. So in this case, we only have one document, and that one document is the list of numbers. So we can use a text list to create an item list with text in for the training and the validation.

In this case, the validation set is the numbers from 8,000 onwards, and the training set is 1 to 8,000. We can combine them together, turn that into a data bunch. So we only have one document. So train zero is the document. Grab its dot text. That's how you grab the contents of a text list, and here are the first 80 characters.

It starts with a special token, XXBOS. Anything starting with XX is a special fast AI token. BOS is the beginning of stream token. It basically says this is the start of a document. It's very helpful in NLP to know when documents start so that your models can learn to recognize them.

The validation set contains 13,000 tokens, so 13,000 words or punctuation marks, because everything between spaces is a separate token. The batch size that we asked for was 64. And then by default, it uses something called BPT-T of 70. BPT-T, as we briefly mentioned, stands for backprop through time. That's the sequence length.

So with each of our 64 document segments, we split it up into lists of 70 words that we look at at one time. So what we do is we grab this for the validation set, an entire string of 13,000 tokens, and then we split it into 64 roughly equal sized sections.

People very, very, very often think I'm saying something different. I did not say they are of length 64. They're not. They're 64 equally sized roughly segments. So we take the first 1/64 of the document, piece one. 1/64, piece two. And then for each of those 1/64 of the document, we then split those into pieces of length 70.

So each batch -- so let's now say, okay, for those 13,000 tokens, how many batches are there? Well, divide by batch size and divide by 70. So there's about 2.9 batches. So there's going to be three batches. So let's grab an iterator for our data loader, grab one, two, three batches, the X and the Y, and let's add up the number of elements, and we get back slightly less than this because there's a little bit left over at the end that doesn't quite make up a full batch.

So this is the kind of stuff you should play around with a lot, lots of shapes and sizes and stuff and iterators. As you can see, it's 95 by 64. I claimed it was going to be 70 by 64. That's because our data loader for language models slightly randomizes, BPTT, just to give you a bit more kind of shuffling, get a bit more randomization.

It helps the model. And so here you can see the first batch of X. Remember, we've numericalized all these. And here's the first batch of Y. And you'll see here, this is 2, 18, 10, 11, 8. This is 18, 10, 11, 8. So this one is offset by 1 from here because that's what we want to do with a language model.

We want to predict the next word. So after 2 should come 18. And after 18 should come 10. You can grab the vocab for this data set. And a vocab has a textify. So if we look at the same thing but with textify, that'll just look it up in the vocab.

So here you can see XXBOS 8001. Whereas in the Y, there's no XXBOS. It's just 8001. So after XXBOS is 8, after 8 is 1, after 1000 is 1. And so then after we get 8023 comes X2. And look at this. We're always looking at column 0. So this is the first batch, the first mini-batch.

Comes 8024 and then X3 all the way up to 8040. And so then we can go right back to the start but look at batch 1. So index 1, which is batch number 2. And now we can continue. A slight skip from 8040 to 8046. That's because the last mini-batch wasn't quite complete.

So what this means is that every mini-batch joins up with the previous mini-batch. So you can go straight from X1, 0 to X2, 0. It continues. 8023, 8024, right? And so if you look at the same thing for colon, comma, 1, you'll also see they join up. So all the mini-batches join up.

So that's the data. We can do show batch to see it. And here is our model which is doing this. So this is just the code copied over. So it contains one embedding, i.e. the green arrow, one hidden to hidden brown arrow layer, and one hidden to output. So each colored arrow has a single matrix.

And so then in the forward pass, we take our first input, X0, and put it through input to hidden, the green arrow, create our first set of activations, which we call H. Assuming that there is a second word, because sometimes we might be at the end of a batch where there isn't a second word, assuming there is a second word, then we would add to H the result of X1, put through the green arrow.

Remember that's IH. And then we would say, okay, our new H is the result of those two added together, put through our hidden to hidden, orange arrow, and then relu then batch it on. And then for the second word, do exactly the same thing. And then finally, blue arrow, put it through H0.

So that's how we convert our diagram to code. So nothing new here at all. So now let's do -- so we can check that in the learner and we can train it, 46%. Let's take this code and recognize it's pretty awful. There's a lot of duplicate code. And as coders, when we see duplicate code, what do we do?

We refactor. So we should refactor this into a loop. So here we are. We've refactored it into a loop. So now we're going for each X, I and X and doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not anything new.

This is now an RNN. And let's refactor our diagram from this to this. This is the same diagram. But I've just replaced it with my loop. Does the same thing. So here it is. It's got exactly the same in it. Literally exactly the same. Just popped a loop here.

Before I start, I just have to make sure I've got a bunch of zeros to add to. And of course I get exactly the same result when I train it. Okay. So next thing that you might think then -- and one nice thing about the loop, though, is now this will work even if I'm not predicting the fourth word from the previous three but the ninth word from the previous eight.

It will work for any arbitrarily length long sequence, which is nice. So let's up the BPTT to 20 since we can now. And let's now say, okay, instead of just predicting the nth word from the previous n minus 1, let's try to predict the second word from the first and the third from the second and the fourth from the third and so forth.

Because previously -- look at our loss function. Previously we were comparing the result of our model to just the last word of the sequence. It's very wasteful because there's a lot of words in the sequence. So let's compare every word in X to every word in Y. So to do that, we need to change this so it's not just one triangle at the end of the loop.

But the triangle is inside this, right? So that in other words, after every loop, predict, loop, predict, loop, predict. So here's this code. It's the same as the previous code but now I've created an array. And every time I go through the loop, I append HOH to the array.

So now for n inputs, I create n outputs. So I'm predicting after every word. Previously I had 46%. Now I have 40%. Why is it worse? Well, it's worse because now, like when I'm trying to predict the second word, I only have one word of state to use. Right?

So like when I'm looking at the third word, I only have two words of state to use. So it's a much harder problem for it to solve. So the obvious way to fix this then would -- you know, the key problem is here. I go H equals torch.zeros, like I reset my state to zero every time I start another BPTT sequence.

Well, let's not do that. Let's keep H. Right? And we can because remember each batch connects to the previous batch. It's not shuffled like happens in image classification. So let's take this exact model and replicate it again. But let's move the creation of H into the constructor. Okay. There it is.

So it's now self.h. So this is now exactly the same code. But at the end, let's put the new H back into self.h. So it's now doing the same thing, but it's not throwing away that state. And so therefore now we actually get above the original. We get all the way up to 54% accuracy.

So this is what a real RNN looks like. You always want to keep that state. But just keep remembering there's nothing different about an RNN. It's a totally normal, fully connected neural net. It's just that you've got a loop you refactored. What you could do, though, is at the end of your -- every loop, you could not just spit out an output, but you could spit it out into another RNN.

So you could have an RNN going into an RNN. And that's nice because we've now got more layers of computation. You would expect that to work better. Well, to get there, let's do some more refactoring. So let's take this code and replace it with the equivalent built-in PyTorch code, which is -- you just say that.

So nn.rn basically says do the loop for me. We've still got the same embedding, the same output, the same batch norm, the same initialization of H, but we just got rid of the loop. So one of the nice things about RNN is that you can now say how many layers you want.

So this is the same accuracy, of course. So here I've got to do it with two layers. But here's the thing. When you think about this, right, think about it without the loop. It looks like this, right? It's like -- it keeps on going -- and we've got a BPTT of 20, so there's 20 layers of this.

And we know from that visualizing the lost landscapes paper that deep networks have awful, bumpy, lost surfaces. So when you start creating long time scales and multiple layers, these things get impossible to train. So there's a few tricks you can do. One thing is you can add skip connections, of course.

But what people normally do is instead they put inside -- instead of just adding these together, they actually use a little mini neural net to decide how much of the green arrow to keep and how much of the orange arrow to keep. And when you do that, you get something that's either called a GIU or an LSTM depending on the details of that little neural net.

And we'll learn about the details of those neural nets in part two. They really don't matter, though, frankly. So we can now say let's create a GIU instead, so it's just like what we had before, but it'll handle longer sequences in deeper networks. Let's use two layers, and we're up to 75%.

Okay. So that's RNNs. And the main reason I wanted to show it to you was to remove the last remaining piece of magic. And this is one of the least magical things we have in deep learning. It's just a refactored, fully connected network. So don't let RNNs ever put you off.

And with this approach where you basically have a sequence of N inputs and a sequence of N outputs we've been using for language modeling, you can use that for other tasks, right? For example, the sequence of outputs could be for every word. There could be something saying is this something that is sensitive and I want to anonymize or not?

You know, so like is this private data or not? Or it could be a part of speech tag for that word. Or it could be something saying, you know, how should that word be formatted? Or whatever. And so these are called sequence labeling tasks, and so you can use this same approach for pretty much any sequence labeling task.

Or you can do what I did in the earlier lesson, which is once you finish building your language model, you can throw away the kind of this HO bit and instead pop there a standard classification head and then you can now do NLP classification, which as you saw earlier will give you state-of-the-art results even on long documents.

So this is a super valuable technique and not remotely magical. Okay, so that's it, right? That's deep learning or at least, you know, the kind of the practical pieces from my point of view. Having watched this one time, you won't get it all. And I don't recommend that you do watch this so slowly that you get it all the first time, but you go back and look at it again, take your time, and there'll be bits that you go like, "Oh, now I see what he's saying," and then you'll be able to implement things you couldn't implement before and you'll be able to dig in more than you before.

So definitely go back and do it again. And as you do, write code, not just for yourself, but put it on GitHub. It doesn't matter if you think it's great code or not. The fact that you're writing code and sharing it is impressive, and the feedback you'll get if you tell people on the forum, "Hey, I wrote this code.

It's not great, but it's my first effort. Anything you see, jump out at you," people will say like, "Oh, that bit was done well. Hey, but did you know for this bit you could have used this library and saved you some time?" You'll learn a lot by interacting with your peers.

As you've noticed, I've started introducing more and more papers. Now, part two will be a lot of papers, and so it's a good time to start reading some of the papers that have been introduced in this section. All the bits that say derivation and theorems and lemmas, you can skip them.

I do. They add almost nothing to your understanding of practical deep learning. But the bits that say why are we solving this problem, and what are the results, and so forth are really interesting. And then try and write English prose. Not English prose that you want to be read by Jeff Hinton and Yann LeCun, but English prose that you want to be read by you as of six months ago.

Because there's a lot more people in the audience of you as of six months ago than there is of Jeffrey Hinton and Yann LeCun. That's the person you best understand. You know what they need. Go and get help and help others. Tell us about your success stories. But perhaps the most important one is get together with others.

People's learning works much better if you've got that social experience. So start a book club, get involved in meetups, create study groups, and build things. And again, it doesn't have to be amazing. Just build something that you think the world would be a little bit better if that existed.

Or you think it would be kind of slightly delightful to your two-year-old to see that thing. Or you just want to show it to your brother the next time they come around to see what you're doing. Whatever. Just finish something. Finish something. And then try and make it a bit better.

So for example, something I just saw this afternoon is the Elon Musk tweet generator. So looking at lots of older tweets, creating a language model from Elon Musk, and then creating new tweets such as humanity will also have an option to publish on its own journey as an alien civilization.

It will always, like all human beings, Mars is no longer possible. AI will definitely be the central intelligence agency. Okay. So this is great. I love this. And I love that Dave Smith wrote and said, "These are my first ever commits. Thanks for teaching a finance guy how to build an app in eight weeks." Right?

So I think this is awesome. And I think clearly a lot of care and passion is being put into this project. Will it systematically change the future direction of society as a whole? Maybe not. But maybe Elon will look at this and think, "Oh, maybe I need to rethink my method of prose." I don't know.

I think it's great. And so, yeah. Create something. Put it out there. Put a bit of yourself into it. Or get involved in fast AI. The fast AI project, there's a lot going on. You know, you can help with documentation and tests, which might sound boring, but you'd be surprised how incredibly not boring it is to, like, take a piece of code that hasn't been properly documented and research it and understand it and ask Silver and I on the forum what's going on.

Why did you write it this way? We'll send you off to the papers that we were implementing. You know, writing a test requires deeply understanding that part of the machine learning world to understand how it's meant to work. So that's always interesting. Staz Beckman has created this nice dev projects index which you can go on to the forum in the fast AI dev section and find actually the dev project section and find, like, here's some stuff going on that you might want to get involved in.

Or maybe there's stuff you want to exist. You can add your own. Create a study group. Dean has already created a study group for San Francisco starting in January. This is how easy it is to create a study group. Go on the forum, find your little time zone subcategory and add a post saying let's create a study group.

But make sure you give people a little Google sheet to sign up, some way to actually do something. A great example is Pierre who's been doing a fantastic job in Brazil of running study groups for the last couple of parts of the course and he keeps posting these pictures of people having a good time and learning deep learning together, creating wikis together, creating projects together.

Great experience. And then come back for part two, right, where we'll be looking at all of this interesting stuff in particular going deep into the fast AI code base to understand how did we build it exactly. We'll actually go through, as we were building it, we created notebooks of like here is where we were at each stage.

So we're actually going to see the software development process itself. We'll talk about the process of doing research, how to read academic papers, how to turn math into code, and then a whole bunch of additional types of models that we haven't seen yet. So it'll be kind of like going beyond practical deep learning into actually cutting edge research.

So we've got five minutes to take some questions. We had an AMA going on online and so we're going to have time for a couple of the highest ranked AMA questions from the community. And the first one is by Jeremy's request, although it's not the highest ranked. What's your typical day like?

How do you manage your time across so many things that you do? Yeah, I thought that I hear that all the time. So I thought I should answer it. And I think I've got a few votes. Because I think people who come to our study group are always shocked at how disorganized and incompetent I am.

And so I often hear people saying like, oh, wow, I thought you were like this deep learning role model and I'd get to see how to be like you. And now I'm not sure what to be like you at all. So yeah, it's for me, it's all about just having a good time with it.

I never really have many plans. I just try to finish what I start. If you're not having fun with it, it's really, really hard to continue because there's a lot of frustration in deep learning because it's not like writing a web app, where it's like, you know, authentication check, you know, backend service watchdog check.

Okay, user credentials check. You know, like you're making progress. Where else for stuff like this and stuff that we've been doing the last couple of weeks, it's just like, it's not working. It's not working. It's not working. No, that also didn't work. That also didn't work until oh, my God, it's amazing.

It's a cat. That's kind of what it is, right? So you don't get that regular feedback. So yeah, you know, you got to have fun with it. And so, so my, yeah, my day is kind of, you know, I mean, the other thing I'll do, I'll say I don't, I don't do any meetings.

I don't do phone calls. I don't do coffees. I don't watch TV. I don't play computer games. I spend a lot of time with my family, a lot of time exercising and a lot of time reading and coding and doing things I like. So, you know, I think, you know, the main thing is just finish, finish something like properly finish it.

So when you get to that point where you think you're 80% of the way through, but you haven't quite created a read me yet and the install process is still a bit clunky and you know, this is what 99% of GitHub projects look like. You'll see the read me says to do, you know, complete baseline experiments, document, blah, blah, blah.

It's like, don't be that person. Like just do something properly and finish it and maybe get some other people around you to work with you so that you're all doing it together and you know, get it done. What are the up and coming deep learning machine learning things that you are most excited about?

Also, you've mentioned last year that you are not a believer in reinforcement learning. Do you still feel the same way? Yeah, I still feel exactly the same way as I did three years ago when we started this, which is it's all about transfer learning. It's underappreciated. It's under researched.

Every time we put transfer learning into anything, we make it much better. You know, our academic paper on transfer learning for NLP has, you know, helped be one piece of kind of changing the direction of NLP this year. It's made it all the way to the New York Times, just a stupid, obvious little thing that we threw together.

So I remain excited about that. I remain unexcited about reinforcement learning for most things. I don't see it used by normal people for normal things, for nearly anything. It's an incredibly inefficient way to solve problems which are often solved more simply and more quickly in other ways. It probably has maybe a role in the world, but a limited one and not in most people's day-to-day work.

For someone planning to take part two in 2019, what would you recommend doing learning practicing until the part two course starts? Just code. Yeah, just code all the time. I know it's perfectly possible I hear from people who get to this point of the course and they haven't actually written any code yet.

And if that's you, it's okay. You know, you just go through and do it again and this time do code and look at the shapes of your inputs and look at your outputs and make sure you know how to grab a mini batch and look at its main and standard deviation and plot it.

There's so much material that we've covered. If you can get to a point where you can rebuild those notebooks from scratch without too much cheating, when I say from scratch, I mean using the first AI library, not from scratch from scratch, you'll be in the top echelon of practitioners because you'll be able to do all of these things yourself and that's really, really rare.

And that'll put you in a great position for part two. Should we do one more? Nine o'clock. We always do one more. Where do you see the fast AI library going in the future, say in five years? Well, like I said, I don't make plans. I just piss around.

So, I mean, our only plan for fast AI as an organization is to make deep learning accessible as a tool for normal people to use for normal stuff. So, as long as we need to code, we failed at that. So, the big goal, because 99.8% of the world can't code.

So, the main goal would be to get to a point where it's not a library but it's a piece of software that doesn't require code. It certainly shouldn't require a goddamn lengthy, hard-working course like this one. So, I want to get rid of the course. I want to get rid of the code.

I want to make it so you can just do useful stuff quickly and easily. So, that's maybe five years? Yeah, maybe longer. All right. Well, I hope to see you all back here for part two. Thank you. (audience applauds)