back to indexLesson 7: Deep Learning 2019 - Resnets from scratch; U-net; Generative (adversarial) networks
Chapters
0:0
8:23 add a bit of random padding
11:1 start out creating a simple cnn
27:12 create your own variations of resnet blocks
56:43 create a generator learner
72:46 print out a sample after every epoch
85:17 using the pre-trained model
116:33 add skip connections
00:00:00.000 |
Wellcome to lesson seven, the last lesson of part one. 00:00:13.440 |
And so don't let that bother you, because partly what I want to do is to kind of give 00:00:18.520 |
you enough things to think about to keep you busy until part two. 00:00:24.200 |
And so, in fact, some of the things we cover today, I'm not going to tell you about some 00:00:28.720 |
of the details, I'll just point out a few things where I'll say like, okay, that we're 00:00:31.840 |
not talking about yet, that we're not talking about that. 00:00:34.360 |
And so then come back in part two to get the details on some of these extra pieces. 00:00:44.900 |
Pretty quickly might require a few viewings to fully understand at all, a few experiments 00:00:52.280 |
I'm going to give you stuff to keep you amused for a couple of months. 00:00:59.520 |
Wanted to start by showing some cool work done by a couple of students, Reshma and Npata01, 00:01:07.520 |
who have developed an Android and an iOS app. 00:01:12.120 |
And so check out Reshma's post on the forum about that, because they have a demonstration 00:01:17.760 |
of how to create both Android and iOS apps that are actually on the Play Store and on 00:01:27.480 |
First ones I know of that are on the App Store that are using fast AI. 00:01:32.000 |
And let me also say a huge thank you to Reshma for all of the work she does, both for the 00:01:36.180 |
fast AI community and the machine learning community, or generally, and also the women 00:01:43.560 |
She does a lot of fantastic work, including providing lots of fantastic documentation 00:01:49.440 |
and tutorials and community organizing and so many other things. 00:01:53.340 |
So thank you, Reshma, and congrats on getting this app out there. 00:02:04.200 |
We have lots of Lesson 7 notebooks today, as you see, and we're going to start with 00:02:13.740 |
So the first notebook we're going to look at is Lesson 7 ResNet MNIST. 00:02:18.480 |
And what I want to do is look at some of the stuff we started talking about last week around 00:02:23.560 |
convolutions and convolutional neural networks and start building on top of them to create 00:02:28.600 |
a fairly modern deep learning architecture, largely from scratch. 00:02:34.000 |
When I say from scratch, I'm not going to re-implement things we already know how to 00:02:37.000 |
implement but kind of use the pre-existing PyTorch bits of those. 00:02:42.640 |
So we're going to use the MNIST dataset, which -- so urls.mnist has the whole MNIST dataset. 00:02:52.540 |
So in there, there's a training folder and a testing folder. 00:02:57.440 |
And as I read this in, I'm going to show some more details about pieces of the Datablocks 00:03:02.000 |
API so that you see how to kind of see what's going on. 00:03:05.540 |
Similarly with the Datablocks API, we've kind of said blah, blah, blah, blah, blah, and done 00:03:12.320 |
So the first thing you say is what kind of item list do you have? 00:03:16.640 |
So in this case, it's an item list of images. 00:03:19.680 |
And then where are you getting the list of file names from? 00:03:22.520 |
In this case, by looking in a folder recursively. 00:03:28.880 |
You can pass in arguments that end up going to pillow, because pillow or PIL is the thing 00:03:35.680 |
And in this case, these are black and white rather than RGB. 00:03:39.480 |
So you have to use PILO's convert mode equals L. For more details, refer to the Python imaging 00:03:45.360 |
library documentation to see what their convert modes are. 00:03:48.960 |
But this one is going to be grayscale, which is what MNIST is. 00:03:54.840 |
So inside an item list is an items attribute. 00:03:58.880 |
And the items attribute is kind of the thing that you gave it. 00:04:02.920 |
It's the thing that it's going to use to create your items. 00:04:05.480 |
So in this case, the thing you gave it really is a list of file names. 00:04:13.600 |
When you show images, normally it shows them in RGB. 00:04:17.960 |
And so in this case, we want to use a binary color map. 00:04:20.360 |
So in FastAI, you can set a default color map. 00:04:23.240 |
For more information about Cmap and color maps, refer to the mapplotlib documentation. 00:04:28.380 |
And so this will set the default color map for FastAI. 00:04:33.280 |
So our image item list contains 70,000 items. 00:04:36.400 |
And it's a bunch of images that are 1 by 28 by 28. 00:04:45.200 |
You might think, why aren't they just 28 by 28 matrices rather than a 1 by 28 by 28 rank 00:04:54.800 |
All the conv2d stuff and so forth works on rank 3 tensors. 00:05:00.320 |
So you want to include that unit access at the start. 00:05:04.240 |
And so FastAI will do that for you even when it's reading 1 channel images. 00:05:11.100 |
So the dot items attribute contains the thing that's kind of read to build the image, which 00:05:18.720 |
But if you just index into an item list directly, you'll get the actual image object. 00:05:23.400 |
And so the actual image object has a show method. 00:05:28.120 |
So once you've got an image item list, you then split it into training versus validation. 00:05:35.420 |
If you don't, you can actually use the dot no split method to create a kind of empty 00:05:50.440 |
First create your item list, then decide how to split. 00:05:53.440 |
In this case, we're going to do it based on folders. 00:05:58.000 |
In this case, the validation folder for MNIST is called testing. 00:06:04.520 |
So in FastAI parlance, we use the same kind of parlance that Kaggle does, which is the 00:06:11.680 |
The validation set has labels, and you do it for testing that your model's working. 00:06:19.520 |
And you use it for doing inference or submitting to a competition or sending it off to somebody 00:06:25.060 |
who's held out those labels for vendor testing or whatever. 00:06:29.880 |
So just because a folder in your data set is called testing doesn't mean it's a test 00:06:35.120 |
This one has labels, so it's a validation set. 00:06:39.520 |
So if you want to do inference on lots of things at a time rather than one thing at 00:06:43.720 |
a time, you want to use the test equals in FastAI to say this is stuff which has no labels 00:06:56.480 |
My split data is a training set and a validation set, as you can see. 00:07:02.680 |
So inside the training set, there's a folder for each class. 00:07:09.140 |
So now we can take that split data and say label from folder. 00:07:13.920 |
So first you create the item list, then you split it, then you label it. 00:07:18.200 |
And so you can see now we have an X and a Y, and the Y are category objects. 00:07:30.280 |
So if you index into a label list, such as ll.train as a label list, you will get back 00:07:39.760 |
an independent variable, independent variable, X and Y. 00:07:42.520 |
So in this case, the X will be an image object, which I can show, and the Y will be a category 00:07:51.360 |
That's the number eight category, and there's the eight. 00:07:59.840 |
In this case, we're not going to use the normal get transforms function because we're doing 00:08:05.360 |
digit recognition, and digit recognition, you wouldn't want to flip it left or right, 00:08:11.480 |
You wouldn't want to rotate it too much, that would change the meaning of it. 00:08:14.800 |
Also because these images are so small, doing zooms and stuff is going to make them so fuzzy 00:08:20.280 |
So normally for small images of digits like this, you just add a bit of random padding. 00:08:25.940 |
So I'll use the random padding function, which actually returns two transforms, the bit that 00:08:32.240 |
does the padding and the bit that does the random crop. 00:08:34.480 |
So you have to use star to, say, put both these transforms in this list. 00:08:41.920 |
This empty array here is referring to the validation set transforms. 00:08:53.320 |
We can pick a batch size and choose data bunch. 00:08:59.440 |
In this case, we're not using a pre-trained model, so there's no reason to use image net 00:09:05.960 |
And so if you call normalize like this, without passing in stats, it will grab a batch of 00:09:13.200 |
data at random and use that to decide what normalization stats to use. 00:09:18.500 |
That's a good idea if you're not using a pre-trained model. 00:09:25.840 |
And so in that data bunch is a data set, which we've seen already. 00:09:33.880 |
But what is interesting is that the training data set now has data augmentation because 00:09:39.220 |
So plot multi is a fast AI function that will plot the result of calling some function for 00:09:47.960 |
So in this case, my function is just grab the first image from the training set. 00:09:53.160 |
And because each time you grab something from the training set, it's going to load it from 00:09:56.640 |
disk and it's going to transform it on the fly. 00:10:01.320 |
So people sometimes ask like, how many transformed versions of the image do you create? 00:10:09.480 |
Each time we grab one thing from the data set, we do a random transform on the fly. 00:10:15.320 |
So potentially everyone will look a little bit different. 00:10:19.080 |
So you can see here, if we plot the result of that lots of times, we get eights in slightly 00:10:24.280 |
different positions because we did random padding. 00:10:28.680 |
You can always grab a batch of data then from the data bunch. 00:10:33.000 |
Because remember, a data bunch has data loaders, and data loaders are things that you grab 00:10:39.580 |
And so you can then grab an X batch and a Y batch, look at their shape, batch size by 00:10:47.560 |
All fast AI data bunches have a show batch, which will show you what's in it in some sensible 00:10:55.960 |
Okay, so that's a quick walkthrough of the data block API stuff to grab our data. 00:11:02.120 |
So let's start out creating a simple CNN, simple confident. 00:11:12.960 |
So let's define -- I like to define when I'm creating architectures a function which kind 00:11:18.700 |
of does the things that I do again and again and again. 00:11:20.880 |
I don't want to call it with the same arguments because I'll forget, I'll make a mistake. 00:11:24.520 |
So in this case, all of my convolutions are going to be kernel size three, stride two, 00:11:31.480 |
So let's just create a simple function to do a conv with those parameters. 00:11:34.720 |
So each time I have a convolution, it's skipping over one pixel, so it's doing jumping two 00:11:45.240 |
So that means that each time we have a convolution, it's going to halve the grid size. 00:11:50.100 |
So I've put a comment here showing what the new grid size is after each one. 00:11:54.960 |
So after the first convolution, we have one channel coming in, because remember it's a 00:12:02.440 |
And then how many channels coming out, whatever you like, right? 00:12:06.320 |
So remember you always get to pick how many filters you create regardless of whether it's 00:12:11.560 |
a fully connected layer, in which case it's just the width of the matrix you're multiplying 00:12:16.760 |
by, or in this case with a 2D conv, it's just how many filters do you want. 00:12:26.520 |
So the 28 by 28 image is now a 14 by 14 feature map with eight channels. 00:12:33.460 |
So specifically, therefore, it's an eight by 14 by 14 tensor of activations. 00:12:41.280 |
Then we'll do batch norm, then we'll do relu. 00:12:43.700 |
So the number of input filters to the next conv has to equal the number of output filters 00:12:47.960 |
from the previous conv, and we can just keep increasing the number of channels. 00:12:53.360 |
Because we're doing stride two, it's going to keep decreasing the grid size. 00:12:58.040 |
Notice here it goes from seven to four, because if you're doing a stride two conv over seven, 00:13:04.260 |
it's going to be kind of math.ceiling of seven divided by two. 00:13:11.600 |
Batch norm, relu, conv, we're now down to two by two. 00:13:15.200 |
Batch norm, relu, conv, we're now down to one by one. 00:13:18.960 |
So after this, we have a picture map of, let's see, ten by one by one. 00:13:40.240 |
It's a rank three tensor of ten by one by one. 00:13:46.120 |
So our loss functions expect generally a vector, not a rank three tensor. 00:13:51.800 |
So you can chuck flatten at the end, and flatten just means remove any unit axes. 00:13:59.360 |
So that will make it now just a vector of length ten, which is what we always expect. 00:14:10.260 |
So then we can return that into a learner by passing in the data and the model and the 00:14:14.960 |
loss function and, if optionally, some metrics. 00:14:19.260 |
So we're going to use cross entropy as usual. 00:14:22.240 |
So we can then call learn.summary and confirm. 00:14:24.700 |
After that first conv, we're down to 14 by 14. 00:14:28.480 |
And after the second conv, 7 by 7 and 4 by 4, 2 by 2, 1 by 1. 00:14:36.160 |
The flatten comes out calling it a lambda, but that, as you can see, gets rid of the 00:14:40.040 |
one by one, and it's now just a length ten vector for each item in the batch. 00:14:46.500 |
So a 128 by 10 matrix for the whole mini batch. 00:14:51.440 |
So just to confirm that this is working okay, we can grab that mini batch of X that we created 00:14:59.280 |
That's our mini batch of X. Pop it onto the GPU and call the model directly. 00:15:05.320 |
Remember any PyTorch module we can pretend it's a function. 00:15:09.820 |
And that gives us back, as we hoped, a 128 by 10 result. 00:15:14.880 |
So that's how you can directly get some predictions out. 00:15:29.040 |
And this is trained from scratch, of course, it's not pre-trained, we literally created 00:15:32.520 |
our own architecture, it's about the simplest possible architecture you can imagine, 18 00:15:37.440 |
So that's how easy it is to create a pretty accurate digit detector. 00:15:42.800 |
So let's refactor that a little rather than saying Conv Batch Norm Relu all the time. 00:15:51.520 |
Fast AI already has something called Conv_Layer, which lets you create Conv Batch Norm Relu 00:16:00.240 |
And it has various other options to do other tweaks to it, but the basic version is just 00:16:06.520 |
So we can refactor that like so, so that's exactly the same neural net. 00:16:14.280 |
And so let's just train it a little bit longer, and it's actually 99.1% accurate if we train 00:16:28.240 |
Well what we really want to do is create a deeper network. 00:16:33.980 |
And it's a very easy way to create a deeper network would be after every stride two Conv, 00:16:40.280 |
add a stride one Conv, because the stride one Conv doesn't change the feature map size at 00:16:54.280 |
And the problem was pointed out in this paper, very, very, very influential paper, called 00:16:58.680 |
Deep Residual Learning for Image Recognition by Kaiming He and colleagues then at Microsoft 00:17:11.520 |
Let's just look at the training error of a network trained on CIFAR-10. 00:17:17.920 |
And let's try one network with 20 layers, just basic three by three Convs, just basically 00:17:22.960 |
the same network I just showed you, but without batch norm. 00:17:28.800 |
So I trained 20 layer one and a 56 layer one on the training set. 00:17:34.760 |
So the 56 layer one has a lot more parameters. 00:17:37.040 |
It's got a lot more of these stride one Convs in the middle. 00:17:40.560 |
So the one with more parameters should seriously overfit. 00:17:45.060 |
So you would expect the 56 layer one to zip down to zero-ish training error pretty quickly. 00:17:55.760 |
So when you see something weird happen, really good researchers don't go, oh, no, it's not 00:18:11.600 |
And he said, I don't know, but what I do know is this. 00:18:16.520 |
I could take this 56 layer network and make a new version of it, which is identical, but 00:18:24.320 |
has to be at least as good as the 20 layer network. 00:18:29.920 |
Every two convolutions, I'm going to add together the input to those two convolutions, add it 00:18:38.840 |
together with the result of those two convolutions. 00:18:44.700 |
So in other words, he's saying, instead of saying output equals conv2 of conv1 of x, instead, 00:18:59.680 |
he's saying output equals x plus conv2 of conv1 of x. 00:19:09.720 |
So that 56 layers worth of convolutions in that, his theory has to be at least as good 00:19:20.800 |
as the 20 layer version because it could always just set conv2 and conv1 to a bunch of zero 00:19:28.480 |
weights for everything except for the first 20 layers because the x, the input, could 00:19:38.000 |
So this thing here is, as you see, called an identity connection. 00:19:51.400 |
That's what the paper describes as the intuition behind this is what would happen if we created 00:19:57.400 |
something which has to train at least as well as a 20 layer neural network because it kind 00:20:05.280 |
It's literally a path you can just skip over all the convolutions. 00:20:13.760 |
And what happened was he won ImageNet that year. 00:20:20.280 |
And in fact, even today, we had that record-breaking result on ImageNet speed training ourselves. 00:20:39.720 |
If you're interested in doing some research, some novel research, any time you find some 00:20:45.680 |
model for anything, whether it's medical image segmentation or some kind of GAN or whatever, 00:20:54.120 |
and it was written a couple of years ago, they might have forgotten to put ResNets in. 00:20:59.960 |
Resblocks, this is what we normally call a resblock. 00:21:04.080 |
They might have forgotten to put resblocks in. 00:21:05.920 |
So replace their convolutional path with a bunch of resblocks, and you'll almost always 00:21:17.040 |
So at NeurIPS, which Rachel and I and David all just came back from and Sylvain, we saw 00:21:24.200 |
a new presentation where they actually figured out how to visualize the loss surface of a 00:21:35.960 |
And anybody who's watching this, lesson seven, is at a point where they will understand most 00:21:41.840 |
of the most important concepts in this paper. 00:21:45.760 |
You won't necessarily get all of it, but I'm sure you'll get enough to find it interesting. 00:21:54.000 |
Here's what happens if you draw a picture where kind of X and Y here are two projections 00:22:04.940 |
And so as you move through the weight space, a 56 layer neural network without skip connections 00:22:13.680 |
And that's why this got nowhere, because it just got stuck in all these hills and valleys. 00:22:21.400 |
The exact same network with identity connections, with skip connections, has this lost landscape. 00:22:29.540 |
So it's kind of interesting how Herr recognized back in 2015, this shouldn't happen here's 00:22:41.240 |
And it took three years before people were able to say, oh, this is kind of why it fixed 00:22:48.480 |
With the batch norm discussion we had a couple of weeks ago, people realizing a little bit 00:22:53.360 |
after the fact sometimes what's going on and why it helps. 00:22:57.860 |
So in our code, we can create a res block in just the way I described. 00:23:10.200 |
We create an nn.module, we create two conf layers. 00:23:14.120 |
Where a conf layer is conf2d, batch norm, relu, sorry, conf2d, relu, batch norm. 00:23:23.800 |
So create two of those and then in forward we go conf1 of X, conf2 of that, and then 00:23:33.560 |
There's a res block function already in fast AI. 00:23:36.520 |
So you can just call res block instead, and you just pass in something saying how many 00:23:46.000 |
So there's the res block that I defined in our notebook. 00:23:51.160 |
And so with that res block we can now take every one of those, I've just copied the previous 00:23:57.000 |
CNN, and after every conf2, except the last one, I added a res block. 00:24:03.240 |
So this has now got three times as many layers. 00:24:16.960 |
Since I go conf2 res block so many times, let's just pop that into a mini sequential 00:24:23.160 |
model here, and so I can refactor that like so. 00:24:26.840 |
Keep refactoring your architectures if you're trying novel architectures because you'll 00:24:32.880 |
Most research code you look at is clunky as all hell, and people often make mistakes in 00:24:42.760 |
So use your coding skills to make life easier. 00:25:05.120 |
So that's interesting because we've trained this literally from scratch with an architecture 00:25:14.200 |
It was just the first thing that came to mind. 00:25:17.520 |
But in terms of where that puts us, 0.45% error is around about the state of the art 00:25:24.360 |
for this data set as of three or four years ago. 00:25:28.600 |
Now, you know, today MNIST is considered a kind of trivially easy data set, so I'm not 00:25:34.760 |
saying, like, wow, we've broken some records here. 00:25:40.040 |
But what I'm saying is that, you know, we can't -- this kind of ResNet is a genuinely 00:25:49.720 |
extremely useful network still today, and this is really all we use in our fast ImageNet 00:25:57.080 |
And one of the reasons as well is that it's so popular, so the vendors of the library 00:26:02.580 |
spend a lot of time optimizing it, so things tend to work fast, whereas some more modern-style 00:26:10.760 |
architectures using things like separable or grouped convolutions tend not to actually 00:26:19.580 |
If you look at the definition of ResBlock in the fast AI code, you'll see it looks a little 00:26:27.740 |
And that's because I've created something called a merge layer. 00:26:30.800 |
And a merge layer is something which in the forward -- just skip dense for a moment -- the 00:26:41.240 |
So you can see there's something ResNet-ish going on here. 00:26:45.800 |
Well, if you create a special kind of sequential model called a sequential EX -- so this is 00:26:51.120 |
like fast AI's sequential extended -- it's just like a normal sequential model, but we 00:27:01.400 |
And so this here, sequential EX, conv layer, conv layer, merge layer, will do exactly the 00:27:12.600 |
So you can create your own variations of ResNet blocks very easily with just sequential EX 00:27:22.200 |
So there's something else here, which is when you create your merge layer, you can optionally 00:27:29.660 |
Well, if you do, it doesn't go x plus x dot a ridge, it goes cat x comma x dot a ridge. 00:27:36.120 |
In other words, rather than putting a plus in this connection, it does a concatenate. 00:27:43.940 |
So that's pretty interesting, because what happens is that you have your input coming 00:27:53.720 |
And once you use concatenate instead of plus, it's not called a Res block anymore, it's 00:27:58.080 |
called a dense block, and it's not called a ResNet anymore, it's called a dense net. 00:28:02.600 |
So the dense net was invented about a year after the ResNet. 00:28:07.520 |
And if you read the dense net paper, it can sound incredibly complex and different, but 00:28:11.240 |
actually it's literally identical, but plus here is replaced with cat. 00:28:17.320 |
So you have your input coming into your dense block, right, and you've got a few convolutions 00:28:22.160 |
in here, and then you've got some output coming out, and then you've got your identity connection. 00:28:28.440 |
And remember, it doesn't plus, it concats, so if this is the channel axis, it gets a 00:28:35.840 |
And then so we do another dense block, and at the end of that we have all of this coming 00:28:44.160 |
So at the end of that we have the result of the convolution as per usual, but this time 00:28:57.400 |
So you can see that what happens is that with dense blocks it's getting bigger and bigger 00:29:01.600 |
and bigger, and kind of interestingly the exact input is still here, right? 00:29:09.160 |
So actually, no matter how deep you get, the original input pixels are still there, and 00:29:14.040 |
the original layer one features are still there, and the original layer two features 00:29:17.880 |
So as you can imagine, dense nets are very memory intensive. 00:29:24.040 |
There are ways to manage this, from time to time you can have a regular convolution that 00:29:28.920 |
squishes your channels back down, but they are memory intensive. 00:29:36.600 |
So for dealing with small data sets, you should definitely experiment with dense blocks and 00:29:45.480 |
They tend to work really well on small data sets. 00:29:49.160 |
Also, because it's possible to kind of keep those original input pixels all the way down 00:29:53.480 |
the path, they work really well for segmentation, right? 00:29:56.920 |
Because for segmentation, you kind of want to be able to reconstruct the original resolution 00:30:03.140 |
of your picture, so having all of those original pixels still there is super helpful. 00:30:14.640 |
So that's res nets, and one of the main reasons, other than the fact that res nets are awesome, 00:30:22.520 |
to tell you about them, is that these skip connections are useful in other places as well, 00:30:28.640 |
and they're particularly useful in other places and other ways of designing architectures 00:30:35.280 |
So in building this lesson, I always kind of, I keep trying to take old papers and saying, 00:30:43.000 |
like I'm mentioning, what would that person have done if they had access to all the modern 00:30:48.400 |
And I try to kind of rebuild them in a more modern style. 00:30:51.600 |
So I've been really rebuilding this next architecture we're going to look at, called a UNET, in 00:30:59.260 |
And I got to the point now, I keep showing you this semantic segmentation paper with 00:31:06.760 |
the state of the art for CAMVID, which was 91.5. 00:31:11.120 |
This week I got it up to 94.1 using the architecture I'm about to show you. 00:31:17.540 |
So we keep pushing this further and further and further. 00:31:21.440 |
And it really was all about adding all of the modern tricks, many of which I'll show 00:31:30.400 |
you today, some of which we'll see in part two. 00:31:35.040 |
So what we're going to do to get there is we're going to use this UNET. 00:31:39.620 |
So we've used a UNET before, I've improved it a bit since then. 00:31:45.480 |
So we've used a UNET before, we used it when we did the CAMVID segmentation, but we didn't 00:31:51.260 |
So we're now in a position where we can understand what it was doing. 00:31:58.900 |
And so the first thing we need to do is kind of understand the basic idea of how you can 00:32:06.800 |
So if we go back to our CAMVID notebook, in our CAMVID notebook you'll remember that basically 00:32:17.000 |
what we were doing is we were taking these photos and adding a class to every single 00:32:23.920 |
And so when you go data.showbatch for something which is a segmentation item list, it will 00:32:30.240 |
automatically show you these color-coded pixels. 00:32:38.680 |
In order to color-code this as a pedestrian, but this as a bicyclist, it needs to know 00:32:48.780 |
It needs to actually know that's what a pedestrian looks like, and it needs to know that's exactly 00:32:52.320 |
where the pedestrian is, and this is the arm of the pedestrian and not part of their shopping 00:32:56.440 |
It needs to really understand a lot about this picture to do this task. 00:33:01.940 |
And it really does do this task, like when you look at the results of our top model, 00:33:10.040 |
I can't see a single pixel by looking at it by eye. 00:33:13.720 |
I know there's a few wrong, but I can't see the ones that are wrong, it's that accurate. 00:33:19.640 |
So the way that we're doing it to get these really, really good results is, not surprisingly, 00:33:29.260 |
So we start with a ResNet-34, and you can see that here, unet-learner data, models.resnet-34. 00:33:40.920 |
And if you don't say pre-trained equals false, by default, you get pre-trained equals true, 00:33:48.860 |
So we start with a ResNet-34, which starts with a big image. 00:33:57.360 |
So in this case, this is from the unet paper now. 00:33:59.960 |
They're images, they started with one channel by 572 by 572. 00:34:08.500 |
So after your stride 2 conv, they're doubling the number of channels to 128, and they're 00:34:15.200 |
halving the size, so they're now down to 280 by 280. 00:34:19.740 |
In this original unet paper, they didn't add any padding, so they lost a pixel on each 00:34:27.960 |
So basically half the size, and then half the size, and then half the size, and then 00:34:33.000 |
half the size, until they're down to 28 by 28, with 1024 channels. 00:34:39.760 |
So that's what the unet's downsampling path, this is called the downsampling path look 00:35:01.080 |
So you can see that the size keeps halving, channels keep going up, and so forth. 00:35:08.160 |
So eventually, you've got down to a point where if you use a unit architecture, it's 00:35:13.400 |
28 by 28 with 1024 channels, with a ResNet architecture, with a 224 pixel input, it would 00:35:24.720 |
So it's a pretty small grid size on this feature map. 00:35:29.480 |
Somehow we've got to end up with something which is the same size as our original picture. 00:35:38.840 |
How do you do computation which increases the grid size? 00:35:44.800 |
Well, we don't have a way to do that in our current bag of tricks. 00:35:49.340 |
We can use a stride 1 conv to do computation and keep grid size or a stride 2 conv to do 00:36:00.440 |
We do a stride half conv, also known as a deconvolution, also known as a transposed convolution. 00:36:11.320 |
There is a fantastic paper called A Guide to Convolution Arithmetic for Deep Learning that 00:36:16.240 |
shows a great picture of exactly what does a 3 by 3 kernel stride half conv look like. 00:36:24.300 |
If you have a 2 by 2 input, so the blue squares are the 2 by 2 input, you add not only two 00:36:32.040 |
pixels of padding all around the outside, but you also add a pixel of padding between 00:36:43.440 |
And so now if we put this 3 by 3 kernel here and then here and then here, you see how the 00:36:49.160 |
3 by 3 kernel is just moving across it in the usual way? 00:36:52.160 |
You will end up going from a 2 by 2 output to a 5 by 5 output. 00:36:59.000 |
So if you only added one pixel of padding around the outside, you would end up with 00:37:10.600 |
So this is how you can increase the resolution. 00:37:18.240 |
This was the way people did it until maybe a year or two ago. 00:37:27.200 |
It's another trick for improving things you find online, because this is actually a dumb 00:37:32.880 |
And it's kind of obvious it's a dumb way to do it for a couple of reasons. 00:37:35.440 |
One is that, have a look at this, nearly all of those pixels are white. 00:37:49.320 |
Also, this one, when you get down to that 3 by 3 area, 2 out of the 9 pixels are non-white, 00:38:03.560 |
So there's different amounts of information going into different parts of your convolution. 00:38:09.120 |
So it just doesn't make any sense to kind of throw away information like this and to 00:38:15.280 |
do all this unnecessary computation and have different parts of the convolution having 00:38:22.240 |
So what people generally do nowadays is something really simple, which is if you have, let's 00:38:29.320 |
say, a 2 by 2 input, these are your pixel values, A, B, C, and D, and you want to create 00:38:46.160 |
A, A, A, A, B, B, B, B, C, C, C, C, C, D, D, D, D. 00:38:58.960 |
I haven't done any interesting computation, but now on top of that, I could just do a 00:39:05.360 |
stride 1 convolution, and now I have done some computation. 00:39:09.240 |
So an up sample, this is called nearest neighbor interpolation, nearest neighbor interpolation. 00:39:20.760 |
So you can just do, and that's super fast, which is nice, so you can do a nearest neighbor 00:39:24.480 |
interpolation and then a stride 1 conv, and now you've got some computation, which is 00:39:29.840 |
actually kind of using, you know, there's no zeros here. 00:39:34.280 |
This is kind of nice because it gets a mixture of A's and B's, which is kind of what you 00:39:40.920 |
Another approach is instead of using nearest neighbor interpolation, you can use bilinear 00:39:45.040 |
interpolation, which basically means instead of copying A to all those different cells, 00:39:50.440 |
you take a kind of a weighted average of the cells around it. 00:39:53.840 |
So for example, if you were, you know, looking at what should go here, you would kind of 00:40:00.440 |
go like, oh, it's about 3 A's, 2 C's, 1 D, and 2 B's, and you could have taken the average. 00:40:07.800 |
Not exactly, but roughly just a weighted average. 00:40:10.920 |
Bilinear interpolation you'll find in any, you know, all over the place, it's a pretty 00:40:16.280 |
Any time you look at a picture on your computer screen and change its size, it's doing bilinear 00:40:26.200 |
So that was what people were using, well, that's what people still tend to use. 00:40:31.360 |
That's as much as I'm going to teach you this part. 00:40:34.360 |
In part two, we'll actually learn what the FastAI library is actually doing behind the 00:40:39.120 |
scenes, which is something called a pixel shuffle, also known as sub-pixel convolutions. 00:40:44.900 |
It's not dramatically more complex, but complex enough that I won't cover it today. 00:40:49.800 |
All of these things is something which is basically letting us do a convolution that 00:40:54.760 |
ends up with something that's twice the size. 00:41:02.640 |
So that lets us go from 28 by 28 to 54 by 54 and keep on doubling the size. 00:41:22.760 |
Which is not surprising, because in this 28 by 28 feature map, how the hell is it going 00:41:28.680 |
to have enough information to reconstruct a 572 by 572 output space? 00:41:37.800 |
So you tended to end up with these things that lacked fine detail. 00:41:45.120 |
So what Olaf Rolleberger and et al. did was they said, hey, let's add a skip connection, 00:41:58.300 |
And amazingly enough, this was before resnets existed. 00:42:08.080 |
And so but rather than adding a skip connection that skipped every two convolutions, they added 00:42:17.680 |
In other words, they added a skip connection from the same part of the downsampling path 00:42:22.440 |
to the same sized bit in the upsampling path. 00:42:28.320 |
That's why you can see the white and the blue next to each other. 00:42:33.440 |
So basically these are like dense blocks, right? 00:42:36.880 |
But the skip connections are skipping over larger and larger amounts of the architecture. 00:42:42.960 |
So that over here, you've literally got nearly the input pixels themselves coming into the 00:42:55.920 |
And so that's going to make it super handy for resolving the fine details in these segmentation 00:43:00.960 |
tasks because you've literally got all of the fine details. 00:43:04.600 |
On the downside, you don't have very many layers of computation going on here, just four. 00:43:11.480 |
So you better hope that by that stage, you've done all the computation necessary to figure 00:43:15.760 |
out, is this a bicyclist or is this a pedestrian? 00:43:18.920 |
But you can then add on top of that something saying like, is this exact pixel where their 00:43:23.800 |
nose finishes or is that the start of the tree? 00:43:39.600 |
And the key thing that comes in is the encoder. 00:43:56.480 |
In most cases, they have this specific older-style architecture. 00:44:01.040 |
But like I said, replace any older-style architecture bits with ResNet bits and life improves, 00:44:11.080 |
So our layers of our unit is an encoder, then batch norm, then ReLU, and then middle conv, 00:44:20.920 |
Remember conv layer is a conv ReLU batch norm in FastAI. 00:44:26.680 |
And so the middle conv is these two extra steps here at the bottom, just doing a little bit 00:44:34.360 |
It's kind of nice to add more layers of computation where you can. 00:44:38.960 |
So encoder, batch norm, ReLU, and then two convolutions. 00:44:48.160 |
But these are basically -- we figure out what is the layer number where each of these strived 00:44:55.280 |
two convs occurs, and we just store it in an array of indexes. 00:44:59.480 |
So then we can loop through that, and we can basically say for each one of those points, 00:45:04.600 |
create a unit block, telling us how many up-sampling channels there are and how many cross-connection. 00:45:11.400 |
These things here are called cross-connections, or at least that's what I call them. 00:45:16.920 |
So that's really the main works going on in the unit block. 00:45:22.720 |
As I said, there's quite a few tweaks we do, as well as the fact we use a much better encoder. 00:45:27.280 |
We also use some tweaks in all of our up-sampling using this pixel shuffle. 00:45:34.200 |
And then another tweak, which I just did in the last week, is to not just take the result 00:45:39.040 |
of the convolutions and pass it across, but we actually grab the input pixels and make 00:45:48.000 |
You can see we're literally appending a res block with the original inputs. 00:45:57.000 |
So really all the work's going on in unit block, and unit block has to store the activations 00:46:08.120 |
And the way to do that, as we learned in the last lesson, is with hooks. 00:46:13.140 |
So we put hooks into the ResNet-34 to store the activations each time there's a Strive2 00:46:25.760 |
And we grab the result of the stored value in that hook, and we literally just go torch.cat, 00:46:32.120 |
so we concatenate the up-sampled convolution with the result of the hook, which we chuck 00:46:44.440 |
through batch norm, and then we do two convolutions to it. 00:46:48.560 |
And actually, you know, something you could play with at home is pretty obvious here. 00:46:53.640 |
Any time you see two convolutions like this, there's an obvious question is, what if we 00:46:59.020 |
So you could try replacing those two comms with a ResNet block. 00:47:04.000 |
And then the kind of things I look for when I look at an architecture is like, oh, two 00:47:08.840 |
comms in a row probably should be a ResNet block. 00:47:16.260 |
So that's UNET, and it's amazing to think it preceded ResNet, it preceded DenseNet. 00:47:29.240 |
It wasn't even published in a major machine learning venue. 00:47:32.640 |
It was actually published in MICHI, which is a specialized medical image computing conference. 00:47:39.600 |
For years, actually, it was largely unknown outside of the medical imaging community. 00:47:44.420 |
And actually, what happened was Kaggle competitions for segmentation kept on being easily won 00:47:52.240 |
And that was the first time I saw it getting noticed outside the medical imaging community. 00:47:56.080 |
And then, gradually, a few people in the academic machine learning community started noticing, 00:48:00.480 |
and now everybody loves UNET, which I'm glad, because it's just awesome. 00:48:09.320 |
So identity connections, regardless of whether they're a plus style or a concat style, are 00:48:20.120 |
They can basically get us close to the state of the art on lots of important tasks. 00:48:31.440 |
And so the next task I want to look at is image restoration. 00:48:36.120 |
So image restoration refers to starting with an image, and this time, we're not going to 00:48:41.240 |
create a segmentation mask, but we're going to try and create a better image. 00:48:47.440 |
And there's lots of versions of better-- there could be different image. 00:48:50.680 |
So the kind of things we can do with this image generation would be take a low res image, 00:48:55.800 |
make it high res, take a black and white image, make it color, take an image where something's 00:49:01.480 |
being cut out of it and try and replace the cut out thing, take a photo and try and turn 00:49:07.160 |
it into what looks like a line drawing, take a photo and try and make it look like a Monet 00:49:12.240 |
These are all examples of kind of image to image generation tasks, which you'll know 00:49:21.040 |
So in our case, we're going to try to do image restoration, which is going to start with 00:49:27.600 |
low resolution, poor quality JPEGs with writing written over the top of them, and get them 00:49:35.520 |
to replace them with high resolution, good quality pictures in which the text has been 00:49:51.680 |
Why do you concat before calling conv2, conv1, not after? 00:50:00.320 |
Because if you did conv1-- if you did your comms before you concat, then there's no way 00:50:06.440 |
for the channels of the two parts to interact with each other. 00:50:11.240 |
You don't get any-- so remember, in a 2D conv, it's really 3D, right? 00:50:18.320 |
It's moving across two dimensions, but in each case, it's doing a dot product of all 00:50:25.720 |
three dimensions of a rank 3 tensor, row by column by channel. 00:50:30.440 |
So generally speaking, we want as much interaction as possible. 00:50:35.000 |
We want to say this part of the downsampling path and this part of the upsampling path, 00:50:40.520 |
if you look at the combination of them, you find these interesting things. 00:50:43.480 |
So generally, you want to have as many interactions going on as possible in each computation that 00:50:55.480 |
How does concatenating every layer together in a dense net work when the size of the image 00:51:07.760 |
So, if you have a stride 2 conv, you can't keep dense netting. 00:51:14.920 |
That's what actually happens in a dense net, is you kind of go like dense block growing, 00:51:19.040 |
dense block growing, dense block growing, so you're getting more and more channels. 00:51:22.000 |
And then you do a stride 2 conv without a dense block. 00:51:29.600 |
And then you just do a few more dense blocks and then it's gone. 00:51:32.280 |
So in practice, a dense block doesn't actually keep all the information all the way through, 00:51:38.920 |
but just up into every one of these stride 2 convs. 00:51:45.400 |
And there's kind of various ways of doing these bottlenecking layers where you're basically 00:51:52.160 |
It also helps us keep memory under control because at that point we can decide how many 00:52:00.320 |
So, in order to create something which can turn crappy images into nice images, we need 00:52:09.960 |
a data set containing nice versions of images and crappy versions of the same images. 00:52:15.080 |
So the easiest way to do that is to start with some nice images and Crapify them. 00:52:20.000 |
And so the way to Crapify them is to create a function called Crapify, which contains 00:52:27.200 |
So my Crapification logic, you can pick your own, is that I open up my nice image, I resize 00:52:34.360 |
it to be really small, 96 by 96 pixels, with bilinear interpolation, I then pick a random 00:52:42.400 |
number between 10 and 70, I draw that number into my image at some random location, and 00:52:51.720 |
then I save that image with a JPEG quality of that random number. 00:52:56.760 |
And a JPEG quality of 10 is like absolute rubbish. 00:53:06.120 |
So I end up with high quality images, low quality images that look something like these. 00:53:15.340 |
And so you can see this one, you know, there's the image. 00:53:18.760 |
And this is after transformations, that's why it's been flipped. 00:53:22.520 |
And you won't always see the image because we're zooming into them. 00:53:26.360 |
So a lot of the time the image is cropped out. 00:53:30.600 |
So yeah, it's trying to figure out how to take this incredibly JPEG artifacty thing with 00:53:34.800 |
text written over the top and turn it into this. 00:53:38.240 |
So I'm using the Oxford Pets dataset, again, the same one we used in lesson one. 00:53:43.220 |
So there's nothing more high quality than pictures of dogs and cats, I think we can 00:53:49.400 |
The Crapification process can take a while, but fast.ai has a function called parallel. 00:53:56.200 |
And if you pass parallel a function name and a list of things to run that function on, 00:54:01.580 |
it will run that function on them all in parallel. 00:54:12.040 |
The way you write this function is where you get to do all the interesting stuff in this 00:54:19.000 |
Try and think of an interesting Crapification which does something that you want to do. 00:54:23.240 |
So if you want to colorize black and white images, you would replace it with black and 00:54:29.100 |
If you want something which can take large cut out blocks of image and replace them with 00:54:35.820 |
hallucinated image, add a big black box to these. 00:54:40.600 |
If you want something which can take old family photo scans that have been folded up and have 00:54:45.400 |
crinkles in, try and find a way of adding dust prints and crinkles and so forth. 00:54:52.040 |
Something that you don't include in Crapify, your model won't learn to fix because every 00:54:58.960 |
time it sees that in your photos, the input and output will be the same, so it won't consider 00:55:09.760 |
So we now want to create a model which can take an input photo that looks like that and 00:55:19.840 |
So obviously what we want to do is use a unit, because we already know that units can do 00:55:24.080 |
exactly that kind of thing, and we just need to pass the unit that data. 00:55:30.480 |
So our data is just literally the file names from each of those two folders. 00:55:37.600 |
Do some transforms, databunch, normalize, or use ImageNet stats because we're going 00:55:46.960 |
Well, because like if you're going to get rid of this 46, you need to know what probably 00:55:52.080 |
was there, and to know what probably was there, you need to know what this is a picture of. 00:55:55.880 |
Because otherwise, how can you possibly know what it ought to look like? 00:55:59.080 |
So let's use a pre-trained model that knows about these kinds of things. 00:56:12.720 |
These three things are important and interesting and useful, but I'm going to leave them to 00:56:18.460 |
For now, you should always include them when you use a unit for this kind of problem. 00:56:26.920 |
And so now we're going to-- and this whole thing I'm calling a generator. 00:56:30.200 |
It's going to generate-- this is generative modeling. 00:56:34.320 |
There's not a really formal definition, but it's basically something where the thing we're 00:56:37.280 |
outputting is like a real object, in this case, an image. 00:56:44.000 |
So we're going to create a generator learner, which is this unit learner. 00:56:53.480 |
So in other words, what's the mean squared error between the actual pixel value that 00:56:57.360 |
it should be and the pixel value that we predicted? 00:57:06.300 |
So we have a version called MSC loss flat, which simply flattens out those images into 00:57:17.360 |
If you don't have a vector, it'll also work fine. 00:57:20.340 |
So we're already down to 0.05 mean squared error on the pixel values, which is not bad, 00:57:29.540 |
Like all things in fast AI, pretty much, because we are doing transfer learning by default, 00:57:34.680 |
when you create this, it'll freeze the pre-trained part. 00:57:40.200 |
And the pre-trained part of a unit is this part, the down sampling part. 00:57:47.220 |
So let's unfreeze that and train a little more. 00:57:55.620 |
So with four minutes of training, we've got something which is basically doing a perfect 00:58:07.600 |
It's certainly not doing a good job of up sampling. 00:58:13.120 |
But it's definitely doing a nice-- sometimes when it removes a number, it maybe leaves 00:58:18.360 |
But it's certainly doing something pretty useful. 00:58:21.120 |
And so if all we wanted to do was kind of watermark removal, we'd be finished. 00:58:28.320 |
We're not finished, because we actually want this thing to look more like this thing. 00:58:38.600 |
The problem, the reason that we're not making as much progress with that as we'd like is 00:58:43.400 |
that our loss function doesn't really describe what we want. 00:58:47.160 |
Because actually, the mean squared error between the pixels of this and this is actually very 00:58:54.680 |
And if you actually think about it, most of the pixels are very nearly the right color. 00:59:02.520 |
And we're missing the eyeballs entirely, pretty much. 00:59:08.800 |
So we want some loss function that does a better job than pixel mean squared error loss 00:59:16.880 |
of saying like, is this a good quality picture of this thing? 00:59:23.660 |
So there's a fairly general way of answering that question. 00:59:29.560 |
And it's something called a generative adversarial network, or GaN. 00:59:36.720 |
And a GaN tries to solve this problem by using a loss function which actually calls another 00:59:52.760 |
So we've got our crappy image, and we've already created a generator. 01:00:12.220 |
And we can compare the high res image to the prediction with pixel MSE. 01:00:20.000 |
We could also train another model, which we would variously call either the discriminator 01:00:31.120 |
We could try and build a binary classification model that takes all the pairs of the generated 01:00:37.680 |
image and the real high res image and tries to classify, learn to classify, which is which. 01:00:45.360 |
So look at some picture and say like, hey, what do you think? 01:00:50.320 |
Is that a high res cat or is that a generated cat? 01:00:55.200 |
So just a regular standard binary cross-entropy classifier. 01:01:04.480 |
So if we had one of those, we could now fine-tune the generator. 01:01:11.580 |
And rather than using pixel MSE as the loss, the loss could be how good are we at fooling 01:01:19.840 |
So can we create generated images that the critic thinks are real? 01:01:30.840 |
Because if it can do that, if the loss function is am I fooling the critic, then it's going 01:01:36.840 |
to learn to create images which the critic can't tell whether they're real or fake. 01:01:43.760 |
So we could do that for a while, train a few batches, but the critic isn't that great. 01:01:52.280 |
The reason the critic isn't that great is because it wasn't that hard. 01:01:55.160 |
Like these images are really shitty, so it's really easy to tell the difference, right? 01:01:59.320 |
So after we train the generator a little bit more using the critic as the loss function, 01:02:05.680 |
the generator is going to get really good at fooling the critic. 01:02:09.600 |
So now we're going to stop training the generator and we'll train the critic some more on these 01:02:16.800 |
So now that the generator's better, it's now a tougher task for the critic to decide which 01:02:21.600 |
is real and which is fake, so we'll train that a little bit more. 01:02:25.880 |
And then once we've done that, and the critic's now pretty good at recognising the difference 01:02:29.040 |
between the better generated images and the originals, we'll go back and we'll fine-tune 01:02:34.480 |
the generator some more using the better discriminator, the better critic, as the loss function. 01:02:40.040 |
And so we'll just go ping pong, ping pong, backwards and forwards. 01:02:49.080 |
I don't know if anybody's written this before. 01:02:52.840 |
We've created a new version of a GAN, which is kind of a lot like the original GANs, but 01:02:57.840 |
we have this neat trick where we pre-train the generator and we pre-train the critic. 01:03:03.920 |
I mean, GANs have been kind of in the news a lot. 01:03:09.460 |
And if you've seen them, you may have heard that they're a real pain to train. 01:03:14.900 |
But it turns out we realise that really most of the pain of training them was at the start. 01:03:20.160 |
If you don't have a pre-trained generator and you don't have a pre-trained critic, then 01:03:24.280 |
it's basically the blind leading the blind, right? 01:03:27.720 |
You're basically like the critics, well, the generator's trying to generate something which 01:03:31.300 |
falls a critic, but the critic doesn't know anything at all, so it's basically got nothing 01:03:35.760 |
And then the critics kind of try to decide whether the generated images are real or not, 01:03:39.000 |
and that gets really obvious, so that just does it. 01:03:41.320 |
And so they kind of like don't go anywhere for ages. 01:03:46.360 |
And then once they finally start picking up steam, they go along pretty quickly. 01:03:50.420 |
So if you can find a way to generate things without using a GAN, like mean squared error 01:03:56.800 |
pixel loss, and discriminate things without using a GAN, like predict on that first generator, 01:04:08.040 |
So to create just a totally standard fast.ai binary classification model, we need two folders, 01:04:15.760 |
one folder containing high-res images, one folder containing generated images. 01:04:20.680 |
We already have the folder with the high-res images, so we just have to save our generated 01:04:26.180 |
So here's a tiny, tiny bit of code that does that. 01:04:30.400 |
We're going to create a directory called imagegen, pop it into a variable called pathgen. 01:04:37.560 |
We've got a little function called save preds that takes a data loader, and we're going 01:04:41.960 |
to grab all of the file names, because remember that in an item list, the dot items contains 01:04:50.340 |
So here's the file names in that data loader's data set. 01:04:55.080 |
And so now let's go through each batch of the data loader, and let's grab a batch of 01:05:00.640 |
predictions for that batch, and then reconstruct equals true, means it's actually going to create 01:05:06.680 |
fast.ai image objects for each of those, each thing in the batch. 01:05:12.000 |
And so then we'll go through each of those predictions and save them. 01:05:16.000 |
And the name we'll save it with is the name of the original file, but we're going to pop 01:05:28.320 |
And so you can see I'm kind of increasingly not just using stuff that's already in the 01:05:33.480 |
fast.ai library, but trying to show you how to write stuff yourself, right? 01:05:38.400 |
And generally it doesn't require heaps of code to do that. 01:05:41.920 |
And so if you come back to part two, this is what, you know, lots of part two were kind 01:05:47.080 |
of like here's how you use things inside the library, and of course here's how we wrote 01:05:57.520 |
So save those predictions, and then let's just do a PIL.image.open on the first one, 01:06:08.120 |
So now I can train a critic in the usual way. 01:06:13.320 |
It's really annoying to have to restart Jupyter Notebook to reclaim GPU memory. 01:06:18.440 |
So one easy way to handle this is if you just set something that you knew was using a lot 01:06:22.580 |
of GPU to none, like this learner, and then just go gc.collect. 01:06:28.080 |
That tells Python to do memory garbage collection, and after that you'll generally be fine. 01:06:36.920 |
You'll be able to use all of your GPU memory again. 01:06:40.340 |
If you're using Nvidia SMI to actually look at your GPU memory, you won't see it clear 01:06:45.620 |
because PyTorch still has a kind of allocated cache, but it makes it available. 01:06:51.700 |
So you should find this is how you can avoid restarting your Notebook. 01:06:56.160 |
So we're going to create a critic, it's just an image item list from folder in the totally 01:07:00.520 |
usual way, and the classes will be the image gen and images. 01:07:07.960 |
We'll do a random split because we want to know how well we're doing with a critic to 01:07:12.800 |
We just label it from folder in the usual way, add some transforms, databunch, normalize, 01:07:17.880 |
so it's a totally standard object classifier. 01:07:22.280 |
Okay, so we've got a totally standard classifier. 01:07:31.900 |
So here's one from the real images, generated images, generated images. 01:07:38.080 |
So it's going to try and figure out which class is which. 01:07:42.720 |
Okay, so we're going to use binary cross-entropy as usual, however, we're not going to use 01:07:59.200 |
And the reason we'll get into it in more detail in part two, but basically when you're doing 01:08:03.720 |
a GAN, you need to be particularly careful that the generator and the critic can't kind 01:08:14.440 |
of both push in the same direction and increase the weights out of control. 01:08:19.940 |
So we have to use something called spectral normalization to make GANs work nowadays. 01:08:27.400 |
So if you say GAN critic, that will give you a binary classifier suitable for GANs. 01:08:34.960 |
I strongly suspect we probably can use a ResNet here, we just have to create a pre-trained 01:08:39.320 |
ResNet with spectral norm, hope to do that pretty soon, we'll see how we go. 01:08:44.160 |
But as of now, this is kind of the best approach, there's this thing called GAN critic. 01:08:51.680 |
And again, critic uses a slightly different way of averaging the different parts of the 01:09:03.220 |
So any time you're doing a GAN at the moment, you have to wrap your loss function with adaptive 01:09:08.600 |
Again, we'll look at the details in part two, for now, just know this is what you have to 01:09:14.700 |
So other than that, slightly odd loss function and that slightly odd architecture, everything 01:09:19.280 |
else is the same, we can call that to create our critic. 01:09:24.080 |
Because we have this slightly different architecture and slightly different loss function, we did 01:09:27.920 |
a slightly different metric, this is the equivalent GAN version of accuracy, the critics, and 01:09:34.320 |
then we can train it, and you can see it's 98% accurate at recognizing that kind of crappy 01:09:44.280 |
And of course, we don't see the numbers here anymore, right, because these are the generated 01:09:48.040 |
images, the generator already knows how to get rid of those numbers that are written 01:09:58.160 |
Now that we have pre-trained the generator and pre-trained the critic, we now need to 01:10:04.680 |
get it to ping-pong between training a little bit of each. 01:10:08.240 |
And the amount of time you spend on each of those things and the learning rates you use 01:10:17.480 |
So we've created a GAN learner for you, which you just pass in your generator and your critic, 01:10:27.400 |
which we've just simply loaded here from the ones we just trained, and it will go ahead 01:10:33.800 |
and when you go learn.fit, it will do that for you. 01:10:37.360 |
It will figure out how much time to train the generator and then when to switch to training 01:10:40.800 |
the discriminator, the critic, and it will go back and forth. 01:10:45.200 |
These weights here is that what we actually do is we don't only use the critic as the 01:10:51.880 |
If we only use the critic as the loss function, the GAN could get very good at creating pictures 01:10:57.520 |
that look like real pictures, but they actually have nothing to do with the original photo 01:11:05.860 |
So we actually add together the pixel loss and the critic loss. 01:11:10.560 |
And so those two losses are kind of on different scales. 01:11:15.240 |
So we multiply the pixel loss by something between about 50 and about 200. 01:11:21.280 |
Again, something in that range generally works pretty well. 01:11:27.700 |
Something else with GANs, GANs hate momentum when you're training them. 01:11:33.040 |
It kind of doesn't make sense to train them with momentum because you keep switching between 01:11:39.140 |
Maybe there are ways to use momentum, but I'm not sure anybody's figured it out. 01:11:43.000 |
This number here, when you create an atom optimizer, is where the momentum goes, so you should 01:11:48.900 |
So anyway, if you're doing GANs, use these hyperparameters, it should work. 01:12:00.120 |
So that's what GAN learner does, and so then you can go fit, and it trains for a while. 01:12:05.680 |
And one of the tough things about GANs is that these loss numbers, they're meaningless. 01:12:14.360 |
You can't expect them to go down, because as the generator gets better, it gets harder 01:12:22.720 |
Then as the critic gets better, it gets harder for the generator. 01:12:31.480 |
So that's one of the tough things about training GANs, is it's kind of hard to know how are 01:12:36.960 |
So the only way to know how are they doing is to actually take a look at the results 01:12:43.200 |
And so if you put show image equals true here, it will actually print out a sample after 01:12:49.920 |
I haven't put that in the notebook because it makes it too big for the repo, but you 01:12:55.680 |
So I've just put the results at the bottom, and here it is. 01:13:05.880 |
We already knew how to get rid of the numbers, but we now don't really have that kind of 01:13:12.240 |
And it's definitely sharpening up this little kitty cat quite nicely. 01:13:23.720 |
There's some weird kind of noise going on here. 01:13:28.560 |
It's certainly a lot better than the horrible original. 01:13:40.120 |
Like here, these things ought to be eyeballs, and they're not. 01:13:48.000 |
Well, our critic doesn't know anything about eyeballs. 01:13:52.000 |
And even if it did, it wouldn't know that eyeballs are particularly important. 01:13:57.760 |
Like when we see a cat without eyes, it's a lot less cute. 01:14:02.600 |
I mean, I'm more of a dog person, but it just doesn't know that this is a feature that matters. 01:14:18.520 |
Particularly because the critic, remember, is not a pre-trained network. 01:14:21.440 |
So I kind of suspect that if we replace the critic with a pre-trained network that's been 01:14:26.160 |
pre-trained on ImageNet but is also compatible with GANs, it might do a better job here. 01:14:31.760 |
But it's definitely a shortcoming of this approach. 01:14:42.120 |
And then after the break, I will show you how to find the cat's eyeballs again. 01:14:48.880 |
For what kind of problems do you not want to use UNETs? 01:14:56.480 |
Well, UNETs are for when the size of your output is similar to the size of your input 01:15:08.880 |
There's no point kind of having cross-connections if that level of spatial resolution in the 01:15:16.600 |
So any kind of generative modeling and segmentation is generative modeling. 01:15:23.440 |
It's generating a picture which is a mask of the original objects. 01:15:29.840 |
So probably anything where you want that resolution of the output to be of the same kind of fidelity 01:15:39.080 |
Obviously, something like a classifier makes no sense. 01:15:42.160 |
In a classifier, you just want the downsampling path, because at the end, you just want a 01:15:48.160 |
single number, which is like, is it a dog or a cat, or what kind of pet is it, or whatever. 01:16:00.160 |
Just before we leave GANs, I'll just mention there's another notebook you might be interested 01:16:09.920 |
When GANs started a few years ago, people generally used them to kind of create images 01:16:18.000 |
out of thin air, which I personally don't think is a particularly useful or interesting 01:16:23.840 |
thing to do, but it's kind of a good, I don't know, it's a good research exercise, I guess. 01:16:30.000 |
So we implemented this wGAN paper, which was kind of really the first one to do a somewhat 01:16:36.680 |
adequate job, somewhat easily, and so you can see how to do that with the fast AI library. 01:16:43.280 |
It's kind of interesting, because the dataset we use is this Lsun bedrooms dataset, which 01:16:51.120 |
we've provided in our URLs, which just, as you can see, has bedrooms, lots and lots and 01:16:59.560 |
And the approach, you'll see in the pros here that Sylvain wrote, the approach that we use 01:17:08.520 |
in this case is to just say, can we create a bedroom? 01:17:14.160 |
And so what we actually do is that the input to the generator isn't an image that we clean 01:17:23.960 |
We actually feed to the generator random noise. 01:17:27.820 |
And so then the generator's task is, can you turn random noise into something which the 01:17:33.300 |
critic can't tell the difference between that output and a real bedroom? 01:17:38.860 |
And so we're not doing any pre-training here or any of the stuff that makes this kind of 01:17:48.360 |
So this is a very traditional approach, but you can still see, you still just go, you 01:17:52.200 |
know, gan learner, and there's actually a wGAN version, which is, you know, this kind 01:17:56.160 |
of older style approach, but you just pass in the data and the generator and the critic 01:18:00.160 |
in the usual way, and you call fit, and you'll see, in this case we have a show image on, 01:18:08.400 |
you know, after epoch one, it's not creating great bedrooms or two or three, and you can 01:18:12.720 |
really see that in the early days of these kinds of gans, it doesn't do a great job of 01:18:16.400 |
anything, but eventually after, you know, a couple of hours of training, producing somewhat 01:18:29.440 |
So anyway, it's a notebook you can have a play with, and it's a bit of fun. 01:18:35.520 |
So I was very excited when we got fast.ai to the point in the last week or so that we 01:18:47.720 |
had gans working in a way where kind of API-wise, they're far more concise and more flexible 01:18:53.720 |
than any other library that exists, but also kind of disappointed with they take a long 01:18:59.840 |
time to train, and the outputs are still like so-so, and so the next step was like, well, 01:19:07.920 |
So the first step with that, I mean, obviously, the thing we really want to do is come up 01:19:13.040 |
We want a loss function that does a good job of saying this is a high-quality image without 01:19:19.480 |
having to go over all the gan trouble, and preferably it also doesn't just say it's a 01:19:23.640 |
high-quality image, but it's an image which actually looks like the thing it's meant to. 01:19:29.200 |
So the real trick here comes back to this paper from a couple of years ago, perceptual 01:19:33.720 |
losses for real-time style transfer and super resolution. 01:19:37.600 |
Justin Johnson at our, created this thing they call perceptual losses. 01:19:42.080 |
It's a nice paper, but I hate this term because they're nothing particularly perceptual about 01:19:49.080 |
So in the fastai library, you'll see this referred to as feature losses. 01:19:53.800 |
And it shares something with gans, which is that after we go through our generator, which 01:19:59.920 |
they call the image transform net, and you can see it's got this kind of unit shaped 01:20:04.720 |
They didn't actually use units because at the time this came out, nobody in the machine 01:20:15.600 |
I should mention, like, in these architectures where you have a downsampling path followed 01:20:21.080 |
by the upsampling path, the downsampling path is very often called the encoder. 01:20:27.000 |
As you saw in our code, actually, we called that the encoder. 01:20:30.200 |
And the upsampling path is very often called the decoder. 01:20:35.080 |
In generative models, generally, including generative text models, neural translation, 01:20:41.760 |
stuff like that, they tend to be called the encoder and the decoder, two pieces. 01:20:45.880 |
So we have this generator, and we want a loss function that says, you know, is the thing 01:20:52.920 |
that it's created like the thing that we want. 01:20:56.320 |
And so the way they do that is they take the prediction -- remember Y hat is what we normally 01:21:00.680 |
use for a prediction from a model -- we take the prediction and we put it through a pre-trained 01:21:09.080 |
So at the time that this came out, the pre-trained image network they were using was VGG. 01:21:15.120 |
People still -- it's kind of old now, but people still tend to use it because it works 01:21:23.320 |
So they take the prediction and they put it through VGG, the pre-trained image net network. 01:21:30.960 |
And so normally the output of that would tell you, hey, is this generated thing, you know, 01:21:36.840 |
a dog or a cat or an airplane or a fire engine or whatever, right? 01:21:42.200 |
But in the process of getting to that final classification, it goes through lots of different 01:21:47.640 |
And in this case, they've color-coded all the layers with the same grid size in the 01:21:53.640 |
So every time we switch colors, we're switching grid size. 01:21:56.520 |
So there's a strive to conv, or in VGG's case they still used to use max pooling layers, 01:22:04.840 |
And so what we could do is say, hey, let's not take the final output of the VGG model 01:22:10.080 |
on this generated image, but let's take something in the middle. 01:22:17.000 |
Let's take the activations of some layer in the middle. 01:22:20.960 |
So those activations might be a feature map of like 256 channels by 28 by 28, say. 01:22:30.500 |
And so those kind of 28 by 28 grid cells will kind of roughly semantically say things like, 01:22:35.280 |
hey, in this part of that 28 by 28 grid, is there something that looks kind of furry? 01:22:40.400 |
Or is there something that looks kind of shiny? 01:22:42.280 |
Or is there something that looks kind of circular? 01:22:43.440 |
Or is there something that kind of looks like an eyeball or whatever? 01:22:47.000 |
So what we do is that we then take the target, so the actual Y value, and we put it through 01:22:53.480 |
the same pre-trained VGG network, and we pull out the activations at the same layer, and 01:23:01.000 |
So it'll say, OK, in the real image, grid cell 1, 1 of that 28 by 28 feature map is 01:23:11.760 |
furry and blue and round shaped, and in the generated image, it's furry and blue and not 01:23:23.620 |
So that ought to go a long way towards fixing our eyeball problem, because in this case, 01:23:30.040 |
the feature map is going to say, there's eyeballs here-- sorry, here-- but there isn't here. 01:23:41.140 |
So that's what we call feature losses, or Johnson et al. called perceptual losses. 01:23:52.760 |
So to do that, we're going to use the Lesson 7 Super Res notebook. 01:24:02.800 |
And this time, the task we're going to do is kind of the same as the previous task, 01:24:08.000 |
but I wrote this notebook a little bit before the GAN notebook. 01:24:11.800 |
Before I came up with the idea of putting text on it and having a random JPEG quality. 01:24:18.720 |
There's no text written on top, and it's 96 by 96. 01:24:23.640 |
And it's before I realized what a great word "crapify" is, so it's called resize. 01:24:29.100 |
So here's our crappy images and our original images, kind of a similar task to what we 01:24:38.240 |
So I'm going to try and create a loss function which does this. 01:24:47.380 |
So the first thing I do is I define a base loss function, which is basically like, how 01:24:54.640 |
am I going to compare the pixels and the features? 01:25:08.240 |
So any time you see base loss, we mean L1 loss. 01:25:20.280 |
In VGG, there's an attribute called dot_features, which contains the convolutional part of the 01:25:27.200 |
So here's the convolutional part of the VGG model. 01:25:30.680 |
Because we don't need the head, because we only want the intermediate activations. 01:25:37.340 |
We'll put it into eval mode, because we're not training it. 01:25:41.120 |
And we'll turn off requires_grad, because we don't want to update the weights of this 01:25:46.840 |
We're just using it for inference, for the loss. 01:25:50.840 |
So then let's enumerate through all the children of that model and find all of the max pooling 01:25:56.760 |
Because in the VGG model, that's where the grid size changes. 01:26:01.320 |
And as you can see from this picture, we kind of want to grab features from every time just 01:26:13.280 |
So there's our list of layer numbers just before the max pooling layers. 01:26:21.160 |
And so all of those are values, not surprisingly. 01:26:27.200 |
So those are where we want to grab some features from. 01:26:34.360 |
So here's our feature_loss class, which is going to implement this idea. 01:26:40.140 |
So basically, when we call the feature_loss class, we're going to pass it some pre-trained 01:26:50.360 |
That's the model which contains the features which we want to generate for-- want our feature 01:26:56.160 |
So we can go ahead and grab all of the layers from that network that we want the features 01:27:07.760 |
So we're going to need to hook all of those outputs. 01:27:10.360 |
Because remember, that's how we grab intermediate layers in PyTorch is by hooking them. 01:27:15.760 |
So this is going to contain our hooked outputs. 01:27:22.160 |
So now, in the forward of feature_loss, we're going to make features passing in the target. 01:27:28.960 |
So this is our actual Y, which is just going to call that VGG model and go through all 01:27:33.760 |
of the stored activations and just grab a copy of them. 01:27:39.320 |
And so we're going to do that both for the target, call that out_feet, and for the input. 01:27:48.520 |
And so now, let's calculate the L1 loss between the pixels. 01:27:55.700 |
Because we still want the pixel loss a little bit. 01:27:58.140 |
And then let's also go through all of those layers features and get the L1 loss on them. 01:28:08.000 |
So we're basically going through every one of these end of each block and grabbing the 01:28:19.360 |
So that's going to end up in this list called feature_losses, which I then sum them all 01:28:28.880 |
And by the way, the reason I do it as a list is because we've got this nice little callback 01:28:33.180 |
that if you put them into a thing called .metrics in your loss function, it'll print out all 01:28:38.240 |
of the separate layer loss amounts for you, which is super handy. 01:28:47.880 |
That's our perceptual loss or feature_loss class. 01:28:51.060 |
And so now we can just go ahead and train a unit in the usual way with our data and our 01:28:55.160 |
pre-trained architecture, which is a ResNet-34, passing in our loss function, which is using 01:29:02.920 |
And this is that callback I mentioned, loss_metrics, which is going to print out all the different 01:29:09.760 |
These are two things that we'll learn about in part two of the course, but you should 01:29:14.720 |
I just created a little function called do_fit that does fit one cycle and then saves the 01:29:23.020 |
So as per usual, because we're using a pre-trained network in our UNet, we start with frozen 01:29:29.240 |
layers for the downsampling path, train for a while, and as you can see, we get not only 01:29:34.720 |
the loss, but also the pixel loss and the loss at each of our feature layers. 01:29:40.000 |
And then also something we'll learn about in part two called gram_loss, which I don't 01:29:45.560 |
think anybody's used for SuperRes before as far as I know, but as you'll see, it turns 01:29:52.160 |
So that's eight minutes, so much faster than a GAN. 01:29:55.960 |
And already, as you can see, this is our output, modeled output, pretty good. 01:30:01.500 |
So then we unfreeze and train some more, and it's a little bit better. 01:30:07.720 |
And then let's switch up to double the size, and so we need to also halve the batch size 01:30:14.880 |
And freeze again and train some more, so it's now taking half an hour. 01:30:24.320 |
So all in all, we've done about an hour and 20 minutes of training. 01:30:33.920 |
It knows that eyes are important, so it's really made an effort. 01:30:36.920 |
It knows that fur is important, so it's really made an effort. 01:30:39.780 |
So it started with something with JPEG artifacts around the ears and all this mess and eyes 01:30:47.920 |
that are just kind of vague, light blue things, and it really created a lot of texture. 01:30:53.880 |
This cat is clearly kind of like looking over the top of one of those little clawing frames 01:30:59.480 |
covered in fuzz, so it actually recognized that this thing is probably kind of a carpety 01:31:04.440 |
material that's created a carpety material for us. 01:31:12.080 |
So talking of remarkable, we can now - so I've never seen outputs like this before without 01:31:24.520 |
So I was just so excited when we were able to generate this. 01:31:31.200 |
So if you create your own krapification functions and train this model, you'll build stuff that 01:31:38.120 |
Because like nobody else's that I know of is doing it this way. 01:31:45.680 |
What we can now do is we can now, instead of starting with our low res, I actually stored 01:31:53.440 |
another set at size 256, which are called medium res. 01:31:57.600 |
So let's see what happens if we upsize a medium res. 01:32:14.480 |
So you can see there's still a lot of room for improvement. 01:32:16.860 |
Like you see the lashes here are very pixelated. 01:32:21.920 |
Size where there should be hair here is just kind of fuzzy. 01:32:24.880 |
So watch this area as I hit down on my keyboard. 01:32:31.360 |
You know, it's taken a medium res image and it's made a totally clear thing here. 01:32:41.360 |
The eyeball here is just kind of a general blue thing. 01:32:46.040 |
Here it's added all the right texture, you know. 01:32:49.780 |
So I just think this is super exciting, you know. 01:32:54.080 |
Here's a model I trained in an hour and a half using standard stuff that you've all learned 01:32:59.840 |
about a unit, a pre-trained model, feature loss function, and we've got something which 01:33:05.680 |
can turn that into that or, you know, this absolute mess into this. 01:33:14.680 |
And like it's really exciting to think what could you do with that, right? 01:33:19.660 |
So one of the inspirations here has been a guy called Jason Antich. 01:33:26.840 |
And Jason was a student in the course last year. 01:33:34.160 |
And what he did very sensibly was decide to focus basically nearly quit his job and work 01:33:44.080 |
four days a week or really six days a week on studying deep learning. 01:33:47.760 |
And as you should do, he created a kind of capstone project. 01:33:51.240 |
And his project was to combine GANs and feature losses together. 01:33:57.320 |
And his crepification approach was to take color pictures and make them black and white. 01:34:05.200 |
So he took the whole of ImageNet, created a black and white ImageNet, and then trained 01:34:12.920 |
And now he's got these actual old photos from the 19th century that he's turning into color. 01:34:25.520 |
The model thought, oh, that's probably some kind of copper kettle. 01:34:32.240 |
They're probably like different colors to the wall. 01:34:38.200 |
Maybe it would be reflecting stuff outside, you know. 01:34:53.560 |
Like you can take our feature loss and our GAN loss and combine them. 01:34:58.640 |
So I'm very grateful to Jason, because he's helped us build this lesson. 01:35:03.400 |
And it's been really nice, because we've been able to help him, too, because he hadn't realized 01:35:08.520 |
that he can use all this pre-training and stuff. 01:35:10.480 |
And so hopefully you'll see De-oldify in the next couple of weeks be even better at de-oldification. 01:35:16.560 |
But hopefully you all can now add other kinds of de-crapification methods as well. 01:35:23.840 |
So I like every course, if possible, to show something totally new, because then every 01:35:33.600 |
student has a chance to basically build things that have never been built before. 01:35:36.920 |
So this is kind of that thing, you know, but between the much better segmentation results 01:35:42.460 |
and these much simpler and faster de-crapification results, I think you can build some really 01:35:55.040 |
Is it possible to use similar ideas to UNET and GANs for NLP? 01:35:59.960 |
For example, if I want to tag the verbs and nouns in a sentence or create a really good 01:36:11.920 |
It's a pretty new area, but there's a lot of opportunities there. 01:36:15.160 |
And we'll be looking at some in a moment, actually. 01:36:24.000 |
So I actually tried training this -- well, I actually tried testing this on this -- remember 01:36:30.040 |
this picture I showed you with a slide last lesson? 01:36:33.320 |
And it's a really rubbishy-looking picture, and I thought, what would happen if we tried 01:36:36.440 |
running this just through the exact same model and it changed it from that to that? 01:36:45.480 |
You can see something it didn't do, which is this weird discoloration. 01:36:49.280 |
It didn't fix it, because I didn't crepify things with weird discoloration, right? 01:36:53.520 |
So if you want to create really good image restoration, like I say, you need really good 01:37:01.480 |
So here's what we've learned so far, right, in the course, some of the main things. 01:37:08.200 |
So we've learned that neural nets consist of sandwich layers of affine functions, which 01:37:15.680 |
are basically matrix multiplications, slightly more general version, and nonlinearities, like 01:37:21.520 |
And we learned that the results of those calculations are called activations, and the things that 01:37:26.640 |
go into those calculations that we learn are called parameters, and that the parameters 01:37:31.520 |
are initially, randomly initialized, or we copy them over from a pre-trained model, and 01:37:36.480 |
then we train them with SGD or faster versions, and we learned that convolutions are a particular 01:37:42.720 |
affine function that work great for autocorrelated data, so things like images and stuff. 01:37:48.560 |
We learned about batch norm, dropout data orientation and weight decay as ways of regularizing 01:37:54.600 |
models, and also batch norm helps train models more quickly. 01:37:57.640 |
And then today we've learned about res/dense blocks. 01:38:02.760 |
We've learned a lot about image classification and regression, embeddings, categorical and 01:38:07.400 |
continuous variables, collaborative filtering, language models and NLP classification, and 01:38:15.820 |
So go over these things and make sure that you feel comfortable with each of them. 01:38:21.880 |
If you've only watched this series once, you definitely won't. 01:38:26.140 |
People normally watch it three times or so to really understand the detail. 01:38:38.540 |
So that's the last thing we're going to do, RNNs. 01:38:42.200 |
So RNNs, I'm going to introduce a little kind of diagrammatic method here to explain RNNs. 01:38:48.160 |
And the diagrammatic method, I'll start by showing you a basic neural net with a single 01:38:56.960 |
So that'll be batch size by number of inputs. 01:39:01.040 |
So kind of, you know, batch size by number of inputs. 01:39:10.040 |
An arrow means a layer, broadly defined, such as matrix product followed by value. 01:39:21.840 |
So in this case, we have one set of hidden activations. 01:39:25.660 |
And so given that the input was number of inputs, this here is a matrix of number of 01:39:34.580 |
So the output will be batch size by number of activations. 01:39:38.960 |
It's really important you know how to calculate these shapes. 01:39:41.480 |
So go learn.summary lots to see all the shapes. 01:39:48.620 |
So that means it's another layer, matrix product followed by non-linearity. 01:39:51.880 |
In this case, we're going to the output, so we use softmax. 01:39:59.680 |
And so this matrix product will be number of activations by number of classes. 01:40:03.240 |
So our output is batch size by number of classes. 01:40:06.160 |
So let's reuse that key, remember, triangle output, circle is activations, hidden state, 01:40:20.420 |
So let's now imagine that we wanted to get a big document, split it into sets of three 01:40:27.840 |
words at a time, and grab each set of three words and then try to predict the third word 01:40:36.560 |
So if we had the data set in place, we could grab word one as an input, chuck it through 01:40:42.080 |
an embedding, create some activations, pass that through a matrix product and non-linearity, 01:40:53.960 |
grab the second word, put it through an embedding, and then we could either add those two things 01:41:02.080 |
Generally speaking, when you see kind of two sets of activations coming together in a diagram, 01:41:08.400 |
you normally have a choice of concatenate or add. 01:41:13.200 |
And that's going to create a second bunch of activations, and then you can put it through 01:41:16.640 |
one more fully connected layer and softmax to create an output. 01:41:23.160 |
So that would be a totally standard, fully connected neural net with one very minor tweak, 01:41:29.520 |
which is concatenating or adding at this point, which we could use to try to predict the third 01:41:41.120 |
So remember, arrows represent layer operations, and I removed in this one the specifics of 01:41:48.520 |
what they are because they're always an affine function followed by a non-linearity. 01:41:56.760 |
Let's go further. What if we wanted to predict word four using words one and two and three? 01:42:03.080 |
It's basically the same picture as last time, except with one extra input and one extra 01:42:07.980 |
But I want to point something out, which is each time we go from rectangle to circle, 01:42:15.720 |
we're doing the same thing. We're doing an embedding, which is just a particular kind 01:42:20.320 |
of matrix multiply, where you have a one-hot encoded input. 01:42:24.740 |
Each time we go from circle to circle, we're basically taking one piece of hidden state, 01:42:31.000 |
one set of activations, and turning it into another set of activations by saying we're 01:42:37.280 |
And then when we go from circle to triangle, we're doing something else again, which is 01:42:41.000 |
we're saying let's convert the hidden state, these activations, into an output. 01:42:46.360 |
So it would make sense, so you can see I've colored each of those arrows differently. 01:42:50.680 |
So each of those arrows should probably use the same weight matrix, because it's doing 01:42:57.800 |
So why would you have a different set of embeddings for each word, or a different set of -- a 01:43:02.320 |
different matrix to multiply by to go from this hidden state to this hidden state versus 01:43:08.800 |
So this is what we're going to build. So we're now going to jump into human numbers, which 01:43:22.080 |
is less than seven human numbers, and this is the dataset that I created, which literally 01:43:25.960 |
just contains all the numbers from one to 9,999 written out in English. 01:43:31.880 |
And we're going to try and create a language model that can predict the next word in this 01:43:36.320 |
document. It's just a toy example for this purpose. 01:43:41.240 |
So in this case, we only have one document, and that one document is the list of numbers. 01:43:47.320 |
So we can use a text list to create an item list with text in for the training and the 01:43:52.200 |
validation. In this case, the validation set is the numbers from 8,000 onwards, and the 01:43:59.600 |
We can combine them together, turn that into a data bunch. 01:44:04.660 |
So we only have one document. So train zero is the document. Grab its dot text. That's 01:44:09.200 |
how you grab the contents of a text list, and here are the first 80 characters. 01:44:15.080 |
It starts with a special token, XXBOS. Anything starting with XX is a special fast AI token. 01:44:21.080 |
BOS is the beginning of stream token. It basically says this is the start of a document. It's 01:44:26.480 |
very helpful in NLP to know when documents start so that your models can learn to recognize 01:44:31.920 |
them. The validation set contains 13,000 tokens, 01:44:36.360 |
so 13,000 words or punctuation marks, because everything between spaces is a separate token. 01:44:44.640 |
The batch size that we asked for was 64. And then by default, it uses something called 01:44:54.640 |
BPT-T of 70. BPT-T, as we briefly mentioned, stands for backprop through time. 01:45:01.440 |
That's the sequence length. So with each of our 64 document segments, we split it up into 01:45:12.080 |
lists of 70 words that we look at at one time. 01:45:15.840 |
So what we do is we grab this for the validation set, an entire string of 13,000 tokens, and 01:45:24.040 |
then we split it into 64 roughly equal sized sections. People very, very, very often think 01:45:32.440 |
I'm saying something different. I did not say they are of length 64. They're not. They're 01:45:38.120 |
64 equally sized roughly segments. So we take the first 1/64 of the document, piece one. 01:45:46.160 |
1/64, piece two. And then for each of those 1/64 of the document, we then split those 01:45:56.960 |
into pieces of length 70. So each batch -- so let's now say, okay, for 01:46:04.720 |
those 13,000 tokens, how many batches are there? Well, divide by batch size and divide 01:46:10.080 |
by 70. So there's about 2.9 batches. So there's going to be three batches. So let's grab an 01:46:16.760 |
iterator for our data loader, grab one, two, three batches, the X and the Y, and let's add 01:46:23.840 |
up the number of elements, and we get back slightly less than this because there's a 01:46:28.920 |
little bit left over at the end that doesn't quite make up a full batch. 01:46:34.360 |
So this is the kind of stuff you should play around with a lot, lots of shapes and sizes 01:46:37.920 |
and stuff and iterators. As you can see, it's 95 by 64. I claimed it was going to be 70 by 01:46:45.680 |
64. That's because our data loader for language models slightly randomizes, BPTT, just to 01:46:53.800 |
give you a bit more kind of shuffling, get a bit more randomization. It helps the model. 01:47:00.600 |
And so here you can see the first batch of X. Remember, we've numericalized all these. 01:47:10.280 |
And here's the first batch of Y. And you'll see here, this is 2, 18, 10, 11, 8. This is 01:47:15.800 |
18, 10, 11, 8. So this one is offset by 1 from here because that's what we want to do with 01:47:23.120 |
a language model. We want to predict the next word. So after 2 should come 18. And after 01:47:30.160 |
18 should come 10. You can grab the vocab for this data set. And a vocab has a textify. 01:47:39.000 |
So if we look at the same thing but with textify, that'll just look it up in the vocab. So here 01:47:43.880 |
you can see XXBOS 8001. Whereas in the Y, there's no XXBOS. It's just 8001. So after XXBOS is 01:47:52.440 |
8, after 8 is 1, after 1000 is 1. And so then after we get 8023 comes X2. And look at this. 01:48:03.520 |
We're always looking at column 0. So this is the first batch, the first mini-batch. 01:48:08.720 |
Comes 8024 and then X3 all the way up to 8040. And so then we can go right back to the start 01:48:18.880 |
but look at batch 1. So index 1, which is batch number 2. And now we can continue. A 01:48:25.600 |
slight skip from 8040 to 8046. That's because the last mini-batch wasn't quite complete. 01:48:32.240 |
So what this means is that every mini-batch joins up with the previous mini-batch. So 01:48:41.840 |
you can go straight from X1, 0 to X2, 0. It continues. 8023, 8024, right? And so if you 01:48:50.320 |
look at the same thing for colon, comma, 1, you'll also see they join up. So all the mini-batches 01:48:57.060 |
join up. So that's the data. We can do show batch to see it. And here is our model which 01:49:09.040 |
is doing this. So this is just the code copied over. So it contains one embedding, i.e. the 01:49:25.640 |
green arrow, one hidden to hidden brown arrow layer, and one hidden to output. So each colored 01:49:35.800 |
arrow has a single matrix. And so then in the forward pass, we take our first input, 01:49:45.000 |
X0, and put it through input to hidden, the green arrow, create our first set of activations, 01:49:52.120 |
which we call H. Assuming that there is a second word, because sometimes we might be 01:49:58.320 |
at the end of a batch where there isn't a second word, assuming there is a second word, 01:50:02.160 |
then we would add to H the result of X1, put through the green arrow. Remember that's IH. 01:50:11.920 |
And then we would say, okay, our new H is the result of those two added together, put 01:50:20.600 |
through our hidden to hidden, orange arrow, and then relu then batch it on. And then for 01:50:25.360 |
the second word, do exactly the same thing. And then finally, blue arrow, put it through 01:50:30.840 |
H0. So that's how we convert our diagram to code. So nothing new here at all. So now let's 01:50:42.960 |
do -- so we can check that in the learner and we can train it, 46%. Let's take this code 01:50:49.720 |
and recognize it's pretty awful. There's a lot of duplicate code. And as coders, when 01:50:55.040 |
we see duplicate code, what do we do? We refactor. So we should refactor this into a loop. So 01:51:01.760 |
here we are. We've refactored it into a loop. So now we're going for each X, I and X and 01:51:06.800 |
doing it in the loop. Guess what? That's an RNN. An RNN is just a refactoring. It's not 01:51:16.320 |
anything new. This is now an RNN. And let's refactor our diagram from this to this. This 01:51:27.360 |
is the same diagram. But I've just replaced it with my loop. Does the same thing. So here 01:51:36.520 |
it is. It's got exactly the same in it. Literally exactly the same. Just popped a loop here. 01:51:41.400 |
Before I start, I just have to make sure I've got a bunch of zeros to add to. And of course 01:51:47.600 |
I get exactly the same result when I train it. Okay. So next thing that you might think 01:51:53.760 |
then -- and one nice thing about the loop, though, is now this will work even if I'm 01:51:57.280 |
not predicting the fourth word from the previous three but the ninth word from the previous 01:52:01.560 |
eight. It will work for any arbitrarily length long sequence, which is nice. So let's up 01:52:07.120 |
the BPTT to 20 since we can now. And let's now say, okay, instead of just predicting 01:52:19.400 |
the nth word from the previous n minus 1, let's try to predict the second word from 01:52:25.760 |
the first and the third from the second and the fourth from the third and so forth. Because 01:52:29.920 |
previously -- look at our loss function. Previously we were comparing the result of our model 01:52:35.440 |
to just the last word of the sequence. It's very wasteful because there's a lot of words 01:52:39.560 |
in the sequence. So let's compare every word in X to every word in Y. So to do that, we 01:52:46.720 |
need to change this so it's not just one triangle at the end of the loop. But the triangle is 01:52:52.440 |
inside this, right? So that in other words, after every loop, predict, loop, predict, 01:53:00.360 |
loop, predict. So here's this code. It's the same as the previous code but now I've created 01:53:08.120 |
an array. And every time I go through the loop, I append HOH to the array. So now for 01:53:15.960 |
n inputs, I create n outputs. So I'm predicting after every word. Previously I had 46%. Now 01:53:23.640 |
I have 40%. Why is it worse? Well, it's worse because now, like when I'm trying to predict 01:53:31.080 |
the second word, I only have one word of state to use. Right? So like when I'm looking at 01:53:37.400 |
the third word, I only have two words of state to use. So it's a much harder problem for 01:53:42.200 |
it to solve. So the obvious way to fix this then would -- you know, the key problem is 01:53:47.640 |
here. I go H equals torch.zeros, like I reset my state to zero every time I start another 01:53:54.360 |
BPTT sequence. Well, let's not do that. Let's keep H. Right? And we can because remember 01:54:01.240 |
each batch connects to the previous batch. It's not shuffled like happens in image classification. 01:54:09.000 |
So let's take this exact model and replicate it again. But let's move the creation of H 01:54:13.440 |
into the constructor. Okay. There it is. So it's now self.h. So this is now exactly the 01:54:20.160 |
same code. But at the end, let's put the new H back into self.h. So it's now doing the 01:54:25.800 |
same thing, but it's not throwing away that state. And so therefore now we actually get 01:54:33.160 |
above the original. We get all the way up to 54% accuracy. So this is what a real RNN 01:54:41.720 |
looks like. You always want to keep that state. But just keep remembering there's nothing 01:54:48.720 |
different about an RNN. It's a totally normal, fully connected neural net. It's just that 01:54:52.840 |
you've got a loop you refactored. What you could do, though, is at the end of your -- every 01:55:03.120 |
loop, you could not just spit out an output, but you could spit it out into another RNN. 01:55:07.920 |
So you could have an RNN going into an RNN. And that's nice because we've now got more 01:55:12.160 |
layers of computation. You would expect that to work better. Well, to get there, let's 01:55:18.960 |
do some more refactoring. So let's take this code and replace it with the equivalent built-in 01:55:25.720 |
PyTorch code, which is -- you just say that. So nn.rn basically says do the loop for me. 01:55:33.000 |
We've still got the same embedding, the same output, the same batch norm, the same initialization 01:55:39.880 |
of H, but we just got rid of the loop. So one of the nice things about RNN is that you can 01:55:45.680 |
now say how many layers you want. So this is the same accuracy, of course. So here I've 01:55:53.280 |
got to do it with two layers. But here's the thing. When you think about this, right, think 01:56:00.480 |
about it without the loop. It looks like this, right? It's like -- it keeps on going -- and 01:56:06.360 |
we've got a BPTT of 20, so there's 20 layers of this. And we know from that visualizing 01:56:13.320 |
the lost landscapes paper that deep networks have awful, bumpy, lost surfaces. So when 01:56:20.680 |
you start creating long time scales and multiple layers, these things get impossible to train. 01:56:31.520 |
So there's a few tricks you can do. One thing is you can add skip connections, of course. 01:56:37.640 |
But what people normally do is instead they put inside -- instead of just adding these 01:56:43.280 |
together, they actually use a little mini neural net to decide how much of the green 01:56:48.680 |
arrow to keep and how much of the orange arrow to keep. And when you do that, you get something 01:56:53.640 |
that's either called a GIU or an LSTM depending on the details of that little neural net. 01:56:59.200 |
And we'll learn about the details of those neural nets in part two. They really don't 01:57:02.600 |
matter, though, frankly. So we can now say let's create a GIU instead, so it's just like 01:57:07.920 |
what we had before, but it'll handle longer sequences in deeper networks. Let's use two 01:57:13.560 |
layers, and we're up to 75%. Okay. So that's RNNs. And the main reason I wanted to show 01:57:27.960 |
it to you was to remove the last remaining piece of magic. And this is one of the least 01:57:35.360 |
magical things we have in deep learning. It's just a refactored, fully connected network. 01:57:40.800 |
So don't let RNNs ever put you off. And with this approach where you basically have a sequence 01:57:48.760 |
of N inputs and a sequence of N outputs we've been using for language modeling, you can 01:57:53.360 |
use that for other tasks, right? For example, the sequence of outputs could be for every 01:57:58.440 |
word. There could be something saying is this something that is sensitive and I want to 01:58:01.980 |
anonymize or not? You know, so like is this private data or not? Or it could be a part 01:58:08.560 |
of speech tag for that word. Or it could be something saying, you know, how should that 01:58:15.840 |
word be formatted? Or whatever. And so these are called sequence labeling tasks, and so 01:58:21.360 |
you can use this same approach for pretty much any sequence labeling task. Or you can do 01:58:26.960 |
what I did in the earlier lesson, which is once you finish building your language model, 01:58:33.360 |
you can throw away the kind of this HO bit and instead pop there a standard classification 01:58:42.480 |
head and then you can now do NLP classification, which as you saw earlier will give you state-of-the-art 01:58:49.160 |
results even on long documents. So this is a super valuable technique and not remotely 01:58:57.600 |
magical. Okay, so that's it, right? That's deep learning or at least, you know, the kind 01:59:05.880 |
of the practical pieces from my point of view. Having watched this one time, you won't get 01:59:17.120 |
it all. And I don't recommend that you do watch this so slowly that you get it all the 01:59:21.560 |
first time, but you go back and look at it again, take your time, and there'll be bits 01:59:27.200 |
that you go like, "Oh, now I see what he's saying," and then you'll be able to implement 01:59:31.080 |
things you couldn't implement before and you'll be able to dig in more than you before. So 01:59:34.560 |
definitely go back and do it again. And as you do, write code, not just for yourself, 01:59:40.640 |
but put it on GitHub. It doesn't matter if you think it's great code or not. The fact 01:59:45.640 |
that you're writing code and sharing it is impressive, and the feedback you'll get if 01:59:51.880 |
you tell people on the forum, "Hey, I wrote this code. It's not great, but it's my first 01:59:57.120 |
effort. Anything you see, jump out at you," people will say like, "Oh, that bit was done 02:00:02.320 |
well. Hey, but did you know for this bit you could have used this library and saved you 02:00:05.760 |
some time?" You'll learn a lot by interacting with your peers. As you've noticed, I've started 02:00:12.520 |
introducing more and more papers. Now, part two will be a lot of papers, and so it's a 02:00:17.320 |
good time to start reading some of the papers that have been introduced in this section. 02:00:24.160 |
All the bits that say derivation and theorems and lemmas, you can skip them. I do. They add 02:00:29.520 |
almost nothing to your understanding of practical deep learning. But the bits that say why are 02:00:36.600 |
we solving this problem, and what are the results, and so forth are really interesting. 02:00:42.540 |
And then try and write English prose. Not English prose that you want to be read by Jeff Hinton 02:00:51.200 |
and Yann LeCun, but English prose that you want to be read by you as of six months ago. 02:00:56.560 |
Because there's a lot more people in the audience of you as of six months ago than there is 02:01:02.360 |
of Jeffrey Hinton and Yann LeCun. That's the person you best understand. You know what 02:01:07.600 |
they need. Go and get help and help others. Tell us about your success stories. But perhaps 02:01:16.360 |
the most important one is get together with others. People's learning works much better 02:01:20.860 |
if you've got that social experience. So start a book club, get involved in meetups, create 02:01:27.640 |
study groups, and build things. And again, it doesn't have to be amazing. Just build 02:01:36.880 |
something that you think the world would be a little bit better if that existed. Or you 02:01:41.700 |
think it would be kind of slightly delightful to your two-year-old to see that thing. Or 02:01:46.600 |
you just want to show it to your brother the next time they come around to see what you're 02:01:49.120 |
doing. Whatever. Just finish something. Finish something. And then try and make it a bit 02:01:57.320 |
better. So for example, something I just saw this afternoon is the Elon Musk tweet generator. 02:02:09.320 |
So looking at lots of older tweets, creating a language model from Elon Musk, and then 02:02:14.520 |
creating new tweets such as humanity will also have an option to publish on its own 02:02:19.640 |
journey as an alien civilization. It will always, like all human beings, Mars is no 02:02:25.680 |
longer possible. AI will definitely be the central intelligence agency. Okay. So this 02:02:31.680 |
is great. I love this. And I love that Dave Smith wrote and said, "These are my first 02:02:37.920 |
ever commits. Thanks for teaching a finance guy how to build an app in eight weeks." Right? 02:02:43.560 |
So I think this is awesome. And I think clearly a lot of care and passion is being put into 02:02:50.520 |
this project. Will it systematically change the future direction of society as a whole? 02:02:59.720 |
Maybe not. But maybe Elon will look at this and think, "Oh, maybe I need to rethink my 02:03:05.380 |
method of prose." I don't know. I think it's great. And so, yeah. Create something. Put 02:03:12.240 |
it out there. Put a bit of yourself into it. Or get involved in fast AI. The fast AI project, 02:03:20.120 |
there's a lot going on. You know, you can help with documentation and tests, which might 02:03:24.760 |
sound boring, but you'd be surprised how incredibly not boring it is to, like, take a piece of 02:03:28.780 |
code that hasn't been properly documented and research it and understand it and ask 02:03:33.640 |
Silver and I on the forum what's going on. Why did you write it this way? We'll send 02:03:37.240 |
you off to the papers that we were implementing. You know, writing a test requires deeply understanding 02:03:42.200 |
that part of the machine learning world to understand how it's meant to work. So that's 02:03:46.800 |
always interesting. Staz Beckman has created this nice dev projects index which you can 02:03:53.060 |
go on to the forum in the fast AI dev section and find actually the dev project section 02:03:59.060 |
and find, like, here's some stuff going on that you might want to get involved in. Or 02:04:02.680 |
maybe there's stuff you want to exist. You can add your own. Create a study group. Dean 02:04:07.280 |
has already created a study group for San Francisco starting in January. This is how 02:04:10.720 |
easy it is to create a study group. Go on the forum, find your little time zone subcategory 02:04:15.920 |
and add a post saying let's create a study group. But make sure you give people a little 02:04:22.000 |
Google sheet to sign up, some way to actually do something. A great example is Pierre who's 02:04:28.380 |
been doing a fantastic job in Brazil of running study groups for the last couple of parts 02:04:34.400 |
of the course and he keeps posting these pictures of people having a good time and learning 02:04:40.000 |
deep learning together, creating wikis together, creating projects together. Great experience. 02:04:47.100 |
And then come back for part two, right, where we'll be looking at all of this interesting 02:04:53.400 |
stuff in particular going deep into the fast AI code base to understand how did we build 02:04:58.080 |
it exactly. We'll actually go through, as we were building it, we created notebooks 02:05:03.400 |
of like here is where we were at each stage. So we're actually going to see the software 02:05:06.920 |
development process itself. We'll talk about the process of doing research, how to read 02:05:11.760 |
academic papers, how to turn math into code, and then a whole bunch of additional types 02:05:17.080 |
of models that we haven't seen yet. So it'll be kind of like going beyond practical deep 02:05:21.800 |
learning into actually cutting edge research. So we've got five minutes to take some questions. 02:05:31.140 |
We had an AMA going on online and so we're going to have time for a couple of the highest 02:05:37.200 |
ranked AMA questions from the community. And the first one is by Jeremy's request, although 02:05:42.160 |
it's not the highest ranked. What's your typical day like? How do you manage your time across 02:05:47.160 |
so many things that you do? Yeah, I thought that I hear that all the time. So I thought 02:05:54.160 |
I should answer it. And I think I've got a few votes. Because I think people who come 02:06:01.880 |
to our study group are always shocked at how disorganized and incompetent I am. And so 02:06:09.120 |
I often hear people saying like, oh, wow, I thought you were like this deep learning 02:06:12.720 |
role model and I'd get to see how to be like you. And now I'm not sure what to be like 02:06:16.240 |
you at all. So yeah, it's for me, it's all about just having a good time with it. I never 02:06:26.160 |
really have many plans. I just try to finish what I start. If you're not having fun with 02:06:32.400 |
it, it's really, really hard to continue because there's a lot of frustration in deep learning 02:06:36.720 |
because it's not like writing a web app, where it's like, you know, authentication check, 02:06:41.840 |
you know, backend service watchdog check. Okay, user credentials check. You know, like you're 02:06:51.200 |
making progress. Where else for stuff like this and stuff that we've been doing the last 02:06:55.960 |
couple of weeks, it's just like, it's not working. It's not working. It's not working. 02:07:00.760 |
No, that also didn't work. That also didn't work until oh, my God, it's amazing. It's 02:07:05.360 |
a cat. That's kind of what it is, right? So you don't get that regular feedback. So yeah, 02:07:11.360 |
you know, you got to have fun with it. And so, so my, yeah, my day is kind of, you know, 02:07:19.560 |
I mean, the other thing I'll do, I'll say I don't, I don't do any meetings. I don't 02:07:24.320 |
do phone calls. I don't do coffees. I don't watch TV. I don't play computer games. I spend 02:07:29.920 |
a lot of time with my family, a lot of time exercising and a lot of time reading and coding 02:07:36.600 |
and doing things I like. So, you know, I think, you know, the main thing is just finish, finish 02:07:45.440 |
something like properly finish it. So when you get to that point where you think you're 02:07:50.120 |
80% of the way through, but you haven't quite created a read me yet and the install process 02:07:54.600 |
is still a bit clunky and you know, this is what 99% of GitHub projects look like. You'll 02:07:59.320 |
see the read me says to do, you know, complete baseline experiments, document, blah, blah, 02:08:06.880 |
blah. It's like, don't be that person. Like just do something properly and finish it and 02:08:13.080 |
maybe get some other people around you to work with you so that you're all doing it 02:08:22.360 |
What are the up and coming deep learning machine learning things that you are most excited 02:08:26.200 |
about? Also, you've mentioned last year that you are not a believer in reinforcement learning. 02:08:33.160 |
Yeah, I still feel exactly the same way as I did three years ago when we started this, 02:08:38.200 |
which is it's all about transfer learning. It's underappreciated. It's under researched. 02:08:44.200 |
Every time we put transfer learning into anything, we make it much better. You know, our academic 02:08:50.640 |
paper on transfer learning for NLP has, you know, helped be one piece of kind of changing 02:08:55.800 |
the direction of NLP this year. It's made it all the way to the New York Times, just 02:09:00.440 |
a stupid, obvious little thing that we threw together. So I remain excited about that. 02:09:06.160 |
I remain unexcited about reinforcement learning for most things. I don't see it used by normal 02:09:12.760 |
people for normal things, for nearly anything. It's an incredibly inefficient way to solve 02:09:17.720 |
problems which are often solved more simply and more quickly in other ways. It probably 02:09:22.480 |
has maybe a role in the world, but a limited one and not in most people's day-to-day work. 02:09:39.040 |
For someone planning to take part two in 2019, what would you recommend doing learning practicing 02:09:48.200 |
Just code. Yeah, just code all the time. I know it's perfectly possible I hear from people 02:09:53.000 |
who get to this point of the course and they haven't actually written any code yet. And 02:09:56.920 |
if that's you, it's okay. You know, you just go through and do it again and this time do 02:10:01.920 |
code and look at the shapes of your inputs and look at your outputs and make sure you 02:10:08.680 |
know how to grab a mini batch and look at its main and standard deviation and plot it. 02:10:15.080 |
There's so much material that we've covered. If you can get to a point where you can rebuild 02:10:23.960 |
those notebooks from scratch without too much cheating, when I say from scratch, I mean 02:10:30.960 |
using the first AI library, not from scratch from scratch, you'll be in the top echelon 02:10:38.400 |
of practitioners because you'll be able to do all of these things yourself and that's 02:10:41.800 |
really, really rare. And that'll put you in a great position for part two. Should we do 02:10:48.240 |
Nine o'clock. We always do one more. Where do you see the fast AI library going in the 02:10:56.040 |
Well, like I said, I don't make plans. I just piss around. So, I mean, our only plan for 02:11:05.640 |
fast AI as an organization is to make deep learning accessible as a tool for normal people 02:11:15.400 |
to use for normal stuff. So, as long as we need to code, we failed at that. So, the big 02:11:22.000 |
goal, because 99.8% of the world can't code. So, the main goal would be to get to a point 02:11:30.120 |
where it's not a library but it's a piece of software that doesn't require code. It certainly 02:11:34.280 |
shouldn't require a goddamn lengthy, hard-working course like this one. So, I want to get rid 02:11:41.840 |
of the course. I want to get rid of the code. I want to make it so you can just do useful 02:11:46.160 |
stuff quickly and easily. So, that's maybe five years? Yeah, maybe longer. 02:11:52.640 |
All right. Well, I hope to see you all back here for part two. Thank you.