back to indexLesson 11 (2019) - Data Block API, and generic optimizer
Chapters
0:0 Introduction
1:20 Batch norm
3:50 LSUV
10:0 ImageNet
12:15 New Data Sets
15:55 Question
18:15 Importing data
25:20 The purpose of deep learning
29:40 Getting the files
36:0 Split validation set
39:30 Labeling
53:50 Data Bunch
00:00:00.000 |
Well, welcome back, welcome to lesson 11, where we're going to be talking mainly about 00:00:13.200 |
I said we would be talking about fastai.audio, but that's going to be a little bit later 00:00:19.460 |
We haven't quite got to where I wanted to get to yet, so everything I said we'd talk 00:00:23.200 |
about last week, we will talk about, but it might take a few more lessons to get there 00:00:31.600 |
So this is kind of where we're up to, is the last little bit of our CNN, and specifically 00:00:39.440 |
these were the things we were going to dive into when we've done the first four of these. 00:00:51.040 |
So we're going to keep going through this process to try to create our state-of-the-art 00:01:00.000 |
The specific items that we're working with are images, but everything we've covered so 00:01:07.960 |
far is equally valid, equally used for tabular, collaborative filtering, and text, and pretty 00:01:20.340 |
Last week we talked about BatchNorm, and I just wanted to mention that at the end of 00:01:26.760 |
the BatchNorm notebook there's another bit called simplified running BatchNorm. 00:01:32.200 |
We talked a little bit about de-biasing last week, we'll talk about it more today, but 00:01:37.400 |
Stas Begman pointed out something which is kind of obvious in hindsight, but I didn't 00:01:41.600 |
notice at the time, which is that we had sums divided by de-bias, and we had count divided 00:01:49.040 |
by de-bias, and then we go sum divided by count, and sum divided by de-bias divided 00:01:54.800 |
by count divided by de-bias, the two de-bias cancel each other out, so we can remove all 00:02:01.680 |
We're still going to cover de-biasing today for a different purpose, but actually we didn't 00:02:06.360 |
really need it for last week, so we can remove all the de-biasing and end up with something 00:02:16.320 |
That's the version that we're going to go with. 00:02:21.080 |
Also thanks to Tom Veman, who pointed out that the last step where we went subtract 00:02:30.440 |
mean divided by standard deviation multiplied by molts add adds, you can just rearrange 00:02:36.120 |
that into this form, molts divided by variances and adds minus means times that factor, and 00:02:46.200 |
if you do it this way, then you don't actually have to touch X until you've done all of those 00:02:54.200 |
things, and so that's going to end up being faster. 00:02:56.160 |
If you think through the broadcasting operations there, then you're doing a lot less computation 00:03:07.720 |
And I should mention Tom has been helping us out quite a bit with some of this batch 00:03:12.640 |
norm stuff, one of the few people involved in the PyTorch community who's been amazingly 00:03:20.400 |
So thanks also to Sumith Chintala, who was one of the original founders of PyTorch, who's 00:03:27.600 |
been super helpful in sorting some things out for this course, and also Francisco Massa, 00:03:36.240 |
They're both part of the official Facebook engineering team, and Tom's not, but he does 00:03:42.040 |
so much work for PyTorch, he kind of seems like he must be sometimes. 00:03:50.040 |
Okay, before we moved on to data blocks, I wanted to mention one other approach to making 00:03:59.120 |
sure that your model trains nicely, and to me this is the most fast AI-ish method. 00:04:12.280 |
Wonderful researcher named Dimitro came up with it in a paper called All You Need Is 00:04:16.560 |
a Good In It, Dimitro Mishkin, and this is the paper. 00:04:24.600 |
And he came up with this technique called LSUV, Layer-wise Sequential Unit Variance. 00:04:28.800 |
So the basic idea is this, you've seen now how fiddly it is to get your unit variances 00:04:36.400 |
all the way through your network, and little things can change that. 00:04:42.120 |
So if you change your activation function, or something we haven't mentioned, if you 00:04:46.280 |
add dropout, or change the amount of dropout, these are all going to impact the variances 00:04:53.560 |
of your layer outputs, and if they're just a little bit different to one, you'll get 00:04:58.900 |
exponentially worse as we saw through the model. 00:05:03.360 |
So the normal approach to fixing this is to think really carefully about your architecture, 00:05:08.240 |
and exactly, analytically, figure out how to initialize everything so it works. 00:05:12.720 |
And Dimitro's idea, which I like a lot better, is let the computer figure it out, and here's 00:05:20.960 |
We create our MNIST data set in the same way as before, we create a bunch of layers with 00:05:25.040 |
these number of filters like before, and what I'm going to do is I'm going to create a conf 00:05:29.680 |
layer class which contains our convolution and our relu, and the idea is that we're going 00:05:36.240 |
to use this because now we can basically say this whole kind of combined conv plus relu 00:05:41.400 |
has kind of a, I'm calling it bias, but actually I'm taking that general relu and just saying 00:05:51.480 |
So this is kind of like something we can add or remove. 00:05:56.720 |
And then the weight is just the conv weights. 00:05:58.960 |
And you'll see why we're doing this in a moment. 00:06:01.220 |
Basically what we'll do is we'll create our learner in the usual way, and however it initializes 00:06:09.380 |
And so we can train it, that's fine, but let's try and now train it in a better way. 00:06:16.920 |
So let's recreate our learner, and let's grab a single minibatch. 00:06:25.000 |
And here's a function that will let us grab a single minibatch, making sure we're using 00:06:28.160 |
all our callbacks that the minibatch does all the things we needed to do. 00:06:37.680 |
And what we're going to do is we're going to find all of the modules that are, we're 00:06:45.120 |
going to find all of the modules which are of type conv layer. 00:06:51.460 |
And so it's just a little function that does that. 00:06:55.360 |
And generally speaking, when you're working with PyTorch modules or with neural nets more 00:07:00.800 |
generally, you need to use recursion a lot because modules can contain modules, can contain 00:07:06.000 |
So you can see here, find modules, calls find modules to find out all the modules throughout 00:07:11.400 |
your kind of tree, because really a module is like a tree. 00:07:17.960 |
And so here's our list of all of our conv layers. 00:07:20.820 |
And then what we do is we create a hook, right? 00:07:25.680 |
And the hook is just going to grab the mean and standard deviation of a particular module. 00:07:31.200 |
And so we can first of all just print those out. 00:07:33.640 |
And we can see that the means and standard deviations are not zero one. 00:07:38.840 |
The means are too high, as we know, because we've got the relues. 00:07:49.460 |
So rather than coming up with our perfect init, instead, we just create a loop. 00:07:57.740 |
And the loop calls the model, passing in that mini-batch we have, right? 00:08:03.360 |
And remember, this is -- so first of all, we hook it, right? 00:08:12.800 |
We check whether the mean, the absolute value of the mean is close to zero. 00:08:17.800 |
And if it's not, we subtract the mean from the bias. 00:08:21.840 |
And so it just keeps looping through, calling the model again with the hook, subtracting 00:08:30.160 |
And then we do the same thing for the standard deviation. 00:08:32.960 |
Keep checking whether standard deviation minus one is nearly zero. 00:08:37.240 |
And as long as it isn't, we'll keep dividing by the standard deviation. 00:08:42.100 |
And so those two loops, if we run this function, then it's going to eventually give us what 00:08:57.760 |
Because we do the means first and then the standard deviations, and the standard deviation 00:09:03.640 |
But you can see our standard deviations are perfectly one. 00:09:13.040 |
And this is how, without thinking at all, you can initialize any neural network pretty 00:09:20.440 |
much to get the unit variance all the way through. 00:09:24.440 |
And this is much easier than having to think about whether you've got ReLU or ELU or whether 00:09:36.720 |
Yeah, and then we can train it, and it trains very nicely. 00:09:40.680 |
Particularly useful for complex and deeper architectures. 00:09:44.400 |
So there's kind of, for me, the fast AI approach to initializing your neural nets, which is 00:09:54.920 |
Just a simple little for loop, or in this case, a while loop. 00:09:59.600 |
So I think we've done enough with MNIST because we're getting really good results. 00:10:13.960 |
Well, we're not quite ready to try ImageNet because ImageNet takes quite a lot of time. 00:10:18.880 |
You know, a few days if you've got just one GPU to train. 00:10:22.460 |
And that's really frustrating and an expensive way to try to practice things or learn things 00:10:30.640 |
I kept finding this problem of not knowing what data set I should try for my research 00:10:38.960 |
You know, it seemed like at one end there was MNIST, which is kind of too easy. 00:10:41.920 |
There was sci-fi tan that a lot of people use, but these are 32 by 32 pixel images. 00:10:47.760 |
And it turns out, and this is something I haven't seen really well written about, but 00:10:51.960 |
our research clearly shows, it turns out that small images, 32 by 32, have very different 00:11:00.920 |
And specifically, it seems like once you get beneath about 96 by 96, things behave really 00:11:06.600 |
So stuff that works well on sci-fi tan tends not to work well on normal sized images. 00:11:14.840 |
And stuff that tends to work well on sci-fi tan doesn't necessarily work well on ImageNet. 00:11:20.440 |
There's this kind of gap of like something with normal sized images, which I can train 00:11:27.520 |
in a sane amount of time, but also gives me a good sense of whether something's going 00:11:33.880 |
And actually, Dimitro, who wrote that LSUV paper we just looked at, also had a fantastic 00:11:40.960 |
paper called systematic evaluation, something like systematic evaluation of convolutional 00:11:48.120 |
And he noticed that if you use 128 by 128 images with ImageNet, then the kind of things 00:11:55.680 |
that he found works well or doesn't work well, all of those discoveries applied equally well 00:12:05.120 |
128 by 128 for 1.3 million images, still too long. 00:12:09.760 |
So I thought that was a good step, but I wanted to go even further. 00:12:17.120 |
And my two new data sets are subsets of ImageNet. 00:12:21.760 |
And there's kind of like multiple versions in here, it really is, but they're both subsets 00:12:27.200 |
They both contain just 10 classes out of 1,000. 00:12:30.160 |
So they're 1/100 of the number of images of ImageNet. 00:12:34.160 |
And I create a number of versions, full size, 320 pixel size, and 160 pixel size. 00:12:41.160 |
One data set is specifically designed to be easy. 00:12:44.360 |
It contains 10 classes that are all very different to each other. 00:12:51.720 |
I thought, well, what if I create this data set, then maybe I could train it for like 00:12:55.480 |
just an epoch or two, like just a couple of minutes and see whether something was going 00:13:01.720 |
And then the second one I created was one designed to be hard, which is 10 categories 00:13:08.280 |
that are designed to be very similar to each other, so they're all dog breeds. 00:13:12.100 |
So the first data set is called ImageNet, which is very French, as you can hear. 00:13:19.080 |
And there's some helpful pronunciation tips here. 00:13:26.200 |
And you can see here I've created a leaderboard for ImageNet and for ImageWolf. 00:13:34.600 |
And I've discovered that in my very quick experiments with this, the exact observations 00:13:40.440 |
I find about what works well for the full ImageNet, also I see the same results here. 00:13:46.800 |
And it's also fascinating to see how some things are the same between the two data sets 00:13:55.940 |
And I found working with these two data sets has given me more insight into computer vision 00:14:03.200 |
model training than anything else that I've done. 00:14:09.600 |
And I really wanted to mention this to say, a big part of getting good at using deep learning 00:14:15.840 |
in your domain is knowing how to create like small, workable, useful data sets. 00:14:23.280 |
So once I decided to make this, it took me about three hours. 00:14:26.120 |
Like, it's not at all hard to create a data set, it's a quick little Python script to 00:14:33.080 |
How did I decide which 10 things, I just looked at a list of categories and picked 10 things 00:14:39.920 |
How did I decide to pick these things, I just looked at 10 things that I knew are dogs. 00:14:46.440 |
So it's like just a case of like, throw something together, get it working, and then on your 00:14:52.200 |
domain area, whether it's audio or Sanskrit texts or whatever, or genomic sequences, try 00:14:59.920 |
to come up with your version of a toy problem or two which you hope might give insight into 00:15:09.640 |
And if you're interested in computer vision, I would strongly recommend trying this out. 00:15:15.600 |
Because trying to beat me, and these are not great, they're just okay, but trying to beat 00:15:19.800 |
me will give you a sense of whether the things you're thinking about are in the ballpark 00:15:26.480 |
of what a moderately competent practitioner is able to do in a small amount of time. 00:15:32.920 |
It's also interesting to see that with like a 1/100th the size of ImageNet, like a tiny 00:15:39.600 |
data set, I was able to create a 90% accurate dog breed classifier from random weights. 00:15:45.980 |
So like you can do a lot pretty quickly without much data, even if you don't have transfer 00:15:58.080 |
So before we look at the data set, let's do the question. 00:16:03.000 |
>> So just to confirm, LSUV is something you run on all the layers once at the beginning, 00:16:19.760 |
So you'd run it once at the start of training to initialize your weights, just so that that 00:16:25.120 |
initial set of steps gives you sensible gradients, because it's those first few mini batches 00:16:32.200 |
Remember how we saw that if we didn't have a very good first few mini batches that we 00:16:37.080 |
ended up with 90% of the weights being, 90% of the activations being inactive. 00:16:43.320 |
So that's why we want to make sure we start well. 00:16:47.200 |
And yeah, if you've got a small mini batch, just run five mini batches and take the mean. 00:16:51.960 |
There's nothing special about the one mini batch, it's just a fast way to do the computation. 00:16:56.280 |
It's not like we're doing any gradient descent or anything. 00:17:05.160 |
So ImageNet is too big to read it all into RAM at once. 00:17:18.240 |
So we're going to need to be able to read it in one image at a time, which is going 00:17:22.120 |
to be true of most of our deep learning projects. 00:17:24.800 |
So we need some way to do that from scratch, because that's the rules. 00:17:33.360 |
And in the process, we're going to end up building a data block API, which you're all 00:17:39.160 |
But most people using the data block API feel familiar enough with it to do small tweaks 00:17:45.880 |
for things that they kind of know they can do. 00:17:48.160 |
But most people I speak to don't know how to really change what's going on. 00:17:54.080 |
So by the end of this notebook, you'll see how incredibly simple the data block API is. 00:17:59.680 |
And you'll be able to either write your own, maybe based on this one, or modify the one 00:18:04.320 |
in fast.ai, because this is a very direct translation of the one that's in fast.ai. 00:18:15.720 |
So the first thing to do is to read in our data. 00:18:19.760 |
And we'll see a similar thing when we build fastai.audio. 00:18:24.320 |
But whatever process you use, you're going to have to find some library that can read 00:18:32.320 |
And there's a library called PIL, or Pillow, Python Imaging Library, which can read images. 00:18:44.000 |
And we want to see what's inside our ImageNet data set. 00:18:51.160 |
Typing list x.editor is far too complicated for me. 00:18:57.320 |
This is how easy it is to add stuff to the standard library. 00:19:00.640 |
You can just take the class and add a function to it. 00:19:07.920 |
So we've got a training and a validation directory. 00:19:11.840 |
Within validation, we have one directory for each category. 00:19:17.600 |
And then if we look at one category, we could grab one file name. 00:19:22.560 |
And if we look at one file name, we have a TENCH. 00:19:27.120 |
So if you want to know whether somebody is actually a deep learning practitioner, show 00:19:32.680 |
If they don't know it's a TENCH, they're lying to you, because this is the first category 00:19:37.240 |
So if you're ever using ImageNet, you know your TENCHs. 00:19:41.640 |
They're generally being held up by middle-aged men, or sometimes they're in nets. 00:19:48.520 |
That's pretty much how it always looks in ImageNet. 00:19:52.860 |
So that's why we have them in ImageNet too, because it's such a classic computer vision 00:20:00.480 |
We're cheating and importing NumPy for a moment, just so I can show you what an image contains, 00:20:05.640 |
just to turn it into an array so I can print it for you. 00:20:11.840 |
It contains numbers between 0 and 255 for the integer. 00:20:18.360 |
So this is what we get when we load up an image. 00:20:24.360 |
And it's got a geometry, and it's got a number of channels. 00:20:34.120 |
So we want to have some way to read in lots of images, which means we need to know what 00:20:40.520 |
images there are in this directory structure. 00:20:44.240 |
And in the full ImageNet, there's going to be 1.3 million of them. 00:20:49.120 |
So the first thing we need to know is which things are images. 00:20:54.720 |
Your computer already has a list of image extensions. 00:21:00.160 |
So you can query Python for your MIME types database for all of the images. 00:21:05.120 |
So here's a list of the image extensions that my computer knows about. 00:21:09.960 |
So now what I want to do is I want to loop through all the files in a directory and find 00:21:18.560 |
The fastest way to check whether something's in a list is to first of all turn it into 00:21:26.600 |
So Setify simply checks if it is a set, and if it is, it makes it one. 00:21:30.440 |
Otherwise, it first turns it into a list, and then turns it into a set. 00:21:37.000 |
And here's what I do when I build a little bit of functionality. 00:21:39.720 |
I just throw together a quick bunch of tests to make sure it seems to be roughly doing 00:21:45.320 |
And do you remember, in lesson one, we created our own test framework. 00:21:49.480 |
So we can now run any notebook as a test suite. 00:21:53.700 |
So it will automatically check if we break this at some point. 00:21:58.160 |
So now we need a way to go through a single directory and grab all of the images in that. 00:22:09.320 |
I always like to make sure that you can pass any of these things, either a path, lib path, 00:22:16.280 |
So if you just say p equals path p, if it's already a path, lib object, that doesn't do 00:22:22.840 |
So this is a nice, easy way to make sure that works. 00:22:25.620 |
So we just go through-- here's our path, lib object. 00:22:33.640 |
And so you'll see in a moment how we actually grab the list of files. 00:22:42.560 |
If it does, that's a Unix hidden file, or a Mac hidden file. 00:22:47.680 |
And we also check either they didn't ask for some particular extensions, or that the extension 00:22:55.320 |
So that will allow us to grab just the image files. 00:23:00.480 |
Python has something called Skander, which will grab a path and list all of the files 00:23:08.600 |
We go Skander, and then we go get files, and it looks something like that. 00:23:20.560 |
And so this is something where we say, for some path, give me things with these extensions, 00:23:26.000 |
optionally recurse, optionally only include these folder names. 00:23:32.520 |
I will go through it in detail, but I'll just point out a couple of things, because being 00:23:34.880 |
able to rapidly look through files is important. 00:23:38.640 |
The first is that Skander is super, super fast. 00:23:46.720 |
So this is a really great way to quickly grab stuff for a single directory. 00:23:57.580 |
This is the thing that uses Skander internally to walk recursively through a folder tree. 00:24:04.560 |
And you can do cool stuff like change the list of directories that it's going to look 00:24:11.340 |
And it basically returns all the information that you need. 00:24:15.160 |
So os.walk and os.skander are the things that you want to be using if you're playing with 00:24:19.880 |
directories and files in Python, and you need it to be fast. 00:24:28.180 |
And so now we can say get files, path, tench, just the image extensions. 00:24:35.540 |
And then we're going to need recurse, because we've got a few levels of directory structure. 00:24:42.480 |
So if we try to get all of the file names, we have 13,000. 00:24:47.620 |
And specifically, it takes 70 milliseconds to get 13,000 file names. 00:24:55.800 |
For me to look at 13,000 files in Windows Explorer seems to take about four minutes. 00:25:06.120 |
So the full image net, which is 100 times bigger, it's going to be literally just a 00:25:11.620 |
So this gives you a sense of how incredibly fast these os.walk and skander functions 00:25:23.720 |
I've often been confused as to whether the code Jeremy is writing in the notebooks or 00:25:27.600 |
functionality that will be integrated into the fast AI library, or whether the functions 00:25:32.520 |
and classes are meant to be written and used by the user interactively and on the fly. 00:25:41.600 |
Well I guess that's really a question about what's the purpose of this deep learning from 00:25:49.040 |
And different people will get different things out of it. 00:25:51.080 |
But for me, it's about demystifying what's going on so that you can take what's in your 00:26:01.980 |
And to do that would be always some combination of using things that are in existing libraries, 00:26:06.080 |
which might be fast AI or PyTorch or TensorFlow or whatever, and partly will be things that 00:26:12.400 |
And I don't want you to be in a situation where you say, well, that's not in fast AI, 00:26:19.880 |
So really the goal is, this is why it's also called impractical deep learning for coders, 00:26:25.640 |
is to give you the underlying expertise and tools that you need. 00:26:31.880 |
In practice, I would expect a lot of the stuff I'm showing you to end up in the fast AI library 00:26:38.560 |
because that's like literally I'm showing you my research, basically. 00:26:43.440 |
This is like my research journal of the last six months. 00:26:46.720 |
And that's what happens is fast AI library is silver and I take our research and turn 00:26:53.580 |
And some of it, like this function, is pretty much copied and pasted from the existing fast 00:26:58.800 |
AI V1 code base because I spent at least a week figuring out how to make this fast. 00:27:06.520 |
I'm sure most people can do it faster, but I'm slow and it took me a long time, and this 00:27:12.200 |
So yeah, I mean, it's going to map pretty closely to what's in fast AI already. 00:27:20.840 |
Where things are new, we're telling you, like running batch norm is new, today we're going 00:27:27.560 |
But otherwise things are going to be pretty similar to what's in fast AI, so it'll make 00:27:35.400 |
And as fast AI changes, it's not going to surprise you because you'll know what's going 00:27:53.920 |
It's a little more awkward to work with because it doesn't try to do so much. 00:28:01.880 |
I suspect glob probably uses it behind the scenes. 00:28:08.080 |
Time it with glob, time it with Skander, probably depends how you use glob exactly. 00:28:12.720 |
But I remember I used to use glob and it was quite a bit slower. 00:28:16.520 |
And when I say quite a bit, you know, for those of you that have been using fast AI 00:28:22.040 |
for a while, you might have noticed that the speed at which you can grab the image net 00:28:27.340 |
folder is some orders of magnitude faster than it used to be. 00:28:34.080 |
Okay. So the reason that fast AI has a data blocks API and nobody else does is because 00:28:44.400 |
I got so frustrated in the last course at having to create every possible combination 00:28:49.200 |
of independent and dependent variable that I actually sat back for a while and did some 00:28:56.480 |
And specifically, this is what I did when I thought was I wrote this down, it's like 00:29:06.480 |
We need some way to split the validation set or multiple validation sets out, some way 00:29:11.640 |
to do labeling, optionally some augmentation, transform it to a tensor, make it into data, 00:29:18.440 |
into batches, optionally transform the batches, and then combine the data loaders together 00:29:25.520 |
into a data bunch and optionally add a test set. 00:29:29.080 |
And so when I wrote it down like that, I just went ahead and implemented an API for each 00:29:33.940 |
of those things to say like, okay, you can plug in anything you like to that part of 00:29:41.320 |
So we've already got the basic functionality to get the files. 00:29:48.120 |
We already created that list container, right? 00:29:51.560 |
So we basically can just dump our files into a list container. 00:29:55.680 |
But in the end, what we actually want is an image list for this one, right? 00:30:01.480 |
And an image list, when you call, so we're going to have this get method and when you 00:30:05.720 |
get something from the image list, it should open the image. 00:30:13.600 |
But we could get all kinds of different objects. 00:30:16.800 |
So therefore, we have this superclass, which has a get method that you override. 00:30:22.640 |
And by default, it just returns whatever you put in there, which in this case would be 00:30:29.320 |
So this is basically all item list does, right? 00:30:34.880 |
So in this case, it's going to be our file names, the path that they came from. 00:30:40.160 |
And then optionally also, there could be a list of transforms, right? 00:30:46.960 |
And we'll look at this in more detail in a moment. 00:30:48.800 |
But basically, what will happen is, when you index into your item list, remember, done_to_get_item 00:30:55.200 |
does that, we'll pass that back up to list container, get_item, and that will return 00:31:05.400 |
And if it's a single item, we'll just call self._get. 00:31:09.320 |
If it's a list of items, we'll call self._get on all of them. 00:31:13.440 |
And what that's going to do is it's going to call the get method, which in the case 00:31:25.120 |
So for those of you that haven't done any kind of more functional-style programming, 00:31:29.720 |
compose is just a concept that says, go through a list of functions and call the function 00:31:40.720 |
and replace myself with a result of that, and then call the next function and replace 00:31:48.400 |
So in other words, a deep neural network is just a composition of functions. 00:31:56.840 |
This compose does a little bit more than most composers. 00:32:00.040 |
Specifically, you can optionally say, I want to order them in some way. 00:32:05.800 |
And it checks to see whether the things have an underscore order key and sorts them. 00:32:12.040 |
And also, you could pass in some keyword arguments. 00:32:14.560 |
And if you do, it'll just keep passing in those keyword arguments. 00:32:17.480 |
But it's basically other than that, it's a pretty standard function composition function. 00:32:22.760 |
If you haven't seen compose used elsewhere in programming before, Google that because 00:32:33.800 |
And as you can see in this case, it means I can just pass in a list of transforms. 00:32:37.840 |
And this will simply call each of those transforms in turn modifying, in this case, the image 00:32:49.120 |
And then here's a method to create an image list from a path. 00:32:58.060 |
And then that's going to give us a list of files, which we will then pass to the class 00:33:02.800 |
constructor which expects a list of files or a list of something. 00:33:07.720 |
So this is basically the same as item list in fast.io version 1. 00:33:17.240 |
It's just a list where when you try to index into it, it will call something which subclasses 00:33:33.160 |
Now one thing that happens all the time is you try to create a mini batch of images. 00:33:42.860 |
And when pillow opens up a black and white image, it gives you back by default a reg2 00:33:55.140 |
And then you can't stack them into a mini batch because they're not all the same shape. 00:34:01.000 |
So what you can do is you can call the pillow.convert and RGB. 00:34:07.680 |
And if something's not RGB, it'll turn it into RGB. 00:34:15.600 |
So a transform is just a class with an underscore order. 00:34:19.120 |
And then make RGB is a transform that when you call it will call convert. 00:34:33.720 |
You can have done to call or you can have a function. 00:34:43.840 |
And so if we create a image list from files using our path and pass in that transform, 00:34:54.320 |
And remember that item list inherits from list container, which we gave a dunder rep 00:35:03.920 |
This is why we create these little convenient things to subclass from because we get all 00:35:10.840 |
So we can now see that we've got 13,000 items. 00:35:19.520 |
And when we index into it, it calls get and get calls image dot open and pillow automatically 00:35:42.720 |
And because we're using the functionality that we wrote last time for list container, 00:35:47.640 |
we can also index with a list of booleans, with a slice, with a list of ints and so forth. 00:35:55.120 |
So here's a slice containing one item, for instance. 00:36:06.600 |
So to do that, we look and we see here's a path. 00:36:21.960 |
So let's create a function called grandparent splitter that grabs the grandparent's name. 00:36:27.540 |
And you call it, telling it the name of your validation set and the name of your training 00:36:30.960 |
set, and it returns true if it's the validation set or false if it's the training set or none 00:36:41.720 |
And so here's something that will create a mask containing you pass it some function. 00:36:49.080 |
So we're going to be using grandparent splitter. 00:36:51.440 |
And it will just grab all the things where that mask is false. 00:36:57.440 |
That's the validation set and will return them. 00:37:04.940 |
So here's a splitter that splits on grandparents. 00:37:08.080 |
And where the validation name is val because that's what it is for image net. 00:37:17.800 |
We've now got a validation set with 500 things and a training set with 12,800 things. 00:37:28.360 |
So split data object is just something with a training set and a validation set. 00:37:41.240 |
Everything else from here is just convenience. 00:37:44.160 |
So we'll give it a representation so that you can print it. 00:37:48.480 |
We'll define done to get attribute so that if you pass it some attribute that it doesn't 00:37:52.320 |
know about, it will grab it from the training set. 00:37:56.860 |
And then let's add a split by func method that just calls that split by func thing we 00:38:06.760 |
There's one trick here, though, which is we want split by func to return item lists 00:38:23.560 |
And that's why in our item list, we defined something called new. 00:38:35.480 |
PyTorch has the concept of a new method as well. 00:38:39.400 |
It says, all right, let's look at this object. 00:38:42.240 |
Let's see what class it is, because it might not be item list, right? 00:38:45.200 |
It might be image list or some other subclass. 00:38:49.880 |
And this is now the constructor for that class. 00:38:53.000 |
And let's just pass it in the items that we asked for. 00:38:56.720 |
And then pass in our path and our transforms. 00:38:59.680 |
So new is going to create a new item list of the same type with the same path and the 00:39:08.360 |
And so that's why this is now going to give us a training set and a validation set with 00:39:17.680 |
the same path, the same transforms, and the same type. 00:39:20.640 |
And so if we call split data split by func, now you can see we've got our training set 00:39:30.480 |
So next in our list of things to do is labeling. 00:39:41.120 |
And the reason it's tricky is because we need processes. 00:39:45.720 |
Processes are things which are first applied to the training set. 00:39:49.760 |
They get some state and then they get applied to the validation set. 00:39:54.720 |
For example, our labels should not be tench and French horn. 00:40:02.440 |
They should be like zero and two because when we go to do a cross entropy loss, we expect 00:40:12.680 |
So we need to be able to map tench to zero or French horn to two. 00:40:18.480 |
We need the training set to have the same mapping as the validation set. 00:40:22.960 |
And for any inference we do in the future, it's going to have the same mapping as well. 00:40:26.860 |
Because otherwise, the different data sets are going to be talking about completely different 00:40:32.160 |
things when they see the number zero, for instance. 00:40:35.680 |
So we're going to create something called a vocab. 00:40:38.700 |
And a vocab is just the list saying these are our classes and this is the order they're 00:40:44.560 |
Zero is tench, one is golf ball, two is French horn, and so forth. 00:40:50.500 |
So we're going to create the vocab from the training set. 00:40:54.400 |
And then we're going to convert all those strings into ints using the vocab. 00:40:59.880 |
And then we're going to do the same thing for the validation set, but we'll use the 00:41:06.040 |
So that's an example of a processor that converts label strings to numbers in a consistent and 00:41:14.520 |
Other things we could do would be processing texts to tokenize them and then numericalize 00:41:21.040 |
Numericalizing them is a lot like converting the label strings to numbers. 00:41:24.560 |
Or taking tabular data and filling the missing values with the median computed on the training 00:41:31.680 |
So most things we do in this labeling process is going to require some kind of processor. 00:41:40.680 |
So in our case, we want a processor that can convert label strings to numbers. 00:41:44.960 |
So the first thing we need to know is what are all of the possible labels. 00:41:48.880 |
And so therefore we need to know all the possible unique things in a list. 00:41:53.020 |
So here's some list, here's something that uniquifies them. 00:41:57.000 |
So that's how we can get all the unique values of something. 00:42:03.040 |
So now that we've got that, we can create a processor. 00:42:06.400 |
And a processor is just something that can process some items. 00:42:14.240 |
And this is the thing that's going to create our list of all of the possible categories. 00:42:19.880 |
So basically when you say process, we're going to see if there's a vocab yet. 00:42:25.640 |
And if there's not, this must be the training set. 00:42:30.200 |
And it's just the unique values of all the items. 00:42:34.200 |
And then we'll create the thing that goes not from int to object, but goes from object 00:42:40.320 |
So we just enumerate the vocabulary and create a dictionary with the reverse mapping. 00:42:46.120 |
So now that we have a vocab, we can then go through all the items and process one of them 00:42:54.200 |
And process one of them simply means look in that reverse mapping. 00:42:59.840 |
We could also deprocess, which would take a bunch of indexes. 00:43:03.040 |
We would use this, for example, to print out the inferences that we're doing. 00:43:09.060 |
So we better make sure we get a vocab by now, otherwise we can't do anything. 00:43:13.080 |
And then we just deprocess one for each index. 00:43:17.240 |
And deprocess one just looks it up in the vocab. 00:43:23.800 |
And so with this, we can now combine it all together. 00:43:29.560 |
And it's just a list container that contains a processor. 00:43:34.600 |
And the items in it, whatever we were given after being processed. 00:43:41.200 |
And so then, as well as being able to index in it to grab those processed items, we'll 00:43:49.120 |
And that's just the thing that's going to deprocess the items again. 00:43:54.720 |
So that's all the stuff we need to label things. 00:44:01.200 |
So we already know that for splitting, we needed the grandparent. 00:44:18.240 |
And here is something which labels things using a function. 00:44:26.880 |
And so here is our class, and we're going to have to pass it some independent variable 00:44:32.800 |
and some dependent variable and store them away. 00:44:36.920 |
And then we need a indexer to grab the x and grab the y at those indexes. 00:44:48.960 |
And then we'll just add something just like we did before, which does the labeling. 00:44:54.840 |
And passes those to a processed item list to grab the labels. 00:45:00.880 |
And then passes the inputs and outputs to our constructor to give us our label data. 00:45:11.360 |
So with that, we have a label by function where we can create our category processor. 00:45:19.900 |
We can label the validation set, and we can return the result, the split data result. 00:45:25.760 |
So the main thing to notice here is that when we say train equals labeled data dot label 00:45:33.240 |
passing in this processor, this processor has no vocab. 00:45:37.480 |
So it goes to that bit we saw that says, oh, there's no vocab. 00:45:40.560 |
So let's create a list of all the unique possibilities. 00:45:43.920 |
On the other hand, when it goes to the validation set, proc now does have a vocab. 00:45:49.320 |
So it will skip that step and use the training sets vocab. 00:45:55.680 |
People get mixed up by this all the time in machine learning and deep learning is like 00:46:00.440 |
very often when somebody says, my model's no better than random. 00:46:05.200 |
The most common reason is that they're using some kind of different mapping between their 00:46:13.300 |
So if you use a process like this, that's never going to happen because you're ensuring 00:46:24.000 |
So the details of the code are particularly important. 00:46:28.800 |
The important idea is that your labeling process needs to include some kind of processor idea. 00:46:38.360 |
And if you're doing this stuff manually, which basically every other machine learning and 00:46:44.600 |
deep learning framework does, you're asking for difficult to fix bugs because anytime 00:46:51.460 |
your computer's not doing something for you, it means you have to remember to do it yourself. 00:46:55.400 |
So whatever framework you're using, I don't think, I don't know if any other frameworks 00:47:03.600 |
So like create something like this for yourself so that you don't have that problem. 00:47:14.280 |
In the case of online streaming data, how do you deal with having new categories in 00:47:24.480 |
I mean, it happens all the time is you do inference either on your validation set or 00:47:29.760 |
test set or in production where you see something you haven't seen before. 00:47:39.320 |
For labels, it's less of a problem in inference because for inference, you don't have labels. 00:47:44.240 |
By definition, but you could certainly have that problem in your validation set. 00:47:50.360 |
So what I tend to like to do is if I have like some kind of, if I have something where 00:47:54.920 |
there's lots and lots of categories and some of them don't occur very often and I know 00:47:59.240 |
that in the future there might be new categories appearing, I'll take the few least common 00:48:06.280 |
and I'll group them together into a group called like other. 00:48:10.680 |
And that way I now have some way to ensure that my model can handle all these rare other 00:48:17.080 |
Something like that tends to work pretty well, but you do have to think of it ahead of time. 00:48:22.600 |
For many kinds of problems, you know that there's a fixed set of possibilities. 00:48:29.240 |
And if you know that it's not a fixed set, yeah, I would generally try to create an other 00:48:36.320 |
So make sure you train with some things in that other category, all right? 00:48:44.240 |
>> In the label data class, what is the class method decorator doing? 00:48:53.520 |
So I'll be quick because you can Google it, but basically this is the difference between 00:49:02.880 |
So you'll see that I'm not going to call this on an object of type label data, but I'm calling 00:49:17.000 |
So it's just a convenience, really, class methods. 00:49:19.880 |
The thing that they get passed in is the actual class that was requested. 00:49:23.980 |
So I could create a subclass of this and then ask for that subclass. 00:49:32.180 |
Pretty much every language supports class methods or something like it. 00:49:38.960 |
You can get away without them, but they're pretty convenient. 00:49:46.080 |
So now we've got our labeled list, and if we print it out, it's got a training set and 00:49:52.920 |
a validation set, and each one has an X and a Y. 00:49:58.160 |
Our category items are a little less convenient than the FastAI version ones because the FastAI 00:50:04.760 |
ones will actually print out the name of each category. 00:50:08.600 |
We haven't done anything to make that happen. 00:50:10.200 |
So if we want the name of each category, we would actually have to refer to the .obj, which 00:50:17.040 |
you can see we're doing here, Y.obj or Y.obj with a slice. 00:50:24.240 |
So in FastAI version one, there's one extra thing we have, which is this concept of an 00:50:30.560 |
item base, and you can actually define things like category items that know how to print 00:50:37.400 |
Whether that convenience is worth the extra complexity is up to you if you're designing 00:50:46.880 |
So we still can't train a model with these because we have pillow objects. 00:50:54.800 |
So here's our labeled list, training set, zeroth object, and that has an X and a Y. 00:51:09.600 |
If they're all going to be in the batch together, they have to be the same size. 00:51:14.680 |
I mean, that's not a great way to do it, but it's a start. 00:51:26.560 |
And it has to be after all the other transforms we've seen so far because we want conversion 00:51:32.720 |
to RGB to happen beforehand, probably, stuff like that. 00:51:41.480 |
If you pass in an integer, we'll turn it into a tuple. 00:51:44.360 |
And when you call it, it'll call resize, and it'll do bilinear resizing for you. 00:51:53.600 |
Once you've turned them all into the same size, then we can turn them into tensors. 00:52:02.860 |
This is how TorchVision turns pillow objects into tensors. 00:52:14.920 |
And you see, there's two ways here of adding kind of class-level state or transform-level 00:52:23.440 |
This is really underused in Python, but it's super handy, right? 00:52:27.320 |
We just want to say, like, what's the order of the function? 00:52:32.400 |
And then that's turned it into a byte tensor. 00:52:38.440 |
And we don't want it to be between 0 and 255. 00:52:52.040 |
It doesn't matter what order they're in the array, because they're going to order them 00:53:03.440 |
Here's a little convenience to permute the order back again. 00:53:08.680 |
I don't know if you noticed this, but in to byte tensor, I had to permute 201, because 00:53:15.880 |
Pillow has the channel last, or else PyTorch assumes the channel comes first. 00:53:22.200 |
So this is just going to pop the channel first. 00:53:24.560 |
So to print them out, we have to put the channel last again. 00:53:28.580 |
So now we can grab something from that list and show image. 00:53:34.760 |
And you can see that it is something of a torch thing of this size. 00:53:40.520 |
So we now have tensors that are floats and all the same size. 00:53:49.640 |
We'll use the get data load as we had before. 00:53:52.040 |
We can just pass in train invalid directly from our labeled list. 00:53:58.320 |
Let's grab a mini batch, and here it is, 64 by 3 by 128 by 128. 00:54:03.560 |
And we can have a look at it, and we can see the vocab for it. 00:54:20.480 |
And to make life even easier for the future, let's add two optional things, channels in 00:54:28.280 |
And that way any models that want to be automatically created can automatically create themselves 00:54:33.560 |
with the correct number of inputs and the correct number of outputs for our data set. 00:54:40.160 |
And let's create add to our split data, something called to data bunch, which is just this function. 00:54:54.880 |
So like in practice, in your actual module, you would go back and you would paste the 00:55:02.120 |
contents of this back into your split data definition. 00:55:05.880 |
But this is kind of a nice way when you're just iteratively building stuff. 00:55:10.880 |
You can't only monkey patch PyTorch things or standard library things, you can monkey 00:55:18.440 |
So here's how you can add something to a previous class when you realize later that you want 00:55:23.640 |
Okay, so let's go through and see what happens. 00:55:30.960 |
So here are all the steps, literally all the steps. 00:55:35.400 |
Grab the path, untie the data, grab the transforms, grab the item list, pass in the transforms, 00:55:43.360 |
split the data using the grandparent, using this validation name, label it using parent 00:55:50.320 |
labeler, and then turn it into a data bunch with this batch size, three channels in, ten 00:56:02.600 |
Here's our callback functions from last time. 00:56:09.420 |
In the past, we've normalized things that have had only one channel, being MNIST. 00:56:14.480 |
Now we've got three channels, so we need to make sure that we take the mean over the other 00:56:18.600 |
channels so that we get a three-channel mean and a three-channel standard deviation. 00:56:25.320 |
So let's define a function that normalizes things that are three channels. 00:56:33.360 |
So here's the mean and standard deviation of this Imagenet batch. 00:56:38.560 |
So here's a function called normImagenet, which we can use from now on to normalize 00:56:47.160 |
So let's add that as a callback using the batch transform we built earlier. 00:56:54.360 |
We will create a ConvNet with this number of layers. 00:57:01.880 |
And here's the ConvNet, we're going to come back to that. 00:57:05.920 |
And then we will do our one-cycle scheduling using COSYCLE, one-cycle annealing, COSIGN, 00:57:11.720 |
one-cycle annealing, pass that into our getLearn run, and train. 00:57:20.720 |
And that's going to give us 72.6%, which if we look at the Imagenet leaderboard for 128 00:57:33.600 |
pixels for i5 epochs, the best is 84.6 so far. 00:57:44.120 |
So let's take a look and see what model we built, because it's kind of interesting. 00:57:48.360 |
It's a few interesting features of this model. 00:57:50.680 |
And we're going to be looking at these features quite a lot in the next two lessons. 00:57:56.760 |
The model knows how big its first layer has to start out because we pass in data, and 00:58:07.680 |
Already this is a model which you don't have to change its definition if you have hyperspectral 00:58:13.800 |
imaging with four channels, or you have black and white with one channel, or whatever. 00:58:24.720 |
Or I should say, what's the output of the first layer going to be? 00:58:34.300 |
Well, what we're going to do is we're going to say, well, our input has, we don't know, 00:58:46.280 |
But we do know that the first layer is going to be a three by three kernel, and then there's 00:58:50.600 |
going to be some number of channels, CIN channels, which in our case is three. 00:58:58.160 |
So as the convolution kernel kind of scrolls over the input image, at each time, the number 00:59:04.140 |
of things that it's multiplying together is going to be three by three by CN. 00:59:13.800 |
So remember we talked about this last week, right? 00:59:16.280 |
We basically want to put that, we basically want to make sure that our first convolution 00:59:31.760 |
So if we're getting nine by CN coming in, you wouldn't want more than that going out 00:59:42.900 |
So what I'm going to do is I'm going to say, okay, let's take that value, CN by three by 00:59:49.800 |
three, and let's just look for the next largest number that's a power of two, and we'll use 01:00:01.200 |
And then I'll just go ahead and multiply by two for each of the next two layers. 01:00:05.380 |
So this way, we've got these vital first three layers are going to work out pretty well. 01:00:11.640 |
So back in the old days, we used to use five by seven kernels, okay? 01:00:19.040 |
We'd have the first layer, would be one of those, but we know now that's not a good idea. 01:00:25.720 |
Still most people do it because people stick with what they know, but when you look at 01:00:30.680 |
the bag of tricks for image classification paper, which in turn refers to many previous 01:00:39.360 |
citations, many of which are state of the art and competition winning models, the message 01:00:47.040 |
Three by three kernels give you more bang for your buck. 01:00:50.960 |
You get deeper, you end up with the same receptive field. 01:00:54.120 |
It's faster because you've got less work going on, right? 01:00:58.180 |
And really, this goes all the way back to the classic Zeiler and Fergus paper that we've 01:01:03.540 |
looked at so many times over the years that we've been doing this course. 01:01:08.100 |
And even before that to the VGG paper, it really is three by three kernels everywhere. 01:01:16.000 |
So any place you see something that's not a three by three kernel, have a big think 01:01:25.680 |
Okay, so that's basically what we have for those critical first three layers. 01:01:32.520 |
That's where that initial feature representation is happening. 01:01:36.440 |
And then the rest of the layers is whatever we've asked for. 01:01:43.120 |
And so then we can build those layers up, just saying number of filters in to number 01:01:48.960 |
And then as usual, average pooling, Latin, and a linear layer to however many classes 01:02:03.200 |
It's very hard to, every time I write something like this, I break it the first 12 times. 01:02:10.960 |
And the only way to debug it is to see exactly what's going on. 01:02:14.840 |
To see exactly what's going on, you need to see that what module is there at each point 01:02:25.000 |
So that's why we've created this model summary. 01:02:28.360 |
So model summary's gonna use that get batch that we added in the LSUV notebook to grab 01:02:37.480 |
We will make sure that that batch is on the correct device. 01:02:42.960 |
We will use the find module thing that we used in the LSUV to find all of the places 01:02:52.440 |
If you said find all, otherwise we will grab just the immediate children. 01:03:00.680 |
We will grab a hook for every layer using the hooks that we made. 01:03:09.360 |
And the function that we've used for hooking simply prints out the module and the output 01:03:18.400 |
So that's how easy it is to create this wonderfully useful model summary. 01:03:23.320 |
So to answer your question of earlier, another reason why are we doing this or what are you 01:03:29.880 |
meant to be getting out of it is to say you don't have to write much code to create really 01:03:40.400 |
So we've seen how to create like per-layer histogram viewers, how to create model summaries. 01:03:47.520 |
With the tools that you have at your disposal now, I really hope that you can dig inside 01:03:52.120 |
your models what they are and what they're doing. 01:03:59.080 |
So this hooks thing we have is just like super, super useful. 01:04:04.120 |
Now very grateful to the PyTorch team for adding this fantastic functionality. 01:04:11.160 |
The input is 128 because that's a batch size, 128 by 3 by 128 by 128. 01:04:19.120 |
And then we gradually go through these convolutions. 01:04:32.880 |
And you can see after each one they have a stride of two, get smaller and smaller. 01:04:36.900 |
And then an average pull that to a one by one. 01:04:43.460 |
So it's a really, it's like as basic a ConvNet as you could get. 01:04:50.480 |
It's just a bunch of three by three conv-related batch norms. 01:05:18.440 |
This is one of the bits I'm most excited about in this course actually. 01:05:23.320 |
But hopefully it's going to be like totally unexciting to you because it's just going 01:05:26.640 |
to be so obvious that you should do it this way. 01:05:28.880 |
But the reason I'm excited is that we're going to be talking about optimizers. 01:05:32.960 |
And anybody who's done work with kind of optimizers in deep learning in the past will know that 01:05:38.960 |
every library treats every optimizer as a totally different thing. 01:05:43.520 |
So there's an atom optimizer, like in PyTorch there's an atom optimizer and a SGD optimizer 01:05:52.520 |
And somebody comes along and says, hey, we've invented this thing called decoupled weight 01:05:57.800 |
decay, also known as Adam W. And the PyTorch folks go, oh, damn, what are we going to do? 01:06:04.400 |
And they have to add a parameter to every one of their optimizers and they have to change 01:06:08.640 |
And then somebody else comes along and says, oh, we've invented a thing called AMS grad. 01:06:13.680 |
There's another parameter we have to put into any one of those optimizers. 01:06:16.240 |
And it's not just like inefficient and frustrating, but it holds back research because it starts 01:06:25.520 |
feeling like there are all these things called different kinds of optimizers, but there's 01:06:33.000 |
There's one optimizer and there's one optimizer in which you can inject different pieces of 01:06:39.120 |
behavior in a very, very small number of ways. 01:06:42.560 |
And what we're going to do is we're going to start with this generic optimizer and we're 01:06:49.400 |
This came out last week and it's a massive improvement as you see in what we can do with 01:06:59.320 |
This is the equation set that we're going to end up implementing from the paper. 01:07:07.880 |
And what if I told you that not only I think are we the first library to have this implemented, 01:07:15.800 |
but this is the total amount of code that we're going to write to do it. 01:07:26.240 |
So we're going to continue with the image net and we're going to continue with the basic 01:07:30.920 |
set of transforms we had before and the basic set of stuff to create our data bunch. 01:07:38.760 |
This is our model and this is something to pop it on CUDA to get our statistics written 01:07:46.120 |
out to do our batch transform with the normalization. 01:07:50.000 |
And so we're going to start here 52% after an epoch. 01:07:59.800 |
Now in PyTorch, the base thing called optimizer is just a dictionary that stores away some 01:08:08.600 |
hyperparameters and we've actually already used it. 01:08:17.680 |
We used something that is not part of our approved set of foundations without building 01:08:33.920 |
So we're going to go back and do it now, right? 01:08:37.000 |
Because the reason we did this is because we were using torture's optim.optimizer. 01:08:42.240 |
We've already built the kind of the main part of that, which is the thing that multiplies 01:08:46.960 |
by the learning rate and subtracts from the gradients. 01:08:59.920 |
As always, we need something called zero grad, which is going to go through some parameters 01:09:04.640 |
and zero them out and also remove any gradient computation history. 01:09:10.020 |
And we're going to have a step function that does some kind of step. 01:09:13.720 |
The main difference here, though, is our step function isn't actually going to do anything. 01:09:20.460 |
It's going to use composition on some things that we pass on and ask them to do something. 01:09:26.440 |
So this optimizer is going to do nothing at all until we build on top of it. 01:09:33.400 |
But we're going to set it up to be able to handle things like discriminative learning 01:09:37.420 |
rates and one cycle of kneeling and stuff like that. 01:09:42.320 |
And so to be able to do that, we need some way to create parameter groups. 01:09:48.160 |
This is what we call in fast AI layer groups. 01:09:52.840 |
And I kind of wish I hadn't called them layer groups. 01:09:56.400 |
I should call them parameter groups because we have a perfectly good name for them already 01:10:01.720 |
So I'm not going to call them layer groups anymore. 01:10:03.880 |
I'm just going to call them parameter groups. 01:10:08.920 |
So a parameter group-- so remember when we say parameters in PyTorch, remember right 01:10:16.920 |
back to when we've created our first linear layer, we had a weight tensor and we had a 01:10:27.700 |
So in order to optimize something, we need to know what all the parameter tensors are 01:10:33.000 |
And you can just say model.parameters to grab them all in PyTorch. 01:10:39.960 |
And that's going to give us-- it gives us a generator. 01:10:43.160 |
But as soon as you call list on a generator, it turns it into an actual list. 01:10:46.880 |
So that's going to give us a list of all of the tensors, all of the weights and all of 01:10:58.880 |
But we might want to be able to say the last two layers should have a different learning 01:11:09.000 |
And so the way we can do that is rather than just passing in a list of parameters, we'll 01:11:16.560 |
And so let's say our list of lists has two items. 01:11:19.320 |
The first item contains all the parameters in the main body of the architecture. 01:11:24.440 |
And the last item contains just the parameters from the last two layers. 01:11:30.200 |
So if we make this-- decide that this is a list of lists, then that lets us do parameter 01:11:39.240 |
Now, that's how we tell the optimizer these sets of parameters should be handled differently 01:11:44.760 |
with discriminative learning rates and stuff. 01:11:49.280 |
We're going to assume that this thing being passed in is a list of lists. 01:11:57.200 |
If it's not, then we'll turn it into a list of lists by just wrapping it in a list. 01:12:02.580 |
So if it only has one thing in it, we'll just make it a list of-- with one item containing 01:12:09.120 |
So now, param groups is a list of lists of parameter tensors. 01:12:15.880 |
And so you could either pass in, so you could decide how you want to split them up into 01:12:20.520 |
different parameter groups, or you could just have them turn into a single parameter group 01:12:30.140 |
So now, we have-- our optimizer object has a param groups attribute containing our parameter 01:12:38.920 |
So just keep remembering that's a list of lists. 01:12:42.080 |
Each parameter group can have its own set of hyperparameters. 01:12:47.580 |
So hyperparameters could be learning rate, momentum, beta in atom, epsilon in atom, and 01:12:56.320 |
So those hyperparameters are going to be stored as a dictionary. 01:13:00.560 |
And so there's going to be one dictionary for each parameter group. 01:13:07.240 |
self.hypers contains, for each parameter group, a dictionary. 01:13:16.440 |
What's in the dictionary is whatever you pass to the constructor, OK? 01:13:20.980 |
So this is how you just pass a single bunch of keyword arguments to the constructor, and 01:13:26.440 |
it's going to construct a dictionary for every one. 01:13:29.120 |
And this is just a way of cloning a dictionary so that they're not all referring to the same 01:13:34.280 |
reference, but they all have their own reference. 01:13:39.240 |
So that's doing much the same stuff as torture's optim.optimizer. 01:13:49.360 |
In order to see what a stepper is, let's write one. 01:14:03.520 |
So in other words, to create an SGD optimizer, we create a partial with our optimizer with 01:14:15.220 |
So now when we call step, it goes through our parameters, composes together our steppers, 01:14:29.040 |
So the parameter is going to go p.data.add minus learning rate p.grad.data. 01:14:40.120 |
So with that optimization function, we can fit. 01:14:49.360 |
But what we have done is we've done the same thing we've done 1,000 times without ever 01:15:05.280 |
I've created this thing called grad params, which is just a little convenience. 01:15:08.980 |
Basically when we zero the gradients, we have to go through every parameter. 01:15:14.080 |
To go through every parameter, we have to go through every parameter group. 01:15:18.440 |
And then within each parameter group, we have to go through every parameter in that group 01:15:31.920 |
And also, when we call the stepper, we want to pass to it all of our hyperparameters. 01:15:46.400 |
And learning rate is just one of the things that we've listed in our hyperparameters. 01:15:52.000 |
So remember how I said that our compose is a bit special, that it passes along any keyword 01:15:56.940 |
arguments it got to everything that it composes? 01:16:01.820 |
So that's how calm SGD step can say, oh, I need the learning rate. 01:16:06.400 |
And so as long as hyper has a learning rate in it, it's going to end up here. 01:16:11.040 |
And it'll be here as long as you pass it here. 01:16:14.760 |
And then you can change it for each different layer group. 01:16:20.320 |
So we're going to need to change our parameter scheduler to use our new generic optimizer. 01:16:29.840 |
It's simply now that we have to say, go through each hyperparameter in self.opt.hypers and 01:16:39.220 |
So that's basically the same as what we had in parameter scheduler before, but for our 01:16:45.500 |
This used to use param groups, now it uses hypers. 01:16:50.080 |
So a minor change to make these keep working. 01:16:52.960 |
So now I was super excited when we first got this working, so it's like, wow, we've just 01:16:58.680 |
built an SGD optimizer that works without ever writing an SGD optimizer. 01:17:04.340 |
So now when we want to add weight decay, right? 01:17:06.780 |
So weight decay, remember, is the thing where we don't want something that fits this. 01:17:13.080 |
And the reason we, the way we do it is we use ultra regularization, which just as we 01:17:17.580 |
add the sum of squared weights times some parameter we choose. 01:17:24.440 |
And remember that the derivative of that is actually just WTD times weight. 01:17:29.680 |
So you could either add an L2 regularization to the loss, or you can add WD times weight 01:17:39.520 |
If you've forgotten this, go back and look at weight decay in part one to remind yourself. 01:17:46.560 |
And so if we want to add either this or this, we can do it. 01:17:53.360 |
So weight decay is going to get an LR and a WD, and it's going to simply do that. 01:18:04.620 |
Or L2 regularization is going to just do that. 01:18:10.120 |
By the way, if you haven't seen this before, add in PyTorch. 01:18:15.960 |
Normally it just adds this tensor to this tensor. 01:18:19.240 |
But if you add a scalar here, it multiplies these together first. 01:18:24.040 |
This is a nice, fast way to go WD times parameter and add that to the gradient. 01:18:37.760 |
Okay, so we've got our L2 regularization, we've got our weight decay. 01:18:53.160 |
What we need to be able to do now is to be able to somehow have the idea of defaults. 01:19:00.860 |
Because we don't want to have to say weight decay equals zero every time we want to turn 01:19:07.120 |
So see how we've attached some state here to our function object? 01:19:12.020 |
So the function now has something called defaults that says it's a dictionary with WD equals 01:19:19.040 |
So let's just grab exactly the same optimizer we had before. 01:19:23.080 |
But what we're going to do is we're going to maybe update our defaults with whatever 01:19:34.200 |
And the reason it's maybe update is that it's not going to replace -- if you explicitly 01:19:38.480 |
say I want this weight decay, it's not going to update it. 01:19:44.280 |
And so that's just what this little loop does, right? 01:19:46.360 |
Just goes through each of the things, and then goes through each of the things in the 01:19:51.000 |
dictionary, and it just checks if it's not there, then it updates it. 01:19:55.600 |
So this is now -- everything else here is exactly the same as before. 01:20:00.920 |
So now we can say let's create an SGD optimizer. 01:20:08.560 |
It's just an optimizer with a SGD step and weight decay. 01:20:14.240 |
And so let's create a learner, and let's try creating an optimizer, which is an SGD optimizer, 01:20:24.160 |
with our model's parameters, with some learning rate, and make sure that the hyperparameter 01:20:30.600 |
for weight decay should be zero, the hyperparameter for LR should be .1. 01:20:36.920 |
Let's try giving it a different weight decay, make sure it's there, okay, it passes as well. 01:20:41.120 |
So we've now got an ability to basically add any step functions we want, and those step 01:20:48.760 |
functions can have their own state that gets added automatically to our optimization object, 01:21:01.080 |
So now we've got an SGD optimizer with weight decay is one line of code. 01:21:14.660 |
So momentum is going to require a slightly better optimizer, or a slightly different 01:21:18.960 |
optimizer, because momentum needs some more state. 01:21:23.200 |
It doesn't just have parameters and hyperparameters, but also momentum knows that for every set 01:21:30.320 |
of activations, it knows what were they updated by last time. 01:21:35.640 |
Because remember the momentum equation is, if momentum is .9, then it would be .9 times 01:21:42.560 |
whatever you did last time, plus this step, right? 01:21:47.960 |
So we actually need to track for every single parameter what happened last time. 01:21:54.320 |
And that's actually quite a bit of state, right? 01:21:56.520 |
If you've got 10 million activations in your network, you've now got 10 million more floats 01:22:02.960 |
that you have to store, because that's your momentum. 01:22:06.680 |
So we're going to store that in a dictionary called state. 01:22:11.760 |
So a stateful optimizer is just an optimizer that has state. 01:22:19.080 |
And then we're going to have to have some stats. 01:22:29.880 |
They're objects that we're going to pass in to say, when we create this state, how do 01:22:36.880 |
So when you're doing momentum, what's the function that you run to calculate momentum? 01:22:42.160 |
So that's going to be called something of a stat class. 01:22:45.860 |
So for example, momentum is calculated by simply averaging the gradient, like so. 01:22:53.960 |
We take whatever the gradient averaged before, we multiply it by momentum, and we add the 01:23:10.480 |
So it's not enough just to have update, because we actually need this to be something at the 01:23:16.800 |
We can't multiply by something that doesn't exist. 01:23:18.680 |
So we're also going to define something called init state that will create a dictionary containing 01:23:26.540 |
So that's all that stateful optimizer is going to do, right? 01:23:31.000 |
It's going to look at each of our parameters, and it's going to check to see whether that 01:23:38.480 |
parameter already exists in the state dictionary, and if it doesn't, it hasn't been initialized. 01:23:43.780 |
So we'll initialize it with an empty dictionary, and then we'll update it with the results 01:23:51.580 |
So now that we have every parameter can now be looked up in this state dictionary to find 01:23:57.160 |
out its state, and we can now, therefore, grab it, and then we can call update, like 01:24:06.040 |
Oh, this one's not opening, like so, to do, for example, average gradients. 01:24:11.080 |
And then we can call compose with our parameter and our steppers, and now we don't just pass 01:24:18.900 |
in our hyperparameters, but we also pass in our state. 01:24:23.480 |
So now that we have average gradients, which is sticking into this thing called grad average, 01:24:32.040 |
and it's going to be passed into our steppers, we can now do a momentum step. 01:24:37.860 |
And the momentum step takes not just LR, but it's now going to be getting this grad average. 01:24:46.180 |
It's just this grad average times the learning rate. 01:24:53.160 |
So now we can create an SGD with momentum optimizer with a line of code. 01:24:59.920 |
It can have a momentum step, it can have a weight decay step, it can have an average 01:25:04.560 |
grad stat, we can even give it some default weight decay, and away we go. 01:25:14.840 |
So here's something that might just blow your mind. 01:25:29.600 |
Here is a paper, L2 regularization versus batch and weight norm. 01:25:38.040 |
Batch normalization is a commonly used trick to improve training of deep neural networks, 01:25:42.760 |
and they also use L2 regularization ostensibly to prevent overfitting. 01:25:48.280 |
However, we show that L2 regularization has no regularizing effect. 01:26:06.680 |
I realized this when I was chatting to Sylvain at NeurIPS, and like we were walking around 01:26:12.440 |
the poster session, and I suddenly said to him, "Wait, Sylvain, if there's batch norm, 01:26:35.920 |
And some layer, and we've got some weights that was used to create that layer of activations. 01:26:43.520 |
So these are our weights, and these are our activations, and then we pass it through some 01:26:52.640 |
It's got a bunch of adds, and it's got a bunch of multiplies, right? 01:26:59.360 |
It also normalizes, but these are the learned parameters. 01:27:02.720 |
Okay, so we come along and we say, "Okay, weight decay time. 01:27:08.360 |
Your weight decay is a million," and it goes, "Uh-oh, what do I do?" 01:27:17.440 |
Because now the squared of these, the sum of the squares of these gets multiplied by 01:27:30.640 |
But then the batch norm layer goes, "Oh, no, don't worry, friends," and it fills every 01:27:37.120 |
single one of these with 1 divided by a million, okay? 01:27:48.760 |
Well, no, sorry, it multiplies by positive, sorry, it multiplies them all by a million. 01:27:58.880 |
So what now happens, oh, these now have to get the same activations we had before. 01:28:04.880 |
All of our weights, so like w1, now have to get divided by a million to get the same result. 01:28:13.600 |
And so now our weight decay basically is nothing. 01:28:19.840 |
So the two, so in other words, we could just, we can decide exactly how much weight decay 01:28:26.680 |
loss there is by simply using the batch norm molts, right? 01:28:31.560 |
Now the batch norm molts get a tiny bit of weight decay applied to them, unless you turn 01:28:36.200 |
it off, which people often do, but it's tiny, right? 01:28:38.760 |
Because there's very few parameters here, and there's lots of parameters here. 01:28:47.320 |
L2 regularization has no regularizing effect, which is not what I've been telling people 01:28:54.480 |
who have been listening to these lessons the last three years, for which I apologize. 01:29:00.280 |
I feel a little bit better in knowing that pretty much everybody in the community is 01:29:08.360 |
So Twan Van Lahoven mentioned this in the middle of 2017. 01:29:17.040 |
There's a couple more papers I've mentioned in today's lesson notes from the last few 01:29:23.160 |
months where people are finally starting to really think about this, but I'm not aware 01:29:29.240 |
of any other course, which is actually pointed out we're all doing it wrong. 01:29:34.160 |
So you know how I keep mentioning how none of us know what we're doing? 01:29:38.960 |
We don't even know what L2 regularization does because it doesn't even do anything, 01:29:46.680 |
but it does do something because if you change it, something happens. 01:29:58.120 |
So a more recent paper by a team led by Roger Gross has found three kind of ways in which 01:30:06.000 |
maybe regularization happens, but it's not the way you think. 01:30:10.560 |
This is one of the papers in the lesson notes. 01:30:13.520 |
But even in his paper, which is just a few months old, the abstract says basically, or 01:30:20.480 |
the introduction says basically no one really understands what L2 regularization does. 01:30:28.880 |
There's this thing that every model ever always has, and it totally doesn't work. 01:30:35.520 |
At least it doesn't work in the way we thought it did. 01:30:38.440 |
So that should make you feel better about, can I contribute to deep learning? 01:30:44.960 |
Obviously you can, because none of us have any idea what we're doing. 01:30:48.480 |
And this is a great place to contribute, right? 01:30:51.880 |
Is like use all this telemetry that I'm showing you, activations of different layers, and 01:30:56.560 |
see what happens experimentally, because the people who study this stuff, like what actually 01:31:02.240 |
happens with batch norm and weight decay, most of them don't know how to train models, 01:31:07.520 |
The theory people, and then there's like the practitioners who forget about actually thinking 01:31:14.760 |
But if you can combine the two and say like, oh, let's actually try some experiments. 01:31:19.080 |
Let's see what happens really when we change weight decay, now that I've assumed we don't 01:31:23.240 |
know what we're doing, I'm sure you can find some really interesting results. 01:31:29.060 |
So momentum is also interesting, and we really don't understand much about how things like 01:31:34.640 |
momentum work, but here's some nice pictures for you. 01:31:38.760 |
And hopefully it'll give you a bit of a sense of momentum. 01:31:42.000 |
Let's create 200 numbers equally spaced between minus four and four, and then let's create 01:31:52.840 |
And then let's create something that plots some function for these numbers, and we're 01:32:03.320 |
going to look at this function for each value of something called beta. 01:32:07.880 |
And this is the function we're going to try plotting, and this is the momentum function. 01:32:13.720 |
Okay, so what happens if we plot this function for each value of beta, for our data where 01:32:28.680 |
So beta here is going to be our different values of momentum, and you can see what happens 01:32:33.160 |
is, with very little momentum, you just get very bumpy, very bumpy. 01:32:38.400 |
Once you get up to a high momentum, you get a totally wrong answer. 01:32:44.400 |
Because if you think about it, right, we're constantly saying 0.9 times whatever we had 01:32:50.520 |
before, plus the new thing, then basically you're continuing to say like, oh, the thing 01:32:57.320 |
I had before times 0.9 plus the new thing, and the things are all above zero. 01:33:05.960 |
And this is why, if your momentum is too high, and basically you're way away from where you 01:33:12.400 |
need to be in weight space, so it keeps on saying go that way, go that way, go that way. 01:33:16.600 |
If you get that enough with a high momentum, it will literally shoot off far faster than 01:33:23.720 |
Okay, so this will give you a sense of why you've got to be really careful with high 01:33:27.400 |
momentum, it's literally biased to end up being a higher gradient than the actual gradient. 01:33:39.600 |
Like when you think about it, this is kind of dumb, right, because we shouldn't be saying 01:33:44.080 |
beta times average plus yi, we should be saying beta times average plus 1 minus beta times 01:33:54.680 |
Like dampen the thing that we're adding in, and that's called an exponentially weighted 01:33:59.760 |
moving average, as we know, or lerp in PyTorch speak. 01:34:06.920 |
So let's plot the same thing as before but this time with exponentially weighted moving 01:34:19.260 |
What if the thing that we're trying to match isn't just random but is some function? 01:34:29.940 |
Well if we use a very small momentum with exponentially weighted moving averages, we're 01:34:35.040 |
And I've added an outlier at the start just to show you what happens. 01:34:40.040 |
Even with beta 0.7 we're fine, but uh-oh, now we've got trouble. 01:34:46.280 |
And the reason we've got trouble is that the second, third, fourth, fifth observations all 01:34:55.560 |
have a whole lot of this item number one in, right? 01:34:58.840 |
Because remember item number two is 0.99 times item number one plus 0.01 times item number 01:35:08.640 |
And so item number one is massively biasing the start. 01:35:16.960 |
And the second thing that goes wrong is with this momentum is that you see how we're a 01:35:22.560 |
We're always running a bit behind where we should be. 01:35:26.640 |
Because we're always only taking 0.1 times the new thing. 01:35:34.000 |
De-biasing is what we saw last week and it turned out, thanks to Staz Beckman's discovery, 01:35:39.520 |
we didn't really need it, but we do need it now. 01:35:42.400 |
And de-biasing is to divide by one minus beta to the power of whatever batch number we're 01:35:52.460 |
If your initial starting point is zero, and that's what we use always when we're de-biasing, 01:35:58.300 |
we always start at zero, and beta is 0.9, then your first step is going to be 0.9 times 01:36:10.520 |
So in other words, you'll end up at 0.1 times your item, so you're going to end up 10 times 01:36:18.740 |
And if you kind of work through it, you'll see that each step is simply 0.1 to the power 01:36:30.280 |
of one, two, three, four, five, and so forth. 01:36:34.440 |
And in fact, we have, of course, a spreadsheet showing you this. 01:36:40.760 |
So if you have a look at the momentum bias spreadsheet, there we go. 01:36:47.320 |
So basically, here's our batch number, and let's say these are the values that are coming 01:36:52.220 |
in, our gradients, five, one, one, one, one, one, five, one. 01:36:56.360 |
Then basically, this is our exponentially weighted moving average, and here is our de-biasing 01:37:06.320 |
And then here is our resulting de-biased exponentially weighted moving average. 01:37:11.280 |
And then you can compare it to an actual moving average of the last few. 01:37:21.040 |
And Silva loves writing LaTeX, so he wrote all this LaTeX that basically points out that 01:37:26.840 |
if you say what I just said, which is beta times this plus one minus beta times that, 01:37:33.520 |
and you keep doing it to itself lots and lots of times, you end up with something that they 01:37:40.040 |
So this is all we need to do to take our exponentially weighted moving average, divide it by one 01:37:45.320 |
minus beta to the power of i plus one, and look at that. 01:37:51.680 |
It de-biases very quickly, even if you have a bad starting point, and it looks pretty 01:37:56.640 |
It's not magic, but you can see why a beta of .9 is popular. 01:38:12.120 |
Atom is dampened de-biased momentum, that's the numerator, divided by dampened de-biased 01:38:29.380 |
And so we talked about why atom does that before, we won't go into the details. 01:38:34.320 |
But here's our average gradient again, but this time we've added optional dampening. 01:38:40.440 |
So if you say I want dampening, then we'll set momentum dampening to that, otherwise 01:38:49.440 |
And so this is exactly the same as before, but with dampening. 01:38:54.660 |
Average squared gradients is exactly the same as average gradients, we could definitely 01:38:58.480 |
refactor these a lot, so this is all exactly the same as before, except we'll call them 01:39:04.640 |
We'll call it squared dampening, we'll call it squared averages, and this time, rather 01:39:08.160 |
than just adding in the p-grad data, we will multiply p-grad data by itself, in other words, 01:39:17.720 |
This is the only difference, and we store it in a different name. 01:39:21.920 |
So with those, we're also going to need to de-bias, which means we need to know what 01:39:28.280 |
So here's a stat, which just literally counts. 01:39:34.500 |
So here's our de-bias function, the one we just saw. 01:39:40.240 |
Once that's in place, atom is just the de-bias momentum with momentum dampening, the de-bias 01:39:44.960 |
squared momentum with squared momentum dampening, and then we just take the parameter, and then 01:39:52.480 |
our learning rate, and we've got the de-bias in here, our gradient average, and divided 01:40:03.400 |
And we also have our epsilon, oh, this is in the wrong spot, be careful, epsilon should 01:40:15.240 |
So that's an atom step, so now we can create an atom optimizer in one line of code. 01:40:23.120 |
And so there's our atom optimizer, it has average grads, it's got average squared grads, 01:40:41.880 |
By the way, these equations are a little nicer than these equations, and I want to point 01:40:53.740 |
Look at this, m over v plus epsilon root, lambda, da-da-da, it's the same as this. 01:41:00.960 |
So like, it's just so complicated when things appear the same way in multiple places, right? 01:41:05.980 |
So when we did this equation, we gave that a new name. 01:41:10.180 |
And so now we can just look at r2, goes from all that to just that. 01:41:20.560 |
And so when you pull these things out, when you refactor your math, it's much easier to 01:41:30.600 |
When we look at this, even if you're a terrible mathematician like me, you're going to start 01:41:35.040 |
to recognize some patterns, and that's the trick to being a less terrible mathematician 01:41:40.560 |
Beta times something plus one minus beta times another thing is exponentially weighted moving 01:41:48.360 |
So here's one exponentially weighted moving average. 01:41:50.160 |
Here's another exponentially weighted moving average. 01:41:58.960 |
So these are the exponentially weighted moving average of the gradient and the gradient squared. 01:42:19.280 |
Sylvain has a message, don't move the epsilon. 01:42:29.040 |
Sylvain's an actual math guy, so- In Adam, the epsilon goes outside the square 01:42:35.240 |
I always thought epsilon should always go inside the square root. 01:42:39.320 |
Jeremy just did a fix I pushed a week ago where our Adam wasn't working. 01:42:49.680 |
So to explain why this matters and why there is no right answer, here's the difference, 01:42:57.240 |
If this is 1e next 7, then having it here versus having it here. 01:43:04.640 |
So like the square root of 1e next 7 is very different to 1e next 7. 01:43:14.360 |
And in batch norm, they do put it inside the square root and according to Sylvain and Adam, 01:43:22.040 |
Neither is like the right place to put it or the wrong place to put it. 01:43:26.980 |
If you don't put it in the same place as they do in paper, it's just a totally different 01:43:32.240 |
And this is a good time to talk about epsilon and Adam because I love epsilon and Adam because 01:43:39.280 |
like what if we put, made epsilon equal to 1, right? 01:43:45.980 |
Then we've got the kind of momentumized, the kind of momentum term on the numerator and 01:43:54.280 |
the denominator, we've got the root sum of squares of the root of the exponentially weighted 01:44:08.240 |
And most of the time, the gradients are going to be smaller than 1 and the squared version 01:44:15.200 |
So basically then, the 1 is going to be much bigger than this, so it basically makes this 01:44:24.600 |
So if epsilon is 1, it's pretty close to being standard SGD with momentum or at least debiased 01:44:35.560 |
Whereas if epsilon is 1e next 7, then we're basically saying, oh, we want to really use 01:44:44.240 |
these different exponentially weighted moving average squared gradients. 01:44:50.800 |
And this is really important because if you have some activation that has had a very small 01:45:01.680 |
squared gradients for a while, this could well be like 1e next 6, which means when you divide 01:45:11.520 |
And that could absolutely kill your optimizer. 01:45:14.700 |
So the trick to making atom and atom-like things work well is to make this about 0.1, somewhere 01:45:23.240 |
between 1e next 3 and 1e next 1 tends to work pretty well. 01:45:27.080 |
Most people use 1e next 7, which it just makes no sense. 01:45:32.320 |
There's no way that you want to be able to multiply your step by 10 million times. 01:45:40.520 |
So there's another place that epsilon is a super important thing to think about. 01:45:49.680 |
So LAM then is stuff that we've all seen before, right? 01:45:56.000 |
So it's debiased, this is atom, right, debiased exponentially weighted moving averages of 01:46:09.840 |
The norm is just the sum of the roots and the squares. 01:46:19.040 |
This one here, hopefully you recognize as being the atom step. 01:46:31.060 |
So basically what LAM is doing is it's atom, but what we do is we average all the steps 01:46:44.440 |
That's why these L's are really important, right, because these things are happening 01:46:48.980 |
And so basically we're taking, so here's our debiased momentum, debiased squared momentum, 01:46:57.560 |
And then here's our one, and look, here's this mean, right? 01:47:03.000 |
Because remember, each stepper is created for a layer, for a parameter. 01:47:07.680 |
I shouldn't say a layer, for a parameter, okay? 01:47:11.140 |
So this is kind of like both exciting and annoying because I'd been working on this 01:47:17.600 |
exact idea, which is basically atom but averaged out over a layer for the previous week. 01:47:25.400 |
And then this LAM paper came out, and I was like, "Oh, that's cool. 01:47:31.120 |
And it's like, "Oh, we do it with a new optimizer." 01:47:32.440 |
And I looked at the new optimizer and was like, "It's just the optimizer I wrote a week 01:47:45.320 |
And you should definitely check out LAM because it makes so much sense to use the average 01:47:53.440 |
over the layer of that step as a kind of a, you can see here, it's kind of got this normalization 01:48:02.520 |
Because it's just really unlikely that every individual parameter in that tensor, you don't 01:48:09.480 |
want to divide it by its squared gradients because it's going to vary too much. 01:48:14.120 |
There's just too much chance that there's going to be a 1e neg 7 in there somewhere 01:48:19.880 |
So this to me is exactly the right way to do it. 01:48:22.800 |
And this is kind of like the first optimizer I've seen where I just kind of think like, 01:48:28.440 |
"Oh, finally I feel like people are heading in the right direction." 01:48:31.760 |
But when you really study this optimizer, you realize that everything we thought about 01:48:37.920 |
The way optimizers are going with things like LAM is the whole idea of what is the magnitude 01:48:47.560 |
of our step, it just looks very different to everything we kind of thought of before. 01:48:55.160 |
You know, this bath might look slightly intimidating at first, but now you know all of these things. 01:49:01.240 |
You know, what all they all are and you know why they exist. 01:49:08.020 |
So here's how we create a LAM optimizer and here's how we fit with it. 01:49:14.280 |
Okay, that is that, unless Silvass says otherwise. 01:49:28.560 |
So as I was building this, I got so sick of runner because I kept on wondering when do 01:49:38.520 |
And then I kind of suddenly thought, like, again, like, once every month or two, I actually 01:49:48.680 |
And it's only when I get really frustrated, right? 01:49:51.620 |
So like I was getting really frustrated with runners and I actually decided to sit and 01:49:56.400 |
And I looked at the definition of learner and I thought, wait, it doesn't do anything at 01:50:08.580 |
And then a runner has a learner in it that stores three things, like, why don't we store 01:50:18.920 |
So I took the runner, I took that line of code, I copied it and I pasted it just here. 01:50:29.160 |
I then found everything that said self.learn and removed the .learn and I was done. 01:50:37.720 |
And it's like, oh, it's just one of those obvious refactorings that as soon as I did 01:50:42.420 |
it, Sylvain was like, why didn't you do it that way in the first place? 01:50:47.500 |
And I was like, why didn't you fix it that way in the first place? 01:50:49.680 |
But now that we've done it, like, this is so much easier. 01:50:52.920 |
There's no more get learn run, there's no more having to match these things together. 01:50:58.920 |
So one of the nice things I like about this kind of Jupyter style of development is I 01:51:04.200 |
spend a month or two just, like, immersing myself in the code in this very experimental 01:51:12.120 |
And I feel totally fine throwing it all away and changing everything because, like, everything's 01:51:16.120 |
small and I can, like, fiddle around with it. 01:51:19.200 |
And then after a couple of months, you know, Sylvain and I will just kind of go like, okay, 01:51:24.760 |
there's a bunch of things here that work nicely together and we turn it into some modules. 01:51:29.440 |
And so that's how fast.ai version one happened. 01:51:33.840 |
People often say to us, like, turning it into modules, what a nightmare that must have been. 01:51:39.120 |
So here's what was required for me to do that. 01:51:43.720 |
I typed into Skype, Sylvain, please turn this into a module. 01:51:50.040 |
And then three hours later, Sylvain typed back and he said, done. 01:51:55.560 |
You know, it took, you know, it was four, five, six months of development in notebooks, 01:52:07.120 |
And I think this -- I find this quite delightful. 01:52:24.760 |
Sylvain wrote this fantastic package called Fast Progress, which you should totally check 01:52:32.760 |
Because remember, we're allowed to import modules that are not data science modules. 01:52:39.200 |
But now we need to attach this progress bar to our callback system. 01:52:44.240 |
So let's grab our ImageNet data as before, create a little thing with, I don't know, 01:52:56.720 |
It's basically exactly the same as it was before, except now we're storing our stats 01:53:04.240 |
And we're just passing off the array to logger. 01:53:07.560 |
Remember logger is just a print statement at this stage. 01:53:12.240 |
And then we will create our progress bar callback. 01:53:22.120 |
So with that, we can now add progress callback to our callback functions. 01:53:35.320 |
That's all the code we needed to make this happen. 01:53:43.720 |
So this is, you know, thanks to just careful, simple, decoupled software engineering. 01:53:50.080 |
We just said, okay, when you start fitting, you've got to create the master bar. 01:54:00.920 |
And then replace the logger function, not with print, but with master bar.write. 01:54:09.160 |
And then after we've done a batch, update our progress bar. 01:54:15.320 |
When we begin an epoch or begin validating, we'll have to create a new progress bar. 01:54:19.800 |
And when we're done fitting, tell the master bar we're finished. 01:54:24.520 |
So it's very easy to, once you have a system like this, to integrate with other libraries 01:54:29.360 |
if you want to use TensorBoard or VisDum or send yourself a Twilio message or whatever. 01:54:42.600 |
Okay, so we're going to finish, I think we're going to finish, unless this goes faster than 01:54:52.000 |
So so far, we've seen how to create our optimizers, we've seen how to create our data blocks API. 01:54:58.160 |
And we can use all that to train a reasonably good image net model. 01:55:03.240 |
But to make a better image net model, it's a bit short of data. 01:55:07.360 |
So we should use data augmentation as we all know. 01:55:20.360 |
The only transforms, we're going to use resize fixed. 01:55:27.040 |
And let's just actually open the original pillow image without resizing it to see what 01:55:38.360 |
When you resize, there are various resampling methods you can use. 01:55:43.320 |
So basically, when you go from one size image to another size image, do you like take the 01:55:50.200 |
Or do you put a little cubic spline through them? 01:55:54.920 |
And so these are called resampling methods, and pillow has a few. 01:55:58.720 |
They suggest when downsampling, so going from big to small, you should use anti alias. 01:56:04.200 |
So here's what you do, when you're augmenting your data, and this is like nothing I'm going 01:56:12.340 |
If you're doing audio, if you're doing text, if you're doing music, whatever, augment your 01:56:18.360 |
data and look at or listen to or understand your augmented data. 01:56:23.600 |
So don't like just chuck this into a model, but like look at what's going on. 01:56:28.680 |
So if I want to know what's going on here, I need to be able to see the texture of this 01:56:35.840 |
Now, I'm not very good at tensions, but I do know a bit about clothes. 01:56:39.420 |
So let's say if we were trying to see what this guy's wearing, it's a checkered shirt. 01:56:44.080 |
So let's zoom in and see what this guy's wearing. 01:56:51.680 |
So like, I can tell that this is going to totally break my model if we use this kind 01:57:02.780 |
What if instead of anti aliasing, we use bilinear, which is the most common? 01:57:11.880 |
What if we use nearest neighbors, which nobody uses because everybody knows it's terrible? 01:57:20.560 |
So yeah, just look at stuff and try and find something that you can study to see whether 01:57:36.720 |
And this is interesting because what I did here was I did two steps. 01:57:39.580 |
I first of all resized to 256 by 256 with bicubic, and then I resized to my final 128 01:57:50.080 |
And so sometimes you can combine things together in steps to get really good results. 01:57:55.360 |
Anyway, I didn't want to go into the details here, I'm just saying that when we talk about 01:58:00.440 |
image augmentation, your test is to look at or listen to or whatever your augmented data. 01:58:12.620 |
Flipping is a great data augmentation for vision. 01:58:18.000 |
The main thing I want to point out is this, at this point, our tensors contain bytes. 01:58:24.600 |
Calculating with bytes and moving bytes around is very, very fast. 01:58:28.440 |
And we really care about this because when we were doing the Dawnbench competition, one 01:58:33.100 |
of our biggest issues for speed was getting our data augmentation running fast enough 01:58:41.300 |
If you're flipping something, flipping bytes is identical to flipping floats in terms of 01:58:46.280 |
the outcome, so you should definitely do your flip while it's still a byte. 01:58:51.920 |
So image augmentation isn't just about throwing some transformation functions in there, but 01:58:58.480 |
think about when you're going to do it because you've got this pipeline where you start with 01:59:02.000 |
bytes and you start with bytes in a pillow thing and then they become bytes in a tensor 01:59:08.000 |
and then they become floats and then they get turned into a batch. 01:59:13.060 |
And so you should do whatever you can while they're still bytes. 01:59:17.400 |
Don't do things that are going to cause rounding errors or saturation problems, whatever. 01:59:27.780 |
So there's a thing called PILX.TransposePILImageFlipLeftRight. 01:59:32.360 |
Let's check it for random numbers less than 0.5. 01:59:35.600 |
Let's create an item list and let's replace that. 01:59:38.800 |
We built this ourselves, so we know how to do this stuff now. 01:59:41.240 |
Let's replace the items with just the first item with 64 copies of it. 01:59:46.920 |
And so that way we can now use this to create the same picture lots of times. 01:59:53.640 |
So show batch is just something that's just going to go through our batch and show all 01:59:59.320 |
Everything we're using we've built ourselves, so you never have to wonder what's going on. 02:00:04.000 |
So we can show batch with no augmentation or remember how we created our transforms. 02:00:09.240 |
We can add PIL random flip and now some of them are backwards. 02:00:15.240 |
It might be nice to turn this into a class that you actually pass a P into to decide 02:00:23.320 |
You probably want to give it an order because we need to make sure it happens after we've 02:00:27.720 |
got the image and after we've converted it to RGB but before we've turned it into a tensor. 02:00:33.240 |
Since all of our PIL transforms are going to want to be that order, we may as well create 02:00:37.920 |
a PIL transform class and give it that order and then we can just inherit from that class 02:00:45.600 |
So now we've got a PIL transform class, we've got a PIL random flip, it's got this state, 02:00:50.980 |
it's going to be random, we can try it out giving it P of 0.8 and so now most of them 02:01:02.680 |
Or maybe we want to be able to do all these other flips. 02:01:07.220 |
So actually PIL transpose, you can pass it all kinds of different things and they're 02:01:21.240 |
So let's turn that into another transform where we just pick any one of those at random 02:01:38.680 |
>> It's easy to evaluate data augmentation for images, how would you handle tabular text 02:01:54.000 |
>> How would you handle the data augmentation? 02:01:58.560 |
So if you're augmenting text then you read the augmented text. 02:02:03.520 |
For time series you would look at the signal of the time series. 02:02:09.400 |
For tabular you would graph or however you normally visualize that kind of tabular data, 02:02:15.680 |
you would visualize that tabular data in the same way. 02:02:18.320 |
So you just kind of come and try and as a domain expert hopefully you understand your 02:02:23.040 |
data and you have to come up with a way, what are the ways you normally visualize that kind 02:02:29.120 |
of data and use the same thing for your augmented data. 02:02:40.120 |
How would you do the augmentation for tabular data text or time series? 02:02:45.880 |
I mean, again, it kind of requires your domain expertise. 02:02:53.400 |
Just before class today actually one of our alumni, Christine Payne came in, she's at open 02:02:59.840 |
AI now working on music analysis and music generation and she was talking about her data 02:03:06.600 |
augmentation saying she's pitch shifting and volume changing and slicing bits off the front 02:03:18.560 |
It's just a case of thinking about what kinds of things could change in your data that would 02:03:24.600 |
almost certainly cause the label to not change but would still be a reasonable data item 02:03:31.480 |
and that just requires your domain expertise. 02:03:33.440 |
Oh, except for the thing I'm going to show you next which is going to be a magic trick 02:03:43.560 |
We can do random cropping and this is, again, something to be very careful of. 02:03:50.280 |
We very often want to grab a small piece of an image and zoom into that piece. 02:04:01.720 |
And if we do crop and resize, oh, we've lost his checked shirt. 02:04:09.560 |
So for example, with pillow, there's a transform called extent where you tell it what crop 02:04:17.440 |
and what resize and it does it in one step and now it's much more clear. 02:04:24.200 |
So generally speaking, you've got to be super careful, particularly when your data is still 02:04:28.720 |
bytes, not to do destructive transformations, particularly multiple destructive transformations. 02:04:36.000 |
Do them all in one go or wait until they're floats because bytes round off and disappear 02:04:47.160 |
And the cropping one takes 193 microseconds, the better one takes 500 microseconds. 02:04:57.400 |
So one approach would be to say, oh, crap, it's more than twice as long, we're screwed. 02:05:08.360 |
So here's how I thought through our time budget for this little augmentation project. 02:05:12.600 |
I know that for Dawnbench, kind of the best we could get down to is five minutes per batch 02:05:31.960 |
So that's on one GPU per minute, that's 31,000 or 500 per second. 02:05:44.880 |
Assuming four cores per GPU, that's 125 per second. 02:05:49.200 |
So we're going to try to stay under 10 milliseconds, I said 10 milliseconds per image. 02:05:58.440 |
So it's actually still a pretty small number. 02:06:05.000 |
So we're not too worried at this point about 500 microseconds. 02:06:11.240 |
But this is always kind of the thing to think about is like how much time have you got? 02:06:23.400 |
But yeah, 520 per second, we've got some time, especially since we've got normally a few 02:06:32.880 |
So we can just write some code to do kind of a general crop transform. 02:06:39.560 |
For ImageNet and things like that, for the validation set, what we normally do is we 02:06:48.040 |
grab the center of the image, we remove 14% from each side and grab the center. 02:06:56.360 |
So we can zoom in a little bit, so we have a center crop. 02:07:03.840 |
That's what we do for the validation set, and obviously they're all the same because 02:07:09.360 |
But for the training set, the most useful transformation by far, like all the competition 02:07:16.440 |
winners, grab a small piece of the image and zoom into it. 02:07:24.600 |
And this is going to be really useful to know about for any domain. 02:07:29.840 |
So for example, in NLP, really useful thing to do is to grab different sized chunks of 02:07:37.840 |
With audio, if you're doing speech recognition, grab different sized pieces of the utterances 02:07:44.160 |
If you can find a way to get different slices of your data, it's a fantastically useful 02:07:53.760 |
And so this is like by far the main, most important augmentation used in every ImageNet 02:08:05.880 |
It's a bit weird though because what they do in this approach is this little ratio here 02:08:11.840 |
says squish it by between three over four aspect ratio to a four over three aspect ratio. 02:08:19.560 |
And so it literally makes the person, see here, he's looking quite thin. 02:08:28.120 |
It doesn't actually make any sense, this transformation, because optically speaking, there's no way 02:08:33.800 |
of looking at something in normal day-to-day life that causes them to expand outwards or 02:08:43.160 |
So when we looked at this, we thought, I think what happened here is that they were, this 02:08:48.240 |
is the best they could do with the tools they had, but probably what they really want to 02:08:52.080 |
do is to do the thing that's kind of like physically reasonable. 02:08:57.200 |
And so the physically reasonable thing is like you might be a bit above somebody or 02:09:00.960 |
a bit below somebody or left of somebody or right of somebody, causing your perspective 02:09:06.760 |
So our guess is that what we actually want is not this, but this. 02:09:12.740 |
So perspective warping is basically something that looks like this. 02:09:22.200 |
And you think about how would those four points map to four other points if they were going 02:09:27.960 |
So it's like as you look from different directions, roughly speaking. 02:09:32.880 |
And the reason that I really like this idea is because when you're doing data augmentation 02:09:39.120 |
at any domain, as I mentioned, the idea is to try and create like physically reasonable 02:09:49.400 |
And these just aren't like you can't make somebody squishier in real world, right? 02:09:58.760 |
So if we do a perspective transform, then they look like this. 02:10:06.280 |
If you're a bit underneath them, the fish will look a bit closer, or if you're a bit 02:10:10.720 |
over here, then the hat's a bit closer from that side. 02:10:13.720 |
So these perspective transforms make a lot more sense, right? 02:10:19.680 |
So if you're interested in perspective transforms, we have some details here on how you actually 02:10:25.760 |
The details aren't important, but what are interesting is the transform actually requires 02:10:33.240 |
And did you know that PyTorch has a function for solving systems of linear equations? 02:10:37.440 |
It's amazing how much stuff is in PyTorch, right? 02:10:41.240 |
So for like lots of the things you'll need in your domain, you might be surprised to 02:10:48.680 |
>> And with the cropping and resizing, what happens when you lose the object of interest, 02:11:09.600 |
And interestingly, the kind of ImageNet winning strategy is to randomly pick between 8% and 02:11:19.800 |
So literally, they are very often picking 8% of the pixels. 02:11:30.380 |
So very often they'll have just the fin or just the eye. 02:11:36.720 |
So this tells us that if we want to use this really effective augmentation strategy really 02:11:40.900 |
well, we have to be very good at handling noisy labels, which we're going to learn about in 02:11:47.620 |
And it also hopefully tells you that if you already have noisy labels, don't worry about 02:11:54.560 |
All of the research we have tells us that we can handle labels where the thing's totally 02:12:00.880 |
missing or sometimes it's wrong, as long as it's not biased. 02:12:06.600 |
And one of the things it'll do is it'll learn to find things associated with a tench. 02:12:12.760 |
So if there's a middle-aged man looking very happy outside, could well be a tench. 02:12:19.720 |
Okay, so this is a bit of research that we're currently working on. 02:12:25.440 |
And hopefully I'll have some results to show you soon. 02:12:28.600 |
But our view is that this image warping approach is probably going to give us better results 02:12:33.960 |
than the traditional ImageNet style augmentations. 02:12:39.880 |
So here's our final transform for tilting in arbitrary directions, and here's the result. 02:12:52.360 |
The first is that it's really important to measure everything. 02:12:57.960 |
And I and many people have been shocked to discover that actually the time it takes to 02:13:03.500 |
convert an image into a float tensor is significantly longer than the amount of time it takes to 02:13:18.600 |
So you may be thinking this image warping thing sounds really hard and slow, but be 02:13:25.280 |
careful, just converting bytes to floats is really hard and slow. 02:13:30.780 |
And then this is the one, as I mentioned, this one we're using here is the one that comes 02:13:36.120 |
We found another version that's like twice as fast, which goes directly to float. 02:13:42.460 |
So this is the one that we're going to be using. 02:13:45.400 |
So time everything if you're running, you know, if things are running not fast enough. 02:13:49.880 |
Okay, here's the thing I'm really excited about for augmentation, is this stuff's all 02:13:59.840 |
What if I told you, you could do arbitrary affine transformations. 02:14:06.880 |
So warping, zooming, rotating, shifting at a speed which would compare. 02:14:25.100 |
So up to like, you know, an order of magnitude or more faster. 02:14:34.180 |
So we can actually do augmentation on the GPU. 02:14:36.880 |
And the trick is that PyTorch gives us all the functionality to make it happen. 02:14:42.440 |
So the key thing we have to do is to actually realize that our transforms, our augmentation 02:14:53.200 |
For our augmentation, we don't create one random number. 02:14:56.540 |
We create a mini batch of random numbers, which is fine because PyTorch has the ability 02:15:01.260 |
to generate batches of random numbers on the GPU. 02:15:04.880 |
And so then once we've got a mini batch of random numbers, then we just have to use that 02:15:10.200 |
to generate a mini batch of augmented images. 02:15:18.760 |
But if you're not a computer vision person, maybe not. 02:15:21.480 |
But basically, we create something called an affine grid, which is just the coordinates 02:15:26.800 |
of where is every pixel, so like literally is coordinates from minus one to one. 02:15:33.720 |
And then what we do is we multiply it by this matrix, which is called an affine transform. 02:15:41.840 |
And there are various kinds of affine transforms you can do. 02:15:44.440 |
For example, you can do a rotation transform by using this particular matrix, but these 02:15:52.320 |
And then you just, as you see here, you just do the matrix multiplication and this is how 02:15:58.240 |
So a rotation, believe it or not, is just a matrix multiplication by this untimely matrix. 02:16:06.320 |
If you do that, normally it's going to take you about 17 milliseconds because speed it 02:16:12.880 |
up a bit with own sum, or we could speed it up a little bit more with batch matrix multiply, 02:16:20.760 |
or we could stick the whole thing on the GPU and do it there. 02:16:26.000 |
And that's going to go from 11 milliseconds to 81 microseconds. 02:16:32.520 |
So if we can put things on the GPU, it's totally different. 02:16:37.120 |
And suddenly we don't have to worry about how long our augmentation is taking. 02:16:41.760 |
So this is the thing that actually rotates the coordinates, to say where the coordinates 02:16:48.080 |
And believe it or not, PyTorch has an optimized batch-wise interpolation function. 02:16:59.600 |
And not only do they have a grid sample, but this is actually even better than Pillows because 02:17:05.520 |
You can say padding mode equals reflection, and the black edges are gone. 02:17:10.360 |
It just reflects what was there, which most of the time is better. 02:17:14.920 |
And so reflection padding is one of these little things we find definitely helps models. 02:17:20.160 |
So now we can put this all together into a rotate batch. 02:17:24.920 |
We can do any kind of coordinate transform here. 02:17:32.160 |
And yeah, as I say, it's dramatically faster. 02:17:38.720 |
Or in fact, we can do it all in one step because PyTorch has a thing called affine grid that 02:17:44.760 |
will actually do the multiplication as it creates a coordinate grid. 02:17:49.260 |
And this is where we get down to this incredibly fast speed. 02:17:53.040 |
So I feel like there's a whole, you know, big opportunity here. 02:18:00.440 |
There are currently no kind of hackable, anybody can write their own augmentation. 02:18:11.440 |
The entire fastai.vision library is written using PyTorch Tensor operations. 02:18:17.480 |
We did it so that we could eventually do it this way. 02:18:20.040 |
But currently they all run on the CPU, one image at a time. 02:18:30.760 |
And so whatever domain you're working in, you can hopefully start to try out these, 02:18:37.720 |
you know, randomized GPU batch wise augmentations. 02:18:44.240 |
And next week we're going to show you this magic data augmentation called mixup that's 02:18:51.280 |
going to work on the GPU, it's going to work on every kind of domain that you can think 02:18:55.560 |
of and will possibly make most of these irrelevant because it's so good you possibly don't need 02:19:04.400 |
So that and much more next week, we'll see you then.