Lesson 11 (2019) - Data Block API, and generic optimizer

00:00:00.000 | Well, welcome back, welcome to lesson 11, where we're going to be talking mainly about

00:00:10.320 | data loading and optimizers.

00:00:13.200 | I said we would be talking about fastai.audio, but that's going to be a little bit later

00:00:18.460 | in the course.

00:00:19.460 | We haven't quite got to where I wanted to get to yet, so everything I said we'd talk

00:00:23.200 | about last week, we will talk about, but it might take a few more lessons to get there

00:00:28.200 | than I said.

00:00:31.600 | So this is kind of where we're up to, is the last little bit of our CNN, and specifically

00:00:39.440 | these were the things we were going to dive into when we've done the first four of these.

00:00:45.720 | CUDA, convolutions, hooks, normalization.

00:00:51.040 | So we're going to keep going through this process to try to create our state-of-the-art

00:00:57.320 | image net model.

00:01:00.000 | The specific items that we're working with are images, but everything we've covered so

00:01:07.960 | far is equally valid, equally used for tabular, collaborative filtering, and text, and pretty

00:01:15.440 | much everything else as well.

00:01:20.340 | Last week we talked about BatchNorm, and I just wanted to mention that at the end of

00:01:26.760 | the BatchNorm notebook there's another bit called simplified running BatchNorm.

00:01:32.200 | We talked a little bit about de-biasing last week, we'll talk about it more today, but

00:01:37.400 | Stas Begman pointed out something which is kind of obvious in hindsight, but I didn't

00:01:41.600 | notice at the time, which is that we had sums divided by de-bias, and we had count divided

00:01:49.040 | by de-bias, and then we go sum divided by count, and sum divided by de-bias divided

00:01:54.800 | by count divided by de-bias, the two de-bias cancel each other out, so we can remove all

00:02:00.680 | of them.

00:02:01.680 | We're still going to cover de-biasing today for a different purpose, but actually we didn't

00:02:06.360 | really need it for last week, so we can remove all the de-biasing and end up with something

00:02:12.520 | much simpler.

00:02:16.320 | That's the version that we're going to go with.

00:02:21.080 | Also thanks to Tom Veman, who pointed out that the last step where we went subtract

00:02:30.440 | mean divided by standard deviation multiplied by molts add adds, you can just rearrange

00:02:36.120 | that into this form, molts divided by variances and adds minus means times that factor, and

00:02:46.200 | if you do it this way, then you don't actually have to touch X until you've done all of those

00:02:54.200 | things, and so that's going to end up being faster.

00:02:56.160 | If you think through the broadcasting operations there, then you're doing a lot less computation

00:03:01.840 | this way, so that was a good idea as well.

00:03:07.720 | And I should mention Tom has been helping us out quite a bit with some of this batch

00:03:12.640 | norm stuff, one of the few people involved in the PyTorch community who's been amazingly

00:03:17.480 | helpful who I'd like to call out.

00:03:20.400 | So thanks also to Sumith Chintala, who was one of the original founders of PyTorch, who's

00:03:27.600 | been super helpful in sorting some things out for this course, and also Francisco Massa,

00:03:34.640 | also super helpful.

00:03:36.240 | They're both part of the official Facebook engineering team, and Tom's not, but he does

00:03:42.040 | so much work for PyTorch, he kind of seems like he must be sometimes.

00:03:46.160 | So thanks to all of you for your great help.

00:03:50.040 | Okay, before we moved on to data blocks, I wanted to mention one other approach to making

00:03:59.120 | sure that your model trains nicely, and to me this is the most fast AI-ish method.

00:04:06.800 | I wish I had come up with it, but I didn't.

00:04:12.280 | Wonderful researcher named Dimitro came up with it in a paper called All You Need Is

00:04:16.560 | a Good In It, Dimitro Mishkin, and this is the paper.

00:04:24.600 | And he came up with this technique called LSUV, Layer-wise Sequential Unit Variance.

00:04:28.800 | So the basic idea is this, you've seen now how fiddly it is to get your unit variances

00:04:36.400 | all the way through your network, and little things can change that.

00:04:42.120 | So if you change your activation function, or something we haven't mentioned, if you

00:04:46.280 | add dropout, or change the amount of dropout, these are all going to impact the variances

00:04:53.560 | of your layer outputs, and if they're just a little bit different to one, you'll get

00:04:58.900 | exponentially worse as we saw through the model.

00:05:03.360 | So the normal approach to fixing this is to think really carefully about your architecture,

00:05:08.240 | and exactly, analytically, figure out how to initialize everything so it works.

00:05:12.720 | And Dimitro's idea, which I like a lot better, is let the computer figure it out, and here's

00:05:19.480 | how you let the computer figure it out.

00:05:20.960 | We create our MNIST data set in the same way as before, we create a bunch of layers with

00:05:25.040 | these number of filters like before, and what I'm going to do is I'm going to create a conf

00:05:29.680 | layer class which contains our convolution and our relu, and the idea is that we're going

00:05:36.240 | to use this because now we can basically say this whole kind of combined conv plus relu

00:05:41.400 | has kind of a, I'm calling it bias, but actually I'm taking that general relu and just saying

00:05:48.280 | like, how much are we subtracting from it?

00:05:51.480 | So this is kind of like something we can add or remove.

00:05:56.720 | And then the weight is just the conv weights.

00:05:58.960 | And you'll see why we're doing this in a moment.

00:06:01.220 | Basically what we'll do is we'll create our learner in the usual way, and however it initializes

00:06:07.720 | is fine.

00:06:09.380 | And so we can train it, that's fine, but let's try and now train it in a better way.

00:06:16.920 | So let's recreate our learner, and let's grab a single minibatch.

00:06:25.000 | And here's a function that will let us grab a single minibatch, making sure we're using

00:06:28.160 | all our callbacks that the minibatch does all the things we needed to do.

00:06:31.120 | So here's one minibatch of X and Y.

00:06:37.680 | And what we're going to do is we're going to find all of the modules that are, we're

00:06:45.120 | going to find all of the modules which are of type conv layer.

00:06:51.460 | And so it's just a little function that does that.

00:06:55.360 | And generally speaking, when you're working with PyTorch modules or with neural nets more

00:07:00.800 | generally, you need to use recursion a lot because modules can contain modules, can contain

00:07:05.000 | modules, right?

00:07:06.000 | So you can see here, find modules, calls find modules to find out all the modules throughout

00:07:11.400 | your kind of tree, because really a module is like a tree.

00:07:14.640 | Modules have modules have modules.

00:07:17.960 | And so here's our list of all of our conv layers.

00:07:20.820 | And then what we do is we create a hook, right?

00:07:25.680 | And the hook is just going to grab the mean and standard deviation of a particular module.

00:07:31.200 | And so we can first of all just print those out.

00:07:33.640 | And we can see that the means and standard deviations are not zero one.

00:07:38.840 | The means are too high, as we know, because we've got the relues.

00:07:44.480 | And the standard deviations are too low.

00:07:49.460 | So rather than coming up with our perfect init, instead, we just create a loop.

00:07:57.740 | And the loop calls the model, passing in that mini-batch we have, right?

00:08:03.360 | And remember, this is -- so first of all, we hook it, right?

00:08:08.760 | And then we call the model in a while loop.

00:08:12.800 | We check whether the mean, the absolute value of the mean is close to zero.

00:08:17.800 | And if it's not, we subtract the mean from the bias.

00:08:21.840 | And so it just keeps looping through, calling the model again with the hook, subtracting

00:08:26.120 | from the bias until we get about zero mean.

00:08:30.160 | And then we do the same thing for the standard deviation.

00:08:32.960 | Keep checking whether standard deviation minus one is nearly zero.

00:08:37.240 | And as long as it isn't, we'll keep dividing by the standard deviation.

00:08:42.100 | And so those two loops, if we run this function, then it's going to eventually give us what

00:08:52.640 | we want.

00:08:53.640 | Now, it's not perfect, right?

00:08:55.240 | The means are still not quite zero.

00:08:57.760 | Because we do the means first and then the standard deviations, and the standard deviation

00:09:01.520 | changes will slightly change the mean.

00:09:03.640 | But you can see our standard deviations are perfectly one.

00:09:07.640 | And our means are pretty close.

00:09:10.560 | And that's it.

00:09:11.560 | This is called LSUV.

00:09:13.040 | And this is how, without thinking at all, you can initialize any neural network pretty

00:09:20.440 | much to get the unit variance all the way through.

00:09:24.440 | And this is much easier than having to think about whether you've got ReLU or ELU or whether

00:09:30.920 | you've got dropout or whatever else.

00:09:35.060 | So here's a super cool trick.

00:09:36.720 | Yeah, and then we can train it, and it trains very nicely.

00:09:40.680 | Particularly useful for complex and deeper architectures.

00:09:44.400 | So there's kind of, for me, the fast AI approach to initializing your neural nets, which is

00:09:52.040 | no math, no thinking.

00:09:54.920 | Just a simple little for loop, or in this case, a while loop.

00:09:58.600 | All right.

00:09:59.600 | So I think we've done enough with MNIST because we're getting really good results.

00:10:06.360 | It's running fast.

00:10:07.360 | It's looking good.

00:10:08.360 | It's doing something harder.

00:10:12.260 | So what are we going to try?

00:10:13.960 | Well, we're not quite ready to try ImageNet because ImageNet takes quite a lot of time.

00:10:18.880 | You know, a few days if you've got just one GPU to train.

00:10:22.460 | And that's really frustrating and an expensive way to try to practice things or learn things

00:10:28.060 | or try things out.

00:10:30.640 | I kept finding this problem of not knowing what data set I should try for my research

00:10:36.120 | or for my practice or for my learning.

00:10:38.960 | You know, it seemed like at one end there was MNIST, which is kind of too easy.

00:10:41.920 | There was sci-fi tan that a lot of people use, but these are 32 by 32 pixel images.

00:10:47.760 | And it turns out, and this is something I haven't seen really well written about, but

00:10:51.960 | our research clearly shows, it turns out that small images, 32 by 32, have very different

00:10:57.800 | characteristics to larger images.

00:11:00.920 | And specifically, it seems like once you get beneath about 96 by 96, things behave really

00:11:05.600 | differently.

00:11:06.600 | So stuff that works well on sci-fi tan tends not to work well on normal sized images.

00:11:11.640 | Because 32 by 32 is tiny, right?

00:11:14.840 | And stuff that tends to work well on sci-fi tan doesn't necessarily work well on ImageNet.

00:11:20.440 | There's this kind of gap of like something with normal sized images, which I can train

00:11:27.520 | in a sane amount of time, but also gives me a good sense of whether something's going

00:11:32.880 | to work well or not.

00:11:33.880 | And actually, Dimitro, who wrote that LSUV paper we just looked at, also had a fantastic

00:11:40.960 | paper called systematic evaluation, something like systematic evaluation of convolutional

00:11:46.800 | neural networks.

00:11:48.120 | And he noticed that if you use 128 by 128 images with ImageNet, then the kind of things

00:11:55.680 | that he found works well or doesn't work well, all of those discoveries applied equally well

00:12:01.240 | to the full sized ImageNet.

00:12:04.120 | Still takes too long.

00:12:05.120 | 128 by 128 for 1.3 million images, still too long.

00:12:09.760 | So I thought that was a good step, but I wanted to go even further.

00:12:12.840 | So I tried creating two new data sets.

00:12:17.120 | And my two new data sets are subsets of ImageNet.

00:12:21.760 | And there's kind of like multiple versions in here, it really is, but they're both subsets

00:12:25.880 | of ImageNet.

00:12:27.200 | They both contain just 10 classes out of 1,000.

00:12:30.160 | So they're 1/100 of the number of images of ImageNet.

00:12:34.160 | And I create a number of versions, full size, 320 pixel size, and 160 pixel size.

00:12:41.160 | One data set is specifically designed to be easy.

00:12:44.360 | It contains 10 classes that are all very different to each other.

00:12:50.360 | So this is like my starting point.

00:12:51.720 | I thought, well, what if I create this data set, then maybe I could train it for like

00:12:55.480 | just an epoch or two, like just a couple of minutes and see whether something was going

00:13:00.120 | to work.

00:13:01.720 | And then the second one I created was one designed to be hard, which is 10 categories

00:13:08.280 | that are designed to be very similar to each other, so they're all dog breeds.

00:13:12.100 | So the first data set is called ImageNet, which is very French, as you can hear.

00:13:19.080 | And there's some helpful pronunciation tips here.

00:13:22.280 | And the second is called ImageWolf.

00:13:26.200 | And you can see here I've created a leaderboard for ImageNet and for ImageWolf.

00:13:34.600 | And I've discovered that in my very quick experiments with this, the exact observations

00:13:40.440 | I find about what works well for the full ImageNet, also I see the same results here.

00:13:46.800 | And it's also fascinating to see how some things are the same between the two data sets

00:13:54.240 | and some are different.

00:13:55.940 | And I found working with these two data sets has given me more insight into computer vision

00:14:03.200 | model training than anything else that I've done.

00:14:07.680 | So check them out.

00:14:09.600 | And I really wanted to mention this to say, a big part of getting good at using deep learning

00:14:15.840 | in your domain is knowing how to create like small, workable, useful data sets.

00:14:23.280 | So once I decided to make this, it took me about three hours.

00:14:26.120 | Like, it's not at all hard to create a data set, it's a quick little Python script to

00:14:31.200 | grab the things I wanted.

00:14:33.080 | How did I decide which 10 things, I just looked at a list of categories and picked 10 things

00:14:38.920 | that I knew are different.

00:14:39.920 | How did I decide to pick these things, I just looked at 10 things that I knew are dogs.

00:14:46.440 | So it's like just a case of like, throw something together, get it working, and then on your

00:14:52.200 | domain area, whether it's audio or Sanskrit texts or whatever, or genomic sequences, try

00:14:59.920 | to come up with your version of a toy problem or two which you hope might give insight into

00:15:06.000 | your full problem.

00:15:07.160 | So this has been super helpful for me.

00:15:09.640 | And if you're interested in computer vision, I would strongly recommend trying this out.

00:15:13.720 | And specifically, try to beat me, right?

00:15:15.600 | Because trying to beat me, and these are not great, they're just okay, but trying to beat

00:15:19.800 | me will give you a sense of whether the things you're thinking about are in the ballpark

00:15:26.480 | of what a moderately competent practitioner is able to do in a small amount of time.

00:15:32.920 | It's also interesting to see that with like a 1/100th the size of ImageNet, like a tiny

00:15:39.600 | data set, I was able to create a 90% accurate dog breed classifier from random weights.

00:15:45.980 | So like you can do a lot pretty quickly without much data, even if you don't have transfer

00:15:51.160 | learning, which is kind of amazing.

00:15:53.960 | So we're going to use this data set now.

00:15:55.760 | Oh, sorry, you had a question.

00:15:58.080 | So before we look at the data set, let's do the question.

00:16:01.160 | Sorry, Rachel.

00:16:03.000 | >> So just to confirm, LSUV is something you run on all the layers once at the beginning,

00:16:10.040 | not during training.

00:16:11.480 | What if your batch size is small?

00:16:13.000 | Would you overfit to that batch?

00:16:16.560 | >> Yeah, that's right.

00:16:19.760 | So you'd run it once at the start of training to initialize your weights, just so that that

00:16:25.120 | initial set of steps gives you sensible gradients, because it's those first few mini batches

00:16:31.200 | that are everything.

00:16:32.200 | Remember how we saw that if we didn't have a very good first few mini batches that we

00:16:37.080 | ended up with 90% of the weights being, 90% of the activations being inactive.

00:16:43.320 | So that's why we want to make sure we start well.

00:16:47.200 | And yeah, if you've got a small mini batch, just run five mini batches and take the mean.

00:16:51.960 | There's nothing special about the one mini batch, it's just a fast way to do the computation.

00:16:56.280 | It's not like we're doing any gradient descent or anything.

00:16:58.840 | It's just a forward pass.

00:17:00.520 | Thanks, that was a good question.

00:17:05.160 | So ImageNet is too big to read it all into RAM at once.

00:17:16.280 | It's not huge, but it's too big to do that.

00:17:18.240 | So we're going to need to be able to read it in one image at a time, which is going

00:17:22.120 | to be true of most of our deep learning projects.

00:17:24.800 | So we need some way to do that from scratch, because that's the rules.

00:17:31.040 | So let's start working through that process.

00:17:33.360 | And in the process, we're going to end up building a data block API, which you're all

00:17:38.000 | familiar with.

00:17:39.160 | But most people using the data block API feel familiar enough with it to do small tweaks

00:17:45.880 | for things that they kind of know they can do.

00:17:48.160 | But most people I speak to don't know how to really change what's going on.

00:17:54.080 | So by the end of this notebook, you'll see how incredibly simple the data block API is.

00:17:59.680 | And you'll be able to either write your own, maybe based on this one, or modify the one

00:18:04.320 | in fast.ai, because this is a very direct translation of the one that's in fast.ai.

00:18:12.480 | So you should be able to get going.

00:18:15.720 | So the first thing to do is to read in our data.

00:18:19.760 | And we'll see a similar thing when we build fastai.audio.

00:18:24.320 | But whatever process you use, you're going to have to find some library that can read

00:18:28.480 | the kind of data that you want.

00:18:30.120 | So in our case, we have images.

00:18:32.320 | And there's a library called PIL, or Pillow, Python Imaging Library, which can read images.

00:18:37.520 | So let's import it.

00:18:39.680 | We'll grab the data set and untie it.

00:18:42.320 | Import pillow.

00:18:44.000 | And we want to see what's inside our ImageNet data set.

00:18:51.160 | Typing list x.editor is far too complicated for me.

00:18:54.600 | I just want to type ls.

00:18:56.240 | So be lazy.

00:18:57.320 | This is how easy it is to add stuff to the standard library.

00:19:00.640 | You can just take the class and add a function to it.

00:19:04.040 | So now we have ls.

00:19:06.920 | So here's ls.

00:19:07.920 | So we've got a training and a validation directory.

00:19:11.840 | Within validation, we have one directory for each category.

00:19:17.600 | And then if we look at one category, we could grab one file name.

00:19:22.560 | And if we look at one file name, we have a TENCH.

00:19:27.120 | So if you want to know whether somebody is actually a deep learning practitioner, show

00:19:31.360 | them this photo.

00:19:32.680 | If they don't know it's a TENCH, they're lying to you, because this is the first category

00:19:36.240 | in ImageNet.

00:19:37.240 | So if you're ever using ImageNet, you know your TENCHs.

00:19:41.640 | They're generally being held up by middle-aged men, or sometimes they're in nets.

00:19:48.520 | That's pretty much how it always looks in ImageNet.

00:19:52.860 | So that's why we have them in ImageNet too, because it's such a classic computer vision

00:19:57.680 | fish.

00:20:00.480 | We're cheating and importing NumPy for a moment, just so I can show you what an image contains,

00:20:05.640 | just to turn it into an array so I can print it for you.

00:20:08.940 | This is really important.

00:20:10.340 | It contains bytes.

00:20:11.840 | It contains numbers between 0 and 255 for the integer.

00:20:16.480 | They're not float.

00:20:18.360 | So this is what we get when we load up an image.

00:20:24.360 | And it's got a geometry, and it's got a number of channels.

00:20:29.800 | And in this case, it's RGB, three channels.

00:20:34.120 | So we want to have some way to read in lots of images, which means we need to know what

00:20:40.520 | images there are in this directory structure.

00:20:44.240 | And in the full ImageNet, there's going to be 1.3 million of them.

00:20:47.320 | So I need to be able to do that fast.

00:20:49.120 | So the first thing we need to know is which things are images.

00:20:52.480 | So I need a list of image extensions.

00:20:54.720 | Your computer already has a list of image extensions.

00:20:57.000 | It's your MIME types database.

00:21:00.160 | So you can query Python for your MIME types database for all of the images.

00:21:05.120 | So here's a list of the image extensions that my computer knows about.

00:21:09.960 | So now what I want to do is I want to loop through all the files in a directory and find

00:21:15.840 | out which ones are one of these.

00:21:18.560 | The fastest way to check whether something's in a list is to first of all turn it into

00:21:22.200 | a set.

00:21:23.480 | And of course, therefore, we need Setify.

00:21:26.600 | So Setify simply checks if it is a set, and if it is, it makes it one.

00:21:30.440 | Otherwise, it first turns it into a list, and then turns it into a set.

00:21:34.720 | So that's how we can setify things.

00:21:37.000 | And here's what I do when I build a little bit of functionality.

00:21:39.720 | I just throw together a quick bunch of tests to make sure it seems to be roughly doing

00:21:44.320 | the right thing.

00:21:45.320 | And do you remember, in lesson one, we created our own test framework.

00:21:49.480 | So we can now run any notebook as a test suite.

00:21:53.700 | So it will automatically check if we break this at some point.

00:21:57.160 | OK.

00:21:58.160 | So now we need a way to go through a single directory and grab all of the images in that.

00:22:05.920 | So here we can say get files.

00:22:09.320 | I always like to make sure that you can pass any of these things, either a path, lib path,

00:22:13.720 | or a string, to make it convenient.

00:22:16.280 | So if you just say p equals path p, if it's already a path, lib object, that doesn't do

00:22:21.840 | anything.

00:22:22.840 | So this is a nice, easy way to make sure that works.

00:22:25.620 | So we just go through-- here's our path, lib object.

00:22:33.640 | And so you'll see in a moment how we actually grab the list of files.

00:22:36.240 | But this is our parent directory.

00:22:37.800 | This is going to be our list of files.

00:22:39.680 | We go through the list of files.

00:22:41.120 | We check that it doesn't start with dot.

00:22:42.560 | If it does, that's a Unix hidden file, or a Mac hidden file.

00:22:47.680 | And we also check either they didn't ask for some particular extensions, or that the extension

00:22:52.320 | is in the list of extensions we asked for.

00:22:55.320 | So that will allow us to grab just the image files.

00:23:00.480 | Python has something called Skander, which will grab a path and list all of the files

00:23:04.880 | in that path.

00:23:06.160 | So here is how we can call get files.

00:23:08.600 | We go Skander, and then we go get files, and it looks something like that.

00:23:13.720 | So that's just for one directory.

00:23:16.580 | So we can put all this together like so.

00:23:20.560 | And so this is something where we say, for some path, give me things with these extensions,

00:23:26.000 | optionally recurse, optionally only include these folder names.

00:23:30.520 | And this is it.

00:23:31.520 | OK.

00:23:32.520 | I will go through it in detail, but I'll just point out a couple of things, because being

00:23:34.880 | able to rapidly look through files is important.

00:23:38.640 | The first is that Skander is super, super fast.

00:23:41.920 | This is Python's thin wrapper over a C API.

00:23:46.720 | So this is a really great way to quickly grab stuff for a single directory.

00:23:53.520 | If you need to recurse, check out os.walk.

00:23:57.580 | This is the thing that uses Skander internally to walk recursively through a folder tree.

00:24:04.560 | And you can do cool stuff like change the list of directories that it's going to look

00:24:08.580 | at.

00:24:11.340 | And it basically returns all the information that you need.

00:24:14.160 | It's super great.

00:24:15.160 | So os.walk and os.skander are the things that you want to be using if you're playing with

00:24:19.880 | directories and files in Python, and you need it to be fast.

00:24:26.060 | We do.

00:24:27.060 | So here's get files.

00:24:28.180 | And so now we can say get files, path, tench, just the image extensions.

00:24:33.640 | There we go.

00:24:35.540 | And then we're going to need recurse, because we've got a few levels of directory structure.

00:24:40.200 | So here's recurse.

00:24:42.480 | So if we try to get all of the file names, we have 13,000.

00:24:47.620 | And specifically, it takes 70 milliseconds to get 13,000 file names.

00:24:55.800 | For me to look at 13,000 files in Windows Explorer seems to take about four minutes.

00:25:03.080 | So this is unbelievably fast.

00:25:06.120 | So the full image net, which is 100 times bigger, it's going to be literally just a

00:25:10.300 | few seconds.

00:25:11.620 | So this gives you a sense of how incredibly fast these os.walk and skander functions

00:25:17.240 | are.

00:25:18.400 | Yes, questions are good.

00:25:23.720 | I've often been confused as to whether the code Jeremy is writing in the notebooks or

00:25:27.600 | functionality that will be integrated into the fast AI library, or whether the functions

00:25:32.520 | and classes are meant to be written and used by the user interactively and on the fly.

00:25:41.600 | Well I guess that's really a question about what's the purpose of this deep learning from

00:25:45.920 | the foundations course.

00:25:49.040 | And different people will get different things out of it.

00:25:51.080 | But for me, it's about demystifying what's going on so that you can take what's in your

00:25:58.600 | head and turn it into something real.

00:26:01.980 | And to do that would be always some combination of using things that are in existing libraries,

00:26:06.080 | which might be fast AI or PyTorch or TensorFlow or whatever, and partly will be things that

00:26:11.400 | aren't in existing libraries.

00:26:12.400 | And I don't want you to be in a situation where you say, well, that's not in fast AI,

00:26:17.440 | therefore I don't know how to do it.

00:26:19.880 | So really the goal is, this is why it's also called impractical deep learning for coders,

00:26:25.640 | is to give you the underlying expertise and tools that you need.

00:26:31.880 | In practice, I would expect a lot of the stuff I'm showing you to end up in the fast AI library

00:26:38.560 | because that's like literally I'm showing you my research, basically.

00:26:43.440 | This is like my research journal of the last six months.

00:26:46.720 | And that's what happens is fast AI library is silver and I take our research and turn

00:26:51.360 | it into a library.

00:26:53.580 | And some of it, like this function, is pretty much copied and pasted from the existing fast

00:26:58.800 | AI V1 code base because I spent at least a week figuring out how to make this fast.

00:27:06.520 | I'm sure most people can do it faster, but I'm slow and it took me a long time, and this

00:27:10.200 | is what I came up with.

00:27:12.200 | So yeah, I mean, it's going to map pretty closely to what's in fast AI already.

00:27:20.840 | Where things are new, we're telling you, like running batch norm is new, today we're going

00:27:24.440 | to be seeing a whole new kind of optimizer.

00:27:27.560 | But otherwise things are going to be pretty similar to what's in fast AI, so it'll make

00:27:31.320 | you be able to quickly hack it fast AI V1.

00:27:35.400 | And as fast AI changes, it's not going to surprise you because you'll know what's going

00:27:40.280 | on.

00:27:41.280 | Sure.

00:27:42.280 | >> How does Skander compare to glob?

00:27:51.080 | >> Skander should be much faster than glob.

00:27:53.920 | It's a little more awkward to work with because it doesn't try to do so much.

00:28:00.500 | It's the lowest level thing.

00:28:01.880 | I suspect glob probably uses it behind the scenes.

00:28:07.080 | You should try.

00:28:08.080 | Time it with glob, time it with Skander, probably depends how you use glob exactly.

00:28:12.720 | But I remember I used to use glob and it was quite a bit slower.

00:28:16.520 | And when I say quite a bit, you know, for those of you that have been using fast AI

00:28:22.040 | for a while, you might have noticed that the speed at which you can grab the image net

00:28:27.340 | folder is some orders of magnitude faster than it used to be.

00:28:31.200 | So it's quite a big difference.

00:28:34.080 | Okay. So the reason that fast AI has a data blocks API and nobody else does is because

00:28:44.400 | I got so frustrated in the last course at having to create every possible combination

00:28:49.200 | of independent and dependent variable that I actually sat back for a while and did some

00:28:55.000 | thinking.

00:28:56.480 | And specifically, this is what I did when I thought was I wrote this down, it's like

00:29:00.240 | what do you actually need to do?

00:29:03.160 | So let's go ahead and do these things.

00:29:04.600 | So we've already got files.

00:29:06.480 | We need some way to split the validation set or multiple validation sets out, some way

00:29:11.640 | to do labeling, optionally some augmentation, transform it to a tensor, make it into data,

00:29:18.440 | into batches, optionally transform the batches, and then combine the data loaders together

00:29:25.520 | into a data bunch and optionally add a test set.

00:29:29.080 | And so when I wrote it down like that, I just went ahead and implemented an API for each

00:29:33.940 | of those things to say like, okay, you can plug in anything you like to that part of

00:29:37.920 | the API.

00:29:38.920 | So let's do it, right?

00:29:41.320 | So we've already got the basic functionality to get the files.

00:29:45.240 | So now we have to put them somewhere.

00:29:48.120 | We already created that list container, right?

00:29:51.560 | So we basically can just dump our files into a list container.

00:29:55.680 | But in the end, what we actually want is an image list for this one, right?

00:30:01.480 | And an image list, when you call, so we're going to have this get method and when you

00:30:05.720 | get something from the image list, it should open the image.

00:30:10.120 | So pil.image.open is how you open an image.

00:30:13.600 | But we could get all kinds of different objects.

00:30:16.800 | So therefore, we have this superclass, which has a get method that you override.

00:30:22.640 | And by default, it just returns whatever you put in there, which in this case would be

00:30:26.280 | the file name.

00:30:29.320 | So this is basically all item list does, right?

00:30:32.200 | It's got a list of items, right?

00:30:34.880 | So in this case, it's going to be our file names, the path that they came from.

00:30:40.160 | And then optionally also, there could be a list of transforms, right?

00:30:44.040 | And transforms are some kind of functions.

00:30:46.960 | And we'll look at this in more detail in a moment.

00:30:48.800 | But basically, what will happen is, when you index into your item list, remember, done_to_get_item

00:30:55.200 | does that, we'll pass that back up to list container, get_item, and that will return

00:31:01.200 | either a single item or a list of items.

00:31:05.400 | And if it's a single item, we'll just call self._get.

00:31:09.320 | If it's a list of items, we'll call self._get on all of them.

00:31:13.440 | And what that's going to do is it's going to call the get method, which in the case

00:31:18.960 | of an image list will open the image.

00:31:21.680 | And then it will compose the transforms.

00:31:25.120 | So for those of you that haven't done any kind of more functional-style programming,

00:31:29.720 | compose is just a concept that says, go through a list of functions and call the function

00:31:40.720 | and replace myself with a result of that, and then call the next function and replace

00:31:45.760 | myself with a result of that, and so forth.

00:31:48.400 | So in other words, a deep neural network is just a composition of functions.

00:31:53.480 | Each layer is a function.

00:31:55.080 | We compose them all together.

00:31:56.840 | This compose does a little bit more than most composers.

00:32:00.040 | Specifically, you can optionally say, I want to order them in some way.

00:32:05.800 | And it checks to see whether the things have an underscore order key and sorts them.

00:32:12.040 | And also, you could pass in some keyword arguments.

00:32:14.560 | And if you do, it'll just keep passing in those keyword arguments.

00:32:17.480 | But it's basically other than that, it's a pretty standard function composition function.

00:32:22.760 | If you haven't seen compose used elsewhere in programming before, Google that because

00:32:29.200 | it's a super useful concept.

00:32:31.800 | Comes up all the time.

00:32:32.800 | We use it all the time.

00:32:33.800 | And as you can see in this case, it means I can just pass in a list of transforms.

00:32:37.840 | And this will simply call each of those transforms in turn modifying, in this case, the image

00:32:45.160 | that I had.

00:32:46.200 | So here's how you create an image list.

00:32:49.120 | And then here's a method to create an image list from a path.

00:32:55.760 | And it's just going to call that get files.

00:32:58.060 | And then that's going to give us a list of files, which we will then pass to the class

00:33:02.800 | constructor which expects a list of files or a list of something.

00:33:07.720 | So this is basically the same as item list in fast.io version 1.

00:33:13.140 | It's just a list.

00:33:17.240 | It's just a list where when you try to index into it, it will call something which subclasses

00:33:23.040 | override.

00:33:25.440 | So now we've got an image list.

00:33:30.800 | We can use it.

00:33:33.160 | Now one thing that happens all the time is you try to create a mini batch of images.

00:33:39.920 | But one of your images was black and white.

00:33:42.860 | And when pillow opens up a black and white image, it gives you back by default a reg2

00:33:50.720 | tensor.

00:33:51.720 | Just the X and Y.

00:33:52.720 | No channel access.

00:33:55.140 | And then you can't stack them into a mini batch because they're not all the same shape.

00:34:01.000 | So what you can do is you can call the pillow.convert and RGB.

00:34:07.680 | And if something's not RGB, it'll turn it into RGB.

00:34:11.920 | So here's our first transform.

00:34:15.600 | So a transform is just a class with an underscore order.

00:34:19.120 | And then make RGB is a transform that when you call it will call convert.

00:34:23.960 | Or you can just do it this way.

00:34:25.560 | Make it a function.

00:34:27.000 | Both is fine.

00:34:28.320 | These are both going to do the same thing.

00:34:30.640 | And this is often the case.

00:34:32.560 | We've seen it a bunch of times before.

00:34:33.720 | You can have done to call or you can have a function.

00:34:37.280 | So here's our first transform.

00:34:41.200 | And so here's the simplest version.

00:34:42.200 | It's just a function.

00:34:43.840 | And so if we create a image list from files using our path and pass in that transform,

00:34:51.560 | and now we have an item list.

00:34:54.320 | And remember that item list inherits from list container, which we gave a dunder rep

00:35:00.360 | or two.

00:35:01.360 | So it's going to give us nice printing.

00:35:03.920 | This is why we create these little convenient things to subclass from because we get all

00:35:09.120 | this behavior for free.

00:35:10.840 | So we can now see that we've got 13,000 items.

00:35:13.720 | And here's a few of them is the path.

00:35:17.720 | And we can index into it.

00:35:19.520 | And when we index into it, it calls get and get calls image dot open and pillow automatically

00:35:27.440 | displays images in Jupiter.

00:35:29.240 | And so there it is.

00:35:30.640 | And this, of course, is a man with a -- yes.

00:35:35.600 | Thank you.

00:35:36.600 | Okay.

00:35:37.600 | He looks very happy with it.

00:35:38.600 | We're going to be seeing him a lot.

00:35:42.720 | And because we're using the functionality that we wrote last time for list container,

00:35:47.640 | we can also index with a list of booleans, with a slice, with a list of ints and so forth.

00:35:55.120 | So here's a slice containing one item, for instance.

00:35:59.560 | All right.

00:36:00.720 | So that's step one.

00:36:02.860 | Step two, split validation set.

00:36:06.600 | So to do that, we look and we see here's a path.

00:36:11.360 | Here's the file name.

00:36:13.240 | Here's the parent.

00:36:14.840 | Here's the grandparent.

00:36:16.840 | So here's the grandparent's name.

00:36:19.560 | That's the thing we use to split.

00:36:21.960 | So let's create a function called grandparent splitter that grabs the grandparent's name.

00:36:27.540 | And you call it, telling it the name of your validation set and the name of your training

00:36:30.960 | set, and it returns true if it's the validation set or false if it's the training set or none

00:36:39.480 | if it's neither.

00:36:41.720 | And so here's something that will create a mask containing you pass it some function.

00:36:49.080 | So we're going to be using grandparent splitter.

00:36:51.440 | And it will just grab all the things where that mask is false.

00:36:54.880 | That's the training set.

00:36:55.880 | All the things where the mask is true.

00:36:57.440 | That's the validation set and will return them.

00:37:00.000 | Okay.

00:37:01.000 | So there's our splitter.

00:37:02.800 | Remember we used partial.

00:37:04.940 | So here's a splitter that splits on grandparents.

00:37:08.080 | And where the validation name is val because that's what it is for image net.

00:37:14.400 | And let's check that that seems to work.

00:37:16.800 | Yes, it does.

00:37:17.800 | We've now got a validation set with 500 things and a training set with 12,800 things.

00:37:24.040 | So that's looking good.

00:37:26.640 | So let's use it.

00:37:28.360 | So split data object is just something with a training set and a validation set.

00:37:33.240 | You pass it in.

00:37:34.240 | You save them away.

00:37:37.120 | And then that's basically it.

00:37:41.240 | Everything else from here is just convenience.

00:37:44.160 | So we'll give it a representation so that you can print it.

00:37:48.480 | We'll define done to get attribute so that if you pass it some attribute that it doesn't

00:37:52.320 | know about, it will grab it from the training set.

00:37:56.860 | And then let's add a split by func method that just calls that split by func thing we

00:38:05.760 | just had.

00:38:06.760 | There's one trick here, though, which is we want split by func to return item lists

00:38:13.720 | of the same type that we gave it.

00:38:17.240 | In this case, it would be an image list.

00:38:20.240 | So we call item list dot new.

00:38:23.560 | And that's why in our item list, we defined something called new.

00:38:33.080 | And this is a really handy trick.

00:38:35.480 | PyTorch has the concept of a new method as well.

00:38:39.400 | It says, all right, let's look at this object.

00:38:42.240 | Let's see what class it is, because it might not be item list, right?

00:38:45.200 | It might be image list or some other subclass.

00:38:47.720 | It doesn't exist yet.

00:38:49.880 | And this is now the constructor for that class.

00:38:53.000 | And let's just pass it in the items that we asked for.

00:38:56.720 | And then pass in our path and our transforms.

00:38:59.680 | So new is going to create a new item list of the same type with the same path and the

00:39:05.240 | same transforms, but with these new items.

00:39:08.360 | And so that's why this is now going to give us a training set and a validation set with

00:39:17.680 | the same path, the same transforms, and the same type.

00:39:20.640 | And so if we call split data split by func, now you can see we've got our training set

00:39:25.880 | and our validation set.

00:39:28.520 | Easy.

00:39:30.480 | So next in our list of things to do is labeling.

00:39:37.080 | Labeling is a little more tricky.

00:39:41.120 | And the reason it's tricky is because we need processes.

00:39:45.720 | Processes are things which are first applied to the training set.

00:39:49.760 | They get some state and then they get applied to the validation set.

00:39:54.720 | For example, our labels should not be tench and French horn.

00:40:02.440 | They should be like zero and two because when we go to do a cross entropy loss, we expect

00:40:08.840 | to see a long there, not a string there.

00:40:12.680 | So we need to be able to map tench to zero or French horn to two.

00:40:18.480 | We need the training set to have the same mapping as the validation set.

00:40:22.960 | And for any inference we do in the future, it's going to have the same mapping as well.

00:40:26.860 | Because otherwise, the different data sets are going to be talking about completely different

00:40:32.160 | things when they see the number zero, for instance.

00:40:35.680 | So we're going to create something called a vocab.

00:40:38.700 | And a vocab is just the list saying these are our classes and this is the order they're

00:40:43.560 | in.

00:40:44.560 | Zero is tench, one is golf ball, two is French horn, and so forth.

00:40:50.500 | So we're going to create the vocab from the training set.

00:40:54.400 | And then we're going to convert all those strings into ints using the vocab.

00:40:59.880 | And then we're going to do the same thing for the validation set, but we'll use the

00:41:03.240 | training set's vocab.

00:41:06.040 | So that's an example of a processor that converts label strings to numbers in a consistent and

00:41:12.240 | reproducible ways.

00:41:14.520 | Other things we could do would be processing texts to tokenize them and then numericalize

00:41:20.040 | them.

00:41:21.040 | Numericalizing them is a lot like converting the label strings to numbers.

00:41:24.560 | Or taking tabular data and filling the missing values with the median computed on the training

00:41:29.720 | set or whatever.

00:41:31.680 | So most things we do in this labeling process is going to require some kind of processor.

00:41:40.680 | So in our case, we want a processor that can convert label strings to numbers.

00:41:44.960 | So the first thing we need to know is what are all of the possible labels.

00:41:48.880 | And so therefore we need to know all the possible unique things in a list.

00:41:53.020 | So here's some list, here's something that uniquifies them.

00:41:57.000 | So that's how we can get all the unique values of something.

00:42:03.040 | So now that we've got that, we can create a processor.

00:42:06.400 | And a processor is just something that can process some items.

00:42:11.200 | And so let's create a category processor.

00:42:14.240 | And this is the thing that's going to create our list of all of the possible categories.

00:42:19.880 | So basically when you say process, we're going to see if there's a vocab yet.

00:42:25.640 | And if there's not, this must be the training set.

00:42:28.700 | So we'll create a vocab.

00:42:30.200 | And it's just the unique values of all the items.

00:42:34.200 | And then we'll create the thing that goes not from int to object, but goes from object

00:42:38.320 | to int.

00:42:39.320 | So it's the reverse mapping.

00:42:40.320 | So we just enumerate the vocabulary and create a dictionary with the reverse mapping.

00:42:46.120 | So now that we have a vocab, we can then go through all the items and process one of them

00:42:52.600 | at a time.

00:42:54.200 | And process one of them simply means look in that reverse mapping.

00:42:59.840 | We could also deprocess, which would take a bunch of indexes.

00:43:03.040 | We would use this, for example, to print out the inferences that we're doing.

00:43:09.060 | So we better make sure we get a vocab by now, otherwise we can't do anything.

00:43:13.080 | And then we just deprocess one for each index.

00:43:17.240 | And deprocess one just looks it up in the vocab.

00:43:21.060 | So that's all we need.

00:43:23.800 | And so with this, we can now combine it all together.

00:43:26.520 | And let's create a processed item list.

00:43:29.560 | And it's just a list container that contains a processor.

00:43:34.600 | And the items in it, whatever we were given after being processed.

00:43:41.200 | And so then, as well as being able to index in it to grab those processed items, we'll

00:43:46.840 | also define something called object.

00:43:49.120 | And that's just the thing that's going to deprocess the items again.

00:43:54.720 | So that's all the stuff we need to label things.

00:44:01.200 | So we already know that for splitting, we needed the grandparent.

00:44:07.440 | For labeling, we need the parent.

00:44:11.200 | So here's a parent labeler.

00:44:16.240 | Okay.

00:44:18.240 | And here is something which labels things using a function.

00:44:21.880 | It just calls a function for each thing.

00:44:26.880 | And so here is our class, and we're going to have to pass it some independent variable

00:44:32.800 | and some dependent variable and store them away.

00:44:36.920 | And then we need a indexer to grab the x and grab the y at those indexes.

00:44:43.420 | We need a length.

00:44:45.640 | We may as well make it print out nicely.

00:44:48.960 | And then we'll just add something just like we did before, which does the labeling.

00:44:54.840 | And passes those to a processed item list to grab the labels.

00:45:00.880 | And then passes the inputs and outputs to our constructor to give us our label data.

00:45:09.280 | So that's basically it.

00:45:11.360 | So with that, we have a label by function where we can create our category processor.

00:45:17.520 | We can label the training set.

00:45:19.900 | We can label the validation set, and we can return the result, the split data result.

00:45:25.760 | So the main thing to notice here is that when we say train equals labeled data dot label

00:45:33.240 | passing in this processor, this processor has no vocab.

00:45:37.480 | So it goes to that bit we saw that says, oh, there's no vocab.

00:45:40.560 | So let's create a list of all the unique possibilities.

00:45:43.920 | On the other hand, when it goes to the validation set, proc now does have a vocab.

00:45:49.320 | So it will skip that step and use the training sets vocab.

00:45:53.960 | So this is really important, right?

00:45:55.680 | People get mixed up by this all the time in machine learning and deep learning is like

00:46:00.440 | very often when somebody says, my model's no better than random.

00:46:05.200 | The most common reason is that they're using some kind of different mapping between their

00:46:10.600 | training set and their validation set.

00:46:13.300 | So if you use a process like this, that's never going to happen because you're ensuring

00:46:17.880 | that you're always using the same mapping.

00:46:24.000 | So the details of the code are particularly important.

00:46:28.800 | The important idea is that your labeling process needs to include some kind of processor idea.

00:46:38.360 | And if you're doing this stuff manually, which basically every other machine learning and

00:46:44.600 | deep learning framework does, you're asking for difficult to fix bugs because anytime

00:46:51.460 | your computer's not doing something for you, it means you have to remember to do it yourself.

00:46:55.400 | So whatever framework you're using, I don't think, I don't know if any other frameworks

00:47:02.200 | have something quite like this.

00:47:03.600 | So like create something like this for yourself so that you don't have that problem.

00:47:08.960 | All right, let's go.

00:47:14.280 | In the case of online streaming data, how do you deal with having new categories in

00:47:18.520 | the test set that you don't see in training?

00:47:21.280 | Yeah, I mean, great question.

00:47:23.000 | It's not just online streaming data.

00:47:24.480 | I mean, it happens all the time is you do inference either on your validation set or

00:47:29.760 | test set or in production where you see something you haven't seen before.

00:47:39.320 | For labels, it's less of a problem in inference because for inference, you don't have labels.

00:47:44.240 | By definition, but you could certainly have that problem in your validation set.

00:47:50.360 | So what I tend to like to do is if I have like some kind of, if I have something where

00:47:54.920 | there's lots and lots of categories and some of them don't occur very often and I know

00:47:59.240 | that in the future there might be new categories appearing, I'll take the few least common

00:48:06.280 | and I'll group them together into a group called like other.

00:48:10.680 | And that way I now have some way to ensure that my model can handle all these rare other

00:48:14.880 | cases.

00:48:17.080 | Something like that tends to work pretty well, but you do have to think of it ahead of time.

00:48:22.600 | For many kinds of problems, you know that there's a fixed set of possibilities.

00:48:29.240 | And if you know that it's not a fixed set, yeah, I would generally try to create an other

00:48:34.400 | category with a few examples.

00:48:36.320 | So make sure you train with some things in that other category, all right?

00:48:44.240 | >> In the label data class, what is the class method decorator doing?

00:48:50.400 | >> Sure.

00:48:53.520 | So I'll be quick because you can Google it, but basically this is the difference between

00:48:57.440 | an instance method and a class method.

00:48:59.760 | So you'll see it's not getting past self.

00:49:02.880 | So you'll see that I'm not going to call this on an object of type label data, but I'm calling

00:49:13.240 | it on the label data class itself.

00:49:17.000 | So it's just a convenience, really, class methods.

00:49:19.880 | The thing that they get passed in is the actual class that was requested.

00:49:23.980 | So I could create a subclass of this and then ask for that subclass.

00:49:28.640 | So anyway, they're called class methods.

00:49:30.920 | You should Google them.

00:49:32.180 | Pretty much every language supports class methods or something like it.

00:49:37.960 | They're pretty convenient.

00:49:38.960 | You can get away without them, but they're pretty convenient.

00:49:43.800 | Great.

00:49:46.080 | So now we've got our labeled list, and if we print it out, it's got a training set and

00:49:52.920 | a validation set, and each one has an X and a Y.

00:49:58.160 | Our category items are a little less convenient than the FastAI version ones because the FastAI

00:50:04.760 | ones will actually print out the name of each category.

00:50:08.600 | We haven't done anything to make that happen.

00:50:10.200 | So if we want the name of each category, we would actually have to refer to the .obj, which

00:50:17.040 | you can see we're doing here, Y.obj or Y.obj with a slice.

00:50:24.240 | So in FastAI version one, there's one extra thing we have, which is this concept of an

00:50:30.560 | item base, and you can actually define things like category items that know how to print

00:50:34.960 | themselves out.

00:50:37.400 | Whether that convenience is worth the extra complexity is up to you if you're designing

00:50:43.360 | something similar yourself.

00:50:46.880 | So we still can't train a model with these because we have pillow objects.

00:50:51.800 | We need tensors.

00:50:54.800 | So here's our labeled list, training set, zeroth object, and that has an X and a Y.

00:51:05.060 | So the zeroth thing in that tuple is the X.

00:51:09.600 | If they're all going to be in the batch together, they have to be the same size.

00:51:12.680 | So we can just go .resize.

00:51:13.680 | No problem.

00:51:14.680 | I mean, that's not a great way to do it, but it's a start.

00:51:21.880 | So here's a transform that resizes things.

00:51:26.560 | And it has to be after all the other transforms we've seen so far because we want conversion

00:51:32.720 | to RGB to happen beforehand, probably, stuff like that.

00:51:35.880 | So we'll give this an order of 10.

00:51:39.560 | And this is something you pass in a size.

00:51:41.480 | If you pass in an integer, we'll turn it into a tuple.

00:51:44.360 | And when you call it, it'll call resize, and it'll do bilinear resizing for you.

00:51:51.220 | So there's a transform.

00:51:53.600 | Once you've turned them all into the same size, then we can turn them into tensors.

00:51:59.760 | I stole this from TorchVision.

00:52:02.860 | This is how TorchVision turns pillow objects into tensors.

00:52:09.400 | And this has to happen after the resizing.

00:52:11.880 | So we'll give this a lesser order.

00:52:14.920 | And you see, there's two ways here of adding kind of class-level state or transform-level

00:52:20.800 | state.

00:52:21.800 | I can actually attach state to a function.

00:52:23.440 | This is really underused in Python, but it's super handy, right?

00:52:26.320 | We've got a function.

00:52:27.320 | We just want to say, like, what's the order of the function?

00:52:29.680 | Or we can put it in the plus.

00:52:32.400 | And then that's turned it into a byte tensor.

00:52:34.800 | We actually need a float tensor.

00:52:36.480 | So here's how you turn it into a float.

00:52:38.440 | And we don't want it to be between 0 and 255.

00:52:40.960 | We want it between 0 and 1.

00:52:42.480 | So we divide it in place by 255.

00:52:45.240 | And that has to happen after it's a byte.

00:52:47.360 | So we'll give that a higher order again.

00:52:50.320 | So now here's our list of transforms.

00:52:52.040 | It doesn't matter what order they're in the array, because they're going to order them

00:52:54.980 | by the underscore order attribute.

00:52:58.300 | So we can pass that to our image list.

00:53:00.800 | We can split it.

00:53:02.080 | We can label it.

00:53:03.440 | Here's a little convenience to permute the order back again.

00:53:08.680 | I don't know if you noticed this, but in to byte tensor, I had to permute 201, because

00:53:15.880 | Pillow has the channel last, or else PyTorch assumes the channel comes first.

00:53:22.200 | So this is just going to pop the channel first.

00:53:24.560 | So to print them out, we have to put the channel last again.

00:53:28.580 | So now we can grab something from that list and show image.

00:53:33.760 | Here it is.

00:53:34.760 | And you can see that it is something of a torch thing of this size.

00:53:39.400 | So that's looking good.

00:53:40.520 | So we now have tensors that are floats and all the same size.

00:53:45.280 | So we can train a model.

00:53:47.960 | So we've got a batch size.

00:53:49.640 | We'll use the get data load as we had before.

00:53:52.040 | We can just pass in train invalid directly from our labeled list.

00:53:58.320 | Let's grab a mini batch, and here it is, 64 by 3 by 128 by 128.

00:54:03.560 | And we can have a look at it, and we can see the vocab for it.

00:54:08.860 | We can see the whole mini batch of y values.

00:54:13.280 | So now we can create a data bunch.

00:54:17.840 | That's going to have our data loaders.

00:54:20.480 | And to make life even easier for the future, let's add two optional things, channels in

00:54:26.520 | and channels out.

00:54:28.280 | And that way any models that want to be automatically created can automatically create themselves

00:54:33.560 | with the correct number of inputs and the correct number of outputs for our data set.

00:54:40.160 | And let's create add to our split data, something called to data bunch, which is just this function.

00:54:50.760 | It just calls that get DLs we saw before.

00:54:54.880 | So like in practice, in your actual module, you would go back and you would paste the

00:55:02.120 | contents of this back into your split data definition.

00:55:05.880 | But this is kind of a nice way when you're just iteratively building stuff.

00:55:10.880 | You can't only monkey patch PyTorch things or standard library things, you can monkey

00:55:17.160 | patch your own things.

00:55:18.440 | So here's how you can add something to a previous class when you realize later that you want

00:55:22.640 | it.

00:55:23.640 | Okay, so let's go through and see what happens.

00:55:30.960 | So here are all the steps, literally all the steps.

00:55:35.400 | Grab the path, untie the data, grab the transforms, grab the item list, pass in the transforms,

00:55:43.360 | split the data using the grandparent, using this validation name, label it using parent

00:55:50.320 | labeler, and then turn it into a data bunch with this batch size, three channels in, ten

00:55:57.280 | channels out, and we'll use four processes.

00:56:02.600 | Here's our callback functions from last time.

00:56:07.500 | Let's make sure that we normalize.

00:56:09.420 | In the past, we've normalized things that have had only one channel, being MNIST.

00:56:14.480 | Now we've got three channels, so we need to make sure that we take the mean over the other

00:56:18.600 | channels so that we get a three-channel mean and a three-channel standard deviation.

00:56:25.320 | So let's define a function that normalizes things that are three channels.

00:56:30.560 | So we're just broadcasting here.

00:56:33.360 | So here's the mean and standard deviation of this Imagenet batch.

00:56:38.560 | So here's a function called normImagenet, which we can use from now on to normalize

00:56:44.140 | anything with this dataset.

00:56:47.160 | So let's add that as a callback using the batch transform we built earlier.

00:56:54.360 | We will create a ConvNet with this number of layers.

00:57:01.880 | And here's the ConvNet, we're going to come back to that.

00:57:05.920 | And then we will do our one-cycle scheduling using COSYCLE, one-cycle annealing, COSIGN,

00:57:11.720 | one-cycle annealing, pass that into our getLearn run, and train.

00:57:20.720 | And that's going to give us 72.6%, which if we look at the Imagenet leaderboard for 128

00:57:33.600 | pixels for i5 epochs, the best is 84.6 so far.

00:57:38.280 | So this is looking pretty good.

00:57:41.640 | We're very much on the right track.

00:57:44.120 | So let's take a look and see what model we built, because it's kind of interesting.

00:57:48.360 | It's a few interesting features of this model.

00:57:50.680 | And we're going to be looking at these features quite a lot in the next two lessons.

00:57:56.760 | The model knows how big its first layer has to start out because we pass in data, and

00:58:04.760 | data has the channels in.

00:58:06.680 | So this is nice.

00:58:07.680 | Already this is a model which you don't have to change its definition if you have hyperspectral

00:58:13.800 | imaging with four channels, or you have black and white with one channel, or whatever.

00:58:18.640 | So this is going to change itself.

00:58:21.580 | Now what's the second layer going to be?

00:58:24.720 | Or I should say, what's the output of the first layer going to be?

00:58:27.080 | The input's going to be CIN.

00:58:28.600 | What's the output going to be?

00:58:29.600 | Is it going to be 16, 32, 64?

00:58:34.300 | Well, what we're going to do is we're going to say, well, our input has, we don't know,

00:58:44.560 | some number of channels, right?

00:58:46.280 | But we do know that the first layer is going to be a three by three kernel, and then there's

00:58:50.600 | going to be some number of channels, CIN channels, which in our case is three.

00:58:58.160 | So as the convolution kernel kind of scrolls over the input image, at each time, the number

00:59:04.140 | of things that it's multiplying together is going to be three by three by CN.

00:59:10.480 | So nine by CN.

00:59:13.800 | So remember we talked about this last week, right?

00:59:16.280 | We basically want to put that, we basically want to make sure that our first convolution

00:59:26.880 | actually has something useful to do, right?

00:59:31.760 | So if we're getting nine by CN coming in, you wouldn't want more than that going out

00:59:38.280 | because it's basically a wasted time, okay?

00:59:41.000 | So we discussed that briefly last week.

00:59:42.900 | So what I'm going to do is I'm going to say, okay, let's take that value, CN by three by

00:59:49.800 | three, and let's just look for the next largest number that's a power of two, and we'll use

00:59:56.640 | that.

00:59:58.660 | So then that's how I do that.

01:00:01.200 | And then I'll just go ahead and multiply by two for each of the next two layers.

01:00:05.380 | So this way, we've got these vital first three layers are going to work out pretty well.

01:00:11.640 | So back in the old days, we used to use five by seven kernels, okay?

01:00:19.040 | We'd have the first layer, would be one of those, but we know now that's not a good idea.

01:00:25.720 | Still most people do it because people stick with what they know, but when you look at

01:00:30.680 | the bag of tricks for image classification paper, which in turn refers to many previous

01:00:39.360 | citations, many of which are state of the art and competition winning models, the message

01:00:45.280 | is always clear.

01:00:47.040 | Three by three kernels give you more bang for your buck.

01:00:50.960 | You get deeper, you end up with the same receptive field.

01:00:54.120 | It's faster because you've got less work going on, right?

01:00:58.180 | And really, this goes all the way back to the classic Zeiler and Fergus paper that we've

01:01:03.540 | looked at so many times over the years that we've been doing this course.

01:01:08.100 | And even before that to the VGG paper, it really is three by three kernels everywhere.

01:01:16.000 | So any place you see something that's not a three by three kernel, have a big think

01:01:22.720 | about whether that makes sense.

01:01:25.680 | Okay, so that's basically what we have for those critical first three layers.

01:01:32.520 | That's where that initial feature representation is happening.

01:01:36.440 | And then the rest of the layers is whatever we've asked for.

01:01:43.120 | And so then we can build those layers up, just saying number of filters in to number

01:01:46.440 | of filters out for each filter.

01:01:48.960 | And then as usual, average pooling, Latin, and a linear layer to however many classes

01:01:54.720 | is in our data, okay, that's it.

01:02:03.200 | It's very hard to, every time I write something like this, I break it the first 12 times.

01:02:10.960 | And the only way to debug it is to see exactly what's going on.

01:02:14.840 | To see exactly what's going on, you need to see that what module is there at each point

01:02:20.200 | and what is the output shape at each module.

01:02:25.000 | So that's why we've created this model summary.

01:02:28.360 | So model summary's gonna use that get batch that we added in the LSUV notebook to grab

01:02:33.520 | one batch of data.

01:02:37.480 | We will make sure that that batch is on the correct device.

01:02:42.960 | We will use the find module thing that we used in the LSUV to find all of the places

01:02:47.960 | that there's a linear layer.

01:02:52.440 | If you said find all, otherwise we will grab just the immediate children.

01:03:00.680 | We will grab a hook for every layer using the hooks that we made.

01:03:06.280 | And so now we can pass that model through.

01:03:09.360 | And the function that we've used for hooking simply prints out the module and the output

01:03:17.400 | shape.

01:03:18.400 | So that's how easy it is to create this wonderfully useful model summary.

01:03:23.320 | So to answer your question of earlier, another reason why are we doing this or what are you

01:03:29.880 | meant to be getting out of it is to say you don't have to write much code to create really

01:03:37.280 | useful tools and telemetry.

01:03:40.400 | So we've seen how to create like per-layer histogram viewers, how to create model summaries.

01:03:47.520 | With the tools that you have at your disposal now, I really hope that you can dig inside

01:03:52.120 | your models what they are and what they're doing.

01:03:56.640 | And you see that it's all about hooks.

01:03:59.080 | So this hooks thing we have is just like super, super useful.

01:04:04.120 | Now very grateful to the PyTorch team for adding this fantastic functionality.

01:04:08.560 | So you can see here we start.

01:04:11.160 | The input is 128 because that's a batch size, 128 by 3 by 128 by 128.

01:04:19.120 | And then we gradually go through these convolutions.

01:04:22.800 | The first one has a stride of one.

01:04:27.960 | The next two you have a stride of two.

01:04:29.680 | So that goes 64, 32.

01:04:32.880 | And you can see after each one they have a stride of two, get smaller and smaller.

01:04:36.900 | And then an average pull that to a one by one.

01:04:40.040 | And then we flatten it.

01:04:41.820 | And then we have a linear.

01:04:43.460 | So it's a really, it's like as basic a ConvNet as you could get.

01:04:49.480 | It really is.

01:04:50.480 | It's just a bunch of three by three conv-related batch norms.

01:04:54.460 | But it does terrifically well.

01:04:57.600 | It's deep enough.

01:05:00.760 | So I think that's a good start.

01:05:07.960 | I think that's a good time to take a break.

01:05:10.620 | So let's come back at 7.45.

01:05:18.440 | This is one of the bits I'm most excited about in this course actually.

01:05:23.320 | But hopefully it's going to be like totally unexciting to you because it's just going

01:05:26.640 | to be so obvious that you should do it this way.

01:05:28.880 | But the reason I'm excited is that we're going to be talking about optimizers.

01:05:32.960 | And anybody who's done work with kind of optimizers in deep learning in the past will know that

01:05:38.960 | every library treats every optimizer as a totally different thing.

01:05:43.520 | So there's an atom optimizer, like in PyTorch there's an atom optimizer and a SGD optimizer

01:05:49.520 | and a RMS prop optimizer.

01:05:52.520 | And somebody comes along and says, hey, we've invented this thing called decoupled weight

01:05:57.800 | decay, also known as Adam W. And the PyTorch folks go, oh, damn, what are we going to do?

01:06:04.400 | And they have to add a parameter to every one of their optimizers and they have to change

01:06:07.640 | every one of their optimizers.

01:06:08.640 | And then somebody else comes along and says, oh, we've invented a thing called AMS grad.

01:06:13.680 | There's another parameter we have to put into any one of those optimizers.

01:06:16.240 | And it's not just like inefficient and frustrating, but it holds back research because it starts

01:06:25.520 | feeling like there are all these things called different kinds of optimizers, but there's

01:06:30.720 | not.

01:06:31.720 | I'm going to show you there's not.

01:06:33.000 | There's one optimizer and there's one optimizer in which you can inject different pieces of

01:06:39.120 | behavior in a very, very small number of ways.

01:06:42.560 | And what we're going to do is we're going to start with this generic optimizer and we're

01:06:46.640 | going to end up with this.

01:06:49.400 | This came out last week and it's a massive improvement as you see in what we can do with

01:06:55.840 | natural language processing.

01:06:59.320 | This is the equation set that we're going to end up implementing from the paper.

01:07:07.880 | And what if I told you that not only I think are we the first library to have this implemented,

01:07:15.800 | but this is the total amount of code that we're going to write to do it.

01:07:21.240 | So that's where we're going.

01:07:26.240 | So we're going to continue with the image net and we're going to continue with the basic

01:07:30.920 | set of transforms we had before and the basic set of stuff to create our data bunch.

01:07:38.760 | This is our model and this is something to pop it on CUDA to get our statistics written

01:07:46.120 | out to do our batch transform with the normalization.

01:07:50.000 | And so we're going to start here 52% after an epoch.

01:07:56.680 | And so let's try to create an optimizer.

01:07:59.800 | Now in PyTorch, the base thing called optimizer is just a dictionary that stores away some

01:08:08.600 | hyperparameters and we've actually already used it.

01:08:13.200 | And I deeply apologize for this.

01:08:15.880 | We cheated.

01:08:17.680 | We used something that is not part of our approved set of foundations without building

01:08:23.320 | it ourselves.

01:08:25.320 | And we did it here.

01:08:29.680 | We never wrote param groups.

01:08:31.920 | We never wrote param groups.

01:08:33.920 | So we're going to go back and do it now, right?

01:08:37.000 | Because the reason we did this is because we were using torture's optim.optimizer.

01:08:42.240 | We've already built the kind of the main part of that, which is the thing that multiplies

01:08:46.960 | by the learning rate and subtracts from the gradients.

01:08:52.520 | But we didn't build param groups.

01:08:54.480 | So let's do it here.

01:08:56.640 | So here's what's going to happen.

01:08:59.920 | As always, we need something called zero grad, which is going to go through some parameters

01:09:04.640 | and zero them out and also remove any gradient computation history.

01:09:10.020 | And we're going to have a step function that does some kind of step.

01:09:13.720 | The main difference here, though, is our step function isn't actually going to do anything.

01:09:20.460 | It's going to use composition on some things that we pass on and ask them to do something.

01:09:26.440 | So this optimizer is going to do nothing at all until we build on top of it.

01:09:33.400 | But we're going to set it up to be able to handle things like discriminative learning

01:09:37.420 | rates and one cycle of kneeling and stuff like that.

01:09:42.320 | And so to be able to do that, we need some way to create parameter groups.

01:09:48.160 | This is what we call in fast AI layer groups.

01:09:52.840 | And I kind of wish I hadn't called them layer groups.

01:09:56.400 | I should call them parameter groups because we have a perfectly good name for them already

01:10:00.480 | in PyTorch.

01:10:01.720 | So I'm not going to call them layer groups anymore.

01:10:03.880 | I'm just going to call them parameter groups.

01:10:06.280 | But it's the same thing.

01:10:07.280 | OK, parameter groups and layer groups.

01:10:08.920 | So a parameter group-- so remember when we say parameters in PyTorch, remember right

01:10:16.920 | back to when we've created our first linear layer, we had a weight tensor and we had a

01:10:20.840 | bias tensor.

01:10:22.400 | And each one of those is a parameter.

01:10:24.760 | It's a parameter tensor.

01:10:27.700 | So in order to optimize something, we need to know what all the parameter tensors are

01:10:32.000 | in a model.

01:10:33.000 | And you can just say model.parameters to grab them all in PyTorch.

01:10:39.960 | And that's going to give us-- it gives us a generator.

01:10:43.160 | But as soon as you call list on a generator, it turns it into an actual list.

01:10:46.880 | So that's going to give us a list of all of the tensors, all of the weights and all of

01:10:53.280 | the biases, basically.

01:10:58.880 | But we might want to be able to say the last two layers should have a different learning

01:11:06.580 | rate to all the other layers.

01:11:09.000 | And so the way we can do that is rather than just passing in a list of parameters, we'll

01:11:14.120 | pass in a list of lists.

01:11:16.560 | And so let's say our list of lists has two items.

01:11:19.320 | The first item contains all the parameters in the main body of the architecture.

01:11:24.440 | And the last item contains just the parameters from the last two layers.

01:11:30.200 | So if we make this-- decide that this is a list of lists, then that lets us do parameter

01:11:38.240 | groups.

01:11:39.240 | Now, that's how we tell the optimizer these sets of parameters should be handled differently

01:11:44.760 | with discriminative learning rates and stuff.

01:11:47.960 | And so that's what we're going to do.

01:11:49.280 | We're going to assume that this thing being passed in is a list of lists.

01:11:53.120 | Well, we won't quite assume.

01:11:55.200 | We'll check.

01:11:56.200 | Right?

01:11:57.200 | If it's not, then we'll turn it into a list of lists by just wrapping it in a list.

01:12:02.580 | So if it only has one thing in it, we'll just make it a list of-- with one item containing

01:12:07.400 | a list.

01:12:09.120 | So now, param groups is a list of lists of parameter tensors.

01:12:15.880 | And so you could either pass in, so you could decide how you want to split them up into

01:12:20.520 | different parameter groups, or you could just have them turn into a single parameter group

01:12:26.280 | for you.

01:12:28.280 | So that's the first thing we need.

01:12:30.140 | So now, we have-- our optimizer object has a param groups attribute containing our parameter

01:12:37.920 | groups.

01:12:38.920 | So just keep remembering that's a list of lists.

01:12:40.800 | All right.

01:12:42.080 | Each parameter group can have its own set of hyperparameters.

01:12:47.580 | So hyperparameters could be learning rate, momentum, beta in atom, epsilon in atom, and

01:12:54.840 | so forth.

01:12:56.320 | So those hyperparameters are going to be stored as a dictionary.

01:13:00.560 | And so there's going to be one dictionary for each parameter group.

01:13:04.880 | So here's where we created.

01:13:07.240 | self.hypers contains, for each parameter group, a dictionary.

01:13:14.560 | And what's in the dictionary?

01:13:16.440 | What's in the dictionary is whatever you pass to the constructor, OK?

01:13:20.980 | So this is how you just pass a single bunch of keyword arguments to the constructor, and

01:13:26.440 | it's going to construct a dictionary for every one.

01:13:29.120 | And this is just a way of cloning a dictionary so that they're not all referring to the same

01:13:34.280 | reference, but they all have their own reference.

01:13:37.240 | All right.

01:13:39.240 | So that's doing much the same stuff as torture's optim.optimizer.

01:13:46.400 | And here's the new bit, stepper.

01:13:49.360 | In order to see what a stepper is, let's write one.

01:13:52.040 | Here's a stepper.

01:13:53.600 | It's a function.

01:13:55.200 | It's called SGD step.

01:13:57.080 | What does it do?

01:13:58.720 | It does the SGD step.

01:14:00.280 | We've seen it before.

01:14:03.520 | So in other words, to create an SGD optimizer, we create a partial with our optimizer with

01:14:11.960 | the steppers being SGD step.

01:14:15.220 | So now when we call step, it goes through our parameters, composes together our steppers,

01:14:23.280 | which is just one thing, right?

01:14:25.880 | And calls the parameter.

01:14:29.040 | So the parameter is going to go p.data.add minus learning rate p.grad.data.

01:14:35.280 | So that's how we can create SGD.

01:14:40.120 | So with that optimization function, we can fit.

01:14:45.640 | It's not doing anything different at all.

01:14:49.360 | But what we have done is we've done the same thing we've done 1,000 times without ever

01:14:53.960 | creating an SGD optimizer.

01:14:59.080 | It's an optimizer with an SGD step.

01:15:05.280 | I've created this thing called grad params, which is just a little convenience.

01:15:08.980 | Basically when we zero the gradients, we have to go through every parameter.

01:15:14.080 | To go through every parameter, we have to go through every parameter group.

01:15:18.440 | And then within each parameter group, we have to go through every parameter in that group

01:15:22.360 | where the gradient exists.

01:15:25.000 | They're the ones that we have to zero.

01:15:26.880 | And ditto for when we do a step.

01:15:29.720 | That's why I just refashioned it.

01:15:31.920 | And also, when we call the stepper, we want to pass to it all of our hyperparameters.

01:15:41.040 | Because the stepper might want them.

01:15:44.800 | Like it'll probably want learning rate.

01:15:46.400 | And learning rate is just one of the things that we've listed in our hyperparameters.

01:15:52.000 | So remember how I said that our compose is a bit special, that it passes along any keyword

01:15:56.940 | arguments it got to everything that it composes?

01:16:00.160 | Here's a nice way to use that, right?

01:16:01.820 | So that's how calm SGD step can say, oh, I need the learning rate.

01:16:06.400 | And so as long as hyper has a learning rate in it, it's going to end up here.

01:16:11.040 | And it'll be here as long as you pass it here.

01:16:14.760 | And then you can change it for each different layer group.

01:16:16.840 | You can anneal it and so forth.

01:16:20.320 | So we're going to need to change our parameter scheduler to use our new generic optimizer.

01:16:29.840 | It's simply now that we have to say, go through each hyperparameter in self.opt.hypers and

01:16:37.720 | schedule it.

01:16:39.220 | So that's basically the same as what we had in parameter scheduler before, but for our

01:16:42.400 | new thing.

01:16:43.560 | And ditto for recorder.

01:16:45.500 | This used to use param groups, now it uses hypers.

01:16:50.080 | So a minor change to make these keep working.

01:16:52.960 | So now I was super excited when we first got this working, so it's like, wow, we've just

01:16:58.680 | built an SGD optimizer that works without ever writing an SGD optimizer.

01:17:04.340 | So now when we want to add weight decay, right?

01:17:06.780 | So weight decay, remember, is the thing where we don't want something that fits this.

01:17:10.900 | We want something that fits this.

01:17:13.080 | And the reason we, the way we do it is we use ultra regularization, which just as we

01:17:17.580 | add the sum of squared weights times some parameter we choose.

01:17:24.440 | And remember that the derivative of that is actually just WTD times weight.

01:17:29.680 | So you could either add an L2 regularization to the loss, or you can add WD times weight

01:17:38.280 | to the gradients.

01:17:39.520 | If you've forgotten this, go back and look at weight decay in part one to remind yourself.

01:17:46.560 | And so if we want to add either this or this, we can do it.

01:17:51.840 | We can add a stepper.

01:17:53.360 | So weight decay is going to get an LR and a WD, and it's going to simply do that.

01:18:00.480 | There it is, okay?

01:18:04.620 | Or L2 regularization is going to just do that.

01:18:10.120 | By the way, if you haven't seen this before, add in PyTorch.

01:18:15.960 | Normally it just adds this tensor to this tensor.

01:18:19.240 | But if you add a scalar here, it multiplies these together first.

01:18:24.040 | This is a nice, fast way to go WD times parameter and add that to the gradient.

01:18:31.680 | So there's that.

01:18:37.760 | Okay, so we've got our L2 regularization, we've got our weight decay.

01:18:53.160 | What we need to be able to do now is to be able to somehow have the idea of defaults.

01:19:00.860 | Because we don't want to have to say weight decay equals zero every time we want to turn

01:19:06.000 | it off.

01:19:07.120 | So see how we've attached some state here to our function object?

01:19:12.020 | So the function now has something called defaults that says it's a dictionary with WD equals

01:19:17.000 | zero.

01:19:19.040 | So let's just grab exactly the same optimizer we had before.

01:19:23.080 | But what we're going to do is we're going to maybe update our defaults with whatever

01:19:30.880 | self.steppers has in their defaults.

01:19:34.200 | And the reason it's maybe update is that it's not going to replace -- if you explicitly

01:19:38.480 | say I want this weight decay, it's not going to update it.

01:19:41.680 | It will only update it if it's missing.

01:19:44.280 | And so that's just what this little loop does, right?

01:19:46.360 | Just goes through each of the things, and then goes through each of the things in the

01:19:51.000 | dictionary, and it just checks if it's not there, then it updates it.

01:19:55.600 | So this is now -- everything else here is exactly the same as before.

01:20:00.920 | So now we can say let's create an SGD optimizer.

01:20:08.560 | It's just an optimizer with a SGD step and weight decay.

01:20:14.240 | And so let's create a learner, and let's try creating an optimizer, which is an SGD optimizer,

01:20:24.160 | with our model's parameters, with some learning rate, and make sure that the hyperparameter

01:20:30.600 | for weight decay should be zero, the hyperparameter for LR should be .1.

01:20:34.880 | Yep, it passes.

01:20:36.920 | Let's try giving it a different weight decay, make sure it's there, okay, it passes as well.

01:20:41.120 | So we've now got an ability to basically add any step functions we want, and those step

01:20:48.760 | functions can have their own state that gets added automatically to our optimization object,

01:20:56.320 | and we can go ahead and fit, so that's fine.

01:21:01.080 | So now we've got an SGD optimizer with weight decay is one line of code.

01:21:11.520 | Let's now add momentum.

01:21:14.660 | So momentum is going to require a slightly better optimizer, or a slightly different

01:21:18.960 | optimizer, because momentum needs some more state.

01:21:23.200 | It doesn't just have parameters and hyperparameters, but also momentum knows that for every set

01:21:30.320 | of activations, it knows what were they updated by last time.

01:21:35.640 | Because remember the momentum equation is, if momentum is .9, then it would be .9 times

01:21:42.560 | whatever you did last time, plus this step, right?

01:21:47.960 | So we actually need to track for every single parameter what happened last time.

01:21:54.320 | And that's actually quite a bit of state, right?

01:21:56.520 | If you've got 10 million activations in your network, you've now got 10 million more floats

01:22:02.960 | that you have to store, because that's your momentum.

01:22:06.680 | So we're going to store that in a dictionary called state.

01:22:11.760 | So a stateful optimizer is just an optimizer that has state.

01:22:19.080 | And then we're going to have to have some stats.

01:22:25.400 | And stats are a lot like steppers.

01:22:29.880 | They're objects that we're going to pass in to say, when we create this state, how do

01:22:35.120 | you create it?

01:22:36.880 | So when you're doing momentum, what's the function that you run to calculate momentum?

01:22:42.160 | So that's going to be called something of a stat class.

01:22:45.860 | So for example, momentum is calculated by simply averaging the gradient, like so.

01:22:53.960 | We take whatever the gradient averaged before, we multiply it by momentum, and we add the

01:23:00.320 | current gradient.

01:23:01.700 | That's the definition of momentum.

01:23:06.160 | So this is an example of a stat class.

01:23:10.480 | So it's not enough just to have update, because we actually need this to be something at the

01:23:15.800 | start.

01:23:16.800 | We can't multiply by something that doesn't exist.

01:23:18.680 | So we're also going to define something called init state that will create a dictionary containing

01:23:24.640 | the initial state.

01:23:26.540 | So that's all that stateful optimizer is going to do, right?

01:23:31.000 | It's going to look at each of our parameters, and it's going to check to see whether that

01:23:38.480 | parameter already exists in the state dictionary, and if it doesn't, it hasn't been initialized.

01:23:43.780 | So we'll initialize it with an empty dictionary, and then we'll update it with the results

01:23:48.740 | of that init state call we just saw.

01:23:51.580 | So now that we have every parameter can now be looked up in this state dictionary to find

01:23:57.160 | out its state, and we can now, therefore, grab it, and then we can call update, like

01:24:05.040 | so.

01:24:06.040 | Oh, this one's not opening, like so, to do, for example, average gradients.

01:24:11.080 | And then we can call compose with our parameter and our steppers, and now we don't just pass

01:24:18.900 | in our hyperparameters, but we also pass in our state.

01:24:23.480 | So now that we have average gradients, which is sticking into this thing called grad average,

01:24:32.040 | and it's going to be passed into our steppers, we can now do a momentum step.

01:24:37.860 | And the momentum step takes not just LR, but it's now going to be getting this grad average.

01:24:42.480 | And here is the momentum step.

01:24:46.180 | It's just this grad average times the learning rate.

01:24:51.440 | That's all you do.

01:24:53.160 | So now we can create an SGD with momentum optimizer with a line of code.

01:24:59.920 | It can have a momentum step, it can have a weight decay step, it can have an average

01:25:04.560 | grad stat, we can even give it some default weight decay, and away we go.

01:25:14.840 | So here's something that might just blow your mind.

01:25:25.520 | Let me read it to you.

01:25:29.600 | Here is a paper, L2 regularization versus batch and weight norm.

01:25:38.040 | Batch normalization is a commonly used trick to improve training of deep neural networks,

01:25:42.760 | and they also use L2 regularization ostensibly to prevent overfitting.

01:25:48.280 | However, we show that L2 regularization has no regularizing effect.

01:25:55.920 | What?

01:25:58.320 | Okay.

01:26:01.320 | It's true.

01:26:02.900 | Watch this.

01:26:06.680 | I realized this when I was chatting to Sylvain at NeurIPS, and like we were walking around

01:26:12.440 | the poster session, and I suddenly said to him, "Wait, Sylvain, if there's batch norm,

01:26:21.440 | how can L2 regularization possibly work?"

01:26:24.660 | And I'll tell you what I laid out to him.

01:26:26.880 | This is before I discovered this paper.

01:26:31.160 | We've got some layer of activations, right?

01:26:35.920 | And some layer, and we've got some weights that was used to create that layer of activations.

01:26:43.520 | So these are our weights, and these are our activations, and then we pass it through some

01:26:49.080 | batch norm layer, right?

01:26:50.780 | The batch norm layer does two things.

01:26:52.640 | It's got a bunch of adds, and it's got a bunch of multiplies, right?

01:26:59.360 | It also normalizes, but these are the learned parameters.

01:27:02.720 | Okay, so we come along and we say, "Okay, weight decay time.

01:27:08.360 | Your weight decay is a million," and it goes, "Uh-oh, what do I do?"

01:27:17.440 | Because now the squared of these, the sum of the squares of these gets multiplied by

01:27:24.120 | 1e6.

01:27:26.200 | My loss function's destroyed.

01:27:28.520 | I can't possibly learn anything.

01:27:30.640 | But then the batch norm layer goes, "Oh, no, don't worry, friends," and it fills every

01:27:37.120 | single one of these with 1 divided by a million, okay?

01:27:45.480 | So what just happened?

01:27:48.760 | Well, no, sorry, it multiplies by positive, sorry, it multiplies them all by a million.

01:27:58.880 | So what now happens, oh, these now have to get the same activations we had before.

01:28:04.880 | All of our weights, so like w1, now have to get divided by a million to get the same result.

01:28:13.600 | And so now our weight decay basically is nothing.

01:28:19.840 | So the two, so in other words, we could just, we can decide exactly how much weight decay

01:28:26.680 | loss there is by simply using the batch norm molts, right?

01:28:31.560 | Now the batch norm molts get a tiny bit of weight decay applied to them, unless you turn

01:28:36.200 | it off, which people often do, but it's tiny, right?

01:28:38.760 | Because there's very few parameters here, and there's lots of parameters here.

01:28:44.520 | So it's true.

01:28:46.320 | It's true.

01:28:47.320 | L2 regularization has no regularizing effect, which is not what I've been telling people

01:28:54.480 | who have been listening to these lessons the last three years, for which I apologize.

01:28:58.720 | I was wrong.

01:29:00.280 | I feel a little bit better in knowing that pretty much everybody in the community is

01:29:04.400 | wrong.

01:29:05.400 | We've all been doing it wrong.

01:29:08.360 | So Twan Van Lahoven mentioned this in the middle of 2017.

01:29:14.400 | Basically nobody noticed.

01:29:17.040 | There's a couple more papers I've mentioned in today's lesson notes from the last few

01:29:23.160 | months where people are finally starting to really think about this, but I'm not aware

01:29:29.240 | of any other course, which is actually pointed out we're all doing it wrong.

01:29:34.160 | So you know how I keep mentioning how none of us know what we're doing?

01:29:38.960 | We don't even know what L2 regularization does because it doesn't even do anything,

01:29:46.680 | but it does do something because if you change it, something happens.

01:29:53.200 | So this guy's wrong too.

01:29:56.720 | It doesn't do nothing.

01:29:58.120 | So a more recent paper by a team led by Roger Gross has found three kind of ways in which

01:30:06.000 | maybe regularization happens, but it's not the way you think.

01:30:10.560 | This is one of the papers in the lesson notes.

01:30:13.520 | But even in his paper, which is just a few months old, the abstract says basically, or

01:30:20.480 | the introduction says basically no one really understands what L2 regularization does.

01:30:26.240 | So we have no idea what we're doing.

01:30:28.880 | There's this thing that every model ever always has, and it totally doesn't work.

01:30:35.520 | At least it doesn't work in the way we thought it did.

01:30:38.440 | So that should make you feel better about, can I contribute to deep learning?

01:30:44.960 | Obviously you can, because none of us have any idea what we're doing.

01:30:48.480 | And this is a great place to contribute, right?

01:30:51.880 | Is like use all this telemetry that I'm showing you, activations of different layers, and

01:30:56.560 | see what happens experimentally, because the people who study this stuff, like what actually

01:31:02.240 | happens with batch norm and weight decay, most of them don't know how to train models,

01:31:06.520 | right?

01:31:07.520 | The theory people, and then there's like the practitioners who forget about actually thinking

01:31:12.840 | about the foundations at all.

01:31:14.760 | But if you can combine the two and say like, oh, let's actually try some experiments.

01:31:19.080 | Let's see what happens really when we change weight decay, now that I've assumed we don't

01:31:23.240 | know what we're doing, I'm sure you can find some really interesting results.

01:31:29.060 | So momentum is also interesting, and we really don't understand much about how things like

01:31:34.640 | momentum work, but here's some nice pictures for you.

01:31:38.760 | And hopefully it'll give you a bit of a sense of momentum.

01:31:42.000 | Let's create 200 numbers equally spaced between minus four and four, and then let's create

01:31:47.620 | another 200 random numbers that average 0.3.

01:31:52.840 | And then let's create something that plots some function for these numbers, and we're

01:32:03.320 | going to look at this function for each value of something called beta.

01:32:07.880 | And this is the function we're going to try plotting, and this is the momentum function.

01:32:13.720 | Okay, so what happens if we plot this function for each value of beta, for our data where

01:32:24.120 | the y is random and averages 0.3?

01:32:28.680 | So beta here is going to be our different values of momentum, and you can see what happens

01:32:33.160 | is, with very little momentum, you just get very bumpy, very bumpy.

01:32:38.400 | Once you get up to a high momentum, you get a totally wrong answer.

01:32:43.400 | Why is this?

01:32:44.400 | Because if you think about it, right, we're constantly saying 0.9 times whatever we had

01:32:50.520 | before, plus the new thing, then basically you're continuing to say like, oh, the thing

01:32:57.320 | I had before times 0.9 plus the new thing, and the things are all above zero.

01:33:02.720 | So you end up with a number that's too high.

01:33:05.960 | And this is why, if your momentum is too high, and basically you're way away from where you

01:33:12.400 | need to be in weight space, so it keeps on saying go that way, go that way, go that way.

01:33:16.600 | If you get that enough with a high momentum, it will literally shoot off far faster than

01:33:22.720 | is reasonable.

01:33:23.720 | Okay, so this will give you a sense of why you've got to be really careful with high

01:33:27.400 | momentum, it's literally biased to end up being a higher gradient than the actual gradient.

01:33:36.880 | So we can fix that.

01:33:39.600 | Like when you think about it, this is kind of dumb, right, because we shouldn't be saying

01:33:44.080 | beta times average plus yi, we should be saying beta times average plus 1 minus beta times

01:33:53.680 | the other thing.

01:33:54.680 | Like dampen the thing that we're adding in, and that's called an exponentially weighted

01:33:59.760 | moving average, as we know, or lerp in PyTorch speak.

01:34:06.920 | So let's plot the same thing as before but this time with exponentially weighted moving

01:34:10.120 | average.

01:34:11.120 | Ah, perfect.

01:34:12.120 | Okay, so we're done, right?

01:34:18.120 | Not quite.

01:34:19.260 | What if the thing that we're trying to match isn't just random but is some function?

01:34:26.520 | So it looks something like this.

01:34:29.940 | Well if we use a very small momentum with exponentially weighted moving averages, we're

01:34:34.040 | fine.

01:34:35.040 | And I've added an outlier at the start just to show you what happens.

01:34:40.040 | Even with beta 0.7 we're fine, but uh-oh, now we've got trouble.

01:34:46.280 | And the reason we've got trouble is that the second, third, fourth, fifth observations all

01:34:55.560 | have a whole lot of this item number one in, right?

01:34:58.840 | Because remember item number two is 0.99 times item number one plus 0.01 times item number

01:35:06.640 | two.

01:35:07.640 | Right?

01:35:08.640 | And so item number one is massively biasing the start.

01:35:13.680 | Even here, it takes a very long time.

01:35:16.960 | And the second thing that goes wrong is with this momentum is that you see how we're a

01:35:20.400 | bit to the right of where we should be?

01:35:22.560 | We're always running a bit behind where we should be.

01:35:25.240 | Which makes perfect sense, right?

01:35:26.640 | Because we're always only taking 0.1 times the new thing.

01:35:31.280 | So we can use de-biasing.

01:35:34.000 | De-biasing is what we saw last week and it turned out, thanks to Staz Beckman's discovery,

01:35:39.520 | we didn't really need it, but we do need it now.

01:35:42.400 | And de-biasing is to divide by one minus beta to the power of whatever batch number we're

01:35:48.860 | up to.

01:35:50.580 | So you can kind of tell, right?

01:35:52.460 | If your initial starting point is zero, and that's what we use always when we're de-biasing,

01:35:58.300 | we always start at zero, and beta is 0.9, then your first step is going to be 0.9 times

01:36:05.800 | zero plus 0.1 times your item.

01:36:10.520 | So in other words, you'll end up at 0.1 times your item, so you're going to end up 10 times

01:36:14.240 | lower than you should be.

01:36:15.900 | So you need to divide by 0.1, right?

01:36:18.740 | And if you kind of work through it, you'll see that each step is simply 0.1 to the power

01:36:30.280 | of one, two, three, four, five, and so forth.

01:36:34.440 | And in fact, we have, of course, a spreadsheet showing you this.

01:36:40.760 | So if you have a look at the momentum bias spreadsheet, there we go.

01:36:47.320 | So basically, here's our batch number, and let's say these are the values that are coming

01:36:52.220 | in, our gradients, five, one, one, one, one, one, five, one.

01:36:56.360 | Then basically, this is our exponentially weighted moving average, and here is our de-biasing

01:37:03.300 | correction.

01:37:06.320 | And then here is our resulting de-biased exponentially weighted moving average.

01:37:11.280 | And then you can compare it to an actual moving average of the last few.

01:37:17.640 | So that's basically how this works.

01:37:21.040 | And Silva loves writing LaTeX, so he wrote all this LaTeX that basically points out that

01:37:26.840 | if you say what I just said, which is beta times this plus one minus beta times that,

01:37:33.520 | and you keep doing it to itself lots and lots of times, you end up with something that they

01:37:37.680 | all cancel out to that.

01:37:40.040 | So this is all we need to do to take our exponentially weighted moving average, divide it by one

01:37:45.320 | minus beta to the power of i plus one, and look at that.

01:37:50.260 | It's pretty good, right?

01:37:51.680 | It de-biases very quickly, even if you have a bad starting point, and it looks pretty

01:37:55.640 | good.

01:37:56.640 | It's not magic, but you can see why a beta of .9 is popular.

01:38:03.040 | It's kind of got a pretty nice behavior.

01:38:06.320 | So let's use all that to create atom.

01:38:10.360 | So what's atom?

01:38:12.120 | Atom is dampened de-biased momentum, that's the numerator, divided by dampened de-biased

01:38:26.200 | average sum of squared gradients.

01:38:29.380 | And so we talked about why atom does that before, we won't go into the details.

01:38:34.320 | But here's our average gradient again, but this time we've added optional dampening.

01:38:40.440 | So if you say I want dampening, then we'll set momentum dampening to that, otherwise

01:38:46.960 | we'll set it to one.

01:38:49.440 | And so this is exactly the same as before, but with dampening.

01:38:54.660 | Average squared gradients is exactly the same as average gradients, we could definitely

01:38:58.480 | refactor these a lot, so this is all exactly the same as before, except we'll call them

01:39:03.640 | different things.

01:39:04.640 | We'll call it squared dampening, we'll call it squared averages, and this time, rather

01:39:08.160 | than just adding in the p-grad data, we will multiply p-grad data by itself, in other words,

01:39:16.720 | we get the squareds.

01:39:17.720 | This is the only difference, and we store it in a different name.

01:39:21.920 | So with those, we're also going to need to de-bias, which means we need to know what

01:39:26.800 | step we're up to.

01:39:28.280 | So here's a stat, which just literally counts.

01:39:34.500 | So here's our de-bias function, the one we just saw.

01:39:38.680 | And so here's atom.

01:39:40.240 | Once that's in place, atom is just the de-bias momentum with momentum dampening, the de-bias

01:39:44.960 | squared momentum with squared momentum dampening, and then we just take the parameter, and then

01:39:52.480 | our learning rate, and we've got the de-bias in here, our gradient average, and divided

01:39:59.760 | by the squared.

01:40:03.400 | And we also have our epsilon, oh, this is in the wrong spot, be careful, epsilon should

01:40:09.680 | always go inside the square root.

01:40:15.240 | So that's an atom step, so now we can create an atom optimizer in one line of code.

01:40:23.120 | And so there's our atom optimizer, it has average grads, it's got average squared grads,

01:40:27.760 | it's got a step, and we can now try it out.

01:40:37.500 | So here's lamb.

01:40:41.880 | By the way, these equations are a little nicer than these equations, and I want to point

01:40:47.480 | something out.

01:40:49.640 | Mathematicians hate refactoring.

01:40:51.980 | Don't be like them.

01:40:53.740 | Look at this, m over v plus epsilon root, lambda, da-da-da, it's the same as this.

01:41:00.960 | So like, it's just so complicated when things appear the same way in multiple places, right?

01:41:05.980 | So when we did this equation, we gave that a new name.

01:41:10.180 | And so now we can just look at r2, goes from all that to just that.

01:41:16.320 | And wt goes from all that to just that.

01:41:20.560 | And so when you pull these things out, when you refactor your math, it's much easier to

01:41:25.520 | see what's going on.

01:41:28.040 | So here's the cool thing, right?

01:41:30.600 | When we look at this, even if you're a terrible mathematician like me, you're going to start

01:41:35.040 | to recognize some patterns, and that's the trick to being a less terrible mathematician

01:41:38.560 | is recognizing patterns.

01:41:40.560 | Beta times something plus one minus beta times another thing is exponentially weighted moving

01:41:45.880 | average, right?

01:41:48.360 | So here's one exponentially weighted moving average.

01:41:50.160 | Here's another exponentially weighted moving average.

01:41:52.040 | This one has a gradient.

01:41:53.640 | This one has gradient squared.

01:41:55.280 | This means element-wise multiplication.

01:41:58.960 | So these are the exponentially weighted moving average of the gradient and the gradient squared.

01:42:03.360 | Oh, beta to the t, debiasing.

01:42:07.280 | So that's the debiased version of m.

01:42:09.280 | There's the debiased version of v.

01:42:16.520 | Not to move the epsilon?

01:42:18.280 | Really?

01:42:19.280 | Sylvain has a message, don't move the epsilon.

01:42:25.540 | Don't listen to Jeremy.

01:42:26.540 | Don't listen to Jeremy.

01:42:28.040 | Okay.

01:42:29.040 | Sylvain's an actual math guy, so- In Adam, the epsilon goes outside the square

01:42:33.240 | root.

01:42:34.240 | No way.

01:42:35.240 | I always thought epsilon should always go inside the square root.

01:42:39.320 | Jeremy just did a fix I pushed a week ago where our Adam wasn't working.

01:42:45.600 | Let's press Control-Z a few times.

01:42:47.040 | There we go.

01:42:48.680 | That's great.

01:42:49.680 | So to explain why this matters and why there is no right answer, here's the difference,

01:42:56.240 | right?

01:42:57.240 | If this is 1e next 7, then having it here versus having it here.

01:43:04.640 | So like the square root of 1e next 7 is very different to 1e next 7.

01:43:14.360 | And in batch norm, they do put it inside the square root and according to Sylvain and Adam,

01:43:21.040 | they don't.

01:43:22.040 | Neither is like the right place to put it or the wrong place to put it.

01:43:26.980 | If you don't put it in the same place as they do in paper, it's just a totally different

01:43:31.240 | number.

01:43:32.240 | And this is a good time to talk about epsilon and Adam because I love epsilon and Adam because

01:43:39.280 | like what if we put, made epsilon equal to 1, right?

01:43:45.980 | Then we've got the kind of momentumized, the kind of momentum term on the numerator and

01:43:54.280 | the denominator, we've got the root sum of squares of the root of the exponentially weighted

01:44:02.560 | average squared gradients.

01:44:04.960 | So we're dividing by that plus 1.

01:44:08.240 | And most of the time, the gradients are going to be smaller than 1 and the squared version

01:44:13.320 | is going to be much smaller than 1.

01:44:15.200 | So basically then, the 1 is going to be much bigger than this, so it basically makes this

01:44:23.160 | go away.

01:44:24.600 | So if epsilon is 1, it's pretty close to being standard SGD with momentum or at least debiased

01:44:32.720 | dampened momentum.

01:44:35.560 | Whereas if epsilon is 1e next 7, then we're basically saying, oh, we want to really use

01:44:44.240 | these different exponentially weighted moving average squared gradients.

01:44:50.800 | And this is really important because if you have some activation that has had a very small

01:45:01.680 | squared gradients for a while, this could well be like 1e next 6, which means when you divide

01:45:08.400 | by it, you're multiplying by a million.

01:45:11.520 | And that could absolutely kill your optimizer.

01:45:14.700 | So the trick to making atom and atom-like things work well is to make this about 0.1, somewhere

01:45:23.240 | between 1e next 3 and 1e next 1 tends to work pretty well.

01:45:27.080 | Most people use 1e next 7, which it just makes no sense.

01:45:32.320 | There's no way that you want to be able to multiply your step by 10 million times.

01:45:38.800 | That's just never going to be a good idea.

01:45:40.520 | So there's another place that epsilon is a super important thing to think about.

01:45:47.200 | Okay.

01:45:49.680 | So LAM then is stuff that we've all seen before, right?

01:45:56.000 | So it's debiased, this is atom, right, debiased exponentially weighted moving averages of

01:46:02.860 | gradients and gradient squared.

01:46:05.760 | This here is the norm of the weights.

01:46:09.840 | The norm is just the sum of the roots and the squares.

01:46:13.520 | So this is just weight decay.

01:46:15.000 | So LAM has weight decay built in.

01:46:19.040 | This one here, hopefully you recognize as being the atom step.

01:46:25.020 | And so this is the norm of the atom step.

01:46:31.060 | So basically what LAM is doing is it's atom, but what we do is we average all the steps

01:46:41.520 | over a whole layer, right?

01:46:44.440 | That's why these L's are really important, right, because these things are happening

01:46:47.420 | over a layer.

01:46:48.980 | And so basically we're taking, so here's our debiased momentum, debiased squared momentum,

01:46:56.320 | right?

01:46:57.560 | And then here's our one, and look, here's this mean, right?

01:47:02.000 | So it's for a layer.

01:47:03.000 | Because remember, each stepper is created for a layer, for a parameter.

01:47:07.680 | I shouldn't say a layer, for a parameter, okay?

01:47:11.140 | So this is kind of like both exciting and annoying because I'd been working on this

01:47:17.600 | exact idea, which is basically atom but averaged out over a layer for the previous week.

01:47:25.400 | And then this LAM paper came out, and I was like, "Oh, that's cool.

01:47:28.760 | Some paper about BERT training.

01:47:30.120 | I'll check it out."

01:47:31.120 | And it's like, "Oh, we do it with a new optimizer."

01:47:32.440 | And I looked at the new optimizer and was like, "It's just the optimizer I wrote a week

01:47:37.400 | before we were going to present it."

01:47:39.320 | So I'm thrilled that this thing exists.

01:47:42.360 | I think it's exactly what we need.

01:47:45.320 | And you should definitely check out LAM because it makes so much sense to use the average

01:47:53.440 | over the layer of that step as a kind of a, you can see here, it's kind of got this normalization

01:48:00.240 | going on.

01:48:02.520 | Because it's just really unlikely that every individual parameter in that tensor, you don't

01:48:09.480 | want to divide it by its squared gradients because it's going to vary too much.

01:48:14.120 | There's just too much chance that there's going to be a 1e neg 7 in there somewhere

01:48:17.920 | or something, right?

01:48:19.880 | So this to me is exactly the right way to do it.

01:48:22.800 | And this is kind of like the first optimizer I've seen where I just kind of think like,

01:48:28.440 | "Oh, finally I feel like people are heading in the right direction."

01:48:31.760 | But when you really study this optimizer, you realize that everything we thought about

01:48:35.240 | optimizers kind of doesn't make sense.

01:48:37.920 | The way optimizers are going with things like LAM is the whole idea of what is the magnitude

01:48:47.560 | of our step, it just looks very different to everything we kind of thought of before.

01:48:53.320 | So check out this paper.

01:48:55.160 | You know, this bath might look slightly intimidating at first, but now you know all of these things.

01:49:01.240 | You know, what all they all are and you know why they exist.

01:49:04.880 | So I think you'll be fine.

01:49:08.020 | So here's how we create a LAM optimizer and here's how we fit with it.

01:49:14.280 | Okay, that is that, unless Silvass says otherwise.

01:49:23.520 | All right.

01:49:28.560 | So as I was building this, I got so sick of runner because I kept on wondering when do

01:49:34.800 | I pass a runner?

01:49:35.800 | When do I pass a learner?

01:49:38.520 | And then I kind of suddenly thought, like, again, like, once every month or two, I actually

01:49:46.480 | sit and think.

01:49:48.680 | And it's only when I get really frustrated, right?

01:49:51.620 | So like I was getting really frustrated with runners and I actually decided to sit and

01:49:55.000 | think.

01:49:56.400 | And I looked at the definition of learner and I thought, wait, it doesn't do anything at

01:50:02.720 | all.

01:50:04.160 | It stores three things.

01:50:06.200 | What kind of class just stores three things?

01:50:08.580 | And then a runner has a learner in it that stores three things, like, why don't we store

01:50:15.720 | the three things in the runner?

01:50:18.920 | So I took the runner, I took that line of code, I copied it and I pasted it just here.

01:50:26.280 | I then renamed runner to learner.

01:50:29.160 | I then found everything that said self.learn and removed the .learn and I was done.

01:50:35.840 | And now there's no more runner.

01:50:37.720 | And it's like, oh, it's just one of those obvious refactorings that as soon as I did

01:50:42.420 | it, Sylvain was like, why didn't you do it that way in the first place?

01:50:47.500 | And I was like, why didn't you fix it that way in the first place?

01:50:49.680 | But now that we've done it, like, this is so much easier.

01:50:52.920 | There's no more get learn run, there's no more having to match these things together.

01:50:56.840 | It's just super simple.

01:50:58.920 | So one of the nice things I like about this kind of Jupyter style of development is I

01:51:04.200 | spend a month or two just, like, immersing myself in the code in this very experimental

01:51:11.120 | way.

01:51:12.120 | And I feel totally fine throwing it all away and changing everything because, like, everything's

01:51:16.120 | small and I can, like, fiddle around with it.

01:51:19.200 | And then after a couple of months, you know, Sylvain and I will just kind of go like, okay,

01:51:24.760 | there's a bunch of things here that work nicely together and we turn it into some modules.

01:51:29.440 | And so that's how fast.ai version one happened.

01:51:33.840 | People often say to us, like, turning it into modules, what a nightmare that must have been.

01:51:39.120 | So here's what was required for me to do that.

01:51:43.720 | I typed into Skype, Sylvain, please turn this into a module.

01:51:48.000 | So that was pretty easy.

01:51:50.040 | And then three hours later, Sylvain typed back and he said, done.

01:51:54.040 | It was three hours of work.

01:51:55.560 | You know, it took, you know, it was four, five, six months of development in notebooks,

01:52:01.680 | three hours to convert it into modules.

01:52:03.800 | So it's really -- it's not a hassle.

01:52:07.120 | And I think this -- I find this quite delightful.

01:52:08.920 | It works super well.

01:52:11.680 | So no more runner, thank God.

01:52:14.840 | Runner is now Cordelona.

01:52:15.840 | We're kind of back to where we were.

01:52:19.280 | We want progress bars.

01:52:24.760 | Sylvain wrote this fantastic package called Fast Progress, which you should totally check

01:52:29.200 | out.

01:52:30.680 | And we're allowed to import it.

01:52:32.760 | Because remember, we're allowed to import modules that are not data science modules.

01:52:37.000 | Progress bar is not a data science module.

01:52:39.200 | But now we need to attach this progress bar to our callback system.

01:52:44.240 | So let's grab our ImageNet data as before, create a little thing with, I don't know,

01:52:49.560 | four, 32 filled layers.

01:52:52.040 | Let's rewrite our stats callback.

01:52:56.720 | It's basically exactly the same as it was before, except now we're storing our stats

01:53:01.440 | in an array, okay?

01:53:04.240 | And we're just passing off the array to logger.

01:53:07.560 | Remember logger is just a print statement at this stage.

01:53:12.240 | And then we will create our progress bar callback.

01:53:18.600 | And that is actually the entirety of it.

01:53:20.920 | That's all we need.

01:53:22.120 | So with that, we can now add progress callback to our callback functions.

01:53:28.000 | And grab our learner, no runner, fit.

01:53:33.480 | Now that's kind of magic, right?

01:53:35.320 | That's all the code we needed to make this happen.

01:53:39.200 | And look at the end.

01:53:40.200 | Oh, creates a nice little table.

01:53:42.720 | Pretty good.

01:53:43.720 | So this is, you know, thanks to just careful, simple, decoupled software engineering.

01:53:50.080 | We just said, okay, when you start fitting, you've got to create the master bar.

01:53:54.200 | So that's the thing that tracks the epochs.

01:53:58.320 | And then tell the master bar we're starting.

01:54:00.920 | And then replace the logger function, not with print, but with master bar.write.

01:54:05.400 | So it's going to print the HTML into there.

01:54:09.160 | And then after we've done a batch, update our progress bar.

01:54:15.320 | When we begin an epoch or begin validating, we'll have to create a new progress bar.

01:54:19.800 | And when we're done fitting, tell the master bar we're finished.

01:54:23.120 | That's it.

01:54:24.520 | So it's very easy to, once you have a system like this, to integrate with other libraries

01:54:29.360 | if you want to use TensorBoard or VisDum or send yourself a Twilio message or whatever.

01:54:39.080 | It's super easy.

01:54:42.600 | Okay, so we're going to finish, I think we're going to finish, unless this goes faster than

01:54:49.480 | I expect, with data augmentation.

01:54:52.000 | So so far, we've seen how to create our optimizers, we've seen how to create our data blocks API.

01:54:58.160 | And we can use all that to train a reasonably good image net model.

01:55:03.240 | But to make a better image net model, it's a bit short of data.

01:55:07.360 | So we should use data augmentation as we all know.

01:55:11.360 | Now, so let's load it in as before.

01:55:17.080 | And let's just grab an image list for now.

01:55:20.360 | The only transforms, we're going to use resize fixed.

01:55:24.140 | And here's our chap with a tench.

01:55:27.040 | And let's just actually open the original pillow image without resizing it to see what

01:55:31.200 | he looks like full size.

01:55:32.520 | So here he is.

01:55:35.240 | And I want to point something out.

01:55:38.360 | When you resize, there are various resampling methods you can use.

01:55:43.320 | So basically, when you go from one size image to another size image, do you like take the

01:55:47.720 | pixels and take the average of them?

01:55:50.200 | Or do you put a little cubic spline through them?

01:55:53.880 | Or what?

01:55:54.920 | And so these are called resampling methods, and pillow has a few.

01:55:58.720 | They suggest when downsampling, so going from big to small, you should use anti alias.

01:56:04.200 | So here's what you do, when you're augmenting your data, and this is like nothing I'm going

01:56:09.680 | to say today is really focused on vision.

01:56:12.340 | If you're doing audio, if you're doing text, if you're doing music, whatever, augment your

01:56:18.360 | data and look at or listen to or understand your augmented data.

01:56:23.600 | So don't like just chuck this into a model, but like look at what's going on.

01:56:28.680 | So if I want to know what's going on here, I need to be able to see the texture of this

01:56:34.840 | tension.

01:56:35.840 | Now, I'm not very good at tensions, but I do know a bit about clothes.

01:56:39.420 | So let's say if we were trying to see what this guy's wearing, it's a checkered shirt.

01:56:44.080 | So let's zoom in and see what this guy's wearing.

01:56:47.680 | I have no idea.

01:56:48.680 | The checkered shirt's gone.

01:56:51.680 | So like, I can tell that this is going to totally break my model if we use this kind

01:56:58.840 | of image augmentation.

01:57:01.360 | So let's try a few more.

01:57:02.780 | What if instead of anti aliasing, we use bilinear, which is the most common?

01:57:06.480 | No, I still don't know what he's wearing.

01:57:10.680 | Okay.

01:57:11.880 | What if we use nearest neighbors, which nobody uses because everybody knows it's terrible?

01:57:16.880 | Oh, it totally works.

01:57:20.560 | So yeah, just look at stuff and try and find something that you can study to see whether

01:57:28.880 | it works.

01:57:29.880 | Here's something interesting, though.

01:57:33.820 | This looks better still, don't you think?

01:57:36.720 | And this is interesting because what I did here was I did two steps.

01:57:39.580 | I first of all resized to 256 by 256 with bicubic, and then I resized to my final 128

01:57:47.800 | by 128 with nearest neighbors.

01:57:50.080 | And so sometimes you can combine things together in steps to get really good results.

01:57:55.360 | Anyway, I didn't want to go into the details here, I'm just saying that when we talk about

01:58:00.440 | image augmentation, your test is to look at or listen to or whatever your augmented data.

01:58:08.080 | So resizing is very important for vision.

01:58:12.620 | Flipping is a great data augmentation for vision.

01:58:16.480 | I don't particularly care about flipping.

01:58:18.000 | The main thing I want to point out is this, at this point, our tensors contain bytes.

01:58:24.600 | Calculating with bytes and moving bytes around is very, very fast.

01:58:28.440 | And we really care about this because when we were doing the Dawnbench competition, one

01:58:33.100 | of our biggest issues for speed was getting our data augmentation running fast enough

01:58:37.840 | and doing stuff on floats is slow.

01:58:41.300 | If you're flipping something, flipping bytes is identical to flipping floats in terms of

01:58:46.280 | the outcome, so you should definitely do your flip while it's still a byte.

01:58:51.920 | So image augmentation isn't just about throwing some transformation functions in there, but

01:58:58.480 | think about when you're going to do it because you've got this pipeline where you start with

01:59:02.000 | bytes and you start with bytes in a pillow thing and then they become bytes in a tensor

01:59:08.000 | and then they become floats and then they get turned into a batch.

01:59:11.520 | Where are you going to do the work?

01:59:13.060 | And so you should do whatever you can while they're still bytes.

01:59:16.000 | But be careful.

01:59:17.400 | Don't do things that are going to cause rounding errors or saturation problems, whatever.

01:59:22.440 | But flips, definitely good.

01:59:24.960 | So let's do our flips.

01:59:27.780 | So there's a thing called PILX.TransposePILImageFlipLeftRight.

01:59:32.360 | Let's check it for random numbers less than 0.5.

01:59:35.600 | Let's create an item list and let's replace that.

01:59:38.800 | We built this ourselves, so we know how to do this stuff now.

01:59:41.240 | Let's replace the items with just the first item with 64 copies of it.

01:59:46.920 | And so that way we can now use this to create the same picture lots of times.

01:59:53.640 | So show batch is just something that's just going to go through our batch and show all

01:59:57.080 | the images.

01:59:59.320 | Everything we're using we've built ourselves, so you never have to wonder what's going on.

02:00:04.000 | So we can show batch with no augmentation or remember how we created our transforms.

02:00:09.240 | We can add PIL random flip and now some of them are backwards.

02:00:15.240 | It might be nice to turn this into a class that you actually pass a P into to decide

02:00:19.540 | what the probability of a flip is.

02:00:23.320 | You probably want to give it an order because we need to make sure it happens after we've

02:00:27.720 | got the image and after we've converted it to RGB but before we've turned it into a tensor.

02:00:33.240 | Since all of our PIL transforms are going to want to be that order, we may as well create

02:00:37.920 | a PIL transform class and give it that order and then we can just inherit from that class

02:00:42.600 | every time we want a PIL transform.

02:00:45.600 | So now we've got a PIL transform class, we've got a PIL random flip, it's got this state,

02:00:50.980 | it's going to be random, we can try it out giving it P of 0.8 and so now most of them

02:00:58.400 | are flipped.

02:01:02.680 | Or maybe we want to be able to do all these other flips.

02:01:07.220 | So actually PIL transpose, you can pass it all kinds of different things and they're

02:01:12.620 | basically just numbers between 0 and 6.

02:01:17.060 | So here are all the options.

02:01:21.240 | So let's turn that into another transform where we just pick any one of those at random

02:01:29.920 | and there it is.

02:01:31.420 | So this is how we can do data augmentation.

02:01:33.240 | All right, now's a good time.

02:01:38.680 | >> It's easy to evaluate data augmentation for images, how would you handle tabular text

02:01:43.940 | or time series?

02:01:48.920 | >> For text you read it.

02:01:54.000 | >> How would you handle the data augmentation?

02:01:56.040 | >> You would read the augmented text.

02:01:58.560 | So if you're augmenting text then you read the augmented text.

02:02:03.520 | For time series you would look at the signal of the time series.

02:02:09.400 | For tabular you would graph or however you normally visualize that kind of tabular data,

02:02:15.680 | you would visualize that tabular data in the same way.

02:02:18.320 | So you just kind of come and try and as a domain expert hopefully you understand your

02:02:23.040 | data and you have to come up with a way, what are the ways you normally visualize that kind

02:02:29.120 | of data and use the same thing for your augmented data.

02:02:32.400 | Make sure it makes sense.

02:02:33.400 | Yeah, make sure it seems reasonable.

02:02:35.400 | >> Sorry, I think I misread.

02:02:40.120 | How would you do the augmentation for tabular data text or time series?

02:02:43.840 | >> How would you do the augmentation?

02:02:45.880 | I mean, again, it kind of requires your domain expertise.

02:02:53.400 | Just before class today actually one of our alumni, Christine Payne came in, she's at open

02:02:59.840 | AI now working on music analysis and music generation and she was talking about her data

02:03:06.600 | augmentation saying she's pitch shifting and volume changing and slicing bits off the front

02:03:12.120 | and the end and stuff like that.

02:03:14.200 | So there isn't an answer.

02:03:18.560 | It's just a case of thinking about what kinds of things could change in your data that would

02:03:24.600 | almost certainly cause the label to not change but would still be a reasonable data item

02:03:31.480 | and that just requires your domain expertise.

02:03:33.440 | Oh, except for the thing I'm going to show you next which is going to be a magic trick

02:03:36.320 | that works for everything.

02:03:37.560 | So we'll come to that.

02:03:43.560 | We can do random cropping and this is, again, something to be very careful of.

02:03:50.280 | We very often want to grab a small piece of an image and zoom into that piece.

02:03:55.480 | It's a great way to do data augmentation.

02:03:57.600 | One way would be to crop and then resize.

02:04:01.720 | And if we do crop and resize, oh, we've lost his checked shirt.

02:04:07.080 | But very often you can do both in one step.

02:04:09.560 | So for example, with pillow, there's a transform called extent where you tell it what crop

02:04:17.440 | and what resize and it does it in one step and now it's much more clear.

02:04:24.200 | So generally speaking, you've got to be super careful, particularly when your data is still

02:04:28.720 | bytes, not to do destructive transformations, particularly multiple destructive transformations.

02:04:36.000 | Do them all in one go or wait until they're floats because bytes round off and disappear

02:04:44.120 | or saturate whereas floats don't.

02:04:47.160 | And the cropping one takes 193 microseconds, the better one takes 500 microseconds.

02:04:57.400 | So one approach would be to say, oh, crap, it's more than twice as long, we're screwed.

02:05:02.720 | But that's not how to think.

02:05:04.240 | How to think is what's your time budget?

02:05:07.000 | Does it matter?

02:05:08.360 | So here's how I thought through our time budget for this little augmentation project.

02:05:12.600 | I know that for Dawnbench, kind of the best we could get down to is five minutes per batch

02:05:19.760 | of ImageNet on eight GPUs.

02:05:25.880 | And so that's 1.25 million images.

02:05:31.960 | So that's on one GPU per minute, that's 31,000 or 500 per second.

02:05:44.880 | Assuming four cores per GPU, that's 125 per second.

02:05:49.200 | So we're going to try to stay under 10 milliseconds, I said 10 milliseconds per image.

02:05:54.160 | I think I mean 10 milliseconds per batch.

02:05:58.440 | So it's actually still a pretty small number.

02:06:05.000 | So we're not too worried at this point about 500 microseconds.

02:06:11.240 | But this is always kind of the thing to think about is like how much time have you got?

02:06:20.480 | And sometimes these times really add up.

02:06:23.400 | But yeah, 520 per second, we've got some time, especially since we've got normally a few

02:06:29.400 | cores per GPU.

02:06:32.880 | So we can just write some code to do kind of a general crop transform.

02:06:39.560 | For ImageNet and things like that, for the validation set, what we normally do is we

02:06:48.040 | grab the center of the image, we remove 14% from each side and grab the center.

02:06:56.360 | So we can zoom in a little bit, so we have a center crop.

02:07:00.240 | So here we show all that.

02:07:03.840 | That's what we do for the validation set, and obviously they're all the same because

02:07:06.440 | validation set doesn't have the randomness.

02:07:09.360 | But for the training set, the most useful transformation by far, like all the competition

02:07:16.440 | winners, grab a small piece of the image and zoom into it.

02:07:22.480 | This is called a random resize crop.

02:07:24.600 | And this is going to be really useful to know about for any domain.

02:07:29.840 | So for example, in NLP, really useful thing to do is to grab different sized chunks of

02:07:35.680 | contiguous text.

02:07:37.840 | With audio, if you're doing speech recognition, grab different sized pieces of the utterances

02:07:43.160 | and so forth.

02:07:44.160 | If you can find a way to get different slices of your data, it's a fantastically useful

02:07:51.360 | data augmentation approach.

02:07:53.760 | And so this is like by far the main, most important augmentation used in every ImageNet

02:07:59.800 | winner for the last six years or so.

02:08:05.880 | It's a bit weird though because what they do in this approach is this little ratio here

02:08:11.840 | says squish it by between three over four aspect ratio to a four over three aspect ratio.

02:08:19.560 | And so it literally makes the person, see here, he's looking quite thin.

02:08:24.680 | And see here, he's looking quite wide.

02:08:28.120 | It doesn't actually make any sense, this transformation, because optically speaking, there's no way

02:08:33.800 | of looking at something in normal day-to-day life that causes them to expand outwards or

02:08:40.160 | contract inwards.

02:08:43.160 | So when we looked at this, we thought, I think what happened here is that they were, this

02:08:48.240 | is the best they could do with the tools they had, but probably what they really want to

02:08:52.080 | do is to do the thing that's kind of like physically reasonable.

02:08:57.200 | And so the physically reasonable thing is like you might be a bit above somebody or

02:09:00.960 | a bit below somebody or left of somebody or right of somebody, causing your perspective

02:09:05.000 | to change.

02:09:06.760 | So our guess is that what we actually want is not this, but this.

02:09:12.740 | So perspective warping is basically something that looks like this.

02:09:19.480 | You basically have four points, right?

02:09:22.200 | And you think about how would those four points map to four other points if they were going

02:09:26.960 | through some angle.

02:09:27.960 | So it's like as you look from different directions, roughly speaking.

02:09:32.880 | And the reason that I really like this idea is because when you're doing data augmentation

02:09:39.120 | at any domain, as I mentioned, the idea is to try and create like physically reasonable

02:09:46.840 | in your domain inputs.

02:09:49.400 | And these just aren't like you can't make somebody squishier in real world, right?

02:09:56.620 | But you can shift their perspective.

02:09:58.760 | So if we do a perspective transform, then they look like this.

02:10:05.280 | This is true, right?

02:10:06.280 | If you're a bit underneath them, the fish will look a bit closer, or if you're a bit

02:10:10.720 | over here, then the hat's a bit closer from that side.

02:10:13.720 | So these perspective transforms make a lot more sense, right?

02:10:19.680 | So if you're interested in perspective transforms, we have some details here on how you actually

02:10:23.620 | do them mathematically.

02:10:25.760 | The details aren't important, but what are interesting is the transform actually requires

02:10:30.600 | solving a system of linear equations.

02:10:33.240 | And did you know that PyTorch has a function for solving systems of linear equations?

02:10:37.440 | It's amazing how much stuff is in PyTorch, right?

02:10:41.240 | So for like lots of the things you'll need in your domain, you might be surprised to

02:10:44.880 | find what's already there.

02:10:47.240 | Question?

02:10:48.680 | >> And with the cropping and resizing, what happens when you lose the object of interest,

02:10:55.840 | so when the fish has been cropped out?

02:10:58.800 | >> That's a great question.

02:11:00.480 | It's not just a fish, it's a tench, yeah.

02:11:04.500 | So there's no tench here.

02:11:06.920 | And so these are noisy labels.

02:11:09.600 | And interestingly, the kind of ImageNet winning strategy is to randomly pick between 8% and

02:11:17.600 | 100% of the pixels.

02:11:19.800 | So literally, they are very often picking 8% of the pixels.

02:11:25.480 | And that's the ImageNet winning strategy.

02:11:27.660 | So they very often have no tench.

02:11:30.380 | So very often they'll have just the fin or just the eye.

02:11:36.720 | So this tells us that if we want to use this really effective augmentation strategy really

02:11:40.900 | well, we have to be very good at handling noisy labels, which we're going to learn about in

02:11:44.840 | the next lesson, right?

02:11:47.620 | And it also hopefully tells you that if you already have noisy labels, don't worry about

02:11:53.560 | it.

02:11:54.560 | All of the research we have tells us that we can handle labels where the thing's totally

02:12:00.880 | missing or sometimes it's wrong, as long as it's not biased.

02:12:05.600 | So yeah, it's okay.

02:12:06.600 | And one of the things it'll do is it'll learn to find things associated with a tench.

02:12:12.760 | So if there's a middle-aged man looking very happy outside, could well be a tench.

02:12:19.720 | Okay, so this is a bit of research that we're currently working on.

02:12:25.440 | And hopefully I'll have some results to show you soon.

02:12:28.600 | But our view is that this image warping approach is probably going to give us better results

02:12:33.960 | than the traditional ImageNet style augmentations.

02:12:39.880 | So here's our final transform for tilting in arbitrary directions, and here's the result.

02:12:47.440 | Not bad.

02:12:48.600 | So a couple of things to finish on.

02:12:52.360 | The first is that it's really important to measure everything.

02:12:57.960 | And I and many people have been shocked to discover that actually the time it takes to

02:13:03.500 | convert an image into a float tensor is significantly longer than the amount of time it takes to

02:13:13.240 | do something as complicated as a warp.

02:13:18.600 | So you may be thinking this image warping thing sounds really hard and slow, but be

02:13:25.280 | careful, just converting bytes to floats is really hard and slow.

02:13:30.780 | And then this is the one, as I mentioned, this one we're using here is the one that comes

02:13:33.640 | from Torch Vision.

02:13:36.120 | We found another version that's like twice as fast, which goes directly to float.

02:13:42.460 | So this is the one that we're going to be using.

02:13:45.400 | So time everything if you're running, you know, if things are running not fast enough.

02:13:49.880 | Okay, here's the thing I'm really excited about for augmentation, is this stuff's all

02:13:57.480 | still too slow.

02:13:59.840 | What if I told you, you could do arbitrary affine transformations.

02:14:06.880 | So warping, zooming, rotating, shifting at a speed which would compare.

02:14:18.560 | This is the normal speed.

02:14:23.240 | This is our speed.

02:14:25.100 | So up to like, you know, an order of magnitude or more faster.

02:14:29.280 | How do we do it?

02:14:31.020 | We figured out how to do it on the GPU.

02:14:34.180 | So we can actually do augmentation on the GPU.

02:14:36.880 | And the trick is that PyTorch gives us all the functionality to make it happen.

02:14:42.440 | So the key thing we have to do is to actually realize that our transforms, our augmentation

02:14:48.280 | should happen after you create a batch.

02:14:52.200 | So here's what we do.

02:14:53.200 | For our augmentation, we don't create one random number.

02:14:56.540 | We create a mini batch of random numbers, which is fine because PyTorch has the ability

02:15:01.260 | to generate batches of random numbers on the GPU.

02:15:04.880 | And so then once we've got a mini batch of random numbers, then we just have to use that

02:15:10.200 | to generate a mini batch of augmented images.

02:15:14.400 | I won't kind of bore you with the details.

02:15:17.040 | I find them very interesting details.

02:15:18.760 | But if you're not a computer vision person, maybe not.

02:15:21.480 | But basically, we create something called an affine grid, which is just the coordinates

02:15:26.800 | of where is every pixel, so like literally is coordinates from minus one to one.

02:15:33.720 | And then what we do is we multiply it by this matrix, which is called an affine transform.

02:15:41.840 | And there are various kinds of affine transforms you can do.

02:15:44.440 | For example, you can do a rotation transform by using this particular matrix, but these

02:15:48.600 | are all just matrix multiplications.

02:15:52.320 | And then you just, as you see here, you just do the matrix multiplication and this is how

02:15:56.320 | you can rotate.

02:15:58.240 | So a rotation, believe it or not, is just a matrix multiplication by this untimely matrix.

02:16:06.320 | If you do that, normally it's going to take you about 17 milliseconds because speed it

02:16:12.880 | up a bit with own sum, or we could speed it up a little bit more with batch matrix multiply,

02:16:20.760 | or we could stick the whole thing on the GPU and do it there.

02:16:26.000 | And that's going to go from 11 milliseconds to 81 microseconds.

02:16:32.520 | So if we can put things on the GPU, it's totally different.

02:16:37.120 | And suddenly we don't have to worry about how long our augmentation is taking.

02:16:41.760 | So this is the thing that actually rotates the coordinates, to say where the coordinates

02:16:44.880 | are now.

02:16:45.880 | Then we have to do the interpolation.

02:16:48.080 | And believe it or not, PyTorch has an optimized batch-wise interpolation function.

02:16:54.680 | It's called grid sample.

02:16:56.600 | And so here it is.

02:16:57.600 | We run it.

02:16:58.600 | There it is.

02:16:59.600 | And not only do they have a grid sample, but this is actually even better than Pillows because

02:17:03.360 | you don't have to have these black edges.

02:17:05.520 | You can say padding mode equals reflection, and the black edges are gone.

02:17:10.360 | It just reflects what was there, which most of the time is better.

02:17:14.920 | And so reflection padding is one of these little things we find definitely helps models.

02:17:20.160 | So now we can put this all together into a rotate batch.

02:17:24.920 | We can do any kind of coordinate transform here.

02:17:27.280 | One of them is rotate batch.

02:17:29.600 | To do it a batch at a time.

02:17:32.160 | And yeah, as I say, it's dramatically faster.

02:17:38.720 | Or in fact, we can do it all in one step because PyTorch has a thing called affine grid that

02:17:44.760 | will actually do the multiplication as it creates a coordinate grid.

02:17:49.260 | And this is where we get down to this incredibly fast speed.

02:17:53.040 | So I feel like there's a whole, you know, big opportunity here.

02:18:00.440 | There are currently no kind of hackable, anybody can write their own augmentation.

02:18:08.760 | Run on the GPU libraries out there.

02:18:11.440 | The entire fastai.vision library is written using PyTorch Tensor operations.

02:18:17.480 | We did it so that we could eventually do it this way.

02:18:20.040 | But currently they all run on the CPU, one image at a time.

02:18:24.680 | But this is our template now.

02:18:27.280 | So now you can do them a batch at a time.

02:18:30.760 | And so whatever domain you're working in, you can hopefully start to try out these,

02:18:37.720 | you know, randomized GPU batch wise augmentations.

02:18:44.240 | And next week we're going to show you this magic data augmentation called mixup that's

02:18:51.280 | going to work on the GPU, it's going to work on every kind of domain that you can think

02:18:55.560 | of and will possibly make most of these irrelevant because it's so good you possibly don't need

02:19:02.920 | any others.

02:19:04.400 | So that and much more next week, we'll see you then.

02:19:07.880 | (audience applauds)

02:19:10.880 | (audience applauds)

Lesson 11 (2019) - Data Block API, and generic optimizer

Chapters