Back to Index

Lesson 11 (2019) - Data Block API, and generic optimizer


Chapters

0:0 Introduction
1:20 Batch norm
3:50 LSUV
10:0 ImageNet
12:15 New Data Sets
15:55 Question
18:15 Importing data
25:20 The purpose of deep learning
29:40 Getting the files
36:0 Split validation set
39:30 Labeling
53:50 Data Bunch

Transcript

Well, welcome back, welcome to lesson 11, where we're going to be talking mainly about data loading and optimizers. I said we would be talking about fastai.audio, but that's going to be a little bit later in the course. We haven't quite got to where I wanted to get to yet, so everything I said we'd talk about last week, we will talk about, but it might take a few more lessons to get there than I said.

So this is kind of where we're up to, is the last little bit of our CNN, and specifically these were the things we were going to dive into when we've done the first four of these. CUDA, convolutions, hooks, normalization. So we're going to keep going through this process to try to create our state-of-the-art image net model.

The specific items that we're working with are images, but everything we've covered so far is equally valid, equally used for tabular, collaborative filtering, and text, and pretty much everything else as well. Last week we talked about BatchNorm, and I just wanted to mention that at the end of the BatchNorm notebook there's another bit called simplified running BatchNorm.

We talked a little bit about de-biasing last week, we'll talk about it more today, but Stas Begman pointed out something which is kind of obvious in hindsight, but I didn't notice at the time, which is that we had sums divided by de-bias, and we had count divided by de-bias, and then we go sum divided by count, and sum divided by de-bias divided by count divided by de-bias, the two de-bias cancel each other out, so we can remove all of them.

We're still going to cover de-biasing today for a different purpose, but actually we didn't really need it for last week, so we can remove all the de-biasing and end up with something much simpler. That's the version that we're going to go with. Also thanks to Tom Veman, who pointed out that the last step where we went subtract mean divided by standard deviation multiplied by molts add adds, you can just rearrange that into this form, molts divided by variances and adds minus means times that factor, and if you do it this way, then you don't actually have to touch X until you've done all of those things, and so that's going to end up being faster.

If you think through the broadcasting operations there, then you're doing a lot less computation this way, so that was a good idea as well. And I should mention Tom has been helping us out quite a bit with some of this batch norm stuff, one of the few people involved in the PyTorch community who's been amazingly helpful who I'd like to call out.

So thanks also to Sumith Chintala, who was one of the original founders of PyTorch, who's been super helpful in sorting some things out for this course, and also Francisco Massa, also super helpful. They're both part of the official Facebook engineering team, and Tom's not, but he does so much work for PyTorch, he kind of seems like he must be sometimes.

So thanks to all of you for your great help. Okay, before we moved on to data blocks, I wanted to mention one other approach to making sure that your model trains nicely, and to me this is the most fast AI-ish method. I wish I had come up with it, but I didn't.

Wonderful researcher named Dimitro came up with it in a paper called All You Need Is a Good In It, Dimitro Mishkin, and this is the paper. And he came up with this technique called LSUV, Layer-wise Sequential Unit Variance. So the basic idea is this, you've seen now how fiddly it is to get your unit variances all the way through your network, and little things can change that.

So if you change your activation function, or something we haven't mentioned, if you add dropout, or change the amount of dropout, these are all going to impact the variances of your layer outputs, and if they're just a little bit different to one, you'll get exponentially worse as we saw through the model.

So the normal approach to fixing this is to think really carefully about your architecture, and exactly, analytically, figure out how to initialize everything so it works. And Dimitro's idea, which I like a lot better, is let the computer figure it out, and here's how you let the computer figure it out.

We create our MNIST data set in the same way as before, we create a bunch of layers with these number of filters like before, and what I'm going to do is I'm going to create a conf layer class which contains our convolution and our relu, and the idea is that we're going to use this because now we can basically say this whole kind of combined conv plus relu has kind of a, I'm calling it bias, but actually I'm taking that general relu and just saying like, how much are we subtracting from it?

So this is kind of like something we can add or remove. And then the weight is just the conv weights. And you'll see why we're doing this in a moment. Basically what we'll do is we'll create our learner in the usual way, and however it initializes is fine. And so we can train it, that's fine, but let's try and now train it in a better way.

So let's recreate our learner, and let's grab a single minibatch. And here's a function that will let us grab a single minibatch, making sure we're using all our callbacks that the minibatch does all the things we needed to do. So here's one minibatch of X and Y. And what we're going to do is we're going to find all of the modules that are, we're going to find all of the modules which are of type conv layer.

And so it's just a little function that does that. And generally speaking, when you're working with PyTorch modules or with neural nets more generally, you need to use recursion a lot because modules can contain modules, can contain modules, right? So you can see here, find modules, calls find modules to find out all the modules throughout your kind of tree, because really a module is like a tree.

Modules have modules have modules. And so here's our list of all of our conv layers. And then what we do is we create a hook, right? And the hook is just going to grab the mean and standard deviation of a particular module. And so we can first of all just print those out.

And we can see that the means and standard deviations are not zero one. The means are too high, as we know, because we've got the relues. And the standard deviations are too low. So rather than coming up with our perfect init, instead, we just create a loop. And the loop calls the model, passing in that mini-batch we have, right?

And remember, this is -- so first of all, we hook it, right? And then we call the model in a while loop. We check whether the mean, the absolute value of the mean is close to zero. And if it's not, we subtract the mean from the bias. And so it just keeps looping through, calling the model again with the hook, subtracting from the bias until we get about zero mean.

And then we do the same thing for the standard deviation. Keep checking whether standard deviation minus one is nearly zero. And as long as it isn't, we'll keep dividing by the standard deviation. And so those two loops, if we run this function, then it's going to eventually give us what we want.

Now, it's not perfect, right? The means are still not quite zero. Because we do the means first and then the standard deviations, and the standard deviation changes will slightly change the mean. But you can see our standard deviations are perfectly one. And our means are pretty close. And that's it.

This is called LSUV. And this is how, without thinking at all, you can initialize any neural network pretty much to get the unit variance all the way through. And this is much easier than having to think about whether you've got ReLU or ELU or whether you've got dropout or whatever else.

So here's a super cool trick. Yeah, and then we can train it, and it trains very nicely. Particularly useful for complex and deeper architectures. So there's kind of, for me, the fast AI approach to initializing your neural nets, which is no math, no thinking. Just a simple little for loop, or in this case, a while loop.

All right. So I think we've done enough with MNIST because we're getting really good results. It's running fast. It's looking good. It's doing something harder. So what are we going to try? Well, we're not quite ready to try ImageNet because ImageNet takes quite a lot of time. You know, a few days if you've got just one GPU to train.

And that's really frustrating and an expensive way to try to practice things or learn things or try things out. I kept finding this problem of not knowing what data set I should try for my research or for my practice or for my learning. You know, it seemed like at one end there was MNIST, which is kind of too easy.

There was sci-fi tan that a lot of people use, but these are 32 by 32 pixel images. And it turns out, and this is something I haven't seen really well written about, but our research clearly shows, it turns out that small images, 32 by 32, have very different characteristics to larger images.

And specifically, it seems like once you get beneath about 96 by 96, things behave really differently. So stuff that works well on sci-fi tan tends not to work well on normal sized images. Because 32 by 32 is tiny, right? And stuff that tends to work well on sci-fi tan doesn't necessarily work well on ImageNet.

There's this kind of gap of like something with normal sized images, which I can train in a sane amount of time, but also gives me a good sense of whether something's going to work well or not. And actually, Dimitro, who wrote that LSUV paper we just looked at, also had a fantastic paper called systematic evaluation, something like systematic evaluation of convolutional neural networks.

And he noticed that if you use 128 by 128 images with ImageNet, then the kind of things that he found works well or doesn't work well, all of those discoveries applied equally well to the full sized ImageNet. Still takes too long. 128 by 128 for 1.3 million images, still too long.

So I thought that was a good step, but I wanted to go even further. So I tried creating two new data sets. And my two new data sets are subsets of ImageNet. And there's kind of like multiple versions in here, it really is, but they're both subsets of ImageNet.

They both contain just 10 classes out of 1,000. So they're 1/100 of the number of images of ImageNet. And I create a number of versions, full size, 320 pixel size, and 160 pixel size. One data set is specifically designed to be easy. It contains 10 classes that are all very different to each other.

So this is like my starting point. I thought, well, what if I create this data set, then maybe I could train it for like just an epoch or two, like just a couple of minutes and see whether something was going to work. And then the second one I created was one designed to be hard, which is 10 categories that are designed to be very similar to each other, so they're all dog breeds.

So the first data set is called ImageNet, which is very French, as you can hear. And there's some helpful pronunciation tips here. And the second is called ImageWolf. And you can see here I've created a leaderboard for ImageNet and for ImageWolf. And I've discovered that in my very quick experiments with this, the exact observations I find about what works well for the full ImageNet, also I see the same results here.

And it's also fascinating to see how some things are the same between the two data sets and some are different. And I found working with these two data sets has given me more insight into computer vision model training than anything else that I've done. So check them out. And I really wanted to mention this to say, a big part of getting good at using deep learning in your domain is knowing how to create like small, workable, useful data sets.

So once I decided to make this, it took me about three hours. Like, it's not at all hard to create a data set, it's a quick little Python script to grab the things I wanted. How did I decide which 10 things, I just looked at a list of categories and picked 10 things that I knew are different.

How did I decide to pick these things, I just looked at 10 things that I knew are dogs. So it's like just a case of like, throw something together, get it working, and then on your domain area, whether it's audio or Sanskrit texts or whatever, or genomic sequences, try to come up with your version of a toy problem or two which you hope might give insight into your full problem.

So this has been super helpful for me. And if you're interested in computer vision, I would strongly recommend trying this out. And specifically, try to beat me, right? Because trying to beat me, and these are not great, they're just okay, but trying to beat me will give you a sense of whether the things you're thinking about are in the ballpark of what a moderately competent practitioner is able to do in a small amount of time.

It's also interesting to see that with like a 1/100th the size of ImageNet, like a tiny data set, I was able to create a 90% accurate dog breed classifier from random weights. So like you can do a lot pretty quickly without much data, even if you don't have transfer learning, which is kind of amazing.

So we're going to use this data set now. Oh, sorry, you had a question. So before we look at the data set, let's do the question. Sorry, Rachel. >> So just to confirm, LSUV is something you run on all the layers once at the beginning, not during training. What if your batch size is small?

Would you overfit to that batch? >> Yeah, that's right. So you'd run it once at the start of training to initialize your weights, just so that that initial set of steps gives you sensible gradients, because it's those first few mini batches that are everything. Remember how we saw that if we didn't have a very good first few mini batches that we ended up with 90% of the weights being, 90% of the activations being inactive.

So that's why we want to make sure we start well. And yeah, if you've got a small mini batch, just run five mini batches and take the mean. There's nothing special about the one mini batch, it's just a fast way to do the computation. It's not like we're doing any gradient descent or anything.

It's just a forward pass. Thanks, that was a good question. So ImageNet is too big to read it all into RAM at once. It's not huge, but it's too big to do that. So we're going to need to be able to read it in one image at a time, which is going to be true of most of our deep learning projects.

So we need some way to do that from scratch, because that's the rules. So let's start working through that process. And in the process, we're going to end up building a data block API, which you're all familiar with. But most people using the data block API feel familiar enough with it to do small tweaks for things that they kind of know they can do.

But most people I speak to don't know how to really change what's going on. So by the end of this notebook, you'll see how incredibly simple the data block API is. And you'll be able to either write your own, maybe based on this one, or modify the one in fast.ai, because this is a very direct translation of the one that's in fast.ai.

So you should be able to get going. So the first thing to do is to read in our data. And we'll see a similar thing when we build fastai.audio. But whatever process you use, you're going to have to find some library that can read the kind of data that you want.

So in our case, we have images. And there's a library called PIL, or Pillow, Python Imaging Library, which can read images. So let's import it. We'll grab the data set and untie it. Import pillow. And we want to see what's inside our ImageNet data set. Typing list x.editor is far too complicated for me.

I just want to type ls. So be lazy. This is how easy it is to add stuff to the standard library. You can just take the class and add a function to it. So now we have ls. So here's ls. So we've got a training and a validation directory.

Within validation, we have one directory for each category. And then if we look at one category, we could grab one file name. And if we look at one file name, we have a TENCH. So if you want to know whether somebody is actually a deep learning practitioner, show them this photo.

If they don't know it's a TENCH, they're lying to you, because this is the first category in ImageNet. So if you're ever using ImageNet, you know your TENCHs. They're generally being held up by middle-aged men, or sometimes they're in nets. That's pretty much how it always looks in ImageNet.

So that's why we have them in ImageNet too, because it's such a classic computer vision fish. We're cheating and importing NumPy for a moment, just so I can show you what an image contains, just to turn it into an array so I can print it for you. This is really important.

It contains bytes. It contains numbers between 0 and 255 for the integer. They're not float. So this is what we get when we load up an image. And it's got a geometry, and it's got a number of channels. And in this case, it's RGB, three channels. So we want to have some way to read in lots of images, which means we need to know what images there are in this directory structure.

And in the full ImageNet, there's going to be 1.3 million of them. So I need to be able to do that fast. So the first thing we need to know is which things are images. So I need a list of image extensions. Your computer already has a list of image extensions.

It's your MIME types database. So you can query Python for your MIME types database for all of the images. So here's a list of the image extensions that my computer knows about. So now what I want to do is I want to loop through all the files in a directory and find out which ones are one of these.

The fastest way to check whether something's in a list is to first of all turn it into a set. And of course, therefore, we need Setify. So Setify simply checks if it is a set, and if it is, it makes it one. Otherwise, it first turns it into a list, and then turns it into a set.

So that's how we can setify things. And here's what I do when I build a little bit of functionality. I just throw together a quick bunch of tests to make sure it seems to be roughly doing the right thing. And do you remember, in lesson one, we created our own test framework.

So we can now run any notebook as a test suite. So it will automatically check if we break this at some point. OK. So now we need a way to go through a single directory and grab all of the images in that. So here we can say get files.

I always like to make sure that you can pass any of these things, either a path, lib path, or a string, to make it convenient. So if you just say p equals path p, if it's already a path, lib object, that doesn't do anything. So this is a nice, easy way to make sure that works.

So we just go through-- here's our path, lib object. And so you'll see in a moment how we actually grab the list of files. But this is our parent directory. This is going to be our list of files. We go through the list of files. We check that it doesn't start with dot.

If it does, that's a Unix hidden file, or a Mac hidden file. And we also check either they didn't ask for some particular extensions, or that the extension is in the list of extensions we asked for. So that will allow us to grab just the image files. Python has something called Skander, which will grab a path and list all of the files in that path.

So here is how we can call get files. We go Skander, and then we go get files, and it looks something like that. So that's just for one directory. So we can put all this together like so. And so this is something where we say, for some path, give me things with these extensions, optionally recurse, optionally only include these folder names.

And this is it. OK. I will go through it in detail, but I'll just point out a couple of things, because being able to rapidly look through files is important. The first is that Skander is super, super fast. This is Python's thin wrapper over a C API. So this is a really great way to quickly grab stuff for a single directory.

If you need to recurse, check out os.walk. This is the thing that uses Skander internally to walk recursively through a folder tree. And you can do cool stuff like change the list of directories that it's going to look at. And it basically returns all the information that you need.

It's super great. So os.walk and os.skander are the things that you want to be using if you're playing with directories and files in Python, and you need it to be fast. We do. So here's get files. And so now we can say get files, path, tench, just the image extensions.

There we go. And then we're going to need recurse, because we've got a few levels of directory structure. So here's recurse. So if we try to get all of the file names, we have 13,000. And specifically, it takes 70 milliseconds to get 13,000 file names. For me to look at 13,000 files in Windows Explorer seems to take about four minutes.

So this is unbelievably fast. So the full image net, which is 100 times bigger, it's going to be literally just a few seconds. So this gives you a sense of how incredibly fast these os.walk and skander functions are. Yes, questions are good. I've often been confused as to whether the code Jeremy is writing in the notebooks or functionality that will be integrated into the fast AI library, or whether the functions and classes are meant to be written and used by the user interactively and on the fly.

Well I guess that's really a question about what's the purpose of this deep learning from the foundations course. And different people will get different things out of it. But for me, it's about demystifying what's going on so that you can take what's in your head and turn it into something real.

And to do that would be always some combination of using things that are in existing libraries, which might be fast AI or PyTorch or TensorFlow or whatever, and partly will be things that aren't in existing libraries. And I don't want you to be in a situation where you say, well, that's not in fast AI, therefore I don't know how to do it.

So really the goal is, this is why it's also called impractical deep learning for coders, is to give you the underlying expertise and tools that you need. In practice, I would expect a lot of the stuff I'm showing you to end up in the fast AI library because that's like literally I'm showing you my research, basically.

This is like my research journal of the last six months. And that's what happens is fast AI library is silver and I take our research and turn it into a library. And some of it, like this function, is pretty much copied and pasted from the existing fast AI V1 code base because I spent at least a week figuring out how to make this fast.

I'm sure most people can do it faster, but I'm slow and it took me a long time, and this is what I came up with. So yeah, I mean, it's going to map pretty closely to what's in fast AI already. Where things are new, we're telling you, like running batch norm is new, today we're going to be seeing a whole new kind of optimizer.

But otherwise things are going to be pretty similar to what's in fast AI, so it'll make you be able to quickly hack it fast AI V1. And as fast AI changes, it's not going to surprise you because you'll know what's going on. Sure. >> How does Skander compare to glob?

>> Skander should be much faster than glob. It's a little more awkward to work with because it doesn't try to do so much. It's the lowest level thing. I suspect glob probably uses it behind the scenes. You should try. Time it with glob, time it with Skander, probably depends how you use glob exactly.

But I remember I used to use glob and it was quite a bit slower. And when I say quite a bit, you know, for those of you that have been using fast AI for a while, you might have noticed that the speed at which you can grab the image net folder is some orders of magnitude faster than it used to be.

So it's quite a big difference. Okay. So the reason that fast AI has a data blocks API and nobody else does is because I got so frustrated in the last course at having to create every possible combination of independent and dependent variable that I actually sat back for a while and did some thinking.

And specifically, this is what I did when I thought was I wrote this down, it's like what do you actually need to do? So let's go ahead and do these things. So we've already got files. We need some way to split the validation set or multiple validation sets out, some way to do labeling, optionally some augmentation, transform it to a tensor, make it into data, into batches, optionally transform the batches, and then combine the data loaders together into a data bunch and optionally add a test set.

And so when I wrote it down like that, I just went ahead and implemented an API for each of those things to say like, okay, you can plug in anything you like to that part of the API. So let's do it, right? So we've already got the basic functionality to get the files.

So now we have to put them somewhere. We already created that list container, right? So we basically can just dump our files into a list container. But in the end, what we actually want is an image list for this one, right? And an image list, when you call, so we're going to have this get method and when you get something from the image list, it should open the image.

So pil.image.open is how you open an image. But we could get all kinds of different objects. So therefore, we have this superclass, which has a get method that you override. And by default, it just returns whatever you put in there, which in this case would be the file name.

So this is basically all item list does, right? It's got a list of items, right? So in this case, it's going to be our file names, the path that they came from. And then optionally also, there could be a list of transforms, right? And transforms are some kind of functions.

And we'll look at this in more detail in a moment. But basically, what will happen is, when you index into your item list, remember, done_to_get_item does that, we'll pass that back up to list container, get_item, and that will return either a single item or a list of items. And if it's a single item, we'll just call self._get.

If it's a list of items, we'll call self._get on all of them. And what that's going to do is it's going to call the get method, which in the case of an image list will open the image. And then it will compose the transforms. So for those of you that haven't done any kind of more functional-style programming, compose is just a concept that says, go through a list of functions and call the function and replace myself with a result of that, and then call the next function and replace myself with a result of that, and so forth.

So in other words, a deep neural network is just a composition of functions. Each layer is a function. We compose them all together. This compose does a little bit more than most composers. Specifically, you can optionally say, I want to order them in some way. And it checks to see whether the things have an underscore order key and sorts them.

And also, you could pass in some keyword arguments. And if you do, it'll just keep passing in those keyword arguments. But it's basically other than that, it's a pretty standard function composition function. If you haven't seen compose used elsewhere in programming before, Google that because it's a super useful concept.

Comes up all the time. We use it all the time. And as you can see in this case, it means I can just pass in a list of transforms. And this will simply call each of those transforms in turn modifying, in this case, the image that I had. So here's how you create an image list.

And then here's a method to create an image list from a path. And it's just going to call that get files. And then that's going to give us a list of files, which we will then pass to the class constructor which expects a list of files or a list of something.

So this is basically the same as item list in fast.io version 1. It's just a list. It's just a list where when you try to index into it, it will call something which subclasses override. So now we've got an image list. We can use it. Now one thing that happens all the time is you try to create a mini batch of images.

But one of your images was black and white. And when pillow opens up a black and white image, it gives you back by default a reg2 tensor. Just the X and Y. No channel access. And then you can't stack them into a mini batch because they're not all the same shape.

So what you can do is you can call the pillow.convert and RGB. And if something's not RGB, it'll turn it into RGB. So here's our first transform. So a transform is just a class with an underscore order. And then make RGB is a transform that when you call it will call convert.

Or you can just do it this way. Make it a function. Both is fine. These are both going to do the same thing. And this is often the case. We've seen it a bunch of times before. You can have done to call or you can have a function. So here's our first transform.

And so here's the simplest version. It's just a function. And so if we create a image list from files using our path and pass in that transform, and now we have an item list. And remember that item list inherits from list container, which we gave a dunder rep or two.

So it's going to give us nice printing. This is why we create these little convenient things to subclass from because we get all this behavior for free. So we can now see that we've got 13,000 items. And here's a few of them is the path. And we can index into it.

And when we index into it, it calls get and get calls image dot open and pillow automatically displays images in Jupiter. And so there it is. And this, of course, is a man with a -- yes. Thank you. Okay. He looks very happy with it. We're going to be seeing him a lot.

And because we're using the functionality that we wrote last time for list container, we can also index with a list of booleans, with a slice, with a list of ints and so forth. So here's a slice containing one item, for instance. All right. So that's step one. Step two, split validation set.

So to do that, we look and we see here's a path. Here's the file name. Here's the parent. Here's the grandparent. So here's the grandparent's name. That's the thing we use to split. So let's create a function called grandparent splitter that grabs the grandparent's name. And you call it, telling it the name of your validation set and the name of your training set, and it returns true if it's the validation set or false if it's the training set or none if it's neither.

And so here's something that will create a mask containing you pass it some function. So we're going to be using grandparent splitter. And it will just grab all the things where that mask is false. That's the training set. All the things where the mask is true. That's the validation set and will return them.

Okay. So there's our splitter. Remember we used partial. So here's a splitter that splits on grandparents. And where the validation name is val because that's what it is for image net. And let's check that that seems to work. Yes, it does. We've now got a validation set with 500 things and a training set with 12,800 things.

So that's looking good. So let's use it. So split data object is just something with a training set and a validation set. You pass it in. You save them away. And then that's basically it. Everything else from here is just convenience. So we'll give it a representation so that you can print it.

We'll define done to get attribute so that if you pass it some attribute that it doesn't know about, it will grab it from the training set. And then let's add a split by func method that just calls that split by func thing we just had. There's one trick here, though, which is we want split by func to return item lists of the same type that we gave it.

In this case, it would be an image list. So we call item list dot new. And that's why in our item list, we defined something called new. And this is a really handy trick. PyTorch has the concept of a new method as well. It says, all right, let's look at this object.

Let's see what class it is, because it might not be item list, right? It might be image list or some other subclass. It doesn't exist yet. And this is now the constructor for that class. And let's just pass it in the items that we asked for. And then pass in our path and our transforms.

So new is going to create a new item list of the same type with the same path and the same transforms, but with these new items. And so that's why this is now going to give us a training set and a validation set with the same path, the same transforms, and the same type.

And so if we call split data split by func, now you can see we've got our training set and our validation set. Easy. So next in our list of things to do is labeling. Labeling is a little more tricky. And the reason it's tricky is because we need processes.

Processes are things which are first applied to the training set. They get some state and then they get applied to the validation set. For example, our labels should not be tench and French horn. They should be like zero and two because when we go to do a cross entropy loss, we expect to see a long there, not a string there.

So we need to be able to map tench to zero or French horn to two. We need the training set to have the same mapping as the validation set. And for any inference we do in the future, it's going to have the same mapping as well. Because otherwise, the different data sets are going to be talking about completely different things when they see the number zero, for instance.

So we're going to create something called a vocab. And a vocab is just the list saying these are our classes and this is the order they're in. Zero is tench, one is golf ball, two is French horn, and so forth. So we're going to create the vocab from the training set.

And then we're going to convert all those strings into ints using the vocab. And then we're going to do the same thing for the validation set, but we'll use the training set's vocab. So that's an example of a processor that converts label strings to numbers in a consistent and reproducible ways.

Other things we could do would be processing texts to tokenize them and then numericalize them. Numericalizing them is a lot like converting the label strings to numbers. Or taking tabular data and filling the missing values with the median computed on the training set or whatever. So most things we do in this labeling process is going to require some kind of processor.

So in our case, we want a processor that can convert label strings to numbers. So the first thing we need to know is what are all of the possible labels. And so therefore we need to know all the possible unique things in a list. So here's some list, here's something that uniquifies them.

So that's how we can get all the unique values of something. So now that we've got that, we can create a processor. And a processor is just something that can process some items. And so let's create a category processor. And this is the thing that's going to create our list of all of the possible categories.

So basically when you say process, we're going to see if there's a vocab yet. And if there's not, this must be the training set. So we'll create a vocab. And it's just the unique values of all the items. And then we'll create the thing that goes not from int to object, but goes from object to int.

So it's the reverse mapping. So we just enumerate the vocabulary and create a dictionary with the reverse mapping. So now that we have a vocab, we can then go through all the items and process one of them at a time. And process one of them simply means look in that reverse mapping.

We could also deprocess, which would take a bunch of indexes. We would use this, for example, to print out the inferences that we're doing. So we better make sure we get a vocab by now, otherwise we can't do anything. And then we just deprocess one for each index. And deprocess one just looks it up in the vocab.

So that's all we need. And so with this, we can now combine it all together. And let's create a processed item list. And it's just a list container that contains a processor. And the items in it, whatever we were given after being processed. And so then, as well as being able to index in it to grab those processed items, we'll also define something called object.

And that's just the thing that's going to deprocess the items again. So that's all the stuff we need to label things. So we already know that for splitting, we needed the grandparent. For labeling, we need the parent. So here's a parent labeler. Okay. And here is something which labels things using a function.

It just calls a function for each thing. And so here is our class, and we're going to have to pass it some independent variable and some dependent variable and store them away. And then we need a indexer to grab the x and grab the y at those indexes. We need a length.

We may as well make it print out nicely. And then we'll just add something just like we did before, which does the labeling. And passes those to a processed item list to grab the labels. And then passes the inputs and outputs to our constructor to give us our label data.

So that's basically it. So with that, we have a label by function where we can create our category processor. We can label the training set. We can label the validation set, and we can return the result, the split data result. So the main thing to notice here is that when we say train equals labeled data dot label passing in this processor, this processor has no vocab.

So it goes to that bit we saw that says, oh, there's no vocab. So let's create a list of all the unique possibilities. On the other hand, when it goes to the validation set, proc now does have a vocab. So it will skip that step and use the training sets vocab.

So this is really important, right? People get mixed up by this all the time in machine learning and deep learning is like very often when somebody says, my model's no better than random. The most common reason is that they're using some kind of different mapping between their training set and their validation set.

So if you use a process like this, that's never going to happen because you're ensuring that you're always using the same mapping. So the details of the code are particularly important. The important idea is that your labeling process needs to include some kind of processor idea. And if you're doing this stuff manually, which basically every other machine learning and deep learning framework does, you're asking for difficult to fix bugs because anytime your computer's not doing something for you, it means you have to remember to do it yourself.

So whatever framework you're using, I don't think, I don't know if any other frameworks have something quite like this. So like create something like this for yourself so that you don't have that problem. All right, let's go. In the case of online streaming data, how do you deal with having new categories in the test set that you don't see in training?

Yeah, I mean, great question. It's not just online streaming data. I mean, it happens all the time is you do inference either on your validation set or test set or in production where you see something you haven't seen before. For labels, it's less of a problem in inference because for inference, you don't have labels.

By definition, but you could certainly have that problem in your validation set. So what I tend to like to do is if I have like some kind of, if I have something where there's lots and lots of categories and some of them don't occur very often and I know that in the future there might be new categories appearing, I'll take the few least common and I'll group them together into a group called like other.

And that way I now have some way to ensure that my model can handle all these rare other cases. Something like that tends to work pretty well, but you do have to think of it ahead of time. For many kinds of problems, you know that there's a fixed set of possibilities.

And if you know that it's not a fixed set, yeah, I would generally try to create an other category with a few examples. So make sure you train with some things in that other category, all right? >> In the label data class, what is the class method decorator doing?

>> Sure. So I'll be quick because you can Google it, but basically this is the difference between an instance method and a class method. So you'll see it's not getting past self. So you'll see that I'm not going to call this on an object of type label data, but I'm calling it on the label data class itself.

So it's just a convenience, really, class methods. The thing that they get passed in is the actual class that was requested. So I could create a subclass of this and then ask for that subclass. So anyway, they're called class methods. You should Google them. Pretty much every language supports class methods or something like it.

They're pretty convenient. You can get away without them, but they're pretty convenient. Great. So now we've got our labeled list, and if we print it out, it's got a training set and a validation set, and each one has an X and a Y. Our category items are a little less convenient than the FastAI version ones because the FastAI ones will actually print out the name of each category.

We haven't done anything to make that happen. So if we want the name of each category, we would actually have to refer to the .obj, which you can see we're doing here, Y.obj or Y.obj with a slice. So in FastAI version one, there's one extra thing we have, which is this concept of an item base, and you can actually define things like category items that know how to print themselves out.

Whether that convenience is worth the extra complexity is up to you if you're designing something similar yourself. So we still can't train a model with these because we have pillow objects. We need tensors. So here's our labeled list, training set, zeroth object, and that has an X and a Y.

So the zeroth thing in that tuple is the X. If they're all going to be in the batch together, they have to be the same size. So we can just go .resize. No problem. I mean, that's not a great way to do it, but it's a start. So here's a transform that resizes things.

And it has to be after all the other transforms we've seen so far because we want conversion to RGB to happen beforehand, probably, stuff like that. So we'll give this an order of 10. And this is something you pass in a size. If you pass in an integer, we'll turn it into a tuple.

And when you call it, it'll call resize, and it'll do bilinear resizing for you. So there's a transform. Once you've turned them all into the same size, then we can turn them into tensors. I stole this from TorchVision. This is how TorchVision turns pillow objects into tensors. And this has to happen after the resizing.

So we'll give this a lesser order. And you see, there's two ways here of adding kind of class-level state or transform-level state. I can actually attach state to a function. This is really underused in Python, but it's super handy, right? We've got a function. We just want to say, like, what's the order of the function?

Or we can put it in the plus. And then that's turned it into a byte tensor. We actually need a float tensor. So here's how you turn it into a float. And we don't want it to be between 0 and 255. We want it between 0 and 1. So we divide it in place by 255.

And that has to happen after it's a byte. So we'll give that a higher order again. So now here's our list of transforms. It doesn't matter what order they're in the array, because they're going to order them by the underscore order attribute. So we can pass that to our image list.

We can split it. We can label it. Here's a little convenience to permute the order back again. I don't know if you noticed this, but in to byte tensor, I had to permute 201, because Pillow has the channel last, or else PyTorch assumes the channel comes first. So this is just going to pop the channel first.

So to print them out, we have to put the channel last again. So now we can grab something from that list and show image. Here it is. And you can see that it is something of a torch thing of this size. So that's looking good. So we now have tensors that are floats and all the same size.

So we can train a model. So we've got a batch size. We'll use the get data load as we had before. We can just pass in train invalid directly from our labeled list. Let's grab a mini batch, and here it is, 64 by 3 by 128 by 128. And we can have a look at it, and we can see the vocab for it.

We can see the whole mini batch of y values. So now we can create a data bunch. That's going to have our data loaders. And to make life even easier for the future, let's add two optional things, channels in and channels out. And that way any models that want to be automatically created can automatically create themselves with the correct number of inputs and the correct number of outputs for our data set.

And let's create add to our split data, something called to data bunch, which is just this function. It just calls that get DLs we saw before. So like in practice, in your actual module, you would go back and you would paste the contents of this back into your split data definition.

But this is kind of a nice way when you're just iteratively building stuff. You can't only monkey patch PyTorch things or standard library things, you can monkey patch your own things. So here's how you can add something to a previous class when you realize later that you want it.

Okay, so let's go through and see what happens. So here are all the steps, literally all the steps. Grab the path, untie the data, grab the transforms, grab the item list, pass in the transforms, split the data using the grandparent, using this validation name, label it using parent labeler, and then turn it into a data bunch with this batch size, three channels in, ten channels out, and we'll use four processes.

Here's our callback functions from last time. Let's make sure that we normalize. In the past, we've normalized things that have had only one channel, being MNIST. Now we've got three channels, so we need to make sure that we take the mean over the other channels so that we get a three-channel mean and a three-channel standard deviation.

So let's define a function that normalizes things that are three channels. So we're just broadcasting here. So here's the mean and standard deviation of this Imagenet batch. So here's a function called normImagenet, which we can use from now on to normalize anything with this dataset. So let's add that as a callback using the batch transform we built earlier.

We will create a ConvNet with this number of layers. And here's the ConvNet, we're going to come back to that. And then we will do our one-cycle scheduling using COSYCLE, one-cycle annealing, COSIGN, one-cycle annealing, pass that into our getLearn run, and train. And that's going to give us 72.6%, which if we look at the Imagenet leaderboard for 128 pixels for i5 epochs, the best is 84.6 so far.

So this is looking pretty good. We're very much on the right track. So let's take a look and see what model we built, because it's kind of interesting. It's a few interesting features of this model. And we're going to be looking at these features quite a lot in the next two lessons.

The model knows how big its first layer has to start out because we pass in data, and data has the channels in. So this is nice. Already this is a model which you don't have to change its definition if you have hyperspectral imaging with four channels, or you have black and white with one channel, or whatever.

So this is going to change itself. Now what's the second layer going to be? Or I should say, what's the output of the first layer going to be? The input's going to be CIN. What's the output going to be? Is it going to be 16, 32, 64? Well, what we're going to do is we're going to say, well, our input has, we don't know, some number of channels, right?

But we do know that the first layer is going to be a three by three kernel, and then there's going to be some number of channels, CIN channels, which in our case is three. So as the convolution kernel kind of scrolls over the input image, at each time, the number of things that it's multiplying together is going to be three by three by CN.

So nine by CN. So remember we talked about this last week, right? We basically want to put that, we basically want to make sure that our first convolution actually has something useful to do, right? So if we're getting nine by CN coming in, you wouldn't want more than that going out because it's basically a wasted time, okay?

So we discussed that briefly last week. So what I'm going to do is I'm going to say, okay, let's take that value, CN by three by three, and let's just look for the next largest number that's a power of two, and we'll use that. So then that's how I do that.

And then I'll just go ahead and multiply by two for each of the next two layers. So this way, we've got these vital first three layers are going to work out pretty well. So back in the old days, we used to use five by seven kernels, okay? We'd have the first layer, would be one of those, but we know now that's not a good idea.

Still most people do it because people stick with what they know, but when you look at the bag of tricks for image classification paper, which in turn refers to many previous citations, many of which are state of the art and competition winning models, the message is always clear. Three by three kernels give you more bang for your buck.

You get deeper, you end up with the same receptive field. It's faster because you've got less work going on, right? And really, this goes all the way back to the classic Zeiler and Fergus paper that we've looked at so many times over the years that we've been doing this course.

And even before that to the VGG paper, it really is three by three kernels everywhere. So any place you see something that's not a three by three kernel, have a big think about whether that makes sense. Okay, so that's basically what we have for those critical first three layers.

That's where that initial feature representation is happening. And then the rest of the layers is whatever we've asked for. And so then we can build those layers up, just saying number of filters in to number of filters out for each filter. And then as usual, average pooling, Latin, and a linear layer to however many classes is in our data, okay, that's it.

It's very hard to, every time I write something like this, I break it the first 12 times. And the only way to debug it is to see exactly what's going on. To see exactly what's going on, you need to see that what module is there at each point and what is the output shape at each module.

So that's why we've created this model summary. So model summary's gonna use that get batch that we added in the LSUV notebook to grab one batch of data. We will make sure that that batch is on the correct device. We will use the find module thing that we used in the LSUV to find all of the places that there's a linear layer.

If you said find all, otherwise we will grab just the immediate children. We will grab a hook for every layer using the hooks that we made. And so now we can pass that model through. And the function that we've used for hooking simply prints out the module and the output shape.

So that's how easy it is to create this wonderfully useful model summary. So to answer your question of earlier, another reason why are we doing this or what are you meant to be getting out of it is to say you don't have to write much code to create really useful tools and telemetry.

So we've seen how to create like per-layer histogram viewers, how to create model summaries. With the tools that you have at your disposal now, I really hope that you can dig inside your models what they are and what they're doing. And you see that it's all about hooks. So this hooks thing we have is just like super, super useful.

Now very grateful to the PyTorch team for adding this fantastic functionality. So you can see here we start. The input is 128 because that's a batch size, 128 by 3 by 128 by 128. And then we gradually go through these convolutions. The first one has a stride of one.

The next two you have a stride of two. So that goes 64, 32. And you can see after each one they have a stride of two, get smaller and smaller. And then an average pull that to a one by one. And then we flatten it. And then we have a linear.

So it's a really, it's like as basic a ConvNet as you could get. It really is. It's just a bunch of three by three conv-related batch norms. But it does terrifically well. It's deep enough. So I think that's a good start. I think that's a good time to take a break.

So let's come back at 7.45. This is one of the bits I'm most excited about in this course actually. But hopefully it's going to be like totally unexciting to you because it's just going to be so obvious that you should do it this way. But the reason I'm excited is that we're going to be talking about optimizers.

And anybody who's done work with kind of optimizers in deep learning in the past will know that every library treats every optimizer as a totally different thing. So there's an atom optimizer, like in PyTorch there's an atom optimizer and a SGD optimizer and a RMS prop optimizer. And somebody comes along and says, hey, we've invented this thing called decoupled weight decay, also known as Adam W.

And the PyTorch folks go, oh, damn, what are we going to do? And they have to add a parameter to every one of their optimizers and they have to change every one of their optimizers. And then somebody else comes along and says, oh, we've invented a thing called AMS grad.

There's another parameter we have to put into any one of those optimizers. And it's not just like inefficient and frustrating, but it holds back research because it starts feeling like there are all these things called different kinds of optimizers, but there's not. I'm going to show you there's not.

There's one optimizer and there's one optimizer in which you can inject different pieces of behavior in a very, very small number of ways. And what we're going to do is we're going to start with this generic optimizer and we're going to end up with this. This came out last week and it's a massive improvement as you see in what we can do with natural language processing.

This is the equation set that we're going to end up implementing from the paper. And what if I told you that not only I think are we the first library to have this implemented, but this is the total amount of code that we're going to write to do it.

So that's where we're going. So we're going to continue with the image net and we're going to continue with the basic set of transforms we had before and the basic set of stuff to create our data bunch. This is our model and this is something to pop it on CUDA to get our statistics written out to do our batch transform with the normalization.

And so we're going to start here 52% after an epoch. And so let's try to create an optimizer. Now in PyTorch, the base thing called optimizer is just a dictionary that stores away some hyperparameters and we've actually already used it. And I deeply apologize for this. We cheated. We used something that is not part of our approved set of foundations without building it ourselves.

And we did it here. We never wrote param groups. We never wrote param groups. So we're going to go back and do it now, right? Because the reason we did this is because we were using torture's optim.optimizer. We've already built the kind of the main part of that, which is the thing that multiplies by the learning rate and subtracts from the gradients.

But we didn't build param groups. So let's do it here. So here's what's going to happen. As always, we need something called zero grad, which is going to go through some parameters and zero them out and also remove any gradient computation history. And we're going to have a step function that does some kind of step.

The main difference here, though, is our step function isn't actually going to do anything. It's going to use composition on some things that we pass on and ask them to do something. So this optimizer is going to do nothing at all until we build on top of it. But we're going to set it up to be able to handle things like discriminative learning rates and one cycle of kneeling and stuff like that.

And so to be able to do that, we need some way to create parameter groups. This is what we call in fast AI layer groups. And I kind of wish I hadn't called them layer groups. I should call them parameter groups because we have a perfectly good name for them already in PyTorch.

So I'm not going to call them layer groups anymore. I'm just going to call them parameter groups. But it's the same thing. OK, parameter groups and layer groups. So a parameter group-- so remember when we say parameters in PyTorch, remember right back to when we've created our first linear layer, we had a weight tensor and we had a bias tensor.

And each one of those is a parameter. It's a parameter tensor. So in order to optimize something, we need to know what all the parameter tensors are in a model. And you can just say model.parameters to grab them all in PyTorch. And that's going to give us-- it gives us a generator.

But as soon as you call list on a generator, it turns it into an actual list. So that's going to give us a list of all of the tensors, all of the weights and all of the biases, basically. But we might want to be able to say the last two layers should have a different learning rate to all the other layers.

And so the way we can do that is rather than just passing in a list of parameters, we'll pass in a list of lists. And so let's say our list of lists has two items. The first item contains all the parameters in the main body of the architecture. And the last item contains just the parameters from the last two layers.

So if we make this-- decide that this is a list of lists, then that lets us do parameter groups. Now, that's how we tell the optimizer these sets of parameters should be handled differently with discriminative learning rates and stuff. And so that's what we're going to do. We're going to assume that this thing being passed in is a list of lists.

Well, we won't quite assume. We'll check. Right? If it's not, then we'll turn it into a list of lists by just wrapping it in a list. So if it only has one thing in it, we'll just make it a list of-- with one item containing a list. So now, param groups is a list of lists of parameter tensors.

And so you could either pass in, so you could decide how you want to split them up into different parameter groups, or you could just have them turn into a single parameter group for you. So that's the first thing we need. So now, we have-- our optimizer object has a param groups attribute containing our parameter groups.

So just keep remembering that's a list of lists. All right. Each parameter group can have its own set of hyperparameters. So hyperparameters could be learning rate, momentum, beta in atom, epsilon in atom, and so forth. So those hyperparameters are going to be stored as a dictionary. And so there's going to be one dictionary for each parameter group.

So here's where we created. self.hypers contains, for each parameter group, a dictionary. And what's in the dictionary? What's in the dictionary is whatever you pass to the constructor, OK? So this is how you just pass a single bunch of keyword arguments to the constructor, and it's going to construct a dictionary for every one.

And this is just a way of cloning a dictionary so that they're not all referring to the same reference, but they all have their own reference. All right. So that's doing much the same stuff as torture's optim.optimizer. And here's the new bit, stepper. In order to see what a stepper is, let's write one.

Here's a stepper. It's a function. It's called SGD step. What does it do? It does the SGD step. We've seen it before. So in other words, to create an SGD optimizer, we create a partial with our optimizer with the steppers being SGD step. So now when we call step, it goes through our parameters, composes together our steppers, which is just one thing, right?

And calls the parameter. So the parameter is going to go p.data.add minus learning rate p.grad.data. So that's how we can create SGD. So with that optimization function, we can fit. It's not doing anything different at all. But what we have done is we've done the same thing we've done 1,000 times without ever creating an SGD optimizer.

It's an optimizer with an SGD step. I've created this thing called grad params, which is just a little convenience. Basically when we zero the gradients, we have to go through every parameter. To go through every parameter, we have to go through every parameter group. And then within each parameter group, we have to go through every parameter in that group where the gradient exists.

They're the ones that we have to zero. And ditto for when we do a step. That's why I just refashioned it. And also, when we call the stepper, we want to pass to it all of our hyperparameters. Because the stepper might want them. Like it'll probably want learning rate.

And learning rate is just one of the things that we've listed in our hyperparameters. So remember how I said that our compose is a bit special, that it passes along any keyword arguments it got to everything that it composes? Here's a nice way to use that, right? So that's how calm SGD step can say, oh, I need the learning rate.

And so as long as hyper has a learning rate in it, it's going to end up here. And it'll be here as long as you pass it here. And then you can change it for each different layer group. You can anneal it and so forth. So we're going to need to change our parameter scheduler to use our new generic optimizer.

It's simply now that we have to say, go through each hyperparameter in self.opt.hypers and schedule it. So that's basically the same as what we had in parameter scheduler before, but for our new thing. And ditto for recorder. This used to use param groups, now it uses hypers. So a minor change to make these keep working.

So now I was super excited when we first got this working, so it's like, wow, we've just built an SGD optimizer that works without ever writing an SGD optimizer. So now when we want to add weight decay, right? So weight decay, remember, is the thing where we don't want something that fits this.

We want something that fits this. And the reason we, the way we do it is we use ultra regularization, which just as we add the sum of squared weights times some parameter we choose. And remember that the derivative of that is actually just WTD times weight. So you could either add an L2 regularization to the loss, or you can add WD times weight to the gradients.

If you've forgotten this, go back and look at weight decay in part one to remind yourself. And so if we want to add either this or this, we can do it. We can add a stepper. So weight decay is going to get an LR and a WD, and it's going to simply do that.

There it is, okay? Or L2 regularization is going to just do that. By the way, if you haven't seen this before, add in PyTorch. Normally it just adds this tensor to this tensor. But if you add a scalar here, it multiplies these together first. This is a nice, fast way to go WD times parameter and add that to the gradient.

So there's that. Okay, so we've got our L2 regularization, we've got our weight decay. What we need to be able to do now is to be able to somehow have the idea of defaults. Because we don't want to have to say weight decay equals zero every time we want to turn it off.

So see how we've attached some state here to our function object? So the function now has something called defaults that says it's a dictionary with WD equals zero. So let's just grab exactly the same optimizer we had before. But what we're going to do is we're going to maybe update our defaults with whatever self.steppers has in their defaults.

And the reason it's maybe update is that it's not going to replace -- if you explicitly say I want this weight decay, it's not going to update it. It will only update it if it's missing. And so that's just what this little loop does, right? Just goes through each of the things, and then goes through each of the things in the dictionary, and it just checks if it's not there, then it updates it.

So this is now -- everything else here is exactly the same as before. So now we can say let's create an SGD optimizer. It's just an optimizer with a SGD step and weight decay. And so let's create a learner, and let's try creating an optimizer, which is an SGD optimizer, with our model's parameters, with some learning rate, and make sure that the hyperparameter for weight decay should be zero, the hyperparameter for LR should be .1.

Yep, it passes. Let's try giving it a different weight decay, make sure it's there, okay, it passes as well. So we've now got an ability to basically add any step functions we want, and those step functions can have their own state that gets added automatically to our optimization object, and we can go ahead and fit, so that's fine.

So now we've got an SGD optimizer with weight decay is one line of code. Let's now add momentum. So momentum is going to require a slightly better optimizer, or a slightly different optimizer, because momentum needs some more state. It doesn't just have parameters and hyperparameters, but also momentum knows that for every set of activations, it knows what were they updated by last time.

Because remember the momentum equation is, if momentum is .9, then it would be .9 times whatever you did last time, plus this step, right? So we actually need to track for every single parameter what happened last time. And that's actually quite a bit of state, right? If you've got 10 million activations in your network, you've now got 10 million more floats that you have to store, because that's your momentum.

So we're going to store that in a dictionary called state. So a stateful optimizer is just an optimizer that has state. And then we're going to have to have some stats. And stats are a lot like steppers. They're objects that we're going to pass in to say, when we create this state, how do you create it?

So when you're doing momentum, what's the function that you run to calculate momentum? So that's going to be called something of a stat class. So for example, momentum is calculated by simply averaging the gradient, like so. We take whatever the gradient averaged before, we multiply it by momentum, and we add the current gradient.

That's the definition of momentum. So this is an example of a stat class. So it's not enough just to have update, because we actually need this to be something at the start. We can't multiply by something that doesn't exist. So we're also going to define something called init state that will create a dictionary containing the initial state.

So that's all that stateful optimizer is going to do, right? It's going to look at each of our parameters, and it's going to check to see whether that parameter already exists in the state dictionary, and if it doesn't, it hasn't been initialized. So we'll initialize it with an empty dictionary, and then we'll update it with the results of that init state call we just saw.

So now that we have every parameter can now be looked up in this state dictionary to find out its state, and we can now, therefore, grab it, and then we can call update, like so. Oh, this one's not opening, like so, to do, for example, average gradients. And then we can call compose with our parameter and our steppers, and now we don't just pass in our hyperparameters, but we also pass in our state.

So now that we have average gradients, which is sticking into this thing called grad average, and it's going to be passed into our steppers, we can now do a momentum step. And the momentum step takes not just LR, but it's now going to be getting this grad average. And here is the momentum step.

It's just this grad average times the learning rate. That's all you do. So now we can create an SGD with momentum optimizer with a line of code. It can have a momentum step, it can have a weight decay step, it can have an average grad stat, we can even give it some default weight decay, and away we go.

So here's something that might just blow your mind. Let me read it to you. Here is a paper, L2 regularization versus batch and weight norm. Batch normalization is a commonly used trick to improve training of deep neural networks, and they also use L2 regularization ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect.

What? Okay. It's true. Watch this. I realized this when I was chatting to Sylvain at NeurIPS, and like we were walking around the poster session, and I suddenly said to him, "Wait, Sylvain, if there's batch norm, how can L2 regularization possibly work?" And I'll tell you what I laid out to him.

This is before I discovered this paper. We've got some layer of activations, right? And some layer, and we've got some weights that was used to create that layer of activations. So these are our weights, and these are our activations, and then we pass it through some batch norm layer, right?

The batch norm layer does two things. It's got a bunch of adds, and it's got a bunch of multiplies, right? It also normalizes, but these are the learned parameters. Okay, so we come along and we say, "Okay, weight decay time. Your weight decay is a million," and it goes, "Uh-oh, what do I do?" Because now the squared of these, the sum of the squares of these gets multiplied by 1e6.

My loss function's destroyed. I can't possibly learn anything. But then the batch norm layer goes, "Oh, no, don't worry, friends," and it fills every single one of these with 1 divided by a million, okay? So what just happened? Well, no, sorry, it multiplies by positive, sorry, it multiplies them all by a million.

So what now happens, oh, these now have to get the same activations we had before. All of our weights, so like w1, now have to get divided by a million to get the same result. And so now our weight decay basically is nothing. So the two, so in other words, we could just, we can decide exactly how much weight decay loss there is by simply using the batch norm molts, right?

Now the batch norm molts get a tiny bit of weight decay applied to them, unless you turn it off, which people often do, but it's tiny, right? Because there's very few parameters here, and there's lots of parameters here. So it's true. It's true. L2 regularization has no regularizing effect, which is not what I've been telling people who have been listening to these lessons the last three years, for which I apologize.

I was wrong. I feel a little bit better in knowing that pretty much everybody in the community is wrong. We've all been doing it wrong. So Twan Van Lahoven mentioned this in the middle of 2017. Basically nobody noticed. There's a couple more papers I've mentioned in today's lesson notes from the last few months where people are finally starting to really think about this, but I'm not aware of any other course, which is actually pointed out we're all doing it wrong.

So you know how I keep mentioning how none of us know what we're doing? We don't even know what L2 regularization does because it doesn't even do anything, but it does do something because if you change it, something happens. So this guy's wrong too. It doesn't do nothing. So a more recent paper by a team led by Roger Gross has found three kind of ways in which maybe regularization happens, but it's not the way you think.

This is one of the papers in the lesson notes. But even in his paper, which is just a few months old, the abstract says basically, or the introduction says basically no one really understands what L2 regularization does. So we have no idea what we're doing. There's this thing that every model ever always has, and it totally doesn't work.

At least it doesn't work in the way we thought it did. So that should make you feel better about, can I contribute to deep learning? Obviously you can, because none of us have any idea what we're doing. And this is a great place to contribute, right? Is like use all this telemetry that I'm showing you, activations of different layers, and see what happens experimentally, because the people who study this stuff, like what actually happens with batch norm and weight decay, most of them don't know how to train models, right?

The theory people, and then there's like the practitioners who forget about actually thinking about the foundations at all. But if you can combine the two and say like, oh, let's actually try some experiments. Let's see what happens really when we change weight decay, now that I've assumed we don't know what we're doing, I'm sure you can find some really interesting results.

So momentum is also interesting, and we really don't understand much about how things like momentum work, but here's some nice pictures for you. And hopefully it'll give you a bit of a sense of momentum. Let's create 200 numbers equally spaced between minus four and four, and then let's create another 200 random numbers that average 0.3.

And then let's create something that plots some function for these numbers, and we're going to look at this function for each value of something called beta. And this is the function we're going to try plotting, and this is the momentum function. Okay, so what happens if we plot this function for each value of beta, for our data where the y is random and averages 0.3?

So beta here is going to be our different values of momentum, and you can see what happens is, with very little momentum, you just get very bumpy, very bumpy. Once you get up to a high momentum, you get a totally wrong answer. Why is this? Because if you think about it, right, we're constantly saying 0.9 times whatever we had before, plus the new thing, then basically you're continuing to say like, oh, the thing I had before times 0.9 plus the new thing, and the things are all above zero.

So you end up with a number that's too high. And this is why, if your momentum is too high, and basically you're way away from where you need to be in weight space, so it keeps on saying go that way, go that way, go that way. If you get that enough with a high momentum, it will literally shoot off far faster than is reasonable.

Okay, so this will give you a sense of why you've got to be really careful with high momentum, it's literally biased to end up being a higher gradient than the actual gradient. So we can fix that. Like when you think about it, this is kind of dumb, right, because we shouldn't be saying beta times average plus yi, we should be saying beta times average plus 1 minus beta times the other thing.

Like dampen the thing that we're adding in, and that's called an exponentially weighted moving average, as we know, or lerp in PyTorch speak. So let's plot the same thing as before but this time with exponentially weighted moving average. Ah, perfect. Okay, so we're done, right? Not quite. What if the thing that we're trying to match isn't just random but is some function?

So it looks something like this. Well if we use a very small momentum with exponentially weighted moving averages, we're fine. And I've added an outlier at the start just to show you what happens. Even with beta 0.7 we're fine, but uh-oh, now we've got trouble. And the reason we've got trouble is that the second, third, fourth, fifth observations all have a whole lot of this item number one in, right?

Because remember item number two is 0.99 times item number one plus 0.01 times item number two. Right? And so item number one is massively biasing the start. Even here, it takes a very long time. And the second thing that goes wrong is with this momentum is that you see how we're a bit to the right of where we should be?

We're always running a bit behind where we should be. Which makes perfect sense, right? Because we're always only taking 0.1 times the new thing. So we can use de-biasing. De-biasing is what we saw last week and it turned out, thanks to Staz Beckman's discovery, we didn't really need it, but we do need it now.

And de-biasing is to divide by one minus beta to the power of whatever batch number we're up to. So you can kind of tell, right? If your initial starting point is zero, and that's what we use always when we're de-biasing, we always start at zero, and beta is 0.9, then your first step is going to be 0.9 times zero plus 0.1 times your item.

So in other words, you'll end up at 0.1 times your item, so you're going to end up 10 times lower than you should be. So you need to divide by 0.1, right? And if you kind of work through it, you'll see that each step is simply 0.1 to the power of one, two, three, four, five, and so forth.

And in fact, we have, of course, a spreadsheet showing you this. So if you have a look at the momentum bias spreadsheet, there we go. So basically, here's our batch number, and let's say these are the values that are coming in, our gradients, five, one, one, one, one, one, five, one.

Then basically, this is our exponentially weighted moving average, and here is our de-biasing correction. And then here is our resulting de-biased exponentially weighted moving average. And then you can compare it to an actual moving average of the last few. So that's basically how this works. And Silva loves writing LaTeX, so he wrote all this LaTeX that basically points out that if you say what I just said, which is beta times this plus one minus beta times that, and you keep doing it to itself lots and lots of times, you end up with something that they all cancel out to that.

So this is all we need to do to take our exponentially weighted moving average, divide it by one minus beta to the power of i plus one, and look at that. It's pretty good, right? It de-biases very quickly, even if you have a bad starting point, and it looks pretty good.

It's not magic, but you can see why a beta of .9 is popular. It's kind of got a pretty nice behavior. So let's use all that to create atom. So what's atom? Atom is dampened de-biased momentum, that's the numerator, divided by dampened de-biased average sum of squared gradients. And so we talked about why atom does that before, we won't go into the details.

But here's our average gradient again, but this time we've added optional dampening. So if you say I want dampening, then we'll set momentum dampening to that, otherwise we'll set it to one. And so this is exactly the same as before, but with dampening. Average squared gradients is exactly the same as average gradients, we could definitely refactor these a lot, so this is all exactly the same as before, except we'll call them different things.

We'll call it squared dampening, we'll call it squared averages, and this time, rather than just adding in the p-grad data, we will multiply p-grad data by itself, in other words, we get the squareds. This is the only difference, and we store it in a different name. So with those, we're also going to need to de-bias, which means we need to know what step we're up to.

So here's a stat, which just literally counts. So here's our de-bias function, the one we just saw. And so here's atom. Once that's in place, atom is just the de-bias momentum with momentum dampening, the de-bias squared momentum with squared momentum dampening, and then we just take the parameter, and then our learning rate, and we've got the de-bias in here, our gradient average, and divided by the squared.

And we also have our epsilon, oh, this is in the wrong spot, be careful, epsilon should always go inside the square root. So that's an atom step, so now we can create an atom optimizer in one line of code. And so there's our atom optimizer, it has average grads, it's got average squared grads, it's got a step, and we can now try it out.

So here's lamb. By the way, these equations are a little nicer than these equations, and I want to point something out. Mathematicians hate refactoring. Don't be like them. Look at this, m over v plus epsilon root, lambda, da-da-da, it's the same as this. So like, it's just so complicated when things appear the same way in multiple places, right?

So when we did this equation, we gave that a new name. And so now we can just look at r2, goes from all that to just that. And wt goes from all that to just that. And so when you pull these things out, when you refactor your math, it's much easier to see what's going on.

So here's the cool thing, right? When we look at this, even if you're a terrible mathematician like me, you're going to start to recognize some patterns, and that's the trick to being a less terrible mathematician is recognizing patterns. Beta times something plus one minus beta times another thing is exponentially weighted moving average, right?

So here's one exponentially weighted moving average. Here's another exponentially weighted moving average. This one has a gradient. This one has gradient squared. This means element-wise multiplication. So these are the exponentially weighted moving average of the gradient and the gradient squared. Oh, beta to the t, debiasing. So that's the debiased version of m.

There's the debiased version of v. Not to move the epsilon? Really? Sylvain has a message, don't move the epsilon. Don't listen to Jeremy. Don't listen to Jeremy. Okay. Sylvain's an actual math guy, so- In Adam, the epsilon goes outside the square root. No way. I always thought epsilon should always go inside the square root.

Jeremy just did a fix I pushed a week ago where our Adam wasn't working. Let's press Control-Z a few times. There we go. That's great. So to explain why this matters and why there is no right answer, here's the difference, right? If this is 1e next 7, then having it here versus having it here.

So like the square root of 1e next 7 is very different to 1e next 7. And in batch norm, they do put it inside the square root and according to Sylvain and Adam, they don't. Neither is like the right place to put it or the wrong place to put it.

If you don't put it in the same place as they do in paper, it's just a totally different number. And this is a good time to talk about epsilon and Adam because I love epsilon and Adam because like what if we put, made epsilon equal to 1, right? Then we've got the kind of momentumized, the kind of momentum term on the numerator and the denominator, we've got the root sum of squares of the root of the exponentially weighted average squared gradients.

So we're dividing by that plus 1. And most of the time, the gradients are going to be smaller than 1 and the squared version is going to be much smaller than 1. So basically then, the 1 is going to be much bigger than this, so it basically makes this go away.

So if epsilon is 1, it's pretty close to being standard SGD with momentum or at least debiased dampened momentum. Whereas if epsilon is 1e next 7, then we're basically saying, oh, we want to really use these different exponentially weighted moving average squared gradients. And this is really important because if you have some activation that has had a very small squared gradients for a while, this could well be like 1e next 6, which means when you divide by it, you're multiplying by a million.

And that could absolutely kill your optimizer. So the trick to making atom and atom-like things work well is to make this about 0.1, somewhere between 1e next 3 and 1e next 1 tends to work pretty well. Most people use 1e next 7, which it just makes no sense. There's no way that you want to be able to multiply your step by 10 million times.

That's just never going to be a good idea. So there's another place that epsilon is a super important thing to think about. Okay. So LAM then is stuff that we've all seen before, right? So it's debiased, this is atom, right, debiased exponentially weighted moving averages of gradients and gradient squared.

This here is the norm of the weights. The norm is just the sum of the roots and the squares. So this is just weight decay. So LAM has weight decay built in. This one here, hopefully you recognize as being the atom step. And so this is the norm of the atom step.

So basically what LAM is doing is it's atom, but what we do is we average all the steps over a whole layer, right? That's why these L's are really important, right, because these things are happening over a layer. And so basically we're taking, so here's our debiased momentum, debiased squared momentum, right?

And then here's our one, and look, here's this mean, right? So it's for a layer. Because remember, each stepper is created for a layer, for a parameter. I shouldn't say a layer, for a parameter, okay? So this is kind of like both exciting and annoying because I'd been working on this exact idea, which is basically atom but averaged out over a layer for the previous week.

And then this LAM paper came out, and I was like, "Oh, that's cool. Some paper about BERT training. I'll check it out." And it's like, "Oh, we do it with a new optimizer." And I looked at the new optimizer and was like, "It's just the optimizer I wrote a week before we were going to present it." So I'm thrilled that this thing exists.

I think it's exactly what we need. And you should definitely check out LAM because it makes so much sense to use the average over the layer of that step as a kind of a, you can see here, it's kind of got this normalization going on. Because it's just really unlikely that every individual parameter in that tensor, you don't want to divide it by its squared gradients because it's going to vary too much.

There's just too much chance that there's going to be a 1e neg 7 in there somewhere or something, right? So this to me is exactly the right way to do it. And this is kind of like the first optimizer I've seen where I just kind of think like, "Oh, finally I feel like people are heading in the right direction." But when you really study this optimizer, you realize that everything we thought about optimizers kind of doesn't make sense.

The way optimizers are going with things like LAM is the whole idea of what is the magnitude of our step, it just looks very different to everything we kind of thought of before. So check out this paper. You know, this bath might look slightly intimidating at first, but now you know all of these things.

You know, what all they all are and you know why they exist. So I think you'll be fine. So here's how we create a LAM optimizer and here's how we fit with it. Okay, that is that, unless Silvass says otherwise. All right. So as I was building this, I got so sick of runner because I kept on wondering when do I pass a runner?

When do I pass a learner? And then I kind of suddenly thought, like, again, like, once every month or two, I actually sit and think. And it's only when I get really frustrated, right? So like I was getting really frustrated with runners and I actually decided to sit and think.

And I looked at the definition of learner and I thought, wait, it doesn't do anything at all. It stores three things. What kind of class just stores three things? And then a runner has a learner in it that stores three things, like, why don't we store the three things in the runner?

So I took the runner, I took that line of code, I copied it and I pasted it just here. I then renamed runner to learner. I then found everything that said self.learn and removed the .learn and I was done. And now there's no more runner. And it's like, oh, it's just one of those obvious refactorings that as soon as I did it, Sylvain was like, why didn't you do it that way in the first place?

And I was like, why didn't you fix it that way in the first place? But now that we've done it, like, this is so much easier. There's no more get learn run, there's no more having to match these things together. It's just super simple. So one of the nice things I like about this kind of Jupyter style of development is I spend a month or two just, like, immersing myself in the code in this very experimental way.

And I feel totally fine throwing it all away and changing everything because, like, everything's small and I can, like, fiddle around with it. And then after a couple of months, you know, Sylvain and I will just kind of go like, okay, there's a bunch of things here that work nicely together and we turn it into some modules.

And so that's how fast.ai version one happened. People often say to us, like, turning it into modules, what a nightmare that must have been. So here's what was required for me to do that. I typed into Skype, Sylvain, please turn this into a module. So that was pretty easy.

And then three hours later, Sylvain typed back and he said, done. It was three hours of work. You know, it took, you know, it was four, five, six months of development in notebooks, three hours to convert it into modules. So it's really -- it's not a hassle. And I think this -- I find this quite delightful.

It works super well. So no more runner, thank God. Runner is now Cordelona. We're kind of back to where we were. We want progress bars. Sylvain wrote this fantastic package called Fast Progress, which you should totally check out. And we're allowed to import it. Because remember, we're allowed to import modules that are not data science modules.

Progress bar is not a data science module. But now we need to attach this progress bar to our callback system. So let's grab our ImageNet data as before, create a little thing with, I don't know, four, 32 filled layers. Let's rewrite our stats callback. It's basically exactly the same as it was before, except now we're storing our stats in an array, okay?

And we're just passing off the array to logger. Remember logger is just a print statement at this stage. And then we will create our progress bar callback. And that is actually the entirety of it. That's all we need. So with that, we can now add progress callback to our callback functions.

And grab our learner, no runner, fit. Now that's kind of magic, right? That's all the code we needed to make this happen. And look at the end. Oh, creates a nice little table. Pretty good. So this is, you know, thanks to just careful, simple, decoupled software engineering. We just said, okay, when you start fitting, you've got to create the master bar.

So that's the thing that tracks the epochs. And then tell the master bar we're starting. And then replace the logger function, not with print, but with master bar.write. So it's going to print the HTML into there. And then after we've done a batch, update our progress bar. When we begin an epoch or begin validating, we'll have to create a new progress bar.

And when we're done fitting, tell the master bar we're finished. That's it. So it's very easy to, once you have a system like this, to integrate with other libraries if you want to use TensorBoard or VisDum or send yourself a Twilio message or whatever. It's super easy. Okay, so we're going to finish, I think we're going to finish, unless this goes faster than I expect, with data augmentation.

So so far, we've seen how to create our optimizers, we've seen how to create our data blocks API. And we can use all that to train a reasonably good image net model. But to make a better image net model, it's a bit short of data. So we should use data augmentation as we all know.

Now, so let's load it in as before. And let's just grab an image list for now. The only transforms, we're going to use resize fixed. And here's our chap with a tench. And let's just actually open the original pillow image without resizing it to see what he looks like full size.

So here he is. And I want to point something out. When you resize, there are various resampling methods you can use. So basically, when you go from one size image to another size image, do you like take the pixels and take the average of them? Or do you put a little cubic spline through them?

Or what? And so these are called resampling methods, and pillow has a few. They suggest when downsampling, so going from big to small, you should use anti alias. So here's what you do, when you're augmenting your data, and this is like nothing I'm going to say today is really focused on vision.

If you're doing audio, if you're doing text, if you're doing music, whatever, augment your data and look at or listen to or understand your augmented data. So don't like just chuck this into a model, but like look at what's going on. So if I want to know what's going on here, I need to be able to see the texture of this tension.

Now, I'm not very good at tensions, but I do know a bit about clothes. So let's say if we were trying to see what this guy's wearing, it's a checkered shirt. So let's zoom in and see what this guy's wearing. I have no idea. The checkered shirt's gone. So like, I can tell that this is going to totally break my model if we use this kind of image augmentation.

So let's try a few more. What if instead of anti aliasing, we use bilinear, which is the most common? No, I still don't know what he's wearing. Okay. What if we use nearest neighbors, which nobody uses because everybody knows it's terrible? Oh, it totally works. So yeah, just look at stuff and try and find something that you can study to see whether it works.

Here's something interesting, though. This looks better still, don't you think? And this is interesting because what I did here was I did two steps. I first of all resized to 256 by 256 with bicubic, and then I resized to my final 128 by 128 with nearest neighbors. And so sometimes you can combine things together in steps to get really good results.

Anyway, I didn't want to go into the details here, I'm just saying that when we talk about image augmentation, your test is to look at or listen to or whatever your augmented data. So resizing is very important for vision. Flipping is a great data augmentation for vision. I don't particularly care about flipping.

The main thing I want to point out is this, at this point, our tensors contain bytes. Calculating with bytes and moving bytes around is very, very fast. And we really care about this because when we were doing the Dawnbench competition, one of our biggest issues for speed was getting our data augmentation running fast enough and doing stuff on floats is slow.

If you're flipping something, flipping bytes is identical to flipping floats in terms of the outcome, so you should definitely do your flip while it's still a byte. So image augmentation isn't just about throwing some transformation functions in there, but think about when you're going to do it because you've got this pipeline where you start with bytes and you start with bytes in a pillow thing and then they become bytes in a tensor and then they become floats and then they get turned into a batch.

Where are you going to do the work? And so you should do whatever you can while they're still bytes. But be careful. Don't do things that are going to cause rounding errors or saturation problems, whatever. But flips, definitely good. So let's do our flips. So there's a thing called PILX.TransposePILImageFlipLeftRight.

Let's check it for random numbers less than 0.5. Let's create an item list and let's replace that. We built this ourselves, so we know how to do this stuff now. Let's replace the items with just the first item with 64 copies of it. And so that way we can now use this to create the same picture lots of times.

So show batch is just something that's just going to go through our batch and show all the images. Everything we're using we've built ourselves, so you never have to wonder what's going on. So we can show batch with no augmentation or remember how we created our transforms. We can add PIL random flip and now some of them are backwards.

It might be nice to turn this into a class that you actually pass a P into to decide what the probability of a flip is. You probably want to give it an order because we need to make sure it happens after we've got the image and after we've converted it to RGB but before we've turned it into a tensor.

Since all of our PIL transforms are going to want to be that order, we may as well create a PIL transform class and give it that order and then we can just inherit from that class every time we want a PIL transform. So now we've got a PIL transform class, we've got a PIL random flip, it's got this state, it's going to be random, we can try it out giving it P of 0.8 and so now most of them are flipped.

Or maybe we want to be able to do all these other flips. So actually PIL transpose, you can pass it all kinds of different things and they're basically just numbers between 0 and 6. So here are all the options. So let's turn that into another transform where we just pick any one of those at random and there it is.

So this is how we can do data augmentation. All right, now's a good time. >> It's easy to evaluate data augmentation for images, how would you handle tabular text or time series? >> For text you read it. >> How would you handle the data augmentation? >> You would read the augmented text.

So if you're augmenting text then you read the augmented text. For time series you would look at the signal of the time series. For tabular you would graph or however you normally visualize that kind of tabular data, you would visualize that tabular data in the same way. So you just kind of come and try and as a domain expert hopefully you understand your data and you have to come up with a way, what are the ways you normally visualize that kind of data and use the same thing for your augmented data.

Make sure it makes sense. Yeah, make sure it seems reasonable. >> Sorry, I think I misread. How would you do the augmentation for tabular data text or time series? >> How would you do the augmentation? I mean, again, it kind of requires your domain expertise. Just before class today actually one of our alumni, Christine Payne came in, she's at open AI now working on music analysis and music generation and she was talking about her data augmentation saying she's pitch shifting and volume changing and slicing bits off the front and the end and stuff like that.

So there isn't an answer. It's just a case of thinking about what kinds of things could change in your data that would almost certainly cause the label to not change but would still be a reasonable data item and that just requires your domain expertise. Oh, except for the thing I'm going to show you next which is going to be a magic trick that works for everything.

So we'll come to that. We can do random cropping and this is, again, something to be very careful of. We very often want to grab a small piece of an image and zoom into that piece. It's a great way to do data augmentation. One way would be to crop and then resize.

And if we do crop and resize, oh, we've lost his checked shirt. But very often you can do both in one step. So for example, with pillow, there's a transform called extent where you tell it what crop and what resize and it does it in one step and now it's much more clear.

So generally speaking, you've got to be super careful, particularly when your data is still bytes, not to do destructive transformations, particularly multiple destructive transformations. Do them all in one go or wait until they're floats because bytes round off and disappear or saturate whereas floats don't. And the cropping one takes 193 microseconds, the better one takes 500 microseconds.

So one approach would be to say, oh, crap, it's more than twice as long, we're screwed. But that's not how to think. How to think is what's your time budget? Does it matter? So here's how I thought through our time budget for this little augmentation project. I know that for Dawnbench, kind of the best we could get down to is five minutes per batch of ImageNet on eight GPUs.

And so that's 1.25 million images. So that's on one GPU per minute, that's 31,000 or 500 per second. Assuming four cores per GPU, that's 125 per second. So we're going to try to stay under 10 milliseconds, I said 10 milliseconds per image. I think I mean 10 milliseconds per batch.

So it's actually still a pretty small number. So we're not too worried at this point about 500 microseconds. But this is always kind of the thing to think about is like how much time have you got? And sometimes these times really add up. But yeah, 520 per second, we've got some time, especially since we've got normally a few cores per GPU.

So we can just write some code to do kind of a general crop transform. For ImageNet and things like that, for the validation set, what we normally do is we grab the center of the image, we remove 14% from each side and grab the center. So we can zoom in a little bit, so we have a center crop.

So here we show all that. That's what we do for the validation set, and obviously they're all the same because validation set doesn't have the randomness. But for the training set, the most useful transformation by far, like all the competition winners, grab a small piece of the image and zoom into it.

This is called a random resize crop. And this is going to be really useful to know about for any domain. So for example, in NLP, really useful thing to do is to grab different sized chunks of contiguous text. With audio, if you're doing speech recognition, grab different sized pieces of the utterances and so forth.

If you can find a way to get different slices of your data, it's a fantastically useful data augmentation approach. And so this is like by far the main, most important augmentation used in every ImageNet winner for the last six years or so. It's a bit weird though because what they do in this approach is this little ratio here says squish it by between three over four aspect ratio to a four over three aspect ratio.

And so it literally makes the person, see here, he's looking quite thin. And see here, he's looking quite wide. It doesn't actually make any sense, this transformation, because optically speaking, there's no way of looking at something in normal day-to-day life that causes them to expand outwards or contract inwards.

So when we looked at this, we thought, I think what happened here is that they were, this is the best they could do with the tools they had, but probably what they really want to do is to do the thing that's kind of like physically reasonable. And so the physically reasonable thing is like you might be a bit above somebody or a bit below somebody or left of somebody or right of somebody, causing your perspective to change.

So our guess is that what we actually want is not this, but this. So perspective warping is basically something that looks like this. You basically have four points, right? And you think about how would those four points map to four other points if they were going through some angle.

So it's like as you look from different directions, roughly speaking. And the reason that I really like this idea is because when you're doing data augmentation at any domain, as I mentioned, the idea is to try and create like physically reasonable in your domain inputs. And these just aren't like you can't make somebody squishier in real world, right?

But you can shift their perspective. So if we do a perspective transform, then they look like this. This is true, right? If you're a bit underneath them, the fish will look a bit closer, or if you're a bit over here, then the hat's a bit closer from that side.

So these perspective transforms make a lot more sense, right? So if you're interested in perspective transforms, we have some details here on how you actually do them mathematically. The details aren't important, but what are interesting is the transform actually requires solving a system of linear equations. And did you know that PyTorch has a function for solving systems of linear equations?

It's amazing how much stuff is in PyTorch, right? So for like lots of the things you'll need in your domain, you might be surprised to find what's already there. Question? >> And with the cropping and resizing, what happens when you lose the object of interest, so when the fish has been cropped out?

>> That's a great question. It's not just a fish, it's a tench, yeah. So there's no tench here. And so these are noisy labels. And interestingly, the kind of ImageNet winning strategy is to randomly pick between 8% and 100% of the pixels. So literally, they are very often picking 8% of the pixels.

And that's the ImageNet winning strategy. So they very often have no tench. So very often they'll have just the fin or just the eye. So this tells us that if we want to use this really effective augmentation strategy really well, we have to be very good at handling noisy labels, which we're going to learn about in the next lesson, right?

And it also hopefully tells you that if you already have noisy labels, don't worry about it. All of the research we have tells us that we can handle labels where the thing's totally missing or sometimes it's wrong, as long as it's not biased. So yeah, it's okay. And one of the things it'll do is it'll learn to find things associated with a tench.

So if there's a middle-aged man looking very happy outside, could well be a tench. Okay, so this is a bit of research that we're currently working on. And hopefully I'll have some results to show you soon. But our view is that this image warping approach is probably going to give us better results than the traditional ImageNet style augmentations.

So here's our final transform for tilting in arbitrary directions, and here's the result. Not bad. So a couple of things to finish on. The first is that it's really important to measure everything. And I and many people have been shocked to discover that actually the time it takes to convert an image into a float tensor is significantly longer than the amount of time it takes to do something as complicated as a warp.

So you may be thinking this image warping thing sounds really hard and slow, but be careful, just converting bytes to floats is really hard and slow. And then this is the one, as I mentioned, this one we're using here is the one that comes from Torch Vision. We found another version that's like twice as fast, which goes directly to float.

So this is the one that we're going to be using. So time everything if you're running, you know, if things are running not fast enough. Okay, here's the thing I'm really excited about for augmentation, is this stuff's all still too slow. What if I told you, you could do arbitrary affine transformations.

So warping, zooming, rotating, shifting at a speed which would compare. This is the normal speed. This is our speed. So up to like, you know, an order of magnitude or more faster. How do we do it? We figured out how to do it on the GPU. So we can actually do augmentation on the GPU.

And the trick is that PyTorch gives us all the functionality to make it happen. So the key thing we have to do is to actually realize that our transforms, our augmentation should happen after you create a batch. So here's what we do. For our augmentation, we don't create one random number.

We create a mini batch of random numbers, which is fine because PyTorch has the ability to generate batches of random numbers on the GPU. And so then once we've got a mini batch of random numbers, then we just have to use that to generate a mini batch of augmented images.

I won't kind of bore you with the details. I find them very interesting details. But if you're not a computer vision person, maybe not. But basically, we create something called an affine grid, which is just the coordinates of where is every pixel, so like literally is coordinates from minus one to one.

And then what we do is we multiply it by this matrix, which is called an affine transform. And there are various kinds of affine transforms you can do. For example, you can do a rotation transform by using this particular matrix, but these are all just matrix multiplications. And then you just, as you see here, you just do the matrix multiplication and this is how you can rotate.

So a rotation, believe it or not, is just a matrix multiplication by this untimely matrix. If you do that, normally it's going to take you about 17 milliseconds because speed it up a bit with own sum, or we could speed it up a little bit more with batch matrix multiply, or we could stick the whole thing on the GPU and do it there.

And that's going to go from 11 milliseconds to 81 microseconds. So if we can put things on the GPU, it's totally different. And suddenly we don't have to worry about how long our augmentation is taking. So this is the thing that actually rotates the coordinates, to say where the coordinates are now.

Then we have to do the interpolation. And believe it or not, PyTorch has an optimized batch-wise interpolation function. It's called grid sample. And so here it is. We run it. There it is. And not only do they have a grid sample, but this is actually even better than Pillows because you don't have to have these black edges.

You can say padding mode equals reflection, and the black edges are gone. It just reflects what was there, which most of the time is better. And so reflection padding is one of these little things we find definitely helps models. So now we can put this all together into a rotate batch.

We can do any kind of coordinate transform here. One of them is rotate batch. To do it a batch at a time. And yeah, as I say, it's dramatically faster. Or in fact, we can do it all in one step because PyTorch has a thing called affine grid that will actually do the multiplication as it creates a coordinate grid.

And this is where we get down to this incredibly fast speed. So I feel like there's a whole, you know, big opportunity here. There are currently no kind of hackable, anybody can write their own augmentation. Run on the GPU libraries out there. The entire fastai.vision library is written using PyTorch Tensor operations.

We did it so that we could eventually do it this way. But currently they all run on the CPU, one image at a time. But this is our template now. So now you can do them a batch at a time. And so whatever domain you're working in, you can hopefully start to try out these, you know, randomized GPU batch wise augmentations.

And next week we're going to show you this magic data augmentation called mixup that's going to work on the GPU, it's going to work on every kind of domain that you can think of and will possibly make most of these irrelevant because it's so good you possibly don't need any others.

So that and much more next week, we'll see you then. (audience applauds) (audience applauds)