Back to Index

Lesson 2: Deep Learning 2018


Chapters

0:0 Introduction
2:40 Learning Rate
7:35 Learning Rate Finder
15:50 Data Augmentation
17:58 Transform Sets
29:24 Accuracy
30:16 Learning rate annealing
40:34 Saving your model
43:50 Finetuning
46:39 Differential Learning Rates

Transcript

Okay, so welcome back to deep learning lesson 2. Last week we got to the point where we had successfully trained a pretty accurate image classifier. And so just to remind you about how we did that, can you guys see okay? Actually we can't turn the front lights off, can you guys all see the screen, okay?

We can turn just these ones, can we? That pitch is all into darkness, but if that works then... Okay, that's better isn't it? Do you mind doing the other two? Maybe that one as well. So just to remind you, the way we built this image classifier was we used a small amount of code, basically three lines of code, and these three lines of code pointed at a particular path which already had some data in it.

And so the key thing for this to know how to train this model was that this path, which was data, dogs, cats, had to have a particular structure, which is that it had a train folder and a valid folder, and in each of those train and valid folders there was a cats folder and a dogs folder, and each of the cats and the dogs folders was a bunch of images of cats and dogs.

So this is like a pretty standard, it's one of two main structures that are used to say here is the data that I want you to train an image model from. So I know some of you during the week went away and tried different data sets where you had folders with different sets of images in and created your own image classifiers.

And generally that seems to be working pretty well from what I can see on the forums. So to make it clear, at this point this is everything you need to get started. So if you create your own folders with different sets of images, a few hundred or a few thousand in each folder, and run the same three lines of code, that will give you an image classifier and you'll be able to see this third column tells you how accurate it is.

So we looked at some kind of simple visualizations to see what was it uncertain about, what was it wrong about, and so forth, and that's always a really good idea. And then we learned about the one key number you have to pick. So this number here is the one key number, this 0.01, and this is called the learning rate.

So I wanted to go over this again, and we'll learn about the theory behind what this is during the rest of the course in quite a lot of detail, but for now I just wanted to talk about the practice. We're going to talk about the other ones shortly, so the main one we're going to look at for now is the last column, which is the accuracy.

The first column as you can see is the epoch number, so this tells us how many times has it been through the entire data set trying to learn a better classifier, and then the next two columns is what's called the loss, which we'll be learning about either later today or next week.

The first one is the loss on the training set, these are the images that we're looking at in order to try to make a better classifier. The second is the loss on the validation set, these are the images that we're not looking at when we're training, but we're just setting them aside to see how accurate we are.

So we'll learn about the difference between loss and accuracy later. So we've got the epoch number, the training loss is the second column, the validation loss is the third column and the accuracy is the fourth column. So the basic idea of the learning rate is it's the thing that's going to decide how quickly do we hone in on the solution.

And so I find that a good way to think about this is to think about what if we were trying to fit to a function that looks something like this, we're trying to say whereabouts is the minimum point? This is basically what we do when we do deep learning, is we try to find the minimum point of a function.

Now our function happens to have millions or hundreds of millions of parameters, but it works the same basic way. And so when we look at it, we can immediately see that the lowest point is here, but how would you do that if you were a computer algorithm? And what we do is we start out at some point at random, so we pick say here, and we have a look and we say what's the loss or the error at this point, and we say what's the gradient, in other words which way is up and which way is down.

And it tells us that down is going to be in that direction, and it also tells us how fast is it going down, which at this point is going down pretty quickly. And so then we take a step in the direction that's down, and the distance we travel is going to be proportional to the gradient, it's going to be proportional to how steep it is.

The idea is if it's deeper, then we're probably further away, that's the general idea. And so specifically what we do is we take the gradient, which is how steep is it at this point, and we multiply it by some number, and that number is called the learning rate. So if we pick a number that is very small, then we're guaranteed that we're going to go a little bit closer and a little bit closer and a little bit closer each time.

But it's going to take us a very long time to eventually get to the bottom. If we pick a number that's very big, we could actually step too far, we could go in the right direction, but we could step all the way over to here as a result of which we end up further away than we started, and we could oscillate and it gets worse and worse.

So if you start training a neural net and you find that your accuracy or your loss is like spitting off into infinity, almost certainly your learning rate is too high. So in a sense, learning rate too low is a better problem to have because you're going to have to wait a long time, but wouldn't it be nice if there was a way to figure out what's the best learning rate?

Something where you could quickly go like, boom, boom, boom. And so that's why we use this thing called a learning rate finder. And what the learning rate finder does is it tries, each time it looks at another, remember the term minibatch? Minibatch is a few images that we look at each time so that we're using the parallel processing power of the GPU effectively.

We look generally at around 64 or 128 images at a time. For each minibatch, which is labeled here as an iteration, we gradually increase the learning rate. In fact, multiplicatively increase the learning rate. We start at really tiny learning rates to make sure that we don't start at something too high, and we gradually increase it.

And so the idea is that eventually the learning rate will be so big that the loss will start getting worse. So what we're going to do then is look at the plot of learning rate against loss. So when the learning rate's tiny, it increases slowly, then it starts to increase a bit faster, and then eventually it starts not increasing as quickly and in fact it starts getting worse.

So clearly here, make sure you're going to be familiar with this scientific notation. So 10^-1 is 0.1, 10^-2 is 0.01, and when we write this in Python, we'll generally write it like this. Rather than writing 10^-1 or 10^-2, we'll just write 1e1 or 1e2. They mean the same thing, you're going to see that all the time.

And remember that equals 0.1, 0.01. So don't be confused by this text that it prints out here. This loss here is the final loss at the end, it's not of any interest. So ignore this, this is only interesting when we're doing regular training, not interesting for the learning rate finder.

The thing that's interesting for the learning rate finder is this learn.shed.plot, and specifically we're not looking for the point where it's the lowest, because the point where it's the lowest is actually not getting better anymore, so that's too high a learning rate. So I generally look to see where is it the lowest, and then I go back like 1 over magnitude.

So 1e2 would be a pretty good choice. So that's why you saw when we ran our fit here, we picked 0.01, which is 1e1-2. So an important point to make here is this is the one key number that we've learned to adjust, and if you just adjust this number and nothing else, most of the time you're going to be able to get pretty good results.

And this is like a very different message to what you would hear or see in any textbook or any video or any course, because up until now there's been like dozens and dozens, they're called hyperparameters, dozens and dozens of hyperparameters to set, and they've been thought of as highly sensitive and difficult to set.

So inside the first AI library, we kind of do all that stuff for you as much as we can. And during the course, we're going to learn that there are some more we can break to get slightly better results, but it's kind of like, it's kind of in a funny situation here because for those of you that haven't done any deep learning before, it's kind of like oh, this is, that's all there is to it, this is very easy, and then when you talk to people outside this class, they'll be like deep learning is so difficult, there's so much to set, it's a real art form, and so that's why there's this difference.

And so the truth is that the learning rate really is the key thing to set, and this ability to use this trick to figure out how to set it, although the paper is now probably 18 months old, almost nobody knows about this paper, it was from a guy who's not from a famous research lab, so most people kind of ignored it, and in fact even this particular technique was one sub-part of a paper that was about something else.

So again, this idea of this is how you can set the learning rate, really nobody outside this classroom just about knows about it, obviously the guy who wrote it, Leslie Smith knows about it, so it's a good thing to tell your colleagues about, it's like here is actually a great way to set the learning rate, and there's even been papers called, like one of the famous papers is called No More Pesky Learning Rates, which actually is a less effective technique than this one, but this idea that like setting learning rates is very difficult and fiddly has been true for most of the kind of deep learning history.

So here's the trick, look at this plot, find kind of the lowest, go back about a multiple of 10 and try that, and if that doesn't quite work you can always try going back another multiple of 10, but this has always worked for me so far. That's a great question.

So we're going to learn during this course about a number of ways of improving gradient percent, like you mentioned momentum and atom and so forth. This is orthogonal in fact. So one of the things the class AI library tries to do is figure out the right gradient percent version, and in fact behind the scenes this is actually using something called atom.

And so this technique is telling us this is the best learning rate to use, given what other tweaks you're using in this case, the atom optimizer. So it's not that there's some compromise between this and some other approaches, this sits on top of those approaches, and you still have to set the learning rate when you use other approaches.

So we're trying to find the best kind of optimizer to use for a problem, but you still have to set the learning rate and this is how we can do it. And in fact this idea of using this technique on top of more advanced optimizers like atom I haven't even seen mentioned in a paper before, so I think this is not a huge breakthrough, it seems obvious, but nobody else seems to have tried it.

So as you can see it works well. When we use optimizers like atom which have like adaptive learning rates, so when we set this learning rate, is it like initial learning rate because it changes during the epoch? So we're going to be learning about things like atom, the details about it later in the class, but the basic answer is no, even with atom there actually is a learning rate, it's being basically divided by the average previous gradient and also the recent sum of squares of gradients.

So there's still like a number called the learning rate. There isn't even these so-called dynamic learning rate methods that still have a learning rate. So the most important thing that you can do to make your model better is to give it more data. So the challenge that happens is that these models have hundreds of millions of parameters, and if you train them for a while they start to do what's called overfitting.

And so overfitting means that they're going to start to see like the specific details of the images you're giving them rather than the more general learning that can transfer across to the validation set. So the best thing we can do to avoid overfitting is to find more data. Now obviously one way to do that would just be to collect more data from wherever you're getting it from or label more data, but a really easy way that we should always do is to use something called data augmentation.

So data augmentation is one of these things that in many courses it's not even mentioned at all or if it is it's kind of like an advanced topic right at the end, but actually it's like the most important thing that you can do to make a better model. So it's built into the fast.io library to make it very easy to do.

And so we're going to look at the details of the code shortly, but the basic idea is that in our initial code we had a line that said image_classifier.data.from_paths and we passed in the path to our data, and for transforms we passed in basically the size and the architecture.

We'll look at this in more detail shortly. We just add one more parameter which is what kind of data augmentation do you want to do. And so to understand data augmentation, it's maybe easiest to look at some pictures of data augmentation. So what I've done here, again we'll look at the code in more detail later, but the basic idea is I've built a data class multiple times, I'm going to do it six times, and each time I'm going to plot the same cat.

And you can see that what happens is that this cat here is further over to the left, and this one here is further over to the right, and this one here is flipped horizontally, and so forth. So data augmentation, different types of image are going to want different types of data augmentation.

So for example, if you were trying to recognize letters and digits, you wouldn't want to flip horizontally because it actually has a different meaning. Whereas on the other hand, if you're looking at photos of cats and dogs, you probably don't want to flip vertically because cats aren't generally upside down.

Whereas if you were looking at the current Kaggle competition which is recognizing icebergs in satellite images, you probably do want to flip them upside down because it doesn't really matter which way around the iceberg or the satellite was. So one of the examples of the transform sets we have is "transforms side on".

So in other words, if you have photos that are generally taken from the side, which generally means you want to be able to flip them horizontally but not vertically, this is going to give you all the transforms you need for that. So it will flip them sideways, rotate them by small amounts but not too much, slightly bury their contrast and brightness, and slightly zoom in and out a little bit and move them around a little bit.

So each time it's a slightly different, slightly different edge. I'm getting a couple of questions from people about, could you explain again the reason why you don't take the minimum of the loss curve but a slightly higher rate? Also, could people understand if this works for every CNN or for CNN or for every internet?

Could you put your hand up if there's a spare seat next to you? So there was a question about the learning rate finder about why do we use the learning rate that's less than the lowest point? And so the reason why is to understand what's going on with this learning rate finder.

So let's go back to our picture here of how do we figure out what learning rate to use. And so what we're going to do is we're going to take steps and each time we're going to double the learning rate, so double the amount by which we're multiplying the gradient.

So in other words, we'd go tiny step, slightly bigger, slightly bigger, slightly bigger, slightly bigger, slightly bigger, slightly bigger. And so the purpose of this is not to find the minimum, the purpose of this is to figure out what learning rate is allowing us to decrease quickly. So the point at which the loss was lowest here is actually there, but that learning rate actually looks like it's probably too high, it's going to just jump probably backwards and forwards.

So instead what we do is we go back to the point where the learning rate is giving us a quick increase in the loss. So here is the actual learning rate increasing every single time we look at a new minibatch, so minibatch or iteration versus learning rate. And then here is learning rate versus loss.

So here's that point at the bottom where it was now already too high, and so here's the point where we go back a little bit and it's increasing nice and quickly. We're going to learn about something called stochastic gradient descent with restarts shortly where we're going to see, in a sense you might want to go back to 1e3 where it's actually even steeper still, and maybe we would actually find this will actually learn even quicker, you could try it, but we're going to see later why actually using a higher number is going to give us better generalization.

So for now we'll just put that aside. So as we increase the iterations in the learning rate finder, the learning rate is going up. This is iteration versus learning rate. So as we do that, as the learning rate increases and we plot it here, the loss goes down until we get to the point where the learning rate is too high.

And at that point the loss is now getting worse. Because I asked the question because you were just indicating that even though the minimum was at 10^-1, you suggested we should choose 10^-2, but now you're saying maybe we should go back the other way higher. I didn't mean to say that, I'm sorry if I said something backwards, so I want to go back down to a lower learning rate.

So possibly I said higher when I meant higher in this lower learning rate. Last class you said that all the local minima are the same and this graph also shows the same. Is that something that was observed or is there a theory behind it? That's not what this graph is showing.

This graph is simply showing that there's a point where if we increase the learning rate more, then it stops getting better and it actually starts getting worse. The idea that all local minima are the same is a totally separate issue and it's actually something we'll see a picture of shortly, so let's come back to that.

That's a great question. I certainly run it once when I start. Later on in this class, we're going to learn about unfreezing layers, and after I unfreeze layers I sometimes run it again. If I do something to change the thing I'm training or change the way I'm training it, you may want to run it again.

Particularly if you've changed something about how you train unfreezing layers, which we're going to soon learn about, and you're finding the other training is unstable or too slow, you can run it again. There's never any harm in running it. It doesn't take very long. That's a great question. Back to data augmentation.

When we run this little transforms from model function, we pass in augmentation transforms, we can pass in the main two, a transform side on or transforms top down. Later on we'll learn about creating your own custom transform lists as well, but for now because we're taking pictures from the side of cats and dogs, we'll say transform side on.

Now each time we look at an image, it's going to be zoomed in or out a little bit, moved around a little bit, rotated a little bit, possibly flipped. What this does is it's not exactly creating new data, but as far as the convolutional neural net is concerned, it's a different way of looking at this thing, and it actually therefore allows it to learn how to recognize cats or dogs from somewhat different angles.

So when we do data augmentation, we're basically trying to say based on our domain knowledge, here are different ways that we can mess with this image that we know still make it the same image, and that we could expect that you might actually see that kind of image in the real world.

So what we can do now is when we call this from_parts function, which we'll learn more about shortly, we can now pass in this set of transforms which actually have these augmentations in. So we're going to start from scratch here, we do a fit, and initially the augmentations actually don't do anything.

And the reason initially they don't do anything is because we've got here something that says pre-compute = true. We're going to go back to this lots of times. But basically what this is doing is do you remember this picture we saw where we learned each different layer has these activations that basically look for anything from the middle of flowers to eyeballs of birds or whatever.

And so literally what happens is that the later layers of this convolutional neural network have these things called activations. Activation is a number that says this feature, like eyeball of birds, is in this location with this level of confidence, with this probability. And so we're going to see a lot of this later.

But what we can do is we can say, in this we've got a pre-trained network, and a pre-trained network is one where it's already learned to recognize certain things. In this case it's learned to recognize the 1.5 million images in the ImageNet dataset. And so what we could do is we could take the second last layer, so the one which has got all of the information necessary to figure out what kind of thing a thing is, and we can save those activations.

So basically saving things saying there's this level of eyeballness here, and this level of dog's face-ness here, and this level of fluffy ear there, and so forth. And so we save for every image these activations, and we call them the pre-computed activations. And so the idea is now that when we want to create a new classifier which can basically take advantage of these pre-computed activations, we can very quickly train a simple linear model based on those.

And so that's what happens when we say pre-compute = true. And that's why, you may have noticed this week, the first time that you run a new model, it takes a minute or two. Whereas you saw when I ran it, it took like 5 or 10 seconds, it took you a minute or two, and that's because it had to pre-compute these activations, it just has to do that once.

If you're using your own computer or AWS, it just has to do it once ever. If you're using Cressel, it actually has to do it once every single time you rerun Cressel because Cressel just for these pre-computed activations, it uses a special little kind of scratch space that disappears each time you restart your Cressel instance.

So other than special Cressel, generally speaking, you just have to run it once ever for a dataset. So the issue with that is that since we're pre-computed for each image, how much does it have an ear here and how much does it have a lizard's eyeball there and so forth?

That means that data augmentations don't work. In other words, even though we're trying to show it a different version of the cat each time, we've pre-computed the activations for a particular version of that cat. So in order to use data augmentation, we just have to go learn.precompute=false, and then we can run a few more APOCs.

And so you can see here that as we run more APOCs, the accuracy isn't particularly getting better. That's the bad news. The good news is that you can see the train loss, this is like a way of measuring the error of this model, although that's getting better, the error's going down, the validation error isn't going down, but we're not overfitting.

And overfitting would mean that the training loss is much lower than the validation loss. We're going to talk about that a lot during this course, but the general idea here is if you're doing a much better job on the training set than you are on the validation set, that means your model's not generalizing.

So we're not at that point, which is good, but we're not really improving. So we're going to have to figure out how to deal with that. Before we do, I want to show you one other cool trick. I've added here cycle length=1, and this is another really interesting idea.

Here's the basic idea. Cycle length=1 enables a fairly recent discovery in deep learning called Stochastic Gradient % with restarts. And the basic idea is this. As you get closer and closer to the right spot, I may want to start to decrease my learning rate. Because as I get closer, I'm pretty close down, so let's slow down my steps to try to get exactly to the right spot.

And so as we do more iterations, our learning rate perhaps should actually go down. Because as we go along, we're getting closer and closer to where we want to be and we want to get exactly to the right spot. So the idea of decreasing the learning rate as you train is called learning rate annealing.

And it's very, very common, very, very popular. Everybody uses it basically all the time. The most common kind of learning rate annealing is really horrendously hacky. It's basically that researchers pick a learning rate that seems to work for a while, and then when it stops learning well, they drop it down by about 10 times, and then they keep learning a bit more until it doesn't seem to be improving, and they drop it down by another 10 times.

That's what most academic research papers and most people in the industry do. So this would be like stepwise annealing, very manual, very annoying. A better approach is simply to pick some kind of functional form like a line. It turns out that a really good functional form is one half of a cosine curve.

And the reason why is that for a while when you're not very close, you kind of have a really high learning rate, and then as you do get close you kind of quickly drop down and do a few iterations with a really low learning rate. And so this is called cosine annealing.

So to those of you who haven't done trigonometry for a while, cosine basically looks something like this. So we've picked one little half piece. So we're going to use cosine annealing. But here's the thing, when you're in a very high dimensional space, and here we're only able to show 3 dimensions, but in reality we've got hundreds of millions of dimensions, we've got lots of different fairly flat points, they're fairly flat points, all of which are pretty good, but they might differ in a really interesting way, which is that some of those flat points let me show you.

Let's imagine we've got a surface that looks something like this. Now imagine that our random guess started here, and our initial learning rate annealing schedule got us down to here. Now indeed that's a pretty nice low error, but it probably doesn't generalize very well, which is to say if we use a different dataset where things are just kind of slightly different in one of these directions, suddenly it's a terrible solution, whereas over here it's basically equally good in terms of loss, but it rather suggests that if you have slightly different datasets that are slightly moved in different directions, it's still going to be good.

So in other words, we would expect this solution here is probably going to generalize better than the Sparky one. So here's what we do, if we've got a bunch of different low bits, then our standard learning rate annealing approach will go downhill, downhill, downhill, downhill, downhill to one spot.

But what we could do instead is use a learning rate schedule that looks like this, which is to say we do a cosine annealing and then suddenly jump up again, do a cosine annealing and then jump up again. And so each time we jump up, it means that if we're in a spiky bit and then we suddenly increase the learning rate and it jumps now all the way over to here, and so then we kind of learning rate and near, learning rate and near down to here, and then we jump up again to a high learning rate, and it stays here.

So in other words, each time we jump up the learning rate, it means that if it's in a nasty spiky part of the surface, it's going to hop out of the spiky part, and hopefully if we do that enough times, it will eventually find a nice smooth ball. Could you get the same effect by running multiple iterations through the different randomized running points so that eventually you explore all possible minimals and then compare them?

Yeah so in fact, that's a great question, and before this approach, which is called stochastic gradient descent with restarts was created, that's exactly what people used to do. They used to create these things called ensembles where they would basically relearn a whole new model 10 times in the hope that one of them is going to end up being better.

And so the cool thing about this stochastic gradient descent with restarts is that once we're in a reasonably good spot, each time we jump up the learning rate, it doesn't restart, it actually hangs out in this nice part of the space and then keeps getting better. So interestingly it turns out that this approach where we do this a bunch of separate cosine annealing steps, we end up with a better result than if we just randomly tried a few different starting points.

So it's a super neat trick and it's a fairly recent development, and again almost nobody's heard of it, but I found it's now like my superpower. Using this, along with the learning rate finder, I can get better results than nearly anybody like in a Kaggle competition in the first week or two, I can jump in, spend an hour or two and back, I've got a fantastically good result.

And so this is why I didn't pick the point where it's got the steepest slope, I actually tried to pick something kind of aggressively high, it's still getting down but maybe getting to the point where it's nearly too high. Because when we do this stochastic gradient descent with restarts, this 10^-2 represents the highest number that it uses.

So it goes up to 10^-2 and then goes down, then up to 10^-2 and then down. So if I use to lower learning rate, it's not going to jump to a different part of the function. In terms of this part here where it's going down, we change the learning rate every single mini-batch.

And then the number of times we reset it is set by the cycle length parameter, so 1 means reset it after every epoch. So if I had 2 there, it would reset it up to every 2 epochs. And interestingly this point that when we do the learning rate annealing that we actually change it every single batch, it turns out to be really critical to making this work, and again it's very different to what nearly everybody in industry and academia has done before.

We're going to come back to that multiple times in this course. So the way this course is going to work is we're going to do a really high-level version of each thing, and then we're going to come back to it in 2 or 3 lessons and then come back to it at the end of the course.

And each time we're going to see more of the math, more of the code, and get a deeper view. We can talk about it also in the forums during the week. Our main goal is to generalize and we don't want to get those narrow optimals. Yeah, that's a very good summary.

So with this method, are we keeping track of the minimals and averaging them, and assembling them? That's another level of sophistication, and indeed you can see there's something here called Snapshot Ensembles, so we're not doing it in the code right now. But yes, if you wanted to make this generalized even better, you can save the weights here and here and here and then take the average of the predictions.

But for now, we're just going to pick the last one. If you want to skip ahead, there's a parameter called CycleSaveName, which you can add as well as CycleLend, and that will save a set of weights at the end of every learning rate cycle and then you can ensemble them.

So we've got a pretty decent model here, 99.3% accuracy, and we've gone through a few steps that have taken a minute or two to run. And so from time to time, I tend to save my weights. So if you go Learn.Save and then pass in a file name, it's going to go ahead and save that for you.

Later on, if you go Learn.Load, you'll be straight back to where you came from. So it's a good idea to do that from time to time. This is a good time to mention what happens when you do this. When you go Learn.Save, when you create pre-computed activations, another thing we'll learn about soon when you create resized images, these are all creating various temporary files.

And so what happens is, if we go to Data, and we go to DogsCats, this is my data folder, and you'll see there's a folder here called TMP, and so this is automatically created, and all of my pre-computed activations end up in here. I mention this because if you're getting weird errors, it might be because you've got some pre-computed activations that were only half completed, or are in some way incompatible with what you're doing.

So you can always go ahead and just delete this TMP, this temporary directory, and see if that causes your error to go away. This is the fast AI equivalent of turning it off and then on again. You'll also see there's a directory called Models, and that's where all of these, when you say .save with a model, that's where that's going to go.

Actually it reminds me, when the Stochastic Gradient Descent with Restarts paper came out, I saw a tweet that was somebody who was like, "Oh, to make your deep learning work better, turn it off and then on again." Is there a question there? "If I want to say I want to retrain my model from scratch again, do I just delete everything from that folder?" If you want to train your model from scratch, there's generally no reason to delete the pre-computed activations because the pre-computed activations are without any training.

That's what the pre-trained model created with the weights that you downloaded off the internet. The only reason you want to delete the pre-computed activations is that there was some error caused by half creating them and crashing or something like that. As you change the size of your input, change different architectures and so forth, they all create different sets of activations with different file names, so generally you shouldn't have to worry about it.

If you want to start training again from scratch, all you have to do is create a new learn object. So each time you go con-learner.pre-trained, that creates a new object with new sets of weights we've trained from. Before our break, we'll finish off by talking about fine-tuning and differential learning roots.

So far everything we've done has not changed any of these pre-trained filters. We've used a pre-trained model that already knows how to find at the early stages edges and gradients, and then corners and curves, and then repeating patterns, and bits of text, and eventually eyeballs. We have not re-trained any of those activations, any of those features, or more specifically any of those weights in the convolutional kernels.

All we've done is we've learned some new layers that we've added on top of these things. We've learned how to mix and match these pre-trained features. Now obviously it may turn out that your pictures have different kinds of eyeballs or faces, or if you're using different kinds of images like satellite images, totally different kinds of features altogether.

So if you're training to recognize icebergs, you'll probably want to go all the way back and learn all the way back to different combinations of these simple gradients and edges. In our case as dogs versus cats, we're going to have some minor differences, but we still may find it's helpful to slightly tune some of these later layers as well.

So to tell the learner that we now want to start actually changing the convolutional fielders themselves, we simply say unfreeze. So a frozen layer is a layer which is not trained, which is not updated. So unfreeze unfreezes all of the layers. Now when you think about it, it's pretty obvious that layer 1, which is like a diagonal edge or a gradient, probably doesn't need to change by much if at all.

From the 1.5 million images on ImageNet, it probably already has figured out pretty well how to find edges and gradients. It probably already knows also which kind of corners to look for and how to find which kinds of curves and so forth. So in other words, these early layers probably need little if any learning, whereas these later ones are much more likely to need more learning.

This is universally true regardless of whether you're looking for satellite images of rainforest or icebergs or whether you're looking for cats versus dogs. So what we do is we create an array of learning rates where we say these are the learning rates to use for our additional layers that we've added on top.

These are the learning rates to use in the middle few layers, and these are the learning rates to use for the first few layers. So these are the ones for the layers that represent very basic geometric features. These are the ones that are used for the more complex, sophisticated convolutional features, and these are the ones that are used for the features that we've added and learned from scratch.

So we can create an array of learning rates, and then when we call .fit and pass in an array of learning rates, it's now going to use those different learning rates for different parts of the model. This is not something that we've invented, but I'd also say it's so not that common that it doesn't even have a name as far as I know.

So we're going to call it differential learning rates. If it actually has a name, or indeed if somebody's actually written a paper specifically talking about it, I don't know. There's a great researcher called Jason Yosinski who did write a paper about the idea that you might want different learning rates and showing why, but I don't think any other library support it, and I don't know of a name for it.

Having said that though, this ability to unfreeze and then use these differential learning rates I found is the secret to taking a pretty good model and turning it into an awesome model. So just to clarify, you have three numbers there, three hyperparameters. The first one is for the late layers, the other layers are there in your model.

So the short answer is many, many, and they're kind of in groups, and we're going to learn about the architecture. This is called a ResNet, a residual network, and it kind of has ResNet blocks. And so what we're doing is we're grouping the blocks into three groups, so this first number is for the earliest layers, the ones closest to the pixels that represent like corners and edges and gradients.

I thought those layers are frozen at first. They are, right. So we just said unfreeze. Unfreeze. So you're unfreezing them because you have kind of partially trained all the late layers. We've trained our added layers, yes. Now you're retraining the old steps. Exactly. I see. And the learning rate is particularly small for the early layers, because you just kind of want to fine-tune and you don't want it to.

We probably don't want to change them at all, but if it does need to then it can. Thanks. No problem. So using differential random rates, how is it different from grid search? There's no similarity to grid search. So grid search is where we're trying to find the best hyperparameter for something.

So for example, you could kind of think of the learning rate finder as a really sophisticated grid search, which is like trying lots and lots of learning rates to find which one is best. But this has nothing to do with that. This is actually for the entire training from now on, it's actually going to use a different learning rate for each layer.

So I was wondering, if you have a pre-trained model, then you have to use the same input dimensions, right? Because I was thinking, okay, let's say you have these big machines to train these things and you want to take advantage of it. How would you go about if you have images that are bigger than the ones that they use?

We're going to be talking about sizes later, but the short answer is that with this library and the modern architectures we're using, we can use any size we like. So Jeremy, can we unfreeze just a specific layer? We can. We're not doing it yet, but if you wanted to, you can type learn.freeze_to and pass in a latent number.

Much to my surprise, or at least initially my surprise, it turns out I almost never need to do that. I almost never find it helpful, and I think it's because using differential learning rates, the optimizer can kind of learn just as much as it needs to. The one place I have found it helpful is if I'm using a really big memory-intensive model and I'm running out of GPU, the less layers you unfreeze, the less memory it takes and the less time it takes, so it's that kind of practical aspect.

To make sure I ask the question right, can I just unfreeze a specific layer? No, you can only unfreeze layers from layer N onwards. You could probably delve inside the library and unfreeze one layer, but I don't know why you would. So I'm really excited to be showing you guys this stuff, because it's something we've been kind of researching all year is figuring out how to train state-of-the-art models.

And we've kind of found these tiny number of tricks. And so once we do that, we now go learn.fit, and you can see, look at this, we get right up to like 99.5% accuracy, which is crazy. There's one other trick you might see here that as well as using stochastic gradient descent with restarts, i.e.

cycle length = 1, we've done 3 cycles. Earlier on I lied to you, I said this is the number of epochs, it's actually the number of cycles. So if you said cycle length = 2, it would do 3 cycles of each of two epochs. So here I've said do 3 cycles, yet somehow it's done 7 epochs.

The reason why is I've got one last trick to show you which is cycle_mult = 2. And to tell you what that does, I'm simply going to show you the picture. If I go learn.share.plot_learning_rate, there it is. Now you can see what cycle_mult = 2 is doing. It's doubling the length of the cycle after each cycle.

And so in the paper that introduced this stochastic gradient descent with restarts, the researcher kind of said hey, this is something that seems to sometimes work pretty well, and I've certainly found that often to be the case. So basically, intuitively speaking, if your cycle length is too short, then it kind of starts going down to find a good spot and then it pops out.

And it goes down to try and find a good spot and pops out, and it never actually gets to find a good spot. So earlier on, you want it to do that because it's trying to find the bit that's smoother. But then later on you want it to do more exploring, and then more exploring.

So that's why this cycle_mult = 2 thing often seems to be a pretty good approach. So suddenly we're introducing more and more hyperparameters, having told you there aren't that many. But the reason is that you can really get away with just picking a good learning rate, but then adding these extra tweaks really helps get that extra level up without any effort.

And so in practice, I find this kind of 3 cycles starting at 1, mult = 2 works very, very often to get a pretty decent model. If it doesn't, then often I'll just do 3 cycles of length 2 with no mult. There's kind of like 2 things that seem to work a lot.

There's not too much fiddling, I find, necessary. As I say, even if you use this line every time, I'd be surprised if you didn't get a reasonable result. Is there a question here? Why does smoother services correlate to more generalized networks? So it's kind of this intuitive explanation I tried to give back here.

Which is that if you've got something spiky, and so what this x-axis is showing is how good is this at recognizing dogs versus cats as you change this particular parameter? And so for something to be generalizable, it means that we want it to work when we give it a slightly different data set.

And so a slightly different data set may have a slightly different relationship between this parameter and how catty versus doggy it is. It may instead look a little bit like this. So in other words, if we end up at this point, then it's not going to do a good job on this slightly different data set, or else if we end up on this point, it's still going to do a good job on this data set.

So that's what cyclemult equals do. So we've got one last thing before we're going to take a break, which is we're now going to take this model, which has 99.5% accuracy, and we're going to try to make it better still. And what we're going to do is we're not actually going to change the model at all, but instead we're going to look back at the original visualization we did where we looked at some of our incorrect pictures.

Now what I've done is I've printed out the whole of these incorrect pictures, but the key thing to realize is that when we do the validation set, all of our inputs to our model all the time have to be square. The reason for that is it's kind of a minor technical detail, but basically the GPU doesn't go very quickly if you have different dimensions for different images because it needs to be consistent so that every part of the GPU can do the same thing.

I think this is probably fixable, but now that's the state of the technology we have. So our validation set, when we actually say for this particular thing is it's a dog, what we actually do to make it square is we just pick out the square in the middle. So we would take off its two edges, and so we take the whole height and then as much of the middle as we can.

And so you can see in this case we wouldn't actually see this dog's head. So I think the reason this was actually not correctly classified was because the validation set only got to see the body, and the body doesn't look particularly dog-like or cat-like, it's not at all sure what it is.

So what we're going to do when we calculate the predictions for our validation set is we're going to use something called test time augmentation. And what this means is that every time we decide is this cat or a dog, not in the training but after we train the model, is we're going to actually take four random data augmentations.

And remember the data augmentations move around and zoom in and out and flip. So we're going to take four of them at random and we're going to take the original un-augmented center crop image and we're going to do a prediction for all of those. And then we're going to take the average of those predictions.

So we're going to say is this a cat, is this a cat, is this a cat, is this a cat. And so hopefully in one of those random ones we actually make sure that the face is there, zoomed in by a similar amount to other dog's faces at sea and it's rotated by the amount that it expects to see it and so forth.

And so to do that, all we have to do is just call tta, tta stands for test time augmentation. This term of what do we call it when we're making predictions from a model we've trained, sometimes it's called inference time, sometimes it's called test time, everybody seems to have a different name.

So tta. And so when we do that we go learn.tta, check the accuracy, and lo and behold we're now at 99.65%, which is kind of crazy. Where's our green box? But for every bug we're only showing one type of augmentation of a particular image, right? So when we're training back here, we're not doing any tta.

So you could, and sometimes I've written libraries where after each epoch I run tta to see how well it's going. But that's not what's happening here. I trained the whole thing with training time augmentation, which doesn't have a special name because that's what we mean. When we say data augmentation, we mean training time augmentation.

So here every time we showed a picture, we were randomly changing it a little bit. So each epoch, each of these seven epochs, it was seeing slightly different versions of the picture. Having done that, we now have a fully trained model, we then said okay, let's look at the validation set.

So tta by default uses the validation set and said okay, what are your predictions of which ones are cats and which ones are dogs. And it did four predictions with different random augmentations, plus one on the unaugmented version, averaged them all together, and that's what we got, and that's what we got the accuracy from.

So is there a high probability of having a sample in tta that was not shown during training? Yeah, actually every data augmented image is unique because the rotation could be like 0.034 degrees and zoom could be 1.0165. So every time it's slightly different. Okay, thank you. No problem. Who's behind you?

What's your, why not use white padding or something like that? White padding? Just put like a white border around. Oh, padding's not, yeah, so there's lots of different types of data augmentation you can do and so one of the things you can do is to add a border around it.

Basically adding a border around it in my experiments doesn't help, it doesn't make it any less cat-like, convolutional neural network doesn't seem to find it very interesting. One thing that I do do, we'll see later, is I do something called reflection padding which is where I add some borders that are the outside just reflected, it's a way to kind of make some bigger images.

It works well with satellite imagery in particular, but in general I don't do a lot of padding, instead I do a bit of zooming. Let's kind of follow up to that last one, but rather than cropping, just add white space because when you crop you lose the dog's face, but if you added white space you wouldn't have.

Yeah, so that's where the reflection padding or the zooming or whatever can help. So there are ways in the fastai library when you do custom transforms of making that happen. I find that it kind of depends on the image size, but generally speaking it seems that using TTA plus data augmentation, the best thing to do is to try to use as large an image as possible.

So if you kind of crop the thing down and put white borders on top and bottom, it's now quite a lot smaller. And so to make it as big as it was before, you now have to use more GPU, and if you're going to use all that more GPU you could have zoomed in and used a bigger image.

So in my playing around that doesn't seem to be generally as successful. There is a lot of interest on the topic of how do you do that augmentation in older than images, in data that is not images? No one seems to know. I asked some of my friends in the natural language processing community about this, and we'll get to natural language processing in a couple of lessons, it seems like it would be really helpful.

There's been a very, very few examples of people where papers would try replacing synonyms for instance, but on the whole an understanding of appropriate data augmentation for non-image domains is under-researched and underdeveloped. The question was, couldn't we just use a sliding window to generate all the images? So in that dog picture, couldn't we generate three parts of that, wouldn't that be better?

Yeah, for TTA you mean? Just in general when you're creating your realism. For training time, I would say no, that wouldn't be better because we're not going to get as much variation. We want to have it one degree off, five degrees off, ten pixels up, lots of slightly different versions and so if you just have three standard ways, then you're not giving it as many different ways of looking at the data.

For test time augmentation, having fixed crop locations I think probably would be better, and I just haven't gotten around to writing that yet. I have a version in an olden library, I think having fixed crop locations plus random contrast brightness rotation changes might be better. The reason I haven't gotten around to it yet is because in my testing it didn't seem to help in practice very much and it made the code a lot more complicated, so it's an interesting question.

"I just want to know how this fast AI API is that you're using, is it open source?" Yeah, that's a great question. The fast AI library is open source, and let's talk about it a bit more generally. The fact that we're using this library is kind of interesting and unusual, and it sits on top of something called PyTorch.

So PyTorch is a fairly recent development, and I've noticed all the researchers that I respect pretty much are now using PyTorch. I found in part 2 of last year's course that a lot of the cutting edge stuff I wanted to teach I couldn't do it in Keras and TensorFlow, which is what we used to teach with, and so I had to switch the course to PyTorch halfway through part 2.

The problem was that PyTorch isn't very easy to use, you have to write your own training loop from scratch. Basically if you write everything from scratch, all the stuff you see inside the fast AI library, we would have had to have written it to learn. And so it really makes it very hard to learn deep learning when you have to write hundreds of lines of code to do anything.

So we decided to create a library on top of PyTorch because our mission is to teach world class deep learning. So we wanted to show you how you can be the best in the world at doing X, and we found that a lot of the world class stuff we needed to show really needed PyTorch, or at least with PyTorch it was far easier, but then PyTorch itself just wasn't suitable as a first thing to teach with for new deep learning practitioners.

So we built this library on top of PyTorch, initially heavily influenced by Keras, which is what we taught last year. But then we realized we could actually make things much easier than Keras. So in Keras, if you look back at last year's course notes, you'll find that all of the code is 2-3 times longer, and there's lots more opportunities for mistakes because there's just a lot of things you have to get right.

So we ended up building this library in order to make it easier to get into deep learning, but also easier to get state-of-the-art results. And then over the last year as we started developing on top of that, we started discovering that by using this library, it made us so much more productive that we actually started developing new state-of-the-art results and new methods ourselves, and we started realizing that there's a whole bunch of papers that have kind of been ignored or lost, which when you use them it could semi-automate stuff, like learning rate finder that's not in any other library.

So I kind of got to the point where now not only is kind of fast.ai lets us do things much easier than any other approach, but at the same time it actually has a lot more sophisticated stuff behind the scenes than anything else. So it's kind of an interesting mix.

So we've released this library, at this stage it's like a very early version, and so through this course, by the end of this course I hope as a group a lot of people are already helping have developed it into something that's really pretty stable and rock-solid. And anybody can then use it to build your own models under an open-source license, as you can see it's available on GitHub.

Behind the scenes it's creating PyTorch models, and so PyTorch models can then be exported into various different formats. Having said that, a lot of folks, if you want to do something on a mobile phone, for example, you're probably going to need to use TensorFlow. And so later on in this course, we're going to show how some of the things that we're doing in the fast.ai library you can do in Keras and TensorFlow so you can get a sense of what the different libraries look like.

Generally speaking, the simple stuff will take you a small number of days to learn to do it in Keras and TensorFlow versus fast.ai and PyTorch. And the more complex stuff often just won't be possible. So if you need it to be in TensorFlow, you'll just have to simplify it often a little bit.

I think the more important thing to realize is every year, the libraries that are available and which ones are the best totally changes. So the main thing I hope that you get out of this course is an understanding of the concepts, like here's how you find a learning rate, here's why differential learning rates are important, here's how you do learning rate and kneeling, here's what stochastic gradient acceptance restarts does, so on and so forth.

Because by the time we do this course again next year, the library situation is going to be different again. That's a question to ask. I was wondering if you've had an opinion on Pyro, which is Uber's new release. I haven't looked at it, no, I'm very interested in probabilistic programming and it's really cool that it's built on top of PyTorch.

So one of the things we'll learn about in this course is we'll see that PyTorch is much more than just a deep learning library, it actually lets us write arbitrary GPU accelerated algorithms from scratch, which we're actually going to do, and Pyro is a great example of what people are now doing with PyTorch outside of the deep learning world.

Great let's take an 8 minute break and we'll come back at 7.55. So 99.65% accuracy, what does that mean? In classification, when we do classification in machine learning, a really simple way to look at the result of a classification is what's called the confusion matrix. This is not just deep learning, but any kind of classifier in machine learning where we say what was the actual truth, there were a thousand cats and a thousand dogs, and of the thousand actual cats, how many did we predict were cats?

This is obviously in the validation sets, this is the images that we didn't use to train with. It turns out there were 998 cats that we actually predicted as cats and 2 that we got wrong. And then for dogs, there were 995 that we predicted were dogs and then 5 that we got wrong.

So often these confusion matrices can be helpful, particularly if you've got four or five classes you're trying to predict which group you're having the most trouble with and you can see it uses color-coding to highlight the large bits, and you've got to help with the diagonal is the highlighted section.

So now that we've retrained the model, it can be quite helpful, now it's better to actually look back and see which ones in particular were incorrect. And we can see here there were actually only two incorrect cats, it prints out four by default so you can actually see these two actually less than 0.5 so they weren't wrong.

So it's actually only these two were wrong cats, this one isn't obviously a cat at all. This one is, but it looks like it's got a lot of weird artifacts and you can't see its eyeballs at all. And then here are the 5 wrong dogs, here are 4 of them, that's not obviously a dog.

That looks like a mistake, that looks like a mistake, that one I guess doesn't have enough information, but I guess it's a mistake. So we've done a pretty good job here of creating a good classifier based on entering a lot of Kaggle competitions and comparing results I've done to various research papers.

I can tell you it's a state-of-the-art classifier, it's right up there with the best in the world. We're going to make it a little bit better in a moment, but here are the basic steps. So if you want to create a world class image classifier, the steps that we just went through was that we turned data augmentation on by saying all transforms equals, you either say side on or top down depending on what you're doing.

Start with pre-compute equals true, find a decent learning rate, we then train just like at one or two epochs, which takes a few seconds because we've got pre-compute equals true. Then we turn off pre-compute, which allows us to use data augmentation to do another two or three epochs, generally with cycle length equals one.

Then I unfreeze all the layers, I then set the earlier layers to be somewhere between 3 times to 10 times lower learning rate than the previous. As a rule of thumb, knowing that you're starting with a pre-trained ImageNet model, if you can see that the things that you're now trying to classify are pretty similar to the kinds of things in ImageNet, i.e.

pictures of normal objects in normal environments, you probably want about a 10x difference because you think that the earlier layers are probably very good already. Whereas if you're doing something like satellite imagery or medical imaging, which is not at all like ImageNet, then you probably want to be training those earlier layers a lot more so you might have just a 3x difference.

So that's one change that I make is to try to make it either 10x or 3x. So then after unfreezing, you can now call LRFind again. I actually didn't in this case, but once you've unfreezed all the layers, you've turned on differential learning rates, you can then call LRFind again.

So you can then check does it still look like the same point I had last time as about right. Something to note is that if you call LRFind having set differential learning rates, the thing it's actually going to print out is the learning rate of the last layers, because you've got three different learning rates, so it's actually showing you the last layer.

So then I train the full network with cycle_mult = 2, and until either it starts with a fitting or I run out of time. So let me show you. Let's do this again for a totally different dataset. So this morning, I noticed that some of you on the forums were playing around with this playground Kaggle competition, very similar, called dog breed identification.

So the dog breed identification Kaggle challenge is one where you don't actually have to decide which ones are cats and which ones are dogs, they're all dogs, but you have to decide what kind of dog it is. There are 120 different breeds of dogs. So obviously this could be different types of cells in pathology slides, it could be different kinds of cancers in CT scans, it could be different kinds of icebergs and satellite images, whatever, as long as you've got some kind of labeled images.

So I want to show you what I did this morning, it took me about an hour basically to go end to end from something I've never seen before. So I downloaded the data from Kaggle, and I'll show you how to do that shortly, but the short answer is there's something called Kaggle CLI, which is a GitHub project you can search for and if you read the docs, you basically run kg download, provide the competition name and it will grab all the data for you to your Cressel or Amazon or whatever instance.

I put it in my data folder and I then went LS and I saw that it's a little bit different to our previous dataset. It's not that there's a train folder which has a separate folder for each kind of dog, but instead it turned out there was a CSV file.

And the CSV file, I read it in with pandas, so pandas is the thing we use in Python to do structured data analysis like CSV files, so pandas we call pd, that's pretty much universal, pd.read_csv reads in a CSV file, we can then take a look at it and you can see that basically it had some kind of identifier and then the breed.

So this is like a different way, this is the second main way that people kind of give you image labels. One is to put different images into different folders, the second is generally to give you some kind of file like a CSV file to tell you here's the image name and here's the label.

So what I then did was I used pandas again to create a pivot table which basically groups it up just to see how many of each breed there were and I sorted them. And so I saw they've got about 100 of some of the more common breeds and some of the less common breeds that got like 60 or so.

All together there were 120 rows and there must have been 120 different breeds represented. So I'm going to go through the steps, so enable data augmentation. So to enable data augmentation when we call this transforms from model, you just pass in an org transforms, in this case I chose Sidon, again these are pictures of dots and stuff so they're Sidon photos.

We'll talk about MaxZoom in more detail later, but MaxZoom basically says when you do the data augmentation, we zoom into it by up to 1.1 times, randomly between 1, the original image size, and 1.1 times. So it's not always cropping out in the middle or an edge, but it could be cropping out a smaller part.

So having done that, the key step now is that rather than going from paths, so previously we went from paths and that tells us that the names of the folders are the names of the labels. We go from CSV and we pass in the CSV file that contains the labels.

So we're passing in the path that contains all of the data, the name of the folder that contains the training data, the CSV that contains the labels. We need to also tell it where the test set is if we want to submit to Kaggle later, talk more about that next week.

Now this time, the previous dataset I had actually separated a validation set out into a separate folder, but in this case you'll see that there is not a separate folder called validation. So we want to be able to track how good our performance is locally, so we're going to have to separate some of the images out to put it into a validation set.

So I do that at random, and so up here you can see I've basically opened up the CSV file, turned it into a list of rows, and then taken the length of that minus 1, because there's a header at the top. And so that's the number of rows in the CSV file, which must be the number of images that we have.

And then this is a fastai thing, get-cross-validation-indexes, we'll talk about cross-validation later. But basically if you call this and pass in a number, it's going to return to you by default a random 20% of the rows to use as your validation set, and you can pass in parameters to get different amounts.

So this is now going to grab 20% of the data and say this is the indexes, the numbers of the files which we're going to use as a validation set. So now that we've got that, let's run this so you can see what that looks like. So val_indexes is just a big bunch of numbers, and so n is 10,000, and so about 20% of those is going to be in the validation set.

So when we call from CSV, we can pass in a parameter which is to tell it which indexes to treat it as a validation set, and so let's pass in those indexes. One thing that's a little bit tricky here is that the file names actually have a dot jpg on the end, and these obviously don't have a dot jpg.

So when you call from CSV, you can pass in a suffix that says the labels don't actually contain the full file names, you need to add this to them. So that's basically all I need to do to set up my data. And as a lot of you have noticed during the week, inside that data object, you can actually get access to the training dataset by saying train-ds, and inside train-ds is a whole bunch of things including the file names.

So train-ds.file_names contains all of the file names of everything in the training set, and so here's one file name. So here's an example of one file name. So I can now go ahead and open that file and take a look at it. So the next thing I did was to try and understand what my dataset looks like, and it found an adorable puppy, so that was very nice.

So I'm feeling good about this. I also want to know how big are these files, like how big are the images, because that's a key issue. If they're huge, I'm going to have to think really carefully about how to deal with huge images, that's really challenging. If they're tiny, well that's also challenging.

Most of ImageNet models are trained on either 224x224 or 299x299 images, so any time you have images in that kind of range, that's really hopeful, you're probably not going to have to do too much different. In this case, the first image I looked at was about the right size, so I'm thinking it's looking pretty hopeful.

So what I did then is I created a dictionary comprehension. Now if you don't know about list comprehensions and dictionary comprehensions in Python, go study them. They're the most useful thing, super handy. You can see the basic idea here is that I'm going through all of the files, and I'm creating a dictionary that maps the name of the file to the size of that file.

Again this is a handy little Python feature which I'll let you learn about during the week if you don't know about it, which is zip, and using this special star notation is now going to take this dictionary and turn it into the rows and the columns. So I can now turn those into NumPy arrays, and here are the first 5 row sizes for each of my images.

And then Matplotlib is something you want to be very familiar with if you do any kind of data science on machine learning in Python. Matplotlib we always refer to as PLT, this is a histogram, and so I've got a histogram of how high, how many rows there are in each image.

So you can see here I'm kind of getting a sense. Before I start doing any modeling, I kind of need to know what I'm modeling with. And I can see some of the images are going to be like 2500-3000 pixels high, but most of them seem to be around 500.

So given that so few of them were bigger than 1000, I used standard NumPy slicing to just grab those that are smaller than 1000 and histogram that, just to zoom in a little bit. And I can see here it looks like the vast majority are around 500. And so this actually also prints out the histogram, so I can actually go through and I can see here 4500 of them are about 450.

So Jeremy, how many images should we get in the validation set, is it always 20%? So the size of the validation set, using 20% is fine unless you're kind of feeling like my data set is really small, I'm not sure that's enough. Basically think of it this way, if you train the same model multiple times and you're getting very different validation set results and your validation set is kind of small, like smaller than 1000 or so, then it's going to be quite hard to interpret how well you're doing.

This is particularly true, if you care about the third decimal place of accuracy and you've got 1000 things in your validation set, then you're thinking about a single image changing plus is what you're looking at. So it really depends on how much difference you care about. I would say in general, at the point where you care about the difference between 0.01 and 0.02, the second decimal place, you want that to represent 10 or 20 rows, like changing the class of 10 or 20 rows, then that's something you can be pretty confident of.

So most of the time, given the data sizes we normally have, 20% seems to work fine. It depends a lot on specifically what you're doing and what you care about. And it's not a deep learning specific question either. So those who are interested in this kind of thing, we're going to look into it in a lot more detail in our machine learning course, which will also be available online.

So I did the same thing for the columns just to make sure that these aren't super wide, and I got similar results and checked in and again found that 400-500 seems to be about the average size. So based on all of that, I thought this looks like a pretty normal kind of image dataset that I can probably use pretty normal kinds of models on.

I was also particularly encouraged to see that when I looked at the dog that the dog takes up most of the frame, so I'm not too worried about cropping problems. If the dog was just a tiny little piece of one little corner that I'd be thinking about doing different, maybe zooming in a lot more or something.

Like in medical imaging, that happens a lot, like often the tumor or the cell or whatever is like one tiny piece and that's much more complex. So based on all that, this morning I kind of thought, okay, this looks pretty standard. So I went ahead and created a little function called getData that basically had my normal two lines of code in it.

But I made it so I could pass in a size and a batch size. The reason for this is that when I start working with a new dataset, I want everything to go super fast. And so if I use small images, it's going to go super fast. So I actually started out with size=64, just to create some super small images that just go like a second to run through and see how it went.

Later on, I started using some big images and also some bigger architectures, at which point I started running out of GPU memory, so I started getting these errors saying CUDA out of memory error. When you get a CUDA out of memory error, the first thing you need to do is go kernel restart.

Once you get an out of memory error on your GPU, you can't really recover from it. It doesn't matter what you do, you have to restart. And once I restarted, I then just changed my batch size to something smaller. So when you call createYourData object, you can pass in a batch size parameter.

And I normally use 64 until I hit something that says out of memory, and then I just halve it. And if I still get out of memory, I'll just halve it again. So that's where I created this, to allow me to start making my sizes bigger as I looked into it more and as I started running out of memory to decrease my batch size.

So at this point, I went through this a couple of iterations, but I basically found everything was working fine. Once it was working fine, I set size to 224, and I created my pre-computed equals true. First time I did that it took a minute to create the pre-computed activations, and then it ran through this in about 4 or 5 seconds, and you can see I was getting 83% accuracy.

Now remember accuracy means it's exactly right, and so it's predicting out of 120 categories. So when you see something with 2 classes is 80% accurate versus something with 120 classes is 80% accurate, they're very different levels. So when I saw 83% accuracy with just a pre-computed classifier, no data augmentation, no unfreezing, anything else across 120 classes, oh this looks good.

So then I just kept going through our little standard process. So then I turned pre-compute off, and cycle length equals 1, and I started doing a few more cycles, a few more epochs. So remember an epoch is 1 passed through the data, and a cycle is however many epochs you said is in a cycle.

It's the learning rate going from the top that you asked for all the way down to 0. So since here cycle length equals 1, a cycle and an epoch are the same. So I tried a few epochs, I did actually do the learning rate finder, and I found one in a 2 again looks fine, it often looks fine.

And I found it kept improving, so I tried 5 epochs and I found my accuracy getting better. So then I saved that and I tried something which we haven't looked at before, but it's kind of cool. If you train something on a smaller size, you can then actually call learn.setData() and pass in a larger size dataset.

And that's going to take your model, however it's trained so far, and it's going to let you continue to train on larger images. And I'll tell you something amazing. This actually is another way you can get state-of-the-art results, and I've never seen this written in any paper or discussed anywhere as far as I know this is a new insight.

Basically I've got a pre-trained model, which in this case I've trained a few epochs with a size of 224x224, and I'm now going to do a few more epochs with a size of 299x299. Now I've got very little data kind of by deep learning standards, I've only got 10,000 images.

So with a 224x224 I kind of built these final layers to try to find things that worked well at 224x224. When I go to 299x299, if I overfit the 4, I'm definitely not going to overfit now. I've changed the size of my images, they're kind of totally different, but conceptually they're still the same kinds of pictures as the same kinds of things.

So I found this trick of starting training on small images for a few epochs and then switching to bigger images and continuing training is an amazingly effective way to avoid overfitting. And it's so easy and so obvious, I don't understand why it's never been written about before, maybe it's in some paper somewhere and I haven't found it, but I haven't seen it.

Would it be possible to do the same thing using let's take care of TensorFlow as well, to feed an image of different science? Yeah, I think so. As long as you use one of these more modern architectures, what we call fully convolutional architectures, which means not VGG, and you'll see we don't use VGG in this course because it doesn't have this property, but most of the architectures developed in the last couple of years can handle pretty much arbitrary sizes, it would be worth trying.

I think it ought to work. So I call getData again, remember getData is just the little function that I created back up here. GetData is just this little function, so I just passed a different size to it. And so I call freeze just to make sure that everything except the last layer is frozen, I mean actually it already was at these points that didn't really do anything.

You can see now with free compute off, I've now got data augmentation working, so I kind of run a few more epochs. And what I notice here is that the loss of my training set and the loss of my validation set losses a lot lower than my training set.

This is still just training the last layer. So what this is telling me is I'm underfitting, and so if I'm underfitting, it means the cycle length equals 1 is too short, it means it's like finding something better, popping out and never getting a chance to zoom in properly. So then I set cycle_malt = 2 to give it more time, so the first time is 1 epoch, the second one is 2 epochs, the third one is 4 epochs, and you can see now the validation train and training are about the same.

So that's kind of thinking yeah, this is about the right track. And so then I tried using test time augmentation to see if that gets any better still, it didn't actually help a hell of a lot, just a tiny bit. And just at this point I'm thinking this is nearly done, so I just did one more cycle of 2 to see if it got any better, and it did get a little bit better.

And then I'm like okay, that looks pretty good. I've got a validation set loss of 0.199. And so you'll notice here I actually haven't tried unfreezing. The reason why was when I tried unfreezing and training more, it didn't get any better. And so the reason for this clearly is that this data set is so similar to ImageNet, the training that convolutional layers actually doesn't help in the slightest.

And actually when I later looked into it, it turns out that this competition is actually using a subset of ImageNet. So then if we check this out, 0.199 against the leaderboard, this is only a playground competition so it's not like the best here, but it's still interesting, it gets us somewhere around 10th or 11th.

In fact we're competing against, I notice this is a first AI student, these people up here, I know they actually posted that they cheated, they actually went and downloaded the original images and trained for that. This is why this is a playground competition, it's not real, it's just to allow us to try things out.

We basically see out of 200 and something people, we're getting some very good results without doing anything remotely interesting or clever, and we haven't even used the whole data set, we've only used 80% of it. To get a better result, I would go back and remove that validation set and just re-run the same steps and then submit that, let's just use 100% of the data.

I have three questions, the first one is that class in this case is not balanced, it's not balanced. It's not totally balanced but it's not bad, it's like between 60 and 100, it's not unbalanced enough that I would give it a second thought. Let's get to that later in this course, and don't let me forget.

The short answer is that there was a recent list, a paper came out about two or three weeks ago on this and it said the best way to deal with very unbalanced data sets is to basically make copies of the rare cases. My second question is, I want to pin down a difference between pre-computed and unscrewed, so you have these two options here.

So when you are beginning to add data augmentation, you said pre-computed is true, but in that case, the layers are still true. And not only are they frozen, they're pre-computed, so the data augmentation doesn't do anything at that point. What you said pre-computed equals true, but before you increase everything, what does exactly do?

You only increase the activation? So we're going to learn more about the details as we look into the math and stuff in coming lessons, but basically what happened was we started with a pre-trained network, which was finding activations that had these kind of rich features. And then we add a couple of layers on the end of it, which start out random.

And so with everything frozen, and indeed with pre-compute equals true, all we're learning is those couple of layers that we've added. And so with pre-compute equals true, we actually pre-calculate how much does this image have something that looks like this eyeball and looks like this face and so forth.

And therefore data augmentation doesn't do anything with pre-compute equals true, because we're actually showing exactly the same activations each time. We can then set pre-compute equals false, which means it's still only training those last two layers that we added, it's still frozen, but data augmentation is now working because it's actually going through and recalculating all of the activations from scratch.

And then finally when we unfreeze, that's actually saying okay, now you can go ahead and change all of these earlier convolutional filters. The only reason to have pre-compute equals true is it's just much faster. It's about 10 or more times faster, so particularly if you're working with quite a large data set, it can save quite a bit of time, but there's no accuracy reason ever to use pre-compute equals true.

So it's just a short plan. It's also quite handy if you're throwing together a quick model, it can take a few seconds to create it. If your question is like is there some shorter version of this that's a bit quicker and easier, I could like to delete a few things here.

I think this is a kind of a minimal version to get you a very good result, which is like don't worry about pre-compute equals true because that's just saving a little bit of time. So I would still suggest use lrfind at the start to find a good learning rate.

By default, everything is frozen from the start, so you can just go ahead and run a 2 or 3 epochs with cycle equals 1, unfreeze, and then train the rest of the network with differential learning rates. So it's basically 3 steps, learning rate finder, train frozen network with cycle equals 1, and then train unfrozen network with differential learning rates and cycle multiples 2.

So that's something you could turn into, I guess, 5 or 6 lines of code total. By reducing the batch size, does it only affect the speed of training? Yeah, pretty much. So each batch, and again we're going to see all this stuff about pre-compute and batch size as we dig into the details of the algorithm, it's going to make a lot more sense intuitively.

But basically if you're showing it less images each time, then it's calculating the gradient with less images, which means it's less accurate, which means knowing which direction to go and how far to go in that direction is less accurate. So as you make the batch size smaller, you're basically making it more volatile.

It kind of impacts the optimal learning rate that you would need to use, but in practice I generally find I'm only dividing the batch size by 2 or 4, it doesn't seem to change things very much. Should I reduce the learning rate accordingly? If you change the batch size by much, you can re-run the learning rate finder to see if it's changed by much, but since we're only generally looking at a power of 10, it probably is not going to change things enough that you can't think of it as a possibility.

This is sort of a conceptual and basic question, so going back to the previous slide where you showed. Could you lift that up at home? Sorry, yeah, this is more of a conceptual and basic question. Going back to your previous slide where you showed what the different layers were doing.

So in this slide, I understand the meaning of the third column relative to the fourth column is that you're interpreting what the layer is doing based on what images actually trigger that layer. Yeah, so we're going to look at this in more detail. So these grey ones basically say this is kind of what the filter looks like.

So on the first layer you can see exactly what the filter looks like because the input to it are pixels, so you can absolutely say and remember we looked at what a convolutional kernel was, like a 3x3 thing. So these look like they're 7x7 kernels, you can say this is actually what it looks like.

But later on the input to it are themselves activations which are combinations of activations which are combinations of activations. So you can't draw it, but there's a clever technique that Ziler and Fergus created which allowed them to say this is kind of what the filters tended to look like on average, so this is kind of what the filters looked like.

And then here is specific examples of patches of image which activated that filter highly. So the pictures are the ones that I kind of find more useful because it tells you this kernel is kind of a unicycle wheel finder. Well we may come back to that, if not in this part and the next part.

Probably in part 2 actually because this paper uses to create these things, this paper uses something called a deconvolution which I'm pretty sure we won't do in this part, but we will do it in part 2. So if you're interested, check out the paper, it's in the notebook, there's a link to it, Ziler and Fergus.

It's a very clever technique and not terribly intuitive. So you mentioned that it was good that the dog took up the full picture and it would have been a problem if it was kind of like off in one of the corners and really tiny. What would your technique have been to try to make that work?

Something that we'll learn about in part 2, but basically there's a technique that allows you to figure out roughly which parts of an image are most likely to have the interesting things in them, and then you can crop out those bits. If you're interested in learning about it, we did cover it briefly in lesson 7 of part 1, but I'm going to actually do it properly in part 2 of this course because I didn't really cover it thoroughly enough.

Maybe we'll find time to have a quick look at it, but we'll see. I know UNET's written some of the code that we need already. So once I have something like this notebook that's basically working, I can immediately make it better by doing two things, assuming that the size image I was using is smaller than the average size of the image that we've been given.

I can increase the size, and as I showed before with the dog breeds, you can actually increase it during training. The other thing I can do is to use a better architecture. We're going to talk a lot in this course about architectures, but basically there are different ways of putting together what size convolutional filters and how they're connected to each other and so forth.

Different architectures have different numbers of layers and sizes of kernels and number of filters and so forth. The one that we've been using, ResNet-34, is a great starting point and often a good finishing point because it doesn't have too many parameters, often it works pretty well with small amounts of data as we've seen and so forth.

But there's actually an architecture that I really like called ResNet but ResNext, which was actually the second place winner in last year's ImageNet competition. Like ResNet, you can put a number after the ResNext to say how big it is, and my next step after ResNet-34 is always ResNext-50. You'll find ResNext-50 can take twice as long as ResNet-34, it can take 2 to 4 times as much memory as ResNet-34.

So what I wanted to do was I wanted to rerun that previous notebook with ResNext and increase in the image size to 299. So here I just said architecture = ResNext-50, size = 299, and then I found that I had to take the batch size all the way back to 28 to get it to fit.

My GPU is 11GB, if you're using AWS or Cressel, I think they're like 12GB, so you might be able to make it a bit higher, but this is what I found I had to do. So then this is literally a copy of the previous notebook, so you can actually go file, make a copy, and then rerun it with these different parameters.

And so I deleted some of the pros and some of the exploratory stuff to see, basically I said everything else is the same, all the same steps as before, in fact you can kind of see what this minimum set of steps looks like. I didn't need to worry about learning rate finder, so I just left it as is.

Transforms, data = learn = fit, pre-compute = false, fit, fit cycle length = 1, unfreeze, cycle learning rates, fit some more. And you can see here I didn't do the cycle-mult thing, because I found now that I'm using a bigger architecture, it's got more parameters, it was overfitting pretty quickly.

So rather than cycle length = 1, never finding the right spot, it actually did find the right spot and if I used longer cycle lengths, I found that my validation error was higher than my training error, it was overfitting. So check this out though, by using these 3 steps, I got +TTA, I got 99.75.

So what does that mean? That means I have one incorrect dog, four incorrect cats, and when we look at the pictures of them, my incorrect dog has a cat in it. This one is not either, so I've actually got one mistake, and then my incorrect dog is teeth. So we're at a point where we're now able to train a classifier that's so good that it has like basically one mistake.

And so when people say we have super-human image performance now, this is kind of what they're talking about. So when I looked at the dog breed one I did this morning, it was getting the dog breeds much better than I ever could. So this is what we can get to if you use a really modern architecture like ResNext.

And this only took, I don't know, 20 minutes to train. So that's kind of where we're up to. So if you wanted to do satellite imagery instead, then it's the same thing. And in fact the planet satellite data set is already on Cresol, if you're using Cresol you can jump straight there.

And I just linked it into data/planet, and I can do exactly the same thing. I can image classifier from CSV, and you can see these three lines are exactly the same as my dog breed lines, how many lines are in the file, grab my validation indexes, this get data, as you can see it's identical except I've changed side on to top down.

The satellite images are like top down, so I can fit them vertically and they still make sense. And so you can see here I'm doing this trick where I'm going to do size=64 and train a little bit, the first learning rate finder. And interestingly in this case you can see I want really high learning rates.

I don't know what it is about this particular dataset, this is true, but clearly I can use super high learning rates. So I used a learning rate of 0.2, and so I've trained for a while, differential learning rates, and so remember I said like if the dataset is very different to ImageNet, I probably want to train those middle layers a lot more, so I'm using divided by 3 rather than divided by 10.

Other than that, here's the same thing, cycle mod equals 2, and then I was just kind of keeping an eye on it. So you can actually plot the loss if you go to learn.shed.plotloss, and you can see here is the first cycle, here's the second cycle, here's the third cycle, so you can see it's work.

It's better, it pops out. It's better, it pops out. It's better, it pops out. And each time it finds something better than the last time. Then set the size up to 128, and just repeat exactly the last few steps. And then set it up to 256, and repeat the last two steps.

And then do TTA, and if you submit this, then this gets about 30th place in this competition. So these basic steps work super well. This thing where I went all the way back to a size of 64, I wouldn't do that if I was doing like dogs and cats or dog breeds, because this is so small that if the thing I was working on was very similar to ImageNet, I would kind of destroy those ImageNet weights.

Like 64 by 64 is so small, but in this case the satellite imagery data is so different to ImageNet. I really found that it worked pretty well to start right back to these tiny images. It really helped me to avoid overfitting. And interestingly, using this kind of approach, I actually found that even with using only 128 by 128, I was getting much better Kaggle results than nearly everybody on the leaderboard.

And when I say 30th place, this is a very recent competition. And so I find in the last year, a lot of people have got a lot better at computer vision. And so the people in the top 50 in this competition were generally ensembling dozens of models, lots of people on a team, lots of pre-processing specific satellite data and so forth.

So to be able to get 30th using this totally standard technique is pretty cool. So now that we've got to this point, we've got through two lessons. If you're still here, then hopefully you're thinking okay, this is actually pretty useful, I want to do more, in which case Cressel might not be where you want to stay.

The issues with Cressel, it's pretty handy, it's pretty cheap, and something we haven't talked about much is PaperSpace is another great choice, by the way. PaperSpace is shortly going to be releasing Cressel-like instant Jupyter notebooks, unfortunately they're not ready quite yet, but basically they have the best price performance relationship right now, and you can SSH into them and use them.

So they're also a great choice, and probably by the time this is a MOOC, we'll probably have a separate lesson showing you how to set up PaperSpace because they're likely to be a great option. But at some point you're probably going to want to look at AWS, a couple of reasons why.

The first is, as you all know by now, Amazon have been kind enough to donate about $200,000 worth of compute time to this course. So I want to say thank you very much to Amazon, we've all been given credits, everybody is here. Thanks very much to AWS. So sorry if you're on the MOOC, we didn't get it for you, but AWS credits for everybody.

But even if you're not here in person, you can get AWS credits from lots of places. GitHub has a student pack, Google for GitHub student pack, that's like $150 worth of credits. AWS Educate can get credits, these are all for students. So there's lots of places you can get started on AWS.

Pretty much everybody, a lot of the people that you might work with will be using AWS because it's super flexible. Right now AWS has the fastest available GPUs that you can get in the cloud, the P3s. They're kind of expensive at 3 bucks an hour, but if you've got a model where you've done all the steps before and you're thinking this is looking pretty good, for 6 bucks you could get a P3 for 2 hours and run at turbo speed.

We didn't start with AWS because A, it's twice as expensive as Cressel, the cheapest GPU. And B, it takes some setup. I wanted to go through and show you how to get your AWS setup, so we're going to be going slightly over time to do that, but I want to show you very quickly, so feel free to go if you have to.

But I want to show you very quickly how you can get your AWS setup from scratch. Basically you have to go to console.aws.amazon.com and it will take you to the console. You can follow along on the video with this because I'm going to do it very quickly. From here you have to go to EC2, this is where you set up your instances.

And so from EC2 you need to do what's called launching an instance. So launching an instance means you're basically creating a computer, you're creating a computer on Amazon. So I say launch instance, and what we've done is we've created a fast AMI, an AMI is like a template for how your computer is going to be created.

So if you've got a community AMIs and type in fastai, you'll see that there's one there called fastai part 1 version 2 for the P2. So I'm going to select that, and we need to say what kind of computer do you want. And so I can say I want a GPU computer, and then I can say I want a P2 x lives.

This is the cheapest, reasonably effective for deep learning instance type they have. And then I can say launch, and then I can say launch. And so at this point, they ask you to choose a key pair. Now if you don't have the key pair, you have to create one.

So to create a key pair, you need to open your terminal. If you've got a Mac or a Linux box, you've definitely got one. If you've got Windows, hopefully you've got Ubuntu. If you don't already have Ubuntu set up, you can go to the Windows Store and click on Ubuntu.

So from there, you basically go ssh-keygen, and that will create like a special password for your computer to be able to log in to Amazon. And then you just hit enter three times, and that's going to create for you your key that you can use to get into Amazon.

So then what I do is I copy that key somewhere that I know where it is, so it will be in the .ssh folder, and it's called idrsa.pub. And so I'm going to copy it to my hard drive. So if you're in a Mac or on Linux, it will already be in an easy to find place, it will be in your .ssh folder, let's put that in documents.

So from there, back in AWS, you have to tell it that you've created this key. So you can go to keypairs, and you say import keypair, and you just browse to that file that you just created. There it is, I say import. So if you've ever used ssh before, you've already got the keypair, you don't have to do those steps.

If you've used AWS before, you've already imported it, you don't have to do that step. If you haven't done any of those things, you have to do both steps. So now I can go ahead and launch my instance, community-amis-search-fastai-select-launch. And so now it asks me where's your keypair, and I can choose that one that I just grabbed.

So this is going to go ahead and create a new computer for me to log into. And you can see here, it says the following have been initiated. So if I click on that, it will show me this new computer that I've created. So to be able to log into it, I need to know its IP address.

So here it is, the IP address there. So I can copy that, and that's the IP address of my computer. So to get to this computer, I need to SSH to it. So SSH into a computer means connecting to that computer so that it's like you're typing that computer.

So I type SSH, and the username for this instance is always Ubuntu. And then I can paste in that IP address, and then there's one more thing I have to do, which is I have to connect up the Jupyter notebook on that instance to the Jupyter notebook on my machine.

And so to do that, there's just a particular flag that I set. We can talk about it on the forums as to exactly what it does. But you just type -l 888 localhost 8888. So once you've done it once, you can save that as an alias and type in the same thing every time.

So we can check here, we can see it says that it's running. So we should be able to now hit enter. First time ever we connect to it, it just checks, this is OK. I'll say yes, and then that goes ahead and SSH is in. So this AMI is all set up for you.

So you'll find that the very first time you log in, it takes a few extra seconds because it's getting everything set up. But once it's logged in, you'll see there that there's a directory called fastai. And the fastai directory contains our fastai repo that contains all the notebooks, all the code, etc.

So I can just go cd fastai. First thing you do when you get in is to make sure it's updated. So you just go git pull, and that updates to make sure that your repo is the same as the most recent repo. And so as you can see, there we go, let's make sure it's got all the most recent code.

The second thing you should do is type condor end update. You can just do this maybe once a month or so, and that makes sure that the library is there, all the most recent libraries. I'm not going to run that because it takes a couple of minutes. And then the last step is to type Jupyter Notebook.

So this is going to go ahead and launch the Jupyter Notebook server on this machine. And the first time I do it, the first time you do everything on AWS, it just takes like a minute or two. And then once you've done it in the future, it will be just as fast as running it locally.

So you can see it's going ahead and firing up the notebook. And so what's going to happen is that because when we SSH into it, we said to connect our notebook port to the remote notebook port, we're just going to be able to use this locally. So you can see it says here, copy/paste this URL.

So I'm going to grab that URL, and I'm going to paste it into my browser, and that's it. So this notebook is now actually not running on my machine. It's actually running on AWS using the AWS GPU, which has got a lot of memory. It's not the fastest around, but it's not terrible.

You can always fire up a P3 if you want something that's super fast. This is costing me 90 cents a minute. So when you're finished, please don't forget to shut it down. So to shut it down, you can right-click on it and say "instance date, stop". We've got 500 bucks of credit, assuming that you put your code down in a spreadsheet.

One thing I forgot to do, the first time I showed you this, by the way I said, make sure you choose a P2. The second time I went through I didn't choose P2 by mistake, so just don't forget to choose gpu compute P2. Do you have a question? I've got 90 cents an hour, thank you, 90 cents an hour.

It also costs 3 or 4 bucks a month for the storage as well. Thanks for checking that. Alright, see you next week. Sorry we haven't been over.