back to indexLesson 2: Deep Learning 2018
Chapters
0:0 Introduction
2:40 Learning Rate
7:35 Learning Rate Finder
15:50 Data Augmentation
17:58 Transform Sets
29:24 Accuracy
30:16 Learning rate annealing
40:34 Saving your model
43:50 Finetuning
46:39 Differential Learning Rates
00:00:00.000 |
Okay, so welcome back to deep learning lesson 2. 00:00:07.560 |
Last week we got to the point where we had successfully trained a pretty accurate image 00:00:16.200 |
And so just to remind you about how we did that, can you guys see okay? 00:00:23.440 |
Actually we can't turn the front lights off, can you guys all see the screen, okay? 00:00:33.280 |
That pitch is all into darkness, but if that works then... 00:01:02.880 |
So just to remind you, the way we built this image classifier was we used a small amount 00:01:09.080 |
of code, basically three lines of code, and these three lines of code pointed at a particular 00:01:19.240 |
And so the key thing for this to know how to train this model was that this path, which 00:01:24.840 |
was data, dogs, cats, had to have a particular structure, which is that it had a train folder 00:01:32.320 |
and a valid folder, and in each of those train and valid folders there was a cats folder 00:01:36.960 |
and a dogs folder, and each of the cats and the dogs folders was a bunch of images of 00:01:43.000 |
So this is like a pretty standard, it's one of two main structures that are used to say 00:01:49.400 |
here is the data that I want you to train an image model from. 00:01:53.600 |
So I know some of you during the week went away and tried different data sets where you 00:01:59.320 |
had folders with different sets of images in and created your own image classifiers. 00:02:04.960 |
And generally that seems to be working pretty well from what I can see on the forums. 00:02:08.520 |
So to make it clear, at this point this is everything you need to get started. 00:02:15.160 |
So if you create your own folders with different sets of images, a few hundred or a few thousand 00:02:24.000 |
in each folder, and run the same three lines of code, that will give you an image classifier 00:02:31.840 |
and you'll be able to see this third column tells you how accurate it is. 00:02:37.900 |
So we looked at some kind of simple visualizations to see what was it uncertain about, what was 00:02:48.200 |
it wrong about, and so forth, and that's always a really good idea. 00:02:54.520 |
And then we learned about the one key number you have to pick. 00:02:57.840 |
So this number here is the one key number, this 0.01, and this is called the learning 00:03:04.320 |
So I wanted to go over this again, and we'll learn about the theory behind what this is 00:03:10.720 |
during the rest of the course in quite a lot of detail, but for now I just wanted to talk 00:03:17.000 |
We're going to talk about the other ones shortly, so the main one we're going to look at for 00:03:41.880 |
now is the last column, which is the accuracy. 00:03:46.520 |
The first column as you can see is the epoch number, so this tells us how many times has 00:03:51.960 |
it been through the entire data set trying to learn a better classifier, and then the 00:03:57.360 |
next two columns is what's called the loss, which we'll be learning about either later 00:04:03.480 |
The first one is the loss on the training set, these are the images that we're looking 00:04:07.200 |
at in order to try to make a better classifier. 00:04:09.760 |
The second is the loss on the validation set, these are the images that we're not looking 00:04:13.680 |
at when we're training, but we're just setting them aside to see how accurate we are. 00:04:17.920 |
So we'll learn about the difference between loss and accuracy later. 00:04:29.760 |
So we've got the epoch number, the training loss is the second column, the validation 00:04:35.160 |
loss is the third column and the accuracy is the fourth column. 00:04:46.000 |
So the basic idea of the learning rate is it's the thing that's going to decide how 00:05:04.520 |
And so I find that a good way to think about this is to think about what if we were trying 00:05:09.120 |
to fit to a function that looks something like this, we're trying to say whereabouts 00:05:19.520 |
This is basically what we do when we do deep learning, is we try to find the minimum point 00:05:26.840 |
Now our function happens to have millions or hundreds of millions of parameters, but 00:05:32.440 |
And so when we look at it, we can immediately see that the lowest point is here, but how 00:05:39.440 |
would you do that if you were a computer algorithm? 00:05:41.880 |
And what we do is we start out at some point at random, so we pick say here, and we have 00:05:47.680 |
a look and we say what's the loss or the error at this point, and we say what's the gradient, 00:05:53.640 |
in other words which way is up and which way is down. 00:05:56.880 |
And it tells us that down is going to be in that direction, and it also tells us how fast 00:06:02.000 |
is it going down, which at this point is going down pretty quickly. 00:06:07.440 |
And so then we take a step in the direction that's down, and the distance we travel is 00:06:13.440 |
going to be proportional to the gradient, it's going to be proportional to how steep 00:06:17.480 |
The idea is if it's deeper, then we're probably further away, that's the general idea. 00:06:23.640 |
And so specifically what we do is we take the gradient, which is how steep is it at 00:06:27.320 |
this point, and we multiply it by some number, and that number is called the learning rate. 00:06:32.120 |
So if we pick a number that is very small, then we're guaranteed that we're going to 00:06:38.360 |
go a little bit closer and a little bit closer and a little bit closer each time. 00:06:42.000 |
But it's going to take us a very long time to eventually get to the bottom. 00:06:47.640 |
If we pick a number that's very big, we could actually step too far, we could go in the 00:06:52.680 |
right direction, but we could step all the way over to here as a result of which we end 00:06:58.160 |
up further away than we started, and we could oscillate and it gets worse and worse. 00:07:03.520 |
So if you start training a neural net and you find that your accuracy or your loss is 00:07:08.440 |
like spitting off into infinity, almost certainly your learning rate is too high. 00:07:14.520 |
So in a sense, learning rate too low is a better problem to have because you're going 00:07:20.800 |
to have to wait a long time, but wouldn't it be nice if there was a way to figure out 00:07:27.000 |
Something where you could quickly go like, boom, boom, boom. 00:07:32.000 |
And so that's why we use this thing called a learning rate finder. 00:07:36.800 |
And what the learning rate finder does is it tries, each time it looks at another, remember 00:07:43.320 |
Minibatch is a few images that we look at each time so that we're using the parallel 00:07:50.920 |
We look generally at around 64 or 128 images at a time. 00:07:55.600 |
For each minibatch, which is labeled here as an iteration, we gradually increase the 00:08:01.400 |
In fact, multiplicatively increase the learning rate. 00:08:02.920 |
We start at really tiny learning rates to make sure that we don't start at something 00:08:10.920 |
And so the idea is that eventually the learning rate will be so big that the loss will start 00:08:17.680 |
So what we're going to do then is look at the plot of learning rate against loss. 00:08:24.920 |
So when the learning rate's tiny, it increases slowly, then it starts to increase a bit faster, 00:08:30.960 |
and then eventually it starts not increasing as quickly and in fact it starts getting worse. 00:08:36.400 |
So clearly here, make sure you're going to be familiar with this scientific notation. 00:08:42.720 |
So 10^-1 is 0.1, 10^-2 is 0.01, and when we write this in Python, we'll generally write 00:08:54.520 |
Rather than writing 10^-1 or 10^-2, we'll just write 1e1 or 1e2. 00:09:02.760 |
They mean the same thing, you're going to see that all the time. 00:09:17.200 |
So don't be confused by this text that it prints out here. 00:09:22.400 |
This loss here is the final loss at the end, it's not of any interest. 00:09:28.640 |
So ignore this, this is only interesting when we're doing regular training, not interesting 00:09:34.760 |
The thing that's interesting for the learning rate finder is this learn.shed.plot, and specifically 00:09:40.840 |
we're not looking for the point where it's the lowest, because the point where it's the 00:09:43.960 |
lowest is actually not getting better anymore, so that's too high a learning rate. 00:09:47.840 |
So I generally look to see where is it the lowest, and then I go back like 1 over magnitude. 00:09:59.400 |
So that's why you saw when we ran our fit here, we picked 0.01, which is 1e1-2. 00:10:10.640 |
So an important point to make here is this is the one key number that we've learned to 00:10:16.520 |
adjust, and if you just adjust this number and nothing else, most of the time you're 00:10:25.760 |
And this is like a very different message to what you would hear or see in any textbook 00:10:31.120 |
or any video or any course, because up until now there's been like dozens and dozens, they're 00:10:38.680 |
called hyperparameters, dozens and dozens of hyperparameters to set, and they've been 00:10:42.560 |
thought of as highly sensitive and difficult to set. 00:10:45.320 |
So inside the first AI library, we kind of do all that stuff for you as much as we can. 00:10:52.320 |
And during the course, we're going to learn that there are some more we can break to get 00:10:55.680 |
slightly better results, but it's kind of like, it's kind of in a funny situation here 00:11:02.560 |
because for those of you that haven't done any deep learning before, it's kind of like 00:11:06.320 |
oh, this is, that's all there is to it, this is very easy, and then when you talk to people 00:11:11.040 |
outside this class, they'll be like deep learning is so difficult, there's so much to set, it's 00:11:15.000 |
a real art form, and so that's why there's this difference. 00:11:19.040 |
And so the truth is that the learning rate really is the key thing to set, and this ability 00:11:23.200 |
to use this trick to figure out how to set it, although the paper is now probably 18 00:11:29.360 |
months old, almost nobody knows about this paper, it was from a guy who's not from a famous 00:11:35.200 |
research lab, so most people kind of ignored it, and in fact even this particular technique 00:11:39.400 |
was one sub-part of a paper that was about something else. 00:11:43.240 |
So again, this idea of this is how you can set the learning rate, really nobody outside 00:11:48.320 |
this classroom just about knows about it, obviously the guy who wrote it, Leslie Smith 00:11:52.800 |
knows about it, so it's a good thing to tell your colleagues about, it's like here is actually 00:11:59.240 |
a great way to set the learning rate, and there's even been papers called, like one 00:12:03.520 |
of the famous papers is called No More Pesky Learning Rates, which actually is a less effective 00:12:09.080 |
technique than this one, but this idea that like setting learning rates is very difficult 00:12:13.760 |
and fiddly has been true for most of the kind of deep learning history. 00:12:19.440 |
So here's the trick, look at this plot, find kind of the lowest, go back about a multiple 00:12:24.760 |
of 10 and try that, and if that doesn't quite work you can always try going back another 00:12:30.520 |
multiple of 10, but this has always worked for me so far. 00:13:00.160 |
So we're going to learn during this course about a number of ways of improving gradient 00:13:04.400 |
percent, like you mentioned momentum and atom and so forth. 00:13:10.360 |
So one of the things the class AI library tries to do is figure out the right gradient 00:13:15.080 |
percent version, and in fact behind the scenes this is actually using something called atom. 00:13:19.760 |
And so this technique is telling us this is the best learning rate to use, given what 00:13:25.240 |
other tweaks you're using in this case, the atom optimizer. 00:13:29.000 |
So it's not that there's some compromise between this and some other approaches, this sits 00:13:33.320 |
on top of those approaches, and you still have to set the learning rate when you use 00:13:38.160 |
So we're trying to find the best kind of optimizer to use for a problem, but you still have to 00:13:42.440 |
set the learning rate and this is how we can do it. 00:13:45.320 |
And in fact this idea of using this technique on top of more advanced optimizers like atom 00:13:50.080 |
I haven't even seen mentioned in a paper before, so I think this is not a huge breakthrough, 00:13:55.600 |
it seems obvious, but nobody else seems to have tried it. 00:14:02.360 |
When we use optimizers like atom which have like adaptive learning rates, so when we set 00:14:10.880 |
this learning rate, is it like initial learning rate because it changes during the epoch? 00:14:19.480 |
So we're going to be learning about things like atom, the details about it later in the 00:14:25.320 |
class, but the basic answer is no, even with atom there actually is a learning rate, it's 00:14:35.160 |
being basically divided by the average previous gradient and also the recent sum of squares 00:14:43.580 |
So there's still like a number called the learning rate. 00:14:46.160 |
There isn't even these so-called dynamic learning rate methods that still have a learning rate. 00:14:56.720 |
So the most important thing that you can do to make your model better is to give it more 00:15:09.720 |
So the challenge that happens is that these models have hundreds of millions of parameters, 00:15:16.280 |
and if you train them for a while they start to do what's called overfitting. 00:15:21.200 |
And so overfitting means that they're going to start to see like the specific details 00:15:25.400 |
of the images you're giving them rather than the more general learning that can transfer 00:15:33.960 |
So the best thing we can do to avoid overfitting is to find more data. 00:15:38.800 |
Now obviously one way to do that would just be to collect more data from wherever you're 00:15:41.880 |
getting it from or label more data, but a really easy way that we should always do is 00:15:51.040 |
So data augmentation is one of these things that in many courses it's not even mentioned 00:15:56.580 |
at all or if it is it's kind of like an advanced topic right at the end, but actually it's 00:16:00.520 |
like the most important thing that you can do to make a better model. 00:16:05.080 |
So it's built into the fast.io library to make it very easy to do. 00:16:08.600 |
And so we're going to look at the details of the code shortly, but the basic idea is 00:16:13.680 |
that in our initial code we had a line that said image_classifier.data.from_paths and we 00:16:22.840 |
passed in the path to our data, and for transforms we passed in basically the size and the architecture. 00:16:32.620 |
We just add one more parameter which is what kind of data augmentation do you want to do. 00:16:39.120 |
And so to understand data augmentation, it's maybe easiest to look at some pictures of 00:16:46.120 |
So what I've done here, again we'll look at the code in more detail later, but the basic 00:16:50.080 |
idea is I've built a data class multiple times, I'm going to do it six times, and each time 00:17:03.120 |
And you can see that what happens is that this cat here is further over to the left, 00:17:08.120 |
and this one here is further over to the right, and this one here is flipped horizontally, 00:17:13.800 |
So data augmentation, different types of image are going to want different types of data 00:17:22.000 |
So for example, if you were trying to recognize letters and digits, you wouldn't want to flip 00:17:28.040 |
horizontally because it actually has a different meaning. 00:17:30.880 |
Whereas on the other hand, if you're looking at photos of cats and dogs, you probably don't 00:17:36.440 |
want to flip vertically because cats aren't generally upside down. 00:17:41.000 |
Whereas if you were looking at the current Kaggle competition which is recognizing icebergs 00:17:47.640 |
in satellite images, you probably do want to flip them upside down because it doesn't 00:17:52.320 |
really matter which way around the iceberg or the satellite was. 00:17:56.480 |
So one of the examples of the transform sets we have is "transforms side on". 00:18:03.200 |
So in other words, if you have photos that are generally taken from the side, which generally 00:18:07.360 |
means you want to be able to flip them horizontally but not vertically, this is going to give 00:18:12.680 |
So it will flip them sideways, rotate them by small amounts but not too much, slightly 00:18:18.400 |
bury their contrast and brightness, and slightly zoom in and out a little bit and move them 00:18:25.320 |
So each time it's a slightly different, slightly different edge. 00:18:28.760 |
I'm getting a couple of questions from people about, could you explain again the reason 00:18:36.360 |
why you don't take the minimum of the loss curve but a slightly higher rate? 00:18:42.360 |
Also, could people understand if this works for every CNN or for CNN or for every internet? 00:18:51.600 |
Could you put your hand up if there's a spare seat next to you? 00:19:11.100 |
So there was a question about the learning rate finder about why do we use the learning 00:19:18.160 |
And so the reason why is to understand what's going on with this learning rate finder. 00:19:25.120 |
So let's go back to our picture here of how do we figure out what learning rate to use. 00:19:32.980 |
And so what we're going to do is we're going to take steps and each time we're going to 00:19:38.320 |
double the learning rate, so double the amount by which we're multiplying the gradient. 00:19:44.000 |
So in other words, we'd go tiny step, slightly bigger, slightly bigger, slightly bigger, 00:19:49.360 |
slightly bigger, slightly bigger, slightly bigger. 00:19:57.040 |
And so the purpose of this is not to find the minimum, the purpose of this is to figure 00:20:02.200 |
out what learning rate is allowing us to decrease quickly. 00:20:08.000 |
So the point at which the loss was lowest here is actually there, but that learning 00:20:13.920 |
rate actually looks like it's probably too high, it's going to just jump probably backwards 00:20:20.940 |
So instead what we do is we go back to the point where the learning rate is giving us 00:20:31.200 |
So here is the actual learning rate increasing every single time we look at a new minibatch, 00:20:38.520 |
so minibatch or iteration versus learning rate. 00:20:43.600 |
So here's that point at the bottom where it was now already too high, and so here's the 00:20:48.920 |
point where we go back a little bit and it's increasing nice and quickly. 00:20:54.880 |
We're going to learn about something called stochastic gradient descent with restarts 00:20:59.200 |
shortly where we're going to see, in a sense you might want to go back to 1e3 where it's 00:21:03.660 |
actually even steeper still, and maybe we would actually find this will actually learn 00:21:10.400 |
even quicker, you could try it, but we're going to see later why actually using a higher 00:21:14.620 |
number is going to give us better generalization. 00:21:20.480 |
So as we increase the iterations in the learning rate finder, the learning rate is going up. 00:21:39.280 |
So as we do that, as the learning rate increases and we plot it here, the loss goes down until 00:21:46.400 |
we get to the point where the learning rate is too high. 00:21:49.440 |
And at that point the loss is now getting worse. 00:21:51.480 |
Because I asked the question because you were just indicating that even though the minimum 00:21:56.040 |
was at 10^-1, you suggested we should choose 10^-2, but now you're saying maybe we should 00:22:06.200 |
I didn't mean to say that, I'm sorry if I said something backwards, so I want to go 00:22:14.080 |
So possibly I said higher when I meant higher in this lower learning rate. 00:22:23.680 |
Last class you said that all the local minima are the same and this graph also shows the 00:22:31.520 |
Is that something that was observed or is there a theory behind it? 00:22:39.720 |
This graph is simply showing that there's a point where if we increase the learning 00:22:43.260 |
rate more, then it stops getting better and it actually starts getting worse. 00:22:47.840 |
The idea that all local minima are the same is a totally separate issue and it's actually 00:22:57.400 |
something we'll see a picture of shortly, so let's come back to that. 00:23:32.920 |
Later on in this class, we're going to learn about unfreezing layers, and after I unfreeze 00:23:41.160 |
If I do something to change the thing I'm training or change the way I'm training it, 00:23:50.920 |
Particularly if you've changed something about how you train unfreezing layers, which we're 00:23:54.800 |
going to soon learn about, and you're finding the other training is unstable or too slow, 00:24:14.600 |
When we run this little transforms from model function, we pass in augmentation transforms, 00:24:21.920 |
we can pass in the main two, a transform side on or transforms top down. 00:24:27.160 |
Later on we'll learn about creating your own custom transform lists as well, but for now 00:24:31.980 |
because we're taking pictures from the side of cats and dogs, we'll say transform side 00:24:37.760 |
Now each time we look at an image, it's going to be zoomed in or out a little bit, moved 00:24:42.480 |
around a little bit, rotated a little bit, possibly flipped. 00:24:49.720 |
What this does is it's not exactly creating new data, but as far as the convolutional 00:24:54.880 |
neural net is concerned, it's a different way of looking at this thing, and it actually 00:24:59.440 |
therefore allows it to learn how to recognize cats or dogs from somewhat different angles. 00:25:07.360 |
So when we do data augmentation, we're basically trying to say based on our domain knowledge, 00:25:14.240 |
here are different ways that we can mess with this image that we know still make it the 00:25:19.360 |
same image, and that we could expect that you might actually see that kind of image 00:25:26.260 |
So what we can do now is when we call this from_parts function, which we'll learn more 00:25:31.600 |
about shortly, we can now pass in this set of transforms which actually have these augmentations 00:25:41.700 |
So we're going to start from scratch here, we do a fit, and initially the augmentations 00:25:52.720 |
And the reason initially they don't do anything is because we've got here something that says 00:25:58.160 |
We're going to go back to this lots of times. 00:26:03.640 |
But basically what this is doing is do you remember this picture we saw where we learned 00:26:08.680 |
each different layer has these activations that basically look for anything from the 00:26:15.120 |
middle of flowers to eyeballs of birds or whatever. 00:26:21.760 |
And so literally what happens is that the later layers of this convolutional neural 00:26:28.840 |
network have these things called activations. 00:26:32.440 |
Activation is a number that says this feature, like eyeball of birds, is in this location 00:26:42.420 |
with this level of confidence, with this probability. 00:26:46.240 |
And so we're going to see a lot of this later. 00:26:48.560 |
But what we can do is we can say, in this we've got a pre-trained network, and a pre-trained 00:26:56.360 |
network is one where it's already learned to recognize certain things. 00:26:59.640 |
In this case it's learned to recognize the 1.5 million images in the ImageNet dataset. 00:27:05.320 |
And so what we could do is we could take the second last layer, so the one which has got 00:27:11.920 |
all of the information necessary to figure out what kind of thing a thing is, and we 00:27:18.320 |
So basically saving things saying there's this level of eyeballness here, and this level 00:27:23.600 |
of dog's face-ness here, and this level of fluffy ear there, and so forth. 00:27:28.520 |
And so we save for every image these activations, and we call them the pre-computed activations. 00:27:36.700 |
And so the idea is now that when we want to create a new classifier which can basically 00:27:42.960 |
take advantage of these pre-computed activations, we can very quickly train a simple linear model 00:27:54.760 |
And so that's what happens when we say pre-compute = true. 00:27:58.120 |
And that's why, you may have noticed this week, the first time that you run a new model, it 00:28:08.040 |
Whereas you saw when I ran it, it took like 5 or 10 seconds, it took you a minute or two, 00:28:12.520 |
and that's because it had to pre-compute these activations, it just has to do that once. 00:28:17.800 |
If you're using your own computer or AWS, it just has to do it once ever. 00:28:22.960 |
If you're using Cressel, it actually has to do it once every single time you rerun Cressel 00:28:30.320 |
because Cressel just for these pre-computed activations, it uses a special little kind 00:28:35.640 |
of scratch space that disappears each time you restart your Cressel instance. 00:28:40.680 |
So other than special Cressel, generally speaking, you just have to run it once ever for a dataset. 00:28:49.560 |
So the issue with that is that since we're pre-computed for each image, how much does 00:28:56.040 |
it have an ear here and how much does it have a lizard's eyeball there and so forth? 00:29:01.520 |
That means that data augmentations don't work. 00:29:04.280 |
In other words, even though we're trying to show it a different version of the cat each 00:29:07.180 |
time, we've pre-computed the activations for a particular version of that cat. 00:29:13.160 |
So in order to use data augmentation, we just have to go learn.precompute=false, and then 00:29:24.000 |
And so you can see here that as we run more APOCs, the accuracy isn't particularly getting 00:29:32.680 |
The good news is that you can see the train loss, this is like a way of measuring the 00:29:39.800 |
error of this model, although that's getting better, the error's going down, the validation 00:29:45.080 |
error isn't going down, but we're not overfitting. 00:29:49.840 |
And overfitting would mean that the training loss is much lower than the validation loss. 00:29:56.160 |
We're going to talk about that a lot during this course, but the general idea here is 00:30:00.080 |
if you're doing a much better job on the training set than you are on the validation set, that 00:30:08.240 |
So we're not at that point, which is good, but we're not really improving. 00:30:13.740 |
So we're going to have to figure out how to deal with that. 00:30:17.500 |
Before we do, I want to show you one other cool trick. 00:30:20.880 |
I've added here cycle length=1, and this is another really interesting idea. 00:30:31.080 |
Cycle length=1 enables a fairly recent discovery in deep learning called Stochastic Gradient 00:30:44.080 |
As you get closer and closer to the right spot, I may want to start to decrease my learning 00:30:57.520 |
Because as I get closer, I'm pretty close down, so let's slow down my steps to try to 00:31:06.600 |
And so as we do more iterations, our learning rate perhaps should actually go down. 00:31:18.660 |
Because as we go along, we're getting closer and closer to where we want to be and we want 00:31:25.580 |
So the idea of decreasing the learning rate as you train is called learning rate annealing. 00:31:32.940 |
And it's very, very common, very, very popular. 00:31:40.320 |
The most common kind of learning rate annealing is really horrendously hacky. 00:31:45.640 |
It's basically that researchers pick a learning rate that seems to work for a while, and then 00:31:50.720 |
when it stops learning well, they drop it down by about 10 times, and then they keep 00:31:54.940 |
learning a bit more until it doesn't seem to be improving, and they drop it down by 00:31:59.680 |
That's what most academic research papers and most people in the industry do. 00:32:04.080 |
So this would be like stepwise annealing, very manual, very annoying. 00:32:09.620 |
A better approach is simply to pick some kind of functional form like a line. 00:32:16.480 |
It turns out that a really good functional form is one half of a cosine curve. 00:32:23.720 |
And the reason why is that for a while when you're not very close, you kind of have a 00:32:28.200 |
really high learning rate, and then as you do get close you kind of quickly drop down 00:32:32.520 |
and do a few iterations with a really low learning rate. 00:32:39.360 |
So to those of you who haven't done trigonometry for a while, cosine basically looks something 00:32:55.080 |
But here's the thing, when you're in a very high dimensional space, and here we're only 00:33:02.120 |
able to show 3 dimensions, but in reality we've got hundreds of millions of dimensions, 00:33:08.440 |
we've got lots of different fairly flat points, they're fairly flat points, all of which are 00:33:16.920 |
pretty good, but they might differ in a really interesting way, which is that some of those 00:33:33.360 |
Let's imagine we've got a surface that looks something like this. 00:33:41.720 |
Now imagine that our random guess started here, and our initial learning rate annealing 00:33:52.720 |
Now indeed that's a pretty nice low error, but it probably doesn't generalize very well, 00:33:59.560 |
which is to say if we use a different dataset where things are just kind of slightly different 00:34:04.620 |
in one of these directions, suddenly it's a terrible solution, whereas over here it's 00:34:11.600 |
basically equally good in terms of loss, but it rather suggests that if you have slightly 00:34:18.080 |
different datasets that are slightly moved in different directions, it's still going 00:34:22.760 |
So in other words, we would expect this solution here is probably going to generalize better 00:34:31.840 |
So here's what we do, if we've got a bunch of different low bits, then our standard learning 00:34:39.720 |
rate annealing approach will go downhill, downhill, downhill, downhill, downhill to 00:34:46.960 |
But what we could do instead is use a learning rate schedule that looks like this, which is 00:34:54.280 |
to say we do a cosine annealing and then suddenly jump up again, do a cosine annealing and then 00:35:00.040 |
And so each time we jump up, it means that if we're in a spiky bit and then we suddenly 00:35:05.080 |
increase the learning rate and it jumps now all the way over to here, and so then we kind 00:35:10.360 |
of learning rate and near, learning rate and near down to here, and then we jump up again 00:35:18.120 |
So in other words, each time we jump up the learning rate, it means that if it's in a 00:35:22.480 |
nasty spiky part of the surface, it's going to hop out of the spiky part, and hopefully 00:35:27.640 |
if we do that enough times, it will eventually find a nice smooth ball. 00:35:35.120 |
Could you get the same effect by running multiple iterations through the different randomized 00:35:45.640 |
running points so that eventually you explore all possible minimals and then compare them? 00:35:51.080 |
Yeah so in fact, that's a great question, and before this approach, which is called 00:36:01.040 |
stochastic gradient descent with restarts was created, that's exactly what people used 00:36:08.200 |
They used to create these things called ensembles where they would basically relearn a whole 00:36:13.240 |
new model 10 times in the hope that one of them is going to end up being better. 00:36:20.360 |
And so the cool thing about this stochastic gradient descent with restarts is that once 00:36:26.720 |
we're in a reasonably good spot, each time we jump up the learning rate, it doesn't restart, 00:36:33.120 |
it actually hangs out in this nice part of the space and then keeps getting better. 00:36:37.880 |
So interestingly it turns out that this approach where we do this a bunch of separate cosine 00:36:44.200 |
annealing steps, we end up with a better result than if we just randomly tried a few different 00:36:52.800 |
So it's a super neat trick and it's a fairly recent development, and again almost nobody's 00:37:01.560 |
heard of it, but I found it's now like my superpower. 00:37:07.760 |
Using this, along with the learning rate finder, I can get better results than nearly anybody 00:37:15.320 |
like in a Kaggle competition in the first week or two, I can jump in, spend an hour 00:37:20.960 |
or two and back, I've got a fantastically good result. 00:37:25.440 |
And so this is why I didn't pick the point where it's got the steepest slope, I actually 00:37:31.360 |
tried to pick something kind of aggressively high, it's still getting down but maybe getting 00:37:40.240 |
Because when we do this stochastic gradient descent with restarts, this 10^-2 represents 00:37:51.400 |
So it goes up to 10^-2 and then goes down, then up to 10^-2 and then down. 00:37:57.180 |
So if I use to lower learning rate, it's not going to jump to a different part of the function. 00:38:04.520 |
In terms of this part here where it's going down, we change the learning rate every single 00:38:22.880 |
And then the number of times we reset it is set by the cycle length parameter, so 1 means 00:38:32.860 |
So if I had 2 there, it would reset it up to every 2 epochs. 00:38:36.760 |
And interestingly this point that when we do the learning rate annealing that we actually 00:38:40.480 |
change it every single batch, it turns out to be really critical to making this work, 00:38:47.600 |
and again it's very different to what nearly everybody in industry and academia has done 00:38:52.280 |
We're going to come back to that multiple times in this course. 00:39:04.700 |
So the way this course is going to work is we're going to do a really high-level version 00:39:08.900 |
of each thing, and then we're going to come back to it in 2 or 3 lessons and then come 00:39:14.400 |
And each time we're going to see more of the math, more of the code, and get a deeper view. 00:39:20.060 |
We can talk about it also in the forums during the week. 00:39:27.540 |
Our main goal is to generalize and we don't want to get those narrow optimals. 00:39:34.780 |
So with this method, are we keeping track of the minimals and averaging them, and assembling 00:39:42.500 |
That's another level of sophistication, and indeed you can see there's something here 00:39:46.340 |
called Snapshot Ensembles, so we're not doing it in the code right now. 00:39:52.220 |
But yes, if you wanted to make this generalized even better, you can save the weights here 00:39:57.760 |
and here and here and then take the average of the predictions. 00:40:02.260 |
But for now, we're just going to pick the last one. 00:40:10.260 |
If you want to skip ahead, there's a parameter called CycleSaveName, which you can add as 00:40:20.500 |
well as CycleLend, and that will save a set of weights at the end of every learning rate 00:40:34.660 |
So we've got a pretty decent model here, 99.3% accuracy, and we've gone through a few steps 00:40:46.420 |
And so from time to time, I tend to save my weights. 00:40:48.620 |
So if you go Learn.Save and then pass in a file name, it's going to go ahead and save 00:40:54.660 |
Later on, if you go Learn.Load, you'll be straight back to where you came from. 00:40:59.260 |
So it's a good idea to do that from time to time. 00:41:03.140 |
This is a good time to mention what happens when you do this. 00:41:08.540 |
When you go Learn.Save, when you create pre-computed activations, another thing we'll learn about 00:41:13.480 |
soon when you create resized images, these are all creating various temporary files. 00:41:20.500 |
And so what happens is, if we go to Data, and we go to DogsCats, this is my data folder, 00:41:35.520 |
and you'll see there's a folder here called TMP, and so this is automatically created, 00:41:42.820 |
and all of my pre-computed activations end up in here. 00:41:46.180 |
I mention this because if you're getting weird errors, it might be because you've got some 00:41:53.620 |
pre-computed activations that were only half completed, or are in some way incompatible 00:42:00.900 |
So you can always go ahead and just delete this TMP, this temporary directory, and see 00:42:08.260 |
This is the fast AI equivalent of turning it off and then on again. 00:42:13.660 |
You'll also see there's a directory called Models, and that's where all of these, when 00:42:17.700 |
you say .save with a model, that's where that's going to go. 00:42:22.460 |
Actually it reminds me, when the Stochastic Gradient Descent with Restarts paper came 00:42:25.900 |
out, I saw a tweet that was somebody who was like, "Oh, to make your deep learning work 00:42:34.340 |
"If I want to say I want to retrain my model from scratch again, do I just delete everything 00:42:50.220 |
If you want to train your model from scratch, there's generally no reason to delete the 00:42:55.300 |
pre-computed activations because the pre-computed activations are without any training. 00:43:01.660 |
That's what the pre-trained model created with the weights that you downloaded off the internet. 00:43:12.220 |
The only reason you want to delete the pre-computed activations is that there was some error caused 00:43:16.780 |
by half creating them and crashing or something like that. 00:43:22.260 |
As you change the size of your input, change different architectures and so forth, they 00:43:25.980 |
all create different sets of activations with different file names, so generally you shouldn't 00:43:32.380 |
If you want to start training again from scratch, all you have to do is create a new learn object. 00:43:39.540 |
So each time you go con-learner.pre-trained, that creates a new object with new sets of 00:43:51.460 |
Before our break, we'll finish off by talking about fine-tuning and differential learning 00:44:00.580 |
So far everything we've done has not changed any of these pre-trained filters. 00:44:09.260 |
We've used a pre-trained model that already knows how to find at the early stages edges 00:44:17.460 |
and gradients, and then corners and curves, and then repeating patterns, and bits of text, 00:44:30.320 |
We have not re-trained any of those activations, any of those features, or more specifically 00:44:38.580 |
any of those weights in the convolutional kernels. 00:44:42.340 |
All we've done is we've learned some new layers that we've added on top of these things. 00:44:48.140 |
We've learned how to mix and match these pre-trained features. 00:44:53.000 |
Now obviously it may turn out that your pictures have different kinds of eyeballs or faces, 00:45:01.700 |
or if you're using different kinds of images like satellite images, totally different kinds 00:45:08.220 |
So if you're training to recognize icebergs, you'll probably want to go all the way back 00:45:14.140 |
and learn all the way back to different combinations of these simple gradients and edges. 00:45:20.500 |
In our case as dogs versus cats, we're going to have some minor differences, but we still 00:45:25.980 |
may find it's helpful to slightly tune some of these later layers as well. 00:45:32.660 |
So to tell the learner that we now want to start actually changing the convolutional 00:45:42.880 |
So a frozen layer is a layer which is not trained, which is not updated. 00:45:51.340 |
Now when you think about it, it's pretty obvious that layer 1, which is like a diagonal edge 00:45:59.420 |
or a gradient, probably doesn't need to change by much if at all. 00:46:04.620 |
From the 1.5 million images on ImageNet, it probably already has figured out pretty well 00:46:11.340 |
It probably already knows also which kind of corners to look for and how to find which 00:46:17.980 |
So in other words, these early layers probably need little if any learning, whereas these 00:46:25.380 |
later ones are much more likely to need more learning. 00:46:29.060 |
This is universally true regardless of whether you're looking for satellite images of rainforest 00:46:33.900 |
or icebergs or whether you're looking for cats versus dogs. 00:46:39.340 |
So what we do is we create an array of learning rates where we say these are the learning 00:46:45.700 |
rates to use for our additional layers that we've added on top. 00:46:51.980 |
These are the learning rates to use in the middle few layers, and these are the learning 00:46:59.780 |
So these are the ones for the layers that represent very basic geometric features. 00:47:05.140 |
These are the ones that are used for the more complex, sophisticated convolutional features, 00:47:12.540 |
and these are the ones that are used for the features that we've added and learned from 00:47:17.300 |
So we can create an array of learning rates, and then when we call .fit and pass in an 00:47:22.700 |
array of learning rates, it's now going to use those different learning rates for different 00:47:30.140 |
This is not something that we've invented, but I'd also say it's so not that common that 00:47:38.700 |
it doesn't even have a name as far as I know. 00:47:41.440 |
So we're going to call it differential learning rates. 00:47:46.540 |
If it actually has a name, or indeed if somebody's actually written a paper specifically talking 00:47:53.340 |
There's a great researcher called Jason Yosinski who did write a paper about the idea that 00:47:58.540 |
you might want different learning rates and showing why, but I don't think any other library 00:48:02.940 |
support it, and I don't know of a name for it. 00:48:07.260 |
Having said that though, this ability to unfreeze and then use these differential learning 00:48:13.100 |
rates I found is the secret to taking a pretty good model and turning it into an awesome model. 00:48:25.400 |
So just to clarify, you have three numbers there, three hyperparameters. 00:48:32.280 |
The first one is for the late layers, the other layers are there in your model. 00:48:40.540 |
So the short answer is many, many, and they're kind of in groups, and we're going to learn 00:48:47.620 |
This is called a ResNet, a residual network, and it kind of has ResNet blocks. 00:48:52.820 |
And so what we're doing is we're grouping the blocks into three groups, so this first 00:48:59.100 |
number is for the earliest layers, the ones closest to the pixels that represent like 00:49:17.940 |
So you're unfreezing them because you have kind of partially trained all the late layers. 00:49:30.420 |
And the learning rate is particularly small for the early layers, because you just kind 00:49:35.020 |
of want to fine-tune and you don't want it to. 00:49:37.820 |
We probably don't want to change them at all, but if it does need to then it can. 00:49:49.220 |
So using differential random rates, how is it different from grid search? 00:49:57.500 |
So grid search is where we're trying to find the best hyperparameter for something. 00:50:03.420 |
So for example, you could kind of think of the learning rate finder as a really sophisticated 00:50:10.220 |
grid search, which is like trying lots and lots of learning rates to find which one is 00:50:16.660 |
This is actually for the entire training from now on, it's actually going to use a different 00:50:30.620 |
So I was wondering, if you have a pre-trained model, then you have to use the same input 00:50:37.540 |
Because I was thinking, okay, let's say you have these big machines to train these things 00:50:46.260 |
How would you go about if you have images that are bigger than the ones that they use? 00:50:51.580 |
We're going to be talking about sizes later, but the short answer is that with this library 00:50:55.860 |
and the modern architectures we're using, we can use any size we like. 00:51:03.260 |
So Jeremy, can we unfreeze just a specific layer? 00:51:09.060 |
We're not doing it yet, but if you wanted to, you can type learn.freeze_to and pass 00:51:19.900 |
Much to my surprise, or at least initially my surprise, it turns out I almost never need 00:51:26.540 |
I almost never find it helpful, and I think it's because using differential learning rates, 00:51:32.360 |
the optimizer can kind of learn just as much as it needs to. 00:51:38.360 |
The one place I have found it helpful is if I'm using a really big memory-intensive model 00:51:55.860 |
and I'm running out of GPU, the less layers you unfreeze, the less memory it takes and 00:52:03.940 |
the less time it takes, so it's that kind of practical aspect. 00:52:06.740 |
To make sure I ask the question right, can I just unfreeze a specific layer? 00:52:14.260 |
No, you can only unfreeze layers from layer N onwards. 00:52:21.580 |
You could probably delve inside the library and unfreeze one layer, but I don't know why 00:52:29.140 |
So I'm really excited to be showing you guys this stuff, because it's something we've been 00:52:32.340 |
kind of researching all year is figuring out how to train state-of-the-art models. 00:52:37.700 |
And we've kind of found these tiny number of tricks. 00:52:40.940 |
And so once we do that, we now go learn.fit, and you can see, look at this, we get right 00:52:52.220 |
There's one other trick you might see here that as well as using stochastic gradient descent 00:52:56.780 |
with restarts, i.e. cycle length = 1, we've done 3 cycles. 00:53:02.980 |
Earlier on I lied to you, I said this is the number of epochs, it's actually the number 00:53:07.980 |
So if you said cycle length = 2, it would do 3 cycles of each of two epochs. 00:53:16.300 |
So here I've said do 3 cycles, yet somehow it's done 7 epochs. 00:53:20.900 |
The reason why is I've got one last trick to show you which is cycle_mult = 2. 00:53:25.740 |
And to tell you what that does, I'm simply going to show you the picture. 00:53:30.460 |
If I go learn.share.plot_learning_rate, there it is. 00:53:34.980 |
Now you can see what cycle_mult = 2 is doing. 00:53:38.500 |
It's doubling the length of the cycle after each cycle. 00:53:43.020 |
And so in the paper that introduced this stochastic gradient descent with restarts, the researcher 00:53:48.500 |
kind of said hey, this is something that seems to sometimes work pretty well, and I've certainly 00:53:55.100 |
So basically, intuitively speaking, if your cycle length is too short, then it kind of 00:54:04.140 |
starts going down to find a good spot and then it pops out. 00:54:07.540 |
And it goes down to try and find a good spot and pops out, and it never actually gets to 00:54:12.260 |
So earlier on, you want it to do that because it's trying to find the bit that's smoother. 00:54:17.780 |
But then later on you want it to do more exploring, and then more exploring. 00:54:22.700 |
So that's why this cycle_mult = 2 thing often seems to be a pretty good approach. 00:54:29.860 |
So suddenly we're introducing more and more hyperparameters, having told you there aren't 00:54:35.980 |
But the reason is that you can really get away with just picking a good learning rate, 00:54:41.460 |
but then adding these extra tweaks really helps get that extra level up without any 00:54:50.540 |
And so in practice, I find this kind of 3 cycles starting at 1, mult = 2 works very, 00:55:05.260 |
If it doesn't, then often I'll just do 3 cycles of length 2 with no mult. 00:55:12.100 |
There's kind of like 2 things that seem to work a lot. 00:55:14.780 |
There's not too much fiddling, I find, necessary. 00:55:17.780 |
As I say, even if you use this line every time, I'd be surprised if you didn't get 00:55:29.100 |
Why does smoother services correlate to more generalized networks? 00:55:40.100 |
So it's kind of this intuitive explanation I tried to give back here. 00:55:47.340 |
Which is that if you've got something spiky, and so what this x-axis is showing is how 00:56:03.580 |
good is this at recognizing dogs versus cats as you change this particular parameter? 00:56:09.260 |
And so for something to be generalizable, it means that we want it to work when we give 00:56:16.500 |
And so a slightly different data set may have a slightly different relationship between 00:56:21.900 |
this parameter and how catty versus doggy it is. 00:56:31.900 |
So in other words, if we end up at this point, then it's not going to do a good job on this 00:56:38.140 |
slightly different data set, or else if we end up on this point, it's still going to 00:56:53.420 |
So we've got one last thing before we're going to take a break, which is we're now going 00:56:57.140 |
to take this model, which has 99.5% accuracy, and we're going to try to make it better still. 00:57:03.580 |
And what we're going to do is we're not actually going to change the model at all, but instead 00:57:08.500 |
we're going to look back at the original visualization we did where we looked at some of our incorrect 00:57:17.460 |
Now what I've done is I've printed out the whole of these incorrect pictures, but the 00:57:26.340 |
key thing to realize is that when we do the validation set, all of our inputs to our model 00:57:40.940 |
The reason for that is it's kind of a minor technical detail, but basically the GPU doesn't 00:57:46.740 |
go very quickly if you have different dimensions for different images because it needs to be 00:57:51.620 |
consistent so that every part of the GPU can do the same thing. 00:57:55.460 |
I think this is probably fixable, but now that's the state of the technology we have. 00:57:59.020 |
So our validation set, when we actually say for this particular thing is it's a dog, what 00:58:03.980 |
we actually do to make it square is we just pick out the square in the middle. 00:58:09.220 |
So we would take off its two edges, and so we take the whole height and then as much 00:58:15.940 |
And so you can see in this case we wouldn't actually see this dog's head. 00:58:20.480 |
So I think the reason this was actually not correctly classified was because the validation 00:58:25.220 |
set only got to see the body, and the body doesn't look particularly dog-like or cat-like, 00:58:34.100 |
So what we're going to do when we calculate the predictions for our validation set is 00:58:39.740 |
we're going to use something called test time augmentation. 00:58:42.860 |
And what this means is that every time we decide is this cat or a dog, not in the training 00:58:47.580 |
but after we train the model, is we're going to actually take four random data augmentations. 00:58:57.720 |
And remember the data augmentations move around and zoom in and out and flip. 00:59:03.580 |
So we're going to take four of them at random and we're going to take the original un-augmented 00:59:09.020 |
center crop image and we're going to do a prediction for all of those. 00:59:13.180 |
And then we're going to take the average of those predictions. 00:59:16.580 |
So we're going to say is this a cat, is this a cat, is this a cat, is this a cat. 00:59:21.660 |
And so hopefully in one of those random ones we actually make sure that the face is there, 00:59:26.940 |
zoomed in by a similar amount to other dog's faces at sea and it's rotated by the amount 00:59:33.100 |
And so to do that, all we have to do is just call tta, tta stands for test time augmentation. 00:59:41.740 |
This term of what do we call it when we're making predictions from a model we've trained, 00:59:46.860 |
sometimes it's called inference time, sometimes it's called test time, everybody seems to 00:59:52.780 |
And so when we do that we go learn.tta, check the accuracy, and lo and behold we're now 01:00:05.580 |
But for every bug we're only showing one type of augmentation of a particular image, right? 01:00:13.300 |
So when we're training back here, we're not doing any tta. 01:00:17.820 |
So you could, and sometimes I've written libraries where after each epoch I run tta to see how 01:00:27.780 |
I trained the whole thing with training time augmentation, which doesn't have a special 01:00:34.500 |
When we say data augmentation, we mean training time augmentation. 01:00:37.620 |
So here every time we showed a picture, we were randomly changing it a little bit. 01:00:42.300 |
So each epoch, each of these seven epochs, it was seeing slightly different versions 01:00:47.540 |
Having done that, we now have a fully trained model, we then said okay, let's look at the 01:00:53.100 |
So tta by default uses the validation set and said okay, what are your predictions of 01:00:59.260 |
And it did four predictions with different random augmentations, plus one on the unaugmented 01:01:05.020 |
version, averaged them all together, and that's what we got, and that's what we got the accuracy 01:01:10.700 |
So is there a high probability of having a sample in tta that was not shown during training? 01:01:17.220 |
Yeah, actually every data augmented image is unique because the rotation could be like 01:01:37.300 |
What's your, why not use white padding or something like that? 01:01:46.740 |
Oh, padding's not, yeah, so there's lots of different types of data augmentation you can 01:01:51.060 |
do and so one of the things you can do is to add a border around it. 01:01:56.420 |
Basically adding a border around it in my experiments doesn't help, it doesn't make 01:02:00.180 |
it any less cat-like, convolutional neural network doesn't seem to find it very interesting. 01:02:07.220 |
One thing that I do do, we'll see later, is I do something called reflection padding which 01:02:10.900 |
is where I add some borders that are the outside just reflected, it's a way to kind of make 01:02:16.980 |
It works well with satellite imagery in particular, but in general I don't do a lot of padding, 01:02:25.260 |
Let's kind of follow up to that last one, but rather than cropping, just add white space 01:02:34.060 |
because when you crop you lose the dog's face, but if you added white space you wouldn't 01:02:39.740 |
Yeah, so that's where the reflection padding or the zooming or whatever can help. 01:02:44.820 |
So there are ways in the fastai library when you do custom transforms of making that happen. 01:02:53.100 |
I find that it kind of depends on the image size, but generally speaking it seems that 01:03:03.540 |
using TTA plus data augmentation, the best thing to do is to try to use as large an image 01:03:10.140 |
So if you kind of crop the thing down and put white borders on top and bottom, it's 01:03:16.100 |
And so to make it as big as it was before, you now have to use more GPU, and if you're 01:03:19.660 |
going to use all that more GPU you could have zoomed in and used a bigger image. 01:03:22.660 |
So in my playing around that doesn't seem to be generally as successful. 01:03:28.380 |
There is a lot of interest on the topic of how do you do that augmentation in older than 01:03:52.500 |
I asked some of my friends in the natural language processing community about this, 01:03:55.980 |
and we'll get to natural language processing in a couple of lessons, it seems like it would 01:04:01.260 |
There's been a very, very few examples of people where papers would try replacing synonyms 01:04:07.460 |
for instance, but on the whole an understanding of appropriate data augmentation for non-image 01:04:13.300 |
domains is under-researched and underdeveloped. 01:04:23.940 |
The question was, couldn't we just use a sliding window to generate all the images? 01:04:28.060 |
So in that dog picture, couldn't we generate three parts of that, wouldn't that be better? 01:04:37.080 |
Just in general when you're creating your realism. 01:04:40.180 |
For training time, I would say no, that wouldn't be better because we're not going to get as 01:04:45.620 |
We want to have it one degree off, five degrees off, ten pixels up, lots of slightly different 01:04:51.700 |
versions and so if you just have three standard ways, then you're not giving it as many different 01:05:00.300 |
For test time augmentation, having fixed crop locations I think probably would be better, 01:05:08.620 |
and I just haven't gotten around to writing that yet. 01:05:11.500 |
I have a version in an olden library, I think having fixed crop locations plus random contrast 01:05:24.060 |
The reason I haven't gotten around to it yet is because in my testing it didn't seem to 01:05:27.540 |
help in practice very much and it made the code a lot more complicated, so it's an interesting 01:05:33.540 |
"I just want to know how this fast AI API is that you're using, is it open source?" 01:05:46.020 |
The fast AI library is open source, and let's talk about it a bit more generally. 01:05:54.540 |
The fact that we're using this library is kind of interesting and unusual, and it sits 01:06:01.660 |
So PyTorch is a fairly recent development, and I've noticed all the researchers that 01:06:15.820 |
I found in part 2 of last year's course that a lot of the cutting edge stuff I wanted to 01:06:20.220 |
teach I couldn't do it in Keras and TensorFlow, which is what we used to teach with, and so 01:06:26.980 |
I had to switch the course to PyTorch halfway through part 2. 01:06:31.620 |
The problem was that PyTorch isn't very easy to use, you have to write your own training 01:06:38.620 |
Basically if you write everything from scratch, all the stuff you see inside the fast AI library, 01:06:42.460 |
we would have had to have written it to learn. 01:06:46.460 |
And so it really makes it very hard to learn deep learning when you have to write hundreds 01:06:54.340 |
So we decided to create a library on top of PyTorch because our mission is to teach world 01:07:02.580 |
So we wanted to show you how you can be the best in the world at doing X, and we found 01:07:08.060 |
that a lot of the world class stuff we needed to show really needed PyTorch, or at least 01:07:13.420 |
with PyTorch it was far easier, but then PyTorch itself just wasn't suitable as a first thing 01:07:20.820 |
to teach with for new deep learning practitioners. 01:07:25.820 |
So we built this library on top of PyTorch, initially heavily influenced by Keras, which 01:07:33.480 |
But then we realized we could actually make things much easier than Keras. 01:07:37.380 |
So in Keras, if you look back at last year's course notes, you'll find that all of the 01:07:42.500 |
code is 2-3 times longer, and there's lots more opportunities for mistakes because there's 01:07:51.960 |
So we ended up building this library in order to make it easier to get into deep learning, 01:07:59.460 |
but also easier to get state-of-the-art results. 01:08:03.060 |
And then over the last year as we started developing on top of that, we started discovering 01:08:06.980 |
that by using this library, it made us so much more productive that we actually started 01:08:13.940 |
developing new state-of-the-art results and new methods ourselves, and we started realizing 01:08:18.180 |
that there's a whole bunch of papers that have kind of been ignored or lost, which when 01:08:23.180 |
you use them it could semi-automate stuff, like learning rate finder that's not in any 01:08:30.100 |
So I kind of got to the point where now not only is kind of fast.ai lets us do things 01:08:37.700 |
much easier than any other approach, but at the same time it actually has a lot more sophisticated 01:08:52.580 |
So we've released this library, at this stage it's like a very early version, and so through 01:08:57.700 |
this course, by the end of this course I hope as a group a lot of people are already helping 01:09:03.420 |
have developed it into something that's really pretty stable and rock-solid. 01:09:10.300 |
And anybody can then use it to build your own models under an open-source license, as 01:09:23.540 |
Behind the scenes it's creating PyTorch models, and so PyTorch models can then be exported 01:09:34.100 |
Having said that, a lot of folks, if you want to do something on a mobile phone, for example, 01:09:38.780 |
you're probably going to need to use TensorFlow. 01:09:41.620 |
And so later on in this course, we're going to show how some of the things that we're 01:09:46.540 |
doing in the fast.ai library you can do in Keras and TensorFlow so you can get a sense 01:09:54.300 |
Generally speaking, the simple stuff will take you a small number of days to learn to 01:10:01.860 |
do it in Keras and TensorFlow versus fast.ai and PyTorch. 01:10:05.460 |
And the more complex stuff often just won't be possible. 01:10:09.160 |
So if you need it to be in TensorFlow, you'll just have to simplify it often a little bit. 01:10:18.820 |
I think the more important thing to realize is every year, the libraries that are available 01:10:28.580 |
So the main thing I hope that you get out of this course is an understanding of the 01:10:32.180 |
concepts, like here's how you find a learning rate, here's why differential learning rates 01:10:36.300 |
are important, here's how you do learning rate and kneeling, here's what stochastic 01:10:40.740 |
gradient acceptance restarts does, so on and so forth. 01:10:45.140 |
Because by the time we do this course again next year, the library situation is going 01:10:57.820 |
I was wondering if you've had an opinion on Pyro, which is Uber's new release. 01:11:09.340 |
I haven't looked at it, no, I'm very interested in probabilistic programming and it's really 01:11:15.500 |
So one of the things we'll learn about in this course is we'll see that PyTorch is much 01:11:18.780 |
more than just a deep learning library, it actually lets us write arbitrary GPU accelerated 01:11:26.060 |
algorithms from scratch, which we're actually going to do, and Pyro is a great example of 01:11:31.140 |
what people are now doing with PyTorch outside of the deep learning world. 01:11:36.500 |
Great let's take an 8 minute break and we'll come back at 7.55. 01:11:57.020 |
In classification, when we do classification in machine learning, a really simple way to 01:12:03.220 |
look at the result of a classification is what's called the confusion matrix. 01:12:07.180 |
This is not just deep learning, but any kind of classifier in machine learning where we 01:12:10.900 |
say what was the actual truth, there were a thousand cats and a thousand dogs, and of 01:12:17.940 |
the thousand actual cats, how many did we predict were cats? 01:12:21.980 |
This is obviously in the validation sets, this is the images that we didn't use to train 01:12:27.140 |
It turns out there were 998 cats that we actually predicted as cats and 2 that we got wrong. 01:12:33.420 |
And then for dogs, there were 995 that we predicted were dogs and then 5 that we got 01:12:39.140 |
So often these confusion matrices can be helpful, particularly if you've got four or five classes 01:12:44.700 |
you're trying to predict which group you're having the most trouble with and you can see 01:12:49.180 |
it uses color-coding to highlight the large bits, and you've got to help with the diagonal 01:13:00.540 |
So now that we've retrained the model, it can be quite helpful, now it's better to actually 01:13:04.820 |
look back and see which ones in particular were incorrect. 01:13:09.180 |
And we can see here there were actually only two incorrect cats, it prints out four by 01:13:15.180 |
default so you can actually see these two actually less than 0.5 so they weren't wrong. 01:13:20.220 |
So it's actually only these two were wrong cats, this one isn't obviously a cat at all. 01:13:27.060 |
This one is, but it looks like it's got a lot of weird artifacts and you can't see its 01:13:33.580 |
And then here are the 5 wrong dogs, here are 4 of them, that's not obviously a dog. 01:13:42.340 |
That looks like a mistake, that looks like a mistake, that one I guess doesn't have enough 01:13:50.620 |
So we've done a pretty good job here of creating a good classifier based on entering a lot 01:13:57.940 |
of Kaggle competitions and comparing results I've done to various research papers. 01:14:01.700 |
I can tell you it's a state-of-the-art classifier, it's right up there with the best in the world. 01:14:07.460 |
We're going to make it a little bit better in a moment, but here are the basic steps. 01:14:11.380 |
So if you want to create a world class image classifier, the steps that we just went through 01:14:16.060 |
was that we turned data augmentation on by saying all transforms equals, you either say 01:14:21.940 |
side on or top down depending on what you're doing. 01:14:25.100 |
Start with pre-compute equals true, find a decent learning rate, we then train just like 01:14:30.540 |
at one or two epochs, which takes a few seconds because we've got pre-compute equals true. 01:14:35.940 |
Then we turn off pre-compute, which allows us to use data augmentation to do another 01:14:41.320 |
two or three epochs, generally with cycle length equals one. 01:14:45.580 |
Then I unfreeze all the layers, I then set the earlier layers to be somewhere between 01:14:51.220 |
3 times to 10 times lower learning rate than the previous. 01:15:07.620 |
As a rule of thumb, knowing that you're starting with a pre-trained ImageNet model, if you 01:15:13.700 |
can see that the things that you're now trying to classify are pretty similar to the kinds 01:15:17.460 |
of things in ImageNet, i.e. pictures of normal objects in normal environments, you probably 01:15:22.920 |
want about a 10x difference because you think that the earlier layers are probably very 01:15:29.660 |
Whereas if you're doing something like satellite imagery or medical imaging, which is not at 01:15:34.380 |
all like ImageNet, then you probably want to be training those earlier layers a lot 01:15:42.300 |
So that's one change that I make is to try to make it either 10x or 3x. 01:15:52.100 |
So then after unfreezing, you can now call LRFind again. 01:15:57.660 |
I actually didn't in this case, but once you've unfreezed all the layers, you've turned on 01:16:02.620 |
differential learning rates, you can then call LRFind again. 01:16:09.820 |
So you can then check does it still look like the same point I had last time as about right. 01:16:15.300 |
Something to note is that if you call LRFind having set differential learning rates, the 01:16:21.620 |
thing it's actually going to print out is the learning rate of the last layers, because 01:16:25.700 |
you've got three different learning rates, so it's actually showing you the last layer. 01:16:29.680 |
So then I train the full network with cycle_mult = 2, and until either it starts with a fitting 01:16:39.980 |
Let's do this again for a totally different dataset. 01:16:42.800 |
So this morning, I noticed that some of you on the forums were playing around with this 01:16:47.420 |
playground Kaggle competition, very similar, called dog breed identification. 01:16:54.140 |
So the dog breed identification Kaggle challenge is one where you don't actually have to decide 01:17:01.020 |
which ones are cats and which ones are dogs, they're all dogs, but you have to decide what 01:17:10.500 |
So obviously this could be different types of cells in pathology slides, it could be 01:17:19.460 |
different kinds of cancers in CT scans, it could be different kinds of icebergs and satellite 01:17:26.260 |
images, whatever, as long as you've got some kind of labeled images. 01:17:31.300 |
So I want to show you what I did this morning, it took me about an hour basically to go end 01:17:36.860 |
to end from something I've never seen before. 01:17:40.100 |
So I downloaded the data from Kaggle, and I'll show you how to do that shortly, but the short 01:17:45.380 |
answer is there's something called Kaggle CLI, which is a GitHub project you can search 01:17:49.860 |
for and if you read the docs, you basically run kg download, provide the competition name 01:17:55.080 |
and it will grab all the data for you to your Cressel or Amazon or whatever instance. 01:18:00.260 |
I put it in my data folder and I then went LS and I saw that it's a little bit different 01:18:13.900 |
It's not that there's a train folder which has a separate folder for each kind of dog, 01:18:19.860 |
but instead it turned out there was a CSV file. 01:18:22.580 |
And the CSV file, I read it in with pandas, so pandas is the thing we use in Python to 01:18:28.100 |
do structured data analysis like CSV files, so pandas we call pd, that's pretty much universal, 01:18:34.980 |
pd.read_csv reads in a CSV file, we can then take a look at it and you can see that basically 01:18:40.220 |
it had some kind of identifier and then the breed. 01:18:44.940 |
So this is like a different way, this is the second main way that people kind of give you 01:18:50.940 |
One is to put different images into different folders, the second is generally to give you 01:18:55.060 |
some kind of file like a CSV file to tell you here's the image name and here's the label. 01:19:02.460 |
So what I then did was I used pandas again to create a pivot table which basically groups 01:19:09.700 |
it up just to see how many of each breed there were and I sorted them. 01:19:14.900 |
And so I saw they've got about 100 of some of the more common breeds and some of the 01:19:25.540 |
All together there were 120 rows and there must have been 120 different breeds represented. 01:19:31.220 |
So I'm going to go through the steps, so enable data augmentation. 01:19:37.220 |
So to enable data augmentation when we call this transforms from model, you just pass 01:19:41.680 |
in an org transforms, in this case I chose Sidon, again these are pictures of dots and 01:19:50.780 |
We'll talk about MaxZoom in more detail later, but MaxZoom basically says when you do the 01:19:56.260 |
data augmentation, we zoom into it by up to 1.1 times, randomly between 1, the original 01:20:07.820 |
So it's not always cropping out in the middle or an edge, but it could be cropping out a 01:20:13.580 |
So having done that, the key step now is that rather than going from paths, so previously 01:20:20.240 |
we went from paths and that tells us that the names of the folders are the names of 01:20:26.100 |
We go from CSV and we pass in the CSV file that contains the labels. 01:20:31.920 |
So we're passing in the path that contains all of the data, the name of the folder that 01:20:37.100 |
contains the training data, the CSV that contains the labels. 01:20:43.120 |
We need to also tell it where the test set is if we want to submit to Kaggle later, talk 01:20:49.620 |
Now this time, the previous dataset I had actually separated a validation set out into 01:20:57.300 |
a separate folder, but in this case you'll see that there is not a separate folder called 01:21:04.980 |
So we want to be able to track how good our performance is locally, so we're going to 01:21:09.620 |
have to separate some of the images out to put it into a validation set. 01:21:14.580 |
So I do that at random, and so up here you can see I've basically opened up the CSV file, 01:21:22.900 |
turned it into a list of rows, and then taken the length of that minus 1, because there's 01:21:30.580 |
And so that's the number of rows in the CSV file, which must be the number of images that 01:21:36.860 |
And then this is a fastai thing, get-cross-validation-indexes, we'll talk about cross-validation later. 01:21:42.900 |
But basically if you call this and pass in a number, it's going to return to you by default 01:21:50.100 |
a random 20% of the rows to use as your validation set, and you can pass in parameters to get 01:21:58.420 |
So this is now going to grab 20% of the data and say this is the indexes, the numbers of 01:22:05.500 |
the files which we're going to use as a validation set. 01:22:09.420 |
So now that we've got that, let's run this so you can see what that looks like. 01:22:17.580 |
So val_indexes is just a big bunch of numbers, and so n is 10,000, and so about 20% of those 01:22:35.740 |
So when we call from CSV, we can pass in a parameter which is to tell it which indexes 01:22:47.380 |
to treat it as a validation set, and so let's pass in those indexes. 01:22:52.820 |
One thing that's a little bit tricky here is that the file names actually have a dot 01:23:04.700 |
jpg on the end, and these obviously don't have a dot jpg. 01:23:09.260 |
So when you call from CSV, you can pass in a suffix that says the labels don't actually 01:23:15.860 |
contain the full file names, you need to add this to them. 01:23:22.100 |
So that's basically all I need to do to set up my data. 01:23:26.780 |
And as a lot of you have noticed during the week, inside that data object, you can actually 01:23:32.820 |
get access to the training dataset by saying train-ds, and inside train-ds is a whole bunch 01:23:42.060 |
So train-ds.file_names contains all of the file names of everything in the training set, 01:23:52.680 |
So I can now go ahead and open that file and take a look at it. 01:23:56.020 |
So the next thing I did was to try and understand what my dataset looks like, and it found an 01:24:05.540 |
I also want to know how big are these files, like how big are the images, because that's 01:24:12.420 |
If they're huge, I'm going to have to think really carefully about how to deal with huge 01:24:17.780 |
If they're tiny, well that's also challenging. 01:24:21.180 |
Most of ImageNet models are trained on either 224x224 or 299x299 images, so any time you 01:24:28.380 |
have images in that kind of range, that's really hopeful, you're probably not going to have 01:24:33.980 |
In this case, the first image I looked at was about the right size, so I'm thinking it's 01:24:40.340 |
So what I did then is I created a dictionary comprehension. 01:24:42.940 |
Now if you don't know about list comprehensions and dictionary comprehensions in Python, go 01:24:52.500 |
You can see the basic idea here is that I'm going through all of the files, and I'm creating 01:24:56.700 |
a dictionary that maps the name of the file to the size of that file. 01:25:04.900 |
Again this is a handy little Python feature which I'll let you learn about during the 01:25:07.980 |
week if you don't know about it, which is zip, and using this special star notation 01:25:11.900 |
is now going to take this dictionary and turn it into the rows and the columns. 01:25:19.380 |
So I can now turn those into NumPy arrays, and here are the first 5 row sizes for each 01:25:28.260 |
And then Matplotlib is something you want to be very familiar with if you do any kind 01:25:32.020 |
of data science on machine learning in Python. 01:25:34.380 |
Matplotlib we always refer to as PLT, this is a histogram, and so I've got a histogram 01:25:40.620 |
of how high, how many rows there are in each image. 01:25:45.140 |
So you can see here I'm kind of getting a sense. 01:25:47.540 |
Before I start doing any modeling, I kind of need to know what I'm modeling with. 01:25:50.900 |
And I can see some of the images are going to be like 2500-3000 pixels high, but most 01:25:58.540 |
So given that so few of them were bigger than 1000, I used standard NumPy slicing to just 01:26:04.980 |
grab those that are smaller than 1000 and histogram that, just to zoom in a little bit. 01:26:09.980 |
And I can see here it looks like the vast majority are around 500. 01:26:14.660 |
And so this actually also prints out the histogram, so I can actually go through and I can see 01:26:28.380 |
So Jeremy, how many images should we get in the validation set, is it always 20%? 01:26:39.620 |
So the size of the validation set, using 20% is fine unless you're kind of feeling like 01:26:48.620 |
my data set is really small, I'm not sure that's enough. 01:26:57.980 |
Basically think of it this way, if you train the same model multiple times and you're getting 01:27:01.620 |
very different validation set results and your validation set is kind of small, like 01:27:05.620 |
smaller than 1000 or so, then it's going to be quite hard to interpret how well you're 01:27:13.020 |
This is particularly true, if you care about the third decimal place of accuracy and you've 01:27:19.300 |
got 1000 things in your validation set, then you're thinking about a single image changing 01:27:26.380 |
So it really depends on how much difference you care about. 01:27:34.860 |
I would say in general, at the point where you care about the difference between 0.01 01:27:39.620 |
and 0.02, the second decimal place, you want that to represent 10 or 20 rows, like changing 01:27:48.100 |
the class of 10 or 20 rows, then that's something you can be pretty confident of. 01:27:54.420 |
So most of the time, given the data sizes we normally have, 20% seems to work fine. 01:28:05.500 |
It depends a lot on specifically what you're doing and what you care about. 01:28:12.140 |
And it's not a deep learning specific question either. 01:28:15.780 |
So those who are interested in this kind of thing, we're going to look into it in a lot 01:28:18.540 |
more detail in our machine learning course, which will also be available online. 01:28:27.940 |
So I did the same thing for the columns just to make sure that these aren't super wide, 01:28:31.660 |
and I got similar results and checked in and again found that 400-500 seems to be about 01:28:38.540 |
So based on all of that, I thought this looks like a pretty normal kind of image dataset 01:28:42.900 |
that I can probably use pretty normal kinds of models on. 01:28:46.100 |
I was also particularly encouraged to see that when I looked at the dog that the dog 01:28:49.940 |
takes up most of the frame, so I'm not too worried about cropping problems. 01:28:56.340 |
If the dog was just a tiny little piece of one little corner that I'd be thinking about 01:29:01.060 |
doing different, maybe zooming in a lot more or something. 01:29:05.420 |
Like in medical imaging, that happens a lot, like often the tumor or the cell or whatever 01:29:09.580 |
is like one tiny piece and that's much more complex. 01:29:13.540 |
So based on all that, this morning I kind of thought, okay, this looks pretty standard. 01:29:19.380 |
So I went ahead and created a little function called getData that basically had my normal 01:29:30.740 |
But I made it so I could pass in a size and a batch size. 01:29:35.460 |
The reason for this is that when I start working with a new dataset, I want everything to go 01:29:40.580 |
And so if I use small images, it's going to go super fast. 01:29:44.580 |
So I actually started out with size=64, just to create some super small images that just 01:29:50.460 |
go like a second to run through and see how it went. 01:29:54.260 |
Later on, I started using some big images and also some bigger architectures, at which 01:29:59.940 |
point I started running out of GPU memory, so I started getting these errors saying CUDA 01:30:06.780 |
When you get a CUDA out of memory error, the first thing you need to do is go kernel restart. 01:30:12.580 |
Once you get an out of memory error on your GPU, you can't really recover from it. 01:30:17.140 |
It doesn't matter what you do, you have to restart. 01:30:21.780 |
And once I restarted, I then just changed my batch size to something smaller. 01:30:26.000 |
So when you call createYourData object, you can pass in a batch size parameter. 01:30:35.180 |
And I normally use 64 until I hit something that says out of memory, and then I just halve 01:30:41.060 |
And if I still get out of memory, I'll just halve it again. 01:30:44.940 |
So that's where I created this, to allow me to start making my sizes bigger as I looked 01:30:48.780 |
into it more and as I started running out of memory to decrease my batch size. 01:30:53.780 |
So at this point, I went through this a couple of iterations, but I basically found everything 01:31:00.940 |
Once it was working fine, I set size to 224, and I created my pre-computed equals true. 01:31:09.800 |
First time I did that it took a minute to create the pre-computed activations, and then 01:31:13.700 |
it ran through this in about 4 or 5 seconds, and you can see I was getting 83% accuracy. 01:31:19.220 |
Now remember accuracy means it's exactly right, and so it's predicting out of 120 categories. 01:31:27.420 |
So when you see something with 2 classes is 80% accurate versus something with 120 classes 01:31:34.380 |
is 80% accurate, they're very different levels. 01:31:38.700 |
So when I saw 83% accuracy with just a pre-computed classifier, no data augmentation, no unfreezing, 01:31:46.060 |
anything else across 120 classes, oh this looks good. 01:31:51.540 |
So then I just kept going through our little standard process. 01:31:57.220 |
So then I turned pre-compute off, and cycle length equals 1, and I started doing a few 01:32:11.940 |
So remember an epoch is 1 passed through the data, and a cycle is however many epochs you 01:32:23.100 |
It's the learning rate going from the top that you asked for all the way down to 0. 01:32:28.200 |
So since here cycle length equals 1, a cycle and an epoch are the same. 01:32:33.780 |
So I tried a few epochs, I did actually do the learning rate finder, and I found one 01:32:40.580 |
in a 2 again looks fine, it often looks fine. 01:32:44.940 |
And I found it kept improving, so I tried 5 epochs and I found my accuracy getting better. 01:32:52.620 |
So then I saved that and I tried something which we haven't looked at before, but it's 01:33:00.300 |
If you train something on a smaller size, you can then actually call learn.setData() 01:33:11.460 |
And that's going to take your model, however it's trained so far, and it's going to let 01:33:23.220 |
This actually is another way you can get state-of-the-art results, and I've never seen this written 01:33:27.580 |
in any paper or discussed anywhere as far as I know this is a new insight. 01:33:33.300 |
Basically I've got a pre-trained model, which in this case I've trained a few epochs with 01:33:37.940 |
a size of 224x224, and I'm now going to do a few more epochs with a size of 299x299. 01:33:45.340 |
Now I've got very little data kind of by deep learning standards, I've only got 10,000 images. 01:33:51.340 |
So with a 224x224 I kind of built these final layers to try to find things that worked well 01:33:59.060 |
When I go to 299x299, if I overfit the 4, I'm definitely not going to overfit now. 01:34:07.380 |
I've changed the size of my images, they're kind of totally different, but conceptually 01:34:12.820 |
they're still the same kinds of pictures as the same kinds of things. 01:34:16.420 |
So I found this trick of starting training on small images for a few epochs and then 01:34:21.180 |
switching to bigger images and continuing training is an amazingly effective way to avoid 01:34:29.380 |
And it's so easy and so obvious, I don't understand why it's never been written about before, maybe 01:34:35.500 |
it's in some paper somewhere and I haven't found it, but I haven't seen it. 01:34:40.900 |
Would it be possible to do the same thing using let's take care of TensorFlow as well, 01:34:53.700 |
As long as you use one of these more modern architectures, what we call fully convolutional 01:34:58.380 |
architectures, which means not VGG, and you'll see we don't use VGG in this course because 01:35:03.800 |
it doesn't have this property, but most of the architectures developed in the last couple 01:35:08.100 |
of years can handle pretty much arbitrary sizes, it would be worth trying. 01:35:18.060 |
So I call getData again, remember getData is just the little function that I created 01:35:22.580 |
GetData is just this little function, so I just passed a different size to it. 01:35:27.900 |
And so I call freeze just to make sure that everything except the last layer is frozen, 01:35:34.220 |
I mean actually it already was at these points that didn't really do anything. 01:35:42.700 |
You can see now with free compute off, I've now got data augmentation working, so I kind 01:35:49.740 |
And what I notice here is that the loss of my training set and the loss of my validation 01:36:00.860 |
So what this is telling me is I'm underfitting, and so if I'm underfitting, it means the cycle 01:36:06.340 |
length equals 1 is too short, it means it's like finding something better, popping out 01:36:10.980 |
and never getting a chance to zoom in properly. 01:36:14.320 |
So then I set cycle_malt = 2 to give it more time, so the first time is 1 epoch, the second 01:36:21.580 |
one is 2 epochs, the third one is 4 epochs, and you can see now the validation train and 01:36:30.700 |
So that's kind of thinking yeah, this is about the right track. 01:36:34.620 |
And so then I tried using test time augmentation to see if that gets any better still, it didn't 01:36:40.020 |
actually help a hell of a lot, just a tiny bit. 01:36:43.620 |
And just at this point I'm thinking this is nearly done, so I just did one more cycle 01:36:49.700 |
of 2 to see if it got any better, and it did get a little bit better. 01:36:54.220 |
And then I'm like okay, that looks pretty good. 01:37:03.980 |
And so you'll notice here I actually haven't tried unfreezing. 01:37:07.540 |
The reason why was when I tried unfreezing and training more, it didn't get any better. 01:37:12.180 |
And so the reason for this clearly is that this data set is so similar to ImageNet, the 01:37:18.820 |
training that convolutional layers actually doesn't help in the slightest. 01:37:22.780 |
And actually when I later looked into it, it turns out that this competition is actually 01:37:31.860 |
So then if we check this out, 0.199 against the leaderboard, this is only a playground 01:37:37.740 |
competition so it's not like the best here, but it's still interesting, it gets us somewhere 01:37:49.180 |
In fact we're competing against, I notice this is a first AI student, these people up 01:37:58.020 |
here, I know they actually posted that they cheated, they actually went and downloaded 01:38:07.500 |
This is why this is a playground competition, it's not real, it's just to allow us to try 01:38:14.300 |
We basically see out of 200 and something people, we're getting some very good results 01:38:23.260 |
without doing anything remotely interesting or clever, and we haven't even used the whole 01:38:29.140 |
To get a better result, I would go back and remove that validation set and just re-run 01:38:33.500 |
the same steps and then submit that, let's just use 100% of the data. 01:38:38.100 |
I have three questions, the first one is that class in this case is not balanced, it's not 01:38:55.420 |
It's not totally balanced but it's not bad, it's like between 60 and 100, it's not unbalanced 01:39:04.620 |
enough that I would give it a second thought. 01:39:13.680 |
Let's get to that later in this course, and don't let me forget. 01:39:17.340 |
The short answer is that there was a recent list, a paper came out about two or three 01:39:20.500 |
weeks ago on this and it said the best way to deal with very unbalanced data sets is 01:39:31.020 |
My second question is, I want to pin down a difference between pre-computed and unscrewed, 01:39:44.860 |
So when you are beginning to add data augmentation, you said pre-computed is true, but in that 01:39:53.180 |
And not only are they frozen, they're pre-computed, so the data augmentation doesn't do anything 01:39:57.740 |
What you said pre-computed equals true, but before you increase everything, what does 01:40:09.020 |
So we're going to learn more about the details as we look into the math and stuff in coming 01:40:13.380 |
lessons, but basically what happened was we started with a pre-trained network, which 01:40:19.940 |
was finding activations that had these kind of rich features. 01:40:26.540 |
And then we add a couple of layers on the end of it, which start out random. 01:40:33.380 |
And so with everything frozen, and indeed with pre-compute equals true, all we're learning 01:40:44.080 |
And so with pre-compute equals true, we actually pre-calculate how much does this image have 01:40:49.380 |
something that looks like this eyeball and looks like this face and so forth. 01:40:53.500 |
And therefore data augmentation doesn't do anything with pre-compute equals true, because 01:40:57.300 |
we're actually showing exactly the same activations each time. 01:41:01.980 |
We can then set pre-compute equals false, which means it's still only training those 01:41:06.700 |
last two layers that we added, it's still frozen, but data augmentation is now working 01:41:12.260 |
because it's actually going through and recalculating all of the activations from scratch. 01:41:17.100 |
And then finally when we unfreeze, that's actually saying okay, now you can go ahead and change 01:41:30.260 |
The only reason to have pre-compute equals true is it's just much faster. 01:41:34.500 |
It's about 10 or more times faster, so particularly if you're working with quite a large data set, 01:41:41.400 |
it can save quite a bit of time, but there's no accuracy reason ever to use pre-compute 01:41:52.540 |
It's also quite handy if you're throwing together a quick model, it can take a few seconds to 01:42:02.300 |
If your question is like is there some shorter version of this that's a bit quicker and easier, 01:42:40.100 |
I think this is a kind of a minimal version to get you a very good result, which is like 01:42:44.980 |
don't worry about pre-compute equals true because that's just saving a little bit of 01:42:49.060 |
So I would still suggest use lrfind at the start to find a good learning rate. 01:42:55.940 |
By default, everything is frozen from the start, so you can just go ahead and run a 01:42:59.620 |
2 or 3 epochs with cycle equals 1, unfreeze, and then train the rest of the network with 01:43:07.700 |
So it's basically 3 steps, learning rate finder, train frozen network with cycle equals 1, 01:43:16.620 |
and then train unfrozen network with differential learning rates and cycle multiples 2. 01:43:22.140 |
So that's something you could turn into, I guess, 5 or 6 lines of code total. 01:43:30.420 |
By reducing the batch size, does it only affect the speed of training? 01:43:40.020 |
So each batch, and again we're going to see all this stuff about pre-compute and batch 01:43:44.340 |
size as we dig into the details of the algorithm, it's going to make a lot more sense intuitively. 01:43:49.060 |
But basically if you're showing it less images each time, then it's calculating the gradient 01:43:57.300 |
with less images, which means it's less accurate, which means knowing which direction to go 01:44:01.740 |
and how far to go in that direction is less accurate. 01:44:05.660 |
So as you make the batch size smaller, you're basically making it more volatile. 01:44:11.340 |
It kind of impacts the optimal learning rate that you would need to use, but in practice 01:44:22.540 |
I generally find I'm only dividing the batch size by 2 or 4, it doesn't seem to change 01:44:28.700 |
Should I reduce the learning rate accordingly? 01:44:32.300 |
If you change the batch size by much, you can re-run the learning rate finder to see 01:44:36.380 |
if it's changed by much, but since we're only generally looking at a power of 10, it probably 01:44:41.300 |
is not going to change things enough that you can't think of it as a possibility. 01:44:45.700 |
This is sort of a conceptual and basic question, so going back to the previous slide where 01:44:54.460 |
Sorry, yeah, this is more of a conceptual and basic question. 01:44:58.460 |
Going back to your previous slide where you showed what the different layers were doing. 01:45:06.460 |
So in this slide, I understand the meaning of the third column relative to the fourth 01:45:12.980 |
column is that you're interpreting what the layer is doing based on what images actually 01:45:21.960 |
Yeah, so we're going to look at this in more detail. 01:45:24.420 |
So these grey ones basically say this is kind of what the filter looks like. 01:45:28.900 |
So on the first layer you can see exactly what the filter looks like because the input 01:45:33.020 |
to it are pixels, so you can absolutely say and remember we looked at what a convolutional 01:45:40.620 |
So these look like they're 7x7 kernels, you can say this is actually what it looks like. 01:45:45.060 |
But later on the input to it are themselves activations which are combinations of activations 01:45:54.100 |
So you can't draw it, but there's a clever technique that Ziler and Fergus created which 01:45:58.780 |
allowed them to say this is kind of what the filters tended to look like on average, so 01:46:04.100 |
this is kind of what the filters looked like. 01:46:06.020 |
And then here is specific examples of patches of image which activated that filter highly. 01:46:14.340 |
So the pictures are the ones that I kind of find more useful because it tells you this 01:46:23.860 |
Well we may come back to that, if not in this part and the next part. 01:46:45.100 |
Probably in part 2 actually because this paper uses to create these things, this paper uses 01:46:50.460 |
something called a deconvolution which I'm pretty sure we won't do in this part, but 01:46:55.460 |
So if you're interested, check out the paper, it's in the notebook, there's a link to it, 01:47:02.620 |
It's a very clever technique and not terribly intuitive. 01:47:11.940 |
So you mentioned that it was good that the dog took up the full picture and it would 01:47:21.540 |
have been a problem if it was kind of like off in one of the corners and really tiny. 01:47:26.900 |
What would your technique have been to try to make that work? 01:47:33.940 |
Something that we'll learn about in part 2, but basically there's a technique that allows 01:47:38.900 |
you to figure out roughly which parts of an image are most likely to have the interesting 01:47:44.860 |
things in them, and then you can crop out those bits. 01:47:49.300 |
If you're interested in learning about it, we did cover it briefly in lesson 7 of part 01:47:54.580 |
1, but I'm going to actually do it properly in part 2 of this course because I didn't 01:48:03.580 |
Maybe we'll find time to have a quick look at it, but we'll see. 01:48:09.380 |
I know UNET's written some of the code that we need already. 01:48:17.220 |
So once I have something like this notebook that's basically working, I can immediately 01:48:26.380 |
make it better by doing two things, assuming that the size image I was using is smaller 01:48:33.680 |
than the average size of the image that we've been given. 01:48:36.340 |
I can increase the size, and as I showed before with the dog breeds, you can actually increase 01:48:42.820 |
The other thing I can do is to use a better architecture. 01:48:48.180 |
We're going to talk a lot in this course about architectures, but basically there are different 01:48:57.580 |
ways of putting together what size convolutional filters and how they're connected to each other 01:49:06.660 |
Different architectures have different numbers of layers and sizes of kernels and number 01:49:16.460 |
The one that we've been using, ResNet-34, is a great starting point and often a good 01:49:21.740 |
finishing point because it doesn't have too many parameters, often it works pretty well 01:49:27.140 |
with small amounts of data as we've seen and so forth. 01:49:31.860 |
But there's actually an architecture that I really like called ResNet but ResNext, which 01:49:37.460 |
was actually the second place winner in last year's ImageNet competition. 01:49:44.620 |
Like ResNet, you can put a number after the ResNext to say how big it is, and my next 01:49:54.700 |
You'll find ResNext-50 can take twice as long as ResNet-34, it can take 2 to 4 times as 01:50:07.060 |
So what I wanted to do was I wanted to rerun that previous notebook with ResNext and increase 01:50:14.380 |
So here I just said architecture = ResNext-50, size = 299, and then I found that I had to 01:50:20.460 |
take the batch size all the way back to 28 to get it to fit. 01:50:23.740 |
My GPU is 11GB, if you're using AWS or Cressel, I think they're like 12GB, so you might be 01:50:30.300 |
able to make it a bit higher, but this is what I found I had to do. 01:50:33.380 |
So then this is literally a copy of the previous notebook, so you can actually go file, make 01:50:37.820 |
a copy, and then rerun it with these different parameters. 01:50:43.060 |
And so I deleted some of the pros and some of the exploratory stuff to see, basically 01:50:48.940 |
I said everything else is the same, all the same steps as before, in fact you can kind 01:50:54.460 |
of see what this minimum set of steps looks like. 01:50:56.580 |
I didn't need to worry about learning rate finder, so I just left it as is. 01:51:00.560 |
Transforms, data = learn = fit, pre-compute = false, fit, fit cycle length = 1, unfreeze, 01:51:13.020 |
And you can see here I didn't do the cycle-mult thing, because I found now that I'm using 01:51:18.400 |
a bigger architecture, it's got more parameters, it was overfitting pretty quickly. 01:51:23.660 |
So rather than cycle length = 1, never finding the right spot, it actually did find the right 01:51:28.420 |
spot and if I used longer cycle lengths, I found that my validation error was higher 01:51:39.920 |
So check this out though, by using these 3 steps, I got +TTA, I got 99.75. 01:51:49.760 |
That means I have one incorrect dog, four incorrect cats, and when we look at the pictures 01:52:01.580 |
This one is not either, so I've actually got one mistake, and then my incorrect dog is 01:52:12.080 |
So we're at a point where we're now able to train a classifier that's so good that it 01:52:23.780 |
And so when people say we have super-human image performance now, this is kind of what 01:52:29.500 |
So when I looked at the dog breed one I did this morning, it was getting the dog breeds 01:52:40.240 |
So this is what we can get to if you use a really modern architecture like ResNext. 01:52:46.700 |
And this only took, I don't know, 20 minutes to train. 01:52:59.080 |
So if you wanted to do satellite imagery instead, then it's the same thing. 01:53:07.860 |
And in fact the planet satellite data set is already on Cresol, if you're using Cresol 01:53:14.940 |
And I just linked it into data/planet, and I can do exactly the same thing. 01:53:21.380 |
I can image classifier from CSV, and you can see these three lines are exactly the same 01:53:30.660 |
as my dog breed lines, how many lines are in the file, grab my validation indexes, this 01:53:37.060 |
get data, as you can see it's identical except I've changed side on to top down. 01:53:43.060 |
The satellite images are like top down, so I can fit them vertically and they still make 01:53:48.820 |
And so you can see here I'm doing this trick where I'm going to do size=64 and train a 01:53:57.540 |
And interestingly in this case you can see I want really high learning rates. 01:54:02.380 |
I don't know what it is about this particular dataset, this is true, but clearly I can use 01:54:08.340 |
So I used a learning rate of 0.2, and so I've trained for a while, differential learning 01:54:14.460 |
rates, and so remember I said like if the dataset is very different to ImageNet, I probably 01:54:21.180 |
want to train those middle layers a lot more, so I'm using divided by 3 rather than divided 01:54:27.020 |
Other than that, here's the same thing, cycle mod equals 2, and then I was just kind of 01:54:34.620 |
So you can actually plot the loss if you go to learn.shed.plotloss, and you can see here 01:54:38.540 |
is the first cycle, here's the second cycle, here's the third cycle, so you can see it's 01:54:48.260 |
And each time it finds something better than the last time. 01:54:51.140 |
Then set the size up to 128, and just repeat exactly the last few steps. 01:54:56.580 |
And then set it up to 256, and repeat the last two steps. 01:55:01.860 |
And then do TTA, and if you submit this, then this gets about 30th place in this competition. 01:55:13.140 |
This thing where I went all the way back to a size of 64, I wouldn't do that if I was doing 01:55:20.380 |
like dogs and cats or dog breeds, because this is so small that if the thing I was working 01:55:26.140 |
on was very similar to ImageNet, I would kind of destroy those ImageNet weights. 01:55:31.940 |
Like 64 by 64 is so small, but in this case the satellite imagery data is so different 01:55:37.580 |
I really found that it worked pretty well to start right back to these tiny images. 01:55:46.300 |
And interestingly, using this kind of approach, I actually found that even with using only 01:55:51.060 |
128 by 128, I was getting much better Kaggle results than nearly everybody on the leaderboard. 01:55:58.340 |
And when I say 30th place, this is a very recent competition. 01:56:02.100 |
And so I find in the last year, a lot of people have got a lot better at computer vision. 01:56:07.700 |
And so the people in the top 50 in this competition were generally ensembling dozens of models, 01:56:12.860 |
lots of people on a team, lots of pre-processing specific satellite data and so forth. 01:56:18.220 |
So to be able to get 30th using this totally standard technique is pretty cool. 01:56:27.220 |
So now that we've got to this point, we've got through two lessons. 01:56:30.860 |
If you're still here, then hopefully you're thinking okay, this is actually pretty useful, 01:56:37.020 |
I want to do more, in which case Cressel might not be where you want to stay. 01:56:42.580 |
The issues with Cressel, it's pretty handy, it's pretty cheap, and something we haven't 01:56:47.580 |
talked about much is PaperSpace is another great choice, by the way. 01:56:52.020 |
PaperSpace is shortly going to be releasing Cressel-like instant Jupyter notebooks, unfortunately 01:56:56.900 |
they're not ready quite yet, but basically they have the best price performance relationship 01:57:04.460 |
right now, and you can SSH into them and use them. 01:57:08.500 |
So they're also a great choice, and probably by the time this is a MOOC, we'll probably 01:57:13.260 |
have a separate lesson showing you how to set up PaperSpace because they're likely to 01:57:19.460 |
But at some point you're probably going to want to look at AWS, a couple of reasons why. 01:57:25.580 |
The first is, as you all know by now, Amazon have been kind enough to donate about $200,000 01:57:35.940 |
So I want to say thank you very much to Amazon, we've all been given credits, everybody is 01:57:45.300 |
So sorry if you're on the MOOC, we didn't get it for you, but AWS credits for everybody. 01:57:52.540 |
But even if you're not here in person, you can get AWS credits from lots of places. 01:57:58.060 |
GitHub has a student pack, Google for GitHub student pack, that's like $150 worth of credits. 01:58:03.660 |
AWS Educate can get credits, these are all for students. 01:58:08.260 |
So there's lots of places you can get started on AWS. 01:58:11.540 |
Pretty much everybody, a lot of the people that you might work with will be using AWS 01:58:21.660 |
Right now AWS has the fastest available GPUs that you can get in the cloud, the P3s. 01:58:29.900 |
They're kind of expensive at 3 bucks an hour, but if you've got a model where you've done 01:58:33.740 |
all the steps before and you're thinking this is looking pretty good, for 6 bucks you could 01:58:44.660 |
We didn't start with AWS because A, it's twice as expensive as Cressel, the cheapest GPU. 01:58:54.380 |
I wanted to go through and show you how to get your AWS setup, so we're going to be going 01:59:00.320 |
slightly over time to do that, but I want to show you very quickly, so feel free to go 01:59:06.340 |
But I want to show you very quickly how you can get your AWS setup from scratch. 01:59:13.740 |
Basically you have to go to console.aws.amazon.com and it will take you to the console. 01:59:20.500 |
You can follow along on the video with this because I'm going to do it very quickly. 01:59:24.500 |
From here you have to go to EC2, this is where you set up your instances. 01:59:30.300 |
And so from EC2 you need to do what's called launching an instance. 01:59:35.640 |
So launching an instance means you're basically creating a computer, you're creating a computer 01:59:41.060 |
So I say launch instance, and what we've done is we've created a fast AMI, an AMI is like 01:59:48.620 |
a template for how your computer is going to be created. 01:59:51.500 |
So if you've got a community AMIs and type in fastai, you'll see that there's one there 02:00:02.500 |
So I'm going to select that, and we need to say what kind of computer do you want. 02:00:08.020 |
And so I can say I want a GPU computer, and then I can say I want a P2 x lives. 02:00:16.340 |
This is the cheapest, reasonably effective for deep learning instance type they have. 02:00:21.820 |
And then I can say launch, and then I can say launch. 02:00:27.820 |
And so at this point, they ask you to choose a key pair. 02:00:33.220 |
Now if you don't have the key pair, you have to create one. 02:00:37.580 |
So to create a key pair, you need to open your terminal. 02:00:46.740 |
If you've got a Mac or a Linux box, you've definitely got one. 02:00:49.480 |
If you've got Windows, hopefully you've got Ubuntu. 02:00:53.220 |
If you don't already have Ubuntu set up, you can go to the Windows Store and click on Ubuntu. 02:01:04.180 |
So from there, you basically go ssh-keygen, and that will create like a special password 02:01:13.860 |
for your computer to be able to log in to Amazon. 02:01:16.780 |
And then you just hit enter three times, and that's going to create for you your key that 02:01:25.020 |
So then what I do is I copy that key somewhere that I know where it is, so it will be in 02:01:33.180 |
And so I'm going to copy it to my hard drive. 02:01:39.180 |
So if you're in a Mac or on Linux, it will already be in an easy to find place, it will 02:01:42.300 |
be in your .ssh folder, let's put that in documents. 02:01:49.060 |
So from there, back in AWS, you have to tell it that you've created this key. 02:01:55.060 |
So you can go to keypairs, and you say import keypair, and you just browse to that file 02:02:09.900 |
So if you've ever used ssh before, you've already got the keypair, you don't have to 02:02:15.420 |
If you've used AWS before, you've already imported it, you don't have to do that step. 02:02:19.220 |
If you haven't done any of those things, you have to do both steps. 02:02:24.220 |
So now I can go ahead and launch my instance, community-amis-search-fastai-select-launch. 02:02:39.540 |
And so now it asks me where's your keypair, and I can choose that one that I just grabbed. 02:02:49.940 |
So this is going to go ahead and create a new computer for me to log into. 02:02:55.380 |
And you can see here, it says the following have been initiated. 02:02:58.100 |
So if I click on that, it will show me this new computer that I've created. 02:03:05.620 |
So to be able to log into it, I need to know its IP address. 02:03:13.740 |
So I can copy that, and that's the IP address of my computer. 02:03:19.420 |
So to get to this computer, I need to SSH to it. 02:03:22.100 |
So SSH into a computer means connecting to that computer so that it's like you're typing 02:03:27.180 |
So I type SSH, and the username for this instance is always Ubuntu. 02:03:34.260 |
And then I can paste in that IP address, and then there's one more thing I have to do, 02:03:40.020 |
which is I have to connect up the Jupyter notebook on that instance to the Jupyter notebook 02:03:46.460 |
And so to do that, there's just a particular flag that I set. 02:03:50.220 |
We can talk about it on the forums as to exactly what it does. 02:03:59.220 |
So once you've done it once, you can save that as an alias and type in the same thing 02:04:06.120 |
So we can check here, we can see it says that it's running. 02:04:11.940 |
First time ever we connect to it, it just checks, this is OK. 02:04:15.340 |
I'll say yes, and then that goes ahead and SSH is in. 02:04:29.180 |
So you'll find that the very first time you log in, it takes a few extra seconds because 02:04:35.020 |
But once it's logged in, you'll see there that there's a directory called fastai. 02:04:40.340 |
And the fastai directory contains our fastai repo that contains all the notebooks, all the 02:04:51.700 |
First thing you do when you get in is to make sure it's updated. 02:04:54.420 |
So you just go git pull, and that updates to make sure that your repo is the same as 02:05:04.380 |
And so as you can see, there we go, let's make sure it's got all the most recent code. 02:05:07.760 |
The second thing you should do is type condor end update. 02:05:11.900 |
You can just do this maybe once a month or so, and that makes sure that the library is 02:05:18.000 |
I'm not going to run that because it takes a couple of minutes. 02:05:21.100 |
And then the last step is to type Jupyter Notebook. 02:05:27.060 |
So this is going to go ahead and launch the Jupyter Notebook server on this machine. 02:05:33.180 |
And the first time I do it, the first time you do everything on AWS, it just takes like 02:05:40.740 |
And then once you've done it in the future, it will be just as fast as running it locally. 02:05:46.780 |
So you can see it's going ahead and firing up the notebook. 02:05:49.700 |
And so what's going to happen is that because when we SSH into it, we said to connect our 02:05:55.180 |
notebook port to the remote notebook port, we're just going to be able to use this locally. 02:06:00.340 |
So you can see it says here, copy/paste this URL. 02:06:03.020 |
So I'm going to grab that URL, and I'm going to paste it into my browser, and that's it. 02:06:13.580 |
So this notebook is now actually not running on my machine. 02:06:17.580 |
It's actually running on AWS using the AWS GPU, which has got a lot of memory. 02:06:22.840 |
It's not the fastest around, but it's not terrible. 02:06:26.300 |
You can always fire up a P3 if you want something that's super fast. 02:06:33.520 |
So when you're finished, please don't forget to shut it down. 02:06:37.700 |
So to shut it down, you can right-click on it and say "instance date, stop". 02:06:48.340 |
We've got 500 bucks of credit, assuming that you put your code down in a spreadsheet. 02:06:54.260 |
One thing I forgot to do, the first time I showed you this, by the way I said, make sure 02:07:02.060 |
The second time I went through I didn't choose P2 by mistake, so just don't forget to choose 02:07:11.420 |
I've got 90 cents an hour, thank you, 90 cents an hour. 02:07:19.260 |
It also costs 3 or 4 bucks a month for the storage as well. 02:07:25.060 |
Alright, see you next week. Sorry we haven't been over.