Lesson 2: Deep Learning 2018

00:00:00.000 | Okay, so welcome back to deep learning lesson 2.

00:00:07.560 | Last week we got to the point where we had successfully trained a pretty accurate image

00:00:14.160 | classifier.

00:00:16.200 | And so just to remind you about how we did that, can you guys see okay?

00:00:23.440 | Actually we can't turn the front lights off, can you guys all see the screen, okay?

00:00:30.360 | We can turn just these ones, can we?

00:00:33.280 | That pitch is all into darkness, but if that works then...

00:00:38.960 | Okay, that's better isn't it?

00:00:42.320 | Do you mind doing the other two?

00:00:47.280 | Maybe that one as well.

00:01:02.880 | So just to remind you, the way we built this image classifier was we used a small amount

00:01:09.080 | of code, basically three lines of code, and these three lines of code pointed at a particular

00:01:15.680 | path which already had some data in it.

00:01:19.240 | And so the key thing for this to know how to train this model was that this path, which

00:01:24.840 | was data, dogs, cats, had to have a particular structure, which is that it had a train folder

00:01:32.320 | and a valid folder, and in each of those train and valid folders there was a cats folder

00:01:36.960 | and a dogs folder, and each of the cats and the dogs folders was a bunch of images of

00:01:42.000 | cats and dogs.

00:01:43.000 | So this is like a pretty standard, it's one of two main structures that are used to say

00:01:49.400 | here is the data that I want you to train an image model from.

00:01:53.600 | So I know some of you during the week went away and tried different data sets where you

00:01:59.320 | had folders with different sets of images in and created your own image classifiers.

00:02:04.960 | And generally that seems to be working pretty well from what I can see on the forums.

00:02:08.520 | So to make it clear, at this point this is everything you need to get started.

00:02:15.160 | So if you create your own folders with different sets of images, a few hundred or a few thousand

00:02:24.000 | in each folder, and run the same three lines of code, that will give you an image classifier

00:02:31.840 | and you'll be able to see this third column tells you how accurate it is.

00:02:37.900 | So we looked at some kind of simple visualizations to see what was it uncertain about, what was

00:02:48.200 | it wrong about, and so forth, and that's always a really good idea.

00:02:54.520 | And then we learned about the one key number you have to pick.

00:02:57.840 | So this number here is the one key number, this 0.01, and this is called the learning

00:03:03.320 | rate.

00:03:04.320 | So I wanted to go over this again, and we'll learn about the theory behind what this is

00:03:10.720 | during the rest of the course in quite a lot of detail, but for now I just wanted to talk

00:03:14.640 | about the practice.

00:03:17.000 | We're going to talk about the other ones shortly, so the main one we're going to look at for

00:03:41.880 | now is the last column, which is the accuracy.

00:03:46.520 | The first column as you can see is the epoch number, so this tells us how many times has

00:03:51.960 | it been through the entire data set trying to learn a better classifier, and then the

00:03:57.360 | next two columns is what's called the loss, which we'll be learning about either later

00:04:01.960 | today or next week.

00:04:03.480 | The first one is the loss on the training set, these are the images that we're looking

00:04:07.200 | at in order to try to make a better classifier.

00:04:09.760 | The second is the loss on the validation set, these are the images that we're not looking

00:04:13.680 | at when we're training, but we're just setting them aside to see how accurate we are.

00:04:17.920 | So we'll learn about the difference between loss and accuracy later.

00:04:29.760 | So we've got the epoch number, the training loss is the second column, the validation

00:04:35.160 | loss is the third column and the accuracy is the fourth column.

00:04:46.000 | So the basic idea of the learning rate is it's the thing that's going to decide how

00:04:59.120 | quickly do we hone in on the solution.

00:05:04.520 | And so I find that a good way to think about this is to think about what if we were trying

00:05:09.120 | to fit to a function that looks something like this, we're trying to say whereabouts

00:05:17.440 | is the minimum point?

00:05:19.520 | This is basically what we do when we do deep learning, is we try to find the minimum point

00:05:24.080 | of a function.

00:05:26.840 | Now our function happens to have millions or hundreds of millions of parameters, but

00:05:31.000 | it works the same basic way.

00:05:32.440 | And so when we look at it, we can immediately see that the lowest point is here, but how

00:05:39.440 | would you do that if you were a computer algorithm?

00:05:41.880 | And what we do is we start out at some point at random, so we pick say here, and we have

00:05:47.680 | a look and we say what's the loss or the error at this point, and we say what's the gradient,

00:05:53.640 | in other words which way is up and which way is down.

00:05:56.880 | And it tells us that down is going to be in that direction, and it also tells us how fast

00:06:02.000 | is it going down, which at this point is going down pretty quickly.

00:06:07.440 | And so then we take a step in the direction that's down, and the distance we travel is

00:06:13.440 | going to be proportional to the gradient, it's going to be proportional to how steep

00:06:16.480 | it is.

00:06:17.480 | The idea is if it's deeper, then we're probably further away, that's the general idea.

00:06:23.640 | And so specifically what we do is we take the gradient, which is how steep is it at

00:06:27.320 | this point, and we multiply it by some number, and that number is called the learning rate.

00:06:32.120 | So if we pick a number that is very small, then we're guaranteed that we're going to

00:06:38.360 | go a little bit closer and a little bit closer and a little bit closer each time.

00:06:42.000 | But it's going to take us a very long time to eventually get to the bottom.

00:06:47.640 | If we pick a number that's very big, we could actually step too far, we could go in the

00:06:52.680 | right direction, but we could step all the way over to here as a result of which we end

00:06:58.160 | up further away than we started, and we could oscillate and it gets worse and worse.

00:07:03.520 | So if you start training a neural net and you find that your accuracy or your loss is

00:07:08.440 | like spitting off into infinity, almost certainly your learning rate is too high.

00:07:14.520 | So in a sense, learning rate too low is a better problem to have because you're going

00:07:20.800 | to have to wait a long time, but wouldn't it be nice if there was a way to figure out

00:07:25.060 | what's the best learning rate?

00:07:27.000 | Something where you could quickly go like, boom, boom, boom.

00:07:32.000 | And so that's why we use this thing called a learning rate finder.

00:07:36.800 | And what the learning rate finder does is it tries, each time it looks at another, remember

00:07:42.320 | the term minibatch?

00:07:43.320 | Minibatch is a few images that we look at each time so that we're using the parallel

00:07:48.820 | processing power of the GPU effectively.

00:07:50.920 | We look generally at around 64 or 128 images at a time.

00:07:55.600 | For each minibatch, which is labeled here as an iteration, we gradually increase the

00:08:00.400 | learning rate.

00:08:01.400 | In fact, multiplicatively increase the learning rate.

00:08:02.920 | We start at really tiny learning rates to make sure that we don't start at something

00:08:07.920 | too high, and we gradually increase it.

00:08:10.920 | And so the idea is that eventually the learning rate will be so big that the loss will start

00:08:16.680 | getting worse.

00:08:17.680 | So what we're going to do then is look at the plot of learning rate against loss.

00:08:24.920 | So when the learning rate's tiny, it increases slowly, then it starts to increase a bit faster,

00:08:30.960 | and then eventually it starts not increasing as quickly and in fact it starts getting worse.

00:08:36.400 | So clearly here, make sure you're going to be familiar with this scientific notation.

00:08:42.720 | So 10^-1 is 0.1, 10^-2 is 0.01, and when we write this in Python, we'll generally write

00:08:53.520 | it like this.

00:08:54.520 | Rather than writing 10^-1 or 10^-2, we'll just write 1e1 or 1e2.

00:09:02.760 | They mean the same thing, you're going to see that all the time.

00:09:08.720 | And remember that equals 0.1, 0.01.

00:09:17.200 | So don't be confused by this text that it prints out here.

00:09:22.400 | This loss here is the final loss at the end, it's not of any interest.

00:09:28.640 | So ignore this, this is only interesting when we're doing regular training, not interesting

00:09:33.360 | for the learning rate finder.

00:09:34.760 | The thing that's interesting for the learning rate finder is this learn.shed.plot, and specifically

00:09:40.840 | we're not looking for the point where it's the lowest, because the point where it's the

00:09:43.960 | lowest is actually not getting better anymore, so that's too high a learning rate.

00:09:47.840 | So I generally look to see where is it the lowest, and then I go back like 1 over magnitude.

00:09:53.480 | So 1e2 would be a pretty good choice.

00:09:59.400 | So that's why you saw when we ran our fit here, we picked 0.01, which is 1e1-2.

00:10:10.640 | So an important point to make here is this is the one key number that we've learned to

00:10:16.520 | adjust, and if you just adjust this number and nothing else, most of the time you're

00:10:23.540 | going to be able to get pretty good results.

00:10:25.760 | And this is like a very different message to what you would hear or see in any textbook

00:10:31.120 | or any video or any course, because up until now there's been like dozens and dozens, they're

00:10:38.680 | called hyperparameters, dozens and dozens of hyperparameters to set, and they've been

00:10:42.560 | thought of as highly sensitive and difficult to set.

00:10:45.320 | So inside the first AI library, we kind of do all that stuff for you as much as we can.

00:10:52.320 | And during the course, we're going to learn that there are some more we can break to get

00:10:55.680 | slightly better results, but it's kind of like, it's kind of in a funny situation here

00:11:02.560 | because for those of you that haven't done any deep learning before, it's kind of like

00:11:06.320 | oh, this is, that's all there is to it, this is very easy, and then when you talk to people

00:11:11.040 | outside this class, they'll be like deep learning is so difficult, there's so much to set, it's

00:11:15.000 | a real art form, and so that's why there's this difference.

00:11:19.040 | And so the truth is that the learning rate really is the key thing to set, and this ability

00:11:23.200 | to use this trick to figure out how to set it, although the paper is now probably 18

00:11:29.360 | months old, almost nobody knows about this paper, it was from a guy who's not from a famous

00:11:35.200 | research lab, so most people kind of ignored it, and in fact even this particular technique

00:11:39.400 | was one sub-part of a paper that was about something else.

00:11:43.240 | So again, this idea of this is how you can set the learning rate, really nobody outside

00:11:48.320 | this classroom just about knows about it, obviously the guy who wrote it, Leslie Smith

00:11:52.800 | knows about it, so it's a good thing to tell your colleagues about, it's like here is actually

00:11:59.240 | a great way to set the learning rate, and there's even been papers called, like one

00:12:03.520 | of the famous papers is called No More Pesky Learning Rates, which actually is a less effective

00:12:09.080 | technique than this one, but this idea that like setting learning rates is very difficult

00:12:13.760 | and fiddly has been true for most of the kind of deep learning history.

00:12:19.440 | So here's the trick, look at this plot, find kind of the lowest, go back about a multiple

00:12:24.760 | of 10 and try that, and if that doesn't quite work you can always try going back another

00:12:30.520 | multiple of 10, but this has always worked for me so far.

00:12:58.920 | That's a great question.

00:13:00.160 | So we're going to learn during this course about a number of ways of improving gradient

00:13:04.400 | percent, like you mentioned momentum and atom and so forth.

00:13:08.600 | This is orthogonal in fact.

00:13:10.360 | So one of the things the class AI library tries to do is figure out the right gradient

00:13:15.080 | percent version, and in fact behind the scenes this is actually using something called atom.

00:13:19.760 | And so this technique is telling us this is the best learning rate to use, given what

00:13:25.240 | other tweaks you're using in this case, the atom optimizer.

00:13:29.000 | So it's not that there's some compromise between this and some other approaches, this sits

00:13:33.320 | on top of those approaches, and you still have to set the learning rate when you use

00:13:36.760 | other approaches.

00:13:38.160 | So we're trying to find the best kind of optimizer to use for a problem, but you still have to

00:13:42.440 | set the learning rate and this is how we can do it.

00:13:45.320 | And in fact this idea of using this technique on top of more advanced optimizers like atom

00:13:50.080 | I haven't even seen mentioned in a paper before, so I think this is not a huge breakthrough,

00:13:55.600 | it seems obvious, but nobody else seems to have tried it.

00:13:59.200 | So as you can see it works well.

00:14:02.360 | When we use optimizers like atom which have like adaptive learning rates, so when we set

00:14:10.880 | this learning rate, is it like initial learning rate because it changes during the epoch?

00:14:19.480 | So we're going to be learning about things like atom, the details about it later in the

00:14:25.320 | class, but the basic answer is no, even with atom there actually is a learning rate, it's

00:14:35.160 | being basically divided by the average previous gradient and also the recent sum of squares

00:14:42.580 | of gradients.

00:14:43.580 | So there's still like a number called the learning rate.

00:14:46.160 | There isn't even these so-called dynamic learning rate methods that still have a learning rate.

00:14:56.720 | So the most important thing that you can do to make your model better is to give it more

00:15:08.440 | data.

00:15:09.720 | So the challenge that happens is that these models have hundreds of millions of parameters,

00:15:16.280 | and if you train them for a while they start to do what's called overfitting.

00:15:21.200 | And so overfitting means that they're going to start to see like the specific details

00:15:25.400 | of the images you're giving them rather than the more general learning that can transfer

00:15:31.480 | across to the validation set.

00:15:33.960 | So the best thing we can do to avoid overfitting is to find more data.

00:15:38.800 | Now obviously one way to do that would just be to collect more data from wherever you're

00:15:41.880 | getting it from or label more data, but a really easy way that we should always do is

00:15:46.800 | to use something called data augmentation.

00:15:51.040 | So data augmentation is one of these things that in many courses it's not even mentioned

00:15:56.580 | at all or if it is it's kind of like an advanced topic right at the end, but actually it's

00:16:00.520 | like the most important thing that you can do to make a better model.

00:16:05.080 | So it's built into the fast.io library to make it very easy to do.

00:16:08.600 | And so we're going to look at the details of the code shortly, but the basic idea is

00:16:13.680 | that in our initial code we had a line that said image_classifier.data.from_paths and we

00:16:22.840 | passed in the path to our data, and for transforms we passed in basically the size and the architecture.

00:16:29.880 | We'll look at this in more detail shortly.

00:16:32.620 | We just add one more parameter which is what kind of data augmentation do you want to do.

00:16:39.120 | And so to understand data augmentation, it's maybe easiest to look at some pictures of

00:16:44.320 | data augmentation.

00:16:46.120 | So what I've done here, again we'll look at the code in more detail later, but the basic

00:16:50.080 | idea is I've built a data class multiple times, I'm going to do it six times, and each time

00:16:59.920 | I'm going to plot the same cat.

00:17:03.120 | And you can see that what happens is that this cat here is further over to the left,

00:17:08.120 | and this one here is further over to the right, and this one here is flipped horizontally,

00:17:12.800 | and so forth.

00:17:13.800 | So data augmentation, different types of image are going to want different types of data

00:17:20.160 | augmentation.

00:17:22.000 | So for example, if you were trying to recognize letters and digits, you wouldn't want to flip

00:17:28.040 | horizontally because it actually has a different meaning.

00:17:30.880 | Whereas on the other hand, if you're looking at photos of cats and dogs, you probably don't

00:17:36.440 | want to flip vertically because cats aren't generally upside down.

00:17:41.000 | Whereas if you were looking at the current Kaggle competition which is recognizing icebergs

00:17:47.640 | in satellite images, you probably do want to flip them upside down because it doesn't

00:17:52.320 | really matter which way around the iceberg or the satellite was.

00:17:56.480 | So one of the examples of the transform sets we have is "transforms side on".

00:18:03.200 | So in other words, if you have photos that are generally taken from the side, which generally

00:18:07.360 | means you want to be able to flip them horizontally but not vertically, this is going to give

00:18:10.960 | you all the transforms you need for that.

00:18:12.680 | So it will flip them sideways, rotate them by small amounts but not too much, slightly

00:18:18.400 | bury their contrast and brightness, and slightly zoom in and out a little bit and move them

00:18:24.320 | around a little bit.

00:18:25.320 | So each time it's a slightly different, slightly different edge.

00:18:28.760 | I'm getting a couple of questions from people about, could you explain again the reason

00:18:36.360 | why you don't take the minimum of the loss curve but a slightly higher rate?

00:18:42.360 | Also, could people understand if this works for every CNN or for CNN or for every internet?

00:18:51.600 | Could you put your hand up if there's a spare seat next to you?

00:19:11.100 | So there was a question about the learning rate finder about why do we use the learning

00:19:15.240 | rate that's less than the lowest point?

00:19:18.160 | And so the reason why is to understand what's going on with this learning rate finder.

00:19:25.120 | So let's go back to our picture here of how do we figure out what learning rate to use.

00:19:32.980 | And so what we're going to do is we're going to take steps and each time we're going to

00:19:38.320 | double the learning rate, so double the amount by which we're multiplying the gradient.

00:19:44.000 | So in other words, we'd go tiny step, slightly bigger, slightly bigger, slightly bigger,

00:19:49.360 | slightly bigger, slightly bigger, slightly bigger.

00:19:57.040 | And so the purpose of this is not to find the minimum, the purpose of this is to figure

00:20:02.200 | out what learning rate is allowing us to decrease quickly.

00:20:08.000 | So the point at which the loss was lowest here is actually there, but that learning

00:20:13.920 | rate actually looks like it's probably too high, it's going to just jump probably backwards

00:20:18.880 | and forwards.

00:20:20.940 | So instead what we do is we go back to the point where the learning rate is giving us

00:20:26.280 | a quick increase in the loss.

00:20:31.200 | So here is the actual learning rate increasing every single time we look at a new minibatch,

00:20:38.520 | so minibatch or iteration versus learning rate.

00:20:41.640 | And then here is learning rate versus loss.

00:20:43.600 | So here's that point at the bottom where it was now already too high, and so here's the

00:20:48.920 | point where we go back a little bit and it's increasing nice and quickly.

00:20:54.880 | We're going to learn about something called stochastic gradient descent with restarts

00:20:59.200 | shortly where we're going to see, in a sense you might want to go back to 1e3 where it's

00:21:03.660 | actually even steeper still, and maybe we would actually find this will actually learn

00:21:10.400 | even quicker, you could try it, but we're going to see later why actually using a higher

00:21:14.620 | number is going to give us better generalization.

00:21:18.600 | So for now we'll just put that aside.

00:21:20.480 | So as we increase the iterations in the learning rate finder, the learning rate is going up.

00:21:36.320 | This is iteration versus learning rate.

00:21:39.280 | So as we do that, as the learning rate increases and we plot it here, the loss goes down until

00:21:46.400 | we get to the point where the learning rate is too high.

00:21:49.440 | And at that point the loss is now getting worse.

00:21:51.480 | Because I asked the question because you were just indicating that even though the minimum

00:21:56.040 | was at 10^-1, you suggested we should choose 10^-2, but now you're saying maybe we should

00:22:04.160 | go back the other way higher.

00:22:06.200 | I didn't mean to say that, I'm sorry if I said something backwards, so I want to go

00:22:10.000 | back down to a lower learning rate.

00:22:14.080 | So possibly I said higher when I meant higher in this lower learning rate.

00:22:23.680 | Last class you said that all the local minima are the same and this graph also shows the

00:22:30.520 | same.

00:22:31.520 | Is that something that was observed or is there a theory behind it?

00:22:37.560 | That's not what this graph is showing.

00:22:39.720 | This graph is simply showing that there's a point where if we increase the learning

00:22:43.260 | rate more, then it stops getting better and it actually starts getting worse.

00:22:47.840 | The idea that all local minima are the same is a totally separate issue and it's actually

00:22:57.400 | something we'll see a picture of shortly, so let's come back to that.

00:23:00.920 | That's a great question.

00:23:28.840 | I certainly run it once when I start.

00:23:32.920 | Later on in this class, we're going to learn about unfreezing layers, and after I unfreeze

00:23:38.540 | layers I sometimes run it again.

00:23:41.160 | If I do something to change the thing I'm training or change the way I'm training it,

00:23:45.880 | you may want to run it again.

00:23:50.920 | Particularly if you've changed something about how you train unfreezing layers, which we're

00:23:54.800 | going to soon learn about, and you're finding the other training is unstable or too slow,

00:24:02.220 | you can run it again.

00:24:03.220 | There's never any harm in running it.

00:24:05.960 | It doesn't take very long.

00:24:08.620 | That's a great question.

00:24:11.320 | Back to data augmentation.

00:24:14.600 | When we run this little transforms from model function, we pass in augmentation transforms,

00:24:21.920 | we can pass in the main two, a transform side on or transforms top down.

00:24:27.160 | Later on we'll learn about creating your own custom transform lists as well, but for now

00:24:31.980 | because we're taking pictures from the side of cats and dogs, we'll say transform side

00:24:36.420 | on.

00:24:37.760 | Now each time we look at an image, it's going to be zoomed in or out a little bit, moved

00:24:42.480 | around a little bit, rotated a little bit, possibly flipped.

00:24:49.720 | What this does is it's not exactly creating new data, but as far as the convolutional

00:24:54.880 | neural net is concerned, it's a different way of looking at this thing, and it actually

00:24:59.440 | therefore allows it to learn how to recognize cats or dogs from somewhat different angles.

00:25:07.360 | So when we do data augmentation, we're basically trying to say based on our domain knowledge,

00:25:14.240 | here are different ways that we can mess with this image that we know still make it the

00:25:19.360 | same image, and that we could expect that you might actually see that kind of image

00:25:23.640 | in the real world.

00:25:26.260 | So what we can do now is when we call this from_parts function, which we'll learn more

00:25:31.600 | about shortly, we can now pass in this set of transforms which actually have these augmentations

00:25:37.200 | in.

00:25:41.700 | So we're going to start from scratch here, we do a fit, and initially the augmentations

00:25:50.520 | actually don't do anything.

00:25:52.720 | And the reason initially they don't do anything is because we've got here something that says

00:25:57.160 | pre-compute = true.

00:25:58.160 | We're going to go back to this lots of times.

00:26:03.640 | But basically what this is doing is do you remember this picture we saw where we learned

00:26:08.680 | each different layer has these activations that basically look for anything from the

00:26:15.120 | middle of flowers to eyeballs of birds or whatever.

00:26:21.760 | And so literally what happens is that the later layers of this convolutional neural

00:26:28.840 | network have these things called activations.

00:26:32.440 | Activation is a number that says this feature, like eyeball of birds, is in this location

00:26:42.420 | with this level of confidence, with this probability.

00:26:46.240 | And so we're going to see a lot of this later.

00:26:48.560 | But what we can do is we can say, in this we've got a pre-trained network, and a pre-trained

00:26:56.360 | network is one where it's already learned to recognize certain things.

00:26:59.640 | In this case it's learned to recognize the 1.5 million images in the ImageNet dataset.

00:27:05.320 | And so what we could do is we could take the second last layer, so the one which has got

00:27:11.920 | all of the information necessary to figure out what kind of thing a thing is, and we

00:27:16.360 | can save those activations.

00:27:18.320 | So basically saving things saying there's this level of eyeballness here, and this level

00:27:23.600 | of dog's face-ness here, and this level of fluffy ear there, and so forth.

00:27:28.520 | And so we save for every image these activations, and we call them the pre-computed activations.

00:27:36.700 | And so the idea is now that when we want to create a new classifier which can basically

00:27:42.960 | take advantage of these pre-computed activations, we can very quickly train a simple linear model

00:27:53.520 | based on those.

00:27:54.760 | And so that's what happens when we say pre-compute = true.

00:27:58.120 | And that's why, you may have noticed this week, the first time that you run a new model, it

00:28:05.400 | takes a minute or two.

00:28:08.040 | Whereas you saw when I ran it, it took like 5 or 10 seconds, it took you a minute or two,

00:28:12.520 | and that's because it had to pre-compute these activations, it just has to do that once.

00:28:17.800 | If you're using your own computer or AWS, it just has to do it once ever.

00:28:22.960 | If you're using Cressel, it actually has to do it once every single time you rerun Cressel

00:28:30.320 | because Cressel just for these pre-computed activations, it uses a special little kind

00:28:35.640 | of scratch space that disappears each time you restart your Cressel instance.

00:28:40.680 | So other than special Cressel, generally speaking, you just have to run it once ever for a dataset.

00:28:49.560 | So the issue with that is that since we're pre-computed for each image, how much does

00:28:56.040 | it have an ear here and how much does it have a lizard's eyeball there and so forth?

00:29:01.520 | That means that data augmentations don't work.

00:29:04.280 | In other words, even though we're trying to show it a different version of the cat each

00:29:07.180 | time, we've pre-computed the activations for a particular version of that cat.

00:29:13.160 | So in order to use data augmentation, we just have to go learn.precompute=false, and then

00:29:19.760 | we can run a few more APOCs.

00:29:24.000 | And so you can see here that as we run more APOCs, the accuracy isn't particularly getting

00:29:30.520 | better.

00:29:31.520 | That's the bad news.

00:29:32.680 | The good news is that you can see the train loss, this is like a way of measuring the

00:29:39.800 | error of this model, although that's getting better, the error's going down, the validation

00:29:45.080 | error isn't going down, but we're not overfitting.

00:29:49.840 | And overfitting would mean that the training loss is much lower than the validation loss.

00:29:56.160 | We're going to talk about that a lot during this course, but the general idea here is

00:30:00.080 | if you're doing a much better job on the training set than you are on the validation set, that

00:30:05.760 | means your model's not generalizing.

00:30:08.240 | So we're not at that point, which is good, but we're not really improving.

00:30:13.740 | So we're going to have to figure out how to deal with that.

00:30:17.500 | Before we do, I want to show you one other cool trick.

00:30:20.880 | I've added here cycle length=1, and this is another really interesting idea.

00:30:28.940 | Here's the basic idea.

00:30:31.080 | Cycle length=1 enables a fairly recent discovery in deep learning called Stochastic Gradient

00:30:37.320 | % with restarts.

00:30:39.360 | And the basic idea is this.

00:30:44.080 | As you get closer and closer to the right spot, I may want to start to decrease my learning

00:30:56.520 | rate.

00:30:57.520 | Because as I get closer, I'm pretty close down, so let's slow down my steps to try to

00:31:03.700 | get exactly to the right spot.

00:31:06.600 | And so as we do more iterations, our learning rate perhaps should actually go down.

00:31:18.660 | Because as we go along, we're getting closer and closer to where we want to be and we want

00:31:21.820 | to get exactly to the right spot.

00:31:25.580 | So the idea of decreasing the learning rate as you train is called learning rate annealing.

00:31:32.940 | And it's very, very common, very, very popular.

00:31:37.080 | Everybody uses it basically all the time.

00:31:40.320 | The most common kind of learning rate annealing is really horrendously hacky.

00:31:45.640 | It's basically that researchers pick a learning rate that seems to work for a while, and then

00:31:50.720 | when it stops learning well, they drop it down by about 10 times, and then they keep

00:31:54.940 | learning a bit more until it doesn't seem to be improving, and they drop it down by

00:31:58.000 | another 10 times.

00:31:59.680 | That's what most academic research papers and most people in the industry do.

00:32:04.080 | So this would be like stepwise annealing, very manual, very annoying.

00:32:09.620 | A better approach is simply to pick some kind of functional form like a line.

00:32:16.480 | It turns out that a really good functional form is one half of a cosine curve.

00:32:23.720 | And the reason why is that for a while when you're not very close, you kind of have a

00:32:28.200 | really high learning rate, and then as you do get close you kind of quickly drop down

00:32:32.520 | and do a few iterations with a really low learning rate.

00:32:36.040 | And so this is called cosine annealing.

00:32:39.360 | So to those of you who haven't done trigonometry for a while, cosine basically looks something

00:32:44.120 | like this.

00:32:45.120 | So we've picked one little half piece.

00:32:51.640 | So we're going to use cosine annealing.

00:32:55.080 | But here's the thing, when you're in a very high dimensional space, and here we're only

00:33:02.120 | able to show 3 dimensions, but in reality we've got hundreds of millions of dimensions,

00:33:08.440 | we've got lots of different fairly flat points, they're fairly flat points, all of which are

00:33:16.920 | pretty good, but they might differ in a really interesting way, which is that some of those

00:33:22.680 | flat points let me show you.

00:33:33.360 | Let's imagine we've got a surface that looks something like this.

00:33:41.720 | Now imagine that our random guess started here, and our initial learning rate annealing

00:33:49.760 | schedule got us down to here.

00:33:52.720 | Now indeed that's a pretty nice low error, but it probably doesn't generalize very well,

00:33:59.560 | which is to say if we use a different dataset where things are just kind of slightly different

00:34:04.620 | in one of these directions, suddenly it's a terrible solution, whereas over here it's

00:34:11.600 | basically equally good in terms of loss, but it rather suggests that if you have slightly

00:34:18.080 | different datasets that are slightly moved in different directions, it's still going

00:34:21.760 | to be good.

00:34:22.760 | So in other words, we would expect this solution here is probably going to generalize better

00:34:27.920 | than the Sparky one.

00:34:31.840 | So here's what we do, if we've got a bunch of different low bits, then our standard learning

00:34:39.720 | rate annealing approach will go downhill, downhill, downhill, downhill, downhill to

00:34:44.520 | one spot.

00:34:46.960 | But what we could do instead is use a learning rate schedule that looks like this, which is

00:34:54.280 | to say we do a cosine annealing and then suddenly jump up again, do a cosine annealing and then

00:34:58.600 | jump up again.

00:35:00.040 | And so each time we jump up, it means that if we're in a spiky bit and then we suddenly

00:35:05.080 | increase the learning rate and it jumps now all the way over to here, and so then we kind

00:35:10.360 | of learning rate and near, learning rate and near down to here, and then we jump up again

00:35:13.840 | to a high learning rate, and it stays here.

00:35:18.120 | So in other words, each time we jump up the learning rate, it means that if it's in a

00:35:22.480 | nasty spiky part of the surface, it's going to hop out of the spiky part, and hopefully

00:35:27.640 | if we do that enough times, it will eventually find a nice smooth ball.

00:35:35.120 | Could you get the same effect by running multiple iterations through the different randomized

00:35:45.640 | running points so that eventually you explore all possible minimals and then compare them?

00:35:51.080 | Yeah so in fact, that's a great question, and before this approach, which is called

00:36:01.040 | stochastic gradient descent with restarts was created, that's exactly what people used

00:36:07.200 | to do.

00:36:08.200 | They used to create these things called ensembles where they would basically relearn a whole

00:36:13.240 | new model 10 times in the hope that one of them is going to end up being better.

00:36:20.360 | And so the cool thing about this stochastic gradient descent with restarts is that once

00:36:26.720 | we're in a reasonably good spot, each time we jump up the learning rate, it doesn't restart,

00:36:33.120 | it actually hangs out in this nice part of the space and then keeps getting better.

00:36:37.880 | So interestingly it turns out that this approach where we do this a bunch of separate cosine

00:36:44.200 | annealing steps, we end up with a better result than if we just randomly tried a few different

00:36:51.040 | starting points.

00:36:52.800 | So it's a super neat trick and it's a fairly recent development, and again almost nobody's

00:37:01.560 | heard of it, but I found it's now like my superpower.

00:37:07.760 | Using this, along with the learning rate finder, I can get better results than nearly anybody

00:37:15.320 | like in a Kaggle competition in the first week or two, I can jump in, spend an hour

00:37:20.960 | or two and back, I've got a fantastically good result.

00:37:25.440 | And so this is why I didn't pick the point where it's got the steepest slope, I actually

00:37:31.360 | tried to pick something kind of aggressively high, it's still getting down but maybe getting

00:37:36.240 | to the point where it's nearly too high.

00:37:40.240 | Because when we do this stochastic gradient descent with restarts, this 10^-2 represents

00:37:47.400 | the highest number that it uses.

00:37:51.400 | So it goes up to 10^-2 and then goes down, then up to 10^-2 and then down.

00:37:57.180 | So if I use to lower learning rate, it's not going to jump to a different part of the function.

00:38:04.520 | In terms of this part here where it's going down, we change the learning rate every single

00:38:20.680 | mini-batch.

00:38:22.880 | And then the number of times we reset it is set by the cycle length parameter, so 1 means

00:38:30.000 | reset it after every epoch.

00:38:32.860 | So if I had 2 there, it would reset it up to every 2 epochs.

00:38:36.760 | And interestingly this point that when we do the learning rate annealing that we actually

00:38:40.480 | change it every single batch, it turns out to be really critical to making this work,

00:38:47.600 | and again it's very different to what nearly everybody in industry and academia has done

00:38:51.280 | before.

00:38:52.280 | We're going to come back to that multiple times in this course.

00:39:04.700 | So the way this course is going to work is we're going to do a really high-level version

00:39:08.900 | of each thing, and then we're going to come back to it in 2 or 3 lessons and then come

00:39:12.860 | back to it at the end of the course.

00:39:14.400 | And each time we're going to see more of the math, more of the code, and get a deeper view.

00:39:20.060 | We can talk about it also in the forums during the week.

00:39:27.540 | Our main goal is to generalize and we don't want to get those narrow optimals.

00:39:33.300 | Yeah, that's a very good summary.

00:39:34.780 | So with this method, are we keeping track of the minimals and averaging them, and assembling

00:39:41.500 | them?

00:39:42.500 | That's another level of sophistication, and indeed you can see there's something here

00:39:46.340 | called Snapshot Ensembles, so we're not doing it in the code right now.

00:39:52.220 | But yes, if you wanted to make this generalized even better, you can save the weights here

00:39:57.760 | and here and here and then take the average of the predictions.

00:40:02.260 | But for now, we're just going to pick the last one.

00:40:10.260 | If you want to skip ahead, there's a parameter called CycleSaveName, which you can add as

00:40:20.500 | well as CycleLend, and that will save a set of weights at the end of every learning rate

00:40:25.660 | cycle and then you can ensemble them.

00:40:34.660 | So we've got a pretty decent model here, 99.3% accuracy, and we've gone through a few steps

00:40:43.280 | that have taken a minute or two to run.

00:40:46.420 | And so from time to time, I tend to save my weights.

00:40:48.620 | So if you go Learn.Save and then pass in a file name, it's going to go ahead and save

00:40:53.660 | that for you.

00:40:54.660 | Later on, if you go Learn.Load, you'll be straight back to where you came from.

00:40:59.260 | So it's a good idea to do that from time to time.

00:41:03.140 | This is a good time to mention what happens when you do this.

00:41:08.540 | When you go Learn.Save, when you create pre-computed activations, another thing we'll learn about

00:41:13.480 | soon when you create resized images, these are all creating various temporary files.

00:41:20.500 | And so what happens is, if we go to Data, and we go to DogsCats, this is my data folder,

00:41:35.520 | and you'll see there's a folder here called TMP, and so this is automatically created,

00:41:42.820 | and all of my pre-computed activations end up in here.

00:41:46.180 | I mention this because if you're getting weird errors, it might be because you've got some

00:41:53.620 | pre-computed activations that were only half completed, or are in some way incompatible

00:41:59.900 | with what you're doing.

00:42:00.900 | So you can always go ahead and just delete this TMP, this temporary directory, and see

00:42:06.460 | if that causes your error to go away.

00:42:08.260 | This is the fast AI equivalent of turning it off and then on again.

00:42:13.660 | You'll also see there's a directory called Models, and that's where all of these, when

00:42:17.700 | you say .save with a model, that's where that's going to go.

00:42:22.460 | Actually it reminds me, when the Stochastic Gradient Descent with Restarts paper came

00:42:25.900 | out, I saw a tweet that was somebody who was like, "Oh, to make your deep learning work

00:42:29.580 | better, turn it off and then on again."

00:42:31.060 | Is there a question there?

00:42:34.340 | "If I want to say I want to retrain my model from scratch again, do I just delete everything

00:42:43.500 | from that folder?"

00:42:50.220 | If you want to train your model from scratch, there's generally no reason to delete the

00:42:55.300 | pre-computed activations because the pre-computed activations are without any training.

00:43:01.660 | That's what the pre-trained model created with the weights that you downloaded off the internet.

00:43:12.220 | The only reason you want to delete the pre-computed activations is that there was some error caused

00:43:16.780 | by half creating them and crashing or something like that.

00:43:22.260 | As you change the size of your input, change different architectures and so forth, they

00:43:25.980 | all create different sets of activations with different file names, so generally you shouldn't

00:43:30.540 | have to worry about it.

00:43:32.380 | If you want to start training again from scratch, all you have to do is create a new learn object.

00:43:39.540 | So each time you go con-learner.pre-trained, that creates a new object with new sets of

00:43:45.700 | weights we've trained from.

00:43:51.460 | Before our break, we'll finish off by talking about fine-tuning and differential learning

00:43:57.980 | roots.

00:44:00.580 | So far everything we've done has not changed any of these pre-trained filters.

00:44:09.260 | We've used a pre-trained model that already knows how to find at the early stages edges

00:44:17.460 | and gradients, and then corners and curves, and then repeating patterns, and bits of text,

00:44:27.420 | and eventually eyeballs.

00:44:30.320 | We have not re-trained any of those activations, any of those features, or more specifically

00:44:38.580 | any of those weights in the convolutional kernels.

00:44:42.340 | All we've done is we've learned some new layers that we've added on top of these things.

00:44:48.140 | We've learned how to mix and match these pre-trained features.

00:44:53.000 | Now obviously it may turn out that your pictures have different kinds of eyeballs or faces,

00:45:01.700 | or if you're using different kinds of images like satellite images, totally different kinds

00:45:06.140 | of features altogether.

00:45:08.220 | So if you're training to recognize icebergs, you'll probably want to go all the way back

00:45:14.140 | and learn all the way back to different combinations of these simple gradients and edges.

00:45:20.500 | In our case as dogs versus cats, we're going to have some minor differences, but we still

00:45:25.980 | may find it's helpful to slightly tune some of these later layers as well.

00:45:32.660 | So to tell the learner that we now want to start actually changing the convolutional

00:45:38.700 | fielders themselves, we simply say unfreeze.

00:45:42.880 | So a frozen layer is a layer which is not trained, which is not updated.

00:45:47.900 | So unfreeze unfreezes all of the layers.

00:45:51.340 | Now when you think about it, it's pretty obvious that layer 1, which is like a diagonal edge

00:45:59.420 | or a gradient, probably doesn't need to change by much if at all.

00:46:04.620 | From the 1.5 million images on ImageNet, it probably already has figured out pretty well

00:46:09.260 | how to find edges and gradients.

00:46:11.340 | It probably already knows also which kind of corners to look for and how to find which

00:46:16.140 | kinds of curves and so forth.

00:46:17.980 | So in other words, these early layers probably need little if any learning, whereas these

00:46:25.380 | later ones are much more likely to need more learning.

00:46:29.060 | This is universally true regardless of whether you're looking for satellite images of rainforest

00:46:33.900 | or icebergs or whether you're looking for cats versus dogs.

00:46:39.340 | So what we do is we create an array of learning rates where we say these are the learning

00:46:45.700 | rates to use for our additional layers that we've added on top.

00:46:51.980 | These are the learning rates to use in the middle few layers, and these are the learning

00:46:56.900 | rates to use for the first few layers.

00:46:59.780 | So these are the ones for the layers that represent very basic geometric features.

00:47:05.140 | These are the ones that are used for the more complex, sophisticated convolutional features,

00:47:12.540 | and these are the ones that are used for the features that we've added and learned from

00:47:16.300 | scratch.

00:47:17.300 | So we can create an array of learning rates, and then when we call .fit and pass in an

00:47:22.700 | array of learning rates, it's now going to use those different learning rates for different

00:47:27.500 | parts of the model.

00:47:30.140 | This is not something that we've invented, but I'd also say it's so not that common that

00:47:38.700 | it doesn't even have a name as far as I know.

00:47:41.440 | So we're going to call it differential learning rates.

00:47:46.540 | If it actually has a name, or indeed if somebody's actually written a paper specifically talking

00:47:50.980 | about it, I don't know.

00:47:53.340 | There's a great researcher called Jason Yosinski who did write a paper about the idea that

00:47:58.540 | you might want different learning rates and showing why, but I don't think any other library

00:48:02.940 | support it, and I don't know of a name for it.

00:48:07.260 | Having said that though, this ability to unfreeze and then use these differential learning

00:48:13.100 | rates I found is the secret to taking a pretty good model and turning it into an awesome model.

00:48:25.400 | So just to clarify, you have three numbers there, three hyperparameters.

00:48:32.280 | The first one is for the late layers, the other layers are there in your model.

00:48:40.540 | So the short answer is many, many, and they're kind of in groups, and we're going to learn

00:48:46.620 | about the architecture.

00:48:47.620 | This is called a ResNet, a residual network, and it kind of has ResNet blocks.

00:48:52.820 | And so what we're doing is we're grouping the blocks into three groups, so this first

00:48:59.100 | number is for the earliest layers, the ones closest to the pixels that represent like

00:49:06.600 | corners and edges and gradients.

00:49:08.540 | I thought those layers are frozen at first.

00:49:14.940 | They are, right.

00:49:15.940 | So we just said unfreeze.

00:49:16.940 | Unfreeze.

00:49:17.940 | So you're unfreezing them because you have kind of partially trained all the late layers.

00:49:24.380 | We've trained our added layers, yes.

00:49:26.420 | Now you're retraining the old steps.

00:49:28.420 | Exactly.

00:49:29.420 | I see.

00:49:30.420 | And the learning rate is particularly small for the early layers, because you just kind

00:49:35.020 | of want to fine-tune and you don't want it to.

00:49:37.820 | We probably don't want to change them at all, but if it does need to then it can.

00:49:44.340 | Thanks.

00:49:45.340 | No problem.

00:49:49.220 | So using differential random rates, how is it different from grid search?

00:49:55.640 | There's no similarity to grid search.

00:49:57.500 | So grid search is where we're trying to find the best hyperparameter for something.

00:50:03.420 | So for example, you could kind of think of the learning rate finder as a really sophisticated

00:50:10.220 | grid search, which is like trying lots and lots of learning rates to find which one is

00:50:14.580 | best.

00:50:15.580 | But this has nothing to do with that.

00:50:16.660 | This is actually for the entire training from now on, it's actually going to use a different

00:50:21.660 | learning rate for each layer.

00:50:30.620 | So I was wondering, if you have a pre-trained model, then you have to use the same input

00:50:36.380 | dimensions, right?

00:50:37.540 | Because I was thinking, okay, let's say you have these big machines to train these things

00:50:44.500 | and you want to take advantage of it.

00:50:46.260 | How would you go about if you have images that are bigger than the ones that they use?

00:50:51.580 | We're going to be talking about sizes later, but the short answer is that with this library

00:50:55.860 | and the modern architectures we're using, we can use any size we like.

00:51:03.260 | So Jeremy, can we unfreeze just a specific layer?

00:51:08.060 | We can.

00:51:09.060 | We're not doing it yet, but if you wanted to, you can type learn.freeze_to and pass

00:51:13.780 | in a latent number.

00:51:19.900 | Much to my surprise, or at least initially my surprise, it turns out I almost never need

00:51:25.540 | to do that.

00:51:26.540 | I almost never find it helpful, and I think it's because using differential learning rates,

00:51:32.360 | the optimizer can kind of learn just as much as it needs to.

00:51:38.360 | The one place I have found it helpful is if I'm using a really big memory-intensive model

00:51:55.860 | and I'm running out of GPU, the less layers you unfreeze, the less memory it takes and

00:52:03.940 | the less time it takes, so it's that kind of practical aspect.

00:52:06.740 | To make sure I ask the question right, can I just unfreeze a specific layer?

00:52:14.260 | No, you can only unfreeze layers from layer N onwards.

00:52:21.580 | You could probably delve inside the library and unfreeze one layer, but I don't know why

00:52:25.020 | you would.

00:52:29.140 | So I'm really excited to be showing you guys this stuff, because it's something we've been

00:52:32.340 | kind of researching all year is figuring out how to train state-of-the-art models.

00:52:37.700 | And we've kind of found these tiny number of tricks.

00:52:40.940 | And so once we do that, we now go learn.fit, and you can see, look at this, we get right

00:52:46.700 | up to like 99.5% accuracy, which is crazy.

00:52:52.220 | There's one other trick you might see here that as well as using stochastic gradient descent

00:52:56.780 | with restarts, i.e. cycle length = 1, we've done 3 cycles.

00:53:02.980 | Earlier on I lied to you, I said this is the number of epochs, it's actually the number

00:53:06.980 | of cycles.

00:53:07.980 | So if you said cycle length = 2, it would do 3 cycles of each of two epochs.

00:53:16.300 | So here I've said do 3 cycles, yet somehow it's done 7 epochs.

00:53:20.900 | The reason why is I've got one last trick to show you which is cycle_mult = 2.

00:53:25.740 | And to tell you what that does, I'm simply going to show you the picture.

00:53:30.460 | If I go learn.share.plot_learning_rate, there it is.

00:53:34.980 | Now you can see what cycle_mult = 2 is doing.

00:53:38.500 | It's doubling the length of the cycle after each cycle.

00:53:43.020 | And so in the paper that introduced this stochastic gradient descent with restarts, the researcher

00:53:48.500 | kind of said hey, this is something that seems to sometimes work pretty well, and I've certainly

00:53:53.500 | found that often to be the case.

00:53:55.100 | So basically, intuitively speaking, if your cycle length is too short, then it kind of

00:54:04.140 | starts going down to find a good spot and then it pops out.

00:54:07.540 | And it goes down to try and find a good spot and pops out, and it never actually gets to

00:54:10.740 | find a good spot.

00:54:12.260 | So earlier on, you want it to do that because it's trying to find the bit that's smoother.

00:54:17.780 | But then later on you want it to do more exploring, and then more exploring.

00:54:22.700 | So that's why this cycle_mult = 2 thing often seems to be a pretty good approach.

00:54:29.860 | So suddenly we're introducing more and more hyperparameters, having told you there aren't

00:54:34.980 | that many.

00:54:35.980 | But the reason is that you can really get away with just picking a good learning rate,

00:54:41.460 | but then adding these extra tweaks really helps get that extra level up without any

00:54:49.340 | effort.

00:54:50.540 | And so in practice, I find this kind of 3 cycles starting at 1, mult = 2 works very,

00:55:00.380 | very often to get a pretty decent model.

00:55:05.260 | If it doesn't, then often I'll just do 3 cycles of length 2 with no mult.

00:55:12.100 | There's kind of like 2 things that seem to work a lot.

00:55:14.780 | There's not too much fiddling, I find, necessary.

00:55:17.780 | As I say, even if you use this line every time, I'd be surprised if you didn't get

00:55:23.020 | a reasonable result.

00:55:24.260 | Is there a question here?

00:55:29.100 | Why does smoother services correlate to more generalized networks?

00:55:40.100 | So it's kind of this intuitive explanation I tried to give back here.

00:55:47.340 | Which is that if you've got something spiky, and so what this x-axis is showing is how

00:56:03.580 | good is this at recognizing dogs versus cats as you change this particular parameter?

00:56:09.260 | And so for something to be generalizable, it means that we want it to work when we give

00:56:14.460 | it a slightly different data set.

00:56:16.500 | And so a slightly different data set may have a slightly different relationship between

00:56:21.900 | this parameter and how catty versus doggy it is.

00:56:25.180 | It may instead look a little bit like this.

00:56:31.900 | So in other words, if we end up at this point, then it's not going to do a good job on this

00:56:38.140 | slightly different data set, or else if we end up on this point, it's still going to

00:56:41.740 | do a good job on this data set.

00:56:50.860 | So that's what cyclemult equals do.

00:56:53.420 | So we've got one last thing before we're going to take a break, which is we're now going

00:56:57.140 | to take this model, which has 99.5% accuracy, and we're going to try to make it better still.

00:57:03.580 | And what we're going to do is we're not actually going to change the model at all, but instead

00:57:08.500 | we're going to look back at the original visualization we did where we looked at some of our incorrect

00:57:16.460 | pictures.

00:57:17.460 | Now what I've done is I've printed out the whole of these incorrect pictures, but the

00:57:26.340 | key thing to realize is that when we do the validation set, all of our inputs to our model

00:57:37.500 | all the time have to be square.

00:57:40.940 | The reason for that is it's kind of a minor technical detail, but basically the GPU doesn't

00:57:46.740 | go very quickly if you have different dimensions for different images because it needs to be

00:57:51.620 | consistent so that every part of the GPU can do the same thing.

00:57:55.460 | I think this is probably fixable, but now that's the state of the technology we have.

00:57:59.020 | So our validation set, when we actually say for this particular thing is it's a dog, what

00:58:03.980 | we actually do to make it square is we just pick out the square in the middle.

00:58:09.220 | So we would take off its two edges, and so we take the whole height and then as much

00:58:14.360 | of the middle as we can.

00:58:15.940 | And so you can see in this case we wouldn't actually see this dog's head.

00:58:20.480 | So I think the reason this was actually not correctly classified was because the validation

00:58:25.220 | set only got to see the body, and the body doesn't look particularly dog-like or cat-like,

00:58:31.380 | it's not at all sure what it is.

00:58:34.100 | So what we're going to do when we calculate the predictions for our validation set is

00:58:39.740 | we're going to use something called test time augmentation.

00:58:42.860 | And what this means is that every time we decide is this cat or a dog, not in the training

00:58:47.580 | but after we train the model, is we're going to actually take four random data augmentations.

00:58:57.720 | And remember the data augmentations move around and zoom in and out and flip.

00:59:03.580 | So we're going to take four of them at random and we're going to take the original un-augmented

00:59:09.020 | center crop image and we're going to do a prediction for all of those.

00:59:13.180 | And then we're going to take the average of those predictions.

00:59:16.580 | So we're going to say is this a cat, is this a cat, is this a cat, is this a cat.

00:59:21.660 | And so hopefully in one of those random ones we actually make sure that the face is there,

00:59:26.940 | zoomed in by a similar amount to other dog's faces at sea and it's rotated by the amount

00:59:30.820 | that it expects to see it and so forth.

00:59:33.100 | And so to do that, all we have to do is just call tta, tta stands for test time augmentation.

00:59:41.740 | This term of what do we call it when we're making predictions from a model we've trained,

00:59:46.860 | sometimes it's called inference time, sometimes it's called test time, everybody seems to

00:59:50.620 | have a different name.

00:59:51.620 | So tta.

00:59:52.780 | And so when we do that we go learn.tta, check the accuracy, and lo and behold we're now

00:59:57.760 | at 99.65%, which is kind of crazy.

01:00:01.460 | Where's our green box?

01:00:05.580 | But for every bug we're only showing one type of augmentation of a particular image, right?

01:00:13.300 | So when we're training back here, we're not doing any tta.

01:00:17.820 | So you could, and sometimes I've written libraries where after each epoch I run tta to see how

01:00:25.780 | well it's going.

01:00:26.780 | But that's not what's happening here.

01:00:27.780 | I trained the whole thing with training time augmentation, which doesn't have a special

01:00:33.500 | name because that's what we mean.

01:00:34.500 | When we say data augmentation, we mean training time augmentation.

01:00:37.620 | So here every time we showed a picture, we were randomly changing it a little bit.

01:00:42.300 | So each epoch, each of these seven epochs, it was seeing slightly different versions

01:00:45.820 | of the picture.

01:00:47.540 | Having done that, we now have a fully trained model, we then said okay, let's look at the

01:00:52.100 | validation set.

01:00:53.100 | So tta by default uses the validation set and said okay, what are your predictions of

01:00:57.180 | which ones are cats and which ones are dogs.

01:00:59.260 | And it did four predictions with different random augmentations, plus one on the unaugmented

01:01:05.020 | version, averaged them all together, and that's what we got, and that's what we got the accuracy

01:01:09.700 | from.

01:01:10.700 | So is there a high probability of having a sample in tta that was not shown during training?

01:01:17.220 | Yeah, actually every data augmented image is unique because the rotation could be like

01:01:24.220 | 0.034 degrees and zoom could be 1.0165.

01:01:29.500 | So every time it's slightly different.

01:01:31.500 | Okay, thank you.

01:01:32.500 | No problem.

01:01:33.500 | Who's behind you?

01:01:37.300 | What's your, why not use white padding or something like that?

01:01:42.980 | White padding?

01:01:43.980 | Just put like a white border around.

01:01:46.740 | Oh, padding's not, yeah, so there's lots of different types of data augmentation you can

01:01:51.060 | do and so one of the things you can do is to add a border around it.

01:01:56.420 | Basically adding a border around it in my experiments doesn't help, it doesn't make

01:02:00.180 | it any less cat-like, convolutional neural network doesn't seem to find it very interesting.

01:02:07.220 | One thing that I do do, we'll see later, is I do something called reflection padding which

01:02:10.900 | is where I add some borders that are the outside just reflected, it's a way to kind of make

01:02:15.980 | some bigger images.

01:02:16.980 | It works well with satellite imagery in particular, but in general I don't do a lot of padding,

01:02:22.780 | instead I do a bit of zooming.

01:02:25.260 | Let's kind of follow up to that last one, but rather than cropping, just add white space

01:02:34.060 | because when you crop you lose the dog's face, but if you added white space you wouldn't

01:02:38.740 | have.

01:02:39.740 | Yeah, so that's where the reflection padding or the zooming or whatever can help.

01:02:44.820 | So there are ways in the fastai library when you do custom transforms of making that happen.

01:02:53.100 | I find that it kind of depends on the image size, but generally speaking it seems that

01:03:03.540 | using TTA plus data augmentation, the best thing to do is to try to use as large an image

01:03:09.140 | as possible.

01:03:10.140 | So if you kind of crop the thing down and put white borders on top and bottom, it's

01:03:13.620 | now quite a lot smaller.

01:03:16.100 | And so to make it as big as it was before, you now have to use more GPU, and if you're

01:03:19.660 | going to use all that more GPU you could have zoomed in and used a bigger image.

01:03:22.660 | So in my playing around that doesn't seem to be generally as successful.

01:03:28.380 | There is a lot of interest on the topic of how do you do that augmentation in older than

01:03:40.780 | images, in data that is not images?

01:03:47.820 | No one seems to know.

01:03:52.500 | I asked some of my friends in the natural language processing community about this,

01:03:55.980 | and we'll get to natural language processing in a couple of lessons, it seems like it would

01:04:00.260 | be really helpful.

01:04:01.260 | There's been a very, very few examples of people where papers would try replacing synonyms

01:04:07.460 | for instance, but on the whole an understanding of appropriate data augmentation for non-image

01:04:13.300 | domains is under-researched and underdeveloped.

01:04:23.940 | The question was, couldn't we just use a sliding window to generate all the images?

01:04:28.060 | So in that dog picture, couldn't we generate three parts of that, wouldn't that be better?

01:04:34.860 | Yeah, for TTA you mean?

01:04:37.080 | Just in general when you're creating your realism.

01:04:40.180 | For training time, I would say no, that wouldn't be better because we're not going to get as

01:04:43.940 | much variation.

01:04:45.620 | We want to have it one degree off, five degrees off, ten pixels up, lots of slightly different

01:04:51.700 | versions and so if you just have three standard ways, then you're not giving it as many different

01:04:58.020 | ways of looking at the data.

01:05:00.300 | For test time augmentation, having fixed crop locations I think probably would be better,

01:05:08.620 | and I just haven't gotten around to writing that yet.

01:05:11.500 | I have a version in an olden library, I think having fixed crop locations plus random contrast

01:05:19.180 | brightness rotation changes might be better.

01:05:24.060 | The reason I haven't gotten around to it yet is because in my testing it didn't seem to

01:05:27.540 | help in practice very much and it made the code a lot more complicated, so it's an interesting

01:05:32.540 | question.

01:05:33.540 | "I just want to know how this fast AI API is that you're using, is it open source?"

01:05:44.100 | Yeah, that's a great question.

01:05:46.020 | The fast AI library is open source, and let's talk about it a bit more generally.

01:05:54.540 | The fact that we're using this library is kind of interesting and unusual, and it sits

01:05:59.160 | on top of something called PyTorch.

01:06:01.660 | So PyTorch is a fairly recent development, and I've noticed all the researchers that

01:06:12.180 | I respect pretty much are now using PyTorch.

01:06:15.820 | I found in part 2 of last year's course that a lot of the cutting edge stuff I wanted to

01:06:20.220 | teach I couldn't do it in Keras and TensorFlow, which is what we used to teach with, and so

01:06:26.980 | I had to switch the course to PyTorch halfway through part 2.

01:06:31.620 | The problem was that PyTorch isn't very easy to use, you have to write your own training

01:06:37.620 | loop from scratch.

01:06:38.620 | Basically if you write everything from scratch, all the stuff you see inside the fast AI library,

01:06:42.460 | we would have had to have written it to learn.

01:06:46.460 | And so it really makes it very hard to learn deep learning when you have to write hundreds

01:06:50.180 | of lines of code to do anything.

01:06:54.340 | So we decided to create a library on top of PyTorch because our mission is to teach world

01:07:01.580 | class deep learning.

01:07:02.580 | So we wanted to show you how you can be the best in the world at doing X, and we found

01:07:08.060 | that a lot of the world class stuff we needed to show really needed PyTorch, or at least

01:07:13.420 | with PyTorch it was far easier, but then PyTorch itself just wasn't suitable as a first thing

01:07:20.820 | to teach with for new deep learning practitioners.

01:07:25.820 | So we built this library on top of PyTorch, initially heavily influenced by Keras, which

01:07:31.460 | is what we taught last year.

01:07:33.480 | But then we realized we could actually make things much easier than Keras.

01:07:37.380 | So in Keras, if you look back at last year's course notes, you'll find that all of the

01:07:42.500 | code is 2-3 times longer, and there's lots more opportunities for mistakes because there's

01:07:48.380 | just a lot of things you have to get right.

01:07:51.960 | So we ended up building this library in order to make it easier to get into deep learning,

01:07:59.460 | but also easier to get state-of-the-art results.

01:08:03.060 | And then over the last year as we started developing on top of that, we started discovering

01:08:06.980 | that by using this library, it made us so much more productive that we actually started

01:08:13.940 | developing new state-of-the-art results and new methods ourselves, and we started realizing

01:08:18.180 | that there's a whole bunch of papers that have kind of been ignored or lost, which when

01:08:23.180 | you use them it could semi-automate stuff, like learning rate finder that's not in any

01:08:29.100 | other library.

01:08:30.100 | So I kind of got to the point where now not only is kind of fast.ai lets us do things

01:08:37.700 | much easier than any other approach, but at the same time it actually has a lot more sophisticated

01:08:45.620 | stuff behind the scenes than anything else.

01:08:47.260 | So it's kind of an interesting mix.

01:08:52.580 | So we've released this library, at this stage it's like a very early version, and so through

01:08:57.700 | this course, by the end of this course I hope as a group a lot of people are already helping

01:09:03.420 | have developed it into something that's really pretty stable and rock-solid.

01:09:10.300 | And anybody can then use it to build your own models under an open-source license, as

01:09:17.660 | you can see it's available on GitHub.

01:09:23.540 | Behind the scenes it's creating PyTorch models, and so PyTorch models can then be exported

01:09:30.740 | into various different formats.

01:09:34.100 | Having said that, a lot of folks, if you want to do something on a mobile phone, for example,

01:09:38.780 | you're probably going to need to use TensorFlow.

01:09:41.620 | And so later on in this course, we're going to show how some of the things that we're

01:09:46.540 | doing in the fast.ai library you can do in Keras and TensorFlow so you can get a sense

01:09:51.560 | of what the different libraries look like.

01:09:54.300 | Generally speaking, the simple stuff will take you a small number of days to learn to

01:10:01.860 | do it in Keras and TensorFlow versus fast.ai and PyTorch.

01:10:05.460 | And the more complex stuff often just won't be possible.

01:10:09.160 | So if you need it to be in TensorFlow, you'll just have to simplify it often a little bit.

01:10:18.820 | I think the more important thing to realize is every year, the libraries that are available

01:10:26.580 | and which ones are the best totally changes.

01:10:28.580 | So the main thing I hope that you get out of this course is an understanding of the

01:10:32.180 | concepts, like here's how you find a learning rate, here's why differential learning rates

01:10:36.300 | are important, here's how you do learning rate and kneeling, here's what stochastic

01:10:40.740 | gradient acceptance restarts does, so on and so forth.

01:10:45.140 | Because by the time we do this course again next year, the library situation is going

01:10:52.260 | to be different again.

01:10:53.860 | That's a question to ask.

01:10:57.820 | I was wondering if you've had an opinion on Pyro, which is Uber's new release.

01:11:09.340 | I haven't looked at it, no, I'm very interested in probabilistic programming and it's really

01:11:13.780 | cool that it's built on top of PyTorch.

01:11:15.500 | So one of the things we'll learn about in this course is we'll see that PyTorch is much

01:11:18.780 | more than just a deep learning library, it actually lets us write arbitrary GPU accelerated

01:11:26.060 | algorithms from scratch, which we're actually going to do, and Pyro is a great example of

01:11:31.140 | what people are now doing with PyTorch outside of the deep learning world.

01:11:36.500 | Great let's take an 8 minute break and we'll come back at 7.55.

01:11:47.060 | So 99.65% accuracy, what does that mean?

01:11:57.020 | In classification, when we do classification in machine learning, a really simple way to

01:12:03.220 | look at the result of a classification is what's called the confusion matrix.

01:12:07.180 | This is not just deep learning, but any kind of classifier in machine learning where we

01:12:10.900 | say what was the actual truth, there were a thousand cats and a thousand dogs, and of

01:12:17.940 | the thousand actual cats, how many did we predict were cats?

01:12:21.980 | This is obviously in the validation sets, this is the images that we didn't use to train

01:12:26.140 | with.

01:12:27.140 | It turns out there were 998 cats that we actually predicted as cats and 2 that we got wrong.

01:12:33.420 | And then for dogs, there were 995 that we predicted were dogs and then 5 that we got

01:12:38.140 | wrong.

01:12:39.140 | So often these confusion matrices can be helpful, particularly if you've got four or five classes

01:12:44.700 | you're trying to predict which group you're having the most trouble with and you can see

01:12:49.180 | it uses color-coding to highlight the large bits, and you've got to help with the diagonal

01:12:56.540 | is the highlighted section.

01:13:00.540 | So now that we've retrained the model, it can be quite helpful, now it's better to actually

01:13:04.820 | look back and see which ones in particular were incorrect.

01:13:09.180 | And we can see here there were actually only two incorrect cats, it prints out four by

01:13:15.180 | default so you can actually see these two actually less than 0.5 so they weren't wrong.

01:13:20.220 | So it's actually only these two were wrong cats, this one isn't obviously a cat at all.

01:13:27.060 | This one is, but it looks like it's got a lot of weird artifacts and you can't see its

01:13:31.420 | eyeballs at all.

01:13:33.580 | And then here are the 5 wrong dogs, here are 4 of them, that's not obviously a dog.

01:13:42.340 | That looks like a mistake, that looks like a mistake, that one I guess doesn't have enough

01:13:47.020 | information, but I guess it's a mistake.

01:13:50.620 | So we've done a pretty good job here of creating a good classifier based on entering a lot

01:13:57.940 | of Kaggle competitions and comparing results I've done to various research papers.

01:14:01.700 | I can tell you it's a state-of-the-art classifier, it's right up there with the best in the world.

01:14:07.460 | We're going to make it a little bit better in a moment, but here are the basic steps.

01:14:11.380 | So if you want to create a world class image classifier, the steps that we just went through

01:14:16.060 | was that we turned data augmentation on by saying all transforms equals, you either say

01:14:21.940 | side on or top down depending on what you're doing.

01:14:25.100 | Start with pre-compute equals true, find a decent learning rate, we then train just like

01:14:30.540 | at one or two epochs, which takes a few seconds because we've got pre-compute equals true.

01:14:35.940 | Then we turn off pre-compute, which allows us to use data augmentation to do another

01:14:41.320 | two or three epochs, generally with cycle length equals one.

01:14:45.580 | Then I unfreeze all the layers, I then set the earlier layers to be somewhere between

01:14:51.220 | 3 times to 10 times lower learning rate than the previous.

01:15:07.620 | As a rule of thumb, knowing that you're starting with a pre-trained ImageNet model, if you

01:15:13.700 | can see that the things that you're now trying to classify are pretty similar to the kinds

01:15:17.460 | of things in ImageNet, i.e. pictures of normal objects in normal environments, you probably

01:15:22.920 | want about a 10x difference because you think that the earlier layers are probably very

01:15:28.660 | good already.

01:15:29.660 | Whereas if you're doing something like satellite imagery or medical imaging, which is not at

01:15:34.380 | all like ImageNet, then you probably want to be training those earlier layers a lot

01:15:38.140 | more so you might have just a 3x difference.

01:15:42.300 | So that's one change that I make is to try to make it either 10x or 3x.

01:15:52.100 | So then after unfreezing, you can now call LRFind again.

01:15:57.660 | I actually didn't in this case, but once you've unfreezed all the layers, you've turned on

01:16:02.620 | differential learning rates, you can then call LRFind again.

01:16:09.820 | So you can then check does it still look like the same point I had last time as about right.

01:16:15.300 | Something to note is that if you call LRFind having set differential learning rates, the

01:16:21.620 | thing it's actually going to print out is the learning rate of the last layers, because

01:16:25.700 | you've got three different learning rates, so it's actually showing you the last layer.

01:16:29.680 | So then I train the full network with cycle_mult = 2, and until either it starts with a fitting

01:16:35.140 | or I run out of time.

01:16:37.780 | So let me show you.

01:16:39.980 | Let's do this again for a totally different dataset.

01:16:42.800 | So this morning, I noticed that some of you on the forums were playing around with this

01:16:47.420 | playground Kaggle competition, very similar, called dog breed identification.

01:16:54.140 | So the dog breed identification Kaggle challenge is one where you don't actually have to decide

01:17:01.020 | which ones are cats and which ones are dogs, they're all dogs, but you have to decide what

01:17:04.820 | kind of dog it is.

01:17:06.460 | There are 120 different breeds of dogs.

01:17:10.500 | So obviously this could be different types of cells in pathology slides, it could be

01:17:19.460 | different kinds of cancers in CT scans, it could be different kinds of icebergs and satellite

01:17:26.260 | images, whatever, as long as you've got some kind of labeled images.

01:17:31.300 | So I want to show you what I did this morning, it took me about an hour basically to go end

01:17:36.860 | to end from something I've never seen before.

01:17:40.100 | So I downloaded the data from Kaggle, and I'll show you how to do that shortly, but the short

01:17:45.380 | answer is there's something called Kaggle CLI, which is a GitHub project you can search

01:17:49.860 | for and if you read the docs, you basically run kg download, provide the competition name

01:17:55.080 | and it will grab all the data for you to your Cressel or Amazon or whatever instance.

01:18:00.260 | I put it in my data folder and I then went LS and I saw that it's a little bit different

01:18:10.100 | to our previous dataset.

01:18:13.900 | It's not that there's a train folder which has a separate folder for each kind of dog,

01:18:19.860 | but instead it turned out there was a CSV file.

01:18:22.580 | And the CSV file, I read it in with pandas, so pandas is the thing we use in Python to

01:18:28.100 | do structured data analysis like CSV files, so pandas we call pd, that's pretty much universal,

01:18:34.980 | pd.read_csv reads in a CSV file, we can then take a look at it and you can see that basically

01:18:40.220 | it had some kind of identifier and then the breed.

01:18:44.940 | So this is like a different way, this is the second main way that people kind of give you

01:18:49.940 | image labels.

01:18:50.940 | One is to put different images into different folders, the second is generally to give you

01:18:55.060 | some kind of file like a CSV file to tell you here's the image name and here's the label.

01:19:02.460 | So what I then did was I used pandas again to create a pivot table which basically groups

01:19:09.700 | it up just to see how many of each breed there were and I sorted them.

01:19:14.900 | And so I saw they've got about 100 of some of the more common breeds and some of the

01:19:21.380 | less common breeds that got like 60 or so.

01:19:25.540 | All together there were 120 rows and there must have been 120 different breeds represented.

01:19:31.220 | So I'm going to go through the steps, so enable data augmentation.

01:19:37.220 | So to enable data augmentation when we call this transforms from model, you just pass

01:19:41.680 | in an org transforms, in this case I chose Sidon, again these are pictures of dots and

01:19:46.780 | stuff so they're Sidon photos.

01:19:50.780 | We'll talk about MaxZoom in more detail later, but MaxZoom basically says when you do the

01:19:56.260 | data augmentation, we zoom into it by up to 1.1 times, randomly between 1, the original

01:20:05.600 | image size, and 1.1 times.

01:20:07.820 | So it's not always cropping out in the middle or an edge, but it could be cropping out a

01:20:11.740 | smaller part.

01:20:13.580 | So having done that, the key step now is that rather than going from paths, so previously

01:20:20.240 | we went from paths and that tells us that the names of the folders are the names of

01:20:25.100 | the labels.

01:20:26.100 | We go from CSV and we pass in the CSV file that contains the labels.

01:20:31.920 | So we're passing in the path that contains all of the data, the name of the folder that

01:20:37.100 | contains the training data, the CSV that contains the labels.

01:20:43.120 | We need to also tell it where the test set is if we want to submit to Kaggle later, talk

01:20:46.820 | more about that next week.

01:20:49.620 | Now this time, the previous dataset I had actually separated a validation set out into

01:20:57.300 | a separate folder, but in this case you'll see that there is not a separate folder called

01:21:03.100 | validation.

01:21:04.980 | So we want to be able to track how good our performance is locally, so we're going to

01:21:09.620 | have to separate some of the images out to put it into a validation set.

01:21:14.580 | So I do that at random, and so up here you can see I've basically opened up the CSV file,

01:21:22.900 | turned it into a list of rows, and then taken the length of that minus 1, because there's

01:21:28.300 | a header at the top.

01:21:30.580 | And so that's the number of rows in the CSV file, which must be the number of images that

01:21:35.700 | we have.

01:21:36.860 | And then this is a fastai thing, get-cross-validation-indexes, we'll talk about cross-validation later.

01:21:42.900 | But basically if you call this and pass in a number, it's going to return to you by default

01:21:50.100 | a random 20% of the rows to use as your validation set, and you can pass in parameters to get

01:21:56.980 | different amounts.

01:21:58.420 | So this is now going to grab 20% of the data and say this is the indexes, the numbers of

01:22:05.500 | the files which we're going to use as a validation set.

01:22:09.420 | So now that we've got that, let's run this so you can see what that looks like.

01:22:17.580 | So val_indexes is just a big bunch of numbers, and so n is 10,000, and so about 20% of those

01:22:33.060 | is going to be in the validation set.

01:22:35.740 | So when we call from CSV, we can pass in a parameter which is to tell it which indexes

01:22:47.380 | to treat it as a validation set, and so let's pass in those indexes.

01:22:52.820 | One thing that's a little bit tricky here is that the file names actually have a dot

01:23:04.700 | jpg on the end, and these obviously don't have a dot jpg.

01:23:09.260 | So when you call from CSV, you can pass in a suffix that says the labels don't actually

01:23:15.860 | contain the full file names, you need to add this to them.

01:23:22.100 | So that's basically all I need to do to set up my data.

01:23:26.780 | And as a lot of you have noticed during the week, inside that data object, you can actually

01:23:32.820 | get access to the training dataset by saying train-ds, and inside train-ds is a whole bunch

01:23:39.260 | of things including the file names.

01:23:42.060 | So train-ds.file_names contains all of the file names of everything in the training set,

01:23:46.780 | and so here's one file name.

01:23:48.980 | So here's an example of one file name.

01:23:52.680 | So I can now go ahead and open that file and take a look at it.

01:23:56.020 | So the next thing I did was to try and understand what my dataset looks like, and it found an

01:24:01.540 | adorable puppy, so that was very nice.

01:24:03.940 | So I'm feeling good about this.

01:24:05.540 | I also want to know how big are these files, like how big are the images, because that's

01:24:11.420 | a key issue.

01:24:12.420 | If they're huge, I'm going to have to think really carefully about how to deal with huge

01:24:15.500 | images, that's really challenging.

01:24:17.780 | If they're tiny, well that's also challenging.

01:24:21.180 | Most of ImageNet models are trained on either 224x224 or 299x299 images, so any time you

01:24:28.380 | have images in that kind of range, that's really hopeful, you're probably not going to have

01:24:32.980 | to do too much different.

01:24:33.980 | In this case, the first image I looked at was about the right size, so I'm thinking it's

01:24:38.180 | looking pretty hopeful.

01:24:40.340 | So what I did then is I created a dictionary comprehension.

01:24:42.940 | Now if you don't know about list comprehensions and dictionary comprehensions in Python, go

01:24:48.100 | study them.

01:24:49.100 | They're the most useful thing, super handy.

01:24:52.500 | You can see the basic idea here is that I'm going through all of the files, and I'm creating

01:24:56.700 | a dictionary that maps the name of the file to the size of that file.

01:25:04.900 | Again this is a handy little Python feature which I'll let you learn about during the

01:25:07.980 | week if you don't know about it, which is zip, and using this special star notation

01:25:11.900 | is now going to take this dictionary and turn it into the rows and the columns.

01:25:19.380 | So I can now turn those into NumPy arrays, and here are the first 5 row sizes for each

01:25:26.540 | of my images.

01:25:28.260 | And then Matplotlib is something you want to be very familiar with if you do any kind

01:25:32.020 | of data science on machine learning in Python.

01:25:34.380 | Matplotlib we always refer to as PLT, this is a histogram, and so I've got a histogram

01:25:40.620 | of how high, how many rows there are in each image.

01:25:45.140 | So you can see here I'm kind of getting a sense.

01:25:47.540 | Before I start doing any modeling, I kind of need to know what I'm modeling with.

01:25:50.900 | And I can see some of the images are going to be like 2500-3000 pixels high, but most

01:25:55.780 | of them seem to be around 500.

01:25:58.540 | So given that so few of them were bigger than 1000, I used standard NumPy slicing to just

01:26:04.980 | grab those that are smaller than 1000 and histogram that, just to zoom in a little bit.

01:26:09.980 | And I can see here it looks like the vast majority are around 500.

01:26:14.660 | And so this actually also prints out the histogram, so I can actually go through and I can see

01:26:19.580 | here 4500 of them are about 450.

01:26:28.380 | So Jeremy, how many images should we get in the validation set, is it always 20%?

01:26:39.620 | So the size of the validation set, using 20% is fine unless you're kind of feeling like

01:26:48.620 | my data set is really small, I'm not sure that's enough.

01:26:57.980 | Basically think of it this way, if you train the same model multiple times and you're getting

01:27:01.620 | very different validation set results and your validation set is kind of small, like

01:27:05.620 | smaller than 1000 or so, then it's going to be quite hard to interpret how well you're

01:27:12.020 | doing.

01:27:13.020 | This is particularly true, if you care about the third decimal place of accuracy and you've

01:27:19.300 | got 1000 things in your validation set, then you're thinking about a single image changing

01:27:23.620 | plus is what you're looking at.

01:27:26.380 | So it really depends on how much difference you care about.

01:27:34.860 | I would say in general, at the point where you care about the difference between 0.01

01:27:39.620 | and 0.02, the second decimal place, you want that to represent 10 or 20 rows, like changing

01:27:48.100 | the class of 10 or 20 rows, then that's something you can be pretty confident of.

01:27:54.420 | So most of the time, given the data sizes we normally have, 20% seems to work fine.

01:28:05.500 | It depends a lot on specifically what you're doing and what you care about.

01:28:12.140 | And it's not a deep learning specific question either.

01:28:15.780 | So those who are interested in this kind of thing, we're going to look into it in a lot

01:28:18.540 | more detail in our machine learning course, which will also be available online.

01:28:27.940 | So I did the same thing for the columns just to make sure that these aren't super wide,

01:28:31.660 | and I got similar results and checked in and again found that 400-500 seems to be about

01:28:36.820 | the average size.

01:28:38.540 | So based on all of that, I thought this looks like a pretty normal kind of image dataset

01:28:42.900 | that I can probably use pretty normal kinds of models on.

01:28:46.100 | I was also particularly encouraged to see that when I looked at the dog that the dog

01:28:49.940 | takes up most of the frame, so I'm not too worried about cropping problems.

01:28:56.340 | If the dog was just a tiny little piece of one little corner that I'd be thinking about

01:29:01.060 | doing different, maybe zooming in a lot more or something.

01:29:05.420 | Like in medical imaging, that happens a lot, like often the tumor or the cell or whatever

01:29:09.580 | is like one tiny piece and that's much more complex.

01:29:13.540 | So based on all that, this morning I kind of thought, okay, this looks pretty standard.

01:29:19.380 | So I went ahead and created a little function called getData that basically had my normal

01:29:27.500 | two lines of code in it.

01:29:30.740 | But I made it so I could pass in a size and a batch size.

01:29:35.460 | The reason for this is that when I start working with a new dataset, I want everything to go

01:29:39.180 | super fast.

01:29:40.580 | And so if I use small images, it's going to go super fast.

01:29:44.580 | So I actually started out with size=64, just to create some super small images that just

01:29:50.460 | go like a second to run through and see how it went.

01:29:54.260 | Later on, I started using some big images and also some bigger architectures, at which

01:29:59.940 | point I started running out of GPU memory, so I started getting these errors saying CUDA

01:30:05.220 | out of memory error.

01:30:06.780 | When you get a CUDA out of memory error, the first thing you need to do is go kernel restart.

01:30:12.580 | Once you get an out of memory error on your GPU, you can't really recover from it.

01:30:17.140 | It doesn't matter what you do, you have to restart.

01:30:21.780 | And once I restarted, I then just changed my batch size to something smaller.

01:30:26.000 | So when you call createYourData object, you can pass in a batch size parameter.

01:30:35.180 | And I normally use 64 until I hit something that says out of memory, and then I just halve

01:30:40.060 | it.

01:30:41.060 | And if I still get out of memory, I'll just halve it again.

01:30:44.940 | So that's where I created this, to allow me to start making my sizes bigger as I looked

01:30:48.780 | into it more and as I started running out of memory to decrease my batch size.

01:30:53.780 | So at this point, I went through this a couple of iterations, but I basically found everything

01:30:59.940 | was working fine.

01:31:00.940 | Once it was working fine, I set size to 224, and I created my pre-computed equals true.

01:31:09.800 | First time I did that it took a minute to create the pre-computed activations, and then

01:31:13.700 | it ran through this in about 4 or 5 seconds, and you can see I was getting 83% accuracy.

01:31:19.220 | Now remember accuracy means it's exactly right, and so it's predicting out of 120 categories.

01:31:27.420 | So when you see something with 2 classes is 80% accurate versus something with 120 classes

01:31:34.380 | is 80% accurate, they're very different levels.

01:31:38.700 | So when I saw 83% accuracy with just a pre-computed classifier, no data augmentation, no unfreezing,

01:31:46.060 | anything else across 120 classes, oh this looks good.

01:31:51.540 | So then I just kept going through our little standard process.

01:31:57.220 | So then I turned pre-compute off, and cycle length equals 1, and I started doing a few

01:32:08.060 | more cycles, a few more epochs.

01:32:11.940 | So remember an epoch is 1 passed through the data, and a cycle is however many epochs you

01:32:21.760 | said is in a cycle.

01:32:23.100 | It's the learning rate going from the top that you asked for all the way down to 0.

01:32:28.200 | So since here cycle length equals 1, a cycle and an epoch are the same.

01:32:33.780 | So I tried a few epochs, I did actually do the learning rate finder, and I found one

01:32:40.580 | in a 2 again looks fine, it often looks fine.

01:32:44.940 | And I found it kept improving, so I tried 5 epochs and I found my accuracy getting better.

01:32:52.620 | So then I saved that and I tried something which we haven't looked at before, but it's

01:32:58.140 | kind of cool.

01:33:00.300 | If you train something on a smaller size, you can then actually call learn.setData()

01:33:08.340 | and pass in a larger size dataset.

01:33:11.460 | And that's going to take your model, however it's trained so far, and it's going to let

01:33:16.020 | you continue to train on larger images.

01:33:20.420 | And I'll tell you something amazing.

01:33:23.220 | This actually is another way you can get state-of-the-art results, and I've never seen this written

01:33:27.580 | in any paper or discussed anywhere as far as I know this is a new insight.

01:33:33.300 | Basically I've got a pre-trained model, which in this case I've trained a few epochs with

01:33:37.940 | a size of 224x224, and I'm now going to do a few more epochs with a size of 299x299.

01:33:45.340 | Now I've got very little data kind of by deep learning standards, I've only got 10,000 images.

01:33:51.340 | So with a 224x224 I kind of built these final layers to try to find things that worked well

01:33:58.060 | at 224x224.

01:33:59.060 | When I go to 299x299, if I overfit the 4, I'm definitely not going to overfit now.

01:34:07.380 | I've changed the size of my images, they're kind of totally different, but conceptually

01:34:12.820 | they're still the same kinds of pictures as the same kinds of things.

01:34:16.420 | So I found this trick of starting training on small images for a few epochs and then

01:34:21.180 | switching to bigger images and continuing training is an amazingly effective way to avoid

01:34:26.380 | overfitting.

01:34:29.380 | And it's so easy and so obvious, I don't understand why it's never been written about before, maybe

01:34:35.500 | it's in some paper somewhere and I haven't found it, but I haven't seen it.

01:34:40.900 | Would it be possible to do the same thing using let's take care of TensorFlow as well,

01:34:49.340 | to feed an image of different science?

01:34:51.660 | Yeah, I think so.

01:34:53.700 | As long as you use one of these more modern architectures, what we call fully convolutional

01:34:58.380 | architectures, which means not VGG, and you'll see we don't use VGG in this course because

01:35:03.800 | it doesn't have this property, but most of the architectures developed in the last couple

01:35:08.100 | of years can handle pretty much arbitrary sizes, it would be worth trying.

01:35:14.660 | I think it ought to work.

01:35:18.060 | So I call getData again, remember getData is just the little function that I created

01:35:21.580 | back up here.

01:35:22.580 | GetData is just this little function, so I just passed a different size to it.

01:35:27.900 | And so I call freeze just to make sure that everything except the last layer is frozen,

01:35:34.220 | I mean actually it already was at these points that didn't really do anything.

01:35:42.700 | You can see now with free compute off, I've now got data augmentation working, so I kind

01:35:47.500 | of run a few more epochs.

01:35:49.740 | And what I notice here is that the loss of my training set and the loss of my validation

01:35:55.940 | set losses a lot lower than my training set.

01:35:58.580 | This is still just training the last layer.

01:36:00.860 | So what this is telling me is I'm underfitting, and so if I'm underfitting, it means the cycle

01:36:06.340 | length equals 1 is too short, it means it's like finding something better, popping out

01:36:10.980 | and never getting a chance to zoom in properly.

01:36:14.320 | So then I set cycle_malt = 2 to give it more time, so the first time is 1 epoch, the second

01:36:21.580 | one is 2 epochs, the third one is 4 epochs, and you can see now the validation train and

01:36:28.300 | training are about the same.

01:36:30.700 | So that's kind of thinking yeah, this is about the right track.

01:36:34.620 | And so then I tried using test time augmentation to see if that gets any better still, it didn't

01:36:40.020 | actually help a hell of a lot, just a tiny bit.

01:36:43.620 | And just at this point I'm thinking this is nearly done, so I just did one more cycle

01:36:49.700 | of 2 to see if it got any better, and it did get a little bit better.

01:36:54.220 | And then I'm like okay, that looks pretty good.

01:36:57.580 | I've got a validation set loss of 0.199.

01:37:03.980 | And so you'll notice here I actually haven't tried unfreezing.

01:37:07.540 | The reason why was when I tried unfreezing and training more, it didn't get any better.

01:37:12.180 | And so the reason for this clearly is that this data set is so similar to ImageNet, the

01:37:18.820 | training that convolutional layers actually doesn't help in the slightest.

01:37:22.780 | And actually when I later looked into it, it turns out that this competition is actually

01:37:27.180 | using a subset of ImageNet.

01:37:31.860 | So then if we check this out, 0.199 against the leaderboard, this is only a playground

01:37:37.740 | competition so it's not like the best here, but it's still interesting, it gets us somewhere

01:37:45.780 | around 10th or 11th.

01:37:49.180 | In fact we're competing against, I notice this is a first AI student, these people up

01:37:58.020 | here, I know they actually posted that they cheated, they actually went and downloaded

01:38:01.900 | the original images and trained for that.

01:38:07.500 | This is why this is a playground competition, it's not real, it's just to allow us to try

01:38:13.300 | things out.

01:38:14.300 | We basically see out of 200 and something people, we're getting some very good results

01:38:23.260 | without doing anything remotely interesting or clever, and we haven't even used the whole

01:38:26.740 | data set, we've only used 80% of it.

01:38:29.140 | To get a better result, I would go back and remove that validation set and just re-run

01:38:33.500 | the same steps and then submit that, let's just use 100% of the data.

01:38:38.100 | I have three questions, the first one is that class in this case is not balanced, it's not

01:38:54.420 | balanced.

01:38:55.420 | It's not totally balanced but it's not bad, it's like between 60 and 100, it's not unbalanced

01:39:04.620 | enough that I would give it a second thought.

01:39:13.680 | Let's get to that later in this course, and don't let me forget.

01:39:17.340 | The short answer is that there was a recent list, a paper came out about two or three

01:39:20.500 | weeks ago on this and it said the best way to deal with very unbalanced data sets is

01:39:25.100 | to basically make copies of the rare cases.

01:39:31.020 | My second question is, I want to pin down a difference between pre-computed and unscrewed,

01:39:43.300 | so you have these two options here.

01:39:44.860 | So when you are beginning to add data augmentation, you said pre-computed is true, but in that

01:39:50.860 | case, the layers are still true.

01:39:53.180 | And not only are they frozen, they're pre-computed, so the data augmentation doesn't do anything

01:39:56.740 | at that point.

01:39:57.740 | What you said pre-computed equals true, but before you increase everything, what does

01:40:04.660 | exactly do?

01:40:05.660 | You only increase the activation?

01:40:09.020 | So we're going to learn more about the details as we look into the math and stuff in coming

01:40:13.380 | lessons, but basically what happened was we started with a pre-trained network, which

01:40:19.940 | was finding activations that had these kind of rich features.

01:40:26.540 | And then we add a couple of layers on the end of it, which start out random.

01:40:33.380 | And so with everything frozen, and indeed with pre-compute equals true, all we're learning

01:40:40.360 | is those couple of layers that we've added.

01:40:44.080 | And so with pre-compute equals true, we actually pre-calculate how much does this image have

01:40:49.380 | something that looks like this eyeball and looks like this face and so forth.

01:40:53.500 | And therefore data augmentation doesn't do anything with pre-compute equals true, because

01:40:57.300 | we're actually showing exactly the same activations each time.

01:41:01.980 | We can then set pre-compute equals false, which means it's still only training those

01:41:06.700 | last two layers that we added, it's still frozen, but data augmentation is now working

01:41:12.260 | because it's actually going through and recalculating all of the activations from scratch.

01:41:17.100 | And then finally when we unfreeze, that's actually saying okay, now you can go ahead and change

01:41:21.700 | all of these earlier convolutional filters.

01:41:30.260 | The only reason to have pre-compute equals true is it's just much faster.

01:41:34.500 | It's about 10 or more times faster, so particularly if you're working with quite a large data set,

01:41:41.400 | it can save quite a bit of time, but there's no accuracy reason ever to use pre-compute

01:41:50.300 | equals true.

01:41:51.300 | So it's just a short plan.

01:41:52.540 | It's also quite handy if you're throwing together a quick model, it can take a few seconds to

01:41:59.300 | create it.

01:42:02.300 | If your question is like is there some shorter version of this that's a bit quicker and easier,

01:42:29.980 | I could like to delete a few things here.

01:42:40.100 | I think this is a kind of a minimal version to get you a very good result, which is like

01:42:44.980 | don't worry about pre-compute equals true because that's just saving a little bit of

01:42:48.060 | time.

01:42:49.060 | So I would still suggest use lrfind at the start to find a good learning rate.

01:42:55.940 | By default, everything is frozen from the start, so you can just go ahead and run a

01:42:59.620 | 2 or 3 epochs with cycle equals 1, unfreeze, and then train the rest of the network with

01:43:06.500 | differential learning rates.

01:43:07.700 | So it's basically 3 steps, learning rate finder, train frozen network with cycle equals 1,

01:43:16.620 | and then train unfrozen network with differential learning rates and cycle multiples 2.

01:43:22.140 | So that's something you could turn into, I guess, 5 or 6 lines of code total.

01:43:30.420 | By reducing the batch size, does it only affect the speed of training?

01:43:39.020 | Yeah, pretty much.

01:43:40.020 | So each batch, and again we're going to see all this stuff about pre-compute and batch

01:43:44.340 | size as we dig into the details of the algorithm, it's going to make a lot more sense intuitively.

01:43:49.060 | But basically if you're showing it less images each time, then it's calculating the gradient

01:43:57.300 | with less images, which means it's less accurate, which means knowing which direction to go

01:44:01.740 | and how far to go in that direction is less accurate.

01:44:05.660 | So as you make the batch size smaller, you're basically making it more volatile.

01:44:11.340 | It kind of impacts the optimal learning rate that you would need to use, but in practice

01:44:22.540 | I generally find I'm only dividing the batch size by 2 or 4, it doesn't seem to change

01:44:27.700 | things very much.

01:44:28.700 | Should I reduce the learning rate accordingly?

01:44:32.300 | If you change the batch size by much, you can re-run the learning rate finder to see

01:44:36.380 | if it's changed by much, but since we're only generally looking at a power of 10, it probably

01:44:41.300 | is not going to change things enough that you can't think of it as a possibility.

01:44:45.700 | This is sort of a conceptual and basic question, so going back to the previous slide where

01:44:52.460 | you showed.

01:44:53.460 | Could you lift that up at home?

01:44:54.460 | Sorry, yeah, this is more of a conceptual and basic question.

01:44:58.460 | Going back to your previous slide where you showed what the different layers were doing.

01:45:06.460 | So in this slide, I understand the meaning of the third column relative to the fourth

01:45:12.980 | column is that you're interpreting what the layer is doing based on what images actually

01:45:20.960 | trigger that layer.

01:45:21.960 | Yeah, so we're going to look at this in more detail.

01:45:24.420 | So these grey ones basically say this is kind of what the filter looks like.

01:45:28.900 | So on the first layer you can see exactly what the filter looks like because the input

01:45:33.020 | to it are pixels, so you can absolutely say and remember we looked at what a convolutional

01:45:37.220 | kernel was, like a 3x3 thing.

01:45:40.620 | So these look like they're 7x7 kernels, you can say this is actually what it looks like.

01:45:45.060 | But later on the input to it are themselves activations which are combinations of activations

01:45:52.940 | which are combinations of activations.

01:45:54.100 | So you can't draw it, but there's a clever technique that Ziler and Fergus created which

01:45:58.780 | allowed them to say this is kind of what the filters tended to look like on average, so

01:46:04.100 | this is kind of what the filters looked like.

01:46:06.020 | And then here is specific examples of patches of image which activated that filter highly.

01:46:14.340 | So the pictures are the ones that I kind of find more useful because it tells you this

01:46:19.860 | kernel is kind of a unicycle wheel finder.

01:46:23.860 | Well we may come back to that, if not in this part and the next part.

01:46:45.100 | Probably in part 2 actually because this paper uses to create these things, this paper uses

01:46:50.460 | something called a deconvolution which I'm pretty sure we won't do in this part, but

01:46:54.300 | we will do it in part 2.

01:46:55.460 | So if you're interested, check out the paper, it's in the notebook, there's a link to it,

01:47:01.620 | Ziler and Fergus.

01:47:02.620 | It's a very clever technique and not terribly intuitive.

01:47:11.940 | So you mentioned that it was good that the dog took up the full picture and it would

01:47:21.540 | have been a problem if it was kind of like off in one of the corners and really tiny.

01:47:26.900 | What would your technique have been to try to make that work?

01:47:33.940 | Something that we'll learn about in part 2, but basically there's a technique that allows

01:47:38.900 | you to figure out roughly which parts of an image are most likely to have the interesting

01:47:44.860 | things in them, and then you can crop out those bits.

01:47:49.300 | If you're interested in learning about it, we did cover it briefly in lesson 7 of part

01:47:54.580 | 1, but I'm going to actually do it properly in part 2 of this course because I didn't

01:48:01.580 | really cover it thoroughly enough.

01:48:03.580 | Maybe we'll find time to have a quick look at it, but we'll see.

01:48:09.380 | I know UNET's written some of the code that we need already.

01:48:17.220 | So once I have something like this notebook that's basically working, I can immediately

01:48:26.380 | make it better by doing two things, assuming that the size image I was using is smaller

01:48:33.680 | than the average size of the image that we've been given.

01:48:36.340 | I can increase the size, and as I showed before with the dog breeds, you can actually increase

01:48:40.900 | it during training.

01:48:42.820 | The other thing I can do is to use a better architecture.

01:48:48.180 | We're going to talk a lot in this course about architectures, but basically there are different

01:48:57.580 | ways of putting together what size convolutional filters and how they're connected to each other

01:49:03.260 | and so forth.

01:49:06.660 | Different architectures have different numbers of layers and sizes of kernels and number

01:49:10.660 | of filters and so forth.

01:49:16.460 | The one that we've been using, ResNet-34, is a great starting point and often a good

01:49:21.740 | finishing point because it doesn't have too many parameters, often it works pretty well

01:49:27.140 | with small amounts of data as we've seen and so forth.

01:49:31.860 | But there's actually an architecture that I really like called ResNet but ResNext, which

01:49:37.460 | was actually the second place winner in last year's ImageNet competition.

01:49:44.620 | Like ResNet, you can put a number after the ResNext to say how big it is, and my next

01:49:50.780 | step after ResNet-34 is always ResNext-50.

01:49:54.700 | You'll find ResNext-50 can take twice as long as ResNet-34, it can take 2 to 4 times as

01:50:03.260 | much memory as ResNet-34.

01:50:07.060 | So what I wanted to do was I wanted to rerun that previous notebook with ResNext and increase

01:50:12.140 | in the image size to 299.

01:50:14.380 | So here I just said architecture = ResNext-50, size = 299, and then I found that I had to

01:50:20.460 | take the batch size all the way back to 28 to get it to fit.

01:50:23.740 | My GPU is 11GB, if you're using AWS or Cressel, I think they're like 12GB, so you might be

01:50:30.300 | able to make it a bit higher, but this is what I found I had to do.

01:50:33.380 | So then this is literally a copy of the previous notebook, so you can actually go file, make

01:50:37.820 | a copy, and then rerun it with these different parameters.

01:50:43.060 | And so I deleted some of the pros and some of the exploratory stuff to see, basically

01:50:48.940 | I said everything else is the same, all the same steps as before, in fact you can kind

01:50:54.460 | of see what this minimum set of steps looks like.

01:50:56.580 | I didn't need to worry about learning rate finder, so I just left it as is.

01:51:00.560 | Transforms, data = learn = fit, pre-compute = false, fit, fit cycle length = 1, unfreeze,

01:51:09.860 | cycle learning rates, fit some more.

01:51:13.020 | And you can see here I didn't do the cycle-mult thing, because I found now that I'm using

01:51:18.400 | a bigger architecture, it's got more parameters, it was overfitting pretty quickly.

01:51:23.660 | So rather than cycle length = 1, never finding the right spot, it actually did find the right

01:51:28.420 | spot and if I used longer cycle lengths, I found that my validation error was higher

01:51:36.260 | than my training error, it was overfitting.

01:51:39.920 | So check this out though, by using these 3 steps, I got +TTA, I got 99.75.

01:51:48.320 | So what does that mean?

01:51:49.760 | That means I have one incorrect dog, four incorrect cats, and when we look at the pictures

01:51:55.300 | of them, my incorrect dog has a cat in it.

01:52:01.580 | This one is not either, so I've actually got one mistake, and then my incorrect dog is

01:52:09.380 | teeth.

01:52:12.080 | So we're at a point where we're now able to train a classifier that's so good that it

01:52:18.700 | has like basically one mistake.

01:52:23.780 | And so when people say we have super-human image performance now, this is kind of what

01:52:28.220 | they're talking about.

01:52:29.500 | So when I looked at the dog breed one I did this morning, it was getting the dog breeds

01:52:36.580 | much better than I ever could.

01:52:40.240 | So this is what we can get to if you use a really modern architecture like ResNext.

01:52:46.700 | And this only took, I don't know, 20 minutes to train.

01:52:54.880 | So that's kind of where we're up to.

01:52:59.080 | So if you wanted to do satellite imagery instead, then it's the same thing.

01:53:07.860 | And in fact the planet satellite data set is already on Cresol, if you're using Cresol

01:53:12.020 | you can jump straight there.

01:53:14.940 | And I just linked it into data/planet, and I can do exactly the same thing.

01:53:21.380 | I can image classifier from CSV, and you can see these three lines are exactly the same

01:53:30.660 | as my dog breed lines, how many lines are in the file, grab my validation indexes, this

01:53:37.060 | get data, as you can see it's identical except I've changed side on to top down.

01:53:43.060 | The satellite images are like top down, so I can fit them vertically and they still make

01:53:47.820 | sense.

01:53:48.820 | And so you can see here I'm doing this trick where I'm going to do size=64 and train a

01:53:54.380 | little bit, the first learning rate finder.

01:53:57.540 | And interestingly in this case you can see I want really high learning rates.

01:54:02.380 | I don't know what it is about this particular dataset, this is true, but clearly I can use

01:54:07.340 | super high learning rates.

01:54:08.340 | So I used a learning rate of 0.2, and so I've trained for a while, differential learning

01:54:14.460 | rates, and so remember I said like if the dataset is very different to ImageNet, I probably

01:54:21.180 | want to train those middle layers a lot more, so I'm using divided by 3 rather than divided

01:54:26.020 | by 10.

01:54:27.020 | Other than that, here's the same thing, cycle mod equals 2, and then I was just kind of

01:54:33.620 | keeping an eye on it.

01:54:34.620 | So you can actually plot the loss if you go to learn.shed.plotloss, and you can see here

01:54:38.540 | is the first cycle, here's the second cycle, here's the third cycle, so you can see it's

01:54:44.260 | work.

01:54:45.260 | It's better, it pops out.

01:54:46.260 | It's better, it pops out.

01:54:47.260 | It's better, it pops out.

01:54:48.260 | And each time it finds something better than the last time.

01:54:51.140 | Then set the size up to 128, and just repeat exactly the last few steps.

01:54:56.580 | And then set it up to 256, and repeat the last two steps.

01:55:01.860 | And then do TTA, and if you submit this, then this gets about 30th place in this competition.

01:55:10.220 | So these basic steps work super well.

01:55:13.140 | This thing where I went all the way back to a size of 64, I wouldn't do that if I was doing

01:55:20.380 | like dogs and cats or dog breeds, because this is so small that if the thing I was working

01:55:26.140 | on was very similar to ImageNet, I would kind of destroy those ImageNet weights.

01:55:31.940 | Like 64 by 64 is so small, but in this case the satellite imagery data is so different

01:55:36.580 | to ImageNet.

01:55:37.580 | I really found that it worked pretty well to start right back to these tiny images.

01:55:43.280 | It really helped me to avoid overfitting.

01:55:46.300 | And interestingly, using this kind of approach, I actually found that even with using only

01:55:51.060 | 128 by 128, I was getting much better Kaggle results than nearly everybody on the leaderboard.

01:55:58.340 | And when I say 30th place, this is a very recent competition.

01:56:02.100 | And so I find in the last year, a lot of people have got a lot better at computer vision.

01:56:07.700 | And so the people in the top 50 in this competition were generally ensembling dozens of models,

01:56:12.860 | lots of people on a team, lots of pre-processing specific satellite data and so forth.

01:56:18.220 | So to be able to get 30th using this totally standard technique is pretty cool.

01:56:27.220 | So now that we've got to this point, we've got through two lessons.

01:56:30.860 | If you're still here, then hopefully you're thinking okay, this is actually pretty useful,

01:56:37.020 | I want to do more, in which case Cressel might not be where you want to stay.

01:56:42.580 | The issues with Cressel, it's pretty handy, it's pretty cheap, and something we haven't

01:56:47.580 | talked about much is PaperSpace is another great choice, by the way.

01:56:52.020 | PaperSpace is shortly going to be releasing Cressel-like instant Jupyter notebooks, unfortunately

01:56:56.900 | they're not ready quite yet, but basically they have the best price performance relationship

01:57:04.460 | right now, and you can SSH into them and use them.

01:57:08.500 | So they're also a great choice, and probably by the time this is a MOOC, we'll probably

01:57:13.260 | have a separate lesson showing you how to set up PaperSpace because they're likely to

01:57:18.140 | be a great option.

01:57:19.460 | But at some point you're probably going to want to look at AWS, a couple of reasons why.

01:57:25.580 | The first is, as you all know by now, Amazon have been kind enough to donate about $200,000

01:57:33.020 | worth of compute time to this course.

01:57:35.940 | So I want to say thank you very much to Amazon, we've all been given credits, everybody is

01:57:41.220 | here.

01:57:42.220 | Thanks very much to AWS.

01:57:45.300 | So sorry if you're on the MOOC, we didn't get it for you, but AWS credits for everybody.

01:57:52.540 | But even if you're not here in person, you can get AWS credits from lots of places.

01:57:58.060 | GitHub has a student pack, Google for GitHub student pack, that's like $150 worth of credits.

01:58:03.660 | AWS Educate can get credits, these are all for students.

01:58:08.260 | So there's lots of places you can get started on AWS.

01:58:11.540 | Pretty much everybody, a lot of the people that you might work with will be using AWS

01:58:19.580 | because it's super flexible.

01:58:21.660 | Right now AWS has the fastest available GPUs that you can get in the cloud, the P3s.

01:58:29.900 | They're kind of expensive at 3 bucks an hour, but if you've got a model where you've done

01:58:33.740 | all the steps before and you're thinking this is looking pretty good, for 6 bucks you could

01:58:38.180 | get a P3 for 2 hours and run at turbo speed.

01:58:44.660 | We didn't start with AWS because A, it's twice as expensive as Cressel, the cheapest GPU.

01:58:51.540 | And B, it takes some setup.

01:58:54.380 | I wanted to go through and show you how to get your AWS setup, so we're going to be going

01:59:00.320 | slightly over time to do that, but I want to show you very quickly, so feel free to go

01:59:05.340 | if you have to.

01:59:06.340 | But I want to show you very quickly how you can get your AWS setup from scratch.

01:59:13.740 | Basically you have to go to console.aws.amazon.com and it will take you to the console.

01:59:20.500 | You can follow along on the video with this because I'm going to do it very quickly.

01:59:24.500 | From here you have to go to EC2, this is where you set up your instances.

01:59:30.300 | And so from EC2 you need to do what's called launching an instance.

01:59:35.640 | So launching an instance means you're basically creating a computer, you're creating a computer

01:59:40.060 | on Amazon.

01:59:41.060 | So I say launch instance, and what we've done is we've created a fast AMI, an AMI is like

01:59:48.620 | a template for how your computer is going to be created.

01:59:51.500 | So if you've got a community AMIs and type in fastai, you'll see that there's one there

01:59:57.180 | called fastai part 1 version 2 for the P2.

02:00:02.500 | So I'm going to select that, and we need to say what kind of computer do you want.

02:00:08.020 | And so I can say I want a GPU computer, and then I can say I want a P2 x lives.

02:00:16.340 | This is the cheapest, reasonably effective for deep learning instance type they have.

02:00:21.820 | And then I can say launch, and then I can say launch.

02:00:27.820 | And so at this point, they ask you to choose a key pair.

02:00:33.220 | Now if you don't have the key pair, you have to create one.

02:00:37.580 | So to create a key pair, you need to open your terminal.

02:00:46.740 | If you've got a Mac or a Linux box, you've definitely got one.

02:00:49.480 | If you've got Windows, hopefully you've got Ubuntu.

02:00:53.220 | If you don't already have Ubuntu set up, you can go to the Windows Store and click on Ubuntu.

02:01:04.180 | So from there, you basically go ssh-keygen, and that will create like a special password

02:01:13.860 | for your computer to be able to log in to Amazon.

02:01:16.780 | And then you just hit enter three times, and that's going to create for you your key that

02:01:22.580 | you can use to get into Amazon.

02:01:25.020 | So then what I do is I copy that key somewhere that I know where it is, so it will be in

02:01:28.540 | the .ssh folder, and it's called idrsa.pub.

02:01:33.180 | And so I'm going to copy it to my hard drive.

02:01:39.180 | So if you're in a Mac or on Linux, it will already be in an easy to find place, it will

02:01:42.300 | be in your .ssh folder, let's put that in documents.

02:01:49.060 | So from there, back in AWS, you have to tell it that you've created this key.

02:01:55.060 | So you can go to keypairs, and you say import keypair, and you just browse to that file

02:02:02.020 | that you just created.

02:02:03.940 | There it is, I say import.

02:02:09.900 | So if you've ever used ssh before, you've already got the keypair, you don't have to

02:02:14.140 | do those steps.

02:02:15.420 | If you've used AWS before, you've already imported it, you don't have to do that step.

02:02:19.220 | If you haven't done any of those things, you have to do both steps.

02:02:24.220 | So now I can go ahead and launch my instance, community-amis-search-fastai-select-launch.

02:02:39.540 | And so now it asks me where's your keypair, and I can choose that one that I just grabbed.

02:02:49.940 | So this is going to go ahead and create a new computer for me to log into.

02:02:55.380 | And you can see here, it says the following have been initiated.

02:02:58.100 | So if I click on that, it will show me this new computer that I've created.

02:03:05.620 | So to be able to log into it, I need to know its IP address.

02:03:11.420 | So here it is, the IP address there.

02:03:13.740 | So I can copy that, and that's the IP address of my computer.

02:03:19.420 | So to get to this computer, I need to SSH to it.

02:03:22.100 | So SSH into a computer means connecting to that computer so that it's like you're typing

02:03:26.100 | that computer.

02:03:27.180 | So I type SSH, and the username for this instance is always Ubuntu.

02:03:34.260 | And then I can paste in that IP address, and then there's one more thing I have to do,

02:03:40.020 | which is I have to connect up the Jupyter notebook on that instance to the Jupyter notebook

02:03:45.060 | on my machine.

02:03:46.460 | And so to do that, there's just a particular flag that I set.

02:03:50.220 | We can talk about it on the forums as to exactly what it does.

02:03:53.260 | But you just type -l 888 localhost 8888.

02:03:59.220 | So once you've done it once, you can save that as an alias and type in the same thing

02:04:03.360 | every time.

02:04:06.120 | So we can check here, we can see it says that it's running.

02:04:08.880 | So we should be able to now hit enter.

02:04:11.940 | First time ever we connect to it, it just checks, this is OK.

02:04:15.340 | I'll say yes, and then that goes ahead and SSH is in.

02:04:22.980 | So this AMI is all set up for you.

02:04:29.180 | So you'll find that the very first time you log in, it takes a few extra seconds because

02:04:32.740 | it's getting everything set up.

02:04:35.020 | But once it's logged in, you'll see there that there's a directory called fastai.

02:04:40.340 | And the fastai directory contains our fastai repo that contains all the notebooks, all the

02:04:47.760 | code, etc.

02:04:48.760 | So I can just go cd fastai.

02:04:51.700 | First thing you do when you get in is to make sure it's updated.

02:04:54.420 | So you just go git pull, and that updates to make sure that your repo is the same as

02:05:01.420 | the most recent repo.

02:05:04.380 | And so as you can see, there we go, let's make sure it's got all the most recent code.

02:05:07.760 | The second thing you should do is type condor end update.

02:05:11.900 | You can just do this maybe once a month or so, and that makes sure that the library is

02:05:15.660 | there, all the most recent libraries.

02:05:18.000 | I'm not going to run that because it takes a couple of minutes.

02:05:21.100 | And then the last step is to type Jupyter Notebook.

02:05:27.060 | So this is going to go ahead and launch the Jupyter Notebook server on this machine.

02:05:33.180 | And the first time I do it, the first time you do everything on AWS, it just takes like

02:05:38.620 | a minute or two.

02:05:40.740 | And then once you've done it in the future, it will be just as fast as running it locally.

02:05:46.780 | So you can see it's going ahead and firing up the notebook.

02:05:49.700 | And so what's going to happen is that because when we SSH into it, we said to connect our

02:05:55.180 | notebook port to the remote notebook port, we're just going to be able to use this locally.

02:06:00.340 | So you can see it says here, copy/paste this URL.

02:06:03.020 | So I'm going to grab that URL, and I'm going to paste it into my browser, and that's it.

02:06:13.580 | So this notebook is now actually not running on my machine.

02:06:17.580 | It's actually running on AWS using the AWS GPU, which has got a lot of memory.

02:06:22.840 | It's not the fastest around, but it's not terrible.

02:06:26.300 | You can always fire up a P3 if you want something that's super fast.

02:06:29.860 | This is costing me 90 cents a minute.

02:06:33.520 | So when you're finished, please don't forget to shut it down.

02:06:37.700 | So to shut it down, you can right-click on it and say "instance date, stop".

02:06:48.340 | We've got 500 bucks of credit, assuming that you put your code down in a spreadsheet.

02:06:54.260 | One thing I forgot to do, the first time I showed you this, by the way I said, make sure

02:06:58.300 | you choose a P2.

02:07:02.060 | The second time I went through I didn't choose P2 by mistake, so just don't forget to choose

02:07:06.420 | gpu compute P2.

02:07:08.500 | Do you have a question?

02:07:11.420 | I've got 90 cents an hour, thank you, 90 cents an hour.

02:07:19.260 | It also costs 3 or 4 bucks a month for the storage as well.

02:07:24.060 | Thanks for checking that.

02:07:25.060 | Alright, see you next week. Sorry we haven't been over.

02:07:27.300 | [BLANK_AUDIO]

Lesson 2: Deep Learning 2018

Chapters