back to indexLesson 17: Deep Learning Foundations to Stable Diffusion
Chapters
0:0 Changes to previous lesson
7:50 Trying to get 90% accuracy on Fashion-MNIST
11:58 Jupyter notebooks and GPU memory
14:59 Autoencoder or Classifier
16:5 Why do we need a mean of 0 and standard deviation of 1?
21:21 What exactly do we mean by variance?
25:56 Covariance
29:33 Xavier Glorot initialization
35:27 ReLU and Kaiming He initialization
36:52 Applying an init function
38:59 Learning rate finder and MomentumLearner
40:10 What’s happening is in each stride-2 convolution?
42:32 Normalizing input matrix
46:9 85% accuracy
47:30 Using with_transform to modify input data
48:18 ReLU and 0 mean
52:6 Changing the activation function
55:9 87% accuracy and nice looking training graphs
57:16 “All You Need Is a Good Init”: Layer-wise Sequential Unit Variance
63:55 Batch Normalization, Intro
66:39 Layer Normalization
75:47 Batch Normalization
83:28 Batch Norm, Layer Norm, Instance Norm and Group Norm
86:11 Putting all together: Towards 90
88:42 Accelerated SGD
93:32 Regularization
97:37 Momentum
105:32 Batch size
106:37 RMSProp
111:27 Adam: RMSProp plus Momentum
00:00:00.000 |
Hi everybody and welcome to lesson 17 of practical deep learning for coders. 00:00:06.760 |
Really excited about what we're going to look at over the next lesson or two. 00:00:12.840 |
It's actually been turning out really well, much better than I could have 00:00:15.920 |
hoped. So I can't wait to dive in. Before I do I just going to mention a couple of 00:00:20.280 |
minor changes that I made to our mini AI library this week. One was I went back to 00:00:26.200 |
our callback class in the learner notebook and I did decide in the end to 00:00:31.440 |
add a dunder getAtra to it that just adds these four these four attributes 00:00:40.400 |
and for these four attributes it passes it down to self.learn. So in a 00:00:45.920 |
callback you'll be able to refer to model to get self.learn.model, opt will be 00:00:50.440 |
self.learn.opt, batch will be self.learn.batch, epoch will be self.learn.epoch. 00:00:55.560 |
You can change these you know you could subclass the callback and add your own 00:01:00.880 |
to underscore forward or you could remove things from underscore forward or 00:01:04.840 |
whatever but I felt like these four things I access a lot and I was sick of 00:01:08.200 |
typing self.learn and then I added one more property which is cause in a 00:01:14.760 |
callback there'll be a self.training which saves from typing self.learn.model.training 00:01:21.120 |
since we have model you could get rid of the learn but still I mean 00:01:26.120 |
you so often have to check the training now you can just get self.training in a 00:01:30.480 |
in a callback so that was one change I made. The second change I made was I 00:01:39.040 |
found myself getting a bit bored of adding train_cb every time so what I did 00:01:46.920 |
was I took the four training methods from the momentum_learner subclass and 00:01:54.700 |
I've moved them into a train_learner subclass along with zero grad so now 00:02:01.080 |
momentum_learner actually inherits from train_learner and just adds momentum 00:02:06.240 |
this kind of a quirky momentum method and changes zero grad to do the momentum 00:02:13.520 |
thing so you yeah so we'll be using train_learner quite a bit over the next 00:02:17.880 |
lesson or two so train_learner is just a learner which has the usual training 00:02:24.160 |
it's exactly the same that fastai2 has or you'd have in most PyTorch training 00:02:30.160 |
loops and obviously by using this you lose the ability to change these with a 00:02:36.560 |
callback so it's a little bit less flexible okay so those are little 00:02:43.280 |
changes and then I made some changes to what we looked at last week which is the 00:02:48.860 |
activations notebook and specifically okay so I added a hooks callback so 00:03:04.160 |
previously we had a hooks class and it didn't really require too much ceremony 00:03:10.040 |
to use but I thought we could make it even simpler and a bit more fastaiish 00:03:14.360 |
or miniaiish by putting hooks into a callback so this callback as usual you 00:03:24.000 |
pass it a function that's going to be called for your hook and you can 00:03:29.640 |
optionally pass it a filter as to what modules you want to hook and then in 00:03:34.640 |
before fit it will filter the modules in the learner and so this is one of these 00:03:46.480 |
things we can now get rid of we don't need to learn here because model is one 00:03:49.760 |
of the four things we have a shortcut to and then here we're going to create the 00:03:53.360 |
hooks object and put it in hooks and so one thing that's convenient here is the 00:04:01.800 |
hook function now you don't have to worry and we can get rid of learned up 00:04:04.640 |
model you don't have to worry about checking in your hook functions whether 00:04:08.440 |
in training or not it always checks whether you're in training and if so it 00:04:11.280 |
calls that hook function you passed in and after it finishes it removes the 00:04:15.320 |
hooks and you can iterate through the hooks and get the length of the hooks 00:04:19.280 |
because it just passes these iterators and length down to self dot hooks so to 00:04:24.480 |
show you how this works we can create a hooks callback we can use the same append 00:04:33.120 |
stats and then we can run the model and so as it's training what we're going to 00:04:46.960 |
do is yeah we can now then here we go so we just added that as an extra callback 00:04:54.000 |
to our fit function I don't remember if we had the extra callbacks before I'm 00:04:57.920 |
not sure we did so just to explain it's just I just added extra callbacks here 00:05:04.360 |
in the fit function and we're just adding any extra callbacks yeah so then 00:05:16.600 |
now we've got that callback that we created because we can get iterate 00:05:20.240 |
through it and so forth we can just iterate it through that callback as if 00:05:24.280 |
it's hooks and plot in the usual way so that's a convenient little thing I 00:05:28.400 |
think it's convenient thing I added okay and then I took our colorful dimension 00:05:38.480 |
stuff which Stefano and I came up with a few years ago and decided to wrap all 00:05:45.320 |
that up in a callback as well so I've actually sub classed here our hooks 00:05:49.280 |
callback to create an activation stats and what that's going to do is it's 00:05:53.920 |
going to use this append stats which appends the means the standard 00:05:58.800 |
deviations and the histograms and oh and I changed that very slightly also the 00:06:08.200 |
thing which creates these kind of dead plots I changed it to just get the ratio 00:06:12.840 |
of the very first very smallest histogram bin to the rest of the bins so 00:06:21.160 |
these are really kind of more like very dead at this point so these graphs look 00:06:25.600 |
a little bit different okay so yeah so I sub classed the hooks callback and and 00:06:33.040 |
yeah added the colorful dimension method dead chart method and a plot stats 00:06:39.040 |
method so to see them at work if we want to get the activations on and all of the 00:06:45.240 |
cons then we train our model and then we can just call and so we've added created 00:06:53.880 |
our activation stats we've added that as an extra callback and then and then yeah 00:07:03.520 |
then we can call colored in to get that plot dead chart to get that plot and 00:07:07.320 |
plot stats to get that shot but so now we have absolutely no excuse for not 00:07:14.000 |
getting all of these really fantastic informative visualizations of what's 00:07:20.800 |
going on inside our model because it's literally as easy as adding one line of 00:07:25.600 |
code and just putting that in your callbacks so I really think that 00:07:30.480 |
couldn't be easier and so I hope you're even for models you thought you know a 00:07:34.480 |
training really well why don't you try using this because you might be surprised 00:07:39.280 |
to discover that they're not okay so those are some changes pretty minor but 00:07:46.640 |
hopefully useful and so today and over the next lesson or two we're going to 00:07:55.600 |
look at trying to get to a important milestone which is to try to get fashion 00:08:03.560 |
MNIST training to an accuracy of 90% or more which is certainly not the end of 00:08:09.560 |
the road but it's not bad if we look at papers with code there's so 90% would be 00:08:19.520 |
a 10% error so there's folks that have got down to 3 or 4 percent error in the 00:08:24.600 |
very best which is very impressive but you know 10% error wouldn't be way off 00:08:29.920 |
what's in this paper leaderboard I don't know how far we'll get eventually but 00:08:37.640 |
without using even any architectural changes no resnets or anything we're 00:08:43.520 |
going to try to get into the 10% error all right so let's so the first few 00:08:53.200 |
cells are just copied from from earlier and so here's our ridiculously simple 00:09:00.460 |
model I like all I did here was I said okay well the very first convolution is 00:09:05.240 |
taking a 9 by 9 by 1 channel input so we should have compressed it at least a 00:09:10.120 |
little bit so I made it 8 channels output for the convolution and then I 00:09:14.920 |
just doubled it to 16 doubled it to 32 doubled it to 64 and it so that's going 00:09:22.000 |
to get to a that will be as it says 14 by 14 image 7 by 7 a 4 by 4 a 2 by 2 and 00:09:29.120 |
then this one gets us to a 1 by 1 so of course we get the 10 digits so there was 00:09:34.400 |
no thought at all behind really this architecture this pure just pure 00:09:40.240 |
convolutional architecture and remember this flatten at the end is necessary to 00:09:46.040 |
get rid of the unit axes that we end up with because this is a 1 by 1 okay so 00:09:52.440 |
let's do a learning rate finder on this very simple model and what I found was 00:09:57.520 |
that this model is and and you know this situation is so bad that when I tried to 00:10:03.360 |
use the learning rate finder kind of in the usual way which would be just to say 00:10:08.280 |
you know start at 1e neg 5 or 1e neg 4 say and then run it it kind of looks 00:10:19.240 |
ridiculous it's impossible to see what's going on so if you remember we added 00:10:24.920 |
that that multiplier it we called it LR mult or gamma is what they called it in 00:10:31.200 |
pytorch so we ended up calling it gamma so I dialed that way down to make it 00:10:35.360 |
much more gradual which means I have to dial up the starting learning rate and 00:10:39.600 |
only then did I manage even to get the learning rate finder to tell us anything 00:10:43.520 |
useful okay so so there we there we are so that's that's that's our learning 00:10:51.360 |
rate finder I'm just going to come back to these three later 00:11:00.720 |
so I tried using a learning rate of 0.2 and after trying a few different values 00:11:05.560 |
0.4 0.1 0.2 seems about the highest we can get up to even this actually is too 00:11:10.800 |
high I found much lower and it didn't train much at all you can see what 00:11:16.400 |
happens if I do it's it starts training and then it kind of yeah we lose it which 00:11:22.040 |
is unfortunate and you can see that in the colorful dimension plot we get this 00:11:27.780 |
classic you know getting activations crashing get of activations crashing and 00:11:34.240 |
you can kind of see the key problem here really is that we don't have zero mean 00:11:40.100 |
standard deviation one layers at the start so we certainly don't keep them 00:11:46.520 |
throughout and this is this is a problem now just something I got to mention by 00:11:54.040 |
the way is when you're training stuff in Jupiter notebooks this is just a new 00:12:01.360 |
thing we've we've just added if you get you can easily run out of memory GPU 00:12:09.320 |
memory and there's two reasons it turns out why you can particularly run out of 00:12:15.080 |
GPU memory if you run a few cells in a Jupiter notebook the first is that kind 00:12:22.280 |
of for your convenience Jupiter notebook you might you might may or may not know 00:12:28.680 |
this actually stores the results of your previous few evaluations if you just 00:12:36.560 |
type underscore it tells you the very last thing you evaluated and you can do 00:12:42.000 |
more underscores to go backwards further in time or you can also use oh you can 00:12:49.040 |
also use numbers to get the out 16 for example would be underscore 16 now the 00:12:55.240 |
reason this is an issue is that if one of your outputs is a big CUDA tensor and 00:13:02.480 |
you've shown it in a cell that's going to keep that GPU memory basically 00:13:07.360 |
forever and so that's a bit of a problem so if you are running out of memory one 00:13:14.800 |
thing you'd want to do is clean out all of those underscore blah things I found 00:13:20.440 |
that there's actually some function that nearly does that in the IPython source 00:13:25.000 |
code so I copied the important bits out of it and put it in here so if you call 00:13:28.840 |
clean IPython history it will don't worry about the lines of code at all this is 00:13:35.080 |
just a thing that you can use to get back that GPU memory the second thing 00:13:40.600 |
which Peter figured out in the last week or so is that you also have if you have 00:13:48.800 |
a CUDA error at any point or even any kind of exception at any point then the 00:13:55.120 |
exception object is actually stored by Python and any tensors that were 00:14:03.520 |
allocated anywhere in in that in that trace in that trace back will stay 00:14:08.360 |
allocated basically forever and you again that's a big problem so I created 00:14:15.120 |
this clean trace back function based on Peter's code which gets rid of that so 00:14:22.400 |
if this is particularly problematic because if you have a CUDA out of memory 00:14:25.840 |
error and then you try to rerun it you'll still have a CUDA out of memory 00:14:29.560 |
error because all the memory that was allocated before is now in that trace 00:14:33.240 |
back so basically anytime you get a CUDA out of memory error or any kind of error 00:14:37.280 |
you know with memory you can call clean mem and that will clean the memory in 00:14:42.240 |
your trace back it will clean the memory used in your in your Jupiter history do 00:14:47.800 |
a garbage collect empty the CUDA cache and that will basically should give you 00:14:54.520 |
a totally clean GPU you don't have to restart your notebook okay so Sam asked 00:15:02.120 |
a very good question in the chat so just to yeah just to remind you guys yes we 00:15:06.120 |
did start he's asking I thought we were training an auto encoder or are you 00:15:09.840 |
training a classifier or what so we started doing this auto code encoder 00:15:14.400 |
back in notebook 8 and we decided this is we don't have the tools to make this 00:15:20.200 |
work yet so let's go back and create the tools and then come back to it so in 00:15:26.200 |
creating the tools we're doing a classifier we try to make a really good 00:15:31.280 |
fashion MNIST classifier well we try to create tools which hopefully have a 00:15:35.840 |
side effect will find of giving us a really good classifier and then using 00:15:39.620 |
those tools we hope that will allow us to create a really good auto encoder so 00:15:46.160 |
yes we're kind of like gradually unwinding and we'll come back to where 00:15:52.800 |
we were actually trying to get to so that's why we're doing this this 00:15:57.200 |
classifier the techniques and library pieces we're building will be all very 00:16:03.200 |
necessary okay so why do we need a zero mean one standard deviation why do we 00:16:13.520 |
need that and B how do we get it so first of all on the way so if you think 00:16:20.200 |
about what a neural net does a deep learning net specifically it takes an 00:16:27.600 |
input and it puts it through a whole bunch of matrix multiplications and of 00:16:31.980 |
course there are activation functions sandwiched in there don't worry about the 00:16:37.000 |
activation functions that doesn't change the argument so let's just imagine we 00:16:40.560 |
start with some bunch of some matrix right imagine the 50 to 50 deep neural 00:16:50.400 |
net so a 50 deep neural net basically if we ignore the activation functions is 00:16:55.200 |
taking the previous input and doing a matrix multiply by some initially some 00:17:00.840 |
random weights so these are all yeah these are just a bunch of random weights 00:17:05.200 |
and these are actually red rand n is mean zero variance one and if we run 00:17:16.520 |
this after 50 times of multiplying by a matrix by a matrix by a matrix by a 00:17:25.320 |
matrix we end up with NANs that's no good so that might be that our matrix 00:17:36.720 |
the numbers in our matrix were too big so each time we multiply the numbers 00:17:41.080 |
were getting bigger and bigger and bigger so maybe we should make them a 00:17:45.880 |
bit smaller okay so let's try using in the matrix we are multiplying by let's 00:17:50.400 |
try multiplying by 0.01 and we multiply that lots of times oh now we've got 00:17:56.040 |
zeros now of course in mathematically speaking this isn't actually NAN it's 00:18:02.360 |
actually some really big number mathematically speaking this isn't really 00:18:05.040 |
zero it's some really small number but computers can't handle really really 00:18:09.920 |
small numbers are really really big numbers so really really big numbers 00:18:12.720 |
eventually just get called NAN and really really small numbers eventually 00:18:15.480 |
just get called zero so basically they get washed out and in fact even if you 00:18:23.000 |
don't get a NAN or even if you don't quite get a zero for numbers that are 00:18:27.920 |
extremely big the internal representation has no ability to 00:18:34.920 |
discriminate between even slightly similar numbers basically the and in the 00:18:40.200 |
way floating point is stored the further you get away from zero the less 00:18:44.560 |
accurate the numbers are so yeah this is a problem so we have to scale our weight 00:18:52.920 |
matrices exactly right and we have to scale them in such a way that the 00:18:57.560 |
standard deviation at every point stays at one and the mean stays at zero so 00:19:04.760 |
there's actually a paper that describes how to do this for multiplying lots of 00:19:12.000 |
matrices together and this paper basically just went through it's 00:19:20.200 |
actually pretty simple math actually let's see what do they do all right yeah 00:19:34.760 |
so they look to gradients and the propagation of gradients and they came 00:19:39.240 |
up with a particular weight initialization of using a uniform with 00:19:46.880 |
with 1 over root n as the bounds of that uniform and they yeah they studied 00:19:54.240 |
basically what happened to with various different activation functions and yeah 00:20:00.880 |
as a result we we now have this this this way of initializing neural networks 00:20:09.200 |
which is called either gloro initializer initialization or Xavier initialization 00:20:14.360 |
and yeah this is this is the this is the amount that we scale our initialization 00:20:27.560 |
our random numbers by where n in is the number of inputs so in our case we have 00:20:36.120 |
100 inputs and so root 100 is 10 so 1 over 10 is 0.1 and so if we actually run 00:20:46.800 |
that if we start with our random numbers and then we multiply by random numbers 00:20:53.160 |
times 0.1 which is this is the gloro initialization you can see we do end up 00:21:00.200 |
with numbers that are actually reasonable so that's pretty cool 00:21:06.760 |
so just I mean just some background in case you're not familiar with some of 00:21:16.440 |
these details what exactly do we mean by variance so if we take a tensor let's 00:21:26.520 |
call it T and just put 124 18 in it the mean of that is simply the sum divided 00:21:33.680 |
by the count so that's 6.25 now we want to know basically we want to come up 00:21:39.200 |
with a measure of how far away each data point is from the mean that tells you 00:21:45.120 |
how much variation there is if all the data points are very similar to each 00:21:50.560 |
other right so if you've got kind of like a whole bunch of data points and 00:21:57.840 |
they're all pretty similar to each other right then the mean would be about here 00:22:05.720 |
right and the average distance away of each point from the mean is not very far 00:22:13.200 |
where else if you had dots which were very widely spread all over the place 00:22:18.240 |
right then you might end up with the same mean but the distance from each 00:22:23.920 |
point to the mean is now quite a long way so that's what we want we want some 00:22:30.560 |
measure of kind of how far away the points are an average from the mean so 00:22:39.560 |
here we could do that we can take our tensor we can subtract the mean and then 00:22:44.280 |
take the mean of that oh that doesn't work because we've got some numbers that 00:22:50.640 |
are bigger than the mean and some that are smaller than the mean and so the 00:22:53.720 |
average of them all out then by definition you actually get zero so 00:22:58.560 |
instead you could either square those differences and that will give you 00:23:05.800 |
something and you could also take the square root of that if you wanted to to 00:23:08.960 |
get it back to the same kind of area or you could take the absolute differences 00:23:16.880 |
okay so actually I'm doing this in two steps here so for the first one here it 00:23:23.280 |
is on a different scale and then add square root get it on the same scale so 00:23:27.120 |
six point eight seven and five point eight eight are quite similar right but 00:23:31.120 |
they're mathematically not quite the same but they're both similar ideas so 00:23:35.920 |
this is the mean absolute difference and this is called the standard deviation 00:23:40.840 |
and this is called the variance so the reason that the standard deviation is 00:23:49.800 |
bigger than the mean absolute difference is because in our original data one of 00:23:55.720 |
the numbers is much bigger than the others and so when we square it that 00:24:00.640 |
number ends up having a outsized influence and so that's a bit of an issue 00:24:05.600 |
in general with standard deviation and variance is that outliers like this have 00:24:11.640 |
an outsized influence so you've got to be a bit careful okay so here's the 00:24:21.560 |
formula for the standard deviation that's normally written as sigma okay so 00:24:25.880 |
it's just going to be each of our data points minus the mean squared plus the 00:24:30.760 |
next data point minus the mean squared so forth for all the data points and then 00:24:34.440 |
divide that by the number of data points and square root and okay so one thing I 00:24:40.840 |
point out here is that the mean absolute deviation isn't used as much as the 00:24:45.480 |
standard deviation because mathematicians find it difficult to use but 00:24:52.040 |
we're not mathematicians we have computers so we can use it okay now 00:24:58.280 |
variance we can calculate like this as we said the main of the square of the 00:25:04.040 |
differences and if you feel like doing some math you could discover that 00:25:08.920 |
actually this is exactly the same as you can see and this is actually nice 00:25:15.160 |
because this is showing that the mean of the square data points minus the square 00:25:24.680 |
of the mean of the data points is also the variance and this is very helpful 00:25:29.480 |
because it means you actually never have to calculate this you can just calculate 00:25:34.200 |
the mean so with just the data points on their own you can actually calculate the 00:25:38.880 |
variance this is a really nice shortcut this is how we normally calculate 00:25:43.280 |
variance and so there is the LaTeX version which of course I didn't write 00:25:49.600 |
myself I stole from the Wikipedia LaTeX because I'm lazy now there's a very very 00:25:58.760 |
similar idea which is covariance and has already come up a little bit in the 00:26:05.560 |
first lesson or two and particularly the the extra math lesson that my same 00:26:12.080 |
engineer did and it's yes a covariance tells you how much two things vary not 00:26:22.920 |
just on their own but together and there's a definition here in math but I 00:26:27.800 |
like code so we'll see the code so here's our tensor again now we're going 00:26:33.600 |
to want to have two things so let's create something called u which is just 00:26:37.320 |
two times our tensor with a bit of randomness so here it is now you can see 00:26:44.160 |
that u and t are very closely correlated here but they're not perfectly 00:26:52.960 |
correlated so the covariance tells us yeah how they vary together and 00:26:59.240 |
separately so we can take the you can see this exactly the same thing we had 00:27:05.800 |
before each data point minus its mean but now we've got two different tensors so 00:27:12.480 |
we're also going to do the other one the other the other data points minus their 00:27:15.680 |
mean and we multiply them together so it's actually the same thing as standard 00:27:21.280 |
deviation but instead of deviation it's kind of like the covariance with itself 00:27:25.360 |
in a sense right and so that's a product we can calculate and then what we then 00:27:35.200 |
do is we take the mean of that and that gives us the covariance between those two 00:27:48.640 |
tensors and you can see that's quite a high number and if we compare it to two 00:27:55.160 |
things that aren't very related at all so that's good a totally random tensor v 00:28:00.800 |
so this is not related to t and we do exactly the same thing so take the 00:28:09.240 |
difference of t to its means and v to its means and take the mean of that 00:28:12.800 |
that's a very small number and so you can see covariance is basically telling 00:28:17.920 |
us how related are these two tensors so covariance and variance are basically 00:28:26.680 |
the same thing but you kind of can think of we can think of variance as being 00:28:30.080 |
covariance with itself and you can change this mathematical version which is 00:28:38.040 |
the one we just created in code to this version just like we have for variance 00:28:42.400 |
there's a easier to calculate version which as you can see gets exactly the 00:28:49.920 |
same answer okay so if you haven't done stuff with covariance much before you 00:29:00.560 |
should experiment a bit with it by creating a few different plots and 00:29:06.320 |
experimenting with those and the finally the Pearson correlation coefficient which 00:29:16.120 |
is normally called our row is just the covariance divided by the product of the 00:29:22.800 |
standard deviations so you've seen probably seen that number many times 00:29:26.960 |
there's just a scaled version of the same thing okay so with that in mind 00:29:41.600 |
here is how Xavier in it or Glurrow in it is derived so when you do a matrix 00:29:51.040 |
multiplication for each of the yi's we're adding together all of these 00:30:01.880 |
products so for we've got a i comma 0 times x 0 plus a i comma 1 times x 1 00:30:12.680 |
etc and we can write that in sigma notation so we're adding up together all 00:30:18.360 |
of the aik's with all of the xk's this is the stuff that we did in our first 00:30:24.520 |
lesson of part 2 and so here it is in pure Python code and here it is in 00:30:31.680 |
NumPy code now at the very beginning our vector has a mean of about 0 and a 00:30:37.440 |
standard deviation about 1 because that's what we asked for to remind you 00:30:42.520 |
right that's what we asked for that's a standard deviation of 0 and a minute so 00:30:48.760 |
1 is it a standard deviation of 1 mean of 0 that's what random is okay so let's 00:30:57.080 |
create some random numbers and we can confirm yeah they have a main of about 00:31:02.080 |
0 and a standard deviation of about 1 so if we chose weights for a that have a 00:31:12.880 |
mean of 0 we can compute the standard deviation quite easily so let's do that 00:31:23.440 |
so a hundred times let's try creating our X and let's try creating something to 00:31:32.360 |
multiply it by and we'll do the matrix multiplication and we're going to get 00:31:38.240 |
the mean and mean of the squares and so that is very close to our matrix so I 00:32:03.880 |
won't go into I mean you can look at it if you like but basically as long as the 00:32:08.120 |
elements in a and X are independent which obviously they are because they're 00:32:11.720 |
random then we're going to end up with a main of 0 and a standard deviation of 1 00:32:19.960 |
for these products and so we can try it if we creates a random number normally 00:32:30.080 |
distributed random number and then a second random number multiply them 00:32:33.580 |
together and then do it a bunch of times and you can see here we've got our zero 00:32:39.280 |
one so that's the reason why we need this math dot square root 100 we don't 00:32:52.360 |
normally worry about the mathematical reasons why things are exactly but yeah 00:32:57.820 |
I thought I would just dive into this one because sometimes it's it's fun to go 00:33:01.480 |
through it and so you can check out the paper if you want to look at that in 00:33:04.560 |
more detail or experiment with these with these little simulations now the 00:33:11.200 |
problem is that that doesn't work it doesn't work for us because we use 00:33:23.400 |
rectified linear units which is not something that Xavier Glauro looked at 00:33:30.240 |
let's take a look let's create a couple of matrices this is 200 by 100 this is 00:33:36.840 |
just a vector well matrix in a vector this is 200 and then let's create a 00:33:44.200 |
couple of weight matrices two weight matrices and two bias vectors okay so 00:33:51.360 |
we've got some input data X's and Y's and we've got some weight matrices and 00:33:56.360 |
bias vectors so let's create a linear layer function which we've done lots of 00:34:02.480 |
times before and let's start going through a little neural net you know I'm 00:34:07.520 |
mentioning this is the forward pass of our neural net so we're going to apply 00:34:10.960 |
our linear layer to the X's with our first set of weights and our first set 00:34:15.400 |
of biases and see what the mean and standard deviation is okay it's about 00:34:21.800 |
0 and about 1 so that's good news and the reason why is because we have 100 00:34:29.760 |
inputs and we divided it by square root 100 just like Glauro told us to and our 00:34:34.840 |
second one has 50 inputs and we divide by square root of 50 and so this all ought 00:34:40.320 |
to work right and so far it is but now we're going to mess everything up by 00:34:45.440 |
doing ReLU so ReLU after we do a ReLU look we don't have a zero mean or a one 00:34:54.200 |
standard deviation anymore so if we go through that and create it like a deep 00:34:59.840 |
neural network with Glauro initialization but with a ReLU oh dear 00:35:08.160 |
it's disappeared it's all gone to zero and you can see why right after a matrix 00:35:14.360 |
multiply and a ReLU our means and variances are going down and of course 00:35:20.760 |
they're going down because a ReLU squishes it so I'm not going to worry 00:35:30.160 |
about the math of why but a very important paper indeed called delving 00:35:36.040 |
deep in directifiers surpassing human level performance on image net 00:35:40.040 |
classification by Kaiming He et al came up with a new in it which is just like 00:35:50.360 |
Glauro initialization but you multiply the remember the Glauro initialization 00:35:55.280 |
was 1 over root n this one is root 2 over n and again n is the number of 00:36:03.040 |
inputs so let's try it so we've got 100 inputs so we have to multiply it by root 00:36:10.840 |
2 over 100 and there we go you can see we are in fact getting some nonzero 00:36:19.520 |
numbers that's very encouraging even after going through 50 layers of depth 00:36:25.400 |
so that's good news so this is called Kaiming it's either called Kaiming 00:36:31.840 |
initialization or called her initialization and notice it looks like 00:36:35.920 |
it's built he but it's a Chinese surname so it's actually pronounced her okay 00:36:43.360 |
maybe that's why a lot of people increasingly call it Kaiming 00:36:46.840 |
initialization they don't have to say his surname just a little bit harder to 00:36:50.440 |
pronounce all right so how on earth do we actually use this now that we know 00:36:55.120 |
what initialization function to use for a deep neural network with a ReLU 00:37:00.960 |
activation function the trick is to use a method called apply which all 00:37:08.880 |
nn.modules have so if we grab our model we can apply any function we like for 00:37:15.560 |
example let's apply the function print the name of the type so here you can see 00:37:22.680 |
it's going through and it's printing out all of the modules that are inside our 00:37:30.160 |
model and notice that our model has modules inside modules it's this it's a 00:37:37.360 |
conv in a sequential in a sequential but model.apply goes through all of them 00:37:43.400 |
regardless of their depth so we can apply an init function so we can apply 00:37:53.400 |
the init function which simply does randomly distributed random numbers times 00:38:05.200 |
square root of 2 over the number of inputs that's such an easy thing it's 00:38:11.560 |
not even worth writing so that's already been written but that's all it does it 00:38:15.640 |
just does that one thing it's called init.kaimingnormal as we've seen 00:38:19.920 |
before if there's an underscore at the end of a PyTorch method name that means 00:38:23.940 |
that it changes something in place so init.kaimingnormal underscore will 00:38:29.560 |
modify this weight matrix so that it has been initialized with normally 00:38:35.040 |
distributed random numbers based on root of 2 divided by the number of 00:38:40.760 |
inputs now you can't do that to a sequential layer or a ReLU layer or a 00:38:46.320 |
flattened layer so we should check that the module is a conv or linear layer and 00:38:53.680 |
then we can just say model.apply the function and so if we do that and now I 00:39:02.320 |
can use our learning ratefinder callbacks that we created earlier and 00:39:07.280 |
this time I don't have to worry about actually we can create our own ones 00:39:13.400 |
because we don't need to use even the weird gamma thing anymore so let's go 00:39:17.960 |
back and copy that let's get rid of this gamma equals 1.1 it shouldn't be 00:39:29.640 |
necessary anymore and we can probably make that 4 now oh I should have it to 00:39:41.360 |
recreate the model there we go okay so that's looking much more sensible so at 00:39:49.120 |
least we've got to a point where the learning ratefinder works that's a good 00:39:51.720 |
sign so now when we create our learner we're going to use our momentum learner 00:39:58.080 |
still after we get the model we will apply in it weights and apply also 00:40:03.360 |
returns the model so we can actually this is actually going to return the 00:40:06.960 |
model with the initialization applied while I wait I will answer questions 00:40:14.000 |
okay so Fabrizio asks why do we double the number of filters in successive 00:40:19.720 |
convolutions so what's happening is in each stride 2 convolution these are all 00:40:31.680 |
stride 2 convolutions so this is changing the grid size from 28 by 28 to 14 by 14 00:40:37.480 |
so it's reducing the number the size of the grid by a factor of 4 in total so 00:40:43.560 |
basically so as we go from one to eight from this one to this one same deal we're 00:40:49.760 |
going from 14 by 14 to 7 by 7 so produce the grid size by 4 we want it to learn 00:40:59.700 |
something and if you use if you give it exactly the same kind of number of units 00:41:08.440 |
or activations there's there's not really it's not really forcing it to learn 00:41:12.840 |
things as much so ideally as we decrease the grid size we want to have enough 00:41:20.240 |
channels that you end up with a few less activations but then before it not too 00:41:24.560 |
many less so if we double the number of channels then that means we've decreased 00:41:28.960 |
the grid size by model of 4 increase the channel count by model of 2 so overall 00:41:33.760 |
the number of activations has decreased by a factor of 2 and so that's what we 00:41:39.400 |
want we want to be kind of forcing it to find ways of compressing the information 00:41:44.640 |
intelligently as it goes down also we kind of want to be having a roughly 00:41:52.540 |
similar amount of compute roughly similar amount through the neural net so 00:41:57.840 |
as we decrease the grid size we can add more channels because decreasing the 00:42:03.580 |
grid size decreases the amount of compute increasing the channels then 00:42:06.960 |
gives it more things to compute so we're kind of getting this nice compromise 00:42:11.440 |
between yeah between the kind of amount of compute that it's doing but also 00:42:17.200 |
giving it some kind of compression work to do that's the kind of the basic idea 00:42:31.880 |
well still not able to train well okay if we leave it for a while okay it's not 00:42:39.440 |
great but it is actually starting to train that's encouraging and we got up 00:42:44.020 |
to a 70% accuracy so we can see you're not surprisingly we're getting these 00:42:49.640 |
spikes and spikes and so in the statistics you can see that well it 00:42:57.320 |
didn't quite work we don't have a mean of zero we don't have a standard deviation 00:43:01.860 |
of one even at the start why is that well it's because we forgot something 00:43:10.740 |
critical if you go back to our original point even when we had our let's go to 00:43:17.400 |
the timing version even when we had the correctly normalized matrix that we're 00:43:24.240 |
multiplying by well you also have to have a correctly normalized input matrix 00:43:29.120 |
and we never did anything to normalize our inputs so our inputs actually if we 00:43:36.080 |
get the just get the first X mini batch I get its main and standard deviation it 00:43:42.080 |
has a mean of 0.28 and a standard deviation of 0.35 so we actually didn't 00:43:47.400 |
even start with a 0 1 input and so we started with the mean beneath above zero 00:43:58.400 |
and a standard deviation beneath one so it was very hard for it so using the 00:44:05.540 |
inner helped at least we're able to train a little bit but it's not quite 00:44:09.800 |
what we want we actually need to modify our inputs so they have a mean of one and 00:44:16.280 |
a standard sorry a mean of zero and a standard deviation of one so we could 00:44:20.960 |
create a callback to do that so a callback let's create a batch transform 00:44:26.120 |
callback and so we're going to pass in a function that's going to transform every 00:44:29.520 |
batch and so just in the before batch we will set the batch to be equal to the 00:44:38.840 |
function applied to the batch now I can note by the way we don't need self dot 00:44:44.880 |
learn dot batch here because we can read any because it's one of the four things 00:44:50.920 |
that we kind of proxy down to the learner automatically but we do need it 00:44:55.800 |
on the left hand side because it's only in the get atra remember so be very 00:45:00.880 |
careful so I might just leave it the same on say on both sides just so that 00:45:04.840 |
people don't get confused okay so let's create a function underscore norm that 00:45:12.280 |
subtracts the mean and divides by the standard deviation and so remember a 00:45:17.360 |
batch has an X and a Y so it's the X part where we subtract the mean and 00:45:22.800 |
divide by the standard deviation and so the new batch will be that as the X and 00:45:27.960 |
the Y will be exactly the same as it was before so let's create a instance of the 00:45:33.600 |
normalization of the batch transform callback which is going to do the 00:45:37.560 |
normalization function we'll call it norm so we can pass that as an additional 00:45:42.680 |
callback to our learner and now that's looking a lot better so you can see here 00:45:55.040 |
all we had to do was check that our input matrix was 0 1 and main standard 00:46:05.720 |
deviation and all of our weight matrices was 0 1 standard deviation and we didn't 00:46:10.040 |
have to use any tricks at all it was able to train and got it to an accuracy 00:46:15.280 |
of 85% and so if we look at the color dim and stats look at this it looks 00:46:21.920 |
beautiful now this is layer one this is layer two three four it's still not 00:46:27.640 |
perfect I mean there's some randomness right and and we've got what is it like 00:46:31.440 |
seven or eight layers so that randomness does kind of as you go through the layers 00:46:40.400 |
by the last one it still gets a bit ugly and you can kind of see it bouncing 00:46:45.040 |
around here as a result and you can see that also in the means and standard 00:46:51.240 |
deviations there's some other reasons this is happening we'll see in a moment 00:46:55.920 |
but this is the first time we've really got our even somewhat deep convolutional 00:47:02.280 |
model to train and so this is a really exciting step you know we have from 00:47:07.560 |
scratch in a sequence of 11 notebooks managed to create a real convolutional 00:47:17.640 |
neural network that is training properly so I think that's pretty amazing 00:47:25.480 |
now we don't have to use a callback for this the other thing we could do to 00:47:31.320 |
modify the input data of course is to use the with transform method from the 00:47:37.480 |
hugging face datasets library so we could modify our transform I to do just 00:47:42.720 |
attract the main and divide by the standard deviation and then recreate our 00:47:47.320 |
data loaders and if we now get a batch out of that and check it it's now got 00:47:52.880 |
yep I mean is zero and the standard deviation of one so we could also do it 00:47:57.320 |
this way so generally speaking for stuff that needs to kind of dynamically 00:48:03.080 |
modify the batch you can often do it either in your data processing code or 00:48:09.800 |
you can do it in a callback and neither is right or wrong they both work well 00:48:14.360 |
and you can see whichever one works best for you okay now I'm going to show you 00:48:20.120 |
something amazing okay so it's great this is training well but when you look at 00:48:32.420 |
our stats despite what we did with the normalized input and the normalized the 00:48:41.560 |
yeah and the normalized weight matrices we don't have a mean of zero and we 00:48:47.800 |
don't have a standard deviation of one even from the start so why is that well 00:48:54.800 |
the problem is that we were putting our data through a ReLU and our activation 00:49:07.760 |
stats are looking at the output of those ReLU blocks because that's kind of the 00:49:14.600 |
end of each you know that that's that's the activation of each of each 00:49:19.160 |
combination of weight matrix multiplication and activation function 00:49:24.160 |
and since a ReLU removes all of the negative numbers it's impossible for the 00:49:31.880 |
output of a ReLU to have a mean of zero unless literally every single number is 00:49:37.480 |
zero Max has got no negatives so ReLU seems to me to be fundamentally 00:49:47.400 |
incompatible with the idea of a correctly calibrated bunch of layers in a neural 00:49:53.200 |
net so I came up with this idea of saying well why don't we take our normal 00:50:01.080 |
ReLU and have the ability to subtract something from it and so we just take 00:50:07.440 |
the result of our ReLU and subtract so sub of minus I mean I just I can write 00:50:13.240 |
this in more obvious way is exactly the same as just minus equals when I just do 00:50:17.600 |
that we'll subtract something from our ReLU that will allow us to pull the 00:50:26.800 |
whole thing down so that the bottom of our ReLU is underneath the x-axis and 00:50:33.520 |
it has negatives and that would allow us to have a mean of zero and while we're 00:50:38.520 |
there let's all do also do something that's existed for a while I didn't come 00:50:42.120 |
up with this idea which is that it just to do a leaky ReLU which is where we say 00:50:46.240 |
let's not have the negative speed totally flat just truncated but instead 00:50:51.800 |
let's just have those numbers decreased by some constant amount let me show you 00:50:59.280 |
what that looks like so there's two together I'm going to call general 00:51:02.280 |
ReLU which is where we do this thing called leaky ReLU which is where we make 00:51:06.760 |
it so it's not flat under zero but instead just less less sloped and we 00:51:12.080 |
also subtract something from it so for example to have created a little function 00:51:17.480 |
here for plotting a function so let's plot the general ReLU function with a 00:51:23.200 |
leakiness of point one so that will mean there's a point one slope underneath the 00:51:28.200 |
under zero and we'll subtract point four and so you can see above zero it's just 00:51:36.080 |
a normal y equals x line but it's been pushed down by point four and then when 00:51:42.360 |
it's less than zero it's not flat anymore but it's just got a slope of 1/10 00:51:46.440 |
and so this is now something which if you find the right amount to subtract for 00:51:55.720 |
each amount of leakiness you can make a mean of zero and I actually found that 00:51:59.600 |
this particular combination gives us a mean of zero or there abouts so let's 00:52:06.760 |
now create a new convolution function where we can actually change what 00:52:12.440 |
activation function is used that gives us the ability to change the activation 00:52:16.160 |
functions in our neural nets let's change get model to allow it to take an 00:52:21.880 |
activation function which is passed into the layers and while we're there let's 00:52:27.040 |
also make it easy to change the number of filters so we're going to pass in a 00:52:30.360 |
list of the number of filters in each layer and we will default it to the 00:52:34.400 |
numbers in each layer that we've discussed and so we're just going to go 00:52:37.760 |
through in a list comprehension creating a convolution from the previous number 00:52:44.920 |
of filters this number of filters to the next number of filters and we'll pop 00:52:49.320 |
that all into a sequential along with a flatten at the end and well we're there 00:52:55.440 |
we also then need to be careful about in it weights because this is something 00:52:59.960 |
that people tend to forget which is that in it that it is that timing 00:53:09.760 |
initialization the default only specific only applies only applies at all to 00:53:15.960 |
layers that have a value activation function we don't have really you anymore 00:53:24.760 |
we actually have leaky value the fact that we're subtracting a bit from it 00:53:30.400 |
doesn't change things but the fact that it's leaky does now luckily a lot of 00:53:34.520 |
people don't know this but actually pytorch is claiming normal has an 00:53:38.920 |
adjustment for leaky values weirdly enough they just call it a so if you 00:53:44.320 |
pass into the timing normal initialization how much how your leaky 00:53:48.400 |
values leaky factor as a then you'll get the correct initialization for a leaky 00:53:55.880 |
value so we need to change in it weights now to pass in the leakiness all right 00:54:01.280 |
so let's put all this together so our general value activation for 00:54:05.000 |
activation function is is general value with a leak of point one and it's a 00:54:12.080 |
tractor point four so we'll use partial to create a function that has those 00:54:16.440 |
built-in parameters for activation stats we need to update it now to look for 00:54:24.040 |
general values not nn dot values okay and then our in it weights function we're 00:54:32.920 |
going to have a partial with leaky equals point one so we'll call that our 00:54:36.480 |
in it weights huh great so now we'll get our model using that new activation 00:54:45.200 |
function and that new in it weights and we'll fit that oh that's encouraging 00:55:00.040 |
accuracy of 845 which is about as high as we got to at the end previously Wow 00:55:09.720 |
look at that so we're up to an accuracy of 87% and let's take a look yeah I mean 00:55:18.040 |
look at these we still got a little bit of a spike but it's almost smooth and 00:55:26.800 |
our main is standing starting at about zero standard deviation no standard 00:55:33.400 |
deviation is still a bit low but it's coming up around one it's not too bad 00:55:36.720 |
generally around 0.8 so it's all looking pretty encouraging I think and oh yeah 00:55:43.320 |
look the percentage of dead units in each layer is very small so finally we've 00:55:51.880 |
really trained you know got some very nice looking training graphs here and 00:55:57.040 |
yeah it's interesting that we had to literally invent our own activation 00:56:03.520 |
function to make this work and I think that gives you a sense of how few people 00:56:08.600 |
actually care about this which is crazy because as you can see it's it in some 00:56:13.480 |
ways it's the only thing that matters and it's not at all mathematically 00:56:18.240 |
difficult to make it all work and it's not at all computationally difficult to 00:56:23.960 |
see whether it's working but other frameworks don't even let you plot these 00:56:28.720 |
kinds of things so nobody even knows that they've completely messed up their 00:56:32.840 |
initialization so yeah now you know now some very nice news well so the first 00:56:42.520 |
thing to be aware of which is tricky is we a lot of models use more complicated 00:56:50.600 |
activation functions nowadays rather than value or leaky value or even this 00:56:55.440 |
general version you need to initialize your neural network correctly and most 00:57:02.520 |
people don't and sometimes nobody's even figured out or bothered to try to figure 00:57:09.400 |
out what the correct initialization to use is but there's actually a very cool 00:57:17.400 |
trick which almost nobody knows about which is a paper called all you need is 00:57:22.720 |
a good in it which Demetro Michigan wrote a few years ago and what Demetro 00:57:33.200 |
showed is that there's actually a completely general way of initializing 00:57:38.440 |
any neural network correctly regardless of what activation functions are in it 00:57:46.320 |
and it uses a very very simple idea and the idea is create your model initialize 00:57:53.800 |
it however you like and then go through and put a single batch of data through 00:57:59.840 |
and look at the first layer see what the main and standard deviation through the 00:58:06.160 |
first layer is and if the mean you know if the standard deviation is too big 00:58:09.880 |
divide the weight matrix by a bit if the means too high subtract a bit off the 00:58:14.280 |
weight matrix and do that repeatedly for the first layer until you get the 00:58:18.920 |
correct mean and standard deviation and then go to the second layer do the same 00:58:23.020 |
thing third layer do the same thing and so forth so we can do that using hooks 00:58:28.920 |
right so we could create a little so this is called layer wise sequential 00:58:34.680 |
unit variance LSU V we can create a little LSU V stats that will grab the 00:58:40.640 |
main of the activations of a layer and the standard deviation of the 00:58:43.440 |
activate activations of a layer and we will create a hook with that function 00:58:49.320 |
and what it's going to do is after the after we've run that hook to find out 00:58:56.400 |
the main standard deviation of the layer we will go through and run the model get 00:59:07.760 |
the standard deviation and mean see if the standard deviation is not one see 00:59:12.600 |
if the mean is not zero and we will subtract the mean from the bias and we 00:59:18.680 |
will divide the weight matrix by the standard deviation and we will keep 00:59:25.040 |
doing that until we get a standard deviation of one and a mean of zero and 00:59:32.200 |
so by making that a hook what we will do is we will grab all the values and all 00:59:43.320 |
the comms right and so just to show you what happens there once I've got all the 00:59:49.120 |
relu's and all the comms I can use zip so zip in Python takes a bunch of lists and 00:59:55.480 |
creates a list of the items the first items the second items the third items 01:00:01.440 |
and so forth so if I go through the zip of relu's and comms and just print them 01:00:05.520 |
out you can see it prints out the relu and the first conv the second rally the 01:00:10.120 |
second conv the second rally the sorry the third rally the third conv and so 01:00:13.520 |
forth we use zip all the time in Python so it's really important thing to be 01:00:18.160 |
aware of so we could go through the relu's and the comms and call layerwise 01:00:26.380 |
sequential unit variance in it passing in those module pairs sorry passing in 01:00:37.600 |
yes passing in the relu and the conv and then for each one oh and we're going to 01:00:47.360 |
do that on the the batch and of course we need to put the batch on the correct 01:00:52.640 |
device for our model and so now that I've done that we now have it ran almost 01:01:06.040 |
instantly it's now made all the biases and weights correct give us 0 1 and now 01:01:12.840 |
if I train it there it is so we didn't do any initialization at all of the model 01:01:19.560 |
other than just call LS UV in it and this time we've got an accuracy of 0.86 01:01:29.720 |
versus previously it's 0.87 so pretty much the same thing close enough and 01:01:35.800 |
actually if you want to actually see that happening I guess what we could do I 01:01:47.400 |
mean it's not it's going to be pretty obvious after we've run this we could 01:01:50.520 |
say print H dot mean comma H dot standard deviation actually we could do 01:02:00.760 |
it like before and afterwards right so we could say right before and after 01:02:13.320 |
there we go yeah so it starts at so the first layer started at a mean of point 01:02:26.000 |
negative point one three in a variance of point four six and it kept doing the 01:02:32.440 |
divide subtract divide subtract divide subtract until eventually it got to mean 01:02:36.080 |
is zero standard deviation of one and then it went to the next layer and it 01:02:40.360 |
kept going going going until that was zero one and then the third layer and 01:02:44.560 |
then the fourth layer and so at that point all of the layers had a mean is 01:02:49.400 |
zero and a standard deviation of one so I guess like one thing with LS UV you 01:02:59.880 |
know it's kind of very mathematically convenient we don't have to spend any 01:03:03.240 |
time thinking about you know if we've invented a new activation function or 01:03:06.960 |
we're using some activation function where nobody seems to have figured out 01:03:10.040 |
the correct initialization for it we can just use LS UV it did require a little 01:03:16.000 |
bit more fiddling around with hooks and stuff to get it to work and I haven't 01:03:19.000 |
even put this into like a callback or anything so if you yeah if you decide you 01:03:24.840 |
want to try using this in some of your models it might be a good idea and it 01:03:29.280 |
actually be good homework to see if you can come up with a callback that does 01:03:34.280 |
LS UV initialization for you that would be pretty cool wouldn't it in in before 01:03:40.480 |
fit I guess it would be you'd have to be a bit careful because if you ran fit 01:03:48.320 |
multiple times it would actually initialize it each time so that would be 01:03:52.800 |
one issue with that to think about okay so something which is quite similar to 01:03:58.320 |
LS UV is batch normalization so we're going to have a seven minute break and 01:04:07.240 |
then we're going to come back and we're going to talk about batch normalization 01:04:11.120 |
see you in seven minutes okay hi let's do this batch normalization batch 01:04:22.000 |
normalization was such an important paper I remember when it came out I was 01:04:29.680 |
at analytic my medical startup and I think that's right and everybody was 01:04:40.800 |
talking about it and in particular they were talking about this this graph that 01:04:55.400 |
basically showed like what it used to be like until batch norm to train a model 01:05:01.160 |
on image net how many training steps you'd have to do to get to a certain 01:05:08.840 |
accuracy and then they showed what you could do with batch norm so much faster 01:05:21.280 |
it was amazing and we all thought that can't be true but it was true so 01:05:30.000 |
basically the key idea of batch norm is that you know with with with LS UV and 01:05:38.400 |
input normalization and climbing in it we are normalizing the layers each day as 01:05:46.720 |
inputs before training but the distribution of each layers inputs 01:05:51.280 |
changes during training and that's a problem so you end up having to decrease 01:06:02.680 |
your learning rates and as we've seen you'd have to be very careful about 01:06:06.320 |
parameter initialization so the fact that the layers inputs change during 01:06:12.880 |
training they call internal covariate shift which for some reason a lot of 01:06:17.120 |
people tend to find a confusing statement or a confusing name but it's 01:06:20.160 |
that's very clear to me and you can fix it by normalizing layer inputs during 01:06:27.520 |
training so you're making the normalization a part of the model 01:06:31.760 |
architecture and you perform the normalization for each mini batch now 01:06:36.600 |
I'm actually not going to start with batch normalization I'm going to start 01:06:39.800 |
with something that came out one year later called layer normalization because 01:06:44.360 |
layer normalization is simpler let's do the simpler one first so layer 01:06:50.320 |
normalization came out as this group of fellows the last of whom I'm sure you 01:06:57.440 |
heard of and it's probably easiest to explain by showing you the code so if 01:07:06.880 |
you're thinking layer normalization well it's a whole paper Jeffrey Hinton paper 01:07:11.480 |
must be complicated no the whole thing is this code what is layer normalization 01:07:16.640 |
well we can create a module and we're going to pass in we don't need to pass 01:07:25.840 |
in anything actually you can totally ignore the parameters for now in fact 01:07:29.000 |
what we're going to do is we're going to have a single number called mult for the 01:07:33.480 |
multiplier and a single number called add that's the thing we're going to add 01:07:37.240 |
and we're going to start off by multiplying things by one and adding zero 01:07:41.960 |
so we're going to start off by doing nothing at all okay this is the layer 01:07:46.640 |
it has a forward function and in the forward function so remember that by 01:07:54.560 |
default we have NCHW we have batch by channel by height by width we're going to 01:08:06.880 |
take the mean over the channel height and width so we're just going to find the 01:08:13.760 |
mean activation for each input in the mini batch and when I say input though 01:08:20.880 |
remember that this is going to be this is a layer right so we can put this layer 01:08:25.120 |
anywhere we like so it's the input to that layer and we'll do the same thing 01:08:29.400 |
for finding the variance okay and then we're going to normalize our data by 01:08:41.340 |
subtracting the mean and dividing by the square root of the variance which of 01:08:49.280 |
course is the standard deviation we're going to add a very small number by 01:08:56.120 |
default one in egg five to the denominator just in case the variance is 01:09:01.320 |
zero or ridiculously small this will keep the number from going giant just if 01:09:06.640 |
we happen to get something with a very small variance this idea of an epsilon 01:09:11.960 |
as being something we add to a divisor is really really common and in general 01:09:17.780 |
you should not assume that the defaults are correct very often the defaults are 01:09:21.480 |
too small for algorithms that use an epsilon okay so here we are as you can 01:09:31.520 |
see we are normalizing the the batch I mean I can call it a batch but just 01:09:44.840 |
remember it isn't necessarily the first layer right so it's wherever which 01:09:48.800 |
whichever layer we decide to put this in so we normalize it now the thing is 01:09:53.240 |
maybe we don't want it to be normalized maybe we wanted to have something other 01:10:00.840 |
than unit variance and something other than zero mean well what we do is we 01:10:07.000 |
then multiply it back by self dot malt and add self dot add now remember self 01:10:12.200 |
dot malt was one and self dot add is zero so at first that does nothing at all so 01:10:17.800 |
at first this is just normalizing the data so that's good but because these 01:10:23.720 |
are parameters these two numbers are learnable that means that the SGD 01:10:28.120 |
algorithm can change them so there's a very subtle thing going on here which is 01:10:32.840 |
that in fact this might not be normalizing the data at all or normalizing 01:10:37.680 |
the the inputs to the next layer at all because self dot malt and self dot add 01:10:42.360 |
could be anything so I tend to think that when people think about these kind of 01:10:48.280 |
things like layer normalization and batch normalization thinking of this 01:10:51.200 |
normalization in some ways is not the right way to think of it it's actually 01:10:59.120 |
doing something I think to really well it's definitely normalizing it for the 01:11:02.680 |
initial layers and we don't really need a less UV anymore if we have this in 01:11:07.160 |
here because it's going to normalize it automatically so that's handy but after a 01:11:14.120 |
few batches it's not really normalizing at all but what it is doing is 01:11:19.760 |
previously this idea of like how big are the numbers overall and how much 01:11:26.560 |
variation do they have overall was kind of built into every single number in the 01:11:33.080 |
weight matrix and in the bias vector this way those two things have been 01:11:41.360 |
turned into just two numbers and I think this makes training a lot a lot easier 01:11:46.840 |
for it basically to just have just two numbers that it can focus on to change 01:11:51.400 |
this overall like positioning and variation so there's something very 01:11:56.240 |
subtle going on here because it's not just doing normalization at least not 01:12:02.000 |
after the first few batches are complete because it can learn to create any 01:12:07.320 |
distribution of outputs it want so there's our layer so we're going to need 01:12:13.000 |
to change our con function let again previously we changed it to add 01:12:17.560 |
activation function to be what it to be modifiable now we're going to also 01:12:22.240 |
change it to allow us to add normalization layers to the end so our 01:12:27.240 |
basic layers well we'll start off by adding our conv2d as usual and then if 01:12:32.880 |
you're doing normalization we will append the normalization layer with this 01:12:38.840 |
many inputs now in fact layer norm doesn't care how many inputs so I just 01:12:43.560 |
ignore it but you'll see batch normal care if you've got an activation 01:12:47.760 |
function add it and so our convolutional layer is actually a sequential bunch of 01:12:51.720 |
players now one thing that's interesting I think is that for bias in the conv if 01:13:02.160 |
you're using well this isn't quite true is it I was going to say if you're using 01:13:08.880 |
layer norm you don't need bias but actually you kind of do so maybe we 01:13:15.120 |
should actually change that for batch norm we won't need bias but actually for 01:13:21.240 |
this one we do so put this back bias equals true bias equals bias okay so 01:13:35.400 |
then these initial layers right here yes so they all have bias and then we've got 01:13:42.120 |
bias equals false okay so now in our model we're going to add layer 01:13:57.080 |
normalization to every layer except for the last one and let's see how we go oh 01:14:09.640 |
nice eight seven three okay eight sixty and eight seven two so just we've just 01:14:19.660 |
got our best by a little bit so that's cool so the the thing about these 01:14:32.640 |
normalization layers is though that they do cause a lot of challenges in models 01:14:41.000 |
and generally speaking ever since patch norm appeared well there's been this 01:14:45.400 |
kind of like big change of a view towards it at first people like oh my god batch 01:14:51.280 |
norm is our savior and it kind of was it let us train much deeper models and get 01:14:56.960 |
great results and train quickly but then increasingly people realized it also 01:15:01.720 |
added a lot of complexity these these these learnable parameters turned out to 01:15:07.040 |
create all kind of complexity and in particular batch norm which we'll see in 01:15:09.960 |
a minute created all kinds of complexity so there has been a tendency in recent 01:15:15.280 |
years to be trying to get rid of or at least reduce the use of these kinds of 01:15:19.960 |
layers so knowing how to actually initialize your models correctly at 01:15:28.360 |
first is becoming increasingly important as people are trying to move away from 01:15:33.200 |
these normalization layers increasingly so I will I will say that so I you know 01:15:40.400 |
they're still very helpful but they're not a silver bullet as it turns out 01:15:47.080 |
alright so now let's look at batch norm so batch norm is still not huge but it's 01:15:53.480 |
a little bit bigger than layer norm and you'll see that we've now we've got the 01:15:59.840 |
mulch and add as before but it's not just one number to add or one number to 01:16:08.280 |
multiply but actually we've got a whole bunch of them and the reason is that 01:16:12.360 |
we're going to have one for every channel and so now when we take the mean 01:16:17.000 |
and the very variance we're actually taking it over the batch dimension and 01:16:23.480 |
the height and which dimensions so we're ending up with one mean per channel and 01:16:29.960 |
one variance per channel so just like before once we get our means and 01:16:38.760 |
variances we subtract them out and divide them by the epsilon modified variance 01:16:48.480 |
and just like before we then multiply by mult and add add but now we're actually 01:16:53.760 |
multiplying by a vector of malts and we're adding a vector of ads and that's 01:16:57.840 |
why we have to pass in the number of filters because we have to know how many 01:17:03.640 |
ones and how many zeros we have in our initial malts and ads so that's the main 01:17:11.120 |
difference in a sense is that we are we have one per channel and that and that 01:17:17.640 |
we're also taking the average across all of the things in the batch where else in 01:17:25.280 |
layer norm we didn't each thing in the batch had its own separate normalization 01:17:33.640 |
it was doing then there's something else in batch norm which is a bit tricky 01:17:42.280 |
which is that during training we are not just subtracting the mean and the 01:17:51.320 |
variance but instead we're getting an exponentially weighted moving average of 01:17:56.920 |
the means and the variances of the last few of the last few batches that's what 01:18:06.140 |
this is doing so we start out so we basically create something called vase 01:18:11.000 |
and something called means and initially the variances are all one and the means 01:18:17.000 |
are all zero and there's one per channel just like before or one per filter this 01:18:21.680 |
is number of filters same idea I guess filters we tend to actually use inside 01:18:27.440 |
the model and channels we tend to use as the first input so I should probably say 01:18:31.040 |
filters either works though so we get out let's for example we get our mean per 01:18:39.080 |
filter and then what we do is we use this thing called lerp and lerp is 01:18:43.960 |
simply saying yes that's what it's done so what lerp does is it takes two 01:19:00.120 |
numbers in this case I'm going to take 5 and 15 or two tensors they could be 01:19:06.860 |
vectors or matrices and it creates a weighted average of them and the amount 01:19:11.680 |
of weight it uses is this number here let me explain in this case if I put 01:19:17.360 |
0.5 it's going to take half of this number plus half of this number so we 01:19:22.960 |
end up with just the mean but what if we used 0.75 then that's going to take 01:19:32.080 |
70 that's going to take 0.75 times this number plus 0.25 of this number so it's 01:19:41.200 |
basically kind of allows it to be under like a sliding scale so one extreme would 01:19:46.360 |
be to take all of the second number so that would be lerp with one there and 01:19:50.080 |
the other extreme would be all of the first number and then you can slide 01:19:54.800 |
anywhere between them like so right so that's exactly the same as saying five 01:20:02.400 |
times 0.9 plus 15 times 0.1 right so this this number here is how much of the 01:20:17.240 |
second number do we have and one minus that is how much of this number do we 01:20:21.400 |
have and you can also move this as you can with most PyTorch things you can 01:20:26.080 |
move the first parameter into there and get exactly the same result so that's 01:20:32.880 |
what lerp is so what we're doing here is we're doing an in-place lerp so we're 01:20:40.760 |
replacing self dot means with one minus momentum of self dot means and plus self 01:20:51.720 |
dot momentum times this particular mini batches mean so this is basically doing 01:20:58.120 |
momentum again which is why we indeed are calling the parameter mom from 01:21:02.840 |
momentum so with a mom of point one which I kind of think is the opposite of 01:21:10.760 |
what I'd expect momentum to mean I'd expect to be point nine but with a man 01:21:13.960 |
mom of point one it's saying that each mini batch self dot means will be 0.1 of 01:21:22.560 |
this particular mini batches mean and 0.9 of the previous one the previous 01:21:31.440 |
sequence in fact and that ends up giving us what's called an exponentially weighted 01:21:35.920 |
moving average and we do the same thing for variances okay so that's only 01:21:45.680 |
updated during training okay and then during inference we can we just use the 01:21:53.840 |
saved means and variances so this and then why do we have buffers what does 01:21:59.720 |
that mean these buffers mean that these means and variances will be actually 01:22:04.840 |
saved as part of the model so it's important to understand that this 01:22:11.240 |
information about the means and variances that your model saw saved in 01:22:17.800 |
the model and this is the key thing which makes batch norm very tricky to 01:22:23.240 |
deal with and particularly tricky as we'll see in later lessons with 01:22:26.760 |
transfer learning but what this does do is that it means that we're going to get 01:22:32.600 |
something that's much smoother you know a single weird mini batch shouldn't screw 01:22:37.400 |
things around too much and because we're averaging across their mini batch it's 01:22:42.080 |
also going to make things smoother so this whole thing should lead to a pretty 01:22:44.960 |
nice smooth training so we can train this so we're going to this time we're 01:22:51.680 |
going to use our batch norm layer for norm oh actually we need to put the bias 01:22:55.920 |
thing right oh no it's no that's fine okay and one interesting thing I found 01:23:10.160 |
here is I was able to now finally increase the learning rate up to 0.4 for 01:23:16.220 |
the first time so each time I was really trying to see if I can push the learning 01:23:19.320 |
rate and I'm now able to double the learning rate and still as you can see 01:23:24.240 |
it's training very smoothly which is really cool so there's actually a number 01:23:30.200 |
of different types of normal layer based normalization we can use in this lesson 01:23:34.960 |
we've specifically seen batch norm and layer norm I wanted to mention that 01:23:39.400 |
there's also instance norm and group norm and this picture from the group 01:23:42.640 |
norm paper explains what happens the what it's showing is that we've got here 01:23:48.080 |
the N C H W and so they've kind of concatenated flattened H W into a single 01:23:53.640 |
axis since they can't draw 40 cubes and what they're saying is in batch norm all 01:24:01.200 |
this blue stuff is what we average over so we average across the batch and 01:24:05.680 |
across the height and width and we end up with one therefore normalization 01:24:10.960 |
number per channel right so you can kind of slide these blue blocks across so 01:24:16.400 |
batch norm is averaging over the batch and height width layer norm as we 01:24:21.760 |
learned averages over the channel and the height and the width and it has a 01:24:25.520 |
separate one per item in the mini batch I mean kind of it's a bit it's a bit 01:24:35.320 |
subtle right because remember the overall molten add it just had a 01:24:40.840 |
literally a single number for each right so it's not quite as simple as this but 01:24:44.960 |
that's a general idea instance norm which we're not looking at today only 01:24:50.920 |
averages across height and width so there's going to be a separate one for 01:24:55.720 |
every channel and every element of the mini batch and then finally group norm 01:25:01.120 |
which I'm quite fond of is like instance norm but it arbitrarily basically groups 01:25:06.960 |
a bunch of channels together and you can decide how many groups of channels there 01:25:12.240 |
are and averages over them group norm tends to be a bit slow unfortunately 01:25:16.520 |
because the way these things are implemented is a bit tricky but group 01:25:20.360 |
norm does allow you to yeah avoid some of the the challenges of some of the 01:25:27.240 |
other methods so it's worth trying if you can and of course batch norm has the 01:25:35.280 |
additional thing of the kind of momentum based statistics but in general the idea 01:25:40.320 |
of like do you use momentum based statistics do you store things you know 01:25:47.160 |
per channel or a single mean and variance in your buffers or whatever you 01:25:53.480 |
know all that kind of stuff along with what do you average over they're all 01:25:56.480 |
somewhat independent choices you can make and particular combinations of those 01:26:00.240 |
have been given particular names and so there we go okay so we're getting you 01:26:09.400 |
know we've got some good initialization methods here let's try putting them all 01:26:12.560 |
together and one other thing we can do is we've been using a batch size of 1024 01:26:21.000 |
for speed purposes if we drop it down a bit to 256 it's going to mean that it's 01:26:26.960 |
going to get to see more mini batches so that should improve performance and so 01:26:32.400 |
we're trying to get to 90% remember so let's yeah do all this this time we'll 01:26:40.080 |
use pytorch as its own batch norm we'll just use pytorches there's nothing wrong 01:26:44.080 |
with ours but we try to switch to pytorches when something we've recreated 01:26:49.480 |
exists there we'll use our momentum learner and we'll fit for three epochs 01:26:57.440 |
and so as you can see it's going a little bit more slowly now and then the 01:27:04.240 |
other thing I'm going to do is I'm going to decrease the learning rate and keep 01:27:10.840 |
the existing model and then train for a little bit longer the idea being that as 01:27:20.160 |
the you know as it's kind of getting close to a pretty good answer maybe it 01:27:26.600 |
just wants to be able to fine-tune that a little bit and so we by decreasing the 01:27:31.080 |
learning rate we give it a chance to fine-tune a little bit so let's see how 01:27:38.960 |
we're going so we got to eighty seven point eight percent accuracy after three 01:27:43.360 |
epochs which is an improvement I guess mainly thanks to well basically thanks 01:27:52.080 |
to using this smaller mini batch size now with a smaller mini batch size you 01:27:57.360 |
do have to decrease the learning rate so I found I could still get away with point 01:28:01.320 |
two which is pretty cool and look at this after just one more epoch by 01:28:05.880 |
decreasing the learning rate we've got up to eighty nine point seven oh we didn't 01:28:10.640 |
make it eighty nine point nine so towards ninety percent but not quite ninety 01:28:15.240 |
percent eighty nine point nine so we're going to have to do some more work to 01:28:20.520 |
get up to our magical 90% number but we are getting pretty close all right so 01:28:29.280 |
that is the end of initialization an incredibly important topic as hopefully 01:28:40.600 |
accelerated SGD let's see if we can use this to get us up to 90 plus above 90 01:28:49.680 |
percent so let's do our normal imports and data set up as usual and so just to 01:28:57.220 |
summarize what we've got we've got our metrics callback we've got our activation 01:29:01.960 |
stats on the general value so our callbacks are going to be the device 01:29:05.880 |
callback to put it on CUDA or whatever the metrics the progress bar the 01:29:09.680 |
activation stats our activation function is going to be our general value with 01:29:15.240 |
point one leakiness and point four subtraction and we've got the inner 01:29:21.600 |
weights which we need to tell it about how leaky they are and then if we're 01:29:25.960 |
doing a learning rate finder we've got a different set of callbacks so it's no 01:29:30.760 |
real reason to have a progress bar callback with a learning rate finder I 01:29:34.640 |
guess it's pretty short anyway oh which reminds me there was one little thing I 01:29:40.440 |
didn't mention in initializing which is a fun trick you might want to play around 01:29:48.240 |
with and in fact Sam Watkins asked a question earlier in the chat and I 01:29:54.840 |
didn't answer it because it's actually exactly here in general value I added a 01:30:01.440 |
second thing you might have seen which is the maximum value and if the maximum 01:30:06.280 |
value is set then I clamp the value to be no more than the maximum so basically 01:30:14.720 |
as a result let's say you set it to three then the line would go up to here 01:30:19.160 |
it like it does here and then it go up to three like it does here and then it 01:30:22.480 |
will be flat and using that can be a nice way I mean I'd probably go higher 01:30:29.360 |
up to about six but that can be a nice way to avoid yeah numbers getting too 01:30:34.480 |
big and maybe if you really wanted to have fun you could do kind of like a 01:30:39.320 |
leaky maximum which I haven't tried yet where maybe at the top it kind of goes 01:30:43.440 |
like you know ten times smaller kind of just exactly like the leaky could be so 01:30:50.920 |
anyway if you do that you'd need to make sure that the you know that you're still 01:30:57.800 |
getting zero one layers with your initialization but that would be 01:31:02.040 |
something you could consider playing with okay so let's create our own little 01:31:17.680 |
SGD class so an SGD class is going to need to know what parameters to optimize 01:31:26.240 |
and if you remember the module dot parameters method returns a generator so 01:31:31.760 |
we use a list to to turn you know we want to turn that into a list so it's 01:31:36.640 |
kind of forced to be a particular you know not not something that's going to 01:31:39.760 |
change we're going to need to know the learning rate we're going to need to know 01:31:45.280 |
the weight decay which we'll look at a bit in a moment and for reasons we'll 01:31:50.320 |
discuss later we also want to keep track of what batch number are we up to so an 01:31:56.160 |
optimizer basically has two things a step and a zero grad so what steps going 01:32:01.880 |
to do is obviously with no grad because this is not part of the learn part of 01:32:07.320 |
the thing that we're optimizing this is the optimization itself we go through 01:32:11.320 |
each tensor of parameters and we do a step of the optimizer and we'll come 01:32:16.520 |
back to this in a moment we do a step of the regularizer and we keep track of 01:32:20.120 |
what batch number we're up to and so what does SGD do in our step of the 01:32:24.720 |
optimizer it subtracts out from the parameter it's gradient times the 01:32:31.980 |
learning rate so that's an SGD optimization step and to zero the 01:32:36.840 |
gradients we go through each parameter and we zero it and that's in torch dot 01:32:48.660 |
no grad so okay so use dot data that way that's if you use dot data then you don't 01:33:01.640 |
need to say the no grad just a little typing saver okay so let's create a 01:33:10.920 |
train learner so it's a learner with a training callback kind of built in and 01:33:15.120 |
we're going to set the optimization function to be this SGD we just wrote 01:33:19.360 |
and we'll use the batch norm model with the weight initialization we've used 01:33:24.360 |
before and if we train it then just this is just should give us basically the 01:33:30.640 |
same results we've had before while this is training I'm going to talk about 01:33:35.400 |
regularization hopefully you remember from part one of this course or from 01:33:43.560 |
your other learning what weight decay is and so just to remind you weight decay 01:33:53.500 |
or L2 regularization are kind of the same thing and basically what we're doing is 01:34:02.600 |
we're saying let's add the square of the weights to the loss function now if we 01:34:13.120 |
add the square of the weights to the loss function so whatever our loss 01:34:19.040 |
function is so we'll just call it loss bump up we're adding plus the sum of the 01:34:32.160 |
square of the weights so that's our L and so the only thing we actually care 01:34:38.280 |
about is the derivative of that and the derivative of that is equal to the 01:34:50.080 |
derivative of the loss plus the derivative of this which is just the 01:35:06.200 |
sum of 2w and then what we do is we multiply this bit here by some some 01:35:16.320 |
constant which is the weight decay so we call that weight decay and so since the 01:35:20.240 |
weight decay could directly incorporate the number the two we can actually just 01:35:24.600 |
delete that entirely and just time weight decay do that I'm doing this very 01:35:38.600 |
quickly because we have already covered it in part one so this is hopefully 01:35:43.120 |
something that you've all seen before so we can do weight decay by taking our 01:35:52.800 |
gradients and adding on the weight decay times the weights and so as a result then 01:36:07.520 |
in SGD because that's part of the gradient oh man I got it the wrong way 01:36:14.160 |
around need to do that first I guess well whatever okay so since that's part of 01:36:27.800 |
the gradient then in the optimization step that's using the gradient and it's 01:36:34.360 |
subtracting out gradient times learning rate but what you could do is because 01:36:42.400 |
we're just ending up doing p dot grad times self dot LR and the p dot grad 01:36:47.080 |
update is just to add in WT times weight we could simply skip updating the 01:36:54.240 |
gradients and instead directly update the weights to subtract out the learning 01:36:59.360 |
rate times the WD times weight so they would be mathematically identical and 01:37:04.840 |
that is what we've done here in the regularization step we basically say if 01:37:10.080 |
you've got weight decay then just take P times equals 1 minus the learning rate 01:37:21.240 |
times the weight decay which is mathematically the same as this because 01:37:26.320 |
we've got weight on both sides so that's why the regularization is here inside our 01:37:34.480 |
SGD and yes so it's finished running that's good we've got a 85% accuracy 01:37:40.960 |
that all looks fine and we're able to train at a high learning rate of 0.4 so 01:37:51.120 |
that's pretty cool so now let's add momentum now we had a kind of a hacky 01:37:56.840 |
momentum learner before but we're going to see momentum should be in an 01:38:00.040 |
optimizer really and so let's talk a bit about what momentum actually is so let's 01:38:07.200 |
just create some some data so our X's are just going to be equally spaced 01:38:13.720 |
numbers from minus four to four a hundred of them and our Y's are just 01:38:18.040 |
going to be our X's divided by three squared one minus that plus some 01:38:26.080 |
randomization and so these dots here is our random data I'm going to show you 01:38:32.640 |
what momentum is by example and this is something that Sylvain Goudre helped 01:38:39.320 |
build so thank you Sylvain for our book actually if memory serves correctly 01:38:45.620 |
actually I might have even be the course before that what we're going to do is 01:38:50.440 |
we're going to show you what momentum looks like for a range of different 01:38:53.440 |
levels of momentum these are the different levels we're going to use so 01:38:57.880 |
let's take a beta of 0.5 so that's going to be our first one so we're going to 01:39:01.400 |
do a scatter plot of our X's and Y's that's the blue dots and then we're 01:39:05.520 |
going to go through each of the Y's and we're going to do this hopefully looks 01:39:10.320 |
familiar this is doing a loop we're going to take our previous average which 01:39:16.840 |
we'll start at zero times beta which is 0.5 plus 1 minus beta that's 0.5 times 01:39:26.920 |
our new average and then we'll append that to this red line and we'll do that 01:39:38.040 |
for all the data points and then plot them and you can see what happens when we 01:39:42.600 |
do that is that the red line becomes less bumpy right because each one is 01:39:49.880 |
half it's this exact dot and half of whatever the red line previously was so 01:39:55.720 |
again this is an exponentially weighted moving average and so we could have 01:40:00.000 |
implemented this using loop so as the beta gets higher it's saying do more of 01:40:11.080 |
just be wherever the red line used to be and less of where this particular data 01:40:16.640 |
point is and so that means when we have these kind of outliers the red line 01:40:21.800 |
doesn't jump around as much as you see but if your momentum gets too high then 01:40:30.120 |
it doesn't follow what's going on at all and in fact it's way behind right when 01:40:35.380 |
you're using momentum it's always going to be partially responding to how things 01:40:40.920 |
were many batches ago and so even at beta is 0.9 here the red line is offset 01:40:51.320 |
to the right because again it's taking it a while for it to recognize that all 01:40:55.800 |
things have changed because each time it's 0.9 of it is where the red line 01:41:01.320 |
used to be and only point one of it is what does this data point say so that's 01:41:07.400 |
what momentum does so the reason that momentum is useful is because when you 01:41:16.000 |
have a you know a loss function that's actually kind of like very very bumpy 01:41:27.600 |
like that right you want to be able to follow the actual curve right so using 01:41:37.120 |
momentum you don't quite get that but you get a kind of a version of that that's 01:41:41.120 |
offset to the right a little bit but still you know hopefully spending a lot 01:41:47.760 |
more time you don't really want to be heading off in this direction which you 01:41:51.200 |
would if you follow the line and then this direction which you would if you 01:41:53.820 |
follow the line you really want to be following the average of those 01:41:57.440 |
directions and that's what momentum lets you do so to use momentum we will 01:42:11.680 |
inherit from SGD and we will override the definition of the optimization step 01:42:16.840 |
remember there was two things that step called it called the regularization step 01:42:21.000 |
and the optimization step so we're going to modify the optimization step we're 01:42:26.200 |
not just going to do minus equals grad times self dot LR but instead then when 01:42:32.440 |
we create our momentum object we will tell it what momentum we want or default 01:42:39.360 |
to point nine store that away and then in the optimization step for each 01:42:45.560 |
parameter because remember the optimization step is being called for 01:42:49.560 |
each parameter in our model so that's each layers weights and each layers 01:42:54.280 |
biases for example we'll find out for that parameter have we ever stored away 01:42:59.320 |
its moving average of gradients before and if we haven't then we'll set them to 01:43:05.600 |
zero initially just like we did here and then we will do our loop right so we're 01:43:16.560 |
going to say the moving average of exponentially weighted moving average of 01:43:19.640 |
gradients is equal to whatever it used to be times the momentum plus this 01:43:28.600 |
actual new batches gradients times one minus momentum so that's just doing the 01:43:34.160 |
loop as we discussed and so then we're just going to do exactly the same as the 01:43:38.200 |
SGD update step but instead of multiplying by p dot grad we're 01:43:42.400 |
multiplying it by p dot grad average so there's a cool little trick here right 01:43:46.160 |
which is that we are basically inventing a brand new attribute putting it inside 01:43:51.120 |
the parameter tensor and that attribute is where we're storing away the moving 01:43:58.200 |
average exponentially weighted moving average of gradients for that particular 01:44:01.920 |
parameter so as we loop through the parameters we don't have to do any 01:44:05.400 |
special work to get access to that so I think that's pretty handy alright so one 01:44:13.920 |
interesting thing very interesting here I found is I could really hike the 01:44:17.360 |
learning rate way up to 1.5 and the reason why is because we're not getting 01:44:21.560 |
these huge bumps anymore and so by getting rid of the huge bumps it the 01:44:25.480 |
whole thing's just a whole lot smoother so previously we got up to 85% because 01:44:33.000 |
we've gone back to our 1024 batch size and just three epochs and a constant 01:44:38.840 |
learning rate and look at that we've got up to 87.6% so it's really improved 01:44:44.600 |
things and the the loss function is nice and smooth as you can see okay and so 01:44:54.280 |
then in our color dim plot you can see it's this is actually that's really the 01:44:58.840 |
really the smoothest we've seen and it's a bit different to the momentum learner 01:45:04.280 |
because the momentum learner didn't have this one minus part right it wasn't 01:45:10.640 |
lerping it was it was basically always including all of the grad plus a bit of 01:45:15.680 |
the momentum part so this is yeah this is a different better approach I think 01:45:26.140 |
and yeah we've got a really nice smooth result one person's asking don't we get a 01:45:33.760 |
similar effect I think in terms of the smoothness if we increase the batch size 01:45:37.060 |
which we do but if you just increase the batch size you're giving it less 01:45:42.000 |
opportunities to update so having a really big batch size is actually not 01:45:46.560 |
great yeah and lacun who created the first really successful confidence 01:45:52.040 |
living learn at 5 says he thinks the ideal batch size if you can get away 01:45:57.000 |
that is one but it's just slow you wanted to have as many opportunities to update 01:46:04.120 |
as possible there's this weird thing recently where people seem to be trying 01:46:07.680 |
to create really large batch sizes which to me is yeah doesn't make any sense we 01:46:19.440 |
want the smallest batch size we can get away with generally speaking to give it 01:46:22.440 |
the most chances to update so this has done a great job of that and we've 01:46:25.840 |
getting very good results despite using yeah only three epochs of very large 01:46:32.120 |
batch size okay so that's called momentum now something that was developed in a 01:46:38.560 |
course or announced in a Coursera course back in maybe 2012 2013 by Jeffrey Hinton 01:46:45.200 |
has never been published is called RMS prop let's have it running while we talk 01:46:50.560 |
about it RMS prop is going to update the optimization step using something very 01:46:56.640 |
similar to momentum but rather than lerping on the p dot grad we're going to 01:47:09.920 |
lerp on p dot grad squared and well just to keep it to keep it kind of consistent 01:47:19.400 |
we won't call it mom we call it square mom but this is just the multiplier and 01:47:23.320 |
what are we doing with the grad squared well the idea is that a large grad 01:47:29.120 |
squared indicates a large variance of gradients so what we're then going to do 01:47:36.800 |
is divide by the square root of that plus epsilon now you'll see I've 01:47:45.320 |
actually been a bit all over the place here with my batch norm I put the epsilon 01:47:50.960 |
inside the square root in this case I'm putting the epsilon outside the square 01:47:56.840 |
root it does make a difference and so be careful as to how your epsilon is being 01:48:03.200 |
interpreted generally speaking I can't remember if I've been exactly right but 01:48:07.440 |
I've tried to be consistent with the papers or normal implementations this is 01:48:12.160 |
a very common cause of confusion and errors though so what we're doing here is 01:48:18.880 |
we're dividing the gradient by the the amount of variation so the square root 01:48:27.880 |
of the moving average of gradient squared and so the idea here is that if 01:48:34.080 |
the gradient has been moving around all over the place then we don't really know 01:48:40.920 |
what it is right so should we shouldn't do a very big update if the gradient is 01:48:48.280 |
very very much the same all the time then we're very confident about it so we 01:48:54.080 |
do want to be a big update I have no idea why we're doing this in two steps 01:48:57.600 |
let's just pop this over here now because we are dividing our gradient by 01:49:08.000 |
this generally possibly rather small number we generally have to decrease the 01:49:15.320 |
learning rate so bring the learning rate back to 0.01 and as you see it's 01:49:20.320 |
training oh it's not amazing but it's training okay so RMS prop can be quite 01:49:30.080 |
nice it's a bit bumpy there isn't it I mean I could try decreasing it a little 01:49:43.160 |
that's a little bit better and a bit smoother that's probably good see what 01:49:58.840 |
the colorful dimension plot looks like - shall we again it's very nice isn't it 01:50:04.600 |
that's great now one thing I did which I don't think I've seen done before I 01:50:12.400 |
don't remember people talking about is I actually decided not to do the normal 01:50:16.860 |
thing of initializing to zeros because if I initialize to zeros then my 01:50:25.280 |
initial denominator here will basically be 0 plus epsilon which will mean my 01:50:30.320 |
initial learning rate will be very very high which I certainly don't want so I 01:50:34.560 |
actually initialized it at first to just whatever the first many batches gradient 01:50:38.960 |
is squared and I think this is a really useful little trick for using our mess 01:50:45.880 |
prop momentum you know momentum can be a bit aggressive sometimes for some really 01:50:57.080 |
you know finicky learning methods finicky architectures and so RMS prop 01:51:04.720 |
can be a good way to get reasonably fast optimization of a very finicky 01:51:12.080 |
architectures and in particular efficient net is an architecture which 01:51:16.720 |
people have generally trained best with RMS prop so you don't see it a whole lot 01:51:21.840 |
but you know in some ways it's just historical interest but you see it a bit 01:51:26.560 |
but I mean the thing we really want to look at is our RMS prop plus momentum 01:51:32.400 |
together and RMS prop plus momentum together exists it has a name you will 01:51:37.320 |
have heard the name many times name is Adam Adam is literally just RMS prop and 01:51:43.400 |
momentum so we rather annoyingly call them beta 1 and beta 2 they should be 01:51:51.280 |
called momentum and square momentum or momentum of squares I suppose so beta 1 01:51:58.320 |
is just the momentum from from the momentum optimizer beta 2 is just these 01:52:03.920 |
momentum for the squares from the RMS prop optimizer so we'll store those away 01:52:10.720 |
and just like RMS prop we need the epsilon so I'm going to as before store 01:52:19.440 |
away the gradient average and the square average and then we're going to do our 01:52:25.840 |
lerping but there's a nice little trick here which is in order to avoid doing 01:52:33.160 |
this where we just put the initial batch gradients as our starting values we're 01:52:41.560 |
going to use zeros as our starting values and then we're going to unbiased 01:52:47.720 |
them so basically the idea is that for the very first mini batch if you have 01:52:52.480 |
zero here being lerped with the gradient then the first mini batch will obviously 01:53:02.160 |
be closer to zero than it should be but we know exactly how much closer it 01:53:07.400 |
should be to zero which is just it's going to be self beta 1 times closer at 01:53:17.080 |
least in the first mini batch because that's what we've worked with and then 01:53:20.080 |
the second mini batch to be self beta 1 squared and so and the third mini batch 01:53:24.120 |
to be self beta 1 cubed and so forth and that's why we had this self dot I back 01:53:30.320 |
in our SGD which was keeping track of what mini batch were up to so we need 01:53:38.560 |
that in order to do this unbiasing of the average oh dear I'm not unbiasing the 01:53:49.680 |
square of the average am I I'm not whoops so we need to do that here as well 01:53:58.800 |
wonder if this is going to help things a little bit unbiased square average is 01:54:05.360 |
going to be P dot square average and that will be beta 2 and so we will use 01:54:16.960 |
those unbiased versions so this this this unbiasing only matters for the 01:54:21.880 |
first few mini batches where otherwise it would be too close to zero you know 01:54:25.800 |
I'll be closer to zero than it should be right so we run that and so again you 01:54:40.760 |
know we've you would expect the learning rate to be similar to what RMS prop 01:54:45.480 |
needs because we're doing that same division so we actually do you have the 01:54:49.440 |
same learning rate here and yeah so we're up to a 865 86.5 percent accuracy so 01:54:58.080 |
that's pretty good I think yeah it's actually a bit less good than momentum 01:55:05.120 |
which is fine you know obviously you can fiddle around or momentum we had 0.9 yeah 01:55:14.120 |
so you can fiddle around with different values of beta 2 beta 1 see if you can 01:55:18.880 |
beat the momentum version I suspect you probably can okay oh we're a bit out of 01:55:31.000 |
time aren't we all right I'm excited about the next bit but I wanted to spend 01:55:36.480 |
time doing it properly so I won't rush through it now but instead we're going 01:55:39.880 |
to do it next time so I will yes I will give you a hint that in our next lesson 01:55:47.720 |
we will in fact get above 90% and it's got some very cool stuff to show you I 01:55:54.920 |
can't wait to show you that then but you know I think in the meantime let's give 01:56:00.040 |
ourselves a pat in the back that we have successfully implemented you know I mean 01:56:04.960 |
think about all this stuff we've got running and happening and we've done the 01:56:08.720 |
whole thing from scratch using nothing but what's in the Python standard 01:56:13.820 |
library we've re-implemented everything and it's we understand exactly what's 01:56:18.880 |
going on so I think this is this is really quite 01:56:22.520 |
terrifically cool personally I hope you feel the same way and look forward to