Back to Index

Lesson 17: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Changes to previous lesson
7:50 Trying to get 90% accuracy on Fashion-MNIST
11:58 Jupyter notebooks and GPU memory
14:59 Autoencoder or Classifier
16:5 Why do we need a mean of 0 and standard deviation of 1?
21:21 What exactly do we mean by variance?
25:56 Covariance
29:33 Xavier Glorot initialization
35:27 ReLU and Kaiming He initialization
36:52 Applying an init function
38:59 Learning rate finder and MomentumLearner
40:10 What’s happening is in each stride-2 convolution?
42:32 Normalizing input matrix
46:9 85% accuracy
47:30 Using with_transform to modify input data
48:18 ReLU and 0 mean
52:6 Changing the activation function
55:9 87% accuracy and nice looking training graphs
57:16 “All You Need Is a Good Init”: Layer-wise Sequential Unit Variance
63:55 Batch Normalization, Intro
66:39 Layer Normalization
75:47 Batch Normalization
83:28 Batch Norm, Layer Norm, Instance Norm and Group Norm
86:11 Putting all together: Towards 90
88:42 Accelerated SGD
93:32 Regularization
97:37 Momentum
105:32 Batch size
106:37 RMSProp
111:27 Adam: RMSProp plus Momentum

Transcript

Hi everybody and welcome to lesson 17 of practical deep learning for coders. Really excited about what we're going to look at over the next lesson or two. It's actually been turning out really well, much better than I could have hoped. So I can't wait to dive in. Before I do I just going to mention a couple of minor changes that I made to our mini AI library this week.

One was I went back to our callback class in the learner notebook and I did decide in the end to add a dunder getAtra to it that just adds these four these four attributes and for these four attributes it passes it down to self.learn. So in a callback you'll be able to refer to model to get self.learn.model, opt will be self.learn.opt, batch will be self.learn.batch, epoch will be self.learn.epoch.

You can change these you know you could subclass the callback and add your own to underscore forward or you could remove things from underscore forward or whatever but I felt like these four things I access a lot and I was sick of typing self.learn and then I added one more property which is cause in a callback there'll be a self.training which saves from typing self.learn.model.training since we have model you could get rid of the learn but still I mean you so often have to check the training now you can just get self.training in a in a callback so that was one change I made.

The second change I made was I found myself getting a bit bored of adding train_cb every time so what I did was I took the four training methods from the momentum_learner subclass and I've moved them into a train_learner subclass along with zero grad so now momentum_learner actually inherits from train_learner and just adds momentum this kind of a quirky momentum method and changes zero grad to do the momentum thing so you yeah so we'll be using train_learner quite a bit over the next lesson or two so train_learner is just a learner which has the usual training it's exactly the same that fastai2 has or you'd have in most PyTorch training loops and obviously by using this you lose the ability to change these with a callback so it's a little bit less flexible okay so those are little changes and then I made some changes to what we looked at last week which is the activations notebook and specifically okay so I added a hooks callback so previously we had a hooks class and it didn't really require too much ceremony to use but I thought we could make it even simpler and a bit more fastaiish or miniaiish by putting hooks into a callback so this callback as usual you pass it a function that's going to be called for your hook and you can optionally pass it a filter as to what modules you want to hook and then in before fit it will filter the modules in the learner and so this is one of these things we can now get rid of we don't need to learn here because model is one of the four things we have a shortcut to and then here we're going to create the hooks object and put it in hooks and so one thing that's convenient here is the hook function now you don't have to worry and we can get rid of learned up model you don't have to worry about checking in your hook functions whether in training or not it always checks whether you're in training and if so it calls that hook function you passed in and after it finishes it removes the hooks and you can iterate through the hooks and get the length of the hooks because it just passes these iterators and length down to self dot hooks so to show you how this works we can create a hooks callback we can use the same append stats and then we can run the model and so as it's training what we're going to do is yeah we can now then here we go so we just added that as an extra callback to our fit function I don't remember if we had the extra callbacks before I'm not sure we did so just to explain it's just I just added extra callbacks here in the fit function and we're just adding any extra callbacks yeah so then now we've got that callback that we created because we can get iterate through it and so forth we can just iterate it through that callback as if it's hooks and plot in the usual way so that's a convenient little thing I think it's convenient thing I added okay and then I took our colorful dimension stuff which Stefano and I came up with a few years ago and decided to wrap all that up in a callback as well so I've actually sub classed here our hooks callback to create an activation stats and what that's going to do is it's going to use this append stats which appends the means the standard deviations and the histograms and oh and I changed that very slightly also the thing which creates these kind of dead plots I changed it to just get the ratio of the very first very smallest histogram bin to the rest of the bins so these are really kind of more like very dead at this point so these graphs look a little bit different okay so yeah so I sub classed the hooks callback and and yeah added the colorful dimension method dead chart method and a plot stats method so to see them at work if we want to get the activations on and all of the cons then we train our model and then we can just call and so we've added created our activation stats we've added that as an extra callback and then and then yeah then we can call colored in to get that plot dead chart to get that plot and plot stats to get that shot but so now we have absolutely no excuse for not getting all of these really fantastic informative visualizations of what's going on inside our model because it's literally as easy as adding one line of code and just putting that in your callbacks so I really think that couldn't be easier and so I hope you're even for models you thought you know a training really well why don't you try using this because you might be surprised to discover that they're not okay so those are some changes pretty minor but hopefully useful and so today and over the next lesson or two we're going to look at trying to get to a important milestone which is to try to get fashion MNIST training to an accuracy of 90% or more which is certainly not the end of the road but it's not bad if we look at papers with code there's so 90% would be a 10% error so there's folks that have got down to 3 or 4 percent error in the very best which is very impressive but you know 10% error wouldn't be way off what's in this paper leaderboard I don't know how far we'll get eventually but without using even any architectural changes no resnets or anything we're going to try to get into the 10% error all right so let's so the first few cells are just copied from from earlier and so here's our ridiculously simple model I like all I did here was I said okay well the very first convolution is taking a 9 by 9 by 1 channel input so we should have compressed it at least a little bit so I made it 8 channels output for the convolution and then I just doubled it to 16 doubled it to 32 doubled it to 64 and it so that's going to get to a that will be as it says 14 by 14 image 7 by 7 a 4 by 4 a 2 by 2 and then this one gets us to a 1 by 1 so of course we get the 10 digits so there was no thought at all behind really this architecture this pure just pure convolutional architecture and remember this flatten at the end is necessary to get rid of the unit axes that we end up with because this is a 1 by 1 okay so let's do a learning rate finder on this very simple model and what I found was that this model is and and you know this situation is so bad that when I tried to use the learning rate finder kind of in the usual way which would be just to say you know start at 1e neg 5 or 1e neg 4 say and then run it it kind of looks ridiculous it's impossible to see what's going on so if you remember we added that that multiplier it we called it LR mult or gamma is what they called it in pytorch so we ended up calling it gamma so I dialed that way down to make it much more gradual which means I have to dial up the starting learning rate and only then did I manage even to get the learning rate finder to tell us anything useful okay so so there we there we are so that's that's that's our learning rate finder I'm just going to come back to these three later so I tried using a learning rate of 0.2 and after trying a few different values 0.4 0.1 0.2 seems about the highest we can get up to even this actually is too high I found much lower and it didn't train much at all you can see what happens if I do it's it starts training and then it kind of yeah we lose it which is unfortunate and you can see that in the colorful dimension plot we get this classic you know getting activations crashing get of activations crashing and you can kind of see the key problem here really is that we don't have zero mean standard deviation one layers at the start so we certainly don't keep them throughout and this is this is a problem now just something I got to mention by the way is when you're training stuff in Jupiter notebooks this is just a new thing we've we've just added if you get you can easily run out of memory GPU memory and there's two reasons it turns out why you can particularly run out of GPU memory if you run a few cells in a Jupiter notebook the first is that kind of for your convenience Jupiter notebook you might you might may or may not know this actually stores the results of your previous few evaluations if you just type underscore it tells you the very last thing you evaluated and you can do more underscores to go backwards further in time or you can also use oh you can also use numbers to get the out 16 for example would be underscore 16 now the reason this is an issue is that if one of your outputs is a big CUDA tensor and you've shown it in a cell that's going to keep that GPU memory basically forever and so that's a bit of a problem so if you are running out of memory one thing you'd want to do is clean out all of those underscore blah things I found that there's actually some function that nearly does that in the IPython source code so I copied the important bits out of it and put it in here so if you call clean IPython history it will don't worry about the lines of code at all this is just a thing that you can use to get back that GPU memory the second thing which Peter figured out in the last week or so is that you also have if you have a CUDA error at any point or even any kind of exception at any point then the exception object is actually stored by Python and any tensors that were allocated anywhere in in that in that trace in that trace back will stay allocated basically forever and you again that's a big problem so I created this clean trace back function based on Peter's code which gets rid of that so if this is particularly problematic because if you have a CUDA out of memory error and then you try to rerun it you'll still have a CUDA out of memory error because all the memory that was allocated before is now in that trace back so basically anytime you get a CUDA out of memory error or any kind of error you know with memory you can call clean mem and that will clean the memory in your trace back it will clean the memory used in your in your Jupiter history do a garbage collect empty the CUDA cache and that will basically should give you a totally clean GPU you don't have to restart your notebook okay so Sam asked a very good question in the chat so just to yeah just to remind you guys yes we did start he's asking I thought we were training an auto encoder or are you training a classifier or what so we started doing this auto code encoder back in notebook 8 and we decided this is we don't have the tools to make this work yet so let's go back and create the tools and then come back to it so in creating the tools we're doing a classifier we try to make a really good fashion MNIST classifier well we try to create tools which hopefully have a side effect will find of giving us a really good classifier and then using those tools we hope that will allow us to create a really good auto encoder so yes we're kind of like gradually unwinding and we'll come back to where we were actually trying to get to so that's why we're doing this this classifier the techniques and library pieces we're building will be all very necessary okay so why do we need a zero mean one standard deviation why do we need that and B how do we get it so first of all on the way so if you think about what a neural net does a deep learning net specifically it takes an input and it puts it through a whole bunch of matrix multiplications and of course there are activation functions sandwiched in there don't worry about the activation functions that doesn't change the argument so let's just imagine we start with some bunch of some matrix right imagine the 50 to 50 deep neural net so a 50 deep neural net basically if we ignore the activation functions is taking the previous input and doing a matrix multiply by some initially some random weights so these are all yeah these are just a bunch of random weights and these are actually red rand n is mean zero variance one and if we run this after 50 times of multiplying by a matrix by a matrix by a matrix by a matrix we end up with NANs that's no good so that might be that our matrix the numbers in our matrix were too big so each time we multiply the numbers were getting bigger and bigger and bigger so maybe we should make them a bit smaller okay so let's try using in the matrix we are multiplying by let's try multiplying by 0.01 and we multiply that lots of times oh now we've got zeros now of course in mathematically speaking this isn't actually NAN it's actually some really big number mathematically speaking this isn't really zero it's some really small number but computers can't handle really really small numbers are really really big numbers so really really big numbers eventually just get called NAN and really really small numbers eventually just get called zero so basically they get washed out and in fact even if you don't get a NAN or even if you don't quite get a zero for numbers that are extremely big the internal representation has no ability to discriminate between even slightly similar numbers basically the and in the way floating point is stored the further you get away from zero the less accurate the numbers are so yeah this is a problem so we have to scale our weight matrices exactly right and we have to scale them in such a way that the standard deviation at every point stays at one and the mean stays at zero so there's actually a paper that describes how to do this for multiplying lots of matrices together and this paper basically just went through it's actually pretty simple math actually let's see what do they do all right yeah so they look to gradients and the propagation of gradients and they came up with a particular weight initialization of using a uniform with with 1 over root n as the bounds of that uniform and they yeah they studied basically what happened to with various different activation functions and yeah as a result we we now have this this this way of initializing neural networks which is called either gloro initializer initialization or Xavier initialization and yeah this is this is the this is the amount that we scale our initialization our random numbers by where n in is the number of inputs so in our case we have 100 inputs and so root 100 is 10 so 1 over 10 is 0.1 and so if we actually run that if we start with our random numbers and then we multiply by random numbers times 0.1 which is this is the gloro initialization you can see we do end up with numbers that are actually reasonable so that's pretty cool so just I mean just some background in case you're not familiar with some of these details what exactly do we mean by variance so if we take a tensor let's call it T and just put 124 18 in it the mean of that is simply the sum divided by the count so that's 6.25 now we want to know basically we want to come up with a measure of how far away each data point is from the mean that tells you how much variation there is if all the data points are very similar to each other right so if you've got kind of like a whole bunch of data points and they're all pretty similar to each other right then the mean would be about here right and the average distance away of each point from the mean is not very far where else if you had dots which were very widely spread all over the place right then you might end up with the same mean but the distance from each point to the mean is now quite a long way so that's what we want we want some measure of kind of how far away the points are an average from the mean so here we could do that we can take our tensor we can subtract the mean and then take the mean of that oh that doesn't work because we've got some numbers that are bigger than the mean and some that are smaller than the mean and so the average of them all out then by definition you actually get zero so instead you could either square those differences and that will give you something and you could also take the square root of that if you wanted to to get it back to the same kind of area or you could take the absolute differences okay so actually I'm doing this in two steps here so for the first one here it is on a different scale and then add square root get it on the same scale so six point eight seven and five point eight eight are quite similar right but they're mathematically not quite the same but they're both similar ideas so this is the mean absolute difference and this is called the standard deviation and this is called the variance so the reason that the standard deviation is bigger than the mean absolute difference is because in our original data one of the numbers is much bigger than the others and so when we square it that number ends up having a outsized influence and so that's a bit of an issue in general with standard deviation and variance is that outliers like this have an outsized influence so you've got to be a bit careful okay so here's the formula for the standard deviation that's normally written as sigma okay so it's just going to be each of our data points minus the mean squared plus the next data point minus the mean squared so forth for all the data points and then divide that by the number of data points and square root and okay so one thing I point out here is that the mean absolute deviation isn't used as much as the standard deviation because mathematicians find it difficult to use but we're not mathematicians we have computers so we can use it okay now variance we can calculate like this as we said the main of the square of the differences and if you feel like doing some math you could discover that actually this is exactly the same as you can see and this is actually nice because this is showing that the mean of the square data points minus the square of the mean of the data points is also the variance and this is very helpful because it means you actually never have to calculate this you can just calculate the mean so with just the data points on their own you can actually calculate the variance this is a really nice shortcut this is how we normally calculate variance and so there is the LaTeX version which of course I didn't write myself I stole from the Wikipedia LaTeX because I'm lazy now there's a very very similar idea which is covariance and has already come up a little bit in the first lesson or two and particularly the the extra math lesson that my same engineer did and it's yes a covariance tells you how much two things vary not just on their own but together and there's a definition here in math but I like code so we'll see the code so here's our tensor again now we're going to want to have two things so let's create something called u which is just two times our tensor with a bit of randomness so here it is now you can see that u and t are very closely correlated here but they're not perfectly correlated so the covariance tells us yeah how they vary together and separately so we can take the you can see this exactly the same thing we had before each data point minus its mean but now we've got two different tensors so we're also going to do the other one the other the other data points minus their mean and we multiply them together so it's actually the same thing as standard deviation but instead of deviation it's kind of like the covariance with itself in a sense right and so that's a product we can calculate and then what we then do is we take the mean of that and that gives us the covariance between those two tensors and you can see that's quite a high number and if we compare it to two things that aren't very related at all so that's good a totally random tensor v so this is not related to t and we do exactly the same thing so take the difference of t to its means and v to its means and take the mean of that that's a very small number and so you can see covariance is basically telling us how related are these two tensors so covariance and variance are basically the same thing but you kind of can think of we can think of variance as being covariance with itself and you can change this mathematical version which is the one we just created in code to this version just like we have for variance there's a easier to calculate version which as you can see gets exactly the same answer okay so if you haven't done stuff with covariance much before you should experiment a bit with it by creating a few different plots and experimenting with those and the finally the Pearson correlation coefficient which is normally called our row is just the covariance divided by the product of the standard deviations so you've seen probably seen that number many times there's just a scaled version of the same thing okay so with that in mind here is how Xavier in it or Glurrow in it is derived so when you do a matrix multiplication for each of the yi's we're adding together all of these products so for we've got a i comma 0 times x 0 plus a i comma 1 times x 1 etc and we can write that in sigma notation so we're adding up together all of the aik's with all of the xk's this is the stuff that we did in our first lesson of part 2 and so here it is in pure Python code and here it is in NumPy code now at the very beginning our vector has a mean of about 0 and a standard deviation about 1 because that's what we asked for to remind you right that's what we asked for that's a standard deviation of 0 and a minute so 1 is it a standard deviation of 1 mean of 0 that's what random is okay so let's create some random numbers and we can confirm yeah they have a main of about 0 and a standard deviation of about 1 so if we chose weights for a that have a mean of 0 we can compute the standard deviation quite easily so let's do that so a hundred times let's try creating our X and let's try creating something to multiply it by and we'll do the matrix multiplication and we're going to get the mean and mean of the squares and so that is very close to our matrix so I won't go into I mean you can look at it if you like but basically as long as the elements in a and X are independent which obviously they are because they're random then we're going to end up with a main of 0 and a standard deviation of 1 for these products and so we can try it if we creates a random number normally distributed random number and then a second random number multiply them together and then do it a bunch of times and you can see here we've got our zero one so that's the reason why we need this math dot square root 100 we don't normally worry about the mathematical reasons why things are exactly but yeah I thought I would just dive into this one because sometimes it's it's fun to go through it and so you can check out the paper if you want to look at that in more detail or experiment with these with these little simulations now the problem is that that doesn't work it doesn't work for us because we use rectified linear units which is not something that Xavier Glauro looked at let's take a look let's create a couple of matrices this is 200 by 100 this is just a vector well matrix in a vector this is 200 and then let's create a couple of weight matrices two weight matrices and two bias vectors okay so we've got some input data X's and Y's and we've got some weight matrices and bias vectors so let's create a linear layer function which we've done lots of times before and let's start going through a little neural net you know I'm mentioning this is the forward pass of our neural net so we're going to apply our linear layer to the X's with our first set of weights and our first set of biases and see what the mean and standard deviation is okay it's about 0 and about 1 so that's good news and the reason why is because we have 100 inputs and we divided it by square root 100 just like Glauro told us to and our second one has 50 inputs and we divide by square root of 50 and so this all ought to work right and so far it is but now we're going to mess everything up by doing ReLU so ReLU after we do a ReLU look we don't have a zero mean or a one standard deviation anymore so if we go through that and create it like a deep neural network with Glauro initialization but with a ReLU oh dear it's disappeared it's all gone to zero and you can see why right after a matrix multiply and a ReLU our means and variances are going down and of course they're going down because a ReLU squishes it so I'm not going to worry about the math of why but a very important paper indeed called delving deep in directifiers surpassing human level performance on image net classification by Kaiming He et al came up with a new in it which is just like Glauro initialization but you multiply the remember the Glauro initialization was 1 over root n this one is root 2 over n and again n is the number of inputs so let's try it so we've got 100 inputs so we have to multiply it by root 2 over 100 and there we go you can see we are in fact getting some nonzero numbers that's very encouraging even after going through 50 layers of depth so that's good news so this is called Kaiming it's either called Kaiming initialization or called her initialization and notice it looks like it's built he but it's a Chinese surname so it's actually pronounced her okay maybe that's why a lot of people increasingly call it Kaiming initialization they don't have to say his surname just a little bit harder to pronounce all right so how on earth do we actually use this now that we know what initialization function to use for a deep neural network with a ReLU activation function the trick is to use a method called apply which all nn.modules have so if we grab our model we can apply any function we like for example let's apply the function print the name of the type so here you can see it's going through and it's printing out all of the modules that are inside our model and notice that our model has modules inside modules it's this it's a conv in a sequential in a sequential but model.apply goes through all of them regardless of their depth so we can apply an init function so we can apply the init function which simply does randomly distributed random numbers times square root of 2 over the number of inputs that's such an easy thing it's not even worth writing so that's already been written but that's all it does it just does that one thing it's called init.kaimingnormal as we've seen before if there's an underscore at the end of a PyTorch method name that means that it changes something in place so init.kaimingnormal underscore will modify this weight matrix so that it has been initialized with normally distributed random numbers based on root of 2 divided by the number of inputs now you can't do that to a sequential layer or a ReLU layer or a flattened layer so we should check that the module is a conv or linear layer and then we can just say model.apply the function and so if we do that and now I can use our learning ratefinder callbacks that we created earlier and this time I don't have to worry about actually we can create our own ones because we don't need to use even the weird gamma thing anymore so let's go back and copy that let's get rid of this gamma equals 1.1 it shouldn't be necessary anymore and we can probably make that 4 now oh I should have it to recreate the model there we go okay so that's looking much more sensible so at least we've got to a point where the learning ratefinder works that's a good sign so now when we create our learner we're going to use our momentum learner still after we get the model we will apply in it weights and apply also returns the model so we can actually this is actually going to return the model with the initialization applied while I wait I will answer questions okay so Fabrizio asks why do we double the number of filters in successive convolutions so what's happening is in each stride 2 convolution these are all stride 2 convolutions so this is changing the grid size from 28 by 28 to 14 by 14 so it's reducing the number the size of the grid by a factor of 4 in total so basically so as we go from one to eight from this one to this one same deal we're going from 14 by 14 to 7 by 7 so produce the grid size by 4 we want it to learn something and if you use if you give it exactly the same kind of number of units or activations there's there's not really it's not really forcing it to learn things as much so ideally as we decrease the grid size we want to have enough channels that you end up with a few less activations but then before it not too many less so if we double the number of channels then that means we've decreased the grid size by model of 4 increase the channel count by model of 2 so overall the number of activations has decreased by a factor of 2 and so that's what we want we want to be kind of forcing it to find ways of compressing the information intelligently as it goes down also we kind of want to be having a roughly similar amount of compute roughly similar amount through the neural net so as we decrease the grid size we can add more channels because decreasing the grid size decreases the amount of compute increasing the channels then gives it more things to compute so we're kind of getting this nice compromise between yeah between the kind of amount of compute that it's doing but also giving it some kind of compression work to do that's the kind of the basic idea well still not able to train well okay if we leave it for a while okay it's not great but it is actually starting to train that's encouraging and we got up to a 70% accuracy so we can see you're not surprisingly we're getting these spikes and spikes and so in the statistics you can see that well it didn't quite work we don't have a mean of zero we don't have a standard deviation of one even at the start why is that well it's because we forgot something critical if you go back to our original point even when we had our let's go to the timing version even when we had the correctly normalized matrix that we're multiplying by well you also have to have a correctly normalized input matrix and we never did anything to normalize our inputs so our inputs actually if we get the just get the first X mini batch I get its main and standard deviation it has a mean of 0.28 and a standard deviation of 0.35 so we actually didn't even start with a 0 1 input and so we started with the mean beneath above zero and a standard deviation beneath one so it was very hard for it so using the inner helped at least we're able to train a little bit but it's not quite what we want we actually need to modify our inputs so they have a mean of one and a standard sorry a mean of zero and a standard deviation of one so we could create a callback to do that so a callback let's create a batch transform callback and so we're going to pass in a function that's going to transform every batch and so just in the before batch we will set the batch to be equal to the function applied to the batch now I can note by the way we don't need self dot learn dot batch here because we can read any because it's one of the four things that we kind of proxy down to the learner automatically but we do need it on the left hand side because it's only in the get atra remember so be very careful so I might just leave it the same on say on both sides just so that people don't get confused okay so let's create a function underscore norm that subtracts the mean and divides by the standard deviation and so remember a batch has an X and a Y so it's the X part where we subtract the mean and divide by the standard deviation and so the new batch will be that as the X and the Y will be exactly the same as it was before so let's create a instance of the normalization of the batch transform callback which is going to do the normalization function we'll call it norm so we can pass that as an additional callback to our learner and now that's looking a lot better so you can see here all we had to do was check that our input matrix was 0 1 and main standard deviation and all of our weight matrices was 0 1 standard deviation and we didn't have to use any tricks at all it was able to train and got it to an accuracy of 85% and so if we look at the color dim and stats look at this it looks beautiful now this is layer one this is layer two three four it's still not perfect I mean there's some randomness right and and we've got what is it like seven or eight layers so that randomness does kind of as you go through the layers by the last one it still gets a bit ugly and you can kind of see it bouncing around here as a result and you can see that also in the means and standard deviations there's some other reasons this is happening we'll see in a moment but this is the first time we've really got our even somewhat deep convolutional model to train and so this is a really exciting step you know we have from scratch in a sequence of 11 notebooks managed to create a real convolutional neural network that is training properly so I think that's pretty amazing now we don't have to use a callback for this the other thing we could do to modify the input data of course is to use the with transform method from the hugging face datasets library so we could modify our transform I to do just attract the main and divide by the standard deviation and then recreate our data loaders and if we now get a batch out of that and check it it's now got yep I mean is zero and the standard deviation of one so we could also do it this way so generally speaking for stuff that needs to kind of dynamically modify the batch you can often do it either in your data processing code or you can do it in a callback and neither is right or wrong they both work well and you can see whichever one works best for you okay now I'm going to show you something amazing okay so it's great this is training well but when you look at our stats despite what we did with the normalized input and the normalized the yeah and the normalized weight matrices we don't have a mean of zero and we don't have a standard deviation of one even from the start so why is that well the problem is that we were putting our data through a ReLU and our activation stats are looking at the output of those ReLU blocks because that's kind of the end of each you know that that's that's the activation of each of each combination of weight matrix multiplication and activation function and since a ReLU removes all of the negative numbers it's impossible for the output of a ReLU to have a mean of zero unless literally every single number is zero Max has got no negatives so ReLU seems to me to be fundamentally incompatible with the idea of a correctly calibrated bunch of layers in a neural net so I came up with this idea of saying well why don't we take our normal ReLU and have the ability to subtract something from it and so we just take the result of our ReLU and subtract so sub of minus I mean I just I can write this in more obvious way is exactly the same as just minus equals when I just do that we'll subtract something from our ReLU that will allow us to pull the whole thing down so that the bottom of our ReLU is underneath the x-axis and it has negatives and that would allow us to have a mean of zero and while we're there let's all do also do something that's existed for a while I didn't come up with this idea which is that it just to do a leaky ReLU which is where we say let's not have the negative speed totally flat just truncated but instead let's just have those numbers decreased by some constant amount let me show you what that looks like so there's two together I'm going to call general ReLU which is where we do this thing called leaky ReLU which is where we make it so it's not flat under zero but instead just less less sloped and we also subtract something from it so for example to have created a little function here for plotting a function so let's plot the general ReLU function with a leakiness of point one so that will mean there's a point one slope underneath the under zero and we'll subtract point four and so you can see above zero it's just a normal y equals x line but it's been pushed down by point four and then when it's less than zero it's not flat anymore but it's just got a slope of 1/10 and so this is now something which if you find the right amount to subtract for each amount of leakiness you can make a mean of zero and I actually found that this particular combination gives us a mean of zero or there abouts so let's now create a new convolution function where we can actually change what activation function is used that gives us the ability to change the activation functions in our neural nets let's change get model to allow it to take an activation function which is passed into the layers and while we're there let's also make it easy to change the number of filters so we're going to pass in a list of the number of filters in each layer and we will default it to the numbers in each layer that we've discussed and so we're just going to go through in a list comprehension creating a convolution from the previous number of filters this number of filters to the next number of filters and we'll pop that all into a sequential along with a flatten at the end and well we're there we also then need to be careful about in it weights because this is something that people tend to forget which is that in it that it is that timing initialization the default only specific only applies only applies at all to layers that have a value activation function we don't have really you anymore we actually have leaky value the fact that we're subtracting a bit from it doesn't change things but the fact that it's leaky does now luckily a lot of people don't know this but actually pytorch is claiming normal has an adjustment for leaky values weirdly enough they just call it a so if you pass into the timing normal initialization how much how your leaky values leaky factor as a then you'll get the correct initialization for a leaky value so we need to change in it weights now to pass in the leakiness all right so let's put all this together so our general value activation for activation function is is general value with a leak of point one and it's a tractor point four so we'll use partial to create a function that has those built-in parameters for activation stats we need to update it now to look for general values not nn dot values okay and then our in it weights function we're going to have a partial with leaky equals point one so we'll call that our in it weights huh great so now we'll get our model using that new activation function and that new in it weights and we'll fit that oh that's encouraging accuracy of 845 which is about as high as we got to at the end previously Wow look at that so we're up to an accuracy of 87% and let's take a look yeah I mean look at these we still got a little bit of a spike but it's almost smooth and flat and let's have a look here look at that our main is standing starting at about zero standard deviation no standard deviation is still a bit low but it's coming up around one it's not too bad generally around 0.8 so it's all looking pretty encouraging I think and oh yeah look the percentage of dead units in each layer is very small so finally we've really trained you know got some very nice looking training graphs here and yeah it's interesting that we had to literally invent our own activation function to make this work and I think that gives you a sense of how few people actually care about this which is crazy because as you can see it's it in some ways it's the only thing that matters and it's not at all mathematically difficult to make it all work and it's not at all computationally difficult to see whether it's working but other frameworks don't even let you plot these kinds of things so nobody even knows that they've completely messed up their initialization so yeah now you know now some very nice news well so the first thing to be aware of which is tricky is we a lot of models use more complicated activation functions nowadays rather than value or leaky value or even this general version you need to initialize your neural network correctly and most people don't and sometimes nobody's even figured out or bothered to try to figure out what the correct initialization to use is but there's actually a very cool trick which almost nobody knows about which is a paper called all you need is a good in it which Demetro Michigan wrote a few years ago and what Demetro showed is that there's actually a completely general way of initializing any neural network correctly regardless of what activation functions are in it and it uses a very very simple idea and the idea is create your model initialize it however you like and then go through and put a single batch of data through and look at the first layer see what the main and standard deviation through the first layer is and if the mean you know if the standard deviation is too big divide the weight matrix by a bit if the means too high subtract a bit off the weight matrix and do that repeatedly for the first layer until you get the correct mean and standard deviation and then go to the second layer do the same thing third layer do the same thing and so forth so we can do that using hooks right so we could create a little so this is called layer wise sequential unit variance LSU V we can create a little LSU V stats that will grab the main of the activations of a layer and the standard deviation of the activate activations of a layer and we will create a hook with that function and what it's going to do is after the after we've run that hook to find out the main standard deviation of the layer we will go through and run the model get the standard deviation and mean see if the standard deviation is not one see if the mean is not zero and we will subtract the mean from the bias and we will divide the weight matrix by the standard deviation and we will keep doing that until we get a standard deviation of one and a mean of zero and so by making that a hook what we will do is we will grab all the values and all the comms right and so just to show you what happens there once I've got all the relu's and all the comms I can use zip so zip in Python takes a bunch of lists and creates a list of the items the first items the second items the third items and so forth so if I go through the zip of relu's and comms and just print them out you can see it prints out the relu and the first conv the second rally the second conv the second rally the sorry the third rally the third conv and so forth we use zip all the time in Python so it's really important thing to be aware of so we could go through the relu's and the comms and call layerwise sequential unit variance in it passing in those module pairs sorry passing in yes passing in the relu and the conv and then for each one oh and we're going to do that on the the batch and of course we need to put the batch on the correct device for our model and so now that I've done that we now have it ran almost instantly it's now made all the biases and weights correct give us 0 1 and now if I train it there it is so we didn't do any initialization at all of the model other than just call LS UV in it and this time we've got an accuracy of 0.86 versus previously it's 0.87 so pretty much the same thing close enough and actually if you want to actually see that happening I guess what we could do I mean it's not it's going to be pretty obvious after we've run this we could say print H dot mean comma H dot standard deviation actually we could do it like before and afterwards right so we could say right before and after there we go yeah so it starts at so the first layer started at a mean of point negative point one three in a variance of point four six and it kept doing the divide subtract divide subtract divide subtract until eventually it got to mean is zero standard deviation of one and then it went to the next layer and it kept going going going until that was zero one and then the third layer and then the fourth layer and so at that point all of the layers had a mean is zero and a standard deviation of one so I guess like one thing with LS UV you know it's kind of very mathematically convenient we don't have to spend any time thinking about you know if we've invented a new activation function or we're using some activation function where nobody seems to have figured out the correct initialization for it we can just use LS UV it did require a little bit more fiddling around with hooks and stuff to get it to work and I haven't even put this into like a callback or anything so if you yeah if you decide you want to try using this in some of your models it might be a good idea and it actually be good homework to see if you can come up with a callback that does LS UV initialization for you that would be pretty cool wouldn't it in in before fit I guess it would be you'd have to be a bit careful because if you ran fit multiple times it would actually initialize it each time so that would be one issue with that to think about okay so something which is quite similar to LS UV is batch normalization so we're going to have a seven minute break and then we're going to come back and we're going to talk about batch normalization see you in seven minutes okay hi let's do this batch normalization batch normalization was such an important paper I remember when it came out I was at analytic my medical startup and I think that's right and everybody was talking about it and in particular they were talking about this this graph that basically showed like what it used to be like until batch norm to train a model on image net how many training steps you'd have to do to get to a certain accuracy and then they showed what you could do with batch norm so much faster it was amazing and we all thought that can't be true but it was true so basically the key idea of batch norm is that you know with with with LS UV and input normalization and climbing in it we are normalizing the layers each day as inputs before training but the distribution of each layers inputs changes during training and that's a problem so you end up having to decrease your learning rates and as we've seen you'd have to be very careful about parameter initialization so the fact that the layers inputs change during training they call internal covariate shift which for some reason a lot of people tend to find a confusing statement or a confusing name but it's that's very clear to me and you can fix it by normalizing layer inputs during training so you're making the normalization a part of the model architecture and you perform the normalization for each mini batch now I'm actually not going to start with batch normalization I'm going to start with something that came out one year later called layer normalization because layer normalization is simpler let's do the simpler one first so layer normalization came out as this group of fellows the last of whom I'm sure you heard of and it's probably easiest to explain by showing you the code so if you're thinking layer normalization well it's a whole paper Jeffrey Hinton paper must be complicated no the whole thing is this code what is layer normalization well we can create a module and we're going to pass in we don't need to pass in anything actually you can totally ignore the parameters for now in fact what we're going to do is we're going to have a single number called mult for the multiplier and a single number called add that's the thing we're going to add and we're going to start off by multiplying things by one and adding zero so we're going to start off by doing nothing at all okay this is the layer it has a forward function and in the forward function so remember that by default we have NCHW we have batch by channel by height by width we're going to take the mean over the channel height and width so we're just going to find the mean activation for each input in the mini batch and when I say input though remember that this is going to be this is a layer right so we can put this layer anywhere we like so it's the input to that layer and we'll do the same thing for finding the variance okay and then we're going to normalize our data by subtracting the mean and dividing by the square root of the variance which of course is the standard deviation we're going to add a very small number by default one in egg five to the denominator just in case the variance is zero or ridiculously small this will keep the number from going giant just if we happen to get something with a very small variance this idea of an epsilon as being something we add to a divisor is really really common and in general you should not assume that the defaults are correct very often the defaults are too small for algorithms that use an epsilon okay so here we are as you can see we are normalizing the the batch I mean I can call it a batch but just remember it isn't necessarily the first layer right so it's wherever which whichever layer we decide to put this in so we normalize it now the thing is maybe we don't want it to be normalized maybe we wanted to have something other than unit variance and something other than zero mean well what we do is we then multiply it back by self dot malt and add self dot add now remember self dot malt was one and self dot add is zero so at first that does nothing at all so at first this is just normalizing the data so that's good but because these are parameters these two numbers are learnable that means that the SGD algorithm can change them so there's a very subtle thing going on here which is that in fact this might not be normalizing the data at all or normalizing the the inputs to the next layer at all because self dot malt and self dot add could be anything so I tend to think that when people think about these kind of things like layer normalization and batch normalization thinking of this normalization in some ways is not the right way to think of it it's actually doing something I think to really well it's definitely normalizing it for the initial layers and we don't really need a less UV anymore if we have this in here because it's going to normalize it automatically so that's handy but after a few batches it's not really normalizing at all but what it is doing is previously this idea of like how big are the numbers overall and how much variation do they have overall was kind of built into every single number in the weight matrix and in the bias vector this way those two things have been turned into just two numbers and I think this makes training a lot a lot easier for it basically to just have just two numbers that it can focus on to change this overall like positioning and variation so there's something very subtle going on here because it's not just doing normalization at least not after the first few batches are complete because it can learn to create any distribution of outputs it want so there's our layer so we're going to need to change our con function let again previously we changed it to add activation function to be what it to be modifiable now we're going to also change it to allow us to add normalization layers to the end so our basic layers well we'll start off by adding our conv2d as usual and then if you're doing normalization we will append the normalization layer with this many inputs now in fact layer norm doesn't care how many inputs so I just ignore it but you'll see batch normal care if you've got an activation function add it and so our convolutional layer is actually a sequential bunch of players now one thing that's interesting I think is that for bias in the conv if you're using well this isn't quite true is it I was going to say if you're using layer norm you don't need bias but actually you kind of do so maybe we should actually change that for batch norm we won't need bias but actually for this one we do so put this back bias equals true bias equals bias okay so then these initial layers right here yes so they all have bias and then we've got bias equals false okay so now in our model we're going to add layer normalization to every layer except for the last one and let's see how we go oh nice eight seven three okay eight sixty and eight seven two so just we've just got our best by a little bit so that's cool so the the thing about these normalization layers is though that they do cause a lot of challenges in models and generally speaking ever since patch norm appeared well there's been this kind of like big change of a view towards it at first people like oh my god batch norm is our savior and it kind of was it let us train much deeper models and get great results and train quickly but then increasingly people realized it also added a lot of complexity these these these learnable parameters turned out to create all kind of complexity and in particular batch norm which we'll see in a minute created all kinds of complexity so there has been a tendency in recent years to be trying to get rid of or at least reduce the use of these kinds of layers so knowing how to actually initialize your models correctly at first is becoming increasingly important as people are trying to move away from these normalization layers increasingly so I will I will say that so I you know they're still very helpful but they're not a silver bullet as it turns out alright so now let's look at batch norm so batch norm is still not huge but it's a little bit bigger than layer norm and you'll see that we've now we've got the mulch and add as before but it's not just one number to add or one number to multiply but actually we've got a whole bunch of them and the reason is that we're going to have one for every channel and so now when we take the mean and the very variance we're actually taking it over the batch dimension and the height and which dimensions so we're ending up with one mean per channel and one variance per channel so just like before once we get our means and variances we subtract them out and divide them by the epsilon modified variance and just like before we then multiply by mult and add add but now we're actually multiplying by a vector of malts and we're adding a vector of ads and that's why we have to pass in the number of filters because we have to know how many ones and how many zeros we have in our initial malts and ads so that's the main difference in a sense is that we are we have one per channel and that and that we're also taking the average across all of the things in the batch where else in layer norm we didn't each thing in the batch had its own separate normalization it was doing then there's something else in batch norm which is a bit tricky which is that during training we are not just subtracting the mean and the variance but instead we're getting an exponentially weighted moving average of the means and the variances of the last few of the last few batches that's what this is doing so we start out so we basically create something called vase and something called means and initially the variances are all one and the means are all zero and there's one per channel just like before or one per filter this is number of filters same idea I guess filters we tend to actually use inside the model and channels we tend to use as the first input so I should probably say filters either works though so we get out let's for example we get our mean per filter and then what we do is we use this thing called lerp and lerp is simply saying yes that's what it's done so what lerp does is it takes two numbers in this case I'm going to take 5 and 15 or two tensors they could be vectors or matrices and it creates a weighted average of them and the amount of weight it uses is this number here let me explain in this case if I put 0.5 it's going to take half of this number plus half of this number so we end up with just the mean but what if we used 0.75 then that's going to take 70 that's going to take 0.75 times this number plus 0.25 of this number so it's basically kind of allows it to be under like a sliding scale so one extreme would be to take all of the second number so that would be lerp with one there and the other extreme would be all of the first number and then you can slide anywhere between them like so right so that's exactly the same as saying five times 0.9 plus 15 times 0.1 right so this this number here is how much of the second number do we have and one minus that is how much of this number do we have and you can also move this as you can with most PyTorch things you can move the first parameter into there and get exactly the same result so that's what lerp is so what we're doing here is we're doing an in-place lerp so we're replacing self dot means with one minus momentum of self dot means and plus self dot momentum times this particular mini batches mean so this is basically doing momentum again which is why we indeed are calling the parameter mom from momentum so with a mom of point one which I kind of think is the opposite of what I'd expect momentum to mean I'd expect to be point nine but with a man mom of point one it's saying that each mini batch self dot means will be 0.1 of this particular mini batches mean and 0.9 of the previous one the previous sequence in fact and that ends up giving us what's called an exponentially weighted moving average and we do the same thing for variances okay so that's only updated during training okay and then during inference we can we just use the saved means and variances so this and then why do we have buffers what does that mean these buffers mean that these means and variances will be actually saved as part of the model so it's important to understand that this information about the means and variances that your model saw saved in the model and this is the key thing which makes batch norm very tricky to deal with and particularly tricky as we'll see in later lessons with transfer learning but what this does do is that it means that we're going to get something that's much smoother you know a single weird mini batch shouldn't screw things around too much and because we're averaging across their mini batch it's also going to make things smoother so this whole thing should lead to a pretty nice smooth training so we can train this so we're going to this time we're going to use our batch norm layer for norm oh actually we need to put the bias thing right oh no it's no that's fine okay and one interesting thing I found here is I was able to now finally increase the learning rate up to 0.4 for the first time so each time I was really trying to see if I can push the learning rate and I'm now able to double the learning rate and still as you can see it's training very smoothly which is really cool so there's actually a number of different types of normal layer based normalization we can use in this lesson we've specifically seen batch norm and layer norm I wanted to mention that there's also instance norm and group norm and this picture from the group norm paper explains what happens the what it's showing is that we've got here the N C H W and so they've kind of concatenated flattened H W into a single axis since they can't draw 40 cubes and what they're saying is in batch norm all this blue stuff is what we average over so we average across the batch and across the height and width and we end up with one therefore normalization number per channel right so you can kind of slide these blue blocks across so batch norm is averaging over the batch and height width layer norm as we learned averages over the channel and the height and the width and it has a separate one per item in the mini batch I mean kind of it's a bit it's a bit subtle right because remember the overall molten add it just had a literally a single number for each right so it's not quite as simple as this but that's a general idea instance norm which we're not looking at today only averages across height and width so there's going to be a separate one for every channel and every element of the mini batch and then finally group norm which I'm quite fond of is like instance norm but it arbitrarily basically groups a bunch of channels together and you can decide how many groups of channels there are and averages over them group norm tends to be a bit slow unfortunately because the way these things are implemented is a bit tricky but group norm does allow you to yeah avoid some of the the challenges of some of the other methods so it's worth trying if you can and of course batch norm has the additional thing of the kind of momentum based statistics but in general the idea of like do you use momentum based statistics do you store things you know per channel or a single mean and variance in your buffers or whatever you know all that kind of stuff along with what do you average over they're all somewhat independent choices you can make and particular combinations of those have been given particular names and so there we go okay so we're getting you know we've got some good initialization methods here let's try putting them all together and one other thing we can do is we've been using a batch size of 1024 for speed purposes if we drop it down a bit to 256 it's going to mean that it's going to get to see more mini batches so that should improve performance and so we're trying to get to 90% remember so let's yeah do all this this time we'll use pytorch as its own batch norm we'll just use pytorches there's nothing wrong with ours but we try to switch to pytorches when something we've recreated exists there we'll use our momentum learner and we'll fit for three epochs and so as you can see it's going a little bit more slowly now and then the other thing I'm going to do is I'm going to decrease the learning rate and keep the existing model and then train for a little bit longer the idea being that as the you know as it's kind of getting close to a pretty good answer maybe it just wants to be able to fine-tune that a little bit and so we by decreasing the learning rate we give it a chance to fine-tune a little bit so let's see how we're going so we got to eighty seven point eight percent accuracy after three epochs which is an improvement I guess mainly thanks to well basically thanks to using this smaller mini batch size now with a smaller mini batch size you do have to decrease the learning rate so I found I could still get away with point two which is pretty cool and look at this after just one more epoch by decreasing the learning rate we've got up to eighty nine point seven oh we didn't make it eighty nine point nine so towards ninety percent but not quite ninety percent eighty nine point nine so we're going to have to do some more work to get up to our magical 90% number but we are getting pretty close all right so that is the end of initialization an incredibly important topic as hopefully you've seen accelerated SGD let's see if we can use this to get us up to 90 plus above 90 percent so let's do our normal imports and data set up as usual and so just to summarize what we've got we've got our metrics callback we've got our activation stats on the general value so our callbacks are going to be the device callback to put it on CUDA or whatever the metrics the progress bar the activation stats our activation function is going to be our general value with point one leakiness and point four subtraction and we've got the inner weights which we need to tell it about how leaky they are and then if we're doing a learning rate finder we've got a different set of callbacks so it's no real reason to have a progress bar callback with a learning rate finder I guess it's pretty short anyway oh which reminds me there was one little thing I didn't mention in initializing which is a fun trick you might want to play around with and in fact Sam Watkins asked a question earlier in the chat and I didn't answer it because it's actually exactly here in general value I added a second thing you might have seen which is the maximum value and if the maximum value is set then I clamp the value to be no more than the maximum so basically as a result let's say you set it to three then the line would go up to here it like it does here and then it go up to three like it does here and then it will be flat and using that can be a nice way I mean I'd probably go higher up to about six but that can be a nice way to avoid yeah numbers getting too big and maybe if you really wanted to have fun you could do kind of like a leaky maximum which I haven't tried yet where maybe at the top it kind of goes like you know ten times smaller kind of just exactly like the leaky could be so anyway if you do that you'd need to make sure that the you know that you're still getting zero one layers with your initialization but that would be something you could consider playing with okay so let's create our own little SGD class so an SGD class is going to need to know what parameters to optimize and if you remember the module dot parameters method returns a generator so we use a list to to turn you know we want to turn that into a list so it's kind of forced to be a particular you know not not something that's going to change we're going to need to know the learning rate we're going to need to know the weight decay which we'll look at a bit in a moment and for reasons we'll discuss later we also want to keep track of what batch number are we up to so an optimizer basically has two things a step and a zero grad so what steps going to do is obviously with no grad because this is not part of the learn part of the thing that we're optimizing this is the optimization itself we go through each tensor of parameters and we do a step of the optimizer and we'll come back to this in a moment we do a step of the regularizer and we keep track of what batch number we're up to and so what does SGD do in our step of the optimizer it subtracts out from the parameter it's gradient times the learning rate so that's an SGD optimization step and to zero the gradients we go through each parameter and we zero it and that's in torch dot no grad so okay so use dot data that way that's if you use dot data then you don't need to say the no grad just a little typing saver okay so let's create a train learner so it's a learner with a training callback kind of built in and we're going to set the optimization function to be this SGD we just wrote and we'll use the batch norm model with the weight initialization we've used before and if we train it then just this is just should give us basically the same results we've had before while this is training I'm going to talk about regularization hopefully you remember from part one of this course or from your other learning what weight decay is and so just to remind you weight decay or L2 regularization are kind of the same thing and basically what we're doing is we're saying let's add the square of the weights to the loss function now if we add the square of the weights to the loss function so whatever our loss function is so we'll just call it loss bump up we're adding plus the sum of the square of the weights so that's our L and so the only thing we actually care about is the derivative of that and the derivative of that is equal to the derivative of the loss plus the derivative of this which is just the sum of 2w and then what we do is we multiply this bit here by some some constant which is the weight decay so we call that weight decay and so since the weight decay could directly incorporate the number the two we can actually just delete that entirely and just time weight decay do that I'm doing this very quickly because we have already covered it in part one so this is hopefully something that you've all seen before so we can do weight decay by taking our gradients and adding on the weight decay times the weights and so as a result then in SGD because that's part of the gradient oh man I got it the wrong way around need to do that first I guess well whatever okay so since that's part of the gradient then in the optimization step that's using the gradient and it's subtracting out gradient times learning rate but what you could do is because we're just ending up doing p dot grad times self dot LR and the p dot grad update is just to add in WT times weight we could simply skip updating the gradients and instead directly update the weights to subtract out the learning rate times the WD times weight so they would be mathematically identical and that is what we've done here in the regularization step we basically say if you've got weight decay then just take P times equals 1 minus the learning rate times the weight decay which is mathematically the same as this because we've got weight on both sides so that's why the regularization is here inside our SGD and yes so it's finished running that's good we've got a 85% accuracy that all looks fine and we're able to train at a high learning rate of 0.4 so that's pretty cool so now let's add momentum now we had a kind of a hacky momentum learner before but we're going to see momentum should be in an optimizer really and so let's talk a bit about what momentum actually is so let's just create some some data so our X's are just going to be equally spaced numbers from minus four to four a hundred of them and our Y's are just going to be our X's divided by three squared one minus that plus some randomization and so these dots here is our random data I'm going to show you what momentum is by example and this is something that Sylvain Goudre helped build so thank you Sylvain for our book actually if memory serves correctly actually I might have even be the course before that what we're going to do is we're going to show you what momentum looks like for a range of different levels of momentum these are the different levels we're going to use so let's take a beta of 0.5 so that's going to be our first one so we're going to do a scatter plot of our X's and Y's that's the blue dots and then we're going to go through each of the Y's and we're going to do this hopefully looks familiar this is doing a loop we're going to take our previous average which we'll start at zero times beta which is 0.5 plus 1 minus beta that's 0.5 times our new average and then we'll append that to this red line and we'll do that for all the data points and then plot them and you can see what happens when we do that is that the red line becomes less bumpy right because each one is half it's this exact dot and half of whatever the red line previously was so again this is an exponentially weighted moving average and so we could have implemented this using loop so as the beta gets higher it's saying do more of just be wherever the red line used to be and less of where this particular data point is and so that means when we have these kind of outliers the red line doesn't jump around as much as you see but if your momentum gets too high then it doesn't follow what's going on at all and in fact it's way behind right when you're using momentum it's always going to be partially responding to how things were many batches ago and so even at beta is 0.9 here the red line is offset to the right because again it's taking it a while for it to recognize that all things have changed because each time it's 0.9 of it is where the red line used to be and only point one of it is what does this data point say so that's what momentum does so the reason that momentum is useful is because when you have a you know a loss function that's actually kind of like very very bumpy like that right you want to be able to follow the actual curve right so using momentum you don't quite get that but you get a kind of a version of that that's offset to the right a little bit but still you know hopefully spending a lot more time you don't really want to be heading off in this direction which you would if you follow the line and then this direction which you would if you follow the line you really want to be following the average of those directions and that's what momentum lets you do so to use momentum we will inherit from SGD and we will override the definition of the optimization step remember there was two things that step called it called the regularization step and the optimization step so we're going to modify the optimization step we're not just going to do minus equals grad times self dot LR but instead then when we create our momentum object we will tell it what momentum we want or default to point nine store that away and then in the optimization step for each parameter because remember the optimization step is being called for each parameter in our model so that's each layers weights and each layers biases for example we'll find out for that parameter have we ever stored away its moving average of gradients before and if we haven't then we'll set them to zero initially just like we did here and then we will do our loop right so we're going to say the moving average of exponentially weighted moving average of gradients is equal to whatever it used to be times the momentum plus this actual new batches gradients times one minus momentum so that's just doing the loop as we discussed and so then we're just going to do exactly the same as the SGD update step but instead of multiplying by p dot grad we're multiplying it by p dot grad average so there's a cool little trick here right which is that we are basically inventing a brand new attribute putting it inside the parameter tensor and that attribute is where we're storing away the moving average exponentially weighted moving average of gradients for that particular parameter so as we loop through the parameters we don't have to do any special work to get access to that so I think that's pretty handy alright so one interesting thing very interesting here I found is I could really hike the learning rate way up to 1.5 and the reason why is because we're not getting these huge bumps anymore and so by getting rid of the huge bumps it the whole thing's just a whole lot smoother so previously we got up to 85% because we've gone back to our 1024 batch size and just three epochs and a constant learning rate and look at that we've got up to 87.6% so it's really improved things and the the loss function is nice and smooth as you can see okay and so then in our color dim plot you can see it's this is actually that's really the really the smoothest we've seen and it's a bit different to the momentum learner because the momentum learner didn't have this one minus part right it wasn't lerping it was it was basically always including all of the grad plus a bit of the momentum part so this is yeah this is a different better approach I think and yeah we've got a really nice smooth result one person's asking don't we get a similar effect I think in terms of the smoothness if we increase the batch size which we do but if you just increase the batch size you're giving it less opportunities to update so having a really big batch size is actually not great yeah and lacun who created the first really successful confidence living learn at 5 says he thinks the ideal batch size if you can get away that is one but it's just slow you wanted to have as many opportunities to update as possible there's this weird thing recently where people seem to be trying to create really large batch sizes which to me is yeah doesn't make any sense we want the smallest batch size we can get away with generally speaking to give it the most chances to update so this has done a great job of that and we've getting very good results despite using yeah only three epochs of very large batch size okay so that's called momentum now something that was developed in a course or announced in a Coursera course back in maybe 2012 2013 by Jeffrey Hinton has never been published is called RMS prop let's have it running while we talk about it RMS prop is going to update the optimization step using something very similar to momentum but rather than lerping on the p dot grad we're going to lerp on p dot grad squared and well just to keep it to keep it kind of consistent we won't call it mom we call it square mom but this is just the multiplier and what are we doing with the grad squared well the idea is that a large grad squared indicates a large variance of gradients so what we're then going to do is divide by the square root of that plus epsilon now you'll see I've actually been a bit all over the place here with my batch norm I put the epsilon inside the square root in this case I'm putting the epsilon outside the square root it does make a difference and so be careful as to how your epsilon is being interpreted generally speaking I can't remember if I've been exactly right but I've tried to be consistent with the papers or normal implementations this is a very common cause of confusion and errors though so what we're doing here is we're dividing the gradient by the the amount of variation so the square root of the moving average of gradient squared and so the idea here is that if the gradient has been moving around all over the place then we don't really know what it is right so should we shouldn't do a very big update if the gradient is very very much the same all the time then we're very confident about it so we do want to be a big update I have no idea why we're doing this in two steps let's just pop this over here now because we are dividing our gradient by this generally possibly rather small number we generally have to decrease the learning rate so bring the learning rate back to 0.01 and as you see it's training oh it's not amazing but it's training okay so RMS prop can be quite nice it's a bit bumpy there isn't it I mean I could try decreasing it a little bit it'd be down to 3e neg 3 instead that's a little bit better and a bit smoother that's probably good see what the colorful dimension plot looks like - shall we again it's very nice isn't it that's great now one thing I did which I don't think I've seen done before I don't remember people talking about is I actually decided not to do the normal thing of initializing to zeros because if I initialize to zeros then my initial denominator here will basically be 0 plus epsilon which will mean my initial learning rate will be very very high which I certainly don't want so I actually initialized it at first to just whatever the first many batches gradient is squared and I think this is a really useful little trick for using our mess prop momentum you know momentum can be a bit aggressive sometimes for some really you know finicky learning methods finicky architectures and so RMS prop can be a good way to get reasonably fast optimization of a very finicky architectures and in particular efficient net is an architecture which people have generally trained best with RMS prop so you don't see it a whole lot but you know in some ways it's just historical interest but you see it a bit but I mean the thing we really want to look at is our RMS prop plus momentum together and RMS prop plus momentum together exists it has a name you will have heard the name many times name is Adam Adam is literally just RMS prop and momentum so we rather annoyingly call them beta 1 and beta 2 they should be called momentum and square momentum or momentum of squares I suppose so beta 1 is just the momentum from from the momentum optimizer beta 2 is just these momentum for the squares from the RMS prop optimizer so we'll store those away and just like RMS prop we need the epsilon so I'm going to as before store away the gradient average and the square average and then we're going to do our lerping but there's a nice little trick here which is in order to avoid doing this where we just put the initial batch gradients as our starting values we're going to use zeros as our starting values and then we're going to unbiased them so basically the idea is that for the very first mini batch if you have zero here being lerped with the gradient then the first mini batch will obviously be closer to zero than it should be but we know exactly how much closer it should be to zero which is just it's going to be self beta 1 times closer at least in the first mini batch because that's what we've worked with and then the second mini batch to be self beta 1 squared and so and the third mini batch to be self beta 1 cubed and so forth and that's why we had this self dot I back in our SGD which was keeping track of what mini batch were up to so we need that in order to do this unbiasing of the average oh dear I'm not unbiasing the square of the average am I I'm not whoops so we need to do that here as well wonder if this is going to help things a little bit unbiased square average is going to be P dot square average and that will be beta 2 and so we will use those unbiased versions so this this this unbiasing only matters for the first few mini batches where otherwise it would be too close to zero you know I'll be closer to zero than it should be right so we run that and so again you know we've you would expect the learning rate to be similar to what RMS prop needs because we're doing that same division so we actually do you have the same learning rate here and yeah so we're up to a 865 86.5 percent accuracy so that's pretty good I think yeah it's actually a bit less good than momentum which is fine you know obviously you can fiddle around or momentum we had 0.9 yeah so you can fiddle around with different values of beta 2 beta 1 see if you can beat the momentum version I suspect you probably can okay oh we're a bit out of time aren't we all right I'm excited about the next bit but I wanted to spend time doing it properly so I won't rush through it now but instead we're going to do it next time so I will yes I will give you a hint that in our next lesson we will in fact get above 90% and it's got some very cool stuff to show you I can't wait to show you that then but you know I think in the meantime let's give ourselves a pat in the back that we have successfully implemented you know I mean think about all this stuff we've got running and happening and we've done the whole thing from scratch using nothing but what's in the Python standard library we've re-implemented everything and it's we understand exactly what's going on so I think this is this is really quite terrifically cool personally I hope you feel the same way and look forward to seeing you in the next lesson thanks bye