Lesson 9 (2019) - How to train your model

Welcome back to lesson 9 part 2, how to train your model. Before we talk about training our model, though, I wanted to revisit a couple of things that came up last week. And the reason I really wanted to revisit them is because I wanted to kind of give you an insight into how I do research.

I mean, a lot of this course really will be me showing you how I do research and how I do software development in the hope that that is somewhat helpful to you. So one of the questions that came up last week was we looked at. That's the size of that.

Okay, we looked at. We looked inside the NN dot com 2D that comes with PyTorch to see how it goes about initializing parameters. And we found that inside here, convn d dot reset parameters, we found the way that it does initialization. And we found this math dot square root 5 without any commentary, which was quite mysterious.

So I decided to do some research into, you know, kind of dual research. One is like, what's the impact of this math dot square root 5? And then at the same time, trying to get in touch with the PyTorch team about asking them where this math dot square root 5 comes from.

So let me show you how I went about doing that research. So I loaded up just what we had from last week, which was the ability to download the MNIST data and open it up. And then the function to normalize it, which I thought we'd export if we haven't already.

And then we'd grab the data and would normalize it. And because we're going to be talking a lot about convolutions towards the end of today's lesson and particularly next lesson, I suspect. So I'll skip over some of the details about convolutions for now. But basically to do a convolution, as you know, we need a square or rectangular input.

And our MNIST input, remember, was just a single vector per image 768 long. So I resized them all to 28 by 28 one channel images. So that we could test out the impact of this conv2d in it in PyTorch and set up the various variables that we wanted to have.

And then I created a conv2d layer. So we have one input because it's just one channel. NH, which is number hidden, which is 32 outputs. And let's do a five by five kernel. And we're talking more about why five by five might be suitable. And just for testing, let's just grab the first hundred elements of the validation set.

So we've now got a tensor of one hundred by one by twenty eight by twenty eight. So it's a really good idea when you're playing with anything in software development, but including notebooks to refactor things. So I'm going to be wanting to look at the main and standard deviation of a bunch of things.

So let's create a little function called stats to do that. And I never plan ahead what I'm going to do when you see this in a notebook. It always means that I've written out that by hand and then I copied it and then I'm like, OK, I'm using it twice.

I'll chuck it in a function. So then I go back and create the function. So here I've got the mean and standard deviation of my L1, which is a conv2d layer. And so a conv2d layer contains a weight tensor parameter and a bias tensor parameter. So just to remind you, L1.weight.shape is 32 output filters because that's what number hidden was.

One input filter because we only have one channel and then five by five. OK, so that's the size of our tensor. And if you've forgotten why that's the size of a tensor, you can go back to the Excel directory, for example, from part one where you can find the conv example spreadsheet.

And in the conv example spreadsheet, you can see what each of those parameters does. So we basically had a filter for each input channel and for each output channel. So that's kind of what it looked like. And so the next layer, we now have a four dimensional tensor, a rank four tensor that we've got the three by three.

We've got it for each input and for each output. OK, so that's the 32 by one by five by five. So the main and standard deviation of the weights zero and point one one. And this is because we know that behind the scenes, it's called this function to initialize.

So the bias is initialized with a uniform random number between negative of this and positive of this and then the weights are initialized with timing uniform with this odd math dot square root five thing. So that's fine. That's not particularly interesting. What's more interesting is to take our input tensor of M NIST numbers and put it through this layer, which we called L1, which remember is a conv2D layer.

So layer one and let's create an output T and let's look at the stats of T. So this is the stats of the output of this layer. We would like it to be a mean of zero and a standard deviation or a variance of one. The mean of zero is there, but the deviation of one is not there.

So that looks like a problem. Let's compare this to the normal timing in it. So the normal timing in it, remember, is designed to be used after a value layer or more generally a leaky value layer. And recall that a leaky value layer has the y equals x here and here the gradient of this is called a or leak or whatever.

And now in our case, we're just looking at a conv layer, so we don't have anything kind of going on there. In fact, it's straight here as well. So effectively we have like a leak, if you like, of one or an A of one. So to use climbing in it with no value, we can just put a equals one.

And if we do that, then we get a mean of zero and a variance of one. So climbing in it seems to be working nicely. OK, so let's now try it with value. So let's now define a function, which is the function for layer one, which is to pass it through our layer one conv and then do a value with some A, some leak amount, which we'll set to zero by default.

So this will be just a regular value. And you can see that. If we now run that. If we now run that with climbing initialization, we get a variance of one, which is good and the main is no longer zero as we discussed last week. It's about a half.

But if we go back and reinitialize the com2d with this default PyTorch, this is not looking good at all with relu. It's even worse because remember. They don't have anything kind of handling that relu case in the default con. So this looks like a problem, so a variance of point three five.

It may not sound a lot lower than one, but let's look. Take a look at what that means. So I forgot to mention where we are. This is the zero to a. Why square root five notebook so zero to a notebook. So in order to explore this, I decided that I would try and write my own.

Timing in it. Function and so normally with the timing in it function. If we were working with just a regular fully connected matrix multiplication. We would basically be saying. How many output filters are there so what's the if this is the weight matrix then what's the what's the width of the weight matrix for a convolutional layer.

It's a little bit different is a little bit different right because what we actually want to know is each time like in this case. We're basically multiplying all these together with some sort of inputs and then adding them all up right that's basically what a single step of a matrix multiplication is.

In a convolution we're also multiplying a bunch of things together and adding them up, but what we're actually adding together is if it was three by three. Is we're multiplying together each of the three by three elements. Right and also. The channel dimension that we actually multiply all of those together and add them all up so.

Because convolution and matrix multiplication are kind of one in the same thing as we know with some weight tying and with some with some zeros. So in order to calculate the total number of modifications and additions going on for a convolutional layer we need to basically take the kernel size which in this case is.

5 by 5. And multiply it by. The number of filters. Okay, so the general way to get that 5 by 5 piece is we can just grab any one piece of this weight tensor and that will return a 5 by 5. Kernel and then say how many elements are in that part of the weight tensor and that's going to be the receptive field size.

So the receptive field size for just the immediate layer before is how many. How many elements are in that kernel so for this it's 25 it's 5 by 5 right and so if we then say okay let's grab the. Shape of the weight matrix and it gives us the number of filters out 32.

And then the number of filters in one and then I'll skip the rest because they're the only two things I want. So now for the timing her in it. We can calculate fan in is the number of input filters times receptive field size so that's 1 times 25 fan out is 32 by 25.

So there you can see this is how we calculate the effective fan in and fan out for a convolutional layer so we can do all that by hand. And then the timing in it formula you need to then for leaky value you need to multiply. By root 2 or if there's a leaky part in it so if the A is not equal to 0 then it is actually root 2 divided by 1 plus a squared.

So that's just the formula for the timing in it and that's often called the gain. For the in it. And so there's a formula for the gain so you can see that if the gain is 1. Right then that's just linear there's no nonlinearity at all so there's no change to the calculation of how you do the initialization.

On the other hand if it's a standard value then you've got the root 2 which we saw last week from the from the timing paper. With a gain of .01 it's about root 2 as well is pretty close and this is a kind of a common leaky value amount.

But what about in the case of the pie torch in it in the case of the pie torch in it it's root 5. Which is .577 which sounds like an odd. Number it's a long way away from what we were expecting to see so that's a bit concerning. But one thing we have to account for here is that the initialization that they use for pie torch.

Is not climbing normal it's climbing uniform right and so normally distributed random numbers. Look like that but uniform random numbers. Look like that right and so the uniform random numbers they were using as their kind of starting point were between. Minus one and one. The standard deviation of that obviously is not one right the standard deviation is obviously less than one and so you can Google for the standard deviation of a uniform distribution or you could jump into Excel or Python and just grab a bunch of random numbers and find out what the standard deviation is.

And you'll find that you can I've got it here actually I've grabbed ten thousand random numbers in that uniform distribution and asked for this standard deviation and it turns out that it's one over root three. Okay so part of the reason for this difference actually. Is that they need a gain to handle uniform random numbers rather than just normal random numbers but it still doesn't.

Quite account for the difference so let's take a look so here's here's my version of climbing in which I've just grabbed all of the previous lines of code and merge them together. And then I've just added this thing to multiply it by root three because of the uniform random number and so then if I run this timing to.

On my weights and get the stats of that nice again get a variance of about one. And again confirming that if I. Well this is interesting if I do it with a equals math dot square root five I would expect to get the same result as the pie torch default which I do it's about point four which is what we found back here point three five so it seems like we've successfully.

You know re implemented what they had. So at this point I was like okay well what is this what does this do what does this mean so to see kind of what this looks like I threw together a quick confident. And I grabbed the first hundred dependent variables and so then I took my input and I ran it through the whole confident to get the stats out of the results so this is now telling me what happens when I use the default pie torch in it.

And put it through a four layer confident and the answer is I end up with a variance of point oh oh six. And that sounds likely to be a really big problem right because there's there's so little variation going on that last layer and also there's a huge difference between the first layer and the last layer that's the really big issue the first layer had a standard deviation of point four and the last layer had a standard.

Well the input layers is one that's the first hidden layer is point four and the last layer is point oh six so these are all going at like totally different rates. And then what we could do is we could grab that prediction and put it through mean squared error this is the function we created last week run backward and get the stats on the gradients.

For the first layer weights so this is now going all the way forward and all the way back again. And again standard deviation is nowhere near one so that sounds like a big problem. So let's try using climbing uniform instead and if you look at the climbing uniform source code you'll see that it's got this it's got the steps that we saw gain over root of the fan and here is the square root of three because it's uniform okay and so we can confirm.

Let's go through and go through every layer and if it's a convolutional layer then let's call climbing uniform on the weights and set the biases to zero. So we'll initialize it ourselves. And then we'll grab T. And it's not one but it's a lot better than point oh eight so this is pretty encouraging that we can get through four layers we wouldn't want to probably have.

A forty layer neural network which is losing this much variance but it should be fine plenty good enough for a four layer network and then let's also confirm on the backward and the backward the first layers gradient is point five. So that was my kind of starting point for the research here and at the end of this I kind of thought this is.

You know pretty concerning and why did I think it was concerning. We'll be saying a lot more about why it's concerning but let's quickly look at to be initializing. So I put this together today and he called this why you need a good in it and he's just he's pointed out here that if you grab.

To grab a random vector X and a random matrix a which is random normally distributed mean zero and standard deviation of one. Then a hundred times you basically go X is a times X and then you go so you're basically multiplying again and again and again after a hundred iterations your standard deviation and mean are not a number.

So basically the issue is that when you multiply by a matrix lots and lots of times you explode out to the point that your computer can't even keep track. So what silver did next was he actually. Put something in a loop to check whether it's not a number and he found it was twenty twenty twenty eight iterations before it before it died.

So it didn't take very long to. To explode now in the other hand what if we take the random numbers for the standard deviation of point or one instead of one. And we do that a hundred times. Then it disappears to zero. So you can see like you know if you've got a hundred layer neural net because that's what it's doing it's doing a hundred matrix multipliers on itself on the on the output of each previous one you've got to be super careful.

To find some set of weights that because because if this is your starting set of weights if it's point or one or if it's one standard deviation. You can't ever learn anything because there are no gradients the gradients are either zero or man right so you actually have to have a reasonable starting point and this is really why.

For decades people weren't able to train deep neural networks because people hadn't figured out how to initialize them. So instead we have to use some. In it okay and we'll talk about that in a moment for those who are interested Silvan's thing gone on to describe why it is that you have to divide by the square root of the fan.

And so feel free to keep reading that if you're interested it's it's cool but we don't need to know it for now it's just some derivation and further discussion. So in parallel I also asked the pipe torch team about this and I sent these results to them. And I said what's going on and so sumeth finally appeared and he said it was a historical accident because 15 years ago or for 15 years before Python appeared there was a product called torch a neural network product in lower and they did it that way.

And so then on Google Plus in 2014 he started talking to someone a dealer man who's now a deep mind and about at this time maybe a bit before he was our intern actually and. Sonda said this is. At analytic. And Sonda said this this route five thing looks weird and sumeth said no no go look at the paper and Sonda said no that's not what the paper said and sumeth said oh it's a bug but it's a good bug because somebody at somebody went and checked it out and they thought that they were getting better results with this thing.

So so then I talked to sumeth and he was already aware of this issue to some extent and within a couple of hours. I torch team had created an issue saying they're going to update their their edits so this is super cool like it so this is like partly to say well this is an awesome team super responsive and this is why pie torch works so well is that they see issues and they fix them but it's also to say like.

When you see something in a library don't assume it's it's right or that it makes sense you know when it comes to deep learning none of us know what we're doing and you can see it doesn't take too much to kind of. Begin to something you know and then you can you know raise an issue and say his his analysis that I did there's a fantastic extension called just it just it for Jupiter notebooks that lets you take your little research notebook press a single button and it turns it into a shareable just that you can then put a link to say here's the analysis that I did.

And so yeah that's that's a little bit of. A little bit of research idea to answering this question from last week. There are lots of interesting initialization approaches you can use we've already talked about the lower and Ben Jio paper we've already talked about the timing her paper. There's an interesting paper called all you need is a good in it which describes how you can kind of iteratively go through your network and set one layer of weights at a time to like literally like kind of do a little optimized to find out which set of parameters gets you a unit variance at every point.

There's another cool paper which talks about something called orthogonal initialization if you've done some many algebra particularly if you've done Rachel's computational linear algebra course you'll know about the idea of orthogonal matrices and they make good in its. We talked briefly last week about fix up initialization and then there's also something called self normalizing neural networks.

Fix up and self normalizing neural networks are both interesting papers because they describe how to try to set. A combination of kind of activation functions and in it such that you are guaranteed a unit variance as deep as you like and both of those 2 papers went to something like a thousand layers deep and train them successfully.

In both cases the fix ups much more recent but in both cases people have kind of held them as. Reasons we can get rid of batch norm I think it's very unlikely to be true very very few people use this sell you thing now because in both cases they're incredibly fiddly.

You know so for example in this in the self normalizing neural networks case if you put in dropout you need to put a correction if you know if you do anything different in put in a correction as soon as you as you've seen as soon as something changes like the amount of leakiness in your activation function or whatever.

All you know all of your assumptions about what your variance will be in the next layer disappear. And for the cellular paper it was a particular problem because it relied on 2 specific numbers that were calculated in a famous 96 page long appendix of math in the cellular paper and so if you wanted to do a slightly different.

You know architecture in any way and they only showed this a fully connected network so even if you want to do convolutions what are you going to do you know redo that 96 pages of math so that 96 pages of math is now so famous that it has its own Twitter handle the cellular appendix.

Which has the pin tweet why does nobody want to read me and this is like literally what the entire 96 pages of the appendix looks like. I will mention that this in my opinion this is kind of a dumb way of finding those 2 numbers the all you need is a good in it paper is a much better approach to kind of.

Doing these things in my opinion which is like if you've got a couple of parameters you need to set then why not kind of set them using a quick little loop or something you know so if you if you need those if you want to find. To kind of sell you parameters that work for your architecture you can find them empirically pretty quickly and pretty easily.

Okay so that's a little bit about in it we'll come back to more of that very shortly. There was one other question from last week which was we noticed that the shape of the kind of manual linear layer we created and the shape of the pipe torch one were transposed and the question was why.

And so again I did some digging into this until eventually sumeth from the high torch team pointed out to me this commit from. 7 years ago in the old lure torch code where this actually happened and that basically it's because that old lure library. Couldn't handle batch matrix multiplication without doing it in this transposed way and that's why still to this day pie torch does it.

And upside down which is fine like it's not slower it's not a problem but again it's kind of an interesting case of like. I find this happens all the time like in deep learning something's done a particular way forever and then everybody does it that way forever and nobody goes back and says why now this particular case it really doesn't matter I don't think.

But often it does right so like things like how do we initialize neural networks and how many layers should they have and stuff like that they kind of. Nobody really thought of like nobody really changed challenged the normal practices for years so I'm hoping that with this. You know really ground up approach you can see what the assumptions we're making are and see how to question them and see that you know even to me pie torch is the best library around at the moment and even pie torch has these weird kind of.

Our cake edges to it. OK so that was a little diversion to start with but a fun diversion because that's you know something I spend a couple of days this week on and think it's pretty interesting so to go back to how do we implement a basic modern CNN.

Model. We got to this point so we've done a matrix modification so that's our fine function we've done real you so that's our nonlinearity and so a fully connected network forward. Is simply layering together those 2 things. So we did that and then we did the backward pass and we kind of refactored that nicely and it turned out that it looked pretty similar to pie torches way of doing things.

And so now we're ready to train our model and that's where we're up to. So here we are. Oh 3 many batch training and we're going to train our model. So we can start. By grabbing our MNIST data so again we're just importing the stuff that we just exported from the previous class.

Here's the model we created in the previous class. And so let's get some predictions from that model and we'll call them pred. And so now to train our model the first thing we need to do is we need a loss function. Because without a loss function we can't train it now profuse previously we used mean squared error which I said was a total cheap.

Now that we've decided to trust pie torches auto grad we can use many more things because you don't have to write our own gradients and I'm too lazy to do that. So let's go ahead and use cross entropy because cross entropy makes a lot more sense to remind you from.

The last class. There is an entropy example notebook where we learnt first of all that cross entropy requires doing two things first of all softmax. Well in the case it is multi class categorical cross entropy first to softmax and then you do the negative log likelihood so the softmax was if we have a bunch of different possible.

Predictions and we got some output for each one from our model then we take either the power of that output we sum them all up and then we take. Either the power of divided by the sum of either the power of and that was our softmax. So there it is in math form there it is in summation math form and here it is in code form.

So either the X divided by X X some and then the whole thing we do a dot log and that's because in pie torch negative log likelihood expects a log softmax not just a softmax and we'll see why in a moment. I'm sorry pop a log on the end so here's a lot longer softmax function.

So now we can go ahead and create a softmax predictions by passing. Preds to log softmax. Now that we've done that we can calculate cross entropy loss. And cross entropy loss is generally expressed in this form which is this form some of actual times the log of the probability of that actual so in other words if we have is cat and is dog then here's our actual so it's one hot encoded is cat yes is dog no.

We have our predictions from our from our model from our softmax. We can then say well what's the log of the probability it's a cat so log of this what's the log of the probability it's a dog so log of one minus that. And so then our negative log likelihood is simply B times E plus C times F and then take the negative of all that that's negative log likelihood which is what this is.

But remember and I know I keep saying this because people keep forgetting not you guys but people out in the world keep forgetting that when you're multiplying by stuff which is mainly 0 and one hot encoded multi categorical classification most of your categories are 0. Every time you multiply by 0 you're doing nothing at all but you're doing it very slowly.

So rather than multiplying by 0 and then adding up the one one that you have a much faster way as we know to do to multiply by a one hot encoded thing is to first of all simply say what's the location of the one. Here so in this case it's location to this case it's location one if we index from one and then we just look up into our array of probabilities directly offset by this amount.

Or put it in math terms for one hot encoded X's the above is simply log of PI where I is the index of our prediction. Sorry not our prediction of the actual so the index into here the actual. So how do we write this in PyTorch and I'll show you a really cool trick.

This is that this is what we're going to end up with this is our negative log likelihood. Implementation and it's incredibly fast and incredibly concise and I'll show you how we do it. Let's look at our dependent variable so let's just look at the first three values there five zero four so that's the first three elements of the dependent variable.

And so what we want to do is we want to find what is the probability associated with five in our predictions and with zero and with four. So our predictions are softmax predictions remember fifty thousand by ten and so if we take the very first of those. That I all are right and so.

Said that the actual answer should be five so if we go into this zero one two three four five that's the answer that we're going to want right. So here's how we can grab all three of those at once. We can index into our array with the whole thing five zero four.

And then for the first bit we pass in. Just contiguous integers zero one two. Why does this work this works because PyTorch supports all of the advanced indexing support from NumPy and so if you click on this link. One of the many things that types of indexing that NumPy and therefore PyTorch supports is integer array indexing and what this is.

Is that you pass a list for each dimension so in this case we have two dimensions so we need to pass two lists and the first is the list of all of the row indexes you want and the second is the list of all of the column indexes you want so this is going to end up returning zero comma five.

One comma zero and two comma four. Which is. The exact numbers that we wanted so example zero five is minus two point four nine. So to grab the entire list of the exact things that we want for our negative log likelihood. Then we basically say OK let's look in our predict let's look in our predictions and then for our row indexes.

It's every single row index so range of target dot shape zero so target dot shape zero is the number of rows so range of that is all of the numbers from zero to the number of rows so zero one two three blah blah blah fifty thousand four forty nine thousand ninety nine.

And then which columns do we want for each of those rows or whatever our target is whatever the actual value so in this case five zero four etc. So that returns. All of the values we need we then take minus because it's negative log likelihood and take the mean so that's all it takes to do negative log likelihood in PyTorch which is super wonderfully easy.

So now we can calculate our loss which is the negative log likelihood of the softmax predictions. That's what we had up here. Compared to our actual why trading. And so there it is. Now this was our softmax formula which is a to the X over some of the X's.

So we have a and then it's all logged so we've got a log of A over B and remember I keep telling you that like one thing you want to remember from high school math is how logs work so I do want you to try to recall that log of A over B is log of A minus log of B.

And so we can rewrite this as log of A to the X minus log of all that and of course either the something and log are opposites of each other so a log of A to the X is just X so that ends up being X minus. X dot X dot some dot log.

So this is useful. And let's just check that that actually works so as I kind of keep refactoring these things as even as I'd like to me. These mathematical manipulations are just refactoring right so just refactoring the math so you keep checking along so we created test near last time so let's use it to make sure that it's a similar the same as our loss.

Now you'll see here this is taking the log of the sum of the X and there's a trick called log some X. Now the reason we need this trick is that when you go into the power of something you can get ridiculously big numbers and if you've done Rachel's new computational linear algebra course then you'll know that very big numbers in floating point.

On a computer are really inaccurate basically the further you get away from zero the less kind of fine grained they are you know gets to the point where like two numbers a thousand apart the computer thinks they're the same number so you don't want big numbers particularly when you're calculating gradients.

So anywhere you see an eight of the X we get nervous we don't want X to be big but it turns out that if you do this little mathematical substitution you can actually subtract a number from your exes and add them back at the front and you get the same answer so what you could do is you can find the maximum of all of your exes.

You can subtract it from all of your exes and then add it back afterwards outside the X and you get exactly the same answer so in other words let's find the maximum. That's subtracted from all of our exes and then let's do log some X and then at the end we'll add it back again and that gives you exactly the same number that without this numerical problem.

So when people talk about like numerical stability tricks they're talking about stuff like this and this is a really helpful numerical stability trick. So this is how you do log some X in real life we can check that this one here is the same as and look in fact log some X is already a method in pytorch it's such a important and useful thing you can just actually use pytorches and you'll get the same result as the one we just wrote.

So now we can use it so log softmax is now just X minus X dot log some X. And let's check yep still the same so now that that's all working we may as well just use pytorches log softmax is and pytorches and our little loss. But actually I know a loss of log softmax is called cross entropy.

So finally we get test near F dot cross entropy is the same as loss and it is so we've now recreated by torches cross entropy so we're allowed to use it according to our rules. OK so now that we have a loss function we can use it to train and we may as well also define a metric because it's nice to see accuracy to see how we're going just much more interpretable and remember from part one that the accuracy is simply grab the argmax OK to find out which which which of the numbers in our softmax is the highest and the index of that is our prediction and then check whether that's equal to the actual.

And then we want to take the main but in pytorch you can't take the main of it's you have to take the main of floats which makes some sense so turn it into a float first so there's our accuracy. So let's just check let's grab a batch size of 64 and let's grab our first X batch this is our first playing around with many batches right so our first X batch is going to be a training set from zero up to batch size.

So our predictions is we're just going to run our model and remember our model was. Linear value linear those still using a super simple model so let's calculate some predictions and let's have a look at them and here's some predictions and it's 64 by 10 as you'd expect batch size 64 and 10 possible probabilities right.

So now we can grab our first batch of dependent variables and calculate our loss. OK that's 2.3 and calculate our accuracy and as you'd expect it's about 10% because we haven't trained our model. So we've got a model that's. Giving basically random answers. So let's train it so we need a learning rate.

We need to pick a number of epochs. And we need a training loop. So our training loop if you remember from part one. Remember lesson to SGD. Our training loop. Looks like this. Calculate your predictions. Calculate your loss. Do backward subtract learning rate times gradients and zero the gradients so let's do exactly the same thing.

We're going to go through a cheap walk and go through I up until and which is 50,000. That's the number of rows but integer divide by batch size because we're going to do a batch at a time and so then we'll grab everything starting at times batch size and ending at that plus batch size.

So this is going to be our first this is going to be a mini batch and so let's grab one X mini batch one Y mini batch and pass that through the model and our loss function. And then do backward. And then we're going to do our update which remember we have to do with no grad because this is not part of the gradient calculation.

This is the result of it. But now we can't just go. A dot subtract learning rate times gradient we have to do that for every single one of our parameters so our model has. 3 layers. The value has no parameters in it. So the linear layer has weight and bias and this linear layer has weight and bias so it basically got 4 tenses to deal with.

So we're going to go through all of our layers and let's just check whether that layer has an attribute called weight or not that's a bit kind of more flexible than hard coding things and if it does then let's update the weight to minus equals the gradient of that.

By the learning rate the bias to the bias gradient by the learning rate and then 0 those gradients when we're done. So let's run it. And then let's check the loss function and the accuracy and the loss has gone down from 2.3 to 0.05 and the accuracy has gone up from 0.12 to 1.

Notice that this accuracy is for only a single mini batch and it's a mini batch from the training set so it doesn't mean too much, but obviously our model is learning something so this is good so this is. This is there you know we're now. Well, we haven't really done calm, but I guess we've got a basic training.

We're now here. We have a basic training. Which is great so we kind of got all the pieces. So let's try to make this simpler 'cause this is too much code right and it's too hard to fiddle around with. So the first bit will do. Is we're going to try and get rid of.

This mess. And we're going to replace it with. And so the difference here is rather than manually going through weight and bias for each one. We're going to loop through. Something called model dot parameters, so we're not even going to look through the layers. We're just going to loop directly through model dot parameters and for each parameter will say that parameter minus equals gradient times learning rate.

So somehow we need to be able to get all of the parameters of our model 'cause if we could do that we could greatly simplify this part of the loop and also make it much more flexible right so. To do that. We could create something like this and is calling this dummy module right and in dummy module what I'm going to do is I'm going to say every time I.

Set an attribute like L1 or L2. To a you know in this case to linear. I want to. Update a list called underscore modules with a list of all of the modules I have so in other words after I create this dummy module. I want to be able to print out here's my representation I want to be able to print out the list of those modules and see the modules that are there 'cause then I can define a method called parameters.

That will go through everything in my underscore modules list and then go through all of their parameters and that's what I'll be able to do see I could do here model dot parameters. So how did I create this just it's not inheriting from something right this is all written in pure Python.

How did I make it so that as soon as I said here's an attribute in my in my in it that somehow it magically appeared in this underscore modules list so that I could then create this parameters so that I could then do this refactoring. And the trick is that Python has a special dunder set atra method and every time you assign to anything inside self inside Python.

It will call this method if you've. Got one and so this method just checks that my the key so in other words the attribute name doesn't start with underscore 'cause if it does it might be underscore modules and then it's going to be like a recursive loop and also Python's got all kinds of.

Internal stuff that starts with underscore so as long as it's not some internal private stuff. Put that value inside my modules dictionary and call it K. That's it right then after you've done that. Do whatever the super class does when it sets attributes and in this case the super class is object if you don't say what it is that it's just the Python highest level object so now we have something that has all of the stuff we need to do this refactoring.

But the good news is pie torch also has something that does that and it's called an end up module. So we can do the exact same thing rather than implementing that set attributes stuff ourselves we can just call just inherit from an end up module and it does it for us right and so this is now you know why you have to call super dunder in it first right because it has to set up.

It's equivalent of this underscore modules dictionary right so. That's why you have to call super in it first and then after you've done that in pie torch. It's exactly the same as what I just showed you it now creates something which you can access through named children and you can see here if I print out the name and the layer there is the name and the layer.

So this is how pie torch does the exact same thing just like I created a dunder repra pie torch also has a dunder repra so if you just print out model it prints it out like so you can grab the attributes just in the normal pythonic way it's just a normal python class it has a bit of bit of this extra behavior.

So now we can run it with this refactoring. Make sure everything works. And there we go okay so this is doing exactly the same thing as before but a little bit more conveniently. Not convenient enough for my liking so one thing we could try to do is to get rid of the need to write every layer separately maybe go back to having it as a list again.

So if we made it a list of layers. And then we want to be able to pass that to some model class. Pass in the layers. This is not enough. To make them available as parameters right because the only thing that actually that pie torch is going to make available as parameters are things that.

That it knows are proper and end up modules so but here's the cool thing you can just go through. All of those layers and call self dot add module that's just the equivalent of what I did here when I said self dot underscore modules blah blah right so in pie torch you can just call self dot add module and just like I did you give it a name and then you pass in the layer and so if you do that.

If you do that then you end up with. The same thing okay so that's one that's one thing you could do but this is kind of clunky so it'd be nice if pie torch would do it for you and it does that's what end up module list does so if you.

If you create if you use an end up module list then it just. Basically calls that line of code for you. So you can see us doing it here we're going to create something called sequential model which just set self dot layers to that module list and then when we call it just goes through each layer X equals that layer of X and returns it.

And there's that okay even this is a little bit on the clunky side why would we have to write it ourselves. We don't pie torch has that code already it's called an end up sequential. Okay so we've now recreated an end up sequential. And there it is doing the same thing.

So again we're not creating like dumped down versions if you look at an end up sequential and you look at the source code and you look at forward it's just. It's even the same name self go through each module and self underscore modules dot values input equals module input return input.

So that's their version and remember our version. You know was. Basically the same. And we could even put it in something called underscore modules so yeah that's what that's all an end up sequential is doing for you. Okay so we're making some progress it's less ugly than it used to be still more ugly than we would like.

This is. This is where we got our fifth function up to. So let's try and simplify it a bit more. Let's replace all this torch dot no grad for PN model parameters blah blah blah with something where we can just write those two lines of code that would be nice.

So let's create a class called optimizer. We're going to pass in some parameters. And store them away. And we're going to pass in the learning rate and we're going to store that away. And so if we're going to be able to go up dot step up dot step has to do.

This so here is step with torch not no grad go through each parameter and do that okay so it's just back to that out and zero grad we probably shouldn't actually go model zero grad because. It's actually possible for the user as you know to say I don't want to include certain parameters in my optimizer so when we're doing like gradual unfreezing and stuff so really zero grad should actually do this it should go through the list of parameters that you asked to.

Optimize and zero those gradients so here we've now created something called optimizer and we can now. Grab our model and so we remember that the model now we've created something called dot parameters so we can pass that to our optimizer and then we can now just go up dot step up dot zero grad and let's test it and it works.

OK now of course luckily for us pipe torture already has these lines of code it's called opt in dot SGD. Now opt in dot SGD does do a few more things weight decay momentum stuff like that so let's have a look here's opt in dot SGD and here's a step function.

It's got weight decay momentum dampening this drop we're going to see all these things very shortly but basically all it does is it goes through each layer group. And it. Does the exact thing that we just see so once you remove the momentum and stuff and we're going to be implementing this in a much better way than pytorch in very soon so once you remove all that.

Their opt in dot SGD is exactly the same as our optimizer. So let's go ahead and use that instead. And so it's kind of nice then if we're going to use all the parameters of the model it's just create a get model which creates a model and returns it as well as a SGD optimizer with all the parameters.

And okay there's a training loop. And seems to be working. It's nice to put tests in from time to time and I like to put these tests in like hey my accuracy should be significantly better than 50%. You know note that these kind of stochastic tests are highly imperfect in many ways it's theoretically possible it could fail because you got really unlucky.

I know though that this is really vanishingly unlikely to happen it's always much more than 90%. It's also possible that your code could be failing in a way that causes the accuracy to be. A bit lower than it should be but not this low. I still think it's a great idea to have these kinds of tests when you're doing machine learning because they give you a hint when something's going wrong and you'll notice I don't set a random seed at any point this is very intentional.

I really like it that that if there's variation going on when I run my model at different times I want to see it. I don't want it to be hidden away behind a fixed seat so there's a there's a big push in science for like reproducible science which is which is great for many reasons but it's not how you should develop your models and you're developing your models.

You want to have a kind of good intuitive sense of what bits are stable and what bits are unstable and how much variation do you expect and so if you have a test which. You know fails one every out of one every 100 times it's it's good to know that you know and so like in the first day I code there's lots of tests like that and so then sometimes.

You know there'll be a test that fails it's nothing particularly to do with the push that just happened but it's it's really helpful for us because then we can look at it and be like oh. This thing we thought should pretty much always be true sometimes isn't true and then we'll go back and we'll deeply study why that is and figure out how to make it more stable and how to make it reliably pass that test so this is a kind of a controversial kind of test to have that something that I found in practice is very very very helpful.

It's not complete and it's not totally automated and it's imperfect in many ways but it's nonetheless helpful. OK let's get rid of these 2 lines of code these were the lines of code that grabbed X mini batch from the training set and the Y mini batch from the training set.

Let's do them both in one line of code. So it'd be nice to have one line of code we have some kind of object where we can pass in the indexes we want and get back both X and Y. And that's quite a data set as you know so here's our data set class again not inheriting from anything it's all from scratch play Python we initialize it by passing in the X and the Y will store them away.

It's very handy to have a length hopefully you know by now if you don't then now's a good time to realize that dunder Len is the thing that lets you go Len of something in Python and have it work that's what Len will call so now we've got the length of our data set.

And dunder get item is a thing that when you index into it it will return that and so we just return the tuple of X I and Y I. So. Let's go ahead and create a data set for our training set and a validation set check that the lengths are right.

Check the first. So the next thing we're going to do so that's a data set. Next thing we're going to do is create a data loader. This is what the start of our training loop look like before and let's replace it with this single line of code. So to do that we're going to have a class we're going to have to have a class that takes a data set and a batch size and stores them away and when you go for blah in blah behind the scenes in python it calls dunder itter.

And so what we're going to do is we're going to loop through range from zero up to the size of the data set jumping up batch size at a time so 064128 etc up to 50,000 and each time we go through we will. Yield our data set at an index starting at I and ending up ending at I plus self dot batch size.

Probably quite a lot of you haven't seen yield before it's an incredibly useful concept it's. If you're really interested it's something called a co routine it's basically this really interesting idea that you can have a function that doesn't return just one thing once back can return lots of things and you can kind of ask for it.

Lots of times so the way these iterators work in python is that when you when you call this it basically returns something which you can then call next on lots of times and each time you call next it will return the next thing. That is yielded so it's not I don't have time to explain car routines in detail here but.

It's really worth looking up and learning about we'll be using them lots. They're super valuable thing and it's not just for data science they're really handy for things like network programming web apps stuff like that as well so well worth being familiar with yield in python and nowadays most programming languages have something like this so you'll be able to.

Take it to wherever you go. So now we have a data loader we can create a training one and a validation one and we can this is how we do it. It's a valid deal is the thing that basically kind of generates our co routine for us and then next is the thing that grabs the next thing yielded out of that car routine so this is a very common thing you'll be doing lots next in a blah you probably did it a whole lot of times.

In part one because we kind of did it without diving in very deeply into what's going on and that returns. One thing from our data set in the data set returns two things because that's what we put in it so we expect to get two things back and we can check that those two things are the right size.

So that's our data loader and so let's double check. There it is good stuff so now. There's our fitness function let's call the fitness function looking good. So this is about as neat as we're going to get that's that's quite beautiful right. It's kind of. All the steps you can think of if you said it in English there go through each epoch.

Go through each batch grabbing the independent independent variable. Calculate the predictions. Calculate the losses. Calculate the gradients. Update with the learning rate. Reset the gradients. So that's you know that's kind of where you want to get is to a point where you can read your code. In a very kind of intuitive way to a domain expert and until you get to that point it's very hard really I find to really maintain the code and understand the code.

And this is the trick for doing research as well as it's not just for you know hardcore software engineering a researcher that can't do those things to their code can't do research properly right because if you think of something you want to try. You know you don't know how to do it or it takes weeks or if there are bugs in it you don't know so you know you want your code to be quite beautiful.

I think this is beautiful code. And this is at this point you know. This data set and this data loader are the same extractions that PyTorch. Uses so let's dig into this a little bit more we do have a problem which is that we're always looping through our training set in order.

And that's very problematic because we lose the randomness of kind of shuffling it each time particularly if our training set was already like ordered by dependent variable. Then every batch is going to be exactly the same dependent variable so we really want to shuffle it. So let's try random sampling so for random sampling I'm going to create a sampler class and we're going to pass into it a data set to sample.

And a batch size and something that says whether to shuffle or not right and as per usual we just store those away I don't actually store away the data set I just store away the length of the data set so that we know. How much to how many items to sample.

And then here's that here's that under it right so remember this is the thing that we can call next on lots of times. And so if we are. Shuffling. Then let's grab a random permutation of the numbers from naught to n minus one. And if we're not shuffling then let's grab all of the integers in order from zero to n minus one.

And then this is the same we had before go through that range and yield the indexes so what does that look like. Here's a sampler with shuffle equals false and a batch size of three oh one two three four five six seven eight nine and here it is. With shuffle equals true five four three seven six two eight nine oh one.

So now that we've got these we can now replace our data loader with something where we pass it a sampler. And we then loop through for s in samplers that's going to loop through each of these right and the cool thing is that because we used yield these are only going to be calculated when we ask for them they're not all calculated up front so we can use these on really big data sets no problem.

And so it's kind of this is a common thing is where you're actually looping through something. Which is itself a core routine and then yielding something which does other things to that so this is like a really nice way of doing streaming computations it's being done lazily and not going to run out of memory it's a it's really neat way to do things and so then we're going to grab.

All of the indexes in that sample and we're going to grab the data set at that index so now we've got a list of tenses. And then we need some way to collect them all together into a single pair of tenses and so we've created a function called collate which just grabs the X's and the Y's and stacks them up so torch stack just grabs a bunch of tenses and.

Glues them together on a new access you might want to do different things like add some padding or you know stuff like that so you can pass in a different collate function if you want to and it'll store it away and use it right so now. We can create our 2 samplers we can create our 2 data loaders with those samplers so the training one is shuffling the valid ones not shuffling so let's check as the validation data loader and the training data loader if we call it twice with exactly the same index we get different things in this case we got 2/8 but they're different 8's.

Call it another 2 times we're getting different numbers okay so it's it is shuffling as we hoped. And so again we can train our model and that's fine. So the pie torch data loader. Does exactly that so let's import the pie torch data loader and you can see it takes exactly the same arguments.

Okay and we can even pass in the exact collate function that we just wrote it's it doesn't have a single sampler that you pass shuffle equals true or false true it has 2 set samplers 1 called random 1 called sequential so slightly different to the API we just wrote but does exactly the same thing.

And so. You can create those data loaders and works exactly the same so that's what a pie torch data loader does. Most of the time you don't need the flexibility of writing your own sampler and your own collation function so you can just pass in shuffle and it will use the default.

Sampler and collation function that work the way we just showed. Something that we did not implement in pie torches data loader is that in purchase data loader you can pass in an extra parameter called num workers and that will fire off that many processes and each one will separately grab stuff out of your data set and then it'll collect them together afterwards and so if your data sets doing things like.

You know opening big jpeg files and doing all kinds of image transformations that's a really good idea. So we want to implement that. So finally for this section. We should add validation. So to know if we're overfitting we need to have a separate validation set. So here's the same loop that we had before.

And here's and here's the same loop pretty much again but with torch dot no grad going through the validation set. So for this we grabbed the predictions and the losses before but we don't call backward and we don't step the optimizer because it's just a validation instead we just keep track of the loss and we also keep track of the accuracy.

The only other difference is that we've added here model dot train and here model dot novelle. What does that do well actually all it does is it sets a internal attribute called dot training to true or false. So let's try it if I put print model dot training after each one.

And train this. See true false true false true false okay and so why does it set this thing called model dot training to true or false because some kinds of layers need to have different behavior depending on whether it's training or evaluation or validation. For example batch norm only updates its running statistics if it's training dropout only does randomized dropout if it's training they're the two main ones so that's why you always want to train and avail and if you forget to put something into a valve mode when you're done training you'll often be surprised because you'll be getting worse results than you expected.

Okay, so that's that's our fit loop one thing to note. Are these validation results correct if the batch size varies spell that correctly if. If the batch size varies because what we're doing here is we're adding up. The loss and we're adding up the accuracy and then at the end we see how big is our data loader how many batches are there and we divide.

But if you think about it if you had one mini batch of size 1000 and one mini batch of size one. You can't actually just do that right you can't you actually need a weighted average weighted by the size of the mini batch so this incorrect way is how nearly every library does it faster I does it the proper way and next time we do this we're going to do the proper way okay but for now.

Here's what most people do and it does not work correctly when your batch size varies. So it it's handy to have something that we can basically pass in a training data set in a validation data set in a batch size to and just grab the data loaders. The training data set will be shuffled validation won't be shuffled also the validation data set we don't need to do the backward pass so we don't need to store the gradients so that means that we have.

Twice as much room so we can make it twice this size twice the batch size so it's you know another nice thing to refactor out you don't have to type it anymore and also it means that you won't accidentally make a mistake. And so now we can go ahead and fit and let's do five epochs and so now these are actual validation accuracies.

Okay great so we've successfully built a training loop let's have a six minute break come back at 755 and talk about callbacks. Before we continue Rachel any questions. So why do we have to zero out our gradients and pie torch. Why do you have to zero out your gradients and pie torch so yeah.

The way we. Let's go back to. So it is our here's our optimizer right. Or let's go back even further here's. Here's our first version so this is just. With no additional help from pie torch at all if we didn't go grad zero here. Then what's going to happen the next time we go through and say lost up backward is it's going to add.

The new gradients to those existing gradients now why does that happen or that happens because we often have kind of lots of sources of gradients you know there's lots of kind of. Different modules all connected together and so they're getting their gradients from lots of different places and they all have to be added up so so when we call backward we wouldn't want backward to zero the gradients.

Because then we would lose this ability to kind of plug in lots of things together and just have them have them work. So that's why we need the grad dot zero. Here so then you know that's part that's part one of the answer. Part two of the answer is.

Why did we write our optimizer so there was one thing called step and one thing called zero grad because what we could have done is we could have removed these lines and push this up here. And so that step. Could have done both and then since we've actually got this kind of twice now we could put it all inside the for loop.

So we could certainly have written our optimizer like this as a switch parameter and does the update and sets the gradient to zero and then we would be able to remove this line. So. The problem with that is that. Within remove the ability. To not zero the gradients here and that means any time that we don't want to zero the gradients we now can't use the optimizer so for example.

What if you are working with some pretty big objects so like if you're doing super resolution and you're trying to create a two K output. You know your batch size you can only fit two images on the GPU at a time. And. The stability of the gradients that you get from a batch size of two is so poor that you need to use a larger batch size so well that would be.

Really easy to do if you did it like this right because we could say. If. I. Percent. Two. Then. Right and so this is now going to only run these things every two iterations and so that means that our effective batch size is now double. So that's handy right that's called gradient accumulation the gradient accumulation is where you change your training loop so that your your optimizer step and your zero grads only happen occasionally so.

That's really the reason is it is it there might be times you don't want to zero the gradients every time you do a step. And if you if there's no way to do that that's a problem. That you could argue that I can't think of a reason that this isn't a good idea we could make our optimizer we could say kind of like auto zero.

Equals true say and then we could have something in here which kind of says like if self dot auto zero. Then self dot zero. Right something like that and then that could even be the default and then you wouldn't have to worry about it unless you're explicitly wanted to do gradient accumulation I think that would be really be a better API design maybe but that's.

Not what they've done but it's so easy to write your own optimizers you could totally do that but I mean that you know the upside is removing a single line of code which. Isn't a huge upside anyway so. Any other questions Rachel. Okay. Okay so that's our training loop.

But it's not quite where we wanted to be and I'm stealing some slides here from silver who had a really cool talk recently called an infinitely customizable training loop so I'll steal her slides. Before I do I would like to do a big thank you to silver. He has been working full time with fast day I for.

Well over a year now I guess and a huge amount of what you see in the fast day I library and research and courses is is him so massive thank you to silver who's the most awesome person I've worked within my whole life so. That's pretty cool but also thank you to lots of other people huge thanks to stairs who a lot of you all have come across in the forum and he's he's done a lot of the stuff that makes faster I work.

Well and he's entirely a volunteer so like super grateful to him. You know the stuff that lets you check with your installation works properly that lets you quickly check with your performance is what it should be also like organizing lots of helpful projects through the forums he's been fantastic lots of other folks as well.

Andrew Shaw wrote a lot of the original kind of documentation stuff that we have Fred Monroe has been helpful in thousands of ways and it's just incredibly generous. Jason a lot of you will already be aware of who helped a lot with the final lesson of the last course and is hard at work now on taking it even further to doing some stuff that's going to blow you away.

I particularly want to point out Radik because this is the list of the I can't quite count as a 20 most helpful people on the forum as ranked by a number of likes. You know when somebody clicks that like button that means they're saying you know you've you've helped me and more people have said that about Radik than anybody else and it's not surprising because Radik is just not just an incredibly helpful person but extremely thoughtful.

And he's now I mean he. You know when he started with as a fast student he. Considered himself if I remember correctly basically a failed ML student he had tried a number of times to learn ML and hadn't succeeded but he's just applied himself so well for the last couple of years and he's now a Kaggle winner.

You know a world recognized deep learning practitioner and so thank you to all of these people and everybody else who's contributed in so many ways and of course Rachel who's sitting right next to me. So this is the. Fit function that we just wrote or the one this slightly more elegant one before we added validation to it.

So go through each epoch go through each sort of mini back to go through each mini batch get the prediction the loss backward pass update your parameters and then zero the gradients. So that's basically what we're doing model predictions loss gradients step at each time we grab a bit more training data.

But. That's not really all we want to do in a training loop. We might want to add the beautiful fast progress progress bars and animations that silver created in his fast progress library or tense aboard or whatever and thanks to Jason actually we now have tense aboard integration in fast AI so be sure to check that out if you want extra pretty graphs like these.

Hyper parameter scheduling. You might want to add all kinds of different regularization techniques these are all examples of regularization techniques that we support in fast AI and many more. Mixed precision training and also take advantage of the tensor cause in a voter GPU to train much faster. There's more tweaks you might want to do the training loop than than we could possibly think of and even if we did think of all of the ones that exist now somebody will come up with a new one tomorrow.

So the so you've got some possible ways you could solve this problem. And you know some of the things are talking about or even things like how do you add gains more complex stuff so one approach is. Write a training loop for every possible way you want to train and this is particularly problematic when you start to like want to combine multiple different tweaks right as you like cutting and pasting or whatever.

So that's certainly not going to work for fast AI. There's what I tried for fast AI zero point seven this is my training loop the last time I tried this which is like throw in every damn thing and it and it just got. So every time somebody would say like a new papers come out can you please implement and I just be like no I couldn't bear it.

So now we have something better. Callbacks and callbacks are something which like every library has callbacks but nobody else have callbacks anything like our callbacks and you'll see what I mean our callbacks let you not only look at but fully customize every one of these. Steps right and so.

Here's our starting training loop. Here's the fast AI. Version one training loop it's the same. Right there's the exact same lines of code. Plus. A bunch of calls to. Callbacks and so each one basically says. Before I do a step on step again. After I do a step on step and after I do a batch on batch and after I do an epoch on epoch and after finish training on training and right.

And they have the ability to also change things or even. They have the ability to say please skip the next step. I returning a bullion. So with this we can create and have created all kinds of things and fast AI like learning rate schedulers. And early stopping. And parallel trainer this is literally when I wrote parallel trainer this is the entire call back I wrote.

This is the entire gradient clipping call back right after you do the backward pass clip the gradients. So you can do a lot with a little. And then. You can mix them all together. Because all of the callbacks work with all of the other callbacks. So these are some of the callbacks that we have in fast AI.

Right now. So for example. How did we do gains. Last. Course. So what we did behind the scenes was we created again module. That was ridiculously simple we created again module that had a forward method. That just said what's your generator mode is that. Sorry are you in generator mode or not and we're not means discriminator mode if you're in generator mode called the generator otherwise called the critic.

And then there's a function called switch that just change generator mode backwards and forwards between generator and just discriminator. Same thing if we created a loss function where there was a generator loss. And a critic loss. And so then we created a call back. Right which had a switch that just switch the generator mode on and off.

And past that along to the model I just showed you and the loss function I just showed you. And then it would set requires grad to the generator or discriminator as appropriate. And then would have on train begin on train end on that. Callbacks to do the right thing at the right time so most importantly at the start of an epoch.

Set your generator mode. And at the end of training set your generator mode. So if you look at kind of other libraries implementation of GANS there. Basically kind of a whole new training loop whole new. Data load is whole new everything and it was really cool in fast AI we were able to create a GAN.

In this you know incredibly small amount of code for such a complex task. So let's let's do that ourselves right because we've got a training loop if we add callbacks we should now then be able to do. Everything. So let's start out by grabbing our data as before. So we've got the number of hidden is 50 batch size 64 loss function is cross entropy.

This is the signature of our fit function before and I got very nervous when I see functions with lots of things being passed to it. And it makes me think do we really need to pass all those things to it or can some of them be packaged up together.

There's a lot of benefits to packaging up things together when you can package up things together where they're kind of alike things. You can pass them around to everything that needs them together. You can create them using kind of factory methods that create them together and you can do smart things like look at the combination of them and make smart decisions for your users.

Rather than having to have them set everything themselves so there's lots of reasons that I would prefer to keep a pox right but I'd like to put all these other things into a single object. And specifically we can do that in 2 steps first of all let's take this data and say training and valid data conceptually should be 1 thing it's my data that maybe there's test data there as well.

So let's create a class called data bunch. That we're going to pass in training data and validation data and we'll store them away. And that's the entirety there's no logic here but for convenience let's make it easy to grab the data set out of them. As well. Remember we're now using you can either use the.

That the handmade data loader that we built in the last 1 or you can use the pie torch data loader they're both providing exactly the same API at this point except for the numb workers issue. So the remember that we passed these data load as a data set that you can access and then it would be nice if we could create a get model.

Function. Which could create our model but automatically set the last layer to have the correct number of activations because the data knows how many activations it needs. So let's also optionally make it that you can pass in C which is going to get stored away so that then when we create our data we can pass in C which remember we set to our.

Maximum y value. And so that way we never have to think about that again. So that's our. So that's our data bunch class. So there's our get model so it's just going to. Create a model with the number of inputs is the size of the input data. Number of hidden is whatever we had earlier, whatever we pass in and a value and then a linear from hidden to data dot C.

And return the model and an optimizer. And we all know all about dot parameters now. So then the other. The rest of the stuff model lost funk opt and data. Let's store them in something but all up lost funk data. I want to store them away and that thing will call a learner.

So notice a learner class has no logic at all. It's just a storage device for these 4 things. So now we can create a learner. Passing in the model and the optimizer since they're returned in this order from get model we can just say star get model. So that's going to pass in the model and the optimizer and we've got our lost function already at the top here we set it to cross entropy.

And we've got our data because it's that data bunch we just created. So this there's nothing magic going on with data bunches and learners are just like wrappers for the information that we need. So now we'll take the fit function we had before and I just pasted it here but every time I had model I replaced it with learned up model every time I had data I replaced it with learned up data and so forth.

Okay, so. There's the exact same thing that we had before still working fine. And so now. Let's add callbacks. So our fit function before basically said for epoch in range epochs for batch in train dl and then it had these contents right predictions lost backwards step zero grad I factored out the contents into something called one batch.

Okay, and then I added. All these. Callbacks. At cb dot after backwards cd dot after step. I did one other refactoring which is that the. Training loop has to loop through every batch and the validation loop has to loop through every batch so I just created something called all batches.

Okay, so this is my fit loop right begin fit the epoch in epochs begin epoch or batches with the training set begin validate. No grad or batches with a validation set after epoch after fit. Okay, so that's that. So here's a call back. Right, which has. All the stuff.

And so then we need a call back handler and that's going to be something which you just say here's all my callbacks. And basically it's just going to go through for each thing and say go through every call back. And call it. And keep track of. Whether we've received a false yet or not false means don't keep going anymore and then return it so we do that for begin fit after fit begin epoch begin validate after epoch begin batch after loss after backward after step.

So here's an example of a little call back we could create. And it's one that's going to at the start of the fit. It'll set number of iterations to zero. And then after every step it'll say number of iterations plus equals one. And print that out. And if we get past 10 iterations then it'll tell the learner to stop.

Because we have this little thing called do stop that gets checked at the end. So let's test it. And so it called fit and it only did 10 batches and this is actually a really handy call back because quite often you want to like. Just run a few batches to make sure things seem to be working you don't want to run a whole epoch so here's a quick way you can do something like that.

This is basically what fast AI be one looks like right now. It does have a little bit of extra stuff that lets you pass back and different loss and different data but it's nearly exactly the same. But. I really like. Rewriting stuff because when I rewrite stuff it lets me kind of look and see what I've written and when I looked back at this.

I saw. It's like it's it's. There's this object this is the CB is the call back handler. That's being passed everywhere. And that's a code smell that code smell says something should have that state. And specifically these. Three functions should be the methods of something. That has this state so after I kind of wrote this part of the lesson I suddenly realized oh fast AI is doing it the dumb way so let's fix it so and this is likely to appear in a future version of fast AI I created a new thing called runner.

And so runner is a new class. Runner is a new class that contains the three things I just said one batch. All batches. And fit. Right and the runner. So here's. Is fit. Right it's incredibly simple. We're going to keep track of how many epochs we're doing we're going to keep track of the learner that we're running and remember the learner has no logic in it the stores for things.

And then. We tell each of our callbacks. What runner they're currently working with and then we call begin fit and then we go through each epoch. Set the epoch we call begin epoch we call all batches and then with no grad we call begin validate and then we call all batches and then we call after epoch and then we call after fit.

That's it now. This self. String might look a bit weird. But look at what we had before. Again horrible code smell is lots of duplicate code. Res equals true for call back blah blah blah blah blah blah begin a park. Res equals true for call back blah blah blah blah blah begin validate so that's bad right code duplication means cognitive overhead to understand what's going on lots of opportunities to accidentally have one or instead of an and lots of places you have to change.

If you need to edit something. So basically I took that out and I factored it out into. Dunder call so done to call is the thing that we've seen it before it's the thing that lets you treat an object as if it was a function. So I could have called this lots of things I could have called it self dot run call back or whatever right but it's the thing that happens absolutely everywhere and so my kind of rule of thumb is if you do something lots of times.

Make it small so done to call is the smallest possible way you can call something you don't have to give it a name at all when you call it so we say call the call back called after epoch it also makes sense right we're calling a call back so why not use done to call to call a call back so after epoch.

I gotta go through all of my callbacks. Talk about this sorted in a moment and then the other thing I didn't like before is that all of my callbacks had to inherit from this call back super class because if they didn't then they would have been missing. And one of these methods and so then when it tried to call the method there would have been an exception and I don't like forcing people to have to inherit from something they should be able to do whatever they like so what we did here.

So what we did here was we used get attribute which is the python thing which says look inside this object and try to find something of this name a G begin validate. And default to none if you can't find it right so tries to find that call back and there will be none if the call back doesn't exist and if you find it.

Then you can call it right so this is a nice way to. Call any call back but when you implement a call back as you can see look how much easier our test call back is now right it's just super simple just just implement what you need and we inherit from a new call back class.

But we don't have to anymore right the main reason why is that our call back class now has an underscore order which we can use to. Choose what order callbacks running we'll talk about that after we handle this question. What is the difference between hooks and pie torch and callbacks and fast ai.

We're going to do hooks very shortly. If you think about it if I want to. Kind of add a call back after I calculate the forward pass of the second layer of my model there's no way for me to do that right because the point at which I do the forward pass.

Looks like this. Self dot model. Or if I want to hook into the point at which I've just called the backward pass if my penultimate layer I can't do that either because the whole thing appears here is self lost a backward. So hawks pie torch hawks are callbacks that you can add to specific pie torch modules.

And we're going to see them in very shortly. Might be next class we'll see how we go. Okay so. Very often you want to be able to inject behavior into something but the different things can influence each other for example transformations we're going to be seeing this when we do data data augmentation so quite often you'll need things to run in a particular order.

So when I add this kind of injectable behavior like callbacks I like to just add something which is what order should it run in. You don't have to put this here you might have noticed that what I do when I call this. Is I. This currently does sorry actually when we look at transformations it won't require order this one does require an order so.

Yeah okay so your callbacks need to be something that have an underscore order attribute in them and this way we can make sure that some things run after other things so for example you might have noticed that our. Our runner in the fit function never calls model dot eval never calls model dot train so like it literally doesn't do anything you know it just.

Says these are the steps I have to run and the callbacks do the running. So I created a train eval call back. That at the beginning of an epoch calls model dot train and the beginning of validation calls model dot eval and I also added stuff to keep track of how many epochs has it done and this is quite nice it actually does it as a floating point not just as an end so you could be like 2.3 epochs in.

It also keeps track of how many iterations do you want so now we have this thing keeping track of iterations test call back that should stop training after 10 iterations. Rather than keeping track of and it is itself it should just use the end it a. That was defined in this call back so what we can do is we can say all right well train eval call back has an order of 0 because it inherits.

So what we can just do here is make sure that this is later underscore order. Equals one. And so that way we can now refer to stuff that's inside the train eval call back like an inner. Sorry actually we don't even need to do that because it's putting an inner inside self dot run.

So we can just go self dot and it if we if our order if this ran before train eval call back that would be a problem because and it might not have been. Updated yet. So that's what the orders for. Another nice thing about runner. So a nice thing about class call back.

Is that I have to find dunder get atra. And I have to find it to say return. Get atra self dot run comma K. An important thing to know about dunder get atra is that it is only called by Python if it can't find the attribute that you've asked for.

So if something asks for self dot name well I have self dot name so it's never going to get to here. So if you get to here it means python look for this attribute and it couldn't find it. And so. Very very often the thing you actually want in the call back is actually inside the runner.

Which restore away is self dot run so this means that in all of our callbacks let's look at one. You can basically just use self dot pretty much everything and it will grab what you want. Even though most of the stuff you want is inside the runner so you'll see this.

Pattern in fast AI a lot is that when one object. Contains another object or composes another object we very often delegate. Get attribute to the other object so for example if you're looking at. A data set then I think we delegate to. X if you're looking at a. Stuff in the data box API will often delegate to stuff lower in the data box API and so forth.

So I find this pretty handy. OK so we have a call back that as you see this very little to it one interesting thing you might notice is that a call back has a name property. And the name property. Works like this if you have a property called train eval call back.

Then we've got a function called camel to snake this is called camel case means you got uppercase and lowercase letters like a camel. And snake case looks like this so camel to snake turns this into a snake. And then what we do here is we remove call back from the end.

And that's its name so train eval call back has a name which is just train eval underscore and then in the runner. Any call back functions that you pass in. Which it uses to create new callbacks it actually assigns them to an attribute with that name so we now have something called.

Runner dot train eval for example so we do this in the faster library when you say learned recorder we didn't actually add an attribute called recorder to learner it just automatically sets that because there's a learner call back. So. So let's see how to use this. As a question okay let's do that in a moment so.

Let's use this to add metrics because it's no fun having a training loop where we can't actually see how we're going and part of the whole point of this is that our actual training loop is now. So incredibly tight and neat and easy but we actually want to do all the stuff we want to do so what if we create a little call back called average stats call back.

Where we're going to stick into it a couple of objects to keep track of our loss and metrics one for training one for valid and at the start of an epoch will reset the statistics at the end of an epoch will print out the statistics. And after we've got the loss calculated.

We will accumulate the statistics so then all we need is an object that has an accumulate method. So let's create a class that does that and here's our accumulate method. It's going to add up the total loss. And for each metric. It'll add up the total metrics. And then we'll give it a property called average stats that will go through all of those losses and metrics and return.

The average and you might notice here I fixed the problem of having different batch sizes in the average we're actually adding lost times the size of the batch and count plus the size of the batch and metrics times the size of the batch and so then we're dividing here by the total.

Batch size so this is going to keep track of our stats will add a dunder Repra so that it prints out those statistics in a nice way. And so now we can create a learner. And our average stats call back and when we call fit. Metrics and loss tracking to our minimal training loop.

Runner dunder call exits early when the first call back returns true why is that. So one of the things I noticed was really annoying in the first way I wrote the call back handler was I had it so that. Something had to return something had to return true to mean keep going so basically false meant stop and that was really awkward because if you don't add a return in in python that it actually returns none and none is false and I thought oh if I forget to return something that should mean keep going that should be like the default so.

The first thing to point out is that the basic lope now. Actually says if not rather than if right so if not begin a park so in other words if if your call back handler returns false then keep going and so that means that basically none of my callbacks.

Need to return anything most of the time except for test call back which returns true so true means cancel means stop. So if one of my callbacks says stop. Then I mean I could I could certainly imagine an argument in either way that but the way I thought it if it says stop.

Let's just stop right now you know why do we need to run the other callbacks so if it says stop then it returns stop says we don't want to go anymore and then we can depending on where you are so if it's after a park returns stop and it's actually going to stop the loop.

Entirely. So that's why. Yeah, so this is a little awkward we had to like construct our average stats call back and then we had to pass that to run and then later on we can refer to stats valid stats average stats because remember average stats was where we grab this.

So that's OK, but it's a little awkward so instead what I do is I create a accuracy call back function. So that is. The average stats call back constructor passing an accuracy. But with partial so partial returns is a function that returns a function. And so this is now a function which can create a call back and so I can pass this to CB funks and now I don't have to store it away because the runner.

This is what we saw before the runner will go through each CB funks it will call that function to create the call back and then it will stick that call back inside the runner giving it this name as the attribute. So this way we can say. This is our call back function.

This is our runner fit and now it's automatically available inside run dot average stats so so this is what faster IV one does except it puts them inside a winner because we don't have a runner concept. So I think that's pretty handy it's kind of like it looks a little bit awkward the first time you do it but you can kind of create a standard set of.

Call back functions that you want to use for particular types of models and then you can just store them away in a list and you don't have to think about them again which is what you'll see will do lots of times. So. How like a lot of things in this part 2 of the course you can choose how deep to go on different things I think our approach to callbacks is super interesting and if you do too you might want to go deep here and really look into like.

You know what kind of callbacks you can build and what things you can do with them that we haven't done yet but you know then a lot of these details around. Exactly how I do this if you're not as interested in the details of software engineering this might be something you care less about which is fine.

The main thing that everybody should take away is. That that's our training loop. Okay. So you know the other stuff about like exactly how did we create our average stats call back and exactly what is done to call do a fairly minor details but you should recognize that the fit function.

Stores how many epochs we're doing what learner we're working with. Calls each of the different callbacks at each time right and it and like I never remember which ones are at which place if you go to docs.fast.ai the callbacks documentation will show you personally I just always look at the source code because it's just so easy to see exactly what happens at each time and exactly what's available.

At each time. So let's use this. And let's use this to do one cycle training. Because it's pretty hard when you have to have a constant learning rate all the time particularly because I was really wanting to show you. A deep dive which about to see using hooks a deep dive into how.

The mechanics or kind of how the dynamics of training models looks like and what we'll learn is that the first batches everything if you can get the first batches working well then things will tend to be things are going to be good and this is how you can get super convergence.

So if you want your first batches to be good it turns out that good annealing is critical. So let's do that right away let's set up good annealing because we have the mechanics we need because we have callbacks so we're inside of five and Neil. Get our data this is all the same as before here's something to create a learner with one line.

So let's create a learner with that same little model we had before and lost function data and we'll create a runner. With a average stats call back this defaulted to a learning rate of point five maybe we could try it with learning rate of point three. Pretty handy being able to like quickly create things with different learning rates so let's create a function that's just going to be partial get model with a learning rate.

And so now we can just call get model funk and pass a learning rate in and we'll immediately have something with a different learning rate. Yes tell me the question. So what is your typical debugging process. My debugging process is to use the debugger so if I got an exception while I was running a cell then I just go into the next cell and type percent debug.

And that pops open the debugger if things aren't working the way I expected but it wasn't an exception then I'll just add set underscore trace somewhere around the point I care about. That's about it yeah I find that's. That works pretty well. Most of the time then it's just a case of looking at what's it what's the shape of everything and what is everything contain like a couple of couple of objects in the batch I normally find something's got names or zeros or whatever.

Yeah it's really rare that using that debugger that I find debugging is that difficult if it is then it's just a case of stepping away and questioning your assumptions but with the help of a debugger. All of the states right there in front of you just one of the great things about pytorch is that it supports this kind of development.

Okay. All right so. We're going to create a call back. That's going to do. Hyper parameter scheduling. And so for this notebook we're just going to do learning rate as a hyper parameter but it's in the last 12 months one of the really successful areas of research have been people pointing out that.

You can you can and should schedule everything your dropout amount what kind of data augmentation you do weight decay. Learning rate momentum everything which makes sense right because the other thing that we've been learning about a lot about in the last 12 months is how as you train a model it kind of goes through this different phases of like that the.

The kind of weight landscapes the sorry the lost function the lost landscapes of neural nets look very different at the start in the middle and at the end. And so it's very unlikely that you would want the same hyper parameters throughout so being able to schedule anything is super handy.

So we'll create a parameter schedule a call back and you're just going to pass in a function. Right and a parameter schedule so we're going to be passing in lr cuz lr is what pytorch calls learning rate and then this function will be something which takes a single argument.

Which is. Number of epochs divided by total epochs remember I told you that that train eval call back we added is going to set this to be a float so this will be the number of this will be like epoch number 2.35. Out of 6 so this will be a float of exactly how far through training are we and we'll pass that to some function that we're going to write and the result of that function will be used to set.

The hyper parameter in this case learning rate. As you know from part one. You don't necessarily want to have the same value of a hyper parameter for all of your layers. So pytorch has something called. Parameter groups which we use an abstraction we call layer groups in fast AI but they're basically the same thing.

And so a pytorch optimizer contains a number of parameter groups unless you explicitly create more than one it'll all be in one but anytime we do stuff with hyper parameters you have to loop through. Pg in self dot opt dot parameter groups. So then parameter group so learning rate for this parameter group this layer group.

Is equal to the result of this function. And then every time we start a new batch. If we're training. And we'll run our schedule. Pretty hard to know if our schedule is working if we can't actually see what's happening to the learning rate as we go so let's create another call back called recorder.

That at the start of fit fitting sets the allies and losses are raised to being empty. And then after each batch as long as you're training it depends the current. Learning rate and the current loss now there's actually lots of learning rates potentially because there's lots of layer groups so in fast AI we tend to use the final layer group as a learning rate we actually.

Point out but you don't have to do it that way and then we'll add something to plot the learning rates and we'll add something to plot the losses. So hopefully this looks pretty familiar compared to the recorder in faster IV one so with that in place. We now need to create a function that takes.

The percentage through the learning which we're going to call pause for position and returns the value of learning rate. And so let's create one for linear schedules. So what we want to be able to pass this. Is a starting learning rate and an ending learning rate so you might pass it 10 and one and it was started the learning rate of 10 and go down to one that would be ridiculously high but whatever.

But we need a function that just takes position so this is a function that's going to return a function. So here's a function that takes a start learning rate and an end learning rate and a position. And returns the learning rate so to start plus position times difference. So to convert that function into one which only takes position.

We do partial passing in that function and the start and the end we were given. So now this function just takes position because that's the only thing from inner that we haven't set. So that's going to work fine but it's inconvenient because we're going to create lots of different schedule is.

And I don't want to have to write all this every time. So we can simplify the way that you can create these by using a decorator. Here's the version with a decorator with a decorator you create. Linear scheduler. In the natural way it's something that takes a start learning rate and end learning rate and a position and returns this.

And then we add an annealer decorator and the annealer decorator is the thing that does all this in a partial nonsense. What's a decorator? A decorator is a function that returns a function. And what Python does is if it sees the name of a function here with an at sign before it that it takes this function.

Passes it into this function and replaces the definition of this function with whatever this returns. So it's going to take this. It's going to pass it over here. And then it's going to say return inner where inner is. Partial as we described before. So let's see that. So now shed Lynn we wrote it as taking start and end and pause, but if I hit shift tab.

This says it only takes start and end. Why is that? Because we've replaced this function. With this function. And this function just takes start and end. And this is where Jupiter is. Going to give you a lot more happy times and pretty much any idea because this kind of dynamic.

Code generation. It's pretty hard for an idea to do that for you. Where else in Jupiter. It's actually running the code in an actual Python process so it knows exactly what shed Lynn. Means. So this is now created a function that takes start. And end. And returns a function.

Which takes. Which is what we need. For our schedule. So let's try it. Let's say f equals shed Lynn one comma two so this is a schedule that starts at learning rate one ends at learning rate two and then we'll say hey what should that be 30% of the way through training.

And again if I hit shift tab here. It knows that f is something that takes pause right so it's really nice in Jupiter you can you can take advantage of. Python's dynamic nature. And like. There's no point using a dynamic language if you're not taking advantage of his dynamic nature right so things like.

Decorators are super convenient way to do this stuff. There are other languages like Julia that can do similar things with macros like it's this is not the only way to get this kind of nice. A very expressive ability but it's one good way to do it. So now we can just go ahead and define all of our different schedule is by passing it each is start and pause.

So for example no schedule is something which. Always return start or cosine scheduling. Exponential scheduling. So. Let's define those. And then let's try to plot them and it doesn't work. Why doesn't it work. Because you can't plot PyTorch tenses. But it turns out the only reason you can't plot PyTorch tenses is because tenses don't have an end in attribute which tells.

Matplotlib how many dimensions there are so watch this. Torch dot tensor dot end in equals a property that is the length of the shape. And now replaced the definition again using the dynamic features of Python replace the def replace or actually insert into the definition of tensor a new property called end in.

And now we can plot tenses. So like the nice thing about Python is you never have to be like oh this isn't supported because you can change everything you can insert things you can replace things whatever so. Here we've now got a nice print out of our four different schedule is.

Which isn't really enough because if you want to do one cycle scheduling then in fact you know most of the time nowadays you want some kind of warm up and some kind of cool down or if you're doing something like SGDR you've got like multiple cool downs. So we really need to be able to paste some of these schedule is together.

So let's create another function called combined schedule is. And it's going to look like this. We're going to pass in. We're going to pass in the kind of the phases we want so phase one will be a cosine schedule from a learning rate of point three to point six phase two will be a learning rate as cosine schedule with a learning rate going from point six to point two.

And phase one will take up 30% of our batches and phase two will take up 70%. So that's what we're going to pass in how long is each phase and what's the schedule in each phase. So. Here's how we do that I don't think I need to go through the code it's there's nothing interesting about it.

But what we do once we have that is that we can then plot that schedule and you can kind of see why we're very fond of these. Cosine one cycle schedule so I don't think this has ever been published anywhere but it's what fast AI uses by default nowadays.

It's you kind of get a nice gentle warm up at the start and this is the time when things are just super sensitive and fall apart really quickly. But it doesn't take long as you'll see in next week's lesson when we do a deep dive into into stuff using hooks it doesn't take long for it to get into a decent part of the lost landscape and so you can quite quickly increase the learning rate.

And then something that people have and we'll start looking at papers next week for this something that people have realized in the last. 4 months or so although Leslie Smith really kind of showed us this 2 years ago but it's only been the last 4 months or so that people have really understood this in the wider academic literature.

You need to train at a high learning rate for a long time and so with this kind of coach cosine schedule we keep it up high for a long time. But then you also need to fine tune at a very low learning rate for a long time so this has all of the kind of nice features that we want so cosine one cycle schedules are terrific and we now can build them from scratch.

So let's try training like this so with let's create a list of call back functions that has a recorder in it an average stats call back with accuracy in it and a parameter scheduler that schedules the learning rate using this schedule. And then fit. That's looking pretty good we're getting up towards 94% pretty quickly and we can now go plot lr and it's the shape that we hope for and we can even say plot loss.

We now have. We still haven't looked at convolutions really will do that next week and a lot more. But you kind of have the ability now to hopefully think of lots of things that you might want to try and and try them out. So next week we're going to be starting with confidence.

We're going to be kind of and we're going to be finally using a GPU because once we because once we start creating confidence of this size it starts taking a little bit too long. But just to read ahead a little bit how what's it going to take to put stuff on the CPU this is the entirety of the call back right so we've now got the mechanics we need to do things unbelievably quickly right.

And then we'll be able to oh and also we'll be wanting to add some transformations this is the entirety of what it takes to do batch wise transformations. Without call back. As we discussed though we can't. Add callbacks between layers so we will add callbacks between layers initially manually and then using pytorch hooks and that way we're going to be able to plot and see exactly what's going on inside our models as they train and we'll find ways to train them.

Much much more nicely so that by the end of next by the end of the next notebook will be up over 98% accuracy. And that's going to be super cool. And then we're going to do a deep dive into batch norm data blocks API optimizers and transforms and at that point I think we'll have basically all the mechanics we need to go into some more advanced architectures and training methods and see how we did some of the.

Cool stuff that we did in part one so I'll see you next week.

Lesson 9 (2019) - How to train your model

Chapters

Transcript