Back to Index

Lesson 18: Deep Learning Foundations to Stable Diffusion


Chapters

0:0 Accelerated SGD done in Excel
1:35 Basic SGD
10:56 Momentum
15:37 RMSProp
16:35 Adam
20:11 Adam with annealing tab
23:2 Learning Rate Annealing in PyTorch
26:34 How PyTorch’s Optimizers work?
32:44 How schedulers work?
34:32 Plotting learning rates from a scheduler
36:36 Creating a scheduler callback
40:3 Training with Cosine Annealing
42:18 1-Cycle learning rate
48:26 HasLearnCB - passing learn as parameter
51:1 Changes from last week, /compare in GitHub
52:40 fastcore’s patch to the Learner with lr_find
55:11 New fit() parameters
56:38 ResNets
77:44 Training the ResNet
81:17 ResNets from timm
83:48 Going wider
86:2 Pooling
91:15 Reducing the number of parameters and megaFLOPS
95:34 Training for longer
98:6 Data Augmentation
105:56 Test Time Augmentation
109:22 Random Erasing
115:55 Random Copying
118:52 Ensembling
120:54 Wrap-up and homework

Transcript

Hi folks thanks for joining me for lesson 18. We're going to start today in Microsoft Excel. You'll see there's an excel folder actually in the course 22 p2 repo and in there there's a spreadsheet called grad desk as in gradient descent which I guess we should zoom in a bit here.

So there's some instructions here but this is basically describing what's in each sheet. We're going to be looking at the various SGD accelerated approaches we saw last time but done in a spreadsheet. We're going to do something very very simple which is to try to solve a linear regression.

So the actual data was generated with y equals ax plus b where a which is the slope was 2 and b which is the intercept or constant was 30. And so you can see we've got some random numbers here and then over here we've got the ax plus b calculation.

So then what I did is I copied and pasted as values just one one set of those random numbers into the next sheet called basic. This is the basic SGD sheet so that that's what x and y are. And so the idea is we're going to try to use SGD to learn that the intercept is 30 and the slope is 2.

So the way we do SGD is we so those are those are our those are our weights or parameters. So the way we do SGD is we start out at some random kind of guess so my random guess is going to be 1 and 1 for the intercept and slope.

And so if we look at the very first data point which is x is 14 and y is 58 the intercept and slope are both 1 then we can make a prediction. And so our prediction is just equal to slope times x plus the intercept so the prediction will be 15.

Now actually the answer was 58 so we're a long way off so we're going to use mean squared error. So the mean squared error is just the error so the difference squared. Okay so one way to calculate how much would the prediction sorry how much would the error change so how much would the the squared error I should say change if we changed the intercept which is b would be just to change b by a little bit change the intercept by a little bit and see what the error is.

So here that's what I've done is I've just added 0.01 to the intercept and then calculated y and then calculated the difference squared. And so this is what I mean by b1 this is this is the error squared I get if I change b by 0.01 so it's made the error go down a little bit.

So that suggests that we should probably increase b increase the intercept. So we can calculate the estimated derivative by simply taking the change from when we use the actual intercept using the the intercept plus 0.01 so that's the rise and we divide it by the run which is as we said is 0.01 and that gives us the estimated derivative of the squared error with respect to b the intercept.

Okay so it's about negative 86 85.99 so we can do exactly the same thing for a so change the slope by 0.01 calculate y calculate the difference and square it and we can calculate the estimated derivative in the same way rise which is the difference divided by run which is 0.01 and that's quite a big number minus 1200.

In both cases the estimated derivatives are negative so that suggests we should increase the intercept and the slope and we know that that's true because actually the intercept and the slope are both bigger than one the intercept is 30 should be 30 and the slope should be 2. So there's one way to calculate the derivatives another way is analytically and the the derivative of squared is two times so here it is here I've just written it down for you so here's the analytic derivative it's just two times the difference and then the derivative for the slope is here and you can see that the estimated version using the rise over run and the little 0.01 change and the actual they're pretty similar okay and same thing here they're pretty similar so anytime I calculate gradients kind of analytically but by hand I always like to test them against doing the actual rise over run calculation with some small number and this is called using the finite differencing approach we only use it for testing because it's slow because you have to do a separate calculation for every single weight but it's good for testing we use analytic derivatives all the time in real life anyway so however we calculate the derivatives we can now calculate a new slope so our new slope will be equal to the previous slope minus the derivative times the learning rate which we just set here at 0.0001 and we can do the same thing for the intercept as you see and so here's our new slope intercept so we can use that for the second row of data so the second row of data is x equals 86 y equals 202 so our intercept is not 1 1 anymore the intercept and slope are not 1 1 but they're 1.01 and 1.12 so here's we're just using a formula just to point at the old at the new intercept and slope we can get a new prediction and squared error and derivatives and then we can get another new slope and intercept and so that was a pretty good one actually it really helped our slope head in the right direction although the intercepts moving pretty slowly and so we can do that for every row of data now strictly speaking this is not mini batch gradient descent that we normally do in deep learning it's a simpler version where every batch is a size one so I mean it's still stochastic gradient descent it's just not it's just a batch size of one but I think sometimes it's called online gradient descent if I remember correctly so we go through every data point in our very small data set until we get to the very end and so at the end of the first epoch we've got an intercept of 1.06 and a slope of 2.57 and those indeed are better estimates than our starting estimates of 1 1 so what I would do is I would copy our slope 2.57 up to here 2.57 I'll just type it for now and I'll copy our intercept up to here and then it goes through the entire epoch again then we get another interception slope and so we could keep copying and pasting and copying and pasting again and again and we can watch the root mean squared error going down now that's pretty boring doing that copying and pasting so what we could do is fire up visual basic for applications and sorry this might be a bit small I'm not sure how to increase the font size and what it shows here so sorry this is a bit small so you might want to just open it on your own computer be able to see it clearly but basically it shows I've created a little macro where if you click on the reset button it's just going to set the slope and constant to one and calculate and if you click the run button it's going to go through five times calling one step and what one step's going to do is it's going to copy the slope last slope to the new slope and the last constant intercept to the new constant intercept and also do the same for the RMSE and it's actually going to paste it down to the bottom for reasons I'll show you in a moment so if I now run this I'll reset and then run there we go you can see it's run it five times and each time it's pasted the RMSE and here's a chart of it showing it going down and so you can see the new slope is 2.57 new intercept is 1.27 I could keep running it another five so this is just doing copy paste copy paste copy paste five times and you can see that the RMSE is very very very slowly going down and the intercept and slope are very very very slowly getting closer to where they want to be the big issue really is that the intercept is meant to be 30 it looks like it's going to take a very very long time to get there but it will get there eventually if you click run enough times or maybe set the VBA macro to more than five steps at a time but you can see it's it's very slowly and and importantly though you can see like it's kind of taking this linear route every time these are increasing so why not increase it by more and more and more and so you'll remember from last week that that is what momentum does so on the next sheet we show momentum and so everything's exactly the same as the previous sheet but this sheet we didn't bother with the finite differencing we just have the analytic derivatives which are exactly the same as last time the data is the same as last time the slope and intercept are the same starting points as last time and this is the new b and new a that we get but what we do this time is that we've added a momentum term which we're calling beta and so the beta is going to these cells here and what are these cells what what these cells are is that they're maybe it's most interesting to take this one here what it's doing is it's taking the gradient and it's taking the gradient and it's using that to update the weights but it's also taking the previous update so you can see here the blue one minus 25 so that is going to get multiplied by 0.9 the momentum and then the derivative is then multiplied by 0.1 so this is momentum which is getting a little bit of each and so then what we do is we then use that instead of the derivative to multiply by our learning rate so we keep doing that again and again and again as per usual and so we've got one column which is calculating the next which is calculating the momentum you know lerped version of the gradient for both b and for a and so you can see that for this one it's the same thing you look at what was the previous move and that's going to be 0.9 of what you're going to use for your momentum version gradient and 0.1 is for this version the momentum gradient and so then that's again what we're going to use to multiply by the learning rate and so you can see what happens is when you keep moving in the same direction which here is we're saying the derivative is negative again and again and again so it gets higher and higher and higher and get over here and so particularly with this big jump we get we keep getting big jumps because still we want to then there's still negative gradient negative gradient negative gradient so if we at so at the end of this our new our b and our a have jumped ahead and so we can click run and we can click clicking it and you can see that it's moving you know not super fast but certainly faster than it was before so if you haven't used vba visual basic for applications before you can hit alt alt f11 or option f11 to to open it and you may need to go into your preferences and turn on the developer tools so that you can see it you can also right click and choose assign macro on a button and you can see what macro has been assigned so if i hit alt f11 and i can just double or you can just double click on the sheet name and it'll open it up and you can see that this is exactly the same as the previous one there's no difference here oh one difference is that to keep track of momentum at the very very end so i've got my momentum values going all the way down the very last momentum i copy back up to the top h epoch so that we don't lose track of our kind of optimizer state if you like okay so that's what momentum looks like so yeah if you're kind of a more of a visual person like me you like to see everything laid out in front of you and like to be able to experiment which i think is a good idea this can be really helpful so rms prop we've seen and it's very similar to momentum but in this case instead of keeping track of kind of a lerped moving average an exponential moving average of gradients we're keeping track of a moving average of gradient squared and then rather than simply adding that you know using that as the gradient what instead we're doing is we are dividing our gradient by the square root of that and so remember the reason we were doing that is to say if you know if if the there's very little variation very little going on in your gradients then you probably want to jump further so that's rms prop and then finally atom remember was a combination of both so in atom we've got both the lerped version of the gradient and we've got the lerped version of the gradient squared and then we do both when we update we're both dividing the gradient by the square root of the lerped the moving exponentially weighting average moving averages and we're also using the momentumized version and so again we just go through that each time and so if i reset run and so oh wow look at that it jumped up there very quickly because remember we wanted to get to 2 and 30 so just two sets so that's five that's 10 epochs now if i keep running it it's kind of now not getting closer it's kind of jumping up and down between pretty much the same values so probably what we'd need to do is decrease the learning rate at that point and yeah that's pretty good and now it's jumping up and down between the same two values again so maybe decrease the learning rate a little bit more and i kind of like playing around like this because it gives me a really intuitive feeling for what training looks like so i've got a question from our youtube chat which is how is j 33 being initialized so it's it's this is just what happens is we take the very last cell here well these actually all these last four cells and we copy them to here as values so this is what those looked like in the last epoch so if i basically we're going we go copy and then paste as values and then they this here just refers back to them as you see and it's interesting that they're kind of you can see how they're exact opposites of each other which is really you can really see how they're it's it's just fluctuating around the actual optimum at this point um okay thank you to sam whatkins we've now got a nicer sized editor that's great um where are we adam okay so with um so with adam basically it all looks pretty much the same except now we have to copy and paste our both our momentums and our um squared gradients and of course the slopes and intercepts at the end of each step but other than that it's just doing the same thing and when we reset it it just sets everything back to their default values now one thing that occurred to me you know when i first wrote this spreadsheet a few years ago was that manually changing the learning rate seems pretty annoying now of course we can use a scheduler but a scheduler is something we set up ahead of time and i did wonder if it's possible to create an automatic scheduler and so i created this adam annealing tab which honestly i've never really got back to experimenting with so if anybody's interested they should check this out um what i did here was i used exactly the same spreadsheet as the adam spreadsheet but i added an extra after i do the step i added an extra thing which is i automatically decreased the learning rate in a certain situation and the situation in which i in which i decreased it was i kept track of the average of the um squared gradients and anytime the average of the squared gradients decreased during an epoch i stored it so i basically kept track of the the lowest squared gradients we had and then what i did was if we got a if that resulted in the gradients the squared gradients average halving then i would decrease the learning rate by then i would decrease the learning rate by a factor of four so i was keeping track of this gradient ratio now when you see a range like this you can find what that's referring to by just clicking up here and finding gradient ratio and there it is and you can see that it's equal to the ratio between the average of the squared gradients versus the minimum that we've seen so far um so this is kind of like my theory here was thinking that yeah basically as you train you kind of get into flatter more stable areas and as you do that that's a sign that you know you might want to decrease your learning rate so uh yeah if i try that if i hit run again it jumps straight to a pretty good value but i'm not going to change the learning rate manually i just press run and because it's changed the learning rate automatically now and if i keep hitting run without doing anything look at that it's got pretty good hasn't it and the learning rates got lower and lower and we basically got almost exactly the right answer so yeah that's a little experiment i tried so maybe some of you should try experiments around whether you can create a an automatic annealer using the um using mini ai i think that would be fun so that is an excellent segue into our notebook because we are going to talk about annealing now so we've seen it manually before um where we've just where we've just decreased the learning rate in a notebook and like ran a second cell um and we've seen something in excel um but let's look at what we generally do in pytorch so we're still in the same notebook as last time the accelerated SGD notebook um and now that we've reimplemented all the main optimizers that t equal tend to use most of the time from scratch we can use pytorches of course um so let's see look look now at how we can do our own learning rate scheduling or annealing within the mini ai framework now we've seen when we implemented the learning rate finder um that that we saw how to create something that adjusts the learning rate so just to remind you this was all we had to do so we had to go through the optimizers parameter groups and in each group set the learning rate to times equals some model player if we're just that was for the learning rate finder um so since we know how to do that we're not going to bother reimplementing all the schedulers from scratch um because we know the basic idea now so instead what we're going to have do is have a look inside the torch dot optim dot lr scheduler module and see what's defined in there so the lr scheduler module you know you can hit dot tab and see what's in there but something that i quite like to do is to use dir because dir lr scheduler is a nice little function that tells you everything inside a python object and this particular object is a module object and it tells you all the stuff in the module um when you use the dot version tab it doesn't show you stuff that starts with an underscore by the way because that stuff's considered private or else dir does show you that stuff now i can kind of see from here that the things that start with a capital and then a small letter look like the things we care about we probably don't care about this we probably don't care about these um so we can just do a little list comprehension that checks that the first letter is an uppercase and the second letter is lowercase and then join those all together with a space and so here is a nice way to get a list of all of the schedulers that pytorch has available and actually um i didn't couldn't find such a list on the pytorch website in the documentation um so this is actually a handy thing to have available so here's various schedulers we can use and so i thought we might experiment with using cosine annealing um so before we do we have to recognize that these um pytorch schedulers work with pytorch optimizers not with of course with our custom sgd class and pytorch optimizers have a slightly different api and so we might learn how they work so to learn how they work we need an optimizer um so some one easy way to just grab an optimizer would be to create a learner just kind of pretty much any old random learner and pass in that single batch callback that we created do you remember that single batch callback single batch it just after batch it cancels the fit so it literally just does one batch um and we could fit and from that we've now got a learner and an optimizer and so we can do the same thing we can do our optimizer to see what attributes it has this is a nice way or of course just read the documentation in pytorch this one is documented um i think showing all the things it can do um as you would expect it's got the step and the zero grad like we're familiar with um or you can just if you just hit um opt um so you can uh the optimizers in pytorch do actually have a a repra as it's called which means you can just type it in and hit shift enter and you can also see the information about it this way now an optimizer it'll tell you what kind of optimizer it is and so in this case the default optimizer um for a learner when we created it we decided was uh optim.sgd.sgd so we've got an sgd optimizer and it's got these things called parameter groups um what are parameter groups well parameter groups are as it suggests they're groups of parameters and in fact we only have one parameter group here which means all of our parameters are in this group um so let me kind of try and show you it's a little bit confusing but it's kind of quite neat so let's grab all of our parameters um and that's actually a generator so we have to turn that into an iterator and call next and that will just give us our first parameter okay now what we can do is we can then check the state of the optimizer and the state is a dictionary and the keys are parameter tensors so this is kind of pretty interesting because you might be i'm sure you're familiar with dictionaries i hope you're familiar with dictionaries but normally you probably use um numbers or strings as keys but actually you can use tensors as keys and indeed that's what happens here if we look at param it's a tensor it's actually a parameter which remember is a tensor which it knows to to require grad and to to list in the parameters of the module and so we're actually using that to index into the state so if you look at up.state it's a dictionary where the keys are parameters now what's this for well what we want to be able to do is if you think back to this we actually had each parameter we have state for it we have the average of the gradients or the exponentially way to moving average gradients and of squared averages and we actually stored them as attributes um so pytorch does it a bit differently it doesn't store them as attributes but instead it it the the optimizer has a dictionary where you can look at where you can index into it using a parameter and that gives you the state and so you can see here it's got a this is the this is the exponentially weighted moving averages and both because we haven't done any training yet and because we're using non-momentum std it's none but that's that's how it would be stored so this is really important to understand pytorch optimizers i quite liked our way of doing it of just storing the state directly as attributes but this works as well and it's it's it's fine you just have to know it's there and then as i said rather than just having parameters so we in sgd stored the parameters directly but in pytorch those parameters can be put into groups and so since we haven't put them into groups the length of param groups is one there's just one group so here is the param groups and that group contains all of our parameters okay so pg just to clarify here what's going on pg is a dictionary it's a parameter group and to get the keys from a dictionary you can just listify it that gives you back the keys and so this is one quick way of finding out all the keys in a dictionary so that you can see all the parameters in the group and you can see all of the hyper parameters the learning rate the momentum weight decay and so forth so that gives you some background about about what's what's going on inside an optimizer so seva asks isn't indexing by a tensor just like passing a tensor argument to a method and no it's not quite the same because this is this is state so this is how the optimizer stores state about the parameters it has to be stored somewhere for our homemade mini ii version we stored it as attributes on the parameter but in the pytorch optimizers they store it as a dictionary so it's just how it's stored okay so with that in mind let's look at how schedulers work so let's create a cosine annealing scheduler so a scheduler in pytorch you have to pass it the optimizer and the reason for that is we want to be able to tell it to change the learning rates of our optimizer so it needs to know what optimizer to change the learning rates of so it can then do that for each set of parameters and the reason that it does it by parameter group is that as we'll learn in a later lesson for things like transfer learning we often want to adjust the learning rates of the later layers differently to the earlier layers and actually have different learning rates and so that's why we can have different groups and the different groups have the different learning rates mementums and so forth okay so we pass in the optimizer and then if i hit shift tab a couple of times it'll tell me all of the things that you can pass in and so it needs to know t max how many iterations you're going to do and that's because it's trying to do one you know half a wave if you like of the cosine curve so it needs to know how many iterations you're going to do so it needs to know how far to step each time so if we're going to do 100 iterations so the scheduler is going to store the base learning rate and where did it get that from it got it from our optimizer which we set a learning rate okay so it's going to steal the optimizer's learning rate and that's going to be the starting learning rate the base learning rate and it's a list because there could be a different one for each parameter group we only have one parameter group you can also get the most recent learning rate from a scheduler which of course is the same and so i couldn't find any method in pytorch to actually plot a scheduler's learning rates so i just made a tiny little thing that just created a list set it to the last learning rate of the scheduler which is going to start at 0.06 and then goes through however many steps you ask for steps the optimizer steps the scheduler so this is the thing that causes the scheduler to to adjust its learning rate and then just append that new learning rate to a list of learning rates and then plot it so that's here's and what i've done here is i've intentionally gone over 100 because i had told it i'm going to do 100 so i'm going over 100 and you can see the learning rate if we did 100 iterations would start high for a while it would then go down and then it would stay low for a while and if we intentionally go past the maximum it's actually start going up again because this is a cosine curve so um one of the main things i guess i wanted to show here is like what it looks like to really investigate in a repl environment like a notebook how you know how an object behaves you know what's in it and you know this is something i would always want to do when i'm using something from an api i'm not very familiar with i really want to like see what's in it see what they do run it totally independently plot anything i can plot this is how i yeah like to learn about the stuff i'm working with um you know data scientists don't spend all of their time just coding you know so that means we need we can't just rely on using the same classes and apis every day so we have to be very good at exploring them and learning about them and so that's why i think this is a really good approach okay so um let's create a scheduler callback so a scheduler callback is something we're going to pass in the scheduling class but remember then when we go the scheduling callable actually and remember that when we create the scheduler we have to pass in the optimizer to to schedule and so before fit that's the point at which we have an optimizer we will create the scheduling object i like this ghetto it's very australian so the scheduling object we will create by passing the optimizer into the scheduler callable and then when we do step then we'll check if we're training and if so we'll step okay so then what's going to call step is after batch so after batch we'll call step and that would be if you want your scheduler to update the learning rate every batch um we could also have a um an epoch scheduler callback which we'll see later and that's just going to be after epoch so in order to actually see what the schedule is doing we're going to need to create a new callback to keep track of what's going on in our learner and i figured we could create a recorder callback and what we're going to do is we're going to be passing in the name of the thing that we want to record that we want to keep track of in each batch and a function which is going to be responsible for grabbing the thing that we want and so in this case the function here is going to grab from the callback look up its param groups property and grab the learning rate um where does the pg property come from attribute well before fit the recorder callback is going to grab just the first parameter group um just so it's like you got to pick some parameter group to track so we'll just grab the first one and so then um also we're going to create a dictionary of all the things that we're recording so we'll get all the names so that's going to be in this case just lr and initially it's just going to be an empty list and then after batch we'll go through each of the items in that dictionary which in this case is just lr as the key and underscore lr function as the value and we will append to that list call that method call that function or callable and pass in this callback and that's why this is going to get the callback and so that's going to basically then have a whole bunch of you know dictionary of the results you know of each of these functions uh after each batch um during training um so we'll just go through and plot them all and so let me show you what that's going to look like if we um let's create a cosine annealing callable um so we're going to have to use a partial to say that this callable is going to have t max equal to three times however many many batches we have in our data loader that's because we're going to do three epochs um and then we will set it running and we're passing in the batch scheduler with the scheduler callable and we're also going to pass in our recorder callback saying we want to check the learning rate using the underscore lr function we're going to call fit um and oh this is actually a pretty good accuracy we're getting you know close to 90 percent now in only three epochs which is impressive and so when we then call rec dot plot it's going to call remember the rec is the recorder callback so it plots the learning rate isn't that sweet so we could as i said we would can do exactly the same thing but replace after batch with after epoch and this will now become a scheduler which steps at the end of each epoch rather than the end of each batch so i can do exactly the same thing now using an epoch scheduler so this time t max is three because we're only going to be stepping three times we're not stepping at the end of each batch just at the end of each epoch so that trains and then we can call rec dot plot after trains and as you can see there it's just stepping three times so you can see here we're really digging in deeply to understanding what's happening in everything in our models what are all the activations look like what are the losses look like what do our learning rates look like and we've built all this from scratch so yeah hopefully that gives you a sense that we can really yeah do a lot ourselves now if you've done the fastai part one course you'll be very aware of one cycle training which was from a terrific paper by leslie smith which i'm not sure it ever got published actually and one cycle training is well let's take a look at it so we can just replace our scheduler with one cycle learning rate scheduler so that's in pytorch and of course if it wasn't in pytorch we could very easily just write our own we're going to make it a batch scheduler and we're going to train this time we're going to do five epochs so we're going to train a bit longer and so the first thing i'll point out is hooray we have got a new record for us 90.6 so that's great so and then b you can see here's the plot and now look two things are being plotted and that's because i've now passed into the recorder callback a plot of learning rates and also a plot of momentums and momentums it's going to grip the beta's zero because remember for adam it's called beta zero and beta one is momentum of the gradients and the momentum of the gradient squared and you can see what the one cycle is doing is the learning rate is starting very low and going up to high and then down again but the momentum is starting high and then going down and then up again so what's the theory here well the the the starting out at a low learning rate is particularly important if you have a not perfectly initialized model which almost everybody almost always does even though we spent a lot of time learning to initialize models you know we use a lot of models that get more complicated and after a while people after a while people learn or figure out how to initialize more complex models properly so for example this is a very very cool paper in 2019 this team figured out how to initialize resnets properly we'll be looking at resnets very shortly and they discovered when they did that they did not need batch norm they could train networks of 10 000 layers and they could get state-of-the-art performance with no batch norm and there's actually been something similar for transformers called tfixup that does a similar kind of thing but anyway it is quite difficult to initialize models correctly most people fail to most people fail to realize that they generally don't need tricks like warm up and batch norm if they do initialize them correctly in fact tfixup explicitly looks at this it looks at the difference between no warm up versus with warm up with their correct initialization versus with normal initialization and you can see these pictures they're showing are pretty similar actually log scale histograms of gradients they're very similar to the colorful dimension plots i kind of like our colorful dimension plots better in some ways because i think they're easier to read although i think theirs are probably prettier so there you go stufano there's something to inspire you if you want to try more things with our colorful dimension plots i think it's interesting that some papers are actually starting to use a similar idea i don't know if they got it from us or they came up with it independently doesn't really matter but so that so we do a warm up if our if our if our network's not quite initialized correctly then starting at a very low learning rate means it's not going to jump off way outside the area where the weights even make sense and so then you gradually increase them as the weights move into a part of the space that does make sense and then during that time while we have low learning rates if they keep moving in the same direction then with it it's very high momentum they'll move more and more quickly but if they keep moving in different directions it's just the momentum is going to kind of look at the underlying direction they're moving and then once you have got to a good part of the weight space you can use a very high learning rate and with a very high learning rate you wouldn't want so much momentum so that's why there's low momentum during the time when there's high learning rate and then as we saw in our spreadsheet which did this automatically as you get closer to the optimal you generally want to decrease the learning rate and since we're decreasing it again we can increase the momentum so you can see that starting from random weights we've got a pretty good accuracy on fashion MNIST with a totally standard convolutional neural network no resonance nothing else everything built from scratch by hand artisanal neural network training and we've got 90.6 percent fashion MNIST so there you go all right let's take a seven minute break and i'll see you back shortly i should warn you you've got a lot more to cover so i hope you're okay with a long lesson today okay we're back um i just wanted to mention also something we skipped over here which uh is this has learn callback um this is more important for the people doing the live course than the recordings if you're doing the recording you will have already seen this but since i created learner actually uh peter zappa i don't know how to pronounce your surname sorry peter um uh pointed out that there's actually kind of a nicer way of of handling learner that um previously we were putting the learner object itself into self.learn in each callback and that meant we were using self.learn.model and self.learn.opt and self.learn.all this you know all over the place it was kind of ugly um so we've modified learner this week um to instead um pass in when it calls the callback when in run cbs which is what it calls uh learner calls you might remember is it passes the learner as a parameter to the method um so now um the learner no longer goes through the callbacks and sets their dot learn attribute um but instead in your callbacks you have to put learn as a parameter in all of the method in all of the callback methods so for example device cb has a before fit so now it's got comma learn here so now this is not self.learn it's just learn um so it does make a lot of the code um less less yucky to not have all this self.learn.pred equals self.learn.model self.learn.batch is now just learn.

it also is good because you don't generally want to have um both have the learner um has a reference to the callbacks but also the callbacks having a reference back to the learner it creates something called a cycle so there's a couple of benefits there um and that reminds me there's a few other little changes we've made to the code and i want to show you a cool little trick i want to show you a cool little trick for how i'm going to find quickly all of the changes that we've made to the code in the last week so to do that we can go to the course repo and on any repo you can add slash compare in github and then you can compare across um you know all kinds of different things but one of the examples they've got here is to compare across different times look at the master branch now versus one day ago so i actually want the master branch now versus seven days ago so i just hit this change this to seven and there we go there's all my commits and i can immediately see the changes from last week um and so you can basically see what are the things i had to do when i change things so for example you can see here all of my self.learns became learns i added the nearly that's right i made augmentation and so in learner i added an lr find oh yes i will show you that one that's pretty fun so here's the changes we made to run cbs to fit so this is a nice way i can quickly yeah find out um what i've changed since last time and make sure that i don't forget to tell you folks about any of them oh yes clean up fit i have to tell you about that as well okay that's a useful reminder so um the main other change to mention is that calling the learning rate finder is now easier because i added what's called a patch to the learner um fast cause patch decorator that's you take a function and it will turn that function into a method of this class of whatever class you put after the colon so this has created a new method called lr find or learner dot lr find and what it does is it calls self.fit where self is a learner passing in however many epochs you set as the maximum you want to check for your learning rate finder what to start the learning rate at and then it says to use as callbacks the learning rate finder callback now this is new as well um self dot learn.fit didn't used to have a callbacks parameter um so that's very convenient because what it does is it adds those callbacks just during the fit so if you pass in callbacks then it goes through each one and appends it to self.cb's and when it's finished fitting it removes them again so these are callbacks that are just added for the period of this one fit which is what we want for a learning rate finder it should just be added for that one fit um so with this patch in place it says this is all it's required to do the learning rate finder is now to create your learner and call dot lr find and there you go bang so patch is a very convenient thing it's um one of these things which you know python has a lot of kind of like folk wisdom about what is and isn't considered pythonic or good and a lot of people uh really don't like patching um in other languages it's used very widely and is considered very good um so i i don't tend to have strong opinions either way about what's good or what's bad in fact instead i just you know figure out what's useful in a particular situation um so in this situation obviously it's very nice to be able to add in this additional functionality to our class so that's what lr find is um and then the only other thing we added to the learner uh this week was we added a few more parameters to fit fit used to just take the number of epochs um as well as the callbacks parameter it now also has a learning rate parameter and so you've always been able to provide a learning rate to um the constructor but you can override the learning rate for one fit so if you pass in the learning rate it will use it if you pass it in and if you don't it'll use the learning rate passed into the constructor and then i also added these two booleans to say when you fit do you want to do the training loop and do you want to do the validation loop so by default it'll do both and you can see here there's just an if train do the training loop if valid do the validation loop um i'm not even going to talk about this but if you're interested in testing your understanding of decorators you might want to think about why it is that i didn't have to say with torch.nograd but instead i called torch.nograd parentheses function that will be a very if you can get to a point that you understand why that works and what it does you'll be on your way to understanding decorators better okay so that is the end of excel sgd resnets okay so we are up to 90 point what was it three percent uh yeah let's keep track of this oh yeah 90.6 percent is what we're up to okay so to remind you the model um actually so we're going to open 13 resnet now um and we're going to do the usual important setup initially and the model that we've been using is the same one we've been using for a while which is that it's a convolution and an activation and an optimal optional batch norm and uh in our models we were using batch norm and applying our weight initialization the kiming weight initialization and then we've got comms that take the channels from 1 to 8 to 16 to 32 to 64 and each one's dried two and at the end we then do a flatten and so that ended up with a one by one so that's been the model we've been using for a while so the number of layers is one two three four so four four convolutional layers with a maximum of 64 channels in the last one so can we beat 90.9 about 90 and a half 90.6 can we beat 90.6 percent so before we do a resnet i thought well let's just see if we can improve the architecture thoughtfully so generally speaking um more depth and more channels gives the neural net more opportunity to learn and since we're pretty good at initializing our neural nets and using batch norm we should be able to handle deeper so um one thing we could do is we could let's just remind ourselves of the previous version so we can compare is we could have our go up to 128 parameters now the way we do that is we could make our very first convolutional layer have a stride of one so that would be one that goes from the one input channel to eight output channels or eight filters if you like so if we make it a stride of one then that allows us to have one extra layer and then that one extra layer could again double the number of channels and take us up to 128 so that would make it uh deeper and effectively wider as a result um so we can do our normal batch norm 2d and our new one cycle learning rate with our scheduler um and the callbacks we're going to use is the device call back our metrics our progress bar and our activation stats looking for general values and i won't what have you watched them train because that would be kind of boring but if i do this with this deeper and eventually wider network this is pretty amazing we get up to 91.7 percent so that's like quite a big difference and literally the only difference to our previous model is this one line of code which allowed us to take this instead of going from one to 64 it goes from eight to 128 so that's a very small change but it massively improved so the error rate's gone down by a temp you know about well over 10 percent relatively speaking um in terms of the error rate so there's a huge impact we've already had um again five epochs so now what we're going to do is we're going to make it deeper still but it gets there becomes a point um so chiming her at our noted that there comes a point where making neural nets deeper stops working well and remember this is the guy who created the initializer that we know and love and he pointed out that even with that good initialization there comes a time where adding more layers becomes problematic and he pointed out something particularly interesting he said let's take a 20-layer neural network this is in a paper called deep deep residue learning for image recognition that introduced resnets so let's take a 20-layer network and train it for a few what's that tens of thousands of iterations and track its test error okay and now let's do exactly the same thing on a 56-layer identical otherwise identical but deeper 56-layer network and he pointed out that the 56-layer network had a worse error than the 20-layer and it wasn't just a problem of generalization because it was worse on the training set as well now the insight that he had is if you just set the additional 36 layers to just identity you know identity matrices they should they would do nothing at all and so a 56-layer network is a superset of a 20-layer network so it should be at least as good but it's not it's worse so clearly the problem here is something about training it and so him and his team came up with a really clever insight which is can we create a 56-layer network which has the same training dynamics as a 20-layer network or even less and they realized yes you can what you could do is you could add something called a shortcut connection and basically the idea is that normally when we have you know our inputs coming into our convolution so let's say that's that was our inputs and here's our convolution and here's our outputs now if we do this 56 times that's a lot of stacked up convolutions which are effectively matrix multiplications with a lot of opportunity for you know gradient explosions and all that fun stuff so how could we make it so that we have convolutions but with the training dynamics of a much shallower network and here's what he did he said let's actually put two comms in here to make it twice as deep because we are trying to make things deeper but then let's add what's called a skip connection where instead of just being out equals so this is conv1 this is conv2 instead of being out equals and there's a you know assume that these include activation functions equals conv2 of conv1 of in right instead of just doing that let's make it conv2 of conv1 of in plus in now if we initialize these at the first to have weights of zero then initially this will do nothing at all it will output zero and therefore at first you'll just get out equals in which is exactly what we wanted right we actually want to to to for it to be as if there is no extra layers and so this way we actually end up with a network which can which can be deep but also at least when you start training behaves as if it's shallow it's called a residual connection because if we subtract in from both sides out then we would get out minus in equals conv1 of conv2 of in in other words the difference between the end point and the starting point which is the residual and so another way of thinking about it is that this is calculating a residual so there's a couple of ways of thinking about it and so this this thing here is called the res block or res net block okay so Sam Watkins has just pointed out the confusion here which is that this only works if let's put the minus in back and put it back over here this only works if you can add these together now if conv1 and conv2 both have the same number of channels as in the same number of filters same number of filters and they also have stride1 then that will work fine you'll end up that will be exactly the same output shape as the input shape and you can add them together but if they are not the same then you're in a bit of trouble so what do you do and the answer which um timing her et al came up with is to add a conv on in as well but to make it as simple as possible we call this the identity conv it's not really an identity anymore but we're trying to make it as simple as possible so that we do as little to mess up these training dynamics as we can and the simplest possible convolution is a one by one filter block a one by one kernel i guess we should call it a one by one kernel size and using that and we can also add a stride or whatever if we want to so let me show you the code so we're going to create something called a conv block okay and the conv block is going to do the two comms that's going to be a conv block okay so we've got some number of input filters some number of output filters some stride some activation functions possibly a normalization and possibly and some some kernel shape some kernel size so um the second conv is actually going to go from output filters to output filters because the first conv is going to be from input filters to output filters so by the time we get to the second conv it's going to be nf to nf the first conv we will set stride one and then the second conv will have the requested stride and so that way the two comms back to back are going to overall have the requested stride so this way the combination of these two comms is going to eventually is going to take us from ni to nf in terms of the number of filters and it's going to have the stride that we requested so it's going to be a the conv block is a sequential block consisting of a convolution followed by another convolution each one with a requested kernel size and requested activation function and the requested normalization layer the second conv won't have an activation function i'll explain why in a moment and so i mentioned that one way to make this as if it didn't exist would be to set the convolutional weights to zero and the biases to zero but actually we would we would like to have you know correctly randomly initialized weights so instead what we can do is if you're using batch norm we can initialize this conv two one will be the batch norm layer we can initialize the batch norm weights to zero now if you've forgotten what that means go back and have a look at our implementation from scratch of batch norm because the batch norm weights is the thing we multiply by so do you remember the batch norm we we subtract the exponential moving average mean we divide by the exponential moving average standard deviation but then we add back the the kind of the the the batch norms bias layer and we multiply by the batch norms weights well the way around multiplied by weights first so if we set the batch norm layers weights to zero we're multiplying by zero and so this will cause the initial conv block output to be just all zeros and so that's going to give us what we wanted is that nothing's happening here so we just end up with the input with this possible id conv so a res block is going to contain those convolutions in the convolution block we just discussed right and then we're going to need this id conv so the id conv is going to be a no op so that's nothing at all if the number of channels in is equal to the number of channels out but otherwise we're going to use a convolution with a kernel size of one and a stride of one and so that is going to you know is as with as little work as possible change the number of filters so that they match also what if the stride's not one well if the stride is two actually i'm actually this isn't going to work for any stride this only works for a stride of two if there's a stride of two we will simply average using average pooling so this is just saying take the mean of every set of two items in the grid so we'll just take the mean so we we so we basically have here pool of id conv of in if the if the stride is two and if the filtered number is changed and so that's the minimum amount of work so here it is here is the forward pass we get our input and on the identity connection we call pool and if stride is one that's a no op so do nothing at all we do id conv and if the number of filters has not changed that's also a no op so this is this is just the input in that situation and then we add that to the result of the convs and here's something interesting we then apply the activation function to the whole thing okay so that way i wouldn't say this is like the only way you can do it but this is this is a way that works pretty well is to apply the activation function to the result of the whole the whole res net block and that's why i didn't add activation function to the second conv so that's a res block so it's not a huge amount of code right and so now i've literally copied and pasted our get model but everywhere that previously we had a conv i've just replaced it with res block in fact let's have a look get model okay so previously we started with conv one to eight now we do res block one to eight stride one stride one then we added con from number of filters i and number of filters i plus one now it's res block from number of filters number of filters i plus one okay so it's exactly the same one change i have made though is i mean it doesn't actually make any difference at all i think it's mathematically identical is previously the very last conv at the end went from the you know 128 channels down to the 10 channels followed by flatten but this conv is actually working on a one by one input so you know an alternate way but i think makes it clearer is flatten first and then use a linear layer because a conv on a one by one input is identical to a linear layer and if that doesn't immediately make sense that's totally fine but this is one of those places where you should pause and have a little stop and think about why a conv on a one by one is the same and maybe go back to the excel spreadsheet if you like or the the python from scratch conv we did because this is a very important insight so i think it's very useful with a more complex model like this to take a good old look at it to see exactly what the inputs and outputs of each layer is so here's a little function called print shape which takes the things that a hook takes and we will print out for each layer the name of the class the shape of the input and the shape of the output so we can get our model create our learner and use our handy little hooks context manager we built an earlier lesson and call the print shape function and then we will call fit for one epoch just doing the evaluation of the training and if we use the single batch callback it'll just do a single batch put put pass it through and that hook will as you see print out each layer the inputs shape and the output shape so you can see we're starting with an input of batch size of 1024 one channel 28 by 28 our first res block was dried one so we still end up with 28 by 28 but now we've got eight channels and then we gradually decrease the grid size to 14 to 7 to 4 to 2 to 1 as we gradually increase the number of channels we then flatten it which gets rid of that one by one which allows us then to do linear to go under the 10 and then there's some discussion about whether you want a batch norm at the end or not i was finding it quite useful in this case so we've got a batch norm at the end i think this is very useful so i decided to create a patch for learner called summary that would do basically exactly the same thing but it would do it as a markdown table okay so if we create a train learner with our model and um call dot summary this method is now available because it's been patched that method into the learner and it's going to do exactly the same thing as our print but it does it more prettily by using a markdown table if it's in a notebook otherwise it'll just print it um so fast call has a handy thing for keeping track if you're in a notebook and in a notebook to make something markdown you can just use ipython dot display dot markdown as you see um and the other thing that i added as well as the input and the output is i thought let's also add in the number of parameters so we can calculate that as we've seen before by summing up the number of elements for each parameter in that module and so then i've kind of kept track of that as well so that at the end i can also print out the total number of parameters so we've got a 1.2 million parameter model and you can see that there's very few parameters here in the input nearly all the parameters are actually in the last layer why is that well you might want to go back to our excel convolutional spreadsheet to see this you have a parameter for every input channel you have a set of parameters they're all going to get added up across each of the three by three in the kernel and then that's going to be done for every output filter every output channel that you want so that's why you're going to end up with um in fact let's take a look maybe let's create let's just grab some particular one so create our model and so we'll just have a look at the sizes and so you can see here there is this 256 by 256 by three by three so that's a lot of parameters okay so we can call lrfind on that and get a sense of what kind of learning rate to use so i chose 2enag2 so 0.02 this is our standard learning thing you don't have to watch it train i've just trained it and so look at this by using resnet we've gone up from 91.7 this just keeps getting better 92.2 in 5 epochs so that's pretty nice and you know this resnet is not anything fancy it's it's the simplest possible res block right the model is literally copied and pasted from before and replaced each place it said conv with res block but we've just been thoughtful about it you know and here's something very interesting we can actually try lots of other resnets by grabbing tim so that's ross weitman's pie torch image model library and if you call tim.listmodels star resnet star there's a lot of resnets and i tried quite a few of them now one thing that's interesting is if you actually look at the source code for tim you'll see that the various different resnets like resnet 18 resnet 18 d resnet 10 d they're defined in a very nice way using this very elegant configuration you can see exactly what's different so there's basically only if one line of code different between each different type of resnet for the main resnets and so what i did was i tried all the tim models i could find and i even tried importing the underlying things and building my own resnets from those pieces and the best i found was the resnet 18 d and if i train it in exactly the same way i got to 92 percent and so the interesting thing is you'll see that's less than our 92.2 and it's not like i tried lots of things to get here this is the very first thing i tried where else this resnet 18 d was after trying lots lots of different tim models and so what this shows is that the just thoughtfully designed kind of basic architecture goes a very long way it's actually better for this problem than any of the pytorch image model models resnets that i could try that i could find so i think that's quite quite amazing actually it's really cool you know and it shows that you can create a state-of-the-art architecture just by using some common sense you know so i hope that's uh i hope that's yeah hope that's encouraging so anyway so we're up to 92.2 percent we're not done yet because we haven't even talked about data augmentation all right so let's keep going so we're going to make everything the same as before but before we do data augmentation we're going to try to improve our model even further if we can so i said it was kind of not constructed with any great care and thought really like in terms of like this resnet we just took the convnet and replaced it with a resnet so it's effectively twice as deep because each conv block has two convolutions but resnets train better than convnets so surely we could go deeper and wider still so i thought okay how could we go wider and i thought well let's take our model and previously we were going from eight up to 256 what if we could get up to 512 and i thought okay well one way to do that would be to make our very first res block not have a kernel size of three but a kernel size of five so that means that each grid is going to be five by five that's going to be 25 inputs so i think it's fair enough then to have 16 outputs so if i use a kernel size of five 16 outputs then that means if i keep doubling as before i'm going to end up at 512 rather than 256 okay so that's the only change i made was to add k equals five here and then change to double all the sizes um and so if i train that wow look at this 92.7 percent so we're getting better still um and again it wasn't with lots of like trying and failing and whatever it was just like saying well this just makes sense and the first thing i tried it just it just worked you know we're just trying to use these sensible thoughtful approaches okay next thing i'm going to try isn't necessarily something to make it better but it's something to make our res net more flexible our current res net is a bit awkward in that the number of stride two layers has to be exactly big enough that the last of them that the last of them ends up with a one by one output so you can flatten it and do the linear so that's not very flexible because you know what if you've got something you know for different size uh 28 by 28 is a pretty small image so to to kind of make that necessary i've created a get model two um which goes less far it has one less layer so it only goes up to 256 despite starting at 16 and so because it's got one less layer that means that it's going to end up at the two by two not the one by one so what do we do um well we can do something very straightforward which is we can take the mean over the two by two and so if we take the mean over the two by two that's going to give us a mean over the two by two it's going to give us batch size by channels output which is what we can then put into our linear layer so this is called this ridiculously simple thing is called a global average pooling layer and that's the that's the keras term uh in pie torch it's basically the same it's called an adaptive average pooling layer um but in it in pie torch you can cause it to have an output other than one by one um but nobody ever really uses it that way um so they're basically the same thing um this is actually a little bit more convenient than the pie torch version because you don't have to flatten it um so this is global average pooling so you can see here after our last res block which gives us a two by two output we have global average pool and that's just going to take the mean and then we can do the linear batch norm as usual so um i wanted to improve my summary patch to include not only the number of parameters but also the approximate number of mega flops so a flop is a floating operation per second a floating point operation per second um i'm not going to promise my calculation is exactly right i think the basic idea is right um i just basically actually calculated it's not really flops actually counted the number of multiplications um so this is not perfectly accurate but it's pretty indicative i think um so this is the same summary i had before but i had an added an extra thing which is a flops function where you pass in the weight matrix and the height and the width of your grid now if the number of dimensions of the weight matrix is less than three then we're just doing like a linear layer or something so actually just the number of elements is the number of flops because it's just a matrix multiply but if you're doing a convolution so the dimension is four then you actually do that matrix multiply for everything in the height by width grid so that's how i calculate this kind of flops uh equivalent number so um okay so if i run that on this model we can now see our number of parameters compared to the resnet model has gone from uh 1.2 million up to 4.9 million and the reason why is because we've got this um we've got this res block it gets all the way up to 512 and the way we did this um is we made that a stride one layer um so that's why you can see here it's gone two two and it stayed at two two so i wanted to make it as similar as possible to the last ones it's got you know the same 512 final number of channels and so most of the parameters are in that last block for the reason we just discussed um interestingly though it's not as clear for the mega flops you know it it is the greatest of them but you know in terms of number of parameters i think this has more parameters than the other ones added together by a lot but that's not true of mega flops and that's because this first layer has to be done 28 by 28 times whereas this layer only has to be done two by two times anyway so i tried uh training that uh and got pretty similar result 92.6 um and that kind of made me think oh let's fiddle around with this a little bit more um to see like what kind of things would reduce the number of parameters and the mega flops the reason you care about reducing the number of parameters is that it has uh lower memory requirements and the reason you require want to reduce the number of flops is it's less compute so in this case what i've done here is i've removed this line of code so i've removed the line of code that takes it up to 512 so that means we don't have this layer anymore and so the number of parameters has gone down from 4.9 million down to 1.2 million um not a huge impact on the mega flops but a huge impact on the parameters we've reduced it by like two-thirds or three-quarters or something by getting rid of that and you can see that the um if we take the very first resnet block the number of parameters is you know um why is it this 5.3 mega flops it's because although the very first one starts with just one channel the first conv remember our resnet blocks have two comms so the second conv is going to be a 16 by 16 by 5 by 5 and again i'm partly doing this to show you the actual details of this architecture but i'm partly showing it so that you can see how to investigate exactly what's going on in your models i really want you to try these so if we train that one interestingly even though it's only a quarter or something of the size we get the same accuracy 92.7 so that's interesting um can we make it faster well at this point this is the obvious place to look at is this first resnet block because that's where the mega flops are and as i said the reason is because it's got two comms the second one is 16 by 16 um channels 16 channels in 16 channels out and it's doing these five by five kernels um and it's having to do it across the whole 28 by 28 grid so that's the bulk of the the biggest compute so what we could do is we could replace this res block with just one convolution um and if we do that then you'll see that we've now got rid of the 16 by 16 by 5 by 5 we just got the 16 by 1 by 5 by 5 so the number of mega flops has gone down from 18.3 to 13.3 the number of parameters hasn't really changed at all right because the number of parameters was only 6 6 6 800 right so be very careful that when you see people talk about oh my my model has less parameters that doesn't mean it's faster okay really doesn't necessarily i mean it doesn't doesn't mean that at all there's no particular relationship between parameters and speed even counting mega flops doesn't always work that well because it doesn't take account of the amount of things moving through memory um but you know it's not a it's not a bad approximation here um so here's one which has got much less mega flops and in this case it's about the same accuracy as well so i think this is really interesting we've managed to build a model that has far less parameters and far less mega flops and has basically exactly the same accuracy so i think that's a really important thing to keep in mind and remember this is still way better than the resnet 18d from tim um so we've built something that is fast small and accurate so the obvious question is what if we train for longer and the answer is if we train for longer if we train for 20 epochs i'm not going to wait for have you wait for it the training accuracy gets up to 0.999 but the validation accuracy is worse it's 0.924 um and the reason for that is that after 20 epochs it's seen the same picture so many times it's just memorizing them and so once you start memorizing things actually go downhill so we need to regularize now something that we have claimed in the past can regularize is to use weight decay but here's where i'm going to point out that weight decay doesn't regularize at all if you use batch norm um and it's fascinating for years people didn't even seem to notice this and then somebody i think finally wrote a paper that pointed this out and people like oh wow that's weird um but it's really obvious when you think about it a batch norm layer has a single set of coefficients which multiplies an entire layer right so that set of coefficients could just be you know the number 100 in every place and that's going to multiply the entire previous weight matrix you know a convolution kernel matrix by 100 as far as weight decay is concerned that's not much of an impact at all because the batch norm layer has very few weights so it doesn't really have a huge impact on weight decay but it massively increases the effective scale of the weight matrix so batch norm basically lets the the neural net cheat by increasing the coefficients the parameters even nearly as much as it wants indirectly just by changing the batch norm layers weights so weight decay is not going to save us um and that's something really important to recognize weight decay is not i mean with batch norm layers i don't see the point of it at all it does have some like there has been some studies of what it does and it does have some weird kind of second order effects on the learning rate but i don't think you should rely on them you should use a scheduler for changing the learning rate rather than weird second order effects caused by weight decay so instead we're going to do data augmentation which is where we're going to modify every image a little bit by random change so that it doesn't see the same image each time so there's not any particular reason to implement these from scratch to be honest we have implemented them all from scratch in fastai so you can certainly look them up if you're interested but it's actually a little bit separate to what we're meant to be learning about so i'm not going to go through it but yeah if you're interested go into fastai vision augment and you'll be able to see for example how do we do flip and you know it's just like x dot transpose okay which is not really yeah it's not that interesting yeah how do we do cropping and padding how do we do random crops so on and so forth okay so we're just going to actually you know fastai has probably got the best implementation of these but torch visions are fine so we'll just use them and so we've created before a batch transform callback and we used it for normalization if you remember so what we could do is we could create a transform batch function which transforms the inputs and transforms the outputs using two different functions so that would be an augmentation callback and so then you would say okay for the transform batch function for example in this case we want to transform our x's and how do we want to transform our x's and the answer is we want to transform them using this module which is a sequential module or first of all doing a random crop and then a random horizontal flip now it seems weird to randomly crop a 28 by 28 image to get a 28 by 28 image but we can add padding to it and so effectively it's going to randomly add padding on one or both sides to do this kind of random crop one thing i did to to change the batch transform callback i can't remember if i've mentioned this before but something i changed slightly since we first wrote it is i added this untrain and on validate so that it only does it if you said i want to do it on training and it's training or i want to do it on validation and it's not training and then this is this is all the code is um so um data augmentation generally speaking shouldn't be done on validation so he said on validation false okay so what i'm going to do first of all is i'm going to use our classic single batch cb trick and um fit in fact even better oh yeah fit fit one uh just doing training um and what i'm going to do then is after i fit i can grab the batch out of the learner and this is a way this is quite cool right this is a way that i can see exactly what the model sees right so this is not relying on on any you know approximations remember when we fit it puts it in the batch that it looks at into learn.batch so if we fit for a single batch we can then grab that batch back out of it and we can call show images and so here you can see this little crop it's added now something you'll notice is that every single image in this batch notice grab the first 16 so i don't want to show you 1024 has exactly the same augmentation and that makes sense right because we're applying a batch transform now why is this good and why is it bad it's good because this is running on the GPU right which is great because nowadays very often it's really hard to get enough cpu to feed your fast GPU fast enough particularly if you use something like kaggle or colab that are really underpowered for cpu particularly kaggle um so this way all of our transformations all of our augmentation is happening on the GPU um on the downside it means that there's a little bit less variety every mini batch has the same augmentation i don't think the downside matters though because it's going to see lots of mini batches so the fact that each mini batch is going to have a different augmentation is actually all i care about so we can see that if we run this multiple times you can see it's got a different augmentation in each mini batch um okay so i decided actually i'm just going to use one padding so i'm just going to do a very very small amount of data augmentation and i'm going to do 20 epochs using one cycle learning rate um and so this takes quite a while to train so we won't watch it but check this out we get to 93.8 that's pretty wild um yeah that's pretty wild so um i actually went on twitter and i said to the entire world on twitter you know which if you're watching this in 2023 if twitter doesn't exist yet ask somebody tell you about what twitter used to be it still does um uh can anybody beat this in 20 epochs you can use any model you like uh any library you like and nobody's got anywhere close um so this is um this is pretty amazing and actually you know when i had a look at papers with code there are you know well i mean you can see it's right up there right with the kind of best models that are listed certainly better than these ones um and the the better models all use you know 250 or more epochs um so yeah if anybody i i'm hoping that some somebody watching this will find a way to beat this in 20 epochs that would be really great because as you can see we haven't really done anything very amazingly weirdly clever it's all very very basic um and actually we can uh go even a bit further than 93.8 um um just before we do i mentioned that since this is actually taking a while to train now i can't remember it takes like 10 to 15 seconds per epoch so you know you're waiting a few minutes you may as well save it so you can just call torch.save on a model and then you can load that back later um so something that can um make things even better is something called test time augmentation i guess i should write this out properly here test text test time augmentation um now test time augmentation actually does our batch transform callback on validation as well and then what we're going to do is we're actually in this case we're going to do just a very very very simple test time augmentation which is we're going to um add a batch transform callback that runs on validate and it's not random but it actually just does a horizontal flip non-random so it always does a horizontal flip and so check this out what we're going to do is we're going to create a new callback called capture preds um and after each batch it's just going to append to a list the predictions and it's going to append to a different list the targets and that way we can just call learn.fit train equals false and it will show us the accuracy okay and this is just the same number that we saw before but then what we can do is we can call that the same thing but this time with a different callback which is with the horizontal flip callback and that way it's going to do exactly the same thing as before but in every time it's going to do a horizontal flip and weirdly enough that accuracy is slightly higher which that's not the interesting bit the interesting bit is that we've now got two sets of predictions we've got the sets of predictions with the non-flipped version we've got the set of predictions with the flipped version and what we could do is we could stack those together and take the mean so we're going to take the average of the flipped and unflipped predictions and that gives us a better result still 94.2 percent so why is it better it's because looking at the image from kind of like multiple different directions gives it more opportunities to try to understand what this is a picture of and so in this case i'm just giving it two different directions which is the flipped and unflipped version and then just taking their average so yeah this is like a really nice little trick.

Sam's pointed out it's a bit like random forests which is true it's a kind of bagging that we're doing we're kind of getting multiple predictions and bringing them together and so we can actually so 94.2 i think is my best 20 epoch result and notice i didn't have to do any additional training so it still counts as a 20 epoch result you can do test time augmentation where you do you know a much wider range of different augmentations that you trained with and then you can use them at test time as well you know more more crops or rotations or warps or whatever i want to show you one of my favorite data augmentation approaches which is called random erasing um so random erasing i'll show you what it's going to look like random erasing we're going to add a little we're going to basically delete a little bit of each picture and we're going to replace it with some random Gaussian noise now in this case we just got one patch but eventually we're going to do more than one patch so i wanted to implement this because remember we have to implement everything from scratch and this one's a bit less trivial than the previous transforms so we should do it from scratch and also not sure there's that many good implementations ross whiteman's tem i think has one and so and it's also a very good exercise to see how to implement this from scratch um so let's grab a batch out of the training set um and let's just grab the first 16 images and so then let's grab the mean and standard deviation okay and so what we want to do is we wanted to delete a patch from each image but rather than deleting it deleting it would change the statistics right if we set those order zero the mean and standard deviation and now not going to be zero one anymore but if we replace them with exactly the same mean and standard deviation pixels that the picture has or that our data set has then it won't change the statistics so that's why we've grabbed the mean and standard deviation and so we could then try grabbing let's say we want to delete 0.2 so 20 percent of the height and width then let's find out how big that size is so 0.2 of the shape of the height and of the width that's the size of the x and y and then the starting point we're just going to randomly grab some starting point right so in this case we've got the starting point for x is 14 starting point for y is zero and then it's going to be a five by five spot and then we're going to do a Gaussian or normal initialization of our mini batch everything in the batch every channel for this x slice this y slice and we're going to initialize it with this mean and standard deviation normal random noise and so that's what this is so it's just that tiny little bit of code so you'll see i don't start by writing a function i start by writing single lines of code that i can run independently and make sure that they all work and that i look at the pictures and make sure it's working now one thing that's wrong here is that you see how the different you know this looks black and this looks gray now first this was confusing me as to what's going on what's it changed because the original images didn't look like that and i realized the problem is that the minimum and the maximum have changed it used to be from negative point eight to two that was the previous min and max now it goes from negative three to three so the noise we've added has the same mean and standard deviation but it doesn't have the same range because the pixels were not normally distributed originally so normally distributed noise actually is wrong so to fix that i created a new version and i'm putting in a function now does all the same stuff as before as i just did before but it clamps the random pixels to be between min and max and so it's going to be exactly the same thing but it's going to make sure that it doesn't change the the range that's really important i think because changing the range really impacts your you know your activations quite a lot so here's what that looks like and so as you can see now all of the backgrounds have that nice black and it's still giving me random pixels and i can check and because i've done the clamping you know and stuff the main and standard deviation aren't quite zero one but they're very very close so i'm going to call that good enough and of course the min and max haven't changed because i clamped them to ensure they didn't change so that's my random erasing so that randomly erases one block and so i could create a random erase which will randomly choose up to in this case four blocks so with that function oh that's annoying it happened to be zero this time okay i'll just run it again this time it's got three so that's good so you can see it's got oh maybe that's four one two three four blocks okay so that's what this data augmentation looks like so we can create a class to do this data augmentation so you'll pass in what percentage to do in each block what the maximum number of blocks to have is store that away and then in the forward we're just going to call our random arrays function passing in the input and passing in the parameters great so now we can use random crop random flip and random rows make sure it looks okay and so now we're going to go all the way up to 50 epochs and so if i run this for 50 epochs i get 94.6 isn't that crazy um so we're really right up there now up we're even above this one so we're somewhere up here and this is like stuff people write papers about from 2019 2020 oh look here's the random erasing paper that's cool um so they were way ahead of their time in 2017 but yeah that would have uh trained for a lot longer now i was having a think and i realized something which is like why like how do we actually get the correct distribution right like in some ways it shouldn't matter but i was kind of like bothered by this thing of like well we don't actually end up with zero one and there's kind of like clamping it all feels a bit weird like how do we actually replace these pixels with something that is guaranteed to be the correct distribution and i realized there's actually a very simple answer to this which is we could copy another part of the picture over to here if we copy part of the picture we're guaranteed to have the correct distribution of pixels and so it wouldn't exactly be random erasing anymore that would be random copying now i'm sure somebody else has invented this i mean you know um i'm not saying this nobody's ever thought of this before um so if anybody has knows a paper that's done this please tell me about it um but i you know i think it's um it's a very sensible approach um and it's very very easy to implement so again we're going to implement it all manually right so that's great get our x mini batch and let's get our again our size and again let's get the x y that we're going to be erasing by this time we're not erasing they're copying so we'll then randomly get a different x y to copy from and so now it's just instead of in a random noise we just say replace this slice of the batch with this slice of the batch and we end up with you know you can see here it's kind of copied little bits across some of them you can't really see at all and some of you can because i think some of them are black and it's replaced black but i guess it's knocked off the end of this shoe added a little bit extra here a little bit extra here um so we can now again we turn it into a function once i've tested it in the ripple make sure the function works and obviously this in this this case it's copying it largely from something that's largely black for a lot of them and then again we can do the thing where we do it multiple times and here we go now it's got a couple of random copies and so again turn that into a class create our transforms and again we okay so again we can have a look at a batch to make sure it looks sensible and do it for just do it for 25 epochs here and gets to 94 percent um now why did i do it for 25 epochs because i was trying to think about how do i beat my 50 epoch record which was 94.6 and i thought well what i could do is i could train for 25 epochs and then i'll train a whole new model for a different 25 epochs and i'm going to put it a different learner learn2 right this one is 94.1 so one of the models was 94.1 one of them was 94 maybe you can guess what we're going to do next it's a bit like test time augmentation but rather than that we're going to grab the predictions of our first learner and grab the predictions of our second learner and stack them up and take their mean and this is called ensembling and not surprisingly the ensemble is better than either of the two individual models at 94.4 although unfortunately i'm afraid to say we didn't beat our best but it's a useful trick and particularly useful trick in this case i was kind of like trying something a bit interesting to see if using the exact same number of epochs can i get a better result by using ensembling instead of training for longer and the answer was i couldn't maybe it's because the random copy is not as good or maybe i'm using too much augmentation who knows but it's something that you could experiment with so shall one mentions in the chat that cut mix is similar to this which is actually that's a good point i'd forgotten cut mix but cut mix yes copies it from different images rather than from the same image but yeah it's pretty much the same thing i guess ish well similar yeah very similar all right so that brings us to the end of the lesson and you know i am yeah so pumped and excited to share this with you because you know i don't know that this has never been done before you know to be able to to go from from i mean even in our previous courses we've never done this before go from scratch step by step to an absolute state-of-the-art model where we build everything ourselves and it runs this quickly and we're even using our own custom resnet and everything you know just using common sense at every stage and so hopefully that shows that deep learning is not magic you know that we can actually build the pieces ourselves and yeah as you'll see going up to larger data sets absolutely nothing changes and so it's exactly these techniques and this is actually i do 99 percent of my research on very small data sets because you can iterate much more quickly you can understand them much better and i don't think there's ever been a time where i've then gone up to a bigger data set and my findings didn't continue to hold true now homework what i would really like you to do is to actually do the thing that i didn't do which is to do the create your own um create your own schedulers that work with python's optimizers so i mean it's the tricky bit we'll be making sure that you understand the pytorch api well which i've really laid out here so study this carefully so create your own cosine annealing scheduler from scratch and then create your own one cycle scheduler from scratch and make sure that they work correctly with this batch scheduler callback this will be a very good exercise for you in you know hopefully getting extremely frustrated as things don't work the way you hope they would and being mystified for a while and then working through it you know using this very step-by-step approach lots of experimentation lots of exploration and then figuring it out um that's that's the journey i'm hoping you you have if it's all super easy and and you get it first go then you know you'll have to find something else to do um but um yeah i'm i'm hoping you'll find it actually you know surprisingly tricky to get it all working properly and in the process of doing so you're going to have to do a lot of exploration and experimentation but you'll realize that it requires no um like prerequisite knowledge at all okay so um if it doesn't work first time it's not because there's something that you didn't learn in graduate school if only you had done a phd whatever it's just that you need to dig through you know slowly and carefully to see how it all works um and you know then see how neat and concise you can get it um then the other homework is to try and beat me um i really really want people to beat me uh try to beat me on the 5 epoc or the 20 epoc or the 50 epoc fashion eminist um ideally using um mini ai uh with things that you've added yourself um uh but you know you can try grabbing other libraries if you like well ideally if you do grab another library and you find you can beat my approach try to re-implement that library um that way you are um still within the spirit of the game okay so in our next lesson um uh jonno and tanish and i are going to be putting this all together to um create a diffusion model from scratch and we're actually going to be taking a couple of lessons for this i'm not just a diffusion model but a variety of of interesting generative approaches um so we've kind of starting to come um full circle so thank you um so much for joining me on this very extensive journey and um i look forward to hearing what you come up with please do come and join us on forums.fast.ai and share your your progress bye