Welcome back. And here is lesson 4, which is where we get deep into the weeds of exactly what is going on when we are training a neural network. And we started looking at this in the previous lesson, we were looking at spicastic gradient descent. And so to remind you, we were looking at what Arthur Samuel said.
Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment, or we would call it parameter assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize that performance. So we could make that entirely automatic and a machine so programmed would learn from its experience.
And that was our goal. So our initial attempt on the MNIST data set was not really based on that. We didn't really have any parameters. So then last week we tried to figure out how we could parameterize it, how we could create a function that had parameters. And what we thought we could do would be to have something where say the probability of being some particular number was expressed in terms of the pixels of that number and some weights, and then we would just multiply them together and add them up.
So we looked at how stochastic gradient descent worked last week. And the basic idea is that we start out by initializing the parameters randomly. We use them to make a prediction using a function such as this one. We then see how good that prediction is by measuring using a loss function.
We then calculate the gradient which is how much would the loss change if I changed one parameter by a little bit. We then use that to make a small step to change each of the parameters by a little bit by multiplying the learning rate by the gradient to get a new set of predictions.
And so we went round and round and round a few times until eventually we decided to stop. And so these are the basic seven steps that we went through. And so we did that for simple quadratic equation. And we had something which looked like this. And so by the end, we had this nice sample of a curve getting closer and closer and closer.
So I have a little summary at the start of this section, summarizing gradient descent that Silva and I have in the notebooks in the book of what we just did. So you can review that and make sure it makes sense to you. So now let's use this to create our MNIST threes versus sevens model.
And so to create a model, we're going to need to create something that we can pass into a function like, let's see where it was, passing to a function like this one. So we need just some pixels that are all lined up and some parameters that are all lined up.
And then we're going to sum them up. So our X's are going to be pixels. And so in this case, because we're just going to multiply each pixel by a parameter and add them up, the fact that they're laid out in a grid is not important. So let's reshape those grids and turn them into vectors.
The way we reshape things in PyTorch is by using the View method. And so the View method, you can pass to it how large you want each dimension to be. And so in this case, we want the number of columns to be equal to the total number of pixels in each picture, which is 28 times 28, because they're 28 by 28 images.
And then the number of rows will be however many rows there are in the data. And so if you just use minus one, when you call View, that means, you know, as many as there are in the data. So this will create something of the same with the same total number of elements that we had before.
So we can grab all our threes. We can concatenate them, torch.cat, with all of our 7s, and then reshape that into a matrix where each row is one image with all of the rows and columns of the image all lined up in a single vector. So then we're going to need labels.
So that's our x. So we're going to need labels. Our labels will be a 1 for each of the threes and a 0 for each of the 7s. So basically we're going to create an is3 model. So that's going to create a vector. We actually need it to be a matrix in PyTorch.
So unsqueeze will add an additional unit dimension to wherever I've asked for. So here in position 1. So in other words, this is going to turn it from something which is a vector of 12,396 long into a matrix with 12,396 rows and one column. That's just what PyTorch expects to see.
So now we're going to turn our x and y into a data set. And a data set is a very specific concept in PyTorch. It's something which we can index into using square brackets. And when we do so, it's expected to return a tuple. So here if we look at, we're going to create this data set.
And when we index into it, it's going to return a tuple containing our independent variable and our dependent variable for each particular row. And so to do that, we can use the Python zip function, which takes one element of the first thing and combines it with, concatenates it with one element of the second thing.
And then it does that again and again and again. And so then if we create a list of those, it gives us a data set. It gives us a list, which when we index into it, it's going to contain one image and one label. And so here you can see why there's my label and my image.
I won't print out the whole thing, but it's a 784 long vector. So that's a really important concept. A data set is something that you can index into and get back a tuple. And here I am, this is called destructuring the tuple, which means I'm taking the two parts of the tuple and putting the first part in one variable and the second part in the other variable, which is something we do a lot in Python.
It's pretty handy. A lot of other languages support that as well. Repeat the same three steps for a validation set. So we've now got a training data set and a validation data set. Right. So now we need to initialize our parameters. And so to do that, as we've discussed, we just do it randomly.
So here's a function that, given some size, some shape if you like, will randomly initialize using a normal random number distribution in PyTorch. That's what randin does. And we can hit Shift + Tab to see how that works. Okay. And it says here that it's going to have a variance of 1.
So I probably shouldn't have called this standard deviation. I probably should call this variance actually. So multiply it by the variance to change its variance to whatever is requested, which will default to 1. And then as we talked about when it comes to calculating our gradients, we have to tell PyTorch which things we want gradients for.
And the way we do that is requires grad underscore. Remember this underscore at the end is a special magic symbol, which tells PyTorch that we want this function to actually change the thing that it's referring to. So this will change this tensor such that it requires gradients. So here's some weights.
So our weights are going to need to be 28 by 28 by 1 shape, 28 by 28 because every pixel is going to need a weight. And then 1 because we're going to need again, we're going to need to have that unit access to make it into a column.
So that's what PyTorch expects. So there's our weights. Now just weights by pixels actually isn't going to be enough because weights by pixels will always equal 0 when the pixels are equal to 0. It has a 0 intercept. So we really want something which like wx plus b, a line.
So the b is we call the bias. And so that's just going to be a single number. So let's grab a single number for our bias. So remember I told you there's a difference between the parameters and weights, so actually speaking. So here the weights are the w in this equation, the bias is b in this equation, and the weights and bias together is the parameters of the function.
They're all the things that we're going to change. They're all the things that have gradients that we're going to update. So there's an important bit of jargon for you. The weights and biases of the model are the parameters. So we can, yes question. What's the difference between gradient descent and stochastic gradient descent?
So far we've only done gradient descent. We'll be doing stochastic gradient descent in a few minutes. So we can now create a calculated prediction for one image. So we can take an image such as the first one and multiply by the weights. We need to transpose them to make them line up in terms of the rows and columns and add it up and add the bias and there is a prediction.
We want to do that for every image. We could do that with a for loop and that would be really, really slow. It wouldn't run on the GPU and it wouldn't run in optimized C code. So we actually want to use always to do kind of like looping over pixels looping over images.
You always need to try to make sure you're doing that without a Python or loop. In this case doing this calculation for lots of rows and columns is a mathematical operation called matrix model play. So if you've forgotten your matrix multiplication or maybe never quite got around to it at high school.
It would be a good idea to have a look at Khan Academy or something to learn about what it is, but it's actually, I'll give you the quick answer. This is from Wikipedia. If these are two matrices A and B, then this element here 1, 2 in the output is going to be equal to the first bit here times the first bit here plus the second bit here times the second bit here.
So it's going to be B12 times A11 plus B22 times A12. That's, you can see the orange matches the orange. Ditto for over here. This would be equal to B13 times A31 plus B23 times A32 and so forth for every part. Here's a great picture of that in action.
If you look at matrix multiplication.xyz, another way to think of it is we can kind of flip the second bit over on top and then multiply each bit together and add them up, multiply each bit together and add them up. And you can see always the second one here and ends up in the second spot and the first one ends up in the first spot.
And that's what matrix multiplication is. So we can do our multiply and add up by using matrix multiplication. And in Python and therefore PyTorch matrix multiplication is the @ sign operator. So when you see @ that means matrix multiply. So here is our 20.2336. If I do a matrix multiply of our training set by our weights and then we add the bias and here is our 20.336 for the first one.
And you can see though it's doing every single one. So that's really important is that matrix multiplication gives us an optimized way to do these simple linear functions for as many kind of rows and columns as we want. So this is one of the two fundamental equations of any neural network.
Some rows of data, rows and columns of data, matrix multiply, some weights, add some bias. And the second one which we'll see in a moment is an activation function. So that is some predictions from our randomly initialized model. So we can check how good our model is. And so to do that we can decide that anything greater than zero we will call a 3 and anything less than zero we will call a 7.
So preds greater than zero tells us whether or not something is predicted to be a 3 or not. Then turn that into a float. So rather than true and false make it 1 and 0 because that's what our training set contains. And then check whether our thresholded predictions are equal to our training set.
And this will return true every time a row is correctly predicted and false otherwise. So if we take all those trues and falses and turn them into floats so that'll be ones and zeros and then take their mean it's 0.49. So not surprisingly our randomly initialized model is right about half the time at predicting threes from sevens.
I added one more method here which is .item. Without .item this would return a tensor. It's a rank 0 tensor it has no rows it has no columns it just it's just a number on its own. But I actually wanted to unwrap it to create a normal Python scalar mainly just because I wanted to see the easily see the full set of decimal places.
And the reason for that is I want to show you how we're going to calculate the derivative on the accuracy by changing a parameter by a tiny bit. So let's take one parameter which will be weight 0 and multiply it by 1.0001. And so that's going to make it a little bit bigger.
And then if I calculate how the accuracy changes based on the change in that weight that will be the gradient of the accuracy with respect to that parameter. So I can do that by calculating my new set of predictions and then I can threshold them and then I can check whether they're equal to the training set and then take the mean and I get back exactly the same number.
So remember that gradient is equal to rise over run if you remember back to your calculus or if you'd forgotten your calculus hopefully you've reviewed it on Khan Academy. So the change in the y so y new minus y old which is 0.4912 etc minus 0.4912 etc which is 0 divided by this change will give us 0.
So at this point we have a problem our derivative is 0 so we have 0 gradients which means our step will be 0 which means our prediction will be unchanged. Okay so we have a problem and our problem is that our gradient is 0 and with a gradient of 0 we can't take a step and we can't get better predictions.
And so intuitively speaking the reason that our gradient is 0 is because when we change a single pixel by a tiny bit we might not ever in any way change an actual prediction to change from a 3 predicting a 3 to a 7 or vice versa because we have this threshold.
Okay and so in other words our accuracy loss function here is very bumpy it's like flat step flat step flat step so it's got this 0 gradient all over the place. So what we need to do is use something other than accuracy as our loss function. So let's try and create a new function and what this new function is going to do is it's going to give us a better value kind of in much the same way that accuracy gives a better value.
So this is the loss remember a small loss is better so it'll give us a lower loss when the accuracy is better but it won't have a 0 gradient. So it means that a slightly better prediction needs to have a slightly better loss. So let's have a look at an example let's say our targets our labels of like that is 3 oh there's just three rows three images here 1 0 1 okay and we've made some predictions from a neural net and those predictions gave us 0.9 0.4 0.2.
So now consider this loss function a loss function we're going to use torch.where which is basically the same as this list comprehension it's basically an if statement. So it's going to say for where target equals 1 we're going to return 1 minus predictions so here target is 1 so it'll be 1 minus 0.9 and where target is not 1 it'll just be predictions.
So for these examples here the first one target equals 1 will be 1 minus 0.9 which is 0.1 the next one is target equals 0 so it will be the prediction just 0.4 and then for the third one it's a 1 for target so it'll be 1 minus prediction which is 0.8 and so you can see here when the prediction is correct correct in other words it's a number you know it's a high number when the target is 1 and a low number when the target is 0 these numbers are going to be smaller.
So the worst one is when we predicted 0.2 so we're pretty we really thought that was actually a 0 but it's actually a 1 so we ended up with a 0.8 here because this is 1 minus prediction 1 minus 0.2 is 0.8. So we can then take the mean of all of these to calculate a loss.
So if you think about it this loss will be the smallest if the predictions are exactly right. So if we did predictions is actually identical to the targets then this will be 0 0 0 okay or else if they were exactly wrong let's say they were 1 minus then it's 1 1 1.
So it's going to be the loss will be better i.e. smaller when the predictions are closer to the targets and so here we can now take the mean and when we do we get here 0.433. So let's say we change this last bad one this inaccurate prediction from 0.2 to 0.8 and the loss gets better from 0.43 to 0.23.
This is just this function is torch.where.mean. So this is actually pretty good this is actually a loss function which pretty closely tracks accuracy whereas the accuracy is better the loss will be smaller but also it doesn't have these zero gradients because every time we change the prediction the loss changes because the prediction is literally part of the loss that's pretty neat isn't it.
One problem is this is only going to work well as long as the predictions are between 0 and 1 otherwise this 1 minus prediction thing is going to look a bit funny. So we should try and find a way to ensure that the predictions are always between 0 and 1 and that's also going to just make a lot more intuitive sense because you know we like to be able to kind of think of these as if they're like probabilities or at least nicely scaled numbers.
So we need some function that can take our numbers have a look. It's something which can take these big numbers and turn them all into numbers between 0 and 1 and it so happens that we have exactly the right function it's called the sigmoid function. So the sigmoid function looks like this.
If you pass in a really small number you get a number very close to 0 if you pass in a big number you get a number very close to 1 it never gets past 1 and it never goes smaller than 0 and then it's kind of like the smooth curve between and in the middle it looks a lot like the y = x line.
This is the definition of the sigmoid function. It's 1 over 1 plus e to the minus x. What is x? x is just e to the power of something. So if we look at e it's just a number like pi this is a simple it's just a number that has a particular value.
So if we go e squared and we look at it's going to be a tensor, use pytorch, make it a float, there we go. You can see that these are the same number so that's what torch.exp means. Okay so you know for me when I see these kinds of interesting functions I don't worry too much about the definition.
What I care about is the shape. So you can have a play around with graphing calculators or whatever to kind of see why it is that you end up with this shape from this particular equation but for me I just never think about that. It never really matters to me.
What's important is this sigmoid shape which is what we want. It's something that squashes every number to be between 0 and 1. So we can change MNIST_LOST to be exactly the same as it was before but first we can make everything into sigmoid first and then use torch.where. So that is a loss function that has all the properties we want.
It's something which is going to be have not have any of those nasty 0 gradients and we've ensured that the input to the where is between 0 and 1. So the reason we did this is because our accuracy was kind of what we really care about is a good accuracy.
We can't use it to get our gradients just to create our step to improve our parameters. So we can change our accuracy to another function that is similar in terms of it it's better when the accuracy is better but it also does not have these 0 gradients. And so you can see now where why we have a metric and a loss.
The metric is the thing we actually care about. The loss is the thing that's similar to what we care about that has a nicely behaved gradient. Sometimes the thing you care about your metric does have a nicely defined gradient and you can use it directly as a loss. For example, we often use mean squared error but for classification unfortunately not.
So we need to now use this to update the parameters. And so there's a couple of ways we could do this. One would be to loop through every image, calculate a prediction for that image and then calculate a loss and then do a step and then step through the parameters and then do that again for the next image and the next image and the next image.
That's going to be really slow because we're doing a single step for a single image. So that would mean an epoch would take quite a while. We could go much faster by doing every single image in the dataset. So a big matrix multiplication, it can all be paralyzed on the GPU and then so then we can we could then do a step based on the gradients looking at the entire dataset.
But now that's going to be like a lot of work to just update the weights once. And remember sometimes our datasets have millions or tens of millions of items. So that's probably a bad idea too. So why not compromise? Let's grab a few data items at a time to calculate our loss and our step.
If we grab a few data items at a time, those two data items are called a mini-batch. And a mini-batch just means a few pieces of data. And so the size of your mini-batch is called, not surprisingly, the batch size. So the bigger the batch size, the closer you get to the full size of your dataset, the longer it's going to do take to calculate a single set of losses, a single step.
But the more accurate it's going to be, it's going to be like the gradients are going to be much closer to the true dataset gradients. And then the smaller the batch size, the faster each step will be able to do, but those steps will represent a smaller number of items.
And so they won't be such an accurate approximation of the real gradient of the whole dataset. Is there a reason the mean of the loss is calculated over, say, doing a median, since the median is less prone to getting influenced by outliers? In the example you gave, if the third point, which was wrongly predicted as an outlier, then the derivative would push the function away while doing SGD, and a median could be better in that case.
Honestly, I've never tried using a median. The problem with a median is it ends up really only caring about one number, which is the number in the middle. So it could end up really pretty much ignoring all of the things at each end. In fact, all it really cares about is the order of things.
So my guess is that you would end up with something that is only good at predicting one thing in the middle, but I haven't tried it. It would be interesting to see. Well, I guess the other thing that would happen with a median is you would have a lot of zero gradients, I think, because it's picking the thing in the middle and you could, you know, change your values and the thing in the middle.
Well, it wouldn't be zero gradients, but bumpy gradients. I think in the middle would suddenly jump to being a different item. So it might not behave very well. That's my guess. You should try it. Okay. So how do we ask for a few items at a time? It turns out that PyTorch and FastAI provide something to do that for you.
You can pass in any data set to this class called data loader and it will grab a few items from that data set at a time. You can ask for how many by asking for a batch size. And then you can, as you can see, it will grab a few items at a time until it's grabbed all of them.
So here I'm saying let's create a collection that just contains all the numbers from 0 to 14. Let's pass that into a data loader with a batch size of 5. And then that's going to be something, it's called an iterator in Python. It's something that you can ask for one more thing from an iterator.
If you pass an iterator to list in Python, it returns all of the things from the iterator. So here are my three mini batches and you'll see here all the numbers from 0 to 15 appear. They appear in a random order and they appear five at a time. They appear in random order because shuffle equals true.
So normally in the training set we ask for things to be shuffled. So it gives us a little bit more randomization. More randomization is good because it makes it harder for it to kind of learn what the data set looks like. So that's what a data loader, that's how a data loader is created.
Now remember though that our data sets actually return tuples. And here I've just got single ints. So let's actually create a tuple. So if we enumerate all the letters of English, then that means that returns 0a1b2c, etc. Let's make that our data set. So if we pass that to a data loader with a batch size of 6, and as you can see it returns tuples containing 6 of the first things and the associated 6 of the second things.
So this is like our independent variable and this is like our dependent variable. And so and then at the end, you know, the batch size won't necessarily exactly divide nicely into the full size of the data set. You might end up with a smaller batch. So basically then we already have a data set, remember.
And so we could pass it to a data loader and then we can basically say this, an iterator in Python is something that you can actually loop through. So when we say for in data loader, it's going to return a tuple. We can de-structure it into the first bit and the second bit.
And so that's going to be our x and y. We can calculate our predictions, we can calculate our loss from the predictions and the targets, we can ask it to calculate our gradients and then we can update our parameters just like we did in our toy SGD example for the quadratic equation.
So let's reinitialize our weights and bias with the same two lines of code before. Let's create the data loader this time from our actual MNIST data set and create a nice big batch size. So we did plenty of work each time. And just to take a look, let's just grab the first thing from the data loader.
First is a fast AI function, which just grabs the first thing from an iterator. Just it's useful to look at, you know, kind of an arbitrary mini batch. So here is the shape. We're going to have the first mini batch is 256 rows of 784 long, that's 28 by 28.
So 256 flattened out images and 256 labels that are one long because that's just the number zero or the number one, depending on whether it's a three or a seven. Do the same for the validation set. So here's our validation data loader. And so let's grab a batch here, testing, pass it into, well, why do we do that?
We should, yeah, I guess, yeah, actually for our testing, I'm going to just manually grab the first four things just so that we can make sure everything lines up. So let's grab just the first four things. We'll call that a batch. Pass it into that linear function we created earlier.
Remember linear was just batch at weights matrix, multiply plus bias. And so that's going to give us four results. That's a prediction for each of those four images. And so then we can calculate the loss using that loss function we just used. And let's just grab the first four items of the training set and there's the loss.
Okay. And so now we can calculate the gradients. And so the gradients are 784 by one. So in other words, it's a column where every weight as a gradient, it's what's the change in loss for a small change in that parameter. And then the bias has a gradient that's a single number because the bias is just a single number.
So, we can take those three steps and put it in a function. So if you pass, if you, this is calculate gradient, you pass it an X batch or Y batch in some model, then it's going to calculate the predictions, calculate the loss and do the backward step. And here we see calculate gradient.
And so we can get the, just to take a look, the mean of the weights gradient and the bias gradient. And there it is. If I call it a second time and look, notice I have not done any step here. This is exactly the same parameters. I get a different value.
That's a concern. You would expect to get the same gradient every time you called it with the same data. Why have the gradients changed? That's because loss dot backward does not just calculate the gradients. It calculates the gradients and adds them to the existing gradients, the things in the dot grad attribute.
The reasons for that will come to you later, but for now the thing to know is just it does that. So actually what we need to do is to call grad dot zero underscore. So dot zero returns a tensor containing zeros. And remember underscore does it in place. So that updates the weights dot grad attribute, which is a tensor to contain zeros.
So now if I do that and call it again, I will get exactly the same number. So here is how you train one epoch with SGD. Loop through the data loader, grabbing the X batch and the Y batch, calculate the gradient, prediction loss backward. Go through each of the parameters and we're going to be passing those in.
So there's going to be the 768 weights and the one bias. And then for each of those, update the parameter to go minus equals gradient times learning rate. That's our gradient descent step and then zero it out for the next time around the loop. I'm not just saying p minus equals.
I'm saying p dot data minus equals. And the reason for that is that remember PyTorch keeps track of all of the calculations we do so that it can calculate the gradient. Well I don't want to calculate in the gradient of my gradient descent step. That's like not part of the model, right?
So dot data is a special attribute in PyTorch where if you write to it, it tells PyTorch not to update the gradients using that calculation. So this is your most basic standard SGD stochastic gradient descent loop. So now we can answer that earlier question. The difference between stochastic gradient descent and gradient descent is that gradient descent does not have this here that loops through each mini-batch.
For gradient descent, it does it on the whole data set each time around. So train epoch for gradient descent would simply not have the for loop at all, but instead it would calculate the gradient for the whole data set and update the parameters based on the whole data set, which we never really do in practice.
We always use mini-batches of various sizes. Okay, so we can take the function we had before where we compare the predictions to whether that, well we used to be comparing the predictions to whether they were greater or less than zero, right? But now that we're doing the sigmoid, remember the sigmoid will squish everything between 0 and 1.
So now we should compare the predictions to whether they're greater than 0.5 or not. If they're greater than 0.5, just look back at our sigmoid function. So 0, what used to be 0 is now on the sigmoid is 0.5. Okay, so we need just to make that slight change to our measure of accuracy.
So to calculate the accuracy for some X batch and some Y batch, this is actually assumed this is actually the predictions. Then we take the sigmoid of the predictions, we compare them to 0.5 to tell us whether it's a 3 or not, we check what the actual target was to see which ones are correct, and then we take the mean of those after converting the Booleans to floats.
So we can check that. Accuracy, let's take our batch, put it through our simple linear model, compare it to the four items of the training set, and there's the accuracy. So if we do that for every batch in the validation set, then we can loop through with a list comprehension every batch in the validation set, get the accuracy based on some model, stack those all up together so that this is a list, right?
So if we want to turn that list into a tensor where the items of the list of the tensor are the items of the list, that's what stack does. So we can stack up all those, take the mean, convert it to a standard Python scalar by calling dot item, round it to four decimal places just for display.
And so here is our validation set accuracy as you would expect. It's about 50% because it's random. So we can now train for one epoch. So we can say, remember train epoch needed the parameters. So our parameters in this case are the weights tensor and the bias tensor. So train one epoch using the linear one model with the learning rate of one with these two parameters and then validate and look at that.
Our accuracy is now 68.8%. So we've trained an epoch. So let's just repeat that 20 times, train and validate. And you can see the accuracy goes up and up and up and up and up to about 97%. So that's cool. We've built an SGD optimizer of a simple linear function that is getting about 97% on our simplified MNIST where there's just the threes and the sevenths.
So a lot of steps there. Let's simplify this through some refactoring. So the kind of simple refactoring we're going to do, we're going to do a couple, but the basic idea is we're going to create something called an optimizer class. The first thing we'll do is we'll get rid of the linear one function.
Remember the linear one function does x at w plus b. There's actually a class in PyTorch that does that equation for us. So we might as well use it. It's called nn.linear. And nn.linear does two things. It does that function for us and it also initializes the parameters for us.
So we don't have to do writes and bias in it params anymore. We just create an nn.linear class and that's going to create a matrix of size 28 by 28 comma 1 and a bias of size 1. It will set requires grad equals true for us. It's all going to be encapsulated in this class and then when I call that as a function, it's going to do my x at w plus b.
So to see the parameters in it, we would expect it to contain 784 weights and one bias. We can just call dot parameters and we can de-structure it to w comma b and see, yep, it is 784 and 1 for the weights and bias. So that's cool. So this is just, you know, it could be an interesting exercise for you to create this class yourself from scratch.
You should be able to at this point so that you can confirm that you can recreate something that behaves exactly like an nn.linear. So now that we've got this object which contains our parameters in a parameters method, we can now create an optimizer. So for our optimizer, we're going to pass it the parameters to optimize and a learning rate.
We'll store them away and we'll have something called step which goes through each parameter and does that thing we just saw. p dot data minus equals p dot grad times learning rate and it's also going to have something called zero grad which goes through each parameter and zeros it out or we could even just set it to none.
So that's the thing we're going to call basic optimizer. So those are exactly the same lines of code we've already seen wrapped up into a class. So we can now create an optimizer passing in the parameters of the linear model for these and our learning rate. And so now our training loop is loop through each mini batch in the data loader, calculate the gradient, opt dot step, opt dot zero grad, that's it.
Validation function doesn't have to change and so let's put our training loop into a function that's going to loop through a bunch of epochs, call an epoch, print validate epoch and then run it and it's the same. We're getting a slightly different result here but much the same idea.
Okay so that's cool right we've now refactoring using you know create our own optimizer and using faster pytorch is built in nn.linear class and you know by the way we don't actually need to use our own basic optimizer. Not surprisingly pytorch comes with something which does exactly this and not surprisingly it's called SGD.
So and actually this SGD is provided by fastai, fastai and pytorch provide some overlapping functionality they work much the same way. So you can pass to SGD your parameters and your learning rate just like basic optimizer. Okay and train it and get the same result. So as you can see these classes that are in fastai and pytorch are not mysterious they're just pretty you know in wrappers around functionality that we've now written ourself.
So there's quite a few steps there and if you haven't done gradient descent before then there's a lot of unpacking. So this lesson is kind of the key lesson it's the one where you know like we should you know really take a stop and a deep breath at this point and make sure you're comfortable.
What's a data set? What's a data loader? What's nn.linear? What's SGD? And if you you know if any or all of those don't make sense go back to where we defined it from scratch using Python code. Well the data loader we didn't define from scratch but it you know the functionality is not particularly interesting.
You could certainly create your own from scratch if you wanted to that would be another pretty good exercise. Let's refactor some more. Fastai has a data loaders class which is as we've mentioned before is a tiny class that just you pass it a bunch of data loaders and it just stores them away as a dot train and a dot valid.
Even though it's a tiny class it's it's super handy because with that we now have a single object that knows all the data we have and so it can make sure that your training data loader is shuffled and your validation loader isn't shuffled you know make sure everything works properly.
So that's what the data loaders class is you can pass in the training and valid data loader and then the next thing we have in fastai is the learner class and the learner class is something where we're going to pass in our data loaders. We're going to pass in our model we're going to pass in our optimization function we're going to pass in our loss function we're going to pass in our metrics.
So all the stuff we've just done manually that's all learner does is it's just going to do that for us so it's just going to call this train model and this train epoch it's just you know it's inside learner. So now if we go learn.fit you can see again it's doing the same thing getting the same result and it's got some nice functionality it's printing it out into a pretty table for us and it's showing us the losses and the accuracy and how long it takes but there's nothing magic right you've been able to do exactly the same thing by hand using Python and PyTorch.
So these abstractions are here to like let you write less code and to save some time and to save some cognitive overhead but they're not doing anything you can't do yourself. And that's important right because if the if they're doing things you can't do yourself then you can't customize them you can't debug them you know you can't profile them.
So we want to make sure that the stuff we're using is stuff that we understand what it's doing. So this is just a linear function is not great we want a neural network. So how do we turn this into a neural network or remember this is a linear function x at w plus b to turn it into a neural network we have two linear functions exactly the same but with different weights and different biases and in between this magic line of code which takes the result of our first linear function and then does a max between that and 0.
So a max of res and 0 is going to take any negative numbers and turn them into zeros. So we're going to do a linear function we're going to replace the negatives with 0 and then we're going to take that and put it through another linear function that believe it or not is a neural net.
So w1 and w2 are weight tensors b1 and b2 are bias tensors just like before so we can initialize them just like before and we could now call exactly the same training code that we did before to roll these. So res.max 0 is called a rectified linear unit which you will always see referred to as ReLU and so here is and in PyTorch it already has this function it's called f.relu and so if we plot it you can see it's as you'd expect it's 0 for all negative numbers and then it's y equals x for positive numbers.
So you know here's some jargon rectified linear unit sounds scary sounds complicated but it's actually this incredibly tiny line of code this incredibly simple function and this happens a lot in deep learning things that sound complicated and sophisticated and impressive turn out to be normally super simple frankly at least once you know what it is.
So why do we do linear layer ReLU linear layer well if we got rid of the middle if we got rid of the middle ReLU and just went linear layer linear layer then you could rewrite that as a single linear layer when you multiply things and add and then multiply things and add and you can just change the coefficients and make it into a single multiply and then add.
So no matter how many linear layers we stack on top of each other we can never make anything more kind of effective than a simple linear model but if you put a non-linearity between the linear layers then actually you have the opposite this is now where something called the universal approximation theorem holds which is that if the size of the weight and bias matrices are big enough this can actually approximate any arbitrary function including the function of how do I recognize threes from sevens or or whatever.
So that's kind of amazing right this tiny thing is actually a universal function approximator as long as you have w1 b1 w2 and b2 have the right numbers and we know how to make them the right numbers you use SGD could take a very long time could take a lot of memory but the basic idea is that there is some solution to any computable problem and this is one of the biggest challenges a lot of beginners have to deep learning is that there's nothing else to it like there's often this like okay how do I make a neural net oh that is a neural net or how do I do deep learning training where there's gd there's things to like make it train a bit faster there's you know things to mean you need a few less parameters but everything from here is just performance tweaks honestly right so this is you know this is the key understanding of training a neural network okay we can simplify things a bit more we already know that we can use nn.linear to replace the weight and bias so let's do that for both of the linear layers and then since we're simply taking the result of one function and passing it into the next and take the result of that function pass it to the next and so forth and then return the end this is called function composition function composition is when you just take the result of one function pass it to a new one take a result of one function pass it to a new one and so every pretty much neural network is just doing function composition of linear layers and these are called activation functions or non-linearities so PyTorch provides something to do function composition for us and it's called nn.sequential so it's going to do a linear layer pass the result to a value pass the result to a linear layer you'll see here I'm not using f.relu I'm using nn.relu this is identical returns exactly the same thing but this is a class rather than a function yes Rachel by using the non-linearity won't using a function that makes all negative output zero make many of the gradients in the network zero and stop the learning process due to many zero gradients well that's a fantastic question and the answer is yes it does but there won't be zero for every image and remember the mini batches are shuffled so even if it's zero for every image in one mini batch it won't be for the next mini batch and it won't be the next time around we go for another epoch so yes it can create zeros and if if the neural net ends up with a set of parameters such that lots and lots of inputs end up as zeros you can end up with whole mini batches that is zero and you can end up in a situation where some of the neurons remain inactive inactive means they're zero and they're basically dead units and this is a huge problem it basically means you're wasting computation so there's a few tricks to avoid that which we'll be learning about a lot one simple trick is to not make this thing flat here but just make it a less steep line that's called a leaky value leaky rectified linear unit and they help a bit as we'll learn though even better is to make sure that we just kind of initialize to sensible initial values that are not too big and not too small and step by sensible amounts that are particularly not too big and generally if we do that we can keep things in the zone where they're positive most of the time but we are going to learn about how to actually analyze inside a network and find out how many dead units we have how many of these zeros we have because as this as you point out they are they are bad news they don't do any work and they'll continue to not do any work if if enough of the inputs end up being zero okay so now that we've got a neural net we can use exactly the same learner we had before but this time we'll pass in the simple net instead of the linear one everything else is the same and we can call fit just like before and generally as your models get deeper though here we've gone from one layer to and I'm only counting the parameterized layers as layers you could say it's three I'm just going to call it two there's two trainable layers so I've gone from one layer to I've checked dropped my learning rate from one to zero point one because the deeper models you know tend to be kind of bumpier less nicely behaved so often you need to use lower learning rates and so we train it for a while okay and we can actually find out what that training looks like by looking inside our learner and there's an attribute we create for you called recorder and that's going to record well everything that appears in this table basically well these three things the training loss the validation loss and the accuracy or any metrics so recorder dot values contains that kind of table of results and so item number two of each row will be the accuracy and so the the capital L class which I'm using here as a nice little method called item got that will will get the second item from every row and then I can plot that to see how the training went and I can get the final accuracy like so by grabbing the last row of the table and grabbing the second index two zero one two and my final accuracy not bad ninety eight point three percent so this is pretty amazing we now have a function that can solve any problem to any level of accuracy if we can find the right parameters and we have a way to find hopefully the best or at least a very good set of parameters for any function so this is kind of the magic yes Rachel how could we use what we're learning here to get an idea of what the network is learning along the way like Xylor and Fergus did more or less we will look at that later not in the full detail of their paper but basically you can look in the dot parameters to see the values of those parameters and at this point well I mean why don't you try it yourself right you've actually got now the parameters so if you want to grab the model you can actually see learn dot model so we can we can look inside learn dot model to see the actual model that we just trained and you can see it's got the three things in it the linear the value of the linear and what I kind of like to do is to put that into a variable make it a bit easy to work with and you can grab one layer by indexing in you can look at the parameters and that just gives me a something called a generator it's something that will give me a list of the parameters when I ask for them so I can just go weight comma bias equals to de-structure them and so the weight is 30 by 784 because that's what I asked for so one of the things to note here is that to create a neural net so something with more than one layer I actually have 30 outputs not just one right so I'm kind of generating lots of you can think of generating lots of features so it's kind of like 30 different linear linear models here and then I combine those 30 back into one so you could look at one of those by having a look at yeah so there's there's the numbers in the first row we could reshape that into the original shape of the images and we could even have a look and there it is right so you can see this is something so this is cool right we can actually see here we've got something which is which is kind of learning to find things at the top and the bottom and the middle and so we could look at the second one okay no idea what that's showing and so some of them are kind of you know I've probably got far more than I need which is why they're not that obvious but you can see yeah here's another thing it's looking pretty similar here's something that's kind of looking for this little bit in the middle so yeah this is the basic idea to understand the features that are not the first layer but later layers you have to be a bit more sophisticated but yeah to see the first layer ones you can you can just plot them okay so then you know just to compare we could use the full fast AI toolkit so grab our data loaders by using data loaders from folder as we've done before and create a CNN learner and a ResNet and fit it for a single epoch and whoa 99.7 right so we did 40 epochs and got 98.3 as I said using all the tricks you can really speed things up and make things a lot better and so by the end of this course or at least both parts of this course you'll be able to from scratch get this 99.7 in a single epoch all right so jargon so jargon just to remind us value function that returns zero for negatives many batch a few inputs and labels which optionally are randomly selected the forward pass is the bit where we calculate the predictions the loss is the function that we're going to take the derivative of and then the gradient is the derivative of the loss with respect to each parameter the backward pass is when we calculate those gradients gradient descent is that full thing of taking a step in the direction opposite to the gradients by after calculating the loss and then the learning rate is the size of the step that we take other things to know perhaps the two most important pieces of jargon are all of the numbers that are in a neural network the numbers that we're learning are called parameters and then the numbers that we're calculating so every value that's calculated every matrix multiplication element that's calculated they're called activations so activations and parameters are all of the numbers in the neural net and so be very careful when I say from here on in in these lessons activations or parameters you got to make sure you know what those mean because that's that's the entire basically almost the entire set of numbers that exist inside a neural net so activations are calculated parameters are learned we're doing this stuff with tensors and tensors are just regularly shaped arrays rank 0 tensors we call scalars rank 1 tensors we call vectors rank 2 tensors we call matrices and we continue on to rank 3 tensors rank 4 tensors and so forth and rank 5 tensors are very common in deep learning so don't be scared of going up to higher numbers of dimensions okay so let's have a break oh we've got a question okay is there a rule of thumb for what non-linearity to choose given that there are many yeah there are many non-linearities to choose from and it doesn't generally matter very much which you choose so let's choose ReLU or leaky ReLU or yeah whatever any anyone should work fine later on we'll look at the minor differences between between them but it's not so much something that you pick on a per problem it's more like some take a little bit longer and a little bit more accurate and some a bit faster and a little bit less accurate that's a good question okay so before you move on it's really important that you finish the questionnaire for this chapter because there's a whole lot of concepts that we've just done so you know try to go through the questionnaire go back and relook at the notebook and please run the code through the experiments and make sure it makes sense all right let's have a seven minute break see you back here in seven minutes time okay welcome back so now that we know how to create and train a neural net let's cycle back and look deeper at some applications and so we're going to try to kind of interpolate in from one end we've done the kind of from scratch version at the other end we've done the kind of four lines of code version and we're going to gradually nibble at each end until we find ourselves in the middle and we've we've we've touched on all of it so let's go back up to the kind of the four lines of code version and and delve a little deeper so let's go back to pets and let's think though about like how do you actually you know start with a new data set and figure out how to use it so it you know the data sets we provide it's easy enough to untie them you just say untie that or download it and untie it if it's a data set that you're getting you can just use the terminal or either or python or whatever so let's assume we have a path that's pointing at something so initially you don't you don't know what that something is so we can start by doing ls to have a look and see what's inside there so the pets data set that we saw in lesson one contains three things annotations images and models and you'll see we have this little trick here where we say path dot base path equals and then the path to our data and that's just does a little simple thing where when we print it out it just doesn't show us it just shows us relative to this path is a bit convenient so if you go and have a look at the read me for the original pets data set it tells you what these images and annotations folders are and not surprisingly the images path if we go path images that's how we use path lib to grab a subdirectory and then ls we can see here are the names that the paths to the images as it mentions here most functions and methods in fast a which return a collection don't return a python list but they return a capital L and a capital L as we briefly mentioned is basically an enhanced list one of the enhancements is the way it prints the representation of it starts by showing you how many items there are in the list in the collection so there's seven thousand three hundred and ninety four images and it if there's more than ten things it truncates it and just says dot dot dot to avoid filling up your screen so there's a couple of little conveniences there and so we can see from this output that the file name as we mentioned in lesson one if the first letter is a capital it means it's a cat and if the first letter is lowercase it means it's a dog but this time we've got to do something a bit more complex well a lot more complex which is figure out what breed it is and so you can see the breed is kind of everything up to after the in the file name it's everything up to the the last underscore and before this number is the breed so we want to label everything with its breed so we're going to take advantage of this structure so the way I would do this is to use a regular expression a regular expression is something that looks at a string and basically lets you kind of pull it apart into its pieces in very flexible way it's this kind of simple little language for doing that if you haven't used regular expressions before please google regular expression tutorial now and look it's going to be like one of the most useful tools you'll come across in your life I use them almost every day I will go to details about how to use them since there's so many great tutorials and there's also a lot of great like exercises you know there's regex regex is short for regular expression there's regex crosswords there's regex Q&A there's all kinds of cool regex things a lot of people like me love this tool in order to there's also a regex lesson in the fast AI NLP course maybe even two regex lessons oh yeah I'm sorry for forgetting about the fast AI NLP course what an excellent resource that is so regular expressions are a bit hard to get right the first time so the best thing to do is to get a sample string so it's a good way to do that would be to just grab one of the file names thought it's pop it in F name and then you can experiment with regular expressions so re is the regular expression module in Python and find all will just grab all the parts of a regular expression that have parentheses around them so this regular expression and are is a special kind of string in Python which basically says don't treat backslash as special because normally in Python like backslash N means a new line so here's a string which I'm going to capture any letter one or more times followed by an underscore followed by a digit one or more times followed by anything I probably should have used backslash dot that's fine followed by the letters jpg followed by the end of the string and so if I call that regular expression against my file names name oh looks good right so we kind of check it out so now that seems to work we can create a data block where the independent variables are images the dependent variables are categories just like before get items is going to be get image files we're going to split it randomly as per usual and then we're going to get the label by calling regex labeler which is a just a handy little fast a class which labels things with a regular expression we can't call the regular expression this particular expression directly on the path lib path object we actually want to call it on the name attribute and fast AI has a nice little function called using atra using attribute which takes this function and changes it to a function which will be passed this attribute that's going to be using regex labeler on the name attribute and then from that data block we can create the data loaders as usual there's two interesting lines here resize and all transforms all transforms we have seen before in notebook 2 in the section called data augmentation and so all transforms was the thing which can zoom in and zoom out and warp and rotate and change contrast and change brightness and so forth and flip to kind of give us almost it's like giving us more data being generated synthetically from the data we already have and we also learned about random resize crop which is a kind of a really cool way of getting ensuring you get square images at the same time that you're augmenting the data here we have a resize to a really large image but you know by deep learning standards 460 by 460 is a really large image and then we're using all transforms with a size so that's actually going to use random resize crop to a smaller size why are we doing that this particular combination of two steps does something which I think is unique to fast AI which we call pre-sizing and the best way is I will show you this beautiful example of some PowerPoint wizardry that I'm so excited about to show how pre-sizing works what pre-sizing does is that first step where we say resize to 460 by 460 is it grabs a square and it grabs it randomly if it's a kind of landscape orientation photo it'll grab it randomly so it'll take the whole height and randomly grab somewhere from along the side if it's a portrait orientation then it'll grab it you know take the full width and grab a random bit from top to bottom so then we take this area here and here it is right and so that's what the first resize does and then the second org transforms bit will grab a random warped crop possibly rotated from in here and we'll turn that into a square and so it does so there's two steps it's first of all resize to a square that's big and then the second step is do a kind of rotation and warping and zooming stage to something smaller in this case 224 by 224 because this first step creates something that's square and always is the same size the second step can happen on the GPU and because normally things like rotating and image warping actually pretty slow also normally doing a zoom and a rotate and a warp actually is really destructive to the image because each one of those things requires an interpolation step but it's not just slow it actually makes the image really quite low quality so we do it in a very special way in fast AI I think it's unique where we do all of these kind of coordinate transforms like rotations and warps and zooms and so forth not on the actual pixels but instead we kind of keep track of the changing coordinate values in a non-lossy way so the full floating point value and then once at the very end we then do the interpolation.
The results are quite striking here is what the difference looks like hopefully you can see this on on the video on the left is our pre-sizing approach and on the right is the standard approach that other libraries use and you can see that the one on the right is a lot less nicely focused and it also has like weird things like this should be grass here but it's actually got its kind of bum-sticking way out this has a little bit of weird distortions this has got loads of weird distortions so you can see the pre-sized version really ends up way way better and I think we have a question Rachel are the blocks in the data block and ordered list do they specify the input and output structures respectively are there always two blocks or can there be more than two for example if you wanted a segmentation model would the second block be something about segmentation so so yeah this is an ordered list so the first item says I want to create an image and then the second item says I want to create a category so that's my independent and dependent variable you can have one thing here you can have three things here you can have any amount of things here you want obviously the vast majority of the time it'll be two normally there's an independent variable and a dependent variable we'll be seeing this in more detail later although if you go back to the earlier lesson when we introduced data blocks I do have a picture kind of showing how these pieces fit together.
So after you've put together your data block created your data loaders you want to make sure it's working correctly so the obvious thing to do for computer vision data block is show batch and show batch will show you the items and you can kind of just make sure they look sensible that looks like the labels are reasonable if you add a unique equals true then it's going to show you the same image with all the different augmentations this is a good way to make sure your augmentations work if you make a mistake in your data block in this example there's no resize so the different images are going to be different sizes so it'll be impossible to collate them into a batch so if you call dot summary this is a really neat thing which will go through and tell you everything that's happening so I collecting the items how many did I find what happened when I split them what are the different variables independent dependent variables I'm creating let's try and create one of these here's a step create my image create categorize here's what the first thing gave me an American Bulldog is the final sample is this image this size this category and then eventually it says oh it's not possible to collate your items I tried to collate the zero index members of your tuples so in other words that's the independent variable and I got this was size 500 by 375 this was 375 by 500 oh I can't collate these into a tensor because they're different sizes so this is a super great debugging tool for debugging your data blocks you have a question how does the item transforms pre-size work if the resize is smaller than the image is a whole width or height still taken or is it just a random crop with the resize value so if you remember back to lesson two we looked at the different ways of creating these things you can use squish you can use pad or you can use crop so if your image is smaller than the precise value then squish will really be zoom so it will just swell stretch it'll stretch it and then pad and crop will do much the same thing and so you'll just end up with a you know the same just looks like these but it'll be a kind of lower more pixelated lower resolution because it's having to zoom in a little bit okay so a lot of people say that you should do a hell of a lot of data cleaning before you model we don't we say model as soon as you can because remember what we found in in notebook two your your model can teach you about the problems in your data so as soon as I've got to a point where I have a data block that's working and I have data loaders I'm going to build a model and so here I'm you know it also tells me how I'm going so I'm getting seven percent error well that's actually really good for a pets model and so at this point now that I have a model I can do that stuff we learned about earlier in O2 the notebook O2 where we train our model and use it to clean the data so we can look at the classification a confusion matrix top losses the image cleaner widget you know so forth okay now one thing interesting here is in notebook four we included a loss function when we created a learner and here we don't pass in a loss function why is that that's because first AI will try to automatically pick a somewhat sensible loss function for you and so for a image classification task it knows what loss function is the normal one to pick and it's done it for you but let's have a look and see what it actually did pick so we could have a look at learn dot loss funk and we will see it is cross entropy loss what on earth is cross entropy loss I'm glad you asked let's find out cross entropy loss is really much the same as the MNIST lost we created with that with that sigmoid and the one minus predictions and predictions but it's it's a kind of extended version of that and the extended version of that is that that torch dot where that we looked at in notebook four only works when you have a binary outcome in that case it was is it a three or not but in this case it we've got which of the 37 pet breeds is it so we want to kind of create something just like that sigmoid and torch dot where that which also works nicely for more than two categories so let's see how we can do that so first of all let's grab a batch yes question why do we want to build a model before cleaning the data I would think a clean data set would help in training yeah absolutely a clean data set helps in training but remember as we saw in notebook 02 an initial model helps you clean the data set so remember how plot top losses helped us identify mislabeled images and the confusion matrix helped us recognize which things we were getting confused and might need you know fixing and the image classifier cleaner actually let us find things like an image that contained two bears rather than one bear and clean it up so a model is just a fantastic way to help you zoom in on the data that matters which things seem to have the problems which things are most important stuff like that so you would go through and you clean it with the model helping you and then you go back and train it again with the clean data thanks for that great question okay so in order to understand cross-entropy loss let's grab a batch of data which we can use dls.one batch and that's going to grab a batch from the training set we could also go first dls.train and that's going to do exactly the same thing and so then we can de-structure that into the independent and dependent variable and so the dependent variable shows us we've got a batch size of 64 but shows us the 64 categories and remember those numbers simply refer to the index of into the vocab so for example 16 is a boxer and so that all happens for you automatically when we say show batch it shows us those strings so here's a first mini-batch and so now we can view the predictions that is the activations of the final layer of the network by calling get preds and you can pass in a data loader and a data loader can really be anything that's going to return a sequence of many batches so we can just pass in a list containing our mini-batch as a data loader and so that's going to get the predictions for one mini-batch but here's some predictions okay so the actual predictions if we go preds 0.sum to grab the predictions for the first image and add them all up they add up to 1 and there are 37 of them so that makes sense right it's like the very first thing is what is the probability that that is a else vocab so the first thing is what's the probability it's an Abyssinian cat it's 10 to the negative 6 you see and so forth so it's basically like it's not this it's not this it's not this and you can look through and oh here this one here you know obviously what I think it is so how did it you know so we obviously want the probabilities to sum to one because it would be pretty weird if if they didn't it would say you know that the probability of being one of these things is more than one or less than one which would be extremely odd so how do we go about creating these predictions where each one is between 0 and 1 and they all add up to 1 to do that we use something called softmax softmax is basically an extension of sigmoid to handle more than two levels two categories so remember the sigmoid function look like this and we use that for our threes versus sevens model so what if we want 37 categories rather than two categories we need one activation for every category so actually the threes and sevens model rather than thinking of that as an is three model we could actually say oh that has two categories so that's actually create two activations one representing how three like something is and one representing how seven like something is so let's say you know let's just say that we have six MNIST digits and these were the can I do this and this first column were the activations of my model for for one activation and the second column was for a second activation so my final layer actually has two activations now so this is like how much like a three is it and this is how much like a seven is it but this one is not at all like a three and it's slightly not like a seven this is very much like a three and not much like a seven and so forth so we can take that model and rather having rather than having one activation for like is three we can have two activations for how much like a three how much like a seven so if we take the sigmoid of that we get two numbers between naught and one but they don't add up to one so that doesn't make any sense it can't be point six six chance it's a three and point five six chance it's a seven because every digit in that data set is only one or the other so that's not going to work but what we could do is we could take the difference between this value and this value and say that's how likely it is to be a three so in other words this one here with a high number here and a low number here is very likely to be a three so we could basically say in the binary case these activations that what really matters is their relative confidence of being a three versus a seven so we could calculate the difference between column one and column two or column index zero and column index one right and here's the difference between the two columns there's that big difference and we could take the sigmoid of that right and so this is now giving us a single number between naught and one and so then since we wanted two columns we could make column index zero the sigmoid and column index one could be one minus that and now look these all add up to one so here's probability of three probability of seven for the second one probably three probably seven and so forth so like that's a way that we could go from having two activations for every image to creating two probabilities each of which is between naught and one and each pair of which adds to one great how do we extend that to more than two columns to extend it to more than two columns we use this function which is called softmax those softmax is equal to e to the x divided by sum of e to the x just to show you if I go softmax on my activations I get point six oh two five point three nine seven five point six oh two five point three nine seven five I get exactly the same thing right so softmax in the binary case is identical to the sigmoid that we just looked at but in the multi-category case we basically end up with something like this let's say we were doing the teddy bear grizzly bear brown bear and for that remember our neural net is going to have the final layer will have three activations so let's say it was point oh two negative two point four nine one point two five so to calculate softmax I first go e to the power of each of these three things so here's e to the power of point oh two e to the power of negative two point four nine e to the power of three point four e to the power of one point two five okay then I add them up so there's the sum of the x and then softmax will simply be one point oh two divided by four point six and then this one will be point oh eight divided by four point six and this one will be three point four nine divided by four point six so since each one of these represents each number divided by the sum that means that the total is one okay and because all of these are positive and each one is an item divided by the sum it means all of these must be between naught and one so this shows you that softmax always gives you numbers between naught and one and they always add up to one so to do that in practice you can just call torch dot softmax and it will give you this result of this this function so you should experiment with this in your own time you know write this out by hand and try putting in these numbers right and and see how that you get back the numbers I claim you're going to get back make sure this makes sense to you so one of the interesting points about softmax is remember I told you that exp is e to the power of something and now what that means is that e to the power of something grows very very fast right so like exp of four is fifty four exp of eight is twenty nine two thousand nine hundred and eighty right it grows super fast and what that means is that if you have one activation that's just a bit bigger than the others its softmax will be a lot bigger than the others so intuitively the softmax function really wants to pick one class among the others which is generally what you want right when you're trying to train a classifier to say which breed is it you kind of want it to pick one and kind of go for it right and so that's what softmax does that's not what you always want so sometimes at inference time you want it to be a bit cautious and so you kind of got to remember that softmax isn't always the perfect approach but it's the default it's what we use most of the time and it works well on a lot of situations so that is softmax now in the binary case for the MNIST 3 versus 7 this was how we calculated MNIST loss we took the sigmoid and then we did either one minus that or that as our loss function which is fine as you saw it it worked right and so we could do this exactly the same thing we can't use torch.where anymore because targets aren't just 0 or 1 targets could be any number from 0 to 36 so we could do that by replacing the torch.where with indexing so here's an example for the binary case let's say these are our targets 0 1 0 1 1 0 and these are our softmax activations which we calculated before they're just some random numbers just for a toy example so one way to do instead of doing torch.where we could instead have a look at this I could grab all the numbers from 0 to 5 and if I index into here with all the numbers from 0 to 5 and then my targets 0 1 0 1 0 1 0 then what that's going to do is it's going to pick the row 0 it'll pick 0.6 and then for row 1 it'll pick 1.49 for row 2 it'll pick 0.13 for row 4 it'll pick 1.003 and so forth so this is a super nifty indexing expression which you should definitely play with right and it's basically this trick of passing multiple things to the PyTorch indexer the first thing says which rows should you return and the second thing says for each of those rows which column should you return so this is returning all the rows and these columns for each one and so this is actually identical to torch.where so isn't that tricky and so the nice thing is we can now use that for more than just two values and so here's here's the fully worked out thing so I've got my threes column I've got my sevens column here's that target is the indexes from 0 1 2 3 4 5 and so here 0 0.6 1 1.49 0 2.13 and so forth so yeah this works just as well with more than two columns so we can add you know for doing a full MNIST you know so all the digits from 0 to 9 we could have 10 columns and we would just be indexing into the 10 so this thing we're doing where we're going minus our activations matrix all of the numbers from 0 to n and then our targets is exactly the same as something that already exists in PyTorch called f dot nll loss as you can see exactly the same that's again we're kind of seeing that these things inside PyTorch and fast.ai are just little shortcuts for stuff we can write ourself and our loss stands for negative log likelihood again sounds complex but actually it's just this indexing expression rather confusingly there's no log in it we'll see why in a moment so let's talk about logs so this locks this loss function works quite well as we saw in the notebook 04 it's basically this it is exactly the same as we do in notebook 04 just a different way of expressing it but we can actually make it better because remember the probabilities we're looking at are between 0 and 1 so they can't be smaller than 0 they can't be greater than 1 which means that if our model is trying to decide whether to predict 0.99 or 0.999 it's going to think that those numbers are very very close together but won't really care but actually if you think about the error you know if there's like a thousand things then this would like be 10 things are wrong and this would be like one thing is wrong but this is really like 10 times better than this so really what we'd like to do is to transform the numbers between 0 and 1 to instead be between negative infinity and infinity and there's a function that does exactly that which is called logarithm okay so as the so the numbers we could have can be between 0 and 1 and as we get closer and closer to 0 it goes down to infinity and then at 1 it's going to be 0 and we can't go above 0 because our loss function we want to be negative so this logarithm in case you forgot hopefully you vaguely remember what logarithm is from high school but the basically the definition is this if you have some number that is y that is b to the power of a then logarithm is defined such that a equals the logarithm of y, b in other words it tells you b to the power of what equals y which is not that interesting of itself but one of the really interesting things about logarithms is this very cool relationship which is that log of a times b equals log of a plus log of b and we use that all the time in deep learning and machine learning because this number here a times b can get very very big or very very small if you multiply things a lot of small things together you'll get a tiny number if you multiply a lot of big things together you'll get a huge number it can get so big or so small that the kind of the precision in your computer's floating point gets really bad where else this thing here adding is not going to get out of control so we really love using logarithms like particularly in a deep neural net where there's lots of layers we're kind of multiplying and adding many times so this kind of tends to come out quite nicely so when we take the probabilities that we saw before the things that came out of this function and we take their logs and we take the mean that is called negative log likelihood and so this ends up being kind of a really nicely behaved number because of this property of the log that we described so if you take the softmax and then take the log and then pass that to an LL loss because remember that didn't actually take the log at all despite the name that gives you cross entropy loss so that leaves an obvious question of why doesn't an LL loss actually take the log and the reason for that is that it's more convenient computationally to actually take the log back at the softmax step so PyTorch has a function called log softmax and so since it's actually easier to do the log at the softmax stage it's just faster and more accurate PyTorch assumes that you use soft logmax and then pass that to an LL loss so an LL loss does not do the log it assumes that you've done the log beforehand so log softmax followed by an LL loss is the definition of cross entropy loss in PyTorch so that's our loss function and so you can pass that some activations and some targets and get back a number and pretty much everything in PyTorch every one of these kinds of functions you can either use the nn version as a class like this and then call that object as if it's a function or you can just use f dot with the camel case name as a function directly and as you can see they're exactly the same number people normally use the class version in the documentation in PyTorch you'll see it normally uses the class version so we'll tend to use the class version as well you'll see that it's returning a single number and that's because it takes the mean because a loss needs to be as we've discussed the mean but if you want to see the underlying numbers before taking the mean you can just pass in reduction equals none and that shows you the individual cross entropy losses before taking the mean.
Okay great so this is a good place to stop with our discussion of loss functions and such things Rachel were there any questions about this? Why does the loss function need to be negative? Well I mean I guess it doesn't but it's we want something that the lower it is the better and we kind of need it to cut off somewhere I have to think about this more during the week because it's a bit tired yeah let me refresh my memory when I'm awake.
Okay now next week well note not for the video next week actually happened last week so the thing I'm about to say is actually. So next week we're going to be talking about data ethics and I wanted to kind of segue into that by talking about how my week's gone because a week or two ago in I did a as part of a lesson I actually talked about the efficacy of masks and specifically wearing masks in public and I pointed out that the efficacy of masks seemed like it could be really high and maybe everybody should be wearing them and somehow I found myself as the face of a global advocacy campaign and so if you go to masksforall.co you'll find a website talking about masks and I've been on you know TV shows in South Africa and the US and England and Australia and on radio and blah blah blah talking about masks why is this well it's because as a data scientist you know I noticed that the data around masks seem to be getting misunderstood and it seemed that that misunderstanding was costing possibly hundreds of thousands of lives you know literally in the places that were using masks it seemed to be associated with you know orders of magnitude fewer deaths and one of the things we're talking about next week is like you know what's your role as a data scientist and you know I strongly believe that it's to understand the data and then do something about it and so nobody was talking about this so I ended up writing an article that appeared in the Washington Post that basically called on people to really consider wearing masks which is this article and you know I was lucky I managed to kind of get a huge team of brilliant not huge a pretty decent sized team of brilliant volunteers who helped you know kind of build this website and kind of some PR folks and stuff like that but what this came clear was and I was talking to politicians you know senators staffers what was becoming clear is that people weren't convinced by the science which is fair enough because it's it's hard to you know when the WHO and the CDC is saying you don't need to wear a mask and some random data scientist is saying but doesn't seem to be what the data is showing you know you've got half a brain you would pick the WHO and the CDC not the random data scientist so I really felt like I if I was going to be an effective advocate I needed sort the science out and it you know credentialism is strong and so it wouldn't be enough for me to say it I needed to find other people to say it though I put together a team of 19 scientists including you know a professor of sociology a professor of aerosol dynamics the founder of an African movement that's that kind of studied preventative methods for tuberculosis a Stanford professor who studies mask disposal and cleaning methods a bunch of Chinese scientists who study epidemiology modeling a UCLA professor who is one of the top infectious disease epidemiologists experts and so forth so like this kind of all-star team of people from all around the world and I had never met any of these people before so well no not quite true I knew Austin a little bit and I knew Zainip a little bit I knew Lex a little bit but on the whole you know and well Reshma we all know she's awesome so it was great to actually have a fast AI community person there too and so but yeah I kind of tried to pull together people from you know as many geographies as possible and as many areas of expertise as possible and you know the kind of the global community helped me find papers about about everything about you know how different materials work about how droplets form about epidemiology about case studies of people infecting with and without masks blah blah blah and we ended up in the last week basically we wrote this paper it contains 84 citations and you know we basically worked around the clock on it as a team and it's out and it's been sent to a number of some of the earlier versions three or four days ago we sent her some governments so one of the things is I in this team I tried to look for people who were you know working closely with government leaders not just that they're scientists and so this this went out to a number of government ministers and in the last few days I've heard that it was a very significant part of decisions by governments to change their to change their guidelines around masks and you know the fights not over by any means in particular the UK is a bit of a holdout but I'm going to be on ITV tomorrow and then BBC the next day you know it's it's kind of required stepping out to be a lot more than just a data scientist I've had to pull together you know politicians and staffers I've had to you know you know hustle with the media to try and get you know coverage and you know today I'm now starting to do a lot of work with unions to try to get unions to understand this you know it's really a case of like saying okay as a data scientist and income in conjunction with real scientists we've built this really strong understanding that masks you know this simple but incredibly powerful tool that doesn't do anything unless I can effectively communicate this to decision makers so today I was you know on the phone to you know one of the top union leaders in the country explaining what this means basically it turns out that in buses in America the kind of the air conditioning is set up so that it blows from the back to the front and there's actually case studies in the medical literature of how people that are seated kind of downwind of an air conditioning unit in a restaurant ended up all getting sick with COVID-19 and so we can see why like bus drivers are dying because they're like they're right in the wrong spot here and their passengers aren't wearing masks so I kind of try to explain this science to union leaders so that they understand that to keep the workers safe it's not enough just for the driver to wear a mask but all the people on the bus need to be wearing masks as well so you know all of this is basically to say you know as data scientists I think we have a responsibility to study the data and then do something about it it's not just a research you know exercise it's not just a computation exercise you know what's the point of doing things if it doesn't lead to anything so yeah so next week we'll be talking about this a lot more but I think you know this is a really to me kind of interesting example of how digging into the data can lead to really amazing things happening and and in this case I strongly believe and a lot of people are telling me they strongly believe that this kind of advocacy work that's come out of this data analysis is already saving lives and so I hope this might help inspire you to to take your data analysis and to take it to places that it really makes a difference so thank you very much and I'll see you next week.