Hi everybody, and welcome to lesson 13, where we're going to start talking about back propagation. Before we do, I'll just mention that there was some great success amongst the folks in the class during the week on working with flexing their tensor manipulation muscles. So far the fastest main shift algorithm, which has a similar accuracy to the one I displayed, is one that actually randomly chooses data points as subset.
And I actually think that's a great approach. Very often random sampling and random projections are two excellent ways of speeding up algorithms. So it'd be interesting to see if anybody during the rest of the course comes up with anything faster than random sampling. Also been seeing some good Einstein summation examples and implementations and continuing to see lots of good diff edit implementations.
So congratulations to all the students and I hope those of you following along the videos in the MOOC will be working on the same homework as well and sharing your results on the fast AI forums. So now we're going to take a look at notebook number three in the normal repo, course 22p1 repo.
And we're going to be looking at the forward and backward passes of a simple multi-layer perceptron and neural network. The initial stuff up here is just importing things and just settings and stuff that just copying and pasting some stuff from previous notebooks around paths and parameters and stuff like that.
So we'll skip over this. So we'll often be kind of copying and pasting stuff from one notebook to another's kind of first cell to get things set up. And I'm also loading in our data for MNIST as tensors. Okay. So we, to start with, need to create the basic architecture of our neural network.
And I did mention at the start of the course that we will briefly review everything that we need to cover. So we should briefly review what basic neural networks are and why they are what they are. So to start with, let's consider a linear model. So let's start by considering a linear model of, well, let's take the most simple example possible, which is we're going to pick a single pixel from our MNIST pictures.
And so that will be our X. And for our Y values, then we'll have some loss function of how good is this model, sorry, not some loss function. That's created even simpler. For our Y value, we're going to be looking at how likely is it that this is, say, the number three based on the value of this one pixel.
So the pixel, its value will be X and the probability of being the number three, we'll call Y. And if we just have a linear model, then it's going to look like this. And so in this case, it's saying that the brighter this pixel is, the more likely it is that it's the number three.
And so there's a few problems with this. The first one, obviously, is that as a linear model, it's very limiting because maybe we actually are trying to draw something that looks more like this. So how would you do that? Well, there's actually a neat trick we can use to do that.
What we could do is, well, let's first talk about something we can't do. Something we can't do is to add a bunch of additional lines. So consider what happens if we say, OK, well, let's add a few different lines. So let's also add this line. So what would be the sum of our two lines?
Well, the answer is, of course, that the sum of the two lines will itself be a line. So it's not going to help us at all match the actual curve that we want. So here's the trick. Instead, we could create a line like this that actually we could create this line.
And now consider what happens if we add this original line with this new-- well, it's not a line, right? It's two line segments. So what we would get is this-- everything to the left of this point is going to not be changed if I add these two lines together, because this is zero all the way.
And everything to the right of it is going to be reduced. It looks like they've got similar slopes. So we might end up with instead-- so this would all disappear here. And instead, we would end up with something like this. And then we could do that again, right? We could add an additional line that looks a bit like that.
So it would go-- but this time, it could go even further out here, and it could be something like this. So what if we added that? Well, again, at the point underneath here, it's always zero, so it won't do anything at all. But after that, it's going to make it even more negatively sloped.
And if you can see, using this approach, we could add up lots of these rectified lines, these lines that truncate at zero. And we could create any shape we want with enough of them. And these lines are very easy to create, because actually, all we need to do is to create just a regular line, which we can move up, down, left, right, change its angle, whatever.
And then just say, if it's greater than zero, truncate it to zero. Or we could do the opposite for a line going the opposite direction. If it's less than zero, we could say, truncate it to zero. And that would get rid of, as we want, this whole section here, and make it flat.
OK, so these are rectified lines. And so we can sum up a bunch of these together to basically match any arbitrary curve. So let's start by doing that. Well, the other thing we should mention, of course, is that we're going to have not just one pixel, but we're going to have lots of pixels.
So to start with the kind of most slightly-- the only slightly less simple approach, we could have something where we've got pixel number one and pixel number two. We're looking at two different pixels to see how likely they are to be the number three. And so that would allow us to draw more complex shapes that have some kind of surface between them.
OK, and then we can do exactly the same thing is to create these surfaces, we can add up lots of these rectified lines together. But now they're going to be kind of rectified planes. But it's going to be exactly the same thing. We're going to be adding together a bunch of lines, each one of which is truncated at zero.
OK, so that's the quick review. And so to do that, we'll start out by just defining a few variables. So n is the number of training examples, m is the number of pixels, c is the number of possible values of our digits, and so here they are, 50,000 samples, 784 pixels, and 10 possible outputs.
OK, so what we do is we basically decide ahead of time how many of these line segment thingies to add up. And so the number that we create in a layer is called the number of hidden nodes or activations. So we'll call that nh. So let's just arbitrarily decide on creating 50 of those.
So in order to create lots of lines, which we're going to truncate at zero, we can do a matrix multiplication. So with a matrix multiplication, we're going to have something where we've got 50,000 rows by 784 columns. And we're going to multiply that by something with 784 rows and 10 columns.
And why is that? Well, that's because if we take this very first line of this first vector here, row one, have 784 values, they're the pixel values of the first image. OK, so this is our first image. And so each of those 784 values will be multiplied by each of these 784 values in the first column, the zero index column.
And that's going to give us a number in our output. So our output is going to be 50,000 images by 10. And so that result, we'll multiply those together and we'll add them up. And that result is going to end up over here in this first cell. And so each of these columns is going to eventually represent, if this is a linear model, in this case, this is just the example of doing a linear model, each of these cells is going to represent the probability.
So this first column will be the probability of being a zero. And the second column will be the probability of one. The third column will be the probability of being a two and so forth. So that's why we're going to have these 10 columns, each one allowing us to weight the 784 inputs.
Now, of course, we're going to do something a bit more tricky than that, which is actually we're going to have a 784 by 50 input going into a 784 by 50 output to create the 50 hidden layers. Then we're going to truncate those at zero and then multiply that by a 50 by 10 to create our 10 output.
So we'll do it in two steps. So the way SGD works is we start with just this is our weight matrix here. And this is our data. This is our outputs. The way it works is that this weight matrix is initially filled with random values. Also, of course, this contains our pixel values, this contains the results.
So W is going to start with random values. So here's our weight matrix. It's going to have, as we discussed, 50,000 by 50 random values. And it's not enough just to multiply. We also have to add. So that's what makes it a linear function. So we call those the biases, the things we add.
We can just start those at zeros. So we'll need one for each output, so 50 of those. And so that'll be layer one. And then as we just mentioned, layer two will be a matrix that goes from 50 hidden. And now I'm going to do something totally cheating to simplify some of the calculations for the calculus.
I'm only going to create one output. Why am I going to create one output? That's because I'm not going to use cross entropy just yet. I'm going to use MSE. So actually, I'm going to create one output, which will literally just be what number do I think it is from 0 to 10.
And so then we're going to compare those to the actual-- so these will be our y-predictors. We normally use a little hat for that, and we're going to compare that to our actuals. And yeah, in this very hacky approach, let's say we predict over here the number 9, and the actual is the number 2.
And we'll compare those together using MSE, which will be a stupid way to do it, because it's saying that 9 is further away from being 2 than 2-- 9 is further away from 2 than it is from 4 in terms of how correct it is, which is not what we want at all.
But this is what we're going to do just to simplify our starting point. So that's why we're going to have a single output for this weight matrix and a single output for this bias. So a linear-- let's create a function for putting x through a linear layer with these weights and these biases.
So it's a matrix multiply and an add. All right, so we can now try it. So if we multiply our x-- oh, we're doing x valid this time. So just to clarify, x valid is 10,000 by 784. So if we put x valid through our weights and biases with a linear layer, we end up with a 10,000 by 50, so 10,050 long hidden activations.
They're not quite ready yet, because we have to put them through ReLU. And so we're going to clamp at 0. So everything under 0 will become 0. And so here's what it looks like when we go through the linear layer and then the ReLU. And you can see here's a tensor with a bunch of things, some of which are 0 or they're positive.
So that's the result of this matrix multiplication. OK, so to create our basic MLP multi-layer perceptron from scratch, we will take our mini-batch of x's-- xb is a x match. We will create our first layer's output with a linear. And then we will put that through a ReLU. And then that will go through the second linear.
So the first one uses the w1b one, these ones. And the second one uses the w2b2. And so we've now got a simple model. And as we hoped, when we pass in the validation set, we get back 10,000 digits, so 10,000 by 1. Great. So that's a good start.
OK, so let's use our ridiculous loss function of MSC. So our results is 10,000 by 1. And our x valid-- sorry, our y valid is just a vector. Now what's going to happen if I do res minus y valid? So before you continue in the video, have a think about that.
What's going to happen if I do res minus y valid by thinking about the NumPy broadcasting rules we've learned? OK, let's try it. Oh, terrible. We've ended up with a 10,000 by 10,000 matrix. So 100 million points. Now we would expect an MSC to just contain 1,000 points. Why did that happen?
The reason it happened is because we have to start out at the last dimension and go right to left. And we compare the 10,000 to the 1 and say, are they compatible? And the answer is-- that's right, Alexei in the chat's got it right-- broadcasting rules. So the answer is that this 1 will be broadcast over these 10,000.
So this pair here will give us 10,000 outputs. And then we'll move to the next one. And we'll also move here to the next one. Uh-oh, there is no next one. What happens? Now, if you remember the rules, it inserts a unit axis for us. So we now have 10,000 by 1.
So that means each of the 10,000 outputs from here will end up being broadcast across the 10,000 rows here. So that means that will end up-- for each of those 10,000, we'll have another 10,000. So we'll end up with a 10,000 by 10,000 output. So that's not what we want.
So how could we fix that? Well, what we really would want would we want this to be 10,000 comma 1 here. If that was 10,000 comma 1, then we'd compare these two right to left. And they're both 1. So those match. And there's nothing to broadcast because they're the same.
And then we'll go to the next one, 10,000 to 10,000. Those match. So they just go element wise for those. And we'd end up with exactly what we want. We'd end up with 10,000 results. Or alternatively, we could remove this dimension. And then again, same thing. We're then going to add right to left, compatible 10,000.
So they'll get element wise operation. So in this case, I got rid of the trailing comma 1. There's a couple of ways you could do that. One is just to say, OK, grab every row and the zeroth column of res. And that's going to turn it from a 10,000 by 1 into a 10,000.
Or alternatively, we can say dot squeeze. Now, dot squeeze removes all trailing unit vectors and possibly also prefix unit vectors. I can't quite recall. I guess we should try. So let's say res none comma colon comma none. Q dot shape. OK, so if I go Q dot squeeze dot shape, OK, so all the unit vectors get removed.
Sorry, all the unit dimensions get removed, I should say. OK, so now that we've got a way to remove that axis that we didn't want, we can use it. And if we do that subtraction, now we get 10,000 just like we wanted. So now let's get our training and validation wise.
We'll turn them into floats because we're using MSE. So let's calculate our predictions for the training set, which is 50,000 by 1. And so if we create an MSE function that just does what we just said we wanted. So it does the subtraction and then squares it and then takes the mean, that's MSE.
So there we go, we now have a loss function being applied to our training set. OK, now we need gradients. So as we briefly discussed last time, gradients are slopes. And in fact, maybe it would even be easier to look at last time. So this was last time's notebook.
And so we saw how the gradient at this point is the slope here. And so it's the, as we discussed, rise over run. Now, so that means as we increase, in this case, time by one, the distance increases by how much? That's what the slope is. So why is this interesting?
The reason it's interesting is because let's consider our neural network. Our neural network is some function that takes two things, two groups of things. It contains a matrix of our inputs and it contains our weight matrix. And we want to and let's assume we're also putting it through a loss function.
So let's say, well, I mean, I guess we can be explicit about that. So we could say we then take the result of that and we put it through some loss function. So these are the predictions and we compare it to our actual dependent variable. So that's our neural net.
And that's our loss function. Okay. So if we can get the derivative of the loss with respect to, let's say, one particular weight. So let's say weight number zero. What is that doing? Well, it's saying as I increase the weight by a little bit, what happens to the loss?
And if it says, oh, well, that would make the loss go down, then obviously I want to increase the weight by a little bit. And if it says, oh, it makes the loss go up, then obviously I want to do the opposite. So the derivative of the loss with respect to the weights, each one of those tells us how to change the weights.
And so to remind you, we then change each weight by that derivative of times a little bit and subtract it from the original weights. And we do that a bunch of times and that's called SGD. Now there's something interesting going on here, which is that in this case, there's a single input and a single output.
And so the derivative is a single number at any point. It's the speed in this case, the vehicle's going. But consider a more complex function like say this one. Now in this case, there's one output, but there's two inputs. And so if we want to take the derivative of this function, then we actually need to say, well, what happens if we increase X by a little bit?
And also what happens if we increase Y by a little bit? And in each case, what happens to Z? And so in that case, the derivative is actually going to contain two numbers, right? It's going to contain the derivative of Z with respect to Y. And it's going to contain the derivative of Z with respect to X.
What happens if we change each of these two numbers? So for example, these could be, as we discussed, two different weights in our neural network and Z could be our loss, for example. Now we've got actually 784 inputs, right? So we would actually have 784 of these. So we don't normally write them all like that.
We would just say, use this little squiggly symbol to say the derivative of the loss across all of them with respect to all of the weights. OK, and that's just saying that there's a whole bunch of them. It's a shorthand way of writing this. OK, so it gets more complicated still, though, because think about what happens if, for example, you're in the first layer where we've got a weight matrix that's going to end up giving us 50 outputs, right?
So for every image, we're going to have 784 inputs to our function, and we're going to have 50 outputs to our function. And so in that case, I can't even draw it, right? Because like for every-- even if I had two inputs and two outputs, then as I increase my first input, I'd actually need to say, how does that change both of the two outputs?
And as I change my second input, how does that change both of my two outputs? So for the full thing, you actually are going to end up with a matrix of derivatives. It basically says, for every input that you change by a little bit, how much does it change every output of that function?
So you're going to end up with a matrix. So that's what we're going to be doing, is we're going to be calculating these derivatives, but rather than being single numbers, they're going to actually contain matrices with a row for every input and a column for every output. And a single cell in that matrix will tell us, as I change this input by a little bit, how does it change this output?
Now eventually, we will end up with a single number for every input. And that's because our loss in the end is going to be a single number. And this is like a requirement that you'll find when you try to use SGD, is that your loss has to be a single number.
And so we generally get it by either doing the sum or a mean or something like that. But as you'll see on the way there, we're going to have to be dealing with these matrix of derivatives. So I just want to mention, as I might have said before, I can't even remember.
There is this paper that Terrence Parr and I wrote a while ago, which goes through all this. And it basically assumes that you only know high school calculus, and if you don't check out Khan Academy, but then it describes matrix calculus in those terms. So it's going to explain to you exactly, and it works through lots and lots of examples.
So for example, as it mentions here, when you have this matrix of derivatives, we call that a Jacobian matrix. So there's all these words, it doesn't matter too much if you know them or not, but it's convenient to be able to talk about the matrix of all of the derivatives if somebody just says the Jacobian.
It's a little convenient. It's a bit easier than saying the matrix of all of the derivatives, where all of the rows are things that are all the inputs and all the columns are the outputs. So yeah, if you want to really understand, get to a point where papers are easier to read in particular, it's quite useful to know this notation and definitions of words.
You can certainly get away without it, it's just something to consider. Okay, so we need to be able to calculate derivatives, at least of a single variable. And I am not going to worry too much about that, a, because that is something you do in high school math, and b, because your computer can do it for you.
And so you can do it symbolically, using something called SYMPY, which is really great. So if you create two symbols called x and y, you can say please differentiate x squared with respect to x, and if you do that, SYMPY will tell you the answer is 2x. If you say differentiate 3x squared plus 9 with respect to x, SYMPY will tell you that is 6x.
And a lot of you probably will have used Wolfram Alpha, that does something very similar. I kind of quite like this because I can quickly do it inside my notebook and include it in my prose. So I think SYMPY is pretty cool. So basically, yeah, you can quickly calculate derivatives on a computer.
Having said that, I do want to talk about why the derivative of 3x squared plus 9 equals 6x, because that is going to be very important. So 3x squared plus 9. So we're going to start with the information that the derivative of a to the b with respect to a equals b times a.
So for example, the derivative of x squared with respect to x equals 2x. So that's just something I'm hoping you'll remember from high school or refresh your memory using Card Academy or similar. So that is there. So what we could now do is we could rewrite this derivative as 3u plus 9.
And then we'll write u equals x squared. OK, now this is getting easier. The derivative of two things being added together is simply the sum of their derivatives. Oh, forgot b minus 1 in the exponent. Thank you. Sorry, ba to the power of b minus 1 is what it should be, which would be 2x to the power of 1, and the 1 is not needed.
Thank you for fixing that. All right. So we just sum them up. So we get the derivative of 3u is actually just-- well, it's going to be the derivative of that plus the derivative of that. Now the derivative of any constant with respect to a variable is 0. Because if I change something, an input, it doesn't change the constant.
It's always 9. So that's going to end up as 0. And so we're going to end up with dy/du equals something plus 0. And the derivative of 3u with respect to u is just 3 because it's just a line. So that's its slope. OK, but that's not dy/dx. We weren't doing up to dy/dx.
Well, the cool thing is that dy/dx is actually just equal to dy/du du/dx. So I'll explain why in a moment. But for now then, let's recognize we've got dy-- sorry, du/dx. We know that one, 2x. So we can now multiply these two bits together. And we will end up with 2x times 3, which is 6x, which is what Simpai told us.
So fantastic. OK, this is something we need to know really well. And it's called the chain rule. And it's best to understand it intuitively. So to understand it intuitively, we're going to take a look at an interactive animation. So I found this nice interactive animation on this page here, webspace.ship.edu/msreadow.
And it's read-- oh, georgebra calculus. OK, and the idea here is that we've got a wheel spinning around. And each time it spins around, this is x going up. OK, so at the moment, there's some change in x dx over a period of time. All right, now, this wheel is eight times bigger than this wheel.
So each time this goes around once, if we connect the two together, this wheel would be going around four times faster because the difference between-- the multiple between eight and two is four. Maybe I'll bring this up to here. So now that this wheel has got twice as big a circumference as the u wheel, each time this goes around once, this is going around two times.
So the change in u, each time x goes around once, the change in u will be two. So that's what du dx is saying. The change in u for each change in x is two. Now we could make this interesting by connecting this wheel to this wheel. Now this wheel is twice as small as this wheel.
So now we can see that, again, each time this spins around once, this spins around twice because this has twice the circumference of this. So therefore, dy du equals two. Now that means every time this goes around once, this goes around twice. Every time this one goes around once, this gun goes around twice.
So therefore, every time this one goes around once, this one goes around four times. So dy dx equals four. So you can see here how the two-- well, how the du dx has to be multiplied with the dy du to get the total. So this is what's going on in the chain rule.
And this is what you want to be thinking about is this idea that you've got one function that is kind of this intermediary. And so you have to multiply the two impacts to get the impact of the x wheel on the y wheel. So I hope you find that useful.
I find this-- personally, I find this intuition quite useful. So why do we care about this? Well, the reason we care about this is because we want to calculate the gradient of our MSE applied to our model. And so our inputs are going through a linear, they're going through a ReLU, they're going through another linear, and then they're going through an MSE.
So there's four different steps going on. And so we're going to have to combine those all together. And so we can do that with the chain rule. So if our steps are that loss function is-- so we've got the loss function, which is some function of the predictions and the actuals, and then we've got the second layer is a function of-- actually, let's call this the output of the second layer.
It's slightly weird notation, but hopefully it's not too bad-- is going to be a function of the ReLU activations. And the ReLU activations are a function of the first layer. And the first layer is a function of the inputs. Oh, and of course, this also has weights and biases.
So we're basically going to have to calculate the derivative of that. But then remember that this is itself a function. So then we'll need to multiply that derivative by the derivative of that. But that's also a function, so we have to multiply that derivative by this. But that's also a function, so we have to multiply that derivative by this.
So that's going to be our approach. We're going to start at the end. And we're going to take its derivative, and then we're going to gradually keep multiplying as we go each step through. And this is called backpropagation. So backpropagation sounds pretty fancy, but it's actually just using the chain rule-- gosh, I didn't spell that very well-- prop-gation-- it's just using the chain rule.
And as you'll see, it's also just taking advantage of a computational trick of memorizing some things on the way. And in our chat, Siva made a very good point about understanding nonlinear functions in this case, which is just to consider that the wheels could be growing and shrinking all the time as they're moving.
But you're still going to have this same compound effect, which I really like that. Thank you, Siva. There's also a question in the chat about why is this colon, comma, zero being placed in the function, given that we can do it outside the function? Well, the point is we want an MSE function that will apply to any output.
We're not using it once. We want it to work any time. So we haven't actually modified preds or anything like that, or Y_train. So we want this to be able to apply to anything without us having to pre-process it. It's basically the idea here. OK, so let's take a look at the basic idea.
So here's going to do a forward pass and a backward pass. So the forward pass is where we calculate the loss. So the loss is-- oh, I've got an error here. That should be diff. There we go. So the loss is going to be the output of our neural net minus our target squared, then take the mean.
OK. And then our output is going to be the output of the second linear layer. The second linear layer's input will be the value. The value's input will be the first layer. So we're going to take our input, put it through a linear layer, put that through a value, put that through a linear layer, and calculate the MSE.
OK, that bit hopefully is pretty straightforward. So what about the backward pass? So the backward pass, what I'm going to do-- and you'll see why in a moment-- is I'm going to store the gradients of each layer. So for example, the gradients of the loss with respect to its inputs in the layer itself.
So I'm going to create a new attribute. I could call it anything I like, and I'm just going to call it .g. So I'm going to create a new attribute called out.g, which is going to contain the gradients. You don't have to do it this way, but as you'll see, it turns out pretty convenient.
So that's just going to be 2 times the difference, because we've got difference squared. So that's just the derivative. And then we have taken the mean here. So we have to do the same thing here, divided by the input shape. And so that's those gradients. That's good. And now what we need to do is multiply by the gradients of the previous layer.
So here's our previous layer. So what are the gradients of a linear layer? I've created a function for that here. So the gradient of a linear layer, we're going to need to know the weights of the layer. We're going to need to know the biases of the layer. And then we're also going to know the input to the linear layer, because that's the thing that's actually being manipulated here.
And then we're also going to need the output, because we have to multiply by the gradients because we've got the chain rule. So again, we're going to store the gradients of our input. So this would be the gradients of our output with respect to the input. And that's simply the weights.
Because the weights, so a matrix multiplier is just a whole bunch of linear functions. So each one slope is just its weight. But you have to multiply it by the gradient of the outputs because of the chain rule. And then the gradient of the outputs with respect to the weights is going to be the output times the output summed up.
I'll talk more about that in a moment. The derivatives of the bias is very straightforward. It's the gradients of the output added together because the bias is just a constant value. So for the chain rule, we simply just use output times 1, which is output. So for this one here, again, we have to do the same thing we've been doing before, which is multiply by the output gradients because of the chain rule.
And then we've got the input weights. So every single one of those has to be multiplied by the outputs. And so that's why we have to do an unsqueezed minus 1. So what I'm going to do now is I'm going to show you how I would experiment with this code in order to understand it.
And I would encourage you to do the same thing. It's a little harder to do this one cell by cell because we kind of want to put it all into this function like this. So we need a way to explore the calculations interactively. And the way we do that is by using the Python debugger.
Here is how you-- let me see a few ways to do this. Here's one way to use the Python debugger. The Python debugger is called pdb. So if you say pdb.settrace in your code, then that tells the debugger to stop execution when it reaches this line. So it sets a breakpoint.
So if I call forward and backward, you can see here it's stopped. And the interactive Python debugger, ipdb, has popped up. With an arrow pointing at the line of code, it's about to run. And at this point, there's a whole range of things we can do to find out what they are.
We hit H for help. Understanding how to use the Python debugger is one of the most powerful things I think you can do to improve your coding. So one of the most useful things you can do is to print something. You see all these single-letter things? They're just shortcuts.
But in a debugger, you want to be able to do things quickly. So instead of typing print, I'll just type p. So for example, let's take a look at the shape of the input. So I type p for print, input.shape. So I've got a 50,000 by 50 input to the last layer.
That makes sense. These are the hidden activations coming into the last layer for every one of our images. What about the output gradients? And there's that as well. And actually, a little trick. You can ignore that. You don't have to use the p at all if your variable name is not the same as any of these commands.
So I could have just typed out.g.shape. Get the same thing. OK. So you can also put in expressions. So let's have a look at the shape of this. So the output of this is-- let's see if it makes sense. We've got the input, 50,000 by 50. We put a new axis on the end, unsqueezed minus 1 is the same as indexing it with dot dot dot comma none.
So let's put a new axis at the end. So that would have become 50,000 by 50 by 1. And then the outg dot unsqueezed we're putting in the first dimension. So we're going to have 50,000 by 50 by 1 times 50,000 by 1 by 1. And so we're only going to end up getting this broadcasting happening over these last two dimensions, which is why we end up with 50,000 by 50 by 1.
And then with summing up-- this makes sense, right? We want to sum up over all of the inputs. Each image is individually contributing to the derivative. And so we want to add them all up to find their total impact, because remember the sum of a bunch of-- the derivative of the sum of functions is the sum of the derivatives of the functions.
So we can just sum them up. Now this is one of these situations where if you see a times and a sum and an unsqueeze, it's not a bad idea to think about Einstein summation notation. Maybe there's a way to simplify this. So first of all, let's just see how we can do some more stuff in the debugger.
I'm going to continue. So just continue running. So press C for continue, and it keeps running until it comes back again to the same spot. And the reason we've come to the same spot twice is because lin grad is called two times. So we would expect that the second time, we're going to get a different bunch of inputs and outputs.
And so I can print out a tuple of the inputs and output gradient. So now, yeah, so this is the first layer going into the second layer. So that's exactly what we would expect. To find out what called this function, you just type w. w is where am I?
And so you can see here where am I? Oh, forward and backward was called-- see the arrow? That called lin grad the second time, and now we're here in w.g equals. If we want to find out what w.g ends up being equal to, I can press N to say, go to the next line.
And so now we've moved from line five to nine six. So the instruction point is now looking at line six. So I could now print out, for example, w.g.shape. And there's the shape of our weights. One person on the chat has pointed out that you can use breakpoint instead of this import pdb business.
Unfortunately, the breakpoint keyword doesn't currently work in Jupyter or in IPython. So we actually can't, sadly. That's why I'm doing it the old fashioned way. So this way, maybe they'll fix the bug at some point. But for now, you have to type all this. OK, so those are a few things to know about.
But I would definitely suggest looking up a Python pdb tutorial to become very familiar with this incredibly powerful tool because it really is so very handy. So if I just press continue again, it keeps running all the way to the end and it's now finished running forward and backward.
So when it's finished, we would find that there will now be, for example, a w1.g because this is the gradients that it just calculated. And there would also be a xtrain.g and so forth. OK, so let's see if we can simplify this a little bit. So I would be inclined to take these out and give them their own variable names just to make life a bit easier.
Would have been better if I'd actually done this before the debugging, so it'd be a bit easier to type. So let's set I and O equal to input and output dot g dot unsqueeze. OK, so we'll get rid of our breakpoint and double check that we've got our gradients OK.
And I guess before we run it, we should probably set those to zero. What I would do here to try things out is I'd put my breakpoint there and then I would try things. So let's go next. And so I realize here that what we're actually doing is we're basically doing exactly the same thing as an insum would do.
So I could test that out by trying an insum. Because I've just got this is being replicated and then I'm summing over that dimension, because that's the multiplication that I'm doing. So I'm basically multiplying the first dimension of each and then summing over that dimension. So I could try running that and it works.
So that's interesting. And I've got zeros because I did x_train dot zero, that was silly. That should be dot gradients dot zero. OK, so let's try doing an insum. And there we go. That seems to be working. That's pretty cool. So we've we've multiplied this repeating index. So we were just multiplying the first dimensions together and then summing over them.
So there's no i here. Now that's not quite the same thing as a matrix multiplication, but we could turn it into the same thing as a matrix multiplication just by swapping i and j so that they're the other way around. And that way we'd have ji comma ik. And we can swap into dimensions very easily.
That's what's called the transpose. So that would become a matrix multiplication if we just use the transpose. And in numpy, the transpose is the capital T attribute. So here is exactly the same thing using a matrix multiply and a transpose. And let's check. Yeah, that's the same thing as well.
OK, cool. So that tells us that now we've checked in our debugger that we can actually replace all this with a matrix multiply. We don't need that anymore. Let's see if it works. It does. All right. X train dot g. Cool. OK, so hopefully that's convinced you that the debugger is a really handy thing for playing around with numeric programming ideas or coding in general.
And so I think now is a good time to take a break. So let's take a eight minute break. And I'll see you back here. Actually, seven minute break. I'll see you back here in seven minutes. Thank you. OK, welcome back. So we've calculated our derivatives and we want to test them.
Luckily PyTorch already has derivatives implemented. So I've got to totally cheat and use PyTorch to calculate the same derivatives. So don't worry about how this works yet, because we're actually going to be doing all this from scratch anyway. For now, I'm just going to run it all through PyTorch and check that their derivatives are the same as ours.
And they are. So we're on the right track. OK, so this is all pretty clunky. I think we can all agree. And obviously, it's clunky than what we do in PyTorch. So how do we simplify things? There's some really cool refactoring that we can do. So what we're going to do is we're going to create a whole class for each of our functions, for the value function and for the linear function.
So the way that we're going to do this is we're going to create a dunder call. What does dunder call do? Let me show you. So if I create a class. And we're just going to set that to print hello. So if I create an instance of that class.
And then I call it as if it was a function, oops, missing the dunder bit here. Call it as if it's a function. It says hi. So in other words, you know, everything can be changed in Python. You can change how a class behaves. You can make it look like a function.
And to do that, you simply define dunder call. You could pass it an argument like so. OK, so that's what dunder call does. It just says it's just a little bit of syntax, sugary kind of stuff to say I want to be able to treat it as if it's a function without any method at all.
You can still do it the method way. You could have done this. Don't know why you'd want to, but you can. But because it's got this special magic named under call, you don't have to write the dot dunder call at all. So here, if we create an instance of the relu class, we can treat it as a function.
And what it's going to do is it's going to take its input and do the relu on it. But if you look back at the forward and backward, there's something very interesting about the backward pass, which is that it has to know about, for example, this intermediate calculation gets passed over here.
This intermediate calculation gets passed over here because of the chain rule, we're going to need some of the intermediate calculations and not just the chain rule because of actually how the derivatives are calculated. So we need to actually store each of the layer intermediate calculations. And so that's why relu doesn't just calculate and return the output, but it also stores its output and it also stores its input.
So that way, then, when we call backward, we know how to calculate that. We set the inputs gradient because remember we stored the input, so we can do that. And it's going to just be, oh, import greater than zero dot float. So that's the definition of the derivative of a relu and then chain rule.
So that's how we can calculate the forward pass and the backward pass for relu. And we're not going to have to then store all this intermediate stuff separately, it's going to happen automatically. So we can do the same thing for a linear layer. Now linear layer needs some additional state, weights and biases, relu doesn't, so there's no edit.
So when we create a linear layer, we have to say, what are its weights? What are its biases? We store them away. And then when we call it on the forward pass, just like before, we store the input. So that's exactly the same line here. And just like before, we calculate the output and store it and then return it.
And this time, of course, we just call lin. And then for the backward pass, it's the same thing. So the input gradients we calculate just like before, oh, dot t brackets is exactly the same with a little t as big T is as a property. So that's the same thing, that's just the transpose.
Calculate the gradients of the weights. Again with the chain rule and the bias, just like we did it before. And they're all being stored in the appropriate places. And then for MSE, we can do the same thing. We don't just calculate the MSE, but we also store it. And we also, now the MSE needs two things, an input and a target.
So we'll store those as well. So then in the backward pass, we can calculate its gradient of the input as being two times the difference. And there it all is. So our model now, it's much easier to define. We can just create a bunch of layers, linear w1b1, relu, linear w2b2.
And then we can store an instance of the MSE. This is not calling MSE, it's creating an instance of the MSE class. And this is an instance of the ln class. This is an instance of the relu class, so they're just being stored. So then when we call the model, we pass it our inputs and our target.
We go through each layer, set x equal to the result of calling that layer, and then pass that to the loss. So there's something kind of interesting here that you might've noticed, which is that we don't have-- where do we do it? Something interesting here is that we don't have two separate functions inside our model, the loss function being applied to a separate neural net.
But we've actually integrated the loss function directly into the neural net, into the model. See how the loss is being calculated inside the model? Now that's neither better nor worse than having it separately. It's just different. And so generally, a lot of hugging face stuff does it this way.
They actually put the loss inside the forward. That stuff in fast.ai and a lot of other libraries does it separately, which is the loss is a whole separate function. And the model only returns the result of putting it through the layers. So for this model, we're going to actually do the loss function inside the model.
So for backward, we just do each thing. So self.loss.backwards-- so self.loss is the MSE object. So that's going to call backward. And it's stored when it was called here, it was storing, remember, the inputs, the targets, the outputs, so it can calculate the backward. And then we go through each layer is in reverse, right?
This is back propagation, backwards reversed, calling backward on each one. So that's pretty interesting, I think. So now we can calculate the model, we can calculate the loss, we can call backward, and then we can check that each of the gradients that we stored earlier are equal to each of our new gradients.
OK, so William's asked a very good question, that is, if you do put the loss inside here, how on earth do you actually get predictions? So generally, what happens is, in practice, hugging face models do something like this. I'll say self.preds equals x, and then they'll say self.finalloss equals that, and then return self.finalloss.
And that way-- I guess you don't even need that last bit. Well, that's really the-- anyway, that is what they do, so I'll leave it there. And so that way, you can kind of check, like, model.preds, for example. So it'll be something like that. Or alternatively, you can return not just the loss, but both as a dictionary, stuff like that.
So there's a few different ways you could do it. Actually, now I think about it, I think that's what they do, is they actually return both as a dictionary, so it would be like return dictionary loss equals that, comma, preds equals that, something like that, I guess, is what they would do.
Anyway, there's a few different ways to do it. OK, so hopefully you can see that this is really making it nice and easy for us to do our forward pass and our backward pass without all of this manual fiddling around. Every class now can be totally, separately considered and can be combined however we want.
So you could try creating a bigger neural net if you want to. But we can refactor it more. So basically, as a rule of thumb, when you see repeated code, self.mp equals imp, self.mp equals imp, self.ax equals return self.out, self.out equals return self.out. That's a sign you can refactor things.
And so what we can do is, a simple refactoring is to create a new class called module. And module's going to do those things we just said. It's going to store the inputs. And it's going to call something called self.forward in order to create our self.out, because remember, that was one of the things we had again and again and again, self.out, self.out.
And then return it. And so now, there's going to be a thing called forward, which actually, in this, it doesn't do anything, because the whole purpose of this module is to be inherited. When we call backward, it's going to call self.backward passing in self.out, because notice, all of our backwards always wanted to get hold of self.out, self.out, self.out, because we need it for the chain rule.
So let's pass that in, and pass in those arguments that we stored earlier. And so star means take all of the arguments, regardless whether it's 0, 1, 2, or more, and put them into a list. And then that's what happens when it's inside the actual signature. And then when you call a function using star, it says take this list and expand them into separate arguments, calling backward with each one separately.
So now, for relu, look how much simpler it is. Let's copy the old relu to the new relu. So the old relu had to do all this storing stuff manually. And it had all the self.stuff as well, but now we can get rid of all of that and just implement forward, because that's the thing that's being called, and that's the thing that we need to implement.
And so now the forward of relu just does the one thing we want, which also makes the code much cleaner and more understandable. Did over backward, it just does the one thing we want. So that's nice. Now, we still have to multiply it, but I still have to do the chain rule manually.
But the same thing for linear, same thing for MSE. So these all look a lot nicer. And one thing to point out here is that there's often opportunities to manually speed things up when you create custom autograd functions in PyTorch. And here's an example, look, this calculation is being done twice, which seems like a waste, doesn't it?
So at the cost of some memory, we could instead store that calculation as diff. Right, and I guess we'd have to store it for use later, so it'll need to be self.diff. And at the cost of that memory, we could now remove this redundant calculation because we've done it once before already and stored it and just use it directly.
And this is something that you can often do in neural nets. So there's this compromise between storing things, the memory use of that, and then the computational speed up of not having to recalculate it. This is something we come across a lot. And so now we can call it in the same way, create our model, passing in all of those layers.
So you can see with our model, so the model hasn't changed at this point, the definition was up here, we just pass in the layers, sorry, not the layers, the weights for the layers. Create the loss, call backward, and look, it's the same, hooray. Okay, so thankfully PyTorch has written all this for us.
And remember, according to rules of our game, once we've reimplemented it, we're allowed to use PyTorch's version. So PyTorch calls their version nn.module. And so it's exactly the same. We inherit from nn.module. So if we want to create a linear layer, just like this one, rather than inheriting from our module, we will inherit from that module.
But everything's exactly the same. So we create our, we can create our random numbers. So in this case, rather than passing in the already randomized weights, we're actually going to generate the random weights ourselves and the zeroed biases. And then here's our linear layer, which you could also use Lin for that, of course, to find our forward.
And why don't we need to define backward? Because PyTorch already knows the derivatives of all of the functions in PyTorch, and it knows how to use the chain rule. So we don't have to do the backward at all. It'll actually do that entirely for us, which is very cool.
So we only need forward. We don't need backward. So let's create a model that is a zn.module, otherwise it's exactly the same as before. And now we're going to use PyTorch's MSE loss because we've already implemented ourselves. It's very common to use torch.nn.functional as capital F. This is where lots of these handy functions live, including MSE loss.
And so now you know why we need the colon, colon, none, because you saw the problem if we don't have it. And so create the model, call backward. And remember, we stored our gradients in something called dot G. PyTorch stores them in something called dot grad, but it's doing exactly the same thing.
So there is the exact same values. So let's take stock of where we're up to. So we've created a matrix multiplication from scratch. We've created linear layers. We've created a complete backprop system of modules. We can now calculate both the forward pass and the backward pass for linear layers and values so we can create a multi-layer perceptron.
So we're now up to a point where we can train a model. So let's do that. Mini batch training, notebook number four. So same first cell as before. We won't go through it. This cell's also the same as before, so we won't go through it. Here's the same model that we had before, so we won't go through it.
So just running all that to see. OK, so the first thing we should do, I think, is to improve our loss function so it's not total rubbish anymore. So if you watched part one, you might recall that there are some Excel notebooks. One of those Excel notebooks is entropy example.
OK, so this is what we looked at. So just to remind you, what we're doing now is which we're saying, OK, rather than outputting a single number for each image, we're going to instead output 10 numbers for each image. And so that's going to be a one hot encoded set of-- it'll be like 1, 0, 0, 0, et cetera.
And so then that's going to be-- well, actually, the outputs won't be 1, 0, 0. They'll be basically probabilities, won't they? So it'll be like 0.99 comma 0.01, et cetera. And the targets will be one hot encoded. So if it's the digit 0, for example, it might be 1, 0, 0, 0, 0, dot, dot, dot for all the 10 possibilities.
And so to see how good is it-- so in this case, it's really good. It had a 0.99 probability prediction that it's 0. And indeed, it is because this is the 100 encoded version. And so the way we implement that is we don't even need to actually do the one hot encoding thanks to some tricks.
We can actually just directly store the integer, but we can treat it as if it's one hot encoded. So we can just store the actual target 0 as an integer. So the way we do that is we say, for example, for a single output, oh, it could be, let's say, cat dog plane fish building.
The neural nets bits out a bunch of outputs. What we do for Softmax is we go e to the power of each of those outputs. We sum up all of those e to the power ofs. So here's the e to the power of each of those outputs. This is the sum of them.
And then we divide each one by the sum. So divide each one by the sum. That gives us our Softmaxes. And then for the loss function, we then compare those Softmaxes to the one hot encoded version. So let's say it was a dog. Then it's going to have a 1 for dog and 0 everywhere else.
And then Softmax, this is from this nice blog post here. This is the calculation sum of the ones and zeros. So each of the ones and zeros multiplied by the log of the probabilities. So here is the log probability times the actuals. And since the actuals are either 0 or 1, and only one of them is going to be a 1, we're only going to end up with one value here.
And so if we add them up, it's all 0 except for one of them. So that's cross entropy. So in this special case where the output's one hot encoded, then doing the one hot encoded multiplied by the log Softmax is actually identical to simply saying, oh, dog is in this row.
Let's just look it up directly and take its log Softmax. We can just index directly into it. So it's exactly the same thing. So that's just review. So if you haven't seen that before, then yeah, go and watch the part one video where we went into that in a lot more detail.
OK. So here's our Softmax calculation. It's a to the power of each output divided by the sum of them, or we can use sigma notation to say exactly the same thing. And as you can see, Tupler Notebook lets us use LaTeX. If you haven't used LaTeX before, it's actually surprisingly easy to learn.
You just put dollar signs around your equations like this and your equations backslash is going to be kind of like your functions, if you like. And curly parentheses, curly curlies are used to kind of for arguments. So you can see here, here is e to the power of and then underscore is used for subscript.
So this is X subscript I and power of is used for superscripts. So here's dots. You can see here it is dots. So it's actually, yeah, learning LaTeX is easier than you might expect. It can be quite convenient for writing these functions when you want to. So anyway, that's what Softmax is.
As we'll see in a moment, well, actually, as you've already seen, in cross entropy, we don't really want Softmax, we want log of Softmax. So log of Softmax is, here it is, so we've got x dot exp, so e to the x, divided by x dot exp dot sum.
And we're going to sum up over the last dimension. And then we actually want to keep that dimension so that when we do the divided by, we want to be trailing unit axis for exactly the same reason we saw when we did our MSE loss function. So if you sum with keep dim equals true, it leaves a unit axis in that last position.
So we don't have to put it back to avoid that horrible out of product issue. So this is the equivalent of this and then dot log. So that's log of Softmax. So there is the log of the Softmax with the predictions. Now in terms of high school math that you may have forgotten, but you definitely are going to want to know, a key piece that in that list of things is log and exponent rules.
So check out Khan Academy or similar if you've forgotten them. But a quick reminder is, for example, the one we mentioned here, log of A over B equals log of A minus log of B and equivalently log of A times B equals log of A plus log of B.
And these are very handy because for example, division can take a long time, multiply can create really big numbers that have lots of floating point error. Being able to replace these things with pluses and minuses is very handy indeed. In fact, I used to give people an interview question 20 years ago, a company which I did a lot of stuff with SQL and math.
SQL actually only has a sum function for group by clauses. And I used to ask people how you would deal with calculating a compound interest column where the answer is basically that you have to say, because this compound interest is taking products. So it has to be the sum of the log of the column and then e to the power of all that.
So it's like all kinds of little places that these things come in handy, but they come into neural nets all the time. So we're going to take advantage of that because we've got a divided by that's being logged. And also rather handily, we're going to have therefore the log of exp.exp minus the log of this, but exp and log opposites.
So that is going to end up just being x minus. So log softmax is just x minus all this logged and here it is all this logged. So that's nice. So here's our simplified version. Okay. Now there's another very cool trick, which is one of these things I figured out myself and then discovered other people had known it for years.
So not my trick, but it's always nice to rediscover things. The trick is what's written here. Let me explain what's going on. This piece here, the log of this sum, right, this sum here, we've got x.exp.sum. Now x could be some pretty big numbers and e to the power of that's going to be really big numbers.
And e to the power of things creating really big numbers, well, really big numbers. There's much less precision in your computer's floating point handling the further you get away from zero basically. So we don't want really big numbers, particularly because we're going to be taking derivatives. And so if you're in an area that's not very precise as far as floating point math is concerned, then the derivatives are going to be a disaster.
They might even be zero because you've got two numbers that the computer can't even recognize as different. So this is bad, but there's a nice trick we can do to make it a lot better. What we can do is we can calculate the max of x, right, and we'll call that a.
And so then rather than doing the log of the sum of e to the xi, we're instead going to define a as being the minimum, sorry, the maximum of all of our x values. It's our biggest number. Now if we then subtract that from every number, that means none of the numbers are going to be big by definition because we've subtracted it from all of them.
Now the problem is that's given us a different result, right? But if you think about it, let's expand this sum. It's e to the power of x1, if we don't include our minus a, plus e to the power of x2, plus e to the power of x3 and so forth.
Okay, now we just subtracted a from our exponents, which has made, meant we're now wrong. But I've got good news. I've got good news and bad news. The bad news is that you've got more high school math to remember, which is exponent rules. So x to the a plus b equals x to the a times x to the b.
And similarly, x to the a minus b equals x to the a divided by x to the b. And to convince yourself that's true, consider, for example, 2 to the power of 2 plus 3. What is that? Well, you've got 2 to the power of 2 is just 2 times 2, and 2 to the power of 2 plus 3, well, it's 2 times 2 times, is 2 to the power of 5.
So you've got 2 to the power of 2, you've got 2 of them here, and you've got another 3 of them here. So we're just adding up the number to get the total index. So we can take advantage of this here and say like, oh, well, this is equal to e to the x1 over e to the a plus e to the x2 over e to the a plus e to the x3 over e to the a.
And this is a common denominator. So we can put all that together. And why did we do all that? Because if we now multiply that all by e to the a, these would cancel out and we get the thing we originally wanted. So that means we simply have to multiply this by that, and this gives us exactly the same thing as we had before.
That with, critically, this is no longer ever going to be a giant number. So this might seem a bit weird, we're doing extra calculations. It's not a simplification, it's a complexification, but it's one that's going to make it easier for our floating point unit. So that's our trick, is rather than doing log of this sum, what we actually do is log of e to the a times the sum of e to the x minus a.
And since we've got log of a product, that's just the sum of the logs, and log of e to the a is just a. So it's a plus that. So this here is called the log sum exp trick. Oops, people pointing out that I've made a mistake, thank you.
That, of course, should have been inside the log, you can't just go sticking it on the outside like a crazy person, that's what I meant to say. OK, so here is the log sum exp trick, I'll call it m instead of a, which is a bit silly, should have called it a.
But anyway, so we find the maximum on the last dimension, and then here is the m plus that exact thing. OK, so that's just another way of doing that. OK, so that's the log sum exp. So now we can rewrite log softmax as x minus log sum exp. And we're not going to use our version because pytorch already has one.
So we'll just use pytorches. And if we check, here we go, here's our results. And so then as we've discussed, the cross entropy loss is the sum of the outputs times the log probabilities. And as we discussed, our outputs are one hot encoded, or actually they're just the integers better still.
So what we can do is we can, I guess I should make that more clear, actually there, just the integer indices. So we can simply rewrite that as negative log of the target. So that's what we have in our Excel. And so how do we do that in pytorch?
So this is quite interesting. There's a lot of cool things you can do with array indexing in pytorch and numpy. So basically they use the same approaches. Let's take a look. Here is the first three actual values in YTrain, they're 5, 0, and 4. Now what we want to do is we want to find in our softmax predictions, we want to get 5, the fifth prediction in the zeroth row, the zeroth prediction in the first row, and the fourth prediction in the index 2 row.
So these are the numbers that we want. This is going to be what we add up for the first two rows of our loss function. So how do we do that in all in one go? Well, here's a cool trick. See here, I've got 0, 1, 2. If we index using a two lists, we can put here 0, 1, 2.
And for the second list, we can put YTrain, 3, 5, 0, 4. And this is actually going to return 0, 0, 1, 0, 0, 0, 0, 5, 1, 0, and 2, 4. Which is, as you see, exactly the same thing. So therefore, this is actually giving us what we need for the cross entropy loss.
So if we take range of our target's first dimension, or zero index dimension, which is all this is, and the target, and then take the negative of that dot mean, that gives us our cross entropy loss, which is pretty neat, in my opinion. All right. So PyTorch calls this negative log likelihood loss, but that's all it is.
And so if we take the negative log likelihood, and we pass that to that, the log soft max, then we get the loss. And this particular combination in PyTorch is called F dot cross entropy. So just check. Yep, F dot cross entropy gives us exactly the same thing. So that's cool.
So we have now re-implemented the cross entropy loss. And there's a lot of confusing things going on there, a lot. And so this is one of those places where you should pause the video and go back and look at each step and think not just like, what is it doing, but why is it doing it?
And also try typing in lots of different values yourself to see if you can see what's going on, and then put this aside and test yourself by re-implementing log soft max, and cross entropy yourself, and compare them to PyTorch's values. And so that's a piece of homework for you for this week.
So now that we've got that, we can actually create a training loop. So let's set our loss function to be cross entropy. Let's create a batch size of 64. And so here's our first mini batch. OK, so xB is the x mini batch. It's going to be from 0 up to 64 from our training set.
So we can now calculate our predictions. So that's 64 by 10. So for each of the 64 images in the mini batch, we have 10 probabilities, one for each digit. And our y is just-- in fact, let's print those out. So there's our first 64 target values. So these are the actual digits.
And so our loss function. So we're going to start with a bad loss because it's entirely random at this point. OK, so for each of the predictions we made-- so those are our predictions. And so remember, those predictions are a 64 by 10. What did we predict? So for each one of these 64 rows, we have to go in and see where is the highest number.
So if we go through here, we can go through each one. Here's a 0.1. OK, it looks like this is the highest number. So it's 0, 1, 2, 3. So the highest number is this one. So you've got to find the index of the highest number. The function to find the index of the highest number is called argmax.
And yep, here it is, 3. And I guess we could have also written this probably as preds.argmax. Normally, you can do them either way. I actually prefer normally to do it this way. Yep, there's the same thing. OK, and the reason we want this is because we want to be able to calculate accuracy.
We don't need it for the actual neural net. But we just like to be able to see how we're going because it's like it's a metric. It's something that we use for understanding. So we take the argmax. We compare it to the actual. So that's going to give us a bunch of balls.
If you turn those into floats, they'll be ones and zeros. And the mean of those floats is the accuracy. So our current accuracy, not surprisingly, is around 10%. It's 9% because it's random. That's what you would expect. So let's train our first neural net. So we'll set a learning rate.
We'll do a few epochs. So we're going to go through each epoch. And we're going to go through from 0 up to n. That's the 50,000 training rows. And skipping by 64, the batch size each time. And so we're going to create a slice that starts at i. So starting at 0 and goes up to 64, unless we've gone past the end, in which case we'll just go up to n.
And so then we will slice into our training set for the x and for the y to get our x and y batches. We will then calculate our predictions, our loss function, and do our backward. So the way I did this originally was I had all of these in separate cells.
And I just typed in i equals 0 and then went through one cell at a time, calculating each one until they all worked. And so then I can put them in a loop. OK, so once we've got done backward, we can then, with torch.co.grad, go through each layer. And if that's a layer that has weights, we'll update them to the existing weights minus the gradients times the learning rate.
And then 0 out, so the weights and biases for the gradients, the gradients with the weights and biases. This underscore means do it in place. So that sets this to 0. So if I run that-- oops, got to run all of them. I guess I skipped cell. There we go.
It's finished. So you can see that our accuracy on the training set-- it's a bit unfair, but it's only three epochs-- is nearly 97%. So we now have a digit recognizer. Trains pretty quickly and is not terrible at all. So that's a pretty good starting point. All right, so what we're going to do next time is we're going to refactor this training loop to make it dramatically, dramatically, dramatically simpler step by step until eventually we will get it down to something much, much shorter.
And then we're going to add a validation set to it and a multiprocessing data loader. And then, yeah, we'll be in a pretty good position, I think, to start training some more interesting models. All right, hopefully you found that useful and learned some interesting things. And so what I'd really like you to do is, at this point, now that you've kind of got all these key basic pieces in place, is to really try to recreate them without peaking as much as possible.
So recreate your matrix multiply, recreate those forward and backward passes, recreate something that steps through layers, and even see if you can recreate the idea of the dot forward and the dot backward. Make sure it's all in your head really clearly so that you fully understand what's going on.
At the very least, if you don't have time for that, because that's a big job, you could pick out a smaller part of that, the piece that you're more interested in. Or you could just go through and look really closely at these notebooks. So if you go to kernel restart and clear output, it will delete all the outputs and try to think, what are the shapes of things?
Can you guess what they are? Can you check them? And so forth. OK, thanks, everybody. Hope you have a great week, and I will see you next time. Bye.