Back to Index

Machine Learning 1: Lesson 8


Chapters

0:0 Intro
2:0 Random forests
5:20 Recognition
7:30 Data
12:55 Tensors
16:40 Negative log likelihood
33:49 Slicing
38:20 Neural Network
44:15 Getting Started
47:15 PITorch
47:43 GPU
53:24 AWS
56:38 PI Torch

Transcript

So, I don't want to embarrass Rachel, but I'm very excited that Rachel's here. So this is Rachel, for those of you that don't know. She's not quite back on her feet after her illness, but well enough to at least come to at least part of this lesson, so don't worry if she can't stay for the whole thing.

I'm really glad she's here because Rachel actually wrote the vast majority of the lesson we're going to see. I think it's a really, really cool work, so I'm glad she's going to at least see it being taught, even if unfortunately she's not teaching herself. Good thanksgiving present, best thanksgiving present.

So as we discussed at the end of last lesson, we're kind of moving from the decision tree ensembles to neural nets, broadly defined, and as we discussed, random forests and decision trees are limited by the fact in the end that they're basically doing nearest neighbors. All they can do is to return the average of a bunch of other points.

And so they can't extrapolate out to, if you're thinking what happens if I increase my prices by 20% and you've never priced at that level before, or what's going to happen to sales next year, and obviously we've never seen next year before, it's very hard to extrapolate, it's also hard if it can only do around log-based 2n decisions, and so if there's a time series it needs to fit to that takes like 4 steps to get to the right time area, then suddenly there's not many decisions left for it to make, so there's this limited amount of computation that it can do, so there's a limited complexity of relationship that it can model.

Can I ask about one more drawback of random forests that I've had? So if we have data as categorical variables which are not in sequential order, so for random forests we encode them and treat them as numbers, let's say we have 20 cardinality in 1 to 20, so the result that random forest gives is something like less than 5, less than 6, but if the categories are not sequential, not in any order, what does that mean?

So if you've got like, let's go back to bulldozers, EROPS with a C, OROPS and a whatever, and we arbitrarily label them like this. And so actually we know that all that really mattered was if it had air conditioning. So what's going to happen? Well it's basically going to say like, okay, if I group it into those together and those together, that's an interesting break, just because it so happens that the air conditioning ones all are going to end up in the right-hand side.

And then having done that, it's then going to say okay, well within the group with the 2 and 3, it's going to notice that it's furthermore going to have to split it into two more groups. So eventually it's going to get there, it's going to pull out that category, it's going to take more splits than we would ideally like.

So it's kind of similar to the fact that for it to model a line, it can only do it with lots of splits and only approximately. So random forest is fine with categories that are not sequential also? Yeah, so it can do it, it's just like in some way it's sub-optimal because we just need to do more break points than we would have liked, but it gets there, it does a pretty good job.

And so even although random forests do have some deficiencies, they're incredibly powerful, particularly because they have so few assumptions, they really had to screw up, and it's kind of hard to actually win a Kaggle competition with a random forest, but it's very easy to get top 10%. So in real life where often that third decimal place doesn't matter, random forests are often what you end up doing.

But for some things like this Ecuadorian groceries competition, it's very, very hard to get a good result with a random forest because there's a huge time series component and nearly everything is these two massively high cardinality categorical variables, which is the store and the item, and so there's very little there to even throw at a random forest, and the difference between every pair of stores is kind of different in different ways.

So there are some things that are just hard to get even relatively good results with a random forest. Another example is recognizing numbers. You can get like okay results with a random forest, but in the end the relationship between the spatial structure turns out to be important. And you kind of want to be able to do computations like finding edges or whatever that kind of carry forward through the computation.

So just doing a clever nearest neighbors like a random forest turns out not to be ideal. So for stuff like this, neural networks turn out that they are ideal. Neural networks turn out to be something that works particularly well for both things like the Ecuadorian groceries competition, so forecasting sales over time by store and by item, and for things like recognizing digits and for things like turning voice into speech.

And so it's kind of nice between these two things, neural nets and random forests, we kind of cover the territory. I haven't needed to use anything other than these two things for a very long time. And we'll actually learn, I don't know what course exactly, but at some point we'll learn also how to combine the two, because you can combine the two in really cool ways.

So here's a picture from Adam Geitge of an image. So an image is just a bunch of numbers, and each of those numbers is 0 to 255, and the dark ones are close to 255, the light ones are close to 0. So here is an example of a digit from this MNIST dataset.

MNIST is a really old, it's like a hello world of neural networks. And so here's an example. And so there are 28 by 28 pixels. If it was color, there would be three of these, one for red, one for green, one for blue. So our job is to look at the array of numbers and figure out that this is the number 8, which is tricky.

How do we do that? So we're going to use a small number of fast.ai pieces, and we're gradually going to remove more and more and more until by the end we'll have implemented our own neural network from scratch, our own training loop from scratch, and our own matrix multiplication from scratch.

So we're gradually going to dig in further and further. So the data for MNIST, which is the name of this very famous dataset, is available from here, and we have a thing in fast.ai.io called getData, which will grab it from a URL and store it on your computer, unless it's already there, in which case it will just go ahead and use it.

And then we've got a little function here called load MNIST, which simply loads it up. You'll see that it's zipped, so we can just use Python's gzip to open it up. And then it's also pickled. So if you have any kind of Python object at all, you can use this built-in Python library called pickle to dump it out onto your disk, share it around, load it up later, and you get back the same Python object you started with.

So you've already seen there's something like this with pandas_feather format. Pickle is not just for pandas, it's not just for anything. It works for nearly every Python object, which might lead to the question why didn't we use pickle for a pandas data frame? And the answer is pickle works for nearly every Python object, but it's probably not like optimal for nearly any Python object.

So because we were looking at pandas data frames with over 100 million rows, we really want to save that quickly, and so feather is a format that's specifically designed for that purpose, and so it's going to do that really fast. If we tried to pickle it, it would have been taken a lot longer.

Also note that pickle files are only for Python, so you can't give them to somebody else, whereas a feather file you can hand around. So it's worth knowing that pickle exists because if you've got some dictionary or some kind of object floating around that you want to save for later or send to somebody else, you can always just pickle it.

So in this particular case, the folks at deeplearning.net were kind enough to provide a pickled version. Pickle has changed slightly over time, and so old pickle files like this one, you actually have to, this is a Python 2 one, so you have to tell it that it was encoded using this particular Python 2 character set.

But other than that, Python 2 and 3 can normally open each other's pickle files. So once we've loaded that in, we load it in like so. And so this thing which we're doing here, this is called destructuring. Destructuring means that loadMnist is giving us back a tuple of tuples, and so if we have on the left-hand side of the equal sign a tuple of tuples, we can fill all these things in.

So we're giving back a tuple of training data, a tuple of validation data, and a tuple of test data. In this case, I don't care about the test data, so I just put it into a variable called underscore, which kind of by like, Python people tend to think of underscore as being a special variable which we put things we're going to throw away into.

It's actually not special, but it's really common. If you see something assigned to underscore, it probably means you're just throwing it away. By the way, in a Jupyter notebook, it does have a special meaning which is the last cell that you calculate is always available in underscore. So then the first thing in that tuple is itself a tuple, and so we're going to stick that into x and y for our training data, and then the second one goes into x and y for our validation data.

So that's called destructuring, and it's pretty common in lots of languages. Some languages don't support it, but those that do, life becomes a lot easier. So as soon as I look at some new dataset, I just check out what have I got. So what's its type? Okay, it's a NumPy array.

What's its shape? It's 50,000 by 784. And then what about the dependent variables? That's an array, its shape is 50,000. So this image is not of length 784, it's of size 28 by 28. So what happened here? Well, we could guess, and we can check on the website, it turns out we would be right, that all they did was they took the second row and concatenated it to the first row, and the third row and concatenated it to that, and the fourth row and concatenated it to that.

So in other words, they took this whole 28 by 28 and flattened it out into a single 1D array. Does that make sense? So it's going to be of size 28 squared. This is not normal by any means, so don't think everything you see is going to be like this.

Most of the time when people share images, they share them as JPGs or PNGs, you load them up, you get back a nice 2D array. But in this particular case, for whatever reason, the thing that they pickled was flattened out to be 784. And this word "flatten" is very common with working with tensors.

So when you flatten a tensor, it just means that you're turning it into a lower rank tensor than you started with. In this case, we started with a rank2 tensor matrix for each image, and we turned each one into a rank1 tensor, i.e. a vector. So overall the whole thing is a rank2 tensor rather than a rank3 tensor.

So just to remind us of the jargon here, this in math we would call a vector. In computer science we would call it a 1D array, but because deep learning people have to come across as smarter than everybody else, we have to call this a rank1 tensor. They all mean the same thing, more or less, unless you're a physicist, in which case this means something else, and you get very angry at the deep learning people because you say it's not a tensor.

So there you go. Don't blame me, this is just what people say. So this is either a matrix or a 2D array or a rank2 tensor. And so once we start to get into 3 dimensions, we start to run out of mathematical names, which is why we start to be nice, just to say rank3 tensor.

And so there's actually nothing special about vectors and matrices that make them in any way more important than rank3 tensors or rank4 tensors or whatever. So I try not to use the terms vector and matrix where possible because I don't really think they're any more special than any other rank of tensor.

So it's good to get used to thinking of this as a rank2 tensor. And then the rows and columns, if we're computer science people, we would call this dimension 0 and dimension 1, but if we're deep learning people we would call this axis 0 or axis 1. And then just to be really confusing, if you're an image person, this is the first axis and this is the second axis.

So if you think about TVs, 1920x1080, columns by rows, everybody else, including deep learning and mathematicians, rows by columns. So this is pretty confusing if you use the Python imaging library. You get back columns by rows, pretty much everything else, rows by columns. So be careful. Because they hate us, because they're bad people, I guess.

Particularly in deep learning, a whole lot of different areas have come together, like information theory, computer vision, statistics, signal processing, and you've ended up with this hodgepodge of nomenclature in deep learning. Then every version of things will be used. So today we're going to hear about something that's called either negative log likelihood or binomial or categorical cross entropy, depending on where you come from.

We've already seen something that's called either one-hot encoding or dummy variables depending on where you come from. And really it's just like the same concept gets somewhat independently invented in different fields and eventually they find their way to machine learning and then we don't know what to call them, so we call them all of the above, something like that.

So I think that's what's happened with computer vision rows and columns. So there's this idea of normalizing data, which is subtracting out the mean and dividing by the standard deviation. So a question for you, often it's important to normalize the data so that we can more easily train a model.

Do you think it would be important to normalize the independent variables for a random forest if we're training a random forest? I'm going to be honest, I don't know why, we don't need to normalize, I just know that we don't. We don't. Okay. Does anybody want to think about why?

Cara. It wouldn't matter because each scaling and transformation we can have will be applied to each row and we will be computing means as we were doing, like local averages. And at the end we will of course want to de-normalize it back to give, so it wouldn't change the result.

I'm talking about the independent variables, not the dependent variable. I thought you asked about dependent variables. Okay, who wants to have a go, Matthew? It might be because we just care about the relationship between the independent variables and the dependent variable, so scale doesn't really matter. Okay, go on.

Cat, why? Why? Because at each split point we can just divide to see, regardless of what scale you're on, what minimizes variance. Right, so really the key is that when we're deciding where to split, all that matters is the order, all that matters is how they're sorted. So if we subtract the mean and divide by the standard deviation, they're still sorted in the same order.

Remember when we implemented the random forest, we said sort them, and then we completely ignored the values, we just said now add on one thing from the dependent at a time. So random forests only care about the sort order of the independent variables, they don't care at all about their size.

And so that's why they're wonderfully immune to outliers, because they totally ignore the fact that it's an outlier. They only care about which one's higher than what other thing. So this is an important concept, it doesn't just appear in random forests, it occurs in some metrics as well. For example, area under the ROC curve, you come across a lot, that area under the ROC curve completely ignores scale and only cares about sort.

We saw something else when we did the dendrogram, Spearman's correlation is a rank correlation, only cares about order, not about scale. So random forests, one of the many wonderful things about them are that we can completely ignore a lot of these statistical distribution issues. But we can't for deep learning, because for deep learning we're trying to train a parameterized model.

So we do need to normalize our data. If we don't, then it's going to be much harder to create a network that trains effectively. So we grab the mean and the standard deviation of our training data and subtract out the mean, divide by the standard deviation, and that gives us a mean of 0 and a standard deviation of 1.

Now for our validation data, we need to use the standard deviation and mean from the training data. We have to normalize it the same way. Just like categorical variables, we had to make sure they had the same indexes mapped to the same levels for a random forest, or missing values, we had to make sure we had the same median used when we were replacing the missing values.

You need to make sure anything you do in the training set, you do exactly the same thing in the test and validation set. So here I'm subtracting out the training set mean, the training set, standard deviation. So this is not exactly 0, this is not exactly 1, but it's pretty close.

And so in general, if you find you try something on a validation set or a test set and it's like much, much, much worse than your training set, it's probably because you normalized it in an inconsistent way or encoded categories in an inconsistent way or something like that. So let's take a look at some of this data.

So we've got 10,000 images in the validation set, and each one is a rank 1 tensor of length 784. In order to display it, I want to turn it into a rank 2 tensor of 28x28. So NumPy has a reshape function that takes a tensor in and reshapes it to whatever size tensor you request.

Now if you think about it, you only need to tell it about if there are d axes, you only need to tell it about d-1 of the axes you want, because the last one it can figure out for itself. So in total, there are 10,000 x 784 numbers here altogether.

So if you say, well I want my last axes to be 28x28, then you can figure out that this must be 10,000, otherwise it's not going to fit. So if you put -1, it says make it as big or as small as you have to make it fit. So you can see here it figured out it has to be 10,000.

So you'll see this used in neural net, software, pre-processing and stuff like that all the time. I could have written 10,000 here, but I try to get into a habit of any time I'm referring to how many items are in my input, I tend to use -1 because it just means later on I could use a sub-sample, this code wouldn't break.

I could do some kind of stratified sampling, if it was unbalanced, this code wouldn't break. So by using this kind of approach of saying -1 here for the size, it just makes it more resilient to change this later, it's a good habit to get into. So this kind of idea of being able to take tensors and reshape them and change axes around and stuff like that is something you need to be totally do without thinking, because it's going to happen all the time.

So for example, here's one, I tried to read in some images, they were flattened, I need to unflatten them into a bunch of matrices, reshape, bang. I read some images in with OpenCV, and it turns out OpenCV orders the channels blue, green, red, everything else expects them to be red, green, blue, I need to reverse the last axis, how do you do that?

I read in some images with Python imaging library, it reads them as rows by columns by channels, PyTorch expects channels by rows by columns, how do I transform that? So these are all things you need to be able to do without thinking, like straight away, because it happens all the time and you never want to be sitting there thinking about it for ages.

So make sure you spend a lot of time over the week just practicing with things like all the stuff we're going to see today, reshaping, slicing, reordering dimensions, stuff like that. And so the best way is to create some small tensors yourself and start thinking like, okay, what shall I experiment with?

So here, can we pass that over there? Do you mind if I backtrack a little bit? Of course, I love it. So back in normalize, you say, you might have gone over this, but I'm still like wrestling with it a little bit, saying many machine learning algorithms behave better when the data is normalized, but you also just said that scales don't really matter.

I said it doesn't matter for random forests. So random forests are just going to spit things based on order. And so we love them. We love random forests for the way they're so immune to worrying about distributional assumptions. But we're not doing random forests, we're doing deep learning, and deep learning does care.

We have a parametric then we should scale, if we have a non-parametric then we should now do scale. Can we generalize? No, not quite, because k nearest neighbors is non-parametric and scale matters a hell of a lot. So I would say things involving trees generally are just going to split at a point, and so probably you don't care about scale.

But you probably just need to think, is this an algorithm that uses order or does it use specific numbers? Can you please give us an intuition of why it needs scale, just because that will clarify some of the issues? Not until we get to doing SGD. So we're going to get to that.

So for now, we're just going to take my word for it. Can you pass it to Daniel? So this is probably a dumb question, but can you explain a little bit more what you mean by scale? Because I guess when I think of scale, I'm like, oh, all the numbers should be generally the same size.

That's exactly what we mean. But is that like the case with the cats and dogs that we went over with the deep learning? You could have a small cat and a larger cat, but it would still know that those were both cats. Oh, I guess this is one of these problems where language gets overloaded.

So in computer vision, when we scale an image, we're actually increasing the size of the cat. In this case, we're scaling the actual pixel values. So in both cases, scaling means to make something bigger and smaller. In this case, we're taking the numbers from 0 to 255 and making them so that they have an average of 0 and a standard deviation of 1.

Jeremy, could you please explain, is it by column, by row? By pixel. By pixel. In general, when you're scaling, I'm just not thinking about every picture, but I'm kind of an input to how much you're learning. Yeah, sure. So it's a little bit subtle. But in this case, I've just got a single mean and a single standard deviation.

So it's basically, on average, how much black is there. And so on average, we have a mean and a standard deviation across all the pixels. In computer vision, we would normally do it by channel. So we would normally have one number for red, one number for green, one number for blue.

In general, you need a different set of normalization coefficients for each thing you would expect to behave differently. So if we were doing a structured data set where we've got income, distance in kilometers, and number of children, you'd need three separate normalization coefficients for those. They're very different kinds of things.

So it's kind of like a bit domain-specific here. In this case, all of the pixels are levels of gray, so we've just got a single scaling number. Where else you could imagine if they were red versus green versus blue, you could need to scale those channels in different ways.

Can you pass that back, please? So I'm having a bit of trouble imagining what would happen if we don't normalize in this case. So we'll get there. So this is kind of what Yannette was saying, why do we normalize? And for now we're normalizing because I say we have to.

When we get to looking at stochastic gradient descent, we'll basically discover that if you -- basically to skip ahead a little bit, we're going to be doing a matrix multiply by a bunch of weights. We're going to pick those weights in such a way that when we do the matrix multiply, we're going to try to keep the numbers at the same scale that they started out as.

And that's going to basically require the initial numbers. We're going to have to know what their scale is. So basically it's much easier to create a single kind of neural network architecture that works for lots of different kinds of inputs if we know that they're consistently going to be mean zero, standard deviation one.

That would be the short answer. But we'll learn a lot more about it. And if in a couple of lessons you're still not quite sure why, let's come back to it because it's a really interesting thing to talk about. Yes, I'm just trying to visualize the axes we're working with here.

So under plots, when you write -- so x-valid shape, we get 10,000 by 7, 8, 4. Does that mean that we brought in 10,000 pictures of that dimension? Exactly. Okay. And then in the next line, when you choose to reshape it, is there a reason why you put 28, 28 on as a y or z coordinates, or is there a reason why they're in that order?

Yeah, there is. Pretty much all neural network libraries assume that the first axis is kind of the equivalent of a row. It's like a separate thing. It's a sentence, or an image, or an example of sales, or whatever. So I want each image to be a separate item of the first axis, so that leaves two more axes for the rows and columns of the images.

That's pretty standard. That's totally standard. Yeah, I don't think I've ever seen a library that doesn't work that way. Can you pass it to our bureau? So while normalizing the validation data, I saw you have used mean of x and standard deviation of x data, training data only. Yes.

So shouldn't we use mean and standard deviation of validation data? You mean like join them together, or? Separately calculating mean. No, because then you would be normalizing the validation set using different numbers. So now the meaning of this pixel has a value of 3 in the validation set has a different meaning to the meaning of 3 in the training set.

It would be like if we had days of the week encoded such that Monday was a 1 in the training set and was a 0 in the validation set. We've got now two different sets where the same number has a different meaning. So let me give you an example.

Let's say we were doing full color images and our training set contained green frogs, green snakes, and gray elephants, and we're trying to figure out which was which. We normalize using each channel mean, and then we have a validation set and a test set which are just green frogs and green snakes.

So if we were to normalize by the validation sets statistics, we would end up saying things on average are green, and so we would remove all the greenness out. And so we would now fail to recognize the green frogs and the green snakes effectively. So we actually want to use the same normalization coefficients that we were training on.

And for those of you doing the deep learning class, we actually go further than that. When we use a pre-trained network, we have to use the same normalization coefficients that the original authors trained on. So the idea is that a number needs to have this consistent meaning across every data set where you use it.

Can you pass it to us, Mehta? That means when you are looking at the test set, you normalize the test set based on this mean and set. That's right. Okay. So the validation y values are just rank1 tensor of 10,000. Remember there's this kind of weird Python thing where a tuple with just one thing in it needs a trailing comma.

So this is a rank1 tensor of length 10,000. And so here's an example of something from that, it's just the number 3. So that's our labels. So here's another thing you need to be able to do in your sleep, slicing into a tensor. So in this case, we're slicing into the first axis with zero.

That means we're grabbing the first slice. So because this is a single number, this is going to reduce the rank of the tensor by 1. It's going to turn it from a 3-dimensional tensor into a 2-dimensional tensor. So you can see here, this is now just a matrix, and then we're going to grab 10 through 14 inclusive rows, 10 through 14 inclusive columns, and here it is.

So this is the kind of thing you need to be super comfortable grabbing pieces out, looking at the numbers, and looking at the picture. So here's an example of a little piece of that first image. And so you kind of want to get used to this idea that if you're working with something like pictures or audio, this is something your brain is really good at interpreting.

So keep showing pictures of what you're doing whenever you can. But also remember behind the scenes, they're numbers. So if something's going weird, print out a few of the actual numbers. You might find somehow some of them have become infinity, or they're all zero, or whatever. So use this interactive environment to explore the data as you go.

Just a quick, I guess, semantic question. Why when it's a tensor of rank 3 is it stored as x, y, z instead of, to me it would make more sense to store it as like a list of 2D tensors. Let's do it as either, right? Let's look at this as a 3D.

So here's a 3D. So a 3D tensor is formatted as showing a list of 2D tensors basically. But when you're extracting it, if you're extracting the first one, why isn't it x images square brackets zero, closed square brackets, and then a second set of square brackets? Oh, because that has a different meaning, right?

So it's kind of the difference between tensors and jagged arrays, right? So basically if you do something like that, that says take the second list item and from it grab the third list item. And so we tend to use that when we have something called a jagged array, which is where each subarray may be of a different length, right?

Where else we have like a single object of three dimensions. And so we're trying to say like which little piece of it do we want. And so the idea is that that is a single slice object to go in and grab that piece out. Okay, so here's an example of a few of those images along with their labels.

And this kind of stuff you want to be able to do pretty quickly with matplotlib. It's going to help you a lot in life in your exam. So you can have a look at what Rachel wrote here when she wrote plots. We can use add_subplot to basically create those little separate plots, and you need to know that imshow is how we basically take a numpy array and draw it as a picture.

And then we've also added the title on top. So there it is. So let's now take that data and try to build a neural network with it. And so a neural network -- and sorry, this is going to be a lot of review for those of you already doing deep learning -- a neural network is just a particular mathematical function or a class of mathematical functions.

But it's a really important class because it has the property, it supports what's called the universal approximation theorem, which means that a neural network can approximate any other function arbitrarily closely. So in other words, in theory it can do anything as long as we make it big enough. So this is very different to a function like 3x+5, which can only do one thing.

It's a very specific function. For the class of functions ax+b, which can only represent lines of different slopes moving it up and down different amounts, or even the function ax^2+bx+c+sin(d), again, only can represent a very specific subset of relationships. The neural network, however, is a function that can represent any other function to arbitrarily close accuracy.

So what we're going to do is we're going to learn how to take a function, let's take work ax+b, and we're going to learn how to find its parameters, in this case a and b, which allow it to fit as closely as possible to a set of data. And so this here is showing an example from a notebook that we'll be looking at in the deep learning course, which basically shows what happens when we use something called stochastic gradient descent to try and set a and b.

Basically what happens is we're going to pick a random a to start with, a random b to start with, and then we're going to basically figure out do I need to increase or decrease a to make the line closer to the dots, do I need to increase or decrease b to make the line closer to the dots, and then just keep increasing and decreasing a and b lots and lots of times.

So that's what we're going to do. And to answer the question do I need to increase or decrease a and b, we're going to take the derivative. So the derivative of the function with respect to a and b tells us how will that function change as we change a and b.

So that's basically what we're going to do. But we're not going to start with just a line, the idea is we're going to build up to actually having a neural net, and so it's going to be exactly the same idea, but because it's an infinitely flexible function, we're going to be able to use this exact same technique to fit arbitrarily complex relationships.

That's basically the idea. So then what you need to know is that a neural net is actually a very simple thing. A neural net actually is something which takes as input, let's say we've got a vector, does a matrix product by that vector, so this is of size, let's draw this properly, so this is size r, this is like r/c, a matrix product will spit out something of size c.

And then we do something called a nonlinearity, which is basically we're going to throw away all the negative values, so it's basically max(0,x). And then we're going to put that through another matrix multiply, and then we're going to put that through another max(0,x), and we're going to put that through another matrix multiply, and so on, until eventually we end up with the single vector that we want.

So in other words, each stage of our neural network is the key thing going on is a matrix multiply, so in other words, a linear function. So basically deep learning, most of the calculation is lots and lots of linear functions, but between each one we're going to replace the negative numbers with zeros.

The short answer is if you apply a linear function to a linear function to a linear function, it's still just a linear function, so it's totally useless. But if you throw away the negatives, that's actually a nonlinear transformation. So it turns out that if you apply a linear function to the thing where you threw away the negatives, apply that to a linear function, that creates a neural network, and it turns out that's the thing that can approximate any other function arbitrarily closely.

So this tiny little difference actually makes all the difference. And if you're interested in it, check out the deep learning video where we cover this because I actually show a nice visual, intuitive proof, not something that I created, but something that Michael Nielsen created, or if you want to skip straight to his website, you can go to Michael Nielsen, Universal, I think I spelled his name wrong, there we go, Neural Networks in Deep Learning Chapter 4, and he's got a really nice walkthrough basically with lots of animations where you can see why this works.

I feel like the hardest thing with getting started with technical writing on the internet is just posting your first thing. If you do a search for Rachel Thomas Medium blog, you'll find this, we'll put it on the Lesson Wiki, where she actually says the top advice she would give to her younger self would be to start blogging sooner.

And she has both reasons why you should do it, some examples of places she's blogged and it's turned out to be great for her and her career, but then some tips about how to get started. I remember when I first suggested to Rachel she might think about blogging because she had so much interesting to say, and at first she was kind of surprised at the idea that she could blog.

And now people come up to us at conferences and they're like, "You're Rachel Thomas, I love your writing!" So I've kind of seen that transition from wow, could I blog, to being known as a strong technical author. So check out this article if you still need convincing or if you're wondering how to get started.

And since the first one is the hardest, maybe your first one should be something really easy for you to write. So it could be like, here's a summary of the first 15 minutes of lesson 3 of our machine learning course, here's why it's interesting, here's what we learned. Or it could be like, here's a summary of how I used a random forest to solve a particular problem in my practicum.

I often get questions like, oh my practicum, my organization, we've got sensitive commercial data. That's fine, just find another dataset and do it on that instead to show the example, or anonymize all of the values and change the names of the variables or whatever. You can talk to your employer or your practicum partner to make sure that they're comfortable with whatever it is you're writing.

In general though, people love it when their interns and staff blog about what they're working on because it makes them look super cool. It's like, hey, I'm an intern working at this company and I wrote this post about this cool analysis I did and then other people would be like, wow, that looks like a great company to work for.

So generally speaking, you should find people are pretty supportive. Besides which, there's lots and lots of datasets out there available, so even if you can't base it on the work you're doing, you can find something similar for sure. So we're going to start building our neural network, we're going to build it using something called PyTorch.

PyTorch is a library that basically looks a lot like NumPy, but when you create some code with PyTorch, you can run it on the GPU rather than the CPU. So the GPU is something which is basically going to be probably at least an order of magnitude, possibly hundreds of times faster than the code that you might write for the CPU for particularly stuff involving lots of linear algebra.

So with deep learning, neural nets, if you don't have a GPU, you can do it on the CPU. But it's going to be frustratingly slow. Your Mac does not have a GPU that we can use for this, because I'm actually advertising today, we need an Nvidia GPU. I would actually much prefer that we could use your Macs because competition is great.

But Nvidia were really the first ones to create a GPU which did a good job of supporting general purpose graphics programming units, GPU-GPU. So in other words, that means using a GPU for things other than playing computer games. They created a framework called CUDA, it's a very good framework, it's pretty much universally used in deep learning.

If you don't have an Nvidia GPU, you can't use it, no current Macs have an Nvidia GPU. Most laptops of any kind don't have an Nvidia GPU. If you're interested in doing deep learning on your laptop, the good news is that you need to buy one which is really good for playing computer games on.

There's a place called Exotic PC, gaming laptops, where you can go and buy yourself a great laptop for doing deep learning. You can tell your parents that you need the money to do deep learning. So you'll generally find a whole bunch of laptops with names like Predator and Viper with pictures of robots and stuff, StealthPro, Radar, Leopard.

Having said that, I don't know that many people that do much deep learning on their laptop. Most people will log into a cloud environment. By far the easiest I know of to use is called Cressel. With Cressel, you can basically sign up and straight away the first thing you get is thrown straight into a Jupyter notebook, backed by a GPU, costs 60 cents an hour with all of the fast AI libraries and data already available.

So that makes life really easy. It's less flexible and in some ways less fast than using AWS, which is the Amazon Web Services option. It costs a little bit more, 90 cents an hour rather than 60 cents an hour, but it's very likely that your employer is already using that.

It's like it's good to get to know anyway. They've got more different choices around GPUs and it's a good choice. If you Google for GitHub Student Pack, if you're a student, you can get $150 of credits straight away pretty much, and that's a really good way to get started.

Daniel, did you have a question? I just wanted to know your opinion on, I know that Intel recently published an open source way of boosting regular packages that they claim is equivalent, like if you use the bottom tier GPU on your CPU, if you use their boost packages, you can get the same performance.

Do you know anything about that? Yeah, I do. It's a good question. And actually, Intel makes some great numerical programming libraries, particularly this one called MKL, the matrix kernel library. They definitely make things faster than not using those libraries, but if you look at a graph of performance over time, GPUs have consistently throughout the last 10 years, including now, are about 10 times more floating point operations per second than the equivalent CPU, and they're generally about a fifth of the price for that performance.

And then because of that, everybody doing anything with deep learning basically does it on Nvidia GPUs, and therefore using anything other than Nvidia GPUs is currently very annoying. So slower, more expensive, more annoying. I really hope there will be more activity around AMG GPUs in particular in this area, but AMG has got literally years of catching up to do, so it might take a while.

So I just wanted to point out that you can also buy things such as a GPU extender to a laptop that's also kind of like maybe a first-step solution before you really want to put something on. Yeah, I think for like 300 bucks or so, you can buy something that plugs into your Thunderbolt port if you have a Mac, and then for another 500 or 600 bucks you can buy a GPU to plug into that.

Having said that, for about a thousand bucks you can actually create a pretty good GPU-based desktop, and so if you're considering that, the fast.ai forums have lots of threads where people help each other spec out something at a particular price point. So to start with, let's say use Cressel, and then when you're ready to invest a few extra minutes getting going, use AWS.

To use AWS, you're basically talking to the folks online as well. So AWS, when you get there, go to EC2. EC2, there's lots of stuff on AWS. EC2 is the bit where we get to rent computers by the hour. Now we're going to need a GPU-based instance. Unfortunately when you first sign up for AWS, they don't give you access to them, so you have to request that access.

So go to limits, up in the top left, and the main GPU instance we'll be using is called the P2. So scroll down to P2, and here p2.xlarge, you need to make sure that that number is not zero. If you've just got a new account, it probably is zero, which means you won't be allowed to create one, so you have to go request limit increase.

And the trick there is when it asks you why you want the limit increase, type fast.ai because AWS knows to look out, and they know that fast.ai people are good people, so they'll do it quite quickly. That takes a day or two, generally speaking, to go through. So once you get the email saying you've been approved for P2 instances, you can then go back here and say Launch Instance, and so we've basically set up one that has everything you need.

So if you click on Community AMI, and AMI is an Amazon machine image, it's basically a completely set up computer. So if you type fast.ai, or one word, you'll find here fast.ai DL Part 1 version 2 for the P2. So that's all set up, ready to go. So if you click on Select, and it'll say, "Okay, what kind of computer do you want?" And so we have to say, "I want a GPU compute type, and specifically I want a P2 extra large." And then you can say Review and Launch.

I'm assuming you already know how to deal with SSH keys and all that kind of stuff. If you don't, check out the introductory tutorials and workshop videos that we have online, or Google around for SSH keys. Very important skill to know anyway. So hopefully you get through all that.

You have something running on a GPU with the fast.ai repo. If you use Cressel, just cd fastai2, the repo is already there, git pull. AWS, cd fastai, the repo is already there, git pull. If it's your own computer, you'll just have to git clone, and away you go. So part of all of those is PyTorch is pre-installed.

So PyTorch basically means we can write code that looks a lot like NumPy, but it's going to run really quickly on the GPU. Secondly, since we need to know which direction and how much to move our parameters to improve our loss, we need to know the derivative of functions.

PyTorch has this amazing thing where any code you write using the PyTorch library, it can automatically take the derivative of that for you. So we're not going to look at any calculus in this course, and I don't look at any calculus in any of my courses or in any of my work basically ever in terms of actually calculating derivatives myself, because I've never had to.

It's done for me by the library. So as long as you write the Python code, the derivative is done. So the only calculus you really need to know to be an effective practitioner is what does it mean to be a derivative? And you also need to know the chain rule, which we'll come to.

So we're going to start out kind of top-down, create a neural net, and we're going to assume a whole bunch of stuff. And gradually we're going to dig into each piece. So to create neural nets, we need to import the PyTorch neural net library. PyTorch, funnily enough, is not called PyTorch, it's called Torch.

Torch.nn is the PyTorch subsection that's responsible for neural nets. So we'll call that nn. And then we're going to import a few bits out of fast.ai just to make life a bit easier for us. So here's how you create a neural network in PyTorch. The simplest possible neural network.

You say sequential, and sequential means I am now going to give you a list of the layers that I want in my neural network. So in this case, my list has two things in it. The first thing says I want a linear layer. So a linear layer is something that's basically going to do y=ax+b.

But matrix, matrix, multiply, not univariate, obviously. So it's going to do a matrix product, basically. So the input to the matrix product is going to be a vector of length 28 times 28, because that's how many pixels we have. And the output needs to be of size 10. We'll talk about y in a moment.

But for now, this is how we define a linear layer. And then again, we're going to dig into this in detail, but every linear layer just about in neural nets has to have a nonlinearity after it. And we're going to learn about this particular nonlinearity in a moment. It's called the softmax.

And if you've done the DL course, you've already seen this. So that's how we define a neural net. This is a two-layer neural net. There's also kind of an implicit additional first layer, which is the input. But with PyTorch, you don't have to explicitly mention the input. But normally we think conceptually like the input image is kind of also a layer.

Because we're kind of doing things pretty manually with PyTorch, we're not taking advantage of any of the convenience is in fast.ai for building this stuff. We have to then write .cuda, which tells PyTorch to copy this neural network across to the GPU. So from now on, that network is going to be actually running on the GPU.

If we didn't say that, it would run on the CPU. So that gives us back a neural net, a very simple neural net. So we're then going to try and fit the neural net to some data. So we need some data. So fast.ai has this concept of a model data object, which is basically something that wraps up training data, validation data, and optionally test data.

And so to create a model data object, you can just say I want to create some image classifier data, I'm going to grab it from some arrays, and you just say this is the path that I'm going to save any temporary files, this is my training data arrays, and this is my validation data arrays.

And so that just returns an object that's going to wrap that all up, and so we're going to be able to fit to that data. So now that we have a neural net, and we have some data, we're going to come back to this in a moment, but we basically say what loss function do we want to use, what optimizer do we want to use, and then we say fit.

We say fit this network to this data going over every image once using this loss function, this optimizer, and print out these metrics. And this says here, this is 91.8% accurate. So that's the simplest possible neural net. So what that's doing is it's creating a matrix multiplication followed by a nonlinearity, and then it's trying to find the values for this matrix which basically fit the data as well as possible, that end up predicting this is a 1, this is a 9, this is a 3.

And so we need some definition for as well as possible. And so the general term for that thing is called the loss function. So the loss function is the function that's going to be lower if this is better. Just like with random forests, we had this concept of information gain, and we got to pick what function do you want to use to define information gain, and we were mainly looking at root mean squared error.

Most machine learning algorithms we call something very similar to loss. So the loss is how do we score how good we are. And so in the end we're going to calculate the derivative of the loss with respect to the weight matrix that we're multiplying by to figure out how to update it.

So we're going to use something called negative log likelihood loss. So negative log likelihood loss is also known as cross entropy. They're literally the same thing. There's two versions, one called binary cross entropy, or binary negative log likelihood, and another called categorical cross entropy. The same thing, one is for when you've only got a 0 or 1 dependent, the other is if you've got like cat, dog, airplane or horse, or 0, 1, through 9, and so forth.

So what we've got here is the binary version of cross entropy. And so here is the definition. I think maybe the easiest way to understand this definition is to look at an example. So let's say we're trying to predict cat vs dog. 1 is cat, 0 is dog. So here we've got cat, dog, dog, cat.

And here are our predictions. We said 90% sure it's a cat, 90% sure it's a dog, 80% sure it's a dog, 80% sure it's a cat. So we can then calculate the binary cross entropy by calling our function. So it's going to say, okay, for the first one we've got y = 1, so it's going to be 1 times log of 0.9, plus 1 - y, 1 - 1, is 0, so that's going to be skipped.

And then the second one is going to be a 0, so it's going to be 0 times something, so that's going to be skipped. And the second part will be 1 - 0. So this is 1 times log of 1 - p, 1 - 0.1 is 0.9. So in other words, the first piece and the second piece of this are going to give exactly the same number.

Which makes sense because the first one we said we were 90% confident it was a cat, and it was. And the second we said we were 90% confident it was a dog, and it was. So in each case the loss is coming from the fact that we could have been more confident.

So if we said we were 100% confident the loss would have been 0. So let's look at that in Excel. So here's our 0.9, 0.1, 0.2, 0.8, and here's our predictions, 1, 0, 0, 1. So here's 1 - the prediction, here is log of our prediction, here is log of 1 - our prediction, and so then here is our sum.

So if you think about it, and I want you to think about this during the week, you could replace this with an if statement rather than y. Because y is always 1 or 0, then it's only ever going to use either this or this. So you could replace this with an if statement.

So I'd like you during the week to try to rewrite this with an if statement. And then see if you can then scale it out to be a categorical cross-entropy. So categorical cross-entropy works this way. Let's say we were trying to predict 3 and then 6 and then 7 and then 2.

So if we were trying to predict 3, and the actual thing that was predicted was like 4.7, we're trying to predict 3 and we actually predicted 5. Or we're trying to predict 3 and we accidentally predicted 9. Being 5 instead of 3 is no better than being 9 instead of 3.

So we're not actually going to say how far away is the actual number, we're going to express it differently. Or to put it another way, what if we're trying to predict cats, dogs, horses and airplanes? How far away is cat from horse? So we're going to express these a little bit differently.

Rather than thinking of it as a 3, let's think of it as a vector with a 1 in the third location. And rather than thinking of it as a 6, let's think of it as a vector of zeros with a 1 in the sixth location. So in other words, one hot encoding.

So let's one hot encode a dependent variable. And so that way now, rather than trying to predict a single number, let's predict 10 numbers. Let's predict what's the probability that it's a 0, what's the probability it's a 1, what's the probability that it's a 2, and so forth. And so let's say we're trying to predict a 2, then here is our binary cross entropy, sorry, categorical cross entropy.

So it's just saying, okay, did this one predict correctly or not, how far off was it, and so forth for each one. And so add them all up. So categorical cross entropy is identical to binary cross entropy, we just have to add it up across all of the categories.

So try and turn the binary cross entropy function in Python into a categorical cross entropy Python and maybe create both the version with the if statement and the version with the sum and the product. So that's why in our PyTorch we had 10 as the output dimensionality for this matrix because when we multiply a matrix with 10 columns, we're going to end up with something of length 10, which is what we want.

We want to have 10 predictions. So that's the loss function that we're using. So then we can fit the model, and what it does is it goes through every image this many times, in this case it's just looking at every image once, and going to slightly update the values in that weight matrix based on those gradients.

And so once we've trained it, we can then say predict using this model on the validation set. And now that's bits out something of 10,000 by 10. Can somebody tell me why is this of shape these predictions? Why are they of shape 10,000 by 10? Go for it Chris, it's right next to you.

Well it's because we have 10,000 images we're training on. 10,000 images we're training on, so we're validating on in this case, but same thing. So 10,000 we're validating on, so that's the first axis, and the second axis is because we actually make 10 predictions per image. Good, exactly. So each one of these rows is the probabilities that it's a 0, that it's a 1, that it's a 2, that it's a 3, and so forth.

So in math, there's a really common operation we do called argmax. When I say it's common, it's funny, at high school I never saw argmax, first year undergrad I never saw argmax, but somehow after university everything's about argmax. So it's one of these things that's for some reason not really taught at school, but it actually turns out to be super critical.

And so argmax is both something that you'll see in math, and it's just written out in full, argmax. It's in numpy, it's in pytorch, it's super important. And what it does is it says, let's take this array of preds, and let's figure out on this axis, remember axis 1 is columns, so across as Chris said, the 10 predictions for each row, let's find which prediction has the highest value, and return, not that, if it just said max it would return the value, argmax returns the index of the value.

So by saying argmax axis equals 1, it's going to return the index, which is actually the number itself. So let's grab the first 5, so for the first one it thinks it's a 3, then it thinks the next one's an 8, the next one's a 6, the next one's a 9, the next one's a 6 again.

So that's how we can convert our probabilities back into predictions. So if we save that away, call it preds, we can then say, okay, when does preds equal the ground truth? So that's going to return an array of balls, which we can treat as 1s and 0s, and the mean of a bunch of 1s and 0s is just the average, so that gives us the accuracy, so there's our 91.8%.

And so you want to be able to replicate the numbers you see, and here it is, there's our 91.8%. So when we train this, the last thing it tells us is whatever metric we asked for, and we asked for accuracy, okay. So the last thing it tells us is our metric, which is accuracy, and then before that we get the training set loss, and the loss is again whatever we asked for, negative log likelihood, and the second thing is the validation set loss.

PyTorch doesn't use the word loss, they use the word criterion, so you'll see here, crit. So that's criterion equals loss. This is what loss function we want to use, they call that the criterion. Same thing. So here's how we can recreate that accuracy. So now we can go ahead and plot 8 of the images along with their predictions.

And we've got 3, 8, 6, 9, wrong, 5, wrong, okay. And you can see why they're wrong. This is pretty close to a 9, it's just missing a little cross at the top. This is pretty close to a 5, it's got a little bit of the extra here, right.

So we've made a start, and all we've done so far is we haven't actually created a deep neural net, we've actually got only one layer. So what we've actually done is we've created a logistic regression. So a logistic regression is literally what we just built, and you could try and replicate this with sklearn's logistic regression package.

When I did it, I got similar accuracy, but this version ran much faster because this is running on the GPU where else sklearn runs on the CPU. So even for something like logistic regression, we can implement it very quickly with PyTorch. How can you pass that to Ian? So when we're creating our net, we have to do .cuda, what would be the consequence of not doing that?

Would it just not run? It wouldn't run quickly. It will run on the CPU. Can you pass it to Jake? So maybe with the neural network, why is that we have to do linear and followed by nonlinear? So the short answer is because that's what the universal approximation theorem says is a structure which can give you arbitrarily accurate functions for any functional form.

So the long answer is the details of why the universal approximation theorem works. Another version of the short answer is that's the definition of a neural network. So the definition of a neural network is a linear layer followed by an activation function, followed by a linear layer, followed by an activation function, etc.

We go into a lot more detail of this in the deep learning course. But for this purpose, it's enough to know that it works. So far, of course, we haven't actually built a deep neural net at all. We've just built a logistic regression. And so at this point, if you think about it, all we're doing is we're taking every input pixel and multiplying it by a weight for each possible outcome.

So we're basically saying on average, the number 1 has these pixels turned on, the number 2 has these pixels turned on, and that's why it's not terribly accurate. That's not how digit recognition works in real life, but that's always built so far. Michael Nielsen has this great website called Neural Networks in Deep Learning, and his chapter 4 is actually kind of famous now.

In it, he does this walkthrough of basically showing that a neural network can approximate any other function to arbitrarily close accuracy as long as it's big enough. And we walk through this in a lot of detail in the deep learning course. But the basic trick is that he shows that with a few different numbers, you can basically cause these things to create little boxes.

You can move the boxes up and down, you can move them around, you can join them together to eventually create connections of towers, which you can use to approximate any kind of surface. So that's basically the trick, and so all we need to do given that is to kind of find the parameters for each of the linear functions in that neural network, so to find the weights in each of the matrices.

So far, we've got just one matrix, and so we've just built a simple logistic regression. Just a small note, I just want to confirm that when you showed examples of the images which were misclassified, they look rectangular, so it's just that while rendering the pixels are being scaled differently.

So are they still 28 by 28 squares? They are 28 by 28. I think they're square, I think they just look rectangular because they've got titles on the top. I'm not sure. Good question. I don't know. Anyway, they are square. Matplotlib does often fiddle around with what it considers black versus white and having different size axes and stuff, so you do have to be very careful there sometimes.

Hopefully this will now make more sense because what we're going to do is dig in a layer deeper and define logistic regression without using nn.sequential, without using nn.linear, without using nn.logsoftmax. So we're going to do nearly all of the layer definition from scratch. So to do that, we're going to have to define a PyTorch module.

A PyTorch module is basically either a neural net or a layer in a neural net, which is actually kind of a powerful concept of itself. Basically anything that can kind of behave like a neural net can itself be part of another neural net. And so this is like how we can construct particularly powerful architectures combining lots of other pieces.

So to create a PyTorch module, just create a Python class, but it has to inherit from nn.module. So we haven't done inheritance before. Other than that, this is all the same concepts we've seen in OO already. Basically if you put something in parentheses here, what it means is that our class gets all of the functionality of this class for free.

It's called subclassing it. So we're going to get all of the capabilities of a neural network module that the PyTorch authors have provided, and then we're going to add additional functionality to it. When you create a subclass, there is one key thing you need to remember to do, which is when you initialize your class, you have to first of all initialize the superclass.

So the superclass is the nn.module. So the nn.module has to be built before you can start adding your pieces to it. And so this is just like something you can copy and paste into every one of your modules. You just say super.init, this just means construct the superclass first.

Having done that, we can now go ahead and define our weights and our bias. So our weights is the weight matrix. It's the actual matrix that we're going to multiply our data by. And as we discussed, it's going to have 28x28 rows and 10 columns. And that's because if we take an image which we flattened out into a 28x28 length vector, then we can multiply it by this weight matrix to get back out a length 10 vector, which we can then use to consider it as a set of predictions.

So that's our weight matrix. Now the problem is that we don't just want y=ax, we want y=ax+b. So the +b in neural nets is called bias, and so as well as defining weights, we're also going to find bias. And so since this thing is going to spit out for every image something of length 10, that means that we need to create a vector of length 10 to be our biases.

In other words, for everything 0, 1, 2, 3, up to 9, we're going to have a different +b that we'll be adding. So we've got our data matrix here, which is of length 10,000 by 28x28. And then we've got our weight matrix, which is 28x28 rows by 10. So if we multiply those together, we get something of size 10,000 by 10.

And then we want to add on our bias, like so. And so when we add on, and we're going to learn a lot more about this later, but when we add on a vector like this, it basically is going to get added to every row. So the bias is going to get added to every row.

So we first of all define those. And so to define them, we've created a tiny little function called get_weights, which is over here, which basically just creates some normally distributed random numbers. So torch.rand_n returns a tensor filled with random numbers from a normal distribution. We have to be a bit careful though.

When we do deep learning, like when we add more linear layers later, imagine if we have a matrix which on average tends to increase the size of the inputs we give to it. If we then multiply by lots of matrices of that size, it's going to make the numbers bigger and bigger and bigger, like exponentially bigger.

Or what if it made them a bit smaller? It's going to make them smaller and smaller and smaller exponentially smaller. So because a deep network applies lots of linear layers, if on average they result in things a bit bigger than they started with, or a bit smaller than they started with, it's going to exponentially multiply that difference.

So we need to make sure that the weight matrix is of an appropriate size that the mean of the inputs basically is not going to change. So it turns out that if you use normally distributed random numbers and divide it by the number of rows in the weight matrix, it turns out that particular random initialization keeps your numbers at about the right scale.

So this idea that if you've done linear algebra, basically if the first eigenvalue is bigger than 1 or smaller than 1, it's going to cause the gradients to get bigger and bigger, or smaller and smaller, that's called gradient explosion. So we'll talk more about this in the deep learning course, but if you're interested, you can look up Kaiming, her initialization, and read all about this concept.

But for now, it's probably just enough to know that if you use this type of random number generation, you're going to get random numbers that are un-nicely behaved. You're going to start out with an input, which is mean 0, standard deviation 1. Once you put it through this set of random numbers, you'll still have something that's about mean 0, standard deviation 1.

That's basically the goal. One nice thing about PyTorch is that you can play with this stuff. So torch.random, try it out. Every time you see a function being used, run it and take a look. And so you'll see it looks a lot like NumPy, but it doesn't return a NumPy array, it returns a tensor.

And in fact, now I'm GPU programming. I just multiplied that matrix by 3 very quickly on the GPU. So that's how we do GPU programming with PyTorch. So this is our weight matrix. As I said, we create 1.28.28 by 10. 1 is just rank 1 of 10 for the biases.

We have to make them a parameter. This is basically telling PyTorch which things to update when it does SGD. That's very minor technical detail. So having created the weight matrices, we then define a special method with the name forward. This is a special method. The name forward has a special meaning in PyTorch.

A method called forward in PyTorch is the name of the method that will get called when your layer is calculated. So if you create a neural net or a layer, you have to define forward. And it's going to get past the data from the previous layer. So our definition is to do a matrix multiplication of our input data times our weights and add on the biases.

So that's it. That's what happened earlier on when we said nn.linear. It created this thing for us. Now unfortunately though, we're not getting a 28x28 long vector. We're getting a 28 row by 28 column matrix, so we have to flatten it. Unfortunately in PyTorch, they tend to rename things.

They spell reshape, they spell it view. So view means reshape. So you can see here we end up with something where the number of images we're going to leave the same, and then we're going to replace row by column with a single axis, again -1 meaning as long as required.

So this is how we flatten something using PyTorch. So we flatten it, do a matrix multiply, and then finally we do a softmax. So softmax is the activation function we use. If you look in the deep learning repo, you'll find something called entropy example, where you'll see an example of softmax.

But a softmax simply takes the outputs from our final layer, so we get our outputs from our linear layer, and what we do is we go e^of for each output, and then we take that number and we divide by the sum of the e^ofs. That's called softmax. Why do we do that?

Well, because we're dividing this by the sum, that means that the sum of those itself must add to 1, and that's what we want. We want the probabilities of all the possible outcomes add to 1. Furthermore, because we're using e^of, that means we know that every one of these is between 0 and 1, and probabilities we know should be between 0 and 1.

And then finally, because we're using e^of, it tends to mean that slightly bigger values in the input turn into much bigger values in the output. So you'll see generally speaking in my softmax, there's going to be one big number and lots of small numbers. And that's what we want, because we know that the output is one hot encoded.

So in other words, a softmax activation function, the softmax nonlinearity, is something that returns things that behave like probabilities, and where one of those probabilities is more likely to be kind of high and the other ones are more likely to be low. And we know that's what we want to map to our one hot encoding, so a softmax is a great activation function to use to help the neural net, make it easier for the neural net to map to the output that you wanted.

And this is what we generally want. When we're designing neural networks, we try to come up with little architectural tweaks that make it as easy for it as possible to match the output that we know we want. So that's basically it, right? Rather than doing sequential and using nn.linear and nn.softmax, we have to find it from scratch.

We can now say, just like before, our net is equal to that class .cuda and we can say .fit and we get to within a slight random deviation exactly the same output. So what I'd like you to do during the week is to play around with torch.randn to generate some random tensors, torch.matmul to start multiplying them together, adding them up, try to make sure that you can rewrite softmax yourself from scratch, try to fiddle around a bit with reshaping view, all that kind of stuff.

So by the time you come back next week, you feel pretty comfortable with PyTorch. And if you Google for PyTorch tutorial, you'll see there's a lot of great material actually on the PyTorch website to help you along, basically showing you how to create tensors and modify them and do operations on them.

All right, great. Yes, you had a question. Can you pass it over? So I see that the forward is the layer that gets applied after each of the linear layers. Well, not quite. The forward is just the definition of the module. So this is like how we're implementing linear.

Does that mean after each linear layer, we have to apply the same function? Let's say we can't do a log softmax after layer one and then apply some other function after layer two if we have like a multi-layer neural network. So normally we define neural networks like so. We just say here is a list of the layers we want.

You don't have to write your own forward. All we did just now was to say instead of doing this, let's not use any of this at all, but write it all by hand ourselves. So you can write as many layers as you like in any order you like here.

The point was that here we're not using any of that. We've written our own matmul plus bias, our own softmax. This is just Python code. You can write whatever Python code inside forward that you like to define your own neural net. You won't normally do this yourself. Normally you'll just use the layers that PyTorch provides and your use.sequential to put them together, or even more likely you'll download a predefined architecture and use that.

We're just doing this to learn how it works behind the scenes. Alright, great. Thanks everybody.