back to indexMachine Learning 1: Lesson 8
Chapters
0:0 Intro
2:0 Random forests
5:20 Recognition
7:30 Data
12:55 Tensors
16:40 Negative log likelihood
33:49 Slicing
38:20 Neural Network
44:15 Getting Started
47:15 PITorch
47:43 GPU
53:24 AWS
56:38 PI Torch
00:00:00.000 |
So, I don't want to embarrass Rachel, but I'm very excited that Rachel's here. 00:00:07.440 |
So this is Rachel, for those of you that don't know. 00:00:11.920 |
She's not quite back on her feet after her illness, but well enough to at least come 00:00:15.600 |
to at least part of this lesson, so don't worry if she can't stay for the whole thing. 00:00:19.720 |
I'm really glad she's here because Rachel actually wrote the vast majority of the lesson 00:00:25.240 |
I think it's a really, really cool work, so I'm glad she's going to at least see it being 00:00:29.760 |
taught, even if unfortunately she's not teaching herself. 00:00:37.640 |
Good thanksgiving present, best thanksgiving present. 00:00:41.700 |
So as we discussed at the end of last lesson, we're kind of moving from the decision tree 00:00:47.400 |
ensembles to neural nets, broadly defined, and as we discussed, random forests and decision 00:00:55.840 |
trees are limited by the fact in the end that they're basically doing nearest neighbors. 00:01:07.680 |
All they can do is to return the average of a bunch of other points. 00:01:11.560 |
And so they can't extrapolate out to, if you're thinking what happens if I increase my prices 00:01:16.000 |
by 20% and you've never priced at that level before, or what's going to happen to sales 00:01:21.920 |
next year, and obviously we've never seen next year before, it's very hard to extrapolate, 00:01:27.440 |
it's also hard if it can only do around log-based 2n decisions, and so if there's a time series 00:01:38.080 |
it needs to fit to that takes like 4 steps to get to the right time area, then suddenly 00:01:43.760 |
there's not many decisions left for it to make, so there's this limited amount of computation 00:01:48.800 |
that it can do, so there's a limited complexity of relationship that it can model. 00:01:56.000 |
Can I ask about one more drawback of random forests that I've had? 00:02:05.620 |
So if we have data as categorical variables which are not in sequential order, so for 00:02:12.360 |
random forests we encode them and treat them as numbers, let's say we have 20 cardinality 00:02:18.080 |
in 1 to 20, so the result that random forest gives is something like less than 5, less 00:02:26.520 |
than 6, but if the categories are not sequential, not in any order, what does that mean? 00:02:37.660 |
So if you've got like, let's go back to bulldozers, EROPS with a C, OROPS and a whatever, and 00:03:09.160 |
And so actually we know that all that really mattered was if it had air conditioning. 00:03:15.640 |
Well it's basically going to say like, okay, if I group it into those together and those 00:03:22.320 |
together, that's an interesting break, just because it so happens that the air conditioning 00:03:28.280 |
ones all are going to end up in the right-hand side. 00:03:31.680 |
And then having done that, it's then going to say okay, well within the group with the 00:03:37.280 |
2 and 3, it's going to notice that it's furthermore going to have to split it into two more groups. 00:03:42.120 |
So eventually it's going to get there, it's going to pull out that category, it's going 00:03:47.960 |
to take more splits than we would ideally like. 00:03:50.600 |
So it's kind of similar to the fact that for it to model a line, it can only do it with 00:03:58.080 |
So random forest is fine with categories that are not sequential also? 00:04:02.160 |
Yeah, so it can do it, it's just like in some way it's sub-optimal because we just need 00:04:06.360 |
to do more break points than we would have liked, but it gets there, it does a pretty 00:04:11.840 |
And so even although random forests do have some deficiencies, they're incredibly powerful, 00:04:19.760 |
particularly because they have so few assumptions, they really had to screw up, and it's kind 00:04:25.320 |
of hard to actually win a Kaggle competition with a random forest, but it's very easy to 00:04:32.160 |
So in real life where often that third decimal place doesn't matter, random forests are often 00:04:41.280 |
But for some things like this Ecuadorian groceries competition, it's very, very hard to get a 00:04:46.200 |
good result with a random forest because there's a huge time series component and nearly everything 00:04:52.400 |
is these two massively high cardinality categorical variables, which is the store and the item, 00:04:58.640 |
and so there's very little there to even throw at a random forest, and the difference between 00:05:05.000 |
every pair of stores is kind of different in different ways. 00:05:09.360 |
So there are some things that are just hard to get even relatively good results with a 00:05:22.640 |
You can get like okay results with a random forest, but in the end the relationship between 00:05:29.760 |
the spatial structure turns out to be important. 00:05:33.960 |
And you kind of want to be able to do computations like finding edges or whatever that kind of 00:05:42.620 |
So just doing a clever nearest neighbors like a random forest turns out not to be ideal. 00:05:53.300 |
So for stuff like this, neural networks turn out that they are ideal. 00:05:57.360 |
Neural networks turn out to be something that works particularly well for both things like 00:06:02.360 |
the Ecuadorian groceries competition, so forecasting sales over time by store and by item, and 00:06:09.140 |
for things like recognizing digits and for things like turning voice into speech. 00:06:15.640 |
And so it's kind of nice between these two things, neural nets and random forests, we 00:06:22.520 |
I haven't needed to use anything other than these two things for a very long time. 00:06:28.520 |
And we'll actually learn, I don't know what course exactly, but at some point we'll learn 00:06:32.880 |
also how to combine the two, because you can combine the two in really cool ways. 00:06:38.080 |
So here's a picture from Adam Geitge of an image. 00:06:43.800 |
So an image is just a bunch of numbers, and each of those numbers is 0 to 255, and the 00:06:51.400 |
dark ones are close to 255, the light ones are close to 0. 00:06:55.880 |
So here is an example of a digit from this MNIST dataset. 00:07:01.360 |
MNIST is a really old, it's like a hello world of neural networks. 00:07:12.080 |
If it was color, there would be three of these, one for red, one for green, one for blue. 00:07:19.620 |
So our job is to look at the array of numbers and figure out that this is the number 8, 00:07:31.720 |
So we're going to use a small number of fast.ai pieces, and we're gradually going to remove 00:07:39.340 |
more and more and more until by the end we'll have implemented our own neural network from 00:07:44.960 |
scratch, our own training loop from scratch, and our own matrix multiplication from scratch. 00:07:50.460 |
So we're gradually going to dig in further and further. 00:07:55.640 |
So the data for MNIST, which is the name of this very famous dataset, is available from 00:08:02.720 |
here, and we have a thing in fast.ai.io called getData, which will grab it from a URL and 00:08:09.880 |
store it on your computer, unless it's already there, in which case it will just go ahead 00:08:15.760 |
And then we've got a little function here called load MNIST, which simply loads it up. 00:08:22.160 |
You'll see that it's zipped, so we can just use Python's gzip to open it up. 00:08:30.380 |
So if you have any kind of Python object at all, you can use this built-in Python library 00:08:37.320 |
called pickle to dump it out onto your disk, share it around, load it up later, and you 00:08:45.120 |
get back the same Python object you started with. 00:08:48.440 |
So you've already seen there's something like this with pandas_feather format. 00:08:55.560 |
Pickle is not just for pandas, it's not just for anything. 00:08:59.040 |
It works for nearly every Python object, which might lead to the question why didn't we use 00:09:08.760 |
And the answer is pickle works for nearly every Python object, but it's probably not 00:09:19.320 |
So because we were looking at pandas data frames with over 100 million rows, we really 00:09:24.640 |
want to save that quickly, and so feather is a format that's specifically designed for 00:09:29.480 |
that purpose, and so it's going to do that really fast. 00:09:32.080 |
If we tried to pickle it, it would have been taken a lot longer. 00:09:38.240 |
Also note that pickle files are only for Python, so you can't give them to somebody else, whereas 00:09:47.960 |
So it's worth knowing that pickle exists because if you've got some dictionary or some kind 00:09:54.440 |
of object floating around that you want to save for later or send to somebody else, you 00:10:02.640 |
So in this particular case, the folks at deeplearning.net were kind enough to provide a pickled version. 00:10:11.280 |
Pickle has changed slightly over time, and so old pickle files like this one, you actually 00:10:19.000 |
have to, this is a Python 2 one, so you have to tell it that it was encoded using this 00:10:26.000 |
But other than that, Python 2 and 3 can normally open each other's pickle files. 00:10:32.400 |
So once we've loaded that in, we load it in like so. 00:10:37.160 |
And so this thing which we're doing here, this is called destructuring. 00:10:41.280 |
Destructuring means that loadMnist is giving us back a tuple of tuples, and so if we have 00:10:48.320 |
on the left-hand side of the equal sign a tuple of tuples, we can fill all these things 00:10:53.280 |
So we're giving back a tuple of training data, a tuple of validation data, and a tuple of 00:10:59.760 |
In this case, I don't care about the test data, so I just put it into a variable called 00:11:04.080 |
underscore, which kind of by like, Python people tend to think of underscore as being a special 00:11:12.900 |
variable which we put things we're going to throw away into. 00:11:16.440 |
It's actually not special, but it's really common. 00:11:19.260 |
If you see something assigned to underscore, it probably means you're just throwing it 00:11:23.760 |
By the way, in a Jupyter notebook, it does have a special meaning which is the last cell 00:11:29.780 |
that you calculate is always available in underscore. 00:11:36.960 |
So then the first thing in that tuple is itself a tuple, and so we're going to stick that 00:11:42.060 |
into x and y for our training data, and then the second one goes into x and y for our validation 00:11:49.520 |
So that's called destructuring, and it's pretty common in lots of languages. 00:11:55.560 |
Some languages don't support it, but those that do, life becomes a lot easier. 00:11:59.680 |
So as soon as I look at some new dataset, I just check out what have I got. 00:12:18.480 |
So this image is not of length 784, it's of size 28 by 28. 00:12:30.240 |
Well, we could guess, and we can check on the website, it turns out we would be right, 00:12:34.360 |
that all they did was they took the second row and concatenated it to the first row, and 00:12:38.640 |
the third row and concatenated it to that, and the fourth row and concatenated it to 00:12:42.600 |
So in other words, they took this whole 28 by 28 and flattened it out into a single 1D 00:12:54.640 |
This is not normal by any means, so don't think everything you see is going to be like 00:13:01.320 |
Most of the time when people share images, they share them as JPGs or PNGs, you load 00:13:09.120 |
But in this particular case, for whatever reason, the thing that they pickled was flattened 00:13:18.520 |
And this word "flatten" is very common with working with tensors. 00:13:24.800 |
So when you flatten a tensor, it just means that you're turning it into a lower rank tensor 00:13:31.480 |
In this case, we started with a rank2 tensor matrix for each image, and we turned each 00:13:41.680 |
So overall the whole thing is a rank2 tensor rather than a rank3 tensor. 00:13:50.360 |
So just to remind us of the jargon here, this in math we would call a vector. 00:14:05.000 |
In computer science we would call it a 1D array, but because deep learning people have 00:14:12.400 |
to come across as smarter than everybody else, we have to call this a rank1 tensor. 00:14:21.040 |
They all mean the same thing, more or less, unless you're a physicist, in which case this 00:14:25.640 |
means something else, and you get very angry at the deep learning people because you say 00:14:32.560 |
Don't blame me, this is just what people say. 00:14:35.720 |
So this is either a matrix or a 2D array or a rank2 tensor. 00:14:48.640 |
And so once we start to get into 3 dimensions, we start to run out of mathematical names, 00:14:55.040 |
which is why we start to be nice, just to say rank3 tensor. 00:14:58.600 |
And so there's actually nothing special about vectors and matrices that make them in any 00:15:02.880 |
way more important than rank3 tensors or rank4 tensors or whatever. 00:15:08.560 |
So I try not to use the terms vector and matrix where possible because I don't really think 00:15:15.120 |
they're any more special than any other rank of tensor. 00:15:19.720 |
So it's good to get used to thinking of this as a rank2 tensor. 00:15:26.360 |
And then the rows and columns, if we're computer science people, we would call this dimension 00:15:43.080 |
0 and dimension 1, but if we're deep learning people we would call this axis 0 or axis 1. 00:15:57.480 |
And then just to be really confusing, if you're an image person, this is the first axis and 00:16:04.800 |
So if you think about TVs, 1920x1080, columns by rows, everybody else, including deep learning 00:16:17.240 |
So this is pretty confusing if you use the Python imaging library. 00:16:21.920 |
You get back columns by rows, pretty much everything else, rows by columns. 00:16:29.680 |
Because they hate us, because they're bad people, I guess. 00:16:44.000 |
Particularly in deep learning, a whole lot of different areas have come together, like 00:16:47.160 |
information theory, computer vision, statistics, signal processing, and you've ended up with 00:16:52.520 |
this hodgepodge of nomenclature in deep learning. 00:16:59.520 |
So today we're going to hear about something that's called either negative log likelihood 00:17:04.360 |
or binomial or categorical cross entropy, depending on where you come from. 00:17:09.520 |
We've already seen something that's called either one-hot encoding or dummy variables 00:17:14.520 |
And really it's just like the same concept gets somewhat independently invented in different 00:17:19.480 |
fields and eventually they find their way to machine learning and then we don't know 00:17:24.080 |
what to call them, so we call them all of the above, something like that. 00:17:29.720 |
So I think that's what's happened with computer vision rows and columns. 00:17:34.600 |
So there's this idea of normalizing data, which is subtracting out the mean and dividing 00:17:44.720 |
So a question for you, often it's important to normalize the data so that we can more 00:17:56.480 |
Do you think it would be important to normalize the independent variables for a random forest 00:18:04.360 |
I'm going to be honest, I don't know why, we don't need to normalize, I just know that 00:18:18.520 |
It wouldn't matter because each scaling and transformation we can have will be applied 00:18:26.760 |
to each row and we will be computing means as we were doing, like local averages. 00:18:33.520 |
And at the end we will of course want to de-normalize it back to give, so it wouldn't change the 00:18:40.480 |
I'm talking about the independent variables, not the dependent variable. 00:18:43.480 |
I thought you asked about dependent variables. 00:18:52.120 |
It might be because we just care about the relationship between the independent variables 00:18:55.920 |
and the dependent variable, so scale doesn't really matter. 00:19:02.800 |
Because at each split point we can just divide to see, regardless of what scale you're on, 00:19:16.160 |
Right, so really the key is that when we're deciding where to split, all that matters 00:19:22.640 |
is the order, all that matters is how they're sorted. 00:19:26.180 |
So if we subtract the mean and divide by the standard deviation, they're still sorted in 00:19:32.120 |
Remember when we implemented the random forest, we said sort them, and then we completely 00:19:37.720 |
ignored the values, we just said now add on one thing from the dependent at a time. 00:19:44.320 |
So random forests only care about the sort order of the independent variables, they don't 00:19:52.720 |
And so that's why they're wonderfully immune to outliers, because they totally ignore the 00:19:58.480 |
They only care about which one's higher than what other thing. 00:20:02.320 |
So this is an important concept, it doesn't just appear in random forests, it occurs in 00:20:07.780 |
For example, area under the ROC curve, you come across a lot, that area under the ROC 00:20:13.760 |
curve completely ignores scale and only cares about sort. 00:20:18.980 |
We saw something else when we did the dendrogram, Spearman's correlation is a rank correlation, 00:20:29.180 |
So random forests, one of the many wonderful things about them are that we can completely 00:20:34.080 |
ignore a lot of these statistical distribution issues. 00:20:38.960 |
But we can't for deep learning, because for deep learning we're trying to train a parameterized 00:20:48.160 |
If we don't, then it's going to be much harder to create a network that trains effectively. 00:20:54.480 |
So we grab the mean and the standard deviation of our training data and subtract out the 00:20:58.560 |
mean, divide by the standard deviation, and that gives us a mean of 0 and a standard deviation 00:21:05.760 |
Now for our validation data, we need to use the standard deviation and mean from the training 00:21:15.280 |
Just like categorical variables, we had to make sure they had the same indexes mapped 00:21:20.160 |
to the same levels for a random forest, or missing values, we had to make sure we had 00:21:26.280 |
the same median used when we were replacing the missing values. 00:21:30.760 |
You need to make sure anything you do in the training set, you do exactly the same thing 00:21:36.400 |
So here I'm subtracting out the training set mean, the training set, standard deviation. 00:21:39.920 |
So this is not exactly 0, this is not exactly 1, but it's pretty close. 00:21:45.440 |
And so in general, if you find you try something on a validation set or a test set and it's 00:21:50.440 |
like much, much, much worse than your training set, it's probably because you normalized it 00:21:57.680 |
in an inconsistent way or encoded categories in an inconsistent way or something like that. 00:22:07.720 |
So we've got 10,000 images in the validation set, and each one is a rank 1 tensor of length 00:22:16.720 |
In order to display it, I want to turn it into a rank 2 tensor of 28x28. 00:22:23.000 |
So NumPy has a reshape function that takes a tensor in and reshapes it to whatever size 00:22:35.860 |
Now if you think about it, you only need to tell it about if there are d axes, you only 00:22:42.840 |
need to tell it about d-1 of the axes you want, because the last one it can figure out 00:22:48.920 |
So in total, there are 10,000 x 784 numbers here altogether. 00:22:56.240 |
So if you say, well I want my last axes to be 28x28, then you can figure out that this 00:23:02.360 |
must be 10,000, otherwise it's not going to fit. 00:23:08.360 |
So if you put -1, it says make it as big or as small as you have to make it fit. 00:23:14.120 |
So you can see here it figured out it has to be 10,000. 00:23:18.100 |
So you'll see this used in neural net, software, pre-processing and stuff like that all the 00:23:26.000 |
I could have written 10,000 here, but I try to get into a habit of any time I'm referring 00:23:30.580 |
to how many items are in my input, I tend to use -1 because it just means later on I 00:23:37.240 |
could use a sub-sample, this code wouldn't break. 00:23:41.280 |
I could do some kind of stratified sampling, if it was unbalanced, this code wouldn't break. 00:23:46.160 |
So by using this kind of approach of saying -1 here for the size, it just makes it more 00:23:51.840 |
resilient to change this later, it's a good habit to get into. 00:23:56.700 |
So this kind of idea of being able to take tensors and reshape them and change axes around 00:24:04.360 |
and stuff like that is something you need to be totally do without thinking, because 00:24:13.360 |
So for example, here's one, I tried to read in some images, they were flattened, I need 00:24:18.200 |
to unflatten them into a bunch of matrices, reshape, bang. 00:24:23.720 |
I read some images in with OpenCV, and it turns out OpenCV orders the channels blue, 00:24:30.840 |
green, red, everything else expects them to be red, green, blue, I need to reverse the 00:24:38.880 |
I read in some images with Python imaging library, it reads them as rows by columns 00:24:45.720 |
by channels, PyTorch expects channels by rows by columns, how do I transform that? 00:24:52.600 |
So these are all things you need to be able to do without thinking, like straight away, 00:24:58.400 |
because it happens all the time and you never want to be sitting there thinking about it 00:25:03.960 |
So make sure you spend a lot of time over the week just practicing with things like 00:25:08.560 |
all the stuff we're going to see today, reshaping, slicing, reordering dimensions, stuff like 00:25:16.120 |
And so the best way is to create some small tensors yourself and start thinking like, 00:25:30.240 |
So back in normalize, you say, you might have gone over this, but I'm still like wrestling 00:25:37.040 |
with it a little bit, saying many machine learning algorithms behave better when the 00:25:39.920 |
data is normalized, but you also just said that scales don't really matter. 00:25:48.920 |
So random forests are just going to spit things based on order. 00:25:53.920 |
We love random forests for the way they're so immune to worrying about distributional 00:25:59.800 |
But we're not doing random forests, we're doing deep learning, and deep learning does 00:26:03.520 |
We have a parametric then we should scale, if we have a non-parametric then we should 00:26:14.200 |
No, not quite, because k nearest neighbors is non-parametric and scale matters a hell 00:26:20.720 |
So I would say things involving trees generally are just going to split at a point, and so 00:26:28.760 |
But you probably just need to think, is this an algorithm that uses order or does it use 00:26:39.320 |
Can you please give us an intuition of why it needs scale, just because that will clarify 00:26:50.680 |
So for now, we're just going to take my word for it. 00:26:55.440 |
So this is probably a dumb question, but can you explain a little bit more what you mean 00:27:00.760 |
Because I guess when I think of scale, I'm like, oh, all the numbers should be generally 00:27:07.960 |
But is that like the case with the cats and dogs that we went over with the deep learning? 00:27:12.880 |
You could have a small cat and a larger cat, but it would still know that those were both 00:27:18.100 |
Oh, I guess this is one of these problems where language gets overloaded. 00:27:22.120 |
So in computer vision, when we scale an image, we're actually increasing the size of the 00:27:27.880 |
In this case, we're scaling the actual pixel values. 00:27:32.160 |
So in both cases, scaling means to make something bigger and smaller. 00:27:35.440 |
In this case, we're taking the numbers from 0 to 255 and making them so that they have 00:27:39.560 |
an average of 0 and a standard deviation of 1. 00:27:43.160 |
Jeremy, could you please explain, is it by column, by row? 00:27:51.440 |
In general, when you're scaling, I'm just not thinking about every picture, but I'm 00:27:59.480 |
kind of an input to how much you're learning. 00:28:04.280 |
But in this case, I've just got a single mean and a single standard deviation. 00:28:08.240 |
So it's basically, on average, how much black is there. 00:28:14.520 |
And so on average, we have a mean and a standard deviation across all the pixels. 00:28:23.080 |
In computer vision, we would normally do it by channel. 00:28:26.120 |
So we would normally have one number for red, one number for green, one number for blue. 00:28:33.280 |
In general, you need a different set of normalization coefficients for each thing you would expect 00:28:43.000 |
So if we were doing a structured data set where we've got income, distance in kilometers, 00:28:50.120 |
and number of children, you'd need three separate normalization coefficients for those. 00:28:57.520 |
So it's kind of like a bit domain-specific here. 00:29:02.040 |
In this case, all of the pixels are levels of gray, so we've just got a single scaling 00:29:09.560 |
Where else you could imagine if they were red versus green versus blue, you could need 00:29:19.720 |
So I'm having a bit of trouble imagining what would happen if we don't normalize in this 00:29:28.120 |
So this is kind of what Yannette was saying, why do we normalize? 00:29:31.400 |
And for now we're normalizing because I say we have to. 00:29:35.320 |
When we get to looking at stochastic gradient descent, we'll basically discover that if 00:29:40.400 |
you -- basically to skip ahead a little bit, we're going to be doing a matrix multiply 00:29:47.960 |
We're going to pick those weights in such a way that when we do the matrix multiply, 00:29:52.200 |
we're going to try to keep the numbers at the same scale that they started out as. 00:29:56.360 |
And that's going to basically require the initial numbers. 00:29:59.800 |
We're going to have to know what their scale is. 00:30:02.440 |
So basically it's much easier to create a single kind of neural network architecture 00:30:07.440 |
that works for lots of different kinds of inputs if we know that they're consistently 00:30:11.240 |
going to be mean zero, standard deviation one. 00:30:18.560 |
And if in a couple of lessons you're still not quite sure why, let's come back to it 00:30:24.120 |
because it's a really interesting thing to talk about. 00:30:26.600 |
Yes, I'm just trying to visualize the axes we're working with here. 00:30:31.720 |
So under plots, when you write -- so x-valid shape, we get 10,000 by 7, 8, 4. 00:30:37.480 |
Does that mean that we brought in 10,000 pictures of that dimension? 00:30:43.800 |
And then in the next line, when you choose to reshape it, is there a reason why you put 00:30:47.800 |
28, 28 on as a y or z coordinates, or is there a reason why they're in that order? 00:30:56.560 |
Pretty much all neural network libraries assume that the first axis is kind of the equivalent 00:31:06.080 |
It's a sentence, or an image, or an example of sales, or whatever. 00:31:12.760 |
So I want each image to be a separate item of the first axis, so that leaves two more 00:31:25.640 |
Yeah, I don't think I've ever seen a library that doesn't work that way. 00:31:39.120 |
So while normalizing the validation data, I saw you have used mean of x and standard 00:31:49.160 |
So shouldn't we use mean and standard deviation of validation data? 00:31:58.240 |
No, because then you would be normalizing the validation set using different numbers. 00:32:03.720 |
So now the meaning of this pixel has a value of 3 in the validation set has a different 00:32:10.540 |
meaning to the meaning of 3 in the training set. 00:32:13.880 |
It would be like if we had days of the week encoded such that Monday was a 1 in the training 00:32:25.360 |
We've got now two different sets where the same number has a different meaning. 00:32:33.880 |
Let's say we were doing full color images and our training set contained green frogs, 00:32:42.760 |
green snakes, and gray elephants, and we're trying to figure out which was which. 00:32:47.000 |
We normalize using each channel mean, and then we have a validation set and a test set which 00:32:59.800 |
So if we were to normalize by the validation sets statistics, we would end up saying things 00:33:06.000 |
on average are green, and so we would remove all the greenness out. 00:33:11.000 |
And so we would now fail to recognize the green frogs and the green snakes effectively. 00:33:17.280 |
So we actually want to use the same normalization coefficients that we were training on. 00:33:22.480 |
And for those of you doing the deep learning class, we actually go further than that. 00:33:25.840 |
When we use a pre-trained network, we have to use the same normalization coefficients 00:33:31.840 |
So the idea is that a number needs to have this consistent meaning across every data 00:33:49.380 |
That means when you are looking at the test set, you normalize the test set based on this 00:33:57.720 |
So the validation y values are just rank1 tensor of 10,000. 00:34:12.400 |
Remember there's this kind of weird Python thing where a tuple with just one thing in 00:34:22.840 |
And so here's an example of something from that, it's just the number 3. 00:34:28.940 |
So here's another thing you need to be able to do in your sleep, slicing into a tensor. 00:34:35.960 |
So in this case, we're slicing into the first axis with zero. 00:34:44.200 |
So because this is a single number, this is going to reduce the rank of the tensor by 00:34:48.920 |
It's going to turn it from a 3-dimensional tensor into a 2-dimensional tensor. 00:34:52.760 |
So you can see here, this is now just a matrix, and then we're going to grab 10 through 14 00:34:59.240 |
inclusive rows, 10 through 14 inclusive columns, and here it is. 00:35:03.760 |
So this is the kind of thing you need to be super comfortable grabbing pieces out, looking 00:35:13.320 |
So here's an example of a little piece of that first image. 00:35:18.560 |
And so you kind of want to get used to this idea that if you're working with something 00:35:23.000 |
like pictures or audio, this is something your brain is really good at interpreting. 00:35:28.080 |
So keep showing pictures of what you're doing whenever you can. 00:35:33.080 |
But also remember behind the scenes, they're numbers. 00:35:35.760 |
So if something's going weird, print out a few of the actual numbers. 00:35:39.640 |
You might find somehow some of them have become infinity, or they're all zero, or whatever. 00:35:45.480 |
So use this interactive environment to explore the data as you go. 00:36:02.520 |
Why when it's a tensor of rank 3 is it stored as x, y, z instead of, to me it would make 00:36:10.280 |
more sense to store it as like a list of 2D tensors. 00:36:25.160 |
So a 3D tensor is formatted as showing a list of 2D tensors basically. 00:36:31.320 |
But when you're extracting it, if you're extracting the first one, why isn't it x images square 00:36:39.320 |
brackets zero, closed square brackets, and then a second set of square brackets? 00:36:43.520 |
Oh, because that has a different meaning, right? 00:36:46.260 |
So it's kind of the difference between tensors and jagged arrays, right? 00:36:53.280 |
So basically if you do something like that, that says take the second list item and from 00:37:03.840 |
And so we tend to use that when we have something called a jagged array, which is where each 00:37:07.440 |
subarray may be of a different length, right? 00:37:10.480 |
Where else we have like a single object of three dimensions. 00:37:17.240 |
And so we're trying to say like which little piece of it do we want. 00:37:21.740 |
And so the idea is that that is a single slice object to go in and grab that piece out. 00:37:31.680 |
Okay, so here's an example of a few of those images along with their labels. 00:37:42.480 |
And this kind of stuff you want to be able to do pretty quickly with matplotlib. 00:37:47.480 |
It's going to help you a lot in life in your exam. 00:37:51.440 |
So you can have a look at what Rachel wrote here when she wrote plots. 00:37:56.280 |
We can use add_subplot to basically create those little separate plots, and you need 00:38:04.280 |
to know that imshow is how we basically take a numpy array and draw it as a picture. 00:38:19.960 |
So let's now take that data and try to build a neural network with it. 00:38:31.680 |
And so a neural network -- and sorry, this is going to be a lot of review for those of 00:38:36.200 |
you already doing deep learning -- a neural network is just a particular mathematical 00:38:41.160 |
function or a class of mathematical functions. 00:38:43.200 |
But it's a really important class because it has the property, it supports what's called 00:38:47.480 |
the universal approximation theorem, which means that a neural network can approximate 00:38:58.160 |
So in other words, in theory it can do anything as long as we make it big enough. 00:39:05.720 |
So this is very different to a function like 3x+5, which can only do one thing. 00:39:15.820 |
For the class of functions ax+b, which can only represent lines of different slopes moving 00:39:23.080 |
it up and down different amounts, or even the function ax^2+bx+c+sin(d), again, only 00:39:32.080 |
can represent a very specific subset of relationships. 00:39:36.480 |
The neural network, however, is a function that can represent any other function to arbitrarily 00:39:45.000 |
So what we're going to do is we're going to learn how to take a function, let's take 00:39:48.760 |
work ax+b, and we're going to learn how to find its parameters, in this case a and b, 00:39:54.840 |
which allow it to fit as closely as possible to a set of data. 00:39:58.840 |
And so this here is showing an example from a notebook that we'll be looking at in the 00:40:04.200 |
deep learning course, which basically shows what happens when we use something called 00:40:07.520 |
stochastic gradient descent to try and set a and b. 00:40:11.520 |
Basically what happens is we're going to pick a random a to start with, a random b to start 00:40:17.320 |
with, and then we're going to basically figure out do I need to increase or decrease a to 00:40:23.400 |
make the line closer to the dots, do I need to increase or decrease b to make the line 00:40:28.840 |
closer to the dots, and then just keep increasing and decreasing a and b lots and lots of times. 00:40:36.120 |
And to answer the question do I need to increase or decrease a and b, we're going to take the 00:40:41.600 |
So the derivative of the function with respect to a and b tells us how will that function 00:40:52.000 |
But we're not going to start with just a line, the idea is we're going to build up to actually 00:40:57.280 |
having a neural net, and so it's going to be exactly the same idea, but because it's 00:41:01.680 |
an infinitely flexible function, we're going to be able to use this exact same technique 00:41:13.440 |
So then what you need to know is that a neural net is actually a very simple thing. 00:41:21.280 |
A neural net actually is something which takes as input, let's say we've got a vector, does 00:41:34.920 |
a matrix product by that vector, so this is of size, let's draw this properly, so this 00:41:45.320 |
is size r, this is like r/c, a matrix product will spit out something of size c. 00:41:56.960 |
And then we do something called a nonlinearity, which is basically we're going to throw away 00:42:00.640 |
all the negative values, so it's basically max(0,x). 00:42:06.400 |
And then we're going to put that through another matrix multiply, and then we're going to put 00:42:11.200 |
that through another max(0,x), and we're going to put that through another matrix multiply, 00:42:17.720 |
and so on, until eventually we end up with the single vector that we want. 00:42:24.080 |
So in other words, each stage of our neural network is the key thing going on is a matrix 00:42:31.720 |
multiply, so in other words, a linear function. 00:42:35.600 |
So basically deep learning, most of the calculation is lots and lots of linear functions, but 00:42:41.880 |
between each one we're going to replace the negative numbers with zeros. 00:42:48.360 |
The short answer is if you apply a linear function to a linear function to a linear function, 00:43:04.680 |
it's still just a linear function, so it's totally useless. 00:43:09.200 |
But if you throw away the negatives, that's actually a nonlinear transformation. 00:43:13.720 |
So it turns out that if you apply a linear function to the thing where you threw away 00:43:18.720 |
the negatives, apply that to a linear function, that creates a neural network, and it turns 00:43:23.480 |
out that's the thing that can approximate any other function arbitrarily closely. 00:43:28.700 |
So this tiny little difference actually makes all the difference. 00:43:32.480 |
And if you're interested in it, check out the deep learning video where we cover this 00:43:36.880 |
because I actually show a nice visual, intuitive proof, not something that I created, but something 00:43:45.020 |
that Michael Nielsen created, or if you want to skip straight to his website, you can go 00:43:50.520 |
to Michael Nielsen, Universal, I think I spelled his name wrong, there we go, Neural Networks 00:44:03.840 |
in Deep Learning Chapter 4, and he's got a really nice walkthrough basically with lots 00:44:09.840 |
of animations where you can see why this works. 00:44:18.040 |
I feel like the hardest thing with getting started with technical writing on the internet 00:44:33.980 |
If you do a search for Rachel Thomas Medium blog, you'll find this, we'll put it on the 00:44:40.080 |
Lesson Wiki, where she actually says the top advice she would give to her younger self 00:44:50.080 |
And she has both reasons why you should do it, some examples of places she's blogged 00:44:58.880 |
and it's turned out to be great for her and her career, but then some tips about how to 00:45:04.840 |
I remember when I first suggested to Rachel she might think about blogging because she 00:45:08.640 |
had so much interesting to say, and at first she was kind of surprised at the idea that 00:45:15.760 |
And now people come up to us at conferences and they're like, "You're Rachel Thomas, I 00:45:21.960 |
So I've kind of seen that transition from wow, could I blog, to being known as a strong 00:45:32.120 |
So check out this article if you still need convincing or if you're wondering how to get 00:45:39.760 |
And since the first one is the hardest, maybe your first one should be something really 00:45:47.080 |
So it could be like, here's a summary of the first 15 minutes of lesson 3 of our machine 00:45:55.440 |
learning course, here's why it's interesting, here's what we learned. 00:45:59.120 |
Or it could be like, here's a summary of how I used a random forest to solve a particular 00:46:09.120 |
I often get questions like, oh my practicum, my organization, we've got sensitive commercial 00:46:15.520 |
That's fine, just find another dataset and do it on that instead to show the example, 00:46:23.400 |
or anonymize all of the values and change the names of the variables or whatever. 00:46:30.200 |
You can talk to your employer or your practicum partner to make sure that they're comfortable 00:46:39.160 |
In general though, people love it when their interns and staff blog about what they're 00:46:46.240 |
working on because it makes them look super cool. 00:46:48.720 |
It's like, hey, I'm an intern working at this company and I wrote this post about this cool 00:46:55.240 |
analysis I did and then other people would be like, wow, that looks like a great company 00:46:59.600 |
So generally speaking, you should find people are pretty supportive. 00:47:04.400 |
Besides which, there's lots and lots of datasets out there available, so even if you can't 00:47:10.880 |
base it on the work you're doing, you can find something similar for sure. 00:47:16.700 |
So we're going to start building our neural network, we're going to build it using something 00:47:22.000 |
PyTorch is a library that basically looks a lot like NumPy, but when you create some 00:47:33.280 |
code with PyTorch, you can run it on the GPU rather than the CPU. 00:47:39.920 |
So the GPU is something which is basically going to be probably at least an order of 00:47:49.200 |
magnitude, possibly hundreds of times faster than the code that you might write for the 00:47:53.480 |
CPU for particularly stuff involving lots of linear algebra. 00:47:58.720 |
So with deep learning, neural nets, if you don't have a GPU, you can do it on the CPU. 00:48:13.280 |
Your Mac does not have a GPU that we can use for this, because I'm actually advertising 00:48:25.400 |
I would actually much prefer that we could use your Macs because competition is great. 00:48:29.760 |
But Nvidia were really the first ones to create a GPU which did a good job of supporting general 00:48:39.040 |
So in other words, that means using a GPU for things other than playing computer games. 00:48:44.640 |
They created a framework called CUDA, it's a very good framework, it's pretty much universally 00:48:54.360 |
If you don't have an Nvidia GPU, you can't use it, no current Macs have an Nvidia GPU. 00:49:01.840 |
Most laptops of any kind don't have an Nvidia GPU. 00:49:04.720 |
If you're interested in doing deep learning on your laptop, the good news is that you 00:49:09.600 |
need to buy one which is really good for playing computer games on. 00:49:14.720 |
There's a place called Exotic PC, gaming laptops, where you can go and buy yourself a great 00:49:24.240 |
You can tell your parents that you need the money to do deep learning. 00:49:34.560 |
So you'll generally find a whole bunch of laptops with names like Predator and Viper 00:49:41.840 |
with pictures of robots and stuff, StealthPro, Radar, Leopard. 00:49:54.080 |
Having said that, I don't know that many people that do much deep learning on their laptop. 00:49:57.960 |
Most people will log into a cloud environment. 00:50:01.080 |
By far the easiest I know of to use is called Cressel. 00:50:05.480 |
With Cressel, you can basically sign up and straight away the first thing you get is thrown 00:50:12.640 |
straight into a Jupyter notebook, backed by a GPU, costs 60 cents an hour with all of the 00:50:19.040 |
fast AI libraries and data already available. 00:50:27.760 |
It's less flexible and in some ways less fast than using AWS, which is the Amazon Web Services 00:50:38.200 |
It costs a little bit more, 90 cents an hour rather than 60 cents an hour, but it's very 00:50:45.160 |
likely that your employer is already using that. 00:50:52.520 |
They've got more different choices around GPUs and it's a good choice. 00:50:56.500 |
If you Google for GitHub Student Pack, if you're a student, you can get $150 of credits 00:51:05.840 |
straight away pretty much, and that's a really good way to get started. 00:51:14.200 |
I just wanted to know your opinion on, I know that Intel recently published an open source 00:51:21.440 |
way of boosting regular packages that they claim is equivalent, like if you use the bottom 00:51:27.480 |
tier GPU on your CPU, if you use their boost packages, you can get the same performance. 00:51:38.160 |
And actually, Intel makes some great numerical programming libraries, particularly this one 00:51:42.780 |
called MKL, the matrix kernel library. They definitely make things faster than not using 00:51:51.560 |
those libraries, but if you look at a graph of performance over time, GPUs have consistently 00:51:59.120 |
throughout the last 10 years, including now, are about 10 times more floating point operations 00:52:06.040 |
per second than the equivalent CPU, and they're generally about a fifth of the price for that 00:52:19.840 |
And then because of that, everybody doing anything with deep learning basically does 00:52:24.140 |
it on Nvidia GPUs, and therefore using anything other than Nvidia GPUs is currently very annoying. 00:52:32.020 |
So slower, more expensive, more annoying. I really hope there will be more activity around 00:52:36.620 |
AMG GPUs in particular in this area, but AMG has got literally years of catching up to 00:52:46.800 |
So I just wanted to point out that you can also buy things such as a GPU extender to 00:52:51.100 |
a laptop that's also kind of like maybe a first-step solution before you really want 00:52:56.720 |
Yeah, I think for like 300 bucks or so, you can buy something that plugs into your Thunderbolt 00:53:01.920 |
port if you have a Mac, and then for another 500 or 600 bucks you can buy a GPU to plug 00:53:06.080 |
into that. Having said that, for about a thousand bucks you can actually create a pretty good 00:53:11.600 |
GPU-based desktop, and so if you're considering that, the fast.ai forums have lots of threads 00:53:19.560 |
where people help each other spec out something at a particular price point. 00:53:23.920 |
So to start with, let's say use Cressel, and then when you're ready to invest a few extra 00:53:32.560 |
minutes getting going, use AWS. To use AWS, you're basically talking to the folks online 00:53:55.500 |
So AWS, when you get there, go to EC2. EC2, there's lots of stuff on AWS. EC2 is the bit 00:54:04.520 |
where we get to rent computers by the hour. Now we're going to need a GPU-based instance. 00:54:14.200 |
Unfortunately when you first sign up for AWS, they don't give you access to them, so you 00:54:19.240 |
have to request that access. So go to limits, up in the top left, and the main GPU instance 00:54:27.240 |
we'll be using is called the P2. So scroll down to P2, and here p2.xlarge, you need to 00:54:34.240 |
make sure that that number is not zero. If you've just got a new account, it probably 00:54:38.040 |
is zero, which means you won't be allowed to create one, so you have to go request limit 00:54:41.800 |
increase. And the trick there is when it asks you why you want the limit increase, type 00:54:47.920 |
fast.ai because AWS knows to look out, and they know that fast.ai people are good people, 00:54:54.040 |
so they'll do it quite quickly. That takes a day or two, generally speaking, to go through. 00:55:00.560 |
So once you get the email saying you've been approved for P2 instances, you can then go 00:55:05.760 |
back here and say Launch Instance, and so we've basically set up one that has everything you 00:55:13.560 |
need. So if you click on Community AMI, and AMI is an Amazon machine image, it's basically 00:55:19.640 |
a completely set up computer. So if you type fast.ai, or one word, you'll find here fast.ai 00:55:28.920 |
DL Part 1 version 2 for the P2. So that's all set up, ready to go. So if you click on Select, 00:55:38.360 |
and it'll say, "Okay, what kind of computer do you want?" And so we have to say, "I want 00:55:43.400 |
a GPU compute type, and specifically I want a P2 extra large." And then you can say Review 00:55:53.320 |
and Launch. I'm assuming you already know how to deal with SSH keys and all that kind of 00:55:58.240 |
stuff. If you don't, check out the introductory tutorials and workshop videos that we have 00:56:04.500 |
online, or Google around for SSH keys. Very important skill to know anyway. 00:56:12.880 |
So hopefully you get through all that. You have something running on a GPU with the fast.ai 00:56:20.040 |
repo. If you use Cressel, just cd fastai2, the repo is already there, git pull. AWS, 00:56:29.280 |
cd fastai, the repo is already there, git pull. If it's your own computer, you'll just 00:56:39.480 |
So part of all of those is PyTorch is pre-installed. So PyTorch basically means we can write code 00:56:46.040 |
that looks a lot like NumPy, but it's going to run really quickly on the GPU. Secondly, 00:56:53.760 |
since we need to know which direction and how much to move our parameters to improve 00:56:59.400 |
our loss, we need to know the derivative of functions. PyTorch has this amazing thing 00:57:05.440 |
where any code you write using the PyTorch library, it can automatically take the derivative 00:57:11.120 |
of that for you. So we're not going to look at any calculus in this course, and I don't 00:57:16.240 |
look at any calculus in any of my courses or in any of my work basically ever in terms 00:57:21.180 |
of actually calculating derivatives myself, because I've never had to. It's done for me 00:57:27.800 |
by the library. So as long as you write the Python code, the derivative is done. So the 00:57:32.600 |
only calculus you really need to know to be an effective practitioner is what does it 00:57:37.480 |
mean to be a derivative? And you also need to know the chain rule, which we'll come to. 00:57:47.160 |
So we're going to start out kind of top-down, create a neural net, and we're going to assume 00:57:51.860 |
a whole bunch of stuff. And gradually we're going to dig into each piece. So to create 00:57:57.800 |
neural nets, we need to import the PyTorch neural net library. PyTorch, funnily enough, 00:58:04.200 |
is not called PyTorch, it's called Torch. Torch.nn is the PyTorch subsection that's responsible 00:58:12.860 |
for neural nets. So we'll call that nn. And then we're going to import a few bits out 00:58:17.240 |
of fast.ai just to make life a bit easier for us. 00:58:21.680 |
So here's how you create a neural network in PyTorch. The simplest possible neural network. 00:58:28.920 |
You say sequential, and sequential means I am now going to give you a list of the layers 00:58:34.000 |
that I want in my neural network. So in this case, my list has two things in it. The first 00:58:41.760 |
thing says I want a linear layer. So a linear layer is something that's basically going 00:58:46.760 |
to do y=ax+b. But matrix, matrix, multiply, not univariate, obviously. So it's going to 00:58:57.280 |
do a matrix product, basically. So the input to the matrix product is going to be a vector 00:59:03.480 |
of length 28 times 28, because that's how many pixels we have. And the output needs 00:59:11.200 |
to be of size 10. We'll talk about y in a moment. But for now, this is how we define 00:59:16.360 |
a linear layer. And then again, we're going to dig into this in detail, but every linear 00:59:21.040 |
layer just about in neural nets has to have a nonlinearity after it. And we're going to 00:59:26.200 |
learn about this particular nonlinearity in a moment. It's called the softmax. And if you've 00:59:30.360 |
done the DL course, you've already seen this. So that's how we define a neural net. This 00:59:35.280 |
is a two-layer neural net. There's also kind of an implicit additional first layer, which 00:59:40.920 |
is the input. But with PyTorch, you don't have to explicitly mention the input. But normally 00:59:46.680 |
we think conceptually like the input image is kind of also a layer. Because we're kind 00:59:54.360 |
of doing things pretty manually with PyTorch, we're not taking advantage of any of the convenience 01:00:00.040 |
is in fast.ai for building this stuff. We have to then write .cuda, which tells PyTorch 01:00:05.600 |
to copy this neural network across to the GPU. So from now on, that network is going 01:00:12.240 |
to be actually running on the GPU. If we didn't say that, it would run on the CPU. So that 01:00:19.400 |
gives us back a neural net, a very simple neural net. So we're then going to try and 01:00:25.480 |
fit the neural net to some data. So we need some data. So fast.ai has this concept of 01:00:32.000 |
a model data object, which is basically something that wraps up training data, validation data, 01:00:38.680 |
and optionally test data. And so to create a model data object, you can just say I want 01:00:44.520 |
to create some image classifier data, I'm going to grab it from some arrays, and you 01:00:49.720 |
just say this is the path that I'm going to save any temporary files, this is my training 01:00:54.920 |
data arrays, and this is my validation data arrays. And so that just returns an object 01:01:02.520 |
that's going to wrap that all up, and so we're going to be able to fit to that data. 01:01:07.480 |
So now that we have a neural net, and we have some data, we're going to come back to this 01:01:12.080 |
in a moment, but we basically say what loss function do we want to use, what optimizer 01:01:16.520 |
do we want to use, and then we say fit. We say fit this network to this data going over 01:01:26.480 |
every image once using this loss function, this optimizer, and print out these metrics. 01:01:33.760 |
And this says here, this is 91.8% accurate. So that's the simplest possible neural net. 01:01:43.840 |
So what that's doing is it's creating a matrix multiplication followed by a nonlinearity, 01:01:54.280 |
and then it's trying to find the values for this matrix which basically fit the data as 01:02:03.040 |
well as possible, that end up predicting this is a 1, this is a 9, this is a 3. And so we 01:02:09.320 |
need some definition for as well as possible. And so the general term for that thing is 01:02:14.600 |
called the loss function. So the loss function is the function that's going to be lower if 01:02:20.560 |
this is better. Just like with random forests, we had this concept of information gain, and 01:02:26.160 |
we got to pick what function do you want to use to define information gain, and we were 01:02:31.000 |
mainly looking at root mean squared error. Most machine learning algorithms we call something 01:02:37.560 |
very similar to loss. So the loss is how do we score how good we are. And so in the end 01:02:43.680 |
we're going to calculate the derivative of the loss with respect to the weight matrix 01:02:50.080 |
that we're multiplying by to figure out how to update it. So we're going to use something 01:02:56.080 |
called negative log likelihood loss. So negative log likelihood loss is also known as cross 01:03:04.040 |
entropy. They're literally the same thing. There's two versions, one called binary cross 01:03:10.680 |
entropy, or binary negative log likelihood, and another called categorical cross entropy. 01:03:16.520 |
The same thing, one is for when you've only got a 0 or 1 dependent, the other is if you've 01:03:21.960 |
got like cat, dog, airplane or horse, or 0, 1, through 9, and so forth. So what we've 01:03:28.960 |
got here is the binary version of cross entropy. And so here is the definition. I think maybe 01:03:38.200 |
the easiest way to understand this definition is to look at an example. So let's say we're 01:03:43.440 |
trying to predict cat vs dog. 1 is cat, 0 is dog. So here we've got cat, dog, dog, cat. 01:03:53.960 |
And here are our predictions. We said 90% sure it's a cat, 90% sure it's a dog, 80% sure 01:04:03.280 |
it's a dog, 80% sure it's a cat. So we can then calculate the binary cross entropy by 01:04:11.600 |
calling our function. So it's going to say, okay, for the first one we've got y = 1, so 01:04:17.160 |
it's going to be 1 times log of 0.9, plus 1 - y, 1 - 1, is 0, so that's going to be skipped. 01:04:31.360 |
And then the second one is going to be a 0, so it's going to be 0 times something, so 01:04:35.800 |
that's going to be skipped. And the second part will be 1 - 0. So this is 1 times log 01:04:43.240 |
of 1 - p, 1 - 0.1 is 0.9. So in other words, the first piece and the second piece of this 01:04:52.640 |
are going to give exactly the same number. Which makes sense because the first one we 01:04:57.400 |
said we were 90% confident it was a cat, and it was. And the second we said we were 90% 01:05:03.600 |
confident it was a dog, and it was. So in each case the loss is coming from the fact 01:05:09.600 |
that we could have been more confident. So if we said we were 100% confident the loss 01:05:15.120 |
would have been 0. So let's look at that in Excel. 01:05:21.240 |
So here's our 0.9, 0.1, 0.2, 0.8, and here's our predictions, 1, 0, 0, 1. So here's 1 - the 01:05:34.080 |
prediction, here is log of our prediction, here is log of 1 - our prediction, and so 01:05:45.160 |
then here is our sum. So if you think about it, and I want you to think about this during 01:05:54.640 |
the week, you could replace this with an if statement rather than y. Because y is always 01:06:03.880 |
1 or 0, then it's only ever going to use either this or this. So you could replace this with 01:06:10.040 |
an if statement. So I'd like you during the week to try to rewrite this with an if statement. 01:06:18.360 |
And then see if you can then scale it out to be a categorical cross-entropy. So categorical 01:06:25.360 |
cross-entropy works this way. Let's say we were trying to predict 3 and then 6 and then 01:06:30.920 |
7 and then 2. So if we were trying to predict 3, and the actual thing that was predicted 01:06:38.160 |
was like 4.7, we're trying to predict 3 and we actually predicted 5. Or we're trying to 01:06:47.160 |
predict 3 and we accidentally predicted 9. Being 5 instead of 3 is no better than being 01:06:54.120 |
9 instead of 3. So we're not actually going to say how far away is the actual number, 01:06:59.760 |
we're going to express it differently. Or to put it another way, what if we're trying 01:07:03.840 |
to predict cats, dogs, horses and airplanes? How far away is cat from horse? So we're going 01:07:11.040 |
to express these a little bit differently. Rather than thinking of it as a 3, let's think 01:07:15.240 |
of it as a vector with a 1 in the third location. And rather than thinking of it as a 6, let's 01:07:24.240 |
think of it as a vector of zeros with a 1 in the sixth location. So in other words, one 01:07:29.480 |
hot encoding. So let's one hot encode a dependent variable. 01:07:34.800 |
And so that way now, rather than trying to predict a single number, let's predict 10 01:07:42.000 |
numbers. Let's predict what's the probability that it's a 0, what's the probability it's 01:07:46.920 |
a 1, what's the probability that it's a 2, and so forth. And so let's say we're trying 01:07:52.400 |
to predict a 2, then here is our binary cross entropy, sorry, categorical cross entropy. 01:08:00.760 |
So it's just saying, okay, did this one predict correctly or not, how far off was it, and 01:08:06.880 |
so forth for each one. And so add them all up. So categorical cross entropy is identical 01:08:13.240 |
to binary cross entropy, we just have to add it up across all of the categories. 01:08:20.140 |
So try and turn the binary cross entropy function in Python into a categorical cross entropy 01:08:25.640 |
Python and maybe create both the version with the if statement and the version with the 01:08:36.040 |
So that's why in our PyTorch we had 10 as the output dimensionality for this matrix 01:08:47.200 |
because when we multiply a matrix with 10 columns, we're going to end up with something 01:08:53.180 |
of length 10, which is what we want. We want to have 10 predictions. 01:09:03.520 |
So that's the loss function that we're using. So then we can fit the model, and what it 01:09:12.560 |
does is it goes through every image this many times, in this case it's just looking at every 01:09:19.320 |
image once, and going to slightly update the values in that weight matrix based on those 01:09:29.040 |
And so once we've trained it, we can then say predict using this model on the validation 01:09:37.920 |
set. And now that's bits out something of 10,000 by 10. Can somebody tell me why is this of 01:09:47.120 |
shape these predictions? Why are they of shape 10,000 by 10? 01:09:56.120 |
Well it's because we have 10,000 images we're training on. 01:10:02.240 |
10,000 images we're training on, so we're validating on in this case, but same thing. 01:10:06.440 |
So 10,000 we're validating on, so that's the first axis, and the second axis is because 01:10:13.560 |
Good, exactly. So each one of these rows is the probabilities that it's a 0, that it's 01:10:18.640 |
a 1, that it's a 2, that it's a 3, and so forth. 01:10:24.160 |
So in math, there's a really common operation we do called argmax. When I say it's common, 01:10:30.720 |
it's funny, at high school I never saw argmax, first year undergrad I never saw argmax, but 01:10:39.560 |
somehow after university everything's about argmax. So it's one of these things that's 01:10:44.160 |
for some reason not really taught at school, but it actually turns out to be super critical. 01:10:48.340 |
And so argmax is both something that you'll see in math, and it's just written out in 01:10:51.980 |
full, argmax. It's in numpy, it's in pytorch, it's super important. 01:10:58.640 |
And what it does is it says, let's take this array of preds, and let's figure out on this 01:11:05.520 |
axis, remember axis 1 is columns, so across as Chris said, the 10 predictions for each 01:11:12.600 |
row, let's find which prediction has the highest value, and return, not that, if it just said 01:11:19.240 |
max it would return the value, argmax returns the index of the value. So by saying argmax 01:11:27.140 |
axis equals 1, it's going to return the index, which is actually the number itself. So let's 01:11:34.160 |
grab the first 5, so for the first one it thinks it's a 3, then it thinks the next one's 01:11:39.320 |
an 8, the next one's a 6, the next one's a 9, the next one's a 6 again. So that's how 01:11:44.400 |
we can convert our probabilities back into predictions. 01:11:51.280 |
So if we save that away, call it preds, we can then say, okay, when does preds equal 01:11:58.420 |
the ground truth? So that's going to return an array of balls, which we can treat as 1s 01:12:04.760 |
and 0s, and the mean of a bunch of 1s and 0s is just the average, so that gives us the 01:12:11.360 |
accuracy, so there's our 91.8%. And so you want to be able to replicate the numbers you 01:12:18.320 |
see, and here it is, there's our 91.8%. So when we train this, the last thing it tells 01:12:24.640 |
us is whatever metric we asked for, and we asked for accuracy, okay. So the last thing 01:12:33.200 |
it tells us is our metric, which is accuracy, and then before that we get the training set 01:12:37.760 |
loss, and the loss is again whatever we asked for, negative log likelihood, and the second 01:12:43.460 |
thing is the validation set loss. PyTorch doesn't use the word loss, they use the word 01:12:49.240 |
criterion, so you'll see here, crit. So that's criterion equals loss. This is what loss function 01:12:55.920 |
we want to use, they call that the criterion. Same thing. 01:13:03.480 |
So here's how we can recreate that accuracy. So now we can go ahead and plot 8 of the images 01:13:13.300 |
along with their predictions. And we've got 3, 8, 6, 9, wrong, 5, wrong, okay. And you 01:13:21.920 |
can see why they're wrong. This is pretty close to a 9, it's just missing a little cross 01:13:26.680 |
at the top. This is pretty close to a 5, it's got a little bit of the extra here, right. 01:13:32.080 |
So we've made a start, and all we've done so far is we haven't actually created a deep 01:13:38.800 |
neural net, we've actually got only one layer. So what we've actually done is we've created 01:13:44.280 |
a logistic regression. So a logistic regression is literally what we just built, and you could 01:13:51.080 |
try and replicate this with sklearn's logistic regression package. When I did it, I got similar 01:13:58.440 |
accuracy, but this version ran much faster because this is running on the GPU where else 01:14:04.480 |
sklearn runs on the CPU. So even for something like logistic regression, we can implement 01:14:11.520 |
it very quickly with PyTorch. How can you pass that to Ian? 01:14:16.200 |
So when we're creating our net, we have to do .cuda, what would be the consequence of 01:14:26.040 |
It wouldn't run quickly. It will run on the CPU. Can you pass it to Jake? 01:14:35.280 |
So maybe with the neural network, why is that we have to do linear and followed by nonlinear? 01:14:46.040 |
So the short answer is because that's what the universal approximation theorem says is 01:14:51.080 |
a structure which can give you arbitrarily accurate functions for any functional form. 01:14:57.680 |
So the long answer is the details of why the universal approximation theorem works. Another 01:15:04.520 |
version of the short answer is that's the definition of a neural network. So the definition 01:15:08.760 |
of a neural network is a linear layer followed by an activation function, followed by a linear 01:15:14.800 |
layer, followed by an activation function, etc. 01:15:19.040 |
We go into a lot more detail of this in the deep learning course. But for this purpose, 01:15:25.600 |
it's enough to know that it works. So far, of course, we haven't actually built a deep 01:15:31.600 |
neural net at all. We've just built a logistic regression. And so at this point, if you think 01:15:37.440 |
about it, all we're doing is we're taking every input pixel and multiplying it by a weight 01:15:42.240 |
for each possible outcome. So we're basically saying on average, the number 1 has these 01:15:50.800 |
pixels turned on, the number 2 has these pixels turned on, and that's why it's not terribly 01:15:54.920 |
accurate. That's not how digit recognition works in real life, but that's always built 01:16:04.240 |
Michael Nielsen has this great website called Neural Networks in Deep Learning, and his 01:16:28.360 |
chapter 4 is actually kind of famous now. In it, he does this walkthrough of basically 01:16:35.240 |
showing that a neural network can approximate any other function to arbitrarily close accuracy 01:16:50.800 |
And we walk through this in a lot of detail in the deep learning course. But the basic 01:16:56.080 |
trick is that he shows that with a few different numbers, you can basically cause these things 01:17:04.580 |
to create little boxes. You can move the boxes up and down, you can move them around, you 01:17:09.240 |
can join them together to eventually create connections of towers, which you can use to 01:17:19.480 |
So that's basically the trick, and so all we need to do given that is to kind of find 01:17:29.680 |
the parameters for each of the linear functions in that neural network, so to find the weights 01:17:36.760 |
in each of the matrices. So far, we've got just one matrix, and so we've just built a 01:17:51.160 |
Just a small note, I just want to confirm that when you showed examples of the images 01:17:55.840 |
which were misclassified, they look rectangular, so it's just that while rendering the pixels 01:18:00.640 |
are being scaled differently. So are they still 28 by 28 squares? 01:18:04.400 |
They are 28 by 28. I think they're square, I think they just look rectangular because 01:18:08.120 |
they've got titles on the top. I'm not sure. Good question. I don't know. Anyway, they 01:18:11.880 |
are square. Matplotlib does often fiddle around with what it considers black versus white 01:18:20.440 |
and having different size axes and stuff, so you do have to be very careful there sometimes. 01:18:32.540 |
Hopefully this will now make more sense because what we're going to do is dig in a layer deeper 01:18:36.920 |
and define logistic regression without using nn.sequential, without using nn.linear, without 01:18:43.640 |
using nn.logsoftmax. So we're going to do nearly all of the layer definition from scratch. 01:18:52.480 |
So to do that, we're going to have to define a PyTorch module. A PyTorch module is basically 01:18:59.000 |
either a neural net or a layer in a neural net, which is actually kind of a powerful 01:19:04.160 |
concept of itself. Basically anything that can kind of behave like a neural net can itself 01:19:09.280 |
be part of another neural net. And so this is like how we can construct particularly 01:19:14.380 |
powerful architectures combining lots of other pieces. 01:19:19.520 |
So to create a PyTorch module, just create a Python class, but it has to inherit from 01:19:25.840 |
nn.module. So we haven't done inheritance before. Other than that, this is all the same 01:19:36.880 |
Basically if you put something in parentheses here, what it means is that our class gets 01:19:41.400 |
all of the functionality of this class for free. It's called subclassing it. So we're 01:19:47.120 |
going to get all of the capabilities of a neural network module that the PyTorch authors 01:19:51.920 |
have provided, and then we're going to add additional functionality to it. 01:19:57.880 |
When you create a subclass, there is one key thing you need to remember to do, which is 01:20:02.720 |
when you initialize your class, you have to first of all initialize the superclass. So 01:20:09.280 |
the superclass is the nn.module. So the nn.module has to be built before you can start adding 01:20:16.560 |
your pieces to it. And so this is just like something you can copy and paste into every 01:20:21.300 |
one of your modules. You just say super.init, this just means construct the superclass first. 01:20:31.920 |
Having done that, we can now go ahead and define our weights and our bias. So our weights is 01:20:39.400 |
the weight matrix. It's the actual matrix that we're going to multiply our data by. 01:20:44.440 |
And as we discussed, it's going to have 28x28 rows and 10 columns. And that's because if 01:20:51.760 |
we take an image which we flattened out into a 28x28 length vector, then we can multiply 01:21:01.160 |
it by this weight matrix to get back out a length 10 vector, which we can then use to 01:21:15.640 |
So that's our weight matrix. Now the problem is that we don't just want y=ax, we want y=ax+b. 01:21:26.240 |
So the +b in neural nets is called bias, and so as well as defining weights, we're also 01:21:32.160 |
going to find bias. And so since this thing is going to spit out for every image something 01:21:38.640 |
of length 10, that means that we need to create a vector of length 10 to be our biases. In 01:21:46.920 |
other words, for everything 0, 1, 2, 3, up to 9, we're going to have a different +b that 01:21:56.440 |
So we've got our data matrix here, which is of length 10,000 by 28x28. And then we've got 01:22:15.720 |
our weight matrix, which is 28x28 rows by 10. So if we multiply those together, we get something 01:22:31.720 |
of size 10,000 by 10. And then we want to add on our bias, like so. And so when we add 01:22:57.760 |
on, and we're going to learn a lot more about this later, but when we add on a vector like 01:23:03.120 |
this, it basically is going to get added to every row. So the bias is going to get added 01:23:13.880 |
So we first of all define those. And so to define them, we've created a tiny little function 01:23:19.100 |
called get_weights, which is over here, which basically just creates some normally distributed 01:23:25.160 |
random numbers. So torch.rand_n returns a tensor filled with random numbers from a normal distribution. 01:23:33.760 |
We have to be a bit careful though. When we do deep learning, like when we add more linear 01:23:39.200 |
layers later, imagine if we have a matrix which on average tends to increase the size 01:23:47.800 |
of the inputs we give to it. If we then multiply by lots of matrices of that size, it's going 01:23:54.200 |
to make the numbers bigger and bigger and bigger, like exponentially bigger. Or what 01:23:59.440 |
if it made them a bit smaller? It's going to make them smaller and smaller and smaller 01:24:04.960 |
So because a deep network applies lots of linear layers, if on average they result in 01:24:11.400 |
things a bit bigger than they started with, or a bit smaller than they started with, it's 01:24:16.320 |
going to exponentially multiply that difference. So we need to make sure that the weight matrix 01:24:23.800 |
is of an appropriate size that the mean of the inputs basically is not going to change. 01:24:32.960 |
So it turns out that if you use normally distributed random numbers and divide it by the number 01:24:40.640 |
of rows in the weight matrix, it turns out that particular random initialization keeps 01:24:48.280 |
your numbers at about the right scale. So this idea that if you've done linear algebra, 01:24:55.080 |
basically if the first eigenvalue is bigger than 1 or smaller than 1, it's going to cause 01:25:01.640 |
the gradients to get bigger and bigger, or smaller and smaller, that's called gradient 01:25:08.040 |
So we'll talk more about this in the deep learning course, but if you're interested, 01:25:12.280 |
you can look up Kaiming, her initialization, and read all about this concept. But for now, 01:25:21.240 |
it's probably just enough to know that if you use this type of random number generation, 01:25:27.680 |
you're going to get random numbers that are un-nicely behaved. You're going to start out 01:25:32.000 |
with an input, which is mean 0, standard deviation 1. Once you put it through this set of random 01:25:38.120 |
numbers, you'll still have something that's about mean 0, standard deviation 1. That's 01:25:45.240 |
One nice thing about PyTorch is that you can play with this stuff. So torch.random, try 01:25:52.320 |
it out. Every time you see a function being used, run it and take a look. And so you'll 01:25:57.520 |
see it looks a lot like NumPy, but it doesn't return a NumPy array, it returns a tensor. 01:26:06.060 |
And in fact, now I'm GPU programming. I just multiplied that matrix by 3 very quickly on 01:26:20.920 |
the GPU. So that's how we do GPU programming with PyTorch. 01:26:29.080 |
So this is our weight matrix. As I said, we create 1.28.28 by 10. 1 is just rank 1 of 01:26:36.640 |
10 for the biases. We have to make them a parameter. This is basically telling PyTorch 01:26:42.920 |
which things to update when it does SGD. That's very minor technical detail. 01:26:49.600 |
So having created the weight matrices, we then define a special method with the name 01:26:54.560 |
forward. This is a special method. The name forward has a special meaning in PyTorch. 01:27:01.800 |
A method called forward in PyTorch is the name of the method that will get called when 01:27:07.360 |
your layer is calculated. So if you create a neural net or a layer, you have to define 01:27:14.080 |
forward. And it's going to get past the data from the previous layer. 01:27:20.520 |
So our definition is to do a matrix multiplication of our input data times our weights and add 01:27:28.480 |
on the biases. So that's it. That's what happened earlier on when we said nn.linear. It created 01:27:41.840 |
Now unfortunately though, we're not getting a 28x28 long vector. We're getting a 28 row 01:27:48.560 |
by 28 column matrix, so we have to flatten it. Unfortunately in PyTorch, they tend to 01:27:55.800 |
rename things. They spell reshape, they spell it view. So view means reshape. So you can 01:28:05.160 |
see here we end up with something where the number of images we're going to leave the 01:28:09.840 |
same, and then we're going to replace row by column with a single axis, again -1 meaning 01:28:20.160 |
So this is how we flatten something using PyTorch. So we flatten it, do a matrix multiply, 01:28:26.920 |
and then finally we do a softmax. So softmax is the activation function we use. If you 01:28:35.200 |
look in the deep learning repo, you'll find something called entropy example, where you'll 01:28:41.440 |
see an example of softmax. But a softmax simply takes the outputs from our final layer, so 01:28:49.040 |
we get our outputs from our linear layer, and what we do is we go e^of for each output, 01:28:58.660 |
and then we take that number and we divide by the sum of the e^ofs. That's called softmax. 01:29:06.640 |
Why do we do that? Well, because we're dividing this by the sum, that means that the sum of 01:29:13.680 |
those itself must add to 1, and that's what we want. We want the probabilities of all 01:29:23.200 |
Furthermore, because we're using e^of, that means we know that every one of these is between 01:29:28.640 |
0 and 1, and probabilities we know should be between 0 and 1. 01:29:34.680 |
And then finally, because we're using e^of, it tends to mean that slightly bigger values 01:29:43.680 |
in the input turn into much bigger values in the output. So you'll see generally speaking 01:29:47.960 |
in my softmax, there's going to be one big number and lots of small numbers. And that's 01:29:53.160 |
what we want, because we know that the output is one hot encoded. 01:29:58.120 |
So in other words, a softmax activation function, the softmax nonlinearity, is something that 01:30:05.080 |
returns things that behave like probabilities, and where one of those probabilities is more 01:30:10.800 |
likely to be kind of high and the other ones are more likely to be low. And we know that's 01:30:15.520 |
what we want to map to our one hot encoding, so a softmax is a great activation function 01:30:22.400 |
to use to help the neural net, make it easier for the neural net to map to the output that 01:30:31.040 |
And this is what we generally want. When we're designing neural networks, we try to come 01:30:35.140 |
up with little architectural tweaks that make it as easy for it as possible to match the 01:30:45.560 |
So that's basically it, right? Rather than doing sequential and using nn.linear and nn.softmax, 01:30:52.040 |
we have to find it from scratch. We can now say, just like before, our net is equal to 01:30:57.600 |
that class .cuda and we can say .fit and we get to within a slight random deviation exactly 01:31:07.200 |
So what I'd like you to do during the week is to play around with torch.randn to generate 01:31:12.380 |
some random tensors, torch.matmul to start multiplying them together, adding them up, 01:31:18.480 |
try to make sure that you can rewrite softmax yourself from scratch, try to fiddle around 01:31:24.720 |
a bit with reshaping view, all that kind of stuff. So by the time you come back next week, 01:31:33.120 |
And if you Google for PyTorch tutorial, you'll see there's a lot of great material actually 01:31:38.960 |
on the PyTorch website to help you along, basically showing you how to create tensors 01:31:50.040 |
All right, great. Yes, you had a question. Can you pass it over? 01:31:58.000 |
So I see that the forward is the layer that gets applied after each of the linear layers. 01:32:02.960 |
Well, not quite. The forward is just the definition of the module. So this is like how we're implementing 01:32:12.120 |
Does that mean after each linear layer, we have to apply the same function? Let's say 01:32:17.000 |
we can't do a log softmax after layer one and then apply some other function after layer 01:32:23.480 |
two if we have like a multi-layer neural network. 01:32:28.120 |
So normally we define neural networks like so. We just say here is a list of the layers 01:32:41.240 |
we want. You don't have to write your own forward. All we did just now was to say instead of 01:32:50.160 |
doing this, let's not use any of this at all, but write it all by hand ourselves. 01:32:56.680 |
So you can write as many layers as you like in any order you like here. The point was 01:33:04.060 |
that here we're not using any of that. We've written our own matmul plus bias, our own 01:33:13.520 |
softmax. This is just Python code. You can write whatever Python code inside forward 01:33:26.200 |
You won't normally do this yourself. Normally you'll just use the layers that PyTorch provides 01:33:30.700 |
and your use.sequential to put them together, or even more likely you'll download a predefined 01:33:35.720 |
architecture and use that. We're just doing this to learn how it works behind the scenes.