back to indexLesson 4 - Deep Learning for Coders (2020)
Chapters
0:0 Review of Lesson 3 + SGD
3:9 MNIST Loss Function
5:56 What is a Dataset in PyTorch?
7:47 Initializing our parameters
10:51 Predicting images with matrix multiplication
14:27 Why you shouldn't use accuracy loss function to update parameters
18:40 Creating a good loss function
27:12 Updating parameters with mini-batches and DataLoader
33:18 Putting it all together
43:30 Refactoring and Creating an optimizer
48:50 The DataLoaders class
49:32 The Learner class
51:9 Adding a non-linearity to create a neural network
61:53 Looking at what the NN is learning by looking at the parameters
65:29 Comparing the results with the fastai toolkit
66:19 Jargon review
68:30 Is there a rule of thumb for which non-linearity to choose?
69:43 Pet breeds image classification
77:10 Presizing
82:51 Checking and debugging a DataBlock
84:40 Presizing (question)
85:45 Training model to clean your data
86:46 How fastai chooses a loss function
111:49 Data Ethics and Efficacy of Masks for COVID-19
00:00:00.000 |
Welcome back. And here is lesson 4, which is where we get deep into the weeds of exactly 00:00:12.320 |
what is going on when we are training a neural network. And we started looking at this in 00:00:18.280 |
the previous lesson, we were looking at spicastic gradient descent. And so to remind you, we 00:00:25.200 |
were looking at what Arthur Samuel said. Suppose we arrange for some automatic means of testing 00:00:31.600 |
the effectiveness of any current weight assignment, or we would call it parameter assignment in 00:00:37.760 |
terms of actual performance and provide a mechanism for altering the weight assignment 00:00:42.880 |
so as to maximize that performance. So we could make that entirely automatic and a machine 00:00:48.600 |
so programmed would learn from its experience. And that was our goal. So our initial attempt 00:00:56.360 |
on the MNIST data set was not really based on that. We didn't really have any parameters. 00:01:03.940 |
So then last week we tried to figure out how we could parameterize it, how we could create 00:01:10.000 |
a function that had parameters. And what we thought we could do would be to have something 00:01:15.720 |
where say the probability of being some particular number was expressed in terms of the pixels 00:01:21.640 |
of that number and some weights, and then we would just multiply them together and add 00:01:28.520 |
them up. So we looked at how stochastic gradient descent worked last week. And the basic idea 00:01:40.740 |
is that we start out by initializing the parameters randomly. We use them to make a prediction 00:01:49.240 |
using a function such as this one. We then see how good that prediction is by measuring 00:01:58.040 |
using a loss function. We then calculate the gradient which is how much would the loss 00:02:03.320 |
change if I changed one parameter by a little bit. We then use that to make a small step 00:02:11.320 |
to change each of the parameters by a little bit by multiplying the learning rate by the 00:02:16.560 |
gradient to get a new set of predictions. And so we went round and round and round a 00:02:20.520 |
few times until eventually we decided to stop. And so these are the basic seven steps that 00:02:30.440 |
we went through. And so we did that for simple quadratic equation. And we had something which 00:02:40.080 |
looked like this. And so by the end, we had this nice sample of a curve getting closer 00:02:49.080 |
and closer and closer. So I have a little summary at the start of this section, summarizing 00:02:59.920 |
gradient descent that Silva and I have in the notebooks in the book of what we just 00:03:04.960 |
did. So you can review that and make sure it makes sense to you. So now let's use this 00:03:12.140 |
to create our MNIST threes versus sevens model. And so to create a model, we're going to need 00:03:20.720 |
to create something that we can pass into a function like, let's see where it was, passing 00:03:29.520 |
to a function like this one. So we need just some pixels that are all lined up and some 00:03:34.960 |
parameters that are all lined up. And then we're going to sum them up. So our X's are 00:03:44.000 |
going to be pixels. And so in this case, because we're just going to multiply each pixel by 00:03:48.800 |
a parameter and add them up, the fact that they're laid out in a grid is not important. 00:03:54.500 |
So let's reshape those grids and turn them into vectors. The way we reshape things in 00:04:02.960 |
PyTorch is by using the View method. And so the View method, you can pass to it how large 00:04:10.500 |
you want each dimension to be. And so in this case, we want the number of columns to be 00:04:18.520 |
equal to the total number of pixels in each picture, which is 28 times 28, because they're 00:04:24.760 |
28 by 28 images. And then the number of rows will be however many rows there are in the 00:04:29.840 |
data. And so if you just use minus one, when you call View, that means, you know, as many 00:04:36.480 |
as there are in the data. So this will create something of the same with the same total 00:04:40.560 |
number of elements that we had before. So we can grab all our threes. We can concatenate 00:04:47.000 |
them, torch.cat, with all of our 7s, and then reshape that into a matrix where each row 00:04:55.840 |
is one image with all of the rows and columns of the image all lined up in a single vector. 00:05:02.480 |
So then we're going to need labels. So that's our x. So we're going to need labels. Our 00:05:07.480 |
labels will be a 1 for each of the threes and a 0 for each of the 7s. So basically we're 00:05:15.280 |
going to create an is3 model. So that's going to create a vector. We actually need it to 00:05:24.960 |
be a matrix in PyTorch. So unsqueeze will add an additional unit dimension to wherever 00:05:36.880 |
I've asked for. So here in position 1. So in other words, this is going to turn it from 00:05:41.140 |
something which is a vector of 12,396 long into a matrix with 12,396 rows and one column. 00:05:52.960 |
That's just what PyTorch expects to see. So now we're going to turn our x and y into a 00:06:00.280 |
data set. And a data set is a very specific concept in PyTorch. It's something which we 00:06:07.480 |
can index into using square brackets. And when we do so, it's expected to return a tuple. 00:06:17.980 |
So here if we look at, we're going to create this data set. And when we index into it, 00:06:28.160 |
it's going to return a tuple containing our independent variable and our dependent variable 00:06:34.160 |
for each particular row. And so to do that, we can use the Python zip function, which 00:06:41.860 |
takes one element of the first thing and combines it with, concatenates it with one element 00:06:48.680 |
of the second thing. And then it does that again and again and again. And so then if 00:06:52.880 |
we create a list of those, it gives us a data set. It gives us a list, which when we index 00:07:00.280 |
into it, it's going to contain one image and one label. And so here you can see why there's 00:07:07.440 |
my label and my image. I won't print out the whole thing, but it's a 784 long vector. So 00:07:15.240 |
that's a really important concept. A data set is something that you can index into and 00:07:20.100 |
get back a tuple. And here I am, this is called destructuring the tuple, which means I'm taking 00:07:26.960 |
the two parts of the tuple and putting the first part in one variable and the second 00:07:30.980 |
part in the other variable, which is something we do a lot in Python. It's pretty handy. A 00:07:35.160 |
lot of other languages support that as well. Repeat the same three steps for a validation 00:07:40.600 |
set. So we've now got a training data set and a validation data set. Right. So now we 00:07:49.220 |
need to initialize our parameters. And so to do that, as we've discussed, we just do 00:07:56.080 |
it randomly. So here's a function that, given some size, some shape if you like, will randomly 00:08:05.200 |
initialize using a normal random number distribution in PyTorch. That's what randin does. And we 00:08:13.160 |
can hit Shift + Tab to see how that works. Okay. And it says here that it's going to 00:08:25.120 |
have a variance of 1. So I probably shouldn't have called this standard deviation. I probably 00:08:29.640 |
should call this variance actually. So multiply it by the variance to change its variance 00:08:35.400 |
to whatever is requested, which will default to 1. And then as we talked about when it 00:08:41.400 |
comes to calculating our gradients, we have to tell PyTorch which things we want gradients 00:08:48.000 |
for. And the way we do that is requires grad underscore. Remember this underscore at the 00:08:53.240 |
end is a special magic symbol, which tells PyTorch that we want this function to actually 00:08:59.320 |
change the thing that it's referring to. So this will change this tensor such that it 00:09:08.400 |
requires gradients. So here's some weights. So our weights are going to need to be 28 00:09:15.400 |
by 28 by 1 shape, 28 by 28 because every pixel is going to need a weight. And then 1 because 00:09:24.400 |
we're going to need again, we're going to need to have that unit access to make it into 00:09:29.680 |
a column. So that's what PyTorch expects. So there's our weights. Now just weights by 00:09:40.380 |
pixels actually isn't going to be enough because weights by pixels will always equal 0 when 00:09:45.960 |
the pixels are equal to 0. It has a 0 intercept. So we really want something which like wx 00:09:50.880 |
plus b, a line. So the b is we call the bias. And so that's just going to be a single number. 00:09:58.640 |
So let's grab a single number for our bias. So remember I told you there's a difference 00:10:04.880 |
between the parameters and weights, so actually speaking. So here the weights are the w in 00:10:11.560 |
this equation, the bias is b in this equation, and the weights and bias together is the parameters 00:10:20.440 |
of the function. They're all the things that we're going to change. They're all the things 00:10:23.320 |
that have gradients that we're going to update. So there's an important bit of jargon for 00:10:28.360 |
you. The weights and biases of the model are the parameters. So we can, yes question. What's 00:10:39.600 |
the difference between gradient descent and stochastic gradient descent? So far we've 00:10:46.360 |
only done gradient descent. We'll be doing stochastic gradient descent in a few minutes. 00:10:51.160 |
So we can now create a calculated prediction for one image. So we can take an image such 00:10:56.500 |
as the first one and multiply by the weights. We need to transpose them to make them line 00:11:01.980 |
up in terms of the rows and columns and add it up and add the bias and there is a prediction. 00:11:11.480 |
We want to do that for every image. We could do that with a for loop and that would be 00:11:17.040 |
really, really slow. It wouldn't run on the GPU and it wouldn't run in optimized C code. 00:11:24.120 |
So we actually want to use always to do kind of like looping over pixels looping over images. 00:11:30.640 |
You always need to try to make sure you're doing that without a Python or loop. In this 00:11:35.480 |
case doing this calculation for lots of rows and columns is a mathematical operation called 00:11:43.200 |
matrix model play. So if you've forgotten your matrix multiplication or maybe never 00:11:49.960 |
quite got around to it at high school. It would be a good idea to have a look at Khan 00:11:55.160 |
Academy or something to learn about what it is, but it's actually, I'll give you the quick 00:12:00.880 |
answer. This is from Wikipedia. If these are two matrices A and B, then this element here 00:12:08.960 |
1, 2 in the output is going to be equal to the first bit here times the first bit here 00:12:16.840 |
plus the second bit here times the second bit here. So it's going to be B12 times A11 00:12:23.320 |
plus B22 times A12. That's, you can see the orange matches the orange. Ditto for over 00:12:31.520 |
here. This would be equal to B13 times A31 plus B23 times A32 and so forth for every 00:12:38.760 |
part. Here's a great picture of that in action. If you look at matrix multiplication.xyz, another 00:12:54.480 |
way to think of it is we can kind of flip the second bit over on top and then multiply 00:13:00.640 |
each bit together and add them up, multiply each bit together and add them up. And you 00:13:05.560 |
can see always the second one here and ends up in the second spot and the first one ends 00:13:09.320 |
up in the first spot. And that's what matrix multiplication is. So we can do our multiply 00:13:22.920 |
and add up by using matrix multiplication. And in Python and therefore PyTorch matrix 00:13:30.240 |
multiplication is the @ sign operator. So when you see @ that means matrix multiply. 00:13:38.100 |
So here is our 20.2336. If I do a matrix multiply of our training set by our weights and then 00:13:50.800 |
we add the bias and here is our 20.336 for the first one. And you can see though it's 00:13:56.040 |
doing every single one. So that's really important is that matrix multiplication gives us an 00:14:02.960 |
optimized way to do these simple linear functions for as many kind of rows and columns as we 00:14:08.800 |
want. So this is one of the two fundamental equations of any neural network. Some rows 00:14:18.520 |
of data, rows and columns of data, matrix multiply, some weights, add some bias. And 00:14:23.960 |
the second one which we'll see in a moment is an activation function. So that is some 00:14:31.700 |
predictions from our randomly initialized model. So we can check how good our model 00:14:36.960 |
is. And so to do that we can decide that anything greater than zero we will call a 3 and anything 00:14:46.280 |
less than zero we will call a 7. So preds greater than zero tells us whether or not something 00:14:54.280 |
is predicted to be a 3 or not. Then turn that into a float. So rather than true and false 00:14:59.760 |
make it 1 and 0 because that's what our training set contains. And then check whether our thresholded 00:15:07.080 |
predictions are equal to our training set. And this will return true every time a row 00:15:15.200 |
is correctly predicted and false otherwise. So if we take all those trues and falses and 00:15:20.800 |
turn them into floats so that'll be ones and zeros and then take their mean it's 0.49. 00:15:26.720 |
So not surprisingly our randomly initialized model is right about half the time at predicting 00:15:31.480 |
threes from sevens. I added one more method here which is .item. Without .item this would 00:15:39.520 |
return a tensor. It's a rank 0 tensor it has no rows it has no columns it just it's just 00:15:46.160 |
a number on its own. But I actually wanted to unwrap it to create a normal Python scalar 00:15:52.360 |
mainly just because I wanted to see the easily see the full set of decimal places. And the 00:15:57.720 |
reason for that is I want to show you how we're going to calculate the derivative on 00:16:01.960 |
the accuracy by changing a parameter by a tiny bit. So let's take one parameter which will 00:16:09.680 |
be weight 0 and multiply it by 1.0001. And so that's going to make it a little bit bigger. 00:16:19.160 |
And then if I calculate how the accuracy changes based on the change in that weight that will 00:16:28.080 |
be the gradient of the accuracy with respect to that parameter. So I can do that by calculating 00:16:36.520 |
my new set of predictions and then I can threshold them and then I can check whether they're equal 00:16:40.400 |
to the training set and then take the mean and I get back exactly the same number. So 00:16:49.080 |
remember that gradient is equal to rise over run if you remember back to your calculus 00:16:58.760 |
or if you'd forgotten your calculus hopefully you've reviewed it on Khan Academy. So the 00:17:05.000 |
change in the y so y new minus y old which is 0.4912 etc minus 0.4912 etc which is 0 00:17:18.400 |
divided by this change will give us 0. So at this point we have a problem our derivative 00:17:26.480 |
is 0 so we have 0 gradients which means our step will be 0 which means our prediction 00:17:34.340 |
will be unchanged. Okay so we have a problem and our problem is that our gradient is 0 00:17:45.720 |
and with a gradient of 0 we can't take a step and we can't get better predictions. And so 00:17:53.400 |
intuitively speaking the reason that our gradient is 0 is because when we change a single pixel 00:17:59.800 |
by a tiny bit we might not ever in any way change an actual prediction to change from 00:18:07.000 |
a 3 predicting a 3 to a 7 or vice versa because we have this threshold. Okay and so in other 00:18:17.800 |
words our accuracy loss function here is very bumpy it's like flat step flat step flat step 00:18:27.720 |
so it's got this 0 gradient all over the place. So what we need to do is use something other 00:18:35.280 |
than accuracy as our loss function. So let's try and create a new function and what this 00:18:46.040 |
new function is going to do is it's going to give us a better value kind of in much 00:18:54.240 |
the same way that accuracy gives a better value. So this is the loss remember a small 00:18:58.960 |
loss is better so it'll give us a lower loss when the accuracy is better but it won't have 00:19:05.240 |
a 0 gradient. So it means that a slightly better prediction needs to have a slightly 00:19:14.160 |
better loss. So let's have a look at an example let's say our targets our labels of like that 00:19:23.600 |
is 3 oh there's just three rows three images here 1 0 1 okay and we've made some predictions 00:19:32.020 |
from a neural net and those predictions gave us 0.9 0.4 0.2. So now consider this loss function 00:19:42.660 |
a loss function we're going to use torch.where which is basically the same as this list comprehension 00:19:49.080 |
it's basically an if statement. So it's going to say for where target equals 1 we're going 00:19:56.840 |
to return 1 minus predictions so here target is 1 so it'll be 1 minus 0.9 and where target 00:20:03.400 |
is not 1 it'll just be predictions. So for these examples here the first one target equals 00:20:12.080 |
1 will be 1 minus 0.9 which is 0.1 the next one is target equals 0 so it will be the prediction 00:20:23.600 |
just 0.4 and then for the third one it's a 1 for target so it'll be 1 minus prediction 00:20:30.360 |
which is 0.8 and so you can see here when the prediction is correct correct in other 00:20:37.720 |
words it's a number you know it's a high number when the target is 1 and a low number when 00:20:43.320 |
the target is 0 these numbers are going to be smaller. So the worst one is when we predicted 00:20:50.040 |
0.2 so we're pretty we really thought that was actually a 0 but it's actually a 1 so 00:20:56.480 |
we ended up with a 0.8 here because this is 1 minus prediction 1 minus 0.2 is 0.8. So 00:21:05.680 |
we can then take the mean of all of these to calculate a loss. So if you think about 00:21:12.400 |
it this loss will be the smallest if the predictions are exactly right. So if we did predictions 00:21:27.280 |
is actually identical to the targets then this will be 0 0 0 okay or else if they were 00:21:39.280 |
exactly wrong let's say they were 1 minus then it's 1 1 1. So it's going to be the loss 00:21:50.000 |
will be better i.e. smaller when the predictions are closer to the targets and so here we can 00:21:59.240 |
now take the mean and when we do we get here 0.433. So let's say we change this last bad 00:22:10.920 |
one this inaccurate prediction from 0.2 to 0.8 and the loss gets better from 0.43 to 00:22:21.200 |
0.23. This is just this function is torch.where.mean. So this is actually pretty good this is actually 00:22:28.760 |
a loss function which pretty closely tracks accuracy whereas the accuracy is better the 00:22:33.920 |
loss will be smaller but also it doesn't have these zero gradients because every time we 00:22:39.280 |
change the prediction the loss changes because the prediction is literally part of the loss 00:22:45.280 |
that's pretty neat isn't it. One problem is this is only going to work well as long as 00:22:51.080 |
the predictions are between 0 and 1 otherwise this 1 minus prediction thing is going to 00:22:55.440 |
look a bit funny. So we should try and find a way to ensure that the predictions are always 00:23:01.440 |
between 0 and 1 and that's also going to just make a lot more intuitive sense because you 00:23:07.080 |
know we like to be able to kind of think of these as if they're like probabilities or 00:23:10.540 |
at least nicely scaled numbers. So we need some function that can take our numbers have 00:23:20.120 |
a look. It's something which can take these big numbers and turn them all into numbers 00:23:27.680 |
between 0 and 1 and it so happens that we have exactly the right function it's called 00:23:36.200 |
the sigmoid function. So the sigmoid function looks like this. If you pass in a really small 00:23:42.080 |
number you get a number very close to 0 if you pass in a big number you get a number 00:23:47.000 |
very close to 1 it never gets past 1 and it never goes smaller than 0 and then it's kind 00:23:54.400 |
of like the smooth curve between and in the middle it looks a lot like the y = x line. 00:24:01.000 |
This is the definition of the sigmoid function. It's 1 over 1 plus e to the minus x. What 00:24:11.120 |
is x? x is just e to the power of something. So if we look at e it's just a number like 00:24:24.960 |
pi this is a simple it's just a number that has a particular value. So if we go e squared 00:24:34.120 |
and we look at it's going to be a tensor, use pytorch, make it a float, there we go. 00:24:50.080 |
You can see that these are the same number so that's what torch.exp means. Okay so you 00:25:00.640 |
know for me when I see these kinds of interesting functions I don't worry too much about the 00:25:06.560 |
definition. What I care about is the shape. So you can have a play around with graphing 00:25:12.240 |
calculators or whatever to kind of see why it is that you end up with this shape from 00:25:16.920 |
this particular equation but for me I just never think about that. It never really matters 00:25:24.760 |
to me. What's important is this sigmoid shape which is what we want. It's something that 00:25:29.520 |
squashes every number to be between 0 and 1. So we can change MNIST_LOST to be exactly 00:25:39.040 |
the same as it was before but first we can make everything into sigmoid first and then 00:25:45.920 |
use torch.where. So that is a loss function that has all the properties we want. It's 00:25:52.840 |
something which is going to be have not have any of those nasty 0 gradients and we've ensured 00:25:59.480 |
that the input to the where is between 0 and 1. So the reason we did this is because our 00:26:11.960 |
accuracy was kind of what we really care about is a good accuracy. We can't use it to get 00:26:20.400 |
our gradients just to create our step to improve our parameters. So we can change our accuracy 00:26:33.400 |
to another function that is similar in terms of it it's better when the accuracy is better 00:26:39.640 |
but it also does not have these 0 gradients. And so you can see now where why we have a 00:26:44.940 |
metric and a loss. The metric is the thing we actually care about. The loss is the thing 00:26:50.280 |
that's similar to what we care about that has a nicely behaved gradient. Sometimes the thing 00:26:59.320 |
you care about your metric does have a nicely defined gradient and you can use it directly 00:27:03.520 |
as a loss. For example, we often use mean squared error but for classification unfortunately 00:27:10.560 |
not. So we need to now use this to update the parameters. And so there's a couple of 00:27:22.160 |
ways we could do this. One would be to loop through every image, calculate a prediction 00:27:28.240 |
for that image and then calculate a loss and then do a step and then step through the parameters 00:27:38.160 |
and then do that again for the next image and the next image and the next image. That's 00:27:42.440 |
going to be really slow because we're doing a single step for a single image. So that 00:27:49.600 |
would mean an epoch would take quite a while. We could go much faster by doing every single 00:27:55.680 |
image in the dataset. So a big matrix multiplication, it can all be paralyzed on the GPU and then 00:28:03.480 |
so then we can we could then do a step based on the gradients looking at the entire dataset. 00:28:12.960 |
But now that's going to be like a lot of work to just update the weights once. And remember 00:28:19.960 |
sometimes our datasets have millions or tens of millions of items. So that's probably a 00:28:25.320 |
bad idea too. So why not compromise? Let's grab a few data items at a time to calculate 00:28:33.680 |
our loss and our step. If we grab a few data items at a time, those two data items are 00:28:39.280 |
called a mini-batch. And a mini-batch just means a few pieces of data. And so the size 00:28:47.520 |
of your mini-batch is called, not surprisingly, the batch size. So the bigger the batch size, 00:28:53.000 |
the closer you get to the full size of your dataset, the longer it's going to do take 00:28:56.960 |
to calculate a single set of losses, a single step. But the more accurate it's going to 00:29:04.360 |
be, it's going to be like the gradients are going to be much closer to the true dataset 00:29:09.440 |
gradients. And then the smaller the batch size, the faster each step will be able to 00:29:14.600 |
do, but those steps will represent a smaller number of items. And so they won't be such 00:29:19.920 |
an accurate approximation of the real gradient of the whole dataset. 00:29:28.000 |
Is there a reason the mean of the loss is calculated over, say, doing a median, since 00:29:32.680 |
the median is less prone to getting influenced by outliers? In the example you gave, if the 00:29:39.720 |
third point, which was wrongly predicted as an outlier, then the derivative would push 00:29:44.840 |
the function away while doing SGD, and a median could be better in that case. 00:29:50.680 |
Honestly, I've never tried using a median. The problem with a median is it ends up really 00:29:58.640 |
only caring about one number, which is the number in the middle. So it could end up really 00:30:05.680 |
pretty much ignoring all of the things at each end. In fact, all it really cares about 00:30:10.520 |
is the order of things. So my guess is that you would end up with something that is only 00:30:15.720 |
good at predicting one thing in the middle, but I haven't tried it. It would be interesting 00:30:22.280 |
to see. Well, I guess the other thing that would happen with a median is you would have 00:30:27.520 |
a lot of zero gradients, I think, because it's picking the thing in the middle and you 00:30:32.200 |
could, you know, change your values and the thing in the middle. Well, it wouldn't be 00:30:37.760 |
zero gradients, but bumpy gradients. I think in the middle would suddenly jump to being 00:30:41.040 |
a different item. So it might not behave very well. That's my guess. You should try it. 00:30:50.200 |
Okay. So how do we ask for a few items at a time? It turns out that PyTorch and FastAI 00:30:59.840 |
provide something to do that for you. You can pass in any data set to this class called 00:31:07.320 |
data loader and it will grab a few items from that data set at a time. You can ask for how 00:31:13.200 |
many by asking for a batch size. And then you can, as you can see, it will grab a few 00:31:20.720 |
items at a time until it's grabbed all of them. So here I'm saying let's create a collection 00:31:25.560 |
that just contains all the numbers from 0 to 14. Let's pass that into a data loader 00:31:31.200 |
with a batch size of 5. And then that's going to be something, it's called an iterator in 00:31:36.640 |
Python. It's something that you can ask for one more thing from an iterator. If you pass 00:31:40.520 |
an iterator to list in Python, it returns all of the things from the iterator. So here 00:31:45.920 |
are my three mini batches and you'll see here all the numbers from 0 to 15 appear. They 00:31:51.320 |
appear in a random order and they appear five at a time. They appear in random order because 00:31:55.880 |
shuffle equals true. So normally in the training set we ask for things to be shuffled. So it 00:32:01.640 |
gives us a little bit more randomization. More randomization is good because it makes it 00:32:07.040 |
harder for it to kind of learn what the data set looks like. So that's what a data loader, 00:32:14.160 |
that's how a data loader is created. Now remember though that our data sets actually return tuples. 00:32:25.480 |
And here I've just got single ints. So let's actually create a tuple. So if we enumerate 00:32:30.640 |
all the letters of English, then that means that returns 0a1b2c, etc. Let's make that 00:32:38.360 |
our data set. So if we pass that to a data loader with a batch size of 6, and as you 00:32:45.360 |
can see it returns tuples containing 6 of the first things and the associated 6 of the 00:32:55.920 |
second things. So this is like our independent variable and this is like our dependent variable. 00:33:03.760 |
And so and then at the end, you know, the batch size won't necessarily exactly divide 00:33:10.520 |
nicely into the full size of the data set. You might end up with a smaller batch. 00:33:19.640 |
So basically then we already have a data set, remember. And so we could pass it to a data 00:33:26.440 |
loader and then we can basically say this, an iterator in Python is something that you 00:33:31.200 |
can actually loop through. So when we say for in data loader, it's going to return a 00:33:37.160 |
tuple. We can de-structure it into the first bit and the second bit. And so that's going 00:33:44.160 |
to be our x and y. We can calculate our predictions, we can calculate our loss from the predictions 00:33:50.000 |
and the targets, we can ask it to calculate our gradients and then we can update our parameters 00:33:57.800 |
just like we did in our toy SGD example for the quadratic equation. So let's reinitialize 00:34:05.240 |
our weights and bias with the same two lines of code before. Let's create the data loader 00:34:10.120 |
this time from our actual MNIST data set and create a nice big batch size. So we did plenty 00:34:15.560 |
of work each time. And just to take a look, let's just grab the first thing from the data 00:34:20.960 |
loader. First is a fast AI function, which just grabs the first thing from an iterator. 00:34:26.720 |
Just it's useful to look at, you know, kind of an arbitrary mini batch. So here is the 00:34:33.000 |
shape. We're going to have the first mini batch is 256 rows of 784 long, that's 28 by 28. 00:34:40.520 |
So 256 flattened out images and 256 labels that are one long because that's just the 00:34:48.160 |
number zero or the number one, depending on whether it's a three or a seven. Do the same 00:34:53.400 |
for the validation set. So here's our validation data loader. And so let's grab a batch here, 00:35:05.960 |
testing, pass it into, well, why do we do that? We should, yeah, I guess, yeah, actually 00:35:18.560 |
for our testing, I'm going to just manually grab the first four things just so that we 00:35:23.640 |
can make sure everything lines up. So let's grab just the first four things. We'll call 00:35:27.760 |
that a batch. Pass it into that linear function we created earlier. Remember linear was just 00:35:43.080 |
batch at weights matrix, multiply plus bias. And so that's going to give us four results. 00:35:54.840 |
That's a prediction for each of those four images. And so then we can calculate the loss 00:36:00.760 |
using that loss function we just used. And let's just grab the first four items of the 00:36:05.040 |
training set and there's the loss. Okay. And so now we can calculate the gradients. And 00:36:12.960 |
so the gradients are 784 by one. So in other words, it's a column where every weight as 00:36:21.160 |
a gradient, it's what's the change in loss for a small change in that parameter. And 00:36:28.760 |
then the bias has a gradient that's a single number because the bias is just a single number. 00:36:34.880 |
So, we can take those three steps and put it in a function. So if you pass, if you, 00:36:42.320 |
this is calculate gradient, you pass it an X batch or Y batch in some model, then it's 00:36:47.200 |
going to calculate the predictions, calculate the loss and do the backward step. And here 00:36:52.920 |
we see calculate gradient. And so we can get the, just to take a look, the mean of the 00:36:57.960 |
weights gradient and the bias gradient. And there it is. If I call it a second time and 00:37:05.320 |
look, notice I have not done any step here. This is exactly the same parameters. I get 00:37:11.520 |
a different value. That's a concern. You would expect to get the same gradient every time 00:37:17.840 |
you called it with the same data. Why have the gradients changed? That's because loss 00:37:24.640 |
dot backward does not just calculate the gradients. It calculates the gradients and adds them to 00:37:32.000 |
the existing gradients, the things in the dot grad attribute. The reasons for that will 00:37:39.280 |
come to you later, but for now the thing to know is just it does that. So actually what 00:37:44.120 |
we need to do is to call grad dot zero underscore. So dot zero returns a tensor containing zeros. 00:37:53.440 |
And remember underscore does it in place. So that updates the weights dot grad attribute, 00:37:58.840 |
which is a tensor to contain zeros. So now if I do that and call it again, I will get 00:38:05.680 |
exactly the same number. So here is how you train one epoch with SGD. Loop through the 00:38:14.960 |
data loader, grabbing the X batch and the Y batch, calculate the gradient, prediction 00:38:21.440 |
loss backward. Go through each of the parameters and we're going to be passing those in. So 00:38:28.760 |
there's going to be the 768 weights and the one bias. And then for each of those, update 00:38:35.760 |
the parameter to go minus equals gradient times learning rate. That's our gradient descent 00:38:43.280 |
step and then zero it out for the next time around the loop. I'm not just saying p minus 00:38:51.680 |
equals. I'm saying p dot data minus equals. And the reason for that is that remember PyTorch 00:38:58.920 |
keeps track of all of the calculations we do so that it can calculate the gradient. 00:39:05.200 |
Well I don't want to calculate in the gradient of my gradient descent step. That's like not 00:39:10.760 |
part of the model, right? So dot data is a special attribute in PyTorch where if you 00:39:16.720 |
write to it, it tells PyTorch not to update the gradients using that calculation. So this 00:39:25.120 |
is your most basic standard SGD stochastic gradient descent loop. So now we can answer 00:39:32.480 |
that earlier question. The difference between stochastic gradient descent and gradient descent 00:39:37.680 |
is that gradient descent does not have this here that loops through each mini-batch. For 00:39:46.080 |
gradient descent, it does it on the whole data set each time around. So train epoch 00:39:51.720 |
for gradient descent would simply not have the for loop at all, but instead it would 00:39:57.760 |
calculate the gradient for the whole data set and update the parameters based on the 00:40:01.960 |
whole data set, which we never really do in practice. We always use mini-batches of various 00:40:07.920 |
sizes. Okay, so we can take the function we had before where we compare the predictions 00:40:21.800 |
to whether that, well we used to be comparing the predictions to whether they were greater 00:40:26.160 |
or less than zero, right? But now that we're doing the sigmoid, remember the sigmoid will 00:40:31.400 |
squish everything between 0 and 1. So now we should compare the predictions to whether 00:40:35.880 |
they're greater than 0.5 or not. If they're greater than 0.5, just look back at our sigmoid 00:40:41.040 |
function. So 0, what used to be 0 is now on the sigmoid is 0.5. Okay, so we need just 00:40:52.840 |
to make that slight change to our measure of accuracy. So to calculate the accuracy 00:41:03.360 |
for some X batch and some Y batch, this is actually assumed this is actually the predictions. 00:41:09.960 |
Then we take the sigmoid of the predictions, we compare them to 0.5 to tell us whether 00:41:15.240 |
it's a 3 or not, we check what the actual target was to see which ones are correct, 00:41:20.640 |
and then we take the mean of those after converting the Booleans to floats. So we can check that. 00:41:27.720 |
Accuracy, let's take our batch, put it through our simple linear model, compare it to the 00:41:33.880 |
four items of the training set, and there's the accuracy. So if we do that for every batch 00:41:41.120 |
in the validation set, then we can loop through with a list comprehension every batch in the 00:41:46.800 |
validation set, get the accuracy based on some model, stack those all up together so 00:41:56.080 |
that this is a list, right? So if we want to turn that list into a tensor where the 00:42:01.000 |
items of the list of the tensor are the items of the list, that's what stack does. So we 00:42:06.600 |
can stack up all those, take the mean, convert it to a standard Python scalar by calling 00:42:14.160 |
dot item, round it to four decimal places just for display. And so here is our validation 00:42:21.880 |
set accuracy as you would expect. It's about 50% because it's random. So we can now train 00:42:28.480 |
for one epoch. So we can say, remember train epoch needed the parameters. So our parameters 00:42:37.400 |
in this case are the weights tensor and the bias tensor. So train one epoch using the 00:42:44.360 |
linear one model with the learning rate of one with these two parameters and then validate 00:42:52.460 |
and look at that. Our accuracy is now 68.8%. So we've trained an epoch. So let's just repeat 00:43:02.880 |
that 20 times, train and validate. And you can see the accuracy goes up and up and up 00:43:09.840 |
and up and up to about 97%. So that's cool. We've built an SGD optimizer of a simple linear 00:43:21.120 |
function that is getting about 97% on our simplified MNIST where there's just the threes 00:43:28.360 |
and the sevenths. So a lot of steps there. Let's simplify this through some refactoring. 00:43:37.280 |
So the kind of simple refactoring we're going to do, we're going to do a couple, but the 00:43:40.800 |
basic idea is we're going to create something called an optimizer class. The first thing 00:43:45.520 |
we'll do is we'll get rid of the linear one function. Remember the linear one function 00:43:53.480 |
does x at w plus b. There's actually a class in PyTorch that does that equation for us. 00:44:03.840 |
So we might as well use it. It's called nn.linear. And nn.linear does two things. It does that 00:44:11.440 |
function for us and it also initializes the parameters for us. So we don't have to do 00:44:19.360 |
writes and bias in it params anymore. We just create an nn.linear class and that's going 00:44:27.280 |
to create a matrix of size 28 by 28 comma 1 and a bias of size 1. It will set requires 00:44:35.680 |
grad equals true for us. It's all going to be encapsulated in this class and then when 00:44:40.240 |
I call that as a function, it's going to do my x at w plus b. So to see the parameters 00:44:51.320 |
in it, we would expect it to contain 784 weights and one bias. We can just call dot parameters 00:44:58.800 |
and we can de-structure it to w comma b and see, yep, it is 784 and 1 for the weights 00:45:06.720 |
and bias. So that's cool. So this is just, you know, it could be an interesting exercise 00:45:12.760 |
for you to create this class yourself from scratch. You should be able to at this point 00:45:19.320 |
so that you can confirm that you can recreate something that behaves exactly like an nn.linear. 00:45:26.440 |
So now that we've got this object which contains our parameters in a parameters method, we 00:45:33.760 |
can now create an optimizer. So for our optimizer, we're going to pass it the parameters to optimize 00:45:39.380 |
and a learning rate. We'll store them away and we'll have something called step which 00:45:45.600 |
goes through each parameter and does that thing we just saw. p dot data minus equals 00:45:50.600 |
p dot grad times learning rate and it's also going to have something called zero grad which 00:45:55.760 |
goes through each parameter and zeros it out or we could even just set it to none. So that's 00:46:01.920 |
the thing we're going to call basic optimizer. So those are exactly the same lines of code 00:46:06.320 |
we've already seen wrapped up into a class. So we can now create an optimizer passing 00:46:11.940 |
in the parameters of the linear model for these and our learning rate. And so now our 00:46:18.400 |
training loop is loop through each mini batch in the data loader, calculate the gradient, 00:46:24.920 |
opt dot step, opt dot zero grad, that's it. Validation function doesn't have to change 00:46:35.640 |
and so let's put our training loop into a function that's going to loop through a bunch 00:46:38.640 |
of epochs, call an epoch, print validate epoch and then run it and it's the same. We're getting 00:46:49.880 |
a slightly different result here but much the same idea. Okay so that's cool right we've 00:46:59.880 |
now refactoring using you know create our own optimizer and using faster pytorch is built 00:47:08.120 |
in nn.linear class and you know by the way we don't actually need to use our own basic 00:47:14.560 |
optimizer. Not surprisingly pytorch comes with something which does exactly this and 00:47:20.120 |
not surprisingly it's called SGD. So and actually this SGD is provided by fastai, fastai and 00:47:26.920 |
pytorch provide some overlapping functionality they work much the same way. So you can pass 00:47:33.520 |
to SGD your parameters and your learning rate just like basic optimizer. Okay and train 00:47:40.800 |
it and get the same result. So as you can see these classes that are in fastai and pytorch 00:47:49.120 |
are not mysterious they're just pretty you know in wrappers around functionality that 00:47:57.960 |
we've now written ourself. So there's quite a few steps there and if you haven't done 00:48:03.880 |
gradient descent before then there's a lot of unpacking. So this lesson is kind of the 00:48:12.400 |
key lesson it's the one where you know like we should you know really take a stop and 00:48:17.720 |
a deep breath at this point and make sure you're comfortable. What's a data set? What's 00:48:23.520 |
a data loader? What's nn.linear? What's SGD? And if you you know if any or all of those 00:48:31.440 |
don't make sense go back to where we defined it from scratch using Python code. Well the 00:48:38.080 |
data loader we didn't define from scratch but it you know the functionality is not particularly 00:48:42.680 |
interesting. You could certainly create your own from scratch if you wanted to that would 00:48:47.080 |
be another pretty good exercise. Let's refactor some more. Fastai has a data loaders class 00:48:58.720 |
which is as we've mentioned before is a tiny class that just you pass it a bunch of data 00:49:05.120 |
loaders and it just stores them away as a dot train and a dot valid. Even though it's 00:49:10.200 |
a tiny class it's it's super handy because with that we now have a single object that 00:49:16.800 |
knows all the data we have and so it can make sure that your training data loader is shuffled 00:49:22.800 |
and your validation loader isn't shuffled you know make sure everything works properly. 00:49:28.020 |
So that's what the data loaders class is you can pass in the training and valid data loader 00:49:33.840 |
and then the next thing we have in fastai is the learner class and the learner class 00:49:38.520 |
is something where we're going to pass in our data loaders. We're going to pass in our 00:49:44.560 |
model we're going to pass in our optimization function we're going to pass in our loss function 00:49:51.880 |
we're going to pass in our metrics. So all the stuff we've just done manually that's 00:49:58.120 |
all learner does is it's just going to do that for us so it's just going to call this 00:50:04.320 |
train model and this train epoch it's just you know it's inside learner. So now if we 00:50:11.120 |
go learn.fit you can see again it's doing the same thing getting the same result and 00:50:20.000 |
it's got some nice functionality it's printing it out into a pretty table for us and it's 00:50:23.460 |
showing us the losses and the accuracy and how long it takes but there's nothing magic 00:50:28.480 |
right you've been able to do exactly the same thing by hand using Python and PyTorch. So 00:50:37.080 |
these abstractions are here to like let you write less code and to save some time and 00:50:41.880 |
to save some cognitive overhead but they're not doing anything you can't do yourself. 00:50:49.120 |
And that's important right because if the if they're doing things you can't do yourself 00:50:54.920 |
then you can't customize them you can't debug them you know you can't profile them. So we 00:51:02.400 |
want to make sure that the stuff we're using is stuff that we understand what it's doing. 00:51:09.380 |
So this is just a linear function is not great we want a neural network. So how do we turn 00:51:18.520 |
this into a neural network or remember this is a linear function x at w plus b to turn 00:51:27.320 |
it into a neural network we have two linear functions exactly the same but with different 00:51:34.160 |
weights and different biases and in between this magic line of code which takes the result 00:51:40.920 |
of our first linear function and then does a max between that and 0. So a max of res 00:51:48.760 |
and 0 is going to take any negative numbers and turn them into zeros. So we're going to 00:51:55.000 |
do a linear function we're going to replace the negatives with 0 and then we're going 00:51:59.960 |
to take that and put it through another linear function that believe it or not is a neural 00:52:05.360 |
net. So w1 and w2 are weight tensors b1 and b2 are bias tensors just like before so we 00:52:13.200 |
can initialize them just like before and we could now call exactly the same training code 00:52:19.720 |
that we did before to roll these. So res.max 0 is called a rectified linear unit which 00:52:32.880 |
you will always see referred to as ReLU and so here is and in PyTorch it already has this 00:52:42.140 |
function it's called f.relu and so if we plot it you can see it's as you'd expect it's 0 00:52:49.480 |
for all negative numbers and then it's y equals x for positive numbers. So you know here's 00:52:59.440 |
some jargon rectified linear unit sounds scary sounds complicated but it's actually this incredibly 00:53:06.720 |
tiny line of code this incredibly simple function and this happens a lot in deep learning things 00:53:14.080 |
that sound complicated and sophisticated and impressive turn out to be normally super simple 00:53:21.080 |
frankly at least once you know what it is. So why do we do linear layer ReLU linear layer 00:53:31.000 |
well if we got rid of the middle if we got rid of the middle ReLU and just went linear 00:53:42.960 |
layer linear layer then you could rewrite that as a single linear layer when you multiply 00:53:49.560 |
things and add and then multiply things and add and you can just change the coefficients 00:53:54.440 |
and make it into a single multiply and then add. So no matter how many linear layers we 00:53:58.700 |
stack on top of each other we can never make anything more kind of effective than a simple 00:54:05.760 |
linear model but if you put a non-linearity between the linear layers then actually you 00:54:13.120 |
have the opposite this is now where something called the universal approximation theorem 00:54:18.480 |
holds which is that if the size of the weight and bias matrices are big enough this can 00:54:24.720 |
actually approximate any arbitrary function including the function of how do I recognize 00:54:31.480 |
threes from sevens or or whatever. So that's kind of amazing right this tiny thing is actually 00:54:40.520 |
a universal function approximator as long as you have w1 b1 w2 and b2 have the right 00:54:48.360 |
numbers and we know how to make them the right numbers you use SGD could take a very long 00:54:53.840 |
time could take a lot of memory but the basic idea is that there is some solution to any 00:55:02.000 |
computable problem and this is one of the biggest challenges a lot of beginners have 00:55:09.840 |
to deep learning is that there's nothing else to it like there's often this like okay how 00:55:17.120 |
do I make a neural net oh that is a neural net or how do I do deep learning training 00:55:24.040 |
where there's gd there's things to like make it train a bit faster there's you know things 00:55:30.920 |
to mean you need a few less parameters but everything from here is just performance tweaks 00:55:41.640 |
honestly right so this is you know this is the key understanding of training a neural 00:55:50.400 |
network okay we can simplify things a bit more we already know that we can use nn.linear 00:55:59.960 |
to replace the weight and bias so let's do that for both of the linear layers and then 00:56:08.760 |
since we're simply taking the result of one function and passing it into the next and 00:56:19.320 |
take the result of that function pass it to the next and so forth and then return the 00:56:22.840 |
end this is called function composition function composition is when you just take the result 00:56:28.440 |
of one function pass it to a new one take a result of one function pass it to a new one 00:56:33.560 |
and so every pretty much neural network is just doing function composition of linear 00:56:39.820 |
layers and these are called activation functions or non-linearities so PyTorch provides something 00:56:47.360 |
to do function composition for us and it's called nn.sequential so it's going to do a 00:56:53.360 |
linear layer pass the result to a value pass the result to a linear layer you'll see here 00:56:59.440 |
I'm not using f.relu I'm using nn.relu this is identical returns exactly the same thing 00:57:04.920 |
but this is a class rather than a function yes Rachel by using the non-linearity won't 00:57:15.440 |
using a function that makes all negative output zero make many of the gradients in the network 00:57:19.760 |
zero and stop the learning process due to many zero gradients well that's a fantastic 00:57:27.100 |
question and the answer is yes it does but there won't be zero for every image and remember 00:57:35.920 |
the mini batches are shuffled so even if it's zero for every image in one mini batch it 00:57:41.760 |
won't be for the next mini batch and it won't be the next time around we go for another 00:57:45.120 |
epoch so yes it can create zeros and if if the neural net ends up with a set of parameters 00:57:55.380 |
such that lots and lots of inputs end up as zeros you can end up with whole mini batches 00:58:02.080 |
that is zero and you can end up in a situation where some of the neurons remain inactive inactive 00:58:14.280 |
means they're zero and they're basically dead units and this is a huge problem it basically 00:58:21.960 |
means you're wasting computation so there's a few tricks to avoid that which we'll be 00:58:27.120 |
learning about a lot one simple trick is to not make this thing flat here but just make 00:58:34.680 |
it a less steep line that's called a leaky value leaky rectified linear unit and they 00:58:43.560 |
help a bit as we'll learn though even better is to make sure that we just kind of initialize 00:58:49.400 |
to sensible initial values that are not too big and not too small and step by sensible 00:58:55.440 |
amounts that are particularly not too big and generally if we do that we can keep things 00:59:01.360 |
in the zone where they're positive most of the time but we are going to learn about how 00:59:06.420 |
to actually analyze inside a network and find out how many dead units we have how many of 00:59:10.640 |
these zeros we have because as this as you point out they are they are bad news they 00:59:15.960 |
don't do any work and they'll continue to not do any work if if enough of the inputs 00:59:22.880 |
end up being zero okay so now that we've got a neural net we can use exactly the same learner 00:59:35.240 |
we had before but this time we'll pass in the simple net instead of the linear one everything 00:59:41.260 |
else is the same and we can call fit just like before and generally as your models get 00:59:48.040 |
deeper though here we've gone from one layer to and I'm only counting the parameterized 00:59:53.920 |
layers as layers you could say it's three I'm just going to call it two there's two 00:59:58.180 |
trainable layers so I've gone from one layer to I've checked dropped my learning rate from 01:00:03.080 |
one to zero point one because the deeper models you know tend to be kind of bumpier less nicely 01:00:10.120 |
behaved so often you need to use lower learning rates and so we train it for a while okay 01:00:16.360 |
and we can actually find out what that training looks like by looking inside our learner and 01:00:24.000 |
there's an attribute we create for you called recorder and that's going to record well everything 01:00:29.720 |
that appears in this table basically well these three things the training loss the validation 01:00:34.360 |
loss and the accuracy or any metrics so recorder dot values contains that kind of table of 01:00:42.240 |
results and so item number two of each row will be the accuracy and so the the capital 01:00:53.380 |
L class which I'm using here as a nice little method called item got that will will get 01:01:03.880 |
the second item from every row and then I can plot that to see how the training went 01:01:12.360 |
and I can get the final accuracy like so by grabbing the last row of the table and grabbing 01:01:19.380 |
the second index two zero one two and my final accuracy not bad ninety eight point three 01:01:27.680 |
percent so this is pretty amazing we now have a function that can solve any problem to any 01:01:36.760 |
level of accuracy if we can find the right parameters and we have a way to find hopefully 01:01:43.000 |
the best or at least a very good set of parameters for any function so this is kind of the magic 01:01:51.000 |
yes Rachel how could we use what we're learning here to get an idea of what the network is 01:01:57.960 |
learning along the way like Xylor and Fergus did more or less we will look at that later 01:02:07.480 |
not in the full detail of their paper but basically you can look in the dot parameters 01:02:14.840 |
to see the values of those parameters and at this point well I mean why don't you try 01:02:21.320 |
it yourself right you've actually got now the parameters so if you want to grab the 01:02:28.780 |
model you can actually see learn dot model so we can we can look inside learn dot model 01:02:38.000 |
to see the actual model that we just trained and you can see it's got the three things 01:02:47.880 |
in it the linear the value of the linear and what I kind of like to do is to put that into 01:02:53.320 |
a variable make it a bit easy to work with and you can grab one layer by indexing in 01:03:02.960 |
you can look at the parameters and that just gives me a something called a generator it's 01:03:10.760 |
something that will give me a list of the parameters when I ask for them so I can just 01:03:14.600 |
go weight comma bias equals to de-structure them and so the weight is 30 by 784 because 01:03:30.120 |
that's what I asked for so one of the things to note here is that to create a neural net 01:03:40.400 |
so something with more than one layer I actually have 30 outputs not just one right so I'm 01:03:47.240 |
kind of generating lots of you can think of generating lots of features so it's kind of 01:03:50.520 |
like 30 different linear linear models here and then I combine those 30 back into one 01:03:58.680 |
so you could look at one of those by having a look at yeah so there's there's the numbers 01:04:08.900 |
in the first row we could reshape that into the original shape of the images and we could 01:04:24.800 |
even have a look and there it is right so you can see this is something so this is cool 01:04:35.120 |
right we can actually see here we've got something which is which is kind of learning to find 01:04:46.280 |
things at the top and the bottom and the middle and so we could look at the second one okay 01:04:54.120 |
no idea what that's showing and so some of them are kind of you know I've probably got 01:04:59.680 |
far more than I need which is why they're not that obvious but you can see yeah here's 01:05:04.960 |
another thing it's looking pretty similar here's something that's kind of looking for 01:05:09.920 |
this little bit in the middle so yeah this is the basic idea to understand the features 01:05:17.920 |
that are not the first layer but later layers you have to be a bit more sophisticated but 01:05:23.960 |
yeah to see the first layer ones you can you can just plot them okay so then you know just 01:05:33.280 |
to compare we could use the full fast AI toolkit so grab our data loaders by using data loaders 01:05:41.160 |
from folder as we've done before and create a CNN learner and a ResNet and fit it for 01:05:47.520 |
a single epoch and whoa 99.7 right so we did 40 epochs and got 98.3 as I said using all 01:05:59.560 |
the tricks you can really speed things up and make things a lot better and so by the 01:06:05.640 |
end of this course or at least both parts of this course you'll be able to from scratch 01:06:13.900 |
get this 99.7 in a single epoch all right so jargon so jargon just to remind us value function 01:06:30.200 |
that returns zero for negatives many batch a few inputs and labels which optionally are 01:06:38.280 |
randomly selected the forward pass is the bit where we calculate the predictions the 01:06:43.560 |
loss is the function that we're going to take the derivative of and then the gradient is 01:06:48.480 |
the derivative of the loss with respect to each parameter the backward pass is when we 01:06:54.680 |
calculate those gradients gradient descent is that full thing of taking a step in the 01:07:00.080 |
direction opposite to the gradients by after calculating the loss and then the learning 01:07:04.760 |
rate is the size of the step that we take other things to know perhaps the two most 01:07:17.960 |
important pieces of jargon are all of the numbers that are in a neural network the numbers 01:07:23.520 |
that we're learning are called parameters and then the numbers that we're calculating 01:07:29.240 |
so every value that's calculated every matrix multiplication element that's calculated they're 01:07:35.160 |
called activations so activations and parameters are all of the numbers in the neural net and 01:07:42.420 |
so be very careful when I say from here on in in these lessons activations or parameters 01:07:48.760 |
you got to make sure you know what those mean because that's that's the entire basically 01:07:53.360 |
almost the entire set of numbers that exist inside a neural net so activations are calculated 01:07:59.600 |
parameters are learned we're doing this stuff with tensors and tensors are just regularly 01:08:07.600 |
shaped arrays rank 0 tensors we call scalars rank 1 tensors we call vectors rank 2 tensors 01:08:14.540 |
we call matrices and we continue on to rank 3 tensors rank 4 tensors and so forth and 01:08:21.520 |
rank 5 tensors are very common in deep learning so don't be scared of going up to higher numbers 01:08:27.400 |
of dimensions okay so let's have a break oh we've got a question okay is there a rule 01:08:35.700 |
of thumb for what non-linearity to choose given that there are many yeah there are many 01:08:41.840 |
non-linearities to choose from and it doesn't generally matter very much which you choose 01:08:46.440 |
so let's choose ReLU or leaky ReLU or yeah whatever any anyone should work fine later 01:08:56.640 |
on we'll look at the minor differences between between them but it's not so much something 01:09:02.960 |
that you pick on a per problem it's more like some take a little bit longer and a little 01:09:08.040 |
bit more accurate and some a bit faster and a little bit less accurate that's a good question 01:09:14.320 |
okay so before you move on it's really important that you finish the questionnaire for this 01:09:18.560 |
chapter because there's a whole lot of concepts that we've just done so you know try to go 01:09:24.680 |
through the questionnaire go back and relook at the notebook and please run the code through 01:09:30.920 |
the experiments and make sure it makes sense all right let's have a seven minute break 01:09:37.000 |
see you back here in seven minutes time okay welcome back so now that we know how to create 01:09:51.560 |
and train a neural net let's cycle back and look deeper at some applications and so we're 01:09:59.200 |
going to try to kind of interpolate in from one end we've done the kind of from scratch 01:10:05.800 |
version at the other end we've done the kind of four lines of code version and we're going 01:10:10.320 |
to gradually nibble at each end until we find ourselves in the middle and we've we've we've 01:10:15.680 |
touched on all of it so let's go back up to the kind of the four lines of code version 01:10:20.960 |
and and delve a little deeper so let's go back to pets and let's think though about 01:10:33.000 |
like how do you actually you know start with a new data set and figure out how to use it 01:10:44.920 |
so it you know the data sets we provide it's easy enough to untie them you just say untie 01:10:50.320 |
that or download it and untie it if it's a data set that you're getting you can just 01:10:57.320 |
use the terminal or either or python or whatever so let's assume we have a path that's pointing 01:11:05.000 |
at something so initially you don't you don't know what that something is so we can start 01:11:12.480 |
by doing ls to have a look and see what's inside there so the pets data set that we 01:11:17.720 |
saw in lesson one contains three things annotations images and models and you'll see we have this 01:11:25.520 |
little trick here where we say path dot base path equals and then the path to our data 01:11:31.600 |
and that's just does a little simple thing where when we print it out it just doesn't 01:11:35.640 |
show us it just shows us relative to this path is a bit convenient so if you go and 01:11:44.960 |
have a look at the read me for the original pets data set it tells you what these images 01:11:51.200 |
and annotations folders are and not surprisingly the images path if we go path images that's 01:11:58.640 |
how we use path lib to grab a subdirectory and then ls we can see here are the names 01:12:05.920 |
that the paths to the images as it mentions here most functions and methods in fast a 01:12:12.920 |
which return a collection don't return a python list but they return a capital L and a capital 01:12:20.440 |
L as we briefly mentioned is basically an enhanced list one of the enhancements is the 01:12:26.200 |
way it prints the representation of it starts by showing you how many items there are in 01:12:31.080 |
the list in the collection so there's seven thousand three hundred and ninety four images 01:12:36.800 |
and it if there's more than ten things it truncates it and just says dot dot dot to 01:12:43.480 |
avoid filling up your screen so there's a couple of little conveniences there and so 01:12:51.160 |
we can see from this output that the file name as we mentioned in lesson one if the 01:13:00.080 |
first letter is a capital it means it's a cat and if the first letter is lowercase it 01:13:06.640 |
means it's a dog but this time we've got to do something a bit more complex well a lot 01:13:10.760 |
more complex which is figure out what breed it is and so you can see the breed is kind 01:13:16.320 |
of everything up to after the in the file name it's everything up to the the last underscore 01:13:22.280 |
and before this number is the breed so we want to label everything with its breed so 01:13:30.200 |
we're going to take advantage of this structure so the way I would do this is to use a regular 01:13:41.940 |
expression a regular expression is something that looks at a string and basically lets 01:13:46.880 |
you kind of pull it apart into its pieces in very flexible way it's this kind of simple 01:13:51.520 |
little language for doing that if you haven't used regular expressions before please google 01:13:58.520 |
regular expression tutorial now and look it's going to be like one of the most useful tools 01:14:03.120 |
you'll come across in your life I use them almost every day I will go to details about 01:14:10.000 |
how to use them since there's so many great tutorials and there's also a lot of great 01:14:13.360 |
like exercises you know there's regex regex is short for regular expression there's regex 01:14:18.680 |
crosswords there's regex Q&A there's all kinds of cool regex things a lot of people like 01:14:24.760 |
me love this tool in order to there's also a regex lesson in the fast AI NLP course maybe 01:14:33.400 |
even two regex lessons oh yeah I'm sorry for forgetting about the fast AI NLP course what 01:14:40.400 |
an excellent resource that is so regular expressions are a bit hard to get right the first time 01:14:50.680 |
so the best thing to do is to get a sample string so it's a good way to do that would 01:14:54.880 |
be to just grab one of the file names thought it's pop it in F name and then you can experiment 01:15:02.160 |
with regular expressions so re is the regular expression module in Python and find all will 01:15:11.760 |
just grab all the parts of a regular expression that have parentheses around them so this 01:15:17.400 |
regular expression and are is a special kind of string in Python which basically says don't 01:15:22.960 |
treat backslash as special because normally in Python like backslash N means a new line 01:15:29.520 |
so here's a string which I'm going to capture any letter one or more times followed by an 01:15:39.680 |
underscore followed by a digit one or more times followed by anything I probably should 01:15:47.600 |
have used backslash dot that's fine followed by the letters jpg followed by the end of 01:15:52.560 |
the string and so if I call that regular expression against my file names name oh looks good right 01:16:02.880 |
so we kind of check it out so now that seems to work we can create a data block where the 01:16:09.400 |
independent variables are images the dependent variables are categories just like before 01:16:15.040 |
get items is going to be get image files we're going to split it randomly as per usual and 01:16:23.580 |
then we're going to get the label by calling regex labeler which is a just a handy little 01:16:33.000 |
fast a class which labels things with a regular expression we can't call the regular expression 01:16:39.920 |
this particular expression directly on the path lib path object we actually want to call 01:16:45.040 |
it on the name attribute and fast AI has a nice little function called using atra using 01:16:52.040 |
attribute which takes this function and changes it to a function which will be passed this 01:16:58.080 |
attribute that's going to be using regex labeler on the name attribute and then from that data 01:17:08.520 |
block we can create the data loaders as usual there's two interesting lines here resize 01:17:16.160 |
and all transforms all transforms we have seen before in notebook 2 in the section called 01:17:28.000 |
data augmentation and so all transforms was the thing which can zoom in and zoom out and 01:17:36.040 |
warp and rotate and change contrast and change brightness and so forth and flip to kind of 01:17:43.000 |
give us almost it's like giving us more data being generated synthetically from the data 01:17:47.960 |
we already have and we also learned about random resize crop which is a kind of a really 01:17:59.680 |
cool way of getting ensuring you get square images at the same time that you're augmenting 01:18:08.600 |
the data here we have a resize to a really large image but you know by deep learning 01:18:17.320 |
standards 460 by 460 is a really large image and then we're using all transforms with a 01:18:24.280 |
size so that's actually going to use random resize crop to a smaller size why are we doing 01:18:30.240 |
that this particular combination of two steps does something which I think is unique to 01:18:40.060 |
fast AI which we call pre-sizing and the best way is I will show you this beautiful example 01:18:48.800 |
of some PowerPoint wizardry that I'm so excited about to show how pre-sizing works what pre-sizing 01:18:56.820 |
does is that first step where we say resize to 460 by 460 is it grabs a square and it grabs 01:19:05.480 |
it randomly if it's a kind of landscape orientation photo it'll grab it randomly so it'll take 01:19:11.360 |
the whole height and randomly grab somewhere from along the side if it's a portrait orientation 01:19:18.400 |
then it'll grab it you know take the full width and grab a random bit from top to bottom 01:19:25.480 |
so then we take this area here and here it is right and so that's what the first resize 01:19:30.780 |
does and then the second org transforms bit will grab a random warped crop possibly rotated 01:19:41.360 |
from in here and we'll turn that into a square and so it does so there's two steps it's first 01:19:50.420 |
of all resize to a square that's big and then the second step is do a kind of rotation and 01:19:56.400 |
warping and zooming stage to something smaller in this case 224 by 224 because this first 01:20:05.880 |
step creates something that's square and always is the same size the second step can happen 01:20:12.360 |
on the GPU and because normally things like rotating and image warping actually pretty 01:20:17.400 |
slow also normally doing a zoom and a rotate and a warp actually is really destructive 01:20:27.560 |
to the image because each one of those things requires an interpolation step but it's not 01:20:32.640 |
just slow it actually makes the image really quite low quality so we do it in a very special 01:20:40.120 |
way in fast AI I think it's unique where we do all of these kind of coordinate transforms 01:20:47.560 |
like rotations and warps and zooms and so forth not on the actual pixels but instead 01:20:54.960 |
we kind of keep track of the changing coordinate values in a non-lossy way so the full floating 01:21:01.600 |
point value and then once at the very end we then do the interpolation. 01:21:08.840 |
The results are quite striking here is what the difference looks like hopefully you can 01:21:16.080 |
see this on on the video on the left is our pre-sizing approach and on the right is the 01:21:24.160 |
standard approach that other libraries use and you can see that the one on the right 01:21:28.600 |
is a lot less nicely focused and it also has like weird things like this should be grass 01:21:36.040 |
here but it's actually got its kind of bum-sticking way out this has a little bit of weird distortions 01:21:41.600 |
this has got loads of weird distortions so you can see the pre-sized version really ends 01:21:46.660 |
up way way better and I think we have a question Rachel are the blocks in the data block and 01:21:55.520 |
ordered list do they specify the input and output structures respectively are there always 01:22:01.080 |
two blocks or can there be more than two for example if you wanted a segmentation model 01:22:05.960 |
would the second block be something about segmentation so so yeah this is an ordered 01:22:14.400 |
list so the first item says I want to create an image and then the second item says I want 01:22:20.880 |
to create a category so that's my independent and dependent variable you can have one thing 01:22:26.200 |
here you can have three things here you can have any amount of things here you want obviously 01:22:31.040 |
the vast majority of the time it'll be two normally there's an independent variable and 01:22:34.600 |
a dependent variable we'll be seeing this in more detail later although if you go back 01:22:39.600 |
to the earlier lesson when we introduced data blocks I do have a picture kind of showing 01:22:43.660 |
how these pieces fit together. So after you've put together your data block created your 01:22:56.720 |
data loaders you want to make sure it's working correctly so the obvious thing to do for computer 01:23:02.240 |
vision data block is show batch and show batch will show you the items and you can kind of 01:23:11.960 |
just make sure they look sensible that looks like the labels are reasonable if you add 01:23:16.480 |
a unique equals true then it's going to show you the same image with all the different 01:23:21.240 |
augmentations this is a good way to make sure your augmentations work if you make a mistake 01:23:26.000 |
in your data block in this example there's no resize so the different images are going 01:23:32.160 |
to be different sizes so it'll be impossible to collate them into a batch so if you call 01:23:39.600 |
dot summary this is a really neat thing which will go through and tell you everything that's 01:23:46.400 |
happening so I collecting the items how many did I find what happened when I split them 01:23:52.240 |
what are the different variables independent dependent variables I'm creating let's try 01:23:57.720 |
and create one of these here's a step create my image create categorize here's what the 01:24:06.800 |
first thing gave me an American Bulldog is the final sample is this image this size this 01:24:13.400 |
category and then eventually it says oh it's not possible to collate your items I tried 01:24:20.400 |
to collate the zero index members of your tuples so in other words that's the independent 01:24:24.800 |
variable and I got this was size 500 by 375 this was 375 by 500 oh I can't collate these 01:24:32.560 |
into a tensor because they're different sizes so this is a super great debugging tool for 01:24:37.680 |
debugging your data blocks you have a question how does the item transforms pre-size work 01:24:46.320 |
if the resize is smaller than the image is a whole width or height still taken or is 01:24:51.600 |
it just a random crop with the resize value so if you remember back to lesson two we looked 01:25:04.000 |
at the different ways of creating these things you can use squish you can use pad or you 01:25:16.080 |
can use crop so if your image is smaller than the precise value then squish will really 01:25:23.940 |
be zoom so it will just swell stretch it'll stretch it and then pad and crop will do much 01:25:31.980 |
the same thing and so you'll just end up with a you know the same just looks like these 01:25:37.280 |
but it'll be a kind of lower more pixelated lower resolution because it's having to zoom 01:25:41.560 |
in a little bit okay so a lot of people say that you should do a hell of a lot of data 01:25:52.120 |
cleaning before you model we don't we say model as soon as you can because remember 01:25:59.360 |
what we found in in notebook two your your model can teach you about the problems in 01:26:06.840 |
your data so as soon as I've got to a point where I have a data block that's working and 01:26:12.680 |
I have data loaders I'm going to build a model and so here I'm you know it also tells me 01:26:17.460 |
how I'm going so I'm getting seven percent error well that's actually really good for 01:26:23.080 |
a pets model and so at this point now that I have a model I can do that stuff we learned 01:26:27.200 |
about earlier in O2 the notebook O2 where we train our model and use it to clean the 01:26:32.640 |
data so we can look at the classification a confusion matrix top losses the image cleaner 01:26:40.480 |
widget you know so forth okay now one thing interesting here is in notebook four we included 01:26:55.600 |
a loss function when we created a learner and here we don't pass in a loss function why 01:27:01.360 |
is that that's because first AI will try to automatically pick a somewhat sensible loss 01:27:08.040 |
function for you and so for a image classification task it knows what loss function is the normal 01:27:16.400 |
one to pick and it's done it for you but let's have a look and see what it actually did pick 01:27:24.360 |
so we could have a look at learn dot loss funk and we will see it is cross entropy loss 01:27:37.880 |
what on earth is cross entropy loss I'm glad you asked let's find out cross entropy loss 01:27:46.320 |
is really much the same as the MNIST lost we created with that with that sigmoid and 01:27:54.200 |
the one minus predictions and predictions but it's it's a kind of extended version of 01:28:01.440 |
that and the extended version of that is that that torch dot where that we looked at in 01:28:09.080 |
notebook four only works when you have a binary outcome in that case it was is it a three 01:28:15.440 |
or not but in this case it we've got which of the 37 pet breeds is it so we want to kind 01:28:24.280 |
of create something just like that sigmoid and torch dot where that which also works 01:28:31.280 |
nicely for more than two categories so let's see how we can do that so first of all let's 01:28:41.940 |
grab a batch yes question why do we want to build a model before cleaning the data I would 01:28:52.480 |
think a clean data set would help in training yeah absolutely a clean data set helps in 01:28:59.600 |
training but remember as we saw in notebook 02 an initial model helps you clean the data 01:29:08.040 |
set so remember how plot top losses helped us identify mislabeled images and the confusion 01:29:15.800 |
matrix helped us recognize which things we were getting confused and might need you know 01:29:20.560 |
fixing and the image classifier cleaner actually let us find things like an image that contained 01:29:27.400 |
two bears rather than one bear and clean it up so a model is just a fantastic way to help 01:29:34.080 |
you zoom in on the data that matters which things seem to have the problems which things 01:29:39.600 |
are most important stuff like that so you would go through and you clean it with the 01:29:44.800 |
model helping you and then you go back and train it again with the clean data thanks 01:29:50.720 |
for that great question okay so in order to understand cross-entropy loss let's grab a 01:29:59.960 |
batch of data which we can use dls.one batch and that's going to grab a batch from the 01:30:09.280 |
training set we could also go first dls.train and that's going to do exactly the same thing 01:30:21.560 |
and so then we can de-structure that into the independent and dependent variable and 01:30:25.160 |
so the dependent variable shows us we've got a batch size of 64 but shows us the 64 categories 01:30:41.480 |
and remember those numbers simply refer to the index of into the vocab so for example 01:30:47.320 |
16 is a boxer and so that all happens for you automatically when we say show batch it 01:30:55.500 |
shows us those strings so here's a first mini-batch and so now we can view the predictions that 01:31:04.880 |
is the activations of the final layer of the network by calling get preds and you can pass 01:31:11.320 |
in a data loader and a data loader can really be anything that's going to return a sequence 01:31:20.640 |
of many batches so we can just pass in a list containing our mini-batch as a data loader 01:31:26.920 |
and so that's going to get the predictions for one mini-batch but here's some predictions 01:31:32.080 |
okay so the actual predictions if we go preds 0.sum to grab the predictions for the first 01:31:42.240 |
image and add them all up they add up to 1 and there are 37 of them so that makes sense 01:31:51.080 |
right it's like the very first thing is what is the probability that that is a else vocab 01:32:00.440 |
so the first thing is what's the probability it's an Abyssinian cat it's 10 to the negative 01:32:05.400 |
6 you see and so forth so it's basically like it's not this it's not this it's not this 01:32:11.480 |
and you can look through and oh here this one here you know obviously what I think it 01:32:15.680 |
is so how did it you know so we obviously want the probabilities to sum to one because 01:32:24.920 |
it would be pretty weird if if they didn't it would say you know that the probability 01:32:30.320 |
of being one of these things is more than one or less than one which would be extremely 01:32:34.760 |
odd so how do we go about creating these predictions where each one is between 0 and 1 and they 01:32:45.320 |
all add up to 1 to do that we use something called softmax softmax is basically an extension 01:32:54.000 |
of sigmoid to handle more than two levels two categories so remember the sigmoid function 01:33:01.560 |
look like this and we use that for our threes versus sevens model so what if we want 37 01:33:11.960 |
categories rather than two categories we need one activation for every category so actually 01:33:19.560 |
the threes and sevens model rather than thinking of that as an is three model we could actually 01:33:27.000 |
say oh that has two categories so that's actually create two activations one representing how 01:33:32.200 |
three like something is and one representing how seven like something is so let's say you 01:33:39.960 |
know let's just say that we have six MNIST digits and these were the can I do this and 01:33:55.160 |
this first column were the activations of my model for for one activation and the second 01:34:05.240 |
column was for a second activation so my final layer actually has two activations now so 01:34:10.120 |
this is like how much like a three is it and this is how much like a seven is it but this 01:34:14.440 |
one is not at all like a three and it's slightly not like a seven this is very much like a 01:34:21.040 |
three and not much like a seven and so forth so we can take that model and rather having 01:34:25.840 |
rather than having one activation for like is three we can have two activations for how 01:34:30.920 |
much like a three how much like a seven so if we take the sigmoid of that we get two 01:34:38.720 |
numbers between naught and one but they don't add up to one so that doesn't make any sense 01:34:46.960 |
it can't be point six six chance it's a three and point five six chance it's a seven because 01:34:51.880 |
every digit in that data set is only one or the other so that's not going to work but 01:34:58.800 |
what we could do is we could take the difference between this value and this value and say 01:35:05.240 |
that's how likely it is to be a three so in other words this one here with a high number 01:35:10.360 |
here and a low number here is very likely to be a three so we could basically say in 01:35:17.400 |
the binary case these activations that what really matters is their relative confidence 01:35:24.220 |
of being a three versus a seven so we could calculate the difference between column one 01:35:29.900 |
and column two or column index zero and column index one right and here's the difference 01:35:35.240 |
between the two columns there's that big difference and we could take the sigmoid of that right 01:35:43.600 |
and so this is now giving us a single number between naught and one and so then since we 01:35:50.680 |
wanted two columns we could make column index zero the sigmoid and column index one could 01:35:57.840 |
be one minus that and now look these all add up to one so here's probability of three probability 01:36:06.900 |
of seven for the second one probably three probably seven and so forth so like that's 01:36:14.840 |
a way that we could go from having two activations for every image to creating two probabilities 01:36:27.840 |
each of which is between naught and one and each pair of which adds to one great how do 01:36:35.640 |
we extend that to more than two columns to extend it to more than two columns we use 01:36:42.040 |
this function which is called softmax those softmax is equal to e to the x divided by 01:36:53.200 |
sum of e to the x just to show you if I go softmax on my activations I get point six 01:37:05.200 |
oh two five point three nine seven five point six oh two five point three nine seven five 01:37:10.200 |
I get exactly the same thing right so softmax in the binary case is identical to the sigmoid 01:37:20.040 |
that we just looked at but in the multi-category case we basically end up with something like 01:37:28.000 |
this let's say we were doing the teddy bear grizzly bear brown bear and for that remember 01:37:33.800 |
our neural net is going to have the final layer will have three activations so let's 01:37:38.360 |
say it was point oh two negative two point four nine one point two five so to calculate 01:37:43.480 |
softmax I first go e to the power of each of these three things so here's e to the power 01:37:49.160 |
of point oh two e to the power of negative two point four nine e to the power of three 01:37:54.080 |
point four e to the power of one point two five okay then I add them up so there's the 01:37:59.160 |
sum of the x and then softmax will simply be one point oh two divided by four point 01:38:05.160 |
six and then this one will be point oh eight divided by four point six and this one will 01:38:09.640 |
be three point four nine divided by four point six so since each one of these represents 01:38:15.240 |
each number divided by the sum that means that the total is one okay and because all 01:38:23.160 |
of these are positive and each one is an item divided by the sum it means all of these must 01:38:28.600 |
be between naught and one so this shows you that softmax always gives you numbers between 01:38:35.300 |
naught and one and they always add up to one so to do that in practice you can just call 01:38:42.320 |
torch dot softmax and it will give you this result of this this function so you should 01:38:51.120 |
experiment with this in your own time you know write this out by hand and try putting 01:38:57.600 |
in these numbers right and and see how that you get back the numbers I claim you're going 01:39:03.920 |
to get back make sure this makes sense to you so one of the interesting points about 01:39:08.480 |
softmax is remember I told you that exp is e to the power of something and now what that 01:39:16.840 |
means is that e to the power of something grows very very fast right so like exp of 01:39:29.420 |
four is fifty four exp of eight is twenty nine two thousand nine hundred and eighty 01:39:41.040 |
right it grows super fast and what that means is that if you have one activation that's 01:39:47.480 |
just a bit bigger than the others its softmax will be a lot bigger than the others so intuitively 01:39:54.320 |
the softmax function really wants to pick one class among the others which is generally 01:40:02.160 |
what you want right when you're trying to train a classifier to say which breed is it 01:40:07.840 |
you kind of want it to pick one and kind of go for it right and so that's what softmax 01:40:13.600 |
does that's not what you always want so sometimes at inference time you want it to be a bit 01:40:21.200 |
cautious and so you kind of got to remember that softmax isn't always the perfect approach 01:40:26.840 |
but it's the default it's what we use most of the time and it works well on a lot of 01:40:31.320 |
situations so that is softmax now in the binary case for the MNIST 3 versus 7 this was how 01:40:43.320 |
we calculated MNIST loss we took the sigmoid and then we did either one minus that or that 01:40:49.200 |
as our loss function which is fine as you saw it it worked right and so we could do 01:40:59.940 |
this exactly the same thing we can't use torch.where anymore because targets aren't just 0 or 1 01:41:07.000 |
targets could be any number from 0 to 36 so we could do that by replacing the torch.where 01:41:14.600 |
with indexing so here's an example for the binary case let's say these are our targets 01:41:21.160 |
0 1 0 1 1 0 and these are our softmax activations which we calculated before they're just some 01:41:28.880 |
random numbers just for a toy example so one way to do instead of doing torch.where we could 01:41:37.240 |
instead have a look at this I could grab all the numbers from 0 to 5 and if I index into 01:41:46.160 |
here with all the numbers from 0 to 5 and then my targets 0 1 0 1 0 1 0 then what that's 01:41:58.480 |
going to do is it's going to pick the row 0 it'll pick 0.6 and then for row 1 it'll pick 01:42:06.960 |
1.49 for row 2 it'll pick 0.13 for row 4 it'll pick 1.003 and so forth so this is a super 01:42:22.040 |
nifty indexing expression which you should definitely play with right and it's basically 01:42:30.400 |
this trick of passing multiple things to the PyTorch indexer the first thing says which 01:42:36.840 |
rows should you return and the second thing says for each of those rows which column should 01:42:42.120 |
you return so this is returning all the rows and these columns for each one and so this 01:42:50.080 |
is actually identical to torch.where so isn't that tricky and so the nice thing is we can 01:42:59.280 |
now use that for more than just two values and so here's here's the fully worked out 01:43:08.320 |
thing so I've got my threes column I've got my sevens column here's that target is the 01:43:13.240 |
indexes from 0 1 2 3 4 5 and so here 0 0.6 1 1.49 0 2.13 and so forth so yeah this works 01:43:29.080 |
just as well with more than two columns so we can add you know for doing a full MNIST 01:43:35.840 |
you know so all the digits from 0 to 9 we could have 10 columns and we would just be 01:43:40.320 |
indexing into the 10 so this thing we're doing where we're going minus our activations matrix 01:43:52.240 |
all of the numbers from 0 to n and then our targets is exactly the same as something that 01:43:58.640 |
already exists in PyTorch called f dot nll loss as you can see exactly the same that's 01:44:05.400 |
again we're kind of seeing that these things inside PyTorch and fast.ai are just little 01:44:11.080 |
shortcuts for stuff we can write ourself and our loss stands for negative log likelihood 01:44:18.720 |
again sounds complex but actually it's just this indexing expression rather confusingly 01:44:27.360 |
there's no log in it we'll see why in a moment so let's talk about logs so this locks this 01:44:38.840 |
loss function works quite well as we saw in the notebook 04 it's basically this it is 01:44:45.320 |
exactly the same as we do in notebook 04 just a different way of expressing it but we can 01:44:51.120 |
actually make it better because remember the probabilities we're looking at are between 01:44:57.120 |
0 and 1 so they can't be smaller than 0 they can't be greater than 1 which means that if 01:45:02.400 |
our model is trying to decide whether to predict 0.99 or 0.999 it's going to think that those 01:45:08.620 |
numbers are very very close together but won't really care but actually if you think about 01:45:14.320 |
the error you know if there's like a thousand things then this would like be 10 things are 01:45:22.600 |
wrong and this would be like one thing is wrong but this is really like 10 times better 01:45:28.240 |
than this so really what we'd like to do is to transform the numbers between 0 and 1 to 01:45:36.040 |
instead be between negative infinity and infinity and there's a function that does exactly that 01:45:42.160 |
which is called logarithm okay so as the so the numbers we could have can be between 0 01:45:52.560 |
and 1 and as we get closer and closer to 0 it goes down to infinity and then at 1 it's 01:46:04.880 |
going to be 0 and we can't go above 0 because our loss function we want to be negative so 01:46:17.240 |
this logarithm in case you forgot hopefully you vaguely remember what logarithm is from 01:46:22.200 |
high school but the basically the definition is this if you have some number that is y 01:46:28.600 |
that is b to the power of a then logarithm is defined such that a equals the logarithm 01:46:36.640 |
of y, b in other words it tells you b to the power of what equals y which is not that interesting 01:46:52.040 |
of itself but one of the really interesting things about logarithms is this very cool 01:46:57.560 |
relationship which is that log of a times b equals log of a plus log of b and we use 01:47:05.040 |
that all the time in deep learning and machine learning because this number here a times 01:47:13.040 |
b can get very very big or very very small if you multiply things a lot of small things 01:47:18.220 |
together you'll get a tiny number if you multiply a lot of big things together you'll get a 01:47:22.200 |
huge number it can get so big or so small that the kind of the precision in your computer's 01:47:28.480 |
floating point gets really bad where else this thing here adding is not going to get 01:47:35.360 |
out of control so we really love using logarithms like particularly in a deep neural net where 01:47:42.560 |
there's lots of layers we're kind of multiplying and adding many times so this kind of tends 01:47:47.680 |
to come out quite nicely so when we take the probabilities that we saw before the things 01:48:05.160 |
that came out of this function and we take their logs and we take the mean that is called 01:48:16.080 |
negative log likelihood and so this ends up being kind of a really nicely behaved number 01:48:23.640 |
because of this property of the log that we described so if you take the softmax and then 01:48:31.240 |
take the log and then pass that to an LL loss because remember that didn't actually take 01:48:37.680 |
the log at all despite the name that gives you cross entropy loss so that leaves an obvious 01:48:47.120 |
question of why doesn't an LL loss actually take the log and the reason for that is that 01:48:55.040 |
it's more convenient computationally to actually take the log back at the softmax step so PyTorch 01:49:01.800 |
has a function called log softmax and so since it's actually easier to do the log at the 01:49:11.880 |
softmax stage it's just faster and more accurate PyTorch assumes that you use soft logmax and 01:49:18.280 |
then pass that to an LL loss so an LL loss does not do the log it assumes that you've 01:49:25.000 |
done the log beforehand so log softmax followed by an LL loss is the definition of cross entropy 01:49:31.460 |
loss in PyTorch so that's our loss function and so you can pass that some activations 01:49:38.200 |
and some targets and get back a number and pretty much everything in PyTorch every one 01:49:44.700 |
of these kinds of functions you can either use the nn version as a class like this and 01:49:51.000 |
then call that object as if it's a function or you can just use f dot with the camel case 01:49:57.560 |
name as a function directly and as you can see they're exactly the same number people 01:50:04.960 |
normally use the class version in the documentation in PyTorch you'll see it normally uses the 01:50:11.560 |
class version so we'll tend to use the class version as well you'll see that it's returning 01:50:18.200 |
a single number and that's because it takes the mean because a loss needs to be as we've 01:50:22.720 |
discussed the mean but if you want to see the underlying numbers before taking the mean 01:50:28.560 |
you can just pass in reduction equals none and that shows you the individual cross entropy 01:50:34.080 |
losses before taking the mean. Okay great so this is a good place to stop with our discussion 01:50:57.160 |
of loss functions and such things Rachel were there any questions about this? Why does the 01:51:12.560 |
loss function need to be negative? Well I mean I guess it doesn't but it's we want something 01:51:23.780 |
that the lower it is the better and we kind of need it to cut off somewhere I have to 01:51:35.720 |
think about this more during the week because it's a bit tired yeah let me refresh my memory 01:51:44.760 |
when I'm awake. Okay now next week well note not for the video next week actually happened 01:51:59.320 |
last week so the thing I'm about to say is actually. So next week we're going to be talking 01:52:07.180 |
about data ethics and I wanted to kind of segue into that by talking about how my week's 01:52:14.200 |
gone because a week or two ago in I did a as part of a lesson I actually talked about 01:52:25.680 |
the efficacy of masks and specifically wearing masks in public and I pointed out that the 01:52:33.840 |
efficacy of masks seemed like it could be really high and maybe everybody should be 01:52:38.520 |
wearing them and somehow I found myself as the face of a global advocacy campaign and 01:52:52.200 |
so if you go to masksforall.co you'll find a website talking about masks and I've been 01:53:09.800 |
on you know TV shows in South Africa and the US and England and Australia and on radio 01:53:17.480 |
and blah blah blah talking about masks why is this well it's because as a data scientist 01:53:28.840 |
you know I noticed that the data around masks seem to be getting misunderstood and it seemed 01:53:36.560 |
that that misunderstanding was costing possibly hundreds of thousands of lives you know literally 01:53:43.680 |
in the places that were using masks it seemed to be associated with you know orders of magnitude 01:53:49.720 |
fewer deaths and one of the things we're talking about next week is like you know what's your 01:53:56.480 |
role as a data scientist and you know I strongly believe that it's to understand the data and 01:54:03.920 |
then do something about it and so nobody was talking about this so I ended up writing an 01:54:12.320 |
article that appeared in the Washington Post that basically called on people to really 01:54:18.760 |
consider wearing masks which is this article and you know I was lucky I managed to kind 01:54:32.800 |
of get a huge team of brilliant not huge a pretty decent sized team of brilliant volunteers 01:54:39.120 |
who helped you know kind of build this website and kind of some PR folks and stuff like that 01:54:45.080 |
but what this came clear was and I was talking to politicians you know senators staffers 01:54:54.120 |
what was becoming clear is that people weren't convinced by the science which is fair enough 01:55:01.200 |
because it's it's hard to you know when the WHO and the CDC is saying you don't need to 01:55:07.960 |
wear a mask and some random data scientist is saying but doesn't seem to be what the 01:55:12.640 |
data is showing you know you've got half a brain you would pick the WHO and the CDC not 01:55:17.680 |
the random data scientist so I really felt like I if I was going to be an effective advocate 01:55:23.240 |
I needed sort the science out and it you know credentialism is strong and so it wouldn't 01:55:31.040 |
be enough for me to say it I needed to find other people to say it though I put together 01:55:34.600 |
a team of 19 scientists including you know a professor of sociology a professor of aerosol 01:55:46.320 |
dynamics the founder of an African movement that's that kind of studied preventative methods 01:55:54.000 |
for tuberculosis a Stanford professor who studies mask disposal and cleaning methods 01:56:05.960 |
a bunch of Chinese scientists who study epidemiology modeling a UCLA professor who is one of the 01:56:16.680 |
top infectious disease epidemiologists experts and so forth so like this kind of all-star 01:56:24.320 |
team of people from all around the world and I had never met any of these people before 01:56:29.640 |
so well no not quite true I knew Austin a little bit and I knew Zainip a little bit 01:56:34.440 |
I knew Lex a little bit but on the whole you know and well Reshma we all know she's awesome 01:56:43.280 |
so it was great to actually have a fast AI community person there too and so but yeah 01:56:50.160 |
I kind of tried to pull together people from you know as many geographies as possible and 01:56:56.720 |
as many areas of expertise as possible and you know the kind of the global community 01:57:03.520 |
helped me find papers about about everything about you know how different materials work 01:57:12.840 |
about how droplets form about epidemiology about case studies of people infecting with 01:57:23.280 |
and without masks blah blah blah and we ended up in the last week basically we wrote this 01:57:29.600 |
paper it contains 84 citations and you know we basically worked around the clock on it 01:57:40.600 |
as a team and it's out and it's been sent to a number of some of the earlier versions 01:57:49.640 |
three or four days ago we sent her some governments so one of the things is I in this team I tried 01:57:55.560 |
to look for people who were you know working closely with government leaders not just that 01:58:00.600 |
they're scientists and so this this went out to a number of government ministers and in 01:58:07.360 |
the last few days I've heard that it was a very significant part of decisions by governments 01:58:15.600 |
to change their to change their guidelines around masks and you know the fights not over 01:58:25.280 |
by any means in particular the UK is a bit of a holdout but I'm going to be on ITV tomorrow 01:58:33.600 |
and then BBC the next day you know it's it's kind of required stepping out to be a lot 01:58:39.760 |
more than just a data scientist I've had to pull together you know politicians and staffers 01:58:46.080 |
I've had to you know you know hustle with the media to try and get you know coverage 01:58:53.560 |
and you know today I'm now starting to do a lot of work with unions to try to get unions 01:58:58.460 |
to understand this you know it's really a case of like saying okay as a data scientist 01:59:03.920 |
and income in conjunction with real scientists we've built this really strong understanding 01:59:11.000 |
that masks you know this simple but incredibly powerful tool that doesn't do anything unless 01:59:18.400 |
I can effectively communicate this to decision makers so today I was you know on the phone 01:59:24.480 |
to you know one of the top union leaders in the country explaining what this means basically 01:59:33.040 |
it turns out that in buses in America the kind of the air conditioning is set up so 01:59:38.000 |
that it blows from the back to the front and there's actually case studies in the medical 01:59:42.440 |
literature of how people that are seated kind of downwind of an air conditioning unit in 01:59:49.600 |
a restaurant ended up all getting sick with COVID-19 and so we can see why like bus drivers 01:59:56.200 |
are dying because they're like they're right in the wrong spot here and their passengers 02:00:02.800 |
aren't wearing masks so I kind of try to explain this science to union leaders so that they 02:00:11.360 |
understand that to keep the workers safe it's not enough just for the driver to wear a mask 02:00:17.160 |
but all the people on the bus need to be wearing masks as well so you know all of this is basically 02:00:23.040 |
to say you know as data scientists I think we have a responsibility to study the data 02:00:34.520 |
and then do something about it it's not just a research you know exercise it's not just 02:00:40.680 |
a computation exercise you know what's the point of doing things if it doesn't lead to 02:00:46.360 |
anything so yeah so next week we'll be talking about this a lot more but I think you know 02:00:58.560 |
this is a really to me kind of interesting example of how digging into the data can lead 02:01:07.320 |
to really amazing things happening and and in this case I strongly believe and a lot 02:01:13.600 |
of people are telling me they strongly believe that this kind of advocacy work that's come 02:01:18.600 |
out of this data analysis is already saving lives and so I hope this might help inspire 02:01:24.920 |
you to to take your data analysis and to take it to places that it really makes a difference 02:01:31.640 |
so thank you very much and I'll see you next week.