Lesson 4 - Deep Learning for Coders (2020)

00:00:00.000 | Welcome back. And here is lesson 4, which is where we get deep into the weeds of exactly

00:00:12.320 | what is going on when we are training a neural network. And we started looking at this in

00:00:18.280 | the previous lesson, we were looking at spicastic gradient descent. And so to remind you, we

00:00:25.200 | were looking at what Arthur Samuel said. Suppose we arrange for some automatic means of testing

00:00:31.600 | the effectiveness of any current weight assignment, or we would call it parameter assignment in

00:00:37.760 | terms of actual performance and provide a mechanism for altering the weight assignment

00:00:42.880 | so as to maximize that performance. So we could make that entirely automatic and a machine

00:00:48.600 | so programmed would learn from its experience. And that was our goal. So our initial attempt

00:00:56.360 | on the MNIST data set was not really based on that. We didn't really have any parameters.

00:01:03.940 | So then last week we tried to figure out how we could parameterize it, how we could create

00:01:10.000 | a function that had parameters. And what we thought we could do would be to have something

00:01:15.720 | where say the probability of being some particular number was expressed in terms of the pixels

00:01:21.640 | of that number and some weights, and then we would just multiply them together and add

00:01:28.520 | them up. So we looked at how stochastic gradient descent worked last week. And the basic idea

00:01:40.740 | is that we start out by initializing the parameters randomly. We use them to make a prediction

00:01:49.240 | using a function such as this one. We then see how good that prediction is by measuring

00:01:58.040 | using a loss function. We then calculate the gradient which is how much would the loss

00:02:03.320 | change if I changed one parameter by a little bit. We then use that to make a small step

00:02:11.320 | to change each of the parameters by a little bit by multiplying the learning rate by the

00:02:16.560 | gradient to get a new set of predictions. And so we went round and round and round a

00:02:20.520 | few times until eventually we decided to stop. And so these are the basic seven steps that

00:02:30.440 | we went through. And so we did that for simple quadratic equation. And we had something which

00:02:40.080 | looked like this. And so by the end, we had this nice sample of a curve getting closer

00:02:49.080 | and closer and closer. So I have a little summary at the start of this section, summarizing

00:02:59.920 | gradient descent that Silva and I have in the notebooks in the book of what we just

00:03:04.960 | did. So you can review that and make sure it makes sense to you. So now let's use this

00:03:12.140 | to create our MNIST threes versus sevens model. And so to create a model, we're going to need

00:03:20.720 | to create something that we can pass into a function like, let's see where it was, passing

00:03:29.520 | to a function like this one. So we need just some pixels that are all lined up and some

00:03:34.960 | parameters that are all lined up. And then we're going to sum them up. So our X's are

00:03:44.000 | going to be pixels. And so in this case, because we're just going to multiply each pixel by

00:03:48.800 | a parameter and add them up, the fact that they're laid out in a grid is not important.

00:03:54.500 | So let's reshape those grids and turn them into vectors. The way we reshape things in

00:04:02.960 | PyTorch is by using the View method. And so the View method, you can pass to it how large

00:04:10.500 | you want each dimension to be. And so in this case, we want the number of columns to be

00:04:18.520 | equal to the total number of pixels in each picture, which is 28 times 28, because they're

00:04:24.760 | 28 by 28 images. And then the number of rows will be however many rows there are in the

00:04:29.840 | data. And so if you just use minus one, when you call View, that means, you know, as many

00:04:36.480 | as there are in the data. So this will create something of the same with the same total

00:04:40.560 | number of elements that we had before. So we can grab all our threes. We can concatenate

00:04:47.000 | them, torch.cat, with all of our 7s, and then reshape that into a matrix where each row

00:04:55.840 | is one image with all of the rows and columns of the image all lined up in a single vector.

00:05:02.480 | So then we're going to need labels. So that's our x. So we're going to need labels. Our

00:05:07.480 | labels will be a 1 for each of the threes and a 0 for each of the 7s. So basically we're

00:05:15.280 | going to create an is3 model. So that's going to create a vector. We actually need it to

00:05:24.960 | be a matrix in PyTorch. So unsqueeze will add an additional unit dimension to wherever

00:05:36.880 | I've asked for. So here in position 1. So in other words, this is going to turn it from

00:05:41.140 | something which is a vector of 12,396 long into a matrix with 12,396 rows and one column.

00:05:52.960 | That's just what PyTorch expects to see. So now we're going to turn our x and y into a

00:06:00.280 | data set. And a data set is a very specific concept in PyTorch. It's something which we

00:06:07.480 | can index into using square brackets. And when we do so, it's expected to return a tuple.

00:06:17.980 | So here if we look at, we're going to create this data set. And when we index into it,

00:06:28.160 | it's going to return a tuple containing our independent variable and our dependent variable

00:06:34.160 | for each particular row. And so to do that, we can use the Python zip function, which

00:06:41.860 | takes one element of the first thing and combines it with, concatenates it with one element

00:06:48.680 | of the second thing. And then it does that again and again and again. And so then if

00:06:52.880 | we create a list of those, it gives us a data set. It gives us a list, which when we index

00:07:00.280 | into it, it's going to contain one image and one label. And so here you can see why there's

00:07:07.440 | my label and my image. I won't print out the whole thing, but it's a 784 long vector. So

00:07:15.240 | that's a really important concept. A data set is something that you can index into and

00:07:20.100 | get back a tuple. And here I am, this is called destructuring the tuple, which means I'm taking

00:07:26.960 | the two parts of the tuple and putting the first part in one variable and the second

00:07:30.980 | part in the other variable, which is something we do a lot in Python. It's pretty handy. A

00:07:35.160 | lot of other languages support that as well. Repeat the same three steps for a validation

00:07:40.600 | set. So we've now got a training data set and a validation data set. Right. So now we

00:07:49.220 | need to initialize our parameters. And so to do that, as we've discussed, we just do

00:07:56.080 | it randomly. So here's a function that, given some size, some shape if you like, will randomly

00:08:05.200 | initialize using a normal random number distribution in PyTorch. That's what randin does. And we

00:08:13.160 | can hit Shift + Tab to see how that works. Okay. And it says here that it's going to

00:08:25.120 | have a variance of 1. So I probably shouldn't have called this standard deviation. I probably

00:08:29.640 | should call this variance actually. So multiply it by the variance to change its variance

00:08:35.400 | to whatever is requested, which will default to 1. And then as we talked about when it

00:08:41.400 | comes to calculating our gradients, we have to tell PyTorch which things we want gradients

00:08:48.000 | for. And the way we do that is requires grad underscore. Remember this underscore at the

00:08:53.240 | end is a special magic symbol, which tells PyTorch that we want this function to actually

00:08:59.320 | change the thing that it's referring to. So this will change this tensor such that it

00:09:08.400 | requires gradients. So here's some weights. So our weights are going to need to be 28

00:09:15.400 | by 28 by 1 shape, 28 by 28 because every pixel is going to need a weight. And then 1 because

00:09:24.400 | we're going to need again, we're going to need to have that unit access to make it into

00:09:29.680 | a column. So that's what PyTorch expects. So there's our weights. Now just weights by

00:09:40.380 | pixels actually isn't going to be enough because weights by pixels will always equal 0 when

00:09:45.960 | the pixels are equal to 0. It has a 0 intercept. So we really want something which like wx

00:09:50.880 | plus b, a line. So the b is we call the bias. And so that's just going to be a single number.

00:09:58.640 | So let's grab a single number for our bias. So remember I told you there's a difference

00:10:04.880 | between the parameters and weights, so actually speaking. So here the weights are the w in

00:10:11.560 | this equation, the bias is b in this equation, and the weights and bias together is the parameters

00:10:20.440 | of the function. They're all the things that we're going to change. They're all the things

00:10:23.320 | that have gradients that we're going to update. So there's an important bit of jargon for

00:10:28.360 | you. The weights and biases of the model are the parameters. So we can, yes question. What's

00:10:39.600 | the difference between gradient descent and stochastic gradient descent? So far we've

00:10:46.360 | only done gradient descent. We'll be doing stochastic gradient descent in a few minutes.

00:10:51.160 | So we can now create a calculated prediction for one image. So we can take an image such

00:10:56.500 | as the first one and multiply by the weights. We need to transpose them to make them line

00:11:01.980 | up in terms of the rows and columns and add it up and add the bias and there is a prediction.

00:11:11.480 | We want to do that for every image. We could do that with a for loop and that would be

00:11:17.040 | really, really slow. It wouldn't run on the GPU and it wouldn't run in optimized C code.

00:11:24.120 | So we actually want to use always to do kind of like looping over pixels looping over images.

00:11:30.640 | You always need to try to make sure you're doing that without a Python or loop. In this

00:11:35.480 | case doing this calculation for lots of rows and columns is a mathematical operation called

00:11:43.200 | matrix model play. So if you've forgotten your matrix multiplication or maybe never

00:11:49.960 | quite got around to it at high school. It would be a good idea to have a look at Khan

00:11:55.160 | Academy or something to learn about what it is, but it's actually, I'll give you the quick

00:12:00.880 | answer. This is from Wikipedia. If these are two matrices A and B, then this element here

00:12:08.960 | 1, 2 in the output is going to be equal to the first bit here times the first bit here

00:12:16.840 | plus the second bit here times the second bit here. So it's going to be B12 times A11

00:12:23.320 | plus B22 times A12. That's, you can see the orange matches the orange. Ditto for over

00:12:31.520 | here. This would be equal to B13 times A31 plus B23 times A32 and so forth for every

00:12:38.760 | part. Here's a great picture of that in action. If you look at matrix multiplication.xyz, another

00:12:54.480 | way to think of it is we can kind of flip the second bit over on top and then multiply

00:13:00.640 | each bit together and add them up, multiply each bit together and add them up. And you

00:13:05.560 | can see always the second one here and ends up in the second spot and the first one ends

00:13:09.320 | up in the first spot. And that's what matrix multiplication is. So we can do our multiply

00:13:22.920 | and add up by using matrix multiplication. And in Python and therefore PyTorch matrix

00:13:30.240 | multiplication is the @ sign operator. So when you see @ that means matrix multiply.

00:13:38.100 | So here is our 20.2336. If I do a matrix multiply of our training set by our weights and then

00:13:50.800 | we add the bias and here is our 20.336 for the first one. And you can see though it's

00:13:56.040 | doing every single one. So that's really important is that matrix multiplication gives us an

00:14:02.960 | optimized way to do these simple linear functions for as many kind of rows and columns as we

00:14:08.800 | want. So this is one of the two fundamental equations of any neural network. Some rows

00:14:18.520 | of data, rows and columns of data, matrix multiply, some weights, add some bias. And

00:14:23.960 | the second one which we'll see in a moment is an activation function. So that is some

00:14:31.700 | predictions from our randomly initialized model. So we can check how good our model

00:14:36.960 | is. And so to do that we can decide that anything greater than zero we will call a 3 and anything

00:14:46.280 | less than zero we will call a 7. So preds greater than zero tells us whether or not something

00:14:54.280 | is predicted to be a 3 or not. Then turn that into a float. So rather than true and false

00:14:59.760 | make it 1 and 0 because that's what our training set contains. And then check whether our thresholded

00:15:07.080 | predictions are equal to our training set. And this will return true every time a row

00:15:15.200 | is correctly predicted and false otherwise. So if we take all those trues and falses and

00:15:20.800 | turn them into floats so that'll be ones and zeros and then take their mean it's 0.49.

00:15:26.720 | So not surprisingly our randomly initialized model is right about half the time at predicting

00:15:31.480 | threes from sevens. I added one more method here which is .item. Without .item this would

00:15:39.520 | return a tensor. It's a rank 0 tensor it has no rows it has no columns it just it's just

00:15:46.160 | a number on its own. But I actually wanted to unwrap it to create a normal Python scalar

00:15:52.360 | mainly just because I wanted to see the easily see the full set of decimal places. And the

00:15:57.720 | reason for that is I want to show you how we're going to calculate the derivative on

00:16:01.960 | the accuracy by changing a parameter by a tiny bit. So let's take one parameter which will

00:16:09.680 | be weight 0 and multiply it by 1.0001. And so that's going to make it a little bit bigger.

00:16:19.160 | And then if I calculate how the accuracy changes based on the change in that weight that will

00:16:28.080 | be the gradient of the accuracy with respect to that parameter. So I can do that by calculating

00:16:36.520 | my new set of predictions and then I can threshold them and then I can check whether they're equal

00:16:40.400 | to the training set and then take the mean and I get back exactly the same number. So

00:16:49.080 | remember that gradient is equal to rise over run if you remember back to your calculus

00:16:58.760 | or if you'd forgotten your calculus hopefully you've reviewed it on Khan Academy. So the

00:17:05.000 | change in the y so y new minus y old which is 0.4912 etc minus 0.4912 etc which is 0

00:17:18.400 | divided by this change will give us 0. So at this point we have a problem our derivative

00:17:26.480 | is 0 so we have 0 gradients which means our step will be 0 which means our prediction

00:17:34.340 | will be unchanged. Okay so we have a problem and our problem is that our gradient is 0

00:17:45.720 | and with a gradient of 0 we can't take a step and we can't get better predictions. And so

00:17:53.400 | intuitively speaking the reason that our gradient is 0 is because when we change a single pixel

00:17:59.800 | by a tiny bit we might not ever in any way change an actual prediction to change from

00:18:07.000 | a 3 predicting a 3 to a 7 or vice versa because we have this threshold. Okay and so in other

00:18:17.800 | words our accuracy loss function here is very bumpy it's like flat step flat step flat step

00:18:27.720 | so it's got this 0 gradient all over the place. So what we need to do is use something other

00:18:35.280 | than accuracy as our loss function. So let's try and create a new function and what this

00:18:46.040 | new function is going to do is it's going to give us a better value kind of in much

00:18:54.240 | the same way that accuracy gives a better value. So this is the loss remember a small

00:18:58.960 | loss is better so it'll give us a lower loss when the accuracy is better but it won't have

00:19:05.240 | a 0 gradient. So it means that a slightly better prediction needs to have a slightly

00:19:14.160 | better loss. So let's have a look at an example let's say our targets our labels of like that

00:19:23.600 | is 3 oh there's just three rows three images here 1 0 1 okay and we've made some predictions

00:19:32.020 | from a neural net and those predictions gave us 0.9 0.4 0.2. So now consider this loss function

00:19:42.660 | a loss function we're going to use torch.where which is basically the same as this list comprehension

00:19:49.080 | it's basically an if statement. So it's going to say for where target equals 1 we're going

00:19:56.840 | to return 1 minus predictions so here target is 1 so it'll be 1 minus 0.9 and where target

00:20:03.400 | is not 1 it'll just be predictions. So for these examples here the first one target equals

00:20:12.080 | 1 will be 1 minus 0.9 which is 0.1 the next one is target equals 0 so it will be the prediction

00:20:23.600 | just 0.4 and then for the third one it's a 1 for target so it'll be 1 minus prediction

00:20:30.360 | which is 0.8 and so you can see here when the prediction is correct correct in other

00:20:37.720 | words it's a number you know it's a high number when the target is 1 and a low number when

00:20:43.320 | the target is 0 these numbers are going to be smaller. So the worst one is when we predicted

00:20:50.040 | 0.2 so we're pretty we really thought that was actually a 0 but it's actually a 1 so

00:20:56.480 | we ended up with a 0.8 here because this is 1 minus prediction 1 minus 0.2 is 0.8. So

00:21:05.680 | we can then take the mean of all of these to calculate a loss. So if you think about

00:21:12.400 | it this loss will be the smallest if the predictions are exactly right. So if we did predictions

00:21:27.280 | is actually identical to the targets then this will be 0 0 0 okay or else if they were

00:21:39.280 | exactly wrong let's say they were 1 minus then it's 1 1 1. So it's going to be the loss

00:21:50.000 | will be better i.e. smaller when the predictions are closer to the targets and so here we can

00:21:59.240 | now take the mean and when we do we get here 0.433. So let's say we change this last bad

00:22:10.920 | one this inaccurate prediction from 0.2 to 0.8 and the loss gets better from 0.43 to

00:22:21.200 | 0.23. This is just this function is torch.where.mean. So this is actually pretty good this is actually

00:22:28.760 | a loss function which pretty closely tracks accuracy whereas the accuracy is better the

00:22:33.920 | loss will be smaller but also it doesn't have these zero gradients because every time we

00:22:39.280 | change the prediction the loss changes because the prediction is literally part of the loss

00:22:45.280 | that's pretty neat isn't it. One problem is this is only going to work well as long as

00:22:51.080 | the predictions are between 0 and 1 otherwise this 1 minus prediction thing is going to

00:22:55.440 | look a bit funny. So we should try and find a way to ensure that the predictions are always

00:23:01.440 | between 0 and 1 and that's also going to just make a lot more intuitive sense because you

00:23:07.080 | know we like to be able to kind of think of these as if they're like probabilities or

00:23:10.540 | at least nicely scaled numbers. So we need some function that can take our numbers have

00:23:20.120 | a look. It's something which can take these big numbers and turn them all into numbers

00:23:27.680 | between 0 and 1 and it so happens that we have exactly the right function it's called

00:23:36.200 | the sigmoid function. So the sigmoid function looks like this. If you pass in a really small

00:23:42.080 | number you get a number very close to 0 if you pass in a big number you get a number

00:23:47.000 | very close to 1 it never gets past 1 and it never goes smaller than 0 and then it's kind

00:23:54.400 | of like the smooth curve between and in the middle it looks a lot like the y = x line.

00:24:01.000 | This is the definition of the sigmoid function. It's 1 over 1 plus e to the minus x. What

00:24:11.120 | is x? x is just e to the power of something. So if we look at e it's just a number like

00:24:24.960 | pi this is a simple it's just a number that has a particular value. So if we go e squared

00:24:34.120 | and we look at it's going to be a tensor, use pytorch, make it a float, there we go.

00:24:50.080 | You can see that these are the same number so that's what torch.exp means. Okay so you

00:25:00.640 | know for me when I see these kinds of interesting functions I don't worry too much about the

00:25:06.560 | definition. What I care about is the shape. So you can have a play around with graphing

00:25:12.240 | calculators or whatever to kind of see why it is that you end up with this shape from

00:25:16.920 | this particular equation but for me I just never think about that. It never really matters

00:25:24.760 | to me. What's important is this sigmoid shape which is what we want. It's something that

00:25:29.520 | squashes every number to be between 0 and 1. So we can change MNIST_LOST to be exactly

00:25:39.040 | the same as it was before but first we can make everything into sigmoid first and then

00:25:45.920 | use torch.where. So that is a loss function that has all the properties we want. It's

00:25:52.840 | something which is going to be have not have any of those nasty 0 gradients and we've ensured

00:25:59.480 | that the input to the where is between 0 and 1. So the reason we did this is because our

00:26:11.960 | accuracy was kind of what we really care about is a good accuracy. We can't use it to get

00:26:20.400 | our gradients just to create our step to improve our parameters. So we can change our accuracy

00:26:33.400 | to another function that is similar in terms of it it's better when the accuracy is better

00:26:39.640 | but it also does not have these 0 gradients. And so you can see now where why we have a

00:26:44.940 | metric and a loss. The metric is the thing we actually care about. The loss is the thing

00:26:50.280 | that's similar to what we care about that has a nicely behaved gradient. Sometimes the thing

00:26:59.320 | you care about your metric does have a nicely defined gradient and you can use it directly

00:27:03.520 | as a loss. For example, we often use mean squared error but for classification unfortunately

00:27:10.560 | not. So we need to now use this to update the parameters. And so there's a couple of

00:27:22.160 | ways we could do this. One would be to loop through every image, calculate a prediction

00:27:28.240 | for that image and then calculate a loss and then do a step and then step through the parameters

00:27:38.160 | and then do that again for the next image and the next image and the next image. That's

00:27:42.440 | going to be really slow because we're doing a single step for a single image. So that

00:27:49.600 | would mean an epoch would take quite a while. We could go much faster by doing every single

00:27:55.680 | image in the dataset. So a big matrix multiplication, it can all be paralyzed on the GPU and then

00:28:03.480 | so then we can we could then do a step based on the gradients looking at the entire dataset.

00:28:12.960 | But now that's going to be like a lot of work to just update the weights once. And remember

00:28:19.960 | sometimes our datasets have millions or tens of millions of items. So that's probably a

00:28:25.320 | bad idea too. So why not compromise? Let's grab a few data items at a time to calculate

00:28:33.680 | our loss and our step. If we grab a few data items at a time, those two data items are

00:28:39.280 | called a mini-batch. And a mini-batch just means a few pieces of data. And so the size

00:28:47.520 | of your mini-batch is called, not surprisingly, the batch size. So the bigger the batch size,

00:28:53.000 | the closer you get to the full size of your dataset, the longer it's going to do take

00:28:56.960 | to calculate a single set of losses, a single step. But the more accurate it's going to

00:29:04.360 | be, it's going to be like the gradients are going to be much closer to the true dataset

00:29:09.440 | gradients. And then the smaller the batch size, the faster each step will be able to

00:29:14.600 | do, but those steps will represent a smaller number of items. And so they won't be such

00:29:19.920 | an accurate approximation of the real gradient of the whole dataset.

00:29:28.000 | Is there a reason the mean of the loss is calculated over, say, doing a median, since

00:29:32.680 | the median is less prone to getting influenced by outliers? In the example you gave, if the

00:29:39.720 | third point, which was wrongly predicted as an outlier, then the derivative would push

00:29:44.840 | the function away while doing SGD, and a median could be better in that case.

00:29:50.680 | Honestly, I've never tried using a median. The problem with a median is it ends up really

00:29:58.640 | only caring about one number, which is the number in the middle. So it could end up really

00:30:05.680 | pretty much ignoring all of the things at each end. In fact, all it really cares about

00:30:10.520 | is the order of things. So my guess is that you would end up with something that is only

00:30:15.720 | good at predicting one thing in the middle, but I haven't tried it. It would be interesting

00:30:22.280 | to see. Well, I guess the other thing that would happen with a median is you would have

00:30:27.520 | a lot of zero gradients, I think, because it's picking the thing in the middle and you

00:30:32.200 | could, you know, change your values and the thing in the middle. Well, it wouldn't be

00:30:37.760 | zero gradients, but bumpy gradients. I think in the middle would suddenly jump to being

00:30:41.040 | a different item. So it might not behave very well. That's my guess. You should try it.

00:30:50.200 | Okay. So how do we ask for a few items at a time? It turns out that PyTorch and FastAI

00:30:59.840 | provide something to do that for you. You can pass in any data set to this class called

00:31:07.320 | data loader and it will grab a few items from that data set at a time. You can ask for how

00:31:13.200 | many by asking for a batch size. And then you can, as you can see, it will grab a few

00:31:20.720 | items at a time until it's grabbed all of them. So here I'm saying let's create a collection

00:31:25.560 | that just contains all the numbers from 0 to 14. Let's pass that into a data loader

00:31:31.200 | with a batch size of 5. And then that's going to be something, it's called an iterator in

00:31:36.640 | Python. It's something that you can ask for one more thing from an iterator. If you pass

00:31:40.520 | an iterator to list in Python, it returns all of the things from the iterator. So here

00:31:45.920 | are my three mini batches and you'll see here all the numbers from 0 to 15 appear. They

00:31:51.320 | appear in a random order and they appear five at a time. They appear in random order because

00:31:55.880 | shuffle equals true. So normally in the training set we ask for things to be shuffled. So it

00:32:01.640 | gives us a little bit more randomization. More randomization is good because it makes it

00:32:07.040 | harder for it to kind of learn what the data set looks like. So that's what a data loader,

00:32:14.160 | that's how a data loader is created. Now remember though that our data sets actually return tuples.

00:32:25.480 | And here I've just got single ints. So let's actually create a tuple. So if we enumerate

00:32:30.640 | all the letters of English, then that means that returns 0a1b2c, etc. Let's make that

00:32:38.360 | our data set. So if we pass that to a data loader with a batch size of 6, and as you

00:32:45.360 | can see it returns tuples containing 6 of the first things and the associated 6 of the

00:32:55.920 | second things. So this is like our independent variable and this is like our dependent variable.

00:33:03.760 | And so and then at the end, you know, the batch size won't necessarily exactly divide

00:33:10.520 | nicely into the full size of the data set. You might end up with a smaller batch.

00:33:19.640 | So basically then we already have a data set, remember. And so we could pass it to a data

00:33:26.440 | loader and then we can basically say this, an iterator in Python is something that you

00:33:31.200 | can actually loop through. So when we say for in data loader, it's going to return a

00:33:37.160 | tuple. We can de-structure it into the first bit and the second bit. And so that's going

00:33:44.160 | to be our x and y. We can calculate our predictions, we can calculate our loss from the predictions

00:33:50.000 | and the targets, we can ask it to calculate our gradients and then we can update our parameters

00:33:57.800 | just like we did in our toy SGD example for the quadratic equation. So let's reinitialize

00:34:05.240 | our weights and bias with the same two lines of code before. Let's create the data loader

00:34:10.120 | this time from our actual MNIST data set and create a nice big batch size. So we did plenty

00:34:15.560 | of work each time. And just to take a look, let's just grab the first thing from the data

00:34:20.960 | loader. First is a fast AI function, which just grabs the first thing from an iterator.

00:34:26.720 | Just it's useful to look at, you know, kind of an arbitrary mini batch. So here is the

00:34:33.000 | shape. We're going to have the first mini batch is 256 rows of 784 long, that's 28 by 28.

00:34:40.520 | So 256 flattened out images and 256 labels that are one long because that's just the

00:34:48.160 | number zero or the number one, depending on whether it's a three or a seven. Do the same

00:34:53.400 | for the validation set. So here's our validation data loader. And so let's grab a batch here,

00:35:05.960 | testing, pass it into, well, why do we do that? We should, yeah, I guess, yeah, actually

00:35:18.560 | for our testing, I'm going to just manually grab the first four things just so that we

00:35:23.640 | can make sure everything lines up. So let's grab just the first four things. We'll call

00:35:27.760 | that a batch. Pass it into that linear function we created earlier. Remember linear was just

00:35:43.080 | batch at weights matrix, multiply plus bias. And so that's going to give us four results.

00:35:54.840 | That's a prediction for each of those four images. And so then we can calculate the loss

00:36:00.760 | using that loss function we just used. And let's just grab the first four items of the

00:36:05.040 | training set and there's the loss. Okay. And so now we can calculate the gradients. And

00:36:12.960 | so the gradients are 784 by one. So in other words, it's a column where every weight as

00:36:21.160 | a gradient, it's what's the change in loss for a small change in that parameter. And

00:36:28.760 | then the bias has a gradient that's a single number because the bias is just a single number.

00:36:34.880 | So, we can take those three steps and put it in a function. So if you pass, if you,

00:36:42.320 | this is calculate gradient, you pass it an X batch or Y batch in some model, then it's

00:36:47.200 | going to calculate the predictions, calculate the loss and do the backward step. And here

00:36:52.920 | we see calculate gradient. And so we can get the, just to take a look, the mean of the

00:36:57.960 | weights gradient and the bias gradient. And there it is. If I call it a second time and

00:37:05.320 | look, notice I have not done any step here. This is exactly the same parameters. I get

00:37:11.520 | a different value. That's a concern. You would expect to get the same gradient every time

00:37:17.840 | you called it with the same data. Why have the gradients changed? That's because loss

00:37:24.640 | dot backward does not just calculate the gradients. It calculates the gradients and adds them to

00:37:32.000 | the existing gradients, the things in the dot grad attribute. The reasons for that will

00:37:39.280 | come to you later, but for now the thing to know is just it does that. So actually what

00:37:44.120 | we need to do is to call grad dot zero underscore. So dot zero returns a tensor containing zeros.

00:37:53.440 | And remember underscore does it in place. So that updates the weights dot grad attribute,

00:37:58.840 | which is a tensor to contain zeros. So now if I do that and call it again, I will get

00:38:05.680 | exactly the same number. So here is how you train one epoch with SGD. Loop through the

00:38:14.960 | data loader, grabbing the X batch and the Y batch, calculate the gradient, prediction

00:38:21.440 | loss backward. Go through each of the parameters and we're going to be passing those in. So

00:38:28.760 | there's going to be the 768 weights and the one bias. And then for each of those, update

00:38:35.760 | the parameter to go minus equals gradient times learning rate. That's our gradient descent

00:38:43.280 | step and then zero it out for the next time around the loop. I'm not just saying p minus

00:38:51.680 | equals. I'm saying p dot data minus equals. And the reason for that is that remember PyTorch

00:38:58.920 | keeps track of all of the calculations we do so that it can calculate the gradient.

00:39:05.200 | Well I don't want to calculate in the gradient of my gradient descent step. That's like not

00:39:10.760 | part of the model, right? So dot data is a special attribute in PyTorch where if you

00:39:16.720 | write to it, it tells PyTorch not to update the gradients using that calculation. So this

00:39:25.120 | is your most basic standard SGD stochastic gradient descent loop. So now we can answer

00:39:32.480 | that earlier question. The difference between stochastic gradient descent and gradient descent

00:39:37.680 | is that gradient descent does not have this here that loops through each mini-batch. For

00:39:46.080 | gradient descent, it does it on the whole data set each time around. So train epoch

00:39:51.720 | for gradient descent would simply not have the for loop at all, but instead it would

00:39:57.760 | calculate the gradient for the whole data set and update the parameters based on the

00:40:01.960 | whole data set, which we never really do in practice. We always use mini-batches of various

00:40:07.920 | sizes. Okay, so we can take the function we had before where we compare the predictions

00:40:21.800 | to whether that, well we used to be comparing the predictions to whether they were greater

00:40:26.160 | or less than zero, right? But now that we're doing the sigmoid, remember the sigmoid will

00:40:31.400 | squish everything between 0 and 1. So now we should compare the predictions to whether

00:40:35.880 | they're greater than 0.5 or not. If they're greater than 0.5, just look back at our sigmoid

00:40:41.040 | function. So 0, what used to be 0 is now on the sigmoid is 0.5. Okay, so we need just

00:40:52.840 | to make that slight change to our measure of accuracy. So to calculate the accuracy

00:41:03.360 | for some X batch and some Y batch, this is actually assumed this is actually the predictions.

00:41:09.960 | Then we take the sigmoid of the predictions, we compare them to 0.5 to tell us whether

00:41:15.240 | it's a 3 or not, we check what the actual target was to see which ones are correct,

00:41:20.640 | and then we take the mean of those after converting the Booleans to floats. So we can check that.

00:41:27.720 | Accuracy, let's take our batch, put it through our simple linear model, compare it to the

00:41:33.880 | four items of the training set, and there's the accuracy. So if we do that for every batch

00:41:41.120 | in the validation set, then we can loop through with a list comprehension every batch in the

00:41:46.800 | validation set, get the accuracy based on some model, stack those all up together so

00:41:56.080 | that this is a list, right? So if we want to turn that list into a tensor where the

00:42:01.000 | items of the list of the tensor are the items of the list, that's what stack does. So we

00:42:06.600 | can stack up all those, take the mean, convert it to a standard Python scalar by calling

00:42:14.160 | dot item, round it to four decimal places just for display. And so here is our validation

00:42:21.880 | set accuracy as you would expect. It's about 50% because it's random. So we can now train

00:42:28.480 | for one epoch. So we can say, remember train epoch needed the parameters. So our parameters

00:42:37.400 | in this case are the weights tensor and the bias tensor. So train one epoch using the

00:42:44.360 | linear one model with the learning rate of one with these two parameters and then validate

00:42:52.460 | and look at that. Our accuracy is now 68.8%. So we've trained an epoch. So let's just repeat

00:43:02.880 | that 20 times, train and validate. And you can see the accuracy goes up and up and up

00:43:09.840 | and up and up to about 97%. So that's cool. We've built an SGD optimizer of a simple linear

00:43:21.120 | function that is getting about 97% on our simplified MNIST where there's just the threes

00:43:28.360 | and the sevenths. So a lot of steps there. Let's simplify this through some refactoring.

00:43:37.280 | So the kind of simple refactoring we're going to do, we're going to do a couple, but the

00:43:40.800 | basic idea is we're going to create something called an optimizer class. The first thing

00:43:45.520 | we'll do is we'll get rid of the linear one function. Remember the linear one function

00:43:53.480 | does x at w plus b. There's actually a class in PyTorch that does that equation for us.

00:44:03.840 | So we might as well use it. It's called nn.linear. And nn.linear does two things. It does that

00:44:11.440 | function for us and it also initializes the parameters for us. So we don't have to do

00:44:19.360 | writes and bias in it params anymore. We just create an nn.linear class and that's going

00:44:27.280 | to create a matrix of size 28 by 28 comma 1 and a bias of size 1. It will set requires

00:44:35.680 | grad equals true for us. It's all going to be encapsulated in this class and then when

00:44:40.240 | I call that as a function, it's going to do my x at w plus b. So to see the parameters

00:44:51.320 | in it, we would expect it to contain 784 weights and one bias. We can just call dot parameters

00:44:58.800 | and we can de-structure it to w comma b and see, yep, it is 784 and 1 for the weights

00:45:06.720 | and bias. So that's cool. So this is just, you know, it could be an interesting exercise

00:45:12.760 | for you to create this class yourself from scratch. You should be able to at this point

00:45:19.320 | so that you can confirm that you can recreate something that behaves exactly like an nn.linear.

00:45:26.440 | So now that we've got this object which contains our parameters in a parameters method, we

00:45:33.760 | can now create an optimizer. So for our optimizer, we're going to pass it the parameters to optimize

00:45:39.380 | and a learning rate. We'll store them away and we'll have something called step which

00:45:45.600 | goes through each parameter and does that thing we just saw. p dot data minus equals

00:45:50.600 | p dot grad times learning rate and it's also going to have something called zero grad which

00:45:55.760 | goes through each parameter and zeros it out or we could even just set it to none. So that's

00:46:01.920 | the thing we're going to call basic optimizer. So those are exactly the same lines of code

00:46:06.320 | we've already seen wrapped up into a class. So we can now create an optimizer passing

00:46:11.940 | in the parameters of the linear model for these and our learning rate. And so now our

00:46:18.400 | training loop is loop through each mini batch in the data loader, calculate the gradient,

00:46:24.920 | opt dot step, opt dot zero grad, that's it. Validation function doesn't have to change

00:46:35.640 | and so let's put our training loop into a function that's going to loop through a bunch

00:46:38.640 | of epochs, call an epoch, print validate epoch and then run it and it's the same. We're getting

00:46:49.880 | a slightly different result here but much the same idea. Okay so that's cool right we've

00:46:59.880 | now refactoring using you know create our own optimizer and using faster pytorch is built

00:47:08.120 | in nn.linear class and you know by the way we don't actually need to use our own basic

00:47:14.560 | optimizer. Not surprisingly pytorch comes with something which does exactly this and

00:47:20.120 | not surprisingly it's called SGD. So and actually this SGD is provided by fastai, fastai and

00:47:26.920 | pytorch provide some overlapping functionality they work much the same way. So you can pass

00:47:33.520 | to SGD your parameters and your learning rate just like basic optimizer. Okay and train

00:47:40.800 | it and get the same result. So as you can see these classes that are in fastai and pytorch

00:47:49.120 | are not mysterious they're just pretty you know in wrappers around functionality that

00:47:57.960 | we've now written ourself. So there's quite a few steps there and if you haven't done

00:48:03.880 | gradient descent before then there's a lot of unpacking. So this lesson is kind of the

00:48:12.400 | key lesson it's the one where you know like we should you know really take a stop and

00:48:17.720 | a deep breath at this point and make sure you're comfortable. What's a data set? What's

00:48:23.520 | a data loader? What's nn.linear? What's SGD? And if you you know if any or all of those

00:48:31.440 | don't make sense go back to where we defined it from scratch using Python code. Well the

00:48:38.080 | data loader we didn't define from scratch but it you know the functionality is not particularly

00:48:42.680 | interesting. You could certainly create your own from scratch if you wanted to that would

00:48:47.080 | be another pretty good exercise. Let's refactor some more. Fastai has a data loaders class

00:48:58.720 | which is as we've mentioned before is a tiny class that just you pass it a bunch of data

00:49:05.120 | loaders and it just stores them away as a dot train and a dot valid. Even though it's

00:49:10.200 | a tiny class it's it's super handy because with that we now have a single object that

00:49:16.800 | knows all the data we have and so it can make sure that your training data loader is shuffled

00:49:22.800 | and your validation loader isn't shuffled you know make sure everything works properly.

00:49:28.020 | So that's what the data loaders class is you can pass in the training and valid data loader

00:49:33.840 | and then the next thing we have in fastai is the learner class and the learner class

00:49:38.520 | is something where we're going to pass in our data loaders. We're going to pass in our

00:49:44.560 | model we're going to pass in our optimization function we're going to pass in our loss function

00:49:51.880 | we're going to pass in our metrics. So all the stuff we've just done manually that's

00:49:58.120 | all learner does is it's just going to do that for us so it's just going to call this

00:50:04.320 | train model and this train epoch it's just you know it's inside learner. So now if we

00:50:11.120 | go learn.fit you can see again it's doing the same thing getting the same result and

00:50:20.000 | it's got some nice functionality it's printing it out into a pretty table for us and it's

00:50:23.460 | showing us the losses and the accuracy and how long it takes but there's nothing magic

00:50:28.480 | right you've been able to do exactly the same thing by hand using Python and PyTorch. So

00:50:37.080 | these abstractions are here to like let you write less code and to save some time and

00:50:41.880 | to save some cognitive overhead but they're not doing anything you can't do yourself.

00:50:49.120 | And that's important right because if the if they're doing things you can't do yourself

00:50:54.920 | then you can't customize them you can't debug them you know you can't profile them. So we

00:51:02.400 | want to make sure that the stuff we're using is stuff that we understand what it's doing.

00:51:09.380 | So this is just a linear function is not great we want a neural network. So how do we turn

00:51:18.520 | this into a neural network or remember this is a linear function x at w plus b to turn

00:51:27.320 | it into a neural network we have two linear functions exactly the same but with different

00:51:34.160 | weights and different biases and in between this magic line of code which takes the result

00:51:40.920 | of our first linear function and then does a max between that and 0. So a max of res

00:51:48.760 | and 0 is going to take any negative numbers and turn them into zeros. So we're going to

00:51:55.000 | do a linear function we're going to replace the negatives with 0 and then we're going

00:51:59.960 | to take that and put it through another linear function that believe it or not is a neural

00:52:05.360 | net. So w1 and w2 are weight tensors b1 and b2 are bias tensors just like before so we

00:52:13.200 | can initialize them just like before and we could now call exactly the same training code

00:52:19.720 | that we did before to roll these. So res.max 0 is called a rectified linear unit which

00:52:32.880 | you will always see referred to as ReLU and so here is and in PyTorch it already has this

00:52:42.140 | function it's called f.relu and so if we plot it you can see it's as you'd expect it's 0

00:52:49.480 | for all negative numbers and then it's y equals x for positive numbers. So you know here's

00:52:59.440 | some jargon rectified linear unit sounds scary sounds complicated but it's actually this incredibly

00:53:06.720 | tiny line of code this incredibly simple function and this happens a lot in deep learning things

00:53:14.080 | that sound complicated and sophisticated and impressive turn out to be normally super simple

00:53:21.080 | frankly at least once you know what it is. So why do we do linear layer ReLU linear layer

00:53:31.000 | well if we got rid of the middle if we got rid of the middle ReLU and just went linear

00:53:42.960 | layer linear layer then you could rewrite that as a single linear layer when you multiply

00:53:49.560 | things and add and then multiply things and add and you can just change the coefficients

00:53:54.440 | and make it into a single multiply and then add. So no matter how many linear layers we

00:53:58.700 | stack on top of each other we can never make anything more kind of effective than a simple

00:54:05.760 | linear model but if you put a non-linearity between the linear layers then actually you

00:54:13.120 | have the opposite this is now where something called the universal approximation theorem

00:54:18.480 | holds which is that if the size of the weight and bias matrices are big enough this can

00:54:24.720 | actually approximate any arbitrary function including the function of how do I recognize

00:54:31.480 | threes from sevens or or whatever. So that's kind of amazing right this tiny thing is actually

00:54:40.520 | a universal function approximator as long as you have w1 b1 w2 and b2 have the right

00:54:48.360 | numbers and we know how to make them the right numbers you use SGD could take a very long

00:54:53.840 | time could take a lot of memory but the basic idea is that there is some solution to any

00:55:02.000 | computable problem and this is one of the biggest challenges a lot of beginners have

00:55:09.840 | to deep learning is that there's nothing else to it like there's often this like okay how

00:55:17.120 | do I make a neural net oh that is a neural net or how do I do deep learning training

00:55:24.040 | where there's gd there's things to like make it train a bit faster there's you know things

00:55:30.920 | to mean you need a few less parameters but everything from here is just performance tweaks

00:55:41.640 | honestly right so this is you know this is the key understanding of training a neural

00:55:50.400 | network okay we can simplify things a bit more we already know that we can use nn.linear

00:55:59.960 | to replace the weight and bias so let's do that for both of the linear layers and then

00:56:08.760 | since we're simply taking the result of one function and passing it into the next and

00:56:19.320 | take the result of that function pass it to the next and so forth and then return the

00:56:22.840 | end this is called function composition function composition is when you just take the result

00:56:28.440 | of one function pass it to a new one take a result of one function pass it to a new one

00:56:33.560 | and so every pretty much neural network is just doing function composition of linear

00:56:39.820 | layers and these are called activation functions or non-linearities so PyTorch provides something

00:56:47.360 | to do function composition for us and it's called nn.sequential so it's going to do a

00:56:53.360 | linear layer pass the result to a value pass the result to a linear layer you'll see here

00:56:59.440 | I'm not using f.relu I'm using nn.relu this is identical returns exactly the same thing

00:57:04.920 | but this is a class rather than a function yes Rachel by using the non-linearity won't

00:57:15.440 | using a function that makes all negative output zero make many of the gradients in the network

00:57:19.760 | zero and stop the learning process due to many zero gradients well that's a fantastic

00:57:27.100 | question and the answer is yes it does but there won't be zero for every image and remember

00:57:35.920 | the mini batches are shuffled so even if it's zero for every image in one mini batch it

00:57:41.760 | won't be for the next mini batch and it won't be the next time around we go for another

00:57:45.120 | epoch so yes it can create zeros and if if the neural net ends up with a set of parameters

00:57:55.380 | such that lots and lots of inputs end up as zeros you can end up with whole mini batches

00:58:02.080 | that is zero and you can end up in a situation where some of the neurons remain inactive inactive

00:58:14.280 | means they're zero and they're basically dead units and this is a huge problem it basically

00:58:21.960 | means you're wasting computation so there's a few tricks to avoid that which we'll be

00:58:27.120 | learning about a lot one simple trick is to not make this thing flat here but just make

00:58:34.680 | it a less steep line that's called a leaky value leaky rectified linear unit and they

00:58:43.560 | help a bit as we'll learn though even better is to make sure that we just kind of initialize

00:58:49.400 | to sensible initial values that are not too big and not too small and step by sensible

00:58:55.440 | amounts that are particularly not too big and generally if we do that we can keep things

00:59:01.360 | in the zone where they're positive most of the time but we are going to learn about how

00:59:06.420 | to actually analyze inside a network and find out how many dead units we have how many of

00:59:10.640 | these zeros we have because as this as you point out they are they are bad news they

00:59:15.960 | don't do any work and they'll continue to not do any work if if enough of the inputs

00:59:22.880 | end up being zero okay so now that we've got a neural net we can use exactly the same learner

00:59:35.240 | we had before but this time we'll pass in the simple net instead of the linear one everything

00:59:41.260 | else is the same and we can call fit just like before and generally as your models get

00:59:48.040 | deeper though here we've gone from one layer to and I'm only counting the parameterized

00:59:53.920 | layers as layers you could say it's three I'm just going to call it two there's two

00:59:58.180 | trainable layers so I've gone from one layer to I've checked dropped my learning rate from

01:00:03.080 | one to zero point one because the deeper models you know tend to be kind of bumpier less nicely

01:00:10.120 | behaved so often you need to use lower learning rates and so we train it for a while okay

01:00:16.360 | and we can actually find out what that training looks like by looking inside our learner and

01:00:24.000 | there's an attribute we create for you called recorder and that's going to record well everything

01:00:29.720 | that appears in this table basically well these three things the training loss the validation

01:00:34.360 | loss and the accuracy or any metrics so recorder dot values contains that kind of table of

01:00:42.240 | results and so item number two of each row will be the accuracy and so the the capital

01:00:53.380 | L class which I'm using here as a nice little method called item got that will will get

01:01:03.880 | the second item from every row and then I can plot that to see how the training went

01:01:12.360 | and I can get the final accuracy like so by grabbing the last row of the table and grabbing

01:01:19.380 | the second index two zero one two and my final accuracy not bad ninety eight point three

01:01:27.680 | percent so this is pretty amazing we now have a function that can solve any problem to any

01:01:36.760 | level of accuracy if we can find the right parameters and we have a way to find hopefully

01:01:43.000 | the best or at least a very good set of parameters for any function so this is kind of the magic

01:01:51.000 | yes Rachel how could we use what we're learning here to get an idea of what the network is

01:01:57.960 | learning along the way like Xylor and Fergus did more or less we will look at that later

01:02:07.480 | not in the full detail of their paper but basically you can look in the dot parameters

01:02:14.840 | to see the values of those parameters and at this point well I mean why don't you try

01:02:21.320 | it yourself right you've actually got now the parameters so if you want to grab the

01:02:28.780 | model you can actually see learn dot model so we can we can look inside learn dot model

01:02:38.000 | to see the actual model that we just trained and you can see it's got the three things

01:02:47.880 | in it the linear the value of the linear and what I kind of like to do is to put that into

01:02:53.320 | a variable make it a bit easy to work with and you can grab one layer by indexing in

01:03:02.960 | you can look at the parameters and that just gives me a something called a generator it's

01:03:10.760 | something that will give me a list of the parameters when I ask for them so I can just

01:03:14.600 | go weight comma bias equals to de-structure them and so the weight is 30 by 784 because

01:03:30.120 | that's what I asked for so one of the things to note here is that to create a neural net

01:03:40.400 | so something with more than one layer I actually have 30 outputs not just one right so I'm

01:03:47.240 | kind of generating lots of you can think of generating lots of features so it's kind of

01:03:50.520 | like 30 different linear linear models here and then I combine those 30 back into one

01:03:58.680 | so you could look at one of those by having a look at yeah so there's there's the numbers

01:04:08.900 | in the first row we could reshape that into the original shape of the images and we could

01:04:24.800 | even have a look and there it is right so you can see this is something so this is cool

01:04:35.120 | right we can actually see here we've got something which is which is kind of learning to find

01:04:46.280 | things at the top and the bottom and the middle and so we could look at the second one okay

01:04:54.120 | no idea what that's showing and so some of them are kind of you know I've probably got

01:04:59.680 | far more than I need which is why they're not that obvious but you can see yeah here's

01:05:04.960 | another thing it's looking pretty similar here's something that's kind of looking for

01:05:09.920 | this little bit in the middle so yeah this is the basic idea to understand the features

01:05:17.920 | that are not the first layer but later layers you have to be a bit more sophisticated but

01:05:23.960 | yeah to see the first layer ones you can you can just plot them okay so then you know just

01:05:33.280 | to compare we could use the full fast AI toolkit so grab our data loaders by using data loaders

01:05:41.160 | from folder as we've done before and create a CNN learner and a ResNet and fit it for

01:05:47.520 | a single epoch and whoa 99.7 right so we did 40 epochs and got 98.3 as I said using all

01:05:59.560 | the tricks you can really speed things up and make things a lot better and so by the

01:06:05.640 | end of this course or at least both parts of this course you'll be able to from scratch

01:06:13.900 | get this 99.7 in a single epoch all right so jargon so jargon just to remind us value function

01:06:30.200 | that returns zero for negatives many batch a few inputs and labels which optionally are

01:06:38.280 | randomly selected the forward pass is the bit where we calculate the predictions the

01:06:43.560 | loss is the function that we're going to take the derivative of and then the gradient is

01:06:48.480 | the derivative of the loss with respect to each parameter the backward pass is when we

01:06:54.680 | calculate those gradients gradient descent is that full thing of taking a step in the

01:07:00.080 | direction opposite to the gradients by after calculating the loss and then the learning

01:07:04.760 | rate is the size of the step that we take other things to know perhaps the two most

01:07:17.960 | important pieces of jargon are all of the numbers that are in a neural network the numbers

01:07:23.520 | that we're learning are called parameters and then the numbers that we're calculating

01:07:29.240 | so every value that's calculated every matrix multiplication element that's calculated they're

01:07:35.160 | called activations so activations and parameters are all of the numbers in the neural net and

01:07:42.420 | so be very careful when I say from here on in in these lessons activations or parameters

01:07:48.760 | you got to make sure you know what those mean because that's that's the entire basically

01:07:53.360 | almost the entire set of numbers that exist inside a neural net so activations are calculated

01:07:59.600 | parameters are learned we're doing this stuff with tensors and tensors are just regularly

01:08:07.600 | shaped arrays rank 0 tensors we call scalars rank 1 tensors we call vectors rank 2 tensors

01:08:14.540 | we call matrices and we continue on to rank 3 tensors rank 4 tensors and so forth and

01:08:21.520 | rank 5 tensors are very common in deep learning so don't be scared of going up to higher numbers

01:08:27.400 | of dimensions okay so let's have a break oh we've got a question okay is there a rule

01:08:35.700 | of thumb for what non-linearity to choose given that there are many yeah there are many

01:08:41.840 | non-linearities to choose from and it doesn't generally matter very much which you choose

01:08:46.440 | so let's choose ReLU or leaky ReLU or yeah whatever any anyone should work fine later

01:08:56.640 | on we'll look at the minor differences between between them but it's not so much something

01:09:02.960 | that you pick on a per problem it's more like some take a little bit longer and a little

01:09:08.040 | bit more accurate and some a bit faster and a little bit less accurate that's a good question

01:09:14.320 | okay so before you move on it's really important that you finish the questionnaire for this

01:09:18.560 | chapter because there's a whole lot of concepts that we've just done so you know try to go

01:09:24.680 | through the questionnaire go back and relook at the notebook and please run the code through

01:09:30.920 | the experiments and make sure it makes sense all right let's have a seven minute break

01:09:37.000 | see you back here in seven minutes time okay welcome back so now that we know how to create

01:09:51.560 | and train a neural net let's cycle back and look deeper at some applications and so we're

01:09:59.200 | going to try to kind of interpolate in from one end we've done the kind of from scratch

01:10:05.800 | version at the other end we've done the kind of four lines of code version and we're going

01:10:10.320 | to gradually nibble at each end until we find ourselves in the middle and we've we've we've

01:10:15.680 | touched on all of it so let's go back up to the kind of the four lines of code version

01:10:20.960 | and and delve a little deeper so let's go back to pets and let's think though about

01:10:33.000 | like how do you actually you know start with a new data set and figure out how to use it

01:10:44.920 | so it you know the data sets we provide it's easy enough to untie them you just say untie

01:10:50.320 | that or download it and untie it if it's a data set that you're getting you can just

01:10:57.320 | use the terminal or either or python or whatever so let's assume we have a path that's pointing

01:11:05.000 | at something so initially you don't you don't know what that something is so we can start

01:11:12.480 | by doing ls to have a look and see what's inside there so the pets data set that we

01:11:17.720 | saw in lesson one contains three things annotations images and models and you'll see we have this

01:11:25.520 | little trick here where we say path dot base path equals and then the path to our data

01:11:31.600 | and that's just does a little simple thing where when we print it out it just doesn't

01:11:35.640 | show us it just shows us relative to this path is a bit convenient so if you go and

01:11:44.960 | have a look at the read me for the original pets data set it tells you what these images

01:11:51.200 | and annotations folders are and not surprisingly the images path if we go path images that's

01:11:58.640 | how we use path lib to grab a subdirectory and then ls we can see here are the names

01:12:05.920 | that the paths to the images as it mentions here most functions and methods in fast a

01:12:12.920 | which return a collection don't return a python list but they return a capital L and a capital

01:12:20.440 | L as we briefly mentioned is basically an enhanced list one of the enhancements is the

01:12:26.200 | way it prints the representation of it starts by showing you how many items there are in

01:12:31.080 | the list in the collection so there's seven thousand three hundred and ninety four images

01:12:36.800 | and it if there's more than ten things it truncates it and just says dot dot dot to

01:12:43.480 | avoid filling up your screen so there's a couple of little conveniences there and so

01:12:51.160 | we can see from this output that the file name as we mentioned in lesson one if the

01:13:00.080 | first letter is a capital it means it's a cat and if the first letter is lowercase it

01:13:06.640 | means it's a dog but this time we've got to do something a bit more complex well a lot

01:13:10.760 | more complex which is figure out what breed it is and so you can see the breed is kind

01:13:16.320 | of everything up to after the in the file name it's everything up to the the last underscore

01:13:22.280 | and before this number is the breed so we want to label everything with its breed so

01:13:30.200 | we're going to take advantage of this structure so the way I would do this is to use a regular

01:13:41.940 | expression a regular expression is something that looks at a string and basically lets

01:13:46.880 | you kind of pull it apart into its pieces in very flexible way it's this kind of simple

01:13:51.520 | little language for doing that if you haven't used regular expressions before please google

01:13:58.520 | regular expression tutorial now and look it's going to be like one of the most useful tools

01:14:03.120 | you'll come across in your life I use them almost every day I will go to details about

01:14:10.000 | how to use them since there's so many great tutorials and there's also a lot of great

01:14:13.360 | like exercises you know there's regex regex is short for regular expression there's regex

01:14:18.680 | crosswords there's regex Q&A there's all kinds of cool regex things a lot of people like

01:14:24.760 | me love this tool in order to there's also a regex lesson in the fast AI NLP course maybe

01:14:33.400 | even two regex lessons oh yeah I'm sorry for forgetting about the fast AI NLP course what

01:14:40.400 | an excellent resource that is so regular expressions are a bit hard to get right the first time

01:14:50.680 | so the best thing to do is to get a sample string so it's a good way to do that would

01:14:54.880 | be to just grab one of the file names thought it's pop it in F name and then you can experiment

01:15:02.160 | with regular expressions so re is the regular expression module in Python and find all will

01:15:11.760 | just grab all the parts of a regular expression that have parentheses around them so this

01:15:17.400 | regular expression and are is a special kind of string in Python which basically says don't

01:15:22.960 | treat backslash as special because normally in Python like backslash N means a new line

01:15:29.520 | so here's a string which I'm going to capture any letter one or more times followed by an

01:15:39.680 | underscore followed by a digit one or more times followed by anything I probably should

01:15:47.600 | have used backslash dot that's fine followed by the letters jpg followed by the end of

01:15:52.560 | the string and so if I call that regular expression against my file names name oh looks good right

01:16:02.880 | so we kind of check it out so now that seems to work we can create a data block where the

01:16:09.400 | independent variables are images the dependent variables are categories just like before

01:16:15.040 | get items is going to be get image files we're going to split it randomly as per usual and

01:16:23.580 | then we're going to get the label by calling regex labeler which is a just a handy little

01:16:33.000 | fast a class which labels things with a regular expression we can't call the regular expression

01:16:39.920 | this particular expression directly on the path lib path object we actually want to call

01:16:45.040 | it on the name attribute and fast AI has a nice little function called using atra using

01:16:52.040 | attribute which takes this function and changes it to a function which will be passed this

01:16:58.080 | attribute that's going to be using regex labeler on the name attribute and then from that data

01:17:08.520 | block we can create the data loaders as usual there's two interesting lines here resize

01:17:16.160 | and all transforms all transforms we have seen before in notebook 2 in the section called

01:17:28.000 | data augmentation and so all transforms was the thing which can zoom in and zoom out and

01:17:36.040 | warp and rotate and change contrast and change brightness and so forth and flip to kind of

01:17:43.000 | give us almost it's like giving us more data being generated synthetically from the data

01:17:47.960 | we already have and we also learned about random resize crop which is a kind of a really

01:17:59.680 | cool way of getting ensuring you get square images at the same time that you're augmenting

01:18:08.600 | the data here we have a resize to a really large image but you know by deep learning

01:18:17.320 | standards 460 by 460 is a really large image and then we're using all transforms with a

01:18:24.280 | size so that's actually going to use random resize crop to a smaller size why are we doing

01:18:30.240 | that this particular combination of two steps does something which I think is unique to

01:18:40.060 | fast AI which we call pre-sizing and the best way is I will show you this beautiful example

01:18:48.800 | of some PowerPoint wizardry that I'm so excited about to show how pre-sizing works what pre-sizing

01:18:56.820 | does is that first step where we say resize to 460 by 460 is it grabs a square and it grabs

01:19:05.480 | it randomly if it's a kind of landscape orientation photo it'll grab it randomly so it'll take

01:19:11.360 | the whole height and randomly grab somewhere from along the side if it's a portrait orientation

01:19:18.400 | then it'll grab it you know take the full width and grab a random bit from top to bottom

01:19:25.480 | so then we take this area here and here it is right and so that's what the first resize

01:19:30.780 | does and then the second org transforms bit will grab a random warped crop possibly rotated

01:19:41.360 | from in here and we'll turn that into a square and so it does so there's two steps it's first

01:19:50.420 | of all resize to a square that's big and then the second step is do a kind of rotation and

01:19:56.400 | warping and zooming stage to something smaller in this case 224 by 224 because this first

01:20:05.880 | step creates something that's square and always is the same size the second step can happen

01:20:12.360 | on the GPU and because normally things like rotating and image warping actually pretty

01:20:17.400 | slow also normally doing a zoom and a rotate and a warp actually is really destructive

01:20:27.560 | to the image because each one of those things requires an interpolation step but it's not

01:20:32.640 | just slow it actually makes the image really quite low quality so we do it in a very special

01:20:40.120 | way in fast AI I think it's unique where we do all of these kind of coordinate transforms

01:20:47.560 | like rotations and warps and zooms and so forth not on the actual pixels but instead

01:20:54.960 | we kind of keep track of the changing coordinate values in a non-lossy way so the full floating

01:21:01.600 | point value and then once at the very end we then do the interpolation.

01:21:08.840 | The results are quite striking here is what the difference looks like hopefully you can

01:21:16.080 | see this on on the video on the left is our pre-sizing approach and on the right is the

01:21:24.160 | standard approach that other libraries use and you can see that the one on the right

01:21:28.600 | is a lot less nicely focused and it also has like weird things like this should be grass

01:21:36.040 | here but it's actually got its kind of bum-sticking way out this has a little bit of weird distortions

01:21:41.600 | this has got loads of weird distortions so you can see the pre-sized version really ends

01:21:46.660 | up way way better and I think we have a question Rachel are the blocks in the data block and

01:21:55.520 | ordered list do they specify the input and output structures respectively are there always

01:22:01.080 | two blocks or can there be more than two for example if you wanted a segmentation model

01:22:05.960 | would the second block be something about segmentation so so yeah this is an ordered

01:22:14.400 | list so the first item says I want to create an image and then the second item says I want

01:22:20.880 | to create a category so that's my independent and dependent variable you can have one thing

01:22:26.200 | here you can have three things here you can have any amount of things here you want obviously

01:22:31.040 | the vast majority of the time it'll be two normally there's an independent variable and

01:22:34.600 | a dependent variable we'll be seeing this in more detail later although if you go back

01:22:39.600 | to the earlier lesson when we introduced data blocks I do have a picture kind of showing

01:22:43.660 | how these pieces fit together. So after you've put together your data block created your

01:22:56.720 | data loaders you want to make sure it's working correctly so the obvious thing to do for computer

01:23:02.240 | vision data block is show batch and show batch will show you the items and you can kind of

01:23:11.960 | just make sure they look sensible that looks like the labels are reasonable if you add

01:23:16.480 | a unique equals true then it's going to show you the same image with all the different

01:23:21.240 | augmentations this is a good way to make sure your augmentations work if you make a mistake

01:23:26.000 | in your data block in this example there's no resize so the different images are going

01:23:32.160 | to be different sizes so it'll be impossible to collate them into a batch so if you call

01:23:39.600 | dot summary this is a really neat thing which will go through and tell you everything that's

01:23:46.400 | happening so I collecting the items how many did I find what happened when I split them

01:23:52.240 | what are the different variables independent dependent variables I'm creating let's try

01:23:57.720 | and create one of these here's a step create my image create categorize here's what the

01:24:06.800 | first thing gave me an American Bulldog is the final sample is this image this size this

01:24:13.400 | category and then eventually it says oh it's not possible to collate your items I tried

01:24:20.400 | to collate the zero index members of your tuples so in other words that's the independent

01:24:24.800 | variable and I got this was size 500 by 375 this was 375 by 500 oh I can't collate these

01:24:32.560 | into a tensor because they're different sizes so this is a super great debugging tool for

01:24:37.680 | debugging your data blocks you have a question how does the item transforms pre-size work

01:24:46.320 | if the resize is smaller than the image is a whole width or height still taken or is

01:24:51.600 | it just a random crop with the resize value so if you remember back to lesson two we looked

01:25:04.000 | at the different ways of creating these things you can use squish you can use pad or you

01:25:16.080 | can use crop so if your image is smaller than the precise value then squish will really

01:25:23.940 | be zoom so it will just swell stretch it'll stretch it and then pad and crop will do much

01:25:31.980 | the same thing and so you'll just end up with a you know the same just looks like these

01:25:37.280 | but it'll be a kind of lower more pixelated lower resolution because it's having to zoom

01:25:41.560 | in a little bit okay so a lot of people say that you should do a hell of a lot of data

01:25:52.120 | cleaning before you model we don't we say model as soon as you can because remember

01:25:59.360 | what we found in in notebook two your your model can teach you about the problems in

01:26:06.840 | your data so as soon as I've got to a point where I have a data block that's working and

01:26:12.680 | I have data loaders I'm going to build a model and so here I'm you know it also tells me

01:26:17.460 | how I'm going so I'm getting seven percent error well that's actually really good for

01:26:23.080 | a pets model and so at this point now that I have a model I can do that stuff we learned

01:26:27.200 | about earlier in O2 the notebook O2 where we train our model and use it to clean the

01:26:32.640 | data so we can look at the classification a confusion matrix top losses the image cleaner

01:26:40.480 | widget you know so forth okay now one thing interesting here is in notebook four we included

01:26:55.600 | a loss function when we created a learner and here we don't pass in a loss function why

01:27:01.360 | is that that's because first AI will try to automatically pick a somewhat sensible loss

01:27:08.040 | function for you and so for a image classification task it knows what loss function is the normal

01:27:16.400 | one to pick and it's done it for you but let's have a look and see what it actually did pick

01:27:24.360 | so we could have a look at learn dot loss funk and we will see it is cross entropy loss

01:27:37.880 | what on earth is cross entropy loss I'm glad you asked let's find out cross entropy loss

01:27:46.320 | is really much the same as the MNIST lost we created with that with that sigmoid and

01:27:54.200 | the one minus predictions and predictions but it's it's a kind of extended version of

01:28:01.440 | that and the extended version of that is that that torch dot where that we looked at in

01:28:09.080 | notebook four only works when you have a binary outcome in that case it was is it a three

01:28:15.440 | or not but in this case it we've got which of the 37 pet breeds is it so we want to kind

01:28:24.280 | of create something just like that sigmoid and torch dot where that which also works

01:28:31.280 | nicely for more than two categories so let's see how we can do that so first of all let's

01:28:41.940 | grab a batch yes question why do we want to build a model before cleaning the data I would

01:28:52.480 | think a clean data set would help in training yeah absolutely a clean data set helps in

01:28:59.600 | training but remember as we saw in notebook 02 an initial model helps you clean the data

01:29:08.040 | set so remember how plot top losses helped us identify mislabeled images and the confusion

01:29:15.800 | matrix helped us recognize which things we were getting confused and might need you know

01:29:20.560 | fixing and the image classifier cleaner actually let us find things like an image that contained

01:29:27.400 | two bears rather than one bear and clean it up so a model is just a fantastic way to help

01:29:34.080 | you zoom in on the data that matters which things seem to have the problems which things

01:29:39.600 | are most important stuff like that so you would go through and you clean it with the

01:29:44.800 | model helping you and then you go back and train it again with the clean data thanks

01:29:50.720 | for that great question okay so in order to understand cross-entropy loss let's grab a

01:29:59.960 | batch of data which we can use dls.one batch and that's going to grab a batch from the

01:30:09.280 | training set we could also go first dls.train and that's going to do exactly the same thing

01:30:21.560 | and so then we can de-structure that into the independent and dependent variable and

01:30:25.160 | so the dependent variable shows us we've got a batch size of 64 but shows us the 64 categories

01:30:41.480 | and remember those numbers simply refer to the index of into the vocab so for example

01:30:47.320 | 16 is a boxer and so that all happens for you automatically when we say show batch it

01:30:55.500 | shows us those strings so here's a first mini-batch and so now we can view the predictions that

01:31:04.880 | is the activations of the final layer of the network by calling get preds and you can pass

01:31:11.320 | in a data loader and a data loader can really be anything that's going to return a sequence

01:31:20.640 | of many batches so we can just pass in a list containing our mini-batch as a data loader

01:31:26.920 | and so that's going to get the predictions for one mini-batch but here's some predictions

01:31:32.080 | okay so the actual predictions if we go preds 0.sum to grab the predictions for the first

01:31:42.240 | image and add them all up they add up to 1 and there are 37 of them so that makes sense

01:31:51.080 | right it's like the very first thing is what is the probability that that is a else vocab

01:32:00.440 | so the first thing is what's the probability it's an Abyssinian cat it's 10 to the negative

01:32:05.400 | 6 you see and so forth so it's basically like it's not this it's not this it's not this

01:32:11.480 | and you can look through and oh here this one here you know obviously what I think it

01:32:15.680 | is so how did it you know so we obviously want the probabilities to sum to one because

01:32:24.920 | it would be pretty weird if if they didn't it would say you know that the probability

01:32:30.320 | of being one of these things is more than one or less than one which would be extremely

01:32:34.760 | odd so how do we go about creating these predictions where each one is between 0 and 1 and they

01:32:45.320 | all add up to 1 to do that we use something called softmax softmax is basically an extension

01:32:54.000 | of sigmoid to handle more than two levels two categories so remember the sigmoid function

01:33:01.560 | look like this and we use that for our threes versus sevens model so what if we want 37

01:33:11.960 | categories rather than two categories we need one activation for every category so actually

01:33:19.560 | the threes and sevens model rather than thinking of that as an is three model we could actually

01:33:27.000 | say oh that has two categories so that's actually create two activations one representing how

01:33:32.200 | three like something is and one representing how seven like something is so let's say you

01:33:39.960 | know let's just say that we have six MNIST digits and these were the can I do this and

01:33:55.160 | this first column were the activations of my model for for one activation and the second

01:34:05.240 | column was for a second activation so my final layer actually has two activations now so

01:34:10.120 | this is like how much like a three is it and this is how much like a seven is it but this

01:34:14.440 | one is not at all like a three and it's slightly not like a seven this is very much like a

01:34:21.040 | three and not much like a seven and so forth so we can take that model and rather having

01:34:25.840 | rather than having one activation for like is three we can have two activations for how

01:34:30.920 | much like a three how much like a seven so if we take the sigmoid of that we get two

01:34:38.720 | numbers between naught and one but they don't add up to one so that doesn't make any sense

01:34:46.960 | it can't be point six six chance it's a three and point five six chance it's a seven because

01:34:51.880 | every digit in that data set is only one or the other so that's not going to work but

01:34:58.800 | what we could do is we could take the difference between this value and this value and say

01:35:05.240 | that's how likely it is to be a three so in other words this one here with a high number

01:35:10.360 | here and a low number here is very likely to be a three so we could basically say in

01:35:17.400 | the binary case these activations that what really matters is their relative confidence

01:35:24.220 | of being a three versus a seven so we could calculate the difference between column one

01:35:29.900 | and column two or column index zero and column index one right and here's the difference

01:35:35.240 | between the two columns there's that big difference and we could take the sigmoid of that right

01:35:43.600 | and so this is now giving us a single number between naught and one and so then since we

01:35:50.680 | wanted two columns we could make column index zero the sigmoid and column index one could

01:35:57.840 | be one minus that and now look these all add up to one so here's probability of three probability

01:36:06.900 | of seven for the second one probably three probably seven and so forth so like that's

01:36:14.840 | a way that we could go from having two activations for every image to creating two probabilities

01:36:27.840 | each of which is between naught and one and each pair of which adds to one great how do

01:36:35.640 | we extend that to more than two columns to extend it to more than two columns we use

01:36:42.040 | this function which is called softmax those softmax is equal to e to the x divided by

01:36:53.200 | sum of e to the x just to show you if I go softmax on my activations I get point six

01:37:05.200 | oh two five point three nine seven five point six oh two five point three nine seven five

01:37:10.200 | I get exactly the same thing right so softmax in the binary case is identical to the sigmoid

01:37:20.040 | that we just looked at but in the multi-category case we basically end up with something like

01:37:28.000 | this let's say we were doing the teddy bear grizzly bear brown bear and for that remember

01:37:33.800 | our neural net is going to have the final layer will have three activations so let's

01:37:38.360 | say it was point oh two negative two point four nine one point two five so to calculate

01:37:43.480 | softmax I first go e to the power of each of these three things so here's e to the power

01:37:49.160 | of point oh two e to the power of negative two point four nine e to the power of three

01:37:54.080 | point four e to the power of one point two five okay then I add them up so there's the

01:37:59.160 | sum of the x and then softmax will simply be one point oh two divided by four point

01:38:05.160 | six and then this one will be point oh eight divided by four point six and this one will

01:38:09.640 | be three point four nine divided by four point six so since each one of these represents

01:38:15.240 | each number divided by the sum that means that the total is one okay and because all

01:38:23.160 | of these are positive and each one is an item divided by the sum it means all of these must

01:38:28.600 | be between naught and one so this shows you that softmax always gives you numbers between

01:38:35.300 | naught and one and they always add up to one so to do that in practice you can just call

01:38:42.320 | torch dot softmax and it will give you this result of this this function so you should

01:38:51.120 | experiment with this in your own time you know write this out by hand and try putting

01:38:57.600 | in these numbers right and and see how that you get back the numbers I claim you're going

01:39:03.920 | to get back make sure this makes sense to you so one of the interesting points about

01:39:08.480 | softmax is remember I told you that exp is e to the power of something and now what that

01:39:16.840 | means is that e to the power of something grows very very fast right so like exp of

01:39:29.420 | four is fifty four exp of eight is twenty nine two thousand nine hundred and eighty

01:39:41.040 | right it grows super fast and what that means is that if you have one activation that's

01:39:47.480 | just a bit bigger than the others its softmax will be a lot bigger than the others so intuitively

01:39:54.320 | the softmax function really wants to pick one class among the others which is generally

01:40:02.160 | what you want right when you're trying to train a classifier to say which breed is it

01:40:07.840 | you kind of want it to pick one and kind of go for it right and so that's what softmax

01:40:13.600 | does that's not what you always want so sometimes at inference time you want it to be a bit

01:40:21.200 | cautious and so you kind of got to remember that softmax isn't always the perfect approach

01:40:26.840 | but it's the default it's what we use most of the time and it works well on a lot of

01:40:31.320 | situations so that is softmax now in the binary case for the MNIST 3 versus 7 this was how

01:40:43.320 | we calculated MNIST loss we took the sigmoid and then we did either one minus that or that

01:40:49.200 | as our loss function which is fine as you saw it it worked right and so we could do

01:40:59.940 | this exactly the same thing we can't use torch.where anymore because targets aren't just 0 or 1

01:41:07.000 | targets could be any number from 0 to 36 so we could do that by replacing the torch.where

01:41:14.600 | with indexing so here's an example for the binary case let's say these are our targets

01:41:21.160 | 0 1 0 1 1 0 and these are our softmax activations which we calculated before they're just some

01:41:28.880 | random numbers just for a toy example so one way to do instead of doing torch.where we could

01:41:37.240 | instead have a look at this I could grab all the numbers from 0 to 5 and if I index into

01:41:46.160 | here with all the numbers from 0 to 5 and then my targets 0 1 0 1 0 1 0 then what that's

01:41:58.480 | going to do is it's going to pick the row 0 it'll pick 0.6 and then for row 1 it'll pick

01:42:06.960 | 1.49 for row 2 it'll pick 0.13 for row 4 it'll pick 1.003 and so forth so this is a super

01:42:22.040 | nifty indexing expression which you should definitely play with right and it's basically

01:42:30.400 | this trick of passing multiple things to the PyTorch indexer the first thing says which

01:42:36.840 | rows should you return and the second thing says for each of those rows which column should

01:42:42.120 | you return so this is returning all the rows and these columns for each one and so this

01:42:50.080 | is actually identical to torch.where so isn't that tricky and so the nice thing is we can

01:42:59.280 | now use that for more than just two values and so here's here's the fully worked out

01:43:08.320 | thing so I've got my threes column I've got my sevens column here's that target is the

01:43:13.240 | indexes from 0 1 2 3 4 5 and so here 0 0.6 1 1.49 0 2.13 and so forth so yeah this works

01:43:29.080 | just as well with more than two columns so we can add you know for doing a full MNIST

01:43:35.840 | you know so all the digits from 0 to 9 we could have 10 columns and we would just be

01:43:40.320 | indexing into the 10 so this thing we're doing where we're going minus our activations matrix

01:43:52.240 | all of the numbers from 0 to n and then our targets is exactly the same as something that

01:43:58.640 | already exists in PyTorch called f dot nll loss as you can see exactly the same that's

01:44:05.400 | again we're kind of seeing that these things inside PyTorch and fast.ai are just little

01:44:11.080 | shortcuts for stuff we can write ourself and our loss stands for negative log likelihood

01:44:18.720 | again sounds complex but actually it's just this indexing expression rather confusingly

01:44:27.360 | there's no log in it we'll see why in a moment so let's talk about logs so this locks this

01:44:38.840 | loss function works quite well as we saw in the notebook 04 it's basically this it is

01:44:45.320 | exactly the same as we do in notebook 04 just a different way of expressing it but we can

01:44:51.120 | actually make it better because remember the probabilities we're looking at are between

01:44:57.120 | 0 and 1 so they can't be smaller than 0 they can't be greater than 1 which means that if

01:45:02.400 | our model is trying to decide whether to predict 0.99 or 0.999 it's going to think that those

01:45:08.620 | numbers are very very close together but won't really care but actually if you think about

01:45:14.320 | the error you know if there's like a thousand things then this would like be 10 things are

01:45:22.600 | wrong and this would be like one thing is wrong but this is really like 10 times better

01:45:28.240 | than this so really what we'd like to do is to transform the numbers between 0 and 1 to

01:45:36.040 | instead be between negative infinity and infinity and there's a function that does exactly that

01:45:42.160 | which is called logarithm okay so as the so the numbers we could have can be between 0

01:45:52.560 | and 1 and as we get closer and closer to 0 it goes down to infinity and then at 1 it's

01:46:04.880 | going to be 0 and we can't go above 0 because our loss function we want to be negative so

01:46:17.240 | this logarithm in case you forgot hopefully you vaguely remember what logarithm is from

01:46:22.200 | high school but the basically the definition is this if you have some number that is y

01:46:28.600 | that is b to the power of a then logarithm is defined such that a equals the logarithm

01:46:36.640 | of y, b in other words it tells you b to the power of what equals y which is not that interesting

01:46:52.040 | of itself but one of the really interesting things about logarithms is this very cool

01:46:57.560 | relationship which is that log of a times b equals log of a plus log of b and we use

01:47:05.040 | that all the time in deep learning and machine learning because this number here a times

01:47:13.040 | b can get very very big or very very small if you multiply things a lot of small things

01:47:18.220 | together you'll get a tiny number if you multiply a lot of big things together you'll get a

01:47:22.200 | huge number it can get so big or so small that the kind of the precision in your computer's

01:47:28.480 | floating point gets really bad where else this thing here adding is not going to get

01:47:35.360 | out of control so we really love using logarithms like particularly in a deep neural net where

01:47:42.560 | there's lots of layers we're kind of multiplying and adding many times so this kind of tends

01:47:47.680 | to come out quite nicely so when we take the probabilities that we saw before the things

01:48:05.160 | that came out of this function and we take their logs and we take the mean that is called

01:48:16.080 | negative log likelihood and so this ends up being kind of a really nicely behaved number

01:48:23.640 | because of this property of the log that we described so if you take the softmax and then

01:48:31.240 | take the log and then pass that to an LL loss because remember that didn't actually take

01:48:37.680 | the log at all despite the name that gives you cross entropy loss so that leaves an obvious

01:48:47.120 | question of why doesn't an LL loss actually take the log and the reason for that is that

01:48:55.040 | it's more convenient computationally to actually take the log back at the softmax step so PyTorch

01:49:01.800 | has a function called log softmax and so since it's actually easier to do the log at the

01:49:11.880 | softmax stage it's just faster and more accurate PyTorch assumes that you use soft logmax and

01:49:18.280 | then pass that to an LL loss so an LL loss does not do the log it assumes that you've

01:49:25.000 | done the log beforehand so log softmax followed by an LL loss is the definition of cross entropy

01:49:31.460 | loss in PyTorch so that's our loss function and so you can pass that some activations

01:49:38.200 | and some targets and get back a number and pretty much everything in PyTorch every one

01:49:44.700 | of these kinds of functions you can either use the nn version as a class like this and

01:49:51.000 | then call that object as if it's a function or you can just use f dot with the camel case

01:49:57.560 | name as a function directly and as you can see they're exactly the same number people

01:50:04.960 | normally use the class version in the documentation in PyTorch you'll see it normally uses the

01:50:11.560 | class version so we'll tend to use the class version as well you'll see that it's returning

01:50:18.200 | a single number and that's because it takes the mean because a loss needs to be as we've

01:50:22.720 | discussed the mean but if you want to see the underlying numbers before taking the mean

01:50:28.560 | you can just pass in reduction equals none and that shows you the individual cross entropy

01:50:34.080 | losses before taking the mean. Okay great so this is a good place to stop with our discussion

01:50:57.160 | of loss functions and such things Rachel were there any questions about this? Why does the

01:51:12.560 | loss function need to be negative? Well I mean I guess it doesn't but it's we want something

01:51:23.780 | that the lower it is the better and we kind of need it to cut off somewhere I have to

01:51:35.720 | think about this more during the week because it's a bit tired yeah let me refresh my memory

01:51:44.760 | when I'm awake. Okay now next week well note not for the video next week actually happened

01:51:59.320 | last week so the thing I'm about to say is actually. So next week we're going to be talking

01:52:07.180 | about data ethics and I wanted to kind of segue into that by talking about how my week's

01:52:14.200 | gone because a week or two ago in I did a as part of a lesson I actually talked about

01:52:25.680 | the efficacy of masks and specifically wearing masks in public and I pointed out that the

01:52:33.840 | efficacy of masks seemed like it could be really high and maybe everybody should be

01:52:38.520 | wearing them and somehow I found myself as the face of a global advocacy campaign and

01:52:52.200 | so if you go to masksforall.co you'll find a website talking about masks and I've been

01:53:09.800 | on you know TV shows in South Africa and the US and England and Australia and on radio

01:53:17.480 | and blah blah blah talking about masks why is this well it's because as a data scientist

01:53:28.840 | you know I noticed that the data around masks seem to be getting misunderstood and it seemed

01:53:36.560 | that that misunderstanding was costing possibly hundreds of thousands of lives you know literally

01:53:43.680 | in the places that were using masks it seemed to be associated with you know orders of magnitude

01:53:49.720 | fewer deaths and one of the things we're talking about next week is like you know what's your

01:53:56.480 | role as a data scientist and you know I strongly believe that it's to understand the data and

01:54:03.920 | then do something about it and so nobody was talking about this so I ended up writing an

01:54:12.320 | article that appeared in the Washington Post that basically called on people to really

01:54:18.760 | consider wearing masks which is this article and you know I was lucky I managed to kind

01:54:32.800 | of get a huge team of brilliant not huge a pretty decent sized team of brilliant volunteers

01:54:39.120 | who helped you know kind of build this website and kind of some PR folks and stuff like that

01:54:45.080 | but what this came clear was and I was talking to politicians you know senators staffers

01:54:54.120 | what was becoming clear is that people weren't convinced by the science which is fair enough

01:55:01.200 | because it's it's hard to you know when the WHO and the CDC is saying you don't need to

01:55:07.960 | wear a mask and some random data scientist is saying but doesn't seem to be what the

01:55:12.640 | data is showing you know you've got half a brain you would pick the WHO and the CDC not

01:55:17.680 | the random data scientist so I really felt like I if I was going to be an effective advocate

01:55:23.240 | I needed sort the science out and it you know credentialism is strong and so it wouldn't

01:55:31.040 | be enough for me to say it I needed to find other people to say it though I put together

01:55:34.600 | a team of 19 scientists including you know a professor of sociology a professor of aerosol

01:55:46.320 | dynamics the founder of an African movement that's that kind of studied preventative methods

01:55:54.000 | for tuberculosis a Stanford professor who studies mask disposal and cleaning methods

01:56:05.960 | a bunch of Chinese scientists who study epidemiology modeling a UCLA professor who is one of the

01:56:16.680 | top infectious disease epidemiologists experts and so forth so like this kind of all-star

01:56:24.320 | team of people from all around the world and I had never met any of these people before

01:56:29.640 | so well no not quite true I knew Austin a little bit and I knew Zainip a little bit

01:56:34.440 | I knew Lex a little bit but on the whole you know and well Reshma we all know she's awesome

01:56:43.280 | so it was great to actually have a fast AI community person there too and so but yeah

01:56:50.160 | I kind of tried to pull together people from you know as many geographies as possible and

01:56:56.720 | as many areas of expertise as possible and you know the kind of the global community

01:57:03.520 | helped me find papers about about everything about you know how different materials work

01:57:12.840 | about how droplets form about epidemiology about case studies of people infecting with

01:57:23.280 | and without masks blah blah blah and we ended up in the last week basically we wrote this

01:57:29.600 | paper it contains 84 citations and you know we basically worked around the clock on it

01:57:40.600 | as a team and it's out and it's been sent to a number of some of the earlier versions

01:57:49.640 | three or four days ago we sent her some governments so one of the things is I in this team I tried

01:57:55.560 | to look for people who were you know working closely with government leaders not just that

01:58:00.600 | they're scientists and so this this went out to a number of government ministers and in

01:58:07.360 | the last few days I've heard that it was a very significant part of decisions by governments

01:58:15.600 | to change their to change their guidelines around masks and you know the fights not over

01:58:25.280 | by any means in particular the UK is a bit of a holdout but I'm going to be on ITV tomorrow

01:58:33.600 | and then BBC the next day you know it's it's kind of required stepping out to be a lot

01:58:39.760 | more than just a data scientist I've had to pull together you know politicians and staffers

01:58:46.080 | I've had to you know you know hustle with the media to try and get you know coverage

01:58:53.560 | and you know today I'm now starting to do a lot of work with unions to try to get unions

01:58:58.460 | to understand this you know it's really a case of like saying okay as a data scientist

01:59:03.920 | and income in conjunction with real scientists we've built this really strong understanding

01:59:11.000 | that masks you know this simple but incredibly powerful tool that doesn't do anything unless

01:59:18.400 | I can effectively communicate this to decision makers so today I was you know on the phone

01:59:24.480 | to you know one of the top union leaders in the country explaining what this means basically

01:59:33.040 | it turns out that in buses in America the kind of the air conditioning is set up so

01:59:38.000 | that it blows from the back to the front and there's actually case studies in the medical

01:59:42.440 | literature of how people that are seated kind of downwind of an air conditioning unit in

01:59:49.600 | a restaurant ended up all getting sick with COVID-19 and so we can see why like bus drivers

01:59:56.200 | are dying because they're like they're right in the wrong spot here and their passengers

02:00:02.800 | aren't wearing masks so I kind of try to explain this science to union leaders so that they

02:00:11.360 | understand that to keep the workers safe it's not enough just for the driver to wear a mask

02:00:17.160 | but all the people on the bus need to be wearing masks as well so you know all of this is basically

02:00:23.040 | to say you know as data scientists I think we have a responsibility to study the data

02:00:34.520 | and then do something about it it's not just a research you know exercise it's not just

02:00:40.680 | a computation exercise you know what's the point of doing things if it doesn't lead to

02:00:46.360 | anything so yeah so next week we'll be talking about this a lot more but I think you know

02:00:58.560 | this is a really to me kind of interesting example of how digging into the data can lead

02:01:07.320 | to really amazing things happening and and in this case I strongly believe and a lot

02:01:13.600 | of people are telling me they strongly believe that this kind of advocacy work that's come

02:01:18.600 | out of this data analysis is already saving lives and so I hope this might help inspire

02:01:24.920 | you to to take your data analysis and to take it to places that it really makes a difference

02:01:31.640 | so thank you very much and I'll see you next week.

Lesson 4 - Deep Learning for Coders (2020)

Chapters