back to indexMachine Learning 1: Lesson 9
Chapters
0:0 Introduction
0:35 Synthetic Data
4:10 Parfitt
11:40 Basic Steps
15:30 NNN Module
16:15 Constructor
19:30 FindForWord
21:10 Softmax
27:30 Parameters
28:50 Results
29:10 Functions
30:15 Generators
32:0 FASTA
33:10 Variable
36:5 Function
45:10 Making Predictions
47:0 Broadcasting
48:40 Performance
52:30 Broadcast
00:00:02.600 |
I'm really excited to be able to share some amazing stuff that 00:00:08.480 |
University of San Francisco students have built during the week or written about during the week 00:00:15.280 |
Quite a few things. I'm going to show you have already 00:00:23.200 |
Tweets and posts and all kinds of stuff happening 00:00:28.040 |
One of the the first to be widely shared was this one by Tyler who did something really interesting? 00:00:34.880 |
He started out by saying like what if I like create the synthetic data set where the independent variables is like the x and the y 00:00:43.940 |
And the dependent variable is like color right and interestingly 00:00:48.080 |
He showed me an earlier version of this where he wasn't using color 00:00:51.080 |
he was just like putting the actual numbers in here and 00:00:54.840 |
this thing kind of wasn't really working at all and as soon as he started using color it started working really well and 00:01:00.640 |
So I wanted to mention that one of the things that unfortunately we we don't teach you 00:01:10.840 |
Because actually when it comes to visualization it's kind of the most important thing to know is what is the human eye? 00:01:17.080 |
Or what is what is what is the human brain good at perceiving? There's a whole area of academic study on this 00:01:24.400 |
And one of the things that we're best at perceiving is differences in color 00:01:28.040 |
Right so that's why as soon as we look at this picture of the synthetic data. He created you can immediately see oh there's kind of four 00:01:40.840 |
What if we like tried to create a machine learning model of this synthetic data set? 00:01:46.720 |
And so specifically he created a tree and the cool thing is that you can actually draw 00:01:55.440 |
He did this all in that plot live that plot lead is very flexible right he actually drew the tree boundaries 00:02:01.840 |
So that's already a pretty neat trick is to be actually able to draw the tree 00:02:07.800 |
But then he did something even cleverer which is he said okay? 00:02:10.800 |
So what predictions does the tree make well it's the average of each of these areas and so to do that 00:02:28.120 |
Here's where it gets really interesting. It's like you can as you know randomly 00:02:39.600 |
Generated through resampling they're all like pretty similar, but a little bit different 00:02:43.880 |
And so now we can actually visualize bagging and to visualize bagging we literally take the average of the four pictures 00:02:54.000 |
There it is right and so here is like the the fuzzy decision boundaries of a random forest 00:03:01.440 |
And I think this is kind of amazing right because it's like a I wish I had this actually when I started teaching you 00:03:08.880 |
All random forests because I could have skipped a couple of classes. It's just like okay. That's what we do 00:03:13.920 |
You know we create the decision boundaries we average each area 00:03:18.360 |
And then we we do it a few times and average all of them 00:03:21.960 |
Okay, so that's what a random forest does and I think like this is just such a great example of 00:03:26.320 |
Making the complex easy through through pictures 00:03:37.360 |
That he has actually reinvented something that somebody else has already done a guy called Christian any who went on to be a 00:03:44.000 |
One of the world's foremost machine learning researchers actually included almost exactly this technique in a book 00:03:51.000 |
He wrote about decision forests, so it's actually kind of cool that Tyler ended up 00:03:54.880 |
Reinventing something that one of the world's foremost and for authorities on the fifth decision forests actually it has created 00:04:03.200 |
That's nice because when we pop when we posted this on Twitter 00:04:05.960 |
You know got a lot of attention and finally somebody with that was able to say like oh 00:04:09.800 |
You know what this this actually already exists, so Tyler's gone away, and you know started reading that book 00:04:17.160 |
Something else which is super cool is Jason Carpenter 00:04:20.520 |
Created a whole new library called parfit and parfit is a 00:04:26.960 |
parallelized fitting of multiple models for the purpose of 00:04:31.200 |
Selecting hyper parameters, and there's a lot. I really like about this 00:04:36.560 |
He's shown a clear example of how to use it right and like the API looks very similar to other grid search based approaches 00:04:50.780 |
Rachel wrote about and that we learned about a couple of weeks ago of using a good validation set 00:04:57.820 |
You know what he's done here is in his blog post that introduces it. You know he's he's 00:05:06.820 |
What are hyper parameters why do we have to train them? 00:05:09.140 |
And he's kind of explained every step and then the the module itself is like it's it's very polished 00:05:15.820 |
You know he's added documentation to it. He's added a nice read me to it 00:05:19.620 |
And it's kind of interesting when you actually look at the code you realize 00:05:22.940 |
You know it's very simple. You know which is it's definitely not a bad thing. That's a good thing is to make things simple 00:05:33.100 |
Writing this little bit of code and then packaging it up so nicely 00:05:35.700 |
He's made it really easy for other people to use this technique 00:05:42.460 |
one of the things I've been really thrilled to see is then 00:05:44.660 |
Vinay went along and combined two things from our class one was to take 00:05:50.180 |
Parfit and then the other was to take the kind of accelerated SGD approach to classification 00:05:56.020 |
We don't learned about in the last lesson and combine the two to say like okay. Well. Let's now use 00:06:20.780 |
Summarized pretty much all the stuff we learned in the random and for a random forest interpretation class 00:06:27.980 |
And he went even further than that as he described each of the different approaches to random forest interpretation 00:06:37.020 |
He described how it's done so here for example is feature importance through variable permutation a little picture of each one and 00:06:44.860 |
Then super cool here is the code to implement it from scratch 00:06:52.580 |
Nice post you know describing something that not many people understand and showing you know exactly how it works both with pictures 00:07:00.740 |
And with code that implements it from scratch 00:07:04.340 |
So I think that's really really great one of the things. I really like here is that for like the 00:07:09.100 |
Tree interpreter, but he actually showed how you can take the tree interpreter 00:07:14.320 |
output and feed it into the new waterfall chart package that 00:07:19.300 |
Chris our USF student built to show how you can actually visualize 00:07:23.260 |
The contributions of the tree interpreter in a waterfall chart so again kind of a nice combination of 00:07:30.740 |
multiple pieces of technology we've both learned about and and built as a group I 00:07:39.860 |
Kernel there's been a few interesting kernels shared and I'll share some more next week and devesh wrote this really nice kernel 00:07:45.460 |
Showing there's this quite challenging Kaggle competition on detecting icebergs 00:07:53.420 |
Ships and it's a kind of a weird two channel satellite data. Which is very hard to visualize and he actually 00:08:01.940 |
Went through and basically described kind of the formulas for how these like radar scattering things actually work 00:08:10.420 |
And then actually managed to come up with a code that allowed him to recreate 00:08:24.820 |
I have not seen that done before or like I you know it's it's quite challenging to know how to visualize this data 00:08:31.020 |
And then he went on to show how to build a neural net to try to interpret this so that was pretty fantastic as well 00:08:38.800 |
So yeah congratulations for all of you. I know for a lot of you. You know you're 00:08:44.140 |
Posting stuff out there to the rest of the world for the first time you know and it's kind of intimidating 00:08:51.500 |
you're used to writing stuff that you kind of hand into a teacher, and they're the only ones who see it and 00:08:56.380 |
You know it's kind of scary the first time you do it 00:09:00.100 |
But then the first time somebody you know up votes your Kaggle kernel or adds a clap to your medium post 00:09:05.540 |
He suddenly realized oh, I'm actually I've written something that people like that's that's pretty great 00:09:11.460 |
So if you haven't tried yourself yet, I again invite you to 00:09:18.060 |
Try writing something and if you're not sure you could write a summary of a lesson 00:09:22.540 |
You could write a summary of like if there's something you found hard like maybe you found it hard to 00:09:27.660 |
Fire up a GPU based AWS instance you eventually figured it out you could write down 00:09:32.820 |
Just describe how you solve that problem or if one of your classmates 00:09:36.740 |
Didn't understand something and you explained it to them 00:09:39.700 |
Then you could like write down something saying like oh, there's this concept that some people have trouble understanding here 00:09:45.220 |
So good way. I think of explaining it. There's all kinds of stuff you could you could do 00:10:07.880 |
Rachel put together basically taking us through 00:10:13.660 |
Kind of SGD from scratch for the purpose of digit recognition 00:10:18.380 |
and actually quite a lot of the stuff we look at today is 00:10:26.100 |
Part of the computational linear algebra course 00:10:28.740 |
Which you can both find the MOOCs on fast AI or at USF. It'll be an elective next year, right? 00:10:38.580 |
This stuff interesting and I hope you do then please consider signing up for the elective or checking out the video online 00:10:57.580 |
We're starting with an assumption that we've downloaded the MNIST data 00:11:01.500 |
We've normalized it by subtracting the main and divided by the standard deviation. Okay, so the data is 00:11:08.700 |
It's slightly unusual in that although they represent images 00:11:12.760 |
They were they were downloaded as each image was a seven hundred and eighty four long 00:11:21.660 |
Okay, and so for the purpose of drawing pictures of it we had to 00:11:30.700 |
But the actual data we've got is not 28 by 28. It's as it's it's 784 long 00:11:43.320 |
The basic steps we're going to take here is to start out with training 00:11:48.440 |
The world's simplest neural network basically a logistic regression, right? 00:11:54.000 |
So no hidden layers and we're going to train it using a library 00:11:58.340 |
Fast AI and we're going to build the network using a library type torch 00:12:03.840 |
Right, and then we're going to gradually get rid of all the libraries, right? 00:12:07.480 |
So first of all, we'll get rid of the nn neural net library and pytorch and write that ourselves 00:12:13.760 |
Then we'll get rid of the fast AI fit function and write that ourselves and then we'll get rid of the pytorch 00:12:22.620 |
optimizer and write that ourselves and so by the end of 00:12:26.120 |
This notebook will have written all the pieces ourselves 00:12:30.800 |
The only thing that we'll end up relying on is the two key things that pytorch gives us 00:12:36.200 |
Which is a the ability to write Python code and have it run on the GPU and? 00:12:40.320 |
B the ability to write Python code and have it automatically differentiated for us 00:12:46.960 |
Okay, so they're the two things we're not going to attempt to write ourselves because it's boring and pointless 00:12:52.160 |
But everything else we'll try and write ourselves on top of those two things. Okay, so 00:12:58.720 |
Our starting point is like not doing anything ourselves 00:13:03.680 |
It's basically having it all done for us. And so pytorch has an nn library, which is where the neural net stuff lives 00:13:12.280 |
multi-layer neural network by using the sequential function and then passing in a list of the layers that you want and 00:13:20.840 |
Followed by a softmax layer and that defines our logistic regression. Okay the input to our linear layer 00:13:28.380 |
Is 28 by 28 as we just discussed the output is 10 because we want a probability 00:13:34.500 |
For each of the numbers not through nine for each of our images, okay 00:13:50.180 |
Fits a model okay, so we start out with a random set of weights and then fit uses gradient descent to make it better 00:14:00.820 |
What criterion to use in other words what counts is better and we told it to use negative log likelihood 00:14:07.720 |
We'll learn about that in the next lesson what that is exactly 00:14:10.860 |
We had to tell it what optimizer to use and we said please use optm dot Adam the details of that 00:14:18.000 |
We won't cover in this course. We're going to use something build something simpler called SGD 00:14:23.180 |
If you're interested in Adam, we just covered that in the deep learning course 00:14:27.060 |
And what metrics do you want to print out? We decided to print out accuracy. Okay, so 00:14:42.340 |
So after we fit it we get an accuracy of generally somewhere around 91 92 percent 00:14:47.300 |
So what we're going to do from here is we're going to gradually 00:14:50.980 |
We're going to repeat this exact same thing. So we're going to rebuild 00:14:57.820 |
You know four or five times fitting it building it and fitting it with less and less libraries. Okay, so the second thing that we did 00:15:15.580 |
All right, so instead of saying the network is a sequential bunch of these layers 00:15:21.780 |
Let's not use that library at all and try and define it ourselves from scratch 00:15:32.220 |
Because that's how we build everything in pytorch and we have to create 00:15:39.060 |
Which inherits from an end up module so an end up module is a pytorch class 00:15:45.140 |
That takes our class and turns it into a neural network module 00:15:51.500 |
Which basically means will anything that you inherit from an end up module like this? 00:15:55.940 |
You can pretty much insert into a neural network as a layer or you can treat it as a neural network 00:16:02.020 |
it's going to get all the stuff that it needs automatically to 00:16:05.060 |
To work as a part of or a full neural network and we'll talk about exactly what that means 00:16:15.940 |
so we need to construct the object so that means we need to define the constructor under in it and 00:16:22.900 |
Then importantly, this is a Python thing is if you inherit from some other object 00:16:29.720 |
Then you have to create the thing you inherit from first 00:16:33.100 |
so when you say super dot under in it that says construct the 00:16:38.500 |
Nn dot module piece of that first right if you don't do that then the the NN dot module stuff 00:16:46.180 |
Never gets a chance to actually get constructed. Now. So this is just like a standard 00:16:53.980 |
Subclass constructor, okay, and if any of that's an unclear to you then you know 00:16:59.180 |
This is where you definitely want to just grab a python intro to OO because this is 00:17:04.420 |
That the standard approach, right? So inside our constructor 00:17:11.580 |
Nn dot linear. All right. So what NN dot linear is doing is it's taking our 00:17:29.380 |
Vector so 768 long vector and we're going to be that's going to be the input to a matrix multiplication 00:17:49.620 |
Okay, so because the input to this is going to be a mini batch of size 00:18:01.740 |
768 by 10 and the input to this is going to be a mini batch of size 64 00:18:20.100 |
Right, so we're going to do this matrix product 00:18:23.340 |
Okay, so when we say in pytorch NN dot linear 00:18:32.100 |
This matrix for us, right? So since we're not using that we're doing things from scratch. We need to make it ourselves 00:18:46.140 |
This dimensionality which we passed in here 768 by 10. Okay, so that gives us our 00:19:01.660 |
You know, we don't just want y equals ax we want y equals ax plus b 00:19:08.140 |
Right, so we need to add on what we call in neural nets a bias vector 00:19:13.500 |
So we create here a bias vector of length 10. Okay again randomly initialized 00:19:20.740 |
And so now here are our two randomly initialized 00:19:30.980 |
Now we need to define forward. Why do we need to define forward? This is a pytorch specific thing 00:19:36.900 |
What's going to happen is this is when you create a module in 00:19:42.620 |
Pytorch the object that you get back behaves as if it's a function 00:19:47.760 |
You can call it with parentheses which we'll do it that in a moment. And so you need to somehow define 00:19:52.860 |
What happens when you call it as if it's a function and the answer is pytorch calls a method called? 00:20:00.440 |
Forward, okay, that's just that's the Python the pytorch kind of approach that they picked, right? 00:20:07.740 |
So when it calls forward, we need to do our actual 00:20:12.260 |
Calculation of the output of this module or later. Okay. So here is the thing that actually gets calculated in our logistic regression 00:20:26.020 |
Which gets passed to forward that's basically how forward works it gets passed the mini batch 00:20:35.620 |
The layer one weights which we defined up here and then we add on 00:20:42.740 |
The layer one bias which we defined up here. Okay, and actually nowadays we can define this a little bit more elegantly 00:20:54.700 |
Matrix multiplication operator, which is the at sign 00:20:57.660 |
And when you when you use that I think you kind of end up with 00:21:01.080 |
Something that looks closer to what the mathematical notation looked like and so I find that nicer. Okay 00:21:13.580 |
In our logistic regression in our zero hidden layer neural net. So then the next thing we do to that is 00:21:31.420 |
Okay, who wants to tell me what the dimensionality of my output of this matrix multiply is 00:21:44.060 |
And I should mention for those of you that weren't at deep learning class yesterday 00:21:50.580 |
We actually looked at a really cool post from Karen who described how to 00:21:54.980 |
Do structured data analysis with neural nets which has been like super popular? 00:22:00.380 |
And a whole bunch of people have kind of said that they've read it and found it super interesting. So 00:22:19.740 |
We put it through a softmax because in the end we want probably you know for every image 00:22:24.660 |
We want a probability that this is 0 or a 1 or a 2 or a 3 or 4, right? 00:22:28.780 |
So we want a bunch of probabilities that add up to 1 and where each of those probabilities is between 0 and 1 00:22:40.860 |
So for example if we weren't picking out, you know numbers from 0 to 10 00:22:45.900 |
But instead of picking out cat dog play and fish or building the output of that matrix multiply 00:22:50.500 |
For one particular image might look like that. These are just some random numbers 00:22:54.620 |
And to turn that into a softmax. I first go e to the power of each of those numbers. I 00:23:09.060 |
Then I take each of those e to the power ofs and divide it by the sum and that's softmax 00:23:14.180 |
That's the definition of softmax. So because it was a to the power of it means it's always positive 00:23:19.260 |
Because it was divided by the sum it means that it's always between 0 and 1 and it also means because it's divided 00:23:34.500 |
Activation function so anytime we have a layer of outputs, which we call activations 00:23:40.140 |
And then we apply some function some nonlinear function to that that maps one 00:23:45.980 |
One scalar to one scalar like softmax does we call that an activation function, okay? 00:23:52.500 |
So the softmax activation function takes our outputs and turns it into something which behaves like a probability, right? 00:24:00.260 |
We don't strictly speaking need it. We could still try and train something which where the output directly is the probabilities 00:24:07.980 |
All right, but by creating using this function 00:24:11.320 |
That automatically makes them always behave like probabilities. It means there's less 00:24:16.420 |
For the network to learn so it's going to learn better. All right, so generally speaking whenever we design 00:24:24.660 |
We try to design it in a way where it's as easy as possible for it to create something of the form that we want 00:24:37.580 |
Right so that's the basic steps right we have our input which is a bunch of images 00:24:44.180 |
Right which is here gets multiplied by a weight matrix. We actually also add on a bias 00:24:56.460 |
We put it through a nonlinear activation function in this case softmax and that gives us our probabilities 00:25:14.820 |
Of softmax for reasons that don't particularly bother us now 00:25:19.940 |
It's basically a numerical stability convenience. Okay, so to make this the same as our 00:25:26.020 |
Version up here that you saw log softmax. I'm going to use log here as well. Okay, so 00:25:34.420 |
We can now instantiate this class that is create an object of this class 00:25:41.060 |
So I have a question back for the probabilities where we were before 00:25:50.860 |
If we were to have a photo with a cat and a dog together 00:25:54.820 |
Would that change the way that that works or does it work in the same basic? Yeah, so that's a great question 00:26:00.580 |
so if you had a photo with a cat and a dog together and 00:26:07.100 |
This would be a very poor choice. So softmax is specifically the activation function we use for 00:26:14.540 |
Categorical predictions where we only ever want to predict one of those things, right? 00:26:19.460 |
And so part of the reason why is that as you can see because we're using either the right either the slightly bigger numbers 00:26:27.120 |
Creates much bigger numbers as a result of which we generally have just one or two things large and everything else is pretty small 00:26:35.820 |
Recalculate these random numbers a few times you'll see like it tends to be a bunch of zeros and one or two high numbers 00:26:44.420 |
Try to kind of make it easy to predict like this one thing. There's the thing I want if you're doing multi 00:26:53.700 |
Label prediction so I want to find all the things in this image rather than using softmax 00:27:01.260 |
So sigmoid recall each would cause each of these between to be between zero and one, but they would no longer add to one 00:27:11.480 |
Details about like best practices are things that we cover in the deep learning course 00:27:18.140 |
And we won't cover heaps of them here in the machine learning course. We're more interested in the mechanics, I guess 00:27:28.300 |
All right, so now that we've got that we can instantiate an object of that class and of course 00:27:35.420 |
We want to copy it over to the GPU so we can do computations over there 00:27:38.940 |
Again, we need an optimizer where we're talking about what this is shortly, but you'll see here 00:27:44.580 |
We've called a function on our class called parameters 00:27:47.760 |
But we never defined a method called parameters 00:27:51.340 |
And the reason that is going to work is because it actually was defined for us inside nn.module 00:27:56.420 |
and so nn.module actually automatically goes through the attributes we've created and finds 00:28:04.060 |
Anything that basically we we said this is a parameter 00:28:07.860 |
So the way you say something is a parameter is you wrap it in an end up parameter 00:28:11.260 |
So this is just the way that you tell PyTorch 00:28:16.180 |
Okay, so when we created the weight matrix we just wrapped it with an end up parameter 00:28:23.780 |
PyTorch variable which we'll learn about shortly 00:28:26.620 |
It's just a little flag to say hey you should you should optimize this and so when you call net to dot parameter 00:28:33.940 |
On our net to object we created it goes through everything that we created in the constructor 00:28:38.900 |
Checks to see if any of them are of type parameter 00:28:41.880 |
And if so it sets all of those as being things that we want to train with the optimizer 00:28:46.620 |
And we'll be implementing the optimizer from scratch later 00:28:53.040 |
We can fit and we should get basically the same answer as before 91 ish 00:29:11.500 |
Well what we've actually built as I said is something that can behave like a regular function 00:29:17.340 |
All right, so I want to show you how we can actually call this as a function 00:29:23.660 |
We need to be able to pass data to it to be able to pass data to it 00:29:28.140 |
I'm going to need to grab a mini batch of MNIST images 00:29:37.220 |
Image classifier data from a raised method from fastai 00:29:40.340 |
And what that does is it creates a pytorch data loader for us a pytorch data loader is 00:29:47.060 |
Something that grabs a few images and sticks them into a mini batch and makes them available 00:29:52.340 |
And you can basically say give me another mini batch give me another mini batch give me another mini batch and so 00:30:05.060 |
Generators are things where you can basically say I want another I want another I want another right 00:30:10.020 |
There's this kind of very close connection between 00:30:15.900 |
Iterators and generators are not going to worry about the difference between them right now, but you'll see basically to turn 00:30:23.140 |
To actually get hold of something which we can say please give me another of 00:30:32.020 |
Order to grab something that we can we can use to generate mini batches 00:30:36.540 |
We have to take our data loader and so you can ask for the training data loader from our model data object 00:30:43.180 |
You'll see there's a bunch of different data loaders. You can ask for you can ask for the test data loader the train data loader 00:30:51.940 |
Augmented images data loader and so forth so we're going to grab the training data loader 00:30:57.220 |
That was created for us. This is a PI standard PI torch data loader. Well slightly optimized by us, but same idea 00:31:03.300 |
And you can then say this is a standard Python 00:31:07.020 |
Thing we can say turn that into an iterator turn that into something where we can grab another one at a time from and so 00:31:16.540 |
We've now got something that we can iterate through you can use the standard Python 00:31:21.580 |
Next function to grab one more thing from that generator, okay? 00:31:26.820 |
So that's returning and the X's from a mini batch in the wise 00:31:33.100 |
Found our mini batch the other way that you can use 00:31:36.440 |
Generators and iterators in Python is with a for loop. I could also have said like for you know X mini batch comma Y mini batch in 00:31:47.420 |
And then like do something right so when you do that. It's actually behind the scenes 00:31:51.940 |
It's basically syntactic sugar for calling next lots of times. Okay, so this is all standard 00:32:03.100 |
Tensor of size 64 by 784 as we would expect right the 00:32:14.980 |
Fastai library we used defaults to a mini batch size of 64. That's why it's that long 00:32:20.340 |
These are all of the background zero pixels, but they're not actually zero in this case. Why aren't they zero? 00:32:27.180 |
Yeah, they're normalized exactly right so we subtract at the mean divided by standard deviation right 00:32:33.420 |
So there there it is so now what we want to do is we want to 00:32:42.380 |
Pass that into our our logistic regression. So what we might do is we'll go 00:32:48.860 |
Variable XMB equals variable. Okay, I can take my X mini batch I 00:32:55.580 |
can move it on to the GPU because remember my 00:32:59.160 |
Net to object is on the GPU so our data for it also has to be on the GPU 00:33:04.980 |
And then the second thing I do is I have to wrap it in variable. So what does variable do? 00:33:11.140 |
This is how we get for free automatic differentiation 00:33:19.040 |
You know pretty much anything right any tensor? 00:33:25.380 |
So it's not going to always keep track like to do to do what about differentiation 00:33:30.820 |
It has to keep track of exactly how something was calculated. We added these things together 00:33:35.340 |
We multiplied it by that we then took the sign blah blah blah, right? 00:33:39.420 |
you have to know all of the steps because then to do the automatic differentiation it has to 00:33:45.060 |
Take the derivative of each step using the chain rule multiply them all together 00:33:49.380 |
All right, so that's slow and memory intensive 00:33:52.140 |
So we have to opt in to saying like okay this particular thing we're going to be taking the derivative of later 00:33:57.560 |
So please keep track of all of those operations for us 00:34:00.300 |
And so the way we opt in is by wrapping a tensor in a variable, right? So 00:34:10.100 |
You'll see that it looks almost exactly like a tensor, but it now says variable containing 00:34:16.460 |
This tensor right so in Pytorch a variable has exactly 00:34:21.860 |
Identical API to a tensor or actually more specifically a superset of the API of a tensor 00:34:27.860 |
Anything we can do to a tensor we can do to a variable 00:34:30.740 |
But it's going to keep track of exactly what we did so we can later on take the derivative 00:34:40.260 |
Into our net to object remember I said you can treat this as if it's a function 00:34:51.980 |
Right so notice we're not calling dot forward 00:34:59.380 |
Then remember we took the log so to undo that I'm taking the x and that will give me my probabilities 00:35:07.460 |
Okay, so there's my probabilities, and it's got 00:35:14.020 |
Return something of size 64 by 10 so for each image in the mini batch 00:35:23.020 |
We've got 10 probabilities, and you'll see most probabilities are pretty close to 0 00:35:29.580 |
Right and a few of them are quite a bit bigger 00:35:33.420 |
Which is exactly what we do we hope right is that it's like okay? It's not a zero. It's not a one 00:35:39.300 |
It's not a two. It is a three. It's not a four. It's not a five and so forth 00:35:42.740 |
So maybe this would be a bit easier to read if we just grab like the first three of them 00:35:47.140 |
Okay, so it's like ten to the next three ten to the next eight two five five four okay? 00:35:55.100 |
And then suddenly here's one which is ten to make one right? 00:35:57.620 |
So you can kind of see what it's trying to what it's trying to do here 00:36:02.980 |
I mean we could call like net to dot forward and it'll do exactly the same thing 00:36:16.620 |
It's actually they actually call it as if it's a function right and so this is actually a really important idea 00:36:24.940 |
When we define our own architectures or whatever anywhere that you would put in a function 00:36:30.580 |
You could put in a layer anyway you put in a layer you can put in a neural net anyway 00:36:34.900 |
You put in a neural net you can put in a function because as far as pie torch is concerned 00:36:39.020 |
They're all just things that it's going to call just like as if they're functions 00:36:43.060 |
So they're all like interchangeable, and this is really important because that's how we create 00:36:48.020 |
Really good neural nets is by mixing and matching lots of pieces and putting them all together 00:37:11.380 |
Into a neural network with one hidden layer all right, and the way I'm going to do that is I'm going to create 00:37:19.860 |
I'm going to change this so it spits out a hundred rather than ten 00:37:24.420 |
Which means this one input is going to be a hundred rather than ten 00:37:30.020 |
Now this as it is can't possibly make things any better at all yet 00:37:35.340 |
Why is this definitely not going to be better than what I had before? 00:37:42.540 |
But you've got a combination of two linear layers, which is just the same as one 00:37:47.620 |
Exactly right so we've got two linear layers, which is just a linear layer right so to make things interesting 00:37:55.700 |
I'm going to replace all of the negatives from the first layer with zeros 00:38:00.880 |
Because that's a nonlinear transformation, and so that nonlinear transformation is called a rectified linear unit 00:38:07.820 |
Okay, so nn dot sequential simply is going to call each of these layers in turn for each mini batch right so do a linear layer 00:38:18.340 |
Replace all of the negatives with zero do another linear layer and do a softmax. This is now a neural network 00:38:37.180 |
Okay, so the this is the idea is that the basic techniques. We're learning in this lesson 00:38:43.420 |
Like become powerful at the point where you start stacking them together, okay? 00:38:49.540 |
Can somebody pass the green box there and then there yes, Daniel? 00:38:54.660 |
Why did you pick a hundred? No reason it was like easier to type an extra zero? 00:39:04.220 |
Activations should I have it a neural network layer is kind of part of the the scale of a deep learning practitioner 00:39:09.780 |
We cover it in the deep learning course not in this course 00:39:20.660 |
Additional layer additional layer this one here is called a nonlinear layer or an activation function 00:39:30.060 |
Does it matter that like if you would have done for example like two softmaxes? 00:39:37.780 |
Or is that something you cannot do like yeah? 00:39:42.140 |
But it's probably not going to give you what you want and the reason why is that a softmax? 00:39:48.220 |
Tends to push most of its activations to zero and an activation just be clear like I've had a lot of questions in deep 00:39:55.460 |
Learning course about like what's an activation an activation is the value that is calculated in a layer, right? 00:40:04.740 |
Right it's not a weight a weight is not an activation 00:40:08.700 |
It's the value that you calculate from a layer 00:40:11.340 |
So softmax will tend to make most of its activations pretty close to zero 00:40:15.700 |
and that's the opposite of what you want you genuinely want your activations to be kind of as 00:40:20.860 |
Rich and diverse and and used as possible so nothing to stop you doing it, but it probably won't work very well 00:40:30.980 |
pretty much all of your layers will be followed by 00:40:34.300 |
Non by nonlinear activation functions that will nearly always be value 00:40:44.700 |
Could you when doing multiple layers, so let's say like could you live three could you think it's going two or three layers deep? 00:40:51.740 |
Do you want to switch up these activation layers? No, that's a great question. So if I wanted to go deeper I 00:41:01.940 |
That okay, that's a now to hidden layer network 00:41:05.860 |
So I think I'd heard you said that there are a couple of different 00:41:13.780 |
Activation functions like that rectified linear unit. What are some examples and 00:41:33.980 |
Input comes in and you put it through a linear layer and then a nonlinear layer linear layer nonlinear layer 00:41:41.180 |
linear linear layer and then the final nonlinear layer 00:41:50.900 |
The final nonlinear layer as we've discussed, you know, if it's a 00:41:58.860 |
Classification, but you only ever pick one of them you would use softmax 00:42:08.060 |
Label classification where you're predicting multiple things you would use sigmoid 00:42:18.660 |
Right, although we learned in last night's deal course where sometimes you can use sigmoid there as well 00:42:23.300 |
So they're basically the options main options for the final layer 00:42:50.380 |
Another one you can pick which is kind of interesting which is called 00:43:07.100 |
Basically if it's above zero, it's y equals x and if it's below zero, it's like y equals 0.1 x 00:43:16.660 |
Rather than being equal to 0 under x. It's it's like something close to that 00:43:33.260 |
There are various others, but they're kind of like things that just look very close to that 00:43:38.060 |
So for example, there's something called ELU, which is quite popular 00:43:41.440 |
But like you know the details don't matter too much honestly like that there like ELU is something that looks like this 00:43:49.700 |
And it's kind of like it's not generally something that you so much pick based on the data set it's more like 00:43:59.380 |
Over time we just find better activation functions so two or three years ago 00:44:04.300 |
Everybody used ReLU, you know a year ago pretty much everybody used Leaky ReLU today 00:44:09.380 |
I guess probably most people starting to move towards ELU 00:44:11.940 |
But honestly the choice of activation function doesn't matter 00:44:18.460 |
And you know people have actually showed that you can use like our pretty arbitrary nonlinear activation functions like even a sine wave 00:44:30.820 |
So although what we're going to do today is showing how to create 00:44:51.620 |
Which is 96% ish accurate is it will be trivial right and in fact is something you should 00:44:57.900 |
Probably try and do during the week right is to create that version 00:45:10.580 |
So now that we've got something where we can take our network pass in our variable and get back some 00:45:22.580 |
That's basically all that happened when we called fit. So we're going to see how how that that approach can be used to create this stochastic gradient 00:45:35.860 |
Predicted probabilities into a predicted like which digit is it? We would need to use argmax 00:45:49.220 |
Instead pytorch just calls it max and max returns 00:45:56.260 |
Returns the actual max across this axis so this is across the columns right and the second thing it returns is the index 00:46:05.020 |
Of that maximum right so so the equivalent of argmax is to call max and then get the first 00:46:12.900 |
Indexed thing okay, so there's our predictions right if this was in numpy. We would instead use NP argmax 00:46:25.500 |
So here are the predictions from our hand created logistic regression and in this case 00:46:37.300 |
So the next thing we're going to try and get rid of in terms of using libraries is for try to avoid using the 00:46:43.300 |
Matrix multiplication operator and instead we're going to try and write that by hand 00:46:47.260 |
So this next part we're going to learn about something which kind of seems 00:47:03.860 |
It kind of it's going to seem like a minor little kind of programming idea, but actually it's going to turn out 00:47:14.620 |
That at least in my opinion. It's the most important 00:47:18.500 |
Programming concept that we'll teach in this course, and it's possibly the most important programming 00:47:26.620 |
All the things you need to build machine learning algorithms, and it's the idea of 00:47:37.300 |
If we create an array of 10 6 neg 4 and an array of 2 8 7 and then add the two together 00:47:45.100 |
It adds each of the components of those two arrays in turn we call that element wise 00:47:54.060 |
So in other words we didn't have to write a loop right back in the old days 00:47:58.740 |
We would have to have looped through each one and added them and then concatenated them together 00:48:02.780 |
We don't have to do that today. It happens for us automatically so in numpy 00:48:20.420 |
So in fastai we just add a little capital T to turn something into a Pytorch tensor right and if we add those together 00:48:31.380 |
Exactly the same thing right so element wise operations are pretty standard in these kinds of libraries 00:48:37.700 |
It's interesting not just because we don't have to write the for loop 00:48:44.100 |
Right, but it's actually much more interesting because of the performance things that are happening here 00:49:03.180 |
Right even when you use Pytorch it still does the for loop in Python it has no way of like 00:49:10.140 |
Optimizing the for loop and so a for loop in Python is something like 00:49:18.740 |
So that's your first problem. I can't remember. It's like 1,000 or 10,000 the second problem then is that 00:49:31.500 |
But you want C to take advantage of the thing that you're all of your CPUs do to something called SIMD 00:49:37.700 |
Single instruction multiple data, which is it yours your CPU is capable of taking 00:49:46.260 |
Right in a vector and adding them up to another 00:49:49.860 |
Vector with eight things in in a single CPU instruction 00:49:55.060 |
All right, so if you can take advantage of SIMD you're immediately eight times faster 00:49:59.260 |
It depends on how big the data type is it might be four might be eight 00:50:02.300 |
The other thing that you've got in your computer is you've got multiple processors 00:50:11.300 |
So you've probably got like if this is inside happening on one side one core. You've probably got about four of those 00:50:19.300 |
Okay, so if you're using SIMD you're eight times faster if you can use multiple cores, then you're 32 times faster 00:50:28.180 |
You might be something like 32 times about thousand times faster right and so the nice thing is that when we do that 00:50:52.060 |
Then your GPU can do about 10,000 things at a time 00:50:57.380 |
Right so that'll be another hundred times faster than C 00:51:04.500 |
To getting good performance is you have to learn how to write 00:51:15.900 |
Operations and like it's not it's a lot more than just plus I 00:51:19.040 |
Could also use less than right and that's going to return 0 1 1 or if we go back to numpy 00:51:35.660 |
And so you can kind of use this to do all kinds of things without looping so for example 00:51:42.080 |
I could now multiply that by a and here are all of the values of a 00:51:47.460 |
As long as they're less than B or we could take the mean 00:51:53.440 |
This is the percentage of values in a that are less than B 00:51:59.460 |
All right, so like there's a lot of stuff you can do with this simple idea 00:52:06.260 |
Right to take it further than just this element wise operation 00:52:10.020 |
We're going to have to go the next step to something called broadcasting 00:52:13.220 |
So let's take a five minute break come back at 217 and we'll talk about broadcasting 00:52:29.980 |
This is the definition from the numpy documentation of 00:52:38.020 |
Broadcasting and I'm going to come back to it in a moment rather than reading it now 00:52:41.780 |
But let's start by looking an example of broadcasting 00:52:53.820 |
With one dimension also known as a rank one tensor 00:53:15.100 |
Right a rank zero tensor is also called a scalar 00:53:27.860 |
All right now you've probably done it a thousand times without even noticing. That's kind of weird right that you've got these things of different 00:53:36.060 |
Ranks and different sizes, so what is it actually doing right? 00:53:39.820 |
But what it's actually doing is it's taking that scalar and copying it here here here 00:53:46.140 |
Right and then it's actually going element wise 00:53:53.780 |
6 is greater than 0 minus 4 is greater than 0 you haven't giving us back the three answers 00:54:01.260 |
Right and that's called broadcasting broadcasting means 00:54:11.060 |
To allow it to be the same shape as the other tensor 00:54:20.580 |
What it actually does is it stores this kind of internal indicator that says pretend that this is a 00:54:32.500 |
But it actually just like what rather than kind of going to the next row or going to the next scalar it goes back 00:54:38.540 |
To where it came from if you're interested in learning about this specifically 00:54:42.620 |
It's they set the stride on that axis to be zero. That's a minor advanced concept for those who are curious 00:54:59.200 |
To be 1 1 1 and then do element wise addition 00:55:03.000 |
We could do the same with a matrix right here's our matrix 2 times the matrix is going to broadcast 2 00:55:10.180 |
to be 2 2 2 2 2 2 2 2 2 2 and then do element wise 00:55:18.500 |
All right, so that's our kind of most simple version of 00:55:26.100 |
So here's a slightly more complex version of broadcasting 00:55:30.460 |
Here's an array called C. All right, so this is a rank 1 tensor and 00:56:06.940 |
You can see that what it's done is to add that to each row 00:56:15.140 |
14 25 36 and so we can kind of figure it seems to have done the same kind of idea as broadcasting a scalar 00:56:32.060 |
If it's a rank 2 matrix and now we can do element wise addition 00:56:42.340 |
That makes sense now that's yes, can can you pass that Devon over there? Thank you 00:56:48.140 |
So as it's like by looking at this example it like 00:56:58.420 |
So how would we want to do it if we wanted to get new columns? I'm so glad you asked 00:57:31.380 |
So to get numpy to do that we need to not pass in a 00:57:40.700 |
Matrix with one column a rank 2 tensor, right? 00:57:54.380 |
Rank 1 tensor for these purposes as if it was a rank 2 tensor which represents a row 00:58:02.140 |
Right. So in other words that it is 1 by 3, right? So we want to create a tensor, which is 3 by 1 00:58:17.180 |
And if you then pass in this argument, it says please insert a length 1 axis 00:58:24.260 |
here, please so in our case we want to turn it into a 00:58:33.020 |
Okay, so if we say expand in C comma 1 it changes the shape to 3 comma 1 so if we look at what that looks like 00:58:46.620 |
That looks like a column. Okay, so if we now go 00:58:55.820 |
You can see it's doing exactly what we hoped it would do 00:58:58.980 |
Right, which is to add 10 20 30 to the column 00:59:03.620 |
10 20 30 to the column 10 20 30 to the column 00:59:12.220 |
Location of a unit axis turns out to be so important 00:59:20.580 |
It's really helpful to kind of experiment with creating these extra unit axes and know how to do it easily and 00:59:30.060 |
Isn't in my opinion the easiest way to do this the easiest way? 00:59:33.420 |
The easiest way is to index into the tensor with a special 00:59:40.340 |
Index none and what none does is it creates a new axis in that location of 00:59:53.660 |
Going to add a new axis at the start of length 1 00:59:58.460 |
This is going to add a new axis at the end of length 1 or 01:00:18.340 |
Things in it could be of any rank you like right you can just add 01:00:22.860 |
Unit axes all over the place and so that way we can kind of 01:00:35.380 |
Thing in numpy called broadcast 2 and what that does is it takes our vector and 01:00:45.100 |
broadcasts it to that shape and shows us what that would look like 01:00:49.020 |
Right so if you're ever like unsure of what's going on in some broadcasting operation 01:00:55.060 |
You can say broadcast 2 and so for example here. We could say like rather than 3 comma 3 we could say m dot shape 01:01:01.980 |
Right and see exactly what's happened going to happen, and so that's what's going to happen before we add it to n 01:01:21.460 |
Make sense, so that's kind of like the intuitive 01:01:29.340 |
Broadcasting and so now hopefully we can go back to that 01:01:38.140 |
Broadcasting describes how numpy is going to treat arrays of different shapes when we do some operation 01:01:42.740 |
Right the smaller array is broadcast across the larger array by smaller array. They mean lower rank 01:01:52.860 |
Broadcast across the light the higher rank tensor so they have compatible shapes it vectorizes array operations 01:01:59.540 |
So vectorizing generally means like using SIMD and stuff like that so that multiple things happen at the same time 01:02:08.820 |
But it doesn't actually make needless copies of data it kind of just acts as if it had 01:02:18.060 |
now in deep learning you very often deal with tensors of rank four or more and 01:02:24.620 |
you very often combine them with tensors of rank one or two and 01:02:29.060 |
Trying to just rely on intuition to do that correctly is nearly impossible 01:02:42.300 |
Okay, here's m dot shape here C dot shape so the rule are that we're going to compare 01:02:50.180 |
The shapes of our two tensors element wise we're going to look at one at a time 01:02:54.740 |
And we're going to start at the end right so look at the trailing dimensions and 01:03:01.460 |
Towards the front okay, and so two dimensions are going to be compatible 01:03:06.220 |
When one of these two things is true, right? So let's check right we've got our our M and C compatible M is 01:03:18.500 |
3 right so we're going to start at the end trailing dimensions first and check are they compatible they're compatible if the dimensions are equal 01:03:26.620 |
Okay, so these ones are equal so they're compatible 01:03:31.180 |
Let's go to the next one. Oh, oh, we're missing 01:03:34.140 |
Right C is missing something. So what happens if something is missing as we insert a one? 01:03:41.100 |
Okay, that's the rule right and so let's now check are these compatible one of them is one. Yes, they're compatible 01:03:49.140 |
Okay, so now you can see why it is that numpy treats 01:04:02.060 |
Which is representing a row it's because we're basically inserting a one at the front 01:04:12.620 |
This is something that you very commonly have to do which is you start with like an 01:04:20.780 |
image they're like 256 pixels by 256 pixels by three channels and 01:04:31.740 |
All right, so you've got 256 by 256 by 3 and you want to subtract something of length 3, right? 01:04:40.020 |
Absolutely because 3 and 3 are compatible because they're the same 01:04:43.980 |
All right 256 and empty is compatible. It's going to insert a 1 01:04:48.340 |
256 and empty is compatible. It's going to insert a 1 01:04:55.740 |
this is going to be broadcast over all of this axis and then that whole thing will be broadcast over this axis and 01:05:17.300 |
very few people in the data science or machine learning communities 01:05:22.300 |
Understand broadcasting and the vast majority of the time for example when I see people doing pre-processing for computer vision 01:05:28.820 |
Like subtracting the mean they always write loops 01:05:32.760 |
over the channels right and I kind of think like 01:05:36.780 |
It's it's like so handy to not have to do that and it's often so much faster to not have to do that 01:05:46.220 |
You'll have this like super useful skill that very very few people have 01:05:52.060 |
And and like it's it's it's an ancient skill. You know it goes it goes all the way back to 01:06:00.980 |
so APL was from the late 50s stands for our programming language and 01:06:21.100 |
He proposed that if we use this new math notation 01:06:24.700 |
It gives us new tools for thought and allows us to think things we couldn't before and one of his ideas was 01:06:35.660 |
computer programming tool, but as a piece of math notation and 01:06:43.180 |
this notation as a tool for thought as a programming language called APL and 01:06:57.180 |
Which is basically what you get when you put 60 years of very smart people working on this idea 01:07:03.980 |
And with this programming language you can express 01:07:07.820 |
Very complex mathematical ideas often just with a line of code or two 01:07:16.940 |
But it's even greater that these ideas have found their ways into the languages 01:07:21.020 |
We all use like in Python the NumPy and PyTorch libraries, right? These are not just little 01:07:26.740 |
Kind of niche ideas. It's like fundamental ways to think about math and to do programming 01:07:33.020 |
Like let me give an example of like this kind of notation as a tool for thought 01:07:48.380 |
None right. Notice. This is now a two square brackets, right? So this is kind of like a one row 01:08:19.780 |
Okay, what's that going to do? Have a think about it 01:08:34.580 |
Anybody want to have a go you can even talk through your thinking. Okay. Can we pass the check this over there? Thank you 01:08:40.580 |
Kind of outer product. Yes, absolutely. So take us through your thinking. How's that gonna work? 01:08:47.780 |
So the diagonal elements can be directly visualized from the squares 01:09:00.780 |
And if you multiply the first row with this column, you can get the first row of the matrix 01:09:07.900 |
So finally you'll get a 3 cross 3 matrix. Yeah, and 01:09:12.500 |
So to think of this in terms of like those broadcasting rules, we're basically taking 01:09:28.780 |
Sorry, I mentioned 3 comma 1 and this row which is of dimension 1 comma 3 01:09:34.340 |
Right and so to make these compatible with our broadcasting rules 01:09:45.140 |
Okay, and now this one's going to have to be duplicated three times to match this 01:10:05.100 |
Matrices to do an element wise product of and so as you say 01:10:12.820 |
There is our outer product right now. The interesting thing here is 01:10:17.900 |
That suddenly now that this is not a special mathematical case 01:10:23.220 |
But just a specific version of the general idea of broadcasting we can do like an outer plus 01:10:35.060 |
Right or or whatever right so it's suddenly we've kind of got this this this concept 01:10:44.940 |
New ideas and then we can start to experiment with those new ideas. And so, you know interestingly 01:11:02.100 |
This is how NumPy does it right actually this is kind of the sorry, let me show you this way 01:11:11.660 |
If you want to create a grid, this is how NumPy does it it actually returns 01:11:26.060 |
So we could say like okay, that's x grid comma y grid 01:11:42.580 |
Like that right and so suddenly we've expanded that out 01:11:59.220 |
Yeah, it's kind of interesting how like some of these like simple little concepts 01:12:05.580 |
Kind of get built on and built on and built on so if you lose something like APL or J. It's this whole 01:12:11.660 |
Environment of layers and layers and layers of this we don't have such a deep environment in NumPy 01:12:18.260 |
But you know you can certainly see these ideas of like broadcasting coming through 01:12:22.900 |
In simple things like how do we create a grid in in NumPy? 01:12:27.220 |
So yeah, so that's that's broadcasting and so what we can do with this now is 01:12:34.860 |
Use this to implement matrix multiplication ourselves 01:12:43.980 |
Now why would we want to do that well obviously we don't right matrix multiplication has already been handled 01:12:57.620 |
All kinds of areas in in machine learning and particularly in deep learning that there'll be 01:13:08.460 |
Function that you want to do that aren't quite 01:13:13.300 |
Done for you all right so for example. There's like whole areas 01:13:26.980 |
Which are really being developed a lot at the moment and they're kind of talking about like how do we take like 01:13:38.380 |
Higher rank tensors and kind of turn them into combinations of rows 01:13:43.260 |
Columns and faces and it turns out that when you can kind of do this you can basically like 01:13:50.260 |
Deal with really high dimensional data structures with not much memory and not with not much computation time for example. There's a really terrific library 01:14:00.460 |
Which does a whole lot of this kind of stuff? 01:14:05.660 |
So it's a really really important area it covers like all of deep learning lots of modern machine learning in general 01:14:12.460 |
And so even though you're not going to like to find matrix modification. You're very likely to want to define some other 01:14:22.820 |
So it's really useful to kind of understand how to do that 01:14:34.260 |
2d array and 1d array rank 2 tensor rank 1 tensor and 01:14:40.860 |
Using the at sign or the old way NP dot matmul. Okay? 01:14:46.500 |
And so what that's actually doing when we do that is we're basically saying 01:15:07.700 |
We can go through and do the same thing for the next one and for the next one to get our result, right? 01:15:34.020 |
Okay, but that is not matrix multiplication. What's that? 01:15:45.180 |
Okay, element wise specifically we've got a matrix and a vector so 01:15:53.900 |
Broadcasting okay good. So we've got this is element wise with broadcasting but notice 01:16:01.180 |
The numbers it's created 10 40 90 are the exact three numbers that I needed to 01:16:10.420 |
Piece of my matrix multiplication. So in other words if we sum this 01:16:31.700 |
This stuff without special help from our library 01:16:38.580 |
Let's expand this out to a matrix matrix product 01:16:45.700 |
Looks like this. This is this great site called matrix multiplication dot XYZ 01:16:52.420 |
And it shows us this is what happens when we multiply two matrices 01:17:06.400 |
operationally speaking so in other words what we just did there 01:17:20.680 |
Then we took the second column with the first row 01:17:26.120 |
To get that one. All right, so we're basically doing 01:17:29.040 |
The thing we just did the matrix vector product. We're just doing it twice 01:17:36.480 |
With this column and once with this column, and then we can catenate the two together 01:17:57.640 |
M times the second top column, but some and so there are the two columns of our matrix multiplication 01:18:09.240 |
So I didn't want to like make our code too messy 01:18:12.960 |
So I'm not going to actually like use that but like we have it there now if we want to we don't need to use 01:18:20.280 |
Torch or NumPy matrix multiplication anymore. We've got we've got our own that we can use using nothing but 01:18:39.960 |
Logistic regression from scratch class again. I just copied it here 01:18:45.960 |
Here is where we instantiate the object copy it to the GPU we create an optimizer 01:18:50.160 |
Which we'll learn about in a moment and we call fit. Okay, so the goal is to now repeat this without needing to call fit 01:19:09.320 |
Which grabs a mini batch of data at a time and with each mini batch of data? 01:19:15.600 |
We need to pass it to the optimizer and say please try to come up with a slightly better set of predictions 01:19:24.240 |
So as we learned in order to grab a mini batch of the training set at a time 01:19:28.560 |
We have to ask the model data object for the training data loader 01:19:31.840 |
We have to wrap it in it or it er to create an iterator generator 01:19:36.920 |
And so that gives us our data loader. Okay, so pytorch calls this a data loader 01:19:44.040 |
We actually wrote our own fast AI data loader, but it's it's all it's basically the same idea 01:19:50.280 |
So the next thing we do is we grab the X and the Y tensor 01:19:59.520 |
Wrap it in a variable to say I need to be able to take the derivative of 01:20:05.080 |
The calculations using this because if I can't take the derivative 01:20:08.640 |
Then I can't get the gradients and I can't update the weights 01:20:12.400 |
all right, and I need to put it on the GPU because my 01:20:18.760 |
So we can now take that variable and pass it to 01:20:23.340 |
The object that we instantiated our logistic regression 01:20:28.440 |
Remember our module we can use it as if it's a function because that's how pytorch works 01:20:32.840 |
And that gives us a set of predictions as we saw seen before 01:20:41.760 |
So now we can check the loss and the loss we defined as being a 01:20:49.440 |
Object and we're going to learn about how that's calculated in the next lesson and for now think of it 01:20:55.200 |
Just like root mean squared error, but for classification problems 01:20:58.320 |
So we can call that also just like a function so you can kind of see this 01:21:03.840 |
It's very general idea in pytorch that you know kind of treat everything ideally like it's a function 01:21:09.480 |
So in this case we have a loss a negative log likelihood loss object. We treat it like a function we pass in our predictions and 01:21:16.560 |
We pass in our axials right and again the axials need to be turned into a variable and put on the GPU 01:21:23.360 |
Because the loss is specifically the thing that we actually want to take the derivative of right so that gives us our loss 01:21:36.200 |
So it's a variable and because it's a variable it knows how it was calculated 01:21:41.320 |
All right, it knows it was calculated with this loss function. It knows that the predictions were calculated with this 01:21:47.980 |
Network it knows that this network consisted of these operations and so we can get the gradient 01:22:01.800 |
We call L dot backward remember L is the thing that contains our loss 01:22:06.560 |
All right, so L dot backward is is something which is added to anything. That's a variable 01:22:13.120 |
You can call dot backward and that says please calculate the gradients 01:22:16.440 |
Okay, and so that calculates the gradients and stores them inside that that 01:22:28.120 |
Weights that was used it used each of the parameters that was used to calculate that it's now stored a 01:22:33.960 |
Dot grad we'll see it later. It's basically stored the gradient right so we can then call 01:22:40.320 |
Optimizer dot step and we're going to do this step manually shortly 01:22:44.520 |
And that's the bit that says please make the weights a little bit better right and so what optimizer dot step is doing 01:22:53.440 |
Is it saying like okay if you had like a really simple function? 01:23:04.560 |
Right then what the optimizer does is it says okay. Let's pick a random starting point 01:23:11.580 |
Right and let's calculate the value of the loss right so here's our parameter 01:23:17.400 |
Here's our loss right let's take the derivative 01:23:21.920 |
All right the derivative tells us which way is down, so it tells us we need to go that direction 01:23:31.920 |
Then we take the derivative again, and we take a small step derivative again 01:23:37.400 |
Take a small step do it again. Take a small step and 01:23:40.440 |
Till eventually we're taking such small steps that we stop okay, so that's what? 01:23:52.440 |
Well, we basically take the derivative here, so let's say derivative. There is like eight 01:23:57.300 |
All right, and we multiply it by a small number like say 0.01 and that tells us what step size to take 01:24:06.020 |
this small number here is called the learning rate and 01:24:10.040 |
It's the most important hyper parameter to set right if you pick two smaller learning rate 01:24:17.800 |
Then your steps down are going to be like tiny, and it's going to take you forever 01:24:23.180 |
All right to bigger learning rate, and you'll jump too far 01:24:27.960 |
Right and then you'll jump too far and your diverge rather than converge, okay 01:24:35.680 |
We're not going to talk about how to pick a learning rate in this class 01:24:39.640 |
But in the deep learning class we actually show you a specific technique that very reliably picks a very good learning rate 01:24:48.200 |
So that's basically what's happening right so we calculate the derivatives 01:24:53.040 |
And we call the optimizer that does a step in other words update the weights based on the 01:25:01.280 |
We should hopefully find that after doing that we have a better loss than we did before 01:25:07.800 |
So I just reran this and got a loss here of four point one six and 01:25:14.600 |
It's now four point. Oh three okay, so it worked the way 01:25:17.760 |
We hoped it would based on this mini batch it updated all of the weights in our 01:25:22.640 |
Network to be a little better than they were as a result of which our loss went down, okay? 01:25:31.480 |
All right, we're going to go through a hundred steps 01:25:35.200 |
Grab one more mini batch of data from the data loader 01:25:39.560 |
Calculate our predictions from our network calculate our loss from the predictions and the actuals 01:25:45.360 |
Every 10 goes we'll print out the accuracy just take the mean of the whether they're equal or not 01:25:51.840 |
One Pytorch specific thing you have to zero the gradients basically you can have networks where like you've got lots of different loss 01:26:01.300 |
Functions that you might want to add all of the gradients together 01:26:03.980 |
Right so you have to tell Pytorch like when to set the gradients back to zero 01:26:09.400 |
Right so this just says set all the gradients to zero 01:26:12.120 |
Calculate the gradients that's put backward and then take one step of the optimizer 01:26:18.180 |
So update the weights using the gradients and the learning rate and so once we run it. You can see the loss goes down and 01:26:34.160 |
That's the basic approach and so next lesson. We'll see 01:26:44.360 |
We're not going to look inside here as I say we're going to basically take the calculation of the derivatives as 01:26:56.240 |
And any kind of deep network you have kind of like a function 01:27:03.480 |
And then you pass the output of that into another function that might be like a ReLU 01:27:08.920 |
And you pass the output of that into another function that might be another linear net linear layer 01:27:14.320 |
And you pass that into another function that might be another ReLU and so forth right so these deep networks are just 01:27:22.320 |
Functions of functions of functions, so you could write them mathematically like that right and so 01:27:30.200 |
All backprop does is it says let's just simplify this down to the two version 01:27:40.880 |
Right and so therefore the derivative of g of f of x is we can calculate with the chain rule as being 01:27:56.160 |
Right and so you can see we can do the same thing for the functions of the functions of the functions, and so when you apply a 01:28:02.880 |
Function to a function of a function you can take the derivative just by taking the product of the derivatives of each of those 01:28:09.880 |
Layers okay, and in neural networks. We call this back propagation 01:28:15.040 |
Okay, so when you hear back propagation it just means use the chain rule to calculate the derivatives 01:28:31.560 |
Like if it's defined sequentially literally all this means is 01:28:40.840 |
Apply this function to that apply this function to that apply this function to that right so this is just defining a 01:28:49.840 |
composition of a function to a function to a function to a function to a function 01:28:56.000 |
Yeah, so although we're not going to bother with calculating the gradients ourselves 01:28:59.740 |
You can now see why it can do it right as long as it has internally 01:29:03.480 |
You know a it knows like what's the what's the derivative of to the power of what's the derivative of sign? 01:29:10.440 |
What's the derivative of plus and so forth then our Python code? 01:29:14.000 |
In here, it's just combining those things together 01:29:18.920 |
So it just needs to know how to compose them together with the chain rule and away it goes, okay? 01:29:26.140 |
Okay, so I think we can leave it there for now and yeah and in the next class 01:29:40.240 |
Write our own optimizer, and then we'll have solved MNIST from scratch ourselves. See you then