Machine Learning 1: Lesson 9

00:00:00.000 | All right welcome back to machine learning I

00:00:02.600 | I'm really excited to be able to share some amazing stuff that

00:00:08.480 | University of San Francisco students have built during the week or written about during the week

00:00:13.080 | and

00:00:15.280 | Quite a few things. I'm going to show you have already

00:00:17.400 | spread around the internet quite a bit

00:00:20.640 | lots of

00:00:23.200 | Tweets and posts and all kinds of stuff happening

00:00:28.040 | One of the the first to be widely shared was this one by Tyler who did something really interesting?

00:00:34.880 | He started out by saying like what if I like create the synthetic data set where the independent variables is like the x and the y

00:00:43.940 | And the dependent variable is like color right and interestingly

00:00:48.080 | He showed me an earlier version of this where he wasn't using color

00:00:51.080 | he was just like putting the actual numbers in here and

00:00:54.840 | this thing kind of wasn't really working at all and as soon as he started using color it started working really well and

00:01:00.640 | So I wanted to mention that one of the things that unfortunately we we don't teach you

00:01:05.600 | at USF is

00:01:08.120 | Theory of human perception perhaps we should

00:01:10.840 | Because actually when it comes to visualization it's kind of the most important thing to know is what is the human eye?

00:01:17.080 | Or what is what is what is the human brain good at perceiving? There's a whole area of academic study on this

00:01:24.400 | And one of the things that we're best at perceiving is differences in color

00:01:28.040 | Right so that's why as soon as we look at this picture of the synthetic data. He created you can immediately see oh there's kind of four

00:01:34.340 | areas of you know lighter red

00:01:37.120 | color

00:01:38.680 | So what he did was he said okay?

00:01:40.840 | What if we like tried to create a machine learning model of this synthetic data set?

00:01:46.720 | And so specifically he created a tree and the cool thing is that you can actually draw

00:01:52.840 | The tree right so after he created the tree

00:01:55.440 | He did this all in that plot live that plot lead is very flexible right he actually drew the tree boundaries

00:02:01.840 | So that's already a pretty neat trick is to be actually able to draw the tree

00:02:07.800 | But then he did something even cleverer which is he said okay?

00:02:10.800 | So what predictions does the tree make well it's the average of each of these areas and so to do that

00:02:16.960 | We can actually draw the average color

00:02:18.960 | Right it's actually kind of pretty

00:02:22.720 | Here is the predictions that the tree makes

00:02:25.560 | now

00:02:28.120 | Here's where it gets really interesting. It's like you can as you know randomly

00:02:32.560 | generate trees through resampling and

00:02:36.680 | So here are four trees

00:02:39.600 | Generated through resampling they're all like pretty similar, but a little bit different

00:02:43.880 | And so now we can actually visualize bagging and to visualize bagging we literally take the average of the four pictures

00:02:52.000 | All right. That's what bagging is and

00:02:54.000 | There it is right and so here is like the the fuzzy decision boundaries of a random forest

00:03:01.440 | And I think this is kind of amazing right because it's like a I wish I had this actually when I started teaching you

00:03:08.880 | All random forests because I could have skipped a couple of classes. It's just like okay. That's what we do

00:03:13.920 | You know we create the decision boundaries we average each area

00:03:18.360 | And then we we do it a few times and average all of them

00:03:21.960 | Okay, so that's what a random forest does and I think like this is just such a great example of

00:03:26.320 | Making the complex easy through through pictures

00:03:32.160 | So congrats to Tyler for that

00:03:34.840 | It actually turns out

00:03:37.360 | That he has actually reinvented something that somebody else has already done a guy called Christian any who went on to be a

00:03:44.000 | One of the world's foremost machine learning researchers actually included almost exactly this technique in a book

00:03:51.000 | He wrote about decision forests, so it's actually kind of cool that Tyler ended up

00:03:54.880 | Reinventing something that one of the world's foremost and for authorities on the fifth decision forests actually it has created

00:04:01.720 | So I thought that was neat

00:04:03.200 | That's nice because when we pop when we posted this on Twitter

00:04:05.960 | You know got a lot of attention and finally somebody with that was able to say like oh

00:04:09.800 | You know what this this actually already exists, so Tyler's gone away, and you know started reading that book

00:04:17.160 | Something else which is super cool is Jason Carpenter

00:04:20.520 | Created a whole new library called parfit and parfit is a

00:04:26.960 | parallelized fitting of multiple models for the purpose of

00:04:31.200 | Selecting hyper parameters, and there's a lot. I really like about this

00:04:36.560 | He's shown a clear example of how to use it right and like the API looks very similar to other grid search based approaches

00:04:46.940 | But it uses the validation

00:04:48.940 | techniques that

00:04:50.780 | Rachel wrote about and that we learned about a couple of weeks ago of using a good validation set

00:04:55.580 | and

00:04:57.820 | You know what he's done here is in his blog post that introduces it. You know he's he's

00:05:04.180 | Gone right back and said like well

00:05:06.820 | What are hyper parameters why do we have to train them?

00:05:09.140 | And he's kind of explained every step and then the the module itself is like it's it's very polished

00:05:15.820 | You know he's added documentation to it. He's added a nice read me to it

00:05:19.620 | And it's kind of interesting when you actually look at the code you realize

00:05:22.940 | You know it's very simple. You know which is it's definitely not a bad thing. That's a good thing is to make things simple

00:05:29.700 | but by kind of

00:05:33.100 | Writing this little bit of code and then packaging it up so nicely

00:05:35.700 | He's made it really easy for other people to use this technique

00:05:39.220 | which is great and so

00:05:42.460 | one of the things I've been really thrilled to see is then

00:05:44.660 | Vinay went along and combined two things from our class one was to take

00:05:50.180 | Parfit and then the other was to take the kind of accelerated SGD approach to classification

00:05:56.020 | We don't learned about in the last lesson and combine the two to say like okay. Well. Let's now use

00:06:02.100 | Parfit to help us find the parameters of a

00:06:05.740 | SGD logistic aggression

00:06:08.580 | So I think that's really a really great idea

00:06:12.180 | something

00:06:14.100 | else which I thought was terrific is

00:06:16.100 | Prince actually

00:06:18.780 | basically went through and

00:06:20.780 | Summarized pretty much all the stuff we learned in the random and for a random forest interpretation class

00:06:27.980 | And he went even further than that as he described each of the different approaches to random forest interpretation

00:06:37.020 | He described how it's done so here for example is feature importance through variable permutation a little picture of each one and

00:06:44.860 | Then super cool here is the code to implement it from scratch

00:06:49.400 | So I think this is like really

00:06:52.580 | Nice post you know describing something that not many people understand and showing you know exactly how it works both with pictures

00:07:00.740 | And with code that implements it from scratch

00:07:04.340 | So I think that's really really great one of the things. I really like here is that for like the

00:07:09.100 | Tree interpreter, but he actually showed how you can take the tree interpreter

00:07:14.320 | output and feed it into the new waterfall chart package that

00:07:19.300 | Chris our USF student built to show how you can actually visualize

00:07:23.260 | The contributions of the tree interpreter in a waterfall chart so again kind of a nice combination of

00:07:30.740 | multiple pieces of technology we've both learned about and and built as a group I

00:07:36.100 | Also really thought this

00:07:39.860 | Kernel there's been a few interesting kernels shared and I'll share some more next week and devesh wrote this really nice kernel

00:07:45.460 | Showing there's this quite challenging Kaggle competition on detecting icebergs

00:07:51.400 | versus

00:07:53.420 | Ships and it's a kind of a weird two channel satellite data. Which is very hard to visualize and he actually

00:08:01.940 | Went through and basically described kind of the formulas for how these like radar scattering things actually work

00:08:10.420 | And then actually managed to come up with a code that allowed him to recreate

00:08:17.140 | You know the actual 3d?

00:08:19.780 | Icebergs

00:08:23.260 | or ships and

00:08:24.820 | I have not seen that done before or like I you know it's it's quite challenging to know how to visualize this data

00:08:31.020 | And then he went on to show how to build a neural net to try to interpret this so that was pretty fantastic as well

00:08:38.800 | So yeah congratulations for all of you. I know for a lot of you. You know you're

00:08:44.140 | Posting stuff out there to the rest of the world for the first time you know and it's kind of intimidating

00:08:51.500 | you're used to writing stuff that you kind of hand into a teacher, and they're the only ones who see it and

00:08:56.380 | You know it's kind of scary the first time you do it

00:09:00.100 | But then the first time somebody you know up votes your Kaggle kernel or adds a clap to your medium post

00:09:05.540 | He suddenly realized oh, I'm actually I've written something that people like that's that's pretty great

00:09:11.460 | So if you haven't tried yourself yet, I again invite you to

00:09:18.060 | Try writing something and if you're not sure you could write a summary of a lesson

00:09:22.540 | You could write a summary of like if there's something you found hard like maybe you found it hard to

00:09:27.660 | Fire up a GPU based AWS instance you eventually figured it out you could write down

00:09:32.820 | Just describe how you solve that problem or if one of your classmates

00:09:36.740 | Didn't understand something and you explained it to them

00:09:39.700 | Then you could like write down something saying like oh, there's this concept that some people have trouble understanding here

00:09:45.220 | So good way. I think of explaining it. There's all kinds of stuff you could you could do

00:09:49.860 | Okay, so let's go back to SGD

00:09:58.020 | and

00:10:01.500 | We're going back through this notebook which

00:10:07.880 | Rachel put together basically taking us through

00:10:13.660 | Kind of SGD from scratch for the purpose of digit recognition

00:10:18.380 | and actually quite a lot of the stuff we look at today is

00:10:21.620 | going to be

00:10:24.300 | closely following

00:10:26.100 | Part of the computational linear algebra course

00:10:28.740 | Which you can both find the MOOCs on fast AI or at USF. It'll be an elective next year, right?

00:10:35.980 | So if you find some of this

00:10:38.580 | This stuff interesting and I hope you do then please consider signing up for the elective or checking out the video online

00:10:45.960 | So we're building

00:10:51.940 | neural networks

00:10:54.820 | And

00:10:57.580 | We're starting with an assumption that we've downloaded the MNIST data

00:11:01.500 | We've normalized it by subtracting the main and divided by the standard deviation. Okay, so the data is

00:11:08.700 | It's slightly unusual in that although they represent images

00:11:12.760 | They were they were downloaded as each image was a seven hundred and eighty four long

00:11:17.460 | Rank one tensor, so it's been flattened out

00:11:21.660 | Okay, and so for the purpose of drawing pictures of it we had to

00:11:26.540 | resize it

00:11:28.700 | to 28 by 28

00:11:30.700 | But the actual data we've got is not 28 by 28. It's as it's it's 784 long

00:11:37.520 | flattened out

00:11:39.520 | Okay

00:11:43.320 | The basic steps we're going to take here is to start out with training

00:11:48.440 | The world's simplest neural network basically a logistic regression, right?

00:11:54.000 | So no hidden layers and we're going to train it using a library

00:11:58.340 | Fast AI and we're going to build the network using a library type torch

00:12:03.840 | Right, and then we're going to gradually get rid of all the libraries, right?

00:12:07.480 | So first of all, we'll get rid of the nn neural net library and pytorch and write that ourselves

00:12:13.760 | Then we'll get rid of the fast AI fit function and write that ourselves and then we'll get rid of the pytorch

00:12:22.620 | optimizer and write that ourselves and so by the end of

00:12:26.120 | This notebook will have written all the pieces ourselves

00:12:30.800 | The only thing that we'll end up relying on is the two key things that pytorch gives us

00:12:36.200 | Which is a the ability to write Python code and have it run on the GPU and?

00:12:40.320 | B the ability to write Python code and have it automatically differentiated for us

00:12:46.960 | Okay, so they're the two things we're not going to attempt to write ourselves because it's boring and pointless

00:12:52.160 | But everything else we'll try and write ourselves on top of those two things. Okay, so

00:12:58.720 | Our starting point is like not doing anything ourselves

00:13:03.680 | It's basically having it all done for us. And so pytorch has an nn library, which is where the neural net stuff lives

00:13:10.160 | you can create a

00:13:12.280 | multi-layer neural network by using the sequential function and then passing in a list of the layers that you want and

00:13:18.640 | We asked for a linear layer

00:13:20.840 | Followed by a softmax layer and that defines our logistic regression. Okay the input to our linear layer

00:13:28.380 | Is 28 by 28 as we just discussed the output is 10 because we want a probability

00:13:34.500 | For each of the numbers not through nine for each of our images, okay

00:13:39.180 | Cuda sticks it on the GPU and then

00:13:46.460 | Fit

00:13:50.180 | Fits a model okay, so we start out with a random set of weights and then fit uses gradient descent to make it better

00:13:57.500 | We

00:13:58.820 | Had to tell the fit function

00:14:00.820 | What criterion to use in other words what counts is better and we told it to use negative log likelihood

00:14:07.720 | We'll learn about that in the next lesson what that is exactly

00:14:10.860 | We had to tell it what optimizer to use and we said please use optm dot Adam the details of that

00:14:18.000 | We won't cover in this course. We're going to use something build something simpler called SGD

00:14:23.180 | If you're interested in Adam, we just covered that in the deep learning course

00:14:27.060 | And what metrics do you want to print out? We decided to print out accuracy. Okay, so

00:14:32.740 | That was that and so if we do that

00:14:35.940 | Okay

00:14:42.340 | So after we fit it we get an accuracy of generally somewhere around 91 92 percent

00:14:47.300 | So what we're going to do from here is we're going to gradually

00:14:50.980 | We're going to repeat this exact same thing. So we're going to rebuild

00:14:56.300 | This model

00:14:57.820 | You know four or five times fitting it building it and fitting it with less and less libraries. Okay, so the second thing that we did

00:15:06.320 | last time

00:15:09.020 | Was to try to start to define the

00:15:11.780 | The module ourselves

00:15:15.580 | All right, so instead of saying the network is a sequential bunch of these layers

00:15:21.780 | Let's not use that library at all and try and define it ourselves from scratch

00:15:26.760 | So to do that we have to use

00:15:30.200 | OO

00:15:32.220 | Because that's how we build everything in pytorch and we have to create

00:15:36.720 | a class

00:15:39.060 | Which inherits from an end up module so an end up module is a pytorch class

00:15:45.140 | That takes our class and turns it into a neural network module

00:15:51.500 | Which basically means will anything that you inherit from an end up module like this?

00:15:55.940 | You can pretty much insert into a neural network as a layer or you can treat it as a neural network

00:16:02.020 | it's going to get all the stuff that it needs automatically to

00:16:05.060 | To work as a part of or a full neural network and we'll talk about exactly what that means

00:16:11.260 | Today and the next lesson, right?

00:16:15.940 | so we need to construct the object so that means we need to define the constructor under in it and

00:16:22.900 | Then importantly, this is a Python thing is if you inherit from some other object

00:16:29.720 | Then you have to create the thing you inherit from first

00:16:33.100 | so when you say super dot under in it that says construct the

00:16:38.500 | Nn dot module piece of that first right if you don't do that then the the NN dot module stuff

00:16:46.180 | Never gets a chance to actually get constructed. Now. So this is just like a standard

00:16:50.820 | Python

00:16:52.900 | OO

00:16:53.980 | Subclass constructor, okay, and if any of that's an unclear to you then you know

00:16:59.180 | This is where you definitely want to just grab a python intro to OO because this is

00:17:04.420 | That the standard approach, right? So inside our constructor

00:17:08.740 | We want to do the equivalent of

00:17:11.580 | Nn dot linear. All right. So what NN dot linear is doing is it's taking our

00:17:19.060 | It's taking our 28 by 28

00:17:29.380 | Vector so 768 long vector and we're going to be that's going to be the input to a matrix multiplication

00:17:36.180 | so we now need to create a

00:17:38.820 | Something with

00:17:42.900 | 768 rows and

00:17:45.840 | That's 768 and 10 columns

00:17:49.620 | Okay, so because the input to this is going to be a mini batch of size

00:17:58.260 | Actually, let's move this into a new window

00:18:01.740 | 768 by 10 and the input to this is going to be a mini batch of size 64

00:18:15.140 | by

00:18:18.420 | 768

00:18:20.100 | Right, so we're going to do this matrix product

00:18:23.340 | Okay, so when we say in pytorch NN dot linear

00:18:28.220 | It's going to construct

00:18:32.100 | This matrix for us, right? So since we're not using that we're doing things from scratch. We need to make it ourselves

00:18:38.900 | So to make it ourselves we can say

00:18:41.300 | generate normal random numbers

00:18:44.140 | with

00:18:46.140 | This dimensionality which we passed in here 768 by 10. Okay, so that gives us our

00:18:53.060 | randomly initialized

00:18:55.060 | matrix, okay

00:18:57.300 | Then we want to add on to this

00:19:01.660 | You know, we don't just want y equals ax we want y equals ax plus b

00:19:08.140 | Right, so we need to add on what we call in neural nets a bias vector

00:19:13.500 | So we create here a bias vector of length 10. Okay again randomly initialized

00:19:20.740 | And so now here are our two randomly initialized

00:19:24.620 | weight tenses

00:19:27.420 | So that's our constructor

00:19:29.460 | Okay

00:19:30.980 | Now we need to define forward. Why do we need to define forward? This is a pytorch specific thing

00:19:36.900 | What's going to happen is this is when you create a module in

00:19:42.620 | Pytorch the object that you get back behaves as if it's a function

00:19:47.760 | You can call it with parentheses which we'll do it that in a moment. And so you need to somehow define

00:19:52.860 | What happens when you call it as if it's a function and the answer is pytorch calls a method called?

00:20:00.440 | Forward, okay, that's just that's the Python the pytorch kind of approach that they picked, right?

00:20:07.740 | So when it calls forward, we need to do our actual

00:20:12.260 | Calculation of the output of this module or later. Okay. So here is the thing that actually gets calculated in our logistic regression

00:20:19.600 | So basically we take our

00:20:22.420 | Input X

00:20:26.020 | Which gets passed to forward that's basically how forward works it gets passed the mini batch

00:20:32.340 | and we matrix multiply it by

00:20:35.620 | The layer one weights which we defined up here and then we add on

00:20:42.740 | The layer one bias which we defined up here. Okay, and actually nowadays we can define this a little bit more elegantly

00:20:50.100 | Using the Python 3

00:20:54.700 | Matrix multiplication operator, which is the at sign

00:20:57.660 | And when you when you use that I think you kind of end up with

00:21:01.080 | Something that looks closer to what the mathematical notation looked like and so I find that nicer. Okay

00:21:07.860 | All right, so that's

00:21:11.580 | That's our linear layer

00:21:13.580 | In our logistic regression in our zero hidden layer neural net. So then the next thing we do to that is

00:21:19.740 | softmax

00:21:23.260 | Okay, so we get the output of this

00:21:26.840 | Matrix multiply

00:21:31.420 | Okay, who wants to tell me what the dimensionality of my output of this matrix multiply is

00:21:40.300 | Sorry

00:21:42.060 | 64 by 10. Thank you Karen

00:21:44.060 | And I should mention for those of you that weren't at deep learning class yesterday

00:21:50.580 | We actually looked at a really cool post from Karen who described how to

00:21:54.980 | Do structured data analysis with neural nets which has been like super popular?

00:22:00.380 | And a whole bunch of people have kind of said that they've read it and found it super interesting. So

00:22:05.620 | That was really exciting

00:22:10.020 | So we get this matrix of

00:22:12.020 | Outputs and we put this through a softmax

00:22:15.780 | And why do we put it through a softmax

00:22:19.740 | We put it through a softmax because in the end we want probably you know for every image

00:22:24.660 | We want a probability that this is 0 or a 1 or a 2 or a 3 or 4, right?

00:22:28.780 | So we want a bunch of probabilities that add up to 1 and where each of those probabilities is between 0 and 1

00:22:35.420 | so a softmax

00:22:38.860 | Does exactly that for us?

00:22:40.860 | So for example if we weren't picking out, you know numbers from 0 to 10

00:22:45.900 | But instead of picking out cat dog play and fish or building the output of that matrix multiply

00:22:50.500 | For one particular image might look like that. These are just some random numbers

00:22:54.620 | And to turn that into a softmax. I first go e to the power of each of those numbers. I

00:23:02.420 | Sum up those e to the power of

00:23:07.860 | and

00:23:09.060 | Then I take each of those e to the power ofs and divide it by the sum and that's softmax

00:23:14.180 | That's the definition of softmax. So because it was a to the power of it means it's always positive

00:23:19.260 | Because it was divided by the sum it means that it's always between 0 and 1 and it also means because it's divided

00:23:27.180 | By the sum that they always add up to 1

00:23:29.820 | So by applying this softmax

00:23:34.500 | Activation function so anytime we have a layer of outputs, which we call activations

00:23:40.140 | And then we apply some function some nonlinear function to that that maps one

00:23:45.980 | One scalar to one scalar like softmax does we call that an activation function, okay?

00:23:52.500 | So the softmax activation function takes our outputs and turns it into something which behaves like a probability, right?

00:24:00.260 | We don't strictly speaking need it. We could still try and train something which where the output directly is the probabilities

00:24:07.980 | All right, but by creating using this function

00:24:11.320 | That automatically makes them always behave like probabilities. It means there's less

00:24:16.420 | For the network to learn so it's going to learn better. All right, so generally speaking whenever we design

00:24:21.960 | an architecture

00:24:24.660 | We try to design it in a way where it's as easy as possible for it to create something of the form that we want

00:24:32.500 | So that's why we use

00:24:35.420 | softmax

00:24:37.580 | Right so that's the basic steps right we have our input which is a bunch of images

00:24:44.180 | Right which is here gets multiplied by a weight matrix. We actually also add on a bias

00:24:52.740 | Right to get a output of the linear function

00:24:56.460 | We put it through a nonlinear activation function in this case softmax and that gives us our probabilities

00:25:04.100 | So there there that all is

00:25:09.020 | Pi torch also tends to use the log

00:25:14.820 | Of softmax for reasons that don't particularly bother us now

00:25:19.940 | It's basically a numerical stability convenience. Okay, so to make this the same as our

00:25:26.020 | Version up here that you saw log softmax. I'm going to use log here as well. Okay, so

00:25:34.420 | We can now instantiate this class that is create an object of this class

00:25:41.060 | So I have a question back for the probabilities where we were before

00:25:48.860 | so

00:25:50.860 | If we were to have a photo with a cat and a dog together

00:25:54.820 | Would that change the way that that works or does it work in the same basic? Yeah, so that's a great question

00:26:00.580 | so if you had a photo with a cat and a dog together and

00:26:03.660 | You wanted it to spit out both cat and dog

00:26:07.100 | This would be a very poor choice. So softmax is specifically the activation function we use for

00:26:14.540 | Categorical predictions where we only ever want to predict one of those things, right?

00:26:19.460 | And so part of the reason why is that as you can see because we're using either the right either the slightly bigger numbers

00:26:27.120 | Creates much bigger numbers as a result of which we generally have just one or two things large and everything else is pretty small

00:26:34.340 | All right

00:26:34.860 | so if I like

00:26:35.820 | Recalculate these random numbers a few times you'll see like it tends to be a bunch of zeros and one or two high numbers

00:26:41.980 | right, so it's really designed to

00:26:44.420 | Try to kind of make it easy to predict like this one thing. There's the thing I want if you're doing multi

00:26:53.700 | Label prediction so I want to find all the things in this image rather than using softmax

00:26:59.380 | We would instead use sigmoid, right?

00:27:01.260 | So sigmoid recall each would cause each of these between to be between zero and one, but they would no longer add to one

00:27:08.620 | Good question and like a lot of these

00:27:11.480 | Details about like best practices are things that we cover in the deep learning course

00:27:18.140 | And we won't cover heaps of them here in the machine learning course. We're more interested in the mechanics, I guess

00:27:24.100 | But we'll try and do them if they're quick

00:27:28.300 | All right, so now that we've got that we can instantiate an object of that class and of course

00:27:35.420 | We want to copy it over to the GPU so we can do computations over there

00:27:38.940 | Again, we need an optimizer where we're talking about what this is shortly, but you'll see here

00:27:44.580 | We've called a function on our class called parameters

00:27:47.760 | But we never defined a method called parameters

00:27:51.340 | And the reason that is going to work is because it actually was defined for us inside nn.module

00:27:56.420 | and so nn.module actually automatically goes through the attributes we've created and finds

00:28:04.060 | Anything that basically we we said this is a parameter

00:28:07.860 | So the way you say something is a parameter is you wrap it in an end up parameter

00:28:11.260 | So this is just the way that you tell PyTorch

00:28:13.620 | This is something that I want to optimize

00:28:16.180 | Okay, so when we created the weight matrix we just wrapped it with an end up parameter

00:28:21.420 | It's exactly the same as a regular

00:28:23.780 | PyTorch variable which we'll learn about shortly

00:28:26.620 | It's just a little flag to say hey you should you should optimize this and so when you call net to dot parameter

00:28:33.940 | On our net to object we created it goes through everything that we created in the constructor

00:28:38.900 | Checks to see if any of them are of type parameter

00:28:41.880 | And if so it sets all of those as being things that we want to train with the optimizer

00:28:46.620 | And we'll be implementing the optimizer from scratch later

00:28:50.020 | Okay, so having done that

00:28:53.040 | We can fit and we should get basically the same answer as before 91 ish

00:29:03.620 | So that looks good

00:29:05.620 | All right

00:29:07.660 | So

00:29:09.500 | What if we actually built here?

00:29:11.500 | Well what we've actually built as I said is something that can behave like a regular function

00:29:17.340 | All right, so I want to show you how we can actually call this as a function

00:29:21.660 | So to be able to call it as a function

00:29:23.660 | We need to be able to pass data to it to be able to pass data to it

00:29:28.140 | I'm going to need to grab a mini batch of MNIST images

00:29:32.700 | Okay, so we used

00:29:34.700 | for convenience the

00:29:37.220 | Image classifier data from a raised method from fastai

00:29:40.340 | And what that does is it creates a pytorch data loader for us a pytorch data loader is

00:29:47.060 | Something that grabs a few images and sticks them into a mini batch and makes them available

00:29:52.340 | And you can basically say give me another mini batch give me another mini batch give me another mini batch and so

00:30:00.660 | in

00:30:02.420 | Python we call these things generators

00:30:05.060 | Generators are things where you can basically say I want another I want another I want another right

00:30:10.020 | There's this kind of very close connection between

00:30:15.900 | Iterators and generators are not going to worry about the difference between them right now, but you'll see basically to turn

00:30:23.140 | To actually get hold of something which we can say please give me another of

00:30:29.980 | in

00:30:32.020 | Order to grab something that we can we can use to generate mini batches

00:30:36.540 | We have to take our data loader and so you can ask for the training data loader from our model data object

00:30:43.180 | You'll see there's a bunch of different data loaders. You can ask for you can ask for the test data loader the train data loader

00:30:49.420 | validation loader

00:30:51.940 | Augmented images data loader and so forth so we're going to grab the training data loader

00:30:57.220 | That was created for us. This is a PI standard PI torch data loader. Well slightly optimized by us, but same idea

00:31:03.300 | And you can then say this is a standard Python

00:31:07.020 | Thing we can say turn that into an iterator turn that into something where we can grab another one at a time from and so

00:31:14.860 | Once you've done that

00:31:16.540 | We've now got something that we can iterate through you can use the standard Python

00:31:21.580 | Next function to grab one more thing from that generator, okay?

00:31:26.820 | So that's returning and the X's from a mini batch in the wise

00:31:33.100 | Found our mini batch the other way that you can use

00:31:36.440 | Generators and iterators in Python is with a for loop. I could also have said like for you know X mini batch comma Y mini batch in

00:31:45.180 | data loader

00:31:47.420 | And then like do something right so when you do that. It's actually behind the scenes

00:31:51.940 | It's basically syntactic sugar for calling next lots of times. Okay, so this is all standard

00:31:57.920 | Python stuff

00:32:00.700 | So that returns a

00:32:03.100 | Tensor of size 64 by 784 as we would expect right the

00:32:14.980 | Fastai library we used defaults to a mini batch size of 64. That's why it's that long

00:32:20.340 | These are all of the background zero pixels, but they're not actually zero in this case. Why aren't they zero?

00:32:27.180 | Yeah, they're normalized exactly right so we subtract at the mean divided by standard deviation right

00:32:33.420 | So there there it is so now what we want to do is we want to

00:32:42.380 | Pass that into our our logistic regression. So what we might do is we'll go

00:32:48.860 | Variable XMB equals variable. Okay, I can take my X mini batch I

00:32:55.580 | can move it on to the GPU because remember my

00:32:59.160 | Net to object is on the GPU so our data for it also has to be on the GPU

00:33:04.980 | And then the second thing I do is I have to wrap it in variable. So what does variable do?

00:33:11.140 | This is how we get for free automatic differentiation

00:33:15.000 | Pytorch can automatically differentiate

00:33:19.040 | You know pretty much anything right any tensor?

00:33:22.480 | But to do so takes memory and time

00:33:25.380 | So it's not going to always keep track like to do to do what about differentiation

00:33:30.820 | It has to keep track of exactly how something was calculated. We added these things together

00:33:35.340 | We multiplied it by that we then took the sign blah blah blah, right?

00:33:39.420 | you have to know all of the steps because then to do the automatic differentiation it has to

00:33:45.060 | Take the derivative of each step using the chain rule multiply them all together

00:33:49.380 | All right, so that's slow and memory intensive

00:33:52.140 | So we have to opt in to saying like okay this particular thing we're going to be taking the derivative of later

00:33:57.560 | So please keep track of all of those operations for us

00:34:00.300 | And so the way we opt in is by wrapping a tensor in a variable, right? So

00:34:08.100 | That's how we do it and

00:34:10.100 | You'll see that it looks almost exactly like a tensor, but it now says variable containing

00:34:16.460 | This tensor right so in Pytorch a variable has exactly

00:34:21.860 | Identical API to a tensor or actually more specifically a superset of the API of a tensor

00:34:27.860 | Anything we can do to a tensor we can do to a variable

00:34:30.740 | But it's going to keep track of exactly what we did so we can later on take the derivative

00:34:37.700 | Okay, so we can now pass that

00:34:40.260 | Into our net to object remember I said you can treat this as if it's a function

00:34:51.980 | Right so notice we're not calling dot forward

00:34:56.140 | We're just treating it as a function and

00:34:59.380 | Then remember we took the log so to undo that I'm taking the x and that will give me my probabilities

00:35:07.460 | Okay, so there's my probabilities, and it's got

00:35:14.020 | Return something of size 64 by 10 so for each image in the mini batch

00:35:23.020 | We've got 10 probabilities, and you'll see most probabilities are pretty close to 0

00:35:29.580 | Right and a few of them are quite a bit bigger

00:35:33.420 | Which is exactly what we do we hope right is that it's like okay? It's not a zero. It's not a one

00:35:39.300 | It's not a two. It is a three. It's not a four. It's not a five and so forth

00:35:42.740 | So maybe this would be a bit easier to read if we just grab like the first three of them

00:35:47.140 | Okay, so it's like ten to the next three ten to the next eight two five five four okay?

00:35:55.100 | And then suddenly here's one which is ten to make one right?

00:35:57.620 | So you can kind of see what it's trying to what it's trying to do here

00:36:02.980 | I mean we could call like net to dot forward and it'll do exactly the same thing

00:36:10.380 | Right, but that's not how

00:36:13.060 | All of the pie torch mechanics actually work

00:36:16.620 | It's actually they actually call it as if it's a function right and so this is actually a really important idea

00:36:22.580 | like because it means that

00:36:24.940 | When we define our own architectures or whatever anywhere that you would put in a function

00:36:30.580 | You could put in a layer anyway you put in a layer you can put in a neural net anyway

00:36:34.900 | You put in a neural net you can put in a function because as far as pie torch is concerned

00:36:39.020 | They're all just things that it's going to call just like as if they're functions

00:36:43.060 | So they're all like interchangeable, and this is really important because that's how we create

00:36:48.020 | Really good neural nets is by mixing and matching lots of pieces and putting them all together

00:36:53.660 | Let me give an example

00:36:56.420 | Here is my

00:37:00.220 | Logistic aggression which got

00:37:04.540 | 91 and a bit percent accuracy

00:37:08.980 | I'm now going to turn it

00:37:11.380 | Into a neural network with one hidden layer all right, and the way I'm going to do that is I'm going to create

00:37:17.100 | one more layer

00:37:19.860 | I'm going to change this so it spits out a hundred rather than ten

00:37:24.420 | Which means this one input is going to be a hundred rather than ten

00:37:30.020 | Now this as it is can't possibly make things any better at all yet

00:37:35.340 | Why is this definitely not going to be better than what I had before?

00:37:39.020 | Yeah, can somebody pass the yeah?

00:37:42.540 | But you've got a combination of two linear layers, which is just the same as one

00:37:47.620 | Exactly right so we've got two linear layers, which is just a linear layer right so to make things interesting

00:37:55.700 | I'm going to replace all of the negatives from the first layer with zeros

00:38:00.880 | Because that's a nonlinear transformation, and so that nonlinear transformation is called a rectified linear unit

00:38:07.820 | Okay, so nn dot sequential simply is going to call each of these layers in turn for each mini batch right so do a linear layer

00:38:18.340 | Replace all of the negatives with zero do another linear layer and do a softmax. This is now a neural network

00:38:26.020 | with one hidden layer and

00:38:28.020 | So let's try trading that instead

00:38:30.460 | Okay accuracy is now going up to 96%

00:38:37.180 | Okay, so the this is the idea is that the basic techniques. We're learning in this lesson

00:38:43.420 | Like become powerful at the point where you start stacking them together, okay?

00:38:49.540 | Can somebody pass the green box there and then there yes, Daniel?

00:38:54.660 | Why did you pick a hundred? No reason it was like easier to type an extra zero?

00:38:59.940 | Like this question of like how many

00:39:04.220 | Activations should I have it a neural network layer is kind of part of the the scale of a deep learning practitioner

00:39:09.780 | We cover it in the deep learning course not in this course

00:39:13.000 | When adding that additional I guess

00:39:18.100 | transformation

00:39:20.660 | Additional layer additional layer this one here is called a nonlinear layer or an activation function

00:39:26.180 | Activation function or activation function

00:39:30.060 | Does it matter that like if you would have done for example like two softmaxes?

00:39:37.780 | Or is that something you cannot do like yeah?

00:39:40.180 | You can absolutely use a softmax there

00:39:42.140 | But it's probably not going to give you what you want and the reason why is that a softmax?

00:39:48.220 | Tends to push most of its activations to zero and an activation just be clear like I've had a lot of questions in deep

00:39:55.460 | Learning course about like what's an activation an activation is the value that is calculated in a layer, right?

00:40:02.740 | So this is an activation

00:40:04.740 | Right it's not a weight a weight is not an activation

00:40:08.700 | It's the value that you calculate from a layer

00:40:11.340 | So softmax will tend to make most of its activations pretty close to zero

00:40:15.700 | and that's the opposite of what you want you genuinely want your activations to be kind of as

00:40:20.860 | Rich and diverse and and used as possible so nothing to stop you doing it, but it probably won't work very well

00:40:27.300 | Basically

00:40:30.980 | pretty much all of your layers will be followed by

00:40:34.300 | Non by nonlinear activation functions that will nearly always be value

00:40:39.780 | except for the last layer

00:40:44.700 | Could you when doing multiple layers, so let's say like could you live three could you think it's going two or three layers deep?

00:40:51.740 | Do you want to switch up these activation layers? No, that's a great question. So if I wanted to go deeper I

00:40:59.100 | would just do

00:41:01.940 | That okay, that's a now to hidden layer network

00:41:05.860 | So I think I'd heard you said that there are a couple of different

00:41:13.780 | Activation functions like that rectified linear unit. What are some examples and

00:41:18.940 | Why would you use?

00:41:22.020 | Each yeah great question

00:41:24.180 | So basically like as you add like more

00:41:31.080 | linear layers you kind of got your

00:41:33.980 | Input comes in and you put it through a linear layer and then a nonlinear layer linear layer nonlinear layer

00:41:41.180 | linear linear layer and then the final nonlinear layer

00:41:50.900 | The final nonlinear layer as we've discussed, you know, if it's a

00:41:56.200 | multi-category

00:41:58.860 | Classification, but you only ever pick one of them you would use softmax

00:42:03.580 | If it's a binary classification or a multi

00:42:08.060 | Label classification where you're predicting multiple things you would use sigmoid

00:42:12.100 | If it's a regression

00:42:15.500 | You would often have nothing at all

00:42:18.660 | Right, although we learned in last night's deal course where sometimes you can use sigmoid there as well

00:42:23.300 | So they're basically the options main options for the final layer

00:42:28.500 | for the

00:42:31.940 | Hidden layers you pretty much always use

00:42:35.380 | ReLU

00:42:41.580 | Okay, but there is a another

00:42:50.380 | Another one you can pick which is kind of interesting which is called

00:42:56.660 | Leaky ReLU and it looks like this

00:43:05.100 | and

00:43:07.100 | Basically if it's above zero, it's y equals x and if it's below zero, it's like y equals 0.1 x

00:43:13.540 | that's very similar to ReLU, but it's

00:43:16.660 | Rather than being equal to 0 under x. It's it's like something close to that

00:43:22.100 | So they're the main two

00:43:25.260 | ReLU and Leaky ReLU

00:43:33.260 | There are various others, but they're kind of like things that just look very close to that

00:43:38.060 | So for example, there's something called ELU, which is quite popular

00:43:41.440 | But like you know the details don't matter too much honestly like that there like ELU is something that looks like this

00:43:47.700 | But it's slightly more curvy in the middle

00:43:49.700 | And it's kind of like it's not generally something that you so much pick based on the data set it's more like

00:43:59.380 | Over time we just find better activation functions so two or three years ago

00:44:04.300 | Everybody used ReLU, you know a year ago pretty much everybody used Leaky ReLU today

00:44:09.380 | I guess probably most people starting to move towards ELU

00:44:11.940 | But honestly the choice of activation function doesn't matter

00:44:15.460 | terribly much actually

00:44:18.460 | And you know people have actually showed that you can use like our pretty arbitrary nonlinear activation functions like even a sine wave

00:44:26.180 | It still works

00:44:28.820 | Okay

00:44:30.820 | So although what we're going to do today is showing how to create

00:44:40.620 | This network with no hidden layers

00:44:46.220 | To turn it into

00:44:49.860 | that network

00:44:51.620 | Which is 96% ish accurate is it will be trivial right and in fact is something you should

00:44:57.900 | Probably try and do during the week right is to create that version

00:45:02.060 | Okay

00:45:10.580 | So now that we've got something where we can take our network pass in our variable and get back some

00:45:18.740 | predictions

00:45:22.580 | That's basically all that happened when we called fit. So we're going to see how how that that approach can be used to create this stochastic gradient

00:45:30.780 | descent

00:45:32.300 | one thing to note is that the to turn the

00:45:35.860 | Predicted probabilities into a predicted like which digit is it? We would need to use argmax

00:45:43.540 | Unfortunately pytorch doesn't call it argmax

00:45:49.220 | Instead pytorch just calls it max and max returns

00:45:53.540 | two things

00:45:56.260 | Returns the actual max across this axis so this is across the columns right and the second thing it returns is the index

00:46:05.020 | Of that maximum right so so the equivalent of argmax is to call max and then get the first

00:46:12.900 | Indexed thing okay, so there's our predictions right if this was in numpy. We would instead use NP argmax

00:46:19.440 | Okay

00:46:22.060 | All right

00:46:25.500 | So here are the predictions from our hand created logistic regression and in this case

00:46:31.580 | Looks like we got all but one correct

00:46:37.300 | So the next thing we're going to try and get rid of in terms of using libraries is for try to avoid using the

00:46:43.300 | Matrix multiplication operator and instead we're going to try and write that by hand

00:46:47.260 | So this next part we're going to learn about something which kind of seems

00:47:03.860 | It kind of it's going to seem like a minor little kind of programming idea, but actually it's going to turn out

00:47:14.620 | That at least in my opinion. It's the most important

00:47:18.500 | Programming concept that we'll teach in this course, and it's possibly the most important programming

00:47:24.040 | kind of concept in all of

00:47:26.620 | All the things you need to build machine learning algorithms, and it's the idea of

00:47:32.980 | broadcasting

00:47:34.340 | And the idea I will show by example

00:47:37.300 | If we create an array of 10 6 neg 4 and an array of 2 8 7 and then add the two together

00:47:45.100 | It adds each of the components of those two arrays in turn we call that element wise

00:47:54.060 | So in other words we didn't have to write a loop right back in the old days

00:47:58.740 | We would have to have looped through each one and added them and then concatenated them together

00:48:02.780 | We don't have to do that today. It happens for us automatically so in numpy

00:48:07.980 | We automatically get element wise operations

00:48:11.620 | We can do the same thing with Pytorch

00:48:20.420 | So in fastai we just add a little capital T to turn something into a Pytorch tensor right and if we add those together

00:48:31.380 | Exactly the same thing right so element wise operations are pretty standard in these kinds of libraries

00:48:37.700 | It's interesting not just because we don't have to write the for loop

00:48:44.100 | Right, but it's actually much more interesting because of the performance things that are happening here

00:48:49.380 | The first is if we were doing a for loop

00:48:52.020 | right

00:48:54.740 | If we were doing a for loop

00:49:01.180 | That would happen in Python

00:49:03.180 | Right even when you use Pytorch it still does the for loop in Python it has no way of like

00:49:10.140 | Optimizing the for loop and so a for loop in Python is something like

00:49:15.660 | 10,000 times slower than in C

00:49:18.740 | So that's your first problem. I can't remember. It's like 1,000 or 10,000 the second problem then is that

00:49:29.260 | You don't just want it to be optimized in C

00:49:31.500 | But you want C to take advantage of the thing that you're all of your CPUs do to something called SIMD

00:49:37.700 | Single instruction multiple data, which is it yours your CPU is capable of taking

00:49:43.500 | eight things at a time

00:49:46.260 | Right in a vector and adding them up to another

00:49:49.860 | Vector with eight things in in a single CPU instruction

00:49:55.060 | All right, so if you can take advantage of SIMD you're immediately eight times faster

00:49:59.260 | It depends on how big the data type is it might be four might be eight

00:50:02.300 | The other thing that you've got in your computer is you've got multiple processors

00:50:07.260 | Multiple cores

00:50:11.300 | So you've probably got like if this is inside happening on one side one core. You've probably got about four of those

00:50:19.300 | Okay, so if you're using SIMD you're eight times faster if you can use multiple cores, then you're 32 times faster

00:50:25.740 | And then if you're doing that in C

00:50:28.180 | You might be something like 32 times about thousand times faster right and so the nice thing is that when we do that

00:50:34.860 | It's taking advantage of all of these things

00:50:38.340 | Okay, better still if you do it

00:50:42.900 | in pytorch and your data was created with

00:50:48.300 | .Cuda to stick it on the GPU

00:50:52.060 | Then your GPU can do about 10,000 things at a time

00:50:57.380 | Right so that'll be another hundred times faster than C

00:51:01.440 | All right, so this is critical

00:51:04.500 | To getting good performance is you have to learn how to write

00:51:10.060 | loopless code

00:51:12.500 | By taking advantage of these element wise

00:51:15.900 | Operations and like it's not it's a lot more than just plus I

00:51:19.040 | Could also use less than right and that's going to return 0 1 1 or if we go back to numpy

00:51:28.860 | False true true

00:51:35.660 | And so you can kind of use this to do all kinds of things without looping so for example

00:51:42.080 | I could now multiply that by a and here are all of the values of a

00:51:47.460 | As long as they're less than B or we could take the mean

00:51:53.440 | This is the percentage of values in a that are less than B

00:51:59.460 | All right, so like there's a lot of stuff you can do with this simple idea

00:52:03.660 | But to take it further

00:52:06.260 | Right to take it further than just this element wise operation

00:52:10.020 | We're going to have to go the next step to something called broadcasting

00:52:13.220 | So let's take a five minute break come back at 217 and we'll talk about broadcasting

00:52:19.340 | So

00:52:26.900 | Broadcasting

00:52:29.980 | This is the definition from the numpy documentation of

00:52:38.020 | Broadcasting and I'm going to come back to it in a moment rather than reading it now

00:52:41.780 | But let's start by looking an example of broadcasting

00:52:47.500 | so a is a

00:52:50.860 | Array

00:52:53.820 | With one dimension also known as a rank one tensor

00:52:57.180 | also known as a vector

00:52:59.940 | We can say a greater than zero

00:53:03.860 | so here we have

00:53:07.820 | a

00:53:08.780 | rank one tensor

00:53:10.780 | Right and a rank zero tensor

00:53:15.100 | Right a rank zero tensor is also called a scalar

00:53:19.860 | rank one tensor is also called a vector and

00:53:23.900 | We've got an operation between the two

00:53:27.860 | All right now you've probably done it a thousand times without even noticing. That's kind of weird right that you've got these things of different

00:53:36.060 | Ranks and different sizes, so what is it actually doing right?

00:53:39.820 | But what it's actually doing is it's taking that scalar and copying it here here here

00:53:46.140 | Right and then it's actually going element wise

00:53:50.060 | 10 is greater than 0

00:53:53.780 | 6 is greater than 0 minus 4 is greater than 0 you haven't giving us back the three answers

00:54:01.260 | Right and that's called broadcasting broadcasting means

00:54:05.260 | Copying one or more axes of my tensor

00:54:11.060 | To allow it to be the same shape as the other tensor

00:54:16.640 | It doesn't really copy it though

00:54:20.580 | What it actually does is it stores this kind of internal indicator that says pretend that this is a

00:54:30.500 | vector of three zeros

00:54:32.500 | But it actually just like what rather than kind of going to the next row or going to the next scalar it goes back

00:54:38.540 | To where it came from if you're interested in learning about this specifically

00:54:42.620 | It's they set the stride on that axis to be zero. That's a minor advanced concept for those who are curious

00:54:50.300 | So we could do a

00:54:55.460 | +1 right is going to broadcast the scalar 1

00:54:59.200 | To be 1 1 1 and then do element wise addition

00:55:03.000 | We could do the same with a matrix right here's our matrix 2 times the matrix is going to broadcast 2

00:55:10.180 | to be 2 2 2 2 2 2 2 2 2 2 and then do element wise

00:55:16.380 | multiplication

00:55:18.500 | All right, so that's our kind of most simple version of

00:55:24.100 | broadcasting

00:55:26.100 | So here's a slightly more complex version of broadcasting

00:55:30.460 | Here's an array called C. All right, so this is a rank 1 tensor and

00:55:36.180 | Here's our matrix M from before

00:55:39.020 | Our rank 2 tensor we can add M plus C

00:55:43.600 | All right, so what's going on here?

00:55:49.820 | 1 2 3 4 5 6 7 8 9

00:55:55.300 | That's M

00:55:58.700 | All right, and then C

00:56:00.700 | 10

00:56:03.540 | 20

00:56:04.940 | 30

00:56:06.940 | You can see that what it's done is to add that to each row

00:56:11.020 | right eleven twenty two thirty three

00:56:15.140 | 14 25 36 and so we can kind of figure it seems to have done the same kind of idea as broadcasting a scalar

00:56:22.480 | It's like made copies of it

00:56:24.700 | And then it treats those as

00:56:32.060 | If it's a rank 2 matrix and now we can do element wise addition

00:56:42.340 | That makes sense now that's yes, can can you pass that Devon over there? Thank you

00:56:48.140 | So as it's like by looking at this example it like

00:56:54.220 | Copies it down

00:56:56.500 | making new rows

00:56:58.420 | So how would we want to do it if we wanted to get new columns? I'm so glad you asked

00:57:02.700 | So

00:57:10.740 | Instead

00:57:12.740 | We would do this

00:57:15.420 | 10 20 30

00:57:20.380 | All right, and then copy that 10 20 30

00:57:24.900 | 10 20 30 and

00:57:28.300 | Now treat that as our matrix

00:57:31.380 | So to get numpy to do that we need to not pass in a

00:57:36.140 | vector

00:57:38.700 | but to pass in a

00:57:40.700 | Matrix with one column a rank 2 tensor, right?

00:57:47.420 | so basically it turns out that

00:57:50.860 | numpy is going to think of a

00:57:54.380 | Rank 1 tensor for these purposes as if it was a rank 2 tensor which represents a row

00:58:02.140 | Right. So in other words that it is 1 by 3, right? So we want to create a tensor, which is 3 by 1

00:58:10.140 | There's a couple of ways to do that

00:58:13.980 | One is to use NP expand dims

00:58:17.180 | And if you then pass in this argument, it says please insert a length 1 axis

00:58:24.260 | here, please so in our case we want to turn it into a

00:58:29.100 | 3 by 1 so if we said expand in C comma 1

00:58:33.020 | Okay, so if we say expand in C comma 1 it changes the shape to 3 comma 1 so if we look at what that looks like

00:58:46.620 | That looks like a column. Okay, so if we now go

00:58:52.400 | that

00:58:54.340 | plus M

00:58:55.820 | You can see it's doing exactly what we hoped it would do

00:58:58.980 | Right, which is to add 10 20 30 to the column

00:59:03.620 | 10 20 30 to the column 10 20 30 to the column

00:59:08.280 | Okay

00:59:10.220 | now because the

00:59:12.220 | Location of a unit axis turns out to be so important

00:59:20.580 | It's really helpful to kind of experiment with creating these extra unit axes and know how to do it easily and

00:59:27.840 | NP dot expand dims

00:59:30.060 | Isn't in my opinion the easiest way to do this the easiest way?

00:59:33.420 | The easiest way is to index into the tensor with a special

00:59:40.340 | Index none and what none does is it creates a new axis in that location of

00:59:49.980 | Length 1 right so this is

00:59:53.660 | Going to add a new axis at the start of length 1

00:59:58.460 | This is going to add a new axis at the end of length 1 or

01:00:07.840 | Why not do both?

01:00:11.580 | Right so if you think about it like a tensor

01:00:15.200 | Which has like three?

01:00:18.340 | Things in it could be of any rank you like right you can just add

01:00:22.860 | Unit axes all over the place and so that way we can kind of

01:00:27.540 | Decide how we want our broadcasting to work

01:00:32.220 | So there's a pretty convenient

01:00:35.380 | Thing in numpy called broadcast 2 and what that does is it takes our vector and

01:00:45.100 | broadcasts it to that shape and shows us what that would look like

01:00:49.020 | Right so if you're ever like unsure of what's going on in some broadcasting operation

01:00:55.060 | You can say broadcast 2 and so for example here. We could say like rather than 3 comma 3 we could say m dot shape

01:01:01.980 | Right and see exactly what's happened going to happen, and so that's what's going to happen before we add it to n

01:01:09.620 | right so if we said

01:01:11.980 | Turn it into a column

01:01:16.300 | That's what that looks like

01:01:21.460 | Make sense, so that's kind of like the intuitive

01:01:26.500 | definition of

01:01:29.340 | Broadcasting and so now hopefully we can go back to that

01:01:31.940 | numpy documentation and understand

01:01:34.900 | What it means right?

01:01:38.140 | Broadcasting describes how numpy is going to treat arrays of different shapes when we do some operation

01:01:42.740 | Right the smaller array is broadcast across the larger array by smaller array. They mean lower rank

01:01:50.220 | tensor basically

01:01:52.860 | Broadcast across the light the higher rank tensor so they have compatible shapes it vectorizes array operations

01:01:59.540 | So vectorizing generally means like using SIMD and stuff like that so that multiple things happen at the same time

01:02:06.820 | All the looping occurs in C

01:02:08.820 | But it doesn't actually make needless copies of data it kind of just acts as if it had

01:02:15.140 | So there's our definition

01:02:18.060 | now in deep learning you very often deal with tensors of rank four or more and

01:02:24.620 | you very often combine them with tensors of rank one or two and

01:02:29.060 | Trying to just rely on intuition to do that correctly is nearly impossible

01:02:34.140 | So you really need to know the rules?

01:02:36.420 | So here are the rules

01:02:42.300 | Okay, here's m dot shape here C dot shape so the rule are that we're going to compare

01:02:50.180 | The shapes of our two tensors element wise we're going to look at one at a time

01:02:54.740 | And we're going to start at the end right so look at the trailing dimensions and

01:02:59.180 | then go

01:03:01.460 | Towards the front okay, and so two dimensions are going to be compatible

01:03:06.220 | When one of these two things is true, right? So let's check right we've got our our M and C compatible M is

01:03:14.880 | 3 3

01:03:17.180 | C is

01:03:18.500 | 3 right so we're going to start at the end trailing dimensions first and check are they compatible they're compatible if the dimensions are equal

01:03:26.620 | Okay, so these ones are equal so they're compatible

01:03:29.860 | right

01:03:31.180 | Let's go to the next one. Oh, oh, we're missing

01:03:34.140 | Right C is missing something. So what happens if something is missing as we insert a one?

01:03:41.100 | Okay, that's the rule right and so let's now check are these compatible one of them is one. Yes, they're compatible

01:03:49.140 | Okay, so now you can see why it is that numpy treats

01:03:55.260 | the one dimensional array as

01:03:59.460 | If it is a rank 2 tensor

01:04:02.060 | Which is representing a row it's because we're basically inserting a one at the front

01:04:08.540 | Okay, so that's the rule so for example

01:04:12.620 | This is something that you very commonly have to do which is you start with like an

01:04:20.780 | image they're like 256 pixels by 256 pixels by three channels and

01:04:27.740 | You want to subtract?

01:04:29.740 | the mean of each channel

01:04:31.740 | All right, so you've got 256 by 256 by 3 and you want to subtract something of length 3, right?

01:04:37.980 | So yeah, you can do that

01:04:40.020 | Absolutely because 3 and 3 are compatible because they're the same

01:04:43.980 | All right 256 and empty is compatible. It's going to insert a 1

01:04:48.340 | 256 and empty is compatible. It's going to insert a 1

01:04:51.700 | Okay, so you're going to end up with

01:04:55.740 | this is going to be broadcast over all of this axis and then that whole thing will be broadcast over this axis and

01:05:03.860 | so we'll end up with a

01:05:05.860 | 256 by 256 by 3

01:05:08.460 | Effective

01:05:12.060 | Tensor here, right?

01:05:14.060 | so interestingly like

01:05:17.300 | very few people in the data science or machine learning communities

01:05:22.300 | Understand broadcasting and the vast majority of the time for example when I see people doing pre-processing for computer vision

01:05:28.820 | Like subtracting the mean they always write loops

01:05:32.760 | over the channels right and I kind of think like

01:05:36.780 | It's it's like so handy to not have to do that and it's often so much faster to not have to do that

01:05:44.220 | So if you get good at broadcasting

01:05:46.220 | You'll have this like super useful skill that very very few people have

01:05:52.060 | And and like it's it's it's an ancient skill. You know it goes it goes all the way back to

01:05:57.940 | the days of APL

01:06:00.980 | so APL was from the late 50s stands for our programming language and

01:06:07.680 | Kenneth Iverson

01:06:11.100 | Wrote this paper called

01:06:13.100 | notation as a tool for thought

01:06:15.940 | in which he proposed a new math notation and

01:06:21.100 | He proposed that if we use this new math notation

01:06:24.700 | It gives us new tools for thought and allows us to think things we couldn't before and one of his ideas was

01:06:32.460 | broadcasting not as a

01:06:35.660 | computer programming tool, but as a piece of math notation and

01:06:40.340 | so he ended up implementing

01:06:43.180 | this notation as a tool for thought as a programming language called APL and

01:06:49.260 | His son has gone on to further develop that

01:06:54.100 | Into a piece of software called J

01:06:57.180 | Which is basically what you get when you put 60 years of very smart people working on this idea

01:07:03.980 | And with this programming language you can express

01:07:07.820 | Very complex mathematical ideas often just with a line of code or two

01:07:13.380 | And so I mean it's great that we have J

01:07:16.940 | But it's even greater that these ideas have found their ways into the languages

01:07:21.020 | We all use like in Python the NumPy and PyTorch libraries, right? These are not just little

01:07:26.740 | Kind of niche ideas. It's like fundamental ways to think about math and to do programming

01:07:33.020 | Like let me give an example of like this kind of notation as a tool for thought

01:07:38.220 | let's

01:07:41.220 | Let's look here. We've got C, right?

01:07:46.380 | Here we've got C

01:07:48.380 | None right. Notice. This is now a two square brackets, right? So this is kind of like a one row

01:07:55.940 | rank 2 tensor

01:07:59.060 | Here it is a little column

01:08:04.140 | So what is

01:08:08.620 | Oh

01:08:10.620 | Just round ones

01:08:19.780 | Okay, what's that going to do? Have a think about it

01:08:34.580 | Anybody want to have a go you can even talk through your thinking. Okay. Can we pass the check this over there? Thank you

01:08:40.580 | Kind of outer product. Yes, absolutely. So take us through your thinking. How's that gonna work?

01:08:47.780 | So the diagonal elements can be directly visualized from the squares

01:08:54.620 | And cross 10 20 cross 20 and 30 cross 30

01:09:00.780 | And if you multiply the first row with this column, you can get the first row of the matrix

01:09:07.900 | So finally you'll get a 3 cross 3 matrix. Yeah, and

01:09:12.500 | So to think of this in terms of like those broadcasting rules, we're basically taking

01:09:18.220 | This column right which is of rank

01:09:21.700 | 3 comma 1 right and this kind of row

01:09:28.780 | Sorry, I mentioned 3 comma 1 and this row which is of dimension 1 comma 3

01:09:34.340 | Right and so to make these compatible with our broadcasting rules

01:09:39.220 | Right this one here has to be duplicated

01:09:42.380 | Three times because it needs to match this

01:09:45.140 | Okay, and now this one's going to have to be duplicated three times to match this

01:09:57.700 | Okay, and so now I've got two

01:10:05.100 | Matrices to do an element wise product of and so as you say

01:10:12.820 | There is our outer product right now. The interesting thing here is

01:10:17.900 | That suddenly now that this is not a special mathematical case

01:10:23.220 | But just a specific version of the general idea of broadcasting we can do like an outer plus

01:10:30.980 | Or we can do an outer greater than

01:10:35.060 | Right or or whatever right so it's suddenly we've kind of got this this this concept

01:10:42.340 | That we can use to build

01:10:44.940 | New ideas and then we can start to experiment with those new ideas. And so, you know interestingly

01:10:52.100 | NumPy actually

01:10:54.100 | Uses this sometimes

01:10:56.580 | For example if you want to create a grid

01:11:02.100 | This is how NumPy does it right actually this is kind of the sorry, let me show you this way

01:11:11.660 | If you want to create a grid, this is how NumPy does it it actually returns

01:11:16.820 | 0 1 2 3 4 and

01:11:21.620 | 0 1 2 3 4

01:11:23.620 | 1 is a column 1 is a row

01:11:26.060 | So we could say like okay, that's x grid comma y grid

01:11:30.340 | And now you could do something like

01:11:36.220 | Well, I mean we could obviously go

01:11:42.580 | Like that right and so suddenly we've expanded that out

01:11:49.620 | Into a grid right and so

01:11:59.220 | Yeah, it's kind of interesting how like some of these like simple little concepts

01:12:05.580 | Kind of get built on and built on and built on so if you lose something like APL or J. It's this whole

01:12:11.660 | Environment of layers and layers and layers of this we don't have such a deep environment in NumPy

01:12:18.260 | But you know you can certainly see these ideas of like broadcasting coming through

01:12:22.900 | In simple things like how do we create a grid in in NumPy?

01:12:27.220 | So yeah, so that's that's broadcasting and so what we can do with this now is

01:12:34.860 | Use this to implement matrix multiplication ourselves

01:12:40.540 | Okay

01:12:43.980 | Now why would we want to do that well obviously we don't right matrix multiplication has already been handled

01:12:50.620 | Perfectly nicely for us by our libraries

01:12:54.200 | but very often you'll find in

01:12:57.620 | All kinds of areas in in machine learning and particularly in deep learning that there'll be

01:13:04.820 | particular types of linear

01:13:08.460 | Function that you want to do that aren't quite

01:13:13.300 | Done for you all right so for example. There's like whole areas

01:13:17.700 | called like

01:13:20.620 | tensor regression and

01:13:22.620 | Tensor decomposition

01:13:26.980 | Which are really being developed a lot at the moment and they're kind of talking about like how do we take like

01:13:38.380 | Higher rank tensors and kind of turn them into combinations of rows

01:13:43.260 | Columns and faces and it turns out that when you can kind of do this you can basically like

01:13:50.260 | Deal with really high dimensional data structures with not much memory and not with not much computation time for example. There's a really terrific library

01:13:58.100 | called tensorly

01:14:00.460 | Which does a whole lot of this kind of stuff?

01:14:02.460 | for you

01:14:05.660 | So it's a really really important area it covers like all of deep learning lots of modern machine learning in general

01:14:12.460 | And so even though you're not going to like to find matrix modification. You're very likely to want to define some other

01:14:19.660 | Slightly different tensor product you know

01:14:22.820 | So it's really useful to kind of understand how to do that

01:14:26.700 | So let's go back and look at our

01:14:29.660 | matrix and our

01:14:34.260 | 2d array and 1d array rank 2 tensor rank 1 tensor and

01:14:38.020 | Remember we can do a matrix multiplication

01:14:40.860 | Using the at sign or the old way NP dot matmul. Okay?

01:14:46.500 | And so what that's actually doing when we do that is we're basically saying

01:14:51.540 | Okay, 1 times 10 plus

01:14:56.420 | 2 times 20 plus 3 times 30 is

01:15:02.820 | 140 right and so we do that for each

01:15:05.380 | row and

01:15:07.700 | We can go through and do the same thing for the next one and for the next one to get our result, right?

01:15:12.660 | You could do that in torch as well

01:15:17.020 | We could make this a little shorter

01:15:32.020 | Okay, same thing

01:15:34.020 | Okay, but that is not matrix multiplication. What's that?

01:15:45.180 | Okay, element wise specifically we've got a matrix and a vector so

01:15:53.900 | Broadcasting okay good. So we've got this is element wise with broadcasting but notice

01:16:01.180 | The numbers it's created 10 40 90 are the exact three numbers that I needed to

01:16:07.500 | Calculate when I did that first

01:16:10.420 | Piece of my matrix multiplication. So in other words if we sum this

01:16:15.180 | Over the columns, which is axis equals 1

01:16:21.120 | We get our matrix vector product

01:16:25.540 | Okay, so we can kind of

01:16:30.340 | do

01:16:31.700 | This stuff without special help from our library

01:16:35.140 | So now

01:16:38.580 | Let's expand this out to a matrix matrix product

01:16:42.480 | So a matrix matrix product

01:16:45.700 | Looks like this. This is this great site called matrix multiplication dot XYZ

01:16:52.420 | And it shows us this is what happens when we multiply two matrices

01:16:57.320 | Okay, that's what matrix multiplication is

01:17:06.400 | operationally speaking so in other words what we just did there

01:17:11.360 | Was we first of all took the first column

01:17:16.440 | with the first row to get this one and

01:17:20.680 | Then we took the second column with the first row

01:17:26.120 | To get that one. All right, so we're basically doing

01:17:29.040 | The thing we just did the matrix vector product. We're just doing it twice

01:17:33.760 | right once

01:17:36.480 | With this column and once with this column, and then we can catenate the two together

01:17:44.520 | Okay, so we can now go ahead and do that

01:17:52.760 | Like so M times the first column dot sum

01:17:57.640 | M times the second top column, but some and so there are the two columns of our matrix multiplication

01:18:06.480 | Okay

01:18:09.240 | So I didn't want to like make our code too messy

01:18:12.960 | So I'm not going to actually like use that but like we have it there now if we want to we don't need to use

01:18:20.280 | Torch or NumPy matrix multiplication anymore. We've got we've got our own that we can use using nothing but

01:18:26.360 | element wise operations broadcasting and

01:18:29.640 | some

01:18:32.680 | Okay

01:18:36.000 | So this is our

01:18:39.960 | Logistic regression from scratch class again. I just copied it here

01:18:45.960 | Here is where we instantiate the object copy it to the GPU we create an optimizer

01:18:50.160 | Which we'll learn about in a moment and we call fit. Okay, so the goal is to now repeat this without needing to call fit

01:18:58.600 | So to do that

01:19:03.760 | We're going to need a loop

01:19:09.320 | Which grabs a mini batch of data at a time and with each mini batch of data?

01:19:15.600 | We need to pass it to the optimizer and say please try to come up with a slightly better set of predictions

01:19:22.040 | for this mini batch

01:19:24.240 | So as we learned in order to grab a mini batch of the training set at a time

01:19:28.560 | We have to ask the model data object for the training data loader

01:19:31.840 | We have to wrap it in it or it er to create an iterator generator

01:19:36.920 | And so that gives us our data loader. Okay, so pytorch calls this a data loader

01:19:44.040 | We actually wrote our own fast AI data loader, but it's it's all it's basically the same idea

01:19:48.080 | and

01:19:50.280 | So the next thing we do is we grab the X and the Y tensor

01:19:56.480 | The next one from our data loader, okay?

01:19:59.520 | Wrap it in a variable to say I need to be able to take the derivative of

01:20:05.080 | The calculations using this because if I can't take the derivative

01:20:08.640 | Then I can't get the gradients and I can't update the weights

01:20:12.400 | all right, and I need to put it on the GPU because my

01:20:15.880 | module is on the GPU and

01:20:18.760 | So we can now take that variable and pass it to

01:20:23.340 | The object that we instantiated our logistic regression

01:20:28.440 | Remember our module we can use it as if it's a function because that's how pytorch works

01:20:32.840 | And that gives us a set of predictions as we saw seen before

01:20:37.920 | Okay

01:20:41.760 | So now we can check the loss and the loss we defined as being a

01:20:46.520 | negative log likelihood loss

01:20:49.440 | Object and we're going to learn about how that's calculated in the next lesson and for now think of it

01:20:55.200 | Just like root mean squared error, but for classification problems

01:20:58.320 | So we can call that also just like a function so you can kind of see this

01:21:03.840 | It's very general idea in pytorch that you know kind of treat everything ideally like it's a function

01:21:09.480 | So in this case we have a loss a negative log likelihood loss object. We treat it like a function we pass in our predictions and

01:21:16.560 | We pass in our axials right and again the axials need to be turned into a variable and put on the GPU

01:21:23.360 | Because the loss is specifically the thing that we actually want to take the derivative of right so that gives us our loss

01:21:30.920 | And there it is. That's our loss 2.43

01:21:36.200 | So it's a variable and because it's a variable it knows how it was calculated

01:21:41.320 | All right, it knows it was calculated with this loss function. It knows that the predictions were calculated with this

01:21:47.980 | Network it knows that this network consisted of these operations and so we can get the gradient

01:21:55.880 | automatically, all right

01:21:58.800 | So to get the gradient

01:22:01.800 | We call L dot backward remember L is the thing that contains our loss

01:22:06.560 | All right, so L dot backward is is something which is added to anything. That's a variable

01:22:13.120 | You can call dot backward and that says please calculate the gradients

01:22:16.440 | Okay, and so that calculates the gradients and stores them inside that that

01:22:24.120 | the basically for each of the

01:22:28.120 | Weights that was used it used each of the parameters that was used to calculate that it's now stored a

01:22:33.960 | Dot grad we'll see it later. It's basically stored the gradient right so we can then call

01:22:40.320 | Optimizer dot step and we're going to do this step manually shortly

01:22:44.520 | And that's the bit that says please make the weights a little bit better right and so what optimizer dot step is doing

01:22:53.440 | Is it saying like okay if you had like a really simple function?

01:22:57.480 | Like this

01:23:04.560 | Right then what the optimizer does is it says okay. Let's pick a random starting point

01:23:11.580 | Right and let's calculate the value of the loss right so here's our parameter

01:23:17.400 | Here's our loss right let's take the derivative

01:23:21.920 | All right the derivative tells us which way is down, so it tells us we need to go that direction

01:23:28.440 | Okay, and we take a small step and

01:23:31.920 | Then we take the derivative again, and we take a small step derivative again

01:23:37.400 | Take a small step do it again. Take a small step and

01:23:40.440 | Till eventually we're taking such small steps that we stop okay, so that's what?

01:23:45.560 | gradient descent does okay

01:23:50.080 | How big a step is a small step?

01:23:52.440 | Well, we basically take the derivative here, so let's say derivative. There is like eight

01:23:57.300 | All right, and we multiply it by a small number like say 0.01 and that tells us what step size to take

01:24:06.020 | this small number here is called the learning rate and

01:24:10.040 | It's the most important hyper parameter to set right if you pick two smaller learning rate

01:24:17.800 | Then your steps down are going to be like tiny, and it's going to take you forever

01:24:23.180 | All right to bigger learning rate, and you'll jump too far

01:24:27.960 | Right and then you'll jump too far and your diverge rather than converge, okay

01:24:35.680 | We're not going to talk about how to pick a learning rate in this class

01:24:39.640 | But in the deep learning class we actually show you a specific technique that very reliably picks a very good learning rate

01:24:46.200 | um

01:24:48.200 | So that's basically what's happening right so we calculate the derivatives

01:24:53.040 | And we call the optimizer that does a step in other words update the weights based on the

01:24:58.800 | Gradients and the learning rate

01:25:01.280 | We should hopefully find that after doing that we have a better loss than we did before

01:25:07.800 | So I just reran this and got a loss here of four point one six and

01:25:12.080 | after one step

01:25:14.600 | It's now four point. Oh three okay, so it worked the way

01:25:17.760 | We hoped it would based on this mini batch it updated all of the weights in our

01:25:22.640 | Network to be a little better than they were as a result of which our loss went down, okay?

01:25:27.780 | So let's turn that into a training loop

01:25:31.480 | All right, we're going to go through a hundred steps

01:25:35.200 | Grab one more mini batch of data from the data loader

01:25:39.560 | Calculate our predictions from our network calculate our loss from the predictions and the actuals

01:25:45.360 | Every 10 goes we'll print out the accuracy just take the mean of the whether they're equal or not

01:25:51.840 | One Pytorch specific thing you have to zero the gradients basically you can have networks where like you've got lots of different loss

01:26:01.300 | Functions that you might want to add all of the gradients together

01:26:03.980 | Right so you have to tell Pytorch like when to set the gradients back to zero

01:26:09.400 | Right so this just says set all the gradients to zero

01:26:12.120 | Calculate the gradients that's put backward and then take one step of the optimizer

01:26:18.180 | So update the weights using the gradients and the learning rate and so once we run it. You can see the loss goes down and

01:26:25.040 | The accuracy goes up

01:26:28.560 | Okay

01:26:32.000 | so

01:26:34.160 | That's the basic approach and so next lesson. We'll see

01:26:39.040 | what

01:26:40.320 | That does all right

01:26:42.320 | We're looking in detail

01:26:44.360 | We're not going to look inside here as I say we're going to basically take the calculation of the derivatives as

01:26:50.680 | As a given right but basically

01:26:53.640 | What's happening there?

01:26:56.240 | And any kind of deep network you have kind of like a function

01:27:01.080 | That's like you know a linear function

01:27:03.480 | And then you pass the output of that into another function that might be like a ReLU

01:27:08.920 | And you pass the output of that into another function that might be another linear net linear layer

01:27:14.320 | And you pass that into another function that might be another ReLU and so forth right so these deep networks are just

01:27:22.320 | Functions of functions of functions, so you could write them mathematically like that right and so

01:27:30.200 | All backprop does is it says let's just simplify this down to the two version

01:27:34.560 | Is we can say okay u equals f of x

01:27:40.880 | Right and so therefore the derivative of g of f of x is we can calculate with the chain rule as being

01:27:50.160 | g - u

01:27:54.080 | f - x

01:27:56.160 | Right and so you can see we can do the same thing for the functions of the functions of the functions, and so when you apply a

01:28:02.880 | Function to a function of a function you can take the derivative just by taking the product of the derivatives of each of those

01:28:09.880 | Layers okay, and in neural networks. We call this back propagation

01:28:15.040 | Okay, so when you hear back propagation it just means use the chain rule to calculate the derivatives

01:28:21.600 | And so when you see a neural network defined

01:28:25.480 | Like here right

01:28:31.560 | Like if it's defined sequentially literally all this means is

01:28:37.500 | apply this function to the input

01:28:40.840 | Apply this function to that apply this function to that apply this function to that right so this is just defining a

01:28:49.840 | composition of a function to a function to a function to a function to a function

01:28:53.040 | okay, and so

01:28:56.000 | Yeah, so although we're not going to bother with calculating the gradients ourselves

01:28:59.740 | You can now see why it can do it right as long as it has internally

01:29:03.480 | You know a it knows like what's the what's the derivative of to the power of what's the derivative of sign?

01:29:10.440 | What's the derivative of plus and so forth then our Python code?

01:29:14.000 | In here, it's just combining those things together

01:29:18.920 | So it just needs to know how to compose them together with the chain rule and away it goes, okay?

01:29:26.140 | Okay, so I think we can leave it there for now and yeah and in the next class

01:29:38.240 | We'll go and we'll see how to

01:29:40.240 | Write our own optimizer, and then we'll have solved MNIST from scratch ourselves. See you then

01:29:47.680 | [BLANK_AUDIO]

Machine Learning 1: Lesson 9

Chapters