Back to Index

Machine Learning 1: Lesson 9


Chapters

0:0 Introduction
0:35 Synthetic Data
4:10 Parfitt
11:40 Basic Steps
15:30 NNN Module
16:15 Constructor
19:30 FindForWord
21:10 Softmax
27:30 Parameters
28:50 Results
29:10 Functions
30:15 Generators
32:0 FASTA
33:10 Variable
36:5 Function
45:10 Making Predictions
47:0 Broadcasting
48:40 Performance
52:30 Broadcast

Transcript

All right welcome back to machine learning I I'm really excited to be able to share some amazing stuff that University of San Francisco students have built during the week or written about during the week and Quite a few things. I'm going to show you have already spread around the internet quite a bit lots of Tweets and posts and all kinds of stuff happening One of the the first to be widely shared was this one by Tyler who did something really interesting?

He started out by saying like what if I like create the synthetic data set where the independent variables is like the x and the y And the dependent variable is like color right and interestingly He showed me an earlier version of this where he wasn't using color he was just like putting the actual numbers in here and this thing kind of wasn't really working at all and as soon as he started using color it started working really well and So I wanted to mention that one of the things that unfortunately we we don't teach you at USF is Theory of human perception perhaps we should Because actually when it comes to visualization it's kind of the most important thing to know is what is the human eye?

Or what is what is what is the human brain good at perceiving? There's a whole area of academic study on this And one of the things that we're best at perceiving is differences in color Right so that's why as soon as we look at this picture of the synthetic data.

He created you can immediately see oh there's kind of four areas of you know lighter red color So what he did was he said okay? What if we like tried to create a machine learning model of this synthetic data set? And so specifically he created a tree and the cool thing is that you can actually draw The tree right so after he created the tree He did this all in that plot live that plot lead is very flexible right he actually drew the tree boundaries So that's already a pretty neat trick is to be actually able to draw the tree But then he did something even cleverer which is he said okay?

So what predictions does the tree make well it's the average of each of these areas and so to do that We can actually draw the average color Right it's actually kind of pretty Here is the predictions that the tree makes now Here's where it gets really interesting. It's like you can as you know randomly generate trees through resampling and So here are four trees Generated through resampling they're all like pretty similar, but a little bit different And so now we can actually visualize bagging and to visualize bagging we literally take the average of the four pictures All right.

That's what bagging is and There it is right and so here is like the the fuzzy decision boundaries of a random forest And I think this is kind of amazing right because it's like a I wish I had this actually when I started teaching you All random forests because I could have skipped a couple of classes.

It's just like okay. That's what we do You know we create the decision boundaries we average each area And then we we do it a few times and average all of them Okay, so that's what a random forest does and I think like this is just such a great example of Making the complex easy through through pictures So congrats to Tyler for that It actually turns out That he has actually reinvented something that somebody else has already done a guy called Christian any who went on to be a One of the world's foremost machine learning researchers actually included almost exactly this technique in a book He wrote about decision forests, so it's actually kind of cool that Tyler ended up Reinventing something that one of the world's foremost and for authorities on the fifth decision forests actually it has created So I thought that was neat That's nice because when we pop when we posted this on Twitter You know got a lot of attention and finally somebody with that was able to say like oh You know what this this actually already exists, so Tyler's gone away, and you know started reading that book Something else which is super cool is Jason Carpenter Created a whole new library called parfit and parfit is a parallelized fitting of multiple models for the purpose of Selecting hyper parameters, and there's a lot.

I really like about this He's shown a clear example of how to use it right and like the API looks very similar to other grid search based approaches But it uses the validation techniques that Rachel wrote about and that we learned about a couple of weeks ago of using a good validation set and You know what he's done here is in his blog post that introduces it.

You know he's he's Gone right back and said like well What are hyper parameters why do we have to train them? And he's kind of explained every step and then the the module itself is like it's it's very polished You know he's added documentation to it. He's added a nice read me to it And it's kind of interesting when you actually look at the code you realize You know it's very simple.

You know which is it's definitely not a bad thing. That's a good thing is to make things simple but by kind of Writing this little bit of code and then packaging it up so nicely He's made it really easy for other people to use this technique which is great and so one of the things I've been really thrilled to see is then Vinay went along and combined two things from our class one was to take Parfit and then the other was to take the kind of accelerated SGD approach to classification We don't learned about in the last lesson and combine the two to say like okay.

Well. Let's now use Parfit to help us find the parameters of a SGD logistic aggression So I think that's really a really great idea something else which I thought was terrific is Prince actually basically went through and Summarized pretty much all the stuff we learned in the random and for a random forest interpretation class And he went even further than that as he described each of the different approaches to random forest interpretation He described how it's done so here for example is feature importance through variable permutation a little picture of each one and Then super cool here is the code to implement it from scratch So I think this is like really Nice post you know describing something that not many people understand and showing you know exactly how it works both with pictures And with code that implements it from scratch So I think that's really really great one of the things.

I really like here is that for like the Tree interpreter, but he actually showed how you can take the tree interpreter output and feed it into the new waterfall chart package that Chris our USF student built to show how you can actually visualize The contributions of the tree interpreter in a waterfall chart so again kind of a nice combination of multiple pieces of technology we've both learned about and and built as a group I Also really thought this Kernel there's been a few interesting kernels shared and I'll share some more next week and devesh wrote this really nice kernel Showing there's this quite challenging Kaggle competition on detecting icebergs versus Ships and it's a kind of a weird two channel satellite data.

Which is very hard to visualize and he actually Went through and basically described kind of the formulas for how these like radar scattering things actually work And then actually managed to come up with a code that allowed him to recreate You know the actual 3d? Icebergs or ships and I have not seen that done before or like I you know it's it's quite challenging to know how to visualize this data And then he went on to show how to build a neural net to try to interpret this so that was pretty fantastic as well So yeah congratulations for all of you.

I know for a lot of you. You know you're Posting stuff out there to the rest of the world for the first time you know and it's kind of intimidating you're used to writing stuff that you kind of hand into a teacher, and they're the only ones who see it and You know it's kind of scary the first time you do it But then the first time somebody you know up votes your Kaggle kernel or adds a clap to your medium post He suddenly realized oh, I'm actually I've written something that people like that's that's pretty great So if you haven't tried yourself yet, I again invite you to Try writing something and if you're not sure you could write a summary of a lesson You could write a summary of like if there's something you found hard like maybe you found it hard to Fire up a GPU based AWS instance you eventually figured it out you could write down Just describe how you solve that problem or if one of your classmates Didn't understand something and you explained it to them Then you could like write down something saying like oh, there's this concept that some people have trouble understanding here So good way.

I think of explaining it. There's all kinds of stuff you could you could do Okay, so let's go back to SGD and We're going back through this notebook which Rachel put together basically taking us through Kind of SGD from scratch for the purpose of digit recognition and actually quite a lot of the stuff we look at today is going to be closely following Part of the computational linear algebra course Which you can both find the MOOCs on fast AI or at USF.

It'll be an elective next year, right? So if you find some of this This stuff interesting and I hope you do then please consider signing up for the elective or checking out the video online So we're building neural networks And We're starting with an assumption that we've downloaded the MNIST data We've normalized it by subtracting the main and divided by the standard deviation.

Okay, so the data is It's slightly unusual in that although they represent images They were they were downloaded as each image was a seven hundred and eighty four long Rank one tensor, so it's been flattened out Okay, and so for the purpose of drawing pictures of it we had to resize it to 28 by 28 But the actual data we've got is not 28 by 28.

It's as it's it's 784 long flattened out Okay The basic steps we're going to take here is to start out with training The world's simplest neural network basically a logistic regression, right? So no hidden layers and we're going to train it using a library Fast AI and we're going to build the network using a library type torch Right, and then we're going to gradually get rid of all the libraries, right?

So first of all, we'll get rid of the nn neural net library and pytorch and write that ourselves Then we'll get rid of the fast AI fit function and write that ourselves and then we'll get rid of the pytorch optimizer and write that ourselves and so by the end of This notebook will have written all the pieces ourselves The only thing that we'll end up relying on is the two key things that pytorch gives us Which is a the ability to write Python code and have it run on the GPU and?

B the ability to write Python code and have it automatically differentiated for us Okay, so they're the two things we're not going to attempt to write ourselves because it's boring and pointless But everything else we'll try and write ourselves on top of those two things. Okay, so Our starting point is like not doing anything ourselves It's basically having it all done for us.

And so pytorch has an nn library, which is where the neural net stuff lives you can create a multi-layer neural network by using the sequential function and then passing in a list of the layers that you want and We asked for a linear layer Followed by a softmax layer and that defines our logistic regression.

Okay the input to our linear layer Is 28 by 28 as we just discussed the output is 10 because we want a probability For each of the numbers not through nine for each of our images, okay Cuda sticks it on the GPU and then Fit Fits a model okay, so we start out with a random set of weights and then fit uses gradient descent to make it better We Had to tell the fit function What criterion to use in other words what counts is better and we told it to use negative log likelihood We'll learn about that in the next lesson what that is exactly We had to tell it what optimizer to use and we said please use optm dot Adam the details of that We won't cover in this course.

We're going to use something build something simpler called SGD If you're interested in Adam, we just covered that in the deep learning course And what metrics do you want to print out? We decided to print out accuracy. Okay, so That was that and so if we do that Okay So after we fit it we get an accuracy of generally somewhere around 91 92 percent So what we're going to do from here is we're going to gradually We're going to repeat this exact same thing.

So we're going to rebuild This model You know four or five times fitting it building it and fitting it with less and less libraries. Okay, so the second thing that we did last time Was to try to start to define the The module ourselves All right, so instead of saying the network is a sequential bunch of these layers Let's not use that library at all and try and define it ourselves from scratch So to do that we have to use OO Because that's how we build everything in pytorch and we have to create a class Which inherits from an end up module so an end up module is a pytorch class That takes our class and turns it into a neural network module Which basically means will anything that you inherit from an end up module like this?

You can pretty much insert into a neural network as a layer or you can treat it as a neural network it's going to get all the stuff that it needs automatically to To work as a part of or a full neural network and we'll talk about exactly what that means Today and the next lesson, right?

so we need to construct the object so that means we need to define the constructor under in it and Then importantly, this is a Python thing is if you inherit from some other object Then you have to create the thing you inherit from first so when you say super dot under in it that says construct the Nn dot module piece of that first right if you don't do that then the the NN dot module stuff Never gets a chance to actually get constructed.

Now. So this is just like a standard Python OO Subclass constructor, okay, and if any of that's an unclear to you then you know This is where you definitely want to just grab a python intro to OO because this is That the standard approach, right? So inside our constructor We want to do the equivalent of Nn dot linear.

All right. So what NN dot linear is doing is it's taking our It's taking our 28 by 28 Vector so 768 long vector and we're going to be that's going to be the input to a matrix multiplication so we now need to create a Something with 768 rows and That's 768 and 10 columns Okay, so because the input to this is going to be a mini batch of size Actually, let's move this into a new window 768 by 10 and the input to this is going to be a mini batch of size 64 by 768 Right, so we're going to do this matrix product Okay, so when we say in pytorch NN dot linear It's going to construct This matrix for us, right?

So since we're not using that we're doing things from scratch. We need to make it ourselves So to make it ourselves we can say generate normal random numbers with This dimensionality which we passed in here 768 by 10. Okay, so that gives us our randomly initialized matrix, okay Then we want to add on to this You know, we don't just want y equals ax we want y equals ax plus b Right, so we need to add on what we call in neural nets a bias vector So we create here a bias vector of length 10.

Okay again randomly initialized And so now here are our two randomly initialized weight tenses So that's our constructor Okay Now we need to define forward. Why do we need to define forward? This is a pytorch specific thing What's going to happen is this is when you create a module in Pytorch the object that you get back behaves as if it's a function You can call it with parentheses which we'll do it that in a moment.

And so you need to somehow define What happens when you call it as if it's a function and the answer is pytorch calls a method called? Forward, okay, that's just that's the Python the pytorch kind of approach that they picked, right? So when it calls forward, we need to do our actual Calculation of the output of this module or later.

Okay. So here is the thing that actually gets calculated in our logistic regression So basically we take our Input X Which gets passed to forward that's basically how forward works it gets passed the mini batch and we matrix multiply it by The layer one weights which we defined up here and then we add on The layer one bias which we defined up here.

Okay, and actually nowadays we can define this a little bit more elegantly Using the Python 3 Matrix multiplication operator, which is the at sign And when you when you use that I think you kind of end up with Something that looks closer to what the mathematical notation looked like and so I find that nicer.

Okay All right, so that's That's our linear layer In our logistic regression in our zero hidden layer neural net. So then the next thing we do to that is softmax Okay, so we get the output of this Matrix multiply Okay, who wants to tell me what the dimensionality of my output of this matrix multiply is Sorry 64 by 10.

Thank you Karen And I should mention for those of you that weren't at deep learning class yesterday We actually looked at a really cool post from Karen who described how to Do structured data analysis with neural nets which has been like super popular? And a whole bunch of people have kind of said that they've read it and found it super interesting.

So That was really exciting So we get this matrix of Outputs and we put this through a softmax And why do we put it through a softmax We put it through a softmax because in the end we want probably you know for every image We want a probability that this is 0 or a 1 or a 2 or a 3 or 4, right?

So we want a bunch of probabilities that add up to 1 and where each of those probabilities is between 0 and 1 so a softmax Does exactly that for us? So for example if we weren't picking out, you know numbers from 0 to 10 But instead of picking out cat dog play and fish or building the output of that matrix multiply For one particular image might look like that.

These are just some random numbers And to turn that into a softmax. I first go e to the power of each of those numbers. I Sum up those e to the power of and Then I take each of those e to the power ofs and divide it by the sum and that's softmax That's the definition of softmax.

So because it was a to the power of it means it's always positive Because it was divided by the sum it means that it's always between 0 and 1 and it also means because it's divided By the sum that they always add up to 1 So by applying this softmax Activation function so anytime we have a layer of outputs, which we call activations And then we apply some function some nonlinear function to that that maps one One scalar to one scalar like softmax does we call that an activation function, okay?

So the softmax activation function takes our outputs and turns it into something which behaves like a probability, right? We don't strictly speaking need it. We could still try and train something which where the output directly is the probabilities All right, but by creating using this function That automatically makes them always behave like probabilities.

It means there's less For the network to learn so it's going to learn better. All right, so generally speaking whenever we design an architecture We try to design it in a way where it's as easy as possible for it to create something of the form that we want So that's why we use softmax Right so that's the basic steps right we have our input which is a bunch of images Right which is here gets multiplied by a weight matrix.

We actually also add on a bias Right to get a output of the linear function We put it through a nonlinear activation function in this case softmax and that gives us our probabilities So there there that all is Pi torch also tends to use the log Of softmax for reasons that don't particularly bother us now It's basically a numerical stability convenience.

Okay, so to make this the same as our Version up here that you saw log softmax. I'm going to use log here as well. Okay, so We can now instantiate this class that is create an object of this class So I have a question back for the probabilities where we were before so If we were to have a photo with a cat and a dog together Would that change the way that that works or does it work in the same basic?

Yeah, so that's a great question so if you had a photo with a cat and a dog together and You wanted it to spit out both cat and dog This would be a very poor choice. So softmax is specifically the activation function we use for Categorical predictions where we only ever want to predict one of those things, right?

And so part of the reason why is that as you can see because we're using either the right either the slightly bigger numbers Creates much bigger numbers as a result of which we generally have just one or two things large and everything else is pretty small All right so if I like Recalculate these random numbers a few times you'll see like it tends to be a bunch of zeros and one or two high numbers right, so it's really designed to Try to kind of make it easy to predict like this one thing.

There's the thing I want if you're doing multi Label prediction so I want to find all the things in this image rather than using softmax We would instead use sigmoid, right? So sigmoid recall each would cause each of these between to be between zero and one, but they would no longer add to one Good question and like a lot of these Details about like best practices are things that we cover in the deep learning course And we won't cover heaps of them here in the machine learning course.

We're more interested in the mechanics, I guess But we'll try and do them if they're quick All right, so now that we've got that we can instantiate an object of that class and of course We want to copy it over to the GPU so we can do computations over there Again, we need an optimizer where we're talking about what this is shortly, but you'll see here We've called a function on our class called parameters But we never defined a method called parameters And the reason that is going to work is because it actually was defined for us inside nn.module and so nn.module actually automatically goes through the attributes we've created and finds Anything that basically we we said this is a parameter So the way you say something is a parameter is you wrap it in an end up parameter So this is just the way that you tell PyTorch This is something that I want to optimize Okay, so when we created the weight matrix we just wrapped it with an end up parameter It's exactly the same as a regular PyTorch variable which we'll learn about shortly It's just a little flag to say hey you should you should optimize this and so when you call net to dot parameter On our net to object we created it goes through everything that we created in the constructor Checks to see if any of them are of type parameter And if so it sets all of those as being things that we want to train with the optimizer And we'll be implementing the optimizer from scratch later Okay, so having done that We can fit and we should get basically the same answer as before 91 ish So that looks good All right So What if we actually built here?

Well what we've actually built as I said is something that can behave like a regular function All right, so I want to show you how we can actually call this as a function So to be able to call it as a function We need to be able to pass data to it to be able to pass data to it I'm going to need to grab a mini batch of MNIST images Okay, so we used for convenience the Image classifier data from a raised method from fastai And what that does is it creates a pytorch data loader for us a pytorch data loader is Something that grabs a few images and sticks them into a mini batch and makes them available And you can basically say give me another mini batch give me another mini batch give me another mini batch and so in Python we call these things generators Generators are things where you can basically say I want another I want another I want another right There's this kind of very close connection between Iterators and generators are not going to worry about the difference between them right now, but you'll see basically to turn To actually get hold of something which we can say please give me another of in Order to grab something that we can we can use to generate mini batches We have to take our data loader and so you can ask for the training data loader from our model data object You'll see there's a bunch of different data loaders.

You can ask for you can ask for the test data loader the train data loader validation loader Augmented images data loader and so forth so we're going to grab the training data loader That was created for us. This is a PI standard PI torch data loader. Well slightly optimized by us, but same idea And you can then say this is a standard Python Thing we can say turn that into an iterator turn that into something where we can grab another one at a time from and so Once you've done that We've now got something that we can iterate through you can use the standard Python Next function to grab one more thing from that generator, okay?

So that's returning and the X's from a mini batch in the wise Found our mini batch the other way that you can use Generators and iterators in Python is with a for loop. I could also have said like for you know X mini batch comma Y mini batch in data loader And then like do something right so when you do that.

It's actually behind the scenes It's basically syntactic sugar for calling next lots of times. Okay, so this is all standard Python stuff So that returns a Tensor of size 64 by 784 as we would expect right the Fastai library we used defaults to a mini batch size of 64.

That's why it's that long These are all of the background zero pixels, but they're not actually zero in this case. Why aren't they zero? Yeah, they're normalized exactly right so we subtract at the mean divided by standard deviation right So there there it is so now what we want to do is we want to Pass that into our our logistic regression.

So what we might do is we'll go Variable XMB equals variable. Okay, I can take my X mini batch I can move it on to the GPU because remember my Net to object is on the GPU so our data for it also has to be on the GPU And then the second thing I do is I have to wrap it in variable.

So what does variable do? This is how we get for free automatic differentiation Pytorch can automatically differentiate You know pretty much anything right any tensor? But to do so takes memory and time So it's not going to always keep track like to do to do what about differentiation It has to keep track of exactly how something was calculated.

We added these things together We multiplied it by that we then took the sign blah blah blah, right? you have to know all of the steps because then to do the automatic differentiation it has to Take the derivative of each step using the chain rule multiply them all together All right, so that's slow and memory intensive So we have to opt in to saying like okay this particular thing we're going to be taking the derivative of later So please keep track of all of those operations for us And so the way we opt in is by wrapping a tensor in a variable, right?

So That's how we do it and You'll see that it looks almost exactly like a tensor, but it now says variable containing This tensor right so in Pytorch a variable has exactly Identical API to a tensor or actually more specifically a superset of the API of a tensor Anything we can do to a tensor we can do to a variable But it's going to keep track of exactly what we did so we can later on take the derivative Okay, so we can now pass that Into our net to object remember I said you can treat this as if it's a function Right so notice we're not calling dot forward We're just treating it as a function and Then remember we took the log so to undo that I'm taking the x and that will give me my probabilities Okay, so there's my probabilities, and it's got Return something of size 64 by 10 so for each image in the mini batch We've got 10 probabilities, and you'll see most probabilities are pretty close to 0 Right and a few of them are quite a bit bigger Which is exactly what we do we hope right is that it's like okay?

It's not a zero. It's not a one It's not a two. It is a three. It's not a four. It's not a five and so forth So maybe this would be a bit easier to read if we just grab like the first three of them Okay, so it's like ten to the next three ten to the next eight two five five four okay?

And then suddenly here's one which is ten to make one right? So you can kind of see what it's trying to what it's trying to do here I mean we could call like net to dot forward and it'll do exactly the same thing Right, but that's not how All of the pie torch mechanics actually work It's actually they actually call it as if it's a function right and so this is actually a really important idea like because it means that When we define our own architectures or whatever anywhere that you would put in a function You could put in a layer anyway you put in a layer you can put in a neural net anyway You put in a neural net you can put in a function because as far as pie torch is concerned They're all just things that it's going to call just like as if they're functions So they're all like interchangeable, and this is really important because that's how we create Really good neural nets is by mixing and matching lots of pieces and putting them all together Let me give an example Here is my Logistic aggression which got 91 and a bit percent accuracy I'm now going to turn it Into a neural network with one hidden layer all right, and the way I'm going to do that is I'm going to create one more layer I'm going to change this so it spits out a hundred rather than ten Which means this one input is going to be a hundred rather than ten Now this as it is can't possibly make things any better at all yet Why is this definitely not going to be better than what I had before?

Yeah, can somebody pass the yeah? But you've got a combination of two linear layers, which is just the same as one Exactly right so we've got two linear layers, which is just a linear layer right so to make things interesting I'm going to replace all of the negatives from the first layer with zeros Because that's a nonlinear transformation, and so that nonlinear transformation is called a rectified linear unit Okay, so nn dot sequential simply is going to call each of these layers in turn for each mini batch right so do a linear layer Replace all of the negatives with zero do another linear layer and do a softmax.

This is now a neural network with one hidden layer and So let's try trading that instead Okay accuracy is now going up to 96% Okay, so the this is the idea is that the basic techniques. We're learning in this lesson Like become powerful at the point where you start stacking them together, okay?

Can somebody pass the green box there and then there yes, Daniel? Why did you pick a hundred? No reason it was like easier to type an extra zero? Like this question of like how many Activations should I have it a neural network layer is kind of part of the the scale of a deep learning practitioner We cover it in the deep learning course not in this course When adding that additional I guess transformation Additional layer additional layer this one here is called a nonlinear layer or an activation function Activation function or activation function Does it matter that like if you would have done for example like two softmaxes?

Or is that something you cannot do like yeah? You can absolutely use a softmax there But it's probably not going to give you what you want and the reason why is that a softmax? Tends to push most of its activations to zero and an activation just be clear like I've had a lot of questions in deep Learning course about like what's an activation an activation is the value that is calculated in a layer, right?

So this is an activation Right it's not a weight a weight is not an activation It's the value that you calculate from a layer So softmax will tend to make most of its activations pretty close to zero and that's the opposite of what you want you genuinely want your activations to be kind of as Rich and diverse and and used as possible so nothing to stop you doing it, but it probably won't work very well Basically pretty much all of your layers will be followed by Non by nonlinear activation functions that will nearly always be value except for the last layer Could you when doing multiple layers, so let's say like could you live three could you think it's going two or three layers deep?

Do you want to switch up these activation layers? No, that's a great question. So if I wanted to go deeper I would just do That okay, that's a now to hidden layer network So I think I'd heard you said that there are a couple of different Activation functions like that rectified linear unit.

What are some examples and Why would you use? Each yeah great question So basically like as you add like more linear layers you kind of got your Input comes in and you put it through a linear layer and then a nonlinear layer linear layer nonlinear layer linear linear layer and then the final nonlinear layer The final nonlinear layer as we've discussed, you know, if it's a multi-category Classification, but you only ever pick one of them you would use softmax If it's a binary classification or a multi Label classification where you're predicting multiple things you would use sigmoid If it's a regression You would often have nothing at all Right, although we learned in last night's deal course where sometimes you can use sigmoid there as well So they're basically the options main options for the final layer for the Hidden layers you pretty much always use ReLU Okay, but there is a another Another one you can pick which is kind of interesting which is called Leaky ReLU and it looks like this and Basically if it's above zero, it's y equals x and if it's below zero, it's like y equals 0.1 x that's very similar to ReLU, but it's Rather than being equal to 0 under x.

It's it's like something close to that So they're the main two ReLU and Leaky ReLU There are various others, but they're kind of like things that just look very close to that So for example, there's something called ELU, which is quite popular But like you know the details don't matter too much honestly like that there like ELU is something that looks like this But it's slightly more curvy in the middle And it's kind of like it's not generally something that you so much pick based on the data set it's more like Over time we just find better activation functions so two or three years ago Everybody used ReLU, you know a year ago pretty much everybody used Leaky ReLU today I guess probably most people starting to move towards ELU But honestly the choice of activation function doesn't matter terribly much actually And you know people have actually showed that you can use like our pretty arbitrary nonlinear activation functions like even a sine wave It still works Okay So although what we're going to do today is showing how to create This network with no hidden layers To turn it into that network Which is 96% ish accurate is it will be trivial right and in fact is something you should Probably try and do during the week right is to create that version Okay So now that we've got something where we can take our network pass in our variable and get back some predictions That's basically all that happened when we called fit.

So we're going to see how how that that approach can be used to create this stochastic gradient descent one thing to note is that the to turn the Predicted probabilities into a predicted like which digit is it? We would need to use argmax Unfortunately pytorch doesn't call it argmax Instead pytorch just calls it max and max returns two things Returns the actual max across this axis so this is across the columns right and the second thing it returns is the index Of that maximum right so so the equivalent of argmax is to call max and then get the first Indexed thing okay, so there's our predictions right if this was in numpy.

We would instead use NP argmax Okay All right So here are the predictions from our hand created logistic regression and in this case Looks like we got all but one correct So the next thing we're going to try and get rid of in terms of using libraries is for try to avoid using the Matrix multiplication operator and instead we're going to try and write that by hand So this next part we're going to learn about something which kind of seems It kind of it's going to seem like a minor little kind of programming idea, but actually it's going to turn out That at least in my opinion.

It's the most important Programming concept that we'll teach in this course, and it's possibly the most important programming kind of concept in all of All the things you need to build machine learning algorithms, and it's the idea of broadcasting And the idea I will show by example If we create an array of 10 6 neg 4 and an array of 2 8 7 and then add the two together It adds each of the components of those two arrays in turn we call that element wise So in other words we didn't have to write a loop right back in the old days We would have to have looped through each one and added them and then concatenated them together We don't have to do that today.

It happens for us automatically so in numpy We automatically get element wise operations We can do the same thing with Pytorch So in fastai we just add a little capital T to turn something into a Pytorch tensor right and if we add those together Exactly the same thing right so element wise operations are pretty standard in these kinds of libraries It's interesting not just because we don't have to write the for loop Right, but it's actually much more interesting because of the performance things that are happening here The first is if we were doing a for loop right If we were doing a for loop That would happen in Python Right even when you use Pytorch it still does the for loop in Python it has no way of like Optimizing the for loop and so a for loop in Python is something like 10,000 times slower than in C So that's your first problem.

I can't remember. It's like 1,000 or 10,000 the second problem then is that You don't just want it to be optimized in C But you want C to take advantage of the thing that you're all of your CPUs do to something called SIMD Single instruction multiple data, which is it yours your CPU is capable of taking eight things at a time Right in a vector and adding them up to another Vector with eight things in in a single CPU instruction All right, so if you can take advantage of SIMD you're immediately eight times faster It depends on how big the data type is it might be four might be eight The other thing that you've got in your computer is you've got multiple processors Multiple cores So you've probably got like if this is inside happening on one side one core.

You've probably got about four of those Okay, so if you're using SIMD you're eight times faster if you can use multiple cores, then you're 32 times faster And then if you're doing that in C You might be something like 32 times about thousand times faster right and so the nice thing is that when we do that It's taking advantage of all of these things Okay, better still if you do it in pytorch and your data was created with .Cuda to stick it on the GPU Then your GPU can do about 10,000 things at a time Right so that'll be another hundred times faster than C All right, so this is critical To getting good performance is you have to learn how to write loopless code By taking advantage of these element wise Operations and like it's not it's a lot more than just plus I Could also use less than right and that's going to return 0 1 1 or if we go back to numpy False true true And so you can kind of use this to do all kinds of things without looping so for example I could now multiply that by a and here are all of the values of a As long as they're less than B or we could take the mean This is the percentage of values in a that are less than B All right, so like there's a lot of stuff you can do with this simple idea But to take it further Right to take it further than just this element wise operation We're going to have to go the next step to something called broadcasting So let's take a five minute break come back at 217 and we'll talk about broadcasting So Broadcasting This is the definition from the numpy documentation of Broadcasting and I'm going to come back to it in a moment rather than reading it now But let's start by looking an example of broadcasting so a is a Array With one dimension also known as a rank one tensor also known as a vector We can say a greater than zero so here we have a rank one tensor Right and a rank zero tensor Right a rank zero tensor is also called a scalar rank one tensor is also called a vector and We've got an operation between the two All right now you've probably done it a thousand times without even noticing.

That's kind of weird right that you've got these things of different Ranks and different sizes, so what is it actually doing right? But what it's actually doing is it's taking that scalar and copying it here here here Right and then it's actually going element wise 10 is greater than 0 6 is greater than 0 minus 4 is greater than 0 you haven't giving us back the three answers Right and that's called broadcasting broadcasting means Copying one or more axes of my tensor To allow it to be the same shape as the other tensor It doesn't really copy it though What it actually does is it stores this kind of internal indicator that says pretend that this is a vector of three zeros But it actually just like what rather than kind of going to the next row or going to the next scalar it goes back To where it came from if you're interested in learning about this specifically It's they set the stride on that axis to be zero.

That's a minor advanced concept for those who are curious So we could do a +1 right is going to broadcast the scalar 1 To be 1 1 1 and then do element wise addition We could do the same with a matrix right here's our matrix 2 times the matrix is going to broadcast 2 to be 2 2 2 2 2 2 2 2 2 2 and then do element wise multiplication All right, so that's our kind of most simple version of broadcasting So here's a slightly more complex version of broadcasting Here's an array called C.

All right, so this is a rank 1 tensor and Here's our matrix M from before Our rank 2 tensor we can add M plus C All right, so what's going on here? 1 2 3 4 5 6 7 8 9 That's M All right, and then C 10 20 30 You can see that what it's done is to add that to each row right eleven twenty two thirty three 14 25 36 and so we can kind of figure it seems to have done the same kind of idea as broadcasting a scalar It's like made copies of it And then it treats those as If it's a rank 2 matrix and now we can do element wise addition That makes sense now that's yes, can can you pass that Devon over there?

Thank you So as it's like by looking at this example it like Copies it down making new rows So how would we want to do it if we wanted to get new columns? I'm so glad you asked So Instead We would do this 10 20 30 All right, and then copy that 10 20 30 10 20 30 and Now treat that as our matrix So to get numpy to do that we need to not pass in a vector but to pass in a Matrix with one column a rank 2 tensor, right?

so basically it turns out that numpy is going to think of a Rank 1 tensor for these purposes as if it was a rank 2 tensor which represents a row Right. So in other words that it is 1 by 3, right? So we want to create a tensor, which is 3 by 1 There's a couple of ways to do that One is to use NP expand dims And if you then pass in this argument, it says please insert a length 1 axis here, please so in our case we want to turn it into a 3 by 1 so if we said expand in C comma 1 Okay, so if we say expand in C comma 1 it changes the shape to 3 comma 1 so if we look at what that looks like That looks like a column.

Okay, so if we now go that plus M You can see it's doing exactly what we hoped it would do Right, which is to add 10 20 30 to the column 10 20 30 to the column 10 20 30 to the column Okay now because the Location of a unit axis turns out to be so important It's really helpful to kind of experiment with creating these extra unit axes and know how to do it easily and NP dot expand dims Isn't in my opinion the easiest way to do this the easiest way?

The easiest way is to index into the tensor with a special Index none and what none does is it creates a new axis in that location of Length 1 right so this is Going to add a new axis at the start of length 1 This is going to add a new axis at the end of length 1 or Why not do both?

Right so if you think about it like a tensor Which has like three? Things in it could be of any rank you like right you can just add Unit axes all over the place and so that way we can kind of Decide how we want our broadcasting to work So there's a pretty convenient Thing in numpy called broadcast 2 and what that does is it takes our vector and broadcasts it to that shape and shows us what that would look like Right so if you're ever like unsure of what's going on in some broadcasting operation You can say broadcast 2 and so for example here.

We could say like rather than 3 comma 3 we could say m dot shape Right and see exactly what's happened going to happen, and so that's what's going to happen before we add it to n right so if we said Turn it into a column That's what that looks like Make sense, so that's kind of like the intuitive definition of Broadcasting and so now hopefully we can go back to that numpy documentation and understand What it means right?

Broadcasting describes how numpy is going to treat arrays of different shapes when we do some operation Right the smaller array is broadcast across the larger array by smaller array. They mean lower rank tensor basically Broadcast across the light the higher rank tensor so they have compatible shapes it vectorizes array operations So vectorizing generally means like using SIMD and stuff like that so that multiple things happen at the same time All the looping occurs in C But it doesn't actually make needless copies of data it kind of just acts as if it had So there's our definition now in deep learning you very often deal with tensors of rank four or more and you very often combine them with tensors of rank one or two and Trying to just rely on intuition to do that correctly is nearly impossible So you really need to know the rules?

So here are the rules Okay, here's m dot shape here C dot shape so the rule are that we're going to compare The shapes of our two tensors element wise we're going to look at one at a time And we're going to start at the end right so look at the trailing dimensions and then go Towards the front okay, and so two dimensions are going to be compatible When one of these two things is true, right?

So let's check right we've got our our M and C compatible M is 3 3 C is 3 right so we're going to start at the end trailing dimensions first and check are they compatible they're compatible if the dimensions are equal Okay, so these ones are equal so they're compatible right Let's go to the next one.

Oh, oh, we're missing Right C is missing something. So what happens if something is missing as we insert a one? Okay, that's the rule right and so let's now check are these compatible one of them is one. Yes, they're compatible Okay, so now you can see why it is that numpy treats the one dimensional array as If it is a rank 2 tensor Which is representing a row it's because we're basically inserting a one at the front Okay, so that's the rule so for example This is something that you very commonly have to do which is you start with like an image they're like 256 pixels by 256 pixels by three channels and You want to subtract?

the mean of each channel All right, so you've got 256 by 256 by 3 and you want to subtract something of length 3, right? So yeah, you can do that Absolutely because 3 and 3 are compatible because they're the same All right 256 and empty is compatible. It's going to insert a 1 256 and empty is compatible.

It's going to insert a 1 Okay, so you're going to end up with this is going to be broadcast over all of this axis and then that whole thing will be broadcast over this axis and so we'll end up with a 256 by 256 by 3 Effective Tensor here, right?

so interestingly like very few people in the data science or machine learning communities Understand broadcasting and the vast majority of the time for example when I see people doing pre-processing for computer vision Like subtracting the mean they always write loops over the channels right and I kind of think like It's it's like so handy to not have to do that and it's often so much faster to not have to do that So if you get good at broadcasting You'll have this like super useful skill that very very few people have And and like it's it's it's an ancient skill.

You know it goes it goes all the way back to the days of APL so APL was from the late 50s stands for our programming language and Kenneth Iverson Wrote this paper called notation as a tool for thought in which he proposed a new math notation and He proposed that if we use this new math notation It gives us new tools for thought and allows us to think things we couldn't before and one of his ideas was broadcasting not as a computer programming tool, but as a piece of math notation and so he ended up implementing this notation as a tool for thought as a programming language called APL and His son has gone on to further develop that Into a piece of software called J Which is basically what you get when you put 60 years of very smart people working on this idea And with this programming language you can express Very complex mathematical ideas often just with a line of code or two And so I mean it's great that we have J But it's even greater that these ideas have found their ways into the languages We all use like in Python the NumPy and PyTorch libraries, right?

These are not just little Kind of niche ideas. It's like fundamental ways to think about math and to do programming Like let me give an example of like this kind of notation as a tool for thought let's Let's look here. We've got C, right? Here we've got C None right.

Notice. This is now a two square brackets, right? So this is kind of like a one row rank 2 tensor Here it is a little column So what is Oh Just round ones Okay, what's that going to do? Have a think about it Anybody want to have a go you can even talk through your thinking.

Okay. Can we pass the check this over there? Thank you Kind of outer product. Yes, absolutely. So take us through your thinking. How's that gonna work? So the diagonal elements can be directly visualized from the squares And cross 10 20 cross 20 and 30 cross 30 And if you multiply the first row with this column, you can get the first row of the matrix So finally you'll get a 3 cross 3 matrix.

Yeah, and So to think of this in terms of like those broadcasting rules, we're basically taking This column right which is of rank 3 comma 1 right and this kind of row Sorry, I mentioned 3 comma 1 and this row which is of dimension 1 comma 3 Right and so to make these compatible with our broadcasting rules Right this one here has to be duplicated Three times because it needs to match this Okay, and now this one's going to have to be duplicated three times to match this Okay, and so now I've got two Matrices to do an element wise product of and so as you say There is our outer product right now.

The interesting thing here is That suddenly now that this is not a special mathematical case But just a specific version of the general idea of broadcasting we can do like an outer plus Or we can do an outer greater than Right or or whatever right so it's suddenly we've kind of got this this this concept That we can use to build New ideas and then we can start to experiment with those new ideas.

And so, you know interestingly NumPy actually Uses this sometimes For example if you want to create a grid This is how NumPy does it right actually this is kind of the sorry, let me show you this way If you want to create a grid, this is how NumPy does it it actually returns 0 1 2 3 4 and 0 1 2 3 4 1 is a column 1 is a row So we could say like okay, that's x grid comma y grid And now you could do something like Well, I mean we could obviously go Like that right and so suddenly we've expanded that out Into a grid right and so Yeah, it's kind of interesting how like some of these like simple little concepts Kind of get built on and built on and built on so if you lose something like APL or J.

It's this whole Environment of layers and layers and layers of this we don't have such a deep environment in NumPy But you know you can certainly see these ideas of like broadcasting coming through In simple things like how do we create a grid in in NumPy? So yeah, so that's that's broadcasting and so what we can do with this now is Use this to implement matrix multiplication ourselves Okay Now why would we want to do that well obviously we don't right matrix multiplication has already been handled Perfectly nicely for us by our libraries but very often you'll find in All kinds of areas in in machine learning and particularly in deep learning that there'll be particular types of linear Function that you want to do that aren't quite Done for you all right so for example.

There's like whole areas called like tensor regression and Tensor decomposition Which are really being developed a lot at the moment and they're kind of talking about like how do we take like Higher rank tensors and kind of turn them into combinations of rows Columns and faces and it turns out that when you can kind of do this you can basically like Deal with really high dimensional data structures with not much memory and not with not much computation time for example.

There's a really terrific library called tensorly Which does a whole lot of this kind of stuff? for you So it's a really really important area it covers like all of deep learning lots of modern machine learning in general And so even though you're not going to like to find matrix modification.

You're very likely to want to define some other Slightly different tensor product you know So it's really useful to kind of understand how to do that So let's go back and look at our matrix and our 2d array and 1d array rank 2 tensor rank 1 tensor and Remember we can do a matrix multiplication Using the at sign or the old way NP dot matmul.

Okay? And so what that's actually doing when we do that is we're basically saying Okay, 1 times 10 plus 2 times 20 plus 3 times 30 is 140 right and so we do that for each row and We can go through and do the same thing for the next one and for the next one to get our result, right?

You could do that in torch as well We could make this a little shorter Okay, same thing Okay, but that is not matrix multiplication. What's that? Okay, element wise specifically we've got a matrix and a vector so Broadcasting okay good. So we've got this is element wise with broadcasting but notice The numbers it's created 10 40 90 are the exact three numbers that I needed to Calculate when I did that first Piece of my matrix multiplication.

So in other words if we sum this Over the columns, which is axis equals 1 We get our matrix vector product Okay, so we can kind of do This stuff without special help from our library So now Let's expand this out to a matrix matrix product So a matrix matrix product Looks like this.

This is this great site called matrix multiplication dot XYZ And it shows us this is what happens when we multiply two matrices Okay, that's what matrix multiplication is operationally speaking so in other words what we just did there Was we first of all took the first column with the first row to get this one and Then we took the second column with the first row To get that one.

All right, so we're basically doing The thing we just did the matrix vector product. We're just doing it twice right once With this column and once with this column, and then we can catenate the two together Okay, so we can now go ahead and do that Like so M times the first column dot sum M times the second top column, but some and so there are the two columns of our matrix multiplication Okay So I didn't want to like make our code too messy So I'm not going to actually like use that but like we have it there now if we want to we don't need to use Torch or NumPy matrix multiplication anymore.

We've got we've got our own that we can use using nothing but element wise operations broadcasting and some Okay So this is our Logistic regression from scratch class again. I just copied it here Here is where we instantiate the object copy it to the GPU we create an optimizer Which we'll learn about in a moment and we call fit.

Okay, so the goal is to now repeat this without needing to call fit So to do that We're going to need a loop Which grabs a mini batch of data at a time and with each mini batch of data? We need to pass it to the optimizer and say please try to come up with a slightly better set of predictions for this mini batch So as we learned in order to grab a mini batch of the training set at a time We have to ask the model data object for the training data loader We have to wrap it in it or it er to create an iterator generator And so that gives us our data loader.

Okay, so pytorch calls this a data loader We actually wrote our own fast AI data loader, but it's it's all it's basically the same idea and So the next thing we do is we grab the X and the Y tensor The next one from our data loader, okay? Wrap it in a variable to say I need to be able to take the derivative of The calculations using this because if I can't take the derivative Then I can't get the gradients and I can't update the weights all right, and I need to put it on the GPU because my module is on the GPU and So we can now take that variable and pass it to The object that we instantiated our logistic regression Remember our module we can use it as if it's a function because that's how pytorch works And that gives us a set of predictions as we saw seen before Okay So now we can check the loss and the loss we defined as being a negative log likelihood loss Object and we're going to learn about how that's calculated in the next lesson and for now think of it Just like root mean squared error, but for classification problems So we can call that also just like a function so you can kind of see this It's very general idea in pytorch that you know kind of treat everything ideally like it's a function So in this case we have a loss a negative log likelihood loss object.

We treat it like a function we pass in our predictions and We pass in our axials right and again the axials need to be turned into a variable and put on the GPU Because the loss is specifically the thing that we actually want to take the derivative of right so that gives us our loss And there it is.

That's our loss 2.43 So it's a variable and because it's a variable it knows how it was calculated All right, it knows it was calculated with this loss function. It knows that the predictions were calculated with this Network it knows that this network consisted of these operations and so we can get the gradient automatically, all right So to get the gradient We call L dot backward remember L is the thing that contains our loss All right, so L dot backward is is something which is added to anything.

That's a variable You can call dot backward and that says please calculate the gradients Okay, and so that calculates the gradients and stores them inside that that the basically for each of the Weights that was used it used each of the parameters that was used to calculate that it's now stored a Dot grad we'll see it later.

It's basically stored the gradient right so we can then call Optimizer dot step and we're going to do this step manually shortly And that's the bit that says please make the weights a little bit better right and so what optimizer dot step is doing Is it saying like okay if you had like a really simple function?

Like this Right then what the optimizer does is it says okay. Let's pick a random starting point Right and let's calculate the value of the loss right so here's our parameter Here's our loss right let's take the derivative All right the derivative tells us which way is down, so it tells us we need to go that direction Okay, and we take a small step and Then we take the derivative again, and we take a small step derivative again Take a small step do it again.

Take a small step and Till eventually we're taking such small steps that we stop okay, so that's what? gradient descent does okay How big a step is a small step? Well, we basically take the derivative here, so let's say derivative. There is like eight All right, and we multiply it by a small number like say 0.01 and that tells us what step size to take this small number here is called the learning rate and It's the most important hyper parameter to set right if you pick two smaller learning rate Then your steps down are going to be like tiny, and it's going to take you forever All right to bigger learning rate, and you'll jump too far Right and then you'll jump too far and your diverge rather than converge, okay We're not going to talk about how to pick a learning rate in this class But in the deep learning class we actually show you a specific technique that very reliably picks a very good learning rate um So that's basically what's happening right so we calculate the derivatives And we call the optimizer that does a step in other words update the weights based on the Gradients and the learning rate We should hopefully find that after doing that we have a better loss than we did before So I just reran this and got a loss here of four point one six and after one step It's now four point.

Oh three okay, so it worked the way We hoped it would based on this mini batch it updated all of the weights in our Network to be a little better than they were as a result of which our loss went down, okay? So let's turn that into a training loop All right, we're going to go through a hundred steps Grab one more mini batch of data from the data loader Calculate our predictions from our network calculate our loss from the predictions and the actuals Every 10 goes we'll print out the accuracy just take the mean of the whether they're equal or not One Pytorch specific thing you have to zero the gradients basically you can have networks where like you've got lots of different loss Functions that you might want to add all of the gradients together Right so you have to tell Pytorch like when to set the gradients back to zero Right so this just says set all the gradients to zero Calculate the gradients that's put backward and then take one step of the optimizer So update the weights using the gradients and the learning rate and so once we run it.

You can see the loss goes down and The accuracy goes up Okay so That's the basic approach and so next lesson. We'll see what That does all right We're looking in detail We're not going to look inside here as I say we're going to basically take the calculation of the derivatives as As a given right but basically What's happening there?

And any kind of deep network you have kind of like a function That's like you know a linear function And then you pass the output of that into another function that might be like a ReLU And you pass the output of that into another function that might be another linear net linear layer And you pass that into another function that might be another ReLU and so forth right so these deep networks are just Functions of functions of functions, so you could write them mathematically like that right and so All backprop does is it says let's just simplify this down to the two version Is we can say okay u equals f of x Right and so therefore the derivative of g of f of x is we can calculate with the chain rule as being g - u f - x Right and so you can see we can do the same thing for the functions of the functions of the functions, and so when you apply a Function to a function of a function you can take the derivative just by taking the product of the derivatives of each of those Layers okay, and in neural networks.

We call this back propagation Okay, so when you hear back propagation it just means use the chain rule to calculate the derivatives And so when you see a neural network defined Like here right Like if it's defined sequentially literally all this means is apply this function to the input Apply this function to that apply this function to that apply this function to that right so this is just defining a composition of a function to a function to a function to a function to a function okay, and so Yeah, so although we're not going to bother with calculating the gradients ourselves You can now see why it can do it right as long as it has internally You know a it knows like what's the what's the derivative of to the power of what's the derivative of sign?

What's the derivative of plus and so forth then our Python code? In here, it's just combining those things together So it just needs to know how to compose them together with the chain rule and away it goes, okay? Okay, so I think we can leave it there for now and yeah and in the next class We'll go and we'll see how to Write our own optimizer, and then we'll have solved MNIST from scratch ourselves.

See you then