back to index

Machine Learning 1: Lesson 9


Chapters

0:0 Introduction
0:35 Synthetic Data
4:10 Parfitt
11:40 Basic Steps
15:30 NNN Module
16:15 Constructor
19:30 FindForWord
21:10 Softmax
27:30 Parameters
28:50 Results
29:10 Functions
30:15 Generators
32:0 FASTA
33:10 Variable
36:5 Function
45:10 Making Predictions
47:0 Broadcasting
48:40 Performance
52:30 Broadcast

Whisper Transcript | Transcript Only Page

00:00:00.000 | All right welcome back to machine learning I
00:00:02.600 | I'm really excited to be able to share some amazing stuff that
00:00:08.480 | University of San Francisco students have built during the week or written about during the week
00:00:15.280 | Quite a few things. I'm going to show you have already
00:00:17.400 | spread around the internet quite a bit
00:00:20.640 | lots of
00:00:23.200 | Tweets and posts and all kinds of stuff happening
00:00:28.040 | One of the the first to be widely shared was this one by Tyler who did something really interesting?
00:00:34.880 | He started out by saying like what if I like create the synthetic data set where the independent variables is like the x and the y
00:00:43.940 | And the dependent variable is like color right and interestingly
00:00:48.080 | He showed me an earlier version of this where he wasn't using color
00:00:51.080 | he was just like putting the actual numbers in here and
00:00:54.840 | this thing kind of wasn't really working at all and as soon as he started using color it started working really well and
00:01:00.640 | So I wanted to mention that one of the things that unfortunately we we don't teach you
00:01:05.600 | at USF is
00:01:08.120 | Theory of human perception perhaps we should
00:01:10.840 | Because actually when it comes to visualization it's kind of the most important thing to know is what is the human eye?
00:01:17.080 | Or what is what is what is the human brain good at perceiving? There's a whole area of academic study on this
00:01:24.400 | And one of the things that we're best at perceiving is differences in color
00:01:28.040 | Right so that's why as soon as we look at this picture of the synthetic data. He created you can immediately see oh there's kind of four
00:01:34.340 | areas of you know lighter red
00:01:37.120 | color
00:01:38.680 | So what he did was he said okay?
00:01:40.840 | What if we like tried to create a machine learning model of this synthetic data set?
00:01:46.720 | And so specifically he created a tree and the cool thing is that you can actually draw
00:01:52.840 | The tree right so after he created the tree
00:01:55.440 | He did this all in that plot live that plot lead is very flexible right he actually drew the tree boundaries
00:02:01.840 | So that's already a pretty neat trick is to be actually able to draw the tree
00:02:07.800 | But then he did something even cleverer which is he said okay?
00:02:10.800 | So what predictions does the tree make well it's the average of each of these areas and so to do that
00:02:16.960 | We can actually draw the average color
00:02:18.960 | Right it's actually kind of pretty
00:02:22.720 | Here is the predictions that the tree makes
00:02:28.120 | Here's where it gets really interesting. It's like you can as you know randomly
00:02:32.560 | generate trees through resampling and
00:02:36.680 | So here are four trees
00:02:39.600 | Generated through resampling they're all like pretty similar, but a little bit different
00:02:43.880 | And so now we can actually visualize bagging and to visualize bagging we literally take the average of the four pictures
00:02:52.000 | All right. That's what bagging is and
00:02:54.000 | There it is right and so here is like the the fuzzy decision boundaries of a random forest
00:03:01.440 | And I think this is kind of amazing right because it's like a I wish I had this actually when I started teaching you
00:03:08.880 | All random forests because I could have skipped a couple of classes. It's just like okay. That's what we do
00:03:13.920 | You know we create the decision boundaries we average each area
00:03:18.360 | And then we we do it a few times and average all of them
00:03:21.960 | Okay, so that's what a random forest does and I think like this is just such a great example of
00:03:26.320 | Making the complex easy through through pictures
00:03:32.160 | So congrats to Tyler for that
00:03:34.840 | It actually turns out
00:03:37.360 | That he has actually reinvented something that somebody else has already done a guy called Christian any who went on to be a
00:03:44.000 | One of the world's foremost machine learning researchers actually included almost exactly this technique in a book
00:03:51.000 | He wrote about decision forests, so it's actually kind of cool that Tyler ended up
00:03:54.880 | Reinventing something that one of the world's foremost and for authorities on the fifth decision forests actually it has created
00:04:01.720 | So I thought that was neat
00:04:03.200 | That's nice because when we pop when we posted this on Twitter
00:04:05.960 | You know got a lot of attention and finally somebody with that was able to say like oh
00:04:09.800 | You know what this this actually already exists, so Tyler's gone away, and you know started reading that book
00:04:17.160 | Something else which is super cool is Jason Carpenter
00:04:20.520 | Created a whole new library called parfit and parfit is a
00:04:26.960 | parallelized fitting of multiple models for the purpose of
00:04:31.200 | Selecting hyper parameters, and there's a lot. I really like about this
00:04:36.560 | He's shown a clear example of how to use it right and like the API looks very similar to other grid search based approaches
00:04:46.940 | But it uses the validation
00:04:48.940 | techniques that
00:04:50.780 | Rachel wrote about and that we learned about a couple of weeks ago of using a good validation set
00:04:57.820 | You know what he's done here is in his blog post that introduces it. You know he's he's
00:05:04.180 | Gone right back and said like well
00:05:06.820 | What are hyper parameters why do we have to train them?
00:05:09.140 | And he's kind of explained every step and then the the module itself is like it's it's very polished
00:05:15.820 | You know he's added documentation to it. He's added a nice read me to it
00:05:19.620 | And it's kind of interesting when you actually look at the code you realize
00:05:22.940 | You know it's very simple. You know which is it's definitely not a bad thing. That's a good thing is to make things simple
00:05:29.700 | but by kind of
00:05:33.100 | Writing this little bit of code and then packaging it up so nicely
00:05:35.700 | He's made it really easy for other people to use this technique
00:05:39.220 | which is great and so
00:05:42.460 | one of the things I've been really thrilled to see is then
00:05:44.660 | Vinay went along and combined two things from our class one was to take
00:05:50.180 | Parfit and then the other was to take the kind of accelerated SGD approach to classification
00:05:56.020 | We don't learned about in the last lesson and combine the two to say like okay. Well. Let's now use
00:06:02.100 | Parfit to help us find the parameters of a
00:06:05.740 | SGD logistic aggression
00:06:08.580 | So I think that's really a really great idea
00:06:12.180 | something
00:06:14.100 | else which I thought was terrific is
00:06:16.100 | Prince actually
00:06:18.780 | basically went through and
00:06:20.780 | Summarized pretty much all the stuff we learned in the random and for a random forest interpretation class
00:06:27.980 | And he went even further than that as he described each of the different approaches to random forest interpretation
00:06:37.020 | He described how it's done so here for example is feature importance through variable permutation a little picture of each one and
00:06:44.860 | Then super cool here is the code to implement it from scratch
00:06:49.400 | So I think this is like really
00:06:52.580 | Nice post you know describing something that not many people understand and showing you know exactly how it works both with pictures
00:07:00.740 | And with code that implements it from scratch
00:07:04.340 | So I think that's really really great one of the things. I really like here is that for like the
00:07:09.100 | Tree interpreter, but he actually showed how you can take the tree interpreter
00:07:14.320 | output and feed it into the new waterfall chart package that
00:07:19.300 | Chris our USF student built to show how you can actually visualize
00:07:23.260 | The contributions of the tree interpreter in a waterfall chart so again kind of a nice combination of
00:07:30.740 | multiple pieces of technology we've both learned about and and built as a group I
00:07:36.100 | Also really thought this
00:07:39.860 | Kernel there's been a few interesting kernels shared and I'll share some more next week and devesh wrote this really nice kernel
00:07:45.460 | Showing there's this quite challenging Kaggle competition on detecting icebergs
00:07:51.400 | versus
00:07:53.420 | Ships and it's a kind of a weird two channel satellite data. Which is very hard to visualize and he actually
00:08:01.940 | Went through and basically described kind of the formulas for how these like radar scattering things actually work
00:08:10.420 | And then actually managed to come up with a code that allowed him to recreate
00:08:17.140 | You know the actual 3d?
00:08:19.780 | Icebergs
00:08:23.260 | or ships and
00:08:24.820 | I have not seen that done before or like I you know it's it's quite challenging to know how to visualize this data
00:08:31.020 | And then he went on to show how to build a neural net to try to interpret this so that was pretty fantastic as well
00:08:38.800 | So yeah congratulations for all of you. I know for a lot of you. You know you're
00:08:44.140 | Posting stuff out there to the rest of the world for the first time you know and it's kind of intimidating
00:08:51.500 | you're used to writing stuff that you kind of hand into a teacher, and they're the only ones who see it and
00:08:56.380 | You know it's kind of scary the first time you do it
00:09:00.100 | But then the first time somebody you know up votes your Kaggle kernel or adds a clap to your medium post
00:09:05.540 | He suddenly realized oh, I'm actually I've written something that people like that's that's pretty great
00:09:11.460 | So if you haven't tried yourself yet, I again invite you to
00:09:18.060 | Try writing something and if you're not sure you could write a summary of a lesson
00:09:22.540 | You could write a summary of like if there's something you found hard like maybe you found it hard to
00:09:27.660 | Fire up a GPU based AWS instance you eventually figured it out you could write down
00:09:32.820 | Just describe how you solve that problem or if one of your classmates
00:09:36.740 | Didn't understand something and you explained it to them
00:09:39.700 | Then you could like write down something saying like oh, there's this concept that some people have trouble understanding here
00:09:45.220 | So good way. I think of explaining it. There's all kinds of stuff you could you could do
00:09:49.860 | Okay, so let's go back to SGD
00:10:01.500 | We're going back through this notebook which
00:10:07.880 | Rachel put together basically taking us through
00:10:13.660 | Kind of SGD from scratch for the purpose of digit recognition
00:10:18.380 | and actually quite a lot of the stuff we look at today is
00:10:21.620 | going to be
00:10:24.300 | closely following
00:10:26.100 | Part of the computational linear algebra course
00:10:28.740 | Which you can both find the MOOCs on fast AI or at USF. It'll be an elective next year, right?
00:10:35.980 | So if you find some of this
00:10:38.580 | This stuff interesting and I hope you do then please consider signing up for the elective or checking out the video online
00:10:45.960 | So we're building
00:10:51.940 | neural networks
00:10:57.580 | We're starting with an assumption that we've downloaded the MNIST data
00:11:01.500 | We've normalized it by subtracting the main and divided by the standard deviation. Okay, so the data is
00:11:08.700 | It's slightly unusual in that although they represent images
00:11:12.760 | They were they were downloaded as each image was a seven hundred and eighty four long
00:11:17.460 | Rank one tensor, so it's been flattened out
00:11:21.660 | Okay, and so for the purpose of drawing pictures of it we had to
00:11:26.540 | resize it
00:11:28.700 | to 28 by 28
00:11:30.700 | But the actual data we've got is not 28 by 28. It's as it's it's 784 long
00:11:37.520 | flattened out
00:11:43.320 | The basic steps we're going to take here is to start out with training
00:11:48.440 | The world's simplest neural network basically a logistic regression, right?
00:11:54.000 | So no hidden layers and we're going to train it using a library
00:11:58.340 | Fast AI and we're going to build the network using a library type torch
00:12:03.840 | Right, and then we're going to gradually get rid of all the libraries, right?
00:12:07.480 | So first of all, we'll get rid of the nn neural net library and pytorch and write that ourselves
00:12:13.760 | Then we'll get rid of the fast AI fit function and write that ourselves and then we'll get rid of the pytorch
00:12:22.620 | optimizer and write that ourselves and so by the end of
00:12:26.120 | This notebook will have written all the pieces ourselves
00:12:30.800 | The only thing that we'll end up relying on is the two key things that pytorch gives us
00:12:36.200 | Which is a the ability to write Python code and have it run on the GPU and?
00:12:40.320 | B the ability to write Python code and have it automatically differentiated for us
00:12:46.960 | Okay, so they're the two things we're not going to attempt to write ourselves because it's boring and pointless
00:12:52.160 | But everything else we'll try and write ourselves on top of those two things. Okay, so
00:12:58.720 | Our starting point is like not doing anything ourselves
00:13:03.680 | It's basically having it all done for us. And so pytorch has an nn library, which is where the neural net stuff lives
00:13:10.160 | you can create a
00:13:12.280 | multi-layer neural network by using the sequential function and then passing in a list of the layers that you want and
00:13:18.640 | We asked for a linear layer
00:13:20.840 | Followed by a softmax layer and that defines our logistic regression. Okay the input to our linear layer
00:13:28.380 | Is 28 by 28 as we just discussed the output is 10 because we want a probability
00:13:34.500 | For each of the numbers not through nine for each of our images, okay
00:13:39.180 | Cuda sticks it on the GPU and then
00:13:50.180 | Fits a model okay, so we start out with a random set of weights and then fit uses gradient descent to make it better
00:13:58.820 | Had to tell the fit function
00:14:00.820 | What criterion to use in other words what counts is better and we told it to use negative log likelihood
00:14:07.720 | We'll learn about that in the next lesson what that is exactly
00:14:10.860 | We had to tell it what optimizer to use and we said please use optm dot Adam the details of that
00:14:18.000 | We won't cover in this course. We're going to use something build something simpler called SGD
00:14:23.180 | If you're interested in Adam, we just covered that in the deep learning course
00:14:27.060 | And what metrics do you want to print out? We decided to print out accuracy. Okay, so
00:14:32.740 | That was that and so if we do that
00:14:42.340 | So after we fit it we get an accuracy of generally somewhere around 91 92 percent
00:14:47.300 | So what we're going to do from here is we're going to gradually
00:14:50.980 | We're going to repeat this exact same thing. So we're going to rebuild
00:14:56.300 | This model
00:14:57.820 | You know four or five times fitting it building it and fitting it with less and less libraries. Okay, so the second thing that we did
00:15:06.320 | last time
00:15:09.020 | Was to try to start to define the
00:15:11.780 | The module ourselves
00:15:15.580 | All right, so instead of saying the network is a sequential bunch of these layers
00:15:21.780 | Let's not use that library at all and try and define it ourselves from scratch
00:15:26.760 | So to do that we have to use
00:15:32.220 | Because that's how we build everything in pytorch and we have to create
00:15:36.720 | a class
00:15:39.060 | Which inherits from an end up module so an end up module is a pytorch class
00:15:45.140 | That takes our class and turns it into a neural network module
00:15:51.500 | Which basically means will anything that you inherit from an end up module like this?
00:15:55.940 | You can pretty much insert into a neural network as a layer or you can treat it as a neural network
00:16:02.020 | it's going to get all the stuff that it needs automatically to
00:16:05.060 | To work as a part of or a full neural network and we'll talk about exactly what that means
00:16:11.260 | Today and the next lesson, right?
00:16:15.940 | so we need to construct the object so that means we need to define the constructor under in it and
00:16:22.900 | Then importantly, this is a Python thing is if you inherit from some other object
00:16:29.720 | Then you have to create the thing you inherit from first
00:16:33.100 | so when you say super dot under in it that says construct the
00:16:38.500 | Nn dot module piece of that first right if you don't do that then the the NN dot module stuff
00:16:46.180 | Never gets a chance to actually get constructed. Now. So this is just like a standard
00:16:50.820 | Python
00:16:53.980 | Subclass constructor, okay, and if any of that's an unclear to you then you know
00:16:59.180 | This is where you definitely want to just grab a python intro to OO because this is
00:17:04.420 | That the standard approach, right? So inside our constructor
00:17:08.740 | We want to do the equivalent of
00:17:11.580 | Nn dot linear. All right. So what NN dot linear is doing is it's taking our
00:17:19.060 | It's taking our 28 by 28
00:17:29.380 | Vector so 768 long vector and we're going to be that's going to be the input to a matrix multiplication
00:17:36.180 | so we now need to create a
00:17:38.820 | Something with
00:17:42.900 | 768 rows and
00:17:45.840 | That's 768 and 10 columns
00:17:49.620 | Okay, so because the input to this is going to be a mini batch of size
00:17:58.260 | Actually, let's move this into a new window
00:18:01.740 | 768 by 10 and the input to this is going to be a mini batch of size 64
00:18:20.100 | Right, so we're going to do this matrix product
00:18:23.340 | Okay, so when we say in pytorch NN dot linear
00:18:28.220 | It's going to construct
00:18:32.100 | This matrix for us, right? So since we're not using that we're doing things from scratch. We need to make it ourselves
00:18:38.900 | So to make it ourselves we can say
00:18:41.300 | generate normal random numbers
00:18:46.140 | This dimensionality which we passed in here 768 by 10. Okay, so that gives us our
00:18:53.060 | randomly initialized
00:18:55.060 | matrix, okay
00:18:57.300 | Then we want to add on to this
00:19:01.660 | You know, we don't just want y equals ax we want y equals ax plus b
00:19:08.140 | Right, so we need to add on what we call in neural nets a bias vector
00:19:13.500 | So we create here a bias vector of length 10. Okay again randomly initialized
00:19:20.740 | And so now here are our two randomly initialized
00:19:24.620 | weight tenses
00:19:27.420 | So that's our constructor
00:19:30.980 | Now we need to define forward. Why do we need to define forward? This is a pytorch specific thing
00:19:36.900 | What's going to happen is this is when you create a module in
00:19:42.620 | Pytorch the object that you get back behaves as if it's a function
00:19:47.760 | You can call it with parentheses which we'll do it that in a moment. And so you need to somehow define
00:19:52.860 | What happens when you call it as if it's a function and the answer is pytorch calls a method called?
00:20:00.440 | Forward, okay, that's just that's the Python the pytorch kind of approach that they picked, right?
00:20:07.740 | So when it calls forward, we need to do our actual
00:20:12.260 | Calculation of the output of this module or later. Okay. So here is the thing that actually gets calculated in our logistic regression
00:20:19.600 | So basically we take our
00:20:22.420 | Input X
00:20:26.020 | Which gets passed to forward that's basically how forward works it gets passed the mini batch
00:20:32.340 | and we matrix multiply it by
00:20:35.620 | The layer one weights which we defined up here and then we add on
00:20:42.740 | The layer one bias which we defined up here. Okay, and actually nowadays we can define this a little bit more elegantly
00:20:50.100 | Using the Python 3
00:20:54.700 | Matrix multiplication operator, which is the at sign
00:20:57.660 | And when you when you use that I think you kind of end up with
00:21:01.080 | Something that looks closer to what the mathematical notation looked like and so I find that nicer. Okay
00:21:07.860 | All right, so that's
00:21:11.580 | That's our linear layer
00:21:13.580 | In our logistic regression in our zero hidden layer neural net. So then the next thing we do to that is
00:21:19.740 | softmax
00:21:23.260 | Okay, so we get the output of this
00:21:26.840 | Matrix multiply
00:21:31.420 | Okay, who wants to tell me what the dimensionality of my output of this matrix multiply is
00:21:40.300 | Sorry
00:21:42.060 | 64 by 10. Thank you Karen
00:21:44.060 | And I should mention for those of you that weren't at deep learning class yesterday
00:21:50.580 | We actually looked at a really cool post from Karen who described how to
00:21:54.980 | Do structured data analysis with neural nets which has been like super popular?
00:22:00.380 | And a whole bunch of people have kind of said that they've read it and found it super interesting. So
00:22:05.620 | That was really exciting
00:22:10.020 | So we get this matrix of
00:22:12.020 | Outputs and we put this through a softmax
00:22:15.780 | And why do we put it through a softmax
00:22:19.740 | We put it through a softmax because in the end we want probably you know for every image
00:22:24.660 | We want a probability that this is 0 or a 1 or a 2 or a 3 or 4, right?
00:22:28.780 | So we want a bunch of probabilities that add up to 1 and where each of those probabilities is between 0 and 1
00:22:35.420 | so a softmax
00:22:38.860 | Does exactly that for us?
00:22:40.860 | So for example if we weren't picking out, you know numbers from 0 to 10
00:22:45.900 | But instead of picking out cat dog play and fish or building the output of that matrix multiply
00:22:50.500 | For one particular image might look like that. These are just some random numbers
00:22:54.620 | And to turn that into a softmax. I first go e to the power of each of those numbers. I
00:23:02.420 | Sum up those e to the power of
00:23:09.060 | Then I take each of those e to the power ofs and divide it by the sum and that's softmax
00:23:14.180 | That's the definition of softmax. So because it was a to the power of it means it's always positive
00:23:19.260 | Because it was divided by the sum it means that it's always between 0 and 1 and it also means because it's divided
00:23:27.180 | By the sum that they always add up to 1
00:23:29.820 | So by applying this softmax
00:23:34.500 | Activation function so anytime we have a layer of outputs, which we call activations
00:23:40.140 | And then we apply some function some nonlinear function to that that maps one
00:23:45.980 | One scalar to one scalar like softmax does we call that an activation function, okay?
00:23:52.500 | So the softmax activation function takes our outputs and turns it into something which behaves like a probability, right?
00:24:00.260 | We don't strictly speaking need it. We could still try and train something which where the output directly is the probabilities
00:24:07.980 | All right, but by creating using this function
00:24:11.320 | That automatically makes them always behave like probabilities. It means there's less
00:24:16.420 | For the network to learn so it's going to learn better. All right, so generally speaking whenever we design
00:24:21.960 | an architecture
00:24:24.660 | We try to design it in a way where it's as easy as possible for it to create something of the form that we want
00:24:32.500 | So that's why we use
00:24:35.420 | softmax
00:24:37.580 | Right so that's the basic steps right we have our input which is a bunch of images
00:24:44.180 | Right which is here gets multiplied by a weight matrix. We actually also add on a bias
00:24:52.740 | Right to get a output of the linear function
00:24:56.460 | We put it through a nonlinear activation function in this case softmax and that gives us our probabilities
00:25:04.100 | So there there that all is
00:25:09.020 | Pi torch also tends to use the log
00:25:14.820 | Of softmax for reasons that don't particularly bother us now
00:25:19.940 | It's basically a numerical stability convenience. Okay, so to make this the same as our
00:25:26.020 | Version up here that you saw log softmax. I'm going to use log here as well. Okay, so
00:25:34.420 | We can now instantiate this class that is create an object of this class
00:25:41.060 | So I have a question back for the probabilities where we were before
00:25:50.860 | If we were to have a photo with a cat and a dog together
00:25:54.820 | Would that change the way that that works or does it work in the same basic? Yeah, so that's a great question
00:26:00.580 | so if you had a photo with a cat and a dog together and
00:26:03.660 | You wanted it to spit out both cat and dog
00:26:07.100 | This would be a very poor choice. So softmax is specifically the activation function we use for
00:26:14.540 | Categorical predictions where we only ever want to predict one of those things, right?
00:26:19.460 | And so part of the reason why is that as you can see because we're using either the right either the slightly bigger numbers
00:26:27.120 | Creates much bigger numbers as a result of which we generally have just one or two things large and everything else is pretty small
00:26:34.340 | All right
00:26:34.860 | so if I like
00:26:35.820 | Recalculate these random numbers a few times you'll see like it tends to be a bunch of zeros and one or two high numbers
00:26:41.980 | right, so it's really designed to
00:26:44.420 | Try to kind of make it easy to predict like this one thing. There's the thing I want if you're doing multi
00:26:53.700 | Label prediction so I want to find all the things in this image rather than using softmax
00:26:59.380 | We would instead use sigmoid, right?
00:27:01.260 | So sigmoid recall each would cause each of these between to be between zero and one, but they would no longer add to one
00:27:08.620 | Good question and like a lot of these
00:27:11.480 | Details about like best practices are things that we cover in the deep learning course
00:27:18.140 | And we won't cover heaps of them here in the machine learning course. We're more interested in the mechanics, I guess
00:27:24.100 | But we'll try and do them if they're quick
00:27:28.300 | All right, so now that we've got that we can instantiate an object of that class and of course
00:27:35.420 | We want to copy it over to the GPU so we can do computations over there
00:27:38.940 | Again, we need an optimizer where we're talking about what this is shortly, but you'll see here
00:27:44.580 | We've called a function on our class called parameters
00:27:47.760 | But we never defined a method called parameters
00:27:51.340 | And the reason that is going to work is because it actually was defined for us inside nn.module
00:27:56.420 | and so nn.module actually automatically goes through the attributes we've created and finds
00:28:04.060 | Anything that basically we we said this is a parameter
00:28:07.860 | So the way you say something is a parameter is you wrap it in an end up parameter
00:28:11.260 | So this is just the way that you tell PyTorch
00:28:13.620 | This is something that I want to optimize
00:28:16.180 | Okay, so when we created the weight matrix we just wrapped it with an end up parameter
00:28:21.420 | It's exactly the same as a regular
00:28:23.780 | PyTorch variable which we'll learn about shortly
00:28:26.620 | It's just a little flag to say hey you should you should optimize this and so when you call net to dot parameter
00:28:33.940 | On our net to object we created it goes through everything that we created in the constructor
00:28:38.900 | Checks to see if any of them are of type parameter
00:28:41.880 | And if so it sets all of those as being things that we want to train with the optimizer
00:28:46.620 | And we'll be implementing the optimizer from scratch later
00:28:50.020 | Okay, so having done that
00:28:53.040 | We can fit and we should get basically the same answer as before 91 ish
00:29:03.620 | So that looks good
00:29:05.620 | All right
00:29:09.500 | What if we actually built here?
00:29:11.500 | Well what we've actually built as I said is something that can behave like a regular function
00:29:17.340 | All right, so I want to show you how we can actually call this as a function
00:29:21.660 | So to be able to call it as a function
00:29:23.660 | We need to be able to pass data to it to be able to pass data to it
00:29:28.140 | I'm going to need to grab a mini batch of MNIST images
00:29:32.700 | Okay, so we used
00:29:34.700 | for convenience the
00:29:37.220 | Image classifier data from a raised method from fastai
00:29:40.340 | And what that does is it creates a pytorch data loader for us a pytorch data loader is
00:29:47.060 | Something that grabs a few images and sticks them into a mini batch and makes them available
00:29:52.340 | And you can basically say give me another mini batch give me another mini batch give me another mini batch and so
00:30:02.420 | Python we call these things generators
00:30:05.060 | Generators are things where you can basically say I want another I want another I want another right
00:30:10.020 | There's this kind of very close connection between
00:30:15.900 | Iterators and generators are not going to worry about the difference between them right now, but you'll see basically to turn
00:30:23.140 | To actually get hold of something which we can say please give me another of
00:30:32.020 | Order to grab something that we can we can use to generate mini batches
00:30:36.540 | We have to take our data loader and so you can ask for the training data loader from our model data object
00:30:43.180 | You'll see there's a bunch of different data loaders. You can ask for you can ask for the test data loader the train data loader
00:30:49.420 | validation loader
00:30:51.940 | Augmented images data loader and so forth so we're going to grab the training data loader
00:30:57.220 | That was created for us. This is a PI standard PI torch data loader. Well slightly optimized by us, but same idea
00:31:03.300 | And you can then say this is a standard Python
00:31:07.020 | Thing we can say turn that into an iterator turn that into something where we can grab another one at a time from and so
00:31:14.860 | Once you've done that
00:31:16.540 | We've now got something that we can iterate through you can use the standard Python
00:31:21.580 | Next function to grab one more thing from that generator, okay?
00:31:26.820 | So that's returning and the X's from a mini batch in the wise
00:31:33.100 | Found our mini batch the other way that you can use
00:31:36.440 | Generators and iterators in Python is with a for loop. I could also have said like for you know X mini batch comma Y mini batch in
00:31:45.180 | data loader
00:31:47.420 | And then like do something right so when you do that. It's actually behind the scenes
00:31:51.940 | It's basically syntactic sugar for calling next lots of times. Okay, so this is all standard
00:31:57.920 | Python stuff
00:32:00.700 | So that returns a
00:32:03.100 | Tensor of size 64 by 784 as we would expect right the
00:32:14.980 | Fastai library we used defaults to a mini batch size of 64. That's why it's that long
00:32:20.340 | These are all of the background zero pixels, but they're not actually zero in this case. Why aren't they zero?
00:32:27.180 | Yeah, they're normalized exactly right so we subtract at the mean divided by standard deviation right
00:32:33.420 | So there there it is so now what we want to do is we want to
00:32:42.380 | Pass that into our our logistic regression. So what we might do is we'll go
00:32:48.860 | Variable XMB equals variable. Okay, I can take my X mini batch I
00:32:55.580 | can move it on to the GPU because remember my
00:32:59.160 | Net to object is on the GPU so our data for it also has to be on the GPU
00:33:04.980 | And then the second thing I do is I have to wrap it in variable. So what does variable do?
00:33:11.140 | This is how we get for free automatic differentiation
00:33:15.000 | Pytorch can automatically differentiate
00:33:19.040 | You know pretty much anything right any tensor?
00:33:22.480 | But to do so takes memory and time
00:33:25.380 | So it's not going to always keep track like to do to do what about differentiation
00:33:30.820 | It has to keep track of exactly how something was calculated. We added these things together
00:33:35.340 | We multiplied it by that we then took the sign blah blah blah, right?
00:33:39.420 | you have to know all of the steps because then to do the automatic differentiation it has to
00:33:45.060 | Take the derivative of each step using the chain rule multiply them all together
00:33:49.380 | All right, so that's slow and memory intensive
00:33:52.140 | So we have to opt in to saying like okay this particular thing we're going to be taking the derivative of later
00:33:57.560 | So please keep track of all of those operations for us
00:34:00.300 | And so the way we opt in is by wrapping a tensor in a variable, right? So
00:34:08.100 | That's how we do it and
00:34:10.100 | You'll see that it looks almost exactly like a tensor, but it now says variable containing
00:34:16.460 | This tensor right so in Pytorch a variable has exactly
00:34:21.860 | Identical API to a tensor or actually more specifically a superset of the API of a tensor
00:34:27.860 | Anything we can do to a tensor we can do to a variable
00:34:30.740 | But it's going to keep track of exactly what we did so we can later on take the derivative
00:34:37.700 | Okay, so we can now pass that
00:34:40.260 | Into our net to object remember I said you can treat this as if it's a function
00:34:51.980 | Right so notice we're not calling dot forward
00:34:56.140 | We're just treating it as a function and
00:34:59.380 | Then remember we took the log so to undo that I'm taking the x and that will give me my probabilities
00:35:07.460 | Okay, so there's my probabilities, and it's got
00:35:14.020 | Return something of size 64 by 10 so for each image in the mini batch
00:35:23.020 | We've got 10 probabilities, and you'll see most probabilities are pretty close to 0
00:35:29.580 | Right and a few of them are quite a bit bigger
00:35:33.420 | Which is exactly what we do we hope right is that it's like okay? It's not a zero. It's not a one
00:35:39.300 | It's not a two. It is a three. It's not a four. It's not a five and so forth
00:35:42.740 | So maybe this would be a bit easier to read if we just grab like the first three of them
00:35:47.140 | Okay, so it's like ten to the next three ten to the next eight two five five four okay?
00:35:55.100 | And then suddenly here's one which is ten to make one right?
00:35:57.620 | So you can kind of see what it's trying to what it's trying to do here
00:36:02.980 | I mean we could call like net to dot forward and it'll do exactly the same thing
00:36:10.380 | Right, but that's not how
00:36:13.060 | All of the pie torch mechanics actually work
00:36:16.620 | It's actually they actually call it as if it's a function right and so this is actually a really important idea
00:36:22.580 | like because it means that
00:36:24.940 | When we define our own architectures or whatever anywhere that you would put in a function
00:36:30.580 | You could put in a layer anyway you put in a layer you can put in a neural net anyway
00:36:34.900 | You put in a neural net you can put in a function because as far as pie torch is concerned
00:36:39.020 | They're all just things that it's going to call just like as if they're functions
00:36:43.060 | So they're all like interchangeable, and this is really important because that's how we create
00:36:48.020 | Really good neural nets is by mixing and matching lots of pieces and putting them all together
00:36:53.660 | Let me give an example
00:36:56.420 | Here is my
00:37:00.220 | Logistic aggression which got
00:37:04.540 | 91 and a bit percent accuracy
00:37:08.980 | I'm now going to turn it
00:37:11.380 | Into a neural network with one hidden layer all right, and the way I'm going to do that is I'm going to create
00:37:17.100 | one more layer
00:37:19.860 | I'm going to change this so it spits out a hundred rather than ten
00:37:24.420 | Which means this one input is going to be a hundred rather than ten
00:37:30.020 | Now this as it is can't possibly make things any better at all yet
00:37:35.340 | Why is this definitely not going to be better than what I had before?
00:37:39.020 | Yeah, can somebody pass the yeah?
00:37:42.540 | But you've got a combination of two linear layers, which is just the same as one
00:37:47.620 | Exactly right so we've got two linear layers, which is just a linear layer right so to make things interesting
00:37:55.700 | I'm going to replace all of the negatives from the first layer with zeros
00:38:00.880 | Because that's a nonlinear transformation, and so that nonlinear transformation is called a rectified linear unit
00:38:07.820 | Okay, so nn dot sequential simply is going to call each of these layers in turn for each mini batch right so do a linear layer
00:38:18.340 | Replace all of the negatives with zero do another linear layer and do a softmax. This is now a neural network
00:38:26.020 | with one hidden layer and
00:38:28.020 | So let's try trading that instead
00:38:30.460 | Okay accuracy is now going up to 96%
00:38:37.180 | Okay, so the this is the idea is that the basic techniques. We're learning in this lesson
00:38:43.420 | Like become powerful at the point where you start stacking them together, okay?
00:38:49.540 | Can somebody pass the green box there and then there yes, Daniel?
00:38:54.660 | Why did you pick a hundred? No reason it was like easier to type an extra zero?
00:38:59.940 | Like this question of like how many
00:39:04.220 | Activations should I have it a neural network layer is kind of part of the the scale of a deep learning practitioner
00:39:09.780 | We cover it in the deep learning course not in this course
00:39:13.000 | When adding that additional I guess
00:39:18.100 | transformation
00:39:20.660 | Additional layer additional layer this one here is called a nonlinear layer or an activation function
00:39:26.180 | Activation function or activation function
00:39:30.060 | Does it matter that like if you would have done for example like two softmaxes?
00:39:37.780 | Or is that something you cannot do like yeah?
00:39:40.180 | You can absolutely use a softmax there
00:39:42.140 | But it's probably not going to give you what you want and the reason why is that a softmax?
00:39:48.220 | Tends to push most of its activations to zero and an activation just be clear like I've had a lot of questions in deep
00:39:55.460 | Learning course about like what's an activation an activation is the value that is calculated in a layer, right?
00:40:02.740 | So this is an activation
00:40:04.740 | Right it's not a weight a weight is not an activation
00:40:08.700 | It's the value that you calculate from a layer
00:40:11.340 | So softmax will tend to make most of its activations pretty close to zero
00:40:15.700 | and that's the opposite of what you want you genuinely want your activations to be kind of as
00:40:20.860 | Rich and diverse and and used as possible so nothing to stop you doing it, but it probably won't work very well
00:40:27.300 | Basically
00:40:30.980 | pretty much all of your layers will be followed by
00:40:34.300 | Non by nonlinear activation functions that will nearly always be value
00:40:39.780 | except for the last layer
00:40:44.700 | Could you when doing multiple layers, so let's say like could you live three could you think it's going two or three layers deep?
00:40:51.740 | Do you want to switch up these activation layers? No, that's a great question. So if I wanted to go deeper I
00:40:59.100 | would just do
00:41:01.940 | That okay, that's a now to hidden layer network
00:41:05.860 | So I think I'd heard you said that there are a couple of different
00:41:13.780 | Activation functions like that rectified linear unit. What are some examples and
00:41:18.940 | Why would you use?
00:41:22.020 | Each yeah great question
00:41:24.180 | So basically like as you add like more
00:41:31.080 | linear layers you kind of got your
00:41:33.980 | Input comes in and you put it through a linear layer and then a nonlinear layer linear layer nonlinear layer
00:41:41.180 | linear linear layer and then the final nonlinear layer
00:41:50.900 | The final nonlinear layer as we've discussed, you know, if it's a
00:41:56.200 | multi-category
00:41:58.860 | Classification, but you only ever pick one of them you would use softmax
00:42:03.580 | If it's a binary classification or a multi
00:42:08.060 | Label classification where you're predicting multiple things you would use sigmoid
00:42:12.100 | If it's a regression
00:42:15.500 | You would often have nothing at all
00:42:18.660 | Right, although we learned in last night's deal course where sometimes you can use sigmoid there as well
00:42:23.300 | So they're basically the options main options for the final layer
00:42:28.500 | for the
00:42:31.940 | Hidden layers you pretty much always use
00:42:41.580 | Okay, but there is a another
00:42:50.380 | Another one you can pick which is kind of interesting which is called
00:42:56.660 | Leaky ReLU and it looks like this
00:43:07.100 | Basically if it's above zero, it's y equals x and if it's below zero, it's like y equals 0.1 x
00:43:13.540 | that's very similar to ReLU, but it's
00:43:16.660 | Rather than being equal to 0 under x. It's it's like something close to that
00:43:22.100 | So they're the main two
00:43:25.260 | ReLU and Leaky ReLU
00:43:33.260 | There are various others, but they're kind of like things that just look very close to that
00:43:38.060 | So for example, there's something called ELU, which is quite popular
00:43:41.440 | But like you know the details don't matter too much honestly like that there like ELU is something that looks like this
00:43:47.700 | But it's slightly more curvy in the middle
00:43:49.700 | And it's kind of like it's not generally something that you so much pick based on the data set it's more like
00:43:59.380 | Over time we just find better activation functions so two or three years ago
00:44:04.300 | Everybody used ReLU, you know a year ago pretty much everybody used Leaky ReLU today
00:44:09.380 | I guess probably most people starting to move towards ELU
00:44:11.940 | But honestly the choice of activation function doesn't matter
00:44:15.460 | terribly much actually
00:44:18.460 | And you know people have actually showed that you can use like our pretty arbitrary nonlinear activation functions like even a sine wave
00:44:26.180 | It still works
00:44:30.820 | So although what we're going to do today is showing how to create
00:44:40.620 | This network with no hidden layers
00:44:46.220 | To turn it into
00:44:49.860 | that network
00:44:51.620 | Which is 96% ish accurate is it will be trivial right and in fact is something you should
00:44:57.900 | Probably try and do during the week right is to create that version
00:45:10.580 | So now that we've got something where we can take our network pass in our variable and get back some
00:45:18.740 | predictions
00:45:22.580 | That's basically all that happened when we called fit. So we're going to see how how that that approach can be used to create this stochastic gradient
00:45:30.780 | descent
00:45:32.300 | one thing to note is that the to turn the
00:45:35.860 | Predicted probabilities into a predicted like which digit is it? We would need to use argmax
00:45:43.540 | Unfortunately pytorch doesn't call it argmax
00:45:49.220 | Instead pytorch just calls it max and max returns
00:45:53.540 | two things
00:45:56.260 | Returns the actual max across this axis so this is across the columns right and the second thing it returns is the index
00:46:05.020 | Of that maximum right so so the equivalent of argmax is to call max and then get the first
00:46:12.900 | Indexed thing okay, so there's our predictions right if this was in numpy. We would instead use NP argmax
00:46:22.060 | All right
00:46:25.500 | So here are the predictions from our hand created logistic regression and in this case
00:46:31.580 | Looks like we got all but one correct
00:46:37.300 | So the next thing we're going to try and get rid of in terms of using libraries is for try to avoid using the
00:46:43.300 | Matrix multiplication operator and instead we're going to try and write that by hand
00:46:47.260 | So this next part we're going to learn about something which kind of seems
00:47:03.860 | It kind of it's going to seem like a minor little kind of programming idea, but actually it's going to turn out
00:47:14.620 | That at least in my opinion. It's the most important
00:47:18.500 | Programming concept that we'll teach in this course, and it's possibly the most important programming
00:47:24.040 | kind of concept in all of
00:47:26.620 | All the things you need to build machine learning algorithms, and it's the idea of
00:47:32.980 | broadcasting
00:47:34.340 | And the idea I will show by example
00:47:37.300 | If we create an array of 10 6 neg 4 and an array of 2 8 7 and then add the two together
00:47:45.100 | It adds each of the components of those two arrays in turn we call that element wise
00:47:54.060 | So in other words we didn't have to write a loop right back in the old days
00:47:58.740 | We would have to have looped through each one and added them and then concatenated them together
00:48:02.780 | We don't have to do that today. It happens for us automatically so in numpy
00:48:07.980 | We automatically get element wise operations
00:48:11.620 | We can do the same thing with Pytorch
00:48:20.420 | So in fastai we just add a little capital T to turn something into a Pytorch tensor right and if we add those together
00:48:31.380 | Exactly the same thing right so element wise operations are pretty standard in these kinds of libraries
00:48:37.700 | It's interesting not just because we don't have to write the for loop
00:48:44.100 | Right, but it's actually much more interesting because of the performance things that are happening here
00:48:49.380 | The first is if we were doing a for loop
00:48:52.020 | right
00:48:54.740 | If we were doing a for loop
00:49:01.180 | That would happen in Python
00:49:03.180 | Right even when you use Pytorch it still does the for loop in Python it has no way of like
00:49:10.140 | Optimizing the for loop and so a for loop in Python is something like
00:49:15.660 | 10,000 times slower than in C
00:49:18.740 | So that's your first problem. I can't remember. It's like 1,000 or 10,000 the second problem then is that
00:49:29.260 | You don't just want it to be optimized in C
00:49:31.500 | But you want C to take advantage of the thing that you're all of your CPUs do to something called SIMD
00:49:37.700 | Single instruction multiple data, which is it yours your CPU is capable of taking
00:49:43.500 | eight things at a time
00:49:46.260 | Right in a vector and adding them up to another
00:49:49.860 | Vector with eight things in in a single CPU instruction
00:49:55.060 | All right, so if you can take advantage of SIMD you're immediately eight times faster
00:49:59.260 | It depends on how big the data type is it might be four might be eight
00:50:02.300 | The other thing that you've got in your computer is you've got multiple processors
00:50:07.260 | Multiple cores
00:50:11.300 | So you've probably got like if this is inside happening on one side one core. You've probably got about four of those
00:50:19.300 | Okay, so if you're using SIMD you're eight times faster if you can use multiple cores, then you're 32 times faster
00:50:25.740 | And then if you're doing that in C
00:50:28.180 | You might be something like 32 times about thousand times faster right and so the nice thing is that when we do that
00:50:34.860 | It's taking advantage of all of these things
00:50:38.340 | Okay, better still if you do it
00:50:42.900 | in pytorch and your data was created with
00:50:48.300 | .Cuda to stick it on the GPU
00:50:52.060 | Then your GPU can do about 10,000 things at a time
00:50:57.380 | Right so that'll be another hundred times faster than C
00:51:01.440 | All right, so this is critical
00:51:04.500 | To getting good performance is you have to learn how to write
00:51:10.060 | loopless code
00:51:12.500 | By taking advantage of these element wise
00:51:15.900 | Operations and like it's not it's a lot more than just plus I
00:51:19.040 | Could also use less than right and that's going to return 0 1 1 or if we go back to numpy
00:51:28.860 | False true true
00:51:35.660 | And so you can kind of use this to do all kinds of things without looping so for example
00:51:42.080 | I could now multiply that by a and here are all of the values of a
00:51:47.460 | As long as they're less than B or we could take the mean
00:51:53.440 | This is the percentage of values in a that are less than B
00:51:59.460 | All right, so like there's a lot of stuff you can do with this simple idea
00:52:03.660 | But to take it further
00:52:06.260 | Right to take it further than just this element wise operation
00:52:10.020 | We're going to have to go the next step to something called broadcasting
00:52:13.220 | So let's take a five minute break come back at 217 and we'll talk about broadcasting
00:52:26.900 | Broadcasting
00:52:29.980 | This is the definition from the numpy documentation of
00:52:38.020 | Broadcasting and I'm going to come back to it in a moment rather than reading it now
00:52:41.780 | But let's start by looking an example of broadcasting
00:52:47.500 | so a is a
00:52:50.860 | Array
00:52:53.820 | With one dimension also known as a rank one tensor
00:52:57.180 | also known as a vector
00:52:59.940 | We can say a greater than zero
00:53:03.860 | so here we have
00:53:08.780 | rank one tensor
00:53:10.780 | Right and a rank zero tensor
00:53:15.100 | Right a rank zero tensor is also called a scalar
00:53:19.860 | rank one tensor is also called a vector and
00:53:23.900 | We've got an operation between the two
00:53:27.860 | All right now you've probably done it a thousand times without even noticing. That's kind of weird right that you've got these things of different
00:53:36.060 | Ranks and different sizes, so what is it actually doing right?
00:53:39.820 | But what it's actually doing is it's taking that scalar and copying it here here here
00:53:46.140 | Right and then it's actually going element wise
00:53:50.060 | 10 is greater than 0
00:53:53.780 | 6 is greater than 0 minus 4 is greater than 0 you haven't giving us back the three answers
00:54:01.260 | Right and that's called broadcasting broadcasting means
00:54:05.260 | Copying one or more axes of my tensor
00:54:11.060 | To allow it to be the same shape as the other tensor
00:54:16.640 | It doesn't really copy it though
00:54:20.580 | What it actually does is it stores this kind of internal indicator that says pretend that this is a
00:54:30.500 | vector of three zeros
00:54:32.500 | But it actually just like what rather than kind of going to the next row or going to the next scalar it goes back
00:54:38.540 | To where it came from if you're interested in learning about this specifically
00:54:42.620 | It's they set the stride on that axis to be zero. That's a minor advanced concept for those who are curious
00:54:50.300 | So we could do a
00:54:55.460 | +1 right is going to broadcast the scalar 1
00:54:59.200 | To be 1 1 1 and then do element wise addition
00:55:03.000 | We could do the same with a matrix right here's our matrix 2 times the matrix is going to broadcast 2
00:55:10.180 | to be 2 2 2 2 2 2 2 2 2 2 and then do element wise
00:55:16.380 | multiplication
00:55:18.500 | All right, so that's our kind of most simple version of
00:55:24.100 | broadcasting
00:55:26.100 | So here's a slightly more complex version of broadcasting
00:55:30.460 | Here's an array called C. All right, so this is a rank 1 tensor and
00:55:36.180 | Here's our matrix M from before
00:55:39.020 | Our rank 2 tensor we can add M plus C
00:55:43.600 | All right, so what's going on here?
00:55:49.820 | 1 2 3 4 5 6 7 8 9
00:55:55.300 | That's M
00:55:58.700 | All right, and then C
00:56:06.940 | You can see that what it's done is to add that to each row
00:56:11.020 | right eleven twenty two thirty three
00:56:15.140 | 14 25 36 and so we can kind of figure it seems to have done the same kind of idea as broadcasting a scalar
00:56:22.480 | It's like made copies of it
00:56:24.700 | And then it treats those as
00:56:32.060 | If it's a rank 2 matrix and now we can do element wise addition
00:56:42.340 | That makes sense now that's yes, can can you pass that Devon over there? Thank you
00:56:48.140 | So as it's like by looking at this example it like
00:56:54.220 | Copies it down
00:56:56.500 | making new rows
00:56:58.420 | So how would we want to do it if we wanted to get new columns? I'm so glad you asked
00:57:10.740 | Instead
00:57:12.740 | We would do this
00:57:15.420 | 10 20 30
00:57:20.380 | All right, and then copy that 10 20 30
00:57:24.900 | 10 20 30 and
00:57:28.300 | Now treat that as our matrix
00:57:31.380 | So to get numpy to do that we need to not pass in a
00:57:36.140 | vector
00:57:38.700 | but to pass in a
00:57:40.700 | Matrix with one column a rank 2 tensor, right?
00:57:47.420 | so basically it turns out that
00:57:50.860 | numpy is going to think of a
00:57:54.380 | Rank 1 tensor for these purposes as if it was a rank 2 tensor which represents a row
00:58:02.140 | Right. So in other words that it is 1 by 3, right? So we want to create a tensor, which is 3 by 1
00:58:10.140 | There's a couple of ways to do that
00:58:13.980 | One is to use NP expand dims
00:58:17.180 | And if you then pass in this argument, it says please insert a length 1 axis
00:58:24.260 | here, please so in our case we want to turn it into a
00:58:29.100 | 3 by 1 so if we said expand in C comma 1
00:58:33.020 | Okay, so if we say expand in C comma 1 it changes the shape to 3 comma 1 so if we look at what that looks like
00:58:46.620 | That looks like a column. Okay, so if we now go
00:58:54.340 | plus M
00:58:55.820 | You can see it's doing exactly what we hoped it would do
00:58:58.980 | Right, which is to add 10 20 30 to the column
00:59:03.620 | 10 20 30 to the column 10 20 30 to the column
00:59:10.220 | now because the
00:59:12.220 | Location of a unit axis turns out to be so important
00:59:20.580 | It's really helpful to kind of experiment with creating these extra unit axes and know how to do it easily and
00:59:27.840 | NP dot expand dims
00:59:30.060 | Isn't in my opinion the easiest way to do this the easiest way?
00:59:33.420 | The easiest way is to index into the tensor with a special
00:59:40.340 | Index none and what none does is it creates a new axis in that location of
00:59:49.980 | Length 1 right so this is
00:59:53.660 | Going to add a new axis at the start of length 1
00:59:58.460 | This is going to add a new axis at the end of length 1 or
01:00:07.840 | Why not do both?
01:00:11.580 | Right so if you think about it like a tensor
01:00:15.200 | Which has like three?
01:00:18.340 | Things in it could be of any rank you like right you can just add
01:00:22.860 | Unit axes all over the place and so that way we can kind of
01:00:27.540 | Decide how we want our broadcasting to work
01:00:32.220 | So there's a pretty convenient
01:00:35.380 | Thing in numpy called broadcast 2 and what that does is it takes our vector and
01:00:45.100 | broadcasts it to that shape and shows us what that would look like
01:00:49.020 | Right so if you're ever like unsure of what's going on in some broadcasting operation
01:00:55.060 | You can say broadcast 2 and so for example here. We could say like rather than 3 comma 3 we could say m dot shape
01:01:01.980 | Right and see exactly what's happened going to happen, and so that's what's going to happen before we add it to n
01:01:09.620 | right so if we said
01:01:11.980 | Turn it into a column
01:01:16.300 | That's what that looks like
01:01:21.460 | Make sense, so that's kind of like the intuitive
01:01:26.500 | definition of
01:01:29.340 | Broadcasting and so now hopefully we can go back to that
01:01:31.940 | numpy documentation and understand
01:01:34.900 | What it means right?
01:01:38.140 | Broadcasting describes how numpy is going to treat arrays of different shapes when we do some operation
01:01:42.740 | Right the smaller array is broadcast across the larger array by smaller array. They mean lower rank
01:01:50.220 | tensor basically
01:01:52.860 | Broadcast across the light the higher rank tensor so they have compatible shapes it vectorizes array operations
01:01:59.540 | So vectorizing generally means like using SIMD and stuff like that so that multiple things happen at the same time
01:02:06.820 | All the looping occurs in C
01:02:08.820 | But it doesn't actually make needless copies of data it kind of just acts as if it had
01:02:15.140 | So there's our definition
01:02:18.060 | now in deep learning you very often deal with tensors of rank four or more and
01:02:24.620 | you very often combine them with tensors of rank one or two and
01:02:29.060 | Trying to just rely on intuition to do that correctly is nearly impossible
01:02:34.140 | So you really need to know the rules?
01:02:36.420 | So here are the rules
01:02:42.300 | Okay, here's m dot shape here C dot shape so the rule are that we're going to compare
01:02:50.180 | The shapes of our two tensors element wise we're going to look at one at a time
01:02:54.740 | And we're going to start at the end right so look at the trailing dimensions and
01:02:59.180 | then go
01:03:01.460 | Towards the front okay, and so two dimensions are going to be compatible
01:03:06.220 | When one of these two things is true, right? So let's check right we've got our our M and C compatible M is
01:03:18.500 | 3 right so we're going to start at the end trailing dimensions first and check are they compatible they're compatible if the dimensions are equal
01:03:26.620 | Okay, so these ones are equal so they're compatible
01:03:29.860 | right
01:03:31.180 | Let's go to the next one. Oh, oh, we're missing
01:03:34.140 | Right C is missing something. So what happens if something is missing as we insert a one?
01:03:41.100 | Okay, that's the rule right and so let's now check are these compatible one of them is one. Yes, they're compatible
01:03:49.140 | Okay, so now you can see why it is that numpy treats
01:03:55.260 | the one dimensional array as
01:03:59.460 | If it is a rank 2 tensor
01:04:02.060 | Which is representing a row it's because we're basically inserting a one at the front
01:04:08.540 | Okay, so that's the rule so for example
01:04:12.620 | This is something that you very commonly have to do which is you start with like an
01:04:20.780 | image they're like 256 pixels by 256 pixels by three channels and
01:04:27.740 | You want to subtract?
01:04:29.740 | the mean of each channel
01:04:31.740 | All right, so you've got 256 by 256 by 3 and you want to subtract something of length 3, right?
01:04:37.980 | So yeah, you can do that
01:04:40.020 | Absolutely because 3 and 3 are compatible because they're the same
01:04:43.980 | All right 256 and empty is compatible. It's going to insert a 1
01:04:48.340 | 256 and empty is compatible. It's going to insert a 1
01:04:51.700 | Okay, so you're going to end up with
01:04:55.740 | this is going to be broadcast over all of this axis and then that whole thing will be broadcast over this axis and
01:05:03.860 | so we'll end up with a
01:05:05.860 | 256 by 256 by 3
01:05:08.460 | Effective
01:05:12.060 | Tensor here, right?
01:05:14.060 | so interestingly like
01:05:17.300 | very few people in the data science or machine learning communities
01:05:22.300 | Understand broadcasting and the vast majority of the time for example when I see people doing pre-processing for computer vision
01:05:28.820 | Like subtracting the mean they always write loops
01:05:32.760 | over the channels right and I kind of think like
01:05:36.780 | It's it's like so handy to not have to do that and it's often so much faster to not have to do that
01:05:44.220 | So if you get good at broadcasting
01:05:46.220 | You'll have this like super useful skill that very very few people have
01:05:52.060 | And and like it's it's it's an ancient skill. You know it goes it goes all the way back to
01:05:57.940 | the days of APL
01:06:00.980 | so APL was from the late 50s stands for our programming language and
01:06:07.680 | Kenneth Iverson
01:06:11.100 | Wrote this paper called
01:06:13.100 | notation as a tool for thought
01:06:15.940 | in which he proposed a new math notation and
01:06:21.100 | He proposed that if we use this new math notation
01:06:24.700 | It gives us new tools for thought and allows us to think things we couldn't before and one of his ideas was
01:06:32.460 | broadcasting not as a
01:06:35.660 | computer programming tool, but as a piece of math notation and
01:06:40.340 | so he ended up implementing
01:06:43.180 | this notation as a tool for thought as a programming language called APL and
01:06:49.260 | His son has gone on to further develop that
01:06:54.100 | Into a piece of software called J
01:06:57.180 | Which is basically what you get when you put 60 years of very smart people working on this idea
01:07:03.980 | And with this programming language you can express
01:07:07.820 | Very complex mathematical ideas often just with a line of code or two
01:07:13.380 | And so I mean it's great that we have J
01:07:16.940 | But it's even greater that these ideas have found their ways into the languages
01:07:21.020 | We all use like in Python the NumPy and PyTorch libraries, right? These are not just little
01:07:26.740 | Kind of niche ideas. It's like fundamental ways to think about math and to do programming
01:07:33.020 | Like let me give an example of like this kind of notation as a tool for thought
01:07:38.220 | let's
01:07:41.220 | Let's look here. We've got C, right?
01:07:46.380 | Here we've got C
01:07:48.380 | None right. Notice. This is now a two square brackets, right? So this is kind of like a one row
01:07:55.940 | rank 2 tensor
01:07:59.060 | Here it is a little column
01:08:04.140 | So what is
01:08:10.620 | Just round ones
01:08:19.780 | Okay, what's that going to do? Have a think about it
01:08:34.580 | Anybody want to have a go you can even talk through your thinking. Okay. Can we pass the check this over there? Thank you
01:08:40.580 | Kind of outer product. Yes, absolutely. So take us through your thinking. How's that gonna work?
01:08:47.780 | So the diagonal elements can be directly visualized from the squares
01:08:54.620 | And cross 10 20 cross 20 and 30 cross 30
01:09:00.780 | And if you multiply the first row with this column, you can get the first row of the matrix
01:09:07.900 | So finally you'll get a 3 cross 3 matrix. Yeah, and
01:09:12.500 | So to think of this in terms of like those broadcasting rules, we're basically taking
01:09:18.220 | This column right which is of rank
01:09:21.700 | 3 comma 1 right and this kind of row
01:09:28.780 | Sorry, I mentioned 3 comma 1 and this row which is of dimension 1 comma 3
01:09:34.340 | Right and so to make these compatible with our broadcasting rules
01:09:39.220 | Right this one here has to be duplicated
01:09:42.380 | Three times because it needs to match this
01:09:45.140 | Okay, and now this one's going to have to be duplicated three times to match this
01:09:57.700 | Okay, and so now I've got two
01:10:05.100 | Matrices to do an element wise product of and so as you say
01:10:12.820 | There is our outer product right now. The interesting thing here is
01:10:17.900 | That suddenly now that this is not a special mathematical case
01:10:23.220 | But just a specific version of the general idea of broadcasting we can do like an outer plus
01:10:30.980 | Or we can do an outer greater than
01:10:35.060 | Right or or whatever right so it's suddenly we've kind of got this this this concept
01:10:42.340 | That we can use to build
01:10:44.940 | New ideas and then we can start to experiment with those new ideas. And so, you know interestingly
01:10:52.100 | NumPy actually
01:10:54.100 | Uses this sometimes
01:10:56.580 | For example if you want to create a grid
01:11:02.100 | This is how NumPy does it right actually this is kind of the sorry, let me show you this way
01:11:11.660 | If you want to create a grid, this is how NumPy does it it actually returns
01:11:16.820 | 0 1 2 3 4 and
01:11:21.620 | 0 1 2 3 4
01:11:23.620 | 1 is a column 1 is a row
01:11:26.060 | So we could say like okay, that's x grid comma y grid
01:11:30.340 | And now you could do something like
01:11:36.220 | Well, I mean we could obviously go
01:11:42.580 | Like that right and so suddenly we've expanded that out
01:11:49.620 | Into a grid right and so
01:11:59.220 | Yeah, it's kind of interesting how like some of these like simple little concepts
01:12:05.580 | Kind of get built on and built on and built on so if you lose something like APL or J. It's this whole
01:12:11.660 | Environment of layers and layers and layers of this we don't have such a deep environment in NumPy
01:12:18.260 | But you know you can certainly see these ideas of like broadcasting coming through
01:12:22.900 | In simple things like how do we create a grid in in NumPy?
01:12:27.220 | So yeah, so that's that's broadcasting and so what we can do with this now is
01:12:34.860 | Use this to implement matrix multiplication ourselves
01:12:43.980 | Now why would we want to do that well obviously we don't right matrix multiplication has already been handled
01:12:50.620 | Perfectly nicely for us by our libraries
01:12:54.200 | but very often you'll find in
01:12:57.620 | All kinds of areas in in machine learning and particularly in deep learning that there'll be
01:13:04.820 | particular types of linear
01:13:08.460 | Function that you want to do that aren't quite
01:13:13.300 | Done for you all right so for example. There's like whole areas
01:13:17.700 | called like
01:13:20.620 | tensor regression and
01:13:22.620 | Tensor decomposition
01:13:26.980 | Which are really being developed a lot at the moment and they're kind of talking about like how do we take like
01:13:38.380 | Higher rank tensors and kind of turn them into combinations of rows
01:13:43.260 | Columns and faces and it turns out that when you can kind of do this you can basically like
01:13:50.260 | Deal with really high dimensional data structures with not much memory and not with not much computation time for example. There's a really terrific library
01:13:58.100 | called tensorly
01:14:00.460 | Which does a whole lot of this kind of stuff?
01:14:02.460 | for you
01:14:05.660 | So it's a really really important area it covers like all of deep learning lots of modern machine learning in general
01:14:12.460 | And so even though you're not going to like to find matrix modification. You're very likely to want to define some other
01:14:19.660 | Slightly different tensor product you know
01:14:22.820 | So it's really useful to kind of understand how to do that
01:14:26.700 | So let's go back and look at our
01:14:29.660 | matrix and our
01:14:34.260 | 2d array and 1d array rank 2 tensor rank 1 tensor and
01:14:38.020 | Remember we can do a matrix multiplication
01:14:40.860 | Using the at sign or the old way NP dot matmul. Okay?
01:14:46.500 | And so what that's actually doing when we do that is we're basically saying
01:14:51.540 | Okay, 1 times 10 plus
01:14:56.420 | 2 times 20 plus 3 times 30 is
01:15:02.820 | 140 right and so we do that for each
01:15:05.380 | row and
01:15:07.700 | We can go through and do the same thing for the next one and for the next one to get our result, right?
01:15:12.660 | You could do that in torch as well
01:15:17.020 | We could make this a little shorter
01:15:32.020 | Okay, same thing
01:15:34.020 | Okay, but that is not matrix multiplication. What's that?
01:15:45.180 | Okay, element wise specifically we've got a matrix and a vector so
01:15:53.900 | Broadcasting okay good. So we've got this is element wise with broadcasting but notice
01:16:01.180 | The numbers it's created 10 40 90 are the exact three numbers that I needed to
01:16:07.500 | Calculate when I did that first
01:16:10.420 | Piece of my matrix multiplication. So in other words if we sum this
01:16:15.180 | Over the columns, which is axis equals 1
01:16:21.120 | We get our matrix vector product
01:16:25.540 | Okay, so we can kind of
01:16:31.700 | This stuff without special help from our library
01:16:35.140 | So now
01:16:38.580 | Let's expand this out to a matrix matrix product
01:16:42.480 | So a matrix matrix product
01:16:45.700 | Looks like this. This is this great site called matrix multiplication dot XYZ
01:16:52.420 | And it shows us this is what happens when we multiply two matrices
01:16:57.320 | Okay, that's what matrix multiplication is
01:17:06.400 | operationally speaking so in other words what we just did there
01:17:11.360 | Was we first of all took the first column
01:17:16.440 | with the first row to get this one and
01:17:20.680 | Then we took the second column with the first row
01:17:26.120 | To get that one. All right, so we're basically doing
01:17:29.040 | The thing we just did the matrix vector product. We're just doing it twice
01:17:33.760 | right once
01:17:36.480 | With this column and once with this column, and then we can catenate the two together
01:17:44.520 | Okay, so we can now go ahead and do that
01:17:52.760 | Like so M times the first column dot sum
01:17:57.640 | M times the second top column, but some and so there are the two columns of our matrix multiplication
01:18:09.240 | So I didn't want to like make our code too messy
01:18:12.960 | So I'm not going to actually like use that but like we have it there now if we want to we don't need to use
01:18:20.280 | Torch or NumPy matrix multiplication anymore. We've got we've got our own that we can use using nothing but
01:18:26.360 | element wise operations broadcasting and
01:18:36.000 | So this is our
01:18:39.960 | Logistic regression from scratch class again. I just copied it here
01:18:45.960 | Here is where we instantiate the object copy it to the GPU we create an optimizer
01:18:50.160 | Which we'll learn about in a moment and we call fit. Okay, so the goal is to now repeat this without needing to call fit
01:18:58.600 | So to do that
01:19:03.760 | We're going to need a loop
01:19:09.320 | Which grabs a mini batch of data at a time and with each mini batch of data?
01:19:15.600 | We need to pass it to the optimizer and say please try to come up with a slightly better set of predictions
01:19:22.040 | for this mini batch
01:19:24.240 | So as we learned in order to grab a mini batch of the training set at a time
01:19:28.560 | We have to ask the model data object for the training data loader
01:19:31.840 | We have to wrap it in it or it er to create an iterator generator
01:19:36.920 | And so that gives us our data loader. Okay, so pytorch calls this a data loader
01:19:44.040 | We actually wrote our own fast AI data loader, but it's it's all it's basically the same idea
01:19:50.280 | So the next thing we do is we grab the X and the Y tensor
01:19:56.480 | The next one from our data loader, okay?
01:19:59.520 | Wrap it in a variable to say I need to be able to take the derivative of
01:20:05.080 | The calculations using this because if I can't take the derivative
01:20:08.640 | Then I can't get the gradients and I can't update the weights
01:20:12.400 | all right, and I need to put it on the GPU because my
01:20:15.880 | module is on the GPU and
01:20:18.760 | So we can now take that variable and pass it to
01:20:23.340 | The object that we instantiated our logistic regression
01:20:28.440 | Remember our module we can use it as if it's a function because that's how pytorch works
01:20:32.840 | And that gives us a set of predictions as we saw seen before
01:20:41.760 | So now we can check the loss and the loss we defined as being a
01:20:46.520 | negative log likelihood loss
01:20:49.440 | Object and we're going to learn about how that's calculated in the next lesson and for now think of it
01:20:55.200 | Just like root mean squared error, but for classification problems
01:20:58.320 | So we can call that also just like a function so you can kind of see this
01:21:03.840 | It's very general idea in pytorch that you know kind of treat everything ideally like it's a function
01:21:09.480 | So in this case we have a loss a negative log likelihood loss object. We treat it like a function we pass in our predictions and
01:21:16.560 | We pass in our axials right and again the axials need to be turned into a variable and put on the GPU
01:21:23.360 | Because the loss is specifically the thing that we actually want to take the derivative of right so that gives us our loss
01:21:30.920 | And there it is. That's our loss 2.43
01:21:36.200 | So it's a variable and because it's a variable it knows how it was calculated
01:21:41.320 | All right, it knows it was calculated with this loss function. It knows that the predictions were calculated with this
01:21:47.980 | Network it knows that this network consisted of these operations and so we can get the gradient
01:21:55.880 | automatically, all right
01:21:58.800 | So to get the gradient
01:22:01.800 | We call L dot backward remember L is the thing that contains our loss
01:22:06.560 | All right, so L dot backward is is something which is added to anything. That's a variable
01:22:13.120 | You can call dot backward and that says please calculate the gradients
01:22:16.440 | Okay, and so that calculates the gradients and stores them inside that that
01:22:24.120 | the basically for each of the
01:22:28.120 | Weights that was used it used each of the parameters that was used to calculate that it's now stored a
01:22:33.960 | Dot grad we'll see it later. It's basically stored the gradient right so we can then call
01:22:40.320 | Optimizer dot step and we're going to do this step manually shortly
01:22:44.520 | And that's the bit that says please make the weights a little bit better right and so what optimizer dot step is doing
01:22:53.440 | Is it saying like okay if you had like a really simple function?
01:22:57.480 | Like this
01:23:04.560 | Right then what the optimizer does is it says okay. Let's pick a random starting point
01:23:11.580 | Right and let's calculate the value of the loss right so here's our parameter
01:23:17.400 | Here's our loss right let's take the derivative
01:23:21.920 | All right the derivative tells us which way is down, so it tells us we need to go that direction
01:23:28.440 | Okay, and we take a small step and
01:23:31.920 | Then we take the derivative again, and we take a small step derivative again
01:23:37.400 | Take a small step do it again. Take a small step and
01:23:40.440 | Till eventually we're taking such small steps that we stop okay, so that's what?
01:23:45.560 | gradient descent does okay
01:23:50.080 | How big a step is a small step?
01:23:52.440 | Well, we basically take the derivative here, so let's say derivative. There is like eight
01:23:57.300 | All right, and we multiply it by a small number like say 0.01 and that tells us what step size to take
01:24:06.020 | this small number here is called the learning rate and
01:24:10.040 | It's the most important hyper parameter to set right if you pick two smaller learning rate
01:24:17.800 | Then your steps down are going to be like tiny, and it's going to take you forever
01:24:23.180 | All right to bigger learning rate, and you'll jump too far
01:24:27.960 | Right and then you'll jump too far and your diverge rather than converge, okay
01:24:35.680 | We're not going to talk about how to pick a learning rate in this class
01:24:39.640 | But in the deep learning class we actually show you a specific technique that very reliably picks a very good learning rate
01:24:48.200 | So that's basically what's happening right so we calculate the derivatives
01:24:53.040 | And we call the optimizer that does a step in other words update the weights based on the
01:24:58.800 | Gradients and the learning rate
01:25:01.280 | We should hopefully find that after doing that we have a better loss than we did before
01:25:07.800 | So I just reran this and got a loss here of four point one six and
01:25:12.080 | after one step
01:25:14.600 | It's now four point. Oh three okay, so it worked the way
01:25:17.760 | We hoped it would based on this mini batch it updated all of the weights in our
01:25:22.640 | Network to be a little better than they were as a result of which our loss went down, okay?
01:25:27.780 | So let's turn that into a training loop
01:25:31.480 | All right, we're going to go through a hundred steps
01:25:35.200 | Grab one more mini batch of data from the data loader
01:25:39.560 | Calculate our predictions from our network calculate our loss from the predictions and the actuals
01:25:45.360 | Every 10 goes we'll print out the accuracy just take the mean of the whether they're equal or not
01:25:51.840 | One Pytorch specific thing you have to zero the gradients basically you can have networks where like you've got lots of different loss
01:26:01.300 | Functions that you might want to add all of the gradients together
01:26:03.980 | Right so you have to tell Pytorch like when to set the gradients back to zero
01:26:09.400 | Right so this just says set all the gradients to zero
01:26:12.120 | Calculate the gradients that's put backward and then take one step of the optimizer
01:26:18.180 | So update the weights using the gradients and the learning rate and so once we run it. You can see the loss goes down and
01:26:25.040 | The accuracy goes up
01:26:34.160 | That's the basic approach and so next lesson. We'll see
01:26:40.320 | That does all right
01:26:42.320 | We're looking in detail
01:26:44.360 | We're not going to look inside here as I say we're going to basically take the calculation of the derivatives as
01:26:50.680 | As a given right but basically
01:26:53.640 | What's happening there?
01:26:56.240 | And any kind of deep network you have kind of like a function
01:27:01.080 | That's like you know a linear function
01:27:03.480 | And then you pass the output of that into another function that might be like a ReLU
01:27:08.920 | And you pass the output of that into another function that might be another linear net linear layer
01:27:14.320 | And you pass that into another function that might be another ReLU and so forth right so these deep networks are just
01:27:22.320 | Functions of functions of functions, so you could write them mathematically like that right and so
01:27:30.200 | All backprop does is it says let's just simplify this down to the two version
01:27:34.560 | Is we can say okay u equals f of x
01:27:40.880 | Right and so therefore the derivative of g of f of x is we can calculate with the chain rule as being
01:27:50.160 | g - u
01:27:54.080 | f - x
01:27:56.160 | Right and so you can see we can do the same thing for the functions of the functions of the functions, and so when you apply a
01:28:02.880 | Function to a function of a function you can take the derivative just by taking the product of the derivatives of each of those
01:28:09.880 | Layers okay, and in neural networks. We call this back propagation
01:28:15.040 | Okay, so when you hear back propagation it just means use the chain rule to calculate the derivatives
01:28:21.600 | And so when you see a neural network defined
01:28:25.480 | Like here right
01:28:31.560 | Like if it's defined sequentially literally all this means is
01:28:37.500 | apply this function to the input
01:28:40.840 | Apply this function to that apply this function to that apply this function to that right so this is just defining a
01:28:49.840 | composition of a function to a function to a function to a function to a function
01:28:53.040 | okay, and so
01:28:56.000 | Yeah, so although we're not going to bother with calculating the gradients ourselves
01:28:59.740 | You can now see why it can do it right as long as it has internally
01:29:03.480 | You know a it knows like what's the what's the derivative of to the power of what's the derivative of sign?
01:29:10.440 | What's the derivative of plus and so forth then our Python code?
01:29:14.000 | In here, it's just combining those things together
01:29:18.920 | So it just needs to know how to compose them together with the chain rule and away it goes, okay?
01:29:26.140 | Okay, so I think we can leave it there for now and yeah and in the next class
01:29:38.240 | We'll go and we'll see how to
01:29:40.240 | Write our own optimizer, and then we'll have solved MNIST from scratch ourselves. See you then
01:29:47.680 | [BLANK_AUDIO]