back to indexLesson 5: Deep Learning 2019 - Back propagation; Accelerated SGD; Neural net from scratch
Chapters
0:0
67:4 interpreting embeddings
76:14 add the sum of the square of the parameters
103:4 change to a different optimizer
105:28 calculate the gradient
120:13 plotting the learning rate per batch
127:39 using the correct activation function in your last layer
132:20 avoid overfitting
00:00:00.000 |
Welcome everybody to lesson five and so we have officially peaked and everything is downhill from here as of halfway through the last lesson. 00:00:15.000 |
We started with computer vision because it's the most mature kind of out of the box ready to use deep learning application. 00:00:30.040 |
It's something which if you're not using deep learning you won't be getting good results. 00:00:34.960 |
So the difference hopefully between not doing lesson one versus doing lesson one you've gained a new capability you didn't have before and you kind of get to see a lot of the kind of trade craft of training and effective neural net. 00:00:52.200 |
And so then we moved into NLP because text is kind of another one which you really kind of can't do really well without deep learning generally speaking and it's just got to the point where it's pretty works pretty well now. 00:01:10.600 |
In fact The New York Times just featured an article about the latest advances in deep learning for text yesterday and talked quite a lot about the work that we've done in that area along with open AI and Google and the Allen Institute of artificial intelligence. 00:01:32.360 |
We've kind of finished our application journey with tabular and collaborative filtering partly because tabular and collaborative filtering the things that you can still do pretty well without deep learning so it's not such a big step. 00:01:49.160 |
It's not a kind of whole new thing that you could do that you couldn't used to do and also because. 00:01:56.040 |
You know we're going to try to get to a point where we understand pretty much every line of code and the implementations of these things and the implementations of those things it's much less intricate than vision and NLP so as we come down this other side of the journey which is like. 00:02:15.000 |
All the stuff we've just done how does it actually work by by starting where we just ended which is starting with collaborative filtering and then tabular data we're going to be able to see what all those lines of code do by the end of today's lesson that's our goal so. 00:02:33.720 |
Particularly this lesson you should not expect to come away knowing how to solve you know how to do applications you couldn't do before but instead you should have a better understanding of how we've actually been solving the applications we've seen so far. 00:02:49.480 |
Particularly we're going to understand a lot more about regularization which is how we go about managing over versus under fitting and so hopefully you can use some of the tools from this lesson to go back to your previous projects and. 00:03:03.080 |
Get a little bit more performance or handle models where previously maybe you felt like your data was not enough or maybe you're under fitting and so forth. 00:03:13.560 |
And it's also going to lay the groundwork for understanding convolutional neural networks and recurrent neural networks that will do deep dives into in the next two lessons and as we do that we're also going to look at some new applications some new vision and NLP applications. 00:03:44.200 |
So this picture we were looking at kind of what is a. 00:03:55.160 |
Various layers and the first thing we pointed out is that there are only and exactly two types of layer there are layers that contain parameters and there are layers that contain. 00:04:14.000 |
Activations parameters are the things that your model learns there the things that you use gradient descent to go. 00:04:25.400 |
Parameters minus equals learning rate times parameters dot grad. 00:04:33.240 |
That's our basic that's. What we do okay and those parameters are used by multiplying them by input activations doing a matrix product. 00:04:46.520 |
So the yellow things are our weight matrices or weight tenses more generally, but. 00:04:54.840 |
That's close enough so we take some input activations or some layer activations and we multiply it by weight matrix to get a bunch of activations so activations are numbers that these are numbers that are calculated. 00:05:10.440 |
Okay, so I find in our study group I keep getting questions about where does that number come from and I always answer it in the same way you tell me is it a parameter or is it an activation because it's one of those two things. 00:05:23.800 |
Okay, that's where numbers come from I guess inputs are kind of a special activation so they're not calculated they're just. 00:05:31.280 |
There so maybe that's a special case so maybe it's an input or a parameter or an activation. 00:05:37.160 |
Activations don't only come out of matrix modifications they also come out of activation functions and the most important thing to remember about an activation function is that it's an element wise function so it's a function that is applied to each element of the element. 00:05:53.760 |
The input activations in turn and creates one activation for each input element so if it starts with a 20 long vector it creates a 20 long vector by looking at each one of those in turn doing one thing to it and spitting out the answer so an element wise function. 00:06:12.520 |
Relu is the main one we've looked at and honestly it doesn't too much matter which you pick so we don't spend much time talking about activation functions because if you just use relu you'll get a pretty good answer pretty much all the time. 00:06:27.920 |
And so then we learnt that this combination of matrix modifications followed by relu's stack together has this amazing mathematical property called the universal approximation theorem which is if you have big enough weight matrices and enough of them it can solve any arbitrarily complex mathematical function to any arbitrarily high level of accuracy. 00:06:53.200 |
Assuming that you can train the parameters both in terms of time and data availability and so forth so that's the bit which I find particularly more advanced computer scientists get really confused about is they're always asking like where's the next bit what's the trick how does it work but that's it you know you just do those things and. 00:07:21.880 |
You pass back the gradients and you update the weights with the learning rate and that's it so that piece where we take the loss function between the actual targets and the output of the final layer so the final activations we calculate the gradients with respect to all of these yellow things and then we update those yellow things by learning rate but subtracting learning rate times the gradient. 00:07:52.160 |
That process of calculating those gradients and then subtracting like that is called back propagation okay so when you hear the term well it's just very small fun. 00:08:02.280 |
So when you see when you hear the term back propagation. 00:08:12.640 |
It's one of these terms that you're networking folks love to use it sounds very impressive okay but you can replace it with your head with weights minus equals weights dot grad times learning rate or parameters I should say rather than weights bit more general. 00:08:30.240 |
Okay so that's what we covered last week and then I mentioned last week that we're going to cover a couple more things. 00:08:41.400 |
I'm going to come back to these ones cross entropy and softmax later today let's talk about fine tuning now so what happens when we take a resident 34 and we. 00:08:53.720 |
Do transfer learning what's actually going on so the first thing to notice is the resident 34 that that we grab from image net has a very specific weight matrix at the end it's a weight matrix that has 1000 columns why is that because image net the problem they ask you to solve in the image net competition as please. 00:09:16.360 |
Figure out which one of these 1000 image categories this picture is so that's why they need 1000 things here because an image net this target vector is length 1000 it's you've got to pick the probability that it's which one of those 1000 things. 00:09:36.400 |
So there's a couple of reasons this weight matrix is no good to you when you're doing transfer learning the first is that you probably don't have 1000 categories you know I was trying to do Teddy bears black bears or brown bears. 00:09:47.560 |
So I don't want 1000 categories and the second is even if I did have an exactly 1000 categories they're not the same 1000 categories that are in image net so basically this whole weight matrix is a waste of time for me. 00:10:01.400 |
So what do we do we throw it away so when you go create CNN in fast AI it deletes that and what does it do instead instead it puts in two new weight matrices in there for you. 00:10:24.000 |
There are some defaults as to what size this first one is. 00:10:32.480 |
The size there is as big as you need it to be so in your data bunch which you passed your learner from that we know how many activations you need if you're doing classification it's where many classes you have if you're doing regression it's over many numbers you're trying to predict in the regression problem and so remember that in your if your data bunch is called data that'll be called data. 00:10:57.760 |
Dot C so we'll add for you this weight matrix of size data dot C by however much was in the previous layer. 00:11:05.000 |
OK so now we need to train those because initially these weight matrices are full of random numbers. 00:11:16.680 |
Check this new weight matrices are always full of random numbers if they're new and these ones are new we're just we've grabbed them and thrown them in there so we need to train them. 00:11:26.520 |
But the other layers are not new the other layers are good at something right and what are they good at well let's remember that zyler and fergus paper. 00:11:41.960 |
Here are examples of some visualization of some filters some some weight matrices in the first layer and some examples of some things that they found right so the first layer had one part of the weight matrix was good at finding diagonal edges in this direction and then in layer two one of the filters was good at finding corners in the top left. 00:12:06.120 |
And then in layer three one of the filters was good at finding repeating patterns another one was good at finding round orange things another one was good at finding kind of like fairy or floral textures so as we go up there becoming more sophisticated but also more specific right so like layer four I think was finding like eyeballs for instance now if you're wanting to transfer learn. 00:12:36.240 |
To something for histopathology slides there's probably going to be no eyeballs in that right so the later layers are no good for you but there'll certainly be some repeating patterns and they'll certainly be some diagonal edges right so the earlier you go in the model the more likely it is that you want those weights to stay as they are. 00:13:00.200 |
Well to start with we definitely need to train these new weights because they're random so let's not bother training any of the other weights at all to start with so what we do is we basically say let's freeze. 00:13:17.840 |
Let's freeze all of those other layers so what does that mean all that means is that we're asking faster I in pie torch that when we train. 00:13:27.880 |
You know how many epochs we do when we call fit don't back propagate the weights that don't prep it back propagate the gradients back into those layers in other words when you go parameters equals parameters minus morning rate times gradient. 00:13:46.240 |
Only do it for the new layers don't bother doing it for the other layers that's what freezing means okay just means don't update those parameters. 00:13:56.840 |
As well because there's a few less calculations to do. 00:14:01.160 |
It'll take up a little bit less memory because there's a few less gradients that we have to store but most importantly it's not going to change weights that are already. 00:14:13.160 |
Better than nothing they're better than random at the very least so that's what happens when you call freeze it doesn't freeze the whole thing it freezes everything except the randomly generated Adam added layers that we put on for you. 00:14:23.680 |
So then what happens next okay after a while we say okay this is looking pretty good. 00:14:29.760 |
We probably should train the rest of the network now. 00:14:39.160 |
And so now we're going to train the whole thing but we still have a pretty good sense that these new layers we added to the end probably need more training and these ones right at the start that might just be like diagonal edges probably don't need much training at all. 00:15:02.560 |
A model into a few sections right and we say let's give different parts of the model different learning rates. 00:15:12.560 |
So this part of the model we might give a learning rate of one a neg five and this part of the model we might give a learning rate of. 00:15:26.680 |
One a neg three say and so what's going to happen now is that we can keep training the entire network but because the learning rate for the early layers is smaller. 00:15:38.720 |
It's going to move them around less because we think they're already pretty good and also like if it's already pretty good to the optimal value if you used a higher learning rate it could kick it out but it could actually make it worse which we really don't want to happen. 00:15:53.160 |
So this this process is called using discriminative learning rates. 00:16:01.320 |
Because I think we were kind of the first to use it for this purpose or at least talk about it extensively maybe other probably other people used it without writing it down so most of the stuff you'll find about this will be faster students starting to get more well known slowly now. 00:16:20.920 |
But it's a really really important concept for transfer learning without using this you just can't get nearly as good results. 00:16:27.000 |
So how do we do discriminative learning rates in fast AI when you. 00:16:34.240 |
When you anywhere you can put a learning rate in fast AI such as with the fit function. 00:16:41.600 |
The first thing you put in is the number of epochs and then the second thing you put in his learning rate same if you use fit one cycle. 00:16:50.240 |
The learning rate you can put a number of things that you can put a single number like one a neg three. 00:16:55.560 |
You can write a slice you can write slice for example one a neg three with a single number or you can write slice. 00:17:19.040 |
Just using a single number means every layer gets the same learning rate so you're not using discriminative learning rates. 00:17:29.200 |
Number to slice it means the final layers get a learning rate of whatever you wrote down. 00:17:40.560 |
And then all the other layers get the same learning rate which is that divided by three. 00:17:48.440 |
So all of the other layers will be one a neg three divided by three the last layers will be one a neg three. 00:17:53.480 |
In the last case the final layers that is randomly hidden added layers will still be again one a neg three. 00:18:01.480 |
The first layers will get one a neg five and the other layers will get learning rates that are equally spread between those two so. 00:18:15.440 |
It multiplicatively equal right so if there were three layers there would be one a neg five one a neg four one a neg three so equal multiples each time. 00:18:26.560 |
One slight tweak to make things a little bit simpler to manage we don't actually give a different learning rate to every layer we give a different learning rate to every layer group which is just we decide to put the groups together for you. 00:18:45.640 |
And so specifically what we do is the randomly added extra layers we call those one layer group. 00:18:51.520 |
This is by default you can modify it and then all the rest we split in half into two layer groups so by default at least with the CNN you'll get three layer groups and so if you say slice one a neg five one a neg three you will get one a neg five learning rate for the first layer group one a neg four for the second one a neg three for the third so now if you go back and look at the way that we're training. 00:19:15.080 |
Hopefully you'll see that this makes a lot of sense this divided by three thing is a little weird and we won't talk about why that isn't your part two of the course it says specific quirk around batch normalization so we can discuss that in the advanced topic if anybody's interested. 00:19:42.000 |
That is fine tuning so hopefully that makes that a little bit less mysterious. 00:19:52.520 |
Collaborative filtering last week and in the collaborative filtering example we called fit one cycle and we passed in just a single number and that makes sense because in collaborative filtering we only have. 00:20:11.480 |
One layer there's a few different pieces in it, but there isn't you know a matrix multiply followed by an activation function followed by another matrix multiply. 00:20:23.760 |
I'm going to introduce a another piece of jargon here. 00:20:30.600 |
They're not always exactly matrix modifications there's something very similar they're linear functions that we add together but the more general term for these for these things that are more general than matrix modifications is affine functions okay so if you hear me say the word affine function you can replace it in your head with matrix multiplication. 00:20:54.760 |
But as we'll see when we do convolutions convolutions are matrix multiplications where some of the weights are tied and so it would be slightly more accurate to call them affine functions and I like to introduce a little bit more jargon each lesson so that when you know read books or papers or watch other courses or read documentation. 00:21:15.440 |
There will be more of the words you'll recognize so when you say affine function it just means a linear function right and it means something very very close to matrix multiplication matrix multiplication is the most common kind of affine function at least in deep learning. 00:21:50.000 |
And a bunch of numbers here and we took the dot product of them and given that one here is a row and one is a column we can actually that's the same as a matrix product so emote in excel multiplies matrices so here is the matrix product of those 2. 00:22:09.200 |
And so I started this training last week by using solver in excel and we never actually went back to see how it went so let's go and have a look now. 00:22:18.640 |
So the average sum of squared error got down to point three nine so we're trying to predict something on a scale of point five to five so on average we're being wrong by about point four that's pretty good and you can kind of see it's pretty good. 00:22:37.680 |
If you look at like three five one is what it meant to be three point two five five point one or point nine eight it's pretty close right. 00:22:48.600 |
And then I started to talk about this idea of embedding matrices and so in order to understand that let's put this. 00:23:04.440 |
So here's another worksheet and what I've done here is I have copied over those two weight matrices from the previous worksheet. 00:23:13.360 |
Here's the one for users and here's the one for movies and the movies one I've transposed it so it's now got exactly the same dimensions as the users one. 00:23:28.880 |
Initially they were random we can train them with gradient descent. 00:23:34.320 |
In the original data the user IDs and movie IDs were numbers like these OK to make life more convenient I've converted them to numbers from one to fifteen. 00:23:48.520 |
OK so in these columns I've got for every rating I've got user ID movie ID rating using these mapped numbers so that they're contiguous starting at one. 00:24:03.800 |
With this vector the vector contains a one followed by fourteen zeros. 00:24:12.440 |
And then user number two I'm going to replace with a vector of zero and then one and then thirteen zeros and so forth so movie ID fourteen all these are movie ID fourteen I've also replaced. 00:24:29.160 |
With another vector which is thirteen zeros and then a one and then a zero so these are called one hot encodings by the way. 00:24:37.720 |
So this is not part of a neural net this is just like some input pre processing where I'm literally making this my new inputs this is my new inputs for my movies this is my new inputs for my users so these are my inputs to a neural net. 00:24:58.600 |
So what I'm going to do now is I'm going to take this. 00:25:01.240 |
Input matrix and I'm going to do a matrix multiply. 00:25:09.080 |
Weight matrix and that'll work because this has fifteen rows. 00:25:16.160 |
And this has 15 columns so I can multiply those two matrices together because they match and you can do matrix multiplication in Excel using the M malt function just be careful if you're using Excel because this is a function that returns multiple numbers you can't just hit enter when you finish with it you have to hit control shift enter control shift enter means this is a array function something that returns multiple values so here is the matrix product of. 00:25:50.960 |
Of inputs and this make this parameter matrix or weight matrix. 00:25:59.480 |
So that's just a normal neural network layer. 00:26:04.280 |
It's just a regular matrix multiply and so we can do the same thing for movies and so here's the matrix multiply for movies. 00:26:16.600 |
This input is we claim is this kind of one hot encoded version of user ID number one and these activations. 00:26:28.400 |
Are the activations for user ID number one why is that because if you think about it a matrix multiplication between a one hot encoded vector and some matrix is actually going to find the. 00:26:44.000 |
N row of that matrix when the one is in position and. 00:26:48.680 |
The same sense so what we've done here is we've actually. 00:26:53.600 |
Got a matrix multiply that is creating this these output activations right but it's doing it in a very interesting way which is effectively finding a particular row in the input matrix so having done that. 00:27:09.080 |
We can then multiply. Those two sets together. 00:27:14.120 |
Just a dot product and we can then find the loss. 00:27:19.720 |
Squared and then we can find the average loss and lo and behold that number point three nine. 00:27:49.440 |
This one is just doing a matrix multiply and therefore we know they are mathematically identical so let's lay that out again so here's our final version. 00:28:01.480 |
This is the same weight matrices again exactly the same I've copied them over. 00:28:08.080 |
And here's those user IDs and movie IDs again right but this time I've laid them out just in a normal kind of tabular form just like you would expect to see in the input to your model and this time I've got exactly the same set of activations here that I had. 00:28:32.200 |
But in this case I've calculated these activations using excels offset function which is an array look up right it says find the first row. 00:28:44.320 |
So this is doing it as an array look up so this version is identical to this version but obviously it's much less memory intensive and much faster because I don't actually create the one hot encoded matrix and I don't actually do a matrix multiply. 00:29:01.520 |
Because that matrix multiply is nearly all multiplying by zero which is a total waste of time. 00:29:05.800 |
So in other words multiplying by a one hot encoded matrix is identical to doing an array look up. 00:29:14.600 |
Therefore we should always do the array look up version. 00:29:20.360 |
And therefore we have a specific way of doing we have a specific way of saying I want to do a matrix multiplication by a one hot encoded matrix without ever actually creating it I'm just instead going to pass in a bunch of inse and pretend they're one hot encoded and that is called an embedding. 00:29:41.280 |
Right so you might have heard this word embedding all over the place as if it's some magic advanced Matthew thing but embedding means look something up in an array. 00:29:54.000 |
But it's interesting to know that looking something up in an array is mathematically identical to doing a matrix product by a one hot encoded matrix and therefore an embedding. 00:30:09.920 |
Fits very nicely in our standard model of how neural networks work. 00:30:13.640 |
So now suddenly it's as if we have another whole kind of layer it's a kind of layer where we get to look things up in an array but we actually didn't do anything special right we just added this computational shortcut this thing called an embedding which is simply a fast and memory efficient way of multiplying by a one hot encoded matrix. 00:30:35.040 |
OK so this is really important because when you hear people say embedding you need to replace it in your head with an array look up which we know is mathematically identical to a matrix multiplied by a one hot encoded matrix. 00:30:54.880 |
Here's the thing though it has kind of interesting semantics right because when you do multiply something by a one hot encoded matrix you get this nice feature where the roles of your weight matrix. 00:31:10.160 |
The values only appear for row number one for example where you get user ID number one in your inputs right so in other words you kind of end up with this weight matrix where certain rows of weights correspond to certain values of your input. 00:31:30.480 |
And that's pretty interesting it's particularly interesting here because going back to a kind of most convenient way to look at this. 00:31:39.120 |
Because the only way that we can calculate an output activation is by doing a product of these two input vectors that means that. 00:31:50.800 |
They kind of have to correspond with each other right like there has to be some way of saying if this number for a user is high and this number for a movie is high then the user will like the movie. 00:32:08.480 |
So the only way that can possibly make sense is if these numbers represent features of personal taste and corresponding features of movies for example the movie has John Travolta in it. 00:32:24.560 |
And user ID likes John Travolta then you'll like this movie. 00:32:32.400 |
OK so like we're not actually deciding the rose mean anything we're not doing anything to make the rose mean anything but the only way that this gradient descent could possibly come up with a good answer is if it figures out what the aspects of movie taste are and the corresponding features of movies. 00:32:54.560 |
So those underlying kind of features that appear are called latent factors or latent features that these hidden things that were there all along and once we train this neural net they suddenly appear. 00:33:10.360 |
No one's going to like Battlefield Earth right it's not a good movie even though it has John Travolta in it. 00:33:19.040 |
So how are we going to deal with that right because there's this feature called I like John Travolta movies and this feature called this movie has John Travolta and so this is now like you're going to like the movie but we need to save some way to say unless it's Battlefield Earth or you're a Scientologist either one right. 00:33:36.720 |
So how do we do that we need to add in bias right. 00:33:45.080 |
Same weight matrix sorry not the same weight which is he's the same. 00:33:56.280 |
But this time we got an extra row so now this is not just the matrix product of that and that but I'm also adding on this number and this number which means now each movie can have an overall this is a great movie versus this isn't a great movie and every user can have an overall this user rates movies highly or this user doesn't rate movies highly. 00:34:26.200 |
So that's called the bias so this is hopefully going to look very familiar right this is the same usual. 00:34:32.320 |
Linear model concept or linear layer concept from a neural net that you have a matrix product and a bias and remember from lesson two the lesson to SGD notebook you never actually need a bias you could always just add a column of ones to your input data and then that gives you bias for free. 00:34:52.640 |
But that's pretty inefficient right so in practice all neural networks library explicitly have a concept of bias we don't actually add the column of ones so what does that do. 00:35:03.920 |
Well just before I came in today I ran tools solver data solver on this as well and we can check the RMSC and so the root means grid here is point three two versus the version. 00:35:20.760 |
Without bias was point three nine OK so you can see that this. 00:35:25.560 |
Slightly better model gives us a better result and it's better because it's it's giving both more flexibility right and it's also just makes sense semantically that you need to be able to say it's not the whether I like the movie is not just about the combination of what actors it has and whether it's dialogue driven and how much action is in it but just is it a good movie. 00:35:50.600 |
OK or am I somebody who writes movies highly. 00:35:53.760 |
OK so there's all the pieces of this collaborative filtering model. 00:36:03.360 |
How are we going since this go any questions we have three questions OK. 00:36:18.920 |
Can we explore the activation grids to say what they might be good at recognizing. 00:36:23.160 |
Yes you can and we will learn how to should be in the next lesson. 00:36:29.320 |
Can we have an explanation of what the first argument in fit one cycle actually represents is it equivalent to an epoch yes the first argument to fit one cycle or fit is number of epochs it's. 00:36:48.440 |
In other words an epoch is looking at every input once. 00:36:53.720 |
So if you do 10 epochs you're looking at every every input 10 times and so there's a chance you might start overfitting if you've got lots and lots of parameters and a high learning rate. 00:37:05.960 |
If you only do one epoch it's impossible to ever fit so that's why it's kind of useful to remember how many epochs you're doing. 00:37:15.160 |
Can we have an exponent that did that one what is an affine function and affine function is a linear function. 00:37:28.120 |
I don't know if we need much more detail than that if you're multiplying things together and adding them up it's an affine function. 00:37:40.360 |
I'm not going to bother with them exact mathematical definition partly because I'm a terrible mathematician and partly because it doesn't matter but if you just remember that you're multiplying things together and then adding them up that's the most important thing it's linear and therefore if you put an affine function on top of an affine function that's just another affine function you haven't won anything at all that's a total waste of time right so you need to sandwich it. 00:38:06.280 |
With any kind of nonlinearity pretty much works right including replacing the negatives with zeros which we call value so if you do affine value affine really affine really you have a deep neural network. 00:38:23.080 |
So let's go back to the collaborative filtering notebook and this time we're going to grab the whole movie lens 100 K data set. 00:38:35.160 |
There's also a 20 million data set by the way. 00:38:37.600 |
So really great project available made by this group called group lens they actually update the movie lens data sets on a regular basis but they hopefully provide the original one and we're going to use the original one because that means that we can compare to baselines because everybody basically they say hey if you're going to compare the baselines make sure you all use the same data set is the one you should use. 00:39:04.200 |
Unfortunately it means that we're going to be restricted to movies that are before 1998 so maybe you won't have seen them all but that's the price we pay you can replace this with ML latest when you download it and use it if you want to play around with movies that are up to date. 00:39:23.760 |
OK the original movie lens data set the more recent ones are in a CSV file it's super convenient to use the original one is a slightly messy first of all they don't use commas for delimiters they use tabs so in pandas you can just say what's the delimiter and you load it in. 00:39:41.040 |
The second is they don't add a header row so that you know what color is what so you have to tell pandas there's no header row and then since there's no header row you have to tell pandas what are the names of the columns. 00:39:54.960 |
OK so we can then have a look at head which remembers the first few rows and there is our user ratings user movie rating and let's make it more fun let's see what the movies actually are. 00:40:14.240 |
Which is there's this thing called encoding equals I gotta get rid of it. And I get this error uni code I just want to point this out because you'll all see this at some point in your lives. 00:40:24.960 |
codec can't decode blah blah blah what this means is that this is not a uni code file this will be quite common when you're using data sets are a little bit older. 00:40:37.600 |
Back before you know us folks in the West really realized that there are people that use languages other than well English people English languages other than English. 00:40:45.520 |
Nowadays you know we're much better at handling different languages we use this standard called uni code for it and Python very helpfully uses uni code by default. 00:40:58.760 |
But so if you try to load an old file it's not uni code you actually believe it or not have to guess how it was coded but since like it's really likely that it was created by you know some Western European or American person. 00:41:16.480 |
They almost certainly used Latin one so if you just pop in encoding equals Latin one if you use file open in Python or pandas open or whatever that will generally get around your problem. 00:41:30.600 |
Again they didn't have the names so we had to list of the names are this is kind of interesting they had a separate column for every one of however many genres they had 19 genres. 00:41:44.960 |
And you'll see this looks one hot encoded but it's actually not it's actually an hot encoded in other words a movie can be in multiple genres we're not going to look at genres today but it's just interesting to point out that this is a. 00:41:56.240 |
A way that sometimes people will represent something like genre and the more recent version they actually list the genres directly which is much more convenient. 00:42:06.320 |
Okay so I find life is so we got 100,000 ratings I find life is easier when you're modeling when you actually denormalize the data so I actually want the movie title directly in my ratings so pandas has a merge function to let us do that so here's the ratings table with actual titles. 00:42:26.280 |
So as per usual we can create a data bunch for our applications or Colab data bunch for the Colab application from what from a data frame as a data frame set aside some validation data. 00:42:39.040 |
Really we should use the validation sets and cross validation approach that they used if you're going to properly compare with a benchmark so take these comparisons with a cream of salt by default Colab data bunch assumes that your. 00:42:56.080 |
First column is you use a second column of items third column is rating but now we're actually going to use the title column as item so we have to tell it what the item column name is. 00:43:07.680 |
And then all of our data bunches support show batch so you can just check what's in there and there it is okay. 00:43:18.560 |
So I'm going to try and get as good a result as I can so I'm going to try and use whatever tricks I can come up with to get a good answer now one of the tricks is to use the Y range and remember the Y range was the thing that made the final activation function a sigmoid. 00:43:40.040 |
And specifically last week we said let's have a sigmoid that goes from 0 to 5 and that way it's going to ensure that it kind of is going to help the neural network predict things that are in the right range actually didn't do that in my. 00:43:56.920 |
Excel version and so you can see I've actually got some negatives and there's also some things bigger than five so if you want to beat me in Excel you could you could add the sigmoid to excel and train this and you'll get a slightly better answer. 00:44:11.680 |
Now the problem is that a sigmoid actually asymptotes at say whatever the maximums we said five which means you can never actually predict five but plenty of movies have a rating of five. 00:44:27.080 |
So that's a problem so actually it's slightly better to make your Y range go from a little bit less than the minimum to a little bit more than the maximum and the minimum of this data is point five and the maximum is five so this range is just a little bit further so that's a that's one little trick to get a little bit more. 00:44:46.520 |
Accuracy. The other trick I used is to add something called weight decay. And we're going to look at that next okay after this section we're going to learn about weight decay so then how many. 00:45:00.760 |
How many factors do you want well what are factors the number of factors is the width of the embedding matrix so why don't we say embedding size. 00:45:13.560 |
Maybe we should but in the world of collaborative filtering they don't use that word they use the word factors because of this idea of latent factors and because the standard way of doing collaborative filtering has been with something called matrix factorization and in fact what we just saw. 00:45:31.560 |
Happens to actually be a way of doing matrix factorization so we've we've actually accidentally learned how to do matrix factorization today. 00:45:41.560 |
So so this is a term that's kind of specific to this domain okay but you can just remember it as the width of the embedding matrix and so why 40 well this is one of these architectural decisions you have to play around with and see what works so I tried 10 20 40 and 80 and I found 40 into work pretty well. 00:45:59.560 |
And trained really quickly so like you can check it in a little for loop to try a few things and see what looks best. 00:46:09.720 |
And then for learning rate so use the learning rate finder as usual. 00:46:12.440 |
So 5e neg 3 seem to work pretty well remember this is just a rule of thumb right 5e neg 3 is a bit lower than both silver's rule and my role so silver's role is find the bottom and go back by 10 so his role would be more like 2e neg 2 I reckon. 00:46:32.240 |
My role is kind of find about the steepest section which is about here which again like often it agrees with silver so that would be about 2e neg 2 I tried that and I always like to try like 10x less and 10x more just to check and actually I found a bit less was helpful so. 00:46:48.600 |
The answer to the question like should I do blah is always try blah and see that that's how you actually become a good practitioner. 00:47:02.640 |
And as usual you can save the result to save you another 33 seconds from having to do it again later and so there's a library called libreck and they publish some benchmarks for movie lens 100 K and there's a root mean squared error section and about .91 is about as good as they seem to have been able to get .91 is the root mean squared error. 00:47:29.880 |
We use the mean squared error not the root so we have to go .91 squared which is .83 and we're getting .81 so that's cool with this very simple model we're doing a little bit better quite a lot better actually. 00:47:45.520 |
Although as I said take it with a grain of salt because we're not doing the same splits and the same cross validation so we're at least highly competitive with. 00:47:59.800 |
OK so we're going to look at the Python code that does this in a moment. 00:48:03.640 |
We're going to look at the Python code that does this in a moment but for now just take my word for it that we're going to see something that's just doing. 00:48:13.240 |
This right looking things up in an array and then model plug them together adding them up and doing the mean squared error loss function. 00:48:26.800 |
So given that and given that we noticed that the only way that that can do anything interesting is by trying to kind of. 00:48:33.600 |
Find these latent factors it makes sense to look and see what they found right particularly since as well as finding latent factors we also now have a specific bias number for every user and every movie right now you could just say what's the average rating for each movie. 00:48:57.680 |
But there's a few issues with that in particular this is something you see a lot with like anime people who like anime just love anime right and so they watch lots of anime and then they just write all the anime highly and so very often on kind of charts of movies you'll see a lot of anime at the top. 00:49:17.080 |
Particularly if it's like you know a 100 long series of anime you'll find you know every single item of that series in the top 1000 movie list or something. 00:49:27.480 |
So how do we deal with that well the nice thing is that instead if we look at the movie bias right the movie bias says kind of. 00:49:39.480 |
Once we've included the user bias right which for an anime lover might be very high number because it is reading a lot of movies highly once we account for the specifics of this kind of movie which again might be people love anime right what's left over is something specific to that movie itself so it's kind of interesting to look at. 00:50:01.280 |
Movie bias numbers as a way of saying what are the best movies or what people what do people really like as movies even if those people don't rate movies very highly or even if they don't that movie doesn't have the kind of features that people tend to have rate rate highly so it's kind of nice. 00:50:19.480 |
It's funny to say this and I'm by by using the bias we get an unbiased kind of movie score so how do we do that. 00:50:29.680 |
To make it interesting because particularly because this data set only start only goes to 1998 let's only look at movies that are plenty of people watch right so we'll use pandas to grab our reading movie table group it by title. 00:50:49.520 |
And then count the number of ratings and not measuring how high they're rating just how many ratings do they have and so the top. 00:50:59.920 |
Is that is the other movies that have been rated the most and so they're hopefully movies that we might have seen okay that's the only reason I'm doing this and so I've called this top movies by which I mean not not good movies just movies were likely to have seen so not surprisingly Star Wars is. 00:51:18.920 |
The one that at that point most the most people were had put a rating to. 00:51:29.920 |
We can then take our learner that we trained and ask it for the bias. 00:51:44.020 |
So is item equals true you would pass true to say I want the items or false to say I want the users that so this is kind of like a pretty common piece of momentum for collaborative filtering. 00:52:02.920 |
These IDs tend to be called items even if your problem is what nothing to do with users and items at all. 00:52:11.820 |
You know if we just use these names for convenience okay so they're just they're just words so in our case we want the items this is the list of items we want we want the bias so this is specific to collective filtering and so that's going to give us back a thousand numbers act as we asked for this has a thousand movies in it. 00:52:33.020 |
So we can now take and just for comparison let's also group the titles by the mean rating so then we can zip through so going through together each of the movies along with the bias and grab their rating and the bias and the movie and then we can. 00:53:11.020 |
Mortal Kombat annihilation not a great movie on my man to not a great movie I haven't seen children of the corn but we did have a long discussion at SF study group today and people who have seen it agree not a great movie. 00:53:24.820 |
And you can kind of see like some of them actually have pretty decent ratings. 00:53:31.620 |
Even as though like relative to right so this one's actually got a much higher rating than the next one. 00:53:38.620 |
But you know that's kind of saying well the kind of actors that were in this and the kind of movie that this was in the kind of people you do like it watch it you would expect it to be higher. 00:53:51.520 |
And then here's the sort by reverse okay shindles list Titanic Shawsham Redemption seems reasonable and again you can kind of look for ones where like. 00:54:01.220 |
The rating you know isn't that high but it's still very high here so that's kind of like you know at least in 1998 people weren't that into Leonardo DiCaprio or you know people aren't that into dialogue driven movies or people aren't that into romances or whatever but still people liked it more than you would expect. 00:54:21.420 |
So it's interesting to kind of like interpret our models in this way we can go a bit further and grab not just the biases but the weights. 00:54:39.620 |
And again we're going to grab the weights for the items for our top movies and that is a thousand by 40 because we asked for 40 factors so rather than having a width of five we have a width of 40. 00:54:52.120 |
Often really there's there isn't really conceptually 40 latent factors involved in taste and so trying to look at the 40 can be. 00:55:08.720 |
You know not that intuitive so what we want to do is we want to squish those 40 down to just. 00:55:14.720 |
3 and there's something that we're not going to look into called PCA stands for principal components analysis so this is a movie W is a torch tensor and fast AI adds the PCA method to torch tensors and what PCA does principal components analysis is it's a simple linear transformation that takes an input matrix. 00:55:39.320 |
And tries to find a smaller number of columns that kind of cover a lot of the space of that original matrix if that sounds interesting which it totally is you should check out our course computational linear algebra which Rachel teaches where we will show you how to. 00:56:00.720 |
Calculate PCA from scratch and why you want to do it and lots of stuff like that. 00:56:05.020 |
It's absolutely not a prerequisite for anything in this course but it's definitely worth knowing that taking layers of neural nets and checking them through PCA is very often a good idea. 00:56:19.620 |
Because very often you have like way more activations and you want in a layer and there's all kinds of reasons you might want to play with it for example Francisco who's sitting next to me today is has been working on something to do. 00:56:37.820 |
Image similarity right and for me similarity a nice way to do that is to compare activations from a model but often those activations will be huge and therefore you think could be really slow and unwieldy so people often for something like image similarity will track it through a PCA. 00:56:53.020 |
First and that's kind of cool in our case we're just going to do it so that we take our 40 components down to three components so hopefully they'll be easier for us to interpret. 00:57:05.220 |
So we can grab each of those three factors will call them factor not one and two and let's grab that movie components and then sort and now the thing is. 00:57:17.220 |
We have no idea what this is going to mean but we're pretty sure it's going to be some aspect of. 00:57:25.120 |
Taste and movie feature so if we print it out the top and the bottom we can see that the highest ranked things on this feature. 00:57:40.420 |
You know connoisseurs movies I guess you know like China town you know really classic Jack Nicholson movie everybody knows Casablanca and even like wrong trousers is like this kind of classic claymation movie and so forth right so yeah this this is definitely measuring like things that are very high on the kind of connoisseur level where else maybe home alone three not such a favorite with connoisseurs perhaps. 00:58:09.620 |
It's just not to say that there are people who don't like it right but probably not the same kind of people that would appreciate secrets and lies. 00:58:17.320 |
So you can kind of see this idea that this is found some feature of movies and a corresponding feature of the kind of things people like so let's look at another feature so here's factor number one. 00:58:30.720 |
So this seems to have found like OK these are just big hits that you could watch with the family you know these are definitely not that you know train spotting very gritty kind of you know thing so again it's kind of found this interesting feature of taste and we could even like. 00:58:53.820 |
Draw them on a graph right I've just kind of them randomly to make them easier to see and you can kind of see like and this is just the top 50. 00:59:02.420 |
Most popular movies by rating by how many times they've been rated and so kind of on this one factor you got that if the terminators really high up here and the kind of English patient and send us listed the other end and then kind of. 00:59:17.120 |
Is your godfather on the python over here and independence day and lie a liar over there so you get the idea so it's kind of fun it'd be interesting to see if you can come up with some stuff at work or other kind of data sets where you could try to pull out some some features and play with them. 00:59:47.720 |
The question is why am I sometimes getting negative loss. 01:00:03.920 |
So ask on show us your your opinion particularly since people are voting this I guess other people seen it too so so put it on the forum I mean. 01:00:15.920 |
They said they're doing negative likelihood yeah so we're going to be learning about cross entropy and negative likelihood after the break that today. 01:00:23.820 |
They are lost functions that have very specific expectations about what your input looks like and if your input doesn't look like that then they're going to give very weird answers so. 01:00:34.520 |
Probably you press the wrong buttons so don't do that. 01:00:58.320 |
The collab learner function as per usual takes a. 01:01:09.020 |
A data bunch and normally learners also take something where you ask for a particular architectural details in this case is only one thing which does that which is basically do you want to use a multi layer neural net or do you want to use a classic collaborative filtering and we're only going to look at the classic collaborative filtering today. 01:01:26.220 |
Or maybe we'll briefly look at the other one too we'll see. 01:01:31.620 |
And so what actually happens here well basically we're going to create we create a an embedding dot bias model and then we pass back a learner which has our data and that model so obviously all the interesting stuff is happening here and embedding dot bias. 01:01:46.920 |
So let's take a look at that clearly press wrong button embedding dot bias there we go okay so. 01:02:05.620 |
And end up module so in in pie torch to remind you all pie torch layers and models are an end up modules they are things that once you create them look exactly like a function you call them with parentheses and you pass them arguments but they're not functions. 01:02:28.520 |
They don't even have normally in python to make something look like a function you have to give it a method called dunder call remember that means underscore underscore call underscore underscore which doesn't exist here and the reason is that pie torch. 01:02:43.420 |
Actually expects you to have something called forward and that's what pie torch will call for you when you call it like a function so when this model is being trained to get the predictions it's actually going to call forward for us so this is where we. 01:03:06.420 |
To calculate our predictions so this is where you can see we grab our why is this users rather than user that's because everything's done a mini batch at a time right so it is kind of when I read the forward in in a pie torch module I tend to ignore in my head the fact that there's a mini batch and I pretend there's just one. 01:03:32.120 |
Because pie torch automatically handles all of the stuff about doing it to everything in the mini batch for you right so let's pretend there's just one user right so grab that user and what is this self dot you underscore weight self dot you underscore weight is. 01:03:50.420 |
An embedding we create an embedding for each of users by factors items by factors users by one items by one that makes sense right so users by one. 01:04:07.520 |
Here that's the users bias right and then users by factor is here. 01:04:19.420 |
Users by factors is the first couple so that's going to go in you underscore weight and users comma one is the third so that's going to go in you underscore bias so remember when pie torch creates our and end up module it calls dunder in it and so this is where we have to create our weight matrices right and we don't normally create the actual weight matrix tenses we normally use pie torches convenience functions. 01:04:48.220 |
To do that for us and we're going to see some of that after the break. 01:04:51.520 |
So for now just recognize that this function is going to create an embedding matrix for us it's going to be a pie torch and end up module as well so therefore to actually pass stuff into that embedding matrix and get activations out. 01:05:12.520 |
You treat it as if it was a function okay stick it in parentheses so if you want to look in the pipe that pipe torch source code and find an end on embedding you will find there's something called forward in there which will do this array look up for us. 01:05:32.720 |
Here's where we grab the items and so we've now got the embeddings for each right and so at this point. 01:05:43.120 |
We're kind of like here and we found that and that. 01:05:48.320 |
So we multiply them together and sum them up and then we add on the user bias and the item bias and then if we've got a wide range then we do our sigmoid trick and so the nice thing is. 01:06:09.720 |
The entirety of this model and this is not just any model this is a model that we just found is at the very least highly competitive with and perhaps slightly better than. 01:06:21.720 |
Some published table of pretty good numbers from a software group that does nothing but this so you're doing well this is nice. 01:06:37.420 |
Probably a good place to have a break and so after the break we're going to come back and we're going to talk about the one piece of this puzzle we haven't learned yet which is what the hell does this do. 01:07:01.620 |
Okay, so this idea of interpreting embeddings is really interesting and as we'll see later in this lesson. 01:07:10.720 |
The things that we create for categorical variables more generally in tabular data sets are also embedding matrices and again that's just a normal matrix multiply by a one hot encoded input where we skip the computational computational and memory burden of it by doing it in a more efficient way. 01:07:32.120 |
And it happens to end up with these interesting semantics kind of accidentally and there was this really interesting paper by these folks who came second in a Kaggle competition for something called a Rossman. 01:07:49.220 |
We'll probably look in more detail at the Rossman competition in part two. I think we're going to run out of time in part one, but it's basically this pretty standard tabular stuff. The main interesting stuff is in the pre processing. 01:08:04.820 |
And it was interesting 'cause even they came second despite the fact that the person who came first and pretty much everybody else was a top of the leaderboard did a massive amount of highly specific feature engineering. Whereas these folks did way less feature engineering than anybody else, but instead they used a neural net and this was at a time in 2016 when just no one did that. No one was doing neural nets for tabular data. 01:08:28.820 |
So they have the kind of stuff that we've been talking about. 01:08:32.020 |
Kind of a rose there or at least was kind of popularized there and when I say popularized, I mean only popularized a tiny bit still most people unaware of this idea. 01:08:43.320 |
But it's pretty cool 'cause in their paper they showed that the main average percentage error for various techniques K nearest neighbors random forest and gradient booster trees. 01:08:55.320 |
Well, first you know neural nets just worked worked a lot better, but then with entity embeddings, which is what they call this just using entity matrices in tabular data. 01:09:04.820 |
You can actually they actually added the entity embeddings to all of these different tasks after training them and they all got way better, right? So neural nets with entity embeddings are still the best, but a random forest with entity embeddings was not at all far behind and you know that's often kind of that's kind of nice right 'cause you could train these. 01:09:25.620 |
Entity matrices for products or stores or genome motifs or whatever and then use them in lots of different models possibly, you know, using faster things like random forests. 01:09:41.020 |
But getting a lot of the benefits that it was something interesting. They took a two dimensional projection of their of their embedding matrix for state example German state 'cause this was a German supermarket chain. I think using the same kind of approach we did. I don't remember if they use PCA or something else slightly different. 01:10:04.220 |
And then here's the interesting thing I've circled here. You know a few things in this embedding space and I've circled it with the same color over here. 01:10:14.920 |
And here I've circled some same color over here and it's like oh my God the embedding projection. 01:10:24.820 |
Has actually discovered geography. Like they didn't do that. Right, but it's it's it's found things that are nearby each other in grocery purchasing patterns 'cause this was about predicting how many sales there will be, you know, it's there is some geographic element of that. 01:10:46.720 |
In fact, here is a graph of the distance between two embedding vectors so you can just take an embedding vector and say what's the sum of squared. 01:10:56.720 |
You know, compared to some other embedding vector and that's the Euclidean distance. What's the distance in embedding space and then plotted against the distance in real life between shops and you get this very strong positive correlation. 01:11:10.320 |
Here is an embedding space for the days of the week and as you can see there's a very clear path through them. Here's the embedding space for the month of the year and again there's a very clear path through them so like. 01:11:25.320 |
Embeddings are amazing and I don't feel like anybody is even close to. Exploring the kind of interpretation that you could get right. So if you've got. 01:11:40.720 |
Genome motifs or plant species or products that your shop sells or whatever like it would be really interesting to train a few models and try and kind of fine tune some embeddings and then like. 01:11:55.320 |
Start looking at them in these ways in terms of similarity to other ones and clustering them and projecting them into 2D spaces and whatever. I think it's really interesting. 01:12:10.120 |
So we were trying to make sure we understood what every line of code did in this some pretty good collab learner model we built and so the one piece missing is this WD piece and WD start stands for weight decay. 01:12:28.820 |
What is regularization? Well, let's start by going back to this nice little chart that Andrew did in his terrific machine learning course. 01:12:37.720 |
Where he plot, you know, plotted some data and then showed a few different lines through it. This one here. 01:12:43.920 |
Because Andrews at Stanford, he has to use Greek letters. OK, so we can say this is a plus BX, but you know if you want to go there, theta naught plus theta one X. 01:12:55.720 |
Is a line, right? It's a line. Even if it's a Greek letters, it's still a line. 01:13:02.320 |
So here's a second degree polynomial A plus BX plus CX squared bit of curve and here's a. 01:13:11.220 |
High degree polynomial which is curvy as anything. 01:13:21.220 |
Tend to look more like this and so in traditional statistics. 01:13:27.120 |
We say hey. Let's use less parameters because we don't want it to look like this because if it looks like this, then the predictions over here and over here. They're going to be all wrong. 01:13:37.620 |
Right, it's not going to generalize well. We're overfitting. 01:13:42.720 |
So we avoid overfitting by using less parameters and so if any of you are. 01:13:49.420 |
Unlucky enough to have been brainwashed by a background in statistics or psychology or econometrics or any of these kinds of courses, you'll have, you know, you're going to have to unlearn the idea that you need less parameters because what you instead need to realize this is. 01:14:05.620 |
You were fed this lie that you need less parameters because it's a convenient fiction for the real truth, which is you don't want your function to be too complex and having less parameters is one way of making it less complex. 01:14:21.820 |
But what if you had 1000 parameters and 999 of those parameters were 1e neg 9. 01:14:31.420 |
But what if there was zero if there's zero then they're not really there or if they're 1e neg 9, they're hardly there, right? So like why can't I have lots of parameters if like lots of them are really small? 01:14:44.820 |
OK, you know, so this this thing of like counting the number of parameters is how we limit complexity. 01:14:53.020 |
Is actually. Extremely limiting. It's a fiction that really has a lot of problems, right? And so if in your head complexity is scored by how many parameters you have, you're doing it all wrong, right? Score it properly. 01:15:08.820 |
But so why do we care? Why would I want to use more parameters? Because more parameters means more nonlinearities, more interactions, more curvy bits, right? And real life is full of curvy bits, right? Real life does not look like this. 01:15:32.220 |
But we don't want them to be more curvy than necessary. Or more interacting than necessary. So therefore let's use lots of parameters and then penalize complexity. 01:15:46.520 |
OK, so one way to penalize complexity is as I kind of suggested before is let's sum up the value of your parameters. Now that doesn't quite work because some parameters are positive and some are negative, right? So what if we sum up the square? 01:16:05.320 |
Of the parameters, right? And that's actually a really good idea, right? Let's actually create a model and in the loss function we're going to add the sum of the square of the parameters. 01:16:19.220 |
Now here's the problem with that though. Maybe that number is way too big and it's so big that the best loss is to set all of the parameters to zero. 01:16:33.520 |
Now that would be no good, right? So actually we want to make sure that doesn't happen. 01:16:38.120 |
So therefore let's not just add the sum of the squares of the parameters to the model, but let's multiply that by some number that we choose and that number that we choose in fast AI is called. 01:16:53.420 |
WD. OK, so that's what we're going to do. We're going to take our loss function and we're going to add to it the sum of the squares of the parameters multiplied by some number WD. 01:17:20.220 |
People with fancy machine learning PhDs are extremely skeptical and dismissive of any claims that a learning rate can be 3E next 3 most of the time or a weight decay can be 0.1 most of the time. 01:17:35.920 |
But here's the thing, we've done a lot of experiments on a lot of data sets and we've had a lot of trouble finding anywhere a weight decay of 0.1 isn't great. 01:17:54.820 |
Because in those rare occasions where you have too much weight decay, no matter how much you train, it just never quite fits well enough. 01:18:05.620 |
Whereas if you have too little weight decay, you can still train well, you'll just start to overfit, so you just have to stop a little bit early. 01:18:15.220 |
So we've been a little bit conservative with our defaults, but my suggestion to you is this. Now that you know that every learner has a WD argument and I should mention you won't always see it in this list. 01:18:29.120 |
Right? Because this is concept of KW args in Python, which is basically parameters that are going to get passed up the chain to the next thing that we call. 01:18:45.220 |
This constructor and this constructor has a WD, right? So this is just one of those things that you can either look in the docs or you now know it anytime you're constructing a learner from pretty much any kind of function in fast AI, you can pass WD. 01:19:04.120 |
OK, and so passing point one instead of the default point 01 will often help OK, so give it a go. 01:19:26.620 |
To go back to lesson 2 SGD 'cause everything we're doing for the rest of today really is based on this, right? And this is where we created some data. 01:19:37.320 |
And then we try and then we added a loss function MSE and then we created a function called update which calculated our predictions. That's our weight matrix multiply. This is just a one layer, so there's no value we calculated our loss using that mean squared error. 01:19:56.720 |
We calculated the gradients using lost up backward. We then subtracted in place. 01:20:02.720 |
The learning rate times the gradients and that is gradient descent. So if you haven't reviewed lesson 2 SGD, please do because this is where we're this is our starting point. So if you don't get this, then none of this is going to make sense. If you're watching the video, maybe pause now go back, rewatch this part of lesson 2. Make sure you get it. 01:20:25.720 |
Remember a dot sub underscore is basically the same as a minus equals because a dot sub is subtract and everything in pytorch. If you add an underscore to it means do it in place. So this is updating our a parameters which started out as minus .11. We just arbitrarily picked those numbers and it gradually makes them better, right? So. 01:21:01.120 |
Trying to calculate the parameters. I'm going to call them weights 'cause this is just more common. 01:21:15.120 |
In kind of Epoch T or time T and they're going to be equal to whatever the weights were in the previous Epoch. 01:21:24.820 |
Minus our learning rate multiplied by it's the derivative. 01:21:49.320 |
OK, and we don't have to calculate the derivative 'cause it's boring and because it computers do it for us fast. 01:21:57.020 |
And then they store it here for us, so we're good to go. 01:22:07.020 |
You're exceptionally comfortable with either that equation or. 01:22:12.820 |
That line of code 'cause they're the same thing. 01:22:41.920 |
Now weights. Right and in our case we're using. 01:22:48.720 |
Main squared error for example and it's between. 01:22:55.120 |
Our predictions and our actuals. Right, so where does X and W come in? Well our predictions come from running some model. We'll call it M. 01:23:06.620 |
On those predictions and that model contains some weights, right? So that's that's what our loss function might be and this might be all kinds of other loss functions will see some more today. 01:23:29.720 |
So we're going to do something else we're going to add. 01:23:35.620 |
Weight decay some number which in our case is 0.1 times. 01:24:06.120 |
By not using synthetic data, but let's do some real data and we're going to use MNIST, the hand drawn digits, right? But we're going to do this as a standard. 01:24:18.620 |
Fully connected net not as a convolutional net 'cause we haven't learnt the details of how to really create one of those from scratch. 01:24:26.620 |
So in this case is actually deeplearning.net provides MNIST as a Python pickle file. In other words, it's a file that pickle that Python can just open up and it'll give you NumPy arrays straight away and they're flat NumPy arrays. We don't have to do anything to them so go grab that. 01:24:47.120 |
And it's a gzipped file so you can actually just gzip dot open it directly. 01:24:51.820 |
And then you can pickle dot load it directly and again encoding equals Latin one because yeah, you know and then we can just put that. That'll give us the training, the validation and the test set. I don't care about the test set. So generally in Python if there's like something you don't care about, you tend to use this special variable called underscore. There's no reason you have to. It's just kind of people know. 01:25:17.100 |
You mean I don't care about this, right? So there's a training training X&Y and valid X&Y now this actually comes in as a as you can see if I print the shape 50,000 rows by 784 columns, but the 784 columns are actually 28 by 28 pixel pictures. So if I reshape one of them into a 28 by 28 pixel picture and plot it right, then you can see it's the number 5. 01:25:43.600 |
OK, so that's our data. We've seen MNIST before in its kind of pre reshaped version. Here it is in its flattened version, so I'm going to be using it in its flattened version. 01:25:53.040 |
And currently they are NumPy arrays. I need them to be tensors so I can just map torch dot tensor across all of them and so now they're tensors OK. 01:26:09.900 |
I may as well create a variable with the number of things I have, which we normally call N and remember we normally have a thing called, you know, we don't use C to mean the number of activations we need. 01:26:18.700 |
We actually say this is not going to be activation. Sorry, this is going to be number of columns. That's not a great name for it. Sorry. 01:26:26.000 |
OK, so there we are. And then the the Y, not surprisingly, the minimum value is zero and the maximum value is nine because that's the extra number we're going to predict. Great. 01:26:38.700 |
So in Lesson 2 SGD we like we created a data where we actually added a column of ones on so that we didn't have to worry about bias. We're not going to do that. We're going to have PyTorch do that kind of implicitly for us. We had to write our own MSC function. We're not going to do that. We had to write our own little matrix modification thing. We're not going to do that. We're going to have PyTorch do all this stuff for us now. OK, and what's more and really important, we're going to do mini batches. 01:27:08.820 |
Right, this is a big enough data set. We probably don't want to do it all at once. 01:27:12.920 |
So if you want to do mini batches so we're not going to use too much fast AI stuff here. 01:27:21.920 |
A pipe torch has something called tensor data set that basically grabs a. Any kind of tensor or two tenses and creates a data set. Remember a data set is something where if you index into it, you get back an X value and a Y value. Just one of them. 01:27:40.620 |
OK, so it kind of looks like it looks a lot like a list of XY tuples. 01:27:49.520 |
Once you have a data set, then you can use a little bit of convenience by calling data bunch.create and what's that going to do is it's going to create data loaders for you. A data loader is something which you don't say I want the first thing or the fifth thing. You just say I want the next thing and it will give you a batch. A mini batch of whatever size you asked for. 01:28:13.220 |
And specifically, it will give you the X and the Y of a mini batch. So if I just grab the next of the iterator, this is just standard Python. If you haven't used iterators in Python before, here's my training data loader that data bunch.create creates for you. 01:28:30.120 |
And you can check that as you would expect the X is 64 by 784 'cause there's 784 pixels flattened out 64 in a mini batch and the Y is just 64 numbers. There are things we're trying to predict so and you know if you look at the source code for data batch.create, you'll see there's not much there, but so feel free to do so. We just make sure that like your training set gets shuffled, randomly shuffled for you. 01:28:57.920 |
We make sure that the data is put on the GPU for you. Just a couple of little convenience things like that. 01:29:04.220 |
But don't let it be magic. If it feels magic, check out the source code to make sure you see what's going on. OK. 01:29:11.320 |
So rather than do this Y hat equals X at A thing, we're going to create an nn.module, right? If you want to create an nn.module that does something different to what's already out there, you have to subclass it, right? 01:29:27.020 |
So subclassing is very, very, very normal in PyTorch. So if you're not comfortable with subclassing stuff in Python, go read a couple of tutorials to make sure you are. Main thing is you have to override the constructor done to in it and make sure that you call the super classes constructor 'cause nn.module super classes constructor is going to like set it all up to be a proper nn.module for you. 01:29:54.320 |
So if you're trying to using if you're trying to create your own PyTorch subclass and things don't work, it's almost certainly 'cause you forgot this line of code. 01:30:02.620 |
Alright, so the only thing we want to add is we want to create an attribute in our class which contains a linear layer, an nn.linear module. What is an nn.linear module? It's something which does. 01:30:21.120 |
That, but actually it doesn't only do that it actually is X at A plus B. So in other words, we don't have to add the column of ones. OK, that's all it does. OK, so if you want to play around, why don't you try and create your own? 01:30:36.720 |
Nn.linear class you could create something called my linear and it'll take you. You know, depending on your PyTorch background an hour or two and then you'll feel like OK, this is we don't want any of this to be magic and you know all of the things necessary to create this now. 01:30:55.420 |
So you know these are the kind of things that you should be doing for your assignments this week is not so much new applications, but try to start writing more of these things from scratch and get them to work. Learn how to debug them and check what's going in and out and so forth. OK. 01:31:09.820 |
But we could just use nn.linear and that's just going to do so. It's going to have a def forward in it that goes A at X plus B right? 01:31:19.720 |
And so then in our forward, how do we calculate the result of this? Well, remember every and end up module looks like a function, so we pass our X mini batch. So I tend to use X B to mean a batch of X to self.lin and that's going to give us back the result of the A at X plus B on this mini batch. 01:31:42.020 |
So this is a logistic regression model. A logistic regression model is also known as a neural net with no hidden layers, so it's a one layer neural net no nonlinearities. 01:31:55.420 |
Because we're doing stuff ourselves a little bit, we have to put the weight matrices, the parameters onto the GPU manually. So just type.cuda to do that. So here's our model and as you can see the nn.module machinery has automatically given us a representation of it. It's automatically stored the.lin thing and it's telling us what's inside it. So there's a lot of little conveniences that PyTorch does for us. 01:32:22.220 |
So if you look at now at model.lin, you can see not surprisingly here it is. Perhaps the most interesting thing to point out is that our model automatically gets a bunch of methods and properties and perhaps the most interesting one is the one called parameters which contains all of the yellow squares from our picture that it contains our parameters. It contains our weight matrices and. 01:32:52.620 |
Bias matrices in as much as they're different. So if we have a look at P dot shape for P and model dot parameters. 01:33:01.820 |
And there's something of 10. So what are they or 10 by 784? OK, so that's the thing that's going to take in 784 dimensional input and spit out a 10 dimensional output because that's handy because our input is 784 dimensional and we need something that's going to give us the probability of 10 numbers. 01:33:20.820 |
After that happens, we've got 10 activations, which we then want to add the bias to. So there we go. Here's a vector of length 10 so you can see why this. 01:33:31.620 |
This model we've created has exactly the stuff that we need to do our ax plus B. 01:33:39.220 |
So let's grab a learning rate. We're going to come back to this loss function in a moment, but we can't use MS well. 01:33:47.520 |
We can't really use MSE for this right because we're not trying to say how close are you? Did you predict 3 and actually it was 4? Gosh, you were really close. It's like no 3 is just as far away from 4 as 0 is away from 4 when you're trying to predict what number did somebody draw? So we're not going to use MSE. We're going to use cross entropy loss, which will look at in a moment. 01:34:09.120 |
And here's our update function. I copied it from lesson 2 SGD. 01:34:13.020 |
But now we're calling our model rather than going a at X. We're calling our model as if it was a function to get Y hat. 01:34:21.620 |
And we're calling our loss funk rather than calling MSE to get our loss and then this is all the same as before except rather than going through each parameter and going parameter dot sub underscore learning rate times gradient. 01:34:39.420 |
OK, because very nicely for us pipe torch will automatically create this list of the parameters of anything that we created in our dunder in it. 01:34:52.820 |
I've got this thing called W2 I go through HP and model dot parameters and I add to double to W2. 01:35:02.520 |
The summer Squared's so W2 now contains my summer Squared's weights and then I multiply it by some number which I set to one a neg five. 01:35:16.420 |
OK, so when people talk about weight decay, it's not an amazing magic complex thing containing thousands of lines of CUDA C++ code. 01:35:29.620 |
Those two lines of Python. That's weight decay. This is not a simplified version. That's just enough for now. This is weight decay. That's it. 01:35:40.220 |
There's really interesting kind of dual way of thinking about weight decay. 01:35:47.220 |
One is that we're adding the summer Squared's weights and that seems like a very sound thing to do and it is and let's go ahead and run this. 01:35:56.920 |
So here I've just got a list comprehension that's going through my data loader so the data loader gives you back one mini batch and for the whole thing giving you XYH time. I'm going to call update for each each one returns loss. 01:36:22.520 |
Since I did it all on the GPU that's sitting in the GPU and it's like got all this stuff attached to it to calculate gradients. It's going to use up a lot of memory. So if you if you if you call dot item on a scalar tensor, it turns it into an actual normal Python number. So this is just means I'm returning back normal Python numbers and then I can plot them and yeah there you go. My loss function is going down and. 01:36:49.920 |
You know it's really nice to try this stuff to see it behaves as we expect like we thought this is what would happen as we get closer and closer to the answer. It bounces around more and more right because we're kind of close to where we should be. It's kind of getting flat probably flatter in weight space. So we kind of jumping further and so you can see why we would probably want to be reducing our learning rate as we go learning rate and kneeling. 01:37:19.120 |
Is only interesting for training a neural net because. 01:37:28.720 |
Because we take the gradient of it, that's the thing that actually updates the weights right so they actually the only thing interesting. 01:37:39.520 |
About WD times some of W squared is its gradient so we don't do a lot of math here, but I think we can handle that. 01:37:50.220 |
Of this whole thing if you remember back to your high school math is equal to the gradient of each part taken separately and then add them together. 01:38:01.520 |
So let's just take the gradient of that right because we already know the gradient of this is just whatever we had before right so what's the gradient of WD times the sum of W squared right. 01:38:11.220 |
Let's remove the song and pretend that just one parameter it doesn't change the generality of it so the gradient of WD times W squared. 01:38:34.620 |
Right and so remember this is our constant which now case was like well in that little loop it was one in egg five. 01:38:43.120 |
Okay, and that's our weights and like we could replace WD with like two WD without loss of generality so let's throw away the two. 01:38:55.520 |
So in other words all weight decay does is it subtracts some constant times the weights every time we do a batch. 01:39:13.020 |
Where we add the square to the loss function that's called L to regularization. 01:39:24.220 |
When it's in this form where we subtract WD times weights from the gradients that's called weight decay. 01:39:33.520 |
And they are kind of mathematically identical for everything we've seen so far in fact they are mathematically identical and we'll see in a moment a place where they're not where things get interesting. 01:39:48.020 |
Okay, so this is just a really important tool you now have in your toolbox you can make giant neural networks. 01:39:55.920 |
And still avoid overfitting by adding more weight decay okay or you could use really small data sets with moderately large sized models and avoid overfitting with weight decay. 01:40:10.020 |
It's not magic right like you might still find you don't have enough data in which case like you get to the point where you're not overfitting. 01:40:17.820 |
By adding lots of weight decay and it's just not training very well that can happen. 01:40:22.120 |
But at least this is something that you can now play around with. 01:40:31.820 |
Now that we've got this update function we could replace this MNIST logistic with MNIST neural network and build a neural network from scratch. 01:40:43.020 |
But now we just need to linear layers right in the first one we could use a white matrix of size 50 and so we didn't need to make sure that the second linear layer has an input of size 50 so it matches the final layer has to have an output of size 10 because that's the number of classes we're predicting and so now our forward just goes to a linear layer. 01:41:03.620 |
Calculate value do a second linear layer and now we've actually created a neural net from scratch I mean we didn't write it in linear but you can write it yourself or you could like do the matrices directly you know how to. 01:41:19.320 |
So again you know if we go model dot coda and then we can calculate losses with the exact same update function there goes right so this is why this kind of idea of neural nets it's so easy right once you have something that can do gradient descent right then you can try different models. 01:41:40.120 |
And then you can start to add more pytorch stuff so like rather than that doing all this stuff yourself why not just go. 01:41:48.120 |
Opt equals opt him dot something so the something we've done so far is SGD and so now you're saying to pytorch I want you to take these parameters and optimize them using SGD and so this now rather than saying for P and parameters. 01:42:09.920 |
P minus equals LR times P dot grad you to say up dot step is the same thing. 01:42:16.620 |
It's just less code right but and it does the same thing but the reason it's kind of particularly interesting is that now you can replace. 01:42:25.020 |
SGD with Adam for example and you can even add things like weight decay. 01:42:36.820 |
Right because like there's more stuff that's kind of in these things for you right so that's why we tend to use. 01:42:43.820 |
You know opt him dot blast so behind the scenes this is actually what we do in first day. 01:43:00.220 |
Okay so there's that right and so that's that's just that picture but if we change to a different optimizer. 01:43:06.920 |
So look what happened it diverged and we've seen a great picture of that from one of our students who showed what divergence looks like this is what it looks like when you try to train something so let's use reducing a different optimizer so we need a different learning rate. 01:43:29.220 |
And you can't just continue training because by the time is diverged the the weights are like really really big and really really small they're not going to come back so start again. 01:43:37.620 |
Okay there's a better learning rate but look at this we're down underneath point five by about epoch two hundred or else before I'm not even sure we ever got to quite that level so what's going on what's what's Adam. 01:43:58.720 |
And we're going to do gradient descent in excel because why wouldn't you okay so here is some randomly generated data case and X's and some Y's well they're actually they're randomly generated X's and the Y's are all calculated by doing a X plus B where a is to and be as 30 K. So this is some data that we're going to try and match. 01:44:25.520 |
And here is SGD and so we're going to do it with SGD now in our lesson to SGD notebook we did the whole data set at once as a batch in the notebook we just looked at we did many batches and this spreadsheet we're going to do online gradient descent which means every single row of data is a batch that's kind of batch size of one. 01:44:51.620 |
So as per usual we're going to start by picking an intercept and slope kind of arbitrarily so I'm just going to pick them at one doesn't really matter. 01:44:59.720 |
So here I've copied over the data this is my X and Y and so my intercept and slope as I said is one right I'm just literally referring back to this cell here. 01:45:11.120 |
So my prediction for this particular selection slope would be 14 times 1 plus 1 which is 15 and so there's my error means that there's my summer squared or not not even a summer this point it's the squared error OK. 01:45:26.720 |
So now I need to calculate the gradient so that I can update there's two ways you can calculate the gradient one is analytically and so I you know you can just look them up on Wolfram Alpha or whatever so there's the gradients if you write it out by hand or look it up or you can do something called finite differencing because remember gradients just. 01:45:50.420 |
How far you move in sorry how far you how far the outcome moves divided by how far your change was for really small changes so let's just make a really small change. 01:46:08.920 |
Our intercept and added 0.01 to it right and then calculated our loss. 01:46:15.420 |
And you can see that our loss went down a little bit right and which added 0.01 here so our derivative is that difference divided by that point or one. 01:46:29.220 |
And that's called finite differencing. You can always do derivatives with finite differencing. It's low. We don't do it in practice but it's nice for just checking stuff out so we can do the same thing for our a term at 0.01 to that take the difference and divide by 0.01 or as I say we can calculate it directly using the actual derivative analytical and you can see that you know that and that are as you'd expect very similar and that and. 01:46:56.420 |
That very similar so gradient descent then just says let's take our current value of that weight and subtract the learning rate times the derivative. 01:47:12.320 |
That intercept and that slope to the next row and do it again. 01:47:19.520 |
And do it lots of times and at the end we've done one epoch. 01:47:25.520 |
So at the end of that epoch we could say oh great so this is. 01:47:30.320 |
Our slope so let's copy that over to where it says slope. 01:47:39.220 |
Intercept so copy it to where it says intercept. 01:47:48.020 |
OK so that's kind of boring copying and pasting so I created a very sophisticated macro which copies and pastes for you and so I just recorded it basically and so and then I created a very sophisticated for loop that goes through and does it 5 times. 01:48:10.220 |
And I attach that to the run button so if I press run it'll go ahead and do it 5 times and just keep track of the error each time OK so that is SGD and as you can see it is just. 01:48:24.520 |
Infuriatingly slow like particularly the intercept. 01:48:32.720 |
And we're still only up to 1.57 and like just it's just going so slowly. 01:48:40.720 |
So let's beat it up so the first thing we can do to speed it up is to use something called momentum right so here's the exact same spreadsheet as the last worksheet I've removed the finite differencing version of the derivatives because they're not that useful just the analytical ones here and here's the thing where I take the. 01:49:02.720 |
The derivative and I'm going to update by the derivative but what I do it's kind of more interesting to look at this one. 01:49:12.520 |
Is I take the derivative and I multiply it by point one. 01:49:19.820 |
And what I do is I look at the previous update and I multiply that by point nine and I add the two together so in other words the update that I do is not just based on the derivative. 01:49:33.020 |
But a tenth of it is the derivative and 90% of it is just the same direction I went last time. 01:49:41.420 |
And this is called momentum right what it means is remember how. 01:49:54.520 |
If you're trying to find the minimum of this and you were here and your learning rate was too small. 01:50:00.920 |
Right and you just keep doing the same steps or if you keep doing the same steps then if you also add in the step you talk last time. 01:50:13.420 |
And your steps are going to get bigger and bigger aren't they. 01:50:18.220 |
They go too far but now of course your gradients pointing the other direction to where your momentum is pointing so you might just take a little step over here and then you'll start going small steps bigger steps bigger steps all steps bigger steps like that right so that's kind of what momentum does or if you're. 01:50:44.520 |
Like this which is also slow right then the average of your last few steps is actually somewhere. 01:51:01.720 |
Right so this is a really common idea right is like when you have something that says. 01:51:10.420 |
Kind of my what is in this case it's like my step. 01:51:14.120 |
My step at time T equals some number people often use alpha because like I say you got to love these Greek letters some number times. 01:51:30.220 |
The actual thing I want to do right so it might in this case it's like the gradient right plus. 01:51:46.520 |
This thing here is called an exponentially weighted moving average and the reason why is that if you think about it. 01:51:58.920 |
These one minus alphas are going to multiply so S T minus two is in here with a kind of a one minus alpha squared and S T minus three is in there with one minus alpha cubed so in other words this ends up being. 01:52:13.420 |
The actual thing I want plus a weighted average of the last few time periods where the most recent ones are exponentially higher weighted. 01:52:25.720 |
OK and this is going to keep popping up again and again right so that's what momentum is it says I want to go based on the current gradient. 01:52:33.920 |
Plus the exponentially weighted moving average of my last few steps. 01:52:40.120 |
So that's useful that's called SGD with momentum and we can do it by changing this here to saying SGD. 01:52:52.420 |
Momentum and momentum point nine is really common it's got a lot like it's so common it's always point in just about for basic stuff that's how you do SGD with momentum. 01:53:03.420 |
And again it's not I didn't show you some simplified version I showed you the version that is that is SGD okay that's that's you again you can write your own try it out that would be a great assignment would be to take lesson two SGD and add momentum to it. 01:53:21.720 |
Or even the new notebook we've got for MNIST get rid of the opt in dot and write your own update function with with momentum. 01:53:29.920 |
Then there's a cool thing called RMS prop one of the really cool things about RMS prop is that Jeffrey Hinton created it famous neural net guy. 01:53:41.920 |
Everybody uses it it's like really popular it's really common the correct citation for RMS prop is the Coursera online free MOOC that that's where he first mentioned RMS prop so I love this thing that like you know call new things appear in MOOC said not a paper. 01:54:03.220 |
So RMS prop is very similar to momentum but this time we have an exponentially weighted moving average not of the gradient updates but of F8 squared that's the gradient squared so what the gradient squared times point one plus the previous value times point nine. 01:54:27.920 |
So it's an exponentially this is an exponentially weighted moving average of the gradient squared so what's this number going to mean well if my gradients really small and consistently really small this will be a small number. 01:54:40.520 |
If my gradient is highly volatile it's going to be a big number or if it's just really big all the time it'll be a big number and why is that interesting because when we do our update this time we say wait. 01:55:02.720 |
Gradient divided by the square root of this so in other words if our gradients consistently very small and not volatile let's take bigger jumps and that's kind of what we want right when we watched how the intercept move so damn slowly but it just it's like obviously you need to just try it go faster. 01:55:29.420 |
After just five epochs this is already up to three right where else with the basic version. 01:55:39.020 |
It's still at one point two seven and remember we have to get to 30 so the obvious thing to do and by obvious I mean only a couple of years ago did anybody actually figure this out is do both. 01:55:54.220 |
Right so that's called Adam so Adam is simply keep track of the exponentially weighted moving average of the gradient squared. 01:56:02.820 |
And also keep track of the exponentially weighted moving average of my steps right and both divide by the exponentially weighted moving average of the squared terms and. 01:56:20.120 |
You know take point nine of a step in the same direction as last time so it's it's momentum and I messed up that's called Adam and look at this. 01:56:37.120 |
Okay, so you know these these are these optimizers people call them dynamic learning rates a lot of people have the misunderstanding that you don't have to set a learning rate. 01:56:49.520 |
Of course you do right it's just like trying to identify parameters that need to move faster you know or consistently go in the same direction doesn't mean you don't need learning rates we still have. 01:57:04.220 |
A learning rate OK and in fact you know if I run this again but currently my my error. 01:57:12.420 |
Not just again so we're trying to get to 30 comma 2 so if I run it again. 01:57:32.920 |
Now it's just moving around the same place right so you can see what's happened is the learning rates too high so we could just go in here and drop it down. 01:57:43.620 |
Getting pretty close now right so you can see how you still need learning rate annealing even with Adam. 01:57:57.420 |
Okay so that spreadsheet's fun to play around with. I do have a Google Sheets version of basic SGD that actually works in the macros work and everything. 01:58:10.220 |
Google Sheets is so awful and I went so insane making that work I gave up and making the other ones work so I'll share a link to the Google Sheets version. 01:58:22.420 |
Oh my God they do have a macro language but it's just ridiculous so anyway if somebody feels like fighting it to actually get all the other ones to work they will work it just it's just annoying so maybe somebody can get this working on Google Sheets too. 01:58:37.020 |
Okay so that's weight decay and Adam and Adam is amazingly fast. 01:58:54.520 |
But we don't tend to use opt into whatever and create the optimizer ourselves and all that stuff because instead we tend to use learner. 01:59:07.220 |
But learners just doing those things for you right again there's no magic right so if you create a learner you say here's my data bunch. 01:59:16.220 |
Here's my pytorch and end up module instance here's my loss function and here are my metrics remember the metrics are just stuff to print out that's it right. 01:59:30.420 |
Then you just get a few nice things like learned a lot fine starts working and it starts recording this and you can say fit one cycle instead of just fit but like these things really help a lot like by using the floating rate finder I found a good learning rate and then like look at this my loss here point one three here I wasn't getting much beneath point five. 01:59:52.220 |
Right so these these tweaks make huge differences not tiny differences and this is still just one one epoch. 02:00:01.020 |
Now what is fit one cycle do what does it really do. 02:00:06.120 |
This is what it really does right and we've seen this chart on the left before just to remind you this is plotting the learning rate per batch right remember Adam has a learning rate and we use Adam by default. 02:00:22.920 |
Line of variation which we might try to talk about. 02:00:25.820 |
So the learning rate starts really low and it increases about half the time and then it decreases about half the time because at the very start. 02:00:37.420 |
We don't know where we are right so we're in some part of function space it's just bumpy as all hell right so if you start jumping around those bumps have big gradients and it will throw you into crazy parts of the space right so start slow. 02:00:52.420 |
And then you'll gradually move into parts of the wait space that you know they're kind of sensible and as you get to the points where they're sensible you can increase the learning rate you know because the gradients are actually in the direction you want to go right and then as we've discussed a few times as you get close to the final answer you need to anneal your learning rate to hone in on it but here's the interesting thing on the left is the momentum plot. 02:01:21.820 |
And actually every time our learning rate is small. 02:01:24.820 |
Our momentum is high why is that because if you do have a learning small learning rate, but you keep going in the same direction you may as well go faster. 02:01:35.420 |
Right but if you're jumping really far don't like jump jump really far because it's going to throw you off right and then as you get to the end again you're fine tuning in but actually if you keep going in the same direction again and again. 02:01:51.520 |
Go faster right so this combination is called one cycle and it's just this amazing like it's a simple thing but it's astonishing like this can help you get what's called super convergence that can let you train 10 times faster and this is just last year's paper and some of you may have seen the interview with Leslie Smith that I did last week amazing guy incredibly humble. 02:02:17.320 |
And also I should say somebody who is doing groundbreaking research well into his 60s and all of these things are inspiring. 02:02:25.220 |
I'll show you something else interesting when you plot the losses with fast AI it doesn't look like that. 02:02:31.120 |
It looks like that why is that because fast AI calculates the exponentially weighted moving average of the losses for you. 02:02:41.020 |
Right so this this concept of exponentially weighted stuff it's just really handy and I use it all the time and one of the things that is to make it easier to read these charts it does mean that these charts from fast AI might be kind of an epoch or two sorry a batch or two behind where they should be. 02:02:59.820 |
You know there's that slight downside when you use an exponentially weighted moving average is you've got a little bit of history in there as well but it can make it much easier to see what's going on. 02:03:10.020 |
So we're now at a point coming to the end of this colab and tabular section where we're going to try to understand all of the code in our tabular model so remember the tabular model use this data set called adult which is trying to predict who's going to make more money it's a classification problem. 02:03:37.020 |
And we've got a number of categorical variables and a number of continuous variables so the first thing we realize is we actually don't know how to predict a categorical variable yet because so far we did some hand waving around the fact that our loss function was an end across entropy loss what is that. 02:03:57.020 |
Let's find out and of course we're going to find out. 02:04:01.020 |
By looking at Microsoft Excel so cross entropy loss is just another loss function where you already know one loss function which is means good error y hat minus y squared okay so that's not a good loss function for us because in our case we have like for M NIST 10 possible digits and we have 10 activations each with a probability of that digit. 02:04:26.020 |
So we need something where predicting the right thing correctly and confidently should have very little loss predicting the wrong thing confidently should have a lot of loss so that's what we want. 02:04:45.020 |
Okay so here's an example here is a cat versus dog one hot encoded. 02:04:53.020 |
And here are my two activations for each one from some model that I built probability cat probability drug. 02:05:01.020 |
This one's not very confident of anything this one's very confident of it being a cat and it's right this one's very confident of being a cat and it's wrong. 02:05:10.020 |
So we want to loss that for this one should be a moderate loss because not predicting anything confidently is not really what we want. 02:05:18.020 |
So here's a point three this thing's predicting the correct thing very confidently so point oh one this thing's predicting the wrong thing very confidently so what so how do we do that. 02:05:32.020 |
This is the cross entropy loss and it is equal to whether it's a cat multiplied by log of the probability of cat or this is actually an activation so I should say so it's multiplied by the log of the cat activation. 02:05:53.020 |
Negative that minus is it a dog times the log of the dog activation. 02:06:01.020 |
And that's it okay so in other words it's the sum of all of your one hot encoded variables times all of your activations so interestingly. 02:06:15.020 |
These ones here are exactly the same numbers as these ones here but I've written it differently. 02:06:21.020 |
I've written it with an if function because it's exactly this quiz because the zeros don't actually add anything. 02:06:27.020 |
Right so actually it's exactly the same as saying if it's a cat then take the log of. 02:06:36.020 |
Cattiness and if it's a dog yes or otherwise take the log of one minus cattiness in other words the log of documents. 02:06:47.020 |
So the sum of the one hot encoded times the activations is the same as an if function. 02:06:55.020 |
Which if you think about it it's actually because this is just a matrix multiply this is we now know from our from our embedding discussion that's the same as an index look up. 02:07:09.020 |
So you can also to do cross entropy you can also just look up the log of the activation for the correct answer. 02:07:18.020 |
Now that's only going to work if these rows add up to one and this is one reason that you can get screwy cross entropy numbers is this why I said you press the wrong button if they don't add up to one you've got a trouble. 02:07:35.020 |
How do you make sure that they add up to one you make sure they add up to one. 02:07:39.020 |
By using the correct activation function in your last layer and the correct activation function to use for this is softmax softmax is an activation function where all of the activations add up to one all of the activations are greater than zero and all of the activations are less than one so that's what we want right that's what we need. 02:08:04.020 |
How do you do that well let's say we were predicting one of five things cat dog plain fish building. 02:08:09.020 |
And these were the numbers that came out of our neural net for one set of predictions. 02:08:16.020 |
Well what if I did either the power of that so that's one step in the right direction because either the power of something is always bigger than zero so there's a bunch of numbers that are always bigger than zero. 02:08:34.020 |
Here is a to the number divided by the sum of either the number now this number is always less than one. 02:08:47.020 |
Because all of the things were positive so you can't possibly have one of the pieces be bigger than 100% of its sum. 02:08:53.020 |
And all of those things must add up to one right because each one of them was just that percentage of the total. 02:09:04.020 |
So that's it so this thing softmax is equal to a to the activation divided by the sum of either the activations. 02:09:17.020 |
And so when we're doing single label multi class classification. 02:09:23.020 |
You generally want softmax as your activation function and you generally want cross entropy as your loss. 02:09:37.020 |
Pytorch will do them both for you right so you might have noticed that in this MNIST example. 02:09:45.020 |
I never added a softmax here and that's because if you ask for cross entropy loss it actually does the softmax in inside the loss function so it's not really just cross entropy loss it's actually softmax then cross entropy loss. 02:10:03.020 |
So you probably noticed this but sometimes your predictions from your models will come out looking more like this pretty big numbers with negatives in rather than this numbers between naught and one that add up to one. 02:10:19.020 |
The reason would be that Pytorch it's a Pytorch model that doesn't have a softmax in because we're using cross entropy loss and so you might have to do the softmax. 02:10:33.020 |
Fastai is getting increasingly good at knowing when this is happening generally if you're using a loss function that we recognize when you get the predictions we will try to add the softmax in there for you, but if you particularly if you're using a custom. 02:10:49.020 |
Loss function that you know might call an end across entropy loss behind the scenes or something like that you might find yourself. 02:11:00.020 |
We only have 3 minutes less, but I'm going to point something out to you which is that next week. 02:11:09.020 |
Tabular which will do in like the first 10 minutes. 02:11:14.020 |
Forward in tabular and it basically goes through. 02:11:20.020 |
A bunch of embeddings right kind of call each one of those embeddings E and you can use it like a function. Of course, it's going to pass in each categorical variable to each embedding. 02:11:30.020 |
It's going to concatenate them together into a single matrix. 02:11:39.020 |
Which are basically a bunch of linear layers. 02:11:47.020 |
And then there's only 2 new things we need to learn. One is. 02:11:59.020 |
And these are 2 additional regularization strategies right there are basically. 02:12:05.020 |
Does more than just regularization, but amongst other things it does regularization and the basic ways you regularize your model. 02:12:18.020 |
And then you can also avoid overfitting using something called data augmentation so batch norm and dropout we're going to touch on at the start of next week. 02:12:27.020 |
And we're also going to look at data augmentation and then we're also going to look at what convolutions are and we're going to learn some new computer vision architectures and some new computer vision applications. 02:12:42.020 |
But basically we're very nearly there. You already know how the entirety of. 02:12:47.020 |
Collab dot PY fastai dot collab works. You know what's why it's there and what it does and you're very close to knowing what the entirety of. 02:12:58.020 |
Tabular model does and this tabular model is actually the one that if you run it on Rossman you'll get the same answer that I showed you in that paper you'll get that second place result. 02:13:13.020 |
I'll show you next week if I remember how I actually ran some additional experiments where I figured out some minor tweaks that can do even slightly better than that. 02:13:22.020 |
So yeah, we'll see you next week. Thanks very much and enjoy the smoke outside.