Back to Index

Lesson 7: Practical Deep Learning for Coders 2022


Chapters

0:0 Tweaking first and last layers
2:47 What are the benefits of using larger models
5:58 Understanding GPU memory usage
8:4 What is GradientAccumulation?
20:52 How to run all the models with specifications
22:55 Ensembling
37:51 Multi-target models
41:24 What does `F.cross_entropy` do
45:43 When do you use softmax and when not to?
46:15 Cross_entropy loss
49:53 How to calculate binary-cross-entropy
52:19 Two versions of cross-entropy in pytorch
54:24 How to create a learner for prediction two targets
62:0 Collaborative filtering deep dive
68:55 What are latent factors?
71:28 Dot product model
78:37 What is embedding
82:18 How do you choose the number of latent factors
87:13 How to build a collaborative filtering model from scratch
89:57 How to understand the `forward` function
92:47 Adding a bias term
94:29 Model interpretation
99:6 What is weight decay and How does it help
103:47 What is regularization

Transcript

All right, welcome to lesson seven, the penultimate lesson of practical deep learning for coders part one. And today we're going to be digging into what's inside a neural net. We've already seen what's inside a kind of the most basic possible neural net, which is a sandwich of fully connected layers, or linear layers, and rellues.

And so we built that from scratch, but there's a lot of tweaks that we can do. And so most of the tweaks actually that we probably care about are the tweaking the very first layer or the very last layer. So that's where we'll focus. But over the next couple of weeks, we'll look at some of the tricks we can do inside as well.

So I'm going to do this through the lens of the patty, rice patty competition we've been talking about, and we got to a point where-- let's have a look. So we created a conf next model. We tried a few different types of basic preprocessing. We added test time augmentation.

And then we scaled that up to larger images and rectangular images. And that got us into the top 25% of the competition. So that's part two of the so-called road to the top series, which is increasingly misnamed since we've been presenting these notebooks. More and more of our students have been passing me on the leaderboard.

So currently first and second place for both people from this class, Kurian and Nick. Go to hell. You're in my target. And leave my class immediately. And congratulations. Good luck to you. So in part three, I'm going to show you a really interesting trick, a very simple trick for scaling up these models further.

Watch or discover if you've tried to use larger models. So you can replace the word small with the word large in those architectures and try to train a larger model. A larger model has more parameters. More parameters means it can find more tricky little features. And broadly speaking models with more parameters therefore ought to be more accurate.

Problem is that those activations or more specifically the gradients that have to be calculated choose up memory on your GPU. And your GPU is not as clever as your CPU, that kind of sticking stuff it doesn't need right now into virtual memory on the hard drive. When it runs out of memory, it runs out of memory.

And it also doesn't do such a good job as your CPU, it kind of shuffling things around to try and find memory. It just allocates blocks of memory and it stays allocated until you remove them. So if you try to scale up your models to bigger models unless you have very expensive GPUs, you will run out of space.

And you'll get an error, something like CUDA out of memory error. So if that happens, first thing I'll mention is it's not a bad idea to restart your notebook because they can be a bit tricky to recover from otherwise. And then I'll show you how you can use as large a model as you like.

Almost is, you know, basically you'll be able to use an X large model on Kaggle. So let me explain. Now when you run something on Kaggle, like actually on Kaggle, you're generally going to be on a 16 gig GPU. And you don't have to run stuff on Kaggle, you can run stuff on your home computer or paper space or whatever.

But sometimes if you want to do Kaggle competition, sometimes you have to run stuff on Kaggle because a lot of competitions are what they call code competitions, which is where the only way to submit is from a notebook that you're running on Kaggle. And then a second reason to run stuff on Kaggle is that, you know, your notebooks will appear, you know, with the leaderboard score on them.

And so people can see which notebooks are actually good. And I kind of like, even in things that aren't code competitions, I love trying to be the person who's number one on the notebook score leaderboard, because that's something which, you know, you can't just work at Nvidia and use 1000 GPUs and win a competition through a combination of skill and brute force.

Everybody has the same nine hour timeout to work with. So I think it's a good way of keeping the, you know, things a bit more fair. Now so my home GPU has 24 gig. So I wanted to find out what can I get away with, you know, in 16 gig.

And the way I did that is I think a useful thing to discuss because again, it's all about fast iteration. So I wanted to really quickly find out how much memory will a model use. So there's a really quick hacky way I can do that, which is to say, okay, for the training set, let's not use, so here's the value counts of labels, so the number of each disease.

Let's not look at all the diseases. Let's just pick one, the smallest one, right? And let's make that our training set. Our training set is the bacterial panicle blight images. And now I can train a model with just 337 images without changing anything else. Not that I care about that model, but then I can see how much memory it used.

It's important to realize that, you know, each image you pass through is the same size, each batch size is the same size, so training for longer won't use more memory. So that'll tell us how much memory we're going to need. So what I then did was I then tried training different models to see how much memory they used up.

Now, what happens if we train a model, so obviously ConvNec Small doesn't use too much memory, so here's something that reports the amount of GPU memory just by basically printing out CUDA's GPU processes, and you can see ConvNec Small took up 4 gig. And also this might be interesting to you.

If you then call Python's garbage collection, gc.collect, and then call PyTorch's empty cache, that should basically get your GPU back to a clean state of not using any more memory than it needs to when you can start training the next model without restarting the kernel. So what would happen if we tried to train this little model and it crashed with a CUDA out of memory error?

What do we do? We can use a cool little trick called gradient accumulation. What's gradient accumulation? So what's gradient accumulation? Well I added this parameter to my train method here. So my train method creates by data loaders, creates my learner, and then depending on whether I'm fine-tuning or not either fits or fine-tunes it.

But there's one other thing it does. It does this gradient accumulation thing. What's that about? Well the key step is here. I set my batch size, so that's the number of images that I pass through to the GPU all at once, to 64, which is my default, divided by -- // means integer divide in Python -- divided by this number.

So if I pass 2, it's going to use a batch size of 32. If I pass 4, it'll use a batch size of 16. Now that obviously should let me cure any memory problems, use a smaller batch size. But the problem is that now the dynamics of my training are different, right?

The smaller your batch size, the more volatility there is from batch to batch. So now your learning rates are all messed up. You don't want to be messing around with trying to find a different set of optimal parameters for every batch size, for every architecture. So what we want to do is find a way to run just let's say accumulate equals 2.

Let's say we just want to run 32 images at a time through. How do we make it behave as if it was 64 images? Well the solution to that problem is to consider our training loop. This is basically the training loop we used from a couple of lessons ago, the one we created manually.

So each x, y pair in the data loader, we calculate the loss using some coefficients based on that x, y pair. And then we call backward on that loss to calculate the gradients. And then we subtract from the coefficients the gradients times the learning rate. And then we zero out the gradients.

So I've skipped a bit of stuff like the with torch.nograd thing, actually no I don't need that because I've got .data. No that's it, that should all work fine. I've skipped out printing the loss that's about it. So here is a variation of that loop where I do not always subtract the gradient times the learning rate.

Instead I go through each x, y pair in the data loader, I calculate the loss, I look at how many images are in this batch. So initially I start at zero and this count is going to be 32, say if I've divided the batch size by two. And then if count is greater than 64, I do my coefficients update, well it's not.

So I skip back to here and I do this again. And if you remember there was this interesting subtlety in PyTorch which is if you call back word again without zeroing out the gradients then it adds this set of gradients to the old gradients. So by doing these two half size batches without zeroing out the gradients between them it's adding them up.

So I'm going to end up with the total gradient of a 64 image batch size but passing only 32 at a time. If I used accumulate equals four it would go through this four times adding them up before it subtracted out the coefficients.grad times learning rate and zeroed it out.

If I put in a Q equals 64 it would go through into a single image one at a time. And after 64 passes through eventually count would be greater than 64 and we would do the update. So that's gradient accumulation. It's a very simple idea which is that you don't have to actually update your weights every loop through for every mini-batch.

You can just do it from time to time. But it has quite significant implications which I find most people seem not to realize which is if you look on like Twitter or Reddit or whatever people can say oh I need to buy a bigger GPU to train bigger models but they don't.

They could just use gradient accumulation. And so given the huge price differential between say a RTX 3080 and an RTX 3090 Ti huge price differential the performance is not that different. The big difference is the memory. So what? Just put in a bit smaller batch size and do gradient accumulation.

So there's actually not that much reason to buy giant GPUs. John. Other results with gradient accumulation numerically identical? They're numerically identical for this particular architecture. There is something called batch normalization which we will look at in part two of the course which keeps track of the moving average of standard deviations and averages and does it in a mathematically slightly incorrect way as a result of which if you've got batch normalization then it basically will introduce more volatility which is not necessarily a bad thing but because it's not mathematically identical you won't necessarily get the same results.

One next doesn't use batch normalization so it is the same. And in fact a lot of the models people want to use really big versions of which is NLP ones transformers tend not to use batch normalization but instead they use something called layer normalization which doesn't have the same issue.

I think that's probably fair to say. I haven't thought about it that deeply. In practice I found adding gradient accumulation for ConfNEXT has not caused any issues for me. I don't have to change any parameters when I do it. Any other questions on the forum, John? Tamori asking shouldn't it be count greater than equal to 64 if BS equals 64?

No, I don't think so. So we start at zero then it's going to be 32 then it's going to be, yeah, yeah, probably. You can probably tell I didn't actually run this code. Madhav is asking does this mean that LR find is based on the batch size set during the data block?

Yeah, so LR find just uses your data loaders batch size. Edward is asking why do we need gradient accumulation rather than just using a smaller batch size and follows up with how would we pick a good batch size? Well just if you use a smaller batch size, here's the thing, right?

But architectures have different amounts of memory, you know, which they take up. And so you'll end up with different batch sizes for different architectures. Which is not necessarily a bad thing but each of them is going to then need a different learning rate and maybe even different weight decay or whatever.

Like the kind of the settings that's working really well for batch size 64 won't necessarily work really well for batch size 32. And you know you want to be able to experiment as easily and quickly as possible. I think the second part of your question was how do you pick an optimal batch size?

Honestly the standard approach is to pick the largest one you can just because it's faster that way you're getting more parallel processing going on. So to be honest I quite often use batch sizes that are quite a bit smaller than I need because quite often it doesn't make that much difference.

But yeah the rule of thumb would be you know pick a batch size that fits in your GPU and for performance reasons I think it's generally a good idea to have it be a multiple of eight. Everybody seems to always use powers of two I don't know like I don't think it actually matters.

Look there's one other just a clarification or a check if the learning rate should be scaled according to the batch size. Yeah so generally speaking the rule of thumb is that if you divide the batch size by two you divide the learning rate by two. But unfortunately it's not quite perfect.

Did you have a question Nick if you do you can okay cool yeah now that's us all caught up thanks Jeremy. Good questions thank you. So gradient accumulation in fast AI is very straightforward. You just divide the batch size by however much you want to divide it by and then add something called a callback and a callback is something which changes the way the model trains this call that's called gradient accumulation and you pass in the effective batch size you want.

And then you say when you create the learner you say these are the callbacks I want and so it's going to pass in gradient accumulation callbacks so it's going to only update the weights once it's got 64 images. So if we pass in a Q equals one it won't do any gradient accumulation and that uses four gig if we use Q equals two about three gig Q equals four about two and a half gig and generally the bigger the model the closer you'll get to a kind of a linear scaling because models have a kind of a bit of overhead that they have anyway.

So what I then did was I just went through all the different models I wanted to try so I wanted to try conf next large add a 320 by 240, VIT large, SWIN V2 large, SWIN large and on each of these I just tried running it with a Q equals one and actually every single time for all of these I got them out of memory error and then I tried each of them independently with a Q equals two and it turns out that all of these worked with a Q equals two and it only took me 12 seconds each time so that was a very quick thing for me then okay and I now know how to train all of these models on a 16 gigabyte card so I can check here they're all in less than 16 gig.

So then I just created a little dictionary of all the architectures I wanted and for each architecture all of the resize methods I wanted and final sizes I wanted. Now these models VIT, SWIN V2 and SWIN are all transformers models which means that well most transformers models nearly all of them have a fixed size this one's 224 this one's 192 this one's 224 so I have to make sure that my final size is a square of the size required otherwise I get an error.

There is a way of working around this but I haven't experimented with it enough to know when it works well and when it doesn't so we'll probably come back to that in part two. So for now it's going to use the size that they ask us to use so with this dictionary of architectures and for each architecture kind of pre-processing details we switch the training path back to using all of our images and then we can loop through each architecture and loop through each item transforms and sizes and train the model and then the training script if you're fine-tuning returns the TTA predictions.

So I append all those TTA predictions for each model for each type into a list and after each one it's a good idea to do this garbage collection and empty cache that because otherwise I find what happens is your GPU memory kind of I don't know I think it gets fragmented or something and after a while it runs out of memory even when you thought it wouldn't so this way you can really do as much as you like without running out of memory.

So they all train train train train and one key thing to note here is that in my train script my data loaders does not have the seed equals parameter so I'm using a different training set every time so that means that for each of these different runs they're using also different validation sets so they're not directly comparable but you can kind of see they're all doing pretty well 2.1 percent 2.3 percent 1.7 percent and so forth.

So why am I using different training and validation sets for each of these that's because I want to ensemble them so I'm going to use bagging which is I am going to take the average of their predictions now I mean really when we talked about random forest bagging we were taking the average of like intentionally weak models these are not intentionally weak models they have to be good models but they're all different they're using different architectures and different pre-processing approaches and so in general we would hope that these different approaches some might work well for some images and some might work well for other images and so when we average them out hopefully we'll get a good blend of kind of different ideas which is kind of what you want in bagging so we can stack up that list of different of all the different probabilities and take their main and so that's going to give us 3469 predictions that's our test set size and each one has 10 probabilities the probability of each disease and so then we can use arg max to find which probability index is the highest so that's what it's going to give us our list of indexes so this is basically the same steps as we used before to create our CSV submission file so at the time of creating this analysis that got me to the top of the leaderboard and in fact these are my four submissions and you can see each one got better now you're not always going to get this nice monotonic improvement right but you want to be trying to submit something every day to kind of like try out something new right and the more you practice the more you'll get a good intuition of what's going to help right so partly I'm showing you this to say it's not like purely random as to whether things work or don't once you've been doing this for a while you know you will generally be improving things most of the time so as you can see from the descriptions my first submission was our conf next small for 12 epochs with TTA and then a ensemble of confidence so it's basically this exact same thing but just retraining a few with different training subsets and then this is the same thing again this is the thing we just saw basically the ensemble of large bottles with TTA and then the last one was something I skipped over which was I the the VIT models were the best in my testing so I basically weighted them as double in the ensemble a pretty unscientific but again it gave it a another boost and so that was that was it all right John yes thanks Jeremy so no particular order Korean is asking would trying out cross-validation with K folds with the same architecture makes sense okay so and sombling of models yes a popular thing is to do K fold cross-validation so K fold cross-validation is something very very similar to what I've done here so what I've done here is I've trained a bunch of models with different training sets each one is a different random 80% of the data five fold cross-validation does something as similar but what it says is rather than picking like say five samples out with different random subsets in fact instead first like do all except for the first 20% of the data and then all but the second 20% and then all but the third third and so forth and so you end up with five subsets each of which have non-overlapping validation sets and then you'll ensemble those you know in a thin theory maybe that could be slightly better because you're kind of guaranteed that every row is appears four times you know effectively it also has a benefit that you could average those five validation sets because there's no kind of overlap between them to get a cross-validation personally I generally don't bother but the reason I don't is because this way I can add and remove models very easily I don't you know I can just you know add another architecture and whatever to my ensemble without trying to find a different overlapping non-overlapping subset so yeah cross-validation is therefore something that I use probably less than most people or almost or almost never awesome thank you are there any just come back to gradient accumulation any other kind of drawbacks or potential gotchas with gradient accumulation no not really yeah like amazingly it doesn't even really slow things down much you know going from a batch size of 64 to a batch size of 32 by definition you had to do it because your GPU is full so you're obviously giving a lot of data so it's probably going to be using its processing speed pretty effectively so yeah no it's just it's just a good technique that we should all be buying cheaper graphics cards with less memory in them and using you know have like I don't know the prices I suspect like you could probably buy like 230 80s for the price of 130 90 ti or something that would be a very good deal yes clearly you're not on the nvidia payroll so look this is a good segue then we did have a question about sort of GPU recommendations and there's been a bit of chat on that as well I bet so any any you know commentary any additional commentary around GPU recommendations no not really I mean obviously at the moment nvidia is the only game in town you know if you buy if you trying to use a you know apple m1 or m2 or or an AMD card you're basically in for a world of pain in terms of compatibility and stuff and unoptimized libraries and whatever the the nvidia consumer cards so the ones that start with RTX are much cheaper but are just as good as the expensive enterprise cards so you might be wondering why anybody would buy the expensive enterprise cards and the reason is that there's a licensing issue that nvidia will not allow you to use an RTX consumer card in a data center which is also why cloud computing is more expensive than they kind of ought to be because everybody selling cloud computing GPUs is selling these cards that are like I can't remember I think they're like three times more expensive for kind of the same features so yeah if you do get serious about deep learning to the point that you're prepared to invest you know a few days in administering a box and you know I guess depending on prices hopefully will start to come down but currently a thousand or two thousand or two thousand dollars on buying a GPU then you know that'll probably pay you back pretty quickly great thank you um let's see another one's come in uh if you have a back on models not hardware if you have a well functioning but large model can it make sense to train a smaller model to produce the same final activations as the larger model oh yeah absolutely i'm not sure we'll get into that this time around but um yeah um we'll cover that in part two i think but yeah basically there's a kind of teacher student models and model distillation which broadly speaking there there are ways to make inference faster by training small models that work the same way as large models great thank you all right so that is the actual real end of road to the top because beyond that we don't actually cover how to get closer to the top you'd have to ask korean to share his techniques to find out that or nick to get the second place in the top um part four is actually um something that i think is very useful to know about for for learning and it's going to teach us a whole lot about how the last layer of a neural networks and specifically what we're going to try to do is we're going to try to build a model that doesn't just predict the disease but also predicts the type of rice so how would you do that so here's the data loader we're going to try to build it's going to be something that for each image it tells us the disease and the type of rice i say disease sometimes normal i guess some of them are not diseased so to build a model that can predict two things the first thing is you're going to need data loaders that have two dependent variables and that is shockingly easy to do in fastai thanks to the data block so we've seen the data block before we haven't been using it for the patty competition so far because we haven't needed it we could just use image data loader dot from folder so that's like the the highest level api the simplest api if we go down a level deeper into the data block we have a lot more flexibility so if you've been following the walkthroughs you'll know that as i built this the first thing i actually did was to simply replicate the previous notebook but replace the image data loader dot from folders with the data block to try to do first of all exactly the same thing and then i added the second dependent variable so if we look at the previous image data loader from folders thingy here it is where you're passing in some item transforms and some batch transforms and we had something saying what percentage should be the validation set so in a data block if you remember we have to pass in a blocks argument saying what kind of data is the independent variable and what is the dependent variable so to replicate what we had before we would just pass in image block comma category block because we've got an image as our independent variable and a category one type of rice is the dependent variable so the new thing i'm going to show you here is that you don't have to only put in two things you can put in as many as you like so if you put in three things we're going to generate one image and two categories now fastai if you're saying i want three things fastai doesn't know which of those is the independent variable and which is the dependent variable so the next thing you have to tell it is how many inputs are there number of inputs and so here i've said there's one input so that means this is the input and therefore by definition two categories will be the output because remember we're trying to predict two things the type of rice and the disease okay this is the same as what we've seen before to find out to get our list of items we'll call get image files now here's something we haven't seen before get y is our labeling function normally we pass to get y a single thing such as the parent label function which looks at the name of the parent directory which remember is how these images are structured and that would tell us the label but get y can also take an array and in this case we want two different labels one is the name of the parent directory because that's the disease the second is the variety so what's get variety get variety is a function so let me explain how this function works so we can create a data frame containing our training data that came from Kaggle so for each image it tells us the disease and the variety and what i did is something i haven't shown before in pandas you can set one column to be the index and when you do that in this case image id it makes this series this sorry this data frame kind of like a dictionary i can index into it by saying tell me the row for this image and to do that you use the lock attribute the location so we want in the data frame the location of this image and then you can also say optionally what column you want this column and so here's this image and here's this column and as you can see it returns that thing so hopefully now you can see it's pretty easy for us to create a function that takes a row sorry a path and returns the location in the data frame of the name of that file because remember these are the names of files for the variety column so that's our second get y okay and then we've seen this before randomly split the data into the 20 percent and so 80 percent and so we could just squish them all to 192 just for this example and then use data augmentation to get us down to 128 square images just for this example um and so that's what we get when we say show batch we get what we just discussed so now we need a model that predicts two things how do we create a model that predicts two things well the key thing to realize is we never actually had a model that predicts two things we had a model that predicts 10 things before the 10 things we predicted is the probability of each disease so we don't actually now want a model that predicts two things we want a model that predicts 20 things the probability of each of the 10 diseases and the probability of each of the 10 varieties so how could we do that well let's first of all try to just create the same disease model we had before with our new data loader and so this is going to be reasonably straightforward the key thing to know is that since we told fastai that there's one input and therefore by definition there's two outputs it's going to pass to our metrics and to our um loss functions three things instead of two the predictions from the model and the disease and the variety so if we're gonna so we can't just use error rate as our metric anymore because error rate takes two things instead we have to create a function that takes three things and return error rate are the two things we want which is the predictions from the model and the disease okay so predictions the model this is the target so that's actually all we need to do to define a metric that's going to work with our new data set with a new data loader this is not going to actually tell us anything about variety first it's going to try to replicate something that can do just disease so when we create our learner we'll pass in this new disease error function okay so halfway there the other thing we're going to need is to change our loss function now we never actually talked about what loss function to use and that's because vision learner guessed what loss function to use vision learner saw that our dependent variable is a single category and it knows the best loss function that's probably going to be the case for things with a single category and it knows how big the category is so it just didn't bother us at all just said okay I'll figure it out for you so the only time we've provided our own loss function is when we were kind of doing linear models and neural nets from scratch and we did I think mean squared error we might also have done mean absolute error neither of those work when the dependent variable is a category how would you use mean squared error or mean absolute error to say how close were these 10 probability predictions to this one correct answer so in this case we have to use a different loss function we have to use something called cross entropy loss and this is actually the loss function that fast AI picked for us before without us knowing but now that we are having to pick it out manually I'm going to explain to you exactly what cross entropy loss does okay and you know these details are very important indeed like remember I said at the start of this class the stuff that happens in the middle of the model you're not going to have to care about much in your life if ever but the stuff that happens in the first layer and the last layer including the loss function that sits between the last layer and the loss you're going to have to care about a lot right this stuff comes up all the time so you definitely want to know about cross entropy loss and so I'm going to explain it using a spreadsheet the spreadsheets in the course repo and so let's say you are predicting something like a kind of a mini image net thing where you're trying to predict whether something an image is a cat a dog a plane a fish or a building so you set up some model whatever it is a conf next model or just a big bunch of linear layers connected up or whatever and initially you've got some random weights and it spits out at the end five predictions right so remember to predict something with five categories your model will spit out five probabilities now it doesn't initially spit out probabilities there's nothing making them probabilities it just spits out five numbers could be negative could be positive okay so here's the output of the model so what we want to do is we want to convert these into probabilities and so we do that in two steps the first thing we do is we go exp that's e to the power of we go e to the power of each of those things like so okay and so here's the mathematical formula we're using this is called the soft max what we're working through we're going to go through each of the categories so these are our five categories so here K is five we're going to go through each of our categories and we're going to go e to the power of the output so ZJ is the output for the jth category so here's that and then we're going to sum them all together here it is sum up together okay so this is the denominator and then the numerator is just e to the power of the thing we care about so this row so the numerator is e to the power of cat on this row e to the power of dog on this row and so forth now if you think about it since the denominator adds up all the e to the power ofs and when we do each one divided by the sum that means the sum of these will equal one by definition right and so now we have things that can be treated as probabilities they're all numbers between zero and one numbers that were bigger in the output will be bigger here but there's something else interesting which is because we did either the power of it means that the bigger numbers will be like pushed up to numbers closer to one like we're saying like oh really try to pick one thing as having most of the probability because we are trying to predict you know one thing we're trying to predict which one is it and so this is called softmax so sometimes you'll see people complaining about the fact that their model which they said let's say is it a teddy bear or a grizzly bear or a black bear and they faded a picture of the cat and they say oh the models wrong because it predicted grizzly bear it's not a grizzly bear as you can see there's no way for this to predict anything other than the categories we're giving it we're forcing it to that now we don't if you want that like it's something else you could do which is you could actually have them not add up to one right you could instead have something which simply says what's the probability it's a cat what's probably it's a dog was put it by totally separately and they could add up to less than one out of that situation you can say you know or more than one which case you could have like more than one thing being true or zero things being true but in this particular case where we want to predict one and one thing only we use softmax the first part of the cross entropy formula the first part of the cross entropy formula in fact let's look it up and end up cross entropy loss the first part of what cross entropy loss in PyTorch does is to calculate the softmax it's actually the log of the softmax but don't worry about that too much it's just a slightly faster to do the log okay so now for each one of our five things we've got a probability the next step is the actual cross entropy calculation which is we take our five things we've got our five probabilities and then we've got our axials now the truth is the actual you know the five things would have indices right zero one two three or four the actual turned out to be the number one but what we tend to do is we think of it as being one hot encoded which is we put a one next to the thing for which it's true and a zero everywhere else and so now we can compare these five numbers to these five numbers and we would expect to have a smaller loss if the softmax was high where the actual is high and so here's how we calculate this is the formula the cross entropy loss we sum up they switch to M this time for some reason but the same thing we sum up across the five categories M is five and for each one we multiply the actual target value so that's zero so here it is here the actual target value and we multiply that by the log of the predicted probability the log of red the predicted probability and so of course for four of these that value is zero because here yj equals zero by definition for all but one of them because it's one hot encoded so for the one that it's not we've got our actual times the log softmax okay and so now actually you can see why PyTorch prefers to use log softmax because that it kind of skips over having to do this log at all so this equation looks slightly frightening but when you think about it all it's actually doing is it's finding the probability for the one that is one and taking its log right it's kind of weird doing it as a sum but in math it can be a little bit tricky to kind of say oh look this up in an array which is basically all it's doing but yeah basically at least in this case first for a single result where it's softmax this is all it's doing because it's finding the point eight seven where it's one four and taking the log and then finally negative so that is what cross entropy loss does we add that together for every row so here's what it looks like if we add it together over every row right so n is the number of rows and here's a special case this is called binary cross entropy what happens if we're not predicting which of five things it is but we're just predicting is it a cat so that case if you look at this approach you end up with this formula which it's it's this is identical to this formula but in for just two cases which is you've either you either are a cat or you're not a cat right and so if you're not a cat it's one minus you are a cat and same with the probability you've got the probability you are a cat and then not a cat is one minus that so here's this special case of binary cross entropy and now our rows represent rows of data okay so each one of these is a different image a different prediction and so for each one I'm just predicting are you a cat and this is the actual and so the actual are you not a cat is just one minus that and so then these are the predictions that came out of the model again we can use softmax or it's it's binary equivalent and so that will give you a prediction that you're a cat and the prediction that it's not a cat is one minus that and so here is each of the part yi times log of p yi and here is why did I subtract that's weird oh because I've got minus of both so I just do it this way avoids parentheses yeah minus the are you not a cat times the log of the prediction of are you not a cat and then we can add those together and so that would be the binary cross entropy loss of this data set of five cat or not cat images now if you've got an eagle eye you may have noticed that I am currently looking at the documentation for something called Anna and cross entropy loss but over here I had something called f cross entropy basically it turns out that all of the loss functions in pytorch have two versions there's a version which is a class this is a class which you can instantiate passing in various tweaks you might want and there's also a version which is just a function and so if you don't need any of these tweaks you can just use the function the functions live in a kind of remember what the sub module called I think it might be like torch dot n n dot functional but everybody including the pytorch official docs just calls a capital F so that's what this capital F refers to so our loss if we just care about disease we're going to be past the three things but just going to calculate cross entropy on our input versus disease all right so that's all fine we passed so now when we create a vision learner you can't rely on fast AI to know what loss function to use because we've got multiple targets so you have to say this is the loss function I want to use this is the metrics I want to use and the other thing you can't rely on is that fast AI no longer knows how many activations to create because again it there's more than one target so you have to say the number of outputs to create at the last layer is 10 this is just saying what's the size of the last matrix and once we've done that we can train it and we get you know basically the same kind of result as we always get because this model at this point is identical to our previous conf next small model we've just done it in a slightly more roundabout way so finally before our break I'll show you how to expand this now into a multi-target model and the trick is actually very simple and you might have almost got the idea of it when I talked about it earlier our vision learner now requires 20 outputs we now need that last matrix to have to produce 20 activations not 10 10 of those activations are going to predict the disease and 10 of the activations are going to predict the variety so you might be then asking like well how does the model know what it's meant to be predicting and the answer is with the loss function you're going to have to tell it so for example disease loss remember it's going to get the input the disease in the variety this is now going to have 20 columns in so we're just going to decide all right we're just going to decide the first 10 columns we're going to decide are the prediction of what the disease is which of the probability of each disease so we can now passed across entropy the first 10 columns and the disease target so the way you read this colon means every row and then colon 10 means every column up to the 10th so these are the first 10 columns and that will that's a loss function that just works on predicting disease using the first 10 columns for variety we'll use cross entropy loss with the target of variety and this time we'll use the second 10 columns so here's column 10 onwards so then the overall loss function is the sum of those two things disease loss plus variety loss and that's actually it that's all the model needs to basically it's now going to if you kind of think through the manual neural nets we've created this loss function will be reduced when the first 10 columns are doing a good job of predicting disease probabilities and the second 10 columns are doing a good job of predicting the variety probabilities and therefore the gradients will point in an appropriate direction that the coefficients will get better and better at using those columns for those purposes it would be nice to see the error rate as well for each of disease and variety so we can call error rate passing in the first 10 columns and disease and then variety the second 10 columns and variety and we may as well also add to the metrics the losses and so now when we create a learner we're going to pass in as the loss function the combined loss and as the metrics our list of all the metrics and n out equals 20 and now look what happens when we train as well as telling us the overall train in valid loss it also tells us the disease and variety error and the disease and variety loss and you can see our disease error is getting down to similar levels it was before it's slightly less good but it's similar it's not surprising it's slightly less good because we've only given it the same number of epochs and we're now asking it to try to do more stuff which is to learn to recognize what the rice variety looks like and also learns to recognize what the disease looks like here's the counterintuitive thing though if we train it for longer it may well turn out that this model which is trying to predict two things actually gets better at predicting disease than our disease specific model why is that like that sounds weird right because we're trying to have to do more stuff that's the same size model well the reason is that quite often it'll turn out that the kinds of features that help you recognize a variety of rice are also useful for recognizing the disease you know maybe there are certain textures right or maybe some diseases impact different varieties different ways so it'd be really helpful to know what variety it was so I haven't tried training this for a long time and I don't know the answer is in this particular case does a multi-target model do better than a single target model at predicting disease but I just want to let you know sometimes it does right so for example a few years ago there was a Kaggle competition for recognizing the kinds of fish on a boat and I remember we ended up doing a multi-target model where we tried to predict a second thing I can't even remember what it was maybe it was a type of boat or something and it definitely turned out in that Kaggle competition that predicting two things helped you predict the type of fish better than predicting just the type of fish so there's at least you know there's two reasons to learn about multi-target models one is that sometimes you just want to be able to predict more than one thing so this is useful and the second is sometimes this will actually be better at predicting just one thing than a just one thing model and of course the third reason is it really forced us to dig quite deeply into these loss functions and activations in a way we haven't quite done before so it's okay it's absolutely okay if this is confusing the way to make it not confusing is well the first thing I do is like go back to our earlier models where we did stuff by hand on like the Titanic data set and built our own architectures and maybe you could try to build a model that predicts two things in the Titanic dataset maybe you could try to predict both sex and survival or something like that or or class and survival because that's kind of kind of forced you to look at it on very small data sets and then the other thing I'd say is run this notebook and really experiment at trying to see what kind of outputs you get like actually look at the inputs and look at the outputs and look at the data loaders and so forth all right let's have a six minute break um so I'll see you back here at 10 past seven okay welcome back um oh before I continue I very rudely forgot to mention this very nice equation image here is from an article by Chris said called things that confused me about cross entropy it's a very good article so I recommend you check it out if you want to go a bit deeper there there's a link to it inside the spreadsheet so the next notebook we're going to be looking at is this one called collaborative filtering deep dive and this is going to cover our last of the four major application areas collaborative filtering and this is actually the first time I'm going to be presenting a chapter of the book largely without variation um because this is one where I looked back at the chapter and I was like oh I can't think of any way to improve this so I thought I'll just leave it as is um but we have put the whole chapter up on Kaggle um so that's for the way I'm going to be showing it to you and so we're going to be looking at a data set called the movie lens data set which is a data set of movie ratings and we're going to grab a smaller version of it 100,000 record version of it and it comes as a csv file which we can read in well it's not really a csv file it's a tsv file this here means a tab in python um these are the names of the columns so here's what it looks like it's got a user a movie a rating and a timestamp we're not going to use the timestamp at all so basically three columns we care about this is a user id so maybe 196 is Jeremy and maybe 186 is Rachel and 22 is John I don't know um maybe this movie is Return of the Jedi and this one's Casablanca this one's LA Confidential and then this rating says how did Jeremy feel about Return of the Jedi he gave it a three out of five that's how we can read this data set um this kind of data is very common uh anytime you've got a user and a product or service and you might not even have ratings maybe just the fact that they bought that product you could have a similar table with zeros and ones um so for example um uh Radek who's in the audience here is now at nvidia doing like basically does this right recommendation systems so recommendation systems you know it's it's a huge industry um and so what we're learning today is you know a really key foundation of it um so these are the first few rows this is not a particularly great way to see it i prefer to kind of cross tabulate it like that like this this is the same information uh so for each movie for each user here's the rating so user 212 never watched movie 49 now if you're wondering uh why there's so few empty cells here i actually grabbed the the most watched movies and the most movie watching users for this particular sample matrix so that's why it's particularly full so yeah so this is what kind of a collaborative filtering data set looks like when we cross tabulate it so how do we fill in this gap so maybe user 212 is nick and movie 49 what's the movie you haven't seen nick and you'd quite like to maybe not sure about it the new elvis movie baz limon good choice australian director filmed in queen's land yeah okay so that's movie two and that's movie number 49 so is nick gonna like the new elvis movie well to figure this out what we could do ideally would like to know for each movie what kind of movie is it like what are the kind of features of it is it like actiony science fictiony dialogue driven critical acclaimed you know um so let's say for example we were trying to look at the last guy walker maybe that was the movie the nick's wondering about watching and so if we like had three categories being science fiction action or kind of classic old movies would say the last guy walker is very science fiction let's see this is from like negative one to one pretty action definitely not an old classic or at least not yet and so then maybe we then could say like okay well maybe like nick's tastes in movies are that he really likes science fiction quite likes action movies and doesn't really like old classics right so then we could kind of like match these up to see how much we think this user might like this movie to calculate the match we could just multiply the corresponding values user one times last guy walker and add them up point nine times point nine eight plus point eight times point nine plus negative point six times negative point nine that's going to give us a pretty high number right with a maximum of three so that would suggest nick probably would like the last guy walker on the other hand the movie casablanca we would say definitely not very science fiction not really very action definitely very old classic so then we'd do exactly the same calculation and get this negative result here so you probably wouldn't like casablanca this thing here when we multiply the corresponding parts of a vector together and add them up is called a dot product in math so this is the dot product of the user's preferences and the type of movie now the problem is we weren't given that information we know nothing about these users or about the movies so what are we going to do we want to try to create these factors without knowing ahead of time what they are we wouldn't even know what factors to create what are the things that really matters when it just people decide what movies they want to watch what we can do is we can create things called latent factors latent factors is this weird idea that we can say i don't know what things about movies matter to people but there's probably something and let's just try like using sgd to find them and we can do it and everybody's favorite mathematical optimization software microsoft xl so here is that table and what we can do let's head over here actually here's that table so what we could do is we could say for each of those movies so let's say for movie 27 let's assume there are five latent factors i don't know what they're for they're just five latent factors we'll figure them out later and for now i certainly don't know what the value of those five latent factors for movie 27 so we're going to just chuck a little random numbers in them and we're going to do the same thing for movie 49 pick another five random numbers and the same thing for movie 57 pick another five numbers and you might not be surprised to hear we're going to do the same thing for each user so for user 14 we're going to pick five random numbers for them and for user 29 we'll pick five random numbers for them and so the idea is that this number here 0.19 is saying if it was true that user id 14 feels not very strongly about the factor that for movie 27 has a value of 0.71 so therefore in here we do the dot product the details of why i don't matter too much but well actually you can figure this out from what we've said so far if you go back to our definition of matrix product you might notice that the matrix product of a row with a column is the same thing as a dot product and so here in excel i have a row in a column so therefore i say matrix multiply that by that that gives us the dot product so here's the dot product of that by that or the matrix multiply given that they're row and column the only other slight quirk here is that if the actual rating is zero is empty i'm just going to leave it blank i'm going to set it to zero actually so here is everybody's rating predicted rating of movies i say predicted of course these are currently random numbers so they are terrible predictions but when we have some way to predict things and we start with terrible random predictions we know how to make them better don't we we use stochastic gradient descent now to do that we're going to need a loss function so that's easy enough we can just calculate the sum of x minus y squared divided by the count that is the mean squared error and if we take the square root that is the root mean squared error so here is the root mean squared error in excel between these predictions and these axials and so now that we have a loss function we can optimize it data solver set objective this one here by changing cells these ones here and these ones here solve okay and initially our loss is 2.81 so we hope it's going to go down and as it solves not a great choice of background color but it says 0.68 so this number is going down so this is using um actually in excel it's not quite using stochastic gradient descent because excel doesn't know how to calculate gradients there are actually optimization techniques that don't need gradients they calculate them numerically as they go but that's a minor quirk um one thing you'll notice is it's doing it very very slowly um there's not much data here and it's still going um one reason for that is that if it's because it's not using gradients it's much slower and the second is excel is much slower than pythage anyway it's come up with an answer and look at that it's got to 0.42 so it's got a pretty good prediction and so we can kind of get a sense of this for example um looking at the last three movie uh user 14 likes dislikes likes let's see somebody else like that here's somebody else this person likes dislikes likes so based on our kind of approach we're saying okay since they have the same feeling about these three movies maybe they'll feel the same about these three movies so this person likes all three of those movies and this person likes two out of three of them so you know you kind of this is the idea right as if somebody says to you i like this movie this movie this movie and you're like oh they like those movies too what other movies do you like and they'll say oh how about this there's a chance good chance that you're going to like the same thing that's the basis of collaborative filtering okay um it's and and mathematically we call this matrix completion so this matrix is missing values we just want to complete them so the core of collaborative filtering is it's a matrix completion exercise can you grab a microphone my question was is with um the dot products right so if we think about the math of that for a minute is yeah if we think about the cosine of the angle between the two vectors that's going to roughly approximate the correlation is that essentially what's going on here in one sense with the way that we're so is the cosine of the angle between the vectors much the same thing as the dot product um the answer is yes um they're the same once you normalize them so yeah um is that still on it's correlation what we're doing here at scale as well yeah you can yeah you can think of it that way okay cool um now this looks pretty different to how pytorch looks pytorch has things in rows right we've got a user a movie rating user movie rating right so how do we do the same kind of thing in pytorch so let's do the same kind of thing in excel but using the table in the same format that pytorch has it okay so to do that in excel the first thing i'm going to do is i'm going to see okay this i've got to look at user number 14 and i want to know what index like how far down this list is 14 okay so we'll just match means find the index so this is user index one and then what i'm going to do is i'm going to say the these five numbers is basically i want to find row one over here and in excel that's called offset so we're going to offset from here by one row and so you can see here it is 0.19 0.63 0.19 0.63 etc right so here's the second user 0.25 0.03 etc and we can do the same thing for movies right so movie four one seven is index 14 that's going to be 0.75 0.47 etc and so same thing right but now we're going to offset from here by 14 to get this row which is 0.75 0.47 etc and so the prediction now is the dot product is called sum product in excel this is sum product of those two things so this is exactly the same as we had before right but when we kind of put everything next to each other we have to like manually look up the index and so then for each one we can calculate the error squared prediction minus rating squared and then we could add those all up and if you remember this is actually the same root mean squared error we had before we optimized before 2.81 because we've got the same numbers as before and so this is mathematically identical so what's this weird word up here embedding you've probably heard it before and you might have come across the impression it's some very complex fancy mathematical thing but actually it turns out that it is just looking something up in an array that is what an embedding is so we call this an embedding matrix and these are our user embeddings and our movie embeddings so let's take a look at that in pytorch and you know at this point if you've heard about embeddings before you might be thinking that can't be it and yeah it's just as complex as the rectified linear unit which turned out to be replaced negatives with zeros embedding actually means look something up in an array so there's a lot of things that we use as deep learning practitioners to try to make you as intimidated as possible so that you don't wander into our territory and start winning our kaggle competitions and unfortunately once you discover the simplicity of it you might start to think that you can do it yourself and then it turns out you can so yeah that's what basically it turns out pretty much all of this jargon turns out to be so we're going to try to learn these latent factors which is exactly what we just did in excel we just learned the latent factors all right so if we're going to learn things in pytorch we're going to need data loaders one thing i did is there is actually a movies table as well with the names of the movies so i merged that together with the ratings so that then we've now got the user id and the actual name of the movie we don't need that obviously for the model but it's just going to make it a bit more fun to interpret later so this is called ratings we have something called collaborative data loaders so collaborative filtering data loaders and we can get that from a data frame by passing in the data frame and it expects a user column and an item column so the user column is what it sounds like the the person that is rating this thing and the item column is the product or service that they're rating in our case the user column is called user so we don't have to pass that in and the item column is called title so we do have to pass this in because by default the user column should be called user and the item column will be called item give it a batch size and as usual we can call show batch and so here's our data loaders a batch of data loaders or at least a bit of it and so now that since we talk about the names we actually get to see the names which is nice all right so now we're going to create the user factors and movie factors i.e.

this one and this one so the number of rows of the movie factors will be equal to the number of movies and the number of rows of the user factors will be equal to the number of users and the number of columns will be whatever we want however many factors we want to create john this might be a pertinent time to jump in with a question any comments about choosing the number of factors um um not really um we um we have defaults that we use for embeddings in fastai um it's a very obscure formula and people often ask me for like the mathematical derivation of where it came from but what actually happened is it's i wrote down how many factors i think is appropriate for different size categories on a piece of paper in a table well actually in excel and then i fitted a function to that and that's the function so it's basically a mathematical function that fits my intuition about what works well um but it seems to work pretty well i said it used in lots of other places now lots of papers will be like using fastai's rule of thumb for embedding sizes here's the formula cool thank you um it's pretty fast to train these things so you can try a few so we're going to create um so the number of users is just the length of how many users there are number of movies is the length of how many titles there are so create a matrix of random numbers of users by five and movies of movies by five and now we need to look up the index of the movie in our movie latent factor matrix um the thing is when we've learned about deep learning we learned that we do matrix multiplications not look something up in a matrix in an array so in excel we were saying offset which is to say find element number 14 and the table which that's not a matrix multiply how does that work well actually it is um it actually is for the same reason that we talked about here which is we can represent find the element number one thing in this list is actually the same as multiplying by a one hot encoded matrix so remember how if we let's just take off the log for a moment look this is returned 0.87 um and particularly if i take the negative off here if i add this up this is 0.87 which is the result of finding the index number one thing in this list but we didn't do it that way we did this by taking the dot product of this sorry of this and this but that's actually the same thing taking the dot product of a one hot encoded vector with something is the same as looking up this index in the vector so that means that this exercise here of looking up the 14th thing is the same as doing a matrix multiply with a one hot encoded vector and we can see that here this is how we create a one hot encoded vector of length and users in which the third element is set to one and everything else is zero and if we multiply that so at means do you remember matrix multiply in python so if we multiply that by our user factors we get back this answer and if we just ask for user factors number three we get back the exact same answer they're the same thing so you can think of an embedding as being a computational shortcut for multiplying something by a one hot encoded vector and so if you think back to what we did with dummy variables right this basically means embeddings are like a cool math trick for speeding up doing matrix multipliers with dummy variables not just speeding up we never even have to create the dummy variables we never have to create the one hot encoded vectors we can just look up in an array so all right so we're now ready to build a collaborative filtering model and we're going to create one from scratch and as we've discussed before in PyTorch a model is a class and so we briefly touched on this but i've got to touch on it again this is how we create a class in python you give it a name and then you say how to initialize it how to construct it so in python remember they call these things dunder whatever this is dunder edit these are magic methods that python will call for you at certain times the the method called dunder edit is called when you create an object of this class so we could pass it a value and so now we set the attribute called a equal to that value and so then later on we could call a method called say that will say hello to whatever you passed in here and this is what it will say so for example if you construct an object of type example passing in silver self.a now equals silver so if you say use the dot method the dot say method nice to meet you x is now nice to meet you so it will say hello silver nice to meet you so that's that's kind of all you need to know about object-oriented programming in PyTorch to create a model oh there is one more thing we need to know sorry which is you can put something in parentheses after your class name and that's called the superclass it's basically gonna give you some stuff for free give you some functionality for free and if you create a model in PyTorch you have to make module your superclass this is actually fastai's version of module but it's nearly the same as PyTorches so when we create this dot product object it's going to call dunder in it and we have to say well how many users are going to be in our model and how many movies and how many factors and so we can now create an embedding of users by factors for users and an embedding of movies by factors for movies and so then PyTorch does something quite magic which is that if you create a dot product object like so it then you can treat it like a function you can call it and I can calculate values on it and when you do that it's really important to know PyTorch is going to call a method called forward in your class so this is where you put your calculation of your model it has to be called forward and it's going to be past the object itself and the thing you're calculating on in this case the user and movie for a batch so this is your batch of data each row will be one user and movie combination and the columns will be users and movies so we can grab the first column right so this is every row of the first column and look it up in the user factors embedding to get our users embeddings so that is the same as doing this let's say this is one mini batch and then we do exactly the same thing for the second column passing it into our movie factors to look up the movie embeddings and then take the dot product them equals one because we're summing across the columns for each row we're calculating a prediction for each row so once we've got that we can pass it to a learner passing in our data loaders and our model and our loss function means squared error and we can call fit and away it goes and this by the way is running on cpu now these are very fast to run so this is doing 100 000 rows in 10 seconds which is a whole lot faster than our few dozen rows in excel and so you can see the loss going down and so we've trained a model it's not going to be a great model and one of the problems is that let's see if we can see this in eric cell one look at this one here this prediction is bigger than five but nothing's bigger than five so that seems like a problem we're predicting things that are bigger than the highest possible number and in fact these are very much movie enthusiasts that nobody gave anything a one yeah nobody even gave anything a one here so do you remember when we learned about sigmoid the idea of squishing things between zero and one we could do stuff still without a sigmoid but when we added a sigmoid it trained better because the model didn't have to work so hard to get it kind of into the right zone now if you think about it if you take something and put it through a sigmoid and then multiply it by five now you've got something that's going to be between zero and five used to have something which is between zero and one so we could do that in fact we could do that in excel i'll leave that as an exercise to the reader let's do it over here in pytorch so if we take the exact same class as before and this time we call sigmoid range and so sigmoid range is something which will take our prediction and then squash it into our range and by default we'll use a range of zero through to 5.5 so it can't be smaller than zero it can't be bigger than 5.5 why don't i use five that's because a sigmoid can never hit one right and a sigmoid times five can never hit five but some people do give things movies five so you want to make it a bit bigger than our highest so this one got a loss of 0.8628 86 oh it's not better isn't that always the way all right didn't actually help doesn't always so be it um let's keep trying to improve it um let me show you something i noticed um some of the users like this one this person here just loved movies they give nearly everything a four or five their worst score is a three all right this person oh here's a one this person's got much more range some things are twos some ones some fives um this person doesn't seem to like movies very much considering how many they watch nothing gets a five they've got discerning tastes i guess at the moment we don't have any way in our kind of formulation of this model to say this user tends to give low scores and this user tends to give high scores there's just nothing like that right but that would be very easy to add let's add one more number to our five factors just here for each user and now rather than doing just the matrix multiply let's add oh it's actually the top one let's add this number to it h19 and so for this one let's add i19 to it yeah so i've got it wrong this one here so this this row here we're going to add to each rating and then we're going to do the same thing here each movie's now got an extra number here that again we're going to add a26 so it's our matrix multiplication plus we call it the bias the user bias plus the movie bias so effectively that's like making it so we don't have an intercept of zero anymore and so if we now train this model data solver solve so previously we got to 0.42 okay and so we're going to let that go along for a while and then let's also go back and look at the pytorch version so for pytorch now we're going to have a user bias which is an embedding of n users by one right remember there was just one number for each user and movie bias is an embedding of n movies also by one and so we can now look up the user embedding the movie embedding do the dot product and then look up the user bias and the movie bias and add them chuck that through the sigmoid let's train that so if we beat 0.865 wow we're not training very well are we still not too great 0.894 i think excel normally does do better though let's see okay excel oh excel's done a lot better it's going from 0.42 to 0.35 okay so what happened here why did it get worse well look at this the valid loss got better and then it started getting worse again so we think we might be overfitting which you know we have got a lot of parameters in our embeddings so how do we avoid overfitting so a classic way to avoid overfitting is to use something called wait decay also known as l2 regularization which sounds much more fancy what we're going to do is when we can compute the gradients we're going to first add to our loss function the sum of the weights squared this is something you should go back and add to your titanic model not that it's overfitting but just to try it right so previously our gradients have just been and our loss function has just been about the difference between our predictions and our actuals right and so our gradients were based on the derivative of that with respect to the derivative of that with respect to the coefficients but we're saying now let's add the sum of the square of the weights times some small number so what would make that loss function go down that loss function would go down if we reduce our weights for example if we reduce all of our weights to zero i should say we reduce the magnitude of our weights if we reduce the amount of zero that part of the loss function will be zero because the sum of zero squared is zero now problem is if our weights are all zero our model doesn't do anything right so we'd have crappy predictions so it would want to increase the weights so that's actually predicting something useful but if it increases the weights too much then it starts overfitting so how is it going to actually get the lowest possible value of the loss function by finding the right mix weights not too high right but high enough to be useful at predicting if there's some parameter that's not useful for example say we asked for five factors and we only need four it can just set the weights for the fifth factor to zero right and then problem solved right it won't be used to predict anything but it also won't contribute to our weight decay part so previously we had something calculating the loss function so now we're going to do exactly the same thing but we're going to square the parameters we're going to sum them up and we're going to multiply them by some small number like 0.01 or 0.001 um and in fact we don't even need to do this because remember the whole purpose of the loss is to take its gradient right and to print it out um the gradient of parameters squared is two times parameters it's okay if you don't remember that from high school but the you can take my word for it the gradient of y equals x squared is 2x so actually all we need to do is take our gradient and add the weight decay coefficient 0.01 or whatever times two times parameters and given this is just number some number we get to pick we might as well fold the two into it and just get rid of it so when you call fit you can pass in a wd parameter which does adds this times the parameters to the gradient for you and so that's going to ask the model it's going to say to the model please don't make the the weights any bigger than they have to be and yay finally our loss actually improved okay you can see getting better and better in fastai applications like vision we try to set this for you appropriately and we generally do a reasonably good job just the defaults are normally fine um but in things like tabular and collaborative filtering we don't really know enough about your data to know what to use here so you should just try a few things let's try a few multiples of 10 start at point one and then divide by 10 a few times you know and just see which one gives you the best result so this is called regularization so regularization is about making your bottle model no more complex than it has to be right it has a lower capacity and so the higher the weights the more they're moving the model around right so we want to keep the weights down but not so far down that they don't make good predictions and so the value of this if it's higher will keep the weights down more it will reduce overfitting but it will also reduce the capacity of your model to make good predictions and if it's lower it increases the capacity of model and increases overfitting all right i'm going to take this bit for next time before we wrap up john are there any more questions uh yeah there are there's some from from back at the start of the collaborative filtering so um we had a bit of a conversation a while back about this the size of the embedding vectors um and you talked about your your fast ai rule of thumb so there was a question if anyone has ever done a kind of a hyper parameter search an exploration um for i mean people often will do a hyper parameter search for sure a bigger problem people will often do a hyper parameter search for their model but i haven't seen a i haven't seen any other rules other than my rule of thumb right so not not productively to your knowledge oh productively for an individual model that somebody's building right um and then there's a there's a question here from zaki which i didn't quite wrap my head around so zaki if you want to maybe clarify in the in the chat as well but can recommendation systems be built based on average ratings of users experience rather than collaborative filtering not really right i mean if you've got lots of metadata you could right so if you've got a lot you know like lots of information about demographic data about where the user's from and you know what loyalty scheme results they've had and blah blah blah and then for products there's metadata about that as well then sure averages would be fine but if all you've got is kind of purchasing history then you really want the granular data otherwise how could you say like they like this movie this movie in this movie therefore my they might also like that movie or you've got it's like oh they kind of like movies there's just not enough information there yeah great that's about it thanks okay great all right thanks everybody see you next time for our last lesson