back to indexMachine Learning 1: Lesson 11
Chapters
0:0
2:45 Chain Rule
15:5 Adding Regularization
16:25 Weight Decay
16:34 Cross-Entropy
16:55 Example of a Binary Cross-Entropy
22:54 Logistic Regression
32:55 Regularization
57:3 Engrams
71:21 Embeddings
77:6 Multi-Layer Neural Network
78:21 Ng Embeddings
79:24 The Data
81:36 Data Cleaning
94:31 Treating Columns as Categorical Variables Where Possible
00:00:08.840 |
Multi-layer functions with SGD and so the idea is that we've got some data and 00:00:14.740 |
Then we do something to that data. For example, we multiply it by a weight matrix 00:00:26.280 |
for example, we put it through a softmax or a sigmoid and 00:00:30.200 |
Then we do something to that such as do a cross entropy loss or a root mean squared error loss 00:00:39.200 |
Okay, and that's going to like give us some scalar 00:00:46.660 |
So this is going to have no hidden layers this has got a linear layer 00:00:56.440 |
A nonlinear activation being a softmax and a loss function being a root mean squared error or a cross entropy 00:01:25.320 |
Or softmax and this was cross entropy then that would be logistic regression 00:01:42.840 |
For now think of it like think of root means great error same thing some loss function. Okay now 00:01:49.880 |
We'll look at cross entropy again in a moment 00:01:54.880 |
How do we calculate the derivative of that with with respect to our weights, right? 00:02:05.120 |
So really it would probably be better if we said 00:02:07.840 |
X comma W here because it's really a function of the weights as well. And so we want the derivative of this 00:02:32.700 |
I just screwed up. That's all that's why that didn't make sense. All right 00:02:38.480 |
So to do that we just basically we do the chain rule so we just say that this is equal to H of 00:03:10.460 |
Right and then we can do the chain rule so we can say that's equal to H - the derivative is H - you by 00:03:27.900 |
In order to take the derivative with respect to the weights therefore 00:03:33.940 |
We just have to calculate that derivative with respect to W using that exact formula 00:03:49.780 |
Yeah, so so D of all that dW would be that yeah 00:03:58.280 |
So then if if we you know went further here and 00:04:09.780 |
Another linear layer, right? Let's give us a bit more room 00:04:30.380 |
There's no difference to now calculate the derivative with respect to all of the parameters. We can still use the exact same 00:04:42.700 |
So so don't think of the multi-layer network has been like things that occur at different times 00:04:51.500 |
So we just use the chain rule to calculate all the derivatives at once, you know 00:04:58.080 |
There's a they're just a set of parameters that happen to appear in different parts of the function 00:05:04.540 |
No different so to calculate this with respect to 00:05:09.660 |
W1 and W2, you know, it's it's just you just increase 00:05:14.500 |
You know W you can just now just call it W and say W1 is all of those weights 00:05:20.020 |
So the result that's a great question so what you're going to have then 00:05:38.460 |
Here's W1 and like it's it's it's probably some kind of higher rank tensor, you know, like if it's a 00:05:49.740 |
It'll you know be like a rec 3 tensor or whatever, but we can flatten it out that we just make it a list of parameters 00:06:17.940 |
It's how much does changing that value of W affect the loss how much does changing that value of W affect the loss? 00:06:24.740 |
Right, so you can basically think of it as a function like, you know y equals 00:06:36.100 |
Plus C right and say like oh, what's the derivative of that with respect to A B and C? 00:06:42.300 |
And you would have three numbers the derivative with respect to A and B and C and that's all this is right 00:06:49.080 |
It's a derivative with respect to that weight that weight and that weight and that weight and that weight that way 00:07:00.700 |
We had to calculate and I'm not going to go into detail here, but we had to calculate like 00:07:11.940 |
you've now got something where you've got like a 00:07:18.060 |
You've got an input vector. These are the activations from the previous layer, right and you've got 00:07:31.100 |
output activations, right and so now you've got to say like okay for this particular sorry for this particular 00:07:41.220 |
How does changing this particular weight change? 00:07:46.860 |
This particular output and how does changing this particular weight change this particular output and so forth so you kind of end up with these 00:07:56.740 |
Higher dimensional tensors showing like for every weight. How does it affect? 00:08:04.340 |
But then by the time you get to the loss function the loss function is going to have like a mean or a sum or something 00:08:10.180 |
so they're all going to get added up in the end, you know, and so this kind of thing like I 00:08:16.540 |
don't know it drives me a bit crazy to try and 00:08:19.660 |
Calculate it out by hand or even think of it step by step because you tend to have like 00:08:26.660 |
You just have to remember for every input in a layer for every output in the next layer 00:08:31.020 |
You know, you're going to have to take out for every weight for every output. You're going to have to have a separate 00:08:40.300 |
One good way to look at this is to learn to use pytorches like dot grab 00:08:47.620 |
Attribute and dot backward method manually and like look up the tutorial the pytorch tutorials 00:08:53.460 |
and so you can actually start setting up some calculations with a vector input and the vector output and then type dot backward and 00:09:03.900 |
Right and then do some really small ones with just two or three 00:09:06.540 |
items in the input and output vectors and like make the make the operation like plus two or something and like 00:09:13.140 |
See what the shapes are make sure it makes sense 00:09:22.660 |
Vector matrix calculus is not like introduces zero new concepts to anything you learned in high school 00:09:29.620 |
Like strictly speaking but getting a feel for how these shapes 00:09:34.580 |
Move around I find talk a lot of practice, you know, the good news is you almost never have to worry about it 00:09:55.260 |
Talking about then using this kind of logistic regression 00:10:00.580 |
For NLP and before we got to that point we were talking about using naive base for NLP 00:10:17.020 |
Right a review like this movie is good and turn it into a bag of words 00:10:22.380 |
Representation consisting of the number of times each word appears 00:10:26.380 |
All right, and we call this the vocabulary. This is the unique list of words. Okay, and we used the 00:10:33.860 |
SK learn count vectorizer to automatically generate both the vocabulary which in SK learn they call they call the features 00:10:43.300 |
And to call create the bag of words representations and the whole group of them then is called a term document 00:10:52.700 |
And we kind of realized that we could calculate 00:11:01.860 |
positive review contains the word this by just 00:11:05.260 |
averaging the number of times this appears in the positive reviews and we could do the same for the 00:11:13.260 |
Negatives right and then we could take the ratio of them to get something which if it's greater than one 00:11:21.940 |
Word that appeared more often in the positive reviews or less than one was a word that appeared more often in the negative reviews 00:11:31.420 |
Then we realized you know using using Bayes rule that and taking the logs 00:11:37.920 |
That we could basically end up with something where we could add up the logs of these 00:11:43.820 |
Plus the log of the ratio of the probabilities that things are in class 1 versus class 0 00:11:49.680 |
And end up with something we can compare to 0 00:11:53.580 |
It's a bit greater than 0 then we can predict a document is 00:11:58.020 |
Positive or if it's less than 0 we can predict the document is negative and that was our Bayes rule, right? 00:12:04.140 |
So we kind of did that from math first principles 00:12:09.020 |
And I think we agreed that the the naive in naive Bayes was a good description 00:12:14.260 |
Because it assumes independence when it's definitely not true 00:12:17.980 |
but it's an interesting starting point and I think it was interesting to observe when we actually got to the point where like 00:12:31.420 |
Took the log and now rather than multiply them together of course we have to add them up and 00:12:37.580 |
When when we actually wrote that down we realized like oh that is 00:12:52.780 |
Right and so then we kind of realized like oh, okay, so like if this is not very good 00:13:04.380 |
Why not improve it by saying hey, we know other ways to calculate a 00:13:09.180 |
You know a bunch of coefficients and a bunch of biases, which is to 00:13:14.400 |
Learn them in a logistic regression right so in other words this this is the formula we use for a logistic regression 00:13:22.520 |
And so why don't we just create a logistic regression and fit it 00:13:34.240 |
Coefficients and biases which are theoretically correct based on you know this assumption of independence and based on Bayes rule 00:13:42.620 |
There'll be the coefficients and biases that are actually the best in this data 00:13:48.100 |
All right, so that was kind of where we got to and so 00:14:00.740 |
Just about everything I find a machine learning ends up being either like a tree or 00:14:07.260 |
You know a bunch of matrix products and monomerities 00:14:11.380 |
Right like it's everything seems to end up kind of coming down to the same thing 00:14:20.620 |
And then it turns out that nearly all of the time then whatever the parameters are in that function 00:14:28.820 |
Nearly all the time it turns out that they're better 00:14:33.460 |
Calculated based on theory right and indeed that's what happened when we actually tried learning those coefficients 00:14:46.300 |
We noticed that we could also rather than take the whole term document matrix 00:14:51.180 |
We could instead just take them the you know ones and zeros for presence or absence of a word 00:14:57.420 |
And you know sometimes it was you know, this equally is good 00:15:01.460 |
But then we actually tried something else which is we tried adding regularization 00:15:06.260 |
And with regularization the binarized approach turned out to be a little better. All right, so then 00:15:17.220 |
And again, let's start with RMSE and then we'll talk about cross entropy loss function was 00:15:24.960 |
Our predictions minus our actuals sum that up take the average 00:15:41.680 |
Okay, and so this specifically is the L2 penalty 00:15:56.400 |
We also noted that we don't really care about the loss function per se we only care about its derivatives 00:16:05.400 |
That's actually the thing that updates the weights 00:16:07.400 |
so we can because this is a sum we can take the derivative of each part separately and so the derivative of this part was just 00:16:18.320 |
Right, and so we kind of learned that even though these are mathematically equivalent 00:16:24.560 |
This version is called weight decay and it's kind of what's used that term is used in the neural net picture 00:16:33.600 |
So cross entropy on the other hand, you know, it's just another loss function like root mean squared error 00:16:56.200 |
Binary cross entropy. So let's say this is our you know, is it a cat or a dog? So let's just say is cat 00:17:02.400 |
one or a zero so cat cat dog dog cat and these are our 00:17:07.960 |
Predictions this is the output of our final layer of our neural net or a logistic regression or whatever 00:17:27.640 |
Then we add to that 1 minus actual times the log of 1 minus the prediction and then take the negative of that whole thing 00:17:39.200 |
So I suggested to you all that you tried to kind of write the if statement version of this 00:17:45.800 |
So hopefully you've done that by now. Otherwise, I'm about to spoil it for you. So this was 00:18:04.520 |
Right and negative of that. Okay, so who wants to tell me how to write this is an if statement 00:18:16.560 |
She hit me. I'll give a try. So if y equal to 00:18:29.680 |
Well else return log 1 minus 1. Good. Oh, that's the thing in the brackets and you take C minus 00:18:37.280 |
Good. So the key insight she's using is that y has two possibilities 1 or 0 00:18:47.760 |
The key insight which I think happens here until you actually think about what the values it can take 00:18:55.720 |
That's that's all it's doing it's saying either give me that or give me that 00:19:01.520 |
Right. Could you pass that to the back place tension? 00:19:05.360 |
If I'm missing something, but do you know the two variables in that statement because you got why? 00:19:13.400 |
Shouldn't be like white hot in the way. Oh, yeah. Thank you 00:19:30.220 |
Category version is just the same thing but you're saying you know it for different more than just y equals 1 or 0 00:19:37.560 |
But y equals 0 1 2 3 4 5 6 7 8 9 for instance 00:19:43.480 |
So that you know that loss function has a you can figure it out yourself and particularly simple derivative 00:19:50.960 |
And it also you know another thing you could play with at home if you like is like thinking about how 00:19:56.440 |
The derivative looks when you add a sigmoid or a softmax before it, you know, it turns out at all 00:20:01.840 |
Turns out very nicely because you've got an XP thing going into a loggy thing. So you end up with you know, very well behaved 00:20:11.400 |
The reason I guess there's lots of reasons that people use 00:20:15.800 |
RMSE for aggression and cross entropy for classification 00:20:20.200 |
But most of it comes back to the statistical idea of a best linear unbiased estimator 00:20:26.560 |
You know and based on the likelihood function that kind of turns out that these have some nice statistical properties 00:20:36.280 |
Root means grid error in particular the properties are perhaps more theoretical than actual and actually nowadays using the 00:20:43.920 |
The absolute deviation rather than the sum of squares deviation can often work better 00:20:53.000 |
So in practice like everything in machine learning I normally try both for a particular data set 00:20:58.760 |
I'll try both loss functions and see which one works better and us of course 00:21:03.560 |
It's a Kaggle competition in which case you're told how Kaggle is going to judge it and you should use the same 00:21:15.120 |
So yeah, so this is really the key insight is like hey 00:21:19.240 |
Let's let's not use theory but instead learn things from the data and you know 00:21:23.200 |
We hope that we're going to get better results particularly with regularization we do and then I think the key regularization 00:21:29.520 |
insight here is hey, let's not like try to reduce the number of parameters in our model that instead like use lots of parameters and 00:21:38.920 |
Which ones are actually useful right and so then we took that a step further by saying hey given we can do that with 00:21:45.800 |
Regularization let's create lots more features 00:21:51.120 |
You know biograms like by vast and by vengeance and trigrams like by vengeance full stop and by Vera miles 00:21:58.800 |
Right and you know just to keep things a little faster 00:22:03.200 |
We limited it to 800,000 features, but you know even with the full 70 million features 00:22:07.920 |
It works just as well, and it's not a hell of a lot slower 00:22:20.840 |
And so now we can go ahead and say okay our labels is the training set labels as before 00:22:34.920 |
And then let's fit a logistic regression to that 00:22:43.320 |
We get 90% accuracy, so this is looking pretty good 00:22:56.520 |
Let's go back to our naive phase right in our naive phase 00:23:00.760 |
We have this term document matrix, and then for every feature. We're calculating 00:23:06.960 |
the probability of that feature occurring if it's class one that probability of that feature occurring if it's class two and then the 00:23:18.360 |
Right and in the paper that we're actually lip basing this off. They call this P this Q and this 00:23:29.080 |
Q maybe then we'll say probability to make it more obvious 00:23:43.720 |
And so then we kind of said hey, let's let's not use these ratios as the coefficients in that 00:23:50.520 |
in that matrix multiply, but let's instead like 00:23:55.920 |
Try and learn some coefficients. You know so maybe start out with some random numbers 00:24:00.520 |
You know and then try and use stochastic gradient descent to find slightly better ones 00:24:09.880 |
So you'll notice you know some important features here the the R 00:24:15.480 |
Vector is a vector of rank 1 and its length is equal to the number of features 00:24:25.760 |
Of course our logistic regression coefficient matrix is also 00:24:30.560 |
Of length 1 sorry rank 1 and length equal to the number of features right and we're you know 00:24:36.660 |
We're saying like they're kind of two ways of calculating 00:24:38.680 |
The same kind of thing right one based on theory one based on data 00:24:44.720 |
So here is like some of the numbers in R right remember. It's using the log so these numbers 00:24:57.120 |
More likely to be negative and these ones that here are more likely 00:25:00.920 |
Sorry this one here is more likely to be positive and so 00:25:04.700 |
Here's e to the power of that and so these are the ones we can compare to one rather than to zero 00:25:10.720 |
So I'm going to do something that hopefully is going to seem weird 00:25:20.520 |
And so first of all I'm going to talk about I'm going to say what we're going to do and 00:25:25.120 |
Then I'm going to try and describe why it's weird, and then we'll talk about 00:25:29.960 |
Why it may not be as weird as we first thought so here's what we're going to do 00:25:43.160 |
So what that means is we're going to we can do it here in Excel right so we're going to say 00:25:50.040 |
let's grab everything in our term document matrix and 00:25:52.920 |
Multiply it by the equivalent value in the vector of R. All right, so this is like a 00:25:58.880 |
broadcasted element wise multiplication not a matrix multiplication 00:26:13.320 |
Okay, so here is the value of the term document matrix 00:26:20.020 |
times R in other words everywhere that a zero appears there a zero appears here and 00:26:25.740 |
Every time a one appears here the equivalent value of R appears here 00:26:39.420 |
Changed the ones into something else into the into the R's from that feature 00:26:45.020 |
Right and so what we're now going to do is we're going to use this as 00:26:54.260 |
Okay, so here. We are multiply of X X NB X naive Bayes version is X times R 00:27:04.940 |
fitting using those independent variables and 00:27:13.500 |
Do that for the validation set okay and get the predictions and 00:27:24.900 |
Let me explain why this hopefully seems surprising 00:27:41.620 |
I picked out the wrong ones. I should have said not coeth 00:27:46.180 |
Okay, that's actually ah I got the wrong number okay 00:27:53.580 |
So that's our independent variables right and then the the logistic regression has come up with some set of coefficients 00:28:02.460 |
Let's pretend for a moment that these are the coefficients that it happened to come up with right 00:28:16.780 |
Set of independent variables, but let's use the original binarized feature matrix right and then divide all of our coefficients 00:28:40.260 |
X naive Bayes version of the independent variables, and we've got some 00:28:46.220 |
Some set of weights some sort of some sort of coefficient so call it W 00:28:57.220 |
Where it's found like this is a good set of coefficients and making our predictions from right, but X and B 00:29:22.540 |
Times the weights and so like we could just change the weights to be that 00:29:39.180 |
The change that we made to the dependent variable shouldn't have made any difference 00:29:43.180 |
Because we can calculate exactly the same thing without making that change 00:29:56.900 |
I'm going to try and get you all to try and think about this in order to answer this question you need to think 00:30:00.900 |
About like okay. What are the things that aren't? 00:30:05.620 |
mathematically the same why is why is it not identical? What are the reasons like come up with some hypotheses? What are some reasons that maybe 00:30:15.220 |
Better answer and to figure that out. We need to first of all start with like well. Why is it even a different answer? 00:30:32.460 |
All right, what do you think I'm just wondering if it was two different kinds of multiplications 00:30:36.660 |
You said that one is the element wise multiplication. No they did they do end up mathematically being the same, okay? 00:30:42.340 |
Pretty much there's a minor wrinkle, but not but it's not that it's not some order operations thing 00:30:50.940 |
You are on a roll today, so let's see how you go. I feel like the features are less 00:30:59.780 |
Mean I've made a claim that these are mathematically equivalent, so 00:31:04.900 |
So what are you saying really you know why are we getting different answers? 00:31:11.340 |
It's good people coming up with hypotheses. We need lots of wrong answers before we start finding. It's the right ones 00:31:21.460 |
It's like that. You know I'm a warmer hotter colder. You know Ernest you're gonna get us hotter 00:31:25.540 |
Does it have anything to do with the regularization? Yes, and is it the fact that when you so let's start there, right? 00:31:31.900 |
So Ernest point here is like okay Jeremy. You've set their equivalent, but they're equivalent outcomes 00:31:37.820 |
Right, but you got through you went through a process to get there and that process included regularization, and they're not necessarily equivalent 00:31:45.020 |
Regularization like our loss function has a penalty so yeah help us think through and as how much that might impact things 00:31:52.820 |
Well, this is maybe kind of dumb, but I'm just noticing that the numbers are bigger in the ones 00:32:02.660 |
These are bigger and some are smaller some are bigger 00:32:06.420 |
But that there are some bigger ones like the variance between the columns is much higher now the variance is bigger 00:32:11.460 |
Yeah, I think that's a very interesting insight. Okay. That's all I got okay, so build on that 00:32:24.900 |
Hit us I'm not sure that's fine. Is it also considered like considering the dependency of different words? 00:32:31.740 |
Is that why it is forming better rather than all? 00:32:35.300 |
What independent of each other not really I mean it's it's you know again fear it 00:32:40.240 |
You know theoretically these are creating mathematically equivalent outputs 00:32:45.540 |
So they're not they're not doing something different except 00:33:04.220 |
That was the weirdest thing I forgot to go into screenwriting mode, and it just turns out that you can actually write in Excel 00:33:15.020 |
I still use green writing rows, so I don't skill up my spreadsheet. I just I never tried 00:33:21.660 |
So our loss was equal to like our cross entropy loss. You know based on the 00:33:31.240 |
Predictions of the predictions and the actuals right plus our 00:33:57.020 |
Right and it drowns out that piece right, but that's actually the piece we care about right we actually want it to be a good fit 00:34:04.740 |
So we want to have as little regularization going on as we can get away with we want so we want to have less 00:34:14.860 |
So here's the thing right our value. Yes, can you pass it over here? 00:34:20.260 |
We should let less weights do you mean lesser weights I do yeah 00:34:27.180 |
Yeah, and I kind of use the two words a little equivalently, which is not quite fair 00:34:32.180 |
I agree, but the idea is that weights that are pretty close to zero are kind of not there 00:34:42.460 |
You know and I'm not a Bayesian weenie, but I'm still going to use the word prior right they're kind of like a prior 00:34:52.380 |
The different levels of importance and positive or negative of these different features 00:34:58.220 |
Might be something like that right we think that like bad 00:35:11.140 |
Than good right so our kind of implicit assumption 00:35:16.500 |
But before was that we have no priors so in other words when we'd said 00:35:23.500 |
Squared weights we're saying a non zero weight is something. We don't want to have 00:35:28.780 |
right, but actually I think what I really want to say is that 00:35:33.580 |
Differing from the naive Bayes expectation is something. I don't want to do right 00:35:40.220 |
Like only vary from the naive Bayes prior unless you have good reason to believe otherwise 00:35:46.780 |
All right, and so that's actually what this ends up doing right we end up saying you know what? 00:35:58.060 |
Right and so if you're going to like make it a lot bigger or a lot smaller 00:36:05.100 |
Right that's going to create the kind of variation in weights. That's going to cause that squared term to go up right so 00:36:15.220 |
You know just leave all these values about similar to where they are now 00:36:19.280 |
Right and so that's what the penalty term is now doing right the penalty term when our inputs is already multiplied by R 00:36:26.980 |
Is saying penalize things where we're varying it from our naive Bayes 00:36:42.780 |
Constant like R squared or something like that when the variance would be much higher this time 00:36:50.580 |
Our prior comes from an actual theoretical model right so I said like I don't like to rely on theory 00:37:01.580 |
Then you know maybe we should use that as our starting point rather than starting off by assuming everything's equal 00:37:08.060 |
So our prior said hey, we've got this model called naive Bayes and the naive Bayes model said 00:37:17.140 |
Then R is the correct coefficient right in this specific 00:37:24.580 |
That that's why we pick that because our our prior is based on that that theory 00:37:29.940 |
Okay, so this is a really interesting insight, which I 00:37:39.180 |
Never really see covered which is this idea is that we can use these like, you know traditional 00:37:47.500 |
Machine learning techniques we can imbue them with this kind of Bayesian sense 00:37:57.500 |
You know incorporating our theoretical expectations 00:38:01.180 |
Into the data that we give our model right and when we do so 00:38:08.540 |
We don't have to regularize as much and that's good right because if we regularize a lot 00:38:24.500 |
Remember the way they do it in the sklearn logistic regression is this is the reciprocal of the amount of 00:38:39.660 |
Add lots of regularization by making it small 00:38:54.100 |
It's trying really hard to get those weights down. The loss function is overwhelmed 00:38:59.540 |
By the need to reduce the weights and the need to make it predictive is kind of now seems totally unimportant 00:39:10.660 |
So by kind of starting out and saying you know what don't push the weights down so that you end up 00:39:20.020 |
The the terms but instead push them down so that you try to get rid of you know ignore 00:39:42.860 |
Result which actually was originally this this technique was originally presented. I think about 2012 00:39:49.380 |
Chris Manning who's a terrific NLP researcher up at Stanford and 00:39:53.480 |
Cedar Wang who I don't know but I assume is awesome because this paper is awesome. They basically came up 00:40:02.520 |
What they did was they compared it to a number of other approaches on a number of other 00:40:11.300 |
Datasets so one of the things they tried is this one is the IMDB data set right and so here's naive Bayes SVM on bigrams 00:40:18.220 |
And as you can see this approach out performed the other 00:40:23.060 |
linear based approaches that they looked at and also some 00:40:27.140 |
Restricted Boltzmann machine kind of neural net based approaches. They looked at now nowadays 00:40:36.700 |
You know there are better ways to do this and in fact in the deep learning course 00:40:39.580 |
We showed a new state-of-the-art result that we just developed at fast AI that gets 00:40:46.180 |
But still you know like particularly for a linear technique. That's easy fast and intuitive 00:40:53.180 |
And you'll notice when they when they did this they only used by grams 00:40:57.540 |
And I assume that's because they I looked at their code, and it was kind of pretty slow and ugly 00:41:02.180 |
You know I figured out a way to optimize it a lot more as you saw and so we were able to use 00:41:08.300 |
Here trigrams, and so we get quite a lot better 00:41:12.500 |
So we've got 91.8 versus in 91.2, but other than that it's identical 00:41:16.820 |
Also, I mean they used a support vector machine. Which is almost identical to a logistic aggression in this case 00:41:24.900 |
So there's some minor differences right so I think that's a pretty cool result and 00:41:36.780 |
You know what you get to see here in class is the result of like 00:41:43.340 |
Weeks and often many months of research that I do and so I don't want you to think like this stuff is obvious 00:41:55.800 |
Why they use this model how it's different why they thought it works? 00:42:00.900 |
You know it took me a week or two to even realize that it's kind of like mathematically equivalent 00:42:06.660 |
To a normal logistic regression and then a few more weeks to realize that the difference is actually in the regularization 00:42:16.620 |
Machine learning as I'm sure you've noticed from the Kaggle competitions you enter you know like you come up with a thousand good ideas 00:42:24.140 |
999 of them no matter how confident you are they're going to be great 00:42:29.500 |
you know and then finally after four weeks one of them finally works and 00:42:34.540 |
Kind of gives you the enthusiasm to spend another four weeks of misery and frustration 00:42:47.580 |
Practition as I know in machine learning all share one particular trait in common, which is they're very very tenacious 00:42:55.500 |
You know also known as stubborn and bloody-minded right which is definitely a reputation. I seem to have 00:43:05.420 |
Along with another thing which is that they're all very good coders. You know they're very good at turning their ideas into into code 00:43:16.020 |
So you know this was like a really interesting 00:43:18.020 |
Experience for me working through this a few months ago to try and like figure out how to how to at least 00:43:24.260 |
You know how to explain why this at the at the time kind of state-of-the-art result exists 00:43:30.780 |
And so once I figured that out. I was actually able to build on top of it and make it quite a bit better 00:43:37.060 |
And I'll show you what I did and this is where it was very very handy to have high torch at my disposal 00:43:44.880 |
Because I was able to kind of create something that was 00:43:48.700 |
Customized just the way that I wanted to be and also very fast by using the GPU 00:43:56.100 |
So here's the kind of fast AI version of the NB SVM actually my friend Stephen Marity. Who's a 00:44:05.980 |
Researcher in NLP has christened this the NB SVM plus plus which I thought was lovely 00:44:11.980 |
So here is the even though there is no SVM. It's a logistic regression, but as I said nearly exactly the same thing 00:44:17.540 |
So let me first of all show you like the code 00:44:22.180 |
So this is like we try to like once I figure out like okay 00:44:25.180 |
This is like the best way I can come up with to do a linear bag-of-words model 00:44:28.860 |
I kind of embed it into fast AI so you can just write a couple of lines of code 00:44:32.180 |
So the code is basically hey, I want to create a data class for text classification 00:44:45.020 |
Here is the same thing for the validation set 00:44:51.860 |
2,000 unique words per review which is plenty 00:45:01.780 |
Construct a a learner which is kind of the fast AI generalization of a model 00:45:07.380 |
Which is based on a dot product of naive Bayes and then fit that model 00:45:18.100 |
After five epochs I was already up to ninety two point two. All right, so this is now like, you know getting 00:45:44.580 |
That's it. Right and it'll also look on the whole 00:45:48.700 |
Extremely familiar, right? There's if there's a few tweaks here 00:45:52.780 |
Pretend this thing that says embedding pretend it actually says linear. Okay, I'm going to show you embedding in a moment 00:45:59.420 |
So we've got basically a linear layer where the number of features coming with the number of features as the rows and remember 00:46:06.500 |
SK learn features means number of words basically and then for each row we're going to create 00:46:13.420 |
One weight which makes sense right for like a logistic regression every every so not for each row for each word 00:46:23.860 |
Then we're going to be multiplying it by the the R values. So for each word 00:46:31.440 |
We have one R value per class. So I actually made this so this can handle like not just 00:46:38.500 |
Positive versus negative but maybe figuring out like which author created this work. There could be five or six authors 00:46:44.740 |
Whatever right and basically we kind of use those linear layers 00:46:53.140 |
The value of the weight and the value of the R and then we take the weight 00:47:02.140 |
Then sum it up. And so that's just a dot product. Okay, so just just a simple dot product just as we would do for any 00:47:18.140 |
That we add to get the the better result is this the main one really is this here this plus something 00:47:30.020 |
It's a parameter, but I pretty much always use this this version this value 0.4 00:47:36.980 |
So what this is doing is it's again kind of changing the prior, right? So if you think about it 00:47:44.620 |
Even once we use this R times the term document matrix as our independent variables 00:47:56.460 |
You really want to start with a question? Okay, the penalty terms are still pushing W down to zero, right? 00:48:05.180 |
For W to be 0 right? So what would it mean if we had you know? 00:48:14.460 |
Right. So what that would do when we go? Okay this matrix times these coefficients 00:48:24.540 |
Right. So a weight of 0 still ends up saying I have no opinion on whether this thing is positive or negative 00:48:35.220 |
Right, then it's basically says my opinion is that the naive phase coefficients are exactly right 00:48:54.980 |
The right prior right we shouldn't really be saying if there's no coefficient. It means ignore the naive Bayes coefficient 00:49:05.380 |
Right because we actually think that naive Bayes is only kind of part of the answer 00:49:09.860 |
All right, and so I played around with a few different data sets where I basically said 00:49:23.900 |
Right, and so 0 would become in this case 0.4 00:49:35.060 |
Penalty is pushing the weights not towards 0 but towards this value 00:49:41.180 |
Right, and I found that across a number of data sets 0.4 00:49:46.100 |
Works pretty well that and it's pretty resilient. All right. So again, this is the basic idea is to kind of like 00:49:53.660 |
Get the best of both worlds, you know, we're we're we're learning from the data using a simple model 00:50:03.580 |
You know our prior knowledge as best as we can and so it turns out when you say, okay 00:50:10.120 |
Let's let's tell it, you know as weight matrix of zeros 00:50:15.620 |
Actually means that you should use about you know about half of the R values 00:50:20.140 |
That ends up that ends up working better than the prior that the weights should all be zero 00:50:31.700 |
Is the the weights the W is it that the point for the amount of regularization required? 00:50:43.780 |
term where the we have the term where we reduce the amount of error the prediction error RMSE plus we have the 00:50:51.140 |
Regularization and is it W the point for denote the amount of realization required? So W are the weights 00:50:57.460 |
Right, so this is calculating our activations. Okay, so we calculate our activations as being equal to the weights 00:51:15.680 |
At a normal linear function, right so so the the thing which is being penalized is 00:51:31.580 |
So by saying hey, you know what don't just use W use W plus point four 00:51:41.860 |
Okay, so effectively the weight matrix gets 0.4 for free 00:51:58.500 |
Every feature is getting some form of weight some form of minimum weight or something like that 00:52:03.340 |
Um, not necessarily because it could end up choosing a coefficient of negative point four for a feature 00:52:11.300 |
You know what even though naive Bayes says it's the R should be whatever for this feature. I think you should totally ignore it 00:52:29.440 |
- okay, let's take a break for about eight minutes or so and start back about 25 to 00:52:39.820 |
Okay, so a couple of questions at the break the first was just for a 00:52:55.220 |
Reminder or a bit of a summary as to what's going on 00:53:22.580 |
So normally what we were doing is saying hey logistic regression is basically 00:53:42.940 |
Right, and then we were kind of saying let's do that bit first 00:53:51.220 |
Although in this particular case actually now I look at it. I'm doing it in this code. It doesn't matter obviously in this code 00:54:11.100 |
So this thing here actually I could I called it W which is probably pretty bad. It's actually W times X 00:54:24.740 |
Right, so so instead of W times X times R. I've got W times X 00:54:54.020 |
Regularization wants the weights to be zero right because we're trying to it's trying to reduce 00:55:05.980 |
so what we're saying is like okay, we want to push the weights towards zero because we're saying like that's our like 00:55:13.780 |
default starting point expectation is the weights are zero and 00:55:19.300 |
So we want to be in a situation where if the weights is zero, then we have a model that like 00:55:25.380 |
Makes theoretical or intuitive sense to us, right? 00:55:30.660 |
This model if the weights are zero doesn't make intuitive sense to us 00:55:36.820 |
Right because it's saying hey multiply everything by zero gets rid of all of that and gets rid of that as well 00:55:42.860 |
And we were actually saying no we actually think our our is useful. We actually want to keep that 00:55:51.060 |
So instead we say you know what let's take that piece here and add 00:56:01.980 |
Right so now if the regularizer is pushing the weights towards zero 00:56:12.660 |
Right and so therefore it's pushing our whole model to zero point four times up 00:56:21.100 |
Now kind of default starting point if you've regularized all the weights out all together is to say yeah 00:56:27.380 |
You know let's use a bit of our that's probably a good idea 00:56:34.100 |
So that's the idea right that's the idea is basically you know what happens when 00:56:40.020 |
When that's zero right and you and you want that to like be something sensible because otherwise 00:56:48.460 |
Regularizing the weights to move in that direction wouldn't be such a good idea 00:57:05.500 |
So the N in N gram can be uni by try whatever one two three whatever grounds so for the this movie is good 00:57:15.340 |
All right, it has four unigrams this movie is good 00:57:23.660 |
It has three bigrams this movie movie is is good 00:57:42.060 |
Can you pass that? So yeah, do you mind go back to the wad chase down the zero point four stuff? 00:57:50.180 |
yeah, so I was wondering if this adjustment will harm the predictability of the model because 00:57:56.420 |
Think of extreme extreme case if it's not zero point four if it's four thousand and or 00:58:03.660 |
Coefficient will be like right essentially. So so exactly so so our prior 00:58:09.540 |
Needs to make sense and so our prior here and you know 00:58:13.060 |
This is why it's called dot prod MB is our prior is that this is something where we think naive Bayes is a good prior 00:58:20.660 |
Right and so naive Bayes says that are equals 00:58:28.180 |
That's not how you write P P over Q. I have not had much sleep 00:58:33.620 |
P over Q is a good prior and not only do we think it's a good prior 00:58:46.500 |
That's that's the naive Bayes model. So in other words, we expect that 00:58:51.620 |
You know a coefficient of one is a good coefficient not not four thousand 00:58:58.600 |
Yeah, so we think specifically we don't think we think zero is probably not a good coefficient 00:59:08.100 |
The naive Bayes version is a little overconfident. So maybe one's a little high 00:59:13.260 |
So we're pretty sure that the right number assuming that our model a naive Bayes model is as appropriate is between 00:59:23.140 |
No, but what I was thinking is as long as it's not zero you are pushing those 00:59:31.380 |
Coefficients that are supposed to be zero to something not zero and makes the 00:59:35.720 |
Like high coefficients less distinctive from the mode coefficients 00:59:41.100 |
Well, but you see they're not supposed to be zero. They're supposed to be our 00:59:45.460 |
Mike that's that's what they're supposed to be. They're supposed to be our right and so 00:59:56.780 |
So this is part of what we're taking the gradient of right? So it's basically 01:00:01.660 |
Saying okay, we're still gonna you know, you can still set 01:00:12.260 |
Wants it to be zero and so all we're saying is okay if you want it to be zero then I'll try to make zero be 01:00:24.260 |
That's the basic idea and like yeah, nothing says point four is perfect for every data set 01:00:29.460 |
I've tried a few different data sets and found various numbers between point three and point six that are optimal 01:00:38.180 |
Less good than zero which is not surprising and I've also never found one where one is better, right? 01:00:45.100 |
So the idea is like this is a reasonable default, but it's another parameter you can play with which I kind of like right? 01:00:53.180 |
Grid search or whatever to figure out for your data set. What's best and you know really the key here being 01:00:59.020 |
Every model before this one as far as I know has implicitly assumed 01:01:05.420 |
It should be zero because they just they don't have this parameter right and you know by the way 01:01:10.100 |
I've actually got a second parameter here as well 01:01:12.100 |
Which is the same thing I do to our is actually divide our 01:01:18.180 |
Which I'm not going to worry too much about it now 01:01:20.380 |
But again, it's this is another parameter you can use to kind of adjust what the nature of the regularization is 01:01:29.460 |
Empiricist not a theoretician. You know that I thought this seemed like a good idea 01:01:33.420 |
Nearly all of my things it seemed like a good idea turn out to be stupid this particular one 01:01:38.740 |
Dave good results, you know on this data set and a few other ones as well 01:01:47.540 |
Yeah, I'm sure a little bit confused about the W plus W adjusted. Uh-huh 01:01:52.240 |
So you mentioned that we do W plus W adjusted so that the coefficients don't get set to zero 01:02:00.100 |
that we place some importance on the priors, but you also said that the 01:02:05.420 |
Effect of learning can be that W gets set to a negative value which in fact really does W plus W 01:02:12.020 |
Right zero. So if if we are we are allowing the learning process to indeed set the priors to zero 01:02:20.020 |
So why is that in any way different from just having W because yeah, great question because of regularization because we're panelizing it by that 01:02:33.980 |
We're saying you know what if you if the best thing to do is to ignore the value of R 01:02:40.620 |
That'll cost you you're going to have to set W to a negative number 01:02:44.220 |
Right so only do that if that's clearly a good idea unless it's clearly a good idea then you should leave 01:02:56.340 |
That's that's the only reason like all of this stuff. We've done today is basically entirely about 01:03:02.840 |
You know maximizing the advantage we get from regularization and saying regularization 01:03:09.020 |
pushes us towards some default assumption and nearly all of the machine learning literature assumes that default assumption is 01:03:17.100 |
Everything zero and I'm saying like it turns out 01:03:21.120 |
You know it makes sense theoretically and turns out empirically that actually you should decide what your default assumption is 01:03:27.900 |
And that'll give you better results. So would it be right to say that? 01:03:32.940 |
In a way, you're putting an additional hurdle in the along the way towards getting all coefficients to zero 01:03:39.500 |
So it will be able to do that if it is really worth it 01:03:42.940 |
Yeah, exactly. So I'd say like the default hurdle without this is is 01:03:46.940 |
Making a coefficient non zero is the hook hurdle and now I'm saying no the co-op that the hurdle is making a coefficient 01:04:06.840 |
Some of it is some lambda or C penalty constant 01:04:12.400 |
Yeah, yeah time something. Yeah, so the weight decay should also depend on the value of C if it is very less 01:04:22.020 |
Hey, yeah. Yeah. So if a is point one, then the weights might not go 01:04:29.220 |
Towards you then we might not need great decay. So well that the whatever this value 01:04:35.140 |
I mean if the if the value of this is zero, then there is no recurization, right? 01:04:39.180 |
But if this value is higher than zero then there is some penalty 01:04:43.660 |
right and and presumably we've set it to non zero because we're overfitting so we want some penalty and so if there is some penalty then 01:04:54.340 |
Then my assertion is that we should penalize things that are different to our prior 01:04:59.980 |
Not that we should penalize things that are different to zero 01:05:03.020 |
And our prior is that things should be you know around about equal to our 01:05:11.700 |
Okay, let's move on thanks for the great questions, I want to talk about 01:05:25.660 |
Said pretend. It's linear and indeed we can pretend. It's linear 01:05:29.140 |
Let me show you how much we can pretend. It's linear as in nn dot linear create a linear layer 01:05:39.500 |
All right, here are our coefficients if we're doing the our vision here our coefficients are 01:05:55.980 |
So right then we could do a matrix multiply of that 01:06:01.620 |
By that right and so we're going to end up with 01:06:23.300 |
One times one plus one times one one times one one times three 01:06:41.320 |
All right, and then the next one zero times one one times one so forth okay, so like that the matrix multiply 01:06:57.220 |
By this coefficient matrix is going to give us an answer okay, so that's that is just a matrix multiply 01:07:03.860 |
So the question is like okay. Well. Why didn't Jeremy right and n dot linear? Why did Jeremy right and n dot embedding? 01:07:12.100 |
And the reason is because if you recall we don't actually store it like this 01:07:25.620 |
25,000 right so rather than storing it like this 01:07:32.300 |
Zero one two three right one two three four zero one two 01:07:52.620 |
That's actually how we store it that is this bag of words contains which word indexes 01:08:11.220 |
Of storing it right is just list out the indexes in each sentence 01:08:20.420 |
Want to now do that matrix multiply that I just showed you to create that same 01:08:29.020 |
Right, but I want to do it from this representation 01:08:40.420 |
It's saying a one hot you know this is basically one hot encoded right? 01:08:46.020 |
It's kind of like a dummy dummy matrix version does it have the word this does it have the word movie? 01:08:53.100 |
So if we took the simple version of like does it have the word this one? 01:09:08.380 |
Right then that's just going to return the first item 01:09:26.740 |
Identical to to looking up that matrix to find the nth row in it 01:09:34.100 |
Right so this is identical to saying find the zero first second and fifth 01:09:42.460 |
Right so they're they're the same they're exactly the same thing and like it doesn't like in this case. I only have one 01:09:49.660 |
Coefficient per feature right but actually the way I did this was to have 01:10:01.780 |
Right so in this case is both positive and negative 01:10:04.260 |
So I actually had kind of like an R positive and an R negative 01:10:09.460 |
So negative would be just the opposite right equals that 01:10:13.340 |
Divided by that right now the binary case obviously it's redundant to have both, but what if it was like? 01:10:26.980 |
Jeremy or Savannah or Terrence right now. We've got three categories. We want three 01:10:32.580 |
Values of R right so the nice thing is then this sparse version 01:10:38.380 |
You know you can just look up. You know the zeroth and the first and the second and the fifth 01:10:56.260 |
When you have sparse inputs, it's obviously much much more efficient 01:11:07.900 |
Which is mathematically identical to not conceptually analogous to mathematically identical to 01:11:14.040 |
Multiplying by a one hot encoded matrix is called an embedding 01:11:18.260 |
Right, so I'm sure you've all heard or most of you probably heard about embeddings like word embeddings word to back or glove 01:11:28.060 |
People love to make them sound like there's some 01:11:30.700 |
Amazing you complex neural net thing right they're not 01:11:39.420 |
Make a multiplication by a one hot encoded matrix faster by replacing it with a simple array look up 01:11:49.260 |
You can think of this as if it said self dot W equals n n dot linear 01:12:00.260 |
The same thing right it actually is a matrix with those dimensions. This actually is a matrix with those dimensions 01:12:10.300 |
But it's expecting that the input we're going to give it is not actually a one hot encoded matrix 01:12:24.580 |
Word or for each item so you can see that the forward function in fast AI 01:12:37.300 |
The sparse matrix automatically numpy makes it very easy to just grab those those indexes 01:12:44.340 |
Okay, so in other words there. We've got here. We've got a list of 01:12:51.180 |
each word index of a of the 800,000 that are in this document and 01:12:55.980 |
So then this here says look up each of those in our embedding matrix 01:13:18.820 |
That makes sense, so that's all an embedding is and so what that means is 01:13:31.860 |
Building any kind of model like a you know whatever kind of neural network 01:13:37.140 |
Where we have potentially very high cardinality categorical variables as our inputs 01:13:45.300 |
We can then just turn them into a numeric code between zero and the number of levels and 01:13:57.900 |
Linear layer from that as if we had one hot encoded it 01:14:02.000 |
Without ever actually constructing the one hot encoded version 01:14:06.180 |
And without ever actually doing that matrix multiply okay instead. We will just store 01:14:12.460 |
The index version and simply do the array lookup 01:14:16.100 |
Okay, and so the gradients that are flowing back 01:14:19.580 |
You know basically in the one hot encoded version everything that was a zero has no gradient 01:14:24.500 |
So the gradients flowing back is best go to update the particular row of the embedding matrix that we used 01:14:37.700 |
Just like here like you know I wanted to create 01:14:45.700 |
this ridiculously simple little equation right and 01:14:50.300 |
To do it without this trick would have meant I was beating in a 25,000 by that hatred the 800,000 01:15:01.620 |
Which would have been kind of crazy right and so this this trick allowed me to write you know 01:15:07.340 |
You know I just replaced the word linear with embedding 01:15:12.740 |
One hot encodings in with something that just feeds the indexes in and that was it that that it kept working 01:15:32.380 |
What we can now do is we can now take this idea and apply it not just to language 01:15:50.340 |
Just a quick question so we are not actually looking up anything right 01:15:56.540 |
We are just saying that now that array with the indices that is the representation 01:16:01.060 |
So the represent so we are doing a lookup right the representation 01:16:06.020 |
That's being stored it for the but for the bag of words is now not 01:16:18.660 |
Matrix product right but rather than doing the matrix product we look up 01:16:23.160 |
The zeroth thing and the first thing and the second thing and the fifth thing 01:16:29.220 |
So that means we are still retaining the one hot encoded matrix no 01:16:35.740 |
We didn't there's no one-hot encoded matrix used here. This is the one-hot encoded matrix, which is not currently highlighted 01:16:41.860 |
We've currently highlighted the list of indexes and the list of coefficients from the weight matrix 01:16:57.460 |
So what we're going to do now is we're kind of going to go to go a step further and saying like 01:17:05.180 |
Let's use a multi-layer neural network, right and let's have the input to that potentially be 01:17:11.900 |
Include some categorical variables right and those categorical variables. We will just have as 01:17:22.420 |
And so the first layer for those won't be a normal linear layer. There'll be an embedding layer 01:17:27.420 |
Which we know behaves exactly like a linear layer 01:17:33.220 |
And so then I hope will be that we can now use this to create a neural network for any kind of data 01:17:45.980 |
Kaggle a few years ago called Rossman, which is a German grocery chain 01:17:52.140 |
Where they asked to predict the sales of items in? 01:17:58.220 |
their stores right and that included the mixture of categorical and continuous variables and 01:18:03.740 |
In this paper by Gwar and Birkin they described their third place winning entry 01:18:08.500 |
Which was much simpler than the first place winning entry 01:18:16.980 |
But much much simpler because they took advantage of this idea of what they call entity embeddings 01:18:24.580 |
in the paper they they thought I think that they had invented this actually had been written before earlier by 01:18:31.100 |
Yoshio Benjio and his co-authors in another Kaggle competition, which was predicting taxi destinations 01:18:39.900 |
Gore went a lot further in describing how this can be 01:18:56.980 |
So this one is actually in the is in the deep learning one repo. Okay deal one 01:19:05.540 |
Because we talk about some of the deep learning specific aspects in the deep learning course where else in this course 01:19:10.300 |
We're going to be talking mainly about the feature engineering 01:19:13.180 |
And we're also going to be talking about you know kind of this this embedding idea 01:19:21.860 |
Let's start with the data right so the data was you know store number one on the 31st of July 2015 01:19:37.980 |
It was a school holiday. It was not a state holiday, and they sold five thousand two hundred and sixty three items 01:19:49.820 |
Data they provided and so the goal is obviously to predict sales in a test set that has the same information without sales 01:20:08.740 |
Its nearest competitor competitor is some distance away 01:20:17.980 |
And there's some more information about promos. I don't know the details of what that means 01:20:22.660 |
Like in many Kaggle competitions they let you 01:20:30.860 |
External data sets if you wish as long as you share them with other competitors 01:20:34.860 |
So people oh they also told you what state each store is in so people downloaded a list of the names of the different states 01:20:43.820 |
They downloaded a file for each state in Germany for each week 01:20:48.580 |
Some kind of Google trend data. I don't know what specific Google trend they got but there was that 01:20:55.060 |
For each date they downloaded a whole bunch of temperature information 01:21:08.780 |
Is that there was probably a mistake in some ways for Rossman to design this competition as being one where you could use external data? 01:21:15.980 |
Because in reality you don't actually get to find out next week's weather or next week's Google trends 01:21:23.940 |
But you know when you're competing in Kaggle you don't care about that you just want to win 01:21:35.900 |
So let's talk first of all about data cleaning you know that there wasn't really much feature engineering done in this third place 01:21:42.660 |
Winning entry like bite bite particularly by Kaggle standards where normally every last thing counts 01:21:50.620 |
This is a great example of how far you can get with with a neural net and it certainly reminds me of the 01:21:57.460 |
claims prediction competition we talked about yesterday where the winner did no feature engineering and entirely relied on deep learning 01:22:07.540 |
Laughter in the room I guess is from people who did a little bit more than no feature engineering in that competition 01:22:15.840 |
So you know I should mention by the way like I 01:22:20.620 |
find that bit where like you work hard at a competition and then it closes and 01:22:26.620 |
You didn't win and the winner comes out and says this is how I won like that's the bit where you learn the most right? 01:22:34.020 |
But sometimes that's happened to me, and it's been like oh 01:22:39.860 |
I thought I tried that and then I go back and I realize I like had a bug there 01:22:45.540 |
I didn't test properly and I learned like okay like I really need to learn to like test this thing in this different way 01:22:52.180 |
Sometimes it's like oh, I thought of that, but I assumed it wouldn't work 01:22:56.980 |
I've really got to remember to check everything before I make any assumptions 01:23:00.980 |
And you know sometimes it's just like oh, I did not think of that technique 01:23:09.460 |
I just tried because like otherwise somebody says like hey you know here's a really good technique 01:23:14.300 |
You're like okay great right, but when you spent months trying to do something and like somebody else did it better by using that technique 01:23:24.380 |
Right and so like it's kind of hard like I'm standing up in front of you saying 01:23:28.460 |
Here's a bunch of techniques that I've I've used and I've won some Kaggle competitions 01:23:33.420 |
And I've got some state-of-the-art results, but it's like that's kind of second-hand information by the time it hits you right, so it's really great to 01:23:40.660 |
Yeah, try things out and and also like it's been kind of nice to see 01:23:46.020 |
Particularly I've noticed in the deep learning course quite a few of my students have you know 01:23:50.860 |
I've said like this technique works really well 01:23:52.780 |
And they've tried it and they've got into the top ten of a Kaggle competition the next day, and they're like 01:23:57.980 |
Okay, that that counts is working really well, so so yeah Kaggle competitions are 01:24:06.340 |
But you know one of the best ways is what happens after it finishes and so definitely like 01:24:10.460 |
For the ones that you that are now finishing up make sure you you know watch the forums 01:24:15.060 |
See what people are sharing in terms of their solutions 01:24:18.520 |
And you know if you want to learn more about them like feel free to ask 01:24:24.260 |
The winners like hey, could you tell me more about this so that people are normally pretty pretty good about explaining 01:24:29.500 |
And then ideally try and replicate it yourself, right and that can turn into a great blog post 01:24:36.580 |
You know or a great colonel is to be able to say okay such-and-such said that they use this technique 01:24:42.320 |
Here's a really short explanation of what that technique is and here's a little bit of code showing how it's implemented 01:24:48.140 |
And you know here's the result showing you you can get the same result that can be a really interesting write-up as well 01:24:55.900 |
You know it's it's always nice to kind of have your data reflect 01:25:03.280 |
Like I don't know be as kind of easy to understand as possible 01:25:10.380 |
So in this case the data that came from Kaggle used various you know integers for the holidays 01:25:15.780 |
We can just use a boolean of like was it a holiday or not? 01:25:23.260 |
We've got quite a few different tables. We need to join them all together 01:25:26.500 |
Right I have a standard way of joining things together with pandas 01:25:31.700 |
I just use the pandas merge function and specifically I always do a left joint 01:25:44.740 |
So you retain all the rows in the left table and you take so you have a key column 01:25:50.780 |
You match that with the key column in the right side table, and you just merge the rows that are also present in the right side 01:25:56.620 |
Yeah, that's a great explanation good job. I don't have much to add to that thick the key reason that I always do a left 01:26:03.780 |
That after I do the join. I always then check if there were things in the right hand side 01:26:10.420 |
That are now null right because if so it means that I missed some things 01:26:15.520 |
I haven't shown it here, but I also check that the number of rows 01:26:20.220 |
Hasn't varied before and after if it has that means that the right hand side table wasn't 01:26:28.300 |
Even when I'm sure something's true. I always also assume that I've screwed it up, so I always check 01:26:36.500 |
So I could go ahead and merge the state names into the weather I can also 01:26:52.100 |
It's got this weak range which I need to turn into a date in order to join it 01:26:57.840 |
Right and so the nice thing about doing this in pandas is that pandas gives us access to you know all of Python 01:27:05.360 |
Right and so for example inside the the series object is a dot str 01:27:11.940 |
Attribute that gives you access to all the string processing functions 01:27:15.900 |
Not just like cat gives you access to the categorical functions 01:27:19.140 |
DT gives you access to the daytime functions so I can now split 01:27:23.140 |
Everything in that column, and it's really important to try and use these pandas functions 01:27:27.460 |
Because they you know they're going to be vectorized accelerated through you know often through CMD at least through you know C code 01:27:42.220 |
Then you know as per usual let's add date metadata to our dates 01:27:47.460 |
In the end we are basically denormalizing all these tables, so we're going to put them all into one table so in the Google trend 01:27:59.900 |
There was also though they were mainly trends by state, but there was also trends for the whole of Germany 01:28:06.260 |
So we kind of put the Germany own you know the whole of Germany ones into a separate data frame so that we can join that 01:28:12.780 |
So we're going to have like Google trend for this date and Google trend for the whole of Germany 01:28:21.740 |
Both for the training set and for the test set and then for both checks that we don't have zeros 01:28:33.660 |
I set the suffix if there are two columns that are the same I set the suffix on the left to be nothing at all 01:28:39.840 |
So it doesn't screw around with the name and the right hand side to be underscore Y and in this case 01:28:45.140 |
I didn't want any of the duplicate ones, so I just went through and 01:28:52.820 |
And then we're gonna in a moment. We're going to try to 01:28:59.220 |
Create a competition you know the the main competitor for this store has been open since some date 01:29:05.960 |
Right and so you can just use pandas to date time passing in the year the month and the day 01:29:13.780 |
Right and so that's going to give us an error unless they all have years and months 01:29:19.060 |
So so we're going to fill in the missing ones with the 1900 and a 1 01:29:23.700 |
Okay, and then what we really know it we didn't want to know is like how long is this store been open for? 01:29:29.620 |
At the time of this particular record all right, so we can just do a date subtract, okay? 01:29:35.740 |
Now if you think about it sometimes the competition 01:29:41.900 |
You know opened later than this particular row so sometimes it's going to be negative and it doesn't probably make sense 01:29:50.940 |
To have negative spending like it's going to open in x days time now having said that I would never 01:30:01.740 |
Without first of all running a model with it in and without it in right because like our assumptions about 01:30:09.820 |
About the data very often turn out not to be true now in this case. I didn't invent any of these 01:30:18.180 |
pre-processing steps I wrote all the code, but it's all based on the third place winners github repo, right so 01:30:26.580 |
Knowing what it takes to get third place in the Kaggle competition 01:30:31.060 |
I'm pretty sure they would have checked every one of these pre-processing steps and made sure it actually improved there 01:30:47.180 |
creating a neural network where some of the inputs to it are 01:30:53.020 |
continuous and some of them are categorical and 01:30:57.700 |
So what that means in the in the neural net that you know we have 01:31:03.220 |
We're basically going to have you know this kind of initial weight matrix right and we're going to have this 01:31:16.940 |
This input feature vector right and so some of the inputs are just going to be plain 01:31:22.660 |
Continuous numbers like you know what's the maximum temperature here, or what's the number of plumbers to the nearest store? 01:31:38.220 |
Effectively right, but we're not actually going to store it as one hot encoded 01:31:48.180 |
Right and so the neural net model is going to need to know which of these columns 01:31:53.900 |
Should you should you basically create an embedding for which ones should you treat? 01:31:58.460 |
You know as if they were kind of one hot encoded and which ones should you just feed directly into the linear layer? 01:32:08.340 |
We're going to tell the model when we get there 01:32:10.540 |
Which is which but we actually need to think ahead of time about like which ones? 01:32:15.020 |
Do we want to treat as categorical and which ones are continuous in particular? 01:32:20.260 |
Things that we're going to treat it as categorical 01:32:25.140 |
More categories than we need all right, and so let me show you what I mean 01:32:31.740 |
The the third place getters in this competition 01:32:34.140 |
Decided that the number of months that the competition was open was something that they were going to use as a categorical variable 01:32:40.300 |
Right and so in order to avoid having more categories than they needed 01:32:44.240 |
They truncated it at 24 months. They said anything more than 24 months. I'll truncate to 24 01:32:51.180 |
So here are the unique values of competition months open, and it's all the numbers from naught to 24 01:32:59.620 |
So what that means is that there's going to be you know an embedding matrix 01:33:04.500 |
That's going to have basically an embedding vector for things that aren't open yet for things that are open a month 01:33:10.820 |
So things that are open two months and so forth now 01:33:17.020 |
Continuous variable right they could have just had a number here 01:33:21.380 |
Which is just a single number of how many months has it been open? 01:33:24.260 |
And they could have treated it as continuous and fed it straight into the initial weight matrix 01:33:30.860 |
what I found though and obviously what these competitors found is 01:33:35.900 |
Where possible it's best to treat things as categorical variables 01:33:44.380 |
And the reason for that is that like when you feed something through an embedding matrix? 01:33:50.140 |
You basically mean it means every level can be treated like totally differently 01:33:55.940 |
Right and so for example in this case whether something's been open for zero months or one month is 01:34:02.620 |
Right really different right and so if you fed that in as a continuous variable 01:34:07.940 |
It would be kind of difficult for the neural net to try and find a functional form that kind of has that that big difference 01:34:14.820 |
It's possible because neural nets can do anything right, but you're not making it easy for it 01:34:18.740 |
Where else if you used an embedding treated it as categorical then it'll have a totally different vector for zero versus one 01:34:25.380 |
Right so it seems like particularly as long as you've got enough data 01:34:31.820 |
The treating columns as categorical variables where possible is a better idea 01:34:36.620 |
And so I say when I say where possible that kind of basically means like 01:34:51.620 |
The sales ID number that was like uniquely different on every row 01:34:55.440 |
You can't treat that as a categorical variable 01:34:58.260 |
Right because you know it would be a huge embedding matrix and everything only appears once or ditto for like 01:35:07.780 |
To two decimal places you wouldn't make a categorical variable, right? 01:35:12.300 |
So that's kind of the that's kind of the rule of thumb 01:35:16.900 |
That they both used in this competition in fact if we scroll down 01:35:23.260 |
Here is how they did it right they're continuous variables with things that were genuinely 01:35:30.380 |
Continuous like number of kilometers away to the competitor the temperature stuff 01:35:35.340 |
Right the number you know the specific number in the Google trend, right? 01:35:41.020 |
Where else everything else basically they treat it as categorical 01:35:49.380 |
Okay, so that's it for today, so yeah next time. We'll 01:35:55.100 |
We'll finish this off. We'll see we'll see how to turn this into a neural network