back to indexIntro to Machine Learning: Lesson 7
Chapters
0:0
0:19 The Position of Random Forests
12:10 Mean of the Binomial Distribution of N Samples
13:11 Standard Error
15:48 Less Common Classes
18:15 Class Weights
21:13 Create a Decision Tree
22:19 Flesh Out Our Decision Tree
23:55 Decision Tree
31:26 Predefined Decorators
36:15 Information Gain
42:50 Computational Complexity
43:59 Quickly Calculate Computational Complexity
45:47 Standard Deviation
62:29 Create a Set of Predictions for a Tree
66:13 The Ternary Operator
81:47 Create a Gist
83:39 Neural Networks
00:00:00.000 |
Welcome back. We're going to be talking today about 00:00:03.200 |
Random forests, we're going to finish building our own random forest from scratch 00:00:12.400 |
Tackle a few things that have come up during the week a few questions that I've had 00:00:16.840 |
And I want to start with kind of the position of random forests in general, so we spent 00:00:24.240 |
About half of this course doing random forests, and then after today the second half of this course will be 00:00:40.800 |
Represent like the tea the two key classes of techniques which cover 00:00:46.260 |
Nearly everything that you're likely to need to do 00:00:49.200 |
Random forests belong to the class of techniques of decision tree ensembles 00:00:53.880 |
Along with gradient boosting machines being the other key type and some variants like extremely randomized trees 00:01:03.600 |
Have the benefit that they're highly interpretable 00:01:11.400 |
They have the downside that they don't extrapolate at all to like 00:01:16.720 |
Data that's outside the range that you've seen as we looked at at the end of last week's session 00:01:26.160 |
You know they're they're they're a great starting point and so 00:01:34.680 |
Machine learning tools out there and so and like a lot of courses and books 00:01:39.280 |
Don't attempt to kind of curate that down and say like for these kinds of problems use this for these kinds of problems use that 00:01:49.600 |
here's a description of a hundred different algorithms and 00:01:55.960 |
You know like I don't see why you would ever use in support vector machine today for instance 00:02:04.120 |
I could think of doing that people loved studying them in the 90s because they are like very theoretically elegant and like you can really 00:02:11.360 |
Write a lot of math about support vector machines and people did but you know in practice. I don't see them as having any place 00:02:22.600 |
So there's like a lot of techniques that you could include in an exhaustive list of every way that people have looked at machine learning problems 00:02:28.880 |
But I would rather tell you like how to actually solve machine learning problems in practice 00:02:34.480 |
I think they you know we've we've about to finish today the first class which is you know one type of decision tree ensembles 00:02:40.720 |
In in part two your net will tell you about the other key type 00:02:44.320 |
They're being gradient boosting and we're about to launch next lesson into 00:02:56.480 |
Logistic aggression etc are all variants of neural nets 00:03:06.480 |
Leo Bremen who created random forests did so very late in his life and 00:03:13.960 |
Unfortunately passed away not many years later 00:03:19.240 |
So partly because of that very little has been written about them in the academic literature 00:03:24.760 |
Partly because SVMs were just taken over at that point. You know other people didn't look at them 00:03:30.480 |
And also like just because they're like quite hard to grasp at a theoretical level like analyze them theoretically 00:03:40.560 |
It's quite of hard to write conference papers about them or academic papers about them 00:03:44.520 |
So there hasn't been that much written about them 00:03:47.320 |
But there's been a real resurgence or not resurgence a new wave in recent years of empirical machine learning like what actually works 00:03:55.320 |
Kaggle's been part of that, but also just part of it. It's just been like 00:04:01.020 |
companies using machine learning to make shit loads of money like Amazon and Google and 00:04:06.240 |
So nowadays a lot of people are writing about 00:04:10.520 |
decision tree ensembles and creating better software for decision tree ensembles like like GBM and XG boost and 00:04:22.160 |
But a lot of this is being done in industry rather than academia 00:04:34.080 |
More work being done in deep learning than in decision tree ensembles 00:04:39.400 |
Particularly in in academia, but but there's a lot of progress being made in both 00:04:45.040 |
You know if you look at like of the packages being used today for decision tree ensembles 00:04:51.400 |
Like all the best ones the top five or six, and I don't know that any of them really existed 00:04:56.800 |
Five years ago, you know maybe other than like SK learn or even three years ago, so it's that's that's been good 00:05:08.040 |
Still to be done now. We talked about for example 00:05:10.400 |
Figuring out what interactions are the most important last week and 00:05:15.880 |
Some of you pointed out in the forums that actually there is such a project already for gradient boosting machines 00:05:21.400 |
Which is great, but it doesn't seem that there's anything like that yet for random forests 00:05:25.640 |
And you know random forests do have a nice benefit over GBMs that they're kind of 00:05:30.740 |
Harder to screw up you know and easier to scale 00:05:37.000 |
So hopefully that's something that you know this community might help fix 00:05:40.520 |
Another question I had during the week was about the size of your validation set 00:05:51.840 |
So like to answer this question about how big does your validation set need to be you first need to answer the question 00:06:03.800 |
How accurate do I need help how precisely do I need to know the accuracy of this algorithm right so like 00:06:11.400 |
If the validation set that you have is saying like this is 70% accurate and 00:06:18.520 |
If somebody said well is it 75% or 65% or 70% and the answer was I don't know anything in that range is close enough 00:06:27.800 |
Like that would be one answer where else if it's like is it 70% or 70.01% or 69.99% 00:06:35.120 |
Like then that's something else again, right? So you need to kind of start out by saying like how? 00:06:43.760 |
So like for example in the deep learning course we've been looking at dogs versus cats 00:06:51.680 |
The models that we're looking at had about a ninety nine point 00:06:56.480 |
Four ninety nine point five percent accuracy on the validation set okay, and our validation set size 00:07:06.680 |
2000 okay, in fact let's do this in Excel that'll be a bit easier 00:07:34.040 |
One minus accuracy times n so we were getting about 12 wrong 00:07:58.040 |
Okay, so then like we we run a new model and we find instead that the accuracy 00:08:11.200 |
Right and then it's like okay. Is this less good at finding cats. That's like well. It got two more cats wrong 00:08:24.360 |
But then it's like well does this matter there's ninety nine point four versus ninety nine point two 00:08:29.640 |
Matter and if this was like it wasn't about cats and dogs, but it was about finding fraud 00:08:35.320 |
Right then the difference between a point six percent error rate and a point eight percent error rate 00:08:41.120 |
It's like twenty five percent of your cost of fraud 00:08:46.240 |
Like it was really interesting like when image net came out earlier this year the new competition results came out and 00:08:53.440 |
The accuracy had gone down from three percent 00:08:56.320 |
So the error went down from three percent to two percent 00:08:59.840 |
And I saw a lot of people on the internet like famous machine learning researchers being like man 00:09:04.920 |
Some Chinese guys got it better from like ninety seven percent to one ninety eight percent 00:09:10.080 |
It's like statistically not even significant who cares kind of a thing, but actually I thought like 00:09:15.840 |
Holy crap this Chinese team just blew away the state-of-the-art and image recognition like the old one was 00:09:22.880 |
Fifty percent less accurate than the new one like that's that's actually the right way to think about it 00:09:27.960 |
Isn't it because it's like you know we were trying to recognize you know like 00:09:32.020 |
Which tomatoes were ripe and which ones weren't and like our new approach. You know the old approach like 00:09:42.840 |
Was like letting in the unripe tomatoes or you know 50% more of the time we were like 00:09:49.920 |
Accepting fraudulent customers like that's a really big difference 00:09:53.000 |
So just because like this particular validation set we can't really see six versus eight doesn't mean the point two percent different 00:10:01.280 |
Isn't important it could be so my kind of rule of thumb is that this like this number of like how many? 00:10:08.920 |
Observations you actually looking at I want that generally to be somewhere higher than 22 00:10:15.520 |
Why 22 because 22 is the magic number where the t distribution roughly turns into the normal distribution? 00:10:22.120 |
All right, so as you may have learned the t distribution is is the normal distribution for small 00:10:28.720 |
Datasets right and so in other words once we have 22 of something or more it kind of starts to behave 00:10:39.040 |
Normally in both sense of the words like it's kind of more stable, and you can kind of understand it better, so 00:10:45.240 |
That's my magic number when somebody says do I have enough of something I kind of start out by saying like do you have? 00:10:56.200 |
Line cancer you know and you had a data set that had like a thousand people without lung cancer and 20 people with lung cancer 00:11:05.240 |
I'd be like I very much doubt we're going to make much progress. You know because we haven't even got 20 of the thing you want 00:11:11.960 |
So ditto with a validation set if you don't have 20 of the thing you want that is very unlikely to be useful 00:11:17.020 |
Or if like the at the level of accuracy we need it's not plus or minus 20 00:11:21.680 |
It's just it's that that's the point where I'm thinking like 00:11:26.560 |
so just to be clear you want 22 to be the number of 00:11:31.560 |
Samples in each set like in the validation the test and the train or so what I'm saying is like if there's if there's less 00:11:40.560 |
than 22 of a class in any of the sets then it's 00:11:46.800 |
It's going to get it's getting pretty unstable at that point, right? 00:11:50.640 |
And so like that's just like the first rule of thumb 00:11:54.320 |
but then what I would actually do is like start practicing what we learned about the 00:12:00.620 |
binomial distribution or actually then we distribution so 00:12:10.280 |
What is the mean of the binomial distribution of n samples and probability P? 00:12:17.680 |
And times P. Okay. Thank you n times P is our mean 00:12:23.660 |
Right, so if you've got a 50% chance of getting ahead and you toss it a hundred times on average you get 50 heads 00:12:31.640 |
Okay, and then what's the standard deviation? 00:12:45.240 |
Two numbers well the first number you don't really have to remember. It's intuitively obvious the second one is one that try to remember 00:12:54.160 |
Because not only does it come up all the time the people that you work with will all have forgotten it 00:12:59.000 |
So you'll be like the one person in the conversation who could immediately go we don't have to run this a hundred times 00:13:03.840 |
I can tell you straight away. It's binomial. It's going to be n PQ and P1 minus P 00:13:13.120 |
The standard error is if you run a bunch of trials each time getting a mean 00:13:20.560 |
What is the standard deviation of the mean? I? 00:13:24.280 |
Don't think you guys have covered this yet. Is that right? 00:13:28.920 |
No, so this is really important because this means like if you train a hundred models 00:13:35.620 |
right each time the validation set accuracy is like the meaning of a distribution and 00:13:41.660 |
So therefore the standard deviation of that validation set accuracy it can be calculated with the standard error 00:14:00.040 |
So like one approach to figuring out like is my validation set big enough is train your model five times 00:14:06.280 |
with exactly the same hyper parameters each time and look at the validation set accuracy each time and you know there's like a 00:14:14.600 |
Mean and a standard deviation of five numbers you could use or a maximum and minimum you can choose 00:14:19.600 |
But save yourself some time you can figure out straight away that like 00:14:30.280 |
Accuracy as to you know whether I get the cat correct or not correct 00:14:36.000 |
So therefore the standard deviation is equal to 0.99 times 0.01 00:14:49.720 |
Right so so basically the size of the validation set you need 00:14:57.800 |
Such that your insights about its accuracy are good enough for your particular 00:15:03.840 |
Business problem and so like I say like the simple way to do it is to pick a validation set of like a size a thousand 00:15:13.480 |
See how much the validation set accuracy varies and if it's like if they're if it's they're all close enough for what you need 00:15:20.160 |
Then you're fine. If it's not maybe you should make it bigger 00:15:32.680 |
So like as you can see it really depends on what it is you're trying to do 00:15:37.320 |
How common your less common class is and how accurate your model is could you pass that back to Melissa, please? 00:15:44.640 |
Thank you, I have a question about the less common classes if you have less than 22 00:15:55.320 |
Let's say it's a face and I only have one representation from that particular country 00:16:01.780 |
Do I toss that into the training set and it adds variety do I pull it out completely out of the data set or? 00:16:09.560 |
Do I put it in a test set instead of the validation set? 00:16:16.000 |
So you certainly couldn't put it in the test or the validation set because you're asking 00:16:20.240 |
Can I mean in general because you're asking can I recognize something? I've never seen before 00:16:24.280 |
But actually this this question of like can I recognize something I've not seen before there's actually a whole class of models specifically 00:16:31.940 |
For that purpose it's called other one-shot learning 00:16:35.080 |
Which is you get to see something once and then you have to recognize it again or zero shot learning 00:16:39.700 |
Which is where you have to recognize something you've never seen before we're not going to cover them in this course 00:16:47.040 |
But that can be useful for things like face recognition 00:16:50.840 |
You know like is this the same person I've seen before and so generally speaking 00:16:55.720 |
Obviously for something like that to work. It's not that you've never seen our face before 00:17:00.800 |
It's that you've never seen Melissa's face before you know and so you see Melissa's face once and you have to recognize it again 00:17:07.400 |
Yeah, so in general you know your validation set and test set need to have the same 00:17:18.160 |
Observations that you're going to see in production in the real world 00:17:25.600 |
Should have an equal number in each class and if you don't 00:17:34.960 |
Until it is equal so this is I think we've mentioned this paper before very recent paper that came out 00:17:41.280 |
They tried lots of different approaches to training with unbalanced data sets and found consistently that 00:17:47.000 |
Oversampling the less common class until it is the same size as the more common class is always the right thing to do 00:17:54.480 |
So you could literally copy you know so like I've only got a thousand you know ten examples of people with cancer and a hundred 00:18:09.080 |
That's kind of a little memory inefficient so a lot of things including 00:18:14.120 |
I think SK learns random forests have a class weights parameter that says each time you're bootstrapping or resampling 00:18:20.560 |
I want you to sample the less common class with a higher probability 00:18:24.880 |
Or ditto if you're doing deep learning you know make sure in your mini batch 00:18:29.320 |
It's not randomly sampled, but it's a stratified sample so the less common class is picked more often 00:18:41.800 |
Our random forests and so what we're going to do today is we're going to finish off writing our random forest 00:18:48.280 |
and then after day your your after today your homework will be to take this class and 00:18:53.120 |
To add to it all of the random forest interpretation algorithms that we've learned, okay, so 00:19:00.640 |
Obviously to be able to do that you're going to need to totally understand how this class works 00:19:05.880 |
So please you know ask lots of questions as necessary as we go along 00:19:13.520 |
We we're doing the the bulldozers Kaggle competition data set again 00:19:20.680 |
We split it as before into 12,000 validation the last 12,000 00:19:25.240 |
Records and then just to make it easier for us to keep track of what we're doing 00:19:31.360 |
We're going to just pick two columns out to start with year made and machine hours on the meter 00:19:36.120 |
okay, and so what we did last time was we started out by creating a tree ensemble and 00:19:42.440 |
the tree ensemble had a bunch of trees which was literally a list of 00:19:48.120 |
Entries trees where each time we just called create tree 00:20:07.080 |
So remember bootstrapping means sampling with replacement 00:20:12.600 |
So normally with scikit learn if you've got n rows 00:20:16.240 |
We grab n rows with replacement, which means many of them will appear more than once 00:20:23.760 |
So each time we get a different sample, but it's always the same size as the original data set and then we have our 00:20:33.480 |
Function that we can use which does with replacement sampling of less than n rows 00:20:40.440 |
This is doing something again, which is it's sampling without replacement 00:20:45.360 |
Sample size rows okay because we're permuting 00:20:49.440 |
The numbers from naught to self dot y minus 1 and then grabbing the first self dot sample size of them 00:20:55.640 |
Actually, there's a faster way to do this. You can just use NP dot random dot choice, which is a slightly more direct way 00:21:04.400 |
All right. So this is our random sample for this one of our 00:21:12.240 |
And so then we're going to create a decision tree and our decision tree 00:21:17.000 |
We don't pass it all of x we pass it these specific indexes and remember x is a panda's data frame 00:21:24.220 |
So if we want to index into it with a bunch of integers, we have to use I log integer locations 00:21:31.280 |
And that makes it behave indexing wise just like numpy 00:21:35.540 |
Why vector is numpy so we can just index into it directly and then we're going to keep track of our minimum precise 00:21:48.260 |
So then the only other thing we really need an ensemble is some way to make a prediction 00:21:51.820 |
And so we were just going to do the mean of the tree prediction 00:21:59.300 |
All right, so that was that and so then in order to be able to run that 00:22:03.520 |
We need a decision tree class because it's being called here 00:22:15.740 |
The next thing we need to do is to flesh out our decision tree 00:22:22.140 |
So the important thing to remember is all of our randomness 00:22:28.740 |
The decision tree class we're going to create 00:22:36.580 |
Okay, so right now we are building a randomly regressor, right so that's why we're taking the mean of the tree 00:22:43.540 |
The output if we were to work with classification 00:22:47.800 |
Do we take the max like the classifier will give you either zeros or ones? 00:22:52.820 |
No, I would still take the mean so the so each tree is going to tell you what percentage of that leaf node 00:23:01.380 |
Contains cats and what percentage take contains dogs 00:23:05.820 |
so then I would average all those percentages and say across the trees on average there is 19% cats and 00:23:18.540 |
Random tree classifiers are almost identical or can be almost identical to random tree regressors 00:23:27.240 |
The technique we're going to use to build this today will basically exactly work for a classification 00:23:34.340 |
It's certainly for binary classification. You can do with exactly the same code for multi-class classification. You just need to change your data structure 00:23:41.900 |
so that like you have like a one hot encoded matrix or a 00:23:47.140 |
List of integers that you treat as a one hot encoded matrix 00:23:57.160 |
So remember our idea here is that we're going to like try to avoid thinking 00:24:03.260 |
So we're going to basically write it as if everything we need already exists, okay, so we know 00:24:10.100 |
From when we created the decision tree, we're going to pass in the X the Y and the minimum leaf size 00:24:16.480 |
So here we need to make sure we've got the X and the Y and the minimum leaf size. Okay, so then there's one other thing 00:24:25.980 |
split our tree into sub trees, we're going to need to keep track of 00:24:31.980 |
Which of the row indexes went into the left-hand side of the tree which went into the right-hand side of the tree 00:24:37.740 |
Okay, so we're going to have this thing called 00:24:42.020 |
Well, right so at first we just didn't bother passing in indexes at all 00:24:46.660 |
So if indexes is not passed in if it's none, then we're just going to set it to 00:24:50.980 |
Everything the entire length of y right so NP dot a range is the same as just range in Python 00:24:59.040 |
But it returns a numpy array right so that the root of a decision tree 00:25:04.160 |
Contains all the rows. That's the definition really of the root of a decision tree 00:25:09.960 |
So all the rows is row 0 row 1 row 2 etc up to row y minus 1 00:25:15.460 |
Okay, and then we're just going to store away all that information that we were given 00:25:23.980 |
Rows are there and how many columns are there? Okay? 00:25:31.860 |
leaf and every node in a tree has a value it has a prediction that prediction is just equal to the average of 00:25:50.780 |
the values of the dependent variable that are in this branch of the tree and so here is the mean 00:26:01.780 |
Nodes in a tree also have a score which is like how effective was the split? 00:26:07.920 |
Here right, but that's only going to be true if it's not a leaf node right a leaf node has no further splits 00:26:15.500 |
And at this point when we create a tree we haven't done any splits yet, so its score starts out as being infinity 00:26:24.060 |
so having built that the root of the tree our next job is to find out which variable should we split on and 00:26:32.060 |
What level of that variable should we split on so let's pretend that there's something that does that 00:26:40.700 |
So then we're done, okay, so how do we find a 00:26:49.420 |
variable to split on so well we could just go through each 00:26:54.460 |
Potential variable so C contains the number of columns 00:26:59.380 |
We have so go through each one and see if we can find a better split than we have so far on that column 00:27:13.700 |
The full random forest definition this is assuming that max features is set to all right remember 00:27:20.500 |
We could set max features to like zero point five in which case we wouldn't check all the numbers should not to see 00:27:25.700 |
We would check half the numbers at random from not to see so if you want to turn this into like a 00:27:34.220 |
Support you could easily like add one line of code to do that, but we're not going to do it in our implementation today 00:27:44.060 |
Then we just need to find better split and since we're not interested in thinking at the moment for now 00:27:50.140 |
We're just going to leave that empty all right 00:27:57.700 |
With my kind of where I start writing a class is I like to have some way to print out 00:28:02.860 |
What's in that class right and so if you type print followed by an object or if at Jupiter notebook? 00:28:14.900 |
Underscore underscore main underscore underscore dot decision tree at blah blah blah which is not very helpful 00:28:20.540 |
Right so if we want to replace this with something helpful 00:28:30.140 |
To get a representation of this object so when we type when we sickly just 00:28:37.500 |
Write the name like this behind the scenes that calls that function and the default 00:28:42.020 |
Implementation of that method is just to print out this 00:28:46.100 |
Unhelpful stuff so we can replace it by instead saying let's create a format string 00:28:52.060 |
Where we're going to print out n and then show n and then print Val and then show Val okay, so how many? 00:28:59.300 |
How many rows are in this node and what's the average of the dependent variable okay? 00:29:08.580 |
If it's not a leaf node, so if it has a split then we should also be able to print out the score 00:29:14.060 |
The value we split out and the variable that we split on 00:29:22.380 |
Self dot is leaf is leaf is defined as a method, but I don't have any parentheses after it 00:29:28.240 |
This is a special kind of method called a property and so a property is something that kind of looks like 00:29:37.940 |
But it's actually calculated on the fly so when I call its leaf it actually calls this function 00:29:48.060 |
Property okay, and what this says is basically you don't have to include the parentheses when you call it 00:29:54.800 |
Okay, and so it's going to say all right is this a leaf or not so a leaf is something that we don't split on 00:30:03.940 |
If we haven't split on it then its score is still set to infinity, but that's my logic 00:30:12.220 |
So this this at notation is called a decorator 00:30:18.340 |
It's basically a way of telling Python more information about your method 00:30:23.280 |
So anybody here remember where you have seen decorators before? 00:30:30.460 |
You pass it over here, yeah, where have you seen where have you seen decorators before tell us more about flask and 00:30:41.860 |
So flask so anybody who's done any web programming before with something like flask or a similar framework 00:30:53.940 |
Would have had to have said like this method is going to respond to this bit of the URL 00:31:00.200 |
And either the post or to get and he put it in a special decorator 00:31:09.640 |
Python to treat this method in a special way, so here's another decorator, okay? 00:31:15.000 |
And so you know if you get more advanced with Python you can actually learn how to write your own decorators 00:31:20.880 |
Which as was mentioned you know basically insert some additional code 00:31:24.320 |
but for now just know there's a bunch of predefined decorators we can use to 00:31:30.320 |
Change how our methods behave and one of them is at property which basically means you don't have to put parentheses anymore 00:31:36.600 |
Which of course means you can't add any more parameters beyond self 00:31:43.040 |
Why if it's not a leaf why is this for infinity because doesn't infinity mean you're at the root 00:31:50.600 |
Why no infinity means that you're not at the root. It means that you're at a leaf so the root will have a split 00:31:58.520 |
Assuming we find one right everything will have a split till we get all the way to the bottom 00:32:02.560 |
The leaf and so the leaves will have a score of infinity because they won't split 00:32:14.520 |
Decision tree it doesn't do very much, but at least we can like create an ensemble right ten trees sample size a thousand 00:32:22.200 |
Right and we can like print out so now when I go m trees zero 00:32:26.200 |
It doesn't say blah blah blah blah blah. It says 00:32:29.320 |
What we asked it to say n called the thousand 00:32:32.520 |
Val Colin 10.8. Oh wait, okay, and this is the leaf because we haven't spit on it yet 00:32:42.320 |
Okay, so then the indexes are all the numbers from 0 to 1,000 00:32:48.780 |
Okay, because the base of the tree has everything this is like everything in 00:32:55.080 |
The random sample that was passed to it because remember by the time we get to the point where it's a decision tree 00:32:59.840 |
Where we don't have to worry about any of the randomness in the random forest anymore, okay? 00:33:12.080 |
The thing which finds a split okay, so we need to implement 00:33:18.520 |
Find better split okay, and so it's going to take the index of a variable variable number one variable number three 00:33:31.440 |
Is that better than any split we have so far and? 00:33:34.040 |
For the first variable the answer will always be yes because the best one so far is none at all infinity bad, okay? 00:33:40.840 |
So let's start by making sure we've got something to compare to so the thing we're going to compare to 00:33:50.960 |
So we need to make sure that scikit learns random forest gets exactly the same data that we have so we start out by creating ensemble 00:33:58.140 |
Grab a tree out of it and then find out which particular random sample of x and y did this tree use? 00:34:06.180 |
Okay, and we're going to store them away so that we can pass them to scikit learn so we have exactly the same information 00:34:14.440 |
So let's go ahead and now create a random forest using scikit learn so one tree 00:34:21.600 |
No, bootstrapping so the whole the whole data set that so this should 00:34:26.240 |
Be exactly the same as the thing that we're going to create this tree 00:34:44.060 |
Okay, so let's define our x independent variables and say okay 00:34:49.700 |
Well, it's everything inside our tree, but only 00:34:53.260 |
Those indexes that are in this node right which at the top of the tree is everything right and just this one variable 00:35:02.760 |
Okay, and then for our wise it's just whatever our dependent variable is at the indexes in this node 00:35:13.940 |
Let's now go through every single value in our independent variable and 00:35:20.580 |
So I'll show you what's going to happen. So let's say our independent variable is year made 00:35:41.460 |
Right and so we're going to go to the very first row and we're going to say okay year made here is three 00:35:48.020 |
Right and so what I'm going to do is I'm going to try and calculate 00:35:51.780 |
The score if we decided to branch on the number three 00:35:56.980 |
Right so I need to know which rows are greater than three 00:36:02.180 |
Which rows are less than and equal to three and they're going to become my left-hand side my right-hand side 00:36:11.420 |
There's lots of scores we could use so in random forests. We call this the information gain, right? 00:36:17.380 |
The information gain is like how much better does our score get because we split it into these two groups of data 00:36:22.260 |
There's lots of ways we could calculate it Gini cross entropy root mean squared error, whatever 00:36:28.420 |
If you think about it, there is an alternative formulation of root mean squared error 00:36:34.540 |
Which is mathematically the same to within a constant scale, but it's a little bit easier to deal with which is 00:36:43.380 |
Which the causes the two groups to each have as lower standard deviation as possible 00:36:49.820 |
All right, so like I want to find a split that puts all the cats over here and all the dogs over here 00:36:54.980 |
Right, so if these are all cats and these are all dogs 00:36:58.060 |
Then this has a standard deviation of zero and this has a standard deviation of zero 00:37:02.660 |
Or else this is like a totally random mix of cats and dogs. This is a totally random mix of cats and dogs 00:37:07.500 |
They're going to have a much higher standard deviation 00:37:09.820 |
That makes sense. And so it turns out if you find a split that minimizes those group standard deviations or specifically the 00:37:18.020 |
Weighted average of the two standard deviations. It's mathematically the same as minimizing the root mean squared error 00:37:24.260 |
That's something you can prove to yourself after class if you want to 00:37:31.940 |
First of all split this into two groups. So where's all the stuff that is greater than three? 00:37:36.700 |
So greater than three is this one this one and this one. So we need the standard deviation of that 00:37:41.700 |
so let's go ahead and say standard deviation of 00:37:49.380 |
That one and that one. Okay, and then the next will be the standard deviation of 00:37:59.940 |
Less than or equal to three. So that would be that one 00:38:02.780 |
That one that one and then we just take the weighted average of those two and that's our score 00:38:15.140 |
That makes sense. And so then the next step would be try to split on four 00:38:27.900 |
Redundantly try splitting on one again and find out which one works best 00:38:30.580 |
So that's our code here is we're going to go through every row. And so let's say okay left-hand side is 00:38:37.700 |
Any values in X that are less than or equal to this particular value 00:38:44.420 |
Our right hand side is every value in X that are greater than this particular value 00:38:50.860 |
Okay, so what's the data type that's going to be in LHS and RHS? What are they actually going to contain? 00:39:11.540 |
Raise of a raise of billions. Yeah, which we can treat a zero and one. Okay, so LHS will be an array of 00:39:20.740 |
False every time it's not less than or equal to and true 00:39:23.980 |
Otherwise and RHS will be a boolean array of the opposite. Okay, and now we can't take a standard deviation of an empty set 00:39:32.180 |
Right. So if there's nothing that's greater than 00:39:35.140 |
This number then these will all be false which means the sum will be zero 00:39:43.740 |
Let's not go any further with this step because there's nothing to take the standard deviation of and it's obviously not a useful 00:39:51.100 |
so assuming we've got this far we can now calculate the standard deviation of the left hand side and 00:39:58.500 |
Take the weighted average or the sums the same thing to us to a scalar 00:40:04.220 |
Right and so there's our score and so we can then check is this better than our best score so far and our best score 00:40:11.260 |
So far we initially initialized it to infinity, right? So initially this is this is better 00:40:20.560 |
All of the information we need which variable has found this better split what was the score we found and 00:40:40.100 |
And I'm using time it so what time it does is it sees how long this command takes to run and it tries to give you 00:40:47.580 |
A kind of statistically valid measure of that so you can see here. It's run run at ten times 00:40:52.900 |
To get an average and then it's done that seven times to get a mean and standard deviation across runs 00:40:59.660 |
And so it's taking me 75 milliseconds plus or minus ten 00:41:07.420 |
Find better split tree zero. So zero is year made one is machine hours current meter 00:41:19.140 |
we got back machine hours current meter 37 4 4 with this score and then we ran it again with 00:41:25.680 |
0 that's year made and we've got a better score 658 and split 1974 and so in 1974 00:41:36.020 |
Yeah, that was what this tree did as well. Okay, so we've got we've confirmed that this 00:41:40.800 |
Method is doing is giving the same result that SK learns random forest did 00:41:49.540 |
Ten point oh eight and again matching here the value ten point oh eight 00:41:55.060 |
Okay, so we've got something that can find one split. Could you pass that to your net please? 00:42:01.180 |
So Jeremy, why don't we put a unique on the X there? 00:42:07.340 |
Because I'm not trying to optimize the performance yet, but you see that no like he's doing more 00:42:15.540 |
Yeah, so it's like and you can see in the Excel. I like checked this one twice. I check this for twice unnecessarily 00:42:24.140 |
Okay, so and so again, it's already thinking about performance, which is good 00:42:31.180 |
So tell me what is the computational complexity of 00:42:39.860 |
section of the code and I like have a think about it, but also like feel free to talk us through it if you want to 00:42:49.380 |
What's the computational complexity of this piece of code? 00:43:00.480 |
Alright Jade take us through your thought process 00:43:04.740 |
I think you have to take each different values through the column to calculate it 00:43:15.200 |
All the cup like all the possible combinations between these different values so that can be expensive 00:43:21.980 |
Like cuz you're uh-huh. Can you or does somebody else gonna tell us the actual computational complexity? 00:43:35.740 |
Think it's n squared. Okay, so tell me why is it n squared because for the for loop it is in yes 00:43:42.060 |
And I think I guess the standard deviation will take in so it's in square. Okay, or 00:43:46.820 |
This one maybe is even easier to know like this is like which ones are less than xi 00:43:52.540 |
I'm gonna have to check every value that if it's less than xi. Okay, and so 00:43:59.540 |
How do I quickly calculate computational complexity? 00:44:02.220 |
I can guarantee most of the interviews you do are going to ask you to calculate 00:44:05.560 |
Computational complexity on the fly and it's also like when you're coding you want it to be second nature 00:44:10.980 |
So the technique is basically is there a loop? 00:44:14.180 |
Okay, we're then we're obviously doing this n times 00:44:17.680 |
Okay, so there's an n involved. Is there a loop inside the loop if there is then you need to multiply those two together 00:44:23.740 |
In this case there's not is there anything inside the loop? 00:44:30.060 |
so you might see a sort in there and you just need to know that sort is n log n like 00:44:34.460 |
That should be second nature if you see a matrix multiply you need to know what that is in this case 00:44:40.180 |
There are some things that are doing element wise array operations, right? 00:44:44.980 |
So keep an eye out for anything where numpy is doing something to every value of an array in this case 00:44:50.100 |
It's checking every value of x against a constant. So it's going to have to do that n times 00:44:55.540 |
So to flesh this out into a computational complexity 00:44:58.300 |
You just take the number of things in the loop and you multiply it by the highest computational complexity inside the loop 00:45:12.220 |
Presort the list and then do like one and log n computation and there's lots of things we could do to speed this up 00:45:17.580 |
So at this stage, it's just like what is the computational complexity we have? 00:45:21.340 |
But absolutely it's certainly not as good as it can be. Okay, so and that's where we're going to go next 00:45:26.820 |
It's like alright n squared is not it's not great. So let's try and make it better 00:45:41.340 |
Okay, who wants to first of all tell me what's the equation for a standard deviation? 00:45:53.940 |
So for the standard deviation, it's the difference between the value and it's mean 00:46:01.180 |
It's we take a square root of that. So we take the 00:46:06.280 |
The power of two, yeah, then we sum up all of these observations and we take the square root out of all this sum 00:46:15.820 |
Yeah, you have to divide divide by n. Yep. Yep. Great. Good. Okay now 00:46:20.580 |
In practice, we don't normally use that formulation because it kind of requires us calculating 00:46:27.820 |
You know X minus the mean lots of times. Does anybody know the formulation that just requires? 00:46:33.900 |
X and X squared anybody happened to know that one. Yes at the back turn up house that back there 00:46:46.340 |
Square of mean. Yeah, great mean of squares minus the square of the means 00:46:50.580 |
All right, so that's a really good one at divided by it. That's a really good one to know because like you can now calculate 00:46:57.540 |
Variances or standard deviations of anything. You just have to first of all grab the column as it is 00:47:04.740 |
The column squared right and as long as you've got those stored away somewhere you can 00:47:10.940 |
Immediately calculate the standard deviation. So the reason this is handy for us is that if we first of all 00:47:24.940 |
Then if you think about it as we kind of start going down one step at a time 00:47:30.020 |
Right, then each group is exactly the same as the previous group on the left hand side with one more thing in it 00:47:37.820 |
And on the right hand side with one less thing in it 00:47:40.220 |
So given that we just have to keep track of some of X and some of X squared 00:47:44.500 |
We can just add one more thing to X one more thing to X squared on the left and remove one thing on the right 00:47:51.140 |
Okay, so we don't have to go through the whole lot each time and so we can turn this into a order n 00:47:58.300 |
Algorithm so that's all I do here is I sort the data right and they're going to keep track of the count of things on 00:48:06.500 |
The sum of things on the right and the sum of squares on the right 00:48:10.020 |
And initially everything's in the right hand side 00:48:16.540 |
Y sum is the sum on the right and Y squared sum is the sum of squares on the right and 00:48:23.680 |
Then nothing is initially on the left. So it's zeros. Okay, and then we just have to loop through 00:48:36.140 |
Add one to the left hand count subtract one from the right hand count add the value to the left hand count 00:48:41.980 |
Subtract it from the right hand count add the value squared to the left hand subtract it from the right hand 00:48:50.020 |
Now we do need to be careful though because if we're saying less than or equal to one say we're not stopping here 00:48:57.620 |
We're stopping here like we have to have everything in that group 00:49:01.140 |
So the other thing I'm going to do is I'm just going to make sure that the next value is not the same as this 00:49:06.820 |
Value if it is I'm going to skip over it, right? So I'm just going to double-check 00:49:10.620 |
That this value and the next one aren't the same 00:49:18.380 |
I can keep going ahead and calculate my standard deviation now 00:49:21.700 |
Passing in the count the sum and the sum squared right and there's that formula 00:49:29.100 |
Okay, the sum of squared divided by the square of the sum sorry minus the square of the sum 00:49:36.420 |
Do that for the right hand side and so now we can calculate the weighted average score 00:49:43.020 |
Just like before and all of these lines are now the same. Okay, so we've turned our order and 00:49:48.300 |
Squared algorithm into an order n algorithm and in general 00:49:54.620 |
stuff like this is going to get you a lot more value than like pushing something onto a spark cluster or 00:50:00.700 |
Ordering faster RAM or using normal cause in your CPU or whatever, right? 00:50:05.980 |
This is the way you want to be, you know improving your code and specifically 00:50:14.940 |
Right without thinking too much about performance run it. Is it fast enough for what you need then you're done if not 00:50:25.860 |
Jupiter instead of saying percent time it you say percent p run and 00:50:35.820 |
where the time was spent in your algorithm, and then you can go to the bit that's actually taking the time and think about like 00:50:45.340 |
Is this algorithmically as efficient as it can be? 00:50:55.500 |
76 milliseconds to less than 2 milliseconds and 00:50:59.860 |
Now some people that are new to programming think like oh great. I've saved 00:51:05.220 |
60 something milliseconds, but the point is this is going to get run 00:51:13.060 |
Okay, so the 76 millisecond version is so slow that it's going to be 00:51:19.660 |
Practical for any random forest you use in in practice right whereas the 1 millisecond version. I found is actually quite quite acceptable 00:51:27.940 |
And then check the numbers should be exactly the same as before and they are okay 00:51:37.420 |
function find better split that does what we want I 00:51:40.940 |
Want to insert it into my decision tree class and this is a really cool Python trick 00:51:47.540 |
Python does everything dynamically right so we can actually say 00:51:59.300 |
That function I just created and that might sticks it inside that class 00:52:07.180 |
I'll tell you what's slightly confusing about this is that this thing this word here and 00:52:12.100 |
This word here. They actually have no relationship to each other 00:52:15.980 |
They just happen to have the same letters in the same order right so like I could call this find better split 00:52:32.740 |
Right so now my function is actually called find better split underscore foo, but my method 00:52:42.580 |
I'm expecting to call something called decision tree dot find better split 00:52:47.740 |
Right so here. I could say decision tree dot find better split equals find better split underscore foo 00:52:57.940 |
It's important to understand how namespaces work like in in every language that you use 00:53:06.060 |
One of the most important things is kind of understanding how how it figures out what a name refers to 00:53:13.820 |
Means find better split as defined inside this class right and note nowhere else right well 00:53:20.940 |
I mean a parent class, but never mind about that this one here means find better split foo in 00:53:27.340 |
The global namespace a lot of languages don't have a global namespace, but Python does okay, and so 00:53:34.900 |
The two are like even if they happen to have the same letters in the same order 00:53:40.300 |
They're not referring in any way to the same thing 00:53:44.980 |
This family over here may have somebody called Jeremy and my family has somebody called Jeremy and our names happen to be the same 00:53:55.320 |
Great so now that we've stuck the decision tree 00:54:00.820 |
Sorry the find better split method inside the decision tree with this new definition when I now call the tree ensemble constructor 00:54:09.660 |
All right the decision tree ensemble instructor called create tree 00:54:18.700 |
decision tree called find vast split which went through every column to see if it could find a better split and 00:54:28.860 |
Therefore tree ensemble when we create it has gone ahead and done the split 00:54:37.580 |
That makes sense that you have any anybody have any questions or uncertainties about that like we're only creating one single split 00:54:47.700 |
All right, so this is pretty pretty neat right we kind of just do a little bit at a time testing everything as we go 00:54:55.340 |
And so it's as as as you all implement the random forest interpretation techniques 00:55:02.260 |
You may want to try programming this way to like every step check that you know 00:55:07.300 |
What you're doing matches up with what scikit-learn does or with a test that you've built or whatever? 00:55:16.700 |
Very inception right so let's go now max depth is 2 00:55:21.940 |
And so here is what scikit-learn did after breaking at year made 74 00:55:38.940 |
Right we just went through every column and try to see if there's a better split there 00:55:48.580 |
Not only do we have to go through every column and see if there's a better split in this node 00:55:54.500 |
But then we also have to see whether there's a better split in the left and the right sides that we just created 00:56:01.700 |
Right or in other words the left right side and the right hand side should become decision trees themselves 00:56:07.940 |
Right so there's no difference at all between what we do here to create this tree 00:56:13.220 |
And what we do here to create this tree other than this one contains 00:56:20.220 |
So this row of codes exactly the same as we had before 00:56:24.820 |
Right and then we check actually we could do this a little bit easier. We could say if self dot 00:56:35.180 |
Okay, but I'll just leave it here for now. So it's self dot score. So if the score is 00:56:46.740 |
So let's go back up and just remind ourselves 00:56:55.660 |
So since there we might as well use it so if it's a leaf node 00:57:00.820 |
Then we have nothing further to do right so that means we're right at the bottom. There's no split 00:57:06.940 |
That's been made okay, so we don't have to do anything further on the other hand if it's not a leaf node 00:57:14.180 |
Then we need to split it into the left hand side and the right hand side now earlier on we created a left hand side 00:57:26.100 |
Better would be to have here would be have an array of indexes 00:57:29.360 |
And that's because we don't want to have a full array of all the Booleans in every single 00:57:34.080 |
Node right because remember although it doesn't look like there are many nodes when you see a tree of this size 00:57:42.160 |
the bottom level if there's a minimum leaf size of one contains the same number of nodes as 00:57:48.000 |
The entire data set and so if every one of those contained a full Boolean array of size of the whole data set 00:57:55.080 |
You've got squared memory requirements, which would be bad 00:57:58.520 |
Right on the other hand if we just store the indexes with the things in this node, and that's going to get smaller and smaller 00:58:07.480 |
So NP dot non zero is exactly the same as just this thing which gets the Boolean array 00:58:13.560 |
But it turns it into the indexes of the trues 00:58:17.200 |
Okay, so this is now a list of indexes for the left hand side and 00:58:23.560 |
Okay, so now that we have the indexes the left hand side and the right hand side 00:58:28.760 |
We can now just go ahead and create a decision tree. Okay, so there's a decision tree for the left 00:58:36.240 |
Okay, and we don't have to do anything else. We've already written these 00:58:41.880 |
Function of a constructor that can create a decision tree 00:58:44.920 |
so like when you really think about what this is doing it kind of hurts your head right because 00:58:52.720 |
the reason the whole reason that find vast split got called 00:58:59.540 |
Find vast split is called by the decision tree constructor 00:59:06.780 |
But then the decision tree that but then find vast split itself then causes the decision tree constructor, so we actually have 00:59:17.980 |
I'm not nearly smart enough to be able to think through recursion 00:59:23.420 |
So I just choose not to right like I just write what I mean and 00:59:30.740 |
Right like what do I want to find a variable split? I've got to go through every column see if there's something better 00:59:37.420 |
If it managed to do a split figure out left hand side of the right hand side and make them into decision trees 00:59:44.260 |
Okay, but now try to think through how these two methods call each other would just drive me crazy 00:59:51.220 |
But I don't need to right I know I have a decision tree constructor that works 00:59:55.020 |
Right I know I have a violent find vast bit that works, so that's it right. That's how I 01:00:07.460 |
That's my advice a lot of you are probably smart enough to be able to think through it better than I can so that's fine 01:00:12.780 |
If you can all right so now that I've written that again, I can patch it into the decision tree class 01:00:23.260 |
The tree ensemble constructor will now use that right because Python's dynamic right that's just happens automatically 01:00:43.540 |
There it is hundred fifty nine samples nine point six six 01:00:49.620 |
Eight forty one ten point one five the left hand side of the left hand side 01:00:58.100 |
Hundred and fifty samples nine point six two okay, so you can see it like 01:01:05.020 |
Because I'm not nearly clever enough to write machine learning algorithms like not only can I not write them correctly the first time 01:01:11.220 |
Often like every single line. I write will be wrong right so I always start from the assumption 01:01:17.460 |
That the the line of code. I just typed is almost certainly wrong, and I just have to see why and how 01:01:23.380 |
Right and so like I just make sure and so eventually I get to the point where like much to my surprise 01:01:30.580 |
You know so here I can feel like okay this it would be surprising if all of these things 01:01:35.340 |
Accidentally happen to be exactly the same as I kit learn so this is looking pretty good 01:01:42.300 |
So now that we have something that can build a whole tree 01:01:47.260 |
We want to have something that can calculate predictions 01:01:49.840 |
Right and so to remind you we already have something that calculates predictions for a tree ensemble 01:01:58.580 |
But there is nothing called tree dot predict, so we're going to have to write that 01:02:03.820 |
To make this more interesting let's start bringing up the number of columns that we use 01:02:14.060 |
Let's create our tree ensemble again, and this time. Let's go to a maximum depth of three 01:02:18.820 |
Okay, so now our tree is getting more interesting 01:02:29.820 |
Create a set of predictions for a tree and so a set of predictions for a tree is simply the prediction for a row 01:02:42.180 |
That's it all right. That's our predictions so the predictions for a tree 01:02:46.180 |
every rows predictions in an array, okay, so again, we're like 01:02:52.400 |
Skipping thinking thinking is hard. You know so let's just like keep pushing it back 01:02:58.900 |
This is kind of handy right notice that you can do for 01:03:07.700 |
Blah in array with a numpy array regardless of the rank of the array 01:03:16.820 |
The array and what it does is it will loop through the leading axis 01:03:21.940 |
Right these these concepts are going to be very very important as we get into more and more neural networks because we're going to be 01:03:28.900 |
All doing tensor computations all the time so the leading axis of a vector is the vector itself 01:03:35.860 |
The leading axis of a matrix are the rows the leading axis axis of a three-dimensional tensor 01:03:43.140 |
The matrices that represent the slices and so forth right so in this case because x is a matrix 01:03:50.180 |
This is going to loop through the rows and if you write your 01:03:53.220 |
Kind of tensor code this way, then it'll kind of tend to 01:03:58.620 |
Generalize nicely to higher dimensions like it doesn't really mention matter how many dimensions are in x 01:04:04.620 |
This is going to loop through each of the leading axis, right? 01:04:08.220 |
Okay, so we can now call that decision tree dot predict 01:04:13.020 |
Right so all I need to do is write predict row 01:04:20.780 |
Right and I've delayed thinking so much which is great that the actual point where I actually have to do the work 01:04:37.780 |
Which we calculated right back in the original tree constructor. It's just the average of the wise, right? 01:04:45.900 |
Then we have to figure out whether to go down the left-hand path or the right-hand path to get the prediction, right? So 01:04:55.580 |
This variable in this row is less than or equal to the thing we decided the amount we decided to split on 01:05:07.100 |
Okay, and then having figured out what path we want which tree we want then we can just call predict row on that 01:05:16.620 |
We've accidentally created something recursive 01:05:24.820 |
Works control flow wise or whatever, but I don't need to because like I just it just does like I just told it 01:05:32.420 |
What I wanted so I'll trust it to work, right? 01:05:34.260 |
If it's a leaf return the value otherwise return the prediction for the left-hand side or the right-hand side as appropriate 01:05:40.820 |
Notice this here this if has nothing to do with this if 01:05:51.340 |
All right, this if is a control flow statement 01:05:54.280 |
That tells Python to go down that path or that path to do some calculation 01:06:07.860 |
So those of you that have done C or C++ will recognize it as being identical to that 01:06:13.540 |
It's called the ternary operator, right? If you haven't that's fine. Basically what we're doing is we're going to get a value 01:06:20.780 |
where we're going to say it's this value if this thing is true and 01:06:35.300 |
But that would require writing four lines of code to do one thing and also require you to have code that if you read it 01:06:42.380 |
To yourself or to somebody else is not at all naturally the way you would express it, right? 01:06:50.420 |
The left-hand side if the variables less than the split or the right-hand side, otherwise 01:06:56.000 |
Right, so I want to write my code the way I would think about or the way I would say my code 01:07:01.300 |
Okay, so this kind of ternary operator can be quite helpful for that 01:07:10.140 |
So now that I've got a prediction for row I can dump that into my class 01:07:17.800 |
And I can now plot my actuals against my predictions 01:07:28.100 |
You'll often have a lot of dots sitting on top of each other 01:07:31.380 |
So a good trick is to use alpha alpha means how transparent the things not just a matplotlib 01:07:37.580 |
But like in every graphics package in the world pretty much and so if you set alpha to less than 1 01:07:42.660 |
Then this is saying you would need 20 dots on top of each other for it to be fully blue 01:07:49.880 |
How much things are sitting on top of each other? So it's a good trick good trick for scatter plots 01:08:15.780 |
Tree ensemble had no max amount of splitting we can compare our R squared 01:08:20.060 |
To their R squared and so they're not the same 01:08:24.300 |
But actually ours is a little better, so I don't know what we did differently, but we'll take it 01:08:30.940 |
Okay, so we have now something which for a forest with a single tree in is giving as 01:08:41.820 |
On a validation set using an actual real-world data set you know for Bluetooth is 01:08:53.260 |
So what I would want to do now is to create a package that has this code in and I created it by like creating 01:08:59.420 |
A method here a method here a method here and patching them together 01:09:02.460 |
So what I did with now is I went back through my notebook and collected up all the cells 01:09:07.580 |
That implemented methods and pasted them all together right, and I've just pasted them down here 01:09:12.140 |
So here's this is my original tree ensemble and here is all the cells from the decision tree 01:09:17.620 |
I just dumped them all into one place without any change 01:09:20.620 |
So that was it that was the code we wrote together so now I can go ahead and I can create a 01:09:31.980 |
Tree ensemble I can calculate my predictions. I can do my scatter plot. I can get my R squared 01:09:45.020 |
Right and here we are we have a model of blue dook for bulldozers with a 71% R squared 01:10:00.420 |
Any questions about that and I know there's like quite a got to get through so like during the week 01:10:08.340 |
About any bits of code you come across can somebody pass the box to measure. Oh, there it is 01:10:14.240 |
Can we get back to the probably to the top of maybe 01:10:23.980 |
At the decision tree when we set the score equal to infinity, right? Yes, do we calculate this car this score? 01:10:31.380 |
Further, I mean like I lost track of that and specifically I wonder 01:10:42.660 |
Find var split we check for self score equal to whether it's equal to infinity or not 01:10:49.260 |
it seems to me it seems like unclear whether we 01:10:59.140 |
The methods if if our initial value is infinity 01:11:10.460 |
Starts out with a score of infinity. So in other words at this point when we've created the node there is no split 01:11:20.100 |
Okay, that's why the score is infinity and then we try to find a variable and a split 01:11:33.460 |
Say hey column. Do you have a split which is better than the best one we have so far? 01:11:47.860 |
Let's do the slow way since it's a bit simpler 01:11:50.700 |
find better split we do that by looping through each row and 01:11:55.680 |
Finding out this is the current score if we split here 01:12:00.420 |
Is it better than the current score the current score is infinitely bad? 01:12:04.820 |
It is and so now we set the new score equal to what we just calculated and we keep track of which variable we chose 01:12:17.460 |
Okay, great, let's take a five-minute break and I'll see you back here at 22 01:12:25.500 |
So when I tried comparing the performance of this 01:12:41.820 |
And the reason why is that although like a lot of the works being done by numpy 01:12:49.180 |
Which is nicely optimized C code think about like the very bottom level of a tree if we've got a 01:13:00.500 |
And the bottom level of the tree has something like 01:13:05.980 |
decision points with a million leaves underneath right and so that's like 01:13:11.700 |
Five hundred thousand split methods being called each one of contain which contains multiple calls to numpy which only have like 01:13:21.300 |
Item that's actually being calculated on and so it's like that's like very inefficient and it's the kind of thing that python is 01:13:28.460 |
Particularly not good at performance-wise right like calling lots of functions lots of times 01:13:34.380 |
I mean we can see it's it's not bad right you know for a kind of a 01:13:41.900 |
15 years ago would have been considered pretty big this would be considered pretty good performance right, but nowadays this is 01:13:48.300 |
Some hundreds of times at least slower than than it should be so 01:13:53.060 |
What the scikit-learn folks did to avoid this problem was that they wrote 01:14:01.460 |
their implementation in something called scythe on and 01:14:07.020 |
scythe on is a super set of python so any python you've written 01:14:19.580 |
But then what happens is scythe on runs it in a very different way rather than passing it to the kind of the 01:14:28.120 |
Python interpreter it instead converts it to C 01:14:34.100 |
Compiles that and then runs that C code right which means the first time you run it 01:14:40.260 |
It takes a little longer because it has to go through the the kind of translation and compilation, but then after that it can be 01:14:50.260 |
and so I wanted just to quickly show you what that looks like because 01:14:54.100 |
You are absolutely going to be in a position where scythe on is going to help you with your work and 01:15:02.180 |
Most of the people you're working with will have never used it may not even know it exists 01:15:06.820 |
And so this is like a great superpower to have 01:15:08.820 |
So to use scytheon in a notebook you say load, ext, load extension, scytheon, right? 01:15:21.940 |
Here is the same as a scytheon function is exactly the same thing with percent percent scytheon at the top 01:15:34.200 |
This actually runs about twice as fast as this right just because it does the compilation 01:15:41.120 |
Here is the same version again where I've used a special scytheon 01:15:48.220 |
extension called cdef which defines the C data type of the return value and of each variable 01:16:00.200 |
Basically, that's the trick that you can use to start making things 01:16:03.360 |
Run quickly right and at that point now. It knows it's not just some 01:16:08.460 |
Python object called t in fact I probably should put one here as well 01:16:14.880 |
Let's try that so we've got fib2 we'll call that fib3 01:16:28.800 |
Yeah, so it's exactly the same as before but we say what the data type of the thing we passed to it was is and then 01:16:34.560 |
Define the data types of each of the variables and so then if we call that 01:16:39.400 |
Okay, we've now got something that's 10 times faster right so 01:16:52.360 |
And it's just it's just Python with a few little bits of markup 01:16:56.600 |
so that's like it's it's good to know that that exists because 01:17:00.280 |
If there's something custom you're trying to do 01:17:03.440 |
It's actually I find it kind of painful having to go out and you know going to see and compile it and link it back 01:17:09.680 |
And all that where it's doing it here is pretty easy. Can you pass that just to your right please musher? 01:17:13.560 |
So when you're doing like for the scytheon version of it so in the case of an array or an MP array 01:17:23.520 |
This is a specific C type of yeah, so there's like a lot of 01:17:28.520 |
Specific stuff for integrating scytheon with numpy 01:17:42.080 |
But you can read that and you can basically see the basic ideas. There's this C import which basically 01:17:49.800 |
Certain types of Python library into the kind of the C bit of the code and you can then use it in your site them 01:18:03.320 |
Good question. Thank you all right so your your mission now 01:18:18.520 |
Confidence based on tree variants feature importance partial dependence and tree interpreter 01:18:27.080 |
Removing redundant features doesn't use a random first at all 01:18:33.760 |
Extrapolation is not an interpretation technique, so you don't have to worry about that, so it's just the other ones 01:18:38.160 |
So confidence based on tree variants. We've already written that code 01:18:41.960 |
So I suspect that the exact same code we have in the notebook should continue to work so you can try and make sure it 01:18:50.160 |
Feature importance is with the variable shuffling technique and once you have that working 01:18:55.480 |
Partial dependence will just be a couple of lines of code away because it rather than you know rather than shuffling a column 01:19:02.920 |
You're just replacing it with a constant value that it's nearly the same code and 01:19:09.720 |
It's going to require you writing some code and thinking about that well once you've written tree interpreter 01:19:14.480 |
You're very close if you want to to creating the second approach to 01:19:19.040 |
Feature importance the one where you add up the importance across all of the rows 01:19:26.300 |
Which means you would then be very close to doing interaction importance 01:19:30.480 |
So it turns out that there are actually there's actually a very good library for interaction importance for xg boost 01:19:39.080 |
But there doesn't seem to be one for random forest 01:19:41.880 |
so you could like start by getting it working on our version and 01:19:45.760 |
If you want to do interaction importance, and then you could like get it working on the original 01:19:50.040 |
Sklearn version and that would be a cool contribution 01:19:55.280 |
Like sometimes writing it against your own implementation is kind of nicer because you can see exactly what's going on 01:20:01.440 |
All right, so that's that's your job. You don't have to rewrite the random forest feel free to if you want to you know practice 01:20:13.320 |
Get stuck at any point you know ask on the forum, right? 01:20:21.520 |
Whole page here on wiki.fast.ai about how to ask for help 01:20:30.760 |
Co-workers on slack for help when you ask people in a technical community on github or 01:20:39.080 |
Asking for help the right way will go a long way towards 01:20:43.400 |
You know having people want to help you and be able to help you right so 01:20:48.360 |
So like search for your answer like search for the area you're getting see if somebody's already asked about it 01:20:55.120 |
You know how have you tried to fix it already? What do you think is going wrong? 01:21:04.240 |
What kind of computer are you on? How is it set up? What are the software versions? 01:21:08.160 |
Exactly. What did you type and exactly what happened right now? You could 01:21:16.520 |
Taking a screenshot, so you know make sure you've got some screenshot software. That's really easy to use 01:21:22.880 |
So if I were to take a screenshot, I just hit a button 01:21:24.880 |
Select the area copy the clipboard go to my forum 01:21:31.200 |
Paste it in and there we go right that looks a little bit too big, so let's make it a little smaller 01:21:37.360 |
Right and so now I've got a screenshot people can see exactly what happened 01:21:44.280 |
If there's a few lines of code and error messages to look at and create a gist 01:21:53.000 |
GitHub thing which basically lets you share code, so if I wanted to 01:22:01.520 |
Actually have a extension area that little extension, so if I click on here 01:22:16.280 |
Okay, and that takes my Jupyter notebook shares it publicly. I can then grab that URL 01:22:40.860 |
Okay, so that's a really good way now that particular button is an extension so on Jupyter 01:22:55.000 |
Gist it right while you're there. You should also click on collapsible headings 01:22:59.440 |
That's this really handy thing. I use that lets me collapse things and open them up 01:23:03.420 |
If you go to your Jupyter and don't see this nv extensions button 01:23:11.480 |
It'll show you how to pip install it and and get it set up right, but those two extensions are super duper handy 01:23:28.640 |
With random forests and until the next course when you look at GBMs. We're done with decision tree ensembles 01:23:34.820 |
And so we're going to move on to neural networks broadly defined and so 01:23:49.720 |
To go beyond just you know the kind of nearest neighbors approach of random forests 01:23:55.360 |
You know all the random forest can do is to average data that it's already seen 01:23:59.500 |
It can't extrapolate. It can't they can't calculate right? 01:24:07.240 |
Can calculate and can extrapolate but only in very limited ways 01:24:24.640 |
Right so unstructured data means like pixels or the amplitudes of sound waves or words you know data where? 01:24:38.520 |
You know as opposed to like a database table where you've got like a revenue and a cost and a zip code and a state 01:24:48.280 |
We're going to use it for structured data as well, but we're going to do that a little bit later 01:24:55.120 |
And it's also the area which more people have been applying deep learning to for longer 01:25:08.200 |
If you're doing the deep learning course as well 01:25:11.120 |
You know you'll see that we're going to be approaching kind of the same conclusion from two different directions 01:25:17.920 |
So the deep learning course is starting out with 01:25:21.320 |
Big complicated convolutional neural networks being solved with you know 01:25:27.600 |
Sophisticated optimization schemes and we're going to kind of gradually drill down into like exactly how they work 01:25:33.200 |
Where else with the machine learning course we're going to be starting out more with like 01:25:39.280 |
How does stochastic gradient descent actually work? 01:25:43.440 |
What do we do? What can we do with like one single layer? Which would allow us to create things like logistic regression when we add? 01:25:50.200 |
Regularization to that how does that give us things like? 01:25:54.000 |
Ridge regression elastic net less. Oh and then as we add additional layers to that how does that let us? 01:26:01.040 |
Handle more complex problems, and so we're not going to we're only going to be looking at 01:26:06.840 |
fully connected layers in this machine learning course and 01:26:12.920 |
Then I think next semester with your net you're probably going to be looking at some more 01:26:17.800 |
Sophisticated approaches and so yes on this machine learning 01:26:22.120 |
We're going to be looking much more at like what's actually happening with the matrices and how they actually calculate it and the deep learning 01:26:27.800 |
It's much more like what are the best practices for actually? 01:26:30.840 |
Solving you know at a world-class level real world deep learning problems, right so 01:26:43.440 |
Looking at like the classic Mness problem, which is like how do we recognize digits now? 01:26:50.840 |
If you're interested you can like skip ahead and like try and do this with a random forest and you'll find it's not bad 01:26:58.240 |
Right given that a random forest is basically a type of nearest neighbors, right? 01:27:03.020 |
It's finding like what are the nearest neighbors in in tree space? 01:27:07.360 |
Then a random forest can absolutely recognize that this nine those pixels 01:27:11.800 |
You know are similar to pixels. We've seen in these other ones and on average they were nines as well 01:27:18.000 |
All right, and so like we can absolutely solve 01:27:26.800 |
But we end up being rather data limited because every time we put in another decision point you know 01:27:33.280 |
We're halving our data roughly and so there's this just this limitation and the amount of calculation that we can do 01:27:43.000 |
We're going to be able to use lots and lots and lots of parameters 01:27:47.800 |
Using these tricks we don't learn about with regularization 01:27:51.760 |
And so we're going to be able to do lots of computation, and there's going to be very little limitation on really 01:27:57.920 |
What we can actually end up calculating as a result great good luck with your random forest interpretation, and I will see you next time