back to index

Intro to Machine Learning: Lesson 7


Chapters

0:0
0:19 The Position of Random Forests
12:10 Mean of the Binomial Distribution of N Samples
13:11 Standard Error
15:48 Less Common Classes
18:15 Class Weights
21:13 Create a Decision Tree
22:19 Flesh Out Our Decision Tree
23:55 Decision Tree
31:26 Predefined Decorators
36:15 Information Gain
42:50 Computational Complexity
43:59 Quickly Calculate Computational Complexity
45:47 Standard Deviation
62:29 Create a Set of Predictions for a Tree
66:13 The Ternary Operator
81:47 Create a Gist
83:39 Neural Networks

Whisper Transcript | Transcript Only Page

00:00:00.000 | Welcome back. We're going to be talking today about
00:00:03.200 | Random forests, we're going to finish building our own random forest from scratch
00:00:08.240 | But before we do I wanted to
00:00:12.400 | Tackle a few things that have come up during the week a few questions that I've had
00:00:16.840 | And I want to start with kind of the position of random forests in general, so we spent
00:00:24.240 | About half of this course doing random forests, and then after today the second half of this course will be
00:00:30.040 | neural networks broadly defined
00:00:33.200 | This is because these these two
00:00:40.800 | Represent like the tea the two key classes of techniques which cover
00:00:46.260 | Nearly everything that you're likely to need to do
00:00:49.200 | Random forests belong to the class of techniques of decision tree ensembles
00:00:53.880 | Along with gradient boosting machines being the other key type and some variants like extremely randomized trees
00:01:03.600 | Have the benefit that they're highly interpretable
00:01:06.560 | scalable
00:01:08.960 | Flexible work well for most kinds of data
00:01:11.400 | They have the downside that they don't extrapolate at all to like
00:01:16.720 | Data that's outside the range that you've seen as we looked at at the end of last week's session
00:01:26.160 | You know they're they're they're a great starting point and so
00:01:29.160 | I think you know there's a huge catalog of
00:01:34.680 | Machine learning tools out there and so and like a lot of courses and books
00:01:39.280 | Don't attempt to kind of curate that down and say like for these kinds of problems use this for these kinds of problems use that
00:01:46.800 | Finished you know, but they're rather like
00:01:49.600 | here's a description of a hundred different algorithms and
00:01:53.960 | You just don't need them
00:01:55.960 | You know like I don't see why you would ever use in support vector machine today for instance
00:02:01.240 | Like I know no reason at all
00:02:04.120 | I could think of doing that people loved studying them in the 90s because they are like very theoretically elegant and like you can really
00:02:11.360 | Write a lot of math about support vector machines and people did but you know in practice. I don't see them as having any place
00:02:22.600 | So there's like a lot of techniques that you could include in an exhaustive list of every way that people have looked at machine learning problems
00:02:28.880 | But I would rather tell you like how to actually solve machine learning problems in practice
00:02:34.480 | I think they you know we've we've about to finish today the first class which is you know one type of decision tree ensembles
00:02:40.720 | In in part two your net will tell you about the other key type
00:02:44.320 | They're being gradient boosting and we're about to launch next lesson into
00:02:49.600 | neural nets which includes all kinds of GLM
00:02:53.360 | Ridge regression elastic net Lasso
00:02:56.480 | Logistic aggression etc are all variants of neural nets
00:03:01.600 | You know interestingly
00:03:06.480 | Leo Bremen who created random forests did so very late in his life and
00:03:13.960 | Unfortunately passed away not many years later
00:03:19.240 | So partly because of that very little has been written about them in the academic literature
00:03:24.760 | Partly because SVMs were just taken over at that point. You know other people didn't look at them
00:03:30.480 | And also like just because they're like quite hard to grasp at a theoretical level like analyze them theoretically
00:03:40.560 | It's quite of hard to write conference papers about them or academic papers about them
00:03:44.520 | So there hasn't been that much written about them
00:03:47.320 | But there's been a real resurgence or not resurgence a new wave in recent years of empirical machine learning like what actually works
00:03:55.320 | Kaggle's been part of that, but also just part of it. It's just been like
00:04:01.020 | companies using machine learning to make shit loads of money like Amazon and Google and
00:04:06.240 | So nowadays a lot of people are writing about
00:04:10.520 | decision tree ensembles and creating better software for decision tree ensembles like like GBM and XG boost and
00:04:17.080 | Ranger for our and
00:04:20.080 | Psych hit learn and so forth
00:04:22.160 | But a lot of this is being done in industry rather than academia
00:04:26.320 | But you know it's it's encouraging to see
00:04:30.800 | There's certainly
00:04:34.080 | More work being done in deep learning than in decision tree ensembles
00:04:39.400 | Particularly in in academia, but but there's a lot of progress being made in both
00:04:45.040 | You know if you look at like of the packages being used today for decision tree ensembles
00:04:51.400 | Like all the best ones the top five or six, and I don't know that any of them really existed
00:04:56.800 | Five years ago, you know maybe other than like SK learn or even three years ago, so it's that's that's been good
00:05:04.280 | But I think there's a lot of work
00:05:08.040 | Still to be done now. We talked about for example
00:05:10.400 | Figuring out what interactions are the most important last week and
00:05:15.880 | Some of you pointed out in the forums that actually there is such a project already for gradient boosting machines
00:05:21.400 | Which is great, but it doesn't seem that there's anything like that yet for random forests
00:05:25.640 | And you know random forests do have a nice benefit over GBMs that they're kind of
00:05:30.740 | Harder to screw up you know and easier to scale
00:05:37.000 | So hopefully that's something that you know this community might help fix
00:05:40.520 | Another question I had during the week was about the size of your validation set
00:05:47.240 | How big should it be
00:05:51.840 | So like to answer this question about how big does your validation set need to be you first need to answer the question
00:06:03.800 | How accurate do I need help how precisely do I need to know the accuracy of this algorithm right so like
00:06:11.400 | If the validation set that you have is saying like this is 70% accurate and
00:06:18.520 | If somebody said well is it 75% or 65% or 70% and the answer was I don't know anything in that range is close enough
00:06:27.800 | Like that would be one answer where else if it's like is it 70% or 70.01% or 69.99%
00:06:35.120 | Like then that's something else again, right? So you need to kind of start out by saying like how?
00:06:41.160 | How accurate do I need this?
00:06:43.760 | So like for example in the deep learning course we've been looking at dogs versus cats
00:06:49.080 | images and
00:06:51.680 | The models that we're looking at had about a ninety nine point
00:06:56.480 | Four ninety nine point five percent accuracy on the validation set okay, and our validation set size
00:07:06.680 | 2000 okay, in fact let's do this in Excel that'll be a bit easier
00:07:11.560 | So our validation set
00:07:16.720 | Size was
00:07:19.600 | 2000 and
00:07:21.600 | accuracy was
00:07:23.560 | ninety nine point four percent
00:07:26.840 | Right so the number of incorrect is
00:07:30.040 | Something around
00:07:34.040 | One minus accuracy times n so we were getting about 12 wrong
00:07:40.840 | right and
00:07:44.040 | The number of cats we had is
00:07:47.320 | Half and
00:07:51.240 | So the number of wrong cats is about
00:07:58.040 | Okay, so then like we we run a new model and we find instead that the accuracy
00:08:06.520 | has gone to
00:08:09.200 | ninety nine point two percent
00:08:11.200 | Right and then it's like okay. Is this less good at finding cats. That's like well. It got two more cats wrong
00:08:18.600 | So it's like probably not
00:08:21.840 | right so
00:08:24.360 | But then it's like well does this matter there's ninety nine point four versus ninety nine point two
00:08:29.640 | Matter and if this was like it wasn't about cats and dogs, but it was about finding fraud
00:08:35.320 | Right then the difference between a point six percent error rate and a point eight percent error rate
00:08:41.120 | It's like twenty five percent of your cost of fraud
00:08:43.560 | So like that can be huge
00:08:46.240 | Like it was really interesting like when image net came out earlier this year the new competition results came out and
00:08:53.440 | The accuracy had gone down from three percent
00:08:56.320 | So the error went down from three percent to two percent
00:08:59.840 | And I saw a lot of people on the internet like famous machine learning researchers being like man
00:09:04.920 | Some Chinese guys got it better from like ninety seven percent to one ninety eight percent
00:09:10.080 | It's like statistically not even significant who cares kind of a thing, but actually I thought like
00:09:15.840 | Holy crap this Chinese team just blew away the state-of-the-art and image recognition like the old one was
00:09:22.880 | Fifty percent less accurate than the new one like that's that's actually the right way to think about it
00:09:27.960 | Isn't it because it's like you know we were trying to recognize you know like
00:09:32.020 | Which tomatoes were ripe and which ones weren't and like our new approach. You know the old approach like
00:09:40.120 | 50% of the time more
00:09:42.840 | Was like letting in the unripe tomatoes or you know 50% more of the time we were like
00:09:49.920 | Accepting fraudulent customers like that's a really big difference
00:09:53.000 | So just because like this particular validation set we can't really see six versus eight doesn't mean the point two percent different
00:10:01.280 | Isn't important it could be so my kind of rule of thumb is that this like this number of like how many?
00:10:08.920 | Observations you actually looking at I want that generally to be somewhere higher than 22
00:10:15.520 | Why 22 because 22 is the magic number where the t distribution roughly turns into the normal distribution?
00:10:22.120 | All right, so as you may have learned the t distribution is is the normal distribution for small
00:10:28.720 | Datasets right and so in other words once we have 22 of something or more it kind of starts to behave
00:10:36.420 | kind of
00:10:39.040 | Normally in both sense of the words like it's kind of more stable, and you can kind of understand it better, so
00:10:45.240 | That's my magic number when somebody says do I have enough of something I kind of start out by saying like do you have?
00:10:51.000 | 22 observations of the thing of interest
00:10:53.280 | so if you were looking at like
00:10:56.200 | Line cancer you know and you had a data set that had like a thousand people without lung cancer and 20 people with lung cancer
00:11:05.240 | I'd be like I very much doubt we're going to make much progress. You know because we haven't even got 20 of the thing you want
00:11:11.960 | So ditto with a validation set if you don't have 20 of the thing you want that is very unlikely to be useful
00:11:17.020 | Or if like the at the level of accuracy we need it's not plus or minus 20
00:11:21.680 | It's just it's that that's the point where I'm thinking like
00:11:24.360 | Be a bit careful
00:11:26.560 | so just to be clear you want 22 to be the number of
00:11:31.560 | Samples in each set like in the validation the test and the train or so what I'm saying is like if there's if there's less
00:11:40.560 | than 22 of a class in any of the sets then it's
00:11:46.800 | It's going to get it's getting pretty unstable at that point, right?
00:11:50.640 | And so like that's just like the first rule of thumb
00:11:54.320 | but then what I would actually do is like start practicing what we learned about the
00:12:00.620 | binomial distribution or actually then we distribution so
00:12:05.840 | What's the?
00:12:10.280 | What is the mean of the binomial distribution of n samples and probability P?
00:12:17.680 | And times P. Okay. Thank you n times P is our mean
00:12:23.660 | Right, so if you've got a 50% chance of getting ahead and you toss it a hundred times on average you get 50 heads
00:12:31.640 | Okay, and then what's the standard deviation?
00:12:37.560 | and P1 minus P
00:12:39.560 | Okay, so these are like
00:12:45.240 | Two numbers well the first number you don't really have to remember. It's intuitively obvious the second one is one that try to remember
00:12:52.280 | forevermore
00:12:54.160 | Because not only does it come up all the time the people that you work with will all have forgotten it
00:12:59.000 | So you'll be like the one person in the conversation who could immediately go we don't have to run this a hundred times
00:13:03.840 | I can tell you straight away. It's binomial. It's going to be n PQ and P1 minus P
00:13:09.080 | Then there's the standard error
00:13:13.120 | The standard error is if you run a bunch of trials each time getting a mean
00:13:20.560 | What is the standard deviation of the mean? I?
00:13:24.280 | Don't think you guys have covered this yet. Is that right?
00:13:28.920 | No, so this is really important because this means like if you train a hundred models
00:13:35.620 | right each time the validation set accuracy is like the meaning of a distribution and
00:13:41.660 | So therefore the standard deviation of that validation set accuracy it can be calculated with the standard error
00:13:48.920 | And this is equal to the standard deviation
00:13:52.000 | divided by
00:13:54.360 | square root n
00:13:57.640 | Right so this tells you
00:14:00.040 | So like one approach to figuring out like is my validation set big enough is train your model five times
00:14:06.280 | with exactly the same hyper parameters each time and look at the validation set accuracy each time and you know there's like a
00:14:14.600 | Mean and a standard deviation of five numbers you could use or a maximum and minimum you can choose
00:14:19.600 | But save yourself some time you can figure out straight away that like
00:14:24.600 | okay, well
00:14:27.440 | I I have a point nine nine
00:14:30.280 | Accuracy as to you know whether I get the cat correct or not correct
00:14:36.000 | So therefore the standard deviation is equal to 0.99 times 0.01
00:14:42.240 | Okay, and then I can get the
00:14:45.640 | Standard error of that
00:14:49.720 | Right so so basically the size of the validation set you need
00:14:54.880 | It's like however big it has to be
00:14:57.800 | Such that your insights about its accuracy are good enough for your particular
00:15:03.840 | Business problem and so like I say like the simple way to do it is to pick a validation set of like a size a thousand
00:15:11.160 | train five models and
00:15:13.480 | See how much the validation set accuracy varies and if it's like if they're if it's they're all close enough for what you need
00:15:20.160 | Then you're fine. If it's not maybe you should make it bigger
00:15:24.480 | Or maybe you should consider
00:15:26.480 | using cross validation
00:15:29.280 | instead
00:15:32.680 | So like as you can see it really depends on what it is you're trying to do
00:15:37.320 | How common your less common class is and how accurate your model is could you pass that back to Melissa, please?
00:15:44.640 | Thank you, I have a question about the less common classes if you have less than 22
00:15:53.120 | Let's say you have one sample of something
00:15:55.320 | Let's say it's a face and I only have one representation from that particular country
00:16:01.780 | Do I toss that into the training set and it adds variety do I pull it out completely out of the data set or?
00:16:09.560 | Do I put it in a test set instead of the validation set?
00:16:16.000 | So you certainly couldn't put it in the test or the validation set because you're asking
00:16:20.240 | Can I mean in general because you're asking can I recognize something? I've never seen before
00:16:24.280 | But actually this this question of like can I recognize something I've not seen before there's actually a whole class of models specifically
00:16:31.940 | For that purpose it's called other one-shot learning
00:16:35.080 | Which is you get to see something once and then you have to recognize it again or zero shot learning
00:16:39.700 | Which is where you have to recognize something you've never seen before we're not going to cover them in this course
00:16:47.040 | But that can be useful for things like face recognition
00:16:50.840 | You know like is this the same person I've seen before and so generally speaking
00:16:55.720 | Obviously for something like that to work. It's not that you've never seen our face before
00:17:00.800 | It's that you've never seen Melissa's face before you know and so you see Melissa's face once and you have to recognize it again
00:17:07.400 | Yeah, so in general you know your validation set and test set need to have the same
00:17:16.000 | mix or frequency
00:17:18.160 | Observations that you're going to see in production in the real world
00:17:22.740 | and then your training set
00:17:25.600 | Should have an equal number in each class and if you don't
00:17:31.960 | Just replicate the less common one
00:17:34.960 | Until it is equal so this is I think we've mentioned this paper before very recent paper that came out
00:17:41.280 | They tried lots of different approaches to training with unbalanced data sets and found consistently that
00:17:47.000 | Oversampling the less common class until it is the same size as the more common class is always the right thing to do
00:17:54.480 | So you could literally copy you know so like I've only got a thousand you know ten examples of people with cancer and a hundred
00:18:03.280 | Without so I could just copy those ten
00:18:05.280 | And other you know 90 times
00:18:09.080 | That's kind of a little memory inefficient so a lot of things including
00:18:14.120 | I think SK learns random forests have a class weights parameter that says each time you're bootstrapping or resampling
00:18:20.560 | I want you to sample the less common class with a higher probability
00:18:24.880 | Or ditto if you're doing deep learning you know make sure in your mini batch
00:18:29.320 | It's not randomly sampled, but it's a stratified sample so the less common class is picked more often
00:18:38.640 | Okay, so let's get back to finishing off
00:18:41.800 | Our random forests and so what we're going to do today is we're going to finish off writing our random forest
00:18:48.280 | and then after day your your after today your homework will be to take this class and
00:18:53.120 | To add to it all of the random forest interpretation algorithms that we've learned, okay, so
00:19:00.640 | Obviously to be able to do that you're going to need to totally understand how this class works
00:19:05.880 | So please you know ask lots of questions as necessary as we go along
00:19:09.800 | So just to remind you
00:19:13.520 | We we're doing the the bulldozers Kaggle competition data set again
00:19:20.680 | We split it as before into 12,000 validation the last 12,000
00:19:25.240 | Records and then just to make it easier for us to keep track of what we're doing
00:19:31.360 | We're going to just pick two columns out to start with year made and machine hours on the meter
00:19:36.120 | okay, and so what we did last time was we started out by creating a tree ensemble and
00:19:42.440 | the tree ensemble had a bunch of trees which was literally a list of
00:19:48.120 | Entries trees where each time we just called create tree
00:19:53.800 | Okay, and create tree contained a
00:19:59.560 | sample size number of random indexes
00:20:03.000 | Okay, this one was drawn without replacement
00:20:07.080 | So remember bootstrapping means sampling with replacement
00:20:12.600 | So normally with scikit learn if you've got n rows
00:20:16.240 | We grab n rows with replacement, which means many of them will appear more than once
00:20:23.760 | So each time we get a different sample, but it's always the same size as the original data set and then we have our
00:20:31.120 | set RF samples
00:20:33.480 | Function that we can use which does with replacement sampling of less than n rows
00:20:40.440 | This is doing something again, which is it's sampling without replacement
00:20:45.360 | Sample size rows okay because we're permuting
00:20:49.440 | The numbers from naught to self dot y minus 1 and then grabbing the first self dot sample size of them
00:20:55.640 | Actually, there's a faster way to do this. You can just use NP dot random dot choice, which is a slightly more direct way
00:21:01.960 | But this way works as well
00:21:04.400 | All right. So this is our random sample for this one of our
00:21:09.200 | entries trees
00:21:12.240 | And so then we're going to create a decision tree and our decision tree
00:21:17.000 | We don't pass it all of x we pass it these specific indexes and remember x is a panda's data frame
00:21:24.220 | So if we want to index into it with a bunch of integers, we have to use I log integer locations
00:21:31.280 | And that makes it behave indexing wise just like numpy
00:21:35.540 | Why vector is numpy so we can just index into it directly and then we're going to keep track of our minimum precise
00:21:48.260 | So then the only other thing we really need an ensemble is some way to make a prediction
00:21:51.820 | And so we were just going to do the mean of the tree prediction
00:21:56.200 | For each tree
00:21:59.300 | All right, so that was that and so then in order to be able to run that
00:22:03.520 | We need a decision tree class because it's being called here
00:22:08.240 | And so there we go
00:22:12.020 | Okay, so that's the starting point. So
00:22:15.740 | The next thing we need to do is to flesh out our decision tree
00:22:22.140 | So the important thing to remember is all of our randomness
00:22:25.960 | Happened back here in the tree ensemble
00:22:28.740 | The decision tree class we're going to create
00:22:31.900 | Doesn't have randomness in it
00:22:36.580 | Okay, so right now we are building a randomly regressor, right so that's why we're taking the mean of the tree
00:22:43.540 | The output if we were to work with classification
00:22:47.800 | Do we take the max like the classifier will give you either zeros or ones?
00:22:52.820 | No, I would still take the mean so the so each tree is going to tell you what percentage of that leaf node
00:23:01.380 | Contains cats and what percentage take contains dogs
00:23:05.820 | so then I would average all those percentages and say across the trees on average there is 19% cats and
00:23:12.740 | 81% dogs
00:23:16.260 | Good question, so you know
00:23:18.540 | Random tree classifiers are almost identical or can be almost identical to random tree regressors
00:23:27.240 | The technique we're going to use to build this today will basically exactly work for a classification
00:23:34.340 | It's certainly for binary classification. You can do with exactly the same code for multi-class classification. You just need to change your data structure
00:23:41.900 | so that like you have like a one hot encoded matrix or a
00:23:47.140 | List of integers that you treat as a one hot encoded matrix
00:23:51.620 | Okay, so our decision tree
00:23:57.160 | So remember our idea here is that we're going to like try to avoid thinking
00:24:03.260 | So we're going to basically write it as if everything we need already exists, okay, so we know
00:24:10.100 | From when we created the decision tree, we're going to pass in the X the Y and the minimum leaf size
00:24:16.480 | So here we need to make sure we've got the X and the Y and the minimum leaf size. Okay, so then there's one other thing
00:24:23.980 | which is as we
00:24:25.980 | split our tree into sub trees, we're going to need to keep track of
00:24:31.980 | Which of the row indexes went into the left-hand side of the tree which went into the right-hand side of the tree
00:24:37.740 | Okay, so we're going to have this thing called
00:24:39.860 | indexes as
00:24:42.020 | Well, right so at first we just didn't bother passing in indexes at all
00:24:46.660 | So if indexes is not passed in if it's none, then we're just going to set it to
00:24:50.980 | Everything the entire length of y right so NP dot a range is the same as just range in Python
00:24:59.040 | But it returns a numpy array right so that the root of a decision tree
00:25:04.160 | Contains all the rows. That's the definition really of the root of a decision tree
00:25:09.960 | So all the rows is row 0 row 1 row 2 etc up to row y minus 1
00:25:15.460 | Okay, and then we're just going to store away all that information that we were given
00:25:21.340 | We're going to keep track of how many?
00:25:23.980 | Rows are there and how many columns are there? Okay?
00:25:28.620 | so then the
00:25:30.740 | every
00:25:31.860 | leaf and every node in a tree has a value it has a prediction that prediction is just equal to the average of
00:25:40.140 | the dependent variable
00:25:42.980 | Okay, so
00:25:45.260 | every node in the tree
00:25:47.260 | Why indexed with the indexes is?
00:25:50.780 | the values of the dependent variable that are in this branch of the tree and so here is the mean
00:26:01.780 | Nodes in a tree also have a score which is like how effective was the split?
00:26:07.920 | Here right, but that's only going to be true if it's not a leaf node right a leaf node has no further splits
00:26:15.500 | And at this point when we create a tree we haven't done any splits yet, so its score starts out as being infinity
00:26:24.060 | so having built that the root of the tree our next job is to find out which variable should we split on and
00:26:32.060 | What level of that variable should we split on so let's pretend that there's something that does that
00:26:38.060 | Find bar split
00:26:40.700 | So then we're done, okay, so how do we find a
00:26:49.420 | variable to split on so well we could just go through each
00:26:54.460 | Potential variable so C contains the number of columns
00:26:59.380 | We have so go through each one and see if we can find a better split than we have so far on that column
00:27:09.500 | Now notice this is like
00:27:13.700 | The full random forest definition this is assuming that max features is set to all right remember
00:27:20.500 | We could set max features to like zero point five in which case we wouldn't check all the numbers should not to see
00:27:25.700 | We would check half the numbers at random from not to see so if you want to turn this into like a
00:27:31.420 | random forest that has the max features
00:27:34.220 | Support you could easily like add one line of code to do that, but we're not going to do it in our implementation today
00:27:44.060 | Then we just need to find better split and since we're not interested in thinking at the moment for now
00:27:50.140 | We're just going to leave that empty all right
00:27:55.500 | The one other thing I like to do
00:27:57.700 | With my kind of where I start writing a class is I like to have some way to print out
00:28:02.860 | What's in that class right and so if you type print followed by an object or if at Jupiter notebook?
00:28:09.500 | You just type the name of the object
00:28:12.460 | At the moment. It's just printing out
00:28:14.900 | Underscore underscore main underscore underscore dot decision tree at blah blah blah which is not very helpful
00:28:20.540 | Right so if we want to replace this with something helpful
00:28:23.660 | We have to define the special
00:28:26.300 | Python method name dunder Repra
00:28:30.140 | To get a representation of this object so when we type when we sickly just
00:28:37.500 | Write the name like this behind the scenes that calls that function and the default
00:28:42.020 | Implementation of that method is just to print out this
00:28:46.100 | Unhelpful stuff so we can replace it by instead saying let's create a format string
00:28:52.060 | Where we're going to print out n and then show n and then print Val and then show Val okay, so how many?
00:28:59.300 | How many rows are in this node and what's the average of the dependent variable okay?
00:29:08.580 | If it's not a leaf node, so if it has a split then we should also be able to print out the score
00:29:14.060 | The value we split out and the variable that we split on
00:29:18.780 | Now you'll notice here
00:29:22.380 | Self dot is leaf is leaf is defined as a method, but I don't have any parentheses after it
00:29:28.240 | This is a special kind of method called a property and so a property is something that kind of looks like
00:29:35.940 | a regular variable
00:29:37.940 | But it's actually calculated on the fly so when I call its leaf it actually calls this function
00:29:45.620 | Right, but I've got this special decorator
00:29:48.060 | Property okay, and what this says is basically you don't have to include the parentheses when you call it
00:29:54.800 | Okay, and so it's going to say all right is this a leaf or not so a leaf is something that we don't split on
00:30:03.940 | If we haven't split on it then its score is still set to infinity, but that's my logic
00:30:09.060 | That makes sense
00:30:12.220 | So this this at notation is called a decorator
00:30:18.340 | It's basically a way of telling Python more information about your method
00:30:23.280 | So anybody here remember where you have seen decorators before?
00:30:30.460 | You pass it over here, yeah, where have you seen where have you seen decorators before tell us more about flask and
00:30:36.980 | Yeah, what does that do?
00:30:41.860 | So flask so anybody who's done any web programming before with something like flask or a similar framework
00:30:53.940 | Would have had to have said like this method is going to respond to this bit of the URL
00:31:00.200 | And either the post or to get and he put it in a special decorator
00:31:07.640 | Behind the scenes that's telling
00:31:09.640 | Python to treat this method in a special way, so here's another decorator, okay?
00:31:15.000 | And so you know if you get more advanced with Python you can actually learn how to write your own decorators
00:31:20.880 | Which as was mentioned you know basically insert some additional code
00:31:24.320 | but for now just know there's a bunch of predefined decorators we can use to
00:31:30.320 | Change how our methods behave and one of them is at property which basically means you don't have to put parentheses anymore
00:31:36.600 | Which of course means you can't add any more parameters beyond self
00:31:43.040 | Why if it's not a leaf why is this for infinity because doesn't infinity mean you're at the root
00:31:50.600 | Why no infinity means that you're not at the root. It means that you're at a leaf so the root will have a split
00:31:58.520 | Assuming we find one right everything will have a split till we get all the way to the bottom
00:32:02.560 | The leaf and so the leaves will have a score of infinity because they won't split
00:32:07.880 | Great all right
00:32:11.640 | So that's our
00:32:14.520 | Decision tree it doesn't do very much, but at least we can like create an ensemble right ten trees sample size a thousand
00:32:22.200 | Right and we can like print out so now when I go m trees zero
00:32:26.200 | It doesn't say blah blah blah blah blah. It says
00:32:29.320 | What we asked it to say n called the thousand
00:32:32.520 | Val Colin 10.8. Oh wait, okay, and this is the leaf because we haven't spit on it yet
00:32:40.200 | So we've got nothing more to say
00:32:42.320 | Okay, so then the indexes are all the numbers from 0 to 1,000
00:32:48.780 | Okay, because the base of the tree has everything this is like everything in
00:32:55.080 | The random sample that was passed to it because remember by the time we get to the point where it's a decision tree
00:32:59.840 | Where we don't have to worry about any of the randomness in the random forest anymore, okay?
00:33:05.280 | All right, so
00:33:09.200 | Let's try to write
00:33:12.080 | The thing which finds a split okay, so we need to implement
00:33:18.520 | Find better split okay, and so it's going to take the index of a variable variable number one variable number three
00:33:26.260 | Whatever and it's going to figure out
00:33:28.260 | What's the best split point?
00:33:31.440 | Is that better than any split we have so far and?
00:33:34.040 | For the first variable the answer will always be yes because the best one so far is none at all infinity bad, okay?
00:33:40.840 | So let's start by making sure we've got something to compare to so the thing we're going to compare to
00:33:47.760 | will be
00:33:48.960 | scikit learns random forest and
00:33:50.960 | So we need to make sure that scikit learns random forest gets exactly the same data that we have so we start out by creating ensemble
00:33:58.140 | Grab a tree out of it and then find out which particular random sample of x and y did this tree use?
00:34:06.180 | Okay, and we're going to store them away so that we can pass them to scikit learn so we have exactly the same information
00:34:14.440 | So let's go ahead and now create a random forest using scikit learn so one tree
00:34:19.040 | one decision
00:34:21.600 | No, bootstrapping so the whole the whole data set that so this should
00:34:26.240 | Be exactly the same as the thing that we're going to create this tree
00:34:30.640 | So let's try
00:34:34.680 | So we need to define find better split
00:34:38.960 | Okay, so find better split takes a variable
00:34:44.060 | Okay, so let's define our x independent variables and say okay
00:34:49.700 | Well, it's everything inside our tree, but only
00:34:53.260 | Those indexes that are in this node right which at the top of the tree is everything right and just this one variable
00:35:02.760 | Okay, and then for our wise it's just whatever our dependent variable is at the indexes in this node
00:35:10.140 | Okay, so there's our x and y
00:35:13.940 | Let's now go through every single value in our independent variable and
00:35:20.580 | So I'll show you what's going to happen. So let's say our independent variable is year made
00:35:26.420 | And not going to be an order
00:35:41.460 | Right and so we're going to go to the very first row and we're going to say okay year made here is three
00:35:48.020 | Right and so what I'm going to do is I'm going to try and calculate
00:35:51.780 | The score if we decided to branch on the number three
00:35:56.980 | Right so I need to know which rows are greater than three
00:36:02.180 | Which rows are less than and equal to three and they're going to become my left-hand side my right-hand side
00:36:07.340 | Right and then we need a score
00:36:09.700 | right, so
00:36:11.420 | There's lots of scores we could use so in random forests. We call this the information gain, right?
00:36:17.380 | The information gain is like how much better does our score get because we split it into these two groups of data
00:36:22.260 | There's lots of ways we could calculate it Gini cross entropy root mean squared error, whatever
00:36:28.420 | If you think about it, there is an alternative formulation of root mean squared error
00:36:34.540 | Which is mathematically the same to within a constant scale, but it's a little bit easier to deal with which is
00:36:41.060 | We're going to try and find a split
00:36:43.380 | Which the causes the two groups to each have as lower standard deviation as possible
00:36:49.820 | All right, so like I want to find a split that puts all the cats over here and all the dogs over here
00:36:54.980 | Right, so if these are all cats and these are all dogs
00:36:58.060 | Then this has a standard deviation of zero and this has a standard deviation of zero
00:37:02.660 | Or else this is like a totally random mix of cats and dogs. This is a totally random mix of cats and dogs
00:37:07.500 | They're going to have a much higher standard deviation
00:37:09.820 | That makes sense. And so it turns out if you find a split that minimizes those group standard deviations or specifically the
00:37:18.020 | Weighted average of the two standard deviations. It's mathematically the same as minimizing the root mean squared error
00:37:24.260 | That's something you can prove to yourself after class if you want to
00:37:27.820 | All right, so we're going to need to find
00:37:31.940 | First of all split this into two groups. So where's all the stuff that is greater than three?
00:37:36.700 | So greater than three is this one this one and this one. So we need the standard deviation of that
00:37:41.700 | so let's go ahead and say standard deviation of
00:37:45.820 | Greater than three that one
00:37:49.380 | That one and that one. Okay, and then the next will be the standard deviation of
00:37:59.940 | Less than or equal to three. So that would be that one
00:38:02.780 | That one that one and then we just take the weighted average of those two and that's our score
00:38:10.860 | That would be our score if we split on three
00:38:15.140 | That makes sense. And so then the next step would be try to split on four
00:38:19.740 | Try splitting on one try splitting on six
00:38:22.740 | Redundantly try splitting on four again
00:38:27.900 | Redundantly try splitting on one again and find out which one works best
00:38:30.580 | So that's our code here is we're going to go through every row. And so let's say okay left-hand side is
00:38:37.700 | Any values in X that are less than or equal to this particular value
00:38:44.420 | Our right hand side is every value in X that are greater than this particular value
00:38:50.860 | Okay, so what's the data type that's going to be in LHS and RHS? What are they actually going to contain?
00:39:05.180 | They're going to be arrays arrays of what?
00:39:11.540 | Raise of a raise of billions. Yeah, which we can treat a zero and one. Okay, so LHS will be an array of
00:39:20.740 | False every time it's not less than or equal to and true
00:39:23.980 | Otherwise and RHS will be a boolean array of the opposite. Okay, and now we can't take a standard deviation of an empty set
00:39:32.180 | Right. So if there's nothing that's greater than
00:39:35.140 | This number then these will all be false which means the sum will be zero
00:39:41.700 | Okay, and in that case
00:39:43.740 | Let's not go any further with this step because there's nothing to take the standard deviation of and it's obviously not a useful
00:39:49.460 | Split, okay
00:39:51.100 | so assuming we've got this far we can now calculate the standard deviation of the left hand side and
00:39:55.980 | of the right hand side and
00:39:58.500 | Take the weighted average or the sums the same thing to us to a scalar
00:40:04.220 | Right and so there's our score and so we can then check is this better than our best score so far and our best score
00:40:11.260 | So far we initially initialized it to infinity, right? So initially this is this is better
00:40:17.780 | So if it's better, let's store away
00:40:20.560 | All of the information we need which variable has found this better split what was the score we found and
00:40:29.260 | What was the value that we split on?
00:40:32.860 | Okay, so there it is
00:40:36.140 | So if we run that
00:40:40.100 | And I'm using time it so what time it does is it sees how long this command takes to run and it tries to give you
00:40:47.580 | A kind of statistically valid measure of that so you can see here. It's run run at ten times
00:40:52.900 | To get an average and then it's done that seven times to get a mean and standard deviation across runs
00:40:59.660 | And so it's taking me 75 milliseconds plus or minus ten
00:41:05.420 | So let's check that this works
00:41:07.420 | Find better split tree zero. So zero is year made one is machine hours current meter
00:41:17.140 | So with one
00:41:19.140 | we got back machine hours current meter 37 4 4 with this score and then we ran it again with
00:41:25.680 | 0 that's year made and we've got a better score 658 and split 1974 and so in 1974
00:41:33.600 | Let's compare
00:41:36.020 | Yeah, that was what this tree did as well. Okay, so we've got we've confirmed that this
00:41:40.800 | Method is doing is giving the same result that SK learns random forest did
00:41:47.140 | And you can also see here the value
00:41:49.540 | Ten point oh eight and again matching here the value ten point oh eight
00:41:55.060 | Okay, so we've got something that can find one split. Could you pass that to your net please?
00:42:01.180 | So Jeremy, why don't we put a unique on the X there?
00:42:07.340 | Because I'm not trying to optimize the performance yet, but you see that no like he's doing more
00:42:15.540 | Yeah, so it's like and you can see in the Excel. I like checked this one twice. I check this for twice unnecessarily
00:42:24.140 | Okay, so and so again, it's already thinking about performance, which is good
00:42:31.180 | So tell me what is the computational complexity of
00:42:39.860 | section of the code and I like have a think about it, but also like feel free to talk us through it if you want to
00:42:46.180 | kind of
00:42:47.380 | Think and talk at the same time
00:42:49.380 | What's the computational complexity of this piece of code?
00:42:53.140 | Can I pass it over there yes
00:43:00.480 | Alright Jade take us through your thought process
00:43:04.740 | I think you have to take each different values through the column to calculate it
00:43:10.940 | Once to see the splits so and then compare
00:43:15.200 | All the cup like all the possible combinations between these different values so that can be expensive
00:43:21.980 | Like cuz you're uh-huh. Can you or does somebody else gonna tell us the actual computational complexity?
00:43:27.580 | So like yeah quite high Jade's thinking
00:43:30.100 | How high?
00:43:35.740 | Think it's n squared. Okay, so tell me why is it n squared because for the for loop it is in yes
00:43:42.060 | And I think I guess the standard deviation will take in so it's in square. Okay, or
00:43:46.820 | This one maybe is even easier to know like this is like which ones are less than xi
00:43:52.540 | I'm gonna have to check every value that if it's less than xi. Okay, and so
00:43:56.580 | So it's useful to know like
00:43:59.540 | How do I quickly calculate computational complexity?
00:44:02.220 | I can guarantee most of the interviews you do are going to ask you to calculate
00:44:05.560 | Computational complexity on the fly and it's also like when you're coding you want it to be second nature
00:44:10.980 | So the technique is basically is there a loop?
00:44:14.180 | Okay, we're then we're obviously doing this n times
00:44:17.680 | Okay, so there's an n involved. Is there a loop inside the loop if there is then you need to multiply those two together
00:44:23.740 | In this case there's not is there anything inside the loop?
00:44:27.460 | That's not a constant time thing
00:44:30.060 | so you might see a sort in there and you just need to know that sort is n log n like
00:44:34.460 | That should be second nature if you see a matrix multiply you need to know what that is in this case
00:44:40.180 | There are some things that are doing element wise array operations, right?
00:44:44.980 | So keep an eye out for anything where numpy is doing something to every value of an array in this case
00:44:50.100 | It's checking every value of x against a constant. So it's going to have to do that n times
00:44:55.540 | So to flesh this out into a computational complexity
00:44:58.300 | You just take the number of things in the loop and you multiply it by the highest computational complexity inside the loop
00:45:05.820 | n times n is n squared you pass that
00:45:08.300 | In this case couldn't we just
00:45:12.220 | Presort the list and then do like one and log n computation and there's lots of things we could do to speed this up
00:45:17.580 | So at this stage, it's just like what is the computational complexity we have?
00:45:21.340 | But absolutely it's certainly not as good as it can be. Okay, so and that's where we're going to go next
00:45:26.820 | It's like alright n squared is not it's not great. So let's try and make it better
00:45:31.620 | So here's my attempt at making it better
00:45:37.380 | And the idea is this
00:45:41.340 | Okay, who wants to first of all tell me what's the equation for a standard deviation?
00:45:48.740 | Marsha, can you grab the box?
00:45:53.940 | So for the standard deviation, it's the difference between the value and it's mean
00:46:01.180 | It's we take a square root of that. So we take the
00:46:06.280 | The power of two, yeah, then we sum up all of these observations and we take the square root out of all this sum
00:46:15.820 | Yeah, you have to divide divide by n. Yep. Yep. Great. Good. Okay now
00:46:20.580 | In practice, we don't normally use that formulation because it kind of requires us calculating
00:46:27.820 | You know X minus the mean lots of times. Does anybody know the formulation that just requires?
00:46:33.900 | X and X squared anybody happened to know that one. Yes at the back turn up house that back there
00:46:39.560 | Square root of mean of squares minus
00:46:46.340 | Square of mean. Yeah, great mean of squares minus the square of the means
00:46:50.580 | All right, so that's a really good one at divided by it. That's a really good one to know because like you can now calculate
00:46:57.540 | Variances or standard deviations of anything. You just have to first of all grab the column as it is
00:47:04.740 | The column squared right and as long as you've got those stored away somewhere you can
00:47:10.940 | Immediately calculate the standard deviation. So the reason this is handy for us is that if we first of all
00:47:17.980 | Sort our data, right?
00:47:21.560 | Let's go ahead and sort our data
00:47:24.940 | Then if you think about it as we kind of start going down one step at a time
00:47:30.020 | Right, then each group is exactly the same as the previous group on the left hand side with one more thing in it
00:47:37.820 | And on the right hand side with one less thing in it
00:47:40.220 | So given that we just have to keep track of some of X and some of X squared
00:47:44.500 | We can just add one more thing to X one more thing to X squared on the left and remove one thing on the right
00:47:51.140 | Okay, so we don't have to go through the whole lot each time and so we can turn this into a order n
00:47:58.300 | Algorithm so that's all I do here is I sort the data right and they're going to keep track of the count of things on
00:48:05.560 | the right
00:48:06.500 | The sum of things on the right and the sum of squares on the right
00:48:10.020 | And initially everything's in the right hand side
00:48:13.540 | Okay, so initially n is the count
00:48:16.540 | Y sum is the sum on the right and Y squared sum is the sum of squares on the right and
00:48:23.680 | Then nothing is initially on the left. So it's zeros. Okay, and then we just have to loop through
00:48:30.020 | each observation right and
00:48:36.140 | Add one to the left hand count subtract one from the right hand count add the value to the left hand count
00:48:41.980 | Subtract it from the right hand count add the value squared to the left hand subtract it from the right hand
00:48:50.020 | Now we do need to be careful though because if we're saying less than or equal to one say we're not stopping here
00:48:57.620 | We're stopping here like we have to have everything in that group
00:49:01.140 | So the other thing I'm going to do is I'm just going to make sure that the next value is not the same as this
00:49:06.820 | Value if it is I'm going to skip over it, right? So I'm just going to double-check
00:49:10.620 | That this value and the next one aren't the same
00:49:14.060 | Okay, so as long as they're not the same
00:49:18.380 | I can keep going ahead and calculate my standard deviation now
00:49:21.700 | Passing in the count the sum and the sum squared right and there's that formula
00:49:29.100 | Okay, the sum of squared divided by the square of the sum sorry minus the square of the sum
00:49:36.420 | Do that for the right hand side and so now we can calculate the weighted average score
00:49:43.020 | Just like before and all of these lines are now the same. Okay, so we've turned our order and
00:49:48.300 | Squared algorithm into an order n algorithm and in general
00:49:54.620 | stuff like this is going to get you a lot more value than like pushing something onto a spark cluster or
00:50:00.700 | Ordering faster RAM or using normal cause in your CPU or whatever, right?
00:50:05.980 | This is the way you want to be, you know improving your code and specifically
00:50:11.500 | Write your code
00:50:14.940 | Right without thinking too much about performance run it. Is it fast enough for what you need then you're done if not
00:50:23.860 | profile it right so in
00:50:25.860 | Jupiter instead of saying percent time it you say percent p run and
00:50:32.340 | It will tell you exactly
00:50:35.820 | where the time was spent in your algorithm, and then you can go to the bit that's actually taking the time and think about like
00:50:43.540 | Okay, is this?
00:50:45.340 | Is this algorithmically as efficient as it can be?
00:50:49.180 | Okay, so in this case we run it
00:50:53.500 | and we've gone down from
00:50:55.500 | 76 milliseconds to less than 2 milliseconds and
00:50:59.860 | Now some people that are new to programming think like oh great. I've saved
00:51:05.220 | 60 something milliseconds, but the point is this is going to get run
00:51:09.740 | Like tens of millions of times
00:51:13.060 | Okay, so the 76 millisecond version is so slow that it's going to be
00:51:19.660 | Practical for any random forest you use in in practice right whereas the 1 millisecond version. I found is actually quite quite acceptable
00:51:27.940 | And then check the numbers should be exactly the same as before and they are okay
00:51:34.420 | so now that we have a
00:51:37.420 | function find better split that does what we want I
00:51:40.940 | Want to insert it into my decision tree class and this is a really cool Python trick
00:51:47.540 | Python does everything dynamically right so we can actually say
00:51:51.900 | The method called find better split in
00:51:56.100 | decision tree is
00:51:59.300 | That function I just created and that might sticks it inside that class
00:52:07.180 | I'll tell you what's slightly confusing about this is that this thing this word here and
00:52:12.100 | This word here. They actually have no relationship to each other
00:52:15.980 | They just happen to have the same letters in the same order right so like I could call this find better split
00:52:21.700 | underscore foo
00:52:24.300 | Right and then I could like call that
00:52:27.420 | Right and call that
00:52:32.740 | Right so now my function is actually called find better split underscore foo, but my method
00:52:42.580 | I'm expecting to call something called decision tree dot find better split
00:52:47.740 | Right so here. I could say decision tree dot find better split equals find better split underscore foo
00:52:54.420 | Okay, you see that's the same thing so like
00:52:57.940 | It's important to understand how namespaces work like in in every language that you use
00:53:06.060 | One of the most important things is kind of understanding how how it figures out what a name refers to
00:53:12.020 | So this here
00:53:13.820 | Means find better split as defined inside this class right and note nowhere else right well
00:53:20.940 | I mean a parent class, but never mind about that this one here means find better split foo in
00:53:27.340 | The global namespace a lot of languages don't have a global namespace, but Python does okay, and so
00:53:34.900 | The two are like even if they happen to have the same letters in the same order
00:53:40.300 | They're not referring in any way to the same thing
00:53:42.300 | That makes sense. It's like
00:53:44.980 | This family over here may have somebody called Jeremy and my family has somebody called Jeremy and our names happen to be the same
00:53:52.460 | But we're not the same person, okay
00:53:55.320 | Great so now that we've stuck the decision tree
00:54:00.820 | Sorry the find better split method inside the decision tree with this new definition when I now call the tree ensemble constructor
00:54:09.660 | All right the decision tree ensemble instructor called create tree
00:54:14.320 | create tree
00:54:16.700 | instantiated decision tree
00:54:18.700 | decision tree called find vast split which went through every column to see if it could find a better split and
00:54:25.460 | we've now defined find better split and
00:54:28.860 | Therefore tree ensemble when we create it has gone ahead and done the split
00:54:37.580 | That makes sense that you have any anybody have any questions or uncertainties about that like we're only creating one single split
00:54:44.820 | so far
00:54:47.700 | All right, so this is pretty pretty neat right we kind of just do a little bit at a time testing everything as we go
00:54:55.340 | And so it's as as as you all implement the random forest interpretation techniques
00:55:02.260 | You may want to try programming this way to like every step check that you know
00:55:07.300 | What you're doing matches up with what scikit-learn does or with a test that you've built or whatever?
00:55:12.580 | So at this point we should try to go deeper
00:55:16.700 | Very inception right so let's go now max depth is 2
00:55:21.940 | And so here is what scikit-learn did after breaking at year made 74
00:55:27.700 | It then broke at machine hours meter
00:55:33.540 | So we had this thing called
00:55:36.940 | Find vast split
00:55:38.940 | Right we just went through every column and try to see if there's a better split there
00:55:44.340 | right, but actually
00:55:46.580 | We need to go a bit further than that
00:55:48.580 | Not only do we have to go through every column and see if there's a better split in this node
00:55:54.500 | But then we also have to see whether there's a better split in the left and the right sides that we just created
00:56:01.700 | Right or in other words the left right side and the right hand side should become decision trees themselves
00:56:07.940 | Right so there's no difference at all between what we do here to create this tree
00:56:13.220 | And what we do here to create this tree other than this one contains
00:56:16.540 | 159 samples this one contains a thousand
00:56:20.220 | So this row of codes exactly the same as we had before
00:56:24.820 | Right and then we check actually we could do this a little bit easier. We could say if self dot
00:56:32.220 | Is leaf right would be the same thing?
00:56:35.180 | Okay, but I'll just leave it here for now. So it's self dot score. So if the score is
00:56:40.600 | Infinite still, but let's write it properly
00:56:46.740 | So let's go back up and just remind ourselves
00:56:49.020 | Is leaf is
00:56:53.060 | Self that score equals in okay?
00:56:55.660 | So since there we might as well use it so if it's a leaf node
00:57:00.820 | Then we have nothing further to do right so that means we're right at the bottom. There's no split
00:57:06.940 | That's been made okay, so we don't have to do anything further on the other hand if it's not a leaf node
00:57:12.180 | So it's somewhere back earlier on
00:57:14.180 | Then we need to split it into the left hand side and the right hand side now earlier on we created a left hand side
00:57:21.580 | And a right hand side array of Booleans
00:57:23.620 | right now
00:57:26.100 | Better would be to have here would be have an array of indexes
00:57:29.360 | And that's because we don't want to have a full array of all the Booleans in every single
00:57:34.080 | Node right because remember although it doesn't look like there are many nodes when you see a tree of this size
00:57:39.600 | when it's fully expanded
00:57:42.160 | the bottom level if there's a minimum leaf size of one contains the same number of nodes as
00:57:48.000 | The entire data set and so if every one of those contained a full Boolean array of size of the whole data set
00:57:55.080 | You've got squared memory requirements, which would be bad
00:57:58.520 | Right on the other hand if we just store the indexes with the things in this node, and that's going to get smaller and smaller
00:58:07.480 | So NP dot non zero is exactly the same as just this thing which gets the Boolean array
00:58:13.560 | But it turns it into the indexes of the trues
00:58:17.200 | Okay, so this is now a list of indexes for the left hand side and
00:58:21.040 | Indexes the right hand side
00:58:23.560 | Okay, so now that we have the indexes the left hand side and the right hand side
00:58:28.760 | We can now just go ahead and create a decision tree. Okay, so there's a decision tree for the left
00:58:33.280 | And there's our decision tree for the right
00:58:36.240 | Okay, and we don't have to do anything else. We've already written these
00:58:38.920 | we already have a
00:58:41.880 | Function of a constructor that can create a decision tree
00:58:44.920 | so like when you really think about what this is doing it kind of hurts your head right because
00:58:52.720 | the reason the whole reason that find vast split got called
00:58:57.540 | is because
00:58:59.540 | Find vast split is called by the decision tree constructor
00:59:06.780 | But then the decision tree that but then find vast split itself then causes the decision tree constructor, so we actually have
00:59:15.060 | circular recursion and
00:59:17.980 | I'm not nearly smart enough to be able to think through recursion
00:59:23.420 | So I just choose not to right like I just write what I mean and
00:59:28.020 | Then I don't think about it anymore
00:59:30.740 | Right like what do I want to find a variable split? I've got to go through every column see if there's something better
00:59:37.420 | If it managed to do a split figure out left hand side of the right hand side and make them into decision trees
00:59:44.260 | Okay, but now try to think through how these two methods call each other would just drive me crazy
00:59:51.220 | But I don't need to right I know I have a decision tree constructor that works
00:59:55.020 | Right I know I have a violent find vast bit that works, so that's it right. That's how I
01:00:00.560 | do recursive programming is
01:00:03.300 | By pretending I don't I just just ignore it
01:00:07.460 | That's my advice a lot of you are probably smart enough to be able to think through it better than I can so that's fine
01:00:12.780 | If you can all right so now that I've written that again, I can patch it into the decision tree class
01:00:20.140 | As soon as I do
01:00:23.260 | The tree ensemble constructor will now use that right because Python's dynamic right that's just happens automatically
01:00:30.900 | So now I can check
01:00:33.740 | my left hand side
01:00:35.980 | Should have a hundred and fifty nine samples
01:00:39.540 | Right and a value of nine point six six
01:00:43.540 | There it is hundred fifty nine samples nine point six six
01:00:47.620 | right hand side
01:00:49.620 | Eight forty one ten point one five the left hand side of the left hand side
01:00:54.680 | Hundred and fifty samples nine point six two
01:00:58.100 | Hundred and fifty samples nine point six two okay, so you can see it like
01:01:05.020 | Because I'm not nearly clever enough to write machine learning algorithms like not only can I not write them correctly the first time
01:01:11.220 | Often like every single line. I write will be wrong right so I always start from the assumption
01:01:17.460 | That the the line of code. I just typed is almost certainly wrong, and I just have to see why and how
01:01:23.380 | Right and so like I just make sure and so eventually I get to the point where like much to my surprise
01:01:28.660 | It's not broken anymore
01:01:30.580 | You know so here I can feel like okay this it would be surprising if all of these things
01:01:35.340 | Accidentally happen to be exactly the same as I kit learn so this is looking pretty good
01:01:42.300 | So now that we have something that can build a whole tree
01:01:47.260 | We want to have something that can calculate predictions
01:01:49.840 | Right and so to remind you we already have something that calculates predictions for a tree ensemble
01:01:56.100 | by calling tree dot predict
01:01:58.580 | But there is nothing called tree dot predict, so we're going to have to write that
01:02:03.820 | To make this more interesting let's start bringing up the number of columns that we use
01:02:14.060 | Let's create our tree ensemble again, and this time. Let's go to a maximum depth of three
01:02:18.820 | Okay, so now our tree is getting more interesting
01:02:23.140 | And let's now define how do we
01:02:29.820 | Create a set of predictions for a tree and so a set of predictions for a tree is simply the prediction for a row
01:02:38.520 | for every row
01:02:42.180 | That's it all right. That's our predictions so the predictions for a tree
01:02:46.180 | every rows predictions in an array, okay, so again, we're like
01:02:52.400 | Skipping thinking thinking is hard. You know so let's just like keep pushing it back
01:02:58.900 | This is kind of handy right notice that you can do for
01:03:07.700 | Blah in array with a numpy array regardless of the rank of the array
01:03:13.500 | regardless of the number of axes in
01:03:16.820 | The array and what it does is it will loop through the leading axis
01:03:21.940 | Right these these concepts are going to be very very important as we get into more and more neural networks because we're going to be
01:03:28.900 | All doing tensor computations all the time so the leading axis of a vector is the vector itself
01:03:35.860 | The leading axis of a matrix are the rows the leading axis axis of a three-dimensional tensor
01:03:43.140 | The matrices that represent the slices and so forth right so in this case because x is a matrix
01:03:50.180 | This is going to loop through the rows and if you write your
01:03:53.220 | Kind of tensor code this way, then it'll kind of tend to
01:03:58.620 | Generalize nicely to higher dimensions like it doesn't really mention matter how many dimensions are in x
01:04:04.620 | This is going to loop through each of the leading axis, right?
01:04:08.220 | Okay, so we can now call that decision tree dot predict
01:04:13.020 | Right so all I need to do is write predict row
01:04:20.780 | Right and I've delayed thinking so much which is great that the actual point where I actually have to do the work
01:04:26.620 | It's now basically trivial
01:04:28.620 | So if we're at a leaf
01:04:31.820 | No, then the prediction is just equal to
01:04:35.140 | Whatever that value was
01:04:37.780 | Which we calculated right back in the original tree constructor. It's just the average of the wise, right?
01:04:43.180 | If it's not a leaf node
01:04:45.900 | Then we have to figure out whether to go down the left-hand path or the right-hand path to get the prediction, right? So
01:04:55.580 | This variable in this row is less than or equal to the thing we decided the amount we decided to split on
01:05:01.860 | Then we go down the left path
01:05:04.500 | Otherwise we go down the right path
01:05:07.100 | Okay, and then having figured out what path we want which tree we want then we can just call predict row on that
01:05:14.260 | right and again
01:05:16.620 | We've accidentally created something recursive
01:05:19.260 | Again, I don't want to think about
01:05:22.260 | how that
01:05:24.820 | Works control flow wise or whatever, but I don't need to because like I just it just does like I just told it
01:05:32.420 | What I wanted so I'll trust it to work, right?
01:05:34.260 | If it's a leaf return the value otherwise return the prediction for the left-hand side or the right-hand side as appropriate
01:05:40.820 | Notice this here this if has nothing to do with this if
01:05:51.340 | All right, this if is a control flow statement
01:05:54.280 | That tells Python to go down that path or that path to do some calculation
01:05:59.860 | this if is an
01:06:03.140 | operator that
01:06:05.860 | returns a value
01:06:07.860 | So those of you that have done C or C++ will recognize it as being identical to that
01:06:13.540 | It's called the ternary operator, right? If you haven't that's fine. Basically what we're doing is we're going to get a value
01:06:20.780 | where we're going to say it's this value if this thing is true and
01:06:25.800 | that value otherwise
01:06:28.660 | And so you could write it
01:06:32.000 | This way
01:06:34.740 | All right
01:06:35.300 | But that would require writing four lines of code to do one thing and also require you to have code that if you read it
01:06:42.380 | To yourself or to somebody else is not at all naturally the way you would express it, right?
01:06:47.100 | I want to say the tree I got to go down is
01:06:50.420 | The left-hand side if the variables less than the split or the right-hand side, otherwise
01:06:56.000 | Right, so I want to write my code the way I would think about or the way I would say my code
01:07:01.300 | Okay, so this kind of ternary operator can be quite helpful for that
01:07:07.580 | All right
01:07:10.140 | So now that I've got a prediction for row I can dump that into my class
01:07:14.900 | And now I can create calculate predictions
01:07:17.800 | And I can now plot my actuals against my predictions
01:07:24.220 | When you do a scatterplot
01:07:28.100 | You'll often have a lot of dots sitting on top of each other
01:07:31.380 | So a good trick is to use alpha alpha means how transparent the things not just a matplotlib
01:07:37.580 | But like in every graphics package in the world pretty much and so if you set alpha to less than 1
01:07:42.660 | Then this is saying you would need 20 dots on top of each other for it to be fully blue
01:07:47.340 | And so this is a good way to kind of see
01:07:49.880 | How much things are sitting on top of each other? So it's a good trick good trick for scatter plots
01:07:54.640 | There's my R squared not bad
01:07:59.300 | And so let's now go ahead and
01:08:03.540 | Do a random forest
01:08:06.820 | With no max amount of splitting
01:08:15.780 | Tree ensemble had no max amount of splitting we can compare our R squared
01:08:20.060 | To their R squared and so they're not the same
01:08:24.300 | But actually ours is a little better, so I don't know what we did differently, but we'll take it
01:08:30.940 | Okay, so we have now something which for a forest with a single tree in is giving as
01:08:39.820 | good accuracy
01:08:41.820 | On a validation set using an actual real-world data set you know for Bluetooth is
01:08:46.940 | Compared to psychic learn
01:08:49.740 | So let's go ahead and round this out
01:08:53.260 | So what I would want to do now is to create a package that has this code in and I created it by like creating
01:08:59.420 | A method here a method here a method here and patching them together
01:09:02.460 | So what I did with now is I went back through my notebook and collected up all the cells
01:09:07.580 | That implemented methods and pasted them all together right, and I've just pasted them down here
01:09:12.140 | So here's this is my original tree ensemble and here is all the cells from the decision tree
01:09:17.620 | I just dumped them all into one place without any change
01:09:20.620 | So that was it that was the code we wrote together so now I can go ahead and I can create a
01:09:31.980 | Tree ensemble I can calculate my predictions. I can do my scatter plot. I can get my R squared
01:09:39.440 | Right and this is now with
01:09:42.020 | five trees
01:09:45.020 | Right and here we are we have a model of blue dook for bulldozers with a 71% R squared
01:09:51.700 | With a random forest we wrote
01:09:54.300 | entirely from scratch
01:09:56.900 | That's pretty cool
01:10:00.420 | Any questions about that and I know there's like quite a got to get through so like during the week
01:10:05.900 | Feel free to ask on the forum
01:10:08.340 | About any bits of code you come across can somebody pass the box to measure. Oh, there it is
01:10:14.240 | Can we get back to the probably to the top of maybe
01:10:23.980 | At the decision tree when we set the score equal to infinity, right? Yes, do we calculate this car this score?
01:10:31.380 | Further, I mean like I lost track of that and specifically I wonder
01:10:37.080 | What do we implement?
01:10:40.740 | when we implement
01:10:42.660 | Find var split we check for self score equal to whether it's equal to infinity or not
01:10:49.260 | it seems to me it seems like unclear whether we
01:10:54.100 | fall out of this I
01:10:56.100 | Mean like if we ever implement
01:10:59.140 | The methods if if our initial value is infinity
01:11:03.980 | So okay, let's talk through the logic. So
01:11:07.660 | so the decision tree
01:11:10.460 | Starts out with a score of infinity. So in other words at this point when we've created the node there is no split
01:11:17.020 | So it's infinitely bad
01:11:20.100 | Okay, that's why the score is infinity and then we try to find a variable and a split
01:11:27.220 | that is better and
01:11:29.740 | to do that we loop through each column and
01:11:33.460 | Say hey column. Do you have a split which is better than the best one we have so far?
01:11:40.460 | And so then we implement that
01:11:47.860 | Let's do the slow way since it's a bit simpler
01:11:50.700 | find better split we do that by looping through each row and
01:11:55.680 | Finding out this is the current score if we split here
01:12:00.420 | Is it better than the current score the current score is infinitely bad?
01:12:04.020 | So yes
01:12:04.820 | It is and so now we set the new score equal to what we just calculated and we keep track of which variable we chose
01:12:11.140 | And the split we split on
01:12:13.140 | Okay, no worries
01:12:17.460 | Okay, great, let's take a five-minute break and I'll see you back here at 22
01:12:25.500 | So when I tried comparing the performance of this
01:12:34.060 | Against scikit-learn
01:12:38.500 | This is quite a lot slower
01:12:41.820 | And the reason why is that although like a lot of the works being done by numpy
01:12:49.180 | Which is nicely optimized C code think about like the very bottom level of a tree if we've got a
01:12:57.620 | million data points
01:13:00.500 | And the bottom level of the tree has something like
01:13:03.140 | 500,000
01:13:05.980 | decision points with a million leaves underneath right and so that's like
01:13:11.700 | Five hundred thousand split methods being called each one of contain which contains multiple calls to numpy which only have like
01:13:21.300 | Item that's actually being calculated on and so it's like that's like very inefficient and it's the kind of thing that python is
01:13:28.460 | Particularly not good at performance-wise right like calling lots of functions lots of times
01:13:34.380 | I mean we can see it's it's not bad right you know for a kind of a
01:13:39.900 | random forest which
01:13:41.900 | 15 years ago would have been considered pretty big this would be considered pretty good performance right, but nowadays this is
01:13:48.300 | Some hundreds of times at least slower than than it should be so
01:13:53.060 | What the scikit-learn folks did to avoid this problem was that they wrote
01:14:01.460 | their implementation in something called scythe on and
01:14:07.020 | scythe on is a super set of python so any python you've written
01:14:13.100 | pretty much
01:14:16.460 | You can use as scythe on
01:14:19.580 | But then what happens is scythe on runs it in a very different way rather than passing it to the kind of the
01:14:28.120 | Python interpreter it instead converts it to C
01:14:34.100 | Compiles that and then runs that C code right which means the first time you run it
01:14:40.260 | It takes a little longer because it has to go through the the kind of translation and compilation, but then after that it can be
01:14:47.620 | quite a bit faster
01:14:50.260 | and so I wanted just to quickly show you what that looks like because
01:14:54.100 | You are absolutely going to be in a position where scythe on is going to help you with your work and
01:15:02.180 | Most of the people you're working with will have never used it may not even know it exists
01:15:06.820 | And so this is like a great superpower to have
01:15:08.820 | So to use scytheon in a notebook you say load, ext, load extension, scytheon, right?
01:15:15.580 | And so here's a Python function
01:15:21.940 | Here is the same as a scytheon function is exactly the same thing with percent percent scytheon at the top
01:15:34.200 | This actually runs about twice as fast as this right just because it does the compilation
01:15:41.120 | Here is the same version again where I've used a special scytheon
01:15:48.220 | extension called cdef which defines the C data type of the return value and of each variable
01:15:57.160 | right and so
01:16:00.200 | Basically, that's the trick that you can use to start making things
01:16:03.360 | Run quickly right and at that point now. It knows it's not just some
01:16:08.460 | Python object called t in fact I probably should put one here as well
01:16:14.880 | Let's try that so we've got fib2 we'll call that fib3
01:16:20.940 | So for fib3
01:16:28.800 | Yeah, so it's exactly the same as before but we say what the data type of the thing we passed to it was is and then
01:16:34.560 | Define the data types of each of the variables and so then if we call that
01:16:39.400 | Okay, we've now got something that's 10 times faster right so
01:16:49.160 | Yeah, it doesn't really take that much extra
01:16:52.360 | And it's just it's just Python with a few little bits of markup
01:16:56.600 | so that's like it's it's good to know that that exists because
01:17:00.280 | If there's something custom you're trying to do
01:17:03.440 | It's actually I find it kind of painful having to go out and you know going to see and compile it and link it back
01:17:09.680 | And all that where it's doing it here is pretty easy. Can you pass that just to your right please musher?
01:17:13.560 | So when you're doing like for the scytheon version of it so in the case of an array or an MP array
01:17:23.520 | This is a specific C type of yeah, so there's like a lot of
01:17:28.520 | Specific stuff for integrating scytheon with numpy
01:17:34.040 | And there's a whole page about it
01:17:38.720 | Yeah, so we won't worry about going over it
01:17:42.080 | But you can read that and you can basically see the basic ideas. There's this C import which basically
01:17:47.760 | imports a
01:17:49.800 | Certain types of Python library into the kind of the C bit of the code and you can then use it in your site them
01:17:57.880 | Yeah, it's
01:18:01.080 | It's pretty straightforward
01:18:03.320 | Good question. Thank you all right so your your mission now
01:18:10.520 | Is to implement
01:18:18.520 | Confidence based on tree variants feature importance partial dependence and tree interpreter
01:18:24.160 | for that random first
01:18:27.080 | Removing redundant features doesn't use a random first at all
01:18:31.120 | So you don't have to worry about that
01:18:33.760 | Extrapolation is not an interpretation technique, so you don't have to worry about that, so it's just the other ones
01:18:38.160 | So confidence based on tree variants. We've already written that code
01:18:41.960 | So I suspect that the exact same code we have in the notebook should continue to work so you can try and make sure it
01:18:48.520 | Get that working
01:18:50.160 | Feature importance is with the variable shuffling technique and once you have that working
01:18:55.480 | Partial dependence will just be a couple of lines of code away because it rather than you know rather than shuffling a column
01:19:02.920 | You're just replacing it with a constant value that it's nearly the same code and
01:19:06.640 | then tree interpreter
01:19:09.720 | It's going to require you writing some code and thinking about that well once you've written tree interpreter
01:19:14.480 | You're very close if you want to to creating the second approach to
01:19:19.040 | Feature importance the one where you add up the importance across all of the rows
01:19:26.300 | Which means you would then be very close to doing interaction importance
01:19:30.480 | So it turns out that there are actually there's actually a very good library for interaction importance for xg boost
01:19:39.080 | But there doesn't seem to be one for random forest
01:19:41.880 | so you could like start by getting it working on our version and
01:19:45.760 | If you want to do interaction importance, and then you could like get it working on the original
01:19:50.040 | Sklearn version and that would be a cool contribution
01:19:55.280 | Like sometimes writing it against your own implementation is kind of nicer because you can see exactly what's going on
01:20:01.440 | All right, so that's that's your job. You don't have to rewrite the random forest feel free to if you want to you know practice
01:20:10.920 | If you
01:20:13.320 | Get stuck at any point you know ask on the forum, right?
01:20:18.200 | there is a
01:20:21.520 | Whole page here on wiki.fast.ai about how to ask for help
01:20:26.320 | so when you
01:20:29.080 | ask your
01:20:30.760 | Co-workers on slack for help when you ask people in a technical community on github or
01:20:37.080 | discourse to help or whatever
01:20:39.080 | Asking for help the right way will go a long way towards
01:20:43.400 | You know having people want to help you and be able to help you right so
01:20:48.360 | So like search for your answer like search for the area you're getting see if somebody's already asked about it
01:20:55.120 | You know how have you tried to fix it already? What do you think is going wrong?
01:21:04.240 | What kind of computer are you on? How is it set up? What are the software versions?
01:21:08.160 | Exactly. What did you type and exactly what happened right now? You could
01:21:13.200 | do that by
01:21:16.520 | Taking a screenshot, so you know make sure you've got some screenshot software. That's really easy to use
01:21:22.880 | So if I were to take a screenshot, I just hit a button
01:21:24.880 | Select the area copy the clipboard go to my forum
01:21:31.200 | Paste it in and there we go right that looks a little bit too big, so let's make it a little smaller
01:21:37.360 | Right and so now I've got a screenshot people can see exactly what happened
01:21:41.880 | better still
01:21:44.280 | If there's a few lines of code and error messages to look at and create a gist
01:21:50.120 | Just is a handy little
01:21:53.000 | GitHub thing which basically lets you share code, so if I wanted to
01:21:59.520 | create a gist of this I
01:22:01.520 | Actually have a extension area that little extension, so if I click on here
01:22:09.680 | Give it a name
01:22:14.120 | Say make public
01:22:16.280 | Okay, and that takes my Jupyter notebook shares it publicly. I can then grab that URL
01:22:22.500 | copy link location
01:22:25.200 | Right and paste it into my forum post
01:22:28.960 | right and then when people
01:22:30.960 | click on it
01:22:33.080 | Then they'll immediately see
01:22:36.600 | My notebook when it renders
01:22:40.860 | Okay, so that's a really good way now that particular button is an extension so on Jupyter
01:22:50.760 | You need to click nv extensions and click on
01:22:55.000 | Gist it right while you're there. You should also click on collapsible headings
01:22:59.440 | That's this really handy thing. I use that lets me collapse things and open them up
01:23:03.420 | If you go to your Jupyter and don't see this nv extensions button
01:23:08.760 | Then just google for Jupyter nv extensions
01:23:11.480 | It'll show you how to pip install it and and get it set up right, but those two extensions are super duper handy
01:23:18.480 | All right, so
01:23:24.000 | Other than that assignment
01:23:26.000 | We're we're done
01:23:28.640 | With random forests and until the next course when you look at GBMs. We're done with decision tree ensembles
01:23:34.820 | And so we're going to move on to neural networks broadly defined and so
01:23:44.560 | Neural networks are going to allow us to
01:23:49.720 | To go beyond just you know the kind of nearest neighbors approach of random forests
01:23:55.360 | You know all the random forest can do is to average data that it's already seen
01:23:59.500 | It can't extrapolate. It can't they can't calculate right?
01:24:04.360 | linear regression
01:24:07.240 | Can calculate and can extrapolate but only in very limited ways
01:24:12.200 | Neural nets give us the best of both worlds
01:24:19.200 | We're going to start by applying them to
01:24:21.980 | unstructured data
01:24:24.640 | Right so unstructured data means like pixels or the amplitudes of sound waves or words you know data where?
01:24:32.600 | everything in
01:24:34.720 | all the columns
01:24:36.720 | Are all of the same type?
01:24:38.520 | You know as opposed to like a database table where you've got like a revenue and a cost and a zip code and a state
01:24:45.000 | It should be structured data
01:24:48.280 | We're going to use it for structured data as well, but we're going to do that a little bit later
01:24:52.140 | So unstructured data is a little easier
01:24:55.120 | And it's also the area which more people have been applying deep learning to for longer
01:25:08.200 | If you're doing the deep learning course as well
01:25:11.120 | You know you'll see that we're going to be approaching kind of the same conclusion from two different directions
01:25:17.920 | So the deep learning course is starting out with
01:25:21.320 | Big complicated convolutional neural networks being solved with you know
01:25:27.600 | Sophisticated optimization schemes and we're going to kind of gradually drill down into like exactly how they work
01:25:33.200 | Where else with the machine learning course we're going to be starting out more with like
01:25:39.280 | How does stochastic gradient descent actually work?
01:25:43.440 | What do we do? What can we do with like one single layer? Which would allow us to create things like logistic regression when we add?
01:25:50.200 | Regularization to that how does that give us things like?
01:25:54.000 | Ridge regression elastic net less. Oh and then as we add additional layers to that how does that let us?
01:26:01.040 | Handle more complex problems, and so we're not going to we're only going to be looking at
01:26:06.840 | fully connected layers in this machine learning course and
01:26:12.920 | Then I think next semester with your net you're probably going to be looking at some more
01:26:17.800 | Sophisticated approaches and so yes on this machine learning
01:26:22.120 | We're going to be looking much more at like what's actually happening with the matrices and how they actually calculate it and the deep learning
01:26:27.800 | It's much more like what are the best practices for actually?
01:26:30.840 | Solving you know at a world-class level real world deep learning problems, right so
01:26:38.160 | Next week we're going to
01:26:43.440 | Looking at like the classic Mness problem, which is like how do we recognize digits now?
01:26:50.840 | If you're interested you can like skip ahead and like try and do this with a random forest and you'll find it's not bad
01:26:58.240 | Right given that a random forest is basically a type of nearest neighbors, right?
01:27:03.020 | It's finding like what are the nearest neighbors in in tree space?
01:27:07.360 | Then a random forest can absolutely recognize that this nine those pixels
01:27:11.800 | You know are similar to pixels. We've seen in these other ones and on average they were nines as well
01:27:18.000 | All right, and so like we can absolutely solve
01:27:21.320 | These kinds of problems to an extent
01:27:24.040 | using random forests
01:27:26.800 | But we end up being rather data limited because every time we put in another decision point you know
01:27:33.280 | We're halving our data roughly and so there's this just this limitation and the amount of calculation that we can do
01:27:40.120 | Where else with neural nets?
01:27:43.000 | We're going to be able to use lots and lots and lots of parameters
01:27:47.800 | Using these tricks we don't learn about with regularization
01:27:51.760 | And so we're going to be able to do lots of computation, and there's going to be very little limitation on really
01:27:57.920 | What we can actually end up calculating as a result great good luck with your random forest interpretation, and I will see you next time