Intro to Machine Learning: Lesson 7

00:00:00.000 | Welcome back. We're going to be talking today about

00:00:03.200 | Random forests, we're going to finish building our own random forest from scratch

00:00:08.240 | But before we do I wanted to

00:00:12.400 | Tackle a few things that have come up during the week a few questions that I've had

00:00:16.840 | And I want to start with kind of the position of random forests in general, so we spent

00:00:24.240 | About half of this course doing random forests, and then after today the second half of this course will be

00:00:30.040 | neural networks broadly defined

00:00:33.200 | This is because these these two

00:00:40.800 | Represent like the tea the two key classes of techniques which cover

00:00:46.260 | Nearly everything that you're likely to need to do

00:00:49.200 | Random forests belong to the class of techniques of decision tree ensembles

00:00:53.880 | Along with gradient boosting machines being the other key type and some variants like extremely randomized trees

00:01:00.680 | They

00:01:03.600 | Have the benefit that they're highly interpretable

00:01:06.560 | scalable

00:01:08.960 | Flexible work well for most kinds of data

00:01:11.400 | They have the downside that they don't extrapolate at all to like

00:01:16.720 | Data that's outside the range that you've seen as we looked at at the end of last week's session

00:01:23.840 | but

00:01:26.160 | You know they're they're they're a great starting point and so

00:01:29.160 | I think you know there's a huge catalog of

00:01:34.680 | Machine learning tools out there and so and like a lot of courses and books

00:01:39.280 | Don't attempt to kind of curate that down and say like for these kinds of problems use this for these kinds of problems use that

00:01:46.800 | Finished you know, but they're rather like

00:01:49.600 | here's a description of a hundred different algorithms and

00:01:53.960 | You just don't need them

00:01:55.960 | You know like I don't see why you would ever use in support vector machine today for instance

00:02:01.240 | Like I know no reason at all

00:02:04.120 | I could think of doing that people loved studying them in the 90s because they are like very theoretically elegant and like you can really

00:02:11.360 | Write a lot of math about support vector machines and people did but you know in practice. I don't see them as having any place

00:02:22.600 | So there's like a lot of techniques that you could include in an exhaustive list of every way that people have looked at machine learning problems

00:02:28.880 | But I would rather tell you like how to actually solve machine learning problems in practice

00:02:34.480 | I think they you know we've we've about to finish today the first class which is you know one type of decision tree ensembles

00:02:40.720 | In in part two your net will tell you about the other key type

00:02:44.320 | They're being gradient boosting and we're about to launch next lesson into

00:02:49.600 | neural nets which includes all kinds of GLM

00:02:53.360 | Ridge regression elastic net Lasso

00:02:56.480 | Logistic aggression etc are all variants of neural nets

00:03:01.600 | You know interestingly

00:03:06.480 | Leo Bremen who created random forests did so very late in his life and

00:03:13.960 | Unfortunately passed away not many years later

00:03:19.240 | So partly because of that very little has been written about them in the academic literature

00:03:24.760 | Partly because SVMs were just taken over at that point. You know other people didn't look at them

00:03:30.480 | And also like just because they're like quite hard to grasp at a theoretical level like analyze them theoretically

00:03:40.560 | It's quite of hard to write conference papers about them or academic papers about them

00:03:44.520 | So there hasn't been that much written about them

00:03:47.320 | But there's been a real resurgence or not resurgence a new wave in recent years of empirical machine learning like what actually works

00:03:55.320 | Kaggle's been part of that, but also just part of it. It's just been like

00:04:01.020 | companies using machine learning to make shit loads of money like Amazon and Google and

00:04:06.240 | So nowadays a lot of people are writing about

00:04:10.520 | decision tree ensembles and creating better software for decision tree ensembles like like GBM and XG boost and

00:04:17.080 | Ranger for our and

00:04:20.080 | Psych hit learn and so forth

00:04:22.160 | But a lot of this is being done in industry rather than academia

00:04:26.320 | But you know it's it's encouraging to see

00:04:30.800 | There's certainly

00:04:34.080 | More work being done in deep learning than in decision tree ensembles

00:04:39.400 | Particularly in in academia, but but there's a lot of progress being made in both

00:04:45.040 | You know if you look at like of the packages being used today for decision tree ensembles

00:04:51.400 | Like all the best ones the top five or six, and I don't know that any of them really existed

00:04:56.800 | Five years ago, you know maybe other than like SK learn or even three years ago, so it's that's that's been good

00:05:04.280 | But I think there's a lot of work

00:05:08.040 | Still to be done now. We talked about for example

00:05:10.400 | Figuring out what interactions are the most important last week and

00:05:15.880 | Some of you pointed out in the forums that actually there is such a project already for gradient boosting machines

00:05:21.400 | Which is great, but it doesn't seem that there's anything like that yet for random forests

00:05:25.640 | And you know random forests do have a nice benefit over GBMs that they're kind of

00:05:30.740 | Harder to screw up you know and easier to scale

00:05:37.000 | So hopefully that's something that you know this community might help fix

00:05:40.520 | Another question I had during the week was about the size of your validation set

00:05:47.240 | How big should it be

00:05:51.840 | So like to answer this question about how big does your validation set need to be you first need to answer the question

00:06:00.000 | How

00:06:03.800 | How accurate do I need help how precisely do I need to know the accuracy of this algorithm right so like

00:06:11.400 | If the validation set that you have is saying like this is 70% accurate and

00:06:18.520 | If somebody said well is it 75% or 65% or 70% and the answer was I don't know anything in that range is close enough

00:06:27.800 | Like that would be one answer where else if it's like is it 70% or 70.01% or 69.99%

00:06:35.120 | Like then that's something else again, right? So you need to kind of start out by saying like how?

00:06:41.160 | How accurate do I need this?

00:06:43.760 | So like for example in the deep learning course we've been looking at dogs versus cats

00:06:49.080 | images and

00:06:51.680 | The models that we're looking at had about a ninety nine point

00:06:56.480 | Four ninety nine point five percent accuracy on the validation set okay, and our validation set size

00:07:03.200 | was

00:07:06.680 | 2000 okay, in fact let's do this in Excel that'll be a bit easier

00:07:11.560 | So our validation set

00:07:16.720 | Size was

00:07:19.600 | 2000 and

00:07:21.600 | accuracy was

00:07:23.560 | ninety nine point four percent

00:07:26.840 | Right so the number of incorrect is

00:07:30.040 | Something around

00:07:34.040 | One minus accuracy times n so we were getting about 12 wrong

00:07:40.840 | right and

00:07:44.040 | The number of cats we had is

00:07:47.320 | Half and

00:07:51.240 | So the number of wrong cats is about

00:07:56.040 | six

00:07:58.040 | Okay, so then like we we run a new model and we find instead that the accuracy

00:08:06.520 | has gone to

00:08:09.200 | ninety nine point two percent

00:08:11.200 | Right and then it's like okay. Is this less good at finding cats. That's like well. It got two more cats wrong

00:08:18.600 | So it's like probably not

00:08:21.840 | right so

00:08:24.360 | But then it's like well does this matter there's ninety nine point four versus ninety nine point two

00:08:29.640 | Matter and if this was like it wasn't about cats and dogs, but it was about finding fraud

00:08:35.320 | Right then the difference between a point six percent error rate and a point eight percent error rate

00:08:41.120 | It's like twenty five percent of your cost of fraud

00:08:43.560 | So like that can be huge

00:08:46.240 | Like it was really interesting like when image net came out earlier this year the new competition results came out and

00:08:53.440 | The accuracy had gone down from three percent

00:08:56.320 | So the error went down from three percent to two percent

00:08:59.840 | And I saw a lot of people on the internet like famous machine learning researchers being like man

00:09:04.920 | Some Chinese guys got it better from like ninety seven percent to one ninety eight percent

00:09:10.080 | It's like statistically not even significant who cares kind of a thing, but actually I thought like

00:09:15.840 | Holy crap this Chinese team just blew away the state-of-the-art and image recognition like the old one was

00:09:22.880 | Fifty percent less accurate than the new one like that's that's actually the right way to think about it

00:09:27.960 | Isn't it because it's like you know we were trying to recognize you know like

00:09:32.020 | Which tomatoes were ripe and which ones weren't and like our new approach. You know the old approach like

00:09:40.120 | 50% of the time more

00:09:42.840 | Was like letting in the unripe tomatoes or you know 50% more of the time we were like

00:09:49.920 | Accepting fraudulent customers like that's a really big difference

00:09:53.000 | So just because like this particular validation set we can't really see six versus eight doesn't mean the point two percent different

00:10:01.280 | Isn't important it could be so my kind of rule of thumb is that this like this number of like how many?

00:10:08.920 | Observations you actually looking at I want that generally to be somewhere higher than 22

00:10:15.520 | Why 22 because 22 is the magic number where the t distribution roughly turns into the normal distribution?

00:10:22.120 | All right, so as you may have learned the t distribution is is the normal distribution for small

00:10:28.720 | Datasets right and so in other words once we have 22 of something or more it kind of starts to behave

00:10:36.420 | kind of

00:10:39.040 | Normally in both sense of the words like it's kind of more stable, and you can kind of understand it better, so

00:10:45.240 | That's my magic number when somebody says do I have enough of something I kind of start out by saying like do you have?

00:10:51.000 | 22 observations of the thing of interest

00:10:53.280 | so if you were looking at like

00:10:56.200 | Line cancer you know and you had a data set that had like a thousand people without lung cancer and 20 people with lung cancer

00:11:05.240 | I'd be like I very much doubt we're going to make much progress. You know because we haven't even got 20 of the thing you want

00:11:11.960 | So ditto with a validation set if you don't have 20 of the thing you want that is very unlikely to be useful

00:11:17.020 | Or if like the at the level of accuracy we need it's not plus or minus 20

00:11:21.680 | It's just it's that that's the point where I'm thinking like

00:11:24.360 | Be a bit careful

00:11:26.560 | so just to be clear you want 22 to be the number of

00:11:31.560 | Samples in each set like in the validation the test and the train or so what I'm saying is like if there's if there's less

00:11:40.560 | than 22 of a class in any of the sets then it's

00:11:46.800 | It's going to get it's getting pretty unstable at that point, right?

00:11:50.640 | And so like that's just like the first rule of thumb

00:11:54.320 | but then what I would actually do is like start practicing what we learned about the

00:12:00.620 | binomial distribution or actually then we distribution so

00:12:05.840 | What's the?

00:12:10.280 | What is the mean of the binomial distribution of n samples and probability P?

00:12:17.680 | And times P. Okay. Thank you n times P is our mean

00:12:23.660 | Right, so if you've got a 50% chance of getting ahead and you toss it a hundred times on average you get 50 heads

00:12:31.640 | Okay, and then what's the standard deviation?

00:12:37.560 | and P1 minus P

00:12:39.560 | Okay, so these are like

00:12:45.240 | Two numbers well the first number you don't really have to remember. It's intuitively obvious the second one is one that try to remember

00:12:52.280 | forevermore

00:12:54.160 | Because not only does it come up all the time the people that you work with will all have forgotten it

00:12:59.000 | So you'll be like the one person in the conversation who could immediately go we don't have to run this a hundred times

00:13:03.840 | I can tell you straight away. It's binomial. It's going to be n PQ and P1 minus P

00:13:09.080 | Then there's the standard error

00:13:13.120 | The standard error is if you run a bunch of trials each time getting a mean

00:13:20.560 | What is the standard deviation of the mean? I?

00:13:24.280 | Don't think you guys have covered this yet. Is that right?

00:13:28.920 | No, so this is really important because this means like if you train a hundred models

00:13:35.620 | right each time the validation set accuracy is like the meaning of a distribution and

00:13:41.660 | So therefore the standard deviation of that validation set accuracy it can be calculated with the standard error

00:13:48.920 | And this is equal to the standard deviation

00:13:52.000 | divided by

00:13:54.360 | square root n

00:13:57.640 | Right so this tells you

00:14:00.040 | So like one approach to figuring out like is my validation set big enough is train your model five times

00:14:06.280 | with exactly the same hyper parameters each time and look at the validation set accuracy each time and you know there's like a

00:14:14.600 | Mean and a standard deviation of five numbers you could use or a maximum and minimum you can choose

00:14:19.600 | But save yourself some time you can figure out straight away that like

00:14:24.600 | okay, well

00:14:27.440 | I I have a point nine nine

00:14:30.280 | Accuracy as to you know whether I get the cat correct or not correct

00:14:36.000 | So therefore the standard deviation is equal to 0.99 times 0.01

00:14:42.240 | Okay, and then I can get the

00:14:45.640 | Standard error of that

00:14:49.720 | Right so so basically the size of the validation set you need

00:14:54.880 | It's like however big it has to be

00:14:57.800 | Such that your insights about its accuracy are good enough for your particular

00:15:03.840 | Business problem and so like I say like the simple way to do it is to pick a validation set of like a size a thousand

00:15:11.160 | train five models and

00:15:13.480 | See how much the validation set accuracy varies and if it's like if they're if it's they're all close enough for what you need

00:15:20.160 | Then you're fine. If it's not maybe you should make it bigger

00:15:24.480 | Or maybe you should consider

00:15:26.480 | using cross validation

00:15:29.280 | instead

00:15:31.040 | Okay

00:15:32.680 | So like as you can see it really depends on what it is you're trying to do

00:15:37.320 | How common your less common class is and how accurate your model is could you pass that back to Melissa, please?

00:15:44.640 | Thank you, I have a question about the less common classes if you have less than 22

00:15:53.120 | Let's say you have one sample of something

00:15:55.320 | Let's say it's a face and I only have one representation from that particular country

00:16:01.780 | Do I toss that into the training set and it adds variety do I pull it out completely out of the data set or?

00:16:09.560 | Do I put it in a test set instead of the validation set?

00:16:16.000 | So you certainly couldn't put it in the test or the validation set because you're asking

00:16:20.240 | Can I mean in general because you're asking can I recognize something? I've never seen before

00:16:24.280 | But actually this this question of like can I recognize something I've not seen before there's actually a whole class of models specifically

00:16:31.940 | For that purpose it's called other one-shot learning

00:16:35.080 | Which is you get to see something once and then you have to recognize it again or zero shot learning

00:16:39.700 | Which is where you have to recognize something you've never seen before we're not going to cover them in this course

00:16:47.040 | But that can be useful for things like face recognition

00:16:50.840 | You know like is this the same person I've seen before and so generally speaking

00:16:55.720 | Obviously for something like that to work. It's not that you've never seen our face before

00:17:00.800 | It's that you've never seen Melissa's face before you know and so you see Melissa's face once and you have to recognize it again

00:17:07.400 | Yeah, so in general you know your validation set and test set need to have the same

00:17:16.000 | mix or frequency

00:17:18.160 | Observations that you're going to see in production in the real world

00:17:22.740 | and then your training set

00:17:25.600 | Should have an equal number in each class and if you don't

00:17:31.960 | Just replicate the less common one

00:17:34.960 | Until it is equal so this is I think we've mentioned this paper before very recent paper that came out

00:17:41.280 | They tried lots of different approaches to training with unbalanced data sets and found consistently that

00:17:47.000 | Oversampling the less common class until it is the same size as the more common class is always the right thing to do

00:17:54.480 | So you could literally copy you know so like I've only got a thousand you know ten examples of people with cancer and a hundred

00:18:03.280 | Without so I could just copy those ten

00:18:05.280 | And other you know 90 times

00:18:09.080 | That's kind of a little memory inefficient so a lot of things including

00:18:14.120 | I think SK learns random forests have a class weights parameter that says each time you're bootstrapping or resampling

00:18:20.560 | I want you to sample the less common class with a higher probability

00:18:24.880 | Or ditto if you're doing deep learning you know make sure in your mini batch

00:18:29.320 | It's not randomly sampled, but it's a stratified sample so the less common class is picked more often

00:18:35.640 | Okay

00:18:38.640 | Okay, so let's get back to finishing off

00:18:41.800 | Our random forests and so what we're going to do today is we're going to finish off writing our random forest

00:18:48.280 | and then after day your your after today your homework will be to take this class and

00:18:53.120 | To add to it all of the random forest interpretation algorithms that we've learned, okay, so

00:19:00.640 | Obviously to be able to do that you're going to need to totally understand how this class works

00:19:05.880 | So please you know ask lots of questions as necessary as we go along

00:19:09.800 | So just to remind you

00:19:13.520 | We we're doing the the bulldozers Kaggle competition data set again

00:19:20.680 | We split it as before into 12,000 validation the last 12,000

00:19:25.240 | Records and then just to make it easier for us to keep track of what we're doing

00:19:31.360 | We're going to just pick two columns out to start with year made and machine hours on the meter

00:19:36.120 | okay, and so what we did last time was we started out by creating a tree ensemble and

00:19:42.440 | the tree ensemble had a bunch of trees which was literally a list of

00:19:48.120 | Entries trees where each time we just called create tree

00:19:53.800 | Okay, and create tree contained a

00:19:59.560 | sample size number of random indexes

00:20:03.000 | Okay, this one was drawn without replacement

00:20:07.080 | So remember bootstrapping means sampling with replacement

00:20:12.600 | So normally with scikit learn if you've got n rows

00:20:16.240 | We grab n rows with replacement, which means many of them will appear more than once

00:20:23.760 | So each time we get a different sample, but it's always the same size as the original data set and then we have our

00:20:31.120 | set RF samples

00:20:33.480 | Function that we can use which does with replacement sampling of less than n rows

00:20:40.440 | This is doing something again, which is it's sampling without replacement

00:20:45.360 | Sample size rows okay because we're permuting

00:20:49.440 | The numbers from naught to self dot y minus 1 and then grabbing the first self dot sample size of them

00:20:55.640 | Actually, there's a faster way to do this. You can just use NP dot random dot choice, which is a slightly more direct way

00:21:01.960 | But this way works as well

00:21:04.400 | All right. So this is our random sample for this one of our

00:21:09.200 | entries trees

00:21:12.240 | And so then we're going to create a decision tree and our decision tree

00:21:17.000 | We don't pass it all of x we pass it these specific indexes and remember x is a panda's data frame

00:21:24.220 | So if we want to index into it with a bunch of integers, we have to use I log integer locations

00:21:31.280 | And that makes it behave indexing wise just like numpy

00:21:35.540 | Why vector is numpy so we can just index into it directly and then we're going to keep track of our minimum precise

00:21:45.260 | and

00:21:48.260 | So then the only other thing we really need an ensemble is some way to make a prediction

00:21:51.820 | And so we were just going to do the mean of the tree prediction

00:21:56.200 | For each tree

00:21:59.300 | All right, so that was that and so then in order to be able to run that

00:22:03.520 | We need a decision tree class because it's being called here

00:22:08.240 | And so there we go

00:22:12.020 | Okay, so that's the starting point. So

00:22:15.740 | The next thing we need to do is to flesh out our decision tree

00:22:22.140 | So the important thing to remember is all of our randomness

00:22:25.960 | Happened back here in the tree ensemble

00:22:28.740 | The decision tree class we're going to create

00:22:31.900 | Doesn't have randomness in it

00:22:36.580 | Okay, so right now we are building a randomly regressor, right so that's why we're taking the mean of the tree

00:22:43.540 | The output if we were to work with classification

00:22:47.800 | Do we take the max like the classifier will give you either zeros or ones?

00:22:52.820 | No, I would still take the mean so the so each tree is going to tell you what percentage of that leaf node

00:23:01.380 | Contains cats and what percentage take contains dogs

00:23:05.820 | so then I would average all those percentages and say across the trees on average there is 19% cats and

00:23:12.740 | 81% dogs

00:23:16.260 | Good question, so you know

00:23:18.540 | Random tree classifiers are almost identical or can be almost identical to random tree regressors

00:23:27.240 | The technique we're going to use to build this today will basically exactly work for a classification

00:23:34.340 | It's certainly for binary classification. You can do with exactly the same code for multi-class classification. You just need to change your data structure

00:23:41.900 | so that like you have like a one hot encoded matrix or a

00:23:47.140 | List of integers that you treat as a one hot encoded matrix

00:23:51.620 | Okay, so our decision tree

00:23:57.160 | So remember our idea here is that we're going to like try to avoid thinking

00:24:03.260 | So we're going to basically write it as if everything we need already exists, okay, so we know

00:24:10.100 | From when we created the decision tree, we're going to pass in the X the Y and the minimum leaf size

00:24:16.480 | So here we need to make sure we've got the X and the Y and the minimum leaf size. Okay, so then there's one other thing

00:24:23.980 | which is as we

00:24:25.980 | split our tree into sub trees, we're going to need to keep track of

00:24:31.980 | Which of the row indexes went into the left-hand side of the tree which went into the right-hand side of the tree

00:24:37.740 | Okay, so we're going to have this thing called

00:24:39.860 | indexes as

00:24:42.020 | Well, right so at first we just didn't bother passing in indexes at all

00:24:46.660 | So if indexes is not passed in if it's none, then we're just going to set it to

00:24:50.980 | Everything the entire length of y right so NP dot a range is the same as just range in Python

00:24:59.040 | But it returns a numpy array right so that the root of a decision tree

00:25:04.160 | Contains all the rows. That's the definition really of the root of a decision tree

00:25:09.960 | So all the rows is row 0 row 1 row 2 etc up to row y minus 1

00:25:15.460 | Okay, and then we're just going to store away all that information that we were given

00:25:21.340 | We're going to keep track of how many?

00:25:23.980 | Rows are there and how many columns are there? Okay?

00:25:28.620 | so then the

00:25:30.740 | every

00:25:31.860 | leaf and every node in a tree has a value it has a prediction that prediction is just equal to the average of

00:25:40.140 | the dependent variable

00:25:42.980 | Okay, so

00:25:45.260 | every node in the tree

00:25:47.260 | Why indexed with the indexes is?

00:25:50.780 | the values of the dependent variable that are in this branch of the tree and so here is the mean

00:25:58.340 | okay

00:26:00.340 | some

00:26:01.780 | Nodes in a tree also have a score which is like how effective was the split?

00:26:07.920 | Here right, but that's only going to be true if it's not a leaf node right a leaf node has no further splits

00:26:15.500 | And at this point when we create a tree we haven't done any splits yet, so its score starts out as being infinity

00:26:24.060 | so having built that the root of the tree our next job is to find out which variable should we split on and

00:26:32.060 | What level of that variable should we split on so let's pretend that there's something that does that

00:26:38.060 | Find bar split

00:26:40.700 | So then we're done, okay, so how do we find a

00:26:49.420 | variable to split on so well we could just go through each

00:26:54.460 | Potential variable so C contains the number of columns

00:26:59.380 | We have so go through each one and see if we can find a better split than we have so far on that column

00:27:06.380 | Okay

00:27:09.500 | Now notice this is like

00:27:11.620 | not

00:27:13.700 | The full random forest definition this is assuming that max features is set to all right remember

00:27:20.500 | We could set max features to like zero point five in which case we wouldn't check all the numbers should not to see

00:27:25.700 | We would check half the numbers at random from not to see so if you want to turn this into like a

00:27:31.420 | random forest that has the max features

00:27:34.220 | Support you could easily like add one line of code to do that, but we're not going to do it in our implementation today

00:27:42.060 | so

00:27:44.060 | Then we just need to find better split and since we're not interested in thinking at the moment for now

00:27:50.140 | We're just going to leave that empty all right

00:27:52.140 | so

00:27:55.500 | The one other thing I like to do

00:27:57.700 | With my kind of where I start writing a class is I like to have some way to print out

00:28:02.860 | What's in that class right and so if you type print followed by an object or if at Jupiter notebook?

00:28:09.500 | You just type the name of the object

00:28:12.460 | At the moment. It's just printing out

00:28:14.900 | Underscore underscore main underscore underscore dot decision tree at blah blah blah which is not very helpful

00:28:20.540 | Right so if we want to replace this with something helpful

00:28:23.660 | We have to define the special

00:28:26.300 | Python method name dunder Repra

00:28:30.140 | To get a representation of this object so when we type when we sickly just

00:28:37.500 | Write the name like this behind the scenes that calls that function and the default

00:28:42.020 | Implementation of that method is just to print out this

00:28:46.100 | Unhelpful stuff so we can replace it by instead saying let's create a format string

00:28:52.060 | Where we're going to print out n and then show n and then print Val and then show Val okay, so how many?

00:28:59.300 | How many rows are in this node and what's the average of the dependent variable okay?

00:29:07.180 | then

00:29:08.580 | If it's not a leaf node, so if it has a split then we should also be able to print out the score

00:29:14.060 | The value we split out and the variable that we split on

00:29:18.780 | Now you'll notice here

00:29:22.380 | Self dot is leaf is leaf is defined as a method, but I don't have any parentheses after it

00:29:28.240 | This is a special kind of method called a property and so a property is something that kind of looks like

00:29:35.940 | a regular variable

00:29:37.940 | But it's actually calculated on the fly so when I call its leaf it actually calls this function

00:29:45.620 | Right, but I've got this special decorator

00:29:48.060 | Property okay, and what this says is basically you don't have to include the parentheses when you call it

00:29:54.800 | Okay, and so it's going to say all right is this a leaf or not so a leaf is something that we don't split on

00:30:03.940 | If we haven't split on it then its score is still set to infinity, but that's my logic

00:30:09.060 | That makes sense

00:30:12.220 | So this this at notation is called a decorator

00:30:18.340 | It's basically a way of telling Python more information about your method

00:30:23.280 | So anybody here remember where you have seen decorators before?

00:30:30.460 | You pass it over here, yeah, where have you seen where have you seen decorators before tell us more about flask and

00:30:36.980 | Yeah, what does that do?

00:30:41.860 | So flask so anybody who's done any web programming before with something like flask or a similar framework

00:30:53.940 | Would have had to have said like this method is going to respond to this bit of the URL

00:31:00.200 | And either the post or to get and he put it in a special decorator

00:31:05.000 | so

00:31:07.640 | Behind the scenes that's telling

00:31:09.640 | Python to treat this method in a special way, so here's another decorator, okay?

00:31:15.000 | And so you know if you get more advanced with Python you can actually learn how to write your own decorators

00:31:20.880 | Which as was mentioned you know basically insert some additional code

00:31:24.320 | but for now just know there's a bunch of predefined decorators we can use to

00:31:30.320 | Change how our methods behave and one of them is at property which basically means you don't have to put parentheses anymore

00:31:36.600 | Which of course means you can't add any more parameters beyond self

00:31:40.160 | Yeah

00:31:43.040 | Why if it's not a leaf why is this for infinity because doesn't infinity mean you're at the root

00:31:50.600 | Why no infinity means that you're not at the root. It means that you're at a leaf so the root will have a split

00:31:58.520 | Assuming we find one right everything will have a split till we get all the way to the bottom

00:32:02.560 | The leaf and so the leaves will have a score of infinity because they won't split

00:32:07.880 | Great all right

00:32:11.640 | So that's our

00:32:14.520 | Decision tree it doesn't do very much, but at least we can like create an ensemble right ten trees sample size a thousand

00:32:22.200 | Right and we can like print out so now when I go m trees zero

00:32:26.200 | It doesn't say blah blah blah blah blah. It says

00:32:29.320 | What we asked it to say n called the thousand

00:32:32.520 | Val Colin 10.8. Oh wait, okay, and this is the leaf because we haven't spit on it yet

00:32:40.200 | So we've got nothing more to say

00:32:42.320 | Okay, so then the indexes are all the numbers from 0 to 1,000

00:32:48.780 | Okay, because the base of the tree has everything this is like everything in

00:32:55.080 | The random sample that was passed to it because remember by the time we get to the point where it's a decision tree

00:32:59.840 | Where we don't have to worry about any of the randomness in the random forest anymore, okay?

00:33:05.280 | All right, so

00:33:09.200 | Let's try to write

00:33:12.080 | The thing which finds a split okay, so we need to implement

00:33:18.520 | Find better split okay, and so it's going to take the index of a variable variable number one variable number three

00:33:26.260 | Whatever and it's going to figure out

00:33:28.260 | What's the best split point?

00:33:31.440 | Is that better than any split we have so far and?

00:33:34.040 | For the first variable the answer will always be yes because the best one so far is none at all infinity bad, okay?

00:33:40.840 | So let's start by making sure we've got something to compare to so the thing we're going to compare to

00:33:47.760 | will be

00:33:48.960 | scikit learns random forest and

00:33:50.960 | So we need to make sure that scikit learns random forest gets exactly the same data that we have so we start out by creating ensemble

00:33:58.140 | Grab a tree out of it and then find out which particular random sample of x and y did this tree use?

00:34:06.180 | Okay, and we're going to store them away so that we can pass them to scikit learn so we have exactly the same information

00:34:14.440 | So let's go ahead and now create a random forest using scikit learn so one tree

00:34:19.040 | one decision

00:34:21.600 | No, bootstrapping so the whole the whole data set that so this should

00:34:26.240 | Be exactly the same as the thing that we're going to create this tree

00:34:30.640 | So let's try

00:34:34.680 | So we need to define find better split

00:34:38.960 | Okay, so find better split takes a variable

00:34:44.060 | Okay, so let's define our x independent variables and say okay

00:34:49.700 | Well, it's everything inside our tree, but only

00:34:53.260 | Those indexes that are in this node right which at the top of the tree is everything right and just this one variable

00:35:02.760 | Okay, and then for our wise it's just whatever our dependent variable is at the indexes in this node

00:35:10.140 | Okay, so there's our x and y

00:35:12.660 | so

00:35:13.940 | Let's now go through every single value in our independent variable and

00:35:20.580 | So I'll show you what's going to happen. So let's say our independent variable is year made

00:35:26.420 | And not going to be an order

00:35:41.460 | Right and so we're going to go to the very first row and we're going to say okay year made here is three

00:35:48.020 | Right and so what I'm going to do is I'm going to try and calculate

00:35:51.780 | The score if we decided to branch on the number three

00:35:56.980 | Right so I need to know which rows are greater than three

00:36:02.180 | Which rows are less than and equal to three and they're going to become my left-hand side my right-hand side

00:36:07.340 | Right and then we need a score

00:36:09.700 | right, so

00:36:11.420 | There's lots of scores we could use so in random forests. We call this the information gain, right?

00:36:17.380 | The information gain is like how much better does our score get because we split it into these two groups of data

00:36:22.260 | There's lots of ways we could calculate it Gini cross entropy root mean squared error, whatever

00:36:28.420 | If you think about it, there is an alternative formulation of root mean squared error

00:36:34.540 | Which is mathematically the same to within a constant scale, but it's a little bit easier to deal with which is

00:36:41.060 | We're going to try and find a split

00:36:43.380 | Which the causes the two groups to each have as lower standard deviation as possible

00:36:49.820 | All right, so like I want to find a split that puts all the cats over here and all the dogs over here

00:36:54.980 | Right, so if these are all cats and these are all dogs

00:36:58.060 | Then this has a standard deviation of zero and this has a standard deviation of zero

00:37:02.660 | Or else this is like a totally random mix of cats and dogs. This is a totally random mix of cats and dogs

00:37:07.500 | They're going to have a much higher standard deviation

00:37:09.820 | That makes sense. And so it turns out if you find a split that minimizes those group standard deviations or specifically the

00:37:18.020 | Weighted average of the two standard deviations. It's mathematically the same as minimizing the root mean squared error

00:37:24.260 | That's something you can prove to yourself after class if you want to

00:37:27.820 | All right, so we're going to need to find

00:37:31.940 | First of all split this into two groups. So where's all the stuff that is greater than three?

00:37:36.700 | So greater than three is this one this one and this one. So we need the standard deviation of that

00:37:41.700 | so let's go ahead and say standard deviation of

00:37:45.820 | Greater than three that one

00:37:49.380 | That one and that one. Okay, and then the next will be the standard deviation of

00:37:59.940 | Less than or equal to three. So that would be that one

00:38:02.780 | That one that one and then we just take the weighted average of those two and that's our score

00:38:10.860 | That would be our score if we split on three

00:38:15.140 | That makes sense. And so then the next step would be try to split on four

00:38:19.740 | Try splitting on one try splitting on six

00:38:22.740 | Redundantly try splitting on four again

00:38:27.900 | Redundantly try splitting on one again and find out which one works best

00:38:30.580 | So that's our code here is we're going to go through every row. And so let's say okay left-hand side is

00:38:37.700 | Any values in X that are less than or equal to this particular value

00:38:44.420 | Our right hand side is every value in X that are greater than this particular value

00:38:50.860 | Okay, so what's the data type that's going to be in LHS and RHS? What are they actually going to contain?

00:39:05.180 | They're going to be arrays arrays of what?

00:39:11.540 | Raise of a raise of billions. Yeah, which we can treat a zero and one. Okay, so LHS will be an array of

00:39:20.740 | False every time it's not less than or equal to and true

00:39:23.980 | Otherwise and RHS will be a boolean array of the opposite. Okay, and now we can't take a standard deviation of an empty set

00:39:32.180 | Right. So if there's nothing that's greater than

00:39:35.140 | This number then these will all be false which means the sum will be zero

00:39:41.700 | Okay, and in that case

00:39:43.740 | Let's not go any further with this step because there's nothing to take the standard deviation of and it's obviously not a useful

00:39:49.460 | Split, okay

00:39:51.100 | so assuming we've got this far we can now calculate the standard deviation of the left hand side and

00:39:55.980 | of the right hand side and

00:39:58.500 | Take the weighted average or the sums the same thing to us to a scalar

00:40:04.220 | Right and so there's our score and so we can then check is this better than our best score so far and our best score

00:40:11.260 | So far we initially initialized it to infinity, right? So initially this is this is better

00:40:17.780 | So if it's better, let's store away

00:40:20.560 | All of the information we need which variable has found this better split what was the score we found and

00:40:29.260 | What was the value that we split on?

00:40:32.860 | Okay, so there it is

00:40:36.140 | So if we run that

00:40:40.100 | And I'm using time it so what time it does is it sees how long this command takes to run and it tries to give you

00:40:47.580 | A kind of statistically valid measure of that so you can see here. It's run run at ten times

00:40:52.900 | To get an average and then it's done that seven times to get a mean and standard deviation across runs

00:40:59.660 | And so it's taking me 75 milliseconds plus or minus ten

00:41:02.900 | Okay

00:41:05.420 | So let's check that this works

00:41:07.420 | Find better split tree zero. So zero is year made one is machine hours current meter

00:41:17.140 | So with one

00:41:19.140 | we got back machine hours current meter 37 4 4 with this score and then we ran it again with

00:41:25.680 | 0 that's year made and we've got a better score 658 and split 1974 and so in 1974

00:41:33.600 | Let's compare

00:41:36.020 | Yeah, that was what this tree did as well. Okay, so we've got we've confirmed that this

00:41:40.800 | Method is doing is giving the same result that SK learns random forest did

00:41:47.140 | And you can also see here the value

00:41:49.540 | Ten point oh eight and again matching here the value ten point oh eight

00:41:55.060 | Okay, so we've got something that can find one split. Could you pass that to your net please?

00:42:01.180 | So Jeremy, why don't we put a unique on the X there?

00:42:07.340 | Because I'm not trying to optimize the performance yet, but you see that no like he's doing more

00:42:15.540 | Yeah, so it's like and you can see in the Excel. I like checked this one twice. I check this for twice unnecessarily

00:42:21.300 | Yeah

00:42:24.140 | Okay, so and so again, it's already thinking about performance, which is good

00:42:31.180 | So tell me what is the computational complexity of

00:42:36.900 | this

00:42:39.860 | section of the code and I like have a think about it, but also like feel free to talk us through it if you want to

00:42:46.180 | kind of

00:42:47.380 | Think and talk at the same time

00:42:49.380 | What's the computational complexity of this piece of code?

00:42:53.140 | Can I pass it over there yes

00:43:00.480 | Alright Jade take us through your thought process

00:43:04.740 | I think you have to take each different values through the column to calculate it

00:43:10.940 | Once to see the splits so and then compare

00:43:15.200 | All the cup like all the possible combinations between these different values so that can be expensive

00:43:21.980 | Like cuz you're uh-huh. Can you or does somebody else gonna tell us the actual computational complexity?

00:43:27.580 | So like yeah quite high Jade's thinking

00:43:30.100 | How high?

00:43:33.740 | I

00:43:35.740 | Think it's n squared. Okay, so tell me why is it n squared because for the for loop it is in yes

00:43:42.060 | And I think I guess the standard deviation will take in so it's in square. Okay, or

00:43:46.820 | This one maybe is even easier to know like this is like which ones are less than xi

00:43:52.540 | I'm gonna have to check every value that if it's less than xi. Okay, and so

00:43:56.580 | So it's useful to know like

00:43:59.540 | How do I quickly calculate computational complexity?

00:44:02.220 | I can guarantee most of the interviews you do are going to ask you to calculate

00:44:05.560 | Computational complexity on the fly and it's also like when you're coding you want it to be second nature

00:44:10.980 | So the technique is basically is there a loop?

00:44:14.180 | Okay, we're then we're obviously doing this n times

00:44:17.680 | Okay, so there's an n involved. Is there a loop inside the loop if there is then you need to multiply those two together

00:44:23.740 | In this case there's not is there anything inside the loop?

00:44:27.460 | That's not a constant time thing

00:44:30.060 | so you might see a sort in there and you just need to know that sort is n log n like

00:44:34.460 | That should be second nature if you see a matrix multiply you need to know what that is in this case

00:44:40.180 | There are some things that are doing element wise array operations, right?

00:44:44.980 | So keep an eye out for anything where numpy is doing something to every value of an array in this case

00:44:50.100 | It's checking every value of x against a constant. So it's going to have to do that n times

00:44:55.540 | So to flesh this out into a computational complexity

00:44:58.300 | You just take the number of things in the loop and you multiply it by the highest computational complexity inside the loop

00:45:05.820 | n times n is n squared you pass that

00:45:08.300 | In this case couldn't we just

00:45:12.220 | Presort the list and then do like one and log n computation and there's lots of things we could do to speed this up

00:45:17.580 | So at this stage, it's just like what is the computational complexity we have?

00:45:21.340 | But absolutely it's certainly not as good as it can be. Okay, so and that's where we're going to go next

00:45:26.820 | It's like alright n squared is not it's not great. So let's try and make it better

00:45:31.620 | So here's my attempt at making it better

00:45:37.380 | And the idea is this

00:45:41.340 | Okay, who wants to first of all tell me what's the equation for a standard deviation?

00:45:48.740 | Marsha, can you grab the box?

00:45:53.940 | So for the standard deviation, it's the difference between the value and it's mean

00:46:01.180 | It's we take a square root of that. So we take the

00:46:06.280 | The power of two, yeah, then we sum up all of these observations and we take the square root out of all this sum

00:46:15.820 | Yeah, you have to divide divide by n. Yep. Yep. Great. Good. Okay now

00:46:20.580 | In practice, we don't normally use that formulation because it kind of requires us calculating

00:46:27.820 | You know X minus the mean lots of times. Does anybody know the formulation that just requires?

00:46:33.900 | X and X squared anybody happened to know that one. Yes at the back turn up house that back there

00:46:39.560 | Square root of mean of squares minus

00:46:44.340 | a

00:46:46.340 | Square of mean. Yeah, great mean of squares minus the square of the means

00:46:50.580 | All right, so that's a really good one at divided by it. That's a really good one to know because like you can now calculate

00:46:57.540 | Variances or standard deviations of anything. You just have to first of all grab the column as it is

00:47:04.740 | The column squared right and as long as you've got those stored away somewhere you can

00:47:10.940 | Immediately calculate the standard deviation. So the reason this is handy for us is that if we first of all

00:47:17.980 | Sort our data, right?

00:47:21.560 | Let's go ahead and sort our data

00:47:24.940 | Then if you think about it as we kind of start going down one step at a time

00:47:30.020 | Right, then each group is exactly the same as the previous group on the left hand side with one more thing in it

00:47:37.820 | And on the right hand side with one less thing in it

00:47:40.220 | So given that we just have to keep track of some of X and some of X squared

00:47:44.500 | We can just add one more thing to X one more thing to X squared on the left and remove one thing on the right

00:47:51.140 | Okay, so we don't have to go through the whole lot each time and so we can turn this into a order n

00:47:58.300 | Algorithm so that's all I do here is I sort the data right and they're going to keep track of the count of things on

00:48:05.560 | the right

00:48:06.500 | The sum of things on the right and the sum of squares on the right

00:48:10.020 | And initially everything's in the right hand side

00:48:13.540 | Okay, so initially n is the count

00:48:16.540 | Y sum is the sum on the right and Y squared sum is the sum of squares on the right and

00:48:23.680 | Then nothing is initially on the left. So it's zeros. Okay, and then we just have to loop through

00:48:30.020 | each observation right and

00:48:36.140 | Add one to the left hand count subtract one from the right hand count add the value to the left hand count

00:48:41.980 | Subtract it from the right hand count add the value squared to the left hand subtract it from the right hand

00:48:47.860 | Okay

00:48:50.020 | Now we do need to be careful though because if we're saying less than or equal to one say we're not stopping here

00:48:57.620 | We're stopping here like we have to have everything in that group

00:49:01.140 | So the other thing I'm going to do is I'm just going to make sure that the next value is not the same as this

00:49:06.820 | Value if it is I'm going to skip over it, right? So I'm just going to double-check

00:49:10.620 | That this value and the next one aren't the same

00:49:14.060 | Okay, so as long as they're not the same

00:49:18.380 | I can keep going ahead and calculate my standard deviation now

00:49:21.700 | Passing in the count the sum and the sum squared right and there's that formula

00:49:29.100 | Okay, the sum of squared divided by the square of the sum sorry minus the square of the sum

00:49:36.420 | Do that for the right hand side and so now we can calculate the weighted average score

00:49:43.020 | Just like before and all of these lines are now the same. Okay, so we've turned our order and

00:49:48.300 | Squared algorithm into an order n algorithm and in general

00:49:54.620 | stuff like this is going to get you a lot more value than like pushing something onto a spark cluster or

00:50:00.700 | Ordering faster RAM or using normal cause in your CPU or whatever, right?

00:50:05.980 | This is the way you want to be, you know improving your code and specifically

00:50:11.500 | Write your code

00:50:14.940 | Right without thinking too much about performance run it. Is it fast enough for what you need then you're done if not

00:50:23.860 | profile it right so in

00:50:25.860 | Jupiter instead of saying percent time it you say percent p run and

00:50:32.340 | It will tell you exactly

00:50:35.820 | where the time was spent in your algorithm, and then you can go to the bit that's actually taking the time and think about like

00:50:43.540 | Okay, is this?

00:50:45.340 | Is this algorithmically as efficient as it can be?

00:50:49.180 | Okay, so in this case we run it

00:50:53.500 | and we've gone down from

00:50:55.500 | 76 milliseconds to less than 2 milliseconds and

00:50:59.860 | Now some people that are new to programming think like oh great. I've saved

00:51:05.220 | 60 something milliseconds, but the point is this is going to get run

00:51:09.740 | Like tens of millions of times

00:51:13.060 | Okay, so the 76 millisecond version is so slow that it's going to be

00:51:19.660 | Practical for any random forest you use in in practice right whereas the 1 millisecond version. I found is actually quite quite acceptable

00:51:27.940 | And then check the numbers should be exactly the same as before and they are okay

00:51:34.420 | so now that we have a

00:51:37.420 | function find better split that does what we want I

00:51:40.940 | Want to insert it into my decision tree class and this is a really cool Python trick

00:51:47.540 | Python does everything dynamically right so we can actually say

00:51:51.900 | The method called find better split in

00:51:56.100 | decision tree is

00:51:59.300 | That function I just created and that might sticks it inside that class

00:52:05.020 | now

00:52:07.180 | I'll tell you what's slightly confusing about this is that this thing this word here and

00:52:12.100 | This word here. They actually have no relationship to each other

00:52:15.980 | They just happen to have the same letters in the same order right so like I could call this find better split

00:52:21.700 | underscore foo

00:52:24.300 | Right and then I could like call that

00:52:27.420 | Right and call that

00:52:32.740 | Right so now my function is actually called find better split underscore foo, but my method

00:52:42.580 | I'm expecting to call something called decision tree dot find better split

00:52:47.740 | Right so here. I could say decision tree dot find better split equals find better split underscore foo

00:52:54.420 | Okay, you see that's the same thing so like

00:52:57.940 | It's important to understand how namespaces work like in in every language that you use

00:53:06.060 | One of the most important things is kind of understanding how how it figures out what a name refers to

00:53:12.020 | So this here

00:53:13.820 | Means find better split as defined inside this class right and note nowhere else right well

00:53:20.940 | I mean a parent class, but never mind about that this one here means find better split foo in

00:53:27.340 | The global namespace a lot of languages don't have a global namespace, but Python does okay, and so

00:53:34.900 | The two are like even if they happen to have the same letters in the same order

00:53:40.300 | They're not referring in any way to the same thing

00:53:42.300 | That makes sense. It's like

00:53:44.980 | This family over here may have somebody called Jeremy and my family has somebody called Jeremy and our names happen to be the same

00:53:52.460 | But we're not the same person, okay

00:53:55.320 | Great so now that we've stuck the decision tree

00:54:00.820 | Sorry the find better split method inside the decision tree with this new definition when I now call the tree ensemble constructor

00:54:09.660 | All right the decision tree ensemble instructor called create tree

00:54:14.320 | create tree

00:54:16.700 | instantiated decision tree

00:54:18.700 | decision tree called find vast split which went through every column to see if it could find a better split and

00:54:25.460 | we've now defined find better split and

00:54:28.860 | Therefore tree ensemble when we create it has gone ahead and done the split

00:54:37.580 | That makes sense that you have any anybody have any questions or uncertainties about that like we're only creating one single split

00:54:44.820 | so far

00:54:47.700 | All right, so this is pretty pretty neat right we kind of just do a little bit at a time testing everything as we go

00:54:55.340 | And so it's as as as you all implement the random forest interpretation techniques

00:55:02.260 | You may want to try programming this way to like every step check that you know

00:55:07.300 | What you're doing matches up with what scikit-learn does or with a test that you've built or whatever?

00:55:12.580 | So at this point we should try to go deeper

00:55:16.700 | Very inception right so let's go now max depth is 2

00:55:21.940 | And so here is what scikit-learn did after breaking at year made 74

00:55:27.700 | It then broke at machine hours meter

00:55:30.460 | 2956

00:55:33.540 | So we had this thing called

00:55:36.940 | Find vast split

00:55:38.940 | Right we just went through every column and try to see if there's a better split there

00:55:44.340 | right, but actually

00:55:46.580 | We need to go a bit further than that

00:55:48.580 | Not only do we have to go through every column and see if there's a better split in this node

00:55:54.500 | But then we also have to see whether there's a better split in the left and the right sides that we just created

00:56:01.700 | Right or in other words the left right side and the right hand side should become decision trees themselves

00:56:07.940 | Right so there's no difference at all between what we do here to create this tree

00:56:13.220 | And what we do here to create this tree other than this one contains

00:56:16.540 | 159 samples this one contains a thousand

00:56:20.220 | So this row of codes exactly the same as we had before

00:56:24.820 | Right and then we check actually we could do this a little bit easier. We could say if self dot

00:56:32.220 | Is leaf right would be the same thing?

00:56:35.180 | Okay, but I'll just leave it here for now. So it's self dot score. So if the score is

00:56:40.600 | Infinite still, but let's write it properly

00:56:43.900 | Yes

00:56:46.740 | So let's go back up and just remind ourselves

00:56:49.020 | Is leaf is

00:56:53.060 | Self that score equals in okay?

00:56:55.660 | So since there we might as well use it so if it's a leaf node

00:57:00.820 | Then we have nothing further to do right so that means we're right at the bottom. There's no split

00:57:06.940 | That's been made okay, so we don't have to do anything further on the other hand if it's not a leaf node

00:57:12.180 | So it's somewhere back earlier on

00:57:14.180 | Then we need to split it into the left hand side and the right hand side now earlier on we created a left hand side

00:57:21.580 | And a right hand side array of Booleans

00:57:23.620 | right now

00:57:26.100 | Better would be to have here would be have an array of indexes

00:57:29.360 | And that's because we don't want to have a full array of all the Booleans in every single

00:57:34.080 | Node right because remember although it doesn't look like there are many nodes when you see a tree of this size

00:57:39.600 | when it's fully expanded

00:57:42.160 | the bottom level if there's a minimum leaf size of one contains the same number of nodes as

00:57:48.000 | The entire data set and so if every one of those contained a full Boolean array of size of the whole data set

00:57:55.080 | You've got squared memory requirements, which would be bad

00:57:58.520 | Right on the other hand if we just store the indexes with the things in this node, and that's going to get smaller and smaller

00:58:04.720 | Okay

00:58:07.480 | So NP dot non zero is exactly the same as just this thing which gets the Boolean array

00:58:13.560 | But it turns it into the indexes of the trues

00:58:17.200 | Okay, so this is now a list of indexes for the left hand side and

00:58:21.040 | Indexes the right hand side

00:58:23.560 | Okay, so now that we have the indexes the left hand side and the right hand side

00:58:28.760 | We can now just go ahead and create a decision tree. Okay, so there's a decision tree for the left

00:58:33.280 | And there's our decision tree for the right

00:58:36.240 | Okay, and we don't have to do anything else. We've already written these

00:58:38.920 | we already have a

00:58:41.880 | Function of a constructor that can create a decision tree

00:58:44.920 | so like when you really think about what this is doing it kind of hurts your head right because

00:58:52.720 | the reason the whole reason that find vast split got called

00:58:57.540 | is because

00:58:59.540 | Find vast split is called by the decision tree constructor

00:59:06.780 | But then the decision tree that but then find vast split itself then causes the decision tree constructor, so we actually have

00:59:15.060 | circular recursion and

00:59:17.980 | I'm not nearly smart enough to be able to think through recursion

00:59:23.420 | So I just choose not to right like I just write what I mean and

00:59:28.020 | Then I don't think about it anymore

00:59:30.740 | Right like what do I want to find a variable split? I've got to go through every column see if there's something better

00:59:37.420 | If it managed to do a split figure out left hand side of the right hand side and make them into decision trees

00:59:44.260 | Okay, but now try to think through how these two methods call each other would just drive me crazy

00:59:51.220 | But I don't need to right I know I have a decision tree constructor that works

00:59:55.020 | Right I know I have a violent find vast bit that works, so that's it right. That's how I

01:00:00.560 | do recursive programming is

01:00:03.300 | By pretending I don't I just just ignore it

01:00:07.460 | That's my advice a lot of you are probably smart enough to be able to think through it better than I can so that's fine

01:00:12.780 | If you can all right so now that I've written that again, I can patch it into the decision tree class

01:00:18.140 | and

01:00:20.140 | As soon as I do

01:00:23.260 | The tree ensemble constructor will now use that right because Python's dynamic right that's just happens automatically

01:00:30.900 | So now I can check

01:00:33.740 | my left hand side

01:00:35.980 | Should have a hundred and fifty nine samples

01:00:39.540 | Right and a value of nine point six six

01:00:43.540 | There it is hundred fifty nine samples nine point six six

01:00:47.620 | right hand side

01:00:49.620 | Eight forty one ten point one five the left hand side of the left hand side

01:00:54.680 | Hundred and fifty samples nine point six two

01:00:58.100 | Hundred and fifty samples nine point six two okay, so you can see it like

01:01:03.260 | I'm

01:01:05.020 | Because I'm not nearly clever enough to write machine learning algorithms like not only can I not write them correctly the first time

01:01:11.220 | Often like every single line. I write will be wrong right so I always start from the assumption

01:01:17.460 | That the the line of code. I just typed is almost certainly wrong, and I just have to see why and how

01:01:23.380 | Right and so like I just make sure and so eventually I get to the point where like much to my surprise

01:01:28.660 | It's not broken anymore

01:01:30.580 | You know so here I can feel like okay this it would be surprising if all of these things

01:01:35.340 | Accidentally happen to be exactly the same as I kit learn so this is looking pretty good

01:01:39.140 | Okay

01:01:42.300 | So now that we have something that can build a whole tree

01:01:47.260 | We want to have something that can calculate predictions

01:01:49.840 | Right and so to remind you we already have something that calculates predictions for a tree ensemble

01:01:56.100 | by calling tree dot predict

01:01:58.580 | But there is nothing called tree dot predict, so we're going to have to write that

01:02:03.820 | To make this more interesting let's start bringing up the number of columns that we use

01:02:14.060 | Let's create our tree ensemble again, and this time. Let's go to a maximum depth of three

01:02:18.820 | Okay, so now our tree is getting more interesting

01:02:23.140 | And let's now define how do we

01:02:29.820 | Create a set of predictions for a tree and so a set of predictions for a tree is simply the prediction for a row

01:02:38.520 | for every row

01:02:42.180 | That's it all right. That's our predictions so the predictions for a tree

01:02:46.180 | every rows predictions in an array, okay, so again, we're like

01:02:52.400 | Skipping thinking thinking is hard. You know so let's just like keep pushing it back

01:02:58.900 | This is kind of handy right notice that you can do for

01:03:07.700 | Blah in array with a numpy array regardless of the rank of the array

01:03:13.500 | regardless of the number of axes in

01:03:16.820 | The array and what it does is it will loop through the leading axis

01:03:21.940 | Right these these concepts are going to be very very important as we get into more and more neural networks because we're going to be

01:03:28.900 | All doing tensor computations all the time so the leading axis of a vector is the vector itself

01:03:35.860 | The leading axis of a matrix are the rows the leading axis axis of a three-dimensional tensor

01:03:43.140 | The matrices that represent the slices and so forth right so in this case because x is a matrix

01:03:50.180 | This is going to loop through the rows and if you write your

01:03:53.220 | Kind of tensor code this way, then it'll kind of tend to

01:03:58.620 | Generalize nicely to higher dimensions like it doesn't really mention matter how many dimensions are in x

01:04:04.620 | This is going to loop through each of the leading axis, right?

01:04:08.220 | Okay, so we can now call that decision tree dot predict

01:04:13.020 | Right so all I need to do is write predict row

01:04:20.780 | Right and I've delayed thinking so much which is great that the actual point where I actually have to do the work

01:04:26.620 | It's now basically trivial

01:04:28.620 | So if we're at a leaf

01:04:31.820 | No, then the prediction is just equal to

01:04:35.140 | Whatever that value was

01:04:37.780 | Which we calculated right back in the original tree constructor. It's just the average of the wise, right?

01:04:43.180 | If it's not a leaf node

01:04:45.900 | Then we have to figure out whether to go down the left-hand path or the right-hand path to get the prediction, right? So

01:04:51.900 | if

01:04:55.580 | This variable in this row is less than or equal to the thing we decided the amount we decided to split on

01:05:01.860 | Then we go down the left path

01:05:04.500 | Otherwise we go down the right path

01:05:07.100 | Okay, and then having figured out what path we want which tree we want then we can just call predict row on that

01:05:14.260 | right and again

01:05:16.620 | We've accidentally created something recursive

01:05:19.260 | Again, I don't want to think about

01:05:22.260 | how that

01:05:24.820 | Works control flow wise or whatever, but I don't need to because like I just it just does like I just told it

01:05:32.420 | What I wanted so I'll trust it to work, right?

01:05:34.260 | If it's a leaf return the value otherwise return the prediction for the left-hand side or the right-hand side as appropriate

01:05:40.820 | Notice this here this if has nothing to do with this if

01:05:51.340 | All right, this if is a control flow statement

01:05:54.280 | That tells Python to go down that path or that path to do some calculation

01:05:59.860 | this if is an

01:06:03.140 | operator that

01:06:05.860 | returns a value

01:06:07.860 | So those of you that have done C or C++ will recognize it as being identical to that

01:06:13.540 | It's called the ternary operator, right? If you haven't that's fine. Basically what we're doing is we're going to get a value

01:06:20.780 | where we're going to say it's this value if this thing is true and

01:06:25.800 | that value otherwise

01:06:28.660 | And so you could write it

01:06:32.000 | This way

01:06:34.740 | All right

01:06:35.300 | But that would require writing four lines of code to do one thing and also require you to have code that if you read it

01:06:42.380 | To yourself or to somebody else is not at all naturally the way you would express it, right?

01:06:47.100 | I want to say the tree I got to go down is

01:06:50.420 | The left-hand side if the variables less than the split or the right-hand side, otherwise

01:06:56.000 | Right, so I want to write my code the way I would think about or the way I would say my code

01:07:01.300 | Okay, so this kind of ternary operator can be quite helpful for that

01:07:07.580 | All right

01:07:10.140 | So now that I've got a prediction for row I can dump that into my class

01:07:14.900 | And now I can create calculate predictions

01:07:17.800 | And I can now plot my actuals against my predictions

01:07:24.220 | When you do a scatterplot

01:07:28.100 | You'll often have a lot of dots sitting on top of each other

01:07:31.380 | So a good trick is to use alpha alpha means how transparent the things not just a matplotlib

01:07:37.580 | But like in every graphics package in the world pretty much and so if you set alpha to less than 1

01:07:42.660 | Then this is saying you would need 20 dots on top of each other for it to be fully blue

01:07:47.340 | And so this is a good way to kind of see

01:07:49.880 | How much things are sitting on top of each other? So it's a good trick good trick for scatter plots

01:07:54.640 | There's my R squared not bad

01:07:59.300 | And so let's now go ahead and

01:08:03.540 | Do a random forest

01:08:06.820 | With no max amount of splitting

01:08:11.820 | and

01:08:13.820 | Our

01:08:15.780 | Tree ensemble had no max amount of splitting we can compare our R squared

01:08:20.060 | To their R squared and so they're not the same

01:08:24.300 | But actually ours is a little better, so I don't know what we did differently, but we'll take it

01:08:30.940 | Okay, so we have now something which for a forest with a single tree in is giving as

01:08:39.820 | good accuracy

01:08:41.820 | On a validation set using an actual real-world data set you know for Bluetooth is

01:08:46.940 | Compared to psychic learn

01:08:49.740 | So let's go ahead and round this out

01:08:53.260 | So what I would want to do now is to create a package that has this code in and I created it by like creating

01:08:59.420 | A method here a method here a method here and patching them together

01:09:02.460 | So what I did with now is I went back through my notebook and collected up all the cells

01:09:07.580 | That implemented methods and pasted them all together right, and I've just pasted them down here

01:09:12.140 | So here's this is my original tree ensemble and here is all the cells from the decision tree

01:09:17.620 | I just dumped them all into one place without any change

01:09:20.620 | So that was it that was the code we wrote together so now I can go ahead and I can create a

01:09:31.980 | Tree ensemble I can calculate my predictions. I can do my scatter plot. I can get my R squared

01:09:39.440 | Right and this is now with

01:09:42.020 | five trees

01:09:45.020 | Right and here we are we have a model of blue dook for bulldozers with a 71% R squared

01:09:51.700 | With a random forest we wrote

01:09:54.300 | entirely from scratch

01:09:56.900 | That's pretty cool

01:10:00.420 | Any questions about that and I know there's like quite a got to get through so like during the week

01:10:05.900 | Feel free to ask on the forum

01:10:08.340 | About any bits of code you come across can somebody pass the box to measure. Oh, there it is

01:10:14.240 | Can we get back to the probably to the top of maybe

01:10:23.980 | At the decision tree when we set the score equal to infinity, right? Yes, do we calculate this car this score?

01:10:31.380 | Further, I mean like I lost track of that and specifically I wonder

01:10:37.080 | What do we implement?

01:10:40.740 | when we implement

01:10:42.660 | Find var split we check for self score equal to whether it's equal to infinity or not

01:10:49.260 | it seems to me it seems like unclear whether we

01:10:54.100 | fall out of this I

01:10:56.100 | Mean like if we ever implement

01:10:59.140 | The methods if if our initial value is infinity

01:11:03.980 | So okay, let's talk through the logic. So

01:11:07.660 | so the decision tree

01:11:10.460 | Starts out with a score of infinity. So in other words at this point when we've created the node there is no split

01:11:17.020 | So it's infinitely bad

01:11:20.100 | Okay, that's why the score is infinity and then we try to find a variable and a split

01:11:27.220 | that is better and

01:11:29.740 | to do that we loop through each column and

01:11:33.460 | Say hey column. Do you have a split which is better than the best one we have so far?

01:11:40.460 | And so then we implement that

01:11:47.860 | Let's do the slow way since it's a bit simpler

01:11:50.700 | find better split we do that by looping through each row and

01:11:55.680 | Finding out this is the current score if we split here

01:12:00.420 | Is it better than the current score the current score is infinitely bad?

01:12:04.020 | So yes

01:12:04.820 | It is and so now we set the new score equal to what we just calculated and we keep track of which variable we chose

01:12:11.140 | And the split we split on

01:12:13.140 | Okay, no worries

01:12:15.460 | you

01:12:17.460 | Okay, great, let's take a five-minute break and I'll see you back here at 22

01:12:25.500 | So when I tried comparing the performance of this

01:12:34.060 | Against scikit-learn

01:12:38.500 | This is quite a lot slower

01:12:41.820 | And the reason why is that although like a lot of the works being done by numpy

01:12:49.180 | Which is nicely optimized C code think about like the very bottom level of a tree if we've got a

01:12:57.620 | million data points

01:13:00.500 | And the bottom level of the tree has something like

01:13:03.140 | 500,000

01:13:05.980 | decision points with a million leaves underneath right and so that's like

01:13:11.700 | Five hundred thousand split methods being called each one of contain which contains multiple calls to numpy which only have like

01:13:19.300 | one

01:13:21.300 | Item that's actually being calculated on and so it's like that's like very inefficient and it's the kind of thing that python is

01:13:28.460 | Particularly not good at performance-wise right like calling lots of functions lots of times

01:13:34.380 | I mean we can see it's it's not bad right you know for a kind of a

01:13:39.900 | random forest which

01:13:41.900 | 15 years ago would have been considered pretty big this would be considered pretty good performance right, but nowadays this is

01:13:48.300 | Some hundreds of times at least slower than than it should be so

01:13:53.060 | What the scikit-learn folks did to avoid this problem was that they wrote

01:14:01.460 | their implementation in something called scythe on and

01:14:07.020 | scythe on is a super set of python so any python you've written

01:14:13.100 | pretty much

01:14:16.460 | You can use as scythe on

01:14:19.580 | But then what happens is scythe on runs it in a very different way rather than passing it to the kind of the

01:14:28.120 | Python interpreter it instead converts it to C

01:14:34.100 | Compiles that and then runs that C code right which means the first time you run it

01:14:40.260 | It takes a little longer because it has to go through the the kind of translation and compilation, but then after that it can be

01:14:47.620 | quite a bit faster

01:14:50.260 | and so I wanted just to quickly show you what that looks like because

01:14:54.100 | You are absolutely going to be in a position where scythe on is going to help you with your work and

01:15:02.180 | Most of the people you're working with will have never used it may not even know it exists

01:15:06.820 | And so this is like a great superpower to have

01:15:08.820 | So to use scytheon in a notebook you say load, ext, load extension, scytheon, right?

01:15:15.580 | And so here's a Python function

01:15:18.720 | fib1

01:15:21.940 | Here is the same as a scytheon function is exactly the same thing with percent percent scytheon at the top

01:15:32.200 | you

01:15:34.200 | This actually runs about twice as fast as this right just because it does the compilation

01:15:41.120 | Here is the same version again where I've used a special scytheon

01:15:48.220 | extension called cdef which defines the C data type of the return value and of each variable

01:15:57.160 | right and so

01:16:00.200 | Basically, that's the trick that you can use to start making things

01:16:03.360 | Run quickly right and at that point now. It knows it's not just some

01:16:08.460 | Python object called t in fact I probably should put one here as well

01:16:14.880 | Let's try that so we've got fib2 we'll call that fib3

01:16:20.940 | So for fib3

01:16:28.800 | Yeah, so it's exactly the same as before but we say what the data type of the thing we passed to it was is and then

01:16:34.560 | Define the data types of each of the variables and so then if we call that

01:16:39.400 | Okay, we've now got something that's 10 times faster right so

01:16:49.160 | Yeah, it doesn't really take that much extra

01:16:52.360 | And it's just it's just Python with a few little bits of markup

01:16:56.600 | so that's like it's it's good to know that that exists because

01:17:00.280 | If there's something custom you're trying to do

01:17:03.440 | It's actually I find it kind of painful having to go out and you know going to see and compile it and link it back

01:17:09.680 | And all that where it's doing it here is pretty easy. Can you pass that just to your right please musher?

01:17:13.560 | So when you're doing like for the scytheon version of it so in the case of an array or an MP array

01:17:23.520 | This is a specific C type of yeah, so there's like a lot of

01:17:28.520 | Specific stuff for integrating scytheon with numpy

01:17:34.040 | And there's a whole page about it

01:17:38.720 | Yeah, so we won't worry about going over it

01:17:42.080 | But you can read that and you can basically see the basic ideas. There's this C import which basically

01:17:47.760 | imports a

01:17:49.800 | Certain types of Python library into the kind of the C bit of the code and you can then use it in your site them

01:17:57.880 | Yeah, it's

01:18:01.080 | It's pretty straightforward

01:18:03.320 | Good question. Thank you all right so your your mission now

01:18:10.520 | Is to implement

01:18:18.520 | Confidence based on tree variants feature importance partial dependence and tree interpreter

01:18:24.160 | for that random first

01:18:27.080 | Removing redundant features doesn't use a random first at all

01:18:31.120 | So you don't have to worry about that

01:18:33.760 | Extrapolation is not an interpretation technique, so you don't have to worry about that, so it's just the other ones

01:18:38.160 | So confidence based on tree variants. We've already written that code

01:18:41.960 | So I suspect that the exact same code we have in the notebook should continue to work so you can try and make sure it

01:18:48.520 | Get that working

01:18:50.160 | Feature importance is with the variable shuffling technique and once you have that working

01:18:55.480 | Partial dependence will just be a couple of lines of code away because it rather than you know rather than shuffling a column

01:19:02.920 | You're just replacing it with a constant value that it's nearly the same code and

01:19:06.640 | then tree interpreter

01:19:09.720 | It's going to require you writing some code and thinking about that well once you've written tree interpreter

01:19:14.480 | You're very close if you want to to creating the second approach to

01:19:19.040 | Feature importance the one where you add up the importance across all of the rows

01:19:26.300 | Which means you would then be very close to doing interaction importance

01:19:30.480 | So it turns out that there are actually there's actually a very good library for interaction importance for xg boost

01:19:39.080 | But there doesn't seem to be one for random forest

01:19:41.880 | so you could like start by getting it working on our version and

01:19:45.760 | If you want to do interaction importance, and then you could like get it working on the original

01:19:50.040 | Sklearn version and that would be a cool contribution

01:19:55.280 | Like sometimes writing it against your own implementation is kind of nicer because you can see exactly what's going on

01:20:01.440 | All right, so that's that's your job. You don't have to rewrite the random forest feel free to if you want to you know practice

01:20:08.920 | so

01:20:10.920 | If you

01:20:13.320 | Get stuck at any point you know ask on the forum, right?

01:20:18.200 | there is a

01:20:21.520 | Whole page here on wiki.fast.ai about how to ask for help

01:20:26.320 | so when you

01:20:29.080 | ask your

01:20:30.760 | Co-workers on slack for help when you ask people in a technical community on github or

01:20:37.080 | discourse to help or whatever

01:20:39.080 | Asking for help the right way will go a long way towards

01:20:43.400 | You know having people want to help you and be able to help you right so

01:20:48.360 | So like search for your answer like search for the area you're getting see if somebody's already asked about it

01:20:55.120 | You know how have you tried to fix it already? What do you think is going wrong?

01:21:04.240 | What kind of computer are you on? How is it set up? What are the software versions?

01:21:08.160 | Exactly. What did you type and exactly what happened right now? You could

01:21:13.200 | do that by

01:21:16.520 | Taking a screenshot, so you know make sure you've got some screenshot software. That's really easy to use

01:21:22.880 | So if I were to take a screenshot, I just hit a button

01:21:24.880 | Select the area copy the clipboard go to my forum

01:21:31.200 | Paste it in and there we go right that looks a little bit too big, so let's make it a little smaller

01:21:37.360 | Right and so now I've got a screenshot people can see exactly what happened

01:21:41.880 | better still

01:21:44.280 | If there's a few lines of code and error messages to look at and create a gist

01:21:50.120 | Just is a handy little

01:21:53.000 | GitHub thing which basically lets you share code, so if I wanted to

01:21:59.520 | create a gist of this I

01:22:01.520 | Actually have a extension area that little extension, so if I click on here

01:22:09.680 | Give it a name

01:22:14.120 | Say make public

01:22:16.280 | Okay, and that takes my Jupyter notebook shares it publicly. I can then grab that URL

01:22:22.500 | copy link location

01:22:25.200 | Right and paste it into my forum post

01:22:28.960 | right and then when people

01:22:30.960 | click on it

01:22:33.080 | Then they'll immediately see

01:22:36.600 | My notebook when it renders

01:22:40.860 | Okay, so that's a really good way now that particular button is an extension so on Jupyter

01:22:50.760 | You need to click nv extensions and click on

01:22:55.000 | Gist it right while you're there. You should also click on collapsible headings

01:22:59.440 | That's this really handy thing. I use that lets me collapse things and open them up

01:23:03.420 | If you go to your Jupyter and don't see this nv extensions button

01:23:08.760 | Then just google for Jupyter nv extensions

01:23:11.480 | It'll show you how to pip install it and and get it set up right, but those two extensions are super duper handy

01:23:18.480 | All right, so

01:23:24.000 | Other than that assignment

01:23:26.000 | We're we're done

01:23:28.640 | With random forests and until the next course when you look at GBMs. We're done with decision tree ensembles

01:23:34.820 | And so we're going to move on to neural networks broadly defined and so

01:23:44.560 | Neural networks are going to allow us to

01:23:49.720 | To go beyond just you know the kind of nearest neighbors approach of random forests

01:23:55.360 | You know all the random forest can do is to average data that it's already seen

01:23:59.500 | It can't extrapolate. It can't they can't calculate right?

01:24:04.360 | linear regression

01:24:07.240 | Can calculate and can extrapolate but only in very limited ways

01:24:12.200 | Neural nets give us the best of both worlds

01:24:19.200 | We're going to start by applying them to

01:24:21.980 | unstructured data

01:24:24.640 | Right so unstructured data means like pixels or the amplitudes of sound waves or words you know data where?

01:24:32.600 | everything in

01:24:34.720 | all the columns

01:24:36.720 | Are all of the same type?

01:24:38.520 | You know as opposed to like a database table where you've got like a revenue and a cost and a zip code and a state

01:24:45.000 | It should be structured data

01:24:48.280 | We're going to use it for structured data as well, but we're going to do that a little bit later

01:24:52.140 | So unstructured data is a little easier

01:24:55.120 | And it's also the area which more people have been applying deep learning to for longer

01:25:00.720 | The

01:25:08.200 | If you're doing the deep learning course as well

01:25:11.120 | You know you'll see that we're going to be approaching kind of the same conclusion from two different directions

01:25:17.920 | So the deep learning course is starting out with

01:25:21.320 | Big complicated convolutional neural networks being solved with you know

01:25:27.600 | Sophisticated optimization schemes and we're going to kind of gradually drill down into like exactly how they work

01:25:33.200 | Where else with the machine learning course we're going to be starting out more with like

01:25:39.280 | How does stochastic gradient descent actually work?

01:25:43.440 | What do we do? What can we do with like one single layer? Which would allow us to create things like logistic regression when we add?

01:25:50.200 | Regularization to that how does that give us things like?

01:25:54.000 | Ridge regression elastic net less. Oh and then as we add additional layers to that how does that let us?

01:26:01.040 | Handle more complex problems, and so we're not going to we're only going to be looking at

01:26:06.840 | fully connected layers in this machine learning course and

01:26:12.920 | Then I think next semester with your net you're probably going to be looking at some more

01:26:17.800 | Sophisticated approaches and so yes on this machine learning

01:26:22.120 | We're going to be looking much more at like what's actually happening with the matrices and how they actually calculate it and the deep learning

01:26:27.800 | It's much more like what are the best practices for actually?

01:26:30.840 | Solving you know at a world-class level real world deep learning problems, right so

01:26:38.160 | Next week we're going to

01:26:41.440 | be

01:26:43.440 | Looking at like the classic Mness problem, which is like how do we recognize digits now?

01:26:50.840 | If you're interested you can like skip ahead and like try and do this with a random forest and you'll find it's not bad

01:26:58.240 | Right given that a random forest is basically a type of nearest neighbors, right?

01:27:03.020 | It's finding like what are the nearest neighbors in in tree space?

01:27:07.360 | Then a random forest can absolutely recognize that this nine those pixels

01:27:11.800 | You know are similar to pixels. We've seen in these other ones and on average they were nines as well

01:27:18.000 | All right, and so like we can absolutely solve

01:27:21.320 | These kinds of problems to an extent

01:27:24.040 | using random forests

01:27:26.800 | But we end up being rather data limited because every time we put in another decision point you know

01:27:33.280 | We're halving our data roughly and so there's this just this limitation and the amount of calculation that we can do

01:27:40.120 | Where else with neural nets?

01:27:43.000 | We're going to be able to use lots and lots and lots of parameters

01:27:47.800 | Using these tricks we don't learn about with regularization

01:27:51.760 | And so we're going to be able to do lots of computation, and there's going to be very little limitation on really

01:27:57.920 | What we can actually end up calculating as a result great good luck with your random forest interpretation, and I will see you next time

Intro to Machine Learning: Lesson 7

Chapters