Intro to Machine Learning: Lesson 7

Welcome back. We're going to be talking today about Random forests, we're going to finish building our own random forest from scratch But before we do I wanted to Tackle a few things that have come up during the week a few questions that I've had And I want to start with kind of the position of random forests in general, so we spent About half of this course doing random forests, and then after today the second half of this course will be neural networks broadly defined This is because these these two Represent like the tea the two key classes of techniques which cover Nearly everything that you're likely to need to do Random forests belong to the class of techniques of decision tree ensembles Along with gradient boosting machines being the other key type and some variants like extremely randomized trees They Have the benefit that they're highly interpretable scalable Flexible work well for most kinds of data They have the downside that they don't extrapolate at all to like Data that's outside the range that you've seen as we looked at at the end of last week's session but You know they're they're they're a great starting point and so I think you know there's a huge catalog of Machine learning tools out there and so and like a lot of courses and books Don't attempt to kind of curate that down and say like for these kinds of problems use this for these kinds of problems use that Finished you know, but they're rather like here's a description of a hundred different algorithms and You just don't need them You know like I don't see why you would ever use in support vector machine today for instance Like I know no reason at all I could think of doing that people loved studying them in the 90s because they are like very theoretically elegant and like you can really Write a lot of math about support vector machines and people did but you know in practice.

I don't see them as having any place So there's like a lot of techniques that you could include in an exhaustive list of every way that people have looked at machine learning problems But I would rather tell you like how to actually solve machine learning problems in practice I think they you know we've we've about to finish today the first class which is you know one type of decision tree ensembles In in part two your net will tell you about the other key type They're being gradient boosting and we're about to launch next lesson into neural nets which includes all kinds of GLM Ridge regression elastic net Lasso Logistic aggression etc are all variants of neural nets You know interestingly Leo Bremen who created random forests did so very late in his life and Unfortunately passed away not many years later So partly because of that very little has been written about them in the academic literature Partly because SVMs were just taken over at that point.

You know other people didn't look at them And also like just because they're like quite hard to grasp at a theoretical level like analyze them theoretically It's quite of hard to write conference papers about them or academic papers about them So there hasn't been that much written about them But there's been a real resurgence or not resurgence a new wave in recent years of empirical machine learning like what actually works Kaggle's been part of that, but also just part of it.

It's just been like companies using machine learning to make shit loads of money like Amazon and Google and So nowadays a lot of people are writing about decision tree ensembles and creating better software for decision tree ensembles like like GBM and XG boost and Ranger for our and Psych hit learn and so forth But a lot of this is being done in industry rather than academia But you know it's it's encouraging to see There's certainly More work being done in deep learning than in decision tree ensembles Particularly in in academia, but but there's a lot of progress being made in both You know if you look at like of the packages being used today for decision tree ensembles Like all the best ones the top five or six, and I don't know that any of them really existed Five years ago, you know maybe other than like SK learn or even three years ago, so it's that's that's been good But I think there's a lot of work Still to be done now.

We talked about for example Figuring out what interactions are the most important last week and Some of you pointed out in the forums that actually there is such a project already for gradient boosting machines Which is great, but it doesn't seem that there's anything like that yet for random forests And you know random forests do have a nice benefit over GBMs that they're kind of Harder to screw up you know and easier to scale So hopefully that's something that you know this community might help fix Another question I had during the week was about the size of your validation set How big should it be So like to answer this question about how big does your validation set need to be you first need to answer the question How How accurate do I need help how precisely do I need to know the accuracy of this algorithm right so like If the validation set that you have is saying like this is 70% accurate and If somebody said well is it 75% or 65% or 70% and the answer was I don't know anything in that range is close enough Like that would be one answer where else if it's like is it 70% or 70.01% or 69.99% Like then that's something else again, right?

So you need to kind of start out by saying like how? How accurate do I need this? So like for example in the deep learning course we've been looking at dogs versus cats images and The models that we're looking at had about a ninety nine point Four ninety nine point five percent accuracy on the validation set okay, and our validation set size was 2000 okay, in fact let's do this in Excel that'll be a bit easier So our validation set Size was 2000 and accuracy was ninety nine point four percent Right so the number of incorrect is Something around One minus accuracy times n so we were getting about 12 wrong right and The number of cats we had is Half and So the number of wrong cats is about six Okay, so then like we we run a new model and we find instead that the accuracy has gone to ninety nine point two percent Right and then it's like okay.

Is this less good at finding cats. That's like well. It got two more cats wrong So it's like probably not right so But then it's like well does this matter there's ninety nine point four versus ninety nine point two Matter and if this was like it wasn't about cats and dogs, but it was about finding fraud Right then the difference between a point six percent error rate and a point eight percent error rate It's like twenty five percent of your cost of fraud So like that can be huge Like it was really interesting like when image net came out earlier this year the new competition results came out and The accuracy had gone down from three percent So the error went down from three percent to two percent And I saw a lot of people on the internet like famous machine learning researchers being like man Some Chinese guys got it better from like ninety seven percent to one ninety eight percent It's like statistically not even significant who cares kind of a thing, but actually I thought like Holy crap this Chinese team just blew away the state-of-the-art and image recognition like the old one was Fifty percent less accurate than the new one like that's that's actually the right way to think about it Isn't it because it's like you know we were trying to recognize you know like Which tomatoes were ripe and which ones weren't and like our new approach.

You know the old approach like 50% of the time more Was like letting in the unripe tomatoes or you know 50% more of the time we were like Accepting fraudulent customers like that's a really big difference So just because like this particular validation set we can't really see six versus eight doesn't mean the point two percent different Isn't important it could be so my kind of rule of thumb is that this like this number of like how many?

Observations you actually looking at I want that generally to be somewhere higher than 22 Why 22 because 22 is the magic number where the t distribution roughly turns into the normal distribution? All right, so as you may have learned the t distribution is is the normal distribution for small Datasets right and so in other words once we have 22 of something or more it kind of starts to behave kind of Normally in both sense of the words like it's kind of more stable, and you can kind of understand it better, so That's my magic number when somebody says do I have enough of something I kind of start out by saying like do you have?

22 observations of the thing of interest so if you were looking at like Line cancer you know and you had a data set that had like a thousand people without lung cancer and 20 people with lung cancer I'd be like I very much doubt we're going to make much progress.

You know because we haven't even got 20 of the thing you want So ditto with a validation set if you don't have 20 of the thing you want that is very unlikely to be useful Or if like the at the level of accuracy we need it's not plus or minus 20 It's just it's that that's the point where I'm thinking like Be a bit careful so just to be clear you want 22 to be the number of Samples in each set like in the validation the test and the train or so what I'm saying is like if there's if there's less than 22 of a class in any of the sets then it's It's going to get it's getting pretty unstable at that point, right?

And so like that's just like the first rule of thumb but then what I would actually do is like start practicing what we learned about the binomial distribution or actually then we distribution so What's the? What is the mean of the binomial distribution of n samples and probability P?

And times P. Okay. Thank you n times P is our mean Right, so if you've got a 50% chance of getting ahead and you toss it a hundred times on average you get 50 heads Okay, and then what's the standard deviation? and P1 minus P Okay, so these are like Two numbers well the first number you don't really have to remember.

It's intuitively obvious the second one is one that try to remember forevermore Because not only does it come up all the time the people that you work with will all have forgotten it So you'll be like the one person in the conversation who could immediately go we don't have to run this a hundred times I can tell you straight away.

It's binomial. It's going to be n PQ and P1 minus P Then there's the standard error The standard error is if you run a bunch of trials each time getting a mean What is the standard deviation of the mean? I? Don't think you guys have covered this yet. Is that right?

No, so this is really important because this means like if you train a hundred models right each time the validation set accuracy is like the meaning of a distribution and So therefore the standard deviation of that validation set accuracy it can be calculated with the standard error And this is equal to the standard deviation divided by square root n Right so this tells you So like one approach to figuring out like is my validation set big enough is train your model five times with exactly the same hyper parameters each time and look at the validation set accuracy each time and you know there's like a Mean and a standard deviation of five numbers you could use or a maximum and minimum you can choose But save yourself some time you can figure out straight away that like okay, well I I have a point nine nine Accuracy as to you know whether I get the cat correct or not correct So therefore the standard deviation is equal to 0.99 times 0.01 Okay, and then I can get the Standard error of that Right so so basically the size of the validation set you need It's like however big it has to be Such that your insights about its accuracy are good enough for your particular Business problem and so like I say like the simple way to do it is to pick a validation set of like a size a thousand train five models and See how much the validation set accuracy varies and if it's like if they're if it's they're all close enough for what you need Then you're fine.

If it's not maybe you should make it bigger Or maybe you should consider using cross validation instead Okay So like as you can see it really depends on what it is you're trying to do How common your less common class is and how accurate your model is could you pass that back to Melissa, please?

Thank you, I have a question about the less common classes if you have less than 22 Let's say you have one sample of something Let's say it's a face and I only have one representation from that particular country Do I toss that into the training set and it adds variety do I pull it out completely out of the data set or?

Do I put it in a test set instead of the validation set? So you certainly couldn't put it in the test or the validation set because you're asking Can I mean in general because you're asking can I recognize something? I've never seen before But actually this this question of like can I recognize something I've not seen before there's actually a whole class of models specifically For that purpose it's called other one-shot learning Which is you get to see something once and then you have to recognize it again or zero shot learning Which is where you have to recognize something you've never seen before we're not going to cover them in this course But that can be useful for things like face recognition You know like is this the same person I've seen before and so generally speaking Obviously for something like that to work.

It's not that you've never seen our face before It's that you've never seen Melissa's face before you know and so you see Melissa's face once and you have to recognize it again Yeah, so in general you know your validation set and test set need to have the same mix or frequency Observations that you're going to see in production in the real world and then your training set Should have an equal number in each class and if you don't Just replicate the less common one Until it is equal so this is I think we've mentioned this paper before very recent paper that came out They tried lots of different approaches to training with unbalanced data sets and found consistently that Oversampling the less common class until it is the same size as the more common class is always the right thing to do So you could literally copy you know so like I've only got a thousand you know ten examples of people with cancer and a hundred Without so I could just copy those ten And other you know 90 times That's kind of a little memory inefficient so a lot of things including I think SK learns random forests have a class weights parameter that says each time you're bootstrapping or resampling I want you to sample the less common class with a higher probability Or ditto if you're doing deep learning you know make sure in your mini batch It's not randomly sampled, but it's a stratified sample so the less common class is picked more often Okay Okay, so let's get back to finishing off Our random forests and so what we're going to do today is we're going to finish off writing our random forest and then after day your your after today your homework will be to take this class and To add to it all of the random forest interpretation algorithms that we've learned, okay, so Obviously to be able to do that you're going to need to totally understand how this class works So please you know ask lots of questions as necessary as we go along So just to remind you We we're doing the the bulldozers Kaggle competition data set again We split it as before into 12,000 validation the last 12,000 Records and then just to make it easier for us to keep track of what we're doing We're going to just pick two columns out to start with year made and machine hours on the meter okay, and so what we did last time was we started out by creating a tree ensemble and the tree ensemble had a bunch of trees which was literally a list of Entries trees where each time we just called create tree Okay, and create tree contained a sample size number of random indexes Okay, this one was drawn without replacement So remember bootstrapping means sampling with replacement So normally with scikit learn if you've got n rows We grab n rows with replacement, which means many of them will appear more than once So each time we get a different sample, but it's always the same size as the original data set and then we have our set RF samples Function that we can use which does with replacement sampling of less than n rows This is doing something again, which is it's sampling without replacement Sample size rows okay because we're permuting The numbers from naught to self dot y minus 1 and then grabbing the first self dot sample size of them Actually, there's a faster way to do this.

You can just use NP dot random dot choice, which is a slightly more direct way But this way works as well All right. So this is our random sample for this one of our entries trees And so then we're going to create a decision tree and our decision tree We don't pass it all of x we pass it these specific indexes and remember x is a panda's data frame So if we want to index into it with a bunch of integers, we have to use I log integer locations And that makes it behave indexing wise just like numpy Why vector is numpy so we can just index into it directly and then we're going to keep track of our minimum precise and So then the only other thing we really need an ensemble is some way to make a prediction And so we were just going to do the mean of the tree prediction For each tree All right, so that was that and so then in order to be able to run that We need a decision tree class because it's being called here And so there we go Okay, so that's the starting point.

So The next thing we need to do is to flesh out our decision tree So the important thing to remember is all of our randomness Happened back here in the tree ensemble The decision tree class we're going to create Doesn't have randomness in it Okay, so right now we are building a randomly regressor, right so that's why we're taking the mean of the tree The output if we were to work with classification Do we take the max like the classifier will give you either zeros or ones?

No, I would still take the mean so the so each tree is going to tell you what percentage of that leaf node Contains cats and what percentage take contains dogs so then I would average all those percentages and say across the trees on average there is 19% cats and 81% dogs Good question, so you know Random tree classifiers are almost identical or can be almost identical to random tree regressors The technique we're going to use to build this today will basically exactly work for a classification It's certainly for binary classification.

You can do with exactly the same code for multi-class classification. You just need to change your data structure so that like you have like a one hot encoded matrix or a List of integers that you treat as a one hot encoded matrix Okay, so our decision tree So remember our idea here is that we're going to like try to avoid thinking So we're going to basically write it as if everything we need already exists, okay, so we know From when we created the decision tree, we're going to pass in the X the Y and the minimum leaf size So here we need to make sure we've got the X and the Y and the minimum leaf size.

Okay, so then there's one other thing which is as we split our tree into sub trees, we're going to need to keep track of Which of the row indexes went into the left-hand side of the tree which went into the right-hand side of the tree Okay, so we're going to have this thing called indexes as Well, right so at first we just didn't bother passing in indexes at all So if indexes is not passed in if it's none, then we're just going to set it to Everything the entire length of y right so NP dot a range is the same as just range in Python But it returns a numpy array right so that the root of a decision tree Contains all the rows.

That's the definition really of the root of a decision tree So all the rows is row 0 row 1 row 2 etc up to row y minus 1 Okay, and then we're just going to store away all that information that we were given We're going to keep track of how many?

Rows are there and how many columns are there? Okay? so then the every leaf and every node in a tree has a value it has a prediction that prediction is just equal to the average of the dependent variable Okay, so every node in the tree Why indexed with the indexes is?

the values of the dependent variable that are in this branch of the tree and so here is the mean okay some Nodes in a tree also have a score which is like how effective was the split? Here right, but that's only going to be true if it's not a leaf node right a leaf node has no further splits And at this point when we create a tree we haven't done any splits yet, so its score starts out as being infinity so having built that the root of the tree our next job is to find out which variable should we split on and What level of that variable should we split on so let's pretend that there's something that does that Find bar split So then we're done, okay, so how do we find a variable to split on so well we could just go through each Potential variable so C contains the number of columns We have so go through each one and see if we can find a better split than we have so far on that column Okay Now notice this is like not The full random forest definition this is assuming that max features is set to all right remember We could set max features to like zero point five in which case we wouldn't check all the numbers should not to see We would check half the numbers at random from not to see so if you want to turn this into like a random forest that has the max features Support you could easily like add one line of code to do that, but we're not going to do it in our implementation today so Then we just need to find better split and since we're not interested in thinking at the moment for now We're just going to leave that empty all right so The one other thing I like to do With my kind of where I start writing a class is I like to have some way to print out What's in that class right and so if you type print followed by an object or if at Jupiter notebook?

You just type the name of the object At the moment. It's just printing out Underscore underscore main underscore underscore dot decision tree at blah blah blah which is not very helpful Right so if we want to replace this with something helpful We have to define the special Python method name dunder Repra To get a representation of this object so when we type when we sickly just Write the name like this behind the scenes that calls that function and the default Implementation of that method is just to print out this Unhelpful stuff so we can replace it by instead saying let's create a format string Where we're going to print out n and then show n and then print Val and then show Val okay, so how many?

How many rows are in this node and what's the average of the dependent variable okay? then If it's not a leaf node, so if it has a split then we should also be able to print out the score The value we split out and the variable that we split on Now you'll notice here Self dot is leaf is leaf is defined as a method, but I don't have any parentheses after it This is a special kind of method called a property and so a property is something that kind of looks like a regular variable But it's actually calculated on the fly so when I call its leaf it actually calls this function Right, but I've got this special decorator Property okay, and what this says is basically you don't have to include the parentheses when you call it Okay, and so it's going to say all right is this a leaf or not so a leaf is something that we don't split on If we haven't split on it then its score is still set to infinity, but that's my logic That makes sense So this this at notation is called a decorator It's basically a way of telling Python more information about your method So anybody here remember where you have seen decorators before?

You pass it over here, yeah, where have you seen where have you seen decorators before tell us more about flask and Yeah, what does that do? So flask so anybody who's done any web programming before with something like flask or a similar framework Would have had to have said like this method is going to respond to this bit of the URL And either the post or to get and he put it in a special decorator so Behind the scenes that's telling Python to treat this method in a special way, so here's another decorator, okay?

And so you know if you get more advanced with Python you can actually learn how to write your own decorators Which as was mentioned you know basically insert some additional code but for now just know there's a bunch of predefined decorators we can use to Change how our methods behave and one of them is at property which basically means you don't have to put parentheses anymore Which of course means you can't add any more parameters beyond self Yeah Why if it's not a leaf why is this for infinity because doesn't infinity mean you're at the root Why no infinity means that you're not at the root.

It means that you're at a leaf so the root will have a split Assuming we find one right everything will have a split till we get all the way to the bottom The leaf and so the leaves will have a score of infinity because they won't split Great all right So that's our Decision tree it doesn't do very much, but at least we can like create an ensemble right ten trees sample size a thousand Right and we can like print out so now when I go m trees zero It doesn't say blah blah blah blah blah.

It says What we asked it to say n called the thousand Val Colin 10.8. Oh wait, okay, and this is the leaf because we haven't spit on it yet So we've got nothing more to say Okay, so then the indexes are all the numbers from 0 to 1,000 Okay, because the base of the tree has everything this is like everything in The random sample that was passed to it because remember by the time we get to the point where it's a decision tree Where we don't have to worry about any of the randomness in the random forest anymore, okay?

All right, so Let's try to write The thing which finds a split okay, so we need to implement Find better split okay, and so it's going to take the index of a variable variable number one variable number three Whatever and it's going to figure out What's the best split point?

Is that better than any split we have so far and? For the first variable the answer will always be yes because the best one so far is none at all infinity bad, okay? So let's start by making sure we've got something to compare to so the thing we're going to compare to will be scikit learns random forest and So we need to make sure that scikit learns random forest gets exactly the same data that we have so we start out by creating ensemble Grab a tree out of it and then find out which particular random sample of x and y did this tree use?

Okay, and we're going to store them away so that we can pass them to scikit learn so we have exactly the same information So let's go ahead and now create a random forest using scikit learn so one tree one decision No, bootstrapping so the whole the whole data set that so this should Be exactly the same as the thing that we're going to create this tree So let's try So we need to define find better split Okay, so find better split takes a variable Okay, so let's define our x independent variables and say okay Well, it's everything inside our tree, but only Those indexes that are in this node right which at the top of the tree is everything right and just this one variable Okay, and then for our wise it's just whatever our dependent variable is at the indexes in this node Okay, so there's our x and y so Let's now go through every single value in our independent variable and So I'll show you what's going to happen.

So let's say our independent variable is year made And not going to be an order Right and so we're going to go to the very first row and we're going to say okay year made here is three Right and so what I'm going to do is I'm going to try and calculate The score if we decided to branch on the number three Right so I need to know which rows are greater than three Which rows are less than and equal to three and they're going to become my left-hand side my right-hand side Right and then we need a score right, so There's lots of scores we could use so in random forests.

We call this the information gain, right? The information gain is like how much better does our score get because we split it into these two groups of data There's lots of ways we could calculate it Gini cross entropy root mean squared error, whatever If you think about it, there is an alternative formulation of root mean squared error Which is mathematically the same to within a constant scale, but it's a little bit easier to deal with which is We're going to try and find a split Which the causes the two groups to each have as lower standard deviation as possible All right, so like I want to find a split that puts all the cats over here and all the dogs over here Right, so if these are all cats and these are all dogs Then this has a standard deviation of zero and this has a standard deviation of zero Or else this is like a totally random mix of cats and dogs.

This is a totally random mix of cats and dogs They're going to have a much higher standard deviation That makes sense. And so it turns out if you find a split that minimizes those group standard deviations or specifically the Weighted average of the two standard deviations. It's mathematically the same as minimizing the root mean squared error That's something you can prove to yourself after class if you want to All right, so we're going to need to find First of all split this into two groups.

So where's all the stuff that is greater than three? So greater than three is this one this one and this one. So we need the standard deviation of that so let's go ahead and say standard deviation of Greater than three that one That one and that one. Okay, and then the next will be the standard deviation of Less than or equal to three.

So that would be that one That one that one and then we just take the weighted average of those two and that's our score That would be our score if we split on three That makes sense. And so then the next step would be try to split on four Try splitting on one try splitting on six Redundantly try splitting on four again Redundantly try splitting on one again and find out which one works best So that's our code here is we're going to go through every row.

And so let's say okay left-hand side is Any values in X that are less than or equal to this particular value Our right hand side is every value in X that are greater than this particular value Okay, so what's the data type that's going to be in LHS and RHS?

What are they actually going to contain? They're going to be arrays arrays of what? Raise of a raise of billions. Yeah, which we can treat a zero and one. Okay, so LHS will be an array of False every time it's not less than or equal to and true Otherwise and RHS will be a boolean array of the opposite.

Okay, and now we can't take a standard deviation of an empty set Right. So if there's nothing that's greater than This number then these will all be false which means the sum will be zero Okay, and in that case Let's not go any further with this step because there's nothing to take the standard deviation of and it's obviously not a useful Split, okay so assuming we've got this far we can now calculate the standard deviation of the left hand side and of the right hand side and Take the weighted average or the sums the same thing to us to a scalar Right and so there's our score and so we can then check is this better than our best score so far and our best score So far we initially initialized it to infinity, right?

So initially this is this is better So if it's better, let's store away All of the information we need which variable has found this better split what was the score we found and What was the value that we split on? Okay, so there it is So if we run that And I'm using time it so what time it does is it sees how long this command takes to run and it tries to give you A kind of statistically valid measure of that so you can see here.

It's run run at ten times To get an average and then it's done that seven times to get a mean and standard deviation across runs And so it's taking me 75 milliseconds plus or minus ten Okay So let's check that this works Find better split tree zero. So zero is year made one is machine hours current meter So with one we got back machine hours current meter 37 4 4 with this score and then we ran it again with 0 that's year made and we've got a better score 658 and split 1974 and so in 1974 Let's compare Yeah, that was what this tree did as well.

Okay, so we've got we've confirmed that this Method is doing is giving the same result that SK learns random forest did And you can also see here the value Ten point oh eight and again matching here the value ten point oh eight Okay, so we've got something that can find one split.

Could you pass that to your net please? So Jeremy, why don't we put a unique on the X there? Because I'm not trying to optimize the performance yet, but you see that no like he's doing more Yeah, so it's like and you can see in the Excel. I like checked this one twice.

I check this for twice unnecessarily Yeah Okay, so and so again, it's already thinking about performance, which is good So tell me what is the computational complexity of this section of the code and I like have a think about it, but also like feel free to talk us through it if you want to kind of Think and talk at the same time What's the computational complexity of this piece of code?

Can I pass it over there yes Alright Jade take us through your thought process I think you have to take each different values through the column to calculate it Once to see the splits so and then compare All the cup like all the possible combinations between these different values so that can be expensive Like cuz you're uh-huh.

Can you or does somebody else gonna tell us the actual computational complexity? So like yeah quite high Jade's thinking How high? I Think it's n squared. Okay, so tell me why is it n squared because for the for loop it is in yes And I think I guess the standard deviation will take in so it's in square.

Okay, or This one maybe is even easier to know like this is like which ones are less than xi I'm gonna have to check every value that if it's less than xi. Okay, and so So it's useful to know like How do I quickly calculate computational complexity? I can guarantee most of the interviews you do are going to ask you to calculate Computational complexity on the fly and it's also like when you're coding you want it to be second nature So the technique is basically is there a loop?

Okay, we're then we're obviously doing this n times Okay, so there's an n involved. Is there a loop inside the loop if there is then you need to multiply those two together In this case there's not is there anything inside the loop? That's not a constant time thing so you might see a sort in there and you just need to know that sort is n log n like That should be second nature if you see a matrix multiply you need to know what that is in this case There are some things that are doing element wise array operations, right?

So keep an eye out for anything where numpy is doing something to every value of an array in this case It's checking every value of x against a constant. So it's going to have to do that n times So to flesh this out into a computational complexity You just take the number of things in the loop and you multiply it by the highest computational complexity inside the loop n times n is n squared you pass that In this case couldn't we just Presort the list and then do like one and log n computation and there's lots of things we could do to speed this up So at this stage, it's just like what is the computational complexity we have?

But absolutely it's certainly not as good as it can be. Okay, so and that's where we're going to go next It's like alright n squared is not it's not great. So let's try and make it better So here's my attempt at making it better And the idea is this Okay, who wants to first of all tell me what's the equation for a standard deviation?

Marsha, can you grab the box? So for the standard deviation, it's the difference between the value and it's mean It's we take a square root of that. So we take the The power of two, yeah, then we sum up all of these observations and we take the square root out of all this sum Yeah, you have to divide divide by n.

Yep. Yep. Great. Good. Okay now In practice, we don't normally use that formulation because it kind of requires us calculating You know X minus the mean lots of times. Does anybody know the formulation that just requires? X and X squared anybody happened to know that one. Yes at the back turn up house that back there Square root of mean of squares minus a Square of mean.

Yeah, great mean of squares minus the square of the means All right, so that's a really good one at divided by it. That's a really good one to know because like you can now calculate Variances or standard deviations of anything. You just have to first of all grab the column as it is The column squared right and as long as you've got those stored away somewhere you can Immediately calculate the standard deviation.

So the reason this is handy for us is that if we first of all Sort our data, right? Let's go ahead and sort our data Then if you think about it as we kind of start going down one step at a time Right, then each group is exactly the same as the previous group on the left hand side with one more thing in it And on the right hand side with one less thing in it So given that we just have to keep track of some of X and some of X squared We can just add one more thing to X one more thing to X squared on the left and remove one thing on the right Okay, so we don't have to go through the whole lot each time and so we can turn this into a order n Algorithm so that's all I do here is I sort the data right and they're going to keep track of the count of things on the right The sum of things on the right and the sum of squares on the right And initially everything's in the right hand side Okay, so initially n is the count Y sum is the sum on the right and Y squared sum is the sum of squares on the right and Then nothing is initially on the left.

So it's zeros. Okay, and then we just have to loop through each observation right and Add one to the left hand count subtract one from the right hand count add the value to the left hand count Subtract it from the right hand count add the value squared to the left hand subtract it from the right hand Okay Now we do need to be careful though because if we're saying less than or equal to one say we're not stopping here We're stopping here like we have to have everything in that group So the other thing I'm going to do is I'm just going to make sure that the next value is not the same as this Value if it is I'm going to skip over it, right?

So I'm just going to double-check That this value and the next one aren't the same Okay, so as long as they're not the same I can keep going ahead and calculate my standard deviation now Passing in the count the sum and the sum squared right and there's that formula Okay, the sum of squared divided by the square of the sum sorry minus the square of the sum Do that for the right hand side and so now we can calculate the weighted average score Just like before and all of these lines are now the same.

Okay, so we've turned our order and Squared algorithm into an order n algorithm and in general stuff like this is going to get you a lot more value than like pushing something onto a spark cluster or Ordering faster RAM or using normal cause in your CPU or whatever, right?

This is the way you want to be, you know improving your code and specifically Write your code Right without thinking too much about performance run it. Is it fast enough for what you need then you're done if not profile it right so in Jupiter instead of saying percent time it you say percent p run and It will tell you exactly where the time was spent in your algorithm, and then you can go to the bit that's actually taking the time and think about like Okay, is this?

Is this algorithmically as efficient as it can be? Okay, so in this case we run it and we've gone down from 76 milliseconds to less than 2 milliseconds and Now some people that are new to programming think like oh great. I've saved 60 something milliseconds, but the point is this is going to get run Like tens of millions of times Okay, so the 76 millisecond version is so slow that it's going to be Practical for any random forest you use in in practice right whereas the 1 millisecond version.

I found is actually quite quite acceptable And then check the numbers should be exactly the same as before and they are okay so now that we have a function find better split that does what we want I Want to insert it into my decision tree class and this is a really cool Python trick Python does everything dynamically right so we can actually say The method called find better split in decision tree is That function I just created and that might sticks it inside that class now I'll tell you what's slightly confusing about this is that this thing this word here and This word here.

They actually have no relationship to each other They just happen to have the same letters in the same order right so like I could call this find better split underscore foo Right and then I could like call that Right and call that Right so now my function is actually called find better split underscore foo, but my method I'm expecting to call something called decision tree dot find better split Right so here.

I could say decision tree dot find better split equals find better split underscore foo Okay, you see that's the same thing so like It's important to understand how namespaces work like in in every language that you use One of the most important things is kind of understanding how how it figures out what a name refers to So this here Means find better split as defined inside this class right and note nowhere else right well I mean a parent class, but never mind about that this one here means find better split foo in The global namespace a lot of languages don't have a global namespace, but Python does okay, and so The two are like even if they happen to have the same letters in the same order They're not referring in any way to the same thing That makes sense.

It's like This family over here may have somebody called Jeremy and my family has somebody called Jeremy and our names happen to be the same But we're not the same person, okay Great so now that we've stuck the decision tree Sorry the find better split method inside the decision tree with this new definition when I now call the tree ensemble constructor All right the decision tree ensemble instructor called create tree create tree instantiated decision tree decision tree called find vast split which went through every column to see if it could find a better split and we've now defined find better split and Therefore tree ensemble when we create it has gone ahead and done the split That makes sense that you have any anybody have any questions or uncertainties about that like we're only creating one single split so far All right, so this is pretty pretty neat right we kind of just do a little bit at a time testing everything as we go And so it's as as as you all implement the random forest interpretation techniques You may want to try programming this way to like every step check that you know What you're doing matches up with what scikit-learn does or with a test that you've built or whatever?

So at this point we should try to go deeper Very inception right so let's go now max depth is 2 And so here is what scikit-learn did after breaking at year made 74 It then broke at machine hours meter 2956 So we had this thing called Find vast split Right we just went through every column and try to see if there's a better split there right, but actually We need to go a bit further than that Not only do we have to go through every column and see if there's a better split in this node But then we also have to see whether there's a better split in the left and the right sides that we just created Right or in other words the left right side and the right hand side should become decision trees themselves Right so there's no difference at all between what we do here to create this tree And what we do here to create this tree other than this one contains 159 samples this one contains a thousand So this row of codes exactly the same as we had before Right and then we check actually we could do this a little bit easier.

We could say if self dot Is leaf right would be the same thing? Okay, but I'll just leave it here for now. So it's self dot score. So if the score is Infinite still, but let's write it properly Yes So let's go back up and just remind ourselves Is leaf is Self that score equals in okay?

So since there we might as well use it so if it's a leaf node Then we have nothing further to do right so that means we're right at the bottom. There's no split That's been made okay, so we don't have to do anything further on the other hand if it's not a leaf node So it's somewhere back earlier on Then we need to split it into the left hand side and the right hand side now earlier on we created a left hand side And a right hand side array of Booleans right now Better would be to have here would be have an array of indexes And that's because we don't want to have a full array of all the Booleans in every single Node right because remember although it doesn't look like there are many nodes when you see a tree of this size when it's fully expanded the bottom level if there's a minimum leaf size of one contains the same number of nodes as The entire data set and so if every one of those contained a full Boolean array of size of the whole data set You've got squared memory requirements, which would be bad Right on the other hand if we just store the indexes with the things in this node, and that's going to get smaller and smaller Okay So NP dot non zero is exactly the same as just this thing which gets the Boolean array But it turns it into the indexes of the trues Okay, so this is now a list of indexes for the left hand side and Indexes the right hand side Okay, so now that we have the indexes the left hand side and the right hand side We can now just go ahead and create a decision tree.

Okay, so there's a decision tree for the left And there's our decision tree for the right Okay, and we don't have to do anything else. We've already written these we already have a Function of a constructor that can create a decision tree so like when you really think about what this is doing it kind of hurts your head right because the reason the whole reason that find vast split got called is because Find vast split is called by the decision tree constructor But then the decision tree that but then find vast split itself then causes the decision tree constructor, so we actually have circular recursion and I'm not nearly smart enough to be able to think through recursion So I just choose not to right like I just write what I mean and Then I don't think about it anymore Right like what do I want to find a variable split?

I've got to go through every column see if there's something better If it managed to do a split figure out left hand side of the right hand side and make them into decision trees Okay, but now try to think through how these two methods call each other would just drive me crazy But I don't need to right I know I have a decision tree constructor that works Right I know I have a violent find vast bit that works, so that's it right.

That's how I do recursive programming is By pretending I don't I just just ignore it That's my advice a lot of you are probably smart enough to be able to think through it better than I can so that's fine If you can all right so now that I've written that again, I can patch it into the decision tree class and As soon as I do The tree ensemble constructor will now use that right because Python's dynamic right that's just happens automatically So now I can check my left hand side Should have a hundred and fifty nine samples Right and a value of nine point six six There it is hundred fifty nine samples nine point six six right hand side Eight forty one ten point one five the left hand side of the left hand side Hundred and fifty samples nine point six two Hundred and fifty samples nine point six two okay, so you can see it like I'm Because I'm not nearly clever enough to write machine learning algorithms like not only can I not write them correctly the first time Often like every single line.

I write will be wrong right so I always start from the assumption That the the line of code. I just typed is almost certainly wrong, and I just have to see why and how Right and so like I just make sure and so eventually I get to the point where like much to my surprise It's not broken anymore You know so here I can feel like okay this it would be surprising if all of these things Accidentally happen to be exactly the same as I kit learn so this is looking pretty good Okay So now that we have something that can build a whole tree We want to have something that can calculate predictions Right and so to remind you we already have something that calculates predictions for a tree ensemble by calling tree dot predict But there is nothing called tree dot predict, so we're going to have to write that To make this more interesting let's start bringing up the number of columns that we use Let's create our tree ensemble again, and this time.

Let's go to a maximum depth of three Okay, so now our tree is getting more interesting And let's now define how do we Create a set of predictions for a tree and so a set of predictions for a tree is simply the prediction for a row for every row That's it all right.

That's our predictions so the predictions for a tree every rows predictions in an array, okay, so again, we're like Skipping thinking thinking is hard. You know so let's just like keep pushing it back This is kind of handy right notice that you can do for Blah in array with a numpy array regardless of the rank of the array regardless of the number of axes in The array and what it does is it will loop through the leading axis Right these these concepts are going to be very very important as we get into more and more neural networks because we're going to be All doing tensor computations all the time so the leading axis of a vector is the vector itself The leading axis of a matrix are the rows the leading axis axis of a three-dimensional tensor The matrices that represent the slices and so forth right so in this case because x is a matrix This is going to loop through the rows and if you write your Kind of tensor code this way, then it'll kind of tend to Generalize nicely to higher dimensions like it doesn't really mention matter how many dimensions are in x This is going to loop through each of the leading axis, right?

Okay, so we can now call that decision tree dot predict Right so all I need to do is write predict row Right and I've delayed thinking so much which is great that the actual point where I actually have to do the work It's now basically trivial So if we're at a leaf No, then the prediction is just equal to Whatever that value was Which we calculated right back in the original tree constructor.

It's just the average of the wise, right? If it's not a leaf node Then we have to figure out whether to go down the left-hand path or the right-hand path to get the prediction, right? So if This variable in this row is less than or equal to the thing we decided the amount we decided to split on Then we go down the left path Otherwise we go down the right path Okay, and then having figured out what path we want which tree we want then we can just call predict row on that right and again We've accidentally created something recursive Again, I don't want to think about how that Works control flow wise or whatever, but I don't need to because like I just it just does like I just told it What I wanted so I'll trust it to work, right?

If it's a leaf return the value otherwise return the prediction for the left-hand side or the right-hand side as appropriate Notice this here this if has nothing to do with this if All right, this if is a control flow statement That tells Python to go down that path or that path to do some calculation this if is an operator that returns a value So those of you that have done C or C++ will recognize it as being identical to that It's called the ternary operator, right?

If you haven't that's fine. Basically what we're doing is we're going to get a value where we're going to say it's this value if this thing is true and that value otherwise And so you could write it This way All right But that would require writing four lines of code to do one thing and also require you to have code that if you read it To yourself or to somebody else is not at all naturally the way you would express it, right?

I want to say the tree I got to go down is The left-hand side if the variables less than the split or the right-hand side, otherwise Right, so I want to write my code the way I would think about or the way I would say my code Okay, so this kind of ternary operator can be quite helpful for that All right So now that I've got a prediction for row I can dump that into my class And now I can create calculate predictions And I can now plot my actuals against my predictions When you do a scatterplot You'll often have a lot of dots sitting on top of each other So a good trick is to use alpha alpha means how transparent the things not just a matplotlib But like in every graphics package in the world pretty much and so if you set alpha to less than 1 Then this is saying you would need 20 dots on top of each other for it to be fully blue And so this is a good way to kind of see How much things are sitting on top of each other?

So it's a good trick good trick for scatter plots There's my R squared not bad And so let's now go ahead and Do a random forest With no max amount of splitting and Our Tree ensemble had no max amount of splitting we can compare our R squared To their R squared and so they're not the same But actually ours is a little better, so I don't know what we did differently, but we'll take it Okay, so we have now something which for a forest with a single tree in is giving as good accuracy On a validation set using an actual real-world data set you know for Bluetooth is Compared to psychic learn So let's go ahead and round this out So what I would want to do now is to create a package that has this code in and I created it by like creating A method here a method here a method here and patching them together So what I did with now is I went back through my notebook and collected up all the cells That implemented methods and pasted them all together right, and I've just pasted them down here So here's this is my original tree ensemble and here is all the cells from the decision tree I just dumped them all into one place without any change So that was it that was the code we wrote together so now I can go ahead and I can create a Tree ensemble I can calculate my predictions.

I can do my scatter plot. I can get my R squared Right and this is now with five trees Right and here we are we have a model of blue dook for bulldozers with a 71% R squared With a random forest we wrote entirely from scratch That's pretty cool Any questions about that and I know there's like quite a got to get through so like during the week Feel free to ask on the forum About any bits of code you come across can somebody pass the box to measure.

Oh, there it is Can we get back to the probably to the top of maybe At the decision tree when we set the score equal to infinity, right? Yes, do we calculate this car this score? Further, I mean like I lost track of that and specifically I wonder What do we implement?

when we implement Find var split we check for self score equal to whether it's equal to infinity or not it seems to me it seems like unclear whether we fall out of this I Mean like if we ever implement The methods if if our initial value is infinity So okay, let's talk through the logic.

So so the decision tree Starts out with a score of infinity. So in other words at this point when we've created the node there is no split So it's infinitely bad Okay, that's why the score is infinity and then we try to find a variable and a split that is better and to do that we loop through each column and Say hey column.

Do you have a split which is better than the best one we have so far? And so then we implement that Let's do the slow way since it's a bit simpler find better split we do that by looping through each row and Finding out this is the current score if we split here Is it better than the current score the current score is infinitely bad?

So yes It is and so now we set the new score equal to what we just calculated and we keep track of which variable we chose And the split we split on Okay, no worries you Okay, great, let's take a five-minute break and I'll see you back here at 22 So when I tried comparing the performance of this Against scikit-learn This is quite a lot slower And the reason why is that although like a lot of the works being done by numpy Which is nicely optimized C code think about like the very bottom level of a tree if we've got a million data points And the bottom level of the tree has something like 500,000 decision points with a million leaves underneath right and so that's like Five hundred thousand split methods being called each one of contain which contains multiple calls to numpy which only have like one Item that's actually being calculated on and so it's like that's like very inefficient and it's the kind of thing that python is Particularly not good at performance-wise right like calling lots of functions lots of times I mean we can see it's it's not bad right you know for a kind of a random forest which 15 years ago would have been considered pretty big this would be considered pretty good performance right, but nowadays this is Some hundreds of times at least slower than than it should be so What the scikit-learn folks did to avoid this problem was that they wrote their implementation in something called scythe on and scythe on is a super set of python so any python you've written pretty much You can use as scythe on But then what happens is scythe on runs it in a very different way rather than passing it to the kind of the Python interpreter it instead converts it to C Compiles that and then runs that C code right which means the first time you run it It takes a little longer because it has to go through the the kind of translation and compilation, but then after that it can be quite a bit faster and so I wanted just to quickly show you what that looks like because You are absolutely going to be in a position where scythe on is going to help you with your work and Most of the people you're working with will have never used it may not even know it exists And so this is like a great superpower to have So to use scytheon in a notebook you say load, ext, load extension, scytheon, right?

And so here's a Python function fib1 Here is the same as a scytheon function is exactly the same thing with percent percent scytheon at the top you This actually runs about twice as fast as this right just because it does the compilation Here is the same version again where I've used a special scytheon extension called cdef which defines the C data type of the return value and of each variable right and so Basically, that's the trick that you can use to start making things Run quickly right and at that point now.

It knows it's not just some Python object called t in fact I probably should put one here as well Let's try that so we've got fib2 we'll call that fib3 So for fib3 Yeah, so it's exactly the same as before but we say what the data type of the thing we passed to it was is and then Define the data types of each of the variables and so then if we call that Okay, we've now got something that's 10 times faster right so Yeah, it doesn't really take that much extra And it's just it's just Python with a few little bits of markup so that's like it's it's good to know that that exists because If there's something custom you're trying to do It's actually I find it kind of painful having to go out and you know going to see and compile it and link it back And all that where it's doing it here is pretty easy.

Can you pass that just to your right please musher? So when you're doing like for the scytheon version of it so in the case of an array or an MP array This is a specific C type of yeah, so there's like a lot of Specific stuff for integrating scytheon with numpy And there's a whole page about it Yeah, so we won't worry about going over it But you can read that and you can basically see the basic ideas.

There's this C import which basically imports a Certain types of Python library into the kind of the C bit of the code and you can then use it in your site them Yeah, it's It's pretty straightforward Good question. Thank you all right so your your mission now Is to implement Confidence based on tree variants feature importance partial dependence and tree interpreter for that random first Removing redundant features doesn't use a random first at all So you don't have to worry about that Extrapolation is not an interpretation technique, so you don't have to worry about that, so it's just the other ones So confidence based on tree variants.

We've already written that code So I suspect that the exact same code we have in the notebook should continue to work so you can try and make sure it Get that working Feature importance is with the variable shuffling technique and once you have that working Partial dependence will just be a couple of lines of code away because it rather than you know rather than shuffling a column You're just replacing it with a constant value that it's nearly the same code and then tree interpreter It's going to require you writing some code and thinking about that well once you've written tree interpreter You're very close if you want to to creating the second approach to Feature importance the one where you add up the importance across all of the rows Which means you would then be very close to doing interaction importance So it turns out that there are actually there's actually a very good library for interaction importance for xg boost But there doesn't seem to be one for random forest so you could like start by getting it working on our version and If you want to do interaction importance, and then you could like get it working on the original Sklearn version and that would be a cool contribution Like sometimes writing it against your own implementation is kind of nicer because you can see exactly what's going on All right, so that's that's your job.

You don't have to rewrite the random forest feel free to if you want to you know practice so If you Get stuck at any point you know ask on the forum, right? there is a Whole page here on wiki.fast.ai about how to ask for help so when you ask your Co-workers on slack for help when you ask people in a technical community on github or discourse to help or whatever Asking for help the right way will go a long way towards You know having people want to help you and be able to help you right so So like search for your answer like search for the area you're getting see if somebody's already asked about it You know how have you tried to fix it already?

What do you think is going wrong? What kind of computer are you on? How is it set up? What are the software versions? Exactly. What did you type and exactly what happened right now? You could do that by Taking a screenshot, so you know make sure you've got some screenshot software.

That's really easy to use So if I were to take a screenshot, I just hit a button Select the area copy the clipboard go to my forum Paste it in and there we go right that looks a little bit too big, so let's make it a little smaller Right and so now I've got a screenshot people can see exactly what happened better still If there's a few lines of code and error messages to look at and create a gist Just is a handy little GitHub thing which basically lets you share code, so if I wanted to create a gist of this I Actually have a extension area that little extension, so if I click on here Give it a name Say make public Okay, and that takes my Jupyter notebook shares it publicly.

I can then grab that URL copy link location Right and paste it into my forum post right and then when people click on it Then they'll immediately see My notebook when it renders Okay, so that's a really good way now that particular button is an extension so on Jupyter You need to click nv extensions and click on Gist it right while you're there.

You should also click on collapsible headings That's this really handy thing. I use that lets me collapse things and open them up If you go to your Jupyter and don't see this nv extensions button Then just google for Jupyter nv extensions It'll show you how to pip install it and and get it set up right, but those two extensions are super duper handy All right, so Other than that assignment We're we're done With random forests and until the next course when you look at GBMs.

We're done with decision tree ensembles And so we're going to move on to neural networks broadly defined and so Neural networks are going to allow us to To go beyond just you know the kind of nearest neighbors approach of random forests You know all the random forest can do is to average data that it's already seen It can't extrapolate.

It can't they can't calculate right? linear regression Can calculate and can extrapolate but only in very limited ways Neural nets give us the best of both worlds We're going to start by applying them to unstructured data Right so unstructured data means like pixels or the amplitudes of sound waves or words you know data where?

everything in all the columns Are all of the same type? You know as opposed to like a database table where you've got like a revenue and a cost and a zip code and a state It should be structured data We're going to use it for structured data as well, but we're going to do that a little bit later So unstructured data is a little easier And it's also the area which more people have been applying deep learning to for longer The If you're doing the deep learning course as well You know you'll see that we're going to be approaching kind of the same conclusion from two different directions So the deep learning course is starting out with Big complicated convolutional neural networks being solved with you know Sophisticated optimization schemes and we're going to kind of gradually drill down into like exactly how they work Where else with the machine learning course we're going to be starting out more with like How does stochastic gradient descent actually work?

What do we do? What can we do with like one single layer? Which would allow us to create things like logistic regression when we add? Regularization to that how does that give us things like? Ridge regression elastic net less. Oh and then as we add additional layers to that how does that let us?

Handle more complex problems, and so we're not going to we're only going to be looking at fully connected layers in this machine learning course and Then I think next semester with your net you're probably going to be looking at some more Sophisticated approaches and so yes on this machine learning We're going to be looking much more at like what's actually happening with the matrices and how they actually calculate it and the deep learning It's much more like what are the best practices for actually?

Solving you know at a world-class level real world deep learning problems, right so Next week we're going to be Looking at like the classic Mness problem, which is like how do we recognize digits now? If you're interested you can like skip ahead and like try and do this with a random forest and you'll find it's not bad Right given that a random forest is basically a type of nearest neighbors, right?

It's finding like what are the nearest neighbors in in tree space? Then a random forest can absolutely recognize that this nine those pixels You know are similar to pixels. We've seen in these other ones and on average they were nines as well All right, and so like we can absolutely solve These kinds of problems to an extent using random forests But we end up being rather data limited because every time we put in another decision point you know We're halving our data roughly and so there's this just this limitation and the amount of calculation that we can do Where else with neural nets?

We're going to be able to use lots and lots and lots of parameters Using these tricks we don't learn about with regularization And so we're going to be able to do lots of computation, and there's going to be very little limitation on really What we can actually end up calculating as a result great good luck with your random forest interpretation, and I will see you next time

Intro to Machine Learning: Lesson 7

Chapters

Transcript