Back to Index

Intro to Machine Learning: Lesson 4


Chapters

0:0
0:8 How Do I Deal with Version Control and Notebooks
12:25 Max Features
21:15 Feature Importance
38:27 Random Forest Feature Importance
41:9 Categorical Variables
44:37 One-Hot Encoding
55:26 Dendrogram
55:31 Cluster Analysis
59:3 Rank Correlation
67:42 Partial Dependence
70:8 Gg Plot
70:34 Ggplot
72:40 Locally Weighted Regression
75:5 Partial Dependence Plot
75:16 Partial Dependence Plots
81:45 Purpose of Interpretation
81:50 Why Do You Want To Learn about a Data Set
83:10 Goal of a Model
87:40 Pdp Interaction Plot
92:35 Tree Interpreter
94:4 The Bias
99:5 Bias

Transcript

Alright welcome back Something to mention somebody asked on the forums really good question was like How do I deal with version control and notebooks? the question was something like every time I change the notebook Jeremy goes and changes it on git and then I do a git pull and I end up with a conflict and Blah blah blah and that's that happens a lot with notebooks because notebooks behind the scenes are JSON files Which like every time you run even a cell without changing it It updates that little number saying like what numbered cell this is and cells now suddenly there's a change and so trying to merge notebook changes as a nightmare so My suggestion like a simple way to do it is is when you're looking at some notebook Like lesson 2 RF interpretation you want to start playing around with this First thing I would do would be to go file make a copy and Then in the copy say file rename and give it a name that starts with TMP So that will hide it from get right and so now you've got your own version of that notebook that you can That you can play with okay And so if you're not or get pool and see that the original changed it won't conflict with yours and you can now see There are two different versions There are different ways of kind of dealing with this Jupiter notebook get problem like everybody has it one one is there are some hooks You can use it like remove all of the cell outputs before you commit to get but in this case I actually want the outputs to be in the repo so you can read it on github and see it So it's a minor issue, but it's it's something which catches everybody Yes Before we move on to interpretation of the random forest model, I wonder if we could summarize the relationship between the hyperparameters on the random forest and its Effect on you know over fitting and dealing with collinearity and yada yada.

Yeah, that sounds like a question born from experience absolutely so I Gotta go back to lesson 1 RF If you're ever unsure about where I am you can always see my top here courses ml1 lesson 1 on earth In terms of the hyper parameters that Are interesting and I'm ignoring I'm ignoring like pre-processing, but just the actual hyper parameters The first one of interest I would say is the set RF samples command which determines how many Rows are in each sample so in each tree you're created from how many rows And each tree So before we start a new tree we either bootstrap a sample so sampling with replacement from the whole thing or we pull out a Subsample of a smaller number of rows, and then we build a tree from there, so So step one is we've got our whole big data set and we grab a few rows at random from it And we turn them into a smaller data set and then from that we build a tree right so That's the size of that is set RF samples, so when we change that size Let's say this originally had like a million rows And we said set RF samples 20,000 right and then we're going to grow a tree from there Assuming that The tree remains kind of balanced as we grow it can somebody tell me how many layers deep Would this tree be and assuming we're growing it until every leaf is of size one.

Yes log base 2 of 20,000 Right okay, so the the depth of the tree Doesn't actually vary that much depending on the number of samples right because it's it's Related to the log of the size Can somebody tell me at the very bottom so once we go all the way down to the bottom how many?

Leaf nodes would there be? Speak up what? 20,000 right because every single leaf node has a single thing in it, so we've got Obviously a linear relationship between the number of leaf nodes in the size of the sample So when you decrease the sample size It means that there are less kind of Final decisions that can be made right so therefore the tree is is Going to be less rich in terms of what it can predict because it's just making less different individual decisions, and it also is making Less binary choices to get to those decisions, so therefore Setting our samples lower is going to mean that you over fit less But it also means that you're going to have a less Accurate individual tree model right and so remember the way Brian and the inventor of random forest describe this is that you're trying to do two things when you build a model when you build a model with bagging one is that Each individual tree or as SK loan would say each individual estimator Is as Accurate as possible right on the training set So it's like each model is a strong predictive model, but then the across the estimators The correlation between them is as low as possible So that when you average them out together you end up with something that generalizes So by decreasing the set RF samples number We are actually decreasing the power of the estimator and increasing the correlation And so is that going to result in a better or a worse validation set result for you?

It depends right this is the kind of compromise which you have to figure out when you do machine learning models Can you pass that back there If I wait if I put the OV value So it is it is basically dividing every third it ensures that Every data won't be there in each tree, right?

The OOP second We if I put OV equal to true. Yeah Yeah, so isn't that make sure that out of my entire data a different personal data won't be there in every tree So all our vehicles true does is it says? Whatever your subsample is it might be a bootstrap sample or it might be a subsample Take all of the other rows Right and put them into a for each tree and put them into a different data set and Calculate the the error on those so it doesn't actually impact training at all It just gives you an additional metric which is the OAB error.

So if you Don't have a validation set Then this allows you to get kind of a quasi validation set for free If you want to Set out a sample RF sample so the the default is actually if you say reset RF samples and That causes it to bootstrap. So it all sample a new data set as big as the original one, but with replacement Okay, so obviously the second benefit of set our samples is that you can run More quickly and particularly if you're running on a really large data set like a hundred million rows You know it won't be possible to run it on the full data set So you'd either have to pick a subsample if you yourself before you start or you set RF samples The second key parameter that we learned about was min samples leaf Okay, so if I changed min samples leaf before we assumed that min samples leaf was equal to 1 Alright if I set it equal to 2 Then what would be my new?

Depth how deep would it be? Yes log base 220,000 minus 1 okay, so each time we double them in samples leaf we're removing one layer from the tree and Fine I'll come back to you again since you're doing so well. How many leaf nodes would there be in that case?

What how many leaf nodes would there be in that case? 10,000 okay, so we're going to be again dividing the number of leaf nodes by that number so The result of increasing min samples leaf is that now each of our leaf nodes has more than one thing in so we're going to get A more stable average that we're calculating in each tree, okay We've got a little bit less depth Okay, we've got less decisions to make and we've got a smaller number of leaf nodes so again We would expect the result of that would be that each estimator would be less predictive But the estimators would be also less correlated so again this might help us to avoid overfitting Could you pass the microphone over here please?

Hi Jeremy, I'm not sure if In that case every node will have exactly - no it won't necessarily have exactly - and I thank you for mentioning that So it might try to do a split and so one reason well what would be an example 10 she that you?

Wouldn't split even if you had a hundred nodes what might be a reason for that Sorry a hundred items in a leaf node. They're all the same. They're all the same in terms of Well once the independent saw the dependent And it has the dependent right I mean I guess either but much more likely would be the dependent so if you get to a leaf node where Every single one of them has the same option price or in classification like every single one of them is a dog Then there is no split that you can do that's going to improve your information All right, and remember information is the term.

We use in a kind of a general sense in random for us to describe the amount of Difference about about additional information we create from a split is like how much are we improving the model? So you'll often see this in this word information gain Which means like how much better did the model get by adding an additional split point?

And it could be based on RMSE or it could be based on cross entropy or it could be based on how different to the standard Deviations or whatever so that's just a general term Okay, so that's the second thing that we can do which again It's going to speed up our training because it's like one less set of decisions to make Remember even though there's one less set of decisions those decisions like have as much data again as the previous set So like each layer of the tree can take like twice as long as the previous layer So it could definitely speed up training and it could definitely make it generalize better So then the third one that we had was max features Who wants to tell me what max features?

does I don't know pass that back over there Okay, Vinay Features the dimensions how many features you're going to use in each tree in this case It's a fraction up. So you're going to use half of the features for each tree Nearly right or kind of right? Can you be more specific or can somebody else be more specific?

It's not exactly for each tree essentially That is it for each tree randomly sample half of the So not quite. It's not for each tree. So the the set don't pass it to Karen. So the set RF samples picks a Picks a subset of samples subset of rows for each tree But min samples leaf.

Sorry that max features doesn't quite do that. It's not something different Yeah, right So it kind of sounds like a small difference, but it's actually quite a different way of thinking about it Which is we do our set RF samples So we pull out our sub sample or a bootstrap sample and that's kept for the whole tree And we have all of the columns in there, right and then with Max features equals 0.5 at each point we then at each split we pick a different half of The features and then here we'll take a pick a different half of the features and here we'll pick a different half of the features And so the reason we do that is because we want the trees to be as as rich as possible Right, so particularly like if you if you were only doing a small number of trees like you had only ten trees And you pick the same column set all the way through the tree You're not really getting much variety and what kind of things are confined.

Okay, so this this way at least in theory Seems to be something which is going to give us a better set of trees is picking a different Random subset of features at every decision point So the overall effective max features again, it's the same it's going to mean that the each individual tree is probably going to be less accurate But the trees are going to be more varied and in particular here This can be critical because like imagine that you've got one feature.

That's just super predictive It's so predictive that like every random sub sample you look at always starts out by splitting on that same feature Then the trees are going to be very similar in the sense like they all have the same initial split, right, but There may be some other interesting initial splits because they create different interactions of variables.

So by like half the time That feature won't even be available at the top of the tree So half at least half the trees are going to have a different initial split so it definitely can give us more Variation and therefore again it can help us to create more generalized trees that have less correlation with each other Even though the individual trees probably won't be as predictive In practice, we actually looked at have a little picture of this that as as you add more trees Right if you have max features equals none that's going to use all the features every time Right then with like very very few trees that can still give you a pretty good error but as you create more trees It's not going to help as much because they're all pretty similar because they're all trying every single variable Where else if you say max features equals square root or max pictures equals log 2 Then as we add more estimators, we see improvements Okay, so there's an interesting interaction between those two and this is from the SK learn docs this cool little chart Okay So then things which don't impact our training at all and jobs Simply says how many CPU how many cores do we run on?

Okay, so it'll make it faster Up to a point generally speaking making this more than like eight or so. They may have diminishing returns Minus one says use all of your cores So there's I don't know why the default is to only use one core. That's seems weird to me You'll definitely get more performance by using more cores because all of you have computers with more than one core nowadays And then our B score equals true Simply allows us to see the OOB Score if you don't say that it doesn't calculate it and particularly if you had set RF samples pretty small compared to a big data Set OOB is going to take forever to calculate Hopefully at some point we'll be able to fix the library so that doesn't happen There's no reason it need be that way but right now that's that's how the library works Okay So there are Base, you know key basic parameters that we can change there are More that you can see in the docs or shift tab to have a look at them But the ones you've seen are the ones that I've found useful to play with so feel free to play with others as well And generally speaking, you know max features of as I said max features of like either None Means all of them about 0.5 or Square root Or log, you know kind of those Trees seem to work pretty well and then for min samples leaf You know, I would generally try kind of 135 10 25 You know 100 and like as you start doing that if you notice by the time you get to 10 It's already getting worse that there's no point going further if you get to a hundred it's still going better Then you can keep trying right?

But they're the kind of General amounts that most things need to sit in All right so random for us interpretation is something which You could use to create some really cool Kaggle kernels now Obviously one issue is the fast AI library is not available in Kaggle kernels But if you look inside fast AI structured, right and remember you can just use Double question mark to look at the source code for something or you can go into the editor to have a look at it You'll see that most of the methods we're using are a small number of lines of code in this library and have no Dependencies on anything so you could just copy that Little if you need to use one of those functions just copy it into your kernel And and if you do just say this is from the fast AI library You can link to it on github because it's available on github as open source, but you don't need to Import the whole thing right?

So this is a cool trick is that because you're the first people to learn how to use these tools you can start to show Things that other people haven't seen right? So for example this confidence based on tree variance is something which doesn't exist anywhere else feature importance definitely does and that's already in quite a lot of Kaggle kernels if you're looking at a Competition or a data set that where nobody's done feature importance Being the first person to do that is always going to win lots of votes because it's like the most important thing is like Which features are important?

So last time we let's just make sure we've got our tree data So we need to change this to add one extra thing all right, so that's going to load in that data It is our data, okay? So as I mentioned when we do a model interpretation I tend to set RF samples to some subset something small enough that I can run a model in under 10 seconds or so Because there's just no point run running a super accurate model 50,000 is more than enough To see you'll basically see each time you run an interpretation You'll get the same results back and so as long as that's true, then you you're already using enough data, okay?

So feature importance we learnt it works by randomly shuffling a column Each column one at a time and then seeing how accurate the model the pre-trained model the model we've already built is When you pass it in all the data as before but with one column shuffled so Some of the questions I got after class kind of reminded me that it's very easy to under appreciate how powerful and kind of magic this approach is And so to explain I'll mention a couple of the questions that I heard so one question was like Why don't we or what if we just create took one column at a time and created a tree on?

Just each one column at a time, so we've got our data set. It's got a bunch of columns So why don't we just like grab that column and just build a tree from that right? And then like we'll see which which columns tree is the most predictive? Can anybody tell me?

Why what why that may give misleading results about feature importance? Okay Okay We just shuffle them it will be at randomness and we were able to both capture the interactions and the importance of the picture It's great. Yeah, and and so This issue of interactions is not a minor detail.

It's like It's massively important. So like think about this bulldozers data set where for example where there's one field called year made and There's one field called sale date and like If we think about it It's pretty obvious that what matters is the combination of these two which in other words is like How old is the piece of equipment when it got sold so if we only included one of these?

We're going to massively underestimate how important that feature is now Here's a really important point though if you It's pretty much always possible to create a simple like logistic regression Which is as good as pretty much any random forest if you know ahead of time Exactly what variables you need exactly how they interact exactly how they need to be transformed and so forth, right?

So in this case, for example, we could have created a new field which was equal to year made So sale date or sale year minus year made and we could have fed that to a model and got you know Got that interaction for us. But the point is We never know that like you never like you might have a guess of it I think some of these things are interacted in this way and I think this thing we need to take the log and so forth But you know, the truth is that the way the world works the causal structures, you know They've got many many things interacting in many many subtle ways Right.

And so that's why using trees whether it be gradient boosting machines or random forests works so well So can you pass that to Terrence, please? One thing that Did me years ago was also I tried that Doing one variable at a time thinking. Oh, well, I'll figure out which one's most correlated with the dependent variable but what it doesn't Pull apart is that what if all variables are basically copied the same variable then they're all going to seem equally important But in fact, it's really just one factor Yeah, and that's also true here.

So if we had like a column appeared twice Right then shuffling that column isn't going to make the model much worse, right? There'll be if you think about like how it was built Some of the times particularly if we had like max features is 0.5 And some of the times we're going to get version a of the column some of the times you get going to get version B of the column, so like half the time Shuffling version a of the column is going to make a tree a bit worse half the time It's going to make you know column B or make it a bit worse.

And so it'll show that both of those features are somewhat important And it'll kind of like share the importance between the two features. And so this is why A-rack collinearity but collinearity literally means that they're linearly related. So this isn't quite right But this is why having two variables that are related closely related to each other or more variable sort of closely related to each other Means that you will often Underestimate their importance using this this random forest technique Yes, Terrence and so once we've shuffled and we get a a new model What exactly are the units of these importances?

Is this a change in the R squared? Yeah, I mean it depends on the library. We're using so the units are kind of like I Never think about them. I just kind of know that like in this particular library You know 0.005 is often kind of a cutoff. I would tend to use but all I actually care about is is this picture right, which is the feature importance Ordered for each variable and then kind of zooming into turning into a bar plot and I'm kind of like, okay, you know Here they're all pretty flat and I can see okay That's about 0.005 and so I remove them at that point and just see like the model Hopefully the validation score didn't get worse and if it did get worse I'll just increase this a little bit.

Sorry decrease this a little bit until it it doesn't get worse so yeah, I the the The units of measure of this don't matter too much and we'll learn later about a second way of doing variable importance By the way, can you pass that over there? Is one of the goals here to remove variables that I guess your your score will not Get worse if you remove them.

So you might as well get rid of them. Yeah, so that's what we're going to do next. So So what having looked at our feature importance plot we said, okay, it looks like the ones like less than 0.005 You know a kind of this long tail of Boringness, so I said let's try removing them, right?

So let's just try grabbing the columns where it's greater than 0.005 And I said let's create a new data frame called DF keep which is DF train with just those kept columns created a new training and validation set with just those columns better than you random forest and I looked to see how the Validation set score and the validation set RMSE changed and I found they got a tiny bit better so if they're about the same or a tiny bit better then the thinking my thinking is well, this is Just as good a model, but it's now simpler.

And so now when I redo the feature importance, there's less collinearity Right. And so in this case I saw that year made went from being like Quite a bit better than the next best thing which was coupler system to Way better than the next best thing. All right, and coupler system went from being like quite a bit more important than the next two to Equally important to the next two so it did seem to definitely change these feature importances and hopefully give me some more insight there So How does that help our model in general like what does it mean that you're made is now way ahead of the others Yeah, so we're gonna dig into that kind of now, but basically It tells us That for example if we're looking for like how we're dealing with missing values is there noise in the data You know if it's a high cardinality categorical variable, they're all different steps we would take so for example If it was a high cardinality categorical variable that was originally a string, right like for example I think like maybe fi product class description.

I remember one of the Ones we looked at the other day had like first of all was the type of vehicle and then a hyphen and then like the Size of the vehicle we might look at that and be like, okay. Well, that was an important column Let's try like splitting it into two on hyphen and then take that bit which is like the size of it and trying You know pass it and convert convert it into an integer, you know We can try and do some feature engineering and basically until you know, which ones are important You don't know where to focus that feature engineering time You can talk to your client, you know and say, you know or you know and if you're doing this inside your workplace you go and talk to the folks that like We're responsible for creating this data so in this if you were actually working at a Bulldozer auction company, you might now go to the actual auctioneers and say I'm really surprised that coupler system seems to be driving people's Pricing decisions so much.

Why do you think that might be and they can say to you? Oh, it's actually because Only these classes of vehicles have coupler systems or only this manufacturer has coupler systems And so frankly, this is actually not telling you about coupler systems, but about something else and oh, hey that reminds me That's that that's something else.

We actually have measured that it's in this different CSV file I'll go get it for you. So it kind of helps you focus your attention So I had a fun little problem this weekend as you know, I introduced a couple of crazy computations in Into my random forest and all of a sudden they're like, oh my god These are the most important variables ever squashing all of the others, but then I got a terrible score And then is that because?

Now that I think I have my scores computed correctly What I noticed is that the importance went through the roof, but the validation set Was still bad or got worse is that because somehow that computation allowed the training to almost like an identifier map exactly what the answer was going to be for training, but of course that doesn't Generalize to the validation set.

Is that what I is that what I observed Okay, so there's there's two reasons why your validation score Might not be very good Let's go up here Okay, so we get these five numbers right the RMSE of the training validation R squared of the training validation and the R squared of the ORB Okay, so there's two reasons and really in the end what we care about like for this Kaggle competition is the RMSE of the validation Set assuming we've created a good validation set.

So in Terrence's case. He's saying this number is this thing I care about Got worse when I did some feature engineering. Why is that? Okay There's two possible reasons Reason one is that you're overfitting if you're overfitting Then your OOB Will also get worse If you're doing a huge data set with a small set RF sample, so you can't use an OOB then instead Create a second validation set which is a random sample okay, and and do that right so in other words if you're OOB or your random sample validation set is Has got much worse than you must be overfitting I think in your case Terrence, it's unlikely.

That's the problem because random forests Don't overfit that badly like it's very hard to get them to overfit that badly Unless you use some really weird parameters like only one estimator for example like once you've got ten trees in there There should be enough variation that you're you know You can definitely overfit but not so much that you're going to destroy your validation score by adding a variable So I think you'll find that's probably not the case But it's easy to check and if it's not the case Then you'll see that your OOB score or your random sample validation score hasn't got worse.

Okay So the second reason your validation score can get worse if your OOB score hasn't got worse You're not overfitting, but your validation score has got worse That means you're you're doing something that is true in the training set but not true in the validation set So this can only happen when your validation set is not a random sample So for example in this bulldozers competition or in the grocery shopping competition We've intentionally made a validation set that's for a different date range.

It's for the most recent two weeks, right? And so if something different happened in the last two weeks to the previous weeks then You could totally Break your validation set. So for example If there was some kind of unique identifier which is like Different in the two date periods then you could learn to identify things using that identifier in the training set But then like the last two weeks may have a totally different set of IDs for the different set of behavior Could get a lot worse Yeah, what you're describing is not common though And so I'm a bit skeptical.

It might be a bug but Hopefully there's enough Things you can now use to figure out if it is about we'll be interested to hear what you learn Okay so that's that's feature importance and so I'd like to compare that to How feature importance is normally done in industry and in academic communities outside of machine learning like in psychology and Economics and so forth and generally speaking people in those kind of environments tend to use Some kind of linear regression logistic regression general linear models So they start with their data set and they basically say that was weird Oh, okay, so they start with their data set And they say I'm going to assume that I know The kind of parametric relationship between my independent variables and my dependent variable so I'm going to assume that it's a linear relationship say or it's a linear relationship with a link function like a sigmoid Listic logistic regression say and so assuming that I already know that I can now write this as an equation So if we've got like x 1 x 2 so forth, right?

I can say all right my y values are equal to a x 1 plus b x 2 Equals y and therefore I can find out the feature importance easily enough by just looking at these Coefficients and saying like which one's the highest but it particularly if you've normalized the data first, right?

So There's this kind of trope out there. It's it's very common which is that like this is somehow More accurate or more pure or in some way better way of doing feature importance But that couldn't be further from the truth, right? If you think about it if you were like if you were missing an interaction Right, or if you were missing a transformation you needed or if you've anyway Been anything less than a hundred percent perfect in all of your pre-processing So that your model is the absolute correct truth of this situation, right?

Unless you've got all of that correct Then your coefficients are wrong, right? Your coefficients are telling you in your totally wrong model This is how important those things are right, which is basically meaningless So we're also the random forest feature importance. It's telling you in this extremely High parameter highly flexible functional form with few if any statistical assumptions.

This is your feature importance right So I would be very cautious, you know and and again I can't stress this enough when you when you leave m San when you leave this program You are much more often going to see people talk about logistic regression coefficients Then you're going to see them talk about random forest variable importance And every time you see that happen You should be very very very skeptical of what you're seeing any time you read a paper in economics or in psychology Or the marketing department tells you they did this regression or whatever every single time those coefficients are going to be massively biased by any issues in the model Furthermore If they've done so much pre-processing that actually the model is pretty accurate Then now you're looking at coefficients that are going to be of like a coefficient of some principal component from a PCA or a coefficient of some Distance from some cluster or something at which point they're very very hard to interpret anyway They're not actual variables, right?

So they're kind of the two options I've seen when people try to use classic statistical techniques to do a cover a variable importance equivalent I think things are starting to change Slowly, you know, there are there are some fields that are starting to realize that this is totally the wrong way to do things but it's been You know nearly 20 years since random forests appeared so it takes a long time, you know people say that the only way that Knowledge really advances is when the previous generation dies and that's kind of true, right?

Like particularly academics you know, they make a career of being good at a particular sub thing and You know often don't it, you know, it's not until the next generation comes along that that people notice that Oh, that's actually no longer a good way to do things. I think that's what's happened here Okay, so We've got now a model which isn't really any better as a predictive accuracy wise But it's kind of we're getting a good sense that there seems to be like four main important things When it was made the capital system its size and its product classification.

Okay, so that's cool There is something else that we can do however, which is we can do something called one hot encoding So this is kind of where we're talking about categorical variables. So remember a categorical variable. Let's say we had like A string hi And remember the order we got was kind of back weird.

It was high low medium. So it was in alphabetical order by default Right. Was there our original category for like usage band or something? And so we mapped it to 0 1 2 right and so by the time it gets into our data frame, it's now a number So the random forest doesn't know that it was originally a category.

It's just a number Right. So when the random forest is built it basically says, oh is it? Greater than one or not or is it greater than not or not, you know, basically the two possible decisions it could have made For For something with like five or six bands, you know It could be that just one of the levels of a category is actually interesting Right.

So like if it was like very high Very low Or or unknown Right, then we've know about like six levels and maybe The only thing that mattered was whether it was like unknown maybe like not knowing its size somehow impacts the price and so if we wanted to be able to recognize that and particularly if like it just so happened that the way that the Numbers were coded was it unknown ended up in the middle?

right Then what it's going to do is it's going to say okay There is a difference between these two groups, you know less than or equal to two versus greater than two And then when it gets into this this leaf here, it's going to say Oh, there's a difference between these two between less than four and greater than or equal to four and so it's going to take two Splits to get to the point where we can see that it's actually unknown that matters So this is a little Inefficient and we're kind of like wasting tree computation and like wasting tree computation matters because every time we do a split We're halving the amount of data at least that we have to do more analysis so it's going to make our tree less rich less effective if we're Not giving the data in a way that's kind of convenient for it to do the work.

It needs to do so what we could do instead is create six columns We could create a column called is very high is very low is high is Unknown is low is medium and each one would be ones and zeros, right? It's either one or zero So we had six columns this one moment So having added six additional columns to our data set the random forest Now has the ability to pick one of these and say like oh, let's have a look at is unknown There's one possible split I can do which is one versus zero.

Let's see if that's any good right, so it actually now has the ability in a single step to pull out a single category level and so This this kind of coding is called one hot encoding and for many many types of machine learning model, this is like Necessary something like this is necessary like if you're doing logistic regression You can't possibly put in a categorical variable that goes north through five Because there's obviously no linear relationship between that and anything right so one hot encoding a Lot of people incorrectly assume that all machine learning requires one hot encoding But in this case, I'm going to show you how we could use it optionally and see whether it might improve things sometimes.

Yeah Hi, Jeremy. So if we have six categories like in this case, would there be any problems with adding a column for each of the Categories, oh because in linear regression we so we had to do it like if there's six categories We should only do it for five of them.

Yeah, so um It you certainly can say oh, let's not worry about adding is medium because we can infer it from the other five I would say include it anyway because like rather than the otherwise the random forest would have to say is Very high. No is very low.

No is high. No is unknown. No is low No, okay, and finally I'm there right so it's like five decisions to get to that point. So the reason in Linear models that you you need to not include one is because linear models hate co-linearity But we don't care about about that here So we can do one hot encoding easily enough and the way we do it is we pass One extra parameter to procte F.

Which is what's the max? Number of Categories right so if we say it's seven then anything with Less than seven levels is going to be turned into one hot encoded bunch of columns Right so in this case this has got six levels So this would be one hot encoded where else like zip code has more than six levels And so that would be left as a number And so generally speaking you obviously probably wouldn't want a one hot encode Zip code right because that's just going to create masses of data memory problems computation problems and so forth, right?

So so this is like another parameter that you can play around with so if I do that Try it out run the random forest as per usual you can see what happens to the R squared of the validation set and to the RMSE of the validation set and in this case I found it got a little bit worse This isn't always the case, and it's going to depend on your data set You know do you have a data set where you know single categories tend to be quite important?

Or not in this particular case it didn't make it more predictive however What it did do is that we now have different features, right? so the procte F puts the name of the variable and then an underscore and then the level name and So interestingly it turns out that where else before it said that enclosure was somewhat important When we do it as one hot encoded it actually says enclosure E rots with a C is the most important thing So for at least the purpose of like interpreting your model.

You should always try One hot encoding you know Quite a few of your variables, and so I often find somewhere around six or seven is pretty good You can try like making that number as high as you can so that it doesn't take forever to compute and the feature importance doesn't include like Really tiny levels that aren't interesting, so that's kind of up to you to play it play around with But in this case like this is actually I found this very interesting it clearly tells me I need to find out What enclosure E rops with a C is why is it important because like it means nothing to me Right and but it's in the most important thing, so I should go figure that out so that I had a question plus that so Can you explain how?

Changing the max number of categories works because for me it just seems like there's five categories your side categories Oh, yeah, sorry, so it's it's just like All it's doing is saying like okay. Here's a column called zip code. Here's a column called usage band and Here's a column sex right.

I don't know whatever right and so like zip code has whatever 5000 levels the number of levels in a category we call its cardinality Okay So it has a cardinality of 5000 usage band maybe has a cardinality of six sex has maybe a cardinality of two So when Procte F goes through and says, okay, this is a categorical variable should I one hot encode it?

It checks the cardinality against max and cats and says all 5,000 is bigger than seven So I don't one hot encode it and then it goes to usage band six is less than seven I do one hot encode it goes to six two is less than seven. I do one thing code it So it just says for each variable How do I decide whether the one hot encoded or not in Procte F?

We are keeping both label encodes and one No, once we decide to one hot encode it does not keep the original variable Maybe the best Well, you don't need a labeling code if the if so if the best is an interval it can approximate that with multiple one hot encoding levels Yeah, so like, you know, it's a The the truth is that each column is going to have some You know different, you know, should it be label encoded or not, you know, which you could make on a case-by-case basis I find in practice It's just not that sensitive to this and so I find like just using a single number for the whole data set Gives me what I need but you know if you were Building a model that really had to be as awesome as possible and you had lots and lots of time to do it You can go through man, you know, don't use property if you can go through manually and decide which things to use dummies or not You'll see in the code if you look at the code for Procte F Procte F Right, like I never want you to feel like The code that happens to be in the fastai library is the code that you're limited to right?

So where is that done? you can see that The max ncat gets passed to numerical eyes and numerical eyes Simply checks, okay, is it a numeric type and it's the number of categories either not passed to us at all or We've got more unique that values than there are categories and if so, we're going to use the categorical codes So for any column where that's where it's skipped over that, right?

So it's remained as a category then at the very end We just go pandas dot get dummies we pass in the whole data frame and so pandas dot get dummies you pass in a whole Data frame it checks for anything that's still a categorical variable and it turns it into a dummy variable Which is another way of saying a one hot encoding.

So, you know with that kind of approach you can easily override it and do your own dummy verification variable ization Did you have a question? So some data has Quite obvious order like if you have like a rating system like good bad Or whatever things like that There's an order to that and showing that order by doing the dummy variable thing probably will work in your benefit So is there a way to just force it to leave alone one variable just like convert it beforehand yourself?

Not not in the library And to remind you like unless we explicitly do something about it. We're not going to get that order so when we When we import the data This is in lesson 1 RF We showed how By default the categories are ordered alphabetically And we have the ability to order them Properly, so yeah, if you've actually made an effort to turn your ordinal variables into proper ordinals using property f Can destroy that if you have max-end cats so the simple thing the simple way to avoid that is if we know that we always want to use the codes for usage band rather than the You know like never one hot encode it you could just go ahead and replace it right you could just say okay Let's just go df dot usage band equals df dot usage band dot cat dot codes and it's now an integer And so it'll never get thing All right, so We kind of already seen how Variables which are basically measuring the same thing can kind of confuse our variable importance And there can also make our random forests slightly less good because it requires like more computation to do the same thing There's more columns to check So I'm going to do some more work to try and remove redundant features And the way I do that is to do something called a dendrogram And it's a kind of hierarchical clustering so cluster analysis Is something where you're trying to look at objects they can be either rows in a data set or columns and find which ones are Similar to each other so often you'll see people particularly talking about cluster analysis They normally refer to rows of data, and they'll say like oh let's plot it Right and like oh, there's a cluster, and there's a cluster, right?

The common type of cluster analysis time to permitting we may get around to talking about this in some detail is called K means Which is basically where you assume that you don't have any labels at all and you take basically a? Couple of data points at random and you gradually Find the ones that are near to it and move them closer and closer to centroids and you kind of repeat it again And again, and it's an iterative approach that you basically tell how many clusters you want And it'll tell you where it thinks the classes are Really, and I don't know why but I really underused technique 20-30 years ago.

It was much more popular than it is today is hierarchical clustering hierarchical Also known as agglomerative clustering and in hierarchical or agglomerative clustering We basically look at every pair of option up every pair of objects and say okay, which two objects are the closest Right so in this case we might go okay Those two objects are the closest and so we've kind of like delete them and replace it with the midpoint of the two And then okay here the next two closest we delete them and replace them with the midpoint of the two And you keep doing that again and again right since we've got of removing points and replacing them with their averages You're gradually reducing a number of points By pairwise combining and the cool thing is you can plot that like so right so if rather than looking at points You look at variables.

We can say okay, which two variables are the most similar that says okay? Say all year and sale elapsed are very similar so the kind of horizontal axis here is How similar are the two points that are being compared right so if they're closer to the right? That means they're very similar so sale year and sale elapsed have been combined and they were very similar Again it's like who cares you know it'll be like the correlation coefficient or something like that you know in this particular case What I actually did So you get to tell it so in this case.

I actually used spearmen's are so You guys familiar with correlation coefficients already, right so correlation is as almost exactly the same as the R squared, right? But it's between two variables rather than a variable and its prediction the problem with a normal correlation is that if the Get a new workbook here If you have data that looks like this then you can Do a correlation and you'll get a good result right, but if you've got data which looks like This right and you try and do a correlation and assumes linearity that's not very good, right?

So there's a thing called a rank correlation a really simple idea. It's replace every point by its rank right so instead of like so we basically say okay. This is the smallest so we'll call that one Two there's the next one three is the next one four Five right so you just replace every number by its rank that then you do the same for the y-axis so call that one two Three and so forth right and so then you do like a new plot where you don't plot the data But you plot the rank of the data and if you think about it the rank of this data set is going to look An exact line because every time something was greater on the x-axis.

It was also greater on the y-axis So if we do a correlation on the rank that's called a rank correlation Okay, and so Because I want to find the Columns that are similar in a way that the random forest would find them similar Random forests don't care about linearity.

They just care about ordering so a rank correlation is the the right way to think about that so Spearmons are is is the name of the most common rank correlation But you can literally replace the data with its rank and chuck it at the regular correlation And you'll get basically the same answer the only difference is in how ties are handled.

It's a pretty minor issue um Like if you have like a full parabola in that rank correlation, you will not write why right? It has to be has to be monotonic. Okay. Yeah, yeah Okay, so Once I've got a correlation matrix there's basically a couple of standard steps you do to turn that into a Dendogram which I have to look up on Stack Overflow each time I do it You basically turn it into a distance matrix And then you create something that tells you you know, which things are connected to which other things hierarchically.

So this kind of These two and this step here are like just three standard steps that you always have to do to create a dendrogram and So then you can plot it and so Alright, so sale year and sale elapsed and be measuring basically the same thing at least in terms of rank Which is not surprising because sale elapsed is the number of days since the first day in my data set So obviously these two are nearly entirely correlated with some ties Grouse attracts and hydraulics flow and coupler system all seem to be measuring the same thing and this is interesting because remember couple system It said was super important, right?

And so this rather supports our hypothesis that it's nothing to do with whether it's a coupler system But whether it's whatever kind of vehicle it is. It has these kind of features Product group and product groups desks seem to be measuring the same thing Fi based model and Fi model desk seem to be measuring the same thing.

And so once we get past that Everything else like suddenly the things are further away. So I'm probably going to not worry about those So we're going to look into these one two three four groups that are very similar. She passed that over there Is it in that graph that the similarity between stick length and enclosure is higher than with stick lens and anything that's higher Yeah, pretty much I mean it it's a little hard to interpret but given that stick length and enclosure Don't join up until way over here It would strongly suggest that then that they're a long way away from each other Otherwise you would expect them to have joined up earlier I mean it's it's possible to construct like a synthetic data set where you kind of end up joining things that were close to each other through different paths So you've got to be a bit careful, but I think it's fair to probably assume that stick length or enclosure are probably very different so they are very different, but would they be more similar than for example stick length and sale day of year No, there's nothing to suggest that here because like the point is to notice where they sit in this tree Right and they both that they sit in totally different halves of the tree.

Thank you But really to actually know that the best way would be to actually look at this BM and our correlation matrix Right if you just want to know how similar is this thing to this thing this theme and our correlation matrix tells you that Can you pass that over there?

So today's we are passing the data frame, right? Say again This is just a data frame so we're passing in DF keep so that's the data frame Containing the whatever it was 30 or so features that our random forest thought was interesting so There's no random first being used here the measure of the distance measure is being done entirely on rank correlation So what I then do is I take these these groups Right and I create a little function that I call bit out of band score right which is it does a random forest for some data frame I Make sure that I've taken that data frame and split it into a training and validation set And then I call fit and return the OOB score right so basically what I'm going to do is I'm going to try Removing each one of these one two three four five six seven eight nine or so variables one at a time and See which ones I can remove and it doesn't make the OOB score get worse And each time I run this I get slightly different results So actually it looks like last time I had seven things not not eight things So you can see I just do a loop through each of the things that I'm thinking like maybe I could get rid of this Because it's redundant and I print out the column Name and the OOB score of a model that is trained after dropping that one column Okay, so the OOB score on my whole data frame is 0.89 and then after dropping each one of these things They're basically none of them get much worse sale elapsed is Getting quite a bit worse than sale year, but like it looks like pretty much everything else I can drop with like only like a third decimal place Problem so obviously though you've got to remember the dendrogram what let's take Fi model desk and Fi based model Right, they're very similar to each other, right?

So what this says isn't that I can get rid of both of them, right? I can get rid of one of them because they're basically measuring the same thing Okay, so so then I try it. I say okay. Let's try getting rid of one from each group sale year Fi based model and grouser tracks Okay, and like let's now have a look.

It's like okay. I've gone from point eight nine. Oh to point eight eight eight It's like again so close as to be meaningless. So that sounds good simpler is better So I'm now going to drop those columns from my data frame And then I can try running the full model Again, and I can see you know, so reset RF samples Means I'm using my whole data frame of my whole big strap sample Use 40 estimators and I've got point 907.

Okay, so I've now got a Model which is smaller and simpler and I'm getting a good score for So at this point I've now got rid of as many columns as I feel I comfortably can ones that either didn't have a good feature importance or were Highly related to other variables and the model didn't get worse significantly with that when I removed them So now I'm at the point where I want to try and really understand my data better by taking advantage of the model And we're going to use something called partial dependence And again This is something that you could like using the cable kernel and lots of people are going to appreciate this because almost nobody knows About partial dependence and it's a very very powerful technique What we're going to do is we're going to find out for the features that are important How do they relate to the dependent variable?

Right. So let's have a look right? So let's again since we're doing interpretation We'll set set our samples to 50,000 to run things quickly We'll take our data frame We'll get our feature importance and notice that we're using Max and cat because I'm actually pretty interested in terms of for interpretation and seeing the individual levels And so here's the top ten and so let's try and learn more about those top ten So yeah made is the second most important so one obvious thing we could do would be to plot Year made against sale elapsed because as we've talked about already like it just seems to make sense.

They're both important but it seems very likely that they kind of combine together to find like how old was the Product when it was sold so we could try plotting year made against sale elapsed to see how they relate to each other and when we do We get this very ugly graph and it shows us that year made Actually has a whole bunch that are a thousand Right.

So clearly, you know, this is where I would tend to go back to the client or whatever and say Okay I'm guessing that these bulldozers weren't actually made in the year 1000 and they would presumably say to me Oh, yes, they're ones where we don't know when it was made, you know, maybe before 1986 We didn't track that or maybe the things that are sold in Illinois.

You don't have that data Provided or or whatever. They'll tell us some reason. So in order to Understand this plot better. I'm just going to remove them from this interpretation section of the analysis So I'm just going to say okay. Let's just grab things where year made is greater than 1930.

Okay? So let's now look at the relationship between year made and sale price and there's a really great Package called GG plot GG plot originally was an R package GG stands for the grammar of graphics and the grammar of graphics is like this very powerful way of thinking about how to produce Charts in a very flexible way.

I'm not going to be talking about it much in this class There's lots of information available online But I definitely recommend it as a great package to use GG plot Which you can pip install. It's part of the fast AI environment already GG plot in Python has basically the same Parameters and API is the R version the R version is much better documented So you should read its documentation to learn how to use it.

But basically you say okay, I want to create a plot of This data frame now when you create plots Most of the data sets you're using are going to be Too big to plot as in like if you do a scatter plot It'll create so many dots that it's just a big mess and it'll take forever and remember when you're plotting things You just you're you're looking at it, right?

So there's no point plotting something with a hundred million samples when if you only used a hundred thousand samples It's going to be pixel identical Right. So that's why I call get sample first. So get sample just grabs a random sample. Okay, so I'm just going to grab 500 points For now, okay, so I've got to grab 500 points from my data frame.

I got a plot Year made against sale price AES stands for aesthetic. This is the basic way that you set up your columns in GG plot Okay, so this says to plot these columns from this data frame And then you there's this weird thing in GG plot where plus means basically add chart elements.

Okay, so I'm going to add a smoother so Most of the very very often you'll find that a scatter plot is very hard to see what's going on because there's too much randomness Where else a smoother basically creates a little linear regression for every little subset of the graph And so it kind of joins it up and allows you to see a nice smooth curve.

Okay so this is like the main way that I tend to look at univariate relationships and By adding standard error equals true. It also shows me the confidence interval of this smoother, right? So lowest stands for locally weighted regression, which is this idea of like doing kind of like doing lots of little linear aggressions So we can see here the relationship between year made and sale price is kind of all over the place, right?

Which is like not really what I would expect. I would I would have expected that more recent Stuff that sold more recently Would probably be like more expensive because of inflation and because they're like more current models and so forth and the problem is that when you look at a univariate relationship like this, there's a whole lot of Co-linearity going on a whole lot of interactions that are being lost.

So for example Why did the price drop? Yeah, is it actually because like things made between 1991 and 1997 a Less valuable or is actually because most of them were also sold during that time and actually there was like maybe a recession then Or maybe it was like product sold during that time a lot more people were buying Types of vehicle that were less expensive like there's all kinds of reasons for that and so again as Data scientists one of the things we're going to keep seeing is that at the companies that you join people will come to you with With these kind of univariate charts where they'll say like oh my god our sales in Chicago have disappeared that got really bad or people aren't clicking on this ad anymore and they'll show you a chart that looks like this and they'll be like what happened and Most of the time you'll find the answer to the question.

What happened is that there's something else going on, right? So actually all in Chicago last week actually we were doing a new promotion And that's why our you know revenue went down. It's not because people aren't buying stuff in Chicago anymore It's because the prices were lower for instance So what we really want to be able to do is say well What's the relationship between sale price and year made all other things being equal?

So All other things being equal basically means if we sold something in 1990 versus 1980 and it was exactly the same thing to exactly the same person and exactly the same option So on and so forth. What would have been the difference in price? And so to do that we do something called a partial dependence plot and this is a partial dependence plot There's a really nice library which nobody's heard of called PDP Which does these partial dependence plots and what happens is this we've got our sample of 500 data points Right and we're going to do something really interesting.

We're going to take each one of those hundred randomly chosen auctions and We're going to make a little data set out of it, right? So like here's our Here's our Here's our data set of like 500 auctions and Here's our columns One of which is the thing that we're interested in which is year made so here's year made Okay, and what we're going to do is we're now going to try and create a chart Where we're going to try and say all other things being equal in 1960 How much did Bulldozers cost how much did things cost in options?

And so the way we're going to do that is we're going to replace the year Made column with 1960 we're going to copy in the value 1960 again and again and again all the way down Right. So now every row the year made is 1960 and all of the other data is going to be exactly the same and we're going to take our random forest And we're going to pass all this through our random forest To predict the sale price So That will tell us for everything that was auctioned how much do we think it would have been sold for if?

That thing was made in 1960 and that's what we're going to plot here All right, that's the price we're going to plot here, and then we're going to do the same thing for 1961 All right, we're going to replace all these and do 1961 Yeah Yeah So to be clear We've already fit the random forest yes, and then we're just passing a new year and seeing what it determines The price should be yeah So this is a lot like the way we did feature importance, but rather than randomly shuffling the column We're going to replace the column with a constant value All right, so randomly shuffling the column tells us How accurate it is when you don't use that column it anymore?

Replacing the whole column with a constant tells us or estimates for us how much we would have sold that product for In that auction on that day in that place if that product had been made in 1961, right? So we basically then take the average of all of the sale prices that we calculate from that random forest And so we do it in 1961 and we get this value, right?

So what the partial dependence plot here shows us is each of these light blue lines Actually is showing us all 500 lines, so it says for row number one in our data set If we sold it in 1960, we're going to index that to zero right so call that zero right if we sold it in 1970 that particular Auction would have been here if we sold it in 1980 it would have been here if we sold in 1990 It would have been here, so we actually plot all 500 Predictions of how much every one of those 500 it Auctions would have gone for if we replace it if we replace the EMA with each of these different values And there are then then this dark line here is the average Right so this tells us How much would we have sold?

On average all of those options for if all of those products were actually made in 1985 1990 1993 1994 and so forth and so you can see what's happened here is at least in the period where we have a reasonable Out of data which is since 1990. This is basically a totally straight line Which is what you would expect right because if it was sold on the same date And it was the same kind of tractor that was sold to the same person in the same option house Then you would expect more recent vehicles to be more expensive Because of inflation and because they're they're newer Like they're not they're not as second-hand and you would expect that relationship to be roughly linear And that's exactly what we're finding okay, so by removing all of these personalities it often allows us to see the truth Much more clearly there's a question at the back.

Can you pass that back there? You're done, okay so This this partial dependence plot concept is something which is using a random forest To get us a more clear interpretation of what's going on in our data and so the steps were to first of all look at the feature importance to tell us like which things do we think we care about and Then to use the partial dependence plot to tell us What's going on on average?

right There's another cool thing we can do with PDP is we can use clusters and what clusters does is it uses cluster analysis? to look at all of these each one of the 500 rows and say Does some of those 500 rows kind of move in the same way and like we can kind of see it seems like there's a whole Lot of rows that kind of go down and then up and there seems to be a bunch of rows that kind of go up And then go flat like it does seem like there are some kind of different types of behaviors being hidden And so here is the result of doing that cluster analysis, right is we still get the same average But it says here are kind of the five most common shapes that we see And this is where you could then go in and say all right.

It looks like some kinds of vehicle Actually after 1990 their prices are pretty flat and before that they were pretty linear Some kinds of vehicle are kind of exactly the opposite and so like different kinds of vehicle have these different shapes Right, and so this is something you could dig into I think there's one at the back.

Oh, you could okay So what we're going to do with this information well the purpose of Interpretation is to learn about a data set and so why do you want to learn about a data set? It's because you it's because you want to do something with it, right? So in this case It's not so much something if you're trying to win a Kaggle competition I mean it can be a little bit like some of these insights might make you realize how I could Transform this variable or create this interaction or whatever Obviously feature importance is super important for Kaggle competitions, but this one's much more for like real life you know so this is when you're talking to somebody and you say to them like Okay, those plots you've been showing me which actually say that like there was this kind of dip in prices You know based on like things made between 1990 and 1997 there wasn't really you know actually it was they were Increasing there was actually something else going on at that time You know it's basically the thing that allows you to say like For whatever this outcome.

I'm trying to drive in my business is this is how something's driving it right so efforts like I'm looking at you know kind of advertising technology What's driving clicks that I'm actually digging in to say okay? This is actually how clicks are being driven. This is actually the variable that's driving it This is how it's related so therefore we should change our behavior in this way That's really the goal of any model.

I guess there's two possible goals one goal of a model is just to get the predictions Like if you're doing hedge fund trading You probably just want to know what the price of that equity is going to be if you're doing insurance You probably just want to know how much claims that guy's going to have but probably most of the time you're actually trying to change Something about how you do business how you do marketing how you do logistics So the thing you actually care about is how the things are related to each other All right, I'm sorry can you explain again when you scroll up and you were looking at the sale price year may Look me at the entire model, and you saw that dip And you said something about that dip didn't signify what we thought it did can you explain why yeah, so this is like a classic Boring univariate plot right so this is basically just taking all of the dots all of the options Plotting year made against sale price, and we're going to just fitting a rough average through them and so It's true that products made between 1992 and 1997 on average in our data set being sold for less So like very often in business you'll hear somebody look at something like this, and they'll be like oh we should We should stop auctioning equipment that is made in that year in those years because like we're getting less money for for example But if the truth actually is that during those years It's just that people were making more Small industrial equipment where you would expect it to be sold for less and actually our profit on it is just as high for instance or During those years.

It's not that it's not things made during those years now would have Would be cheaper. It's that during those years When we were selling things in those years they were cheaper because like there was a recession going on So if you're trying to like actually take some action based on this You probably don't just care about the fact that things made in those years are cheaper on average, but how does that impact?

today you know so This this approach where we actually say let's try and remove all of these externalities So if something is sold on the same day to the same person of the same kind of vehicle Then actually how does year made impact price and so this basically says for example if I am Deciding what to buy at an option then this is kind of saying to me okay like Getting a more recent vehicle on average really does on average Give you more money Which is not what the kind of the naive univariate plot said?

For like this bulldozer bulldozers made in 2010 probably are not Close to the type of bulldozers that were made in 1960 right and if you're taking something that would be so very different like a 2010 bulldozer and then trying to just drop it to say oh if it was made in 1960 that may cause poor Prediction at a point because it's so far outside.

Absolutely Absolutely, so you know I think that's a good point. It's you know it's a limitation Of a random forest is if you're got a kind of data point That's like of a kind of you know which is kind of like in a part of the space that it's not seen before Like maybe people didn't put air conditioning really in bulldozers in 1960 And you're saying how much with this bulldozer with air conditioning have gone for in 1960.

You don't really have any information to know that so you know you it's a It's it's this is still the best technique. I know of but it's it's not perfect And you know you kind of hope that The trees are still going to find some Useful truth even though it hasn't seen that combination of features before but yeah, it's something to be aware of So you can also do the same thing in a PDP Interaction plot and a PDP interaction plot which is really what I'm trying to get to here is like how the sale elapsed and Year made together impact price and so if I do a PDP interaction plot it shows me sale elapsed versus price It shows me year made versus price, and it shows me the combination Versus price remember this is always log of price That's why these prices look weird right and so you can see that the combination of sale elapsed and year made is as you would expect Later dates so more or less time is Giving me.

Oh, sorry. It's The other way around isn't it so the highest prices? Those where there's the least elapsed and the most recent year made So you can see here. There's the univariate relationship between sale elapsed and price And here is the univariate relationship between year made and price And then here is the combination of the two It's enough to see like clearly that these two things are driving price together You can also see these are not like simple diagonal lines, so it's kind of some interesting interaction going on And so based on looking at these plots It's enough to make me think oh, we should maybe put in some kind of interaction term and see what happens So let's come back to that in a moment, but let's just look at a couple more Remember in this case.

I did one hot encoding Way back at the top here. I said max n cat equals 7 so I've got like n Closure erops with AC so if you've got one hot encoded variables You can pass an array of them To pit plot PDP, and it'll treat them as a category Right and so in this case.

I'm going to create a PDP plot of these three categories. I'm going to call it enclosure and I can see here that Enclosure erops with AC On average are more expensive than enclosure erops and enclosure erops It actually looks like enclosure erops and enclosure erops are pretty similar or else erops with AC is higher So this is you know at this point.

You know I'd probably be inclined to hop into Google and like type erops and erops And find out what the hell these things are And here we go So it turns out that erops is enclosed rollover protective structure and so it turns out that if your Your bulldozer is fully enclosed then optionally you can also get air conditioning So it turns out that actually this thing is telling us whether it's got air conditioning if it's an open structure Then obviously you don't have air conditioning at all, so that's what these three levels are and so we've now learnt All other things being equal the same bulldozer sold at the same time Built at the same time sold to the same person is going to be quite a bit more expensive as if it has air conditioning Than if it doesn't okay, so again.

We're kind of getting this nice interpretation ability and You know now that I spent some time with this data set I've certainly noticed that this you know knowing this is the most important thing you do notice that there's a lot more air-conditioned bulldozers nowadays than they used to be and so there's definitely an interaction between kind of date and that So based on that earlier interaction analysis.

I've tried First of all setting everything before 1950 to 1950 because it seems to be some kind of missing value I've been set age to be equal to sale year minus year made and so then I try running a random forest on that and Indeed age is now the single biggest thing Sale elapsed is way back down here Year made is back down here, so we've kind of used this to find an interaction But remember of course a random forest can create a can create an interaction through having multiple split points So we shouldn't assume that this is actually going to be a better result and in practice I actually found when I Looked at my score and my RMSE adding age was actually a little worse We'll see about that later probably in the next lesson Okay So one last thing is tree interpreter so This is also in the category of things that most people don't know exist, but it's super important Almost pointless for like Kaggle competitions, but super important for real life and here's the idea let's say you're an insurance company and somebody rings up and you give them a quote and They say oh, that's $500 more than last year why Okay, so in general you've made a prediction from some model and somebody asks why?

And so this is where we use this method called tree interpreter and what tree interpreter does is It allows us to take a particular row so in this case. We're going to pick Row number zero right so here here is row zero right presumably. This is like a year made I don't know what all the codes stand for but like his is all of the columns in row zero What I can do with a tree interpreter is I can go ti dot predict pass in my random forest Pass in my row so this would be like this particular customers insurance information or this in this case this particular Auction right and it'll give me back three things the first is the prediction from the random forest The second is the bias the bias is basically the average sale price Across the whole original data set right so like remember in a random forest.

We started with single trees We haven't got a drawing there anymore, but remember we started with a single tree in our random forest And we split it once and then we split that once and then we split that once right and we said like Oh, what's the average value for the whole data set?

Then what's the average value for those where the first bit was true? And then what's the average value? Where the next bit was also true until eventually you get down to the leaf nodes where you've got the average value you predict right? So you can kind of think of it this way if this for a single tree if this is our final leaf node Right maybe we're predicting like nine point one Right and then maybe the average log sale price for the whole The whole lot is like ten point two right that's the average for all the options And so you could kind of like work your way down here, so let's go and create this Let's actually go and run this so we can see it Okay, so let's go back and redraw this single tree you'll find like in Jupyter notebooks often a lot of the things we create like Videos progress bars and stuff they don't know how to like save themselves to the file So you'll see just like a little string here, and so you actually have to rerun it to create the string So this was the single tree that we created So the whole data set had an average log sale price of 10.2 The data set for those with couple system equals true had an average of ten point three The data set for couple system equals true enclosure less than point less than two was nine point nine and Then eventually we get all the way up here And also model ID less than 45 73 it's ten point two so you could kind of like say okay Why did this particular?

Row let's say we had a row that ended up over in this leaf node. Why did we predict ten point two well? It's because we started with ten point one nine And then because the couple system was was was less than point five so it was actually false We added about two to that so we went from ten point one to ten point three right so ten point two to ten point three So we added a little bit because this one is true and then to go from ten point three to nine point nine So because enclosure is less than two we subtracted About point four and then because model ID was less than 4500 we added about point seven Right so you could see like with a single tree you could like break down like why is it that we predicted?

10.2 right and it's like at each one of these decision points. We're adding or subtracting a little bit from the value So what we could then do is we could do that for all the trees And then we could take the average so every time we see in closure Did we increase or decrease the value and how much by every time we see model ID did we increase or decrease the value?

and how much by and so we could take the average of all of those and That's what ends up in this thing called contributions so here is all of our predictors and Here is the value of each and so this is telling us and I've sorted them here that The fact that this thing was made in 1999 was the thing that most negatively impacted our prediction and the fact that the age of the vehicle was 11 years was what most positively impacted I Think you actually need to sort after you zip them together.

They seem to be sorted negative point five No, the values are sorted, but then they're just reassigned to the columns in the original order Which is why in 11 your own tractor is what's most thank you. Thank you That makes perfect sense Yes, we need to do an index sort Okay Thank you.

We'll make sure we fix that by next week So we need to sort columns by the index from contributions So then there's this thing called bias and so the bias is just the average With before we start doing any splits right so if you basically start with the average Log of value and then we went down each tree and each time we saw a year made we had some impact Couple of systems some impact product size some impact and so forth, right?

Okay, so I think what we might do is we might come back to because we kind of out of time we might come back to tree interpreter Next time but the basic idea. This is the last this was the last of our key interpretation points and the basic idea is that we want some ability to Not only tell us about the model as a whole and how it works on average But to look at how the model makes predictions for an individual Row and that's what we're doing here.

Okay. Great. Thanks everybody. See you on Thursday