Okay, so welcome back, so we're going to start by doing some review, and we're going to talk about test sets training sets validation sets and OOB Something we haven't covered yet, but we will cover in more detail later is also cross validation But I'm going to talk about that as well, right so We have a data set With a bunch of rows in it and We've got some dependent variable and so what's the difference between like machine learning and Kind of pretty much any other kind of work that the the difference is that in machine learning the thing we care about is The generalization accuracy or the generalization error where else in like pretty much everything else all we care about is is how well we could have mapped to the observations full stop and so this this thing about generalization is the key unique piece of machine learning And so if we want to know whether we're good doing a good job of machine learning We need to know whether we're doing a good job of generalizing if we don't know that We know nothing, right?
By generalizing do you mean like scaling being able to scale larger? No, I don't mean scaling at all so scaling is an important thing in many many areas It's like okay. We've got something that works on my computer with 10,000 Items I don't need to work make it work on 10,000 items per second or something so scaling is important But not just a machine learning for just about everything we put in production Generalization is where I say okay here is a model that can predict Cats from dogs.
I've looked at five pictures of cats five pictures of dogs, and I've built a model that is perfect and Then I look at a different set of five cats and dogs and it gets them all wrong So in that case what it learned was not the difference between a cat and a dog That let what those five exact cats look like in those five exact dogs look like or I've got a model of predicting grocery sales for a particular Product so for toilet rolls in New Jersey last month And then I go and put it into production and it scales great in other words it has a great latency I don't have a high CPU load, but it fails to predict anything well other than Toilet rolls in New Jersey it also turns out it only did it well for last month not the next month, so these are all generalization failures so The most common way that people check for the ability to generalize is To create a random sample, so they'll grab a few rows at random and pull it out into a test set and then they'll build all of their models on the rest of the rows and Then when they're finished they'll check that the accuracy they got on there So the rest of the rows are called the training set everything else everything else We could call the training set And So at the end of their modeling process on the training set they got an accuracy of 99% of predicting cats from dogs at the very end they check it against a test set to make sure that the model really does generalize now the problem is What if it doesn't?
Right, so okay, well I could go back and change some hyper parameters do some data augmentation Whatever else try to create a more generalizable model, and then I'll go back again After doing all that and check and it's still no good But and I'll keep doing this again and again until eventually after 50 attempts.
It does generalize But does it really generalize because maybe all I've done is Accidentally found this one which happens to work just for that test set because I've tried 50 different things Right, and so if I've got something which is like right coincidentally 0.05 5% of the time they're not very likely to accidentally get a good result So what we generally do is we put aside a second data set They've got a couple more of these and put these aside into a validation set Validation set right and then everything that's not in the validation test is now training and so what we do is we train a model Check it against the validation to see if it generalizes Do that a few times and then when we finally got something where we're like okay?
We think this generalizes successfully based on the validation set and then at the end of the project we check it against the test set Yeah So basically by making this two layer test set validation set if it gets one right the other one Wrong you kind of double checking your errors kind of like it It's checking that we have an over fit to the validation set so if we're using the validation set again and again Then we could end up not coming up with a generalizable set of hyper parameters But a set of hyper parameters that just so happen to work on the training set and the validation set so So if we try 50 different models Against the validation set and then at the end of all that we then check that against the test set and it's still Generalized as well, then we're kind of going to say okay.
That's good We've actually come up with generalizable model if it doesn't then that's going to say okay We've actually now over fit to the validation set at which point you're kind of in trouble, right because You don't you know you don't have anything left Behind right so the idea is to use effective Techniques during the modeling so that so that doesn't happen right, but but if it's going to happen you want to find out about it like you need that test set to be there because otherwise when you put it in production and Then it turns out that it doesn't generalize that would be a really bad outcome right you end up with Less people clicking on your ads or selling less of your products or providing car insurance to very risky vehicles or whatever Just make sure do you need to ever check if the validation set and the test settings is coherent or you just keep So if you've done what I've just done here, which is to randomly sample There's no particular reason to check as long as they're as long as they're big enough Right, but we're going to come back to your question in a different context in just a moment Now Another trick we've learned for random forests is a way of Not needing a validation set and the way that we learned was to use instead use the OOB Error or the OOB score and so this idea was to say well every time we train a tree in a random Forest there's a bunch of observations that are held out anyway because that's how we get some of the randomness and so let's Calculate our score for each tree based on those held out samples and therefore the forest by averaging the trees that that each Row was not part of training Okay And so the OOB score gives us something which is pretty similar to the validation score But on average, it's a little less good Can anybody either remember or figure out why on average it's a little less good?
Quite a subtle one I'm not sure but is it because you are treating like you are doing every Kind of probe pre-processing on your test and so the OOB score is reflecting the performance on testing set No, so the OOB score is not using the test set at all The OOB score is using the held out rows in the training set for each tree So I mean the you are basically testing each tree on some data from the training set Yes, so you are you have the potential of overbating data?
I should it shouldn't cause overfitting because each one is looking at a held out Sample so it's not an overfitting issue. It's quite a subtle issue. Ernest don't have a try Aren't the samples from OOB Bootstrap samples they are so then you're never gonna on average they only grab 63% of right So when average the OOB is one minus 63% exactly.
Yeah, what's the issue? So then if you're not why would the score be lower than the validation score that implies that you're leaving Sort of like a black hole in the data that there's like data points You're never going to sample and they're not gonna be represented by the model No, that's not true though because each tree is looking at a different set right so the OOB so like we've got like I Don't know dozens of models right and in each one.
There's a different set of rows Which which happened to be held out? right And so when we calculate the OOB score for like let's say row three We say okay row three is in this tree this tree and that's it and so we calculate the prediction on that tree and for that tree and with average those two predictions and so with enough trees You know each one has a 30 or so percent chance.
Sorry 40 or so percent chance that the row is in that tree So if you have 50 trees It's almost certain that every row is going to be mentioned somewhere Did you have an idea term? With validation set we can use the whole forest to make the predictions But here we cannot use the whole forest so we cannot exactly see exactly so every row is Going to be using a subset of the trees to make its prediction and with less trees We know we get a less accurate prediction.
So that's like that's a subtle one Right, and if you didn't get it have a think during the week until you understand Why this is because it's a really interesting test if you're understanding of random forests of like why is OOB score? On average less good than your validation score.
They're both using random subs randomly held out subsets anyway, it's been really close enough right so Why have a validation set at all? When you're using random forest If it's a randomly chosen validation set it's not strictly speaking necessary But you know you've got like four levels of things to test right so you could like test on the OOB When that's working well, you can test on the validation set, you know when hopefully by the time you check against the test set There's going to be no surprises.
So that'll be one good reason Then what Kaggle do the way they do this is kind of clever what Kaggle do is they split the test set into two pieces a public and a private and They don't tell you which is rich. So you submit your predictions to Kaggle and then a Random 30% of those are used to tell you the leaderboard score But then at the end of the competition that gets thrown away and they use the other 70% to calculate your real score so What that's doing is that you're making sure that you're not like continually using that feedback from the leaderboard To figure out some set of hyper parameters that happens to do well on the public but actually doesn't generalize Okay, so it's a great test like this is one of the reasons why it's good practice to use Kaggle Because at the end of a competition at some point this will happen to you and you'll drop a hundred places on the leaderboard The last day of the competition when they use the private test set and I say oh Okay, that's what it feels like to overfit and it's much better to Practice and get that sense there than it is to do it in a company where there's hundreds of millions of dollars on the line Okay, so this is like the easiest possible situation where you're able to use a random sample for your validation set Why might I not be able to use a random sample for my validation set In the case of something where we're forecasting we can't randomly sample because we need to maintain the temporal ordering Go on.
What is that? Because it doesn't it doesn't make sense. So in the case of like an ARMA model I I can't use like I can't pull out random rows because there's I'm thinking that there's like a certain dependency or I'm I'm trying to model a certain dependency that relies on like a specific Lag term if I randomly sample those things then that lag term isn't there for me to okay, so it could be like a Technical modeling issue that like I'm using a model that relies on like Yesterday the day before and the day before that and if I've randomly removed some things I don't have yesterday and my model might just fail.
Okay, that's true, but there's a more fundamental issue Do you want to pass it to Tyler? It's a really good point Although you know in general we're going to try to build models that are not that are more resilient than that particularly with Yet temporal order we expect things that are close by in time to be related to things close to them so we so if we destroy the order like if if we destroy the order we Really aren't going to be able to use that this time is close to this other time Um, I don't think that's true because can pull out a random sample for a validation set and still keep everything nicely ordered Well, we would like to predict things in the future Which we would require as much data close to the end of art Okay, that's true.
I mean we could be like limiting the amount of data that we have by taking some of it out But my claim is stronger. My claim is that by using a random validation set We could get totally the wrong idea about our model carob. Do you want to have a try?
So you if our data is imbalanced for example We can if you're randomly sampling it we can only have one class in our validation set so our fitted model may be That's true as well So maybe you're trying to predict in a medical situation Who's going to die of lung cancer and that's only one out of a hundred people and we pick out a validation set that We accidentally have nobody that died of lung cancer That's also true.
These are all Good niche examples, but none of them quite say like why could the validation set just be plain Wrong like give you a totally Inaccurate idea of whether this is going to generalize And so let's talk about and the closest is is is what Tyler was saying about time closeness in time The important thing to remember is when you build a model You're always you always have a systematic error Which is that you're going to use the model at a later time than the time that you built it, right?
Like you're going to put it into production By which time the world is different to the world that you're in now and even when you're building the model You're using data which is older than today anyway, right? so there's some lag between the data that you're building it on and the data that it's going to actually be used on your life and A lot of the time if not most of the time that matters, right?
So if we're doing stuff in like predicting who's going to buy toilet paper in, New Jersey and it takes us two weeks to put it in production and We did it using data from the last couple of years then by that time, you know things may look Very different right and particularly our validation said if we randomly sampled it Right and it was like from a four-year period then the vast majority of that data is going to be over a year old right, and it may be that the toilet buying habits of folks in New Jersey may have Dramatically shifted.
Maybe they've got a terrible recession there now and they can't afford a high-quality toilet paper anymore Or maybe they know their paper making industry has gone through the roof and suddenly, you know They could they're buying lots more toilet paper because it's so cheap or whatever, right? So The world changes and therefore if you use a random sample for your validation set Then you're actually checking how good are you at predicting things that are totally obsolete now?
But how good are you at predicting things that happened four years ago? That's not interesting Okay, so what we want to do in practice Any time there's some temporal piece? Is to instead say assuming that we've ordered it by time Right, so this is old and this is new That's our validation set Okay, or if we you know, I suppose actually do it properly.
That's our validation set. That's our test set Make sense, right? So here's our training set and we use that and we try and build a model that still works on stuff That's later in time than anything the model was built on and so we're not just testing Generalization in some kind of abstract sense, but in a very Specific time sense, which is it generalizes to the future?
Could you pass it to Siraj, please? So when we are as you said As you said, there is some temporal ordering in the data So in that case is it wise to take the entire whole data for training or only a few recent data? Set so validation test or training.
I'm talking about training training Yeah, that's a whole nother question, right? So how do you how do you get the validation set to be good? So I build them a random forest on all the training data. It looks good on the training data It looks good on the OOB But and this is actually a really good reason to have OB if it looks good on the OOB But it means you're not overfitting in a statistical sense, right?
Like it's it's working well on a random sample But then it looks bad on the validation set So what happened? Well, what happened was that you you somehow failed to predict the future You only predicted the past and so Siraj had an idea about how we could fix that would be okay Well, maybe we should just train so like maybe we shouldn't use the whole training set We should try a recent period only and now you know on the downside We're not using less data so we can create less rich models on the upside.
It's it's more up-to-date data And this is something you have to play around with most Machine learning functions have the ability to provide a weight that is given to each row So for example with a random forest rather than bootstrapping at random You could have a weight on every row and randomly pick that row with some probability right and we could like say Here's our like probability We could like pick a Curve that looks like that So that the most recent rows have a higher probability of being selected that can work really well Yeah, it's it's something that you have to try and and if you don't have a validation set that represents the future Compared to what you're training on you have no way to know which of your techniques are working How do you make the compromise between an amount of data versus recency of data?
So what I tend to do is is when I have this kind of temporal issue, which is probably most of the time Once I have something that's working well on the validation set I wouldn't then go and just use that model on the test set because the thing that I've trained on is now like Much, you know the test set is much more in the future compared to the training set so I would then replicate Building that model again, but this time I would combine the training and validation sets together Okay, and retrain the model and at that point you've got no way to test Against a validation set so you have to make sure you have a reproducible Script or notebook that does exactly the same steps in exactly the same ways Because if you get something wrong, then you're going to find on the test set that you've you've got a problem So So what what I do in practice is I need to know is my validation set a Truly representative of the test set.
So what I do is I build five models on the training set I Build five models on the training set and I try to have them kind of vary in how good I think they are Right, and then and then I score them my five models on the validation set Right, and then I also score them on the test set, right?
So I'm not cheating So I'm not using any feedback from the test set to change my hyper parameters I'm only using it for this one thing which is to check my validation set. So I get my five scores from the test set and Then I check That they fall in a line Okay, and if they don't then you're not going to get good enough feedback from the validation set.
So keep doing that process Until you're getting a line and that can be quite tricky, right? Sometimes the the test set You know trying to create something that's as similar to the real world outcome as possible It's difficult right and when you're kind of in the real world The same is true of creating the test set like the test set has to be a close to production as possible So like what's the actual mix of customers that are going to be using this?
How much time is there actually going to be between when you build the model and when you put it in production? How often are you going to be able to refresh the model? These are all the things to think about when you build that test set Okay So even to say that first make five models on the training data and then till you get a straight line relationship Change your validation and test set you can't really change the test set generally So this is assuming that the test sets given the change change the validation set So if you start with a random sample validation set and then it's all over the place and you realize oh I should have picked the last two months And then you pick the last two months and still go over the place and you realize oh I should have picked it so that's also from the first of the month to the fifteenth of the month and They'll keep going until changing your validation set until you've found a validation set which is Indicative of your test set results So the five models like you would start maybe like just the random data and then average and like just make it better Yeah, yeah.
Yeah, yeah, maybe exactly, maybe I kind of five like not terrible ones, but you want some variety and you also particularly want some variety in like How well they might generalize through time so one that was trained on the whole training set one that was trained on the last two weeks One that was trained on the last six weeks One which used you know lots and lots of columns and might over fit a bit more Yeah, so you kind of want to get a sense of like oh if my validation set fails to Generalize temporarily I'd want to see that if it fails to generalize statistically I want to see that Sorry, can you explain a bit more detail what you mean by change your validation set so it indicates the test set like what does that look?
like So possible. So let's take the groceries competition where we're trying to predict the next two weeks of grocery sales So possible validation sets that Terrence and I played with was a random sample The last month of data The last two weeks of data And the other one we tried was same day range One month earlier so that the test set in this competition was the first of the 15th of August Sorry, this 15th that maybe the 15th the 30th of August So we tried like a random sample as four years.
We tried the 15th of July to the 15th of August we tried the first of August to the 15th of August and we tried the 15th of July to the 30th of July and so there were four different validation sets we tried and so with random You know our kind of results were all over the place with last month You know, they were like not bad, but not great the last two weeks It was a couple that didn't look good But on the whole they were good and same day range of months earlier.
They've got a basically perfect line That's the part I'm talking right there. What exactly are you comparing it to from the test set? It's like confuse what you're creating that graph So for each of those so for each of my so I've built five models, right? So there might be like Just predict the average do some kind of simple group mean of the whole data set do some group mean of the last month Of the data set build a random forest of the whole thing build a random forest in the last two weeks on each of those I calculate the validation score and Then I retrain the model on the whole training set and calculate the same thing on the test set And so each of these points now tells me how well to go in the validation set How well did it go in the test set and so if the validation set is useful?
We would say every time the validation set improves the test set should also score should also improve Yeah, so you just said retrain dreaming retrain the model on training and validations Yeah, that was a step I was talking about here So once I've got the validation score based on just the training set and then retrain it on the train and validation And check against it, right?
somebody else So just to clarify By this set you mean Submitting it to Kaggle and then checking the score If it's Kaggle then your test set is Kaggle's leaderboard in the real world the test set is this third data set that you put aside and it's that third data set that Having it reflect real-world production differences is the most important step in a machine learning project Why is it the most important step because if you screw up everything else that you don't screw up that You'll know you've screwed up Right like if you've got a good test set Then you'll know you screwed up because you screwed up something else and you tested it and it didn't work out And it's like okay, you're not going to destroy the company right if you screwed up creating the test set That would be awful right because then you don't know if you've made a mistake Right you try to build a model you test it on the test set it looks good But the test set was not indicative of real-world Environment So you don't actually know if you're going to destroy the company right now Hopefully you've got ways to put things into production gradually So you won't actually destroy the company, but you'll at least destroy your reputation at work, right?
it's like Oh Jeremy tried to put this thing into production and In the first week the cohort we tried it on their sales halved and we're never going to give Jeremy a machine learning job again All right, but if Jeremy had used a proper test set then like he would have known oh This is like half as good as my validation set said it would be I'll keep trying and now I'm not going to get in any trouble.
I was actually like Oh Jeremy's awesome He is identifies ahead of time when there's going to be a generalization problem Okay, so this is like This is something that kind of everybody talks about a little bit in machine learning classes But often it kind of stops at the point where you learn that there's a thing in SK learn Called make test train split and it returns these things and off you go right, but the fact that like Or here's the cross-validation function right so The fact that these things always give you random samples tells you that like Much if not most of the time you shouldn't be using them The fact that random forest gives you an OOB for free It's useful, but it only tells you that this generalizes in a statistical sense not in a practical sense, right?
so then finally there's cross-validation right which Outside of class you guys have been talking about a lot which makes me feel somebody's been overemphasizing the value of this technique So I'll explain what cross-validation is and then I'll explain why you probably shouldn't be using it most of the time So cross-validation says let's not just pull out one validation set, but let's pull out five say So let's assume that we're going to randomly shuffle the data first, right?
This is critical right, we first randomly shuffle the data and then we're going to split it into Five groups And then for model number one, we'll call this the validation set and We'll call this the training set Okay, and we'll train and we'll check against the validation and we'll get some RMSE R squared whatever and then we'll throw that away and We'll call this the validation set and we'll call this the training set and we'll get another score we'll do that five times and Then we'll take the average Okay, so that's a cross-validation average accuracy, so who can tell me like a benefit of using cross-validation over a The kind of standard validation set I talked about before Could you pass it a phone?
If you have a small data set, then Cross-validation will make use of the data you have. Yeah, you can use all of the data You don't have to put anything aside and you kind of get a little benefit as well in that like You've now got five models that you could ensemble together each one refused which used 80% of the data So, you know, sometimes that ensemble link can be helpful Fun could you tell me like what what could be some reasons that you wouldn't use cross-validation?
We have enough data so we don't not want the validation set to be included in the model trainings process To like to pollute like the model Okay, yeah I'm not sure that cross-validation is necessarily polluting the model. What would be a key like downside of cross-validation? but like for deep learning if you have learned the pictures and Then your network will know the pictures and it's more likely to predict it.
That's right So sure, but if we if we've put aside some data each time in the cross-validation, can you pass it to Siraj? I'm I'm I'm not so worried about like I don't think there's like one of these validation sets is More statistically accurate. Yes Siraj I think that's what fun was worried about I don't see why that would happen like each time we're fitting a model Just behind you Each time we're fitting a model.
We are absolutely holding out 20% of the sample Right so yes the five models between them have seen all of the data But but it's kind of like a random forest in fact it's a lot like a random forest each model Has only been trained on a subset of the data Yes, Nisha say if it is like a large data set like it will take a lot of time Oh, yes, exactly right so we have to fit five models rather than one so here's a key downside number one Is time and so if we're?
Doing deep learning and it takes a day to run suddenly it now takes five days or we need five GPUs Okay, what about my earlier issues about validation sets? Do you pass it over there? What's your name Jose? So if you had like temporal data wouldn't you be like by shuffling wouldn't you be breaking that relation Well, we can unshuffle it afterwards We could reorder it like we could shuffle get the training set out and then sort it by time Like I'd like this presumably there's a date column there, so I Don't think I don't think it's going to stop us from building a model.
Did you have? With cross-validation you're building five even validation sets And if there's some sort of structure that you're trying to capture in your validation set to mirror your test set You're you're essentially just throwing that a chance to construct that yourself Right, I think you're going to say that I think you said the same thing as I'm going to say which is which is that our earlier concerns about why?
Random validation sets are a problem are entirely relevant here all these validation sets are random So if a random validation set is not appropriate for your problem Most likely because for example of temporal issues then none of these four validation set five validation sets are any good they're all random right and so if you have Temporal data like we did here.
There's no way to do cross-validation really or like probably no good way to do cross-validation. I mean You want to have? Your validation set be as close to the test set as possible And so you can't do that by randomly sampling different things so So as fun said You may well not need to do cross validation because most of the time in the real world We don't really have that little data Right unless your data is based on some very very expensive labeling process or some experiments that take a cost a lot to run or whatever, but nowadays that's Data scientists are not very often doing that kind of work summer in which case this is an issue, but most of us aren't So we probably don't need to as Nishan said if we do do it.
It's going to take a whole lot of time And then as earnest said even if we did do it and we talk up all that time It might give us totally the wrong answer because random validation sets are inappropriate for a problem Okay, so I'm not going to be spending much time on cross-validation because I just I think it's an interesting tool to have It's easy to use.
Okay, learn has a cross-validation thing. You can go ahead and use but It's it's it's not that often that it's going to be an important part of your toolbox in my opinion. It'll come up sometimes Okay, so that is Validation sets so then the other thing we started talking about last week And got a little bit stuck on because I screwed it up was tree interpretation So I'm actually going to cover that again without the error And dig into it in a bit more detail So can anybody tell me?
What tree interpreter does and how it does it? Everybody remember? It's a difficult one to explain. I don't think I did a good job of explaining it So don't worry if you don't do a great job, but does anybody want to have a go at explaining it? Okay, that's fine, so Let's start with the output of tree interpreter, so If we look at a single model a single tree in other words Here is a single tree Okay, and So to remind us the top of a tree is Before there's been any split at all so ten point one eight nine Is the average log price of all of the options in our training set?
So I'm going to go ahead and draw Right here ten point one eight nine eight nine is the average of all okay, and Then if I go a couple of system less than or equal to point five Then I get ten point three four five. Okay, so for this subset of sixteen thousand eight hundred Coupler is less than or equal to point five the average is ten point three four five and Then off the people with a couple of system less than or equal to point five We then take the subset where enclosure is less than or equal to two and the average there of log sale price is nine point nine five five Here's nine point nine five five.
Okay, and then final step in our tree Is Model ID just for this group with no coupler system with enclosure less than or equal to two then let's just take model ID less than or equal to forty five seventy three and That gives us ten point two two six okay, so then we can say you're at starting with ten point one oh nine one eight nine average for everybody in our training set for this particular trees subsample of twenty thousand Adding in the couple of decision or couple or less than or equal to point five increased our prediction by point one five six So if we predicted with a naive model of just the mean that would have been ten point one nine Adding in just the coupler decision would have changed it to ten point three four five So this variable is responsible for a point one five six increase in our prediction From that the enclosure decision was responsible for a minus point three nine five decrease The model ID was responsible for a point two seven six increase until eventually that was our final decision That was our prediction for this auction of this particular sale price So we can draw that as what's called a waterfall plot right and waterfall plots are one of the most useful plots I know about and weirdly enough There's nothing in Python to do them and this is one of these things where there's this disconnect between like the world of like management consulting and business where everybody uses waterfall plots all the time and like academia Who have no idea what these things are but like every time like you're looking at say?
here is Last year's sales for Apple and then there was a change in the iPhones increased by this amount Max decreased by that amount and iPads increased by that amount every time you have a starting point in a number of changes and a finishing Point waterfall charts are pretty much always the best way to show it.
So here our prediction for price based on everything 10.1 eight nine there was an increase blue means increase of point one five six per coupler Decrease of point three nine five for implosion increase model ID of point two seven six so decrease As I increase decrease increase to get to our final 10.266 so you see how waterfall chart works So with excel 2016 you it's built in you just click insert waterfall chart and there it is If you want to be a hero create a waterfall chart Package for matplotlib put it on pip and everybody will love you for it There are some like really crappy discs and manual Notebooks and stuff around these are actually super easy to build Like you basically do a stacked column plot where this the bottom of this is like all white Right like you can kind of do it But if you can wrap that up all and put the data the points in the right spots and color them nicely That would be totally awesome.
I think you've all got the skills to do it and would make you know be a terrific thing for your portfolio So there's an idea Could make an interesting cattle kernel even like here's how to build a waterfall plot from scratch and by the way I've put this up on pip you can all use it So in general therefore obviously going from the all and then going through each change Then the sum of all of those is going to be equal to the final prediction So that's how we could say if we were just doing a decision tree Then you know you're coming along and saying like how come this particular option was this particular price?
And it's like well your prediction for it and like oh it's because of these three things had these three impacts, right? so for a random forest We could do that across all of the trees that so every time we see coupler We add up that change every time we see enclosure We add up that change every time we see model we add up that change.
Okay, and so then we combine them all together We get what? Tree interpreter does but so you could go into the source code for tree interpreter, right? It's not at all complex logic or you could build it yourself Right and you can see How it does exactly this so when you go tree interpreter predict with a random first model for some specific?
Auction, so I've got a specific row here. This is my zero index row It tells you okay. This is the prediction the same as the random forest prediction Bias this is going to be always the same. It's the average sale price for for everybody for each of the random samples in the tree and then contributions is The average of sorry the total of all the contributions for each time we see that Specific column appear in a tree Right.
So last time I made the mistake of not sorting this correctly. So this time Np dot art sort is a super handy Function it sorts. It doesn't actually sort Contribution zero it just tells you where each item would move to if it were sorted so now by passing ID access to each one of The column The the level Contribution I can then print out all those in the right order so I can see here.
Here's my column here's the level and the contribution so the fact that it's a small Version of this piece of industrial equipment meant that it was less expensive Right, but the fact it was made pretty recently meant. It was more expensive The fact that it's pretty old however made that it was less expensive right, so this is not going to Really help you much at all with like a Kaggle style situation where you just need predictions that's going to help you a lot in a production environment or even pre-production right so like something which Any good manager should you should do if you say here's a machine learning model?
I think we should use as they should go away and grab a few examples of actual customers or actual options or whatever and check whether your model looks intuitive right and if it says like my prediction is that You know Lots and lots of people are going to really enjoy This crappy movie.
You know and it's like well That was a really crappy movie then they're going to come back to you and say like explain why your models telling me That I'm going to like this movie because I hate that movie and then you can go back and you say well It's because you like this movie and because you're this age range, and you're this gender on average actually people like you Did like that movie?
Okay, yeah you What's the second element of each table? This is saying for this particular row It was a mini, and it was 11 years old and it was a hydraulic excavator track three to four metric tons So it's just feeding back and telling you it's it because this is actually what it was It was these numbers, so I just went back to the original data to actually pull out the Descriptive versions of each one Okay, so if we sum up all the contributions together and Then add them to the bias Then that would be the same as adding up those three things Adding it to this and as we know from our waterfall chart that gives us our final prediction this is a Almost totally unknown technique and this particular Library is almost totally unknown as well so like it's a great opportunity to You know show something that a lot of people like it's totally critical in my opinion But but rarely none, so that's That's kind of the end of the ran of forest interpretation piece and hopefully you've now seen enough that when somebody says We can't use modern machine learning techniques because they're black boxes that aren't interpretable You have enough information to say you're full of shit, right?
Like they're extremely interpretable and the stuff that we've just done You know try to do that with a linear model. Good luck to you You know even where you can do something similar the linear model trying to do it So that's not getting you totally the wrong answer and you had no idea as a wrong answer.
It's going to be a real challenge So the last step we're going to do before we try and build our own random forest is deal with this tricky issue of Extrapolation so in this case If we look at our tree Let's look at the accuracy of our most recent trees We still have You know a big difference between our validation score and our training score The Actually in this case.
It's not too bad that The difference between the OOB and the validation is actually pretty close So if there was a big difference between validation and OOB like I'd be very worried about that. We've dealt with the temporal side of things correctly Let's just have a look at I think our most recent model here it was Yeah, so there's a tiny difference right and so On Kaggle at least you kind of need that last decimal place in the real world.
I'd probably stop here But quite often you'll see there's a big difference between your validation score and your OOB score And I want to show you how you would deal with that Particularly because actually we know that the OOB should be a little worse Because it's using less trees so it gives me a sense that we should be able to do a little bit better And so the reason with the way we should be able to do a little bit better is by handling the time component a little bit better so Here's the problem with random forests when it comes to extrapolation when you When you've got a data set That's like you know for got four years of sales data in it and you create your tree Right, and it says like oh if these if it's in some particular store, and it's some particular item And it is on special You know here's the average price right it actually tells us the average price You know over the whole training set which could be pretty old right and so when you then Want to step forward to like what's going to be the price next month?
It's never seen next month and and where else with a kind of a linear model it can find a relationship Between time and price where even though we only had this much data When you then go and predict something in the future it can extrapolate that but a random forest can't do that There's no way if you think about it for a tree to be able to say well next month.
It would be higher still so there's a few ways to deal with this and we'll talk about it over the next couple of lessons, but one simple way is just to try to Avoid using time Variables as predictors if there's something else we could use that's going to give us a better You know something of a kind of a stronger relationship.
That's actually going to work in the future so in this case What I wanted to do was to first of all figure out What's the difference between our validation set and our training set like if I understand the difference between our validation set And our training set then that tells me What are the predictors which which have a strong temporal component and therefore they may be?
Irrelevant by the time I get to the future time period so I do something really interesting which is I create a random forest Where my dependent variable is is it in the validation set? right, so I've gone back and I've got my whole data frame with the training and validation all together and I've created a new column called is valid which I've set to 1 and Then for all of the stuff in the training set I set it to 0 that's what a new column which is just is this in the validation set or not and Then I'm going to use that as my dependent variable and build a random forest So this is a random forest not to predict price The predict is this in the validation set or not and so if your variables were not Time dependent then it shouldn't be possible to figure out if something's in the validation set or not This is a great trick in Kaggle right because in Kaggle They often won't tell you whether the test set is a random sample or not So you could put the test set and the training set together Create a new column called is test and see if you can predict it if you can You don't have a random sample which means you have to come and figure out how to create a validation set From it right and so in this case I can see I don't have a random sample because my validation set can be predicted with a point nine nine nine nine R squared and So then if I look at feature importance the top thing is sales ID and so this is really interesting It tells us very clearly sales ID is not a random identifier But probably it's something that's just set Consecutively as time goes on we just increase the sales ID Sale elapsed that was the number of days since the first date in our data set so not surprisingly that so is a good predictor interestingly machine ID Clearly each machine is being labeled with some consecutive identifier as well And then there's a big don't just look at the order look at the value so point seven point one point zero seven point zero two Okay, stop right these top three hundreds of times more important than the rest right so let's next grab those top three Right and we can then have a look at their values both in the training set and In the validation set and so we can see for example sales ID on average is I've divided by a thousand on average is 1.8 million in the training set and 5.8 million in the validation set right so you like you can see Just confirm like okay.
They're very different So let's drop them Okay, so after I drop them let's now see if I can predict whether something's in the validation set I still can with point nine eight R squared So once you remove some things then other things can like come to the front and it now turns out okay That's not surprisingly age You know things that are old You know more likely I guess to be in the validation set because if you know earlier on in the training set yet They can't be old yet year made same reason So then we can Try removing those as well and So once we let's see where do we go here?
Yeah, so what we can try doing is we can then say all right? Let's take the sales ID so that's machine ID from the first one The age year made sale sale day of year from the second one and say okay. These are all time dependent features So I still want them in my random forest if they're important Right, but if they're not important then taking them out There are some other non time dependent variables that that work just as well.
That would be better Right because now I'm going to have a model that generalizes over time better So here I'm just going to go ahead and go through each one of those features and drop each one one at a time Okay retrain a new random forest and print out the score Okay, so before we do any of that our score was 0.88 for our validation versus 0.89 OOB and you can see here when I remove sales ID my score goes up and This this is like what we're hoping for we've removed a time-dependent variable There were other variables that could find similar relationships without the time dependency so removing it caused our validation to go up Now OOB didn't go up Right because this is genuinely statistically a useful predictor Right, but it's a time-dependent one and we have a time-dependent validation set so this is like really subtle But it can be really important right.
It's trying to find the things that give you a Generalizable time across time prediction, and here's how you can see it so by so it's like okay We should remove sales ID for sure right, but sale elapsed Didn't get better Okay, so we don't want that machine ID did get better from 888 to 893.
It's actually quite a bit better Age Got a bit better Year made got worse sale day of year got a bit better Okay, so now we can say all right. Let's get rid of the three Where we know that getting rid of it actually made it better Okay, and as a result look at this.
We're now up to 9 1 5 Okay, so we've got rid of three time-dependent things and now as expected Validation is better than our OOB Okay, so that was a super successful approach there right and so now we can check the feature importance And let's go ahead and say all right that was pretty damn good.
Let's now Leave it for a while, so give it 160 trees. Let it show and see how that goes Okay, and so as you can see like we did all of our interpretation all of our fine-tuning Basically with smaller models subsets and at the end we run the whole thing it actually still only took 16 seconds And so we've now got an RMSE of 0.21.
Okay, so now we can check that against Kaggle again, we can't we unfortunately this Older competition we're not allowed to enter anymore to see how we would have gone so the best we can do is check Whether it looks like we could have done well based on our validation set So it should be in the right area and yeah based on that we would have come first Okay, so You know I think this is an interesting series of steps right so you can go through the same series of steps in your Kaggle projects and more importantly your real-world projects So one of the challenges is once you leave this learning environment Suddenly you're surrounded by people who they never have enough time.
They've always want you to be in a hurry They're always telling you you know do this and then do that you need to find the time to step away Right and go back because this is a genuine real-world modeling process you can use And it gives when I say it gives world-class results I mean it right like this guy who won this Listergoss sadly he's passed away, but he is the top Kaggle Competitor of all time like he won.
I believe like dozens of competition So if we can get a score even within kooee of him, then we are doing really really well Okay, so let's take a five-minute break, and we're going to come back and build our own random forest I just wanted to clarify something quickly very good point during the break was Going back to the Change in R squared between here and Here it's not just due to the fact that we removed these three predictors We also went reset RF samples right so to actually see the impact of just removing we need to compare it to The final step earlier, so it's actually compared to 907 so removing those three things took us from 107 to 915 okay, so I mean and you know in the end of course what matters is our final model, but yeah, just to clarify Okay So Some of you have asked me about writing your own random forests from scratch I don't know if any of you have given it a try yet my original plan here was to Do it in real time and then as I started to do it I realized that that would have kind of been boring because for you because I screw things up all the time so instead We might do more of like a walk through the code together Just as an aside This reminds me talking about the exam actually somebody asked on the forum about like what what can you expect from the exam?
the basic plan is to make it a The exam be very similar to these notebooks. So it'll probably be a notebook that you have to you know Get a data set create a model trainer feature importance whatever right and the plan is that it'll be Open book open internet you can use whatever resources you like so basically if you're entering competitions the exam should be very straightforward.
I also expect that there will be some pieces about like Here's a partially completed random forest or something. You know finish Finish writing this step here, or here's a random forest Implement feature importance or you know implement one of the things we've talked about so it'll be you know The exam will be much like what we do in class and what you're expected to be doing during the week.
There won't be any Define this or tell me the difference between this word and that word or whatever. There's not going to be any rote learning It'll be entirely like are you an effective machine learning practitioner ie can you use the algorithms? Do you know can you create an effective validation set and can you can you create parts of the algorithm?
Implement them from scratch, so it'll be all about writing code basically, so if you're not comfortable writing code to practice machine learning then You should be practicing that all the time if you are comfortable. You should be practicing that all the time also Whatever you're doing write code to implement random to do machine learning Okay So I kind of have a particular way of Writing code And I'm not going to claim it's the only way of writing code But it might be a little bit different to what you're used to and hopefully you'll find it at least interesting creating implementing random forest algorithms Is actually quite tricky not because the clothes tricky like generally speaking Most random first algorithms are pretty conceptually easy, you know that generally speaking Academic papers and books have a knack of making them look difficult, but they're not difficult conceptually what's difficult is getting all the details right and knowing and knowing when you're right and So in other words, we need a good way of doing testing So if we're going to re-implement something that already exists.
So like say we wanted to create a random forest in some different Framework different language different operating system, you know, I would always start with something that does exist, right? So in this case, we're just going to do as a learning exercise writing a random forest in Python So for testing I'm going to compare it to an existing random forest implementation Okay, so that's like critical any time you're doing anything involving non-trivial amounts of code in machine learning Knowing whether you've got it right or wrong is kind of the hardest bit I always assume that I've screwed everything up at every step and so I'm thinking like okay assuming that I screwed it up How do I figure out that I screwed it up?
Right and then much to my surprise from time to time I actually get something right and then I can move on But most of the time I get it wrong so Unfortunately with machine learning, there's a lot of ways you can get things wrong that don't give you an error They just make your result like slightly less good And so that's that's what you want to pick up So given that I want to kind of compare it to an existing implementation I'm going to use our existing data set our existing validation set and then to simplify things and just going to use two columns to start with So let's go ahead and start writing a random forest.
So my way of writing Nearly all code is top-down just like my teaching and so by top-down I start by assuming That everything I want already exists Right. So in other words, the first thing I want to do I'm going to call this a tree ensemble All right, so to create a random forest the first question I have is What do I need to pass in?
Right. What do I need to initialize my random first? So I'm going to need some independent variables some dependent variable Pick how many trees I want I'm going to use the sample size parameter from the start here So how big you want each sample to be and then maybe some optional parameter of what's the smallest leaf size?
Okay For testing it's nice to use a constant random seed. So we'll get the same result each time So this is just how you set a random seed, okay? Maybe it's worth mentioning this for those of you unfamiliar with it Random number generators on computers aren't random at all.
They're actually called pseudo random number generators and what they do is given some initial starting point in this case 42 a Pseudo random number generator is a mathematical function that generates a deterministic always the same sequence of numbers Such that those numbers are designed to be as uncorrelated with the previous number as possible And as unpredictable as possible and As uncorrelated as possible with something with a different random seed So the second number in in the sequence starting with 42 should be very different the second number starting with 41 And generally they involve kind of like taking you know You know using big prime numbers and taking mods and stuff like that.
It's kind of an interesting area of math If you want real random numbers the only way to do that is again you can actually buy Hardware called a hardware random number generator that will have inside them like a little bit of some radioactive Substance and and like something that detects how many things it's spitting out Or you know there'll be some hardware thing Getting current System time is is it a valid?
Random like random number generation process so that would be for maybe for a random seed right so this thing of like What do we start the function with so one of the really interesting areas is like in your computer if you don't set the random? seed what is it set to and Yeah, quite often people use the current time for security like obviously we use a lot of random number stuff for security stuff Like if you're generating an SSH key you need some it needs to be random It turns out like you know people can figure out roughly when you created a key like they could look at like oh ID RSA has a timestamp and they could try you know all the different nanoseconds Starting points for a random number generator around that time step and figure out your key So in practice a lot of like really random High randomness requiring applications actually have a step that say please move your mouse and type random stuff at the keyboard for a while And so it like gets you to be a sort of entropy to be a source of entropy Other approaches is they'll look at like you know the hash of some of your log files or you know Stuff like that.
It's a really really fun area So in our case our purpose actually is to remove randomness So we're saying okay generate a series of pseudo random numbers starting with 42, so it always should be the same So if you haven't done much stuff in Python Oh, this is a basically standard idiom at least I mean I write it this way most people don't but if you pass in like One two three four five things that you're going to want to keep inside this object Then you basically have to say self dot x equals x self dot y equals y self dot sample equals sample Right and so we can assign to a tuple from a tuple so You know again This is like my way of coding most people think this is horrible But I prefer to be able to see everything at once and so I know in my code anytime I see something looks like this It's always all of the stuff in the method being set if I did it a different way then half the codes now come off The bottom of the page and you can't see it.
So alright, so So that was the first thing I thought about was like okay to create a random forest What information do you need then I'm going to need to store that information inside my object and so then I? Need to create some trees right a random forest is something that creates something that has some trees, so I basically figured okay List comprehension to create a list of trees how many trees do we have we've got n trees trees That's what we asked for so range n trees gives me the numbers from zero up to n trees at minus one Okay, so if I create a list comprehension that loops through that range calling create tree each time I now have n trees trees And now so I had to write that I didn't have to think at all like that's all like Obvious and so I've kind of delayed the thinking to the point where it's like well wait.
We don't have something to create a tree Okay, no worries, but let's pretend. We did if we did we've now created a random forest Okay, we still need to like do a few things on top of that for example once we have it We would need a predict function, so okay.
Well. Let's write a predict function. How do you predict in a random forest? Can somebody tell me Either based on their own understanding or based on this line of code. What would be like your one or two sentence answer How do you make a prediction in a random forest?
Spencer You would want to over every tree for your like the row that you're trying to predict on Average the values that your that each tree would produce for that And so you know that's a summary of what this says right so for a particular row That or maybe this is a number of rows Go through each tree Calculators prediction so here is a list comprehension that is calculating the prediction for every tree for X I don't know if X is one row or multiple rows doesn't matter right As long as as long as tree dot predict works on it And then once you've got a list of things a cool trick to know is you can pass numpy dot mean a regular non numpy list Okay, and it'll take the mean you just need to tell it axis equals 0 means average it across the lists, okay, so this is going to return the average of Dot predict for each tree and so I find list comprehensions Allow me to write the code in the way that brain works like you could take the word Spencer said and like Translate them into this code or you could take this code and translate them into words like the one Spencer said right and so when I write code I want it to be as much like that as possible I want it to be readable and so hopefully you'll find like when you look at the fast AI code You're trying to understand.
Well, how did Jeremy do X? I try to write things in a way that you can read it and like it kind of turn it into English in your head So if I see correctly that predict method is recursive it's No, it's calling tree dot predict and we haven't written a tree yet So self dot trees is going to contain a tree object So this is tree ensemble dot predict and inside the trees is a tree not a tree ensemble So this is calling tree dot predict not tree ensemble dot predict Good question Okay, so we've nearly finished writing a random forest haven't we all we need to do now is write create tree, right?
so based on this code here or On your own understanding of how we create trees in a random forest. Can somebody tell me? Let's take a few seconds have a read have a think and then I'm going to try and come up for the way of saying How do you create a tree in a random forest?
Okay, who wants to tell me yes, okay, that's Tyler's got close to You take your Essentially taking a random sample or of the original data and then you're just Just constructing a tree. However that happens So construct a decision tree like a non random tree from a random sample of the data Okay, so again like we've delayed any actual thought process here.
We've basically said, okay, we could pick some random IDs This is a good trick to know If you call NP random permutation passing in an int it'll give you back a Randomly shuffled sequence from zero to that it right and so then if you grab the first colon n Items of that that's now a random Substantial so this is not doing bootstrapping.
We're not doing sampling with replacement here Which I think is fine, you know for my random forest I'm deciding that it's going to be something where we do the sub sampling not bootstrapping. Okay, so here's a good line of code to know how to write Because it comes up all the time like I find in machine learning most algorithms I use are Somewhat random and so often I need some kind of random sample.
Can you pass that tighter or changey? Won't that give you one one extra because the you said it'll go from zero to length No, so this will give you if lens self dot y is Size n this will give you n a sequence of length n so 0 to n minus 1 Okay, and then from that I'm picking out colon self dot sample size so the first sample size IDs I Have a comment on bootstrapping, I think this method is better because we have chance of giving more weights to each Observation or am I thinking wrong?
I mean, I think you for bootstrapping we could also give weights. I mean Weighing single observations more than they are like Without wanting that weight because when bootstrapping with replacement we can Have a single observation and duplicates of it. Yeah, the same tree. Yeah, it does feel weird, but I think I'm not sure that the actual Theory or empirical results backs up higher intuition that it's worse.
It would be interesting to look look back at that actually Personally I prefer this because I feel like most of the time we have more data than we Want to put a tree at once I feel like back when bryman created random forests. It was 1999 It was kind of a very different world.
You know where we pretty much always wanted to use all the data we had but nowadays I would say that's Generally not what we want We normally have too much data and so what people tend to do is they're like fire up a spark cluster and they'll run it on hundreds of machines when It makes no sense because if they had just used a subsample each time They could have done it on one machine and like the the overhead of like Spark is a huge amount of IO overhead like I know you guys are doing distributed computing now if you've looked at some of the benchmarks Yeah, yeah, exactly.
So if you do something on a single machine, it can often be hundreds of times faster Because you don't have all this this IO overhead. It also tends to be easier to write the algorithms like you can use like SK learn easier to visualize cheaper so forth so like I Almost always avoid distributed computing and I have my whole life like even 25 years ago when I was starting in machine learning I you know still didn't use clusters because I so I always feel like Whatever I could do with a cluster now I could do with a single machine in five years time So why don't us focus on always being as good as possible with the single machine, you know and that's going to be more interactive and more iterative and work for me, so Okay, so so again, we've like delayed thinking To the point where we have to write decision tree And so hopefully you get an idea that this top-down approach the goal is going to be that we're going to keep delaying thinking So long that that we delay it forever Like like eventually we've somehow written the whole thing without actually having to think right and that's that's kind of what I need Cuz I'm kind of slow right so this is why I write code this way and notice like you never have to design anything You know, you just say hey, what if somebody already gave me the exact API I needed.
How would I use it? Okay, and then and then okay to implement that next stage What would be the exact API I would need to implement that that you keep going down until eventually you're like, oh that already exists Okay, so This assumes we've got a class for decision tree.
So we're going to have to create that So a decision tree Is something so we already know what we're going to have to pass it because we just passed it, right? so we're passing in a random sample of X's a random sample of wise Indexes is actually so we know that down the track so I got a plan a tiny bit We know that a decision tree is going to contain decision trees which themselves contain decision trees And so as we go down the decision tree There's going to be some subset of the original data that we've kind of got and so I'm going to pass in the indexes Of the data that we're actually going to use here.
Okay, so initially it's the entire Random sample, right? So I've got the whole I've got the whole range And I'll turn that into an array. So that's zero the indexes from zero to the size of the sample and Then we'll just pass down them in leaf size. So everything that we got for constructing the random forest We're going to pass down the decision tree except of course num trees, which is irrelevant for the decision tree So again now that we know that's the information we need we can go ahead and store it inside this object So I'm pretty likely to need to know How many rows we have in this tree which I generally call n How many columns do I have which I generally call C So the number of rows is just equal to the number of indexes We were given and the number of columns is just like however many columns there are in our independent variables So then we're going to need This value here We need to know for this tree What's its prediction, right?
So the Prediction for this tree is the mean of Dependent variable for Those indexes which are inside this part of the tree, right? So at the very top of the tree it contains all the indexes All right, I'm assuming that by the time we've got to this point. Remember we've already done the random sampling Right.
So when we're talking about indexes, we're not talking about the random sampling to create the tree We're assuming this tree now has some random sample inside decision tree This is this is the one of the nice things right inside decision tree whole random sampling things gone Right that was done by the random first, right?
So at this point we're building something. That's just a plain old decision tree It's not in any way a random sampling anything. It's just a plain old position tree, right? So the indexes is literally like Which subset of the data have we got to so far in this tree?
And so at the top of the decision tree, it's all the data, right? So it's all of the indexes Okay So all of the indexes So this is therefore all of the dependent variable that are in this part of the tree And so this is the value mean of that That makes sense.
Anybody got any questions about about that? So yes, he passed the change sheet Actually just to let you know that's a large portion of us don't have a all be I Mean all P experiments. Okay. Sure. So So quick so quick over P primer would be helpful Great. Yeah, okay Who has done object-oriented programming in some programming language, okay?
So you've all used actually lots of object-oriented programming in terms of using existing classes All right, so every time we've created a random forest We've called the random forests constructor and it's returned an object and then we've called methods and Attributes on that object so fit is a method you can tell because it's got parentheses after it.
All right, where else? Yeah, I will be score is a Property or an attribute doesn't have parentheses after it. Okay, so inside an object there are kind of two kinds of things They're the functions that you can call So you you have object dot? function parenthesis arguments or there are the properties or attributes you can grab which is Object dot and then just the attribute name with no parentheses So when and then the other thing that we do with objects is we create them Okay, we pass in the name of a class and it returns us the object and you have to tell it all of the parameters Necessary to get constructed.
So let's just copy this code And See how we're going to go ahead and build this So the first step is we're not going to go and equals random forest regressor. We're going to go M equals tree ensemble We're creating a class for tree ensemble and we're going to pass in Various bits of information, okay?
So maybe we'll have ten trees Sample size of a thousand maybe a min leaf of three All right, and you can always like choose to name your arguments or not So when you've got quite a few it's kind of nice to name them so that just so we can see what each one means It's always optional um so we're going to try and create a class that we can use like this and then I'm not sure we're going to bother with dot fit because we've passed in the X and the Y Right like in in psychic learn they use an approach where first of all you construct something without telling it what data to use And then you pass in the day.
We're doing these two steps at once. We're actually passing in the data Right and so then after that we're going to be going m Dot so we're going to go preds equals m dot predict Passing in maybe some validation set Okay, so we're that's that's the API. We're kind of creating here So this thing here is called a constructor something that creates an object is called a constructor and Python There's a lot of ugly hideous things about Python one of which is they it uses these special magic method names Underscore underscore in it underscore underscore is a special magic method that's caught what's called when you try to construct a class So when I call tree ensemble parenthesis it actually calls tree ensemble dot People say thunder in it.
I kind of hate it. But anyway done that in it double underscore in it double underscore thunder thunder in it So that's why we've got this method called dunder in it. Okay, so when I call tree ensemble is going to call this method another hideously ugly thing about Python's OO is that there's this special thing where if you have a class and to create a class you just write class in the class all of its methods Automatically get sent one extra parameter one extra argument Which is the first argument and you can call it anything you like if you call it anything other than self Everybody will hate you and you're a bad person Okay, so call it anything you like as long as itself so So that's why you always see this and in fact I can immediately see here I have a bug Anybody see the bug in my predict function?
I should have self, right? I Like always do it, right? So anytime you try and call a method on your own class and you get something saying you passed in two parameters And it was only expecting one you forgot self Okay, so like this is a really dumb way to add OOP to a programming language But the older languages like Python often did this because they kind of needed to they started out not being Oh, and then they kind of added.
Oh in a way that was hideously ugly So Pell which predates Python by a little bit kind of I think really came up with this approach and unfortunately Other languages of that era stuck with it So you have to add in this magic self. So the magic self now When you're inside this class You can now pretend as if any property name you like exists So I can now pretend there's something called self dot X.
I can read from it I can write to it right, but if I read from it, and I haven't yet written to it. I'll get an error So the stuff that's passed to the constructor Gets thrown away by default like there's nothing that like says you need to this class needs to remember what these things are But anything that we stick inside self it's remembered for all time You know as long as this object exists.
You can access it. Maybe it's remembered so now that I've gone In fact, let's do this right so let's let's create the tree ensemble class and Let's now instantiate it okay, of course we haven't got X we need to call X train Y train Okay decision tree is not defined so let's Create a really minimal decision tree There we go, okay, so here is enough to actually instantiate our tree ensemble Okay, so we have to find the in it for it We have to find the in it for decision tree we need decision trees in it to be defined because inside our ensemble in it they're called self dot create tree and Then self dot create tree called the decision tree constructor and then decision tree constructor Basically does nothing at all other than save some information right so at this point we can now go M dot Okay, and if I press tab at this point Can anybody tell me what I would expect to see pass it to Taylor Can she could you pass it to Taylor?
We would see like a we would see a drop-down of all available methods for that class okay, which would be In this case so if M is a tree ensemble, we would have create tree and predict okay anything else Wait what oh, yeah as well as Ernest whispered the variables as well.
Yeah, so the So variable could mean a lot of things we'll say the attributes so the things that we put inside self so if I hit tab Right there. They are right as Taylor said there's create tree there's predict, and then there's everything else to be put inside self all right, so if I look at M dot Min leaf if I hit shift enter what will I see?
Yeah, the number that I just put there. I put in leaf is three so that went up here to mean leaf This here is a default argument. That's as if I don't pass anything. It'll be five, but I did pass something right so three self dot min leaf here is Gonna be equal to min leaf here so something which Like because of this rather annoying way of doing OO It does mean that it's very easy to accidentally forget To do that right so if I don't assign it to self dot min leaf Right then I get an error and So here tree ensemble doesn't happen to me in leaf So how do I create that attribute?
I just put something in it Okay, so if you want to like if you don't know what a value of it should be yet But you kind of need to be able to refer to it you can always go like self dot min leaf equals none Right so at least it's something you can read check for noneness and not have an error Great now Interestingly I was able to instantiate tree ensemble even though predict refers to a method of decision tree That doesn't exist and this is actually something very nice about the dynamic nature of Python is that Because it's not like compiling it.
It's not checking anything unless you're using it right, so we can go ahead and create decision D dot predict later and Then our our instantiated object will magically start working All right, it doesn't actually look up that functions that methods details until you use it and so it really helps with top-down programming Okay, so when you're inside a class definition, in other words, you're at that indentation level You know indented one in so these are all class definitions Any function that you create unless you do some special things that we're not going to talk about yet Is automatically a method of that class and so every method of that class magically gets a self pass to it So we could call since we've got a tree ensemble we could call M dot create tree and We don't put anything inside those parentheses because the magic self will be passed and the magic self will be whatever M is Okay, so M dot create tree returns a decision tree.
Just like we asked it to right so M dot create tree dot IDXS Will give us the self dot IDXS inside the decision tree Okay, which is set to NP dot arrange range self dot sample size Why is data scientists do we care about object oriented programming? Because a lot of the stuff you use is going to require you to implement stuff with OOP, for example every single PyTorch model of any kind is Created with OOP.
It's the only way to create PyTorch models good news is What you see here is the entirety of what you need to know So you this is all you need to know you need to know to create something called in it to assign the things that are passed in it to something called self and Then just stick the word self after each of your methods Okay, and so the nice thing is like now to think as an OOP programmer is to realize you don't now have to pass around XY sample size and mint leaf to every function that uses them by assigning them to Attributes of self they're now available like magic Right.
So this is why OOP is super handy If you're particularly like I started trying to create a decision tree initially without using OOP and try to like keep track of Like what that decision tree was meant to know about was very difficult, you know Or else with OOP you can just say it inside the decision tree, you know self dot indexes equals this and Everything just works.
Okay. Okay, that's great. So we're out of time. I think that's that's great timing because There's an introduction to OOP, but this week You know next class I'm going to assume that you can use it, right? So you should create some classes instantiate some classes look at their methods and properties Have them call each other and so forth until you feel Comfortable with them and maybe for those of you that haven't done OOP before you and find some other useful Resources you could pop them onto the wiki thread so that other people know what you find useful great.
Thanks everybody Everybody.