back to index

Intro to Machine Learning: Lesson 5


Chapters

0:0
38:5 Random Forest Model interpretation
44:26 6 Tree interpreter
52:19 7 Extrapolation

Whisper Transcript | Transcript Only Page

00:00:00.000 | Okay, so welcome back, so we're going to start by doing some review, and we're going to talk about
00:00:05.600 | test sets
00:00:08.520 | training sets
00:00:10.360 | validation sets and OOB
00:00:12.360 | Something we haven't covered yet, but we will cover in more detail later is also cross validation
00:00:19.640 | But I'm going to talk about that as well, right so
00:00:22.240 | We have a data set
00:00:24.960 | With a bunch of rows in it and
00:00:29.160 | We've got some dependent variable
00:00:33.640 | so what's the difference between
00:00:35.640 | like machine learning and
00:00:38.680 | Kind of pretty much any other kind of work that the the difference is that in machine learning the thing we care about is
00:00:47.600 | The generalization accuracy or the generalization error where else in like pretty much everything else all we care about is is
00:00:56.720 | how well we could have mapped to the observations full stop and
00:01:01.040 | so this this thing about
00:01:03.600 | generalization is the key unique piece of
00:01:06.720 | machine learning
00:01:08.960 | And so if we want to know whether we're good doing a good job of machine learning
00:01:13.600 | We need to know whether we're doing a good job of generalizing if we don't know that
00:01:18.400 | We know nothing, right?
00:01:21.200 | By generalizing do you mean like scaling being able to scale larger?
00:01:29.360 | No, I don't mean scaling at all so scaling is an important thing in many many areas
00:01:36.400 | It's like okay. We've got something that works
00:01:38.400 | on my computer with 10,000
00:01:42.000 | Items I don't need to work make it work on 10,000 items per second or something so scaling is important
00:01:49.000 | But not just a machine learning for just about everything we put in production
00:01:52.520 | Generalization is where I say okay here is a model that can predict
00:01:59.720 | Cats from dogs. I've looked at five pictures of cats five pictures of dogs, and I've built a model that is perfect and
00:02:07.160 | Then I look at a different set of five cats and dogs and it gets them all wrong
00:02:11.840 | So in that case what it learned was not the difference between a cat and a dog
00:02:15.600 | That let what those five exact cats look like in those five exact dogs look like or I've got a model of
00:02:22.280 | predicting
00:02:24.960 | grocery sales
00:02:26.640 | for a particular
00:02:28.640 | Product so for toilet rolls
00:02:30.960 | in New Jersey last month
00:02:34.000 | And then I go and put it into production and it scales great in other words it has a great latency
00:02:41.320 | I don't have a high CPU load, but it fails to predict anything well other than
00:02:46.960 | Toilet rolls in New Jersey it also turns out it only did it well for last month not the next month, so these are all generalization
00:02:54.140 | failures
00:02:58.520 | The most common way that people check for the ability to generalize is
00:03:04.520 | To create a random sample, so they'll grab a few rows at random
00:03:11.640 | pull it out
00:03:13.640 | into a test set and
00:03:16.640 | then they'll build all of their models on the rest of the rows and
00:03:21.800 | Then when they're finished they'll check that the accuracy they got on there
00:03:28.160 | So the rest of the rows are called the training set everything else
00:03:31.240 | everything
00:03:36.360 | We could call the training set
00:03:41.720 | So at the end of their modeling process on the training set they got an accuracy of
00:03:46.200 | 99% of predicting cats from dogs at the very end they check it against a test set to make sure that the model really
00:03:52.880 | does generalize
00:03:54.160 | now the problem is
00:03:56.160 | What if it doesn't?
00:03:58.160 | Right, so okay, well I could go back and change some hyper parameters do some data augmentation
00:04:03.560 | Whatever else try to create a more generalizable model, and then I'll go back again
00:04:08.760 | After doing all that and check and it's still no good
00:04:11.920 | But and I'll keep doing this again and again until eventually after 50 attempts. It does generalize
00:04:19.440 | But does it really generalize because maybe all I've done is
00:04:23.320 | Accidentally found this one which happens to work just for that test set because I've tried 50 different things
00:04:28.400 | Right, and so if I've got something which is like right coincidentally
00:04:35.040 | 0.05 5% of the time they're not very likely to accidentally get a good result
00:04:40.520 | So what we generally do is we put aside a second data set
00:04:46.160 | They've got a couple more of these and put these aside into a validation set
00:04:54.480 | Validation set right and then everything that's not in the validation test is now training and so what we do is we train a model
00:05:03.760 | Check it against the validation to see if it generalizes
00:05:06.440 | Do that a few times and then when we finally got something where we're like okay?
00:05:11.200 | We think this generalizes successfully based on the validation set and then at the end of the project we check it against the test set
00:05:19.600 | So basically by making this two layer test set validation set if it gets one right the other one
00:05:24.520 | Wrong you kind of double checking your errors kind of like it
00:05:27.840 | It's checking that we have an over fit to the validation set so if we're using the validation set again and again
00:05:34.040 | Then we could end up not coming up with a generalizable set of hyper parameters
00:05:38.640 | But a set of hyper parameters that just so happen to work on the training set and the validation set
00:05:45.480 | So if we try 50 different models
00:05:47.760 | Against the validation set and then at the end of all that we then check that against the test set and it's still
00:05:57.480 | Generalized as well, then we're kind of going to say okay. That's good
00:06:00.240 | We've actually come up with generalizable model if it doesn't then that's going to say okay
00:06:04.400 | We've actually now over fit to the validation set at which point you're kind of in trouble, right because
00:06:09.960 | You don't you know you don't have anything left
00:06:13.120 | Behind right so the idea is to use effective
00:06:17.880 | Techniques during the modeling so that so that doesn't happen right, but but if it's going to happen you want to find out about
00:06:24.040 | it like you need that test set to be there because otherwise when you put it in production and
00:06:28.800 | Then it turns out that it doesn't generalize that would be a really bad outcome right you end up with
00:06:34.420 | Less people clicking on your ads or selling less of your products or providing car insurance to very risky vehicles or whatever
00:06:42.340 | Just make sure do you need to ever check if the validation set and the test settings is coherent or you just keep
00:06:52.640 | So if you've done what I've just done here, which is to randomly sample
00:06:56.420 | There's no particular reason to check as long as they're as long as they're big enough
00:06:59.920 | Right, but we're going to come back to your question in a different context in just a moment
00:07:09.520 | Another trick we've learned for random forests is a way of
00:07:13.960 | Not needing a validation set and the way that we learned was to use instead use the OOB
00:07:22.640 | Error or the OOB score and so this idea was to say well every time we train a tree in a random
00:07:30.740 | Forest there's a bunch of observations that are held out anyway because that's how we get some of the randomness and so let's
00:07:37.420 | Calculate our score for each tree based on those held out samples and therefore the forest by averaging the trees that that each
00:07:46.560 | Row was not part of training
00:07:52.240 | And so the OOB score gives us something which is pretty similar to the
00:07:58.120 | validation score
00:08:00.880 | But on average, it's a little less good
00:08:04.960 | Can anybody either remember or figure out why on average it's a little less good?
00:08:09.600 | Quite a subtle one
00:08:13.840 | I'm not sure but is it because you are treating like you are doing every
00:08:21.760 | Kind of probe
00:08:23.680 | pre-processing on your test and so the OOB score is reflecting the performance on testing set
00:08:31.000 | No, so the OOB score is not using the test set at all
00:08:34.360 | The OOB score is using the held out rows in the training set for each tree
00:08:39.360 | So I mean the you are basically testing each tree on some data from the training set
00:08:45.200 | Yes, so you are you have the potential of overbating data?
00:08:51.040 | I should it shouldn't cause overfitting because each one is looking at a held out
00:08:55.820 | Sample so it's not an overfitting issue. It's quite a subtle issue. Ernest don't have a try
00:09:01.220 | Aren't the samples from OOB
00:09:06.440 | Bootstrap samples they are so then you're never gonna on average they only grab 63% of right
00:09:12.920 | So when average the OOB is one minus 63% exactly. Yeah, what's the issue?
00:09:18.600 | So then if you're not why would the score be lower than the validation score that implies that you're leaving
00:09:24.380 | Sort of like a black hole in the data that there's like data points
00:09:26.960 | You're never going to sample and they're not gonna be represented by the model
00:09:29.200 | No, that's not true though because each tree is looking at a different set right so the OOB so like we've got like I
00:09:35.320 | Don't know dozens of models right and in each one. There's a different set of
00:09:43.120 | Which which happened to be held out?
00:09:45.120 | right
00:09:48.360 | And so when we calculate the OOB score for like let's say row three
00:09:53.180 | We say okay row three is in this tree this tree and that's it and so we calculate
00:09:58.880 | the prediction on that tree and for that tree and with average those two predictions and so with enough trees
00:10:07.040 | You know each one has a 30 or so percent chance. Sorry 40 or so percent chance that the row is in that tree
00:10:13.880 | So if you have 50 trees
00:10:15.560 | It's almost certain that every row is going to be mentioned somewhere
00:10:19.280 | Did you have an idea term?
00:10:22.000 | With validation set we can use the whole forest to make the predictions
00:10:27.640 | But here we cannot use the whole forest so we cannot exactly see exactly so every row is
00:10:33.760 | Going to be using a subset of the trees to make its prediction and with less trees
00:10:39.640 | We know we get a less accurate prediction. So that's like that's a subtle one
00:10:44.480 | Right, and if you didn't get it have a think during the week
00:10:47.720 | until you understand
00:10:50.680 | Why this is because it's a really interesting test if you're understanding of random forests of like why is OOB score?
00:10:58.600 | On average less good than your validation score. They're both using random subs randomly held out subsets
00:11:05.560 | anyway, it's been really close enough right so
00:11:09.680 | Why have a validation set at all?
00:11:14.760 | When you're using random forest
00:11:16.760 | If it's a randomly chosen validation set it's not strictly speaking necessary
00:11:24.400 | But you know you've got like four levels of things to test right so you could like test on the OOB
00:11:29.560 | When that's working well, you can test on the validation set, you know when hopefully by the time you check against the test set
00:11:35.720 | There's going to be no surprises. So that'll be one good reason
00:11:40.080 | Then what Kaggle do the way they do this is kind of clever
00:11:44.200 | what Kaggle do is they split the test set into two pieces a public and
00:11:48.440 | a private and
00:11:51.920 | They don't tell you which is rich. So you submit your predictions to Kaggle and then a
00:11:58.200 | Random 30% of those are used to tell you the leaderboard score
00:12:04.600 | But then at the end of the competition that gets thrown away and they use the other 70% to calculate your real score
00:12:14.000 | What that's doing is that you're making sure that you're not like continually using that feedback from the leaderboard
00:12:19.120 | To figure out some set of hyper parameters that happens to do well on the public but actually doesn't generalize
00:12:25.440 | Okay, so it's a great test like this is one of the reasons why it's good practice to use Kaggle
00:12:32.040 | Because at the end of a competition at some point this will happen to you and you'll drop a hundred places on the leaderboard
00:12:37.560 | The last day of the competition when they use the private test set and I say oh
00:12:41.960 | Okay, that's what it feels like to overfit and it's much better to
00:12:46.000 | Practice and get that sense there than it is to do it in a company where there's hundreds of millions of dollars on the line
00:12:52.360 | Okay, so this is like the easiest possible situation where you're able to use a
00:13:01.840 | random sample for your validation set
00:13:05.280 | Why might I not be able to use a random sample for my validation set
00:13:11.600 | In the case of something where we're forecasting we can't randomly sample because we need to maintain the temporal ordering
00:13:22.960 | Go on. What is that?
00:13:25.360 | Because it doesn't it doesn't make sense. So in the case of like an ARMA model
00:13:29.280 | I I can't use like I can't pull out random rows because there's
00:13:34.080 | I'm thinking that there's like a certain dependency or I'm I'm trying to model a certain dependency that relies on like a specific
00:13:41.560 | Lag term if I randomly sample those things then that lag term isn't there for me to okay, so it could be like a
00:13:48.920 | Technical modeling issue that like I'm using a model that relies on like
00:13:55.640 | Yesterday the day before and the day before that and if I've randomly removed some things
00:13:59.640 | I don't have yesterday and my model might just fail. Okay, that's true, but there's a more fundamental issue
00:14:06.360 | Do you want to pass it to Tyler?
00:14:08.360 | It's a really good point
00:14:10.960 | Although you know in general we're going to try to build models that are not that are more resilient than that
00:14:15.800 | particularly with
00:14:18.680 | Yet temporal order we expect things that are close by in time to be related to things close to them
00:14:26.660 | so we so
00:14:28.760 | if we destroy the order like if if we destroy the order we
00:14:35.280 | Really aren't going to be able to use that this time is close to this other time
00:14:40.200 | Um, I don't think that's true because can pull out a random sample for a validation set and still keep everything nicely ordered
00:14:48.480 | Well, we would like to predict things in the future
00:14:51.680 | Which we would require as much data close to the end of art
00:14:58.320 | Okay, that's true. I mean we could be like limiting the amount of data that we have by taking some of it out
00:15:05.560 | But my claim is stronger. My claim is that by using a random validation set
00:15:12.280 | We could get totally the wrong idea about our model carob. Do you want to have a try?
00:15:17.300 | So you if our data is imbalanced for example
00:15:24.240 | We can if you're randomly sampling it we can only have one class in our validation set so our fitted model may be
00:15:31.720 | That's true as well
00:15:33.400 | So maybe you're trying to predict in a medical situation
00:15:35.480 | Who's going to die of lung cancer and that's only one out of a hundred people and we pick out a validation set that
00:15:42.120 | We accidentally have nobody that died of lung cancer
00:15:44.120 | That's also true. These are all
00:15:47.720 | Good niche examples, but none of them quite say like why could the validation set just be plain
00:15:55.520 | Wrong like give you a totally
00:15:59.000 | Inaccurate idea of whether this is going to generalize
00:16:01.800 | And so let's talk about and the closest is is is what Tyler was saying about time
00:16:08.320 | closeness in time
00:16:11.280 | The important thing to remember is when you build a model
00:16:13.820 | You're always you always have a systematic error
00:16:17.960 | Which is that you're going to use the model at a later time than the time that you built it, right?
00:16:24.560 | Like you're going to put it into production
00:16:26.560 | By which time the world is different to the world that you're in now and even when you're building the model
00:16:33.600 | You're using data which is older than today anyway, right?
00:16:36.920 | so there's some lag between the data that you're building it on and the data that it's going to actually be used on your life and
00:16:43.820 | A lot of the time if not most of the time that matters, right?
00:16:48.520 | So if we're doing stuff in like predicting who's going to buy toilet paper in, New Jersey
00:16:54.460 | and it takes us two weeks to put it in production and
00:16:58.160 | We did it using data from the last couple of years then by that time, you know things may look
00:17:05.520 | Very different right and particularly our validation said if we randomly sampled it
00:17:11.620 | Right and it was like from a four-year period then the vast majority of that data is going to be over a year old
00:17:17.920 | right, and it may be that the
00:17:20.440 | toilet buying habits of folks in New Jersey may have
00:17:24.640 | Dramatically shifted. Maybe they've got a terrible recession there now and they can't afford a high-quality toilet paper anymore
00:17:33.440 | Or maybe they know their paper making industry has gone through the roof and suddenly, you know
00:17:39.040 | They could they're buying lots more toilet paper because it's so cheap or whatever, right? So
00:17:43.520 | The world changes and therefore if you use a random sample for your validation set
00:17:50.200 | Then you're actually checking how good are you at predicting things that are totally obsolete now?
00:17:55.560 | But how good are you at predicting things that happened four years ago? That's not interesting
00:18:02.000 | Okay, so what we want to do in practice
00:18:04.720 | Any time there's some temporal piece?
00:18:08.320 | Is to instead say assuming that we've ordered it by time
00:18:15.040 | Right, so this is old and this is new
00:18:22.040 | That's our validation set
00:18:29.520 | Okay, or if we you know, I suppose actually do it properly. That's our validation set. That's our test set
00:18:37.340 | Make sense, right? So here's our training set and we use that and we try and build a model that still works on stuff
00:18:48.160 | That's later in time than anything the model was built on and so we're not just testing
00:18:54.040 | Generalization in some kind of abstract sense, but in a very
00:18:58.520 | Specific time sense, which is it generalizes to the future? Could you pass it to Siraj, please?
00:19:03.200 | So when we are as you said
00:19:10.000 | As you said, there is some temporal ordering in the data
00:19:13.440 | So in that case is it wise to take the entire whole data for training or only a few recent data?
00:19:20.080 | Set so validation test or training. I'm talking about training training
00:19:25.640 | Yeah, that's a whole nother question, right? So how do you how do you get the validation set to be good?
00:19:32.200 | So I build them a random forest on all the training data. It looks good on the training data
00:19:37.720 | It looks good on the OOB
00:19:40.200 | But and this is actually a really good reason to have OB if it looks good on the OOB
00:19:44.920 | But it means you're not overfitting in a statistical sense, right? Like it's it's working well on a random sample
00:19:51.800 | But then it looks bad on the validation set
00:19:55.560 | So what happened? Well, what happened was that you you somehow failed to predict the future
00:20:02.200 | You only predicted the past and so Siraj had an idea about how we could fix that would be okay
00:20:07.240 | Well, maybe we should just train so like maybe we shouldn't use the whole training set
00:20:11.160 | We should try a recent period only and now you know on the downside
00:20:15.120 | We're not using less data so we can create less rich models on the upside. It's it's more up-to-date data
00:20:22.600 | And this is something you have to play around with
00:20:27.760 | Machine learning functions have the ability to provide a weight that is given to each row
00:20:33.260 | So for example with a random forest rather than bootstrapping at random
00:20:37.980 | You could have a weight on every row and randomly pick that row with some probability right and we could like say
00:20:44.880 | Here's our like probability
00:20:47.600 | We could like pick a
00:20:50.920 | Curve that looks like that
00:20:52.920 | So that the most recent rows have a higher probability of being selected that can work really well
00:20:59.480 | Yeah, it's it's something that you have to try and and if you don't have a validation set that represents the future
00:21:07.780 | Compared to what you're training on you have no way to know which of your techniques are working
00:21:12.720 | How do you make the compromise between an amount of data versus recency of data?
00:21:19.600 | So what I tend to do is is when I have this kind of temporal issue, which is probably most of the time
00:21:26.560 | Once I have something that's working well on the validation set
00:21:30.880 | I wouldn't then go and just use that model on the test set because the thing that I've trained on is now like
00:21:36.800 | Much, you know the test set is much more in the future compared to the training set so I would then replicate
00:21:43.280 | Building that model again, but this time I would combine
00:21:47.880 | the training and validation sets together
00:21:50.120 | Okay, and retrain the model and at that point you've got no way to test
00:21:56.200 | Against a validation set so you have to make sure you have a reproducible
00:22:01.280 | Script or notebook that does exactly the same steps in exactly the same ways
00:22:05.960 | Because if you get something wrong, then you're going to find on the test set that you've you've got a problem
00:22:15.280 | So what what I do in practice is I need to know is my validation set
00:22:22.640 | Truly representative of the test set. So what I do is I build five models on the training set I
00:22:30.660 | Build five models on the training set and I try to have them kind of vary in how good I think they are
00:22:41.560 | Right, and then and then I score them my five models on the validation set
00:22:47.840 | Right, and then I also score them on the test set, right? So I'm not cheating
00:22:54.760 | So I'm not using any feedback from the test set to change my hyper parameters
00:22:58.540 | I'm only using it for this one thing which is to check my validation set. So I get my five scores
00:23:03.720 | from the test set and
00:23:06.960 | Then I check
00:23:10.680 | That they fall in a line
00:23:12.480 | Okay, and if they don't then you're not going to get good enough feedback from the validation set. So keep doing that process
00:23:19.720 | Until you're getting a line and that can be quite tricky, right? Sometimes
00:23:24.800 | the the test set
00:23:27.840 | You know trying to create something that's as similar to the real world outcome as possible
00:23:33.840 | It's difficult right and when you're kind of in the real world
00:23:38.000 | The same is true of creating the test set like the test set has to be a close to production as possible
00:23:43.320 | So like what's the actual mix of customers that are going to be using this?
00:23:47.800 | How much time is there actually going to be between when you build the model and when you put it in production?
00:23:52.280 | How often are you going to be able to refresh the model? These are all the things to think about when you build that test set
00:24:01.520 | So even to say that first make five models on the training data and then till you get a straight line relationship
00:24:09.160 | Change your validation and test set you can't really change the test set generally
00:24:13.460 | So this is assuming that the test sets given the change change the validation set
00:24:17.400 | So if you start with a random sample validation set and then it's all over the place and you realize oh
00:24:23.280 | I should have picked the last two months
00:24:25.280 | And then you pick the last two months and still go over the place and you realize oh
00:24:28.440 | I should have picked it so that's also from the first of the month to the fifteenth of the month and
00:24:32.920 | They'll keep going until changing your validation set until you've found a validation set which is
00:24:38.640 | Indicative of your test set results
00:24:41.240 | So the five models like you would start maybe like just the random data and then average and like just make it better
00:24:51.280 | Yeah, yeah. Yeah, yeah, maybe
00:24:54.360 | exactly, maybe I kind of five like not terrible ones, but you want some variety and you also particularly want some variety in like
00:25:01.300 | How well they might generalize through time so one that was trained on the whole training set one that was trained on the last two weeks
00:25:08.800 | One that was trained on the last six weeks
00:25:11.540 | One which used you know lots and lots of columns and might over fit a bit more
00:25:16.400 | Yeah, so you kind of want to get a sense of like oh if my validation set fails to
00:25:22.520 | Generalize temporarily I'd want to see that if it fails to generalize statistically I want to see that
00:25:26.800 | Sorry, can you explain a bit more detail what you mean by change your validation set so it indicates the test set like what does that look?
00:25:35.480 | So possible. So let's take the groceries competition where we're trying to predict the next two weeks of grocery sales
00:25:42.300 | So possible validation sets that Terrence and I played with was a random sample
00:25:51.480 | The last month of data
00:25:53.480 | The last two weeks of data
00:25:57.120 | And the other one we tried was same day range
00:26:05.000 | One month earlier so that the test set in this competition was the first of the 15th of
00:26:14.680 | August
00:26:16.840 | Sorry, this
00:26:18.360 | 15th that maybe the 15th the 30th of August
00:26:21.360 | So we tried like a random sample as four years. We tried
00:26:25.260 | the 15th of July to the 15th of August we tried the first of August to the 15th of August and we tried the
00:26:34.120 | 15th of July to the 30th of July and so there were four different validation sets we tried and so with random
00:26:41.560 | You know our kind of results were all over the place with last month
00:26:46.440 | You know, they were like not bad, but not great the last two weeks
00:26:50.560 | It was a couple that didn't look good
00:26:51.880 | But on the whole they were good and same day range of months earlier. They've got a basically perfect line
00:26:56.520 | That's the part I'm talking right there. What exactly are you comparing it to from the test set?
00:27:00.980 | It's like confuse what you're creating that graph
00:27:03.080 | So for each of those so for each of my so I've built five models, right? So there might be like
00:27:11.840 | Just predict the average do some kind of simple group mean of the whole data set do some group mean of the last month
00:27:17.800 | Of the data set build a random forest of the whole thing build a random forest in the last two weeks
00:27:22.040 | on each of those I calculate the validation score and
00:27:26.480 | Then I retrain the model on the whole training set and calculate the same thing on the test set
00:27:32.480 | And so each of these points now tells me how well to go in the validation set
00:27:37.520 | How well did it go in the test set and so if the validation set is useful?
00:27:42.160 | We would say every time the validation set improves the test set should also score should also improve
00:27:48.220 | Yeah, so you just said retrain dreaming retrain the model on training and validations
00:27:55.740 | Yeah, that was a step I was talking about here
00:27:57.480 | So once I've got the validation score based on just the training set and then retrain it on the train and validation
00:28:02.980 | And check against it, right?
00:28:05.920 | somebody else
00:28:07.920 | So just to clarify
00:28:12.240 | By this set you mean
00:28:15.800 | Submitting it to Kaggle and then checking the score
00:28:19.680 | If it's Kaggle then your test set is Kaggle's leaderboard
00:28:24.320 | in the real world the test set is this third data set that you put aside and it's that third data set that
00:28:33.080 | Having it reflect real-world production differences is the most important step in a machine learning project
00:28:40.100 | Why is it the most important step because if you screw up everything else that you don't screw up that
00:28:48.240 | You'll know you've screwed up
00:28:50.520 | Right like if you've got a good test set
00:28:52.920 | Then you'll know you screwed up because you screwed up something else and you tested it and it didn't work out
00:28:57.960 | And it's like okay, you're not going to destroy the company right if you screwed up creating the test set
00:29:03.080 | That would be awful right because then you don't know if you've made a mistake
00:29:07.760 | Right you try to build a model you test it on the test set it looks good
00:29:12.640 | But the test set was not indicative of real-world
00:29:15.320 | Environment
00:29:18.760 | So you don't actually know if you're going to destroy the company right now
00:29:22.400 | Hopefully you've got ways to put things into production gradually
00:29:24.960 | So you won't actually destroy the company, but you'll at least destroy your reputation at work, right?
00:29:29.720 | it's like Oh Jeremy tried to put this thing into production and
00:29:32.800 | In the first week the cohort we tried it on their sales halved and we're never going to give Jeremy a machine learning job again
00:29:39.960 | All right, but if Jeremy had used a proper test set then like he would have known oh
00:29:45.120 | This is like half as good as my validation set said it would be
00:29:49.480 | I'll keep trying and now I'm not going to get in any trouble. I was actually like Oh Jeremy's awesome
00:29:54.520 | He is identifies ahead of time when there's going to be a generalization problem
00:29:59.200 | Okay, so this is like
00:30:09.000 | This is something that kind of everybody talks about a little bit in machine learning classes
00:30:16.160 | But often it kind of stops at the point where you learn that there's a thing in SK learn
00:30:20.820 | Called make test train split and it returns these things and off you go right, but the fact that like
00:30:27.460 | Or here's the cross-validation function right so
00:30:31.200 | The fact that these things always give you random samples tells you that like
00:30:39.100 | Much if not most of the time you shouldn't be using them
00:30:44.720 | The fact that random forest gives you an OOB for free
00:30:47.880 | It's useful, but it only tells you that this generalizes in a statistical sense not in a practical sense, right?
00:30:54.880 | so then finally there's cross-validation right which
00:30:59.160 | Outside of class you guys have been talking about a lot which makes me feel somebody's been
00:31:05.740 | overemphasizing the value of this technique
00:31:09.360 | So I'll explain what cross-validation is and then I'll explain why you probably shouldn't be using it most of the time
00:31:15.720 | So cross-validation says let's not just pull out one validation set, but let's pull out five say
00:31:23.180 | So let's assume that we're going to randomly shuffle the data first, right? This is critical
00:31:29.320 | right, we first randomly shuffle the data and then we're going to split it into
00:31:34.240 | Five groups
00:31:39.040 | And then for model number one, we'll call this the validation set and
00:31:44.120 | We'll call this the training set
00:31:48.520 | Okay, and we'll train and we'll check against the validation and we'll get some RMSE R squared whatever and
00:31:57.560 | then we'll throw that away and
00:32:00.280 | We'll call this the validation set and we'll call this
00:32:07.520 | the training set and we'll get another score we'll do that five times and
00:32:17.180 | Then we'll take the average
00:32:22.280 | Okay, so that's a cross-validation
00:32:27.780 | average accuracy, so who can tell me like a benefit of using cross-validation over a
00:32:37.160 | The kind of standard validation set I talked about before
00:32:40.080 | Could you pass it a phone?
00:32:43.120 | If you have a small data set, then
00:32:50.480 | Cross-validation will make use of the data you have. Yeah, you can use all of the data
00:32:56.040 | You don't have to put anything aside and you kind of get a little benefit as well in that like
00:33:00.880 | You've now got five models that you could ensemble together each one refused which used 80% of the data
00:33:06.400 | So, you know, sometimes that ensemble link can be helpful
00:33:08.880 | Fun could you tell me like what what could be some reasons that you wouldn't use cross-validation?
00:33:16.440 | We have enough data so we don't not want the validation set to be included in the model trainings
00:33:25.320 | process
00:33:27.400 | To like to pollute like the model
00:33:31.600 | Okay, yeah
00:33:34.360 | I'm not sure that cross-validation is necessarily polluting the model. What would be a key like downside of cross-validation?
00:33:41.280 | but like for deep learning if you have learned the pictures and
00:33:46.240 | Then your network will know the pictures and it's more likely to predict it. That's right
00:33:52.420 | So sure, but if we if we've put aside some data each time in the cross-validation, can you pass it to Siraj?
00:33:59.160 | I'm I'm I'm not so worried about
00:34:02.040 | like I don't think there's like one of these validation sets is
00:34:06.600 | More statistically accurate. Yes Siraj
00:34:11.400 | I think that's what fun was worried about I don't see why that would happen like each time we're fitting a model
00:34:21.760 | Just behind you
00:34:23.380 | Each time we're fitting a model. We are absolutely holding out 20% of the sample
00:34:28.660 | Right so yes the five models between them have seen all of the data
00:34:32.540 | But but it's kind of like a random forest in fact it's a lot like a random forest each model
00:34:37.300 | Has only been trained on a subset of the data
00:34:39.620 | Yes, Nisha say if it is like a large data set like it will take a lot of time
00:34:44.860 | Oh, yes, exactly right so we have to fit five models rather than one so here's a key downside number one
00:34:52.360 | Is time and so if we're?
00:34:54.980 | Doing deep learning and it takes a day to run suddenly it now takes five days or we need five GPUs
00:35:01.220 | Okay, what about my earlier issues about validation sets? Do you pass it over there?
00:35:06.680 | What's your name Jose?
00:35:09.720 | So if you had like temporal data wouldn't you be like by shuffling wouldn't you be breaking that relation
00:35:19.900 | Well, we can unshuffle it afterwards
00:35:21.960 | We could reorder it like we could shuffle get the training set out and then sort it by time
00:35:27.200 | Like I'd like this presumably there's a date column there, so I
00:35:32.300 | Don't think I don't think it's going to stop us from building a model. Did you have?
00:35:36.760 | With cross-validation you're building five even validation sets
00:35:47.380 | And if there's some sort of structure that you're trying to capture in your validation set to mirror your test set
00:35:51.620 | You're you're essentially just throwing that a chance to construct that
00:35:55.140 | yourself
00:35:57.400 | Right, I think you're going to say that
00:35:59.420 | I think you said the same thing as I'm going to say which is which is that our earlier concerns about why?
00:36:04.340 | Random validation sets are a problem are entirely relevant here all these validation sets are random
00:36:10.220 | So if a random validation set is not appropriate for your problem
00:36:15.860 | Most likely because for example of temporal issues then none of these four validation set five validation sets are any good
00:36:23.700 | they're all random right and so if you have
00:36:27.460 | Temporal data like we did here. There's no way to do cross-validation really or like probably no good way to do cross-validation. I mean
00:36:37.060 | You want to have?
00:36:39.700 | Your validation set be as close to the test set as possible
00:36:42.620 | And so you can't do that by randomly sampling different things
00:36:49.380 | So as fun said
00:36:51.340 | You may well not need to do cross validation because most of the time in the real world
00:36:56.260 | We don't really have that little data
00:36:58.260 | Right unless your data is based on some very very expensive labeling process or some experiments that take a cost a lot to run
00:37:05.820 | or whatever, but nowadays that's
00:37:07.860 | Data scientists are not very often doing that kind of work summer in which case this is an issue, but most of us aren't
00:37:14.060 | So we probably don't need to as
00:37:16.620 | Nishan said if we do do it. It's going to take a whole lot of time
00:37:20.500 | And then as earnest said even if we did do it and we talk up all that time
00:37:26.260 | It might give us totally the wrong answer because random validation sets are inappropriate for a problem
00:37:30.620 | Okay, so I'm not going to be spending much time on cross-validation because I just I think it's an interesting tool to have
00:37:37.900 | It's easy to use. Okay, learn has a cross-validation thing. You can go ahead and use
00:37:44.580 | It's it's it's not that often that it's going to be an important part of your toolbox in my opinion. It'll come up sometimes
00:37:51.780 | Okay, so that is
00:38:00.340 | Validation sets so then the other thing we
00:38:04.500 | started talking about last week
00:38:06.780 | And got a little bit stuck on because I screwed it up was tree interpretation
00:38:15.780 | So I'm actually going to cover that again
00:38:18.540 | without the error
00:38:21.580 | And dig into it in a bit more detail
00:38:23.580 | So can anybody tell me?
00:38:27.340 | What tree interpreter does and how it does it?
00:38:33.660 | Everybody remember? It's a difficult one to explain. I don't think I did a good job of explaining it
00:38:41.340 | So don't worry if you don't do a great job, but does anybody want to have a go at explaining it?
00:38:45.340 | Okay, that's fine, so
00:38:49.540 | Let's start with the output of tree interpreter, so
00:38:54.740 | If we look at a single model a single tree in other words
00:39:02.340 | Here is a single tree
00:39:04.340 | Okay, and
00:39:07.540 | So to remind us the top of a tree is
00:39:11.060 | Before there's been any split at all
00:39:14.220 | so ten point one eight nine
00:39:17.180 | Is the average log price of all of the options in our training set?
00:39:23.740 | So I'm going to go ahead and draw
00:39:29.020 | Right here ten point one eight nine eight nine is the average of all
00:39:33.900 | okay, and
00:39:36.260 | Then if I go a couple of system less than or equal to point five
00:39:39.060 | Then I get ten point three four five. Okay, so for this subset of
00:39:45.020 | sixteen thousand eight hundred
00:39:47.940 | Coupler is less than or equal to point five the average is ten point three four five and
00:39:52.380 | Then off the people with a couple of system less than or equal to point five
00:39:58.340 | We then take the subset where enclosure is less than or equal to two and the average there of log sale price is nine point
00:40:05.620 | nine five five
00:40:06.940 | Here's nine point nine five five. Okay, and then final step in our tree
00:40:14.180 | Model ID just for this group with no coupler system with enclosure less than or equal to two
00:40:19.260 | then let's just take model ID less than or equal to forty five seventy three and
00:40:23.940 | That gives us ten point two two six
00:40:29.220 | okay, so then we can say you're at starting with
00:40:32.820 | ten point one oh nine one eight nine average for everybody in our training set for this particular trees subsample of twenty thousand
00:40:40.300 | Adding in the couple of decision or couple or less than or equal to point five
00:40:46.900 | increased our prediction by point one five six
00:40:50.620 | So if we predicted with a naive model of just the mean that would have been ten point one nine
00:40:56.060 | Adding in just the coupler decision would have changed it to ten point three four five
00:41:00.980 | So this variable is responsible for a point one five six increase in our prediction
00:41:06.220 | From that the enclosure decision was responsible for a minus point three nine five decrease
00:41:12.340 | The model ID was responsible for a point two seven six increase until eventually that was our final decision
00:41:20.260 | That was our prediction for this auction of this particular sale price
00:41:26.300 | So we can draw that as what's called a waterfall plot right and waterfall plots are one of the most useful plots
00:41:33.460 | I know about and weirdly enough
00:41:35.660 | There's nothing in Python to do them and this is one of these things where there's this disconnect between like the world of like
00:41:42.020 | management consulting and business where everybody uses waterfall plots all the time and
00:41:46.140 | like academia
00:41:48.420 | Who have no idea what these things are but like every time like you're looking at say?
00:41:54.700 | here is
00:41:56.220 | Last year's sales for Apple and then there was a change in the iPhones increased by this amount
00:42:02.060 | Max decreased by that amount and iPads increased by that amount every time you have a starting point in a number of changes and a finishing
00:42:10.260 | Point waterfall charts are pretty much always the best way to show it. So here our prediction for price based on everything
00:42:16.380 | 10.1 eight nine there was an increase blue means increase of point one five six per coupler
00:42:22.700 | Decrease of point three nine five for implosion increase model ID of point two seven six so decrease
00:42:30.220 | As I increase decrease increase to get to our final
00:42:34.060 | 10.266 so you see how waterfall chart works
00:42:37.960 | So with excel 2016 you it's built in you just click insert waterfall chart and there it is
00:42:44.220 | If you want to be a hero
00:42:46.540 | create a waterfall chart
00:42:49.540 | Package for matplotlib put it on pip and everybody will love you for it
00:42:53.780 | There are some like really crappy
00:42:56.420 | discs and manual
00:42:59.100 | Notebooks and stuff around these are actually super easy to build
00:43:03.300 | Like you basically do a stacked column plot where this the bottom of this is like all white
00:43:08.780 | Right like you can kind of do it
00:43:10.980 | But if you can wrap that up all and put the data the points in the right spots and color them nicely
00:43:17.020 | That would be totally awesome. I think you've all got the skills to do it and would make you know be a
00:43:21.360 | terrific thing for your portfolio
00:43:23.860 | So there's an idea
00:43:26.660 | Could make an interesting cattle kernel even like here's how to build a waterfall plot from scratch and by the way
00:43:33.180 | I've put this up on pip you can all use it
00:43:35.340 | So in general therefore obviously going from the all and then going through each change
00:43:43.580 | Then the sum of all of those is going to be equal to the final prediction
00:43:48.940 | So that's how we could say if we were just doing a decision tree
00:43:53.460 | Then you know you're coming along and saying like how come this particular option was this particular price?
00:43:59.580 | And it's like well your prediction for it and like oh it's because of these three things had these three impacts, right?
00:44:06.580 | so for a random forest
00:44:09.460 | We could do that across all of the trees that so every time we see coupler
00:44:14.100 | We add up that change every time we see enclosure
00:44:17.100 | We add up that change every time we see model we add up that change. Okay, and so then we combine them all together
00:44:23.300 | We get what?
00:44:26.380 | Tree interpreter does but so you could go into the source code for tree interpreter, right?
00:44:31.080 | It's not at all complex logic or you could build it yourself
00:44:33.740 | Right and you can see
00:44:36.820 | How it does exactly this so when you go tree interpreter predict with a random first model for some specific?
00:44:43.820 | Auction, so I've got a specific row here. This is my zero index row
00:44:48.820 | It tells you okay. This is the prediction the same as the random forest prediction
00:44:54.780 | Bias this is going to be always the same. It's the average
00:44:58.860 | sale price for for everybody for each of the random samples in the tree and
00:45:05.900 | then contributions is
00:45:07.900 | The average of sorry the total of all the contributions for each time we see that
00:45:15.100 | Specific column appear in a tree
00:45:18.860 | Right. So last time I made the mistake of not sorting this correctly. So this time
00:45:23.520 | Np dot art sort is a super handy
00:45:26.140 | Function it sorts. It doesn't actually sort
00:45:30.740 | Contribution zero it just tells you where each item would move to if it were sorted
00:45:36.580 | so now by passing ID access to each one of
00:45:40.080 | The column
00:45:43.820 | The the level
00:45:46.220 | Contribution I can then print out all those in the right order so I can see here. Here's my column
00:45:53.780 | here's the
00:45:55.980 | level and the contribution so the fact that it's a small
00:46:01.100 | Version of this piece of industrial equipment meant that it was less expensive
00:46:04.940 | Right, but the fact it was made pretty recently meant. It was more expensive
00:46:10.020 | The fact that it's pretty old however made that it was less expensive
00:46:13.980 | right, so this is not going to
00:46:16.460 | Really help you much at all with like a Kaggle style situation where you just need predictions
00:46:22.820 | that's going to help you a lot in a production environment or even pre-production right so like something which
00:46:28.460 | Any good manager should you should do if you say here's a machine learning model?
00:46:32.940 | I think we should use as they should go away and grab a few examples of actual customers or
00:46:39.060 | actual options or whatever and check whether your model looks intuitive right and if it says like
00:46:46.240 | my prediction is that
00:46:48.500 | You know
00:46:51.820 | Lots and lots of people are going to really enjoy
00:46:54.620 | This crappy movie. You know and it's like well
00:46:58.140 | That was a really crappy movie then they're going to come back to you and say like explain why your models telling me
00:47:03.780 | That I'm going to like this movie because I hate that movie and then you can go back and you say well
00:47:10.020 | It's because you like this movie and because you're this age range, and you're this gender on average actually people like you
00:47:16.800 | Did like that movie?
00:47:18.900 | Okay, yeah
00:47:22.900 | What's the second element of each table?
00:47:28.180 | This is saying for this particular row
00:47:31.380 | It was a mini, and it was 11 years old and it was a hydraulic excavator track three to four metric tons
00:47:39.340 | So it's just feeding back and telling you it's it because this is actually what it was
00:47:46.140 | It was these numbers, so I just went back to
00:47:51.380 | the original
00:47:53.140 | data to actually pull out the
00:47:55.140 | Descriptive versions of each one
00:47:58.260 | Okay, so if we sum up all the contributions together and
00:48:06.420 | Then add them to the bias
00:48:11.220 | Then that would be the same as adding up those three things
00:48:16.540 | Adding it to this and as we know from our waterfall chart that gives us our final
00:48:21.820 | prediction
00:48:24.740 | this is a
00:48:26.740 | Almost totally unknown technique and this particular
00:48:31.260 | Library is almost totally unknown as well
00:48:35.500 | so like it's a great opportunity to
00:48:38.340 | You know show something that a lot of people like it's totally critical in my opinion
00:48:44.660 | But but rarely none, so that's
00:48:47.540 | That's kind of the end of the ran of forest interpretation piece and hopefully you've now seen enough that when somebody says
00:48:57.620 | We can't use modern machine learning techniques because they're black boxes that aren't interpretable
00:49:02.500 | You have enough information to say you're full of shit, right?
00:49:06.100 | Like they're extremely interpretable and the stuff that we've just done
00:49:10.420 | You know try to do that with a linear model. Good luck to you
00:49:13.740 | You know even where you can do something similar the linear model trying to do it
00:49:17.340 | So that's not getting you totally the wrong answer and you had no idea as a wrong answer. It's going to be a real challenge
00:49:22.380 | So the last step we're going to do before we try and build our own random forest is deal with this tricky issue of
00:49:31.620 | Extrapolation so in this case
00:49:34.780 | If we look at our tree
00:49:39.140 | Let's look at the accuracy of our most recent trees
00:49:43.780 | We still have
00:49:51.580 | You know a big difference between our validation
00:49:54.100 | score and our
00:49:57.420 | training score
00:50:03.100 | Actually in this case. It's not too bad that
00:50:06.700 | The difference between the OOB and the validation is actually pretty close
00:50:11.660 | So if there was a big difference between validation and OOB like I'd be very worried about that. We've dealt with the temporal side of things
00:50:19.460 | correctly
00:50:22.500 | Let's just have a look at I think our most recent model here it was
00:50:27.440 | Yeah, so there's a tiny difference right and so
00:50:32.060 | On Kaggle at least you kind of need that last decimal place in the real world. I'd probably stop here
00:50:38.420 | But quite often you'll see there's a big difference between your validation score and your OOB score
00:50:43.220 | And I want to show you how you would deal with that
00:50:45.220 | Particularly because actually we know that the OOB should be a little worse
00:50:51.220 | Because it's using less trees so it gives me a sense that we should be able to do a little bit better
00:50:55.580 | And so the reason with the way we should be able to do a little bit better is by handling the time
00:51:00.780 | component a little bit better
00:51:05.780 | Here's the problem with random forests when it comes to extrapolation
00:51:09.980 | when you
00:51:13.260 | When you've got a data set
00:51:16.700 | That's like you know for got four years of sales data in it and you create your tree
00:51:22.060 | Right, and it says like oh if these if it's in some particular store, and it's some particular item
00:51:30.380 | And it is on special
00:51:32.380 | You know here's the average price right it actually tells us the average price
00:51:38.300 | You know over the whole training set which could be pretty old right and so when you then
00:51:44.800 | Want to step forward to like what's going to be the price next month?
00:51:49.580 | It's never seen next month and and where else with a kind of a linear model it can find a relationship
00:51:56.740 | Between time and price where even though we only had this much data
00:52:00.980 | When you then go and predict something in the future it can extrapolate that but a random forest can't do that
00:52:07.420 | There's no way if you think about it for a tree to be able to say well next month. It would be higher still
00:52:12.820 | so there's a few ways to deal with this and we'll talk about it over the next couple of lessons, but one simple way is
00:52:20.180 | just to try to
00:52:22.180 | Avoid using
00:52:26.540 | Variables as predictors if there's something else we could use that's going to give us a better
00:52:31.300 | You know something of a kind of a stronger relationship. That's actually going to work in the future so in this case
00:52:38.140 | What I wanted to do was to first of all figure out
00:52:42.700 | What's the difference between our validation set and our training set like if I understand the difference between our validation set
00:52:54.920 | And our training set then that tells me
00:52:57.140 | What are the predictors which which have a strong temporal component and therefore they may be?
00:53:04.100 | Irrelevant by the time I get to the future time period so I do something really interesting which is I create a random forest
00:53:13.020 | Where my dependent variable is is it in the validation set?
00:53:20.540 | right, so I've gone back and I've got my whole data frame with the training and validation all together and
00:53:25.540 | I've created a new column called is valid which I've set to 1 and
00:53:31.060 | Then for all of the stuff in the training set I set it to 0
00:53:35.380 | that's what a new column which is just is this in the validation set or not and
00:53:39.460 | Then I'm going to use that as my dependent variable and build a random forest
00:53:44.920 | So this is a random forest not to predict price
00:53:48.780 | The predict is this in the validation set or not and so if your
00:53:53.180 | variables were not
00:53:55.940 | Time dependent then it shouldn't be possible to figure out if something's in the validation set or not
00:54:00.380 | This is a great trick in Kaggle right because in Kaggle
00:54:03.300 | They often won't tell you whether the test set is a random sample or not
00:54:09.340 | So you could put the test set and the training set together
00:54:13.100 | Create a new column called is test and see if you can predict it if you can
00:54:18.340 | You don't have a random sample which means you have to come and figure out how to create a validation set
00:54:23.720 | From it right and so in this case I can see I don't have a random sample because my validation set can be predicted
00:54:30.660 | with a point nine nine nine nine
00:54:32.660 | R squared and
00:54:35.340 | So then if I look at feature importance the top thing is sales ID and so this is really interesting
00:54:41.860 | It tells us very clearly sales ID is not a random identifier
00:54:45.620 | But probably it's something that's just set
00:54:48.540 | Consecutively as time goes on we just increase the sales ID
00:54:52.780 | Sale elapsed that was the number of days since the first date in our data set so not surprisingly that so is a good predictor
00:55:01.560 | interestingly machine ID
00:55:04.460 | Clearly each machine is being labeled with some consecutive identifier as well
00:55:09.720 | And then there's a big don't just look at the order look at the value so point seven point one
00:55:15.660 | point zero seven point zero two
00:55:17.500 | Okay, stop right these top three hundreds of times more important than the rest right so let's next grab those top three
00:55:25.380 | Right and we can then have a look at their values
00:55:31.140 | both in the training set and
00:55:34.080 | In the validation set and so we can see for example sales ID on average is
00:55:40.100 | I've divided by a thousand on average is 1.8 million in the training set and
00:55:45.340 | 5.8 million in the validation set right so you like you can see
00:55:49.840 | Just confirm like okay. They're very different
00:55:52.620 | So let's drop them
00:55:55.300 | Okay, so after I drop them let's now see if I can predict whether something's in the validation set I still can with point nine eight
00:56:03.180 | R squared
00:56:06.420 | So once you remove some things then other things can like come to the front and it now turns out okay
00:56:11.840 | That's not surprisingly age
00:56:13.840 | You know things that are old
00:56:16.520 | You know more likely I guess to be in the validation set because if you know earlier on in the training set yet
00:56:24.620 | They can't be old yet
00:56:26.540 | year made same reason
00:56:28.540 | So then we can
00:56:35.820 | Try removing those as well
00:56:40.700 | So once we let's see where do we go here?
00:56:43.860 | Yeah, so what we can try doing is we can then say all right?
00:56:47.260 | Let's take the sales ID so that's machine ID from the first one
00:56:50.860 | The age year made sale sale day of year from the second one and say okay. These are all
00:56:56.600 | time dependent features
00:56:59.820 | So I still want them in my random forest if they're important
00:57:06.260 | Right, but if they're not important then taking them out
00:57:10.180 | There are some other non time dependent variables that that work just as well. That would be better
00:57:14.940 | Right because now I'm going to have a model that generalizes over time better
00:57:18.580 | So here I'm just going to go ahead and go through each one of those features and drop each one one at a time
00:57:24.060 | Okay retrain a new random forest and print out the score
00:57:28.900 | Okay, so before we do any of that our score was
00:57:33.380 | 0.88 for our validation versus 0.89 OOB and
00:57:42.540 | you can see here
00:57:45.380 | when I remove sales ID my score goes up and
00:57:49.100 | This this is like what we're hoping for we've removed a time-dependent variable
00:57:54.240 | There were other variables that could find similar relationships without the time dependency so removing it caused our validation to go up
00:58:02.260 | Now OOB didn't go up
00:58:04.260 | Right because this is genuinely statistically a useful predictor
00:58:08.160 | Right, but it's a time-dependent one and we have a time-dependent validation set so this is like really subtle
00:58:13.980 | But it can be really important right. It's trying to find the things that give you a
00:58:18.820 | Generalizable time across time prediction, and here's how you can see it so by so it's like okay
00:58:24.480 | We should remove sales ID for sure right, but sale elapsed
00:58:29.180 | Didn't get better
00:58:31.500 | Okay, so we don't want that machine ID did get better from 888 to 893. It's actually quite a bit better
00:58:42.780 | Got a bit better
00:58:44.420 | Year made got worse sale day of year got a bit better
00:58:48.380 | Okay, so now we can say all right. Let's get rid of
00:58:53.020 | the three
00:58:56.020 | Where we know that getting rid of it actually made it better
00:58:59.460 | Okay, and as a result look at this. We're now up to 9 1 5
00:59:03.340 | Okay, so we've got rid of three time-dependent things and now as expected
00:59:09.500 | Validation is better than our OOB
00:59:13.020 | Okay, so that was a super successful approach there right and so now we can check the feature importance
00:59:19.540 | And let's go ahead and say all right that was pretty damn good. Let's now
00:59:27.660 | Leave it for a while, so give it 160 trees. Let it show and see how that goes
00:59:33.100 | Okay, and so as you can see like we did all of our interpretation all of our fine-tuning
00:59:39.440 | Basically with smaller models subsets and at the end we run the whole thing it actually still only took 16 seconds
00:59:46.260 | And so we've now got an RMSE of 0.21. Okay, so now we can check that against Kaggle
00:59:56.740 | again, we can't we
00:59:58.740 | unfortunately this
01:00:01.820 | Older competition we're not allowed to enter anymore to see how we would have gone so the best we can do is check
01:00:06.540 | Whether it looks like we could have done well based on our validation set
01:00:10.580 | So it should be in the right area and yeah based on that we would have come first
01:00:15.180 | Okay, so
01:00:18.900 | You know I think this is an interesting
01:00:22.500 | series of steps right so you can go through the same series of steps in your
01:00:26.780 | Kaggle projects and more importantly your real-world projects
01:00:30.940 | So one of the challenges is once you leave this learning environment
01:00:35.140 | Suddenly you're surrounded by people who they never have enough time. They've always want you to be in a hurry
01:00:40.740 | They're always telling you you know do this and then do that you need to find the time to step away
01:00:45.660 | Right and go back because this is a genuine real-world modeling process you can use
01:00:51.660 | And it gives when I say it gives world-class results
01:00:54.980 | I mean it right like this guy who won this
01:00:57.860 | Listergoss sadly he's passed away, but he is the
01:01:02.140 | top Kaggle
01:01:04.820 | Competitor of all time like he won. I believe like dozens of competition
01:01:11.580 | So if we can get a score even within kooee of him, then we are doing really really well
01:01:19.460 | Okay, so let's take a five-minute break, and we're going to come back and build our own random forest
01:01:24.220 | I just wanted to clarify something quickly very good point during the break was
01:01:38.260 | Going back to the
01:01:45.540 | Change in R squared between here and
01:01:49.460 | Here it's not just due to the fact that we removed
01:01:54.320 | these three predictors
01:01:57.700 | We also went reset RF samples right so to actually see the impact of just removing we need to compare it to
01:02:04.640 | The final step earlier, so it's actually compared to 907 so removing those three things took us from
01:02:13.900 | 107 to
01:02:15.900 | 915 okay, so I mean and you know in the end of course what matters is our final model, but yeah, just to clarify
01:02:33.460 | Some of you have asked me about writing your own random forests from scratch
01:02:37.900 | I don't know if any of you have given it a try yet my original plan here was to
01:02:44.140 | Do it in real time and then as I started to do it
01:02:47.100 | I realized that that would have kind of been boring because for you because I screw things up all the time so instead
01:02:52.660 | We might do more of like a walk through the code together
01:02:55.860 | Just as an aside
01:03:01.460 | This reminds me talking about the exam actually somebody asked on the forum about like what what can you expect from the exam?
01:03:07.940 | the basic plan is to make it a
01:03:11.740 | The exam be very similar to these notebooks. So it'll probably be a notebook that you have to you know
01:03:17.540 | Get a data set create a model trainer feature importance whatever right and the plan is that it'll be
01:03:25.200 | Open book open internet you can use whatever resources you like so basically if you're entering competitions the exam should be very straightforward. I
01:03:33.540 | also expect that there will be some pieces about like
01:03:39.100 | Here's a partially completed random forest or something. You know finish
01:03:42.580 | Finish writing this step here, or here's a random forest
01:03:46.060 | Implement feature importance or you know implement one of the things we've talked about so it'll be you know
01:03:54.060 | The exam will be much like what we do in class and what you're expected to be doing during the week. There won't be any
01:04:00.900 | Define this or tell me the difference between this word and that word or whatever. There's not going to be any rote learning
01:04:07.580 | It'll be entirely like are you an effective machine learning practitioner ie can you use the algorithms?
01:04:12.540 | Do you know can you create an effective validation set and can you can you create parts of the algorithm?
01:04:19.720 | Implement them from scratch, so it'll be all about writing code
01:04:23.320 | basically, so
01:04:25.980 | if you're not comfortable writing code to practice machine learning then
01:04:30.460 | You should be practicing that all the time if you are comfortable. You should be practicing that all the time also
01:04:36.460 | Whatever you're doing write code to implement random to do machine learning
01:04:46.500 | So I kind of have a particular way of
01:04:50.700 | Writing code
01:04:53.660 | And I'm not going to claim it's the only way of writing code
01:04:56.100 | But it might be a little bit different to what you're used to and hopefully you'll find it at least interesting
01:05:01.020 | creating
01:05:03.500 | implementing random forest algorithms
01:05:06.180 | Is actually quite tricky not because the clothes tricky like generally speaking
01:05:10.580 | Most random first algorithms are pretty conceptually easy, you know that generally speaking
01:05:18.220 | Academic papers and books have a knack of making them look difficult, but they're not difficult conceptually
01:05:26.740 | what's difficult is getting all the details right and knowing and knowing when you're right and
01:05:32.420 | So in other words, we need a good way of doing testing
01:05:36.680 | So if we're going to re-implement something that already exists. So like say we wanted to create a random forest in some
01:05:43.320 | different
01:05:45.240 | Framework different language different operating system, you know, I would always start with something that does exist, right?
01:05:51.120 | So in this case, we're just going to do as a learning exercise writing a random forest in Python
01:05:55.200 | So for testing I'm going to compare it to an existing random forest implementation
01:06:00.680 | Okay, so that's like critical any time you're doing anything involving
01:06:05.800 | non-trivial amounts of code in machine learning
01:06:08.960 | Knowing whether you've got it right or wrong is kind of the hardest bit
01:06:12.960 | I always assume that I've screwed everything up at every step and so I'm thinking like okay assuming that I screwed it up
01:06:19.680 | How do I figure out that I screwed it up?
01:06:22.040 | Right and then much to my surprise from time to time I actually get something right and then I can move on
01:06:27.680 | But most of the time I get it wrong
01:06:32.080 | Unfortunately with machine learning, there's a lot of ways you can get things wrong that don't give you an error
01:06:36.840 | They just make your result like slightly less good
01:06:40.240 | And so that's that's what you want to pick up
01:06:43.520 | So given that I want to kind of compare it to an existing implementation
01:06:48.760 | I'm going to use our existing data set our existing validation set and then to simplify things and just going to use two columns
01:06:55.600 | to start with
01:06:59.080 | So let's go ahead and start writing a random forest. So my way of writing
01:07:03.920 | Nearly all code is top-down just like my teaching and so by top-down I start by assuming
01:07:11.720 | That everything I want already exists
01:07:15.600 | Right. So in other words, the first thing I want to do I'm going to call this a tree ensemble
01:07:21.240 | All right, so to create a random forest the first question I have is
01:07:27.160 | What do I need to pass in?
01:07:29.160 | Right. What do I need to initialize my random first? So I'm going to need some independent variables
01:07:35.480 | some dependent variable
01:07:38.280 | Pick how many trees I want
01:07:40.560 | I'm going to use the sample size parameter from the start here
01:07:43.840 | So how big you want each sample to be and then maybe some optional parameter of what's the smallest leaf size?
01:07:53.800 | For testing it's nice to use a constant random seed. So we'll get the same result each time
01:07:59.320 | So this is just how you set a random seed, okay?
01:08:02.360 | Maybe it's worth mentioning this for those of you unfamiliar with it
01:08:06.040 | Random number generators on computers aren't random at all. They're actually called pseudo random number generators
01:08:12.760 | and what they do is given some initial starting point in this case 42 a
01:08:19.160 | Pseudo random number generator is a mathematical function that generates a deterministic always the same sequence of numbers
01:08:27.040 | Such that those numbers are designed to be as uncorrelated with the previous number as possible
01:08:32.760 | And as unpredictable as possible and
01:08:37.520 | As uncorrelated as possible with something with a different random seed
01:08:42.520 | So the second number in in the sequence starting with 42 should be very different the second number starting with 41
01:08:49.200 | And generally they involve kind of like taking you know
01:08:52.280 | You know using big prime numbers and taking mods and stuff like that. It's kind of an interesting area of math
01:09:01.640 | If you want real random numbers the only way to do that is again you can actually buy
01:09:07.720 | Hardware called a hardware random number generator that will have inside them like a little bit of some radioactive
01:09:14.280 | Substance and and like something that detects how many things it's spitting out
01:09:18.960 | Or you know there'll be some hardware thing
01:09:21.240 | Getting current
01:09:27.360 | System time is is it a valid?
01:09:30.160 | Random like random number generation process so that would be for maybe for a random seed right so this thing of like
01:09:37.760 | What do we start the function with so one of the really interesting areas is like in your computer if you don't set the random?
01:09:44.360 | seed what is it set to and
01:09:48.200 | Yeah, quite often people use the current time for security like obviously we use a lot of random number stuff for security stuff
01:09:56.480 | Like if you're generating an SSH key you need some it needs to be random
01:10:00.580 | It turns out like you know people can figure out roughly when you created a key like they could look at like oh
01:10:08.280 | ID RSA has a timestamp and they could try you know all the different nanoseconds
01:10:13.160 | Starting points for a random number generator around that time step and figure out your key
01:10:17.600 | So in practice a lot of like really random
01:10:21.480 | High randomness requiring applications actually have a step that say please move your mouse and type random stuff at the keyboard for a while
01:10:31.040 | And so it like gets you to be a sort of entropy to be a source of entropy
01:10:35.300 | Other approaches is they'll look at like you know the hash of some of your log files or you know
01:10:44.200 | Stuff like that. It's a really really fun area
01:10:47.100 | So in our case our purpose actually is to remove randomness
01:10:51.360 | So we're saying okay generate a series of pseudo random numbers starting with 42, so it always should be the same
01:10:57.180 | So if you haven't done much stuff in Python
01:11:02.480 | Oh, this is a basically standard idiom at least I mean I write it this way most people don't but if you pass in like
01:11:09.300 | One two three four five things that you're going to want to keep inside this object
01:11:14.000 | Then you basically have to say self dot x equals x self dot y equals y self dot sample equals sample
01:11:19.800 | Right and so we can assign to a tuple
01:11:23.540 | from a tuple so
01:11:26.160 | You know again
01:11:27.280 | This is like my way of coding most people think this is horrible
01:11:29.740 | But I prefer to be able to see everything at once and so I know in my code anytime
01:11:34.320 | I see something looks like this
01:11:35.440 | It's always all of the stuff in the method being set if I did it a different way then half the codes now come off
01:11:41.960 | The bottom of the page and you can't see it. So
01:11:44.760 | alright, so
01:11:47.560 | So that was the first thing I thought about was like okay to create a random forest
01:11:51.760 | What information do you need then I'm going to need to store that information inside my object and so then I?
01:11:57.600 | Need to create some trees right a random forest is something that creates something that has some trees, so I basically figured okay
01:12:05.640 | List comprehension to create a list of trees how many trees do we have we've got n trees trees
01:12:11.420 | That's what we asked for so range n trees gives me the numbers from zero up to n trees at minus one
01:12:19.400 | Okay, so if I create a list comprehension that loops through that range
01:12:23.720 | calling create tree each time I now have n trees trees
01:12:30.640 | And now so I had to write that I didn't have to think at all like that's all like
01:12:36.120 | Obvious and so I've kind of delayed the thinking to the point where it's like well wait. We don't have something to create a tree
01:12:44.560 | Okay, no worries, but let's pretend. We did if we did we've now created a random forest
01:12:51.000 | Okay, we still need to like do a few things on top of that for example once we have it
01:12:56.280 | We would need a predict function, so okay. Well. Let's write a predict function. How do you predict in a random forest?
01:13:03.320 | Can somebody tell me
01:13:07.040 | Either based on their own understanding or based on this line of code. What would be like your one or two sentence answer
01:13:13.160 | How do you make a prediction in a random forest?
01:13:15.960 | Spencer
01:13:18.840 | You would want to over every tree for your like the row that you're trying to predict on
01:13:26.520 | Average the values that your that each tree would produce for that
01:13:30.400 | And so you know that's a summary of what this says right so for a particular row
01:13:36.200 | That or maybe this is a number of rows
01:13:38.800 | Go through each tree
01:13:41.920 | Calculators prediction so here is a list comprehension that is calculating the prediction for every tree for X
01:13:50.800 | I don't know if X is one row or multiple rows doesn't matter right
01:13:55.640 | As long as as long as tree dot predict works on it
01:13:59.280 | And then once you've got a list of things a cool trick to know is you can pass numpy dot mean a
01:14:06.080 | regular non numpy list
01:14:09.080 | Okay, and it'll take the mean you just need to tell it
01:14:12.720 | axis equals 0 means average it across the lists, okay, so this is going to return the average of
01:14:21.960 | Dot predict for each tree and so I find list comprehensions
01:14:27.560 | Allow me to write the code in the way that brain works like you could take the word
01:14:34.040 | Spencer said and like
01:14:35.920 | Translate them into this code or you could take this code and translate them into words like the one Spencer said right and so when
01:14:41.920 | I write code I want it to be as much like that as possible
01:14:45.480 | I want it to be readable and so hopefully you'll find like when you look at the fast AI code
01:14:50.880 | You're trying to understand. Well, how did Jeremy do X?
01:14:52.880 | I try to write things in a way that you can read it and like it kind of turn it into English in your head
01:14:58.000 | So if I see correctly that predict method is recursive it's
01:15:06.800 | No, it's calling tree dot predict and we haven't written a tree yet
01:15:11.200 | So self dot trees is going to contain a tree object
01:15:16.480 | So this is tree ensemble dot predict and inside the trees is a tree not a tree ensemble
01:15:22.980 | So this is calling tree dot predict not tree ensemble dot predict
01:15:26.080 | Good question
01:15:29.560 | Okay, so we've nearly finished writing a random forest haven't we all we need to do now is write create tree, right?
01:15:39.040 | based on
01:15:40.520 | this code here or
01:15:43.040 | On your own understanding of how we create trees in a random forest. Can somebody tell me?
01:15:49.120 | Let's take a few seconds have a read have a think and then I'm going to try and come up for the way of saying
01:15:55.040 | How do you create a tree in a random forest?
01:15:59.400 | Okay, who wants to tell me yes, okay, that's Tyler's got close to
01:16:07.480 | You take your
01:16:12.520 | Essentially taking a random sample or of the original data and then you're just
01:16:19.280 | Just constructing a tree. However that happens
01:16:23.120 | So construct a decision tree like a non random tree from a random sample of the data
01:16:29.520 | Okay, so again like we've delayed any actual thought process here. We've basically said, okay, we could pick some random IDs
01:16:38.880 | This is a good trick to know
01:16:40.600 | If you call NP random permutation
01:16:43.800 | passing in an int it'll give you back a
01:16:47.680 | Randomly shuffled sequence from zero to that it right and so then if you grab the first
01:16:54.120 | colon n
01:16:57.080 | Items of that that's now a random
01:16:59.320 | Substantial so this is not doing bootstrapping. We're not doing sampling with replacement here
01:17:06.880 | Which I think is fine, you know for my random forest
01:17:10.400 | I'm deciding that it's going to be something where we do the sub sampling not bootstrapping. Okay, so here's a good line of code
01:17:16.600 | to know how to write
01:17:18.880 | Because it comes up all the time like I find in machine learning
01:17:22.760 | most algorithms I use are
01:17:25.440 | Somewhat random and so often I need some kind of random sample. Can you pass that tighter or changey?
01:17:35.840 | Won't that give you one one extra because the you said it'll go from zero to length
01:17:41.080 | No, so this will give you if lens self dot y is
01:17:47.120 | Size n this will give you n a sequence of length n so 0 to n minus 1
01:17:54.160 | Okay, and then from that I'm picking out
01:17:57.360 | colon self dot sample size so the first sample size IDs
01:18:04.360 | Have a comment on bootstrapping, I think this method is better because we have chance of giving more weights to each
01:18:14.120 | Observation or am I thinking wrong? I mean, I think you for bootstrapping we could also give weights. I mean
01:18:21.440 | Weighing
01:18:23.320 | single observations more than they are like
01:18:25.680 | Without wanting that weight because when bootstrapping with replacement we can
01:18:31.760 | Have a single observation and duplicates of it. Yeah, the same tree. Yeah, it does feel weird, but I think
01:18:39.640 | I'm not sure that the actual
01:18:44.200 | Theory or empirical results backs up higher intuition that it's worse. It would be interesting to look look back at that actually
01:18:52.200 | Personally I prefer this because I feel like most of the time we have more data than we
01:18:59.960 | Want to put a tree at once I feel like back when bryman created random forests. It was 1999
01:19:05.180 | It was kind of a very different world. You know where we pretty much always wanted to use all the data we had
01:19:09.820 | but nowadays I would say that's
01:19:12.200 | Generally not what we want
01:19:14.480 | We normally have too much data and so what people tend to do is they're like fire up a spark cluster
01:19:20.060 | and they'll run it on hundreds of machines when
01:19:22.520 | It makes no sense because if they had just used a subsample each time
01:19:26.860 | They could have done it on one machine and like the the overhead of like
01:19:30.440 | Spark is a huge amount of IO overhead like I know you guys are doing distributed computing now if you've looked at some of the benchmarks
01:19:38.400 | Yeah, yeah, exactly. So if you do something on a single machine, it can often be hundreds of times faster
01:19:45.980 | Because you don't have all this this IO overhead. It also tends to be easier to write the algorithms like you can use like SK learn
01:19:53.320 | easier to visualize
01:19:56.360 | cheaper so forth so like I
01:19:58.420 | Almost always avoid distributed computing and I have my whole life like even 25 years ago when I was starting in machine learning
01:20:06.200 | I you know still didn't use
01:20:08.500 | clusters because I so I always feel like
01:20:11.080 | Whatever I could do with a cluster now I could do with a single machine in five years time
01:20:15.940 | So why don't us focus on always being as good as possible with the single machine, you know
01:20:20.640 | and that's going to be more interactive and more iterative and
01:20:23.140 | work for me, so
01:20:26.680 | Okay, so so again, we've like delayed thinking
01:20:30.340 | To the point where we have to write decision tree
01:20:33.880 | And so hopefully you get an idea that this top-down approach the goal is going to be that we're going to keep delaying thinking
01:20:39.720 | So long that that we delay it forever
01:20:42.280 | Like like eventually we've somehow written the whole thing without actually having to think right and that's that's kind of what I need
01:20:48.840 | Cuz I'm kind of slow right so this is why I write code this way and notice like you never have to design anything
01:20:55.940 | You know, you just say hey, what if somebody already gave me the exact API I needed. How would I use it?
01:21:00.660 | Okay, and then and then okay to implement that next stage
01:21:04.680 | What would be the exact API I would need to implement that that you keep going down until eventually you're like, oh that already exists
01:21:11.820 | Okay, so
01:21:14.080 | This assumes we've got a class for decision tree. So we're going to have to create that
01:21:18.380 | So a decision tree
01:21:23.580 | Is something so we already know what we're going to have to pass it because we just passed it, right?
01:21:28.460 | so we're passing in a
01:21:30.460 | random sample of X's a
01:21:32.900 | random sample of wise
01:21:35.260 | Indexes is actually so we know that down the track so I got a plan a tiny bit
01:21:46.740 | We know that a decision tree is going to contain decision trees which themselves contain decision trees
01:21:52.620 | And so as we go down the decision tree
01:21:54.540 | There's going to be some subset of the original data that we've kind of got and so I'm going to pass in the indexes
01:22:00.780 | Of the data that we're actually going to use here. Okay, so initially it's the entire
01:22:06.260 | Random sample, right? So I've got the whole
01:22:09.820 | I've got the whole range
01:22:14.060 | And I'll turn that into an array. So that's zero the indexes from zero to the size of the sample and
01:22:21.020 | Then we'll just pass down them in leaf size. So everything that we got for constructing the random forest
01:22:26.740 | We're going to pass down the decision tree except of course num trees, which is irrelevant for the decision tree
01:22:31.780 | So again now that we know that's the information we need we can go ahead and store it inside this object
01:22:38.540 | So I'm pretty likely to need to know
01:22:42.580 | How many rows we have in this tree which I generally call n
01:22:48.580 | How many columns do I have which I generally call C
01:22:51.580 | So the number of rows is just equal to the number of indexes
01:22:55.140 | We were given and the number of columns is just like however many columns there are in our independent variables
01:23:01.380 | So then we're going to need
01:23:06.060 | This value here
01:23:09.820 | We need to know for this tree
01:23:13.140 | What's its prediction, right? So
01:23:19.700 | Prediction for this tree is the mean of
01:23:22.820 | Dependent variable for
01:23:27.460 | Those indexes which are inside this part of the tree, right? So at the very top of the tree it contains all the indexes
01:23:36.500 | All right, I'm assuming that by the time we've got to this point. Remember we've already done the
01:23:44.980 | random sampling
01:23:46.620 | Right. So when we're talking about indexes, we're not talking about the random sampling to create the tree
01:23:51.780 | We're assuming this tree now has some random sample inside decision tree
01:23:56.940 | This is this is the one of the nice things right inside decision tree whole random sampling things gone
01:24:02.100 | Right that was done by the random first, right? So at this point we're building something. That's just a plain old decision tree
01:24:08.380 | It's not in any way a random sampling anything. It's just a plain old position tree, right?
01:24:12.660 | So the indexes is literally like
01:24:15.420 | Which subset of the data have we got to so far in this tree?
01:24:20.580 | And so at the top of the decision tree, it's all the data, right? So it's all of the indexes
01:24:28.060 | So all of the indexes
01:24:30.060 | So this is therefore all of the dependent variable that are in this part of the tree
01:24:35.900 | And so this is the value mean of that
01:24:40.220 | That makes sense. Anybody got any questions about about that?
01:24:43.900 | So yes, he passed the change sheet
01:24:47.740 | Actually just to let you know that's a large portion of us don't have a all be I
01:24:55.980 | Mean all P experiments. Okay. Sure. So
01:25:00.460 | So quick so quick over P primer would be helpful
01:25:04.100 | Great. Yeah, okay
01:25:09.460 | Who has done object-oriented programming in some programming language, okay?
01:25:14.340 | So you've all used actually lots of object-oriented programming in terms of using existing classes
01:25:25.140 | All right, so every time we've created a random forest
01:25:28.860 | We've called the random forests constructor and it's returned an object and then we've called
01:25:39.700 | methods and
01:25:41.380 | Attributes on that object so fit is a method you can tell because it's got parentheses after it. All right, where else?
01:25:48.940 | Yeah, I will be score is a
01:25:54.340 | Property or an attribute doesn't have parentheses after it. Okay, so inside an object there are kind of two kinds of things
01:26:02.260 | They're the functions that you can call
01:26:04.820 | So you you have object dot?
01:26:07.500 | function parenthesis arguments or there are the properties or attributes you can grab which is
01:26:13.820 | Object dot and then just the attribute name with no parentheses
01:26:18.300 | So when and then the other thing that we do with objects is we create them
01:26:24.500 | Okay, we pass in the name of a class and it returns us the object and you have to tell it all of the parameters
01:26:32.580 | Necessary to get constructed. So let's just copy this code
01:26:45.340 | See how we're going to go ahead and build this
01:26:47.460 | So the first step is we're not going to go and equals random forest regressor. We're going to go M equals tree ensemble
01:26:55.620 | We're creating a class for tree ensemble and we're going to pass in
01:26:59.540 | Various bits of information, okay?
01:27:06.780 | So maybe we'll have ten trees
01:27:10.580 | Sample size of a thousand maybe a min leaf of three
01:27:15.480 | All right, and you can always like choose to name your arguments or not
01:27:18.980 | So when you've got quite a few it's kind of nice to name them so that just so we can see what each one means
01:27:24.900 | It's always optional
01:27:31.060 | so we're going to try and create a class that we can use like this and
01:27:38.260 | I'm not sure we're going to bother with dot fit because we've passed in the X and the Y
01:27:43.260 | Right like in in psychic learn they use an approach where first of all you construct something without telling it what data to use
01:27:49.620 | And then you pass in the day. We're doing these two steps at once. We're actually passing in the data
01:27:55.020 | Right and so then after that we're going to be going m
01:27:59.020 | Dot so we're going to go preds equals m dot predict
01:28:03.940 | Passing in maybe some validation set
01:28:06.980 | Okay, so we're that's that's the API. We're kind of creating here
01:28:12.220 | So this thing here is called a constructor something that creates an object is called a constructor
01:28:18.220 | and Python
01:28:22.460 | There's a lot of ugly hideous things about Python one of which is they it uses these special magic
01:28:28.700 | method names
01:28:31.220 | Underscore underscore in it underscore underscore is a special magic method that's caught what's called when you try to construct a class
01:28:39.720 | So when I call tree ensemble parenthesis it actually calls tree ensemble dot
01:28:46.020 | People say thunder in it. I kind of hate it. But anyway done that in it double underscore in it double underscore thunder thunder in it
01:28:53.700 | So that's why we've got this method called dunder in it. Okay, so when I call tree ensemble is going to call this method
01:29:01.620 | another
01:29:04.900 | hideously ugly thing about
01:29:06.900 | Python's OO is that there's this special thing where if you have a class and to create a class you just write class in
01:29:14.540 | the class all of its methods
01:29:16.540 | Automatically get sent one extra
01:29:20.060 | parameter one extra argument
01:29:22.860 | Which is the first argument and you can call it anything you like if you call it anything other than self
01:29:28.900 | Everybody will hate you and you're a bad person
01:29:31.300 | Okay, so call it anything you like as long as itself
01:29:38.780 | So that's why you always see this and in fact I can immediately see here I have a bug
01:29:44.540 | Anybody see the bug in my predict function? I should have self, right? I
01:29:48.860 | Like always do it, right?
01:29:52.380 | So anytime you try and call a method on your own class and you get something saying you passed in two parameters
01:29:58.380 | And it was only expecting one you forgot self
01:30:00.820 | Okay, so like this is a really dumb way to add OOP to a programming language
01:30:06.420 | But the older languages like Python often did this because they kind of needed to they started out not being
01:30:12.260 | Oh, and then they kind of added. Oh in a way that was hideously ugly
01:30:16.380 | So Pell which predates Python by a little bit kind of I think really came up with this approach and unfortunately
01:30:23.780 | Other languages of that era stuck with it
01:30:26.540 | So you have to add in this magic self. So the magic self now
01:30:32.500 | When you're inside this class
01:30:35.820 | You can now pretend as if any property name you like exists
01:30:41.620 | So I can now pretend there's something called self dot X. I can read from it
01:30:45.980 | I can write to it right, but if I read from it, and I haven't yet written to it. I'll get an error
01:30:51.180 | So the stuff that's passed
01:30:54.540 | to the constructor
01:30:56.700 | Gets thrown away by default like there's nothing that like says you need to this class needs to remember what these things are
01:31:03.300 | But anything that we stick inside self it's remembered for all time
01:31:08.660 | You know as long as this object exists. You can access it. Maybe it's remembered so now that I've gone
01:31:14.980 | In fact, let's do this right so let's let's create the tree ensemble class and
01:31:19.860 | Let's now instantiate it okay, of course we haven't got X we need to call
01:31:28.500 | X train Y train
01:31:37.860 | Okay decision tree is not defined so let's
01:31:40.940 | Create a really minimal decision tree
01:31:45.660 | There we go, okay, so here is enough to actually instantiate our tree ensemble
01:31:56.900 | Okay, so we have to find the in it for it
01:31:59.820 | We have to find the in it for decision tree
01:32:01.980 | we need decision trees in it to be defined because inside our ensemble in it they're called self dot create tree and
01:32:08.580 | Then self dot create tree called the decision tree constructor and then decision tree constructor
01:32:14.980 | Basically does nothing at all other than save some information right so at this point we can now go M dot
01:32:22.540 | Okay, and if I press tab at this point
01:32:29.060 | Can anybody tell me what I would expect to see
01:32:32.380 | pass it to Taylor
01:32:34.980 | Can she could you pass it to Taylor?
01:32:36.980 | We would see like a we would see a drop-down of all available methods for that class okay, which would be
01:32:44.860 | In this case so if M is a tree ensemble, we would have create tree and predict okay anything else
01:32:50.700 | Wait what oh, yeah as well as Ernest whispered the variables as well. Yeah, so the
01:32:59.220 | So variable could mean a lot of things we'll say the attributes so the things that we put inside self so if I hit tab
01:33:04.860 | Right there. They are right as Taylor said there's create tree there's predict, and then there's everything else to be put inside self
01:33:11.740 | all right, so if I
01:33:13.980 | look at
01:33:16.180 | M dot
01:33:18.180 | Min leaf if I hit shift enter what will I see?
01:33:24.180 | Yeah, the number that I just put there. I put in leaf is three so that went up here to mean leaf
01:33:29.500 | This here is a default argument. That's as if I don't pass anything. It'll be five, but I did pass something right so three
01:33:36.140 | self dot min leaf
01:33:38.380 | here is
01:33:39.980 | Gonna be equal to min leaf here
01:33:41.980 | so something which
01:33:44.660 | Like because of this rather annoying way of doing OO
01:33:48.340 | It does mean that it's very easy to accidentally forget
01:33:52.500 | To do that right so if I don't assign it to self dot min leaf
01:33:58.060 | Right then I get an error and
01:34:01.540 | So here tree ensemble doesn't happen to me in leaf
01:34:04.700 | So how do I create that attribute? I just put something in it
01:34:09.740 | Okay, so if you want to like if you don't know what a value of it should be yet
01:34:15.980 | But you kind of need to be able to refer to it you can always go like self dot min leaf
01:34:21.860 | equals none
01:34:23.300 | Right so at least it's something you can read check for noneness and not have an error
01:34:27.060 | Great
01:34:34.780 | Interestingly I was able to instantiate tree ensemble even though predict refers to a method of decision tree
01:34:42.540 | That doesn't exist and this is actually something very nice about the dynamic nature of Python
01:34:49.360 | is that
01:34:51.700 | Because it's not like compiling it. It's not checking anything unless you're using it
01:34:56.860 | right, so we can go ahead and create decision D dot predict later and
01:35:02.440 | Then our our instantiated object will magically start working
01:35:07.380 | All right, it doesn't actually look up that functions that methods details until you use it and so it really helps with top-down
01:35:14.500 | programming
01:35:18.780 | Okay, so when you're inside a class definition, in other words, you're at that indentation level
01:35:26.000 | You know indented one in so these are all class definitions
01:35:29.480 | Any function that you create unless you do some special things that we're not going to talk about yet
01:35:35.500 | Is automatically a method of that class and so every method of that class
01:35:41.020 | magically gets a
01:35:43.300 | self pass to it
01:35:45.300 | So we could call
01:35:48.740 | since we've got a tree ensemble we could call M dot create tree and
01:35:52.460 | We don't put anything inside those parentheses because the magic self will be passed and the magic self will be whatever M is
01:36:00.160 | Okay, so M dot create tree returns a decision tree. Just like we asked it to right so M dot create tree
01:36:10.940 | dot IDXS
01:36:13.860 | Will give us the self dot IDXS inside the decision tree
01:36:17.540 | Okay, which is set to NP dot arrange range self dot sample size
01:36:24.700 | Why is data scientists do we care about object oriented programming?
01:36:32.180 | Because a lot of the stuff you use is going to require you to implement stuff with OOP, for example
01:36:41.580 | every single PyTorch model of any kind is
01:36:46.260 | Created with OOP. It's the only way to create PyTorch models
01:36:49.860 | good news is
01:36:53.020 | What you see here is the entirety of what you need to know
01:36:57.320 | So you this is all you need to know you need to know to create something called in it
01:37:01.020 | to assign the things that are passed in it to something called self and
01:37:05.660 | Then just stick the word self after each of your methods
01:37:09.860 | Okay, and so the nice thing is like now to think as an OOP programmer is to realize you don't now have to pass around
01:37:17.620 | XY sample size and mint leaf to every function that uses them by assigning them to
01:37:23.740 | Attributes of self they're now available like magic
01:37:28.180 | Right. So this is why OOP is super handy
01:37:31.700 | If you're particularly like I started trying to create a decision tree initially without using OOP and try to like keep track of
01:37:39.380 | Like what that decision tree was meant to know about was very difficult, you know
01:37:44.220 | Or else with OOP you can just say it inside the decision tree, you know self dot indexes equals this and
01:37:50.380 | Everything just works. Okay. Okay, that's great. So we're out of time. I think that's
01:37:55.860 | that's great timing because
01:37:58.900 | There's an introduction to OOP, but this week
01:38:02.480 | You know next class I'm going to assume that you can use it, right?
01:38:07.340 | So you should create some classes instantiate some classes look at their methods and properties
01:38:12.780 | Have them call each other and so forth until you feel
01:38:16.540 | Comfortable with them and maybe for those of you that haven't done OOP before you and find some other useful
01:38:23.420 | Resources you could pop them onto the wiki thread so that other people know what you find
01:38:27.220 | useful great. Thanks everybody
01:38:30.060 | Everybody.
01:38:30.900 | [BLANK_AUDIO]