Intro to Machine Learning: Lesson 5

00:00:00.000 | Okay, so welcome back, so we're going to start by doing some review, and we're going to talk about

00:00:05.600 | test sets

00:00:08.520 | training sets

00:00:10.360 | validation sets and OOB

00:00:12.360 | Something we haven't covered yet, but we will cover in more detail later is also cross validation

00:00:19.640 | But I'm going to talk about that as well, right so

00:00:22.240 | We have a data set

00:00:24.960 | With a bunch of rows in it and

00:00:29.160 | We've got some dependent variable

00:00:31.160 | and

00:00:33.640 | so what's the difference between

00:00:35.640 | like machine learning and

00:00:38.680 | Kind of pretty much any other kind of work that the the difference is that in machine learning the thing we care about is

00:00:47.600 | The generalization accuracy or the generalization error where else in like pretty much everything else all we care about is is

00:00:56.720 | how well we could have mapped to the observations full stop and

00:01:01.040 | so this this thing about

00:01:03.600 | generalization is the key unique piece of

00:01:06.720 | machine learning

00:01:08.960 | And so if we want to know whether we're good doing a good job of machine learning

00:01:13.600 | We need to know whether we're doing a good job of generalizing if we don't know that

00:01:18.400 | We know nothing, right?

00:01:21.200 | By generalizing do you mean like scaling being able to scale larger?

00:01:29.360 | No, I don't mean scaling at all so scaling is an important thing in many many areas

00:01:36.400 | It's like okay. We've got something that works

00:01:38.400 | on my computer with 10,000

00:01:42.000 | Items I don't need to work make it work on 10,000 items per second or something so scaling is important

00:01:49.000 | But not just a machine learning for just about everything we put in production

00:01:52.520 | Generalization is where I say okay here is a model that can predict

00:01:59.720 | Cats from dogs. I've looked at five pictures of cats five pictures of dogs, and I've built a model that is perfect and

00:02:07.160 | Then I look at a different set of five cats and dogs and it gets them all wrong

00:02:11.840 | So in that case what it learned was not the difference between a cat and a dog

00:02:15.600 | That let what those five exact cats look like in those five exact dogs look like or I've got a model of

00:02:22.280 | predicting

00:02:24.960 | grocery sales

00:02:26.640 | for a particular

00:02:28.640 | Product so for toilet rolls

00:02:30.960 | in New Jersey last month

00:02:34.000 | And then I go and put it into production and it scales great in other words it has a great latency

00:02:41.320 | I don't have a high CPU load, but it fails to predict anything well other than

00:02:46.960 | Toilet rolls in New Jersey it also turns out it only did it well for last month not the next month, so these are all generalization

00:02:54.140 | failures

00:02:57.080 | so

00:02:58.520 | The most common way that people check for the ability to generalize is

00:03:04.520 | To create a random sample, so they'll grab a few rows at random

00:03:10.440 | and

00:03:11.640 | pull it out

00:03:13.640 | into a test set and

00:03:16.640 | then they'll build all of their models on the rest of the rows and

00:03:21.800 | Then when they're finished they'll check that the accuracy they got on there

00:03:28.160 | So the rest of the rows are called the training set everything else

00:03:31.240 | everything

00:03:34.360 | else

00:03:36.360 | We could call the training set

00:03:40.800 | And

00:03:41.720 | So at the end of their modeling process on the training set they got an accuracy of

00:03:46.200 | 99% of predicting cats from dogs at the very end they check it against a test set to make sure that the model really

00:03:52.880 | does generalize

00:03:54.160 | now the problem is

00:03:56.160 | What if it doesn't?

00:03:58.160 | Right, so okay, well I could go back and change some hyper parameters do some data augmentation

00:04:03.560 | Whatever else try to create a more generalizable model, and then I'll go back again

00:04:08.760 | After doing all that and check and it's still no good

00:04:11.920 | But and I'll keep doing this again and again until eventually after 50 attempts. It does generalize

00:04:19.440 | But does it really generalize because maybe all I've done is

00:04:23.320 | Accidentally found this one which happens to work just for that test set because I've tried 50 different things

00:04:28.400 | Right, and so if I've got something which is like right coincidentally

00:04:35.040 | 0.05 5% of the time they're not very likely to accidentally get a good result

00:04:40.520 | So what we generally do is we put aside a second data set

00:04:46.160 | They've got a couple more of these and put these aside into a validation set

00:04:54.480 | Validation set right and then everything that's not in the validation test is now training and so what we do is we train a model

00:05:03.760 | Check it against the validation to see if it generalizes

00:05:06.440 | Do that a few times and then when we finally got something where we're like okay?

00:05:11.200 | We think this generalizes successfully based on the validation set and then at the end of the project we check it against the test set

00:05:16.980 | Yeah

00:05:19.600 | So basically by making this two layer test set validation set if it gets one right the other one

00:05:24.520 | Wrong you kind of double checking your errors kind of like it

00:05:27.840 | It's checking that we have an over fit to the validation set so if we're using the validation set again and again

00:05:34.040 | Then we could end up not coming up with a generalizable set of hyper parameters

00:05:38.640 | But a set of hyper parameters that just so happen to work on the training set and the validation set

00:05:43.200 | so

00:05:45.480 | So if we try 50 different models

00:05:47.760 | Against the validation set and then at the end of all that we then check that against the test set and it's still

00:05:57.480 | Generalized as well, then we're kind of going to say okay. That's good

00:06:00.240 | We've actually come up with generalizable model if it doesn't then that's going to say okay

00:06:04.400 | We've actually now over fit to the validation set at which point you're kind of in trouble, right because

00:06:09.960 | You don't you know you don't have anything left

00:06:13.120 | Behind right so the idea is to use effective

00:06:17.880 | Techniques during the modeling so that so that doesn't happen right, but but if it's going to happen you want to find out about

00:06:24.040 | it like you need that test set to be there because otherwise when you put it in production and

00:06:28.800 | Then it turns out that it doesn't generalize that would be a really bad outcome right you end up with

00:06:34.420 | Less people clicking on your ads or selling less of your products or providing car insurance to very risky vehicles or whatever

00:06:42.340 | Just make sure do you need to ever check if the validation set and the test settings is coherent or you just keep

00:06:52.640 | So if you've done what I've just done here, which is to randomly sample

00:06:56.420 | There's no particular reason to check as long as they're as long as they're big enough

00:06:59.920 | Right, but we're going to come back to your question in a different context in just a moment

00:07:04.760 | Now

00:07:09.520 | Another trick we've learned for random forests is a way of

00:07:13.960 | Not needing a validation set and the way that we learned was to use instead use the OOB

00:07:22.640 | Error or the OOB score and so this idea was to say well every time we train a tree in a random

00:07:30.740 | Forest there's a bunch of observations that are held out anyway because that's how we get some of the randomness and so let's

00:07:37.420 | Calculate our score for each tree based on those held out samples and therefore the forest by averaging the trees that that each

00:07:46.560 | Row was not part of training

00:07:49.240 | Okay

00:07:52.240 | And so the OOB score gives us something which is pretty similar to the

00:07:58.120 | validation score

00:08:00.880 | But on average, it's a little less good

00:08:04.960 | Can anybody either remember or figure out why on average it's a little less good?

00:08:09.600 | Quite a subtle one

00:08:13.840 | I'm not sure but is it because you are treating like you are doing every

00:08:21.760 | Kind of probe

00:08:23.680 | pre-processing on your test and so the OOB score is reflecting the performance on testing set

00:08:31.000 | No, so the OOB score is not using the test set at all

00:08:34.360 | The OOB score is using the held out rows in the training set for each tree

00:08:39.360 | So I mean the you are basically testing each tree on some data from the training set

00:08:45.200 | Yes, so you are you have the potential of overbating data?

00:08:51.040 | I should it shouldn't cause overfitting because each one is looking at a held out

00:08:55.820 | Sample so it's not an overfitting issue. It's quite a subtle issue. Ernest don't have a try

00:09:01.220 | Aren't the samples from OOB

00:09:06.440 | Bootstrap samples they are so then you're never gonna on average they only grab 63% of right

00:09:12.920 | So when average the OOB is one minus 63% exactly. Yeah, what's the issue?

00:09:18.600 | So then if you're not why would the score be lower than the validation score that implies that you're leaving

00:09:24.380 | Sort of like a black hole in the data that there's like data points

00:09:26.960 | You're never going to sample and they're not gonna be represented by the model

00:09:29.200 | No, that's not true though because each tree is looking at a different set right so the OOB so like we've got like I

00:09:35.320 | Don't know dozens of models right and in each one. There's a different set of

00:09:40.680 | rows

00:09:43.120 | Which which happened to be held out?

00:09:45.120 | right

00:09:48.360 | And so when we calculate the OOB score for like let's say row three

00:09:53.180 | We say okay row three is in this tree this tree and that's it and so we calculate

00:09:58.880 | the prediction on that tree and for that tree and with average those two predictions and so with enough trees

00:10:07.040 | You know each one has a 30 or so percent chance. Sorry 40 or so percent chance that the row is in that tree

00:10:13.880 | So if you have 50 trees

00:10:15.560 | It's almost certain that every row is going to be mentioned somewhere

00:10:19.280 | Did you have an idea term?

00:10:22.000 | With validation set we can use the whole forest to make the predictions

00:10:27.640 | But here we cannot use the whole forest so we cannot exactly see exactly so every row is

00:10:33.760 | Going to be using a subset of the trees to make its prediction and with less trees

00:10:39.640 | We know we get a less accurate prediction. So that's like that's a subtle one

00:10:44.480 | Right, and if you didn't get it have a think during the week

00:10:47.720 | until you understand

00:10:50.680 | Why this is because it's a really interesting test if you're understanding of random forests of like why is OOB score?

00:10:58.600 | On average less good than your validation score. They're both using random subs randomly held out subsets

00:11:05.560 | anyway, it's been really close enough right so

00:11:09.680 | Why have a validation set at all?

00:11:14.760 | When you're using random forest

00:11:16.760 | If it's a randomly chosen validation set it's not strictly speaking necessary

00:11:24.400 | But you know you've got like four levels of things to test right so you could like test on the OOB

00:11:29.560 | When that's working well, you can test on the validation set, you know when hopefully by the time you check against the test set

00:11:35.720 | There's going to be no surprises. So that'll be one good reason

00:11:40.080 | Then what Kaggle do the way they do this is kind of clever

00:11:44.200 | what Kaggle do is they split the test set into two pieces a public and

00:11:48.440 | a private and

00:11:51.920 | They don't tell you which is rich. So you submit your predictions to Kaggle and then a

00:11:58.200 | Random 30% of those are used to tell you the leaderboard score

00:12:04.600 | But then at the end of the competition that gets thrown away and they use the other 70% to calculate your real score

00:12:11.220 | so

00:12:14.000 | What that's doing is that you're making sure that you're not like continually using that feedback from the leaderboard

00:12:19.120 | To figure out some set of hyper parameters that happens to do well on the public but actually doesn't generalize

00:12:25.440 | Okay, so it's a great test like this is one of the reasons why it's good practice to use Kaggle

00:12:32.040 | Because at the end of a competition at some point this will happen to you and you'll drop a hundred places on the leaderboard

00:12:37.560 | The last day of the competition when they use the private test set and I say oh

00:12:41.960 | Okay, that's what it feels like to overfit and it's much better to

00:12:46.000 | Practice and get that sense there than it is to do it in a company where there's hundreds of millions of dollars on the line

00:12:52.360 | Okay, so this is like the easiest possible situation where you're able to use a

00:13:01.840 | random sample for your validation set

00:13:05.280 | Why might I not be able to use a random sample for my validation set

00:13:11.600 | In the case of something where we're forecasting we can't randomly sample because we need to maintain the temporal ordering

00:13:22.960 | Go on. What is that?

00:13:25.360 | Because it doesn't it doesn't make sense. So in the case of like an ARMA model

00:13:29.280 | I I can't use like I can't pull out random rows because there's

00:13:34.080 | I'm thinking that there's like a certain dependency or I'm I'm trying to model a certain dependency that relies on like a specific

00:13:41.560 | Lag term if I randomly sample those things then that lag term isn't there for me to okay, so it could be like a

00:13:48.920 | Technical modeling issue that like I'm using a model that relies on like

00:13:55.640 | Yesterday the day before and the day before that and if I've randomly removed some things

00:13:59.640 | I don't have yesterday and my model might just fail. Okay, that's true, but there's a more fundamental issue

00:14:06.360 | Do you want to pass it to Tyler?

00:14:08.360 | It's a really good point

00:14:10.960 | Although you know in general we're going to try to build models that are not that are more resilient than that

00:14:15.800 | particularly with

00:14:18.680 | Yet temporal order we expect things that are close by in time to be related to things close to them

00:14:26.660 | so we so

00:14:28.760 | if we destroy the order like if if we destroy the order we

00:14:35.280 | Really aren't going to be able to use that this time is close to this other time

00:14:40.200 | Um, I don't think that's true because can pull out a random sample for a validation set and still keep everything nicely ordered

00:14:48.480 | Well, we would like to predict things in the future

00:14:51.680 | Which we would require as much data close to the end of art

00:14:58.320 | Okay, that's true. I mean we could be like limiting the amount of data that we have by taking some of it out

00:15:05.560 | But my claim is stronger. My claim is that by using a random validation set

00:15:12.280 | We could get totally the wrong idea about our model carob. Do you want to have a try?

00:15:17.300 | So you if our data is imbalanced for example

00:15:24.240 | We can if you're randomly sampling it we can only have one class in our validation set so our fitted model may be

00:15:31.720 | That's true as well

00:15:33.400 | So maybe you're trying to predict in a medical situation

00:15:35.480 | Who's going to die of lung cancer and that's only one out of a hundred people and we pick out a validation set that

00:15:42.120 | We accidentally have nobody that died of lung cancer

00:15:44.120 | That's also true. These are all

00:15:47.720 | Good niche examples, but none of them quite say like why could the validation set just be plain

00:15:55.520 | Wrong like give you a totally

00:15:59.000 | Inaccurate idea of whether this is going to generalize

00:16:01.800 | And so let's talk about and the closest is is is what Tyler was saying about time

00:16:08.320 | closeness in time

00:16:11.280 | The important thing to remember is when you build a model

00:16:13.820 | You're always you always have a systematic error

00:16:17.960 | Which is that you're going to use the model at a later time than the time that you built it, right?

00:16:24.560 | Like you're going to put it into production

00:16:26.560 | By which time the world is different to the world that you're in now and even when you're building the model

00:16:33.600 | You're using data which is older than today anyway, right?

00:16:36.920 | so there's some lag between the data that you're building it on and the data that it's going to actually be used on your life and

00:16:43.820 | A lot of the time if not most of the time that matters, right?

00:16:48.520 | So if we're doing stuff in like predicting who's going to buy toilet paper in, New Jersey

00:16:54.460 | and it takes us two weeks to put it in production and

00:16:58.160 | We did it using data from the last couple of years then by that time, you know things may look

00:17:05.520 | Very different right and particularly our validation said if we randomly sampled it

00:17:11.620 | Right and it was like from a four-year period then the vast majority of that data is going to be over a year old

00:17:17.920 | right, and it may be that the

00:17:20.440 | toilet buying habits of folks in New Jersey may have

00:17:24.640 | Dramatically shifted. Maybe they've got a terrible recession there now and they can't afford a high-quality toilet paper anymore

00:17:33.440 | Or maybe they know their paper making industry has gone through the roof and suddenly, you know

00:17:39.040 | They could they're buying lots more toilet paper because it's so cheap or whatever, right? So

00:17:43.520 | The world changes and therefore if you use a random sample for your validation set

00:17:50.200 | Then you're actually checking how good are you at predicting things that are totally obsolete now?

00:17:55.560 | But how good are you at predicting things that happened four years ago? That's not interesting

00:18:02.000 | Okay, so what we want to do in practice

00:18:04.720 | Any time there's some temporal piece?

00:18:08.320 | Is to instead say assuming that we've ordered it by time

00:18:15.040 | Right, so this is old and this is new

00:18:22.040 | That's our validation set

00:18:29.520 | Okay, or if we you know, I suppose actually do it properly. That's our validation set. That's our test set

00:18:37.340 | Make sense, right? So here's our training set and we use that and we try and build a model that still works on stuff

00:18:48.160 | That's later in time than anything the model was built on and so we're not just testing

00:18:54.040 | Generalization in some kind of abstract sense, but in a very

00:18:58.520 | Specific time sense, which is it generalizes to the future? Could you pass it to Siraj, please?

00:19:03.200 | So when we are as you said

00:19:10.000 | As you said, there is some temporal ordering in the data

00:19:13.440 | So in that case is it wise to take the entire whole data for training or only a few recent data?

00:19:20.080 | Set so validation test or training. I'm talking about training training

00:19:25.640 | Yeah, that's a whole nother question, right? So how do you how do you get the validation set to be good?

00:19:32.200 | So I build them a random forest on all the training data. It looks good on the training data

00:19:37.720 | It looks good on the OOB

00:19:40.200 | But and this is actually a really good reason to have OB if it looks good on the OOB

00:19:44.920 | But it means you're not overfitting in a statistical sense, right? Like it's it's working well on a random sample

00:19:51.800 | But then it looks bad on the validation set

00:19:55.560 | So what happened? Well, what happened was that you you somehow failed to predict the future

00:20:02.200 | You only predicted the past and so Siraj had an idea about how we could fix that would be okay

00:20:07.240 | Well, maybe we should just train so like maybe we shouldn't use the whole training set

00:20:11.160 | We should try a recent period only and now you know on the downside

00:20:15.120 | We're not using less data so we can create less rich models on the upside. It's it's more up-to-date data

00:20:22.600 | And this is something you have to play around with

00:20:25.680 | most

00:20:27.760 | Machine learning functions have the ability to provide a weight that is given to each row

00:20:33.260 | So for example with a random forest rather than bootstrapping at random

00:20:37.980 | You could have a weight on every row and randomly pick that row with some probability right and we could like say

00:20:44.880 | Here's our like probability

00:20:47.600 | We could like pick a

00:20:50.920 | Curve that looks like that

00:20:52.920 | So that the most recent rows have a higher probability of being selected that can work really well

00:20:59.480 | Yeah, it's it's something that you have to try and and if you don't have a validation set that represents the future

00:21:07.780 | Compared to what you're training on you have no way to know which of your techniques are working

00:21:12.720 | How do you make the compromise between an amount of data versus recency of data?

00:21:19.600 | So what I tend to do is is when I have this kind of temporal issue, which is probably most of the time

00:21:26.560 | Once I have something that's working well on the validation set

00:21:30.880 | I wouldn't then go and just use that model on the test set because the thing that I've trained on is now like

00:21:36.800 | Much, you know the test set is much more in the future compared to the training set so I would then replicate

00:21:43.280 | Building that model again, but this time I would combine

00:21:47.880 | the training and validation sets together

00:21:50.120 | Okay, and retrain the model and at that point you've got no way to test

00:21:56.200 | Against a validation set so you have to make sure you have a reproducible

00:22:01.280 | Script or notebook that does exactly the same steps in exactly the same ways

00:22:05.960 | Because if you get something wrong, then you're going to find on the test set that you've you've got a problem

00:22:11.920 | So

00:22:15.280 | So what what I do in practice is I need to know is my validation set

00:22:20.400 | a

00:22:22.640 | Truly representative of the test set. So what I do is I build five models on the training set I

00:22:30.660 | Build five models on the training set and I try to have them kind of vary in how good I think they are

00:22:41.560 | Right, and then and then I score them my five models on the validation set

00:22:47.840 | Right, and then I also score them on the test set, right? So I'm not cheating

00:22:54.760 | So I'm not using any feedback from the test set to change my hyper parameters

00:22:58.540 | I'm only using it for this one thing which is to check my validation set. So I get my five scores

00:23:03.720 | from the test set and

00:23:06.960 | Then I check

00:23:10.680 | That they fall in a line

00:23:12.480 | Okay, and if they don't then you're not going to get good enough feedback from the validation set. So keep doing that process

00:23:19.720 | Until you're getting a line and that can be quite tricky, right? Sometimes

00:23:24.800 | the the test set

00:23:27.840 | You know trying to create something that's as similar to the real world outcome as possible

00:23:33.840 | It's difficult right and when you're kind of in the real world

00:23:38.000 | The same is true of creating the test set like the test set has to be a close to production as possible

00:23:43.320 | So like what's the actual mix of customers that are going to be using this?

00:23:47.800 | How much time is there actually going to be between when you build the model and when you put it in production?

00:23:52.280 | How often are you going to be able to refresh the model? These are all the things to think about when you build that test set

00:23:57.120 | Okay

00:24:01.520 | So even to say that first make five models on the training data and then till you get a straight line relationship

00:24:09.160 | Change your validation and test set you can't really change the test set generally

00:24:13.460 | So this is assuming that the test sets given the change change the validation set

00:24:17.400 | So if you start with a random sample validation set and then it's all over the place and you realize oh

00:24:23.280 | I should have picked the last two months

00:24:25.280 | And then you pick the last two months and still go over the place and you realize oh

00:24:28.440 | I should have picked it so that's also from the first of the month to the fifteenth of the month and

00:24:32.920 | They'll keep going until changing your validation set until you've found a validation set which is

00:24:38.640 | Indicative of your test set results

00:24:41.240 | So the five models like you would start maybe like just the random data and then average and like just make it better

00:24:51.280 | Yeah, yeah. Yeah, yeah, maybe

00:24:54.360 | exactly, maybe I kind of five like not terrible ones, but you want some variety and you also particularly want some variety in like

00:25:01.300 | How well they might generalize through time so one that was trained on the whole training set one that was trained on the last two weeks

00:25:08.800 | One that was trained on the last six weeks

00:25:11.540 | One which used you know lots and lots of columns and might over fit a bit more

00:25:16.400 | Yeah, so you kind of want to get a sense of like oh if my validation set fails to

00:25:22.520 | Generalize temporarily I'd want to see that if it fails to generalize statistically I want to see that

00:25:26.800 | Sorry, can you explain a bit more detail what you mean by change your validation set so it indicates the test set like what does that look?

00:25:34.080 | like

00:25:35.480 | So possible. So let's take the groceries competition where we're trying to predict the next two weeks of grocery sales

00:25:42.300 | So possible validation sets that Terrence and I played with was a random sample

00:25:51.480 | The last month of data

00:25:53.480 | The last two weeks of data

00:25:57.120 | And the other one we tried was same day range

00:26:05.000 | One month earlier so that the test set in this competition was the first of the 15th of

00:26:14.680 | August

00:26:16.840 | Sorry, this

00:26:18.360 | 15th that maybe the 15th the 30th of August

00:26:21.360 | So we tried like a random sample as four years. We tried

00:26:25.260 | the 15th of July to the 15th of August we tried the first of August to the 15th of August and we tried the

00:26:34.120 | 15th of July to the 30th of July and so there were four different validation sets we tried and so with random

00:26:41.560 | You know our kind of results were all over the place with last month

00:26:46.440 | You know, they were like not bad, but not great the last two weeks

00:26:50.560 | It was a couple that didn't look good

00:26:51.880 | But on the whole they were good and same day range of months earlier. They've got a basically perfect line

00:26:56.520 | That's the part I'm talking right there. What exactly are you comparing it to from the test set?

00:27:00.980 | It's like confuse what you're creating that graph

00:27:03.080 | So for each of those so for each of my so I've built five models, right? So there might be like

00:27:11.840 | Just predict the average do some kind of simple group mean of the whole data set do some group mean of the last month

00:27:17.800 | Of the data set build a random forest of the whole thing build a random forest in the last two weeks

00:27:22.040 | on each of those I calculate the validation score and

00:27:26.480 | Then I retrain the model on the whole training set and calculate the same thing on the test set

00:27:32.480 | And so each of these points now tells me how well to go in the validation set

00:27:37.520 | How well did it go in the test set and so if the validation set is useful?

00:27:42.160 | We would say every time the validation set improves the test set should also score should also improve

00:27:48.220 | Yeah, so you just said retrain dreaming retrain the model on training and validations

00:27:55.740 | Yeah, that was a step I was talking about here

00:27:57.480 | So once I've got the validation score based on just the training set and then retrain it on the train and validation

00:28:02.980 | And check against it, right?

00:28:05.920 | somebody else

00:28:07.920 | So just to clarify

00:28:12.240 | By this set you mean

00:28:15.800 | Submitting it to Kaggle and then checking the score

00:28:19.680 | If it's Kaggle then your test set is Kaggle's leaderboard

00:28:24.320 | in the real world the test set is this third data set that you put aside and it's that third data set that

00:28:33.080 | Having it reflect real-world production differences is the most important step in a machine learning project

00:28:40.100 | Why is it the most important step because if you screw up everything else that you don't screw up that

00:28:48.240 | You'll know you've screwed up

00:28:50.520 | Right like if you've got a good test set

00:28:52.920 | Then you'll know you screwed up because you screwed up something else and you tested it and it didn't work out

00:28:57.960 | And it's like okay, you're not going to destroy the company right if you screwed up creating the test set

00:29:03.080 | That would be awful right because then you don't know if you've made a mistake

00:29:07.760 | Right you try to build a model you test it on the test set it looks good

00:29:12.640 | But the test set was not indicative of real-world

00:29:15.320 | Environment

00:29:18.760 | So you don't actually know if you're going to destroy the company right now

00:29:22.400 | Hopefully you've got ways to put things into production gradually

00:29:24.960 | So you won't actually destroy the company, but you'll at least destroy your reputation at work, right?

00:29:29.720 | it's like Oh Jeremy tried to put this thing into production and

00:29:32.800 | In the first week the cohort we tried it on their sales halved and we're never going to give Jeremy a machine learning job again

00:29:39.960 | All right, but if Jeremy had used a proper test set then like he would have known oh

00:29:45.120 | This is like half as good as my validation set said it would be

00:29:49.480 | I'll keep trying and now I'm not going to get in any trouble. I was actually like Oh Jeremy's awesome

00:29:54.520 | He is identifies ahead of time when there's going to be a generalization problem

00:29:59.200 | Okay, so this is like

00:30:09.000 | This is something that kind of everybody talks about a little bit in machine learning classes

00:30:16.160 | But often it kind of stops at the point where you learn that there's a thing in SK learn

00:30:20.820 | Called make test train split and it returns these things and off you go right, but the fact that like

00:30:27.460 | Or here's the cross-validation function right so

00:30:31.200 | The fact that these things always give you random samples tells you that like

00:30:39.100 | Much if not most of the time you shouldn't be using them

00:30:44.720 | The fact that random forest gives you an OOB for free

00:30:47.880 | It's useful, but it only tells you that this generalizes in a statistical sense not in a practical sense, right?

00:30:54.880 | so then finally there's cross-validation right which

00:30:59.160 | Outside of class you guys have been talking about a lot which makes me feel somebody's been

00:31:05.740 | overemphasizing the value of this technique

00:31:09.360 | So I'll explain what cross-validation is and then I'll explain why you probably shouldn't be using it most of the time

00:31:15.720 | So cross-validation says let's not just pull out one validation set, but let's pull out five say

00:31:23.180 | So let's assume that we're going to randomly shuffle the data first, right? This is critical

00:31:29.320 | right, we first randomly shuffle the data and then we're going to split it into

00:31:34.240 | Five groups

00:31:39.040 | And then for model number one, we'll call this the validation set and

00:31:44.120 | We'll call this the training set

00:31:48.520 | Okay, and we'll train and we'll check against the validation and we'll get some RMSE R squared whatever and

00:31:57.560 | then we'll throw that away and

00:32:00.280 | We'll call this the validation set and we'll call this

00:32:07.520 | the training set and we'll get another score we'll do that five times and

00:32:17.180 | Then we'll take the average

00:32:22.280 | Okay, so that's a cross-validation

00:32:27.780 | average accuracy, so who can tell me like a benefit of using cross-validation over a

00:32:37.160 | The kind of standard validation set I talked about before

00:32:40.080 | Could you pass it a phone?

00:32:43.120 | If you have a small data set, then

00:32:50.480 | Cross-validation will make use of the data you have. Yeah, you can use all of the data

00:32:56.040 | You don't have to put anything aside and you kind of get a little benefit as well in that like

00:33:00.880 | You've now got five models that you could ensemble together each one refused which used 80% of the data

00:33:06.400 | So, you know, sometimes that ensemble link can be helpful

00:33:08.880 | Fun could you tell me like what what could be some reasons that you wouldn't use cross-validation?

00:33:16.440 | We have enough data so we don't not want the validation set to be included in the model trainings

00:33:25.320 | process

00:33:27.400 | To like to pollute like the model

00:33:31.600 | Okay, yeah

00:33:34.360 | I'm not sure that cross-validation is necessarily polluting the model. What would be a key like downside of cross-validation?

00:33:41.280 | but like for deep learning if you have learned the pictures and

00:33:46.240 | Then your network will know the pictures and it's more likely to predict it. That's right

00:33:52.420 | So sure, but if we if we've put aside some data each time in the cross-validation, can you pass it to Siraj?

00:33:59.160 | I'm I'm I'm not so worried about

00:34:02.040 | like I don't think there's like one of these validation sets is

00:34:06.600 | More statistically accurate. Yes Siraj

00:34:11.400 | I think that's what fun was worried about I don't see why that would happen like each time we're fitting a model

00:34:21.760 | Just behind you

00:34:23.380 | Each time we're fitting a model. We are absolutely holding out 20% of the sample

00:34:28.660 | Right so yes the five models between them have seen all of the data

00:34:32.540 | But but it's kind of like a random forest in fact it's a lot like a random forest each model

00:34:37.300 | Has only been trained on a subset of the data

00:34:39.620 | Yes, Nisha say if it is like a large data set like it will take a lot of time

00:34:44.860 | Oh, yes, exactly right so we have to fit five models rather than one so here's a key downside number one

00:34:52.360 | Is time and so if we're?

00:34:54.980 | Doing deep learning and it takes a day to run suddenly it now takes five days or we need five GPUs

00:35:01.220 | Okay, what about my earlier issues about validation sets? Do you pass it over there?

00:35:06.680 | What's your name Jose?

00:35:09.720 | So if you had like temporal data wouldn't you be like by shuffling wouldn't you be breaking that relation

00:35:19.900 | Well, we can unshuffle it afterwards

00:35:21.960 | We could reorder it like we could shuffle get the training set out and then sort it by time

00:35:27.200 | Like I'd like this presumably there's a date column there, so I

00:35:32.300 | Don't think I don't think it's going to stop us from building a model. Did you have?

00:35:36.760 | With cross-validation you're building five even validation sets

00:35:47.380 | And if there's some sort of structure that you're trying to capture in your validation set to mirror your test set

00:35:51.620 | You're you're essentially just throwing that a chance to construct that

00:35:55.140 | yourself

00:35:57.400 | Right, I think you're going to say that

00:35:59.420 | I think you said the same thing as I'm going to say which is which is that our earlier concerns about why?

00:36:04.340 | Random validation sets are a problem are entirely relevant here all these validation sets are random

00:36:10.220 | So if a random validation set is not appropriate for your problem

00:36:15.860 | Most likely because for example of temporal issues then none of these four validation set five validation sets are any good

00:36:23.700 | they're all random right and so if you have

00:36:27.460 | Temporal data like we did here. There's no way to do cross-validation really or like probably no good way to do cross-validation. I mean

00:36:37.060 | You want to have?

00:36:39.700 | Your validation set be as close to the test set as possible

00:36:42.620 | And so you can't do that by randomly sampling different things

00:36:46.180 | so

00:36:49.380 | So as fun said

00:36:51.340 | You may well not need to do cross validation because most of the time in the real world

00:36:56.260 | We don't really have that little data

00:36:58.260 | Right unless your data is based on some very very expensive labeling process or some experiments that take a cost a lot to run

00:37:05.820 | or whatever, but nowadays that's

00:37:07.860 | Data scientists are not very often doing that kind of work summer in which case this is an issue, but most of us aren't

00:37:14.060 | So we probably don't need to as

00:37:16.620 | Nishan said if we do do it. It's going to take a whole lot of time

00:37:20.500 | And then as earnest said even if we did do it and we talk up all that time

00:37:26.260 | It might give us totally the wrong answer because random validation sets are inappropriate for a problem

00:37:30.620 | Okay, so I'm not going to be spending much time on cross-validation because I just I think it's an interesting tool to have

00:37:37.900 | It's easy to use. Okay, learn has a cross-validation thing. You can go ahead and use

00:37:42.020 | but

00:37:44.580 | It's it's it's not that often that it's going to be an important part of your toolbox in my opinion. It'll come up sometimes

00:37:51.780 | Okay, so that is

00:38:00.340 | Validation sets so then the other thing we

00:38:04.500 | started talking about last week

00:38:06.780 | And got a little bit stuck on because I screwed it up was tree interpretation

00:38:15.780 | So I'm actually going to cover that again

00:38:18.540 | without the error

00:38:21.580 | And dig into it in a bit more detail

00:38:23.580 | So can anybody tell me?

00:38:27.340 | What tree interpreter does and how it does it?

00:38:33.660 | Everybody remember? It's a difficult one to explain. I don't think I did a good job of explaining it

00:38:41.340 | So don't worry if you don't do a great job, but does anybody want to have a go at explaining it?

00:38:45.340 | Okay, that's fine, so

00:38:49.540 | Let's start with the output of tree interpreter, so

00:38:54.740 | If we look at a single model a single tree in other words

00:39:02.340 | Here is a single tree

00:39:04.340 | Okay, and

00:39:07.540 | So to remind us the top of a tree is

00:39:11.060 | Before there's been any split at all

00:39:14.220 | so ten point one eight nine

00:39:17.180 | Is the average log price of all of the options in our training set?

00:39:23.740 | So I'm going to go ahead and draw

00:39:29.020 | Right here ten point one eight nine eight nine is the average of all

00:39:33.900 | okay, and

00:39:36.260 | Then if I go a couple of system less than or equal to point five

00:39:39.060 | Then I get ten point three four five. Okay, so for this subset of

00:39:45.020 | sixteen thousand eight hundred

00:39:47.940 | Coupler is less than or equal to point five the average is ten point three four five and

00:39:52.380 | Then off the people with a couple of system less than or equal to point five

00:39:58.340 | We then take the subset where enclosure is less than or equal to two and the average there of log sale price is nine point

00:40:05.620 | nine five five

00:40:06.940 | Here's nine point nine five five. Okay, and then final step in our tree

00:40:11.060 | Is

00:40:14.180 | Model ID just for this group with no coupler system with enclosure less than or equal to two

00:40:19.260 | then let's just take model ID less than or equal to forty five seventy three and

00:40:23.940 | That gives us ten point two two six

00:40:29.220 | okay, so then we can say you're at starting with

00:40:32.820 | ten point one oh nine one eight nine average for everybody in our training set for this particular trees subsample of twenty thousand

00:40:40.300 | Adding in the couple of decision or couple or less than or equal to point five

00:40:46.900 | increased our prediction by point one five six

00:40:50.620 | So if we predicted with a naive model of just the mean that would have been ten point one nine

00:40:56.060 | Adding in just the coupler decision would have changed it to ten point three four five

00:41:00.980 | So this variable is responsible for a point one five six increase in our prediction

00:41:06.220 | From that the enclosure decision was responsible for a minus point three nine five decrease

00:41:12.340 | The model ID was responsible for a point two seven six increase until eventually that was our final decision

00:41:20.260 | That was our prediction for this auction of this particular sale price

00:41:26.300 | So we can draw that as what's called a waterfall plot right and waterfall plots are one of the most useful plots

00:41:33.460 | I know about and weirdly enough

00:41:35.660 | There's nothing in Python to do them and this is one of these things where there's this disconnect between like the world of like

00:41:42.020 | management consulting and business where everybody uses waterfall plots all the time and

00:41:46.140 | like academia

00:41:48.420 | Who have no idea what these things are but like every time like you're looking at say?

00:41:54.700 | here is

00:41:56.220 | Last year's sales for Apple and then there was a change in the iPhones increased by this amount

00:42:02.060 | Max decreased by that amount and iPads increased by that amount every time you have a starting point in a number of changes and a finishing

00:42:10.260 | Point waterfall charts are pretty much always the best way to show it. So here our prediction for price based on everything

00:42:16.380 | 10.1 eight nine there was an increase blue means increase of point one five six per coupler

00:42:22.700 | Decrease of point three nine five for implosion increase model ID of point two seven six so decrease

00:42:30.220 | As I increase decrease increase to get to our final

00:42:34.060 | 10.266 so you see how waterfall chart works

00:42:37.960 | So with excel 2016 you it's built in you just click insert waterfall chart and there it is

00:42:44.220 | If you want to be a hero

00:42:46.540 | create a waterfall chart

00:42:49.540 | Package for matplotlib put it on pip and everybody will love you for it

00:42:53.780 | There are some like really crappy

00:42:56.420 | discs and manual

00:42:59.100 | Notebooks and stuff around these are actually super easy to build

00:43:03.300 | Like you basically do a stacked column plot where this the bottom of this is like all white

00:43:08.780 | Right like you can kind of do it

00:43:10.980 | But if you can wrap that up all and put the data the points in the right spots and color them nicely

00:43:17.020 | That would be totally awesome. I think you've all got the skills to do it and would make you know be a

00:43:21.360 | terrific thing for your portfolio

00:43:23.860 | So there's an idea

00:43:26.660 | Could make an interesting cattle kernel even like here's how to build a waterfall plot from scratch and by the way

00:43:33.180 | I've put this up on pip you can all use it

00:43:35.340 | So in general therefore obviously going from the all and then going through each change

00:43:43.580 | Then the sum of all of those is going to be equal to the final prediction

00:43:48.940 | So that's how we could say if we were just doing a decision tree

00:43:53.460 | Then you know you're coming along and saying like how come this particular option was this particular price?

00:43:59.580 | And it's like well your prediction for it and like oh it's because of these three things had these three impacts, right?

00:44:06.580 | so for a random forest

00:44:09.460 | We could do that across all of the trees that so every time we see coupler

00:44:14.100 | We add up that change every time we see enclosure

00:44:17.100 | We add up that change every time we see model we add up that change. Okay, and so then we combine them all together

00:44:23.300 | We get what?

00:44:26.380 | Tree interpreter does but so you could go into the source code for tree interpreter, right?

00:44:31.080 | It's not at all complex logic or you could build it yourself

00:44:33.740 | Right and you can see

00:44:36.820 | How it does exactly this so when you go tree interpreter predict with a random first model for some specific?

00:44:43.820 | Auction, so I've got a specific row here. This is my zero index row

00:44:48.820 | It tells you okay. This is the prediction the same as the random forest prediction

00:44:54.780 | Bias this is going to be always the same. It's the average

00:44:58.860 | sale price for for everybody for each of the random samples in the tree and

00:45:05.900 | then contributions is

00:45:07.900 | The average of sorry the total of all the contributions for each time we see that

00:45:15.100 | Specific column appear in a tree

00:45:18.860 | Right. So last time I made the mistake of not sorting this correctly. So this time

00:45:23.520 | Np dot art sort is a super handy

00:45:26.140 | Function it sorts. It doesn't actually sort

00:45:30.740 | Contribution zero it just tells you where each item would move to if it were sorted

00:45:36.580 | so now by passing ID access to each one of

00:45:40.080 | The column

00:45:43.820 | The the level

00:45:46.220 | Contribution I can then print out all those in the right order so I can see here. Here's my column

00:45:53.780 | here's the

00:45:55.980 | level and the contribution so the fact that it's a small

00:46:01.100 | Version of this piece of industrial equipment meant that it was less expensive

00:46:04.940 | Right, but the fact it was made pretty recently meant. It was more expensive

00:46:10.020 | The fact that it's pretty old however made that it was less expensive

00:46:13.980 | right, so this is not going to

00:46:16.460 | Really help you much at all with like a Kaggle style situation where you just need predictions

00:46:22.820 | that's going to help you a lot in a production environment or even pre-production right so like something which

00:46:28.460 | Any good manager should you should do if you say here's a machine learning model?

00:46:32.940 | I think we should use as they should go away and grab a few examples of actual customers or

00:46:39.060 | actual options or whatever and check whether your model looks intuitive right and if it says like

00:46:46.240 | my prediction is that

00:46:48.500 | You know

00:46:51.820 | Lots and lots of people are going to really enjoy

00:46:54.620 | This crappy movie. You know and it's like well

00:46:58.140 | That was a really crappy movie then they're going to come back to you and say like explain why your models telling me

00:47:03.780 | That I'm going to like this movie because I hate that movie and then you can go back and you say well

00:47:10.020 | It's because you like this movie and because you're this age range, and you're this gender on average actually people like you

00:47:16.800 | Did like that movie?

00:47:18.900 | Okay, yeah

00:47:20.900 | you

00:47:22.900 | What's the second element of each table?

00:47:28.180 | This is saying for this particular row

00:47:31.380 | It was a mini, and it was 11 years old and it was a hydraulic excavator track three to four metric tons

00:47:39.340 | So it's just feeding back and telling you it's it because this is actually what it was

00:47:46.140 | It was these numbers, so I just went back to

00:47:51.380 | the original

00:47:53.140 | data to actually pull out the

00:47:55.140 | Descriptive versions of each one

00:47:58.260 | Okay, so if we sum up all the contributions together and

00:48:06.420 | Then add them to the bias

00:48:11.220 | Then that would be the same as adding up those three things

00:48:16.540 | Adding it to this and as we know from our waterfall chart that gives us our final

00:48:21.820 | prediction

00:48:24.740 | this is a

00:48:26.740 | Almost totally unknown technique and this particular

00:48:31.260 | Library is almost totally unknown as well

00:48:35.500 | so like it's a great opportunity to

00:48:38.340 | You know show something that a lot of people like it's totally critical in my opinion

00:48:44.660 | But but rarely none, so that's

00:48:47.540 | That's kind of the end of the ran of forest interpretation piece and hopefully you've now seen enough that when somebody says

00:48:57.620 | We can't use modern machine learning techniques because they're black boxes that aren't interpretable

00:49:02.500 | You have enough information to say you're full of shit, right?

00:49:06.100 | Like they're extremely interpretable and the stuff that we've just done

00:49:10.420 | You know try to do that with a linear model. Good luck to you

00:49:13.740 | You know even where you can do something similar the linear model trying to do it

00:49:17.340 | So that's not getting you totally the wrong answer and you had no idea as a wrong answer. It's going to be a real challenge

00:49:22.380 | So the last step we're going to do before we try and build our own random forest is deal with this tricky issue of

00:49:31.620 | Extrapolation so in this case

00:49:34.780 | If we look at our tree

00:49:39.140 | Let's look at the accuracy of our most recent trees

00:49:43.780 | We still have

00:49:51.580 | You know a big difference between our validation

00:49:54.100 | score and our

00:49:57.420 | training score

00:50:00.060 | The

00:50:03.100 | Actually in this case. It's not too bad that

00:50:06.700 | The difference between the OOB and the validation is actually pretty close

00:50:11.660 | So if there was a big difference between validation and OOB like I'd be very worried about that. We've dealt with the temporal side of things

00:50:19.460 | correctly

00:50:22.500 | Let's just have a look at I think our most recent model here it was

00:50:27.440 | Yeah, so there's a tiny difference right and so

00:50:32.060 | On Kaggle at least you kind of need that last decimal place in the real world. I'd probably stop here

00:50:38.420 | But quite often you'll see there's a big difference between your validation score and your OOB score

00:50:43.220 | And I want to show you how you would deal with that

00:50:45.220 | Particularly because actually we know that the OOB should be a little worse

00:50:51.220 | Because it's using less trees so it gives me a sense that we should be able to do a little bit better

00:50:55.580 | And so the reason with the way we should be able to do a little bit better is by handling the time

00:51:00.780 | component a little bit better

00:51:02.780 | so

00:51:05.780 | Here's the problem with random forests when it comes to extrapolation

00:51:09.980 | when you

00:51:13.260 | When you've got a data set

00:51:16.700 | That's like you know for got four years of sales data in it and you create your tree

00:51:22.060 | Right, and it says like oh if these if it's in some particular store, and it's some particular item

00:51:30.380 | And it is on special

00:51:32.380 | You know here's the average price right it actually tells us the average price

00:51:38.300 | You know over the whole training set which could be pretty old right and so when you then

00:51:44.800 | Want to step forward to like what's going to be the price next month?

00:51:49.580 | It's never seen next month and and where else with a kind of a linear model it can find a relationship

00:51:56.740 | Between time and price where even though we only had this much data

00:52:00.980 | When you then go and predict something in the future it can extrapolate that but a random forest can't do that

00:52:07.420 | There's no way if you think about it for a tree to be able to say well next month. It would be higher still

00:52:12.820 | so there's a few ways to deal with this and we'll talk about it over the next couple of lessons, but one simple way is

00:52:20.180 | just to try to

00:52:22.180 | Avoid using

00:52:25.100 | time

00:52:26.540 | Variables as predictors if there's something else we could use that's going to give us a better

00:52:31.300 | You know something of a kind of a stronger relationship. That's actually going to work in the future so in this case

00:52:38.140 | What I wanted to do was to first of all figure out

00:52:42.700 | What's the difference between our validation set and our training set like if I understand the difference between our validation set

00:52:54.920 | And our training set then that tells me

00:52:57.140 | What are the predictors which which have a strong temporal component and therefore they may be?

00:53:04.100 | Irrelevant by the time I get to the future time period so I do something really interesting which is I create a random forest

00:53:13.020 | Where my dependent variable is is it in the validation set?

00:53:20.540 | right, so I've gone back and I've got my whole data frame with the training and validation all together and

00:53:25.540 | I've created a new column called is valid which I've set to 1 and

00:53:31.060 | Then for all of the stuff in the training set I set it to 0

00:53:35.380 | that's what a new column which is just is this in the validation set or not and

00:53:39.460 | Then I'm going to use that as my dependent variable and build a random forest

00:53:44.920 | So this is a random forest not to predict price

00:53:48.780 | The predict is this in the validation set or not and so if your

00:53:53.180 | variables were not

00:53:55.940 | Time dependent then it shouldn't be possible to figure out if something's in the validation set or not

00:54:00.380 | This is a great trick in Kaggle right because in Kaggle

00:54:03.300 | They often won't tell you whether the test set is a random sample or not

00:54:09.340 | So you could put the test set and the training set together

00:54:13.100 | Create a new column called is test and see if you can predict it if you can

00:54:18.340 | You don't have a random sample which means you have to come and figure out how to create a validation set

00:54:23.720 | From it right and so in this case I can see I don't have a random sample because my validation set can be predicted

00:54:30.660 | with a point nine nine nine nine

00:54:32.660 | R squared and

00:54:35.340 | So then if I look at feature importance the top thing is sales ID and so this is really interesting

00:54:41.860 | It tells us very clearly sales ID is not a random identifier

00:54:45.620 | But probably it's something that's just set

00:54:48.540 | Consecutively as time goes on we just increase the sales ID

00:54:52.780 | Sale elapsed that was the number of days since the first date in our data set so not surprisingly that so is a good predictor

00:55:01.560 | interestingly machine ID

00:55:04.460 | Clearly each machine is being labeled with some consecutive identifier as well

00:55:09.720 | And then there's a big don't just look at the order look at the value so point seven point one

00:55:15.660 | point zero seven point zero two

00:55:17.500 | Okay, stop right these top three hundreds of times more important than the rest right so let's next grab those top three

00:55:25.380 | Right and we can then have a look at their values

00:55:31.140 | both in the training set and

00:55:34.080 | In the validation set and so we can see for example sales ID on average is

00:55:40.100 | I've divided by a thousand on average is 1.8 million in the training set and

00:55:45.340 | 5.8 million in the validation set right so you like you can see

00:55:49.840 | Just confirm like okay. They're very different

00:55:52.620 | So let's drop them

00:55:55.300 | Okay, so after I drop them let's now see if I can predict whether something's in the validation set I still can with point nine eight

00:56:03.180 | R squared

00:56:06.420 | So once you remove some things then other things can like come to the front and it now turns out okay

00:56:11.840 | That's not surprisingly age

00:56:13.840 | You know things that are old

00:56:16.520 | You know more likely I guess to be in the validation set because if you know earlier on in the training set yet

00:56:24.620 | They can't be old yet

00:56:26.540 | year made same reason

00:56:28.540 | So then we can

00:56:35.820 | Try removing those as well

00:56:37.820 | and

00:56:40.700 | So once we let's see where do we go here?

00:56:43.860 | Yeah, so what we can try doing is we can then say all right?

00:56:47.260 | Let's take the sales ID so that's machine ID from the first one

00:56:50.860 | The age year made sale sale day of year from the second one and say okay. These are all

00:56:56.600 | time dependent features

00:56:59.820 | So I still want them in my random forest if they're important

00:57:06.260 | Right, but if they're not important then taking them out

00:57:10.180 | There are some other non time dependent variables that that work just as well. That would be better

00:57:14.940 | Right because now I'm going to have a model that generalizes over time better

00:57:18.580 | So here I'm just going to go ahead and go through each one of those features and drop each one one at a time

00:57:24.060 | Okay retrain a new random forest and print out the score

00:57:28.900 | Okay, so before we do any of that our score was

00:57:33.380 | 0.88 for our validation versus 0.89 OOB and

00:57:42.540 | you can see here

00:57:45.380 | when I remove sales ID my score goes up and

00:57:49.100 | This this is like what we're hoping for we've removed a time-dependent variable

00:57:54.240 | There were other variables that could find similar relationships without the time dependency so removing it caused our validation to go up

00:58:02.260 | Now OOB didn't go up

00:58:04.260 | Right because this is genuinely statistically a useful predictor

00:58:08.160 | Right, but it's a time-dependent one and we have a time-dependent validation set so this is like really subtle

00:58:13.980 | But it can be really important right. It's trying to find the things that give you a

00:58:18.820 | Generalizable time across time prediction, and here's how you can see it so by so it's like okay

00:58:24.480 | We should remove sales ID for sure right, but sale elapsed

00:58:29.180 | Didn't get better

00:58:31.500 | Okay, so we don't want that machine ID did get better from 888 to 893. It's actually quite a bit better

00:58:38.660 | Age

00:58:42.780 | Got a bit better

00:58:44.420 | Year made got worse sale day of year got a bit better

00:58:48.380 | Okay, so now we can say all right. Let's get rid of

00:58:53.020 | the three

00:58:56.020 | Where we know that getting rid of it actually made it better

00:58:59.460 | Okay, and as a result look at this. We're now up to 9 1 5

00:59:03.340 | Okay, so we've got rid of three time-dependent things and now as expected

00:59:09.500 | Validation is better than our OOB

00:59:13.020 | Okay, so that was a super successful approach there right and so now we can check the feature importance

00:59:19.540 | And let's go ahead and say all right that was pretty damn good. Let's now

00:59:27.660 | Leave it for a while, so give it 160 trees. Let it show and see how that goes

00:59:33.100 | Okay, and so as you can see like we did all of our interpretation all of our fine-tuning

00:59:39.440 | Basically with smaller models subsets and at the end we run the whole thing it actually still only took 16 seconds

00:59:46.260 | And so we've now got an RMSE of 0.21. Okay, so now we can check that against Kaggle

00:59:56.740 | again, we can't we

00:59:58.740 | unfortunately this

01:00:01.820 | Older competition we're not allowed to enter anymore to see how we would have gone so the best we can do is check

01:00:06.540 | Whether it looks like we could have done well based on our validation set

01:00:10.580 | So it should be in the right area and yeah based on that we would have come first

01:00:15.180 | Okay, so

01:00:18.900 | You know I think this is an interesting

01:00:22.500 | series of steps right so you can go through the same series of steps in your

01:00:26.780 | Kaggle projects and more importantly your real-world projects

01:00:30.940 | So one of the challenges is once you leave this learning environment

01:00:35.140 | Suddenly you're surrounded by people who they never have enough time. They've always want you to be in a hurry

01:00:40.740 | They're always telling you you know do this and then do that you need to find the time to step away

01:00:45.660 | Right and go back because this is a genuine real-world modeling process you can use

01:00:51.660 | And it gives when I say it gives world-class results

01:00:54.980 | I mean it right like this guy who won this

01:00:57.860 | Listergoss sadly he's passed away, but he is the

01:01:02.140 | top Kaggle

01:01:04.820 | Competitor of all time like he won. I believe like dozens of competition

01:01:11.580 | So if we can get a score even within kooee of him, then we are doing really really well

01:01:19.460 | Okay, so let's take a five-minute break, and we're going to come back and build our own random forest

01:01:24.220 | I just wanted to clarify something quickly very good point during the break was

01:01:38.260 | Going back to the

01:01:45.540 | Change in R squared between here and

01:01:49.460 | Here it's not just due to the fact that we removed

01:01:54.320 | these three predictors

01:01:57.700 | We also went reset RF samples right so to actually see the impact of just removing we need to compare it to

01:02:04.640 | The final step earlier, so it's actually compared to 907 so removing those three things took us from

01:02:13.900 | 107 to

01:02:15.900 | 915 okay, so I mean and you know in the end of course what matters is our final model, but yeah, just to clarify

01:02:27.320 | Okay

01:02:31.340 | So

01:02:33.460 | Some of you have asked me about writing your own random forests from scratch

01:02:37.900 | I don't know if any of you have given it a try yet my original plan here was to

01:02:44.140 | Do it in real time and then as I started to do it

01:02:47.100 | I realized that that would have kind of been boring because for you because I screw things up all the time so instead

01:02:52.660 | We might do more of like a walk through the code together

01:02:55.860 | Just as an aside

01:03:01.460 | This reminds me talking about the exam actually somebody asked on the forum about like what what can you expect from the exam?

01:03:07.940 | the basic plan is to make it a

01:03:11.740 | The exam be very similar to these notebooks. So it'll probably be a notebook that you have to you know

01:03:17.540 | Get a data set create a model trainer feature importance whatever right and the plan is that it'll be

01:03:25.200 | Open book open internet you can use whatever resources you like so basically if you're entering competitions the exam should be very straightforward. I

01:03:33.540 | also expect that there will be some pieces about like

01:03:39.100 | Here's a partially completed random forest or something. You know finish

01:03:42.580 | Finish writing this step here, or here's a random forest

01:03:46.060 | Implement feature importance or you know implement one of the things we've talked about so it'll be you know

01:03:54.060 | The exam will be much like what we do in class and what you're expected to be doing during the week. There won't be any

01:04:00.900 | Define this or tell me the difference between this word and that word or whatever. There's not going to be any rote learning

01:04:07.580 | It'll be entirely like are you an effective machine learning practitioner ie can you use the algorithms?

01:04:12.540 | Do you know can you create an effective validation set and can you can you create parts of the algorithm?

01:04:19.720 | Implement them from scratch, so it'll be all about writing code

01:04:23.320 | basically, so

01:04:25.980 | if you're not comfortable writing code to practice machine learning then

01:04:30.460 | You should be practicing that all the time if you are comfortable. You should be practicing that all the time also

01:04:36.460 | Whatever you're doing write code to implement random to do machine learning

01:04:40.860 | Okay

01:04:46.500 | So I kind of have a particular way of

01:04:50.700 | Writing code

01:04:53.660 | And I'm not going to claim it's the only way of writing code

01:04:56.100 | But it might be a little bit different to what you're used to and hopefully you'll find it at least interesting

01:05:01.020 | creating

01:05:03.500 | implementing random forest algorithms

01:05:06.180 | Is actually quite tricky not because the clothes tricky like generally speaking

01:05:10.580 | Most random first algorithms are pretty conceptually easy, you know that generally speaking

01:05:18.220 | Academic papers and books have a knack of making them look difficult, but they're not difficult conceptually

01:05:26.740 | what's difficult is getting all the details right and knowing and knowing when you're right and

01:05:32.420 | So in other words, we need a good way of doing testing

01:05:36.680 | So if we're going to re-implement something that already exists. So like say we wanted to create a random forest in some

01:05:43.320 | different

01:05:45.240 | Framework different language different operating system, you know, I would always start with something that does exist, right?

01:05:51.120 | So in this case, we're just going to do as a learning exercise writing a random forest in Python

01:05:55.200 | So for testing I'm going to compare it to an existing random forest implementation

01:06:00.680 | Okay, so that's like critical any time you're doing anything involving

01:06:05.800 | non-trivial amounts of code in machine learning

01:06:08.960 | Knowing whether you've got it right or wrong is kind of the hardest bit

01:06:12.960 | I always assume that I've screwed everything up at every step and so I'm thinking like okay assuming that I screwed it up

01:06:19.680 | How do I figure out that I screwed it up?

01:06:22.040 | Right and then much to my surprise from time to time I actually get something right and then I can move on

01:06:27.680 | But most of the time I get it wrong

01:06:30.680 | so

01:06:32.080 | Unfortunately with machine learning, there's a lot of ways you can get things wrong that don't give you an error

01:06:36.840 | They just make your result like slightly less good

01:06:40.240 | And so that's that's what you want to pick up

01:06:43.520 | So given that I want to kind of compare it to an existing implementation

01:06:48.760 | I'm going to use our existing data set our existing validation set and then to simplify things and just going to use two columns

01:06:55.600 | to start with

01:06:59.080 | So let's go ahead and start writing a random forest. So my way of writing

01:07:03.920 | Nearly all code is top-down just like my teaching and so by top-down I start by assuming

01:07:11.720 | That everything I want already exists

01:07:15.600 | Right. So in other words, the first thing I want to do I'm going to call this a tree ensemble

01:07:21.240 | All right, so to create a random forest the first question I have is

01:07:27.160 | What do I need to pass in?

01:07:29.160 | Right. What do I need to initialize my random first? So I'm going to need some independent variables

01:07:35.480 | some dependent variable

01:07:38.280 | Pick how many trees I want

01:07:40.560 | I'm going to use the sample size parameter from the start here

01:07:43.840 | So how big you want each sample to be and then maybe some optional parameter of what's the smallest leaf size?

01:07:50.440 | Okay

01:07:53.800 | For testing it's nice to use a constant random seed. So we'll get the same result each time

01:07:59.320 | So this is just how you set a random seed, okay?

01:08:02.360 | Maybe it's worth mentioning this for those of you unfamiliar with it

01:08:06.040 | Random number generators on computers aren't random at all. They're actually called pseudo random number generators

01:08:12.760 | and what they do is given some initial starting point in this case 42 a

01:08:19.160 | Pseudo random number generator is a mathematical function that generates a deterministic always the same sequence of numbers

01:08:27.040 | Such that those numbers are designed to be as uncorrelated with the previous number as possible

01:08:32.760 | And as unpredictable as possible and

01:08:37.520 | As uncorrelated as possible with something with a different random seed

01:08:42.520 | So the second number in in the sequence starting with 42 should be very different the second number starting with 41

01:08:49.200 | And generally they involve kind of like taking you know

01:08:52.280 | You know using big prime numbers and taking mods and stuff like that. It's kind of an interesting area of math

01:09:01.640 | If you want real random numbers the only way to do that is again you can actually buy

01:09:07.720 | Hardware called a hardware random number generator that will have inside them like a little bit of some radioactive

01:09:14.280 | Substance and and like something that detects how many things it's spitting out

01:09:18.960 | Or you know there'll be some hardware thing

01:09:21.240 | Getting current

01:09:27.360 | System time is is it a valid?

01:09:30.160 | Random like random number generation process so that would be for maybe for a random seed right so this thing of like

01:09:37.760 | What do we start the function with so one of the really interesting areas is like in your computer if you don't set the random?

01:09:44.360 | seed what is it set to and

01:09:48.200 | Yeah, quite often people use the current time for security like obviously we use a lot of random number stuff for security stuff

01:09:56.480 | Like if you're generating an SSH key you need some it needs to be random

01:10:00.580 | It turns out like you know people can figure out roughly when you created a key like they could look at like oh

01:10:08.280 | ID RSA has a timestamp and they could try you know all the different nanoseconds

01:10:13.160 | Starting points for a random number generator around that time step and figure out your key

01:10:17.600 | So in practice a lot of like really random

01:10:21.480 | High randomness requiring applications actually have a step that say please move your mouse and type random stuff at the keyboard for a while

01:10:31.040 | And so it like gets you to be a sort of entropy to be a source of entropy

01:10:35.300 | Other approaches is they'll look at like you know the hash of some of your log files or you know

01:10:44.200 | Stuff like that. It's a really really fun area

01:10:47.100 | So in our case our purpose actually is to remove randomness

01:10:51.360 | So we're saying okay generate a series of pseudo random numbers starting with 42, so it always should be the same

01:10:57.180 | So if you haven't done much stuff in Python

01:11:02.480 | Oh, this is a basically standard idiom at least I mean I write it this way most people don't but if you pass in like

01:11:09.300 | One two three four five things that you're going to want to keep inside this object

01:11:14.000 | Then you basically have to say self dot x equals x self dot y equals y self dot sample equals sample

01:11:19.800 | Right and so we can assign to a tuple

01:11:23.540 | from a tuple so

01:11:26.160 | You know again

01:11:27.280 | This is like my way of coding most people think this is horrible

01:11:29.740 | But I prefer to be able to see everything at once and so I know in my code anytime

01:11:34.320 | I see something looks like this

01:11:35.440 | It's always all of the stuff in the method being set if I did it a different way then half the codes now come off

01:11:41.960 | The bottom of the page and you can't see it. So

01:11:44.760 | alright, so

01:11:47.560 | So that was the first thing I thought about was like okay to create a random forest

01:11:51.760 | What information do you need then I'm going to need to store that information inside my object and so then I?

01:11:57.600 | Need to create some trees right a random forest is something that creates something that has some trees, so I basically figured okay

01:12:05.640 | List comprehension to create a list of trees how many trees do we have we've got n trees trees

01:12:11.420 | That's what we asked for so range n trees gives me the numbers from zero up to n trees at minus one

01:12:19.400 | Okay, so if I create a list comprehension that loops through that range

01:12:23.720 | calling create tree each time I now have n trees trees

01:12:30.640 | And now so I had to write that I didn't have to think at all like that's all like

01:12:36.120 | Obvious and so I've kind of delayed the thinking to the point where it's like well wait. We don't have something to create a tree

01:12:44.560 | Okay, no worries, but let's pretend. We did if we did we've now created a random forest

01:12:51.000 | Okay, we still need to like do a few things on top of that for example once we have it

01:12:56.280 | We would need a predict function, so okay. Well. Let's write a predict function. How do you predict in a random forest?

01:13:03.320 | Can somebody tell me

01:13:07.040 | Either based on their own understanding or based on this line of code. What would be like your one or two sentence answer

01:13:13.160 | How do you make a prediction in a random forest?

01:13:15.960 | Spencer

01:13:18.840 | You would want to over every tree for your like the row that you're trying to predict on

01:13:26.520 | Average the values that your that each tree would produce for that

01:13:30.400 | And so you know that's a summary of what this says right so for a particular row

01:13:36.200 | That or maybe this is a number of rows

01:13:38.800 | Go through each tree

01:13:41.920 | Calculators prediction so here is a list comprehension that is calculating the prediction for every tree for X

01:13:50.800 | I don't know if X is one row or multiple rows doesn't matter right

01:13:55.640 | As long as as long as tree dot predict works on it

01:13:59.280 | And then once you've got a list of things a cool trick to know is you can pass numpy dot mean a

01:14:06.080 | regular non numpy list

01:14:09.080 | Okay, and it'll take the mean you just need to tell it

01:14:12.720 | axis equals 0 means average it across the lists, okay, so this is going to return the average of

01:14:21.960 | Dot predict for each tree and so I find list comprehensions

01:14:27.560 | Allow me to write the code in the way that brain works like you could take the word

01:14:34.040 | Spencer said and like

01:14:35.920 | Translate them into this code or you could take this code and translate them into words like the one Spencer said right and so when

01:14:41.920 | I write code I want it to be as much like that as possible

01:14:45.480 | I want it to be readable and so hopefully you'll find like when you look at the fast AI code

01:14:50.880 | You're trying to understand. Well, how did Jeremy do X?

01:14:52.880 | I try to write things in a way that you can read it and like it kind of turn it into English in your head

01:14:58.000 | So if I see correctly that predict method is recursive it's

01:15:06.800 | No, it's calling tree dot predict and we haven't written a tree yet

01:15:11.200 | So self dot trees is going to contain a tree object

01:15:16.480 | So this is tree ensemble dot predict and inside the trees is a tree not a tree ensemble

01:15:22.980 | So this is calling tree dot predict not tree ensemble dot predict

01:15:26.080 | Good question

01:15:29.560 | Okay, so we've nearly finished writing a random forest haven't we all we need to do now is write create tree, right?

01:15:37.000 | so

01:15:39.040 | based on

01:15:40.520 | this code here or

01:15:43.040 | On your own understanding of how we create trees in a random forest. Can somebody tell me?

01:15:49.120 | Let's take a few seconds have a read have a think and then I'm going to try and come up for the way of saying

01:15:55.040 | How do you create a tree in a random forest?

01:15:59.400 | Okay, who wants to tell me yes, okay, that's Tyler's got close to

01:16:07.480 | You take your

01:16:12.520 | Essentially taking a random sample or of the original data and then you're just

01:16:19.280 | Just constructing a tree. However that happens

01:16:23.120 | So construct a decision tree like a non random tree from a random sample of the data

01:16:29.520 | Okay, so again like we've delayed any actual thought process here. We've basically said, okay, we could pick some random IDs

01:16:38.880 | This is a good trick to know

01:16:40.600 | If you call NP random permutation

01:16:43.800 | passing in an int it'll give you back a

01:16:47.680 | Randomly shuffled sequence from zero to that it right and so then if you grab the first

01:16:54.120 | colon n

01:16:57.080 | Items of that that's now a random

01:16:59.320 | Substantial so this is not doing bootstrapping. We're not doing sampling with replacement here

01:17:06.880 | Which I think is fine, you know for my random forest

01:17:10.400 | I'm deciding that it's going to be something where we do the sub sampling not bootstrapping. Okay, so here's a good line of code

01:17:16.600 | to know how to write

01:17:18.880 | Because it comes up all the time like I find in machine learning

01:17:22.760 | most algorithms I use are

01:17:25.440 | Somewhat random and so often I need some kind of random sample. Can you pass that tighter or changey?

01:17:35.840 | Won't that give you one one extra because the you said it'll go from zero to length

01:17:41.080 | No, so this will give you if lens self dot y is

01:17:47.120 | Size n this will give you n a sequence of length n so 0 to n minus 1

01:17:54.160 | Okay, and then from that I'm picking out

01:17:57.360 | colon self dot sample size so the first sample size IDs

01:18:02.360 | I

01:18:04.360 | Have a comment on bootstrapping, I think this method is better because we have chance of giving more weights to each

01:18:14.120 | Observation or am I thinking wrong? I mean, I think you for bootstrapping we could also give weights. I mean

01:18:21.440 | Weighing

01:18:23.320 | single observations more than they are like

01:18:25.680 | Without wanting that weight because when bootstrapping with replacement we can

01:18:31.760 | Have a single observation and duplicates of it. Yeah, the same tree. Yeah, it does feel weird, but I think

01:18:39.640 | I'm not sure that the actual

01:18:44.200 | Theory or empirical results backs up higher intuition that it's worse. It would be interesting to look look back at that actually

01:18:52.200 | Personally I prefer this because I feel like most of the time we have more data than we

01:18:59.960 | Want to put a tree at once I feel like back when bryman created random forests. It was 1999

01:19:05.180 | It was kind of a very different world. You know where we pretty much always wanted to use all the data we had

01:19:09.820 | but nowadays I would say that's

01:19:12.200 | Generally not what we want

01:19:14.480 | We normally have too much data and so what people tend to do is they're like fire up a spark cluster

01:19:20.060 | and they'll run it on hundreds of machines when

01:19:22.520 | It makes no sense because if they had just used a subsample each time

01:19:26.860 | They could have done it on one machine and like the the overhead of like

01:19:30.440 | Spark is a huge amount of IO overhead like I know you guys are doing distributed computing now if you've looked at some of the benchmarks

01:19:38.400 | Yeah, yeah, exactly. So if you do something on a single machine, it can often be hundreds of times faster

01:19:45.980 | Because you don't have all this this IO overhead. It also tends to be easier to write the algorithms like you can use like SK learn

01:19:53.320 | easier to visualize

01:19:56.360 | cheaper so forth so like I

01:19:58.420 | Almost always avoid distributed computing and I have my whole life like even 25 years ago when I was starting in machine learning

01:20:06.200 | I you know still didn't use

01:20:08.500 | clusters because I so I always feel like

01:20:11.080 | Whatever I could do with a cluster now I could do with a single machine in five years time

01:20:15.940 | So why don't us focus on always being as good as possible with the single machine, you know

01:20:20.640 | and that's going to be more interactive and more iterative and

01:20:23.140 | work for me, so

01:20:26.680 | Okay, so so again, we've like delayed thinking

01:20:30.340 | To the point where we have to write decision tree

01:20:33.880 | And so hopefully you get an idea that this top-down approach the goal is going to be that we're going to keep delaying thinking

01:20:39.720 | So long that that we delay it forever

01:20:42.280 | Like like eventually we've somehow written the whole thing without actually having to think right and that's that's kind of what I need

01:20:48.840 | Cuz I'm kind of slow right so this is why I write code this way and notice like you never have to design anything

01:20:55.940 | You know, you just say hey, what if somebody already gave me the exact API I needed. How would I use it?

01:21:00.660 | Okay, and then and then okay to implement that next stage

01:21:04.680 | What would be the exact API I would need to implement that that you keep going down until eventually you're like, oh that already exists

01:21:11.820 | Okay, so

01:21:14.080 | This assumes we've got a class for decision tree. So we're going to have to create that

01:21:18.380 | So a decision tree

01:21:23.580 | Is something so we already know what we're going to have to pass it because we just passed it, right?

01:21:28.460 | so we're passing in a

01:21:30.460 | random sample of X's a

01:21:32.900 | random sample of wise

01:21:35.260 | Indexes is actually so we know that down the track so I got a plan a tiny bit

01:21:46.740 | We know that a decision tree is going to contain decision trees which themselves contain decision trees

01:21:52.620 | And so as we go down the decision tree

01:21:54.540 | There's going to be some subset of the original data that we've kind of got and so I'm going to pass in the indexes

01:22:00.780 | Of the data that we're actually going to use here. Okay, so initially it's the entire

01:22:06.260 | Random sample, right? So I've got the whole

01:22:09.820 | I've got the whole range

01:22:14.060 | And I'll turn that into an array. So that's zero the indexes from zero to the size of the sample and

01:22:21.020 | Then we'll just pass down them in leaf size. So everything that we got for constructing the random forest

01:22:26.740 | We're going to pass down the decision tree except of course num trees, which is irrelevant for the decision tree

01:22:31.780 | So again now that we know that's the information we need we can go ahead and store it inside this object

01:22:38.540 | So I'm pretty likely to need to know

01:22:42.580 | How many rows we have in this tree which I generally call n

01:22:48.580 | How many columns do I have which I generally call C

01:22:51.580 | So the number of rows is just equal to the number of indexes

01:22:55.140 | We were given and the number of columns is just like however many columns there are in our independent variables

01:23:01.380 | So then we're going to need

01:23:06.060 | This value here

01:23:09.820 | We need to know for this tree

01:23:13.140 | What's its prediction, right? So

01:23:17.700 | the

01:23:19.700 | Prediction for this tree is the mean of

01:23:22.820 | Dependent variable for

01:23:27.460 | Those indexes which are inside this part of the tree, right? So at the very top of the tree it contains all the indexes

01:23:36.500 | All right, I'm assuming that by the time we've got to this point. Remember we've already done the

01:23:44.980 | random sampling

01:23:46.620 | Right. So when we're talking about indexes, we're not talking about the random sampling to create the tree

01:23:51.780 | We're assuming this tree now has some random sample inside decision tree

01:23:56.940 | This is this is the one of the nice things right inside decision tree whole random sampling things gone

01:24:02.100 | Right that was done by the random first, right? So at this point we're building something. That's just a plain old decision tree

01:24:08.380 | It's not in any way a random sampling anything. It's just a plain old position tree, right?

01:24:12.660 | So the indexes is literally like

01:24:15.420 | Which subset of the data have we got to so far in this tree?

01:24:20.580 | And so at the top of the decision tree, it's all the data, right? So it's all of the indexes

01:24:25.420 | Okay

01:24:28.060 | So all of the indexes

01:24:30.060 | So this is therefore all of the dependent variable that are in this part of the tree

01:24:35.900 | And so this is the value mean of that

01:24:40.220 | That makes sense. Anybody got any questions about about that?

01:24:43.900 | So yes, he passed the change sheet

01:24:47.740 | Actually just to let you know that's a large portion of us don't have a all be I

01:24:55.980 | Mean all P experiments. Okay. Sure. So

01:25:00.460 | So quick so quick over P primer would be helpful

01:25:04.100 | Great. Yeah, okay

01:25:09.460 | Who has done object-oriented programming in some programming language, okay?

01:25:14.340 | So you've all used actually lots of object-oriented programming in terms of using existing classes

01:25:25.140 | All right, so every time we've created a random forest

01:25:28.860 | We've called the random forests constructor and it's returned an object and then we've called

01:25:39.700 | methods and

01:25:41.380 | Attributes on that object so fit is a method you can tell because it's got parentheses after it. All right, where else?

01:25:48.940 | Yeah, I will be score is a

01:25:54.340 | Property or an attribute doesn't have parentheses after it. Okay, so inside an object there are kind of two kinds of things

01:26:02.260 | They're the functions that you can call

01:26:04.820 | So you you have object dot?

01:26:07.500 | function parenthesis arguments or there are the properties or attributes you can grab which is

01:26:13.820 | Object dot and then just the attribute name with no parentheses

01:26:18.300 | So when and then the other thing that we do with objects is we create them

01:26:24.500 | Okay, we pass in the name of a class and it returns us the object and you have to tell it all of the parameters

01:26:32.580 | Necessary to get constructed. So let's just copy this code

01:26:38.140 | And

01:26:45.340 | See how we're going to go ahead and build this

01:26:47.460 | So the first step is we're not going to go and equals random forest regressor. We're going to go M equals tree ensemble

01:26:55.620 | We're creating a class for tree ensemble and we're going to pass in

01:26:59.540 | Various bits of information, okay?

01:27:06.780 | So maybe we'll have ten trees

01:27:10.580 | Sample size of a thousand maybe a min leaf of three

01:27:15.480 | All right, and you can always like choose to name your arguments or not

01:27:18.980 | So when you've got quite a few it's kind of nice to name them so that just so we can see what each one means

01:27:24.900 | It's always optional

01:27:29.700 | um

01:27:31.060 | so we're going to try and create a class that we can use like this and

01:27:35.340 | then

01:27:38.260 | I'm not sure we're going to bother with dot fit because we've passed in the X and the Y

01:27:43.260 | Right like in in psychic learn they use an approach where first of all you construct something without telling it what data to use

01:27:49.620 | And then you pass in the day. We're doing these two steps at once. We're actually passing in the data

01:27:55.020 | Right and so then after that we're going to be going m

01:27:59.020 | Dot so we're going to go preds equals m dot predict

01:28:03.940 | Passing in maybe some validation set

01:28:06.980 | Okay, so we're that's that's the API. We're kind of creating here

01:28:12.220 | So this thing here is called a constructor something that creates an object is called a constructor

01:28:18.220 | and Python

01:28:22.460 | There's a lot of ugly hideous things about Python one of which is they it uses these special magic

01:28:28.700 | method names

01:28:31.220 | Underscore underscore in it underscore underscore is a special magic method that's caught what's called when you try to construct a class

01:28:39.720 | So when I call tree ensemble parenthesis it actually calls tree ensemble dot

01:28:46.020 | People say thunder in it. I kind of hate it. But anyway done that in it double underscore in it double underscore thunder thunder in it

01:28:53.700 | So that's why we've got this method called dunder in it. Okay, so when I call tree ensemble is going to call this method

01:29:01.620 | another

01:29:04.900 | hideously ugly thing about

01:29:06.900 | Python's OO is that there's this special thing where if you have a class and to create a class you just write class in

01:29:14.540 | the class all of its methods

01:29:16.540 | Automatically get sent one extra

01:29:20.060 | parameter one extra argument

01:29:22.860 | Which is the first argument and you can call it anything you like if you call it anything other than self

01:29:28.900 | Everybody will hate you and you're a bad person

01:29:31.300 | Okay, so call it anything you like as long as itself

01:29:35.140 | so

01:29:38.780 | So that's why you always see this and in fact I can immediately see here I have a bug

01:29:44.540 | Anybody see the bug in my predict function? I should have self, right? I

01:29:48.860 | Like always do it, right?

01:29:52.380 | So anytime you try and call a method on your own class and you get something saying you passed in two parameters

01:29:58.380 | And it was only expecting one you forgot self

01:30:00.820 | Okay, so like this is a really dumb way to add OOP to a programming language

01:30:06.420 | But the older languages like Python often did this because they kind of needed to they started out not being

01:30:12.260 | Oh, and then they kind of added. Oh in a way that was hideously ugly

01:30:16.380 | So Pell which predates Python by a little bit kind of I think really came up with this approach and unfortunately

01:30:23.780 | Other languages of that era stuck with it

01:30:26.540 | So you have to add in this magic self. So the magic self now

01:30:32.500 | When you're inside this class

01:30:35.820 | You can now pretend as if any property name you like exists

01:30:41.620 | So I can now pretend there's something called self dot X. I can read from it

01:30:45.980 | I can write to it right, but if I read from it, and I haven't yet written to it. I'll get an error

01:30:51.180 | So the stuff that's passed

01:30:54.540 | to the constructor

01:30:56.700 | Gets thrown away by default like there's nothing that like says you need to this class needs to remember what these things are

01:31:03.300 | But anything that we stick inside self it's remembered for all time

01:31:08.660 | You know as long as this object exists. You can access it. Maybe it's remembered so now that I've gone

01:31:14.980 | In fact, let's do this right so let's let's create the tree ensemble class and

01:31:19.860 | Let's now instantiate it okay, of course we haven't got X we need to call

01:31:28.500 | X train Y train

01:31:37.860 | Okay decision tree is not defined so let's

01:31:40.940 | Create a really minimal decision tree

01:31:45.660 | There we go, okay, so here is enough to actually instantiate our tree ensemble

01:31:56.900 | Okay, so we have to find the in it for it

01:31:59.820 | We have to find the in it for decision tree

01:32:01.980 | we need decision trees in it to be defined because inside our ensemble in it they're called self dot create tree and

01:32:08.580 | Then self dot create tree called the decision tree constructor and then decision tree constructor

01:32:14.980 | Basically does nothing at all other than save some information right so at this point we can now go M dot

01:32:22.540 | Okay, and if I press tab at this point

01:32:29.060 | Can anybody tell me what I would expect to see

01:32:32.380 | pass it to Taylor

01:32:34.980 | Can she could you pass it to Taylor?

01:32:36.980 | We would see like a we would see a drop-down of all available methods for that class okay, which would be

01:32:44.860 | In this case so if M is a tree ensemble, we would have create tree and predict okay anything else

01:32:50.700 | Wait what oh, yeah as well as Ernest whispered the variables as well. Yeah, so the

01:32:59.220 | So variable could mean a lot of things we'll say the attributes so the things that we put inside self so if I hit tab

01:33:04.860 | Right there. They are right as Taylor said there's create tree there's predict, and then there's everything else to be put inside self

01:33:11.740 | all right, so if I

01:33:13.980 | look at

01:33:16.180 | M dot

01:33:18.180 | Min leaf if I hit shift enter what will I see?

01:33:24.180 | Yeah, the number that I just put there. I put in leaf is three so that went up here to mean leaf

01:33:29.500 | This here is a default argument. That's as if I don't pass anything. It'll be five, but I did pass something right so three

01:33:36.140 | self dot min leaf

01:33:38.380 | here is

01:33:39.980 | Gonna be equal to min leaf here

01:33:41.980 | so something which

01:33:44.660 | Like because of this rather annoying way of doing OO

01:33:48.340 | It does mean that it's very easy to accidentally forget

01:33:52.500 | To do that right so if I don't assign it to self dot min leaf

01:33:58.060 | Right then I get an error and

01:34:01.540 | So here tree ensemble doesn't happen to me in leaf

01:34:04.700 | So how do I create that attribute? I just put something in it

01:34:09.740 | Okay, so if you want to like if you don't know what a value of it should be yet

01:34:15.980 | But you kind of need to be able to refer to it you can always go like self dot min leaf

01:34:21.860 | equals none

01:34:23.300 | Right so at least it's something you can read check for noneness and not have an error

01:34:27.060 | Great

01:34:32.940 | now

01:34:34.780 | Interestingly I was able to instantiate tree ensemble even though predict refers to a method of decision tree

01:34:42.540 | That doesn't exist and this is actually something very nice about the dynamic nature of Python

01:34:49.360 | is that

01:34:51.700 | Because it's not like compiling it. It's not checking anything unless you're using it

01:34:56.860 | right, so we can go ahead and create decision D dot predict later and

01:35:02.440 | Then our our instantiated object will magically start working

01:35:07.380 | All right, it doesn't actually look up that functions that methods details until you use it and so it really helps with top-down

01:35:14.500 | programming

01:35:18.780 | Okay, so when you're inside a class definition, in other words, you're at that indentation level

01:35:26.000 | You know indented one in so these are all class definitions

01:35:29.480 | Any function that you create unless you do some special things that we're not going to talk about yet

01:35:35.500 | Is automatically a method of that class and so every method of that class

01:35:41.020 | magically gets a

01:35:43.300 | self pass to it

01:35:45.300 | So we could call

01:35:48.740 | since we've got a tree ensemble we could call M dot create tree and

01:35:52.460 | We don't put anything inside those parentheses because the magic self will be passed and the magic self will be whatever M is

01:36:00.160 | Okay, so M dot create tree returns a decision tree. Just like we asked it to right so M dot create tree

01:36:10.940 | dot IDXS

01:36:13.860 | Will give us the self dot IDXS inside the decision tree

01:36:17.540 | Okay, which is set to NP dot arrange range self dot sample size

01:36:24.700 | Why is data scientists do we care about object oriented programming?

01:36:32.180 | Because a lot of the stuff you use is going to require you to implement stuff with OOP, for example

01:36:41.580 | every single PyTorch model of any kind is

01:36:46.260 | Created with OOP. It's the only way to create PyTorch models

01:36:49.860 | good news is

01:36:53.020 | What you see here is the entirety of what you need to know

01:36:57.320 | So you this is all you need to know you need to know to create something called in it

01:37:01.020 | to assign the things that are passed in it to something called self and

01:37:05.660 | Then just stick the word self after each of your methods

01:37:09.860 | Okay, and so the nice thing is like now to think as an OOP programmer is to realize you don't now have to pass around

01:37:17.620 | XY sample size and mint leaf to every function that uses them by assigning them to

01:37:23.740 | Attributes of self they're now available like magic

01:37:28.180 | Right. So this is why OOP is super handy

01:37:31.700 | If you're particularly like I started trying to create a decision tree initially without using OOP and try to like keep track of

01:37:39.380 | Like what that decision tree was meant to know about was very difficult, you know

01:37:44.220 | Or else with OOP you can just say it inside the decision tree, you know self dot indexes equals this and

01:37:50.380 | Everything just works. Okay. Okay, that's great. So we're out of time. I think that's

01:37:55.860 | that's great timing because

01:37:58.900 | There's an introduction to OOP, but this week

01:38:02.480 | You know next class I'm going to assume that you can use it, right?

01:38:07.340 | So you should create some classes instantiate some classes look at their methods and properties

01:38:12.780 | Have them call each other and so forth until you feel

01:38:16.540 | Comfortable with them and maybe for those of you that haven't done OOP before you and find some other useful

01:38:23.420 | Resources you could pop them onto the wiki thread so that other people know what you find

01:38:27.220 | useful great. Thanks everybody

01:38:30.060 | Everybody.

01:38:30.900 | [BLANK_AUDIO]

Intro to Machine Learning: Lesson 5

Chapters