back to indexIntro to Machine Learning: Lesson 5
Chapters
0:0
38:5 Random Forest Model interpretation
44:26 6 Tree interpreter
52:19 7 Extrapolation
00:00:00.000 |
Okay, so welcome back, so we're going to start by doing some review, and we're going to talk about 00:00:12.360 |
Something we haven't covered yet, but we will cover in more detail later is also cross validation 00:00:19.640 |
But I'm going to talk about that as well, right so 00:00:38.680 |
Kind of pretty much any other kind of work that the the difference is that in machine learning the thing we care about is 00:00:47.600 |
The generalization accuracy or the generalization error where else in like pretty much everything else all we care about is is 00:00:56.720 |
how well we could have mapped to the observations full stop and 00:01:08.960 |
And so if we want to know whether we're good doing a good job of machine learning 00:01:13.600 |
We need to know whether we're doing a good job of generalizing if we don't know that 00:01:21.200 |
By generalizing do you mean like scaling being able to scale larger? 00:01:29.360 |
No, I don't mean scaling at all so scaling is an important thing in many many areas 00:01:36.400 |
It's like okay. We've got something that works 00:01:42.000 |
Items I don't need to work make it work on 10,000 items per second or something so scaling is important 00:01:49.000 |
But not just a machine learning for just about everything we put in production 00:01:52.520 |
Generalization is where I say okay here is a model that can predict 00:01:59.720 |
Cats from dogs. I've looked at five pictures of cats five pictures of dogs, and I've built a model that is perfect and 00:02:07.160 |
Then I look at a different set of five cats and dogs and it gets them all wrong 00:02:11.840 |
So in that case what it learned was not the difference between a cat and a dog 00:02:15.600 |
That let what those five exact cats look like in those five exact dogs look like or I've got a model of 00:02:34.000 |
And then I go and put it into production and it scales great in other words it has a great latency 00:02:41.320 |
I don't have a high CPU load, but it fails to predict anything well other than 00:02:46.960 |
Toilet rolls in New Jersey it also turns out it only did it well for last month not the next month, so these are all generalization 00:02:58.520 |
The most common way that people check for the ability to generalize is 00:03:04.520 |
To create a random sample, so they'll grab a few rows at random 00:03:16.640 |
then they'll build all of their models on the rest of the rows and 00:03:21.800 |
Then when they're finished they'll check that the accuracy they got on there 00:03:28.160 |
So the rest of the rows are called the training set everything else 00:03:41.720 |
So at the end of their modeling process on the training set they got an accuracy of 00:03:46.200 |
99% of predicting cats from dogs at the very end they check it against a test set to make sure that the model really 00:03:58.160 |
Right, so okay, well I could go back and change some hyper parameters do some data augmentation 00:04:03.560 |
Whatever else try to create a more generalizable model, and then I'll go back again 00:04:08.760 |
After doing all that and check and it's still no good 00:04:11.920 |
But and I'll keep doing this again and again until eventually after 50 attempts. It does generalize 00:04:19.440 |
But does it really generalize because maybe all I've done is 00:04:23.320 |
Accidentally found this one which happens to work just for that test set because I've tried 50 different things 00:04:28.400 |
Right, and so if I've got something which is like right coincidentally 00:04:35.040 |
0.05 5% of the time they're not very likely to accidentally get a good result 00:04:40.520 |
So what we generally do is we put aside a second data set 00:04:46.160 |
They've got a couple more of these and put these aside into a validation set 00:04:54.480 |
Validation set right and then everything that's not in the validation test is now training and so what we do is we train a model 00:05:03.760 |
Check it against the validation to see if it generalizes 00:05:06.440 |
Do that a few times and then when we finally got something where we're like okay? 00:05:11.200 |
We think this generalizes successfully based on the validation set and then at the end of the project we check it against the test set 00:05:19.600 |
So basically by making this two layer test set validation set if it gets one right the other one 00:05:24.520 |
Wrong you kind of double checking your errors kind of like it 00:05:27.840 |
It's checking that we have an over fit to the validation set so if we're using the validation set again and again 00:05:34.040 |
Then we could end up not coming up with a generalizable set of hyper parameters 00:05:38.640 |
But a set of hyper parameters that just so happen to work on the training set and the validation set 00:05:47.760 |
Against the validation set and then at the end of all that we then check that against the test set and it's still 00:05:57.480 |
Generalized as well, then we're kind of going to say okay. That's good 00:06:00.240 |
We've actually come up with generalizable model if it doesn't then that's going to say okay 00:06:04.400 |
We've actually now over fit to the validation set at which point you're kind of in trouble, right because 00:06:09.960 |
You don't you know you don't have anything left 00:06:17.880 |
Techniques during the modeling so that so that doesn't happen right, but but if it's going to happen you want to find out about 00:06:24.040 |
it like you need that test set to be there because otherwise when you put it in production and 00:06:28.800 |
Then it turns out that it doesn't generalize that would be a really bad outcome right you end up with 00:06:34.420 |
Less people clicking on your ads or selling less of your products or providing car insurance to very risky vehicles or whatever 00:06:42.340 |
Just make sure do you need to ever check if the validation set and the test settings is coherent or you just keep 00:06:52.640 |
So if you've done what I've just done here, which is to randomly sample 00:06:56.420 |
There's no particular reason to check as long as they're as long as they're big enough 00:06:59.920 |
Right, but we're going to come back to your question in a different context in just a moment 00:07:09.520 |
Another trick we've learned for random forests is a way of 00:07:13.960 |
Not needing a validation set and the way that we learned was to use instead use the OOB 00:07:22.640 |
Error or the OOB score and so this idea was to say well every time we train a tree in a random 00:07:30.740 |
Forest there's a bunch of observations that are held out anyway because that's how we get some of the randomness and so let's 00:07:37.420 |
Calculate our score for each tree based on those held out samples and therefore the forest by averaging the trees that that each 00:07:52.240 |
And so the OOB score gives us something which is pretty similar to the 00:08:04.960 |
Can anybody either remember or figure out why on average it's a little less good? 00:08:13.840 |
I'm not sure but is it because you are treating like you are doing every 00:08:23.680 |
pre-processing on your test and so the OOB score is reflecting the performance on testing set 00:08:31.000 |
No, so the OOB score is not using the test set at all 00:08:34.360 |
The OOB score is using the held out rows in the training set for each tree 00:08:39.360 |
So I mean the you are basically testing each tree on some data from the training set 00:08:45.200 |
Yes, so you are you have the potential of overbating data? 00:08:51.040 |
I should it shouldn't cause overfitting because each one is looking at a held out 00:08:55.820 |
Sample so it's not an overfitting issue. It's quite a subtle issue. Ernest don't have a try 00:09:06.440 |
Bootstrap samples they are so then you're never gonna on average they only grab 63% of right 00:09:12.920 |
So when average the OOB is one minus 63% exactly. Yeah, what's the issue? 00:09:18.600 |
So then if you're not why would the score be lower than the validation score that implies that you're leaving 00:09:24.380 |
Sort of like a black hole in the data that there's like data points 00:09:26.960 |
You're never going to sample and they're not gonna be represented by the model 00:09:29.200 |
No, that's not true though because each tree is looking at a different set right so the OOB so like we've got like I 00:09:35.320 |
Don't know dozens of models right and in each one. There's a different set of 00:09:48.360 |
And so when we calculate the OOB score for like let's say row three 00:09:53.180 |
We say okay row three is in this tree this tree and that's it and so we calculate 00:09:58.880 |
the prediction on that tree and for that tree and with average those two predictions and so with enough trees 00:10:07.040 |
You know each one has a 30 or so percent chance. Sorry 40 or so percent chance that the row is in that tree 00:10:15.560 |
It's almost certain that every row is going to be mentioned somewhere 00:10:22.000 |
With validation set we can use the whole forest to make the predictions 00:10:27.640 |
But here we cannot use the whole forest so we cannot exactly see exactly so every row is 00:10:33.760 |
Going to be using a subset of the trees to make its prediction and with less trees 00:10:39.640 |
We know we get a less accurate prediction. So that's like that's a subtle one 00:10:44.480 |
Right, and if you didn't get it have a think during the week 00:10:50.680 |
Why this is because it's a really interesting test if you're understanding of random forests of like why is OOB score? 00:10:58.600 |
On average less good than your validation score. They're both using random subs randomly held out subsets 00:11:05.560 |
anyway, it's been really close enough right so 00:11:16.760 |
If it's a randomly chosen validation set it's not strictly speaking necessary 00:11:24.400 |
But you know you've got like four levels of things to test right so you could like test on the OOB 00:11:29.560 |
When that's working well, you can test on the validation set, you know when hopefully by the time you check against the test set 00:11:35.720 |
There's going to be no surprises. So that'll be one good reason 00:11:40.080 |
Then what Kaggle do the way they do this is kind of clever 00:11:44.200 |
what Kaggle do is they split the test set into two pieces a public and 00:11:51.920 |
They don't tell you which is rich. So you submit your predictions to Kaggle and then a 00:11:58.200 |
Random 30% of those are used to tell you the leaderboard score 00:12:04.600 |
But then at the end of the competition that gets thrown away and they use the other 70% to calculate your real score 00:12:14.000 |
What that's doing is that you're making sure that you're not like continually using that feedback from the leaderboard 00:12:19.120 |
To figure out some set of hyper parameters that happens to do well on the public but actually doesn't generalize 00:12:25.440 |
Okay, so it's a great test like this is one of the reasons why it's good practice to use Kaggle 00:12:32.040 |
Because at the end of a competition at some point this will happen to you and you'll drop a hundred places on the leaderboard 00:12:37.560 |
The last day of the competition when they use the private test set and I say oh 00:12:41.960 |
Okay, that's what it feels like to overfit and it's much better to 00:12:46.000 |
Practice and get that sense there than it is to do it in a company where there's hundreds of millions of dollars on the line 00:12:52.360 |
Okay, so this is like the easiest possible situation where you're able to use a 00:13:05.280 |
Why might I not be able to use a random sample for my validation set 00:13:11.600 |
In the case of something where we're forecasting we can't randomly sample because we need to maintain the temporal ordering 00:13:25.360 |
Because it doesn't it doesn't make sense. So in the case of like an ARMA model 00:13:29.280 |
I I can't use like I can't pull out random rows because there's 00:13:34.080 |
I'm thinking that there's like a certain dependency or I'm I'm trying to model a certain dependency that relies on like a specific 00:13:41.560 |
Lag term if I randomly sample those things then that lag term isn't there for me to okay, so it could be like a 00:13:48.920 |
Technical modeling issue that like I'm using a model that relies on like 00:13:55.640 |
Yesterday the day before and the day before that and if I've randomly removed some things 00:13:59.640 |
I don't have yesterday and my model might just fail. Okay, that's true, but there's a more fundamental issue 00:14:10.960 |
Although you know in general we're going to try to build models that are not that are more resilient than that 00:14:18.680 |
Yet temporal order we expect things that are close by in time to be related to things close to them 00:14:28.760 |
if we destroy the order like if if we destroy the order we 00:14:35.280 |
Really aren't going to be able to use that this time is close to this other time 00:14:40.200 |
Um, I don't think that's true because can pull out a random sample for a validation set and still keep everything nicely ordered 00:14:48.480 |
Well, we would like to predict things in the future 00:14:51.680 |
Which we would require as much data close to the end of art 00:14:58.320 |
Okay, that's true. I mean we could be like limiting the amount of data that we have by taking some of it out 00:15:05.560 |
But my claim is stronger. My claim is that by using a random validation set 00:15:12.280 |
We could get totally the wrong idea about our model carob. Do you want to have a try? 00:15:24.240 |
We can if you're randomly sampling it we can only have one class in our validation set so our fitted model may be 00:15:33.400 |
So maybe you're trying to predict in a medical situation 00:15:35.480 |
Who's going to die of lung cancer and that's only one out of a hundred people and we pick out a validation set that 00:15:42.120 |
We accidentally have nobody that died of lung cancer 00:15:47.720 |
Good niche examples, but none of them quite say like why could the validation set just be plain 00:15:59.000 |
Inaccurate idea of whether this is going to generalize 00:16:01.800 |
And so let's talk about and the closest is is is what Tyler was saying about time 00:16:11.280 |
The important thing to remember is when you build a model 00:16:13.820 |
You're always you always have a systematic error 00:16:17.960 |
Which is that you're going to use the model at a later time than the time that you built it, right? 00:16:26.560 |
By which time the world is different to the world that you're in now and even when you're building the model 00:16:33.600 |
You're using data which is older than today anyway, right? 00:16:36.920 |
so there's some lag between the data that you're building it on and the data that it's going to actually be used on your life and 00:16:43.820 |
A lot of the time if not most of the time that matters, right? 00:16:48.520 |
So if we're doing stuff in like predicting who's going to buy toilet paper in, New Jersey 00:16:54.460 |
and it takes us two weeks to put it in production and 00:16:58.160 |
We did it using data from the last couple of years then by that time, you know things may look 00:17:05.520 |
Very different right and particularly our validation said if we randomly sampled it 00:17:11.620 |
Right and it was like from a four-year period then the vast majority of that data is going to be over a year old 00:17:20.440 |
toilet buying habits of folks in New Jersey may have 00:17:24.640 |
Dramatically shifted. Maybe they've got a terrible recession there now and they can't afford a high-quality toilet paper anymore 00:17:33.440 |
Or maybe they know their paper making industry has gone through the roof and suddenly, you know 00:17:39.040 |
They could they're buying lots more toilet paper because it's so cheap or whatever, right? So 00:17:43.520 |
The world changes and therefore if you use a random sample for your validation set 00:17:50.200 |
Then you're actually checking how good are you at predicting things that are totally obsolete now? 00:17:55.560 |
But how good are you at predicting things that happened four years ago? That's not interesting 00:18:08.320 |
Is to instead say assuming that we've ordered it by time 00:18:29.520 |
Okay, or if we you know, I suppose actually do it properly. That's our validation set. That's our test set 00:18:37.340 |
Make sense, right? So here's our training set and we use that and we try and build a model that still works on stuff 00:18:48.160 |
That's later in time than anything the model was built on and so we're not just testing 00:18:54.040 |
Generalization in some kind of abstract sense, but in a very 00:18:58.520 |
Specific time sense, which is it generalizes to the future? Could you pass it to Siraj, please? 00:19:10.000 |
As you said, there is some temporal ordering in the data 00:19:13.440 |
So in that case is it wise to take the entire whole data for training or only a few recent data? 00:19:20.080 |
Set so validation test or training. I'm talking about training training 00:19:25.640 |
Yeah, that's a whole nother question, right? So how do you how do you get the validation set to be good? 00:19:32.200 |
So I build them a random forest on all the training data. It looks good on the training data 00:19:40.200 |
But and this is actually a really good reason to have OB if it looks good on the OOB 00:19:44.920 |
But it means you're not overfitting in a statistical sense, right? Like it's it's working well on a random sample 00:19:55.560 |
So what happened? Well, what happened was that you you somehow failed to predict the future 00:20:02.200 |
You only predicted the past and so Siraj had an idea about how we could fix that would be okay 00:20:07.240 |
Well, maybe we should just train so like maybe we shouldn't use the whole training set 00:20:11.160 |
We should try a recent period only and now you know on the downside 00:20:15.120 |
We're not using less data so we can create less rich models on the upside. It's it's more up-to-date data 00:20:22.600 |
And this is something you have to play around with 00:20:27.760 |
Machine learning functions have the ability to provide a weight that is given to each row 00:20:33.260 |
So for example with a random forest rather than bootstrapping at random 00:20:37.980 |
You could have a weight on every row and randomly pick that row with some probability right and we could like say 00:20:52.920 |
So that the most recent rows have a higher probability of being selected that can work really well 00:20:59.480 |
Yeah, it's it's something that you have to try and and if you don't have a validation set that represents the future 00:21:07.780 |
Compared to what you're training on you have no way to know which of your techniques are working 00:21:12.720 |
How do you make the compromise between an amount of data versus recency of data? 00:21:19.600 |
So what I tend to do is is when I have this kind of temporal issue, which is probably most of the time 00:21:26.560 |
Once I have something that's working well on the validation set 00:21:30.880 |
I wouldn't then go and just use that model on the test set because the thing that I've trained on is now like 00:21:36.800 |
Much, you know the test set is much more in the future compared to the training set so I would then replicate 00:21:43.280 |
Building that model again, but this time I would combine 00:21:50.120 |
Okay, and retrain the model and at that point you've got no way to test 00:21:56.200 |
Against a validation set so you have to make sure you have a reproducible 00:22:01.280 |
Script or notebook that does exactly the same steps in exactly the same ways 00:22:05.960 |
Because if you get something wrong, then you're going to find on the test set that you've you've got a problem 00:22:15.280 |
So what what I do in practice is I need to know is my validation set 00:22:22.640 |
Truly representative of the test set. So what I do is I build five models on the training set I 00:22:30.660 |
Build five models on the training set and I try to have them kind of vary in how good I think they are 00:22:41.560 |
Right, and then and then I score them my five models on the validation set 00:22:47.840 |
Right, and then I also score them on the test set, right? So I'm not cheating 00:22:54.760 |
So I'm not using any feedback from the test set to change my hyper parameters 00:22:58.540 |
I'm only using it for this one thing which is to check my validation set. So I get my five scores 00:23:12.480 |
Okay, and if they don't then you're not going to get good enough feedback from the validation set. So keep doing that process 00:23:19.720 |
Until you're getting a line and that can be quite tricky, right? Sometimes 00:23:27.840 |
You know trying to create something that's as similar to the real world outcome as possible 00:23:33.840 |
It's difficult right and when you're kind of in the real world 00:23:38.000 |
The same is true of creating the test set like the test set has to be a close to production as possible 00:23:43.320 |
So like what's the actual mix of customers that are going to be using this? 00:23:47.800 |
How much time is there actually going to be between when you build the model and when you put it in production? 00:23:52.280 |
How often are you going to be able to refresh the model? These are all the things to think about when you build that test set 00:24:01.520 |
So even to say that first make five models on the training data and then till you get a straight line relationship 00:24:09.160 |
Change your validation and test set you can't really change the test set generally 00:24:13.460 |
So this is assuming that the test sets given the change change the validation set 00:24:17.400 |
So if you start with a random sample validation set and then it's all over the place and you realize oh 00:24:25.280 |
And then you pick the last two months and still go over the place and you realize oh 00:24:28.440 |
I should have picked it so that's also from the first of the month to the fifteenth of the month and 00:24:32.920 |
They'll keep going until changing your validation set until you've found a validation set which is 00:24:41.240 |
So the five models like you would start maybe like just the random data and then average and like just make it better 00:24:54.360 |
exactly, maybe I kind of five like not terrible ones, but you want some variety and you also particularly want some variety in like 00:25:01.300 |
How well they might generalize through time so one that was trained on the whole training set one that was trained on the last two weeks 00:25:11.540 |
One which used you know lots and lots of columns and might over fit a bit more 00:25:16.400 |
Yeah, so you kind of want to get a sense of like oh if my validation set fails to 00:25:22.520 |
Generalize temporarily I'd want to see that if it fails to generalize statistically I want to see that 00:25:26.800 |
Sorry, can you explain a bit more detail what you mean by change your validation set so it indicates the test set like what does that look? 00:25:35.480 |
So possible. So let's take the groceries competition where we're trying to predict the next two weeks of grocery sales 00:25:42.300 |
So possible validation sets that Terrence and I played with was a random sample 00:25:57.120 |
And the other one we tried was same day range 00:26:05.000 |
One month earlier so that the test set in this competition was the first of the 15th of 00:26:21.360 |
So we tried like a random sample as four years. We tried 00:26:25.260 |
the 15th of July to the 15th of August we tried the first of August to the 15th of August and we tried the 00:26:34.120 |
15th of July to the 30th of July and so there were four different validation sets we tried and so with random 00:26:41.560 |
You know our kind of results were all over the place with last month 00:26:46.440 |
You know, they were like not bad, but not great the last two weeks 00:26:51.880 |
But on the whole they were good and same day range of months earlier. They've got a basically perfect line 00:26:56.520 |
That's the part I'm talking right there. What exactly are you comparing it to from the test set? 00:27:00.980 |
It's like confuse what you're creating that graph 00:27:03.080 |
So for each of those so for each of my so I've built five models, right? So there might be like 00:27:11.840 |
Just predict the average do some kind of simple group mean of the whole data set do some group mean of the last month 00:27:17.800 |
Of the data set build a random forest of the whole thing build a random forest in the last two weeks 00:27:22.040 |
on each of those I calculate the validation score and 00:27:26.480 |
Then I retrain the model on the whole training set and calculate the same thing on the test set 00:27:32.480 |
And so each of these points now tells me how well to go in the validation set 00:27:37.520 |
How well did it go in the test set and so if the validation set is useful? 00:27:42.160 |
We would say every time the validation set improves the test set should also score should also improve 00:27:48.220 |
Yeah, so you just said retrain dreaming retrain the model on training and validations 00:27:55.740 |
Yeah, that was a step I was talking about here 00:27:57.480 |
So once I've got the validation score based on just the training set and then retrain it on the train and validation 00:28:15.800 |
Submitting it to Kaggle and then checking the score 00:28:19.680 |
If it's Kaggle then your test set is Kaggle's leaderboard 00:28:24.320 |
in the real world the test set is this third data set that you put aside and it's that third data set that 00:28:33.080 |
Having it reflect real-world production differences is the most important step in a machine learning project 00:28:40.100 |
Why is it the most important step because if you screw up everything else that you don't screw up that 00:28:52.920 |
Then you'll know you screwed up because you screwed up something else and you tested it and it didn't work out 00:28:57.960 |
And it's like okay, you're not going to destroy the company right if you screwed up creating the test set 00:29:03.080 |
That would be awful right because then you don't know if you've made a mistake 00:29:07.760 |
Right you try to build a model you test it on the test set it looks good 00:29:12.640 |
But the test set was not indicative of real-world 00:29:18.760 |
So you don't actually know if you're going to destroy the company right now 00:29:22.400 |
Hopefully you've got ways to put things into production gradually 00:29:24.960 |
So you won't actually destroy the company, but you'll at least destroy your reputation at work, right? 00:29:29.720 |
it's like Oh Jeremy tried to put this thing into production and 00:29:32.800 |
In the first week the cohort we tried it on their sales halved and we're never going to give Jeremy a machine learning job again 00:29:39.960 |
All right, but if Jeremy had used a proper test set then like he would have known oh 00:29:45.120 |
This is like half as good as my validation set said it would be 00:29:49.480 |
I'll keep trying and now I'm not going to get in any trouble. I was actually like Oh Jeremy's awesome 00:29:54.520 |
He is identifies ahead of time when there's going to be a generalization problem 00:30:09.000 |
This is something that kind of everybody talks about a little bit in machine learning classes 00:30:16.160 |
But often it kind of stops at the point where you learn that there's a thing in SK learn 00:30:20.820 |
Called make test train split and it returns these things and off you go right, but the fact that like 00:30:27.460 |
Or here's the cross-validation function right so 00:30:31.200 |
The fact that these things always give you random samples tells you that like 00:30:39.100 |
Much if not most of the time you shouldn't be using them 00:30:44.720 |
The fact that random forest gives you an OOB for free 00:30:47.880 |
It's useful, but it only tells you that this generalizes in a statistical sense not in a practical sense, right? 00:30:54.880 |
so then finally there's cross-validation right which 00:30:59.160 |
Outside of class you guys have been talking about a lot which makes me feel somebody's been 00:31:09.360 |
So I'll explain what cross-validation is and then I'll explain why you probably shouldn't be using it most of the time 00:31:15.720 |
So cross-validation says let's not just pull out one validation set, but let's pull out five say 00:31:23.180 |
So let's assume that we're going to randomly shuffle the data first, right? This is critical 00:31:29.320 |
right, we first randomly shuffle the data and then we're going to split it into 00:31:39.040 |
And then for model number one, we'll call this the validation set and 00:31:48.520 |
Okay, and we'll train and we'll check against the validation and we'll get some RMSE R squared whatever and 00:32:00.280 |
We'll call this the validation set and we'll call this 00:32:07.520 |
the training set and we'll get another score we'll do that five times and 00:32:27.780 |
average accuracy, so who can tell me like a benefit of using cross-validation over a 00:32:37.160 |
The kind of standard validation set I talked about before 00:32:50.480 |
Cross-validation will make use of the data you have. Yeah, you can use all of the data 00:32:56.040 |
You don't have to put anything aside and you kind of get a little benefit as well in that like 00:33:00.880 |
You've now got five models that you could ensemble together each one refused which used 80% of the data 00:33:06.400 |
So, you know, sometimes that ensemble link can be helpful 00:33:08.880 |
Fun could you tell me like what what could be some reasons that you wouldn't use cross-validation? 00:33:16.440 |
We have enough data so we don't not want the validation set to be included in the model trainings 00:33:34.360 |
I'm not sure that cross-validation is necessarily polluting the model. What would be a key like downside of cross-validation? 00:33:41.280 |
but like for deep learning if you have learned the pictures and 00:33:46.240 |
Then your network will know the pictures and it's more likely to predict it. That's right 00:33:52.420 |
So sure, but if we if we've put aside some data each time in the cross-validation, can you pass it to Siraj? 00:34:02.040 |
like I don't think there's like one of these validation sets is 00:34:11.400 |
I think that's what fun was worried about I don't see why that would happen like each time we're fitting a model 00:34:23.380 |
Each time we're fitting a model. We are absolutely holding out 20% of the sample 00:34:28.660 |
Right so yes the five models between them have seen all of the data 00:34:32.540 |
But but it's kind of like a random forest in fact it's a lot like a random forest each model 00:34:37.300 |
Has only been trained on a subset of the data 00:34:39.620 |
Yes, Nisha say if it is like a large data set like it will take a lot of time 00:34:44.860 |
Oh, yes, exactly right so we have to fit five models rather than one so here's a key downside number one 00:34:54.980 |
Doing deep learning and it takes a day to run suddenly it now takes five days or we need five GPUs 00:35:01.220 |
Okay, what about my earlier issues about validation sets? Do you pass it over there? 00:35:09.720 |
So if you had like temporal data wouldn't you be like by shuffling wouldn't you be breaking that relation 00:35:21.960 |
We could reorder it like we could shuffle get the training set out and then sort it by time 00:35:27.200 |
Like I'd like this presumably there's a date column there, so I 00:35:32.300 |
Don't think I don't think it's going to stop us from building a model. Did you have? 00:35:36.760 |
With cross-validation you're building five even validation sets 00:35:47.380 |
And if there's some sort of structure that you're trying to capture in your validation set to mirror your test set 00:35:51.620 |
You're you're essentially just throwing that a chance to construct that 00:35:59.420 |
I think you said the same thing as I'm going to say which is which is that our earlier concerns about why? 00:36:04.340 |
Random validation sets are a problem are entirely relevant here all these validation sets are random 00:36:10.220 |
So if a random validation set is not appropriate for your problem 00:36:15.860 |
Most likely because for example of temporal issues then none of these four validation set five validation sets are any good 00:36:27.460 |
Temporal data like we did here. There's no way to do cross-validation really or like probably no good way to do cross-validation. I mean 00:36:39.700 |
Your validation set be as close to the test set as possible 00:36:42.620 |
And so you can't do that by randomly sampling different things 00:36:51.340 |
You may well not need to do cross validation because most of the time in the real world 00:36:58.260 |
Right unless your data is based on some very very expensive labeling process or some experiments that take a cost a lot to run 00:37:07.860 |
Data scientists are not very often doing that kind of work summer in which case this is an issue, but most of us aren't 00:37:16.620 |
Nishan said if we do do it. It's going to take a whole lot of time 00:37:20.500 |
And then as earnest said even if we did do it and we talk up all that time 00:37:26.260 |
It might give us totally the wrong answer because random validation sets are inappropriate for a problem 00:37:30.620 |
Okay, so I'm not going to be spending much time on cross-validation because I just I think it's an interesting tool to have 00:37:37.900 |
It's easy to use. Okay, learn has a cross-validation thing. You can go ahead and use 00:37:44.580 |
It's it's it's not that often that it's going to be an important part of your toolbox in my opinion. It'll come up sometimes 00:38:06.780 |
And got a little bit stuck on because I screwed it up was tree interpretation 00:38:27.340 |
What tree interpreter does and how it does it? 00:38:33.660 |
Everybody remember? It's a difficult one to explain. I don't think I did a good job of explaining it 00:38:41.340 |
So don't worry if you don't do a great job, but does anybody want to have a go at explaining it? 00:38:49.540 |
Let's start with the output of tree interpreter, so 00:38:54.740 |
If we look at a single model a single tree in other words 00:39:17.180 |
Is the average log price of all of the options in our training set? 00:39:29.020 |
Right here ten point one eight nine eight nine is the average of all 00:39:36.260 |
Then if I go a couple of system less than or equal to point five 00:39:39.060 |
Then I get ten point three four five. Okay, so for this subset of 00:39:47.940 |
Coupler is less than or equal to point five the average is ten point three four five and 00:39:52.380 |
Then off the people with a couple of system less than or equal to point five 00:39:58.340 |
We then take the subset where enclosure is less than or equal to two and the average there of log sale price is nine point 00:40:06.940 |
Here's nine point nine five five. Okay, and then final step in our tree 00:40:14.180 |
Model ID just for this group with no coupler system with enclosure less than or equal to two 00:40:19.260 |
then let's just take model ID less than or equal to forty five seventy three and 00:40:29.220 |
okay, so then we can say you're at starting with 00:40:32.820 |
ten point one oh nine one eight nine average for everybody in our training set for this particular trees subsample of twenty thousand 00:40:40.300 |
Adding in the couple of decision or couple or less than or equal to point five 00:40:46.900 |
increased our prediction by point one five six 00:40:50.620 |
So if we predicted with a naive model of just the mean that would have been ten point one nine 00:40:56.060 |
Adding in just the coupler decision would have changed it to ten point three four five 00:41:00.980 |
So this variable is responsible for a point one five six increase in our prediction 00:41:06.220 |
From that the enclosure decision was responsible for a minus point three nine five decrease 00:41:12.340 |
The model ID was responsible for a point two seven six increase until eventually that was our final decision 00:41:20.260 |
That was our prediction for this auction of this particular sale price 00:41:26.300 |
So we can draw that as what's called a waterfall plot right and waterfall plots are one of the most useful plots 00:41:35.660 |
There's nothing in Python to do them and this is one of these things where there's this disconnect between like the world of like 00:41:42.020 |
management consulting and business where everybody uses waterfall plots all the time and 00:41:48.420 |
Who have no idea what these things are but like every time like you're looking at say? 00:41:56.220 |
Last year's sales for Apple and then there was a change in the iPhones increased by this amount 00:42:02.060 |
Max decreased by that amount and iPads increased by that amount every time you have a starting point in a number of changes and a finishing 00:42:10.260 |
Point waterfall charts are pretty much always the best way to show it. So here our prediction for price based on everything 00:42:16.380 |
10.1 eight nine there was an increase blue means increase of point one five six per coupler 00:42:22.700 |
Decrease of point three nine five for implosion increase model ID of point two seven six so decrease 00:42:30.220 |
As I increase decrease increase to get to our final 00:42:37.960 |
So with excel 2016 you it's built in you just click insert waterfall chart and there it is 00:42:49.540 |
Package for matplotlib put it on pip and everybody will love you for it 00:42:59.100 |
Notebooks and stuff around these are actually super easy to build 00:43:03.300 |
Like you basically do a stacked column plot where this the bottom of this is like all white 00:43:10.980 |
But if you can wrap that up all and put the data the points in the right spots and color them nicely 00:43:17.020 |
That would be totally awesome. I think you've all got the skills to do it and would make you know be a 00:43:26.660 |
Could make an interesting cattle kernel even like here's how to build a waterfall plot from scratch and by the way 00:43:35.340 |
So in general therefore obviously going from the all and then going through each change 00:43:43.580 |
Then the sum of all of those is going to be equal to the final prediction 00:43:48.940 |
So that's how we could say if we were just doing a decision tree 00:43:53.460 |
Then you know you're coming along and saying like how come this particular option was this particular price? 00:43:59.580 |
And it's like well your prediction for it and like oh it's because of these three things had these three impacts, right? 00:44:09.460 |
We could do that across all of the trees that so every time we see coupler 00:44:14.100 |
We add up that change every time we see enclosure 00:44:17.100 |
We add up that change every time we see model we add up that change. Okay, and so then we combine them all together 00:44:26.380 |
Tree interpreter does but so you could go into the source code for tree interpreter, right? 00:44:31.080 |
It's not at all complex logic or you could build it yourself 00:44:36.820 |
How it does exactly this so when you go tree interpreter predict with a random first model for some specific? 00:44:43.820 |
Auction, so I've got a specific row here. This is my zero index row 00:44:48.820 |
It tells you okay. This is the prediction the same as the random forest prediction 00:44:54.780 |
Bias this is going to be always the same. It's the average 00:44:58.860 |
sale price for for everybody for each of the random samples in the tree and 00:45:07.900 |
The average of sorry the total of all the contributions for each time we see that 00:45:18.860 |
Right. So last time I made the mistake of not sorting this correctly. So this time 00:45:30.740 |
Contribution zero it just tells you where each item would move to if it were sorted 00:45:46.220 |
Contribution I can then print out all those in the right order so I can see here. Here's my column 00:45:55.980 |
level and the contribution so the fact that it's a small 00:46:01.100 |
Version of this piece of industrial equipment meant that it was less expensive 00:46:04.940 |
Right, but the fact it was made pretty recently meant. It was more expensive 00:46:10.020 |
The fact that it's pretty old however made that it was less expensive 00:46:16.460 |
Really help you much at all with like a Kaggle style situation where you just need predictions 00:46:22.820 |
that's going to help you a lot in a production environment or even pre-production right so like something which 00:46:28.460 |
Any good manager should you should do if you say here's a machine learning model? 00:46:32.940 |
I think we should use as they should go away and grab a few examples of actual customers or 00:46:39.060 |
actual options or whatever and check whether your model looks intuitive right and if it says like 00:46:51.820 |
Lots and lots of people are going to really enjoy 00:46:54.620 |
This crappy movie. You know and it's like well 00:46:58.140 |
That was a really crappy movie then they're going to come back to you and say like explain why your models telling me 00:47:03.780 |
That I'm going to like this movie because I hate that movie and then you can go back and you say well 00:47:10.020 |
It's because you like this movie and because you're this age range, and you're this gender on average actually people like you 00:47:31.380 |
It was a mini, and it was 11 years old and it was a hydraulic excavator track three to four metric tons 00:47:39.340 |
So it's just feeding back and telling you it's it because this is actually what it was 00:47:58.260 |
Okay, so if we sum up all the contributions together and 00:48:11.220 |
Then that would be the same as adding up those three things 00:48:16.540 |
Adding it to this and as we know from our waterfall chart that gives us our final 00:48:26.740 |
Almost totally unknown technique and this particular 00:48:38.340 |
You know show something that a lot of people like it's totally critical in my opinion 00:48:47.540 |
That's kind of the end of the ran of forest interpretation piece and hopefully you've now seen enough that when somebody says 00:48:57.620 |
We can't use modern machine learning techniques because they're black boxes that aren't interpretable 00:49:02.500 |
You have enough information to say you're full of shit, right? 00:49:06.100 |
Like they're extremely interpretable and the stuff that we've just done 00:49:10.420 |
You know try to do that with a linear model. Good luck to you 00:49:13.740 |
You know even where you can do something similar the linear model trying to do it 00:49:17.340 |
So that's not getting you totally the wrong answer and you had no idea as a wrong answer. It's going to be a real challenge 00:49:22.380 |
So the last step we're going to do before we try and build our own random forest is deal with this tricky issue of 00:49:39.140 |
Let's look at the accuracy of our most recent trees 00:49:51.580 |
You know a big difference between our validation 00:50:06.700 |
The difference between the OOB and the validation is actually pretty close 00:50:11.660 |
So if there was a big difference between validation and OOB like I'd be very worried about that. We've dealt with the temporal side of things 00:50:22.500 |
Let's just have a look at I think our most recent model here it was 00:50:27.440 |
Yeah, so there's a tiny difference right and so 00:50:32.060 |
On Kaggle at least you kind of need that last decimal place in the real world. I'd probably stop here 00:50:38.420 |
But quite often you'll see there's a big difference between your validation score and your OOB score 00:50:43.220 |
And I want to show you how you would deal with that 00:50:45.220 |
Particularly because actually we know that the OOB should be a little worse 00:50:51.220 |
Because it's using less trees so it gives me a sense that we should be able to do a little bit better 00:50:55.580 |
And so the reason with the way we should be able to do a little bit better is by handling the time 00:51:05.780 |
Here's the problem with random forests when it comes to extrapolation 00:51:16.700 |
That's like you know for got four years of sales data in it and you create your tree 00:51:22.060 |
Right, and it says like oh if these if it's in some particular store, and it's some particular item 00:51:32.380 |
You know here's the average price right it actually tells us the average price 00:51:38.300 |
You know over the whole training set which could be pretty old right and so when you then 00:51:44.800 |
Want to step forward to like what's going to be the price next month? 00:51:49.580 |
It's never seen next month and and where else with a kind of a linear model it can find a relationship 00:51:56.740 |
Between time and price where even though we only had this much data 00:52:00.980 |
When you then go and predict something in the future it can extrapolate that but a random forest can't do that 00:52:07.420 |
There's no way if you think about it for a tree to be able to say well next month. It would be higher still 00:52:12.820 |
so there's a few ways to deal with this and we'll talk about it over the next couple of lessons, but one simple way is 00:52:26.540 |
Variables as predictors if there's something else we could use that's going to give us a better 00:52:31.300 |
You know something of a kind of a stronger relationship. That's actually going to work in the future so in this case 00:52:38.140 |
What I wanted to do was to first of all figure out 00:52:42.700 |
What's the difference between our validation set and our training set like if I understand the difference between our validation set 00:52:57.140 |
What are the predictors which which have a strong temporal component and therefore they may be? 00:53:04.100 |
Irrelevant by the time I get to the future time period so I do something really interesting which is I create a random forest 00:53:13.020 |
Where my dependent variable is is it in the validation set? 00:53:20.540 |
right, so I've gone back and I've got my whole data frame with the training and validation all together and 00:53:25.540 |
I've created a new column called is valid which I've set to 1 and 00:53:31.060 |
Then for all of the stuff in the training set I set it to 0 00:53:35.380 |
that's what a new column which is just is this in the validation set or not and 00:53:39.460 |
Then I'm going to use that as my dependent variable and build a random forest 00:53:44.920 |
So this is a random forest not to predict price 00:53:48.780 |
The predict is this in the validation set or not and so if your 00:53:55.940 |
Time dependent then it shouldn't be possible to figure out if something's in the validation set or not 00:54:00.380 |
This is a great trick in Kaggle right because in Kaggle 00:54:03.300 |
They often won't tell you whether the test set is a random sample or not 00:54:09.340 |
So you could put the test set and the training set together 00:54:13.100 |
Create a new column called is test and see if you can predict it if you can 00:54:18.340 |
You don't have a random sample which means you have to come and figure out how to create a validation set 00:54:23.720 |
From it right and so in this case I can see I don't have a random sample because my validation set can be predicted 00:54:35.340 |
So then if I look at feature importance the top thing is sales ID and so this is really interesting 00:54:41.860 |
It tells us very clearly sales ID is not a random identifier 00:54:48.540 |
Consecutively as time goes on we just increase the sales ID 00:54:52.780 |
Sale elapsed that was the number of days since the first date in our data set so not surprisingly that so is a good predictor 00:55:04.460 |
Clearly each machine is being labeled with some consecutive identifier as well 00:55:09.720 |
And then there's a big don't just look at the order look at the value so point seven point one 00:55:17.500 |
Okay, stop right these top three hundreds of times more important than the rest right so let's next grab those top three 00:55:25.380 |
Right and we can then have a look at their values 00:55:34.080 |
In the validation set and so we can see for example sales ID on average is 00:55:40.100 |
I've divided by a thousand on average is 1.8 million in the training set and 00:55:45.340 |
5.8 million in the validation set right so you like you can see 00:55:49.840 |
Just confirm like okay. They're very different 00:55:55.300 |
Okay, so after I drop them let's now see if I can predict whether something's in the validation set I still can with point nine eight 00:56:06.420 |
So once you remove some things then other things can like come to the front and it now turns out okay 00:56:16.520 |
You know more likely I guess to be in the validation set because if you know earlier on in the training set yet 00:56:43.860 |
Yeah, so what we can try doing is we can then say all right? 00:56:47.260 |
Let's take the sales ID so that's machine ID from the first one 00:56:50.860 |
The age year made sale sale day of year from the second one and say okay. These are all 00:56:59.820 |
So I still want them in my random forest if they're important 00:57:06.260 |
Right, but if they're not important then taking them out 00:57:10.180 |
There are some other non time dependent variables that that work just as well. That would be better 00:57:14.940 |
Right because now I'm going to have a model that generalizes over time better 00:57:18.580 |
So here I'm just going to go ahead and go through each one of those features and drop each one one at a time 00:57:24.060 |
Okay retrain a new random forest and print out the score 00:57:28.900 |
Okay, so before we do any of that our score was 00:57:49.100 |
This this is like what we're hoping for we've removed a time-dependent variable 00:57:54.240 |
There were other variables that could find similar relationships without the time dependency so removing it caused our validation to go up 00:58:04.260 |
Right because this is genuinely statistically a useful predictor 00:58:08.160 |
Right, but it's a time-dependent one and we have a time-dependent validation set so this is like really subtle 00:58:13.980 |
But it can be really important right. It's trying to find the things that give you a 00:58:18.820 |
Generalizable time across time prediction, and here's how you can see it so by so it's like okay 00:58:24.480 |
We should remove sales ID for sure right, but sale elapsed 00:58:31.500 |
Okay, so we don't want that machine ID did get better from 888 to 893. It's actually quite a bit better 00:58:44.420 |
Year made got worse sale day of year got a bit better 00:58:48.380 |
Okay, so now we can say all right. Let's get rid of 00:58:56.020 |
Where we know that getting rid of it actually made it better 00:58:59.460 |
Okay, and as a result look at this. We're now up to 9 1 5 00:59:03.340 |
Okay, so we've got rid of three time-dependent things and now as expected 00:59:13.020 |
Okay, so that was a super successful approach there right and so now we can check the feature importance 00:59:19.540 |
And let's go ahead and say all right that was pretty damn good. Let's now 00:59:27.660 |
Leave it for a while, so give it 160 trees. Let it show and see how that goes 00:59:33.100 |
Okay, and so as you can see like we did all of our interpretation all of our fine-tuning 00:59:39.440 |
Basically with smaller models subsets and at the end we run the whole thing it actually still only took 16 seconds 00:59:46.260 |
And so we've now got an RMSE of 0.21. Okay, so now we can check that against Kaggle 01:00:01.820 |
Older competition we're not allowed to enter anymore to see how we would have gone so the best we can do is check 01:00:06.540 |
Whether it looks like we could have done well based on our validation set 01:00:10.580 |
So it should be in the right area and yeah based on that we would have come first 01:00:22.500 |
series of steps right so you can go through the same series of steps in your 01:00:26.780 |
Kaggle projects and more importantly your real-world projects 01:00:30.940 |
So one of the challenges is once you leave this learning environment 01:00:35.140 |
Suddenly you're surrounded by people who they never have enough time. They've always want you to be in a hurry 01:00:40.740 |
They're always telling you you know do this and then do that you need to find the time to step away 01:00:45.660 |
Right and go back because this is a genuine real-world modeling process you can use 01:00:51.660 |
And it gives when I say it gives world-class results 01:00:57.860 |
Listergoss sadly he's passed away, but he is the 01:01:04.820 |
Competitor of all time like he won. I believe like dozens of competition 01:01:11.580 |
So if we can get a score even within kooee of him, then we are doing really really well 01:01:19.460 |
Okay, so let's take a five-minute break, and we're going to come back and build our own random forest 01:01:24.220 |
I just wanted to clarify something quickly very good point during the break was 01:01:49.460 |
Here it's not just due to the fact that we removed 01:01:57.700 |
We also went reset RF samples right so to actually see the impact of just removing we need to compare it to 01:02:04.640 |
The final step earlier, so it's actually compared to 907 so removing those three things took us from 01:02:15.900 |
915 okay, so I mean and you know in the end of course what matters is our final model, but yeah, just to clarify 01:02:33.460 |
Some of you have asked me about writing your own random forests from scratch 01:02:37.900 |
I don't know if any of you have given it a try yet my original plan here was to 01:02:44.140 |
Do it in real time and then as I started to do it 01:02:47.100 |
I realized that that would have kind of been boring because for you because I screw things up all the time so instead 01:02:52.660 |
We might do more of like a walk through the code together 01:03:01.460 |
This reminds me talking about the exam actually somebody asked on the forum about like what what can you expect from the exam? 01:03:11.740 |
The exam be very similar to these notebooks. So it'll probably be a notebook that you have to you know 01:03:17.540 |
Get a data set create a model trainer feature importance whatever right and the plan is that it'll be 01:03:25.200 |
Open book open internet you can use whatever resources you like so basically if you're entering competitions the exam should be very straightforward. I 01:03:33.540 |
also expect that there will be some pieces about like 01:03:39.100 |
Here's a partially completed random forest or something. You know finish 01:03:42.580 |
Finish writing this step here, or here's a random forest 01:03:46.060 |
Implement feature importance or you know implement one of the things we've talked about so it'll be you know 01:03:54.060 |
The exam will be much like what we do in class and what you're expected to be doing during the week. There won't be any 01:04:00.900 |
Define this or tell me the difference between this word and that word or whatever. There's not going to be any rote learning 01:04:07.580 |
It'll be entirely like are you an effective machine learning practitioner ie can you use the algorithms? 01:04:12.540 |
Do you know can you create an effective validation set and can you can you create parts of the algorithm? 01:04:19.720 |
Implement them from scratch, so it'll be all about writing code 01:04:25.980 |
if you're not comfortable writing code to practice machine learning then 01:04:30.460 |
You should be practicing that all the time if you are comfortable. You should be practicing that all the time also 01:04:36.460 |
Whatever you're doing write code to implement random to do machine learning 01:04:53.660 |
And I'm not going to claim it's the only way of writing code 01:04:56.100 |
But it might be a little bit different to what you're used to and hopefully you'll find it at least interesting 01:05:06.180 |
Is actually quite tricky not because the clothes tricky like generally speaking 01:05:10.580 |
Most random first algorithms are pretty conceptually easy, you know that generally speaking 01:05:18.220 |
Academic papers and books have a knack of making them look difficult, but they're not difficult conceptually 01:05:26.740 |
what's difficult is getting all the details right and knowing and knowing when you're right and 01:05:32.420 |
So in other words, we need a good way of doing testing 01:05:36.680 |
So if we're going to re-implement something that already exists. So like say we wanted to create a random forest in some 01:05:45.240 |
Framework different language different operating system, you know, I would always start with something that does exist, right? 01:05:51.120 |
So in this case, we're just going to do as a learning exercise writing a random forest in Python 01:05:55.200 |
So for testing I'm going to compare it to an existing random forest implementation 01:06:00.680 |
Okay, so that's like critical any time you're doing anything involving 01:06:05.800 |
non-trivial amounts of code in machine learning 01:06:08.960 |
Knowing whether you've got it right or wrong is kind of the hardest bit 01:06:12.960 |
I always assume that I've screwed everything up at every step and so I'm thinking like okay assuming that I screwed it up 01:06:22.040 |
Right and then much to my surprise from time to time I actually get something right and then I can move on 01:06:32.080 |
Unfortunately with machine learning, there's a lot of ways you can get things wrong that don't give you an error 01:06:36.840 |
They just make your result like slightly less good 01:06:40.240 |
And so that's that's what you want to pick up 01:06:43.520 |
So given that I want to kind of compare it to an existing implementation 01:06:48.760 |
I'm going to use our existing data set our existing validation set and then to simplify things and just going to use two columns 01:06:59.080 |
So let's go ahead and start writing a random forest. So my way of writing 01:07:03.920 |
Nearly all code is top-down just like my teaching and so by top-down I start by assuming 01:07:15.600 |
Right. So in other words, the first thing I want to do I'm going to call this a tree ensemble 01:07:21.240 |
All right, so to create a random forest the first question I have is 01:07:29.160 |
Right. What do I need to initialize my random first? So I'm going to need some independent variables 01:07:40.560 |
I'm going to use the sample size parameter from the start here 01:07:43.840 |
So how big you want each sample to be and then maybe some optional parameter of what's the smallest leaf size? 01:07:53.800 |
For testing it's nice to use a constant random seed. So we'll get the same result each time 01:07:59.320 |
So this is just how you set a random seed, okay? 01:08:02.360 |
Maybe it's worth mentioning this for those of you unfamiliar with it 01:08:06.040 |
Random number generators on computers aren't random at all. They're actually called pseudo random number generators 01:08:12.760 |
and what they do is given some initial starting point in this case 42 a 01:08:19.160 |
Pseudo random number generator is a mathematical function that generates a deterministic always the same sequence of numbers 01:08:27.040 |
Such that those numbers are designed to be as uncorrelated with the previous number as possible 01:08:37.520 |
As uncorrelated as possible with something with a different random seed 01:08:42.520 |
So the second number in in the sequence starting with 42 should be very different the second number starting with 41 01:08:49.200 |
And generally they involve kind of like taking you know 01:08:52.280 |
You know using big prime numbers and taking mods and stuff like that. It's kind of an interesting area of math 01:09:01.640 |
If you want real random numbers the only way to do that is again you can actually buy 01:09:07.720 |
Hardware called a hardware random number generator that will have inside them like a little bit of some radioactive 01:09:14.280 |
Substance and and like something that detects how many things it's spitting out 01:09:30.160 |
Random like random number generation process so that would be for maybe for a random seed right so this thing of like 01:09:37.760 |
What do we start the function with so one of the really interesting areas is like in your computer if you don't set the random? 01:09:48.200 |
Yeah, quite often people use the current time for security like obviously we use a lot of random number stuff for security stuff 01:09:56.480 |
Like if you're generating an SSH key you need some it needs to be random 01:10:00.580 |
It turns out like you know people can figure out roughly when you created a key like they could look at like oh 01:10:08.280 |
ID RSA has a timestamp and they could try you know all the different nanoseconds 01:10:13.160 |
Starting points for a random number generator around that time step and figure out your key 01:10:21.480 |
High randomness requiring applications actually have a step that say please move your mouse and type random stuff at the keyboard for a while 01:10:31.040 |
And so it like gets you to be a sort of entropy to be a source of entropy 01:10:35.300 |
Other approaches is they'll look at like you know the hash of some of your log files or you know 01:10:44.200 |
Stuff like that. It's a really really fun area 01:10:47.100 |
So in our case our purpose actually is to remove randomness 01:10:51.360 |
So we're saying okay generate a series of pseudo random numbers starting with 42, so it always should be the same 01:11:02.480 |
Oh, this is a basically standard idiom at least I mean I write it this way most people don't but if you pass in like 01:11:09.300 |
One two three four five things that you're going to want to keep inside this object 01:11:14.000 |
Then you basically have to say self dot x equals x self dot y equals y self dot sample equals sample 01:11:27.280 |
This is like my way of coding most people think this is horrible 01:11:29.740 |
But I prefer to be able to see everything at once and so I know in my code anytime 01:11:35.440 |
It's always all of the stuff in the method being set if I did it a different way then half the codes now come off 01:11:41.960 |
The bottom of the page and you can't see it. So 01:11:47.560 |
So that was the first thing I thought about was like okay to create a random forest 01:11:51.760 |
What information do you need then I'm going to need to store that information inside my object and so then I? 01:11:57.600 |
Need to create some trees right a random forest is something that creates something that has some trees, so I basically figured okay 01:12:05.640 |
List comprehension to create a list of trees how many trees do we have we've got n trees trees 01:12:11.420 |
That's what we asked for so range n trees gives me the numbers from zero up to n trees at minus one 01:12:19.400 |
Okay, so if I create a list comprehension that loops through that range 01:12:23.720 |
calling create tree each time I now have n trees trees 01:12:30.640 |
And now so I had to write that I didn't have to think at all like that's all like 01:12:36.120 |
Obvious and so I've kind of delayed the thinking to the point where it's like well wait. We don't have something to create a tree 01:12:44.560 |
Okay, no worries, but let's pretend. We did if we did we've now created a random forest 01:12:51.000 |
Okay, we still need to like do a few things on top of that for example once we have it 01:12:56.280 |
We would need a predict function, so okay. Well. Let's write a predict function. How do you predict in a random forest? 01:13:07.040 |
Either based on their own understanding or based on this line of code. What would be like your one or two sentence answer 01:13:13.160 |
How do you make a prediction in a random forest? 01:13:18.840 |
You would want to over every tree for your like the row that you're trying to predict on 01:13:26.520 |
Average the values that your that each tree would produce for that 01:13:30.400 |
And so you know that's a summary of what this says right so for a particular row 01:13:41.920 |
Calculators prediction so here is a list comprehension that is calculating the prediction for every tree for X 01:13:50.800 |
I don't know if X is one row or multiple rows doesn't matter right 01:13:55.640 |
As long as as long as tree dot predict works on it 01:13:59.280 |
And then once you've got a list of things a cool trick to know is you can pass numpy dot mean a 01:14:09.080 |
Okay, and it'll take the mean you just need to tell it 01:14:12.720 |
axis equals 0 means average it across the lists, okay, so this is going to return the average of 01:14:21.960 |
Dot predict for each tree and so I find list comprehensions 01:14:27.560 |
Allow me to write the code in the way that brain works like you could take the word 01:14:35.920 |
Translate them into this code or you could take this code and translate them into words like the one Spencer said right and so when 01:14:41.920 |
I write code I want it to be as much like that as possible 01:14:45.480 |
I want it to be readable and so hopefully you'll find like when you look at the fast AI code 01:14:50.880 |
You're trying to understand. Well, how did Jeremy do X? 01:14:52.880 |
I try to write things in a way that you can read it and like it kind of turn it into English in your head 01:14:58.000 |
So if I see correctly that predict method is recursive it's 01:15:06.800 |
No, it's calling tree dot predict and we haven't written a tree yet 01:15:11.200 |
So self dot trees is going to contain a tree object 01:15:16.480 |
So this is tree ensemble dot predict and inside the trees is a tree not a tree ensemble 01:15:22.980 |
So this is calling tree dot predict not tree ensemble dot predict 01:15:29.560 |
Okay, so we've nearly finished writing a random forest haven't we all we need to do now is write create tree, right? 01:15:43.040 |
On your own understanding of how we create trees in a random forest. Can somebody tell me? 01:15:49.120 |
Let's take a few seconds have a read have a think and then I'm going to try and come up for the way of saying 01:15:59.400 |
Okay, who wants to tell me yes, okay, that's Tyler's got close to 01:16:12.520 |
Essentially taking a random sample or of the original data and then you're just 01:16:19.280 |
Just constructing a tree. However that happens 01:16:23.120 |
So construct a decision tree like a non random tree from a random sample of the data 01:16:29.520 |
Okay, so again like we've delayed any actual thought process here. We've basically said, okay, we could pick some random IDs 01:16:47.680 |
Randomly shuffled sequence from zero to that it right and so then if you grab the first 01:16:59.320 |
Substantial so this is not doing bootstrapping. We're not doing sampling with replacement here 01:17:06.880 |
Which I think is fine, you know for my random forest 01:17:10.400 |
I'm deciding that it's going to be something where we do the sub sampling not bootstrapping. Okay, so here's a good line of code 01:17:18.880 |
Because it comes up all the time like I find in machine learning 01:17:25.440 |
Somewhat random and so often I need some kind of random sample. Can you pass that tighter or changey? 01:17:35.840 |
Won't that give you one one extra because the you said it'll go from zero to length 01:17:41.080 |
No, so this will give you if lens self dot y is 01:17:47.120 |
Size n this will give you n a sequence of length n so 0 to n minus 1 01:17:57.360 |
colon self dot sample size so the first sample size IDs 01:18:04.360 |
Have a comment on bootstrapping, I think this method is better because we have chance of giving more weights to each 01:18:14.120 |
Observation or am I thinking wrong? I mean, I think you for bootstrapping we could also give weights. I mean 01:18:25.680 |
Without wanting that weight because when bootstrapping with replacement we can 01:18:31.760 |
Have a single observation and duplicates of it. Yeah, the same tree. Yeah, it does feel weird, but I think 01:18:44.200 |
Theory or empirical results backs up higher intuition that it's worse. It would be interesting to look look back at that actually 01:18:52.200 |
Personally I prefer this because I feel like most of the time we have more data than we 01:18:59.960 |
Want to put a tree at once I feel like back when bryman created random forests. It was 1999 01:19:05.180 |
It was kind of a very different world. You know where we pretty much always wanted to use all the data we had 01:19:14.480 |
We normally have too much data and so what people tend to do is they're like fire up a spark cluster 01:19:20.060 |
and they'll run it on hundreds of machines when 01:19:22.520 |
It makes no sense because if they had just used a subsample each time 01:19:26.860 |
They could have done it on one machine and like the the overhead of like 01:19:30.440 |
Spark is a huge amount of IO overhead like I know you guys are doing distributed computing now if you've looked at some of the benchmarks 01:19:38.400 |
Yeah, yeah, exactly. So if you do something on a single machine, it can often be hundreds of times faster 01:19:45.980 |
Because you don't have all this this IO overhead. It also tends to be easier to write the algorithms like you can use like SK learn 01:19:58.420 |
Almost always avoid distributed computing and I have my whole life like even 25 years ago when I was starting in machine learning 01:20:11.080 |
Whatever I could do with a cluster now I could do with a single machine in five years time 01:20:15.940 |
So why don't us focus on always being as good as possible with the single machine, you know 01:20:20.640 |
and that's going to be more interactive and more iterative and 01:20:26.680 |
Okay, so so again, we've like delayed thinking 01:20:30.340 |
To the point where we have to write decision tree 01:20:33.880 |
And so hopefully you get an idea that this top-down approach the goal is going to be that we're going to keep delaying thinking 01:20:42.280 |
Like like eventually we've somehow written the whole thing without actually having to think right and that's that's kind of what I need 01:20:48.840 |
Cuz I'm kind of slow right so this is why I write code this way and notice like you never have to design anything 01:20:55.940 |
You know, you just say hey, what if somebody already gave me the exact API I needed. How would I use it? 01:21:00.660 |
Okay, and then and then okay to implement that next stage 01:21:04.680 |
What would be the exact API I would need to implement that that you keep going down until eventually you're like, oh that already exists 01:21:14.080 |
This assumes we've got a class for decision tree. So we're going to have to create that 01:21:23.580 |
Is something so we already know what we're going to have to pass it because we just passed it, right? 01:21:35.260 |
Indexes is actually so we know that down the track so I got a plan a tiny bit 01:21:46.740 |
We know that a decision tree is going to contain decision trees which themselves contain decision trees 01:21:54.540 |
There's going to be some subset of the original data that we've kind of got and so I'm going to pass in the indexes 01:22:00.780 |
Of the data that we're actually going to use here. Okay, so initially it's the entire 01:22:14.060 |
And I'll turn that into an array. So that's zero the indexes from zero to the size of the sample and 01:22:21.020 |
Then we'll just pass down them in leaf size. So everything that we got for constructing the random forest 01:22:26.740 |
We're going to pass down the decision tree except of course num trees, which is irrelevant for the decision tree 01:22:31.780 |
So again now that we know that's the information we need we can go ahead and store it inside this object 01:22:42.580 |
How many rows we have in this tree which I generally call n 01:22:48.580 |
How many columns do I have which I generally call C 01:22:51.580 |
So the number of rows is just equal to the number of indexes 01:22:55.140 |
We were given and the number of columns is just like however many columns there are in our independent variables 01:23:27.460 |
Those indexes which are inside this part of the tree, right? So at the very top of the tree it contains all the indexes 01:23:36.500 |
All right, I'm assuming that by the time we've got to this point. Remember we've already done the 01:23:46.620 |
Right. So when we're talking about indexes, we're not talking about the random sampling to create the tree 01:23:51.780 |
We're assuming this tree now has some random sample inside decision tree 01:23:56.940 |
This is this is the one of the nice things right inside decision tree whole random sampling things gone 01:24:02.100 |
Right that was done by the random first, right? So at this point we're building something. That's just a plain old decision tree 01:24:08.380 |
It's not in any way a random sampling anything. It's just a plain old position tree, right? 01:24:15.420 |
Which subset of the data have we got to so far in this tree? 01:24:20.580 |
And so at the top of the decision tree, it's all the data, right? So it's all of the indexes 01:24:30.060 |
So this is therefore all of the dependent variable that are in this part of the tree 01:24:40.220 |
That makes sense. Anybody got any questions about about that? 01:24:47.740 |
Actually just to let you know that's a large portion of us don't have a all be I 01:25:00.460 |
So quick so quick over P primer would be helpful 01:25:09.460 |
Who has done object-oriented programming in some programming language, okay? 01:25:14.340 |
So you've all used actually lots of object-oriented programming in terms of using existing classes 01:25:25.140 |
All right, so every time we've created a random forest 01:25:28.860 |
We've called the random forests constructor and it's returned an object and then we've called 01:25:41.380 |
Attributes on that object so fit is a method you can tell because it's got parentheses after it. All right, where else? 01:25:54.340 |
Property or an attribute doesn't have parentheses after it. Okay, so inside an object there are kind of two kinds of things 01:26:07.500 |
function parenthesis arguments or there are the properties or attributes you can grab which is 01:26:13.820 |
Object dot and then just the attribute name with no parentheses 01:26:18.300 |
So when and then the other thing that we do with objects is we create them 01:26:24.500 |
Okay, we pass in the name of a class and it returns us the object and you have to tell it all of the parameters 01:26:32.580 |
Necessary to get constructed. So let's just copy this code 01:26:45.340 |
See how we're going to go ahead and build this 01:26:47.460 |
So the first step is we're not going to go and equals random forest regressor. We're going to go M equals tree ensemble 01:26:55.620 |
We're creating a class for tree ensemble and we're going to pass in 01:27:10.580 |
Sample size of a thousand maybe a min leaf of three 01:27:15.480 |
All right, and you can always like choose to name your arguments or not 01:27:18.980 |
So when you've got quite a few it's kind of nice to name them so that just so we can see what each one means 01:27:31.060 |
so we're going to try and create a class that we can use like this and 01:27:38.260 |
I'm not sure we're going to bother with dot fit because we've passed in the X and the Y 01:27:43.260 |
Right like in in psychic learn they use an approach where first of all you construct something without telling it what data to use 01:27:49.620 |
And then you pass in the day. We're doing these two steps at once. We're actually passing in the data 01:27:55.020 |
Right and so then after that we're going to be going m 01:27:59.020 |
Dot so we're going to go preds equals m dot predict 01:28:06.980 |
Okay, so we're that's that's the API. We're kind of creating here 01:28:12.220 |
So this thing here is called a constructor something that creates an object is called a constructor 01:28:22.460 |
There's a lot of ugly hideous things about Python one of which is they it uses these special magic 01:28:31.220 |
Underscore underscore in it underscore underscore is a special magic method that's caught what's called when you try to construct a class 01:28:39.720 |
So when I call tree ensemble parenthesis it actually calls tree ensemble dot 01:28:46.020 |
People say thunder in it. I kind of hate it. But anyway done that in it double underscore in it double underscore thunder thunder in it 01:28:53.700 |
So that's why we've got this method called dunder in it. Okay, so when I call tree ensemble is going to call this method 01:29:06.900 |
Python's OO is that there's this special thing where if you have a class and to create a class you just write class in 01:29:22.860 |
Which is the first argument and you can call it anything you like if you call it anything other than self 01:29:28.900 |
Everybody will hate you and you're a bad person 01:29:31.300 |
Okay, so call it anything you like as long as itself 01:29:38.780 |
So that's why you always see this and in fact I can immediately see here I have a bug 01:29:44.540 |
Anybody see the bug in my predict function? I should have self, right? I 01:29:52.380 |
So anytime you try and call a method on your own class and you get something saying you passed in two parameters 01:29:58.380 |
And it was only expecting one you forgot self 01:30:00.820 |
Okay, so like this is a really dumb way to add OOP to a programming language 01:30:06.420 |
But the older languages like Python often did this because they kind of needed to they started out not being 01:30:12.260 |
Oh, and then they kind of added. Oh in a way that was hideously ugly 01:30:16.380 |
So Pell which predates Python by a little bit kind of I think really came up with this approach and unfortunately 01:30:26.540 |
So you have to add in this magic self. So the magic self now 01:30:35.820 |
You can now pretend as if any property name you like exists 01:30:41.620 |
So I can now pretend there's something called self dot X. I can read from it 01:30:45.980 |
I can write to it right, but if I read from it, and I haven't yet written to it. I'll get an error 01:30:56.700 |
Gets thrown away by default like there's nothing that like says you need to this class needs to remember what these things are 01:31:03.300 |
But anything that we stick inside self it's remembered for all time 01:31:08.660 |
You know as long as this object exists. You can access it. Maybe it's remembered so now that I've gone 01:31:14.980 |
In fact, let's do this right so let's let's create the tree ensemble class and 01:31:19.860 |
Let's now instantiate it okay, of course we haven't got X we need to call 01:31:45.660 |
There we go, okay, so here is enough to actually instantiate our tree ensemble 01:32:01.980 |
we need decision trees in it to be defined because inside our ensemble in it they're called self dot create tree and 01:32:08.580 |
Then self dot create tree called the decision tree constructor and then decision tree constructor 01:32:14.980 |
Basically does nothing at all other than save some information right so at this point we can now go M dot 01:32:29.060 |
Can anybody tell me what I would expect to see 01:32:36.980 |
We would see like a we would see a drop-down of all available methods for that class okay, which would be 01:32:44.860 |
In this case so if M is a tree ensemble, we would have create tree and predict okay anything else 01:32:50.700 |
Wait what oh, yeah as well as Ernest whispered the variables as well. Yeah, so the 01:32:59.220 |
So variable could mean a lot of things we'll say the attributes so the things that we put inside self so if I hit tab 01:33:04.860 |
Right there. They are right as Taylor said there's create tree there's predict, and then there's everything else to be put inside self 01:33:18.180 |
Min leaf if I hit shift enter what will I see? 01:33:24.180 |
Yeah, the number that I just put there. I put in leaf is three so that went up here to mean leaf 01:33:29.500 |
This here is a default argument. That's as if I don't pass anything. It'll be five, but I did pass something right so three 01:33:44.660 |
Like because of this rather annoying way of doing OO 01:33:48.340 |
It does mean that it's very easy to accidentally forget 01:33:52.500 |
To do that right so if I don't assign it to self dot min leaf 01:34:01.540 |
So here tree ensemble doesn't happen to me in leaf 01:34:04.700 |
So how do I create that attribute? I just put something in it 01:34:09.740 |
Okay, so if you want to like if you don't know what a value of it should be yet 01:34:15.980 |
But you kind of need to be able to refer to it you can always go like self dot min leaf 01:34:23.300 |
Right so at least it's something you can read check for noneness and not have an error 01:34:34.780 |
Interestingly I was able to instantiate tree ensemble even though predict refers to a method of decision tree 01:34:42.540 |
That doesn't exist and this is actually something very nice about the dynamic nature of Python 01:34:51.700 |
Because it's not like compiling it. It's not checking anything unless you're using it 01:34:56.860 |
right, so we can go ahead and create decision D dot predict later and 01:35:02.440 |
Then our our instantiated object will magically start working 01:35:07.380 |
All right, it doesn't actually look up that functions that methods details until you use it and so it really helps with top-down 01:35:18.780 |
Okay, so when you're inside a class definition, in other words, you're at that indentation level 01:35:26.000 |
You know indented one in so these are all class definitions 01:35:29.480 |
Any function that you create unless you do some special things that we're not going to talk about yet 01:35:35.500 |
Is automatically a method of that class and so every method of that class 01:35:48.740 |
since we've got a tree ensemble we could call M dot create tree and 01:35:52.460 |
We don't put anything inside those parentheses because the magic self will be passed and the magic self will be whatever M is 01:36:00.160 |
Okay, so M dot create tree returns a decision tree. Just like we asked it to right so M dot create tree 01:36:13.860 |
Will give us the self dot IDXS inside the decision tree 01:36:17.540 |
Okay, which is set to NP dot arrange range self dot sample size 01:36:24.700 |
Why is data scientists do we care about object oriented programming? 01:36:32.180 |
Because a lot of the stuff you use is going to require you to implement stuff with OOP, for example 01:36:46.260 |
Created with OOP. It's the only way to create PyTorch models 01:36:53.020 |
What you see here is the entirety of what you need to know 01:36:57.320 |
So you this is all you need to know you need to know to create something called in it 01:37:01.020 |
to assign the things that are passed in it to something called self and 01:37:05.660 |
Then just stick the word self after each of your methods 01:37:09.860 |
Okay, and so the nice thing is like now to think as an OOP programmer is to realize you don't now have to pass around 01:37:17.620 |
XY sample size and mint leaf to every function that uses them by assigning them to 01:37:23.740 |
Attributes of self they're now available like magic 01:37:31.700 |
If you're particularly like I started trying to create a decision tree initially without using OOP and try to like keep track of 01:37:39.380 |
Like what that decision tree was meant to know about was very difficult, you know 01:37:44.220 |
Or else with OOP you can just say it inside the decision tree, you know self dot indexes equals this and 01:37:50.380 |
Everything just works. Okay. Okay, that's great. So we're out of time. I think that's 01:37:58.900 |
There's an introduction to OOP, but this week 01:38:02.480 |
You know next class I'm going to assume that you can use it, right? 01:38:07.340 |
So you should create some classes instantiate some classes look at their methods and properties 01:38:12.780 |
Have them call each other and so forth until you feel 01:38:16.540 |
Comfortable with them and maybe for those of you that haven't done OOP before you and find some other useful 01:38:23.420 |
Resources you could pop them onto the wiki thread so that other people know what you find