back to indexIntro to Machine Learning: Lesson 3
Chapters
0:0 Introduction
1:10 Data Interpretation
15:15 Limit Memory
18:13 Performance
22:35 Dates
26:0 Data
27:40 Testing
28:47 Sending F samples
30:10 Adding floats
33:3 Adding min samples
33:42 Results
34:33 Insights
35:35 Limitations
37:16 Your Job
38:54 Coding
40:45 Scatter Plot
42:35 Tweaking Data
44:49 Validation Set
50:21 Break
54:40 Standard Deviation
56:56 Random Forest Interpretation
00:00:00.000 |
Last lesson, we looked at what random forests are, and we looked at some of the tweaks that 00:00:17.520 |
So in order to actually practice this, we needed to have a Jupyter notebook environment 00:00:23.420 |
running, so we can either install Anaconda on our own computers, we can use AWS, or we 00:00:31.920 |
can use cressel.com that has everything up and running straight away, or else paperspace.com 00:00:40.320 |
So assuming that you've got all that going, hopefully you've had a chance to practice 00:00:47.080 |
I think one of the things to point out though is that before we did any tweaks of any hyperparameters 00:00:53.840 |
or any tuning at all, the raw defaults already gave us a very good answer for an actual dataset 00:01:02.440 |
So the tweaks aren't always the main piece, they're just tweaks. 00:01:08.900 |
Sometimes they're totally necessary, but quite often you can go a long way without doing 00:01:18.280 |
So today we're going to look at something I think maybe even more important than building 00:01:25.320 |
a predictive model that's good at predicting, which is to learn how to interpret that model 00:01:30.640 |
to find out what it says about your data, to actually understand your data better by 00:01:47.200 |
Things like random forests are black boxes that hide meaning from us. 00:01:52.080 |
You'll see today that the truth is quite the opposite. 00:01:55.040 |
The truth is that random forests allow us to understand our data deeper and more quickly 00:02:04.840 |
The other thing we're going to learn today is how to look at larger datasets than those 00:02:15.840 |
And specifically we're going to look at a dataset with over 100 million rows, which 00:02:19.760 |
is the current Kaggle competition for groceries for past year. 00:02:25.080 |
Did anybody have any questions outside of those two areas since we're covering that 00:02:29.400 |
today or comments that they want to talk about? 00:02:42.440 |
Can you just talk a little bit about in general, I understand the details more now of random 00:02:47.320 |
forests, but when do you know this is an applicable model to use? 00:02:51.680 |
In general, I should try random forests here because that's the part that I'm still like, 00:02:58.120 |
So the short answer is, I can't really think of anything offhand that it's definitely not 00:03:07.440 |
going to be at least somewhat useful for, so it's always worth trying. 00:03:13.480 |
I think really the question is, in what situations should I try other things as well? 00:03:20.800 |
And the short answer to that question is for unstructured data, what I call unstructured 00:03:24.800 |
data. So where all the different data points represent the same kind of thing, like a waveform 00:03:30.640 |
in a sound or speech, or the words in a piece of text, or the pixels in an image, you're 00:03:36.360 |
almost certainly going to want to try deep learning. 00:03:42.720 |
And then outside of those two, there's a particular type of model we're going to look at today 00:03:50.080 |
called a collaborative filtering model, which so happens that the groceries competition 00:03:56.280 |
is of that kind, where neither of those approaches are quite what you want without some tweaks 00:04:05.240 |
If anybody thinks of other places where maybe neither of those techniques is the right thing 00:04:20.960 |
to use, mention it on the forums, even if you're not sure, so we can talk about it because 00:04:26.600 |
I think this is one of the more interesting questions. 00:04:32.040 |
And to some extent it is a case of practice and experience, but I do think there are two 00:04:47.120 |
Last week, at the point where we had done some of the key steps, like the CSV reading 00:04:59.720 |
in particular, which took a minute or two, at the end of that we saved it to a feather 00:05:05.560 |
And just to remind you, that's because this is basically almost the same format that it 00:05:10.800 |
lives in in RAM, so it's ridiculously fast to read and write stuff from feather format. 00:05:16.840 |
So what we're going to do today is we're going to look at lesson 2, RF interpretation, and 00:05:23.440 |
the first thing we're going to do is read that feather format file. 00:05:29.200 |
Now one thing to mention is a couple of you pointed out during the week, a really interesting 00:05:36.840 |
little bug or little issue, which is in the proc df function. 00:05:45.280 |
The proc df function, remember, finds the numeric columns which have missing values 00:05:52.160 |
and creates an additional boolean column, as well as replacing the missing with medians, 00:06:00.080 |
and also turns the categorical objects into the integer codes, the main things it does. 00:06:09.840 |
And a couple of you pointed out some key points about the missing value handling. 00:06:14.960 |
The first one is that your test set may have missing values in some columns that weren't 00:06:26.800 |
And if that happens, you're going to get an error when you try to do the random forest, 00:06:30.160 |
because it's going to say if that is missing field appeared in your training set but not 00:06:35.960 |
in your test set that ended up in the model, it's going to say you can't use that data 00:06:42.400 |
set with this model because you're missing one of the columns it requires. 00:06:49.140 |
Problem number 2 is that the median of the numeric values in the test set may be different 00:06:57.760 |
for the training set, so it may actually process it into something which has different semantics. 00:07:06.720 |
So I thought that was a really interesting point. 00:07:09.160 |
So what I did was I changed prop df, so it returns a third thing, nas. 00:07:17.640 |
And the nas thing it returns, it doesn't matter in detail what it is, but I'll tell you just 00:07:23.160 |
so you know, that's a dictionary where the keys are the names of the columns that have 00:07:29.400 |
missing values, and the values of the dictionary are the medians. 00:07:34.600 |
And so then optionally you can pass nas as an additional argument to prop df, and it'll 00:07:43.640 |
make sure that it adds those specific columns and it uses those specific medians. 00:07:50.040 |
So it's giving you the ability to say process this test set in exactly the same way as we 00:08:07.160 |
So I just did that like yesterday or the day before. 00:08:12.320 |
Before you start doing work any day, I would start doing a git pull, and if something's 00:08:19.520 |
not working today that was working yesterday, check the forum where there will be an explanation 00:08:27.840 |
This library in particular is moving fast, but pretty much all the libraries that we 00:08:31.640 |
use, including PyTorch in particular, move fast. 00:08:34.780 |
And so one of the things to do if you're watching this through the MOOC is to make sure that 00:08:40.320 |
you've got a course.fast.ai and check the links there because there will be links saying 00:08:45.560 |
oh these are the differences from the course, and so they're kept up to date so that you're 00:08:50.520 |
never going to -- because I can't edit what I'm saying, I can only edit that. 00:09:03.280 |
So I haven't actually updated all of the notebooks to add the extra return value. 00:09:10.360 |
I will over the next couple of days, but if you're using them you'll just need to put 00:09:13.520 |
an extra comma, otherwise you'll get an error that it's returned 3 things and you only have 00:09:26.200 |
What I want to do before I talk about interpretation is to show you what the exact same process 00:09:33.640 |
looks like when you're working with a really large dataset. 00:09:38.920 |
And you'll see it's kind of almost the same thing, but there's going to be a few cases 00:09:46.440 |
where we can't use the defaults, because the defaults just run a little bit too slowly. 00:09:53.780 |
So specifically I'm going to look at the Cabo Groceries competition, specifically -- what's 00:10:08.560 |
Compress your favorite grocery sales forecasting. 00:10:12.280 |
So this competition -- who is entering this competition? 00:10:22.740 |
Who would like to have a go at explaining what this competition involves, what the data 00:10:34.760 |
>> Okay, trying to predict the items on the shelf depending on lots of factors, like oil 00:10:41.960 |
So when you're predicting the items on the shelf, what are you actually predicting? 00:10:46.760 |
>> How much do you need to have in stock to maximize their -- 00:10:50.520 |
>> It's not quite what we're predicting, but we'll try and fix that at the moment. 00:10:54.620 |
>> And then there's a bunch of different datasets that you can use to do that. 00:10:57.360 |
There's oil prices, there's stores, there's locations, and each of those can be used to 00:11:04.080 |
Does anybody want to have a go at expanding on that? 00:11:14.680 |
So we have a bunch of information on different products. 00:11:24.080 |
So for every store, for every item, for every day, we have a lot of related information 00:11:30.760 |
available, like the location where the store was located, the class of the product, and 00:11:41.760 |
And then based on this, we are supposed to forecast in a much shorter time frame compared 00:11:48.200 |
For every item number, how much we think it's going to sell, so only the units and nothing 00:12:02.920 |
So your ability to explain the problem you're working on is really, really important. 00:12:10.580 |
So if you don't currently feel confident of your ability to do that, practice with someone 00:12:21.400 |
So in this case, or in any case really, the key things to understand a machine learning 00:12:28.160 |
problem would be to say what are the independent variables and what is the dependent variable. 00:12:32.360 |
So the dependent variable is the thing that you're trying to predict. 00:12:35.800 |
The thing you're trying to predict is how many units of each kind of product were sold 00:12:43.040 |
in each store on each day during a two-week period. 00:12:48.080 |
So that's the thing that you're trying to predict. 00:12:50.520 |
The information you have to predict is how many units of each product at each store on 00:12:57.920 |
each day were sold in the last few years, and for each store some metadata about it, 00:13:06.400 |
like where is it located and what class of store is it. 00:13:09.840 |
For each type of product, you have some metadata about it, such as what category of product 00:13:17.600 |
For each date, we have some metadata about it, such as what was the oil price on that 00:13:24.520 |
So this is what we would call a relational dataset. 00:13:27.000 |
A relational dataset is one where we have a number of different pieces of information 00:13:35.800 |
Specifically this kind of relational dataset is what we would refer to as a star schema. 00:13:41.920 |
A star schema is a kind of data warehousing schema where we say there's some central transactions 00:13:49.960 |
In this case, the central transactions table is train.csv, and it contains the number of 00:14:01.680 |
units that were sold by date, by store ID, by item ID. 00:14:09.720 |
So that's the central transactions table, very small, very simple, and then from that 00:14:16.080 |
It's called a star schema because you can imagine the transactions table in the middle 00:14:21.720 |
and then all these different metadata tables join onto it, giving you more information 00:14:27.640 |
about the date, the item ID and the store ID. 00:14:34.360 |
Sometimes you'll also see a snowflake schema, which means there might then be additional 00:14:38.920 |
information joined onto maybe the items table that tells you about different item categories 00:14:46.560 |
and joined to the store table, telling you about the state that the store is in and so 00:14:55.840 |
So that's the basic information about this problem, the independent variables, the dependent 00:15:05.640 |
variable, and you probably also want to tell you about things like the timeframe. 00:15:13.440 |
Now we start in exactly the same way as we did before, loading in exactly the same stuff, 00:15:20.280 |
But when we go read CSV, if you say limit memory equals false, then you're basically 00:15:29.400 |
saying use as much memory as you like to figure out what kinds of data is here. 00:15:34.160 |
It's going to run out of memory pretty much regardless of how much memory you have. 00:15:39.840 |
So what we do in order to limit the amount of space that it takes up when we read it 00:15:45.400 |
in is we create a dictionary for each column name to the data type of that column. 00:15:52.440 |
And so for you to create this, it's basically up to you to run less or head or whatever 00:15:58.520 |
on the data set to see what the types are and to figure that out and pass them in. 00:16:04.600 |
So then you can just pass in data type equals with that dictionary. 00:16:10.080 |
And so check this out, we can read in the whole CSV file in 1 minute and 48 seconds, 00:16:30.280 |
So when people say Python's slow, no Python's not slow. 00:16:37.240 |
Python can be slow if you don't use it right, but we can actually pass 125 million CSV records 00:16:49.400 |
I'm going to put my language hat on for just a moment. 00:16:54.200 |
Actually if it's fast, almost certainly it's going to see. 00:16:58.760 |
So Python is a wrapper around a bunch of C code usually. 00:17:04.200 |
So if Python itself isn't actually very fast. 00:17:12.240 |
So that was Terrence Parr who writes things for writing programming languages for a living. 00:17:20.920 |
Python itself is not fast, but almost everything we want to do in Python and data science has 00:17:26.720 |
been written for us in C, or actually more often in Python, which is a Python-like language 00:17:32.360 |
which compiles to C. So most of the stuff we run in Python is actually running not just 00:17:38.600 |
C code, but actually in Pandas a lot of it's written in assembly language, it's heavily 00:17:42.880 |
optimized, behind the scenes a lot of that is going back to actually calling Fortran-based 00:17:51.440 |
So there's layers upon layer of speed that actually allow us to spend less than 2 minutes 00:18:00.440 |
If we wrote our own CSV reader in pure Python, it would take thousands of times, at least 00:18:08.400 |
thousands of times longer than the optimized versions. 00:18:13.640 |
So for us, what we care about is the speed we can get in practice. 00:18:20.160 |
As well as telling it what the different data types were, we also have to tell it as before 00:18:33.280 |
I've noticed that in this dictionary, you specify in 64, 33, and 8. 00:18:39.280 |
I was wondering in practice, is it faster if you all specify them to be slower, or any 00:18:49.280 |
So the key performance consideration here was to use the smallest number of bits that 00:18:57.120 |
So if I had used n8 for item number, there are more than 255 item numbers. 00:19:02.360 |
More specifically, the maximum item number is bigger than 255. 00:19:06.120 |
So on the other hand, if I had used n64 for store number, it's using more bits than necessary. 00:19:13.640 |
Given that the whole purpose here was to avoid running out of RAM, we don't want to be using 00:19:24.320 |
In fact when you're working with large data sets, very often you'll find the slow piece 00:19:29.760 |
is the actually reading and writing to RAM, not the actual CPU operations. 00:19:35.520 |
So very often that's the key performance consideration. 00:19:39.540 |
Also however, as a rule of thumb, smaller data types often will run faster, particularly 00:19:47.720 |
if you can use 70, so that's single instruction multiple data vectorized code. 00:19:52.720 |
It can pack more numbers into a single vector to run at once. 00:20:04.840 |
That was all heavily simplified and not exactly right, but right and bound for this purpose. 00:20:11.960 |
Once you do this, the shuffle thing beforehand is not needed anymore, you may just send a 00:20:23.120 |
Although here I've read in the whole thing, when I start, I never start by reading in 00:20:32.760 |
So if you search the forum for 'shuff', you'll find some tips about how to use this UNIX 00:20:42.860 |
command to get a random sample of data at the command prompt. 00:20:49.480 |
The nice thing is that that's a good way to find out what data types to use, to read in 00:20:56.040 |
a random sample and let pandas figure it out for you. 00:21:06.600 |
In general, I do as much work as possible on a sample until I feel confident that I understand 00:21:17.440 |
Having said that, what we're about to learn is some techniques for running models on this 00:21:20.980 |
full dataset that are actually going to work on arbitrarily large datasets, that also I 00:21:25.120 |
specifically wanted to talk about how to read in large datasets. 00:21:29.600 |
One thing to mention, onPromotion objects are like saying create a general purpose Python 00:21:39.560 |
The reason for that is that this is a Boolean which also has missing values, and so we need 00:21:44.800 |
to deal with this before we can turn it into a Boolean. 00:21:47.720 |
So you can see after that, I then go ahead and let's say fill in the missing values with 00:21:52.600 |
Now you wouldn't just do this without doing some checking ahead of time, but some exploratory 00:21:57.480 |
data analysis shows that this is probably an appropriate thing to do, it seems that 00:22:06.680 |
Objects generally read in a string, so replace the strings true and false with actual Booleans, 00:22:11.880 |
and then finally convert it to an actual Boolean type. 00:22:15.200 |
So at this point, when I save this, this file now of 123 million records takes up something 00:22:26.160 |
So you can look at pretty large datasets even on pretty small computers, which is interesting. 00:22:33.680 |
So at that point, now that it's in a nice fast format, look how fast it is. 00:22:37.400 |
I can save it to feather format in under 5 seconds. 00:22:43.880 |
And then because pandas is generally pretty fast, you can do stuff like summarize every 00:22:50.240 |
column of all 125 million records in 20 seconds. 00:22:57.760 |
The first thing I looked at here is the dates. 00:23:01.200 |
Generally speaking, dates are just going to be really important on a lot of the stuff 00:23:04.040 |
you do, particularly because any model that you put in in practice, you're going to be 00:23:10.280 |
putting it in at some date that is later than the date that you trained it by definition. 00:23:16.120 |
And so if anything in the world changes, you need to know how your predictive accuracy 00:23:22.120 |
And so what you'll see on Kaggle and what you should always do in your on projects is 00:23:27.760 |
So in this case, the dates that we have in the training set go from 2013 to mid-August 00:23:40.360 |
And then in our test set, they go from 1 day later, August 16th until the end of the month. 00:23:48.720 |
So this is a key thing that you can't really do any useful machine learning until you understand 00:23:55.160 |
this basic piece here, which is you've got 4 years of data and you're trying to predict 00:24:06.480 |
So that's just a fundamental thing that you're going to need to understand before you can 00:24:11.920 |
And so as soon as I see that, what does that say to you? 00:24:16.480 |
If you wanted to now use a smaller data set, should you use a random sample, or is there 00:24:34.720 |
So it's like, okay, I'm going to go to a shop next week and I've got a $5 bet with my brother 00:24:44.640 |
as to whether I can guess how many cans of Coke are going to be on the shelf. 00:24:48.960 |
Alright, well probably the best way to do that would be to go to the shop same day of 00:24:55.640 |
the previous week and see how many cans of Coke are on the shelf and guess it's going 00:25:00.280 |
You wouldn't go and look at how many were there 4 years ago. 00:25:07.100 |
But couldn't 4 years ago that same time frame of the year be important? 00:25:11.800 |
For example, how much Coke they have on the shelf at Christmas time is going to be way 00:25:15.840 |
Exactly, so there's no useful information from 4 years ago, so we don't want to entirely 00:25:24.960 |
But as a first step, what's the simplest possible thing? 00:25:31.640 |
I wouldn't submit the mean of 2012 sales, I would probably submit the mean of last month's 00:25:42.560 |
So yeah, we're just trying to think about how we might want to create some initial easy 00:25:48.240 |
models and later on we might want to wait it. 00:25:51.760 |
So for example, we might want to wait more recent dates more highly, they're probably 00:25:56.400 |
But we should do a whole bunch of exploratory data analysis to check that. 00:26:01.400 |
So here's what the bottom of that data set looks like. 00:26:06.040 |
And you can see literally it's got a date, a store number, an item number, an unit sales, 00:26:12.640 |
and tells you whether or not that particular item was on sale at that particular store 00:26:18.200 |
on that particular date, and then there's some arbitrary ID. 00:26:26.540 |
So now that we have read that in, we can do stuff like, this is interesting, again we 00:26:37.040 |
And it's the same reason as we looked at last week, because we're trying to predict something 00:26:43.820 |
They told us in this competition that the root mean squared log error is the thing they 00:26:51.680 |
They mentioned also if you check the competition details, which you always should read carefully 00:26:56.920 |
the definition of any project you do, they say that there are some negative sales that 00:27:01.720 |
represent returns, and they tell us that we should consider them to be 0 for the purpose 00:27:08.680 |
So I clip the sales so that they fall between 0 and no particular maximum, so clip just 00:27:17.960 |
means cut it off at that point, truncate it, and then take the log of that +1. 00:27:25.840 |
Because again, if you check the details of the capital competition, that's what they 00:27:28.840 |
tell you they're going to use is they're not actually just taking the root mean squared 00:27:32.080 |
log error, but the root mean squared log +1 error, because log of 0 doesn't make sense. 00:27:41.520 |
We can add the date part as usual, and again it's taking a couple of minutes. 00:27:47.160 |
So I would run through all this on a sample first, so everything takes 10 seconds to make 00:27:51.640 |
sure it works, just to check everything looks reasonable before I go back because I don't 00:27:55.320 |
want to wait 2 minutes or something, I don't know if it's going to work. 00:27:59.720 |
But as you can see, all these lines of code are identical to what we saw for the bulldozers 00:28:07.080 |
In this case, all I'm reading in is a training set. 00:28:09.400 |
I didn't need to run train cats because all of my data types are already numeric. 00:28:14.840 |
If they weren't, I would need to call train cats and then I would need to call apply cats 00:28:21.640 |
to the same categorical codes that I now have in the training set to the validation set. 00:28:29.560 |
I call prop df as before to check for missing values and so forth. 00:28:40.560 |
These lines of code again are identical because root mean squared error is what we care about. 00:28:50.040 |
The first is sent RF samples, which we learned about last week. 00:28:59.600 |
We probably don't want to create a tree from 120 million something records. 00:29:04.000 |
I don't even know how long that's going to take, I haven't had the time and patience 00:29:10.880 |
So you could start with 10,000 or 100,000, maybe it runs in a few seconds, make sure 00:29:17.440 |
it works and you can figure out how much you can run. 00:29:20.480 |
And so I found getting it to a million, it runs in under a minute. 00:29:26.600 |
And so the point here is there's no relationship between the size of the dataset and how long 00:29:33.840 |
The relationship is between the number of estimators multiplied by the sample size. 00:29:39.720 |
So the number of jobs is the number of cores that it's going to use. 00:29:53.040 |
And I was running this on a computer that has about 60 cores, and I just found if you 00:29:58.200 |
try to use all of them, it spends so much time spinning up jobs so it's a bit slower. 00:30:01.840 |
So if you've got lots and lots of cores on your computer, sometimes you want less than 00:30:10.680 |
There's one more change I made which is that I converted the data frame into an array of 00:30:24.160 |
Because internally inside the random forest code, they do that anyway. 00:30:29.560 |
And so given that I wanted to run a few different random forests with a few different hyperparameters, 00:30:34.640 |
by doing it once myself, I saved that minute 37 seconds. 00:30:41.040 |
So if you run a line of code and it takes quite a long time, so the first time I ran 00:30:49.660 |
this random forest progressor, it took 2 or 3 minutes, and I thought I don't really want 00:30:53.600 |
to wait 2 or 3 minutes, you can always add in front of the line of code prun, percent 00:31:02.760 |
So what percent prun does is it runs something called a profiler. 00:31:07.240 |
And what a profiler does is it will tell you which lines of code behind the scenes took 00:31:13.200 |
And in this case I noticed that there was a line of code inside scikit-learn that was 00:31:18.780 |
this line of code, and it was taking all the time, nearly all the time. 00:31:22.560 |
And so I thought I'll do that first and then I'll pass in the result and it won't have 00:31:27.600 |
So this thing of looking to see which things is taking up the time is called profiling. 00:31:33.560 |
And in software engineering, it's one of the most important tools you have. 00:31:37.520 |
Data scientists really underappreciate this tool, but you'll find amongst conversations 00:31:44.160 |
on GitHub issues or on Twitter or whatever amongst the top data scientists, they're sharing 00:31:54.760 |
So for fun, try running prun from time to time on stuff that's taking 10-20 seconds and 00:32:02.840 |
see if you can learn to interpret and use profiler outputs. 00:32:07.320 |
Even though in this case I didn't write this scikit-learn, plus I was still able to use 00:32:14.040 |
the profiler to figure out how to make it run over twice as fast by avoiding recalculating 00:32:22.980 |
So in this case, I built my regressor, I decided to use 20 estimators. 00:32:27.800 |
Something else that I noticed in the profiler is that I can't use OOB score when I use set-RF 00:32:34.320 |
Because if I do, it's going to use the other 124 million rows to calculate the OOB score, 00:32:45.440 |
So I may as well have a proper validation set anyway, besides which I want a validation 00:32:49.640 |
set that's the most recent dates rather than it's random. 00:32:53.720 |
So if you use set-RF samples on a large data set, don't put the OOB score parameter in 00:33:04.320 |
So that got me a 0.76 validation root mean squared log error, and then I tried fiddling 00:33:14.320 |
So if I decrease the min-samples from 100 to 10, it took a little bit more time to run 00:33:22.360 |
And the error went down from 76 to 71, so that looked pretty good. 00:33:28.680 |
So I kept decreasing it down to 3, and that brought this error down to 0.70. 00:33:33.560 |
When I decreased it down to 1, it didn't really help. 00:33:41.440 |
When I say reasonable, though, it's not reasonable in the sense that it does not give a good 00:33:53.640 |
And so this is a very interesting question about why is that. 00:33:57.600 |
And the reason is really coming back to Savannah's question earlier, where might random forests 00:34:08.440 |
Here's the entire dataset, here's all the columns we used. 00:34:12.840 |
So the columns that we have to predict with are the date, the store number, the item number, 00:34:23.880 |
And then of course we used add date part, so there's also going to be day of week, day 00:34:28.080 |
of month, day of year, is quarter, start, etcetera, etcetera. 00:34:33.440 |
So if you think about it, most of the insight around how much of something do you expect 00:34:43.000 |
to sell tomorrow is likely to be very wrapped up in the details about where is that store, 00:34:50.040 |
what kind of things do they tend to sell at that store, for that item, what category of 00:34:54.560 |
item is it, if it's like fresh bread, they might not sell much of it on Sundays because 00:35:02.600 |
on Sundays, fresh bread doesn't get made, where else it's gasoline, maybe they're going 00:35:08.400 |
to sell a lot of gasoline because on Sundays people go and fill up their cart with a wick 00:35:14.280 |
Now a random forest has no ability to do anything other than create a bunch of binary splits 00:35:20.380 |
on things like day of week, store number, item number. 00:35:23.360 |
It doesn't know which one represents gasoline. 00:35:27.120 |
It doesn't know which stores are in the center of the city versus which ones are out in the 00:35:37.060 |
Its ability to really understand what's going on is somewhat limited. 00:35:42.200 |
So we're probably going to need to use the entire four years of data to even get some 00:35:48.480 |
But then as soon as we start using the whole four years of data, a lot of the data we're 00:35:55.600 |
So interestingly, there's a Kaggle kernel that points out that what you could do is 00:36:02.560 |
just take the last two weeks and take the average sales by date, by store number, by 00:36:15.440 |
If you just submit that, you come about 30th. 00:36:21.800 |
So for those of you in the groceries, Terrence has a comment or a question. 00:36:31.160 |
I think this may have tripped me up actually. 00:36:34.840 |
I think it's actually store, item, sales and then you mean across date. 00:36:50.320 |
So each row represents basically a cross tabulation of all of the sales in that store for that 00:36:57.920 |
So if you put date in there as well, there's only going to be one or two items being averaged 00:37:03.560 |
in each of those cells, which is too much variation, basically, it's too sparse. 00:37:08.360 |
It doesn't give you a terrible result, but it's not 30th. 00:37:17.560 |
So your job if you're looking at this competition, and we'll talk about this in the next class, 00:37:23.400 |
is how do you start with that model and make it a little bit better. 00:37:31.960 |
Because if you can, then by the time we meet up next, hopefully you'll be above the top 00:37:39.640 |
Because Kaggle being Kaggle, lots of people have now taken this kernel and submitted it, 00:37:44.200 |
and they all have about the same score, and the scores are ordered not just by score but 00:37:50.280 |
So if you now submit this kernel, you're not going to be 30th because you're way down the 00:37:56.920 |
But if you can do a tiny bit better, you're going to be better than all of those people. 00:38:10.960 |
Would you try to capture seasonality and trend effects by creating new columns? 00:38:15.320 |
These are the average sales in the month of August, these are the average sales for this 00:38:20.800 |
So the thing for you to think about is how to do that. 00:38:29.440 |
Because there are details to get right, which I know Terrence has been working on this for 00:38:38.440 |
The details are difficult, they're not intellectually difficult, they're kind of difficult in the 00:38:46.920 |
way that makes you want to head back to your desk at 2am. 00:38:55.280 |
The coding you do for machine learning is incredibly frustrating and incredibly difficult. 00:39:05.320 |
If you get a detail wrong, much of the time it's not going to give you an exception, it 00:39:14.840 |
will just silently be slightly less good than it otherwise would have been. 00:39:19.140 |
And if you're on Kaggle, at least you know, okay well I'm not doing as well as other people 00:39:24.400 |
But if you're not on Kaggle, you just don't know. 00:39:27.520 |
You don't know if your company's model is like half as good as it could be because you 00:39:33.680 |
So that's one of the reasons why practicing on Kaggle now is great, because you're going 00:39:38.960 |
to get practice in finding all of the ways in which you can infuriatingly screw things 00:39:45.800 |
And you'll be amazed, like for me there's an extraordinary array of them. 00:39:50.000 |
But as you get to know what they are, you'll start to know how to check for them as you 00:39:55.880 |
And so the only way, you should assume every button you press, you're going to press the 00:40:02.560 |
And that's fine as long as you have a way to find out. 00:40:07.000 |
We'll talk about that more during the course, but unfortunately there isn't a set of specific 00:40:18.640 |
You just always have to think like, okay, what do I know about the results of this thing 00:40:29.400 |
If you've actually created that basic entry where you take the mean by date, by store 00:40:34.800 |
number, by on promotion, and you've submitted it and you've got a reasonable score, and 00:40:40.040 |
then you think you've got something that's a little bit better, and you do predictions 00:40:44.360 |
for that, how about you now create a scatterplot showing the predictions of your average model 00:40:52.160 |
on one axis versus the predictions of your new model on the other axis? 00:40:56.160 |
You should see that they just about form a line. 00:41:00.960 |
And if they don't, then that's a very strong suggestion that you screwed something up. 00:41:08.280 |
Can you pass that one to the end of that row, possible two steps? 00:41:17.440 |
So for a problem like this, unlike the car insurance problem on Taggle where columns 00:41:24.960 |
are unnamed, we know what the columns represent and what they are. 00:41:31.760 |
How often do you pull in data from other sources to supplement that? 00:41:39.280 |
Maybe like weather data, for example, or how often is that used? 00:41:45.840 |
And so the whole point of this star schema is that you've got your central table and 00:41:51.920 |
you've got these other tables coming off that provide metadata about it. 00:41:55.920 |
So for example, weather is metadata about a date. 00:42:01.520 |
On Kaggle specifically, most competitions have the rule that you can use external data 00:42:08.720 |
as long as you post on the forum that you're using it and that it's publicly available. 00:42:16.400 |
But you have to check on a competition by competition basis, they will tell you. 00:42:20.720 |
So with Kaggle, you should always be looking for what external data could I possibly leverage 00:42:28.280 |
So are we still talking about how to tweak this data set? 00:42:40.920 |
Well, I'm not familiar with the countries here, so maybe. 00:42:47.400 |
So maybe I would start looking for Ecuador's holidays and shopping holidays, maybe when 00:43:00.480 |
Actually that information is provided in this case. 00:43:04.360 |
And so in general, one way of tackling this kind of problem is to create lots and lots 00:43:12.360 |
of new columns containing things like average number of sales on holidays, average percent 00:43:19.160 |
change in sale between January and February, and so on and so forth. 00:43:23.560 |
And so if you have a look at, there's been a previous competition on Kaggle called Rossman 00:43:34.280 |
It was in Germany in this case for a major grocery chain. 00:43:38.880 |
How many items are sold by day, by item type, by store. 00:43:43.400 |
In this case, the person who won, quite unusually actually, was something of a domain expert 00:43:51.520 |
They're actually a specialist in doing logistics predictions. 00:43:56.320 |
And this is basically what they did, he's a professional sales forecast consultant. 00:44:04.360 |
He created just lots and lots and lots of columns based on his experience of what kinds 00:44:09.280 |
of things tend to be useful for making predictions. 00:44:17.160 |
The third place team did almost no feature engineering, however, and also they had one 00:44:23.160 |
big oversight, which I think they would have won if they hadn't had it. 00:44:26.520 |
So you don't necessarily have to use this approach. 00:44:30.560 |
So anyway, we'll be learning a lot more about how to win this competition, and ones like 00:44:40.120 |
They did interview the third place team, so if you google for Kaggle or Rossman, you'll 00:44:50.120 |
So one of the things, and these are a couple of charts, Terrence is actually my teammate 00:44:54.360 |
on this competition, so Terrence drew a couple of these charts for us, and I want to talk 00:45:01.400 |
If you don't have a good validation set, it's hard if not impossible to create a good model. 00:45:09.120 |
So in other words, if you're trying to predict next month's sales and you try to build a 00:45:18.040 |
model and you have no way of really knowing whether the models you built are good at predicting 00:45:23.640 |
sales a month ahead of time, then you have no way of knowing when you put your model 00:45:27.840 |
in production whether it's actually going to be any good. 00:45:33.000 |
So you need a validation set that you know is reliable at telling you whether or not 00:45:39.360 |
your model is likely to work well when you put it into production or use it on the test 00:45:47.380 |
So in this case, what Terrence has plotted here is, so normally you should not use your 00:45:54.080 |
test set for anything other than using it right at the end of the competition to find 00:46:01.840 |
But there's one thing I'm going to let you use the test set for in addition, and that 00:46:09.320 |
So what Terrence did here was he built four different models, some which he thought would 00:46:14.620 |
be better than others, and he submitted each of the four models to Kaggle to find out its 00:46:22.040 |
So the x-axis is the score that Kaggle told us on the leaderboard. 00:46:28.520 |
And then on the y-axis, he plotted the score on a particular validation set he was trying 00:46:34.200 |
out to see whether this validation set looked like it was going to be any good. 00:46:40.400 |
So if your validation set is good, then the relationship between the leaderboard score 00:46:46.200 |
and the test set score and your validation set score should lie in a straight line. 00:46:52.600 |
Ideally it will actually lie on the y=x line, but honestly that doesn't matter too much. 00:46:58.880 |
As long as, relatively speaking, it tells you which models are better than which other 00:47:02.880 |
models, then you know which model is the best. 00:47:07.280 |
And you know how it's going to perform on the test set because you know the linear relationship 00:47:13.440 |
So in this case, Terrence has managed to come up with a validation set which is looking 00:47:18.400 |
like it's going to predict our Kaggle leaderboard score pretty well. 00:47:22.120 |
And that's really cool because now he can go away and try 100 different types of models, 00:47:26.680 |
feature engineering, weighting, tweaks, hyperparameters, whatever else, see how they go on the validation 00:47:34.160 |
So we're going to get a lot more iterations, a lot more feedback. 00:47:37.920 |
This is not just true of Kaggle, but every machine learning project you do. 00:47:44.560 |
So here's a different one he tried, where it wasn't as good, it's like these ones that 00:47:50.000 |
were quite close to each other, it's showing us the opposite direction, that's a really 00:47:54.800 |
It's like this validation set idea didn't seem like a good idea, this validation set 00:48:02.080 |
So in general if your validation set is not showing a nice straight line, you need to 00:48:12.000 |
There's some way you're constructing it which is different, you're going to have to draw 00:48:22.720 |
So one question is, and I'm going to try to guess how you did it. 00:48:27.720 |
So how do you actually try to construct this validation set as close to the... 00:48:32.280 |
So what I would try to do is to try to sample points from the training set that are very 00:48:37.600 |
closer possible to some of the points in the test set. 00:48:43.600 |
I don't know, I would have to find the features. 00:48:52.320 |
So basically all the different things Terrence was trying were different variations of close 00:49:02.960 |
What I noticed was, so first I looked at the date range of the test set and then I looked 00:49:10.800 |
at the kernel that described how he or she... 00:49:15.240 |
So here is the date range of the test set, so the last two weeks of August 26, 2017. 00:49:21.920 |
And then the person who submitted the kernel that said how to get the 0.58 leaderboard 00:49:35.200 |
Well, it was actually 14 days and the test set is 16 days, but the interesting thing 00:49:40.240 |
is the test set begins on the day after payday and ends on the payday. 00:49:48.920 |
And so these are things I also paid attention to. 00:49:52.400 |
And I think that's one of the bits of metadata that they told us. 00:49:56.920 |
These are the kinds of things you've just got to try, like I said, to plot lots of pictures. 00:50:03.880 |
And even if you didn't know it was payday, you would want to draw the time series chart 00:50:08.640 |
of sales and you would hopefully see that every two weeks there would be a spike or 00:50:14.040 |
And you'd be like, "Oh, I want to make sure that I have the same number of spikes in my 00:50:18.440 |
validation set that I've had in my test set," for example. 00:50:22.760 |
Let's take a 5-minute break and let's come back at 2.32. 00:50:39.440 |
This is my favorite bit -- interpreting machine learning models. 00:50:43.720 |
By the way, if you're looking for my notebook about the groceries competition, you won't 00:50:50.760 |
find it in GitHub because I'm not allowed to share code for running competitions with 00:50:56.160 |
you unless you're on the same team as me, that's the rule. 00:51:00.240 |
After the competition is finished, it will be on GitHub, however, so if you're doing 00:51:03.640 |
this through the video you should be able to find it. 00:51:07.800 |
So let's start by reading in our feather file. 00:51:15.680 |
So our feather file is exactly the same as our CSV file. 00:51:19.680 |
This is for our blue book for bulldozers competition, so we're trying to predict the sale price 00:51:28.000 |
And so reading the feather format file means that we've already read in the CSV and processed 00:51:36.080 |
And so the next thing we do is to run PROC DF in order to turn the categories into integers, 00:51:41.480 |
deal with the missing values, and pull out the independent variable. 00:51:46.800 |
This is exactly the same thing as we used last time to create a validation set where 00:51:51.160 |
the validation set represents the last couple of weeks, the last 12,000 records by date. 00:51:59.860 |
And I discovered, thanks to one of your excellent questions on the forum last week, I had a 00:52:06.120 |
bug here which is that PROC DF was shuffling the order, sorry, not PROC DF, and last week 00:52:21.400 |
we saw a particular version of PROC DF where we passed in a subset, and when I passed in 00:52:32.760 |
And so then when I said split_valves, it wasn't getting the last rows by date, but it was 00:52:41.920 |
So if you rerun the lesson 1 RF code, you'll see slightly different results, specifically 00:52:49.440 |
you'll see in that section that my validation set results look less good, but that's only 00:52:55.380 |
for this tiny little bit where I had subset equals set. 00:53:04.480 |
I'm a little bit confused about the notation here, so as is both an input variable and 00:53:10.160 |
it's also the output variable of this function, and why is that? 00:53:17.640 |
The PROC DF returns a dictionary telling you which columns were missing and for each of 00:53:29.920 |
So when you call it on the larger dataset, the non-subset, you want to take that return 00:53:38.620 |
value and you don't pass in an object to that point, you just want to get back the result. 00:53:44.640 |
Later on when you pass it into a subset, you want to have the same missing columns and 00:53:52.760 |
And if this different subset, like if it was a whole different dataset, turned out it had 00:53:58.440 |
some different missing columns, it would update that dictionary with additional key values 00:54:10.600 |
If you don't pass it in, it just gives you the information about what was missing and 00:54:15.840 |
If you do pass it in, it uses that information for any missing columns that are there, and 00:54:23.320 |
if there are some new missing columns, it will update that dictionary with that additional 00:54:27.960 |
So it's like keeping all the datasets, all the column information. 00:54:32.000 |
Yeah, it's going to keep track of any missing columns that you came across in anything you 00:54:42.080 |
So we split it into the training and test set just like we did last week, and so to 00:54:47.760 |
remind you, once we've done PropDF, this is what it looks like. 00:54:55.720 |
So the first thing to think about is we already know how to get the predictions, which is 00:55:02.240 |
we take the average value in each leaf node, in each tree after running a particular row 00:55:16.520 |
But normally we don't just want a prediction, we also want to know how confident we are 00:55:23.000 |
And so we would be less confident of a prediction if we haven't seen many examples of rows like 00:55:31.080 |
this one, and if we haven't seen many examples of rows like this one, then we wouldn't expect 00:55:38.040 |
any of the trees to have a path through which is really designed to help us predict that 00:55:47.320 |
And so conceptually, you would expect then that as you pass this unusual row through 00:55:52.480 |
different trees, it's going to end up in very different places. 00:55:58.320 |
So in other words, rather than just taking the mean of the predictions of the trees and 00:56:03.120 |
saying that's our prediction, what if we took the standard deviation of the predictions 00:56:10.520 |
So the standard deviation of the predictions of the trees, if that's high, that means each 00:56:16.680 |
tree is giving us a very different estimate of this row's prediction. 00:56:24.680 |
So if this was a really common kind of row, then the trees will have learnt to make good 00:56:33.200 |
predictions for it because it's seen lots of opportunities to split based on those kinds 00:56:39.760 |
So the standard deviation of the predictions across the trees gives us some kind of relative 00:56:48.440 |
understanding of how confident we are of this prediction. 00:56:55.980 |
So that is not something which exists in scikit-learn or in any library I know of, so we have to 00:57:08.040 |
But we already have almost the exact code we need because remember last lesson we actually 00:57:13.440 |
manually calculated the averages across different sets of trees, so we can do exactly the same 00:57:21.760 |
When I'm doing random forest interpretation, I pretty much never use the full data set. 00:57:28.480 |
I always call set-rs-samples because we don't need a massively accurate random forest, we 00:57:36.160 |
just need one which indicates the nature of the relationships involved. 00:57:42.000 |
And so I just make sure this number is high enough that if I call the same interpretation 00:57:48.160 |
commands multiple times, I don't get different results back each time. 00:57:52.760 |
That's like the rule of thumb about how big does it need to be. 00:57:56.040 |
But in practice, 50,000 is a high number and most of the time it would be surprising if 00:58:08.000 |
So with my 50,000 samples per tree set, I create 40 estimators. 00:58:13.480 |
I know from last time that min_samples_leaf=3, max_features=0.5 isn't bad, and again we're 00:58:19.840 |
not trying to create the world's most predictive tree anyway, so that all sounds fine. 00:58:29.240 |
Again we don't particularly care, but as long as it's good enough, which it certainly is. 00:58:35.400 |
And so here's where we can do that exact same list comprehension as last time. 00:58:39.520 |
Remember, go through each estimator, that's each tree, call .predict on it with our validation 00:58:45.840 |
set, make that a list comprehension, and pass that to np.stack, which concatenates everything 00:58:56.000 |
So now our rows are the results of each tree and our columns are the result of each row 00:59:03.760 |
And then we remember we can calculate the mean. 00:59:07.200 |
So here's the prediction for our dataset row number 1. 00:59:15.560 |
So here's how to do it for just one observation at the end here. 00:59:21.040 |
We've calculated for all of them, just printing it for one. 00:59:30.280 |
This can take quite a while, and specifically it's not taking advantage of the fact that 00:59:41.220 |
List comprehension itself is Python code, and Python code, unless you're doing special 00:59:51.080 |
stuff, runs in serial, which means it runs on a single CPU. 00:59:55.120 |
It doesn't take advantage of your multi-CPU hardware. 00:59:58.680 |
And so if I wanted to run this on more trees and more data, this one second is going to 01:00:06.000 |
And you see here the wall time, the amount of actual time it took, is roughly equal to 01:00:09.920 |
the CPU time, where else if it was running on lots of cores, the CPU time would be higher 01:00:16.840 |
So it turns out that scikit-learn, actually not scikit-learn, fast.ai provides a handy 01:00:26.560 |
function called parallel_trees, which calls some stuff inside scikit-learn. 01:00:34.320 |
It takes a random forest model that I trained, here it is, n, and some function to call. 01:00:42.820 |
And it calls that function on every tree in parallel. 01:00:47.560 |
So in other words, rather than calling t.predict_x_valid, let's create a function that calls t.predict_x_valid. 01:00:55.320 |
Let's use parallel_trees to call it on our model for every tree. 01:00:59.960 |
And it will return a list of the result of applying that function to every tree. 01:01:09.320 |
So hopefully you can see that that code and that code are basically the same thing. 01:01:18.920 |
And so you can see here, now our wall time has gone down to 500ms, and it's now giving 01:01:29.040 |
us exactly the same answer, so a little bit faster. 01:01:33.400 |
Time permitting, we'll talk about more general ways of writing code that runs in parallel 01:01:38.480 |
because it turns out to be super useful for data science. 01:01:41.900 |
But here's one that we can use that's very specific to random forests. 01:01:48.520 |
So what we can now do is we can always call this to get our predictions for each tree, 01:01:56.240 |
and then we can call standard deviation to then get them for every row. 01:02:04.000 |
So what I could do is let's create a copy of our data and let's add an additional column 01:02:09.680 |
to it, which is the standard deviation of the predictions across the first axis. 01:02:18.100 |
And let's also add in the mean, so they're the predictions themselves. 01:02:25.040 |
So you might remember from last lesson that one of the predictors we have is called enclosure, 01:02:34.660 |
and we'll see later on that this is an important predictor. 01:02:37.920 |
And so let's start by just doing a histogram. 01:02:40.080 |
So one of the nice things in pandas is it's got built-in plotting capabilities. 01:02:44.080 |
It's well worth Googling for pandas plotting to see how to do it. 01:02:55.120 |
So we don't know what it means, and it doesn't matter. 01:03:01.800 |
I guess the whole purpose of this process is that we're going to learn about what things 01:03:08.560 |
are, or at least what things are important, and later on figure out what they are and 01:03:13.080 |
So we're going to start out knowing nothing about this data set. 01:03:17.720 |
So I'm just going to look at something called enclosure that has something called EROPS 01:03:21.320 |
and something called OROPS, and I don't even know what this is yet. 01:03:24.000 |
All I know is that the only three that really appear in any great quantity are OROPS, EROPS, 01:03:32.580 |
And this is really common as a data scientist, you often find yourself looking at data that 01:03:36.840 |
you're not that familiar with, and you've got to figure out at least which bits to study 01:03:41.180 |
more carefully and which bits to matter and so forth. 01:03:44.160 |
So in this case, I at least know that these three groups I really don't care about because 01:03:51.680 |
So given that, we're going to ignore those three. 01:03:55.240 |
So we're going to focus on this one here, this one here, and this one here. 01:04:00.080 |
And so here you can see what I've done is I've taken my data frame and I've grouped 01:04:08.880 |
by enclosure, and I am taking the average of these three fields. 01:04:17.480 |
So here you can see the average sale price, the average prediction, and the standard deviation 01:04:25.120 |
So I can already start to learn a bit here, as you would expect, the prediction and the 01:04:31.800 |
sale price are close to each other on average, so that's a good sign. 01:04:39.480 |
And then the standard deviation varies a little bit, it's a little hard to see in a table, 01:04:44.320 |
so what we could do is we could try to start printing these things out. 01:04:50.880 |
So here we've got the sale price for each level of enclosure, and here we've got the 01:05:02.080 |
And for the error bars, I'm using the standard deviation of prediction. 01:05:06.060 |
So here you can see the actual, and here's the prediction, and here's my confidence interval. 01:05:17.040 |
Or at least it's the average of the standard deviation of the random virus. 01:05:22.080 |
So this will tell us if there's some groups or some rows that we're not very confident 01:05:30.140 |
So we could do something similar for product size. 01:05:36.240 |
We could do exactly the same thing of looking at our predictions of standard deviations. 01:05:43.000 |
We could sort by, and what we could say is, what's the ratio of the standard deviation 01:05:49.240 |
of the predictions to the predictions themselves? 01:05:51.280 |
So you'd kind of expect on average that when you're predicting something that's a bigger 01:05:55.760 |
number that your standard deviation would be higher, so you can sort by that ratio. 01:06:02.080 |
And what that tells us is that the product size large and product size compact, our predictions 01:06:09.680 |
are less accurate, relatively speaking, as a ratio of the total price. 01:06:14.960 |
And so if we go back and have a look, there you go, that's why. 01:06:20.520 |
From the histogram, those are the smallest groups. 01:06:24.120 |
So as you would expect, in small groups we're doing a less good job. 01:06:29.880 |
So this confidence interval you can really use for two main purposes. 01:06:34.200 |
One is that you can group it up like this and look at the average confidence interval 01:06:39.080 |
by group to find out if there are some groups that you just don't seem to have confidence 01:06:46.600 |
But perhaps more importantly, you can look at them for specific rows. 01:06:50.800 |
So when you put it in production, you might always want to see the confidence interval. 01:06:56.280 |
So if you're doing credit scoring, so deciding whether to get somebody a loan, you probably 01:07:01.600 |
want to see not only what's their level of risk, but how confident are we. 01:07:05.760 |
And if they want to borrow lots of money, and we're not at all confident about our ability 01:07:10.480 |
to predict whether they'll pay it back, we might want to give them a smaller loan. 01:07:16.520 |
So those are the two ways in which you would use this. 01:07:21.400 |
Let me go to the next one, which is the most important. 01:07:27.840 |
The only reason I didn't do this first is because I think the intuitive understanding 01:07:32.600 |
of how to calculate confidence interval is the easiest one to understand intuitively. 01:07:37.120 |
In fact, it's almost identical to something we've already calculated. 01:07:41.080 |
But in terms of which one do I look at first in practice, I always look at this in practice. 01:07:47.440 |
So when I'm working on a cattle competition or a real-world project, I build a random 01:07:53.720 |
forest as fast as I can, try and get it to the point that it's significantly better than 01:08:01.640 |
random, but doesn't have to be much better than that, and the next thing I do is to plot 01:08:07.880 |
The feature importance tells us in this random forest which columns matter. 01:08:16.520 |
So we had dozens and dozens of columns originally in this dataset, and here I'm just picking 01:08:24.320 |
So you can just call rf_feature_importance, again this is part of the fast.ai library, 01:08:29.540 |
it's leveraging stuff that's in scikit-learn. 01:08:32.080 |
Pass in the model, pass in the data frame, because we need to know the names of columns, 01:08:37.960 |
and it'll tell you, it'll order, give you back a pandas data frame showing you in order 01:08:47.080 |
And here I'm just going to pick out the top 10. 01:08:53.240 |
So fi, because it's a data frame, we can use data frame plotting commands. 01:09:01.000 |
So here I've plotted all of the feature importances. 01:09:05.760 |
And so you can see here, and I haven't been able to write all of the names of the columns 01:09:10.140 |
at the bottom, which that's not the important thing. 01:09:12.440 |
The important thing is to see that some columns are really, really important, and most columns 01:09:20.640 |
In nearly every dataset you use in real life, this is what your feature importance is going 01:09:27.760 |
It's going to say there's a handful of columns you care about. 01:09:30.600 |
And this is why I always start here, because at this point, in terms of looking into learning 01:09:37.960 |
about this domain of heavy industrial equipment options, I'm only going to care about learning 01:09:47.200 |
So are we going to bother learning about enclosure? 01:09:57.800 |
So we are going to have to learn about enclosure. 01:10:00.800 |
So then we could also plot this as a bar plot. 01:10:05.480 |
So here I've just created a tiny little function here that's going to just plot my bars, and 01:10:15.000 |
And so you can see the same basic shape here, and I can see there's my enclosure. 01:10:22.180 |
So we're going to learn about how this is calculated in just a moment. 01:10:27.200 |
But before we worry about how it's calculated, much more important is to know what to do 01:10:31.400 |
So the most important thing to do with it is to now sit down with your client, or your 01:10:37.680 |
data dictionary, or whatever your source of information is, and say to them, "Tell me 01:10:47.360 |
Lots of things like histograms of EMAID, scatter plots of EMAID against price, and 01:10:51.000 |
learn everything you can because EMAID and coupler system, they're the things that matter. 01:10:56.760 |
And what will often happen in real-world projects is that you'll sit with the client and you'll 01:11:01.200 |
say, "Oh, it turns out the coupler system is the second most important thing," and then 01:11:08.960 |
Now that doesn't mean that there's a problem with your model. 01:11:12.000 |
It means there's a problem with their understanding of the data that they gave you. 01:11:18.720 |
I entered a Kaggle competition where the goal was to predict which applications for grants 01:11:28.100 |
And I used this exact approach and I discovered a number of columns which were almost entirely 01:11:36.480 |
And specifically when I then looked to see in what way they were predictive, it turned 01:11:39.680 |
out whether they were missing or not was basically the only thing that mattered in this dataset. 01:11:47.160 |
And so later on, I ended up winning that competition and I think a lot of it was thanks to this 01:11:56.400 |
But it turns out that at that university, there's an administrative burden to filling 01:12:03.480 |
And so for a lot of the grant applications, they don't fill in the database for the folks 01:12:11.880 |
So in other words, these missing values in the dataset were saying, 'Okay, this grant 01:12:17.520 |
wasn't accepted because if it was accepted, then the admin folks are going to go in and 01:12:27.800 |
Data leakage means there's information in the dataset that I was modelling with which 01:12:33.200 |
the university wouldn't have had in real life at the point in time they were making a decision. 01:12:39.040 |
So when they're actually deciding which grant applications should I prioritise, they don't 01:12:47.720 |
actually know which ones the admin staff are going to add information to because it turns 01:12:55.740 |
So one of the key things you'll find here is data leakage problems, and that's a serious 01:13:06.440 |
The other thing that will happen is you'll often find its signs of collinearity. 01:13:11.920 |
I think that's what's happened here with Coupler system. 01:13:13.800 |
I think Coupler system tells you whether or not a particular kind of heavy industrial 01:13:18.880 |
equipment has a particular feature on it, but if it's not that kind of industrial equipment 01:13:25.400 |
at all, it will be empty, it will be missing. 01:13:28.120 |
And so Coupler system is really telling you whether or not it's a certain class of heavy 01:13:34.240 |
Now this is not leakage, this is actual information you actually have at the right time, it's 01:13:38.760 |
just that interpreting it, you have to be careful. 01:13:45.680 |
So I would go through at least the top 10 or look for where the natural breakpoints 01:13:53.320 |
To make life easier for myself, what I tend to do is I try to throw some data away and 01:13:59.480 |
So in this case, I had a random forest which, let's go and see how accurate it was, 0.889. 01:14:14.200 |
What I did was I said here, let's go through our feature importance data frame and filter 01:14:19.160 |
out those where the importance is greater than 0.005, so 0.025 is about here, it's kind 01:14:37.840 |
And so that gives us a list of 25 column names. 01:14:41.760 |
And so then I say let's now create a new data frame view which just contains those 25 columns, 01:14:51.600 |
call split_bals on it again, split into test and training set, and create a new random 01:15:00.440 |
And you can see here the R^2 basically didn't change, 8.9.1 versus 8.9. 01:15:13.920 |
I mean generally speaking, removing redundant columns, obviously it shouldn't make it worse, 01:15:21.400 |
if it makes it worse, they won't be redundant after all. 01:15:24.120 |
It might make it a little better because if you think about how we built these trees when 01:15:28.520 |
it's deciding what to split on, it's got less things to have to worry about trying, it's 01:15:33.520 |
less often going to accidentally find a crappy column, so it's got a slightly better opportunity 01:15:39.360 |
to create a slightly better tree with slightly less data, but it's not going to change it 01:15:44.840 |
But it's going to make it a bit faster and it's going to let us focus on what matters. 01:15:49.360 |
So if I rerun feature importance now, I've now got 25. 01:15:55.920 |
Now the key thing that's happened is that when you remove redundant columns is that 01:16:02.120 |
you're also removing sources of collinearity. 01:16:05.080 |
In other words, two columns that might be related to each other. 01:16:09.680 |
Now collinearity doesn't make your random forest less predictive, but if you have two 01:16:16.320 |
columns that are related to each other, this column is a little bit related to this column 01:16:22.080 |
and this column is a strong driver of the dependent variable, then what's going to happen 01:16:26.080 |
is that the importance is going to end up split between the two collinear columns. 01:16:32.600 |
It's going to say both of those columns matter, so it's going to split it between the two. 01:16:37.080 |
So by removing some of those columns with very little impact, it makes your feature 01:16:44.120 |
So you can see here actually, yearmade was pretty close to couple system before, but 01:16:52.360 |
there must have been a bunch of things that were collinear with yearmade, which makes 01:16:56.840 |
Like old industrial equipment wouldn't have had a bunch of technical features that new 01:17:04.760 |
So it's actually saying yearmade really, really matters. 01:17:13.200 |
The predictive accuracy of the model is a tiny bit better, but this feature importance 01:17:27.040 |
And not only is it really simple, it's a technique you can use not just for random forests, but 01:17:33.760 |
for basically any kind of machine learning model. 01:17:43.960 |
Many people will tell you this particular kind of model, there's no way of interpreting 01:17:50.440 |
And the most important interpretation of a model is knowing which things are important. 01:17:54.360 |
And that's almost certainly not going to be true, because this technique I'm going to 01:17:57.240 |
teach you actually works for any kind of model. 01:18:00.280 |
We're going to take our data set, the bulldozers, and we've got this column which we're trying 01:18:10.200 |
And then we've got all of our independent variables. 01:18:14.360 |
So here's an independent variable here, yearMade, plus a whole bunch of other variables. 01:18:22.760 |
And remember, after we did a bit of trimming, we had 25 independent variables. 01:18:32.120 |
How do we figure out how important yearMade is? 01:18:37.560 |
Well we've got our whole random forest, and we can find out our predictive accuracy. 01:18:43.120 |
So we're going to put all of these rows through our random forest, and we're going to spit 01:18:50.280 |
out some predictions, and we're going to compare them to the actual price you get in this case, 01:18:56.840 |
for example, our root mean squared error and our R^2. 01:19:01.360 |
And we're going to call that our starting point. 01:19:04.400 |
So now let's do exactly the same thing, but let's take the yearMade column and randomly 01:19:12.200 |
shuffle it, randomly permute just that column. 01:19:17.060 |
So now yearMade has exactly the same distribution as before, same means and deviation, but it's 01:19:23.440 |
going to have no relationship to the dependent variable at all, because we totally randomly 01:19:29.020 |
So before we might have found our R^2 with 0.89, and then after we shuffle yearMade, we 01:19:41.480 |
Oh, that score got much worse when we destroyed that variable. 01:19:47.960 |
It's like, let's try again, let's put yearMade back to how it was, and this time let's take 01:19:59.600 |
And we find this time with enclosure it's 0.84, and we can say the amount of decrease 01:20:08.760 |
in our score for yearMade was 0.09, and the amount of decrease in our score for enclosure 01:20:17.920 |
was 0.05, and this is going to give us our feature importances for each one of our columns. 01:20:33.420 |
Wouldn't just excluding each column and running a random forest and checking the decay in the 01:20:43.640 |
You could remove the column and train a whole new random forest, but that's going to be really 01:20:50.440 |
This way we can keep our random forest and just test the particular accuracy of it again. 01:20:58.400 |
In this case, we just have to rerun every row forward through the forest for each shuffled 01:21:14.360 |
So if you want to do multi-collinearity, would you do 2 of them and then 3 of them random 01:21:19.540 |
Yeah, so I don't think you mean multi-collinearity, I think you mean looking for interaction effects. 01:21:24.620 |
So if you want to say which pairs of variables are most important, you could do exactly the 01:21:32.920 |
In practice, there are better ways to do that because that's obviously computationally pretty 01:21:39.800 |
expensive and so we'll try and find time to do that if we can. 01:21:48.440 |
We now have a model which is a little bit more accurate and we've learned a lot more 01:21:58.280 |
So we're out of time, and so what I would suggest you try doing now before next class 01:22:05.360 |
for this bulldozers dataset is go through the top 5 or 10 predictors and try and learn 01:22:15.480 |
what you can about how to draw plots in pandas and try to come back with some insights about 01:22:22.200 |
what's the relationship between year made and the dependent variable, what's the histogram 01:22:29.280 |
Now that you know year made is really important, is there some noise in that column which we 01:22:37.320 |
Are there some weird encodings in that column that we could fix? 01:22:41.480 |
This idea I had that maybe a coupled system is there entirely because it's collinear with 01:22:45.600 |
something else, do you want to try and figure out if that's true, if so, how would you do 01:22:51.560 |
FI product class desk, that brings alarm bells to me, it sounds like it might be a high cardinality 01:22:59.120 |
categorical variable, it might be something with lots and lots of levels because it sounds 01:23:04.320 |
So like go and have a look at that model name, does it have some ordering to it, could you 01:23:08.280 |
make it an ordinal variable to make it better, does it have some kind of hierarchical structure 01:23:12.440 |
in the string that we could split it on hyphen to create more subcolumns, have a think about 01:23:21.960 |
By Tuesday when you come back, ideally you've got a better accuracy than what I just showed 01:23:29.280 |
because we found some new insights, or at least that you can tell the class about some 01:23:34.440 |
things you've learned about how heavy industrial equipment options work in practice.