Back to Index

Intro to Machine Learning: Lesson 3


Chapters

0:0 Introduction
1:10 Data Interpretation
15:15 Limit Memory
18:13 Performance
22:35 Dates
26:0 Data
27:40 Testing
28:47 Sending F samples
30:10 Adding floats
33:3 Adding min samples
33:42 Results
34:33 Insights
35:35 Limitations
37:16 Your Job
38:54 Coding
40:45 Scatter Plot
42:35 Tweaking Data
44:49 Validation Set
50:21 Break
54:40 Standard Deviation
56:56 Random Forest Interpretation

Transcript

Last lesson, we looked at what random forests are, and we looked at some of the tweaks that we could use to make them work better. So in order to actually practice this, we needed to have a Jupyter notebook environment running, so we can either install Anaconda on our own computers, we can use AWS, or we can use cressel.com that has everything up and running straight away, or else paperspace.com also works really well.

So assuming that you've got all that going, hopefully you've had a chance to practice running some random forests this week. I think one of the things to point out though is that before we did any tweaks of any hyperparameters or any tuning at all, the raw defaults already gave us a very good answer for an actual dataset that we've got on Kaggle.

So the tweaks aren't always the main piece, they're just tweaks. Sometimes they're totally necessary, but quite often you can go a long way without doing any treats at all. So today we're going to look at something I think maybe even more important than building a predictive model that's good at predicting, which is to learn how to interpret that model to find out what it says about your data, to actually understand your data better by using machine learning.

Things like random forests are black boxes that hide meaning from us. You'll see today that the truth is quite the opposite. The truth is that random forests allow us to understand our data deeper and more quickly than traditional approaches. The other thing we're going to learn today is how to look at larger datasets than those which you can import with just the defaults.

And specifically we're going to look at a dataset with over 100 million rows, which is the current Kaggle competition for groceries for past year. Did anybody have any questions outside of those two areas since we're covering that today or comments that they want to talk about? Can you just talk a little bit about in general, I understand the details more now of random forests, but when do you know this is an applicable model to use?

In general, I should try random forests here because that's the part that I'm still like, if I'm told to I can. So the short answer is, I can't really think of anything offhand that it's definitely not going to be at least somewhat useful for, so it's always worth trying.

I think really the question is, in what situations should I try other things as well? And the short answer to that question is for unstructured data, what I call unstructured data. So where all the different data points represent the same kind of thing, like a waveform in a sound or speech, or the words in a piece of text, or the pixels in an image, you're almost certainly going to want to try deep learning.

And then outside of those two, there's a particular type of model we're going to look at today called a collaborative filtering model, which so happens that the groceries competition is of that kind, where neither of those approaches are quite what you want without some tweaks to them. So that would be the other main one.

If anybody thinks of other places where maybe neither of those techniques is the right thing to use, mention it on the forums, even if you're not sure, so we can talk about it because I think this is one of the more interesting questions. And to some extent it is a case of practice and experience, but I do think there are two main classes to know about.

Last week, at the point where we had done some of the key steps, like the CSV reading in particular, which took a minute or two, at the end of that we saved it to a feather format file. And just to remind you, that's because this is basically almost the same format that it lives in in RAM, so it's ridiculously fast to read and write stuff from feather format.

So what we're going to do today is we're going to look at lesson 2, RF interpretation, and the first thing we're going to do is read that feather format file. Now one thing to mention is a couple of you pointed out during the week, a really interesting little bug or little issue, which is in the proc df function.

The proc df function, remember, finds the numeric columns which have missing values and creates an additional boolean column, as well as replacing the missing with medians, and also turns the categorical objects into the integer codes, the main things it does. And a couple of you pointed out some key points about the missing value handling.

The first one is that your test set may have missing values in some columns that weren't in your training set or vice-versa. And if that happens, you're going to get an error when you try to do the random forest, because it's going to say if that is missing field appeared in your training set but not in your test set that ended up in the model, it's going to say you can't use that data set with this model because you're missing one of the columns it requires.

That's problem number 1. Problem number 2 is that the median of the numeric values in the test set may be different for the training set, so it may actually process it into something which has different semantics. So I thought that was a really interesting point. So what I did was I changed prop df, so it returns a third thing, nas.

And the nas thing it returns, it doesn't matter in detail what it is, but I'll tell you just so you know, that's a dictionary where the keys are the names of the columns that have missing values, and the values of the dictionary are the medians. And so then optionally you can pass nas as an additional argument to prop df, and it'll make sure that it adds those specific columns and it uses those specific medians.

So it's giving you the ability to say process this test set in exactly the same way as we process this training set. So I just did that like yesterday or the day before. In fact that's a good point. Before you start doing work any day, I would start doing a git pull, and if something's not working today that was working yesterday, check the forum where there will be an explanation of why.

This library in particular is moving fast, but pretty much all the libraries that we use, including PyTorch in particular, move fast. And so one of the things to do if you're watching this through the MOOC is to make sure that you've got a course.fast.ai and check the links there because there will be links saying oh these are the differences from the course, and so they're kept up to date so that you're never going to -- because I can't edit what I'm saying, I can only edit that.

But do a git pull before you start each day. So I haven't actually updated all of the notebooks to add the extra return value. I will over the next couple of days, but if you're using them you'll just need to put an extra comma, otherwise you'll get an error that it's returned 3 things and you only have room for 2 things.

What I want to do before I talk about interpretation is to show you what the exact same process looks like when you're working with a really large dataset. And you'll see it's kind of almost the same thing, but there's going to be a few cases where we can't use the defaults, because the defaults just run a little bit too slowly.

So specifically I'm going to look at the Cabo Groceries competition, specifically -- what's it called? Here it is. Compress your favorite grocery sales forecasting. So this competition -- who is entering this competition? Okay, a lot of people. Who would like to have a go at explaining what this competition involves, what the data is and what you're trying to predict?

>> Okay, trying to predict the items on the shelf depending on lots of factors, like oil prices. So when you're predicting the items on the shelf, what are you actually predicting? >> How much do you need to have in stock to maximize their -- >> It's not quite what we're predicting, but we'll try and fix that at the moment.

>> And then there's a bunch of different datasets that you can use to do that. There's oil prices, there's stores, there's locations, and each of those can be used to try to predict it. Does anybody want to have a go at expanding on that? >> All right. So we have a bunch of information on different products.

So we have -- >> Let's just fill up a little bit higher. >> All right. So for every store, for every item, for every day, we have a lot of related information available, like the location where the store was located, the class of the product, and the units sold.

And then based on this, we are supposed to forecast in a much shorter time frame compared to the training data. For every item number, how much we think it's going to sell, so only the units and nothing else. >> Okay, good. Somebody can help get that back here. So your ability to explain the problem you're working on is really, really important.

So if you don't currently feel confident of your ability to do that, practice with someone who is not in this competition. Tell them all about it. So in this case, or in any case really, the key things to understand a machine learning problem would be to say what are the independent variables and what is the dependent variable.

So the dependent variable is the thing that you're trying to predict. The thing you're trying to predict is how many units of each kind of product were sold in each store on each day during a two-week period. So that's the thing that you're trying to predict. The information you have to predict is how many units of each product at each store on each day were sold in the last few years, and for each store some metadata about it, like where is it located and what class of store is it.

For each type of product, you have some metadata about it, such as what category of product is it and so forth. For each date, we have some metadata about it, such as what was the oil price on that date. So this is what we would call a relational dataset.

A relational dataset is one where we have a number of different pieces of information that we can join together. Specifically this kind of relational dataset is what we would refer to as a star schema. A star schema is a kind of data warehousing schema where we say there's some central transactions table.

In this case, the central transactions table is train.csv, and it contains the number of units that were sold by date, by store ID, by item ID. So that's the central transactions table, very small, very simple, and then from that we can join various bits of metadata. It's called a star schema because you can imagine the transactions table in the middle and then all these different metadata tables join onto it, giving you more information about the date, the item ID and the store ID.

Sometimes you'll also see a snowflake schema, which means there might then be additional information joined onto maybe the items table that tells you about different item categories and joined to the store table, telling you about the state that the store is in and so forth so you can have a whole snowflake.

So that's the basic information about this problem, the independent variables, the dependent variable, and you probably also want to tell you about things like the timeframe. Now we start in exactly the same way as we did before, loading in exactly the same stuff, setting the path. But when we go read CSV, if you say limit memory equals false, then you're basically saying use as much memory as you like to figure out what kinds of data is here.

It's going to run out of memory pretty much regardless of how much memory you have. So what we do in order to limit the amount of space that it takes up when we read it in is we create a dictionary for each column name to the data type of that column.

And so for you to create this, it's basically up to you to run less or head or whatever on the data set to see what the types are and to figure that out and pass them in. So then you can just pass in data type equals with that dictionary.

And so check this out, we can read in the whole CSV file in 1 minute and 48 seconds, and there are 125.5 million rows. So when people say Python's slow, no Python's not slow. Python can be slow if you don't use it right, but we can actually pass 125 million CSV records in less than 2 minutes.

I'm going to put my language hat on for just a moment. Actually if it's fast, almost certainly it's going to see. So Python is a wrapper around a bunch of C code usually. So if Python itself isn't actually very fast. So that was Terrence Parr who writes things for writing programming languages for a living.

Python itself is not fast, but almost everything we want to do in Python and data science has been written for us in C, or actually more often in Python, which is a Python-like language which compiles to C. So most of the stuff we run in Python is actually running not just C code, but actually in Pandas a lot of it's written in assembly language, it's heavily optimized, behind the scenes a lot of that is going back to actually calling Fortran-based libraries for a linear algebra.

So there's layers upon layer of speed that actually allow us to spend less than 2 minutes reading in that much data. If we wrote our own CSV reader in pure Python, it would take thousands of times, at least thousands of times longer than the optimized versions. So for us, what we care about is the speed we can get in practice.

So this is pretty cool. As well as telling it what the different data types were, we also have to tell it as before which things you want to parse as dates. I've noticed that in this dictionary, you specify in 64, 33, and 8. I was wondering in practice, is it faster if you all specify them to be slower, or any performance consideration?

So the key performance consideration here was to use the smallest number of bits that I could to fully represent the column. So if I had used n8 for item number, there are more than 255 item numbers. More specifically, the maximum item number is bigger than 255. So on the other hand, if I had used n64 for store number, it's using more bits than necessary.

Given that the whole purpose here was to avoid running out of RAM, we don't want to be using up 8 times more memory than necessary. So the key thing was really about memory. In fact when you're working with large data sets, very often you'll find the slow piece is the actually reading and writing to RAM, not the actual CPU operations.

So very often that's the key performance consideration. Also however, as a rule of thumb, smaller data types often will run faster, particularly if you can use 70, so that's single instruction multiple data vectorized code. It can pack more numbers into a single vector to run at once. That was all heavily simplified and not exactly right, but right and bound for this purpose.

Once you do this, the shuffle thing beforehand is not needed anymore, you may just send a random sub solution. Although here I've read in the whole thing, when I start, I never start by reading in the whole thing. So if you search the forum for 'shuff', you'll find some tips about how to use this UNIX command to get a random sample of data at the command prompt.

And then you can just read that. The nice thing is that that's a good way to find out what data types to use, to read in a random sample and let pandas figure it out for you. In general, I do as much work as possible on a sample until I feel confident that I understand the sample before I move on.

Having said that, what we're about to learn is some techniques for running models on this full dataset that are actually going to work on arbitrarily large datasets, that also I specifically wanted to talk about how to read in large datasets. One thing to mention, onPromotion objects are like saying create a general purpose Python data type which is slow and memory heavy.

The reason for that is that this is a Boolean which also has missing values, and so we need to deal with this before we can turn it into a Boolean. So you can see after that, I then go ahead and let's say fill in the missing values with false.

Now you wouldn't just do this without doing some checking ahead of time, but some exploratory data analysis shows that this is probably an appropriate thing to do, it seems that missing does mean false. Objects generally read in a string, so replace the strings true and false with actual Booleans, and then finally convert it to an actual Boolean type.

So at this point, when I save this, this file now of 123 million records takes up something under 2.5 GB of memory. So you can look at pretty large datasets even on pretty small computers, which is interesting. So at that point, now that it's in a nice fast format, look how fast it is.

I can save it to feather format in under 5 seconds. So that's nice. And then because pandas is generally pretty fast, you can do stuff like summarize every column of all 125 million records in 20 seconds. The first thing I looked at here is the dates. Generally speaking, dates are just going to be really important on a lot of the stuff you do, particularly because any model that you put in in practice, you're going to be putting it in at some date that is later than the date that you trained it by definition.

And so if anything in the world changes, you need to know how your predictive accuracy changes as well. And so what you'll see on Kaggle and what you should always do in your on projects is make sure that your dates don't overlap. So in this case, the dates that we have in the training set go from 2013 to mid-August 2017, there's our first and last.

And then in our test set, they go from 1 day later, August 16th until the end of the month. So this is a key thing that you can't really do any useful machine learning until you understand this basic piece here, which is you've got 4 years of data and you're trying to predict the next 2 weeks.

So that's just a fundamental thing that you're going to need to understand before you can really do a good job of this. And so as soon as I see that, what does that say to you? If you wanted to now use a smaller data set, should you use a random sample, or is there something better you could do?

Probably from the bottom, more recent? So it's like, okay, I'm going to go to a shop next week and I've got a $5 bet with my brother as to whether I can guess how many cans of Coke are going to be on the shelf. Alright, well probably the best way to do that would be to go to the shop same day of the previous week and see how many cans of Coke are on the shelf and guess it's going to be the same.

You wouldn't go and look at how many were there 4 years ago. But couldn't 4 years ago that same time frame of the year be important? For example, how much Coke they have on the shelf at Christmas time is going to be way more than? Exactly, so there's no useful information from 4 years ago, so we don't want to entirely throw it away.

But as a first step, what's the simplest possible thing? It's kind of like submitting the means. I wouldn't submit the mean of 2012 sales, I would probably submit the mean of last month's sales, for example. So yeah, we're just trying to think about how we might want to create some initial easy models and later on we might want to wait it.

So for example, we might want to wait more recent dates more highly, they're probably more relevant. But we should do a whole bunch of exploratory data analysis to check that. So here's what the bottom of that data set looks like. And you can see literally it's got a date, a store number, an item number, an unit sales, and tells you whether or not that particular item was on sale at that particular store on that particular date, and then there's some arbitrary ID.

So that's it. So now that we have read that in, we can do stuff like, this is interesting, again we have to take the log of the sales. And it's the same reason as we looked at last week, because we're trying to predict something that varies according to ratios.

They told us in this competition that the root mean squared log error is the thing they care about, so we take the log. They mentioned also if you check the competition details, which you always should read carefully the definition of any project you do, they say that there are some negative sales that represent returns, and they tell us that we should consider them to be 0 for the purpose of this competition.

So I clip the sales so that they fall between 0 and no particular maximum, so clip just means cut it off at that point, truncate it, and then take the log of that +1. Why do I do +1? Because again, if you check the details of the capital competition, that's what they tell you they're going to use is they're not actually just taking the root mean squared log error, but the root mean squared log +1 error, because log of 0 doesn't make sense.

We can add the date part as usual, and again it's taking a couple of minutes. So I would run through all this on a sample first, so everything takes 10 seconds to make sure it works, just to check everything looks reasonable before I go back because I don't want to wait 2 minutes or something, I don't know if it's going to work.

But as you can see, all these lines of code are identical to what we saw for the bulldozers competition. In this case, all I'm reading in is a training set. I didn't need to run train cats because all of my data types are already numeric. If they weren't, I would need to call train cats and then I would need to call apply cats to the same categorical codes that I now have in the training set to the validation set.

I call prop df as before to check for missing values and so forth. So all of those lines of code are identical. These lines of code again are identical because root mean squared error is what we care about. And then I've got two changes. The first is sent RF samples, which we learned about last week.

So we've got 120 something million records. We probably don't want to create a tree from 120 million something records. I don't even know how long that's going to take, I haven't had the time and patience to wait and see. So you could start with 10,000 or 100,000, maybe it runs in a few seconds, make sure it works and you can figure out how much you can run.

And so I found getting it to a million, it runs in under a minute. And so the point here is there's no relationship between the size of the dataset and how long it takes to build the random forest. The relationship is between the number of estimators multiplied by the sample size.

So the number of jobs is the number of cores that it's going to use. And I was running this on a computer that has about 60 cores, and I just found if you try to use all of them, it spends so much time spinning up jobs so it's a bit slower.

So if you've got lots and lots of cores on your computer, sometimes you want less than negative 1 means use every single core. There's one more change I made which is that I converted the data frame into an array of floats and then I fitted on that. Why did I do that?

Because internally inside the random forest code, they do that anyway. And so given that I wanted to run a few different random forests with a few different hyperparameters, by doing it once myself, I saved that minute 37 seconds. So if you run a line of code and it takes quite a long time, so the first time I ran this random forest progressor, it took 2 or 3 minutes, and I thought I don't really want to wait 2 or 3 minutes, you can always add in front of the line of code prun, percent prun.

So what percent prun does is it runs something called a profiler. And what a profiler does is it will tell you which lines of code behind the scenes took the most time. And in this case I noticed that there was a line of code inside scikit-learn that was this line of code, and it was taking all the time, nearly all the time.

And so I thought I'll do that first and then I'll pass in the result and it won't have to do it again. So this thing of looking to see which things is taking up the time is called profiling. And in software engineering, it's one of the most important tools you have.

Data scientists really underappreciate this tool, but you'll find amongst conversations on GitHub issues or on Twitter or whatever amongst the top data scientists, they're sharing and talking about profiles all the time. And that's how easy it is to get a profile. So for fun, try running prun from time to time on stuff that's taking 10-20 seconds and see if you can learn to interpret and use profiler outputs.

Even though in this case I didn't write this scikit-learn, plus I was still able to use the profiler to figure out how to make it run over twice as fast by avoiding recalculating this each time. So in this case, I built my regressor, I decided to use 20 estimators.

Something else that I noticed in the profiler is that I can't use OOB score when I use set-RF samples. Because if I do, it's going to use the other 124 million rows to calculate the OOB score, which is still going to take forever. So I may as well have a proper validation set anyway, besides which I want a validation set that's the most recent dates rather than it's random.

So if you use set-RF samples on a large data set, don't put the OOB score parameter in because it takes forever. So that got me a 0.76 validation root mean squared log error, and then I tried fiddling around with different min-samples. So if I decrease the min-samples from 100 to 10, it took a little bit more time to run as we would expect.

And the error went down from 76 to 71, so that looked pretty good. So I kept decreasing it down to 3, and that brought this error down to 0.70. When I decreased it down to 1, it didn't really help. So I kind of had a reasonable random forest. When I say reasonable, though, it's not reasonable in the sense that it does not give a good result on the later morning.

And so this is a very interesting question about why is that. And the reason is really coming back to Savannah's question earlier, where might random forests not work as well. Let's go back and look at the data. Here's the entire dataset, here's all the columns we used. So the columns that we have to predict with are the date, the store number, the item number, and whether it was on promotion or not.

And then of course we used add date part, so there's also going to be day of week, day of month, day of year, is quarter, start, etcetera, etcetera. So if you think about it, most of the insight around how much of something do you expect to sell tomorrow is likely to be very wrapped up in the details about where is that store, what kind of things do they tend to sell at that store, for that item, what category of item is it, if it's like fresh bread, they might not sell much of it on Sundays because on Sundays, fresh bread doesn't get made, where else it's gasoline, maybe they're going to sell a lot of gasoline because on Sundays people go and fill up their cart with a wick ahead.

Now a random forest has no ability to do anything other than create a bunch of binary splits on things like day of week, store number, item number. It doesn't know which one represents gasoline. It doesn't know which stores are in the center of the city versus which ones are out in the streets.

It doesn't know any of these things. Its ability to really understand what's going on is somewhat limited. So we're probably going to need to use the entire four years of data to even get some useful insights. But then as soon as we start using the whole four years of data, a lot of the data we're using is really old.

So interestingly, there's a Kaggle kernel that points out that what you could do is just take the last two weeks and take the average sales by date, by store number, by item number and just submit that. If you just submit that, you come about 30th. So for those of you in the groceries, Terrence has a comment or a question.

I think this may have tripped me up actually. I think you said date, store, item. I think it's actually store, item, sales and then you mean across date. Oh yeah, you're right. It's store, item and on promotion. On promotion, yes. If you do it by date as well, you end up.

So each row represents basically a cross tabulation of all of the sales in that store for that item. So if you put date in there as well, there's only going to be one or two items being averaged in each of those cells, which is too much variation, basically, it's too sparse.

It doesn't give you a terrible result, but it's not 30th. So your job if you're looking at this competition, and we'll talk about this in the next class, is how do you start with that model and make it a little bit better. Because if you can, then by the time we meet up next, hopefully you'll be above the top 30.

Because Kaggle being Kaggle, lots of people have now taken this kernel and submitted it, and they all have about the same score, and the scores are ordered not just by score but by date submitted. So if you now submit this kernel, you're not going to be 30th because you're way down the list of when it was submitted.

But if you can do a tiny bit better, you're going to be better than all of those people. So how can you make this a tiny bit better? Would you try to capture seasonality and trend effects by creating new columns? These are the average sales in the month of August, these are the average sales for this year?

Yeah, I think that's a great idea. So the thing for you to think about is how to do that. And so see if you can make it work. Because there are details to get right, which I know Terrence has been working on this for the last week, and he's almost crazy.

The details are difficult, they're not intellectually difficult, they're kind of difficult in the way that makes you want to head back to your desk at 2am. And this is something to mention in general. The coding you do for machine learning is incredibly frustrating and incredibly difficult. If you get a detail wrong, much of the time it's not going to give you an exception, it will just silently be slightly less good than it otherwise would have been.

And if you're on Kaggle, at least you know, okay well I'm not doing as well as other people on Kaggle. But if you're not on Kaggle, you just don't know. You don't know if your company's model is like half as good as it could be because you made a little mistake.

So that's one of the reasons why practicing on Kaggle now is great, because you're going to get practice in finding all of the ways in which you can infuriatingly screw things up. And you'll be amazed, like for me there's an extraordinary array of them. But as you get to know what they are, you'll start to know how to check for them as you go.

And so the only way, you should assume every button you press, you're going to press the wrong button. And that's fine as long as you have a way to find out. We'll talk about that more during the course, but unfortunately there isn't a set of specific things I can tell you to always do.

You just always have to think like, okay, what do I know about the results of this thing I'm about to do? I'll give you a really simple example. If you've actually created that basic entry where you take the mean by date, by store number, by on promotion, and you've submitted it and you've got a reasonable score, and then you think you've got something that's a little bit better, and you do predictions for that, how about you now create a scatterplot showing the predictions of your average model on one axis versus the predictions of your new model on the other axis?

You should see that they just about form a line. And if they don't, then that's a very strong suggestion that you screwed something up. So that would be an example. Can you pass that one to the end of that row, possible two steps? So for a problem like this, unlike the car insurance problem on Taggle where columns are unnamed, we know what the columns represent and what they are.

How often do you pull in data from other sources to supplement that? Maybe like weather data, for example, or how often is that used? Very often. And so the whole point of this star schema is that you've got your central table and you've got these other tables coming off that provide metadata about it.

So for example, weather is metadata about a date. On Kaggle specifically, most competitions have the rule that you can use external data as long as you post on the forum that you're using it and that it's publicly available. But you have to check on a competition by competition basis, they will tell you.

So with Kaggle, you should always be looking for what external data could I possibly leverage here. So are we still talking about how to tweak this data set? If you wish. Well, I'm not familiar with the countries here, so maybe. This is Ecuador. Ecuador. So maybe I would start looking for Ecuador's holidays and shopping holidays, maybe when they have a three-day weekend or a week off.

Actually that information is provided in this case. And so in general, one way of tackling this kind of problem is to create lots and lots of new columns containing things like average number of sales on holidays, average percent change in sale between January and February, and so on and so forth.

And so if you have a look at, there's been a previous competition on Kaggle called Rossman Store Sales that was almost identical. It was in Germany in this case for a major grocery chain. How many items are sold by day, by item type, by store. In this case, the person who won, quite unusually actually, was something of a domain expert in this space.

They're actually a specialist in doing logistics predictions. And this is basically what they did, he's a professional sales forecast consultant. He created just lots and lots and lots of columns based on his experience of what kinds of things tend to be useful for making predictions. That's an approach that can work.

The third place team did almost no feature engineering, however, and also they had one big oversight, which I think they would have won if they hadn't had it. So you don't necessarily have to use this approach. So anyway, we'll be learning a lot more about how to win this competition, and ones like it as we go.

They did interview the third place team, so if you google for Kaggle or Rossman, you'll see it. The short answer is they used big money. So one of the things, and these are a couple of charts, Terrence is actually my teammate on this competition, so Terrence drew a couple of these charts for us, and I want to talk about this.

If you don't have a good validation set, it's hard if not impossible to create a good model. So in other words, if you're trying to predict next month's sales and you try to build a model and you have no way of really knowing whether the models you built are good at predicting sales a month ahead of time, then you have no way of knowing when you put your model in production whether it's actually going to be any good.

So you need a validation set that you know is reliable at telling you whether or not your model is likely to work well when you put it into production or use it on the test set. So in this case, what Terrence has plotted here is, so normally you should not use your test set for anything other than using it right at the end of the competition to find out how you've got.

But there's one thing I'm going to let you use the test set for in addition, and that is to calibrate your validation set. So what Terrence did here was he built four different models, some which he thought would be better than others, and he submitted each of the four models to Kaggle to find out its score.

So the x-axis is the score that Kaggle told us on the leaderboard. And then on the y-axis, he plotted the score on a particular validation set he was trying out to see whether this validation set looked like it was going to be any good. So if your validation set is good, then the relationship between the leaderboard score and the test set score and your validation set score should lie in a straight line.

Ideally it will actually lie on the y=x line, but honestly that doesn't matter too much. As long as, relatively speaking, it tells you which models are better than which other models, then you know which model is the best. And you know how it's going to perform on the test set because you know the linear relationship between the two things.

So in this case, Terrence has managed to come up with a validation set which is looking like it's going to predict our Kaggle leaderboard score pretty well. And that's really cool because now he can go away and try 100 different types of models, feature engineering, weighting, tweaks, hyperparameters, whatever else, see how they go on the validation set and not have to submit to Kaggle.

So we're going to get a lot more iterations, a lot more feedback. This is not just true of Kaggle, but every machine learning project you do. So here's a different one he tried, where it wasn't as good, it's like these ones that were quite close to each other, it's showing us the opposite direction, that's a really bad sign.

It's like this validation set idea didn't seem like a good idea, this validation set idea didn't look like a good idea. So in general if your validation set is not showing a nice straight line, you need to think carefully. How is the test set constructed? How is my validation set different?

There's some way you're constructing it which is different, you're going to have to draw lots of charts and so forth. So one question is, and I'm going to try to guess how you did it. So how do you actually try to construct this validation set as close to the...

So what I would try to do is to try to sample points from the training set that are very closer possible to some of the points in the test set. Of course in what sense? I don't know, I would have to find the features. What would you guess? In this case?

For this groceries? For this groceries, the last points. Yeah, close by date. By date. So basically all the different things Terrence was trying were different variations of close by date. So the most recent. What I noticed was, so first I looked at the date range of the test set and then I looked at the kernel that described how he or she...

So here is the date range of the test set, so the last two weeks of August 26, 2017. That's right. And then the person who submitted the kernel that said how to get the 0.58 leaderboard position or whatever score... Yeah, the average by group, yeah. I looked at the date range of that.

And that was... It was like 9 or 10 days. Well, it was actually 14 days and the test set is 16 days, but the interesting thing is the test set begins on the day after payday and ends on the payday. And so these are things I also paid attention to.

But... And I think that's one of the bits of metadata that they told us. These are the kinds of things you've just got to try, like I said, to plot lots of pictures. And even if you didn't know it was payday, you would want to draw the time series chart of sales and you would hopefully see that every two weeks there would be a spike or whatever.

And you'd be like, "Oh, I want to make sure that I have the same number of spikes in my validation set that I've had in my test set," for example. Let's take a 5-minute break and let's come back at 2.32. This is my favorite bit -- interpreting machine learning models.

By the way, if you're looking for my notebook about the groceries competition, you won't find it in GitHub because I'm not allowed to share code for running competitions with you unless you're on the same team as me, that's the rule. After the competition is finished, it will be on GitHub, however, so if you're doing this through the video you should be able to find it.

So let's start by reading in our feather file. So our feather file is exactly the same as our CSV file. This is for our blue book for bulldozers competition, so we're trying to predict the sale price of heavy industrial equipment and option. And so reading the feather format file means that we've already read in the CSV and processed it into categories.

And so the next thing we do is to run PROC DF in order to turn the categories into integers, deal with the missing values, and pull out the independent variable. This is exactly the same thing as we used last time to create a validation set where the validation set represents the last couple of weeks, the last 12,000 records by date.

And I discovered, thanks to one of your excellent questions on the forum last week, I had a bug here which is that PROC DF was shuffling the order, sorry, not PROC DF, and last week we saw a particular version of PROC DF where we passed in a subset, and when I passed in the subset it was randomly shuffling.

And so then when I said split_valves, it wasn't getting the last rows by date, but it was getting a random set of rows. So I've now fixed that. So if you rerun the lesson 1 RF code, you'll see slightly different results, specifically you'll see in that section that my validation set results look less good, but that's only for this tiny little bit where I had subset equals set.

I'm a little bit confused about the notation here, so as is both an input variable and it's also the output variable of this function, and why is that? The PROC DF returns a dictionary telling you which columns were missing and for each of those columns what the median was.

So when you call it on the larger dataset, the non-subset, you want to take that return value and you don't pass in an object to that point, you just want to get back the result. Later on when you pass it into a subset, you want to have the same missing columns and the same medians, and so you pass it in.

And if this different subset, like if it was a whole different dataset, turned out it had some different missing columns, it would update that dictionary with additional key values as well. You don't have to pass it in. If you don't pass it in, it just gives you the information about what was missing and the medians.

If you do pass it in, it uses that information for any missing columns that are there, and if there are some new missing columns, it will update that dictionary with that additional information. So it's like keeping all the datasets, all the column information. Yeah, it's going to keep track of any missing columns that you came across in anything you passed to PropDF.

So we split it into the training and test set just like we did last week, and so to remind you, once we've done PropDF, this is what it looks like. This is the log of sale price. So the first thing to think about is we already know how to get the predictions, which is we take the average value in each leaf node, in each tree after running a particular row through each tree.

That's how we get the prediction. But normally we don't just want a prediction, we also want to know how confident we are of that prediction. And so we would be less confident of a prediction if we haven't seen many examples of rows like this one, and if we haven't seen many examples of rows like this one, then we wouldn't expect any of the trees to have a path through which is really designed to help us predict that row.

And so conceptually, you would expect then that as you pass this unusual row through different trees, it's going to end up in very different places. So in other words, rather than just taking the mean of the predictions of the trees and saying that's our prediction, what if we took the standard deviation of the predictions of the trees?

So the standard deviation of the predictions of the trees, if that's high, that means each tree is giving us a very different estimate of this row's prediction. So if this was a really common kind of row, then the trees will have learnt to make good predictions for it because it's seen lots of opportunities to split based on those kinds of rows.

So the standard deviation of the predictions across the trees gives us some kind of relative understanding of how confident we are of this prediction. So that is not something which exists in scikit-learn or in any library I know of, so we have to create it. But we already have almost the exact code we need because remember last lesson we actually manually calculated the averages across different sets of trees, so we can do exactly the same thing to calculate the standard deviations.

When I'm doing random forest interpretation, I pretty much never use the full data set. I always call set-rs-samples because we don't need a massively accurate random forest, we just need one which indicates the nature of the relationships involved. And so I just make sure this number is high enough that if I call the same interpretation commands multiple times, I don't get different results back each time.

That's like the rule of thumb about how big does it need to be. But in practice, 50,000 is a high number and most of the time it would be surprising if that wasn't enough, and it runs in seconds. So I generally start with 50,000. So with my 50,000 samples per tree set, I create 40 estimators.

I know from last time that min_samples_leaf=3, max_features=0.5 isn't bad, and again we're not trying to create the world's most predictive tree anyway, so that all sounds fine. We get an R^2 on the validation set of 0.89. Again we don't particularly care, but as long as it's good enough, which it certainly is.

And so here's where we can do that exact same list comprehension as last time. Remember, go through each estimator, that's each tree, call .predict on it with our validation set, make that a list comprehension, and pass that to np.stack, which concatenates everything in that list across a new axis.

So now our rows are the results of each tree and our columns are the result of each row in the original dataset. And then we remember we can calculate the mean. So here's the prediction for our dataset row number 1. And here's our standard deviation. So here's how to do it for just one observation at the end here.

We've calculated for all of them, just printing it for one. This can take quite a while, and specifically it's not taking advantage of the fact that my computer has lots of cores in it. List comprehension itself is Python code, and Python code, unless you're doing special stuff, runs in serial, which means it runs on a single CPU.

It doesn't take advantage of your multi-CPU hardware. And so if I wanted to run this on more trees and more data, this one second is going to go up. And you see here the wall time, the amount of actual time it took, is roughly equal to the CPU time, where else if it was running on lots of cores, the CPU time would be higher than the wall time.

So it turns out that scikit-learn, actually not scikit-learn, fast.ai provides a handy function called parallel_trees, which calls some stuff inside scikit-learn. And parallel_trees takes two things. It takes a random forest model that I trained, here it is, n, and some function to call. And it calls that function on every tree in parallel.

So in other words, rather than calling t.predict_x_valid, let's create a function that calls t.predict_x_valid. Let's use parallel_trees to call it on our model for every tree. And it will return a list of the result of applying that function to every tree. And so then we can np.stack_that. So hopefully you can see that that code and that code are basically the same thing.

But this one is doing it in parallel. And so you can see here, now our wall time has gone down to 500ms, and it's now giving us exactly the same answer, so a little bit faster. Time permitting, we'll talk about more general ways of writing code that runs in parallel because it turns out to be super useful for data science.

But here's one that we can use that's very specific to random forests. So what we can now do is we can always call this to get our predictions for each tree, and then we can call standard deviation to then get them for every row. And so let's try using that.

So what I could do is let's create a copy of our data and let's add an additional column to it, which is the standard deviation of the predictions across the first axis. And let's also add in the mean, so they're the predictions themselves. So you might remember from last lesson that one of the predictors we have is called enclosure, and we'll see later on that this is an important predictor.

And so let's start by just doing a histogram. So one of the nice things in pandas is it's got built-in plotting capabilities. It's well worth Googling for pandas plotting to see how to do it. Yes, Terrence? Can you remind me what enclosure is? So we don't know what it means, and it doesn't matter.

I guess the whole purpose of this process is that we're going to learn about what things are, or at least what things are important, and later on figure out what they are and how they're important. So we're going to start out knowing nothing about this data set. So I'm just going to look at something called enclosure that has something called EROPS and something called OROPS, and I don't even know what this is yet.

All I know is that the only three that really appear in any great quantity are OROPS, EROPS, WAC, and EROPS. And this is really common as a data scientist, you often find yourself looking at data that you're not that familiar with, and you've got to figure out at least which bits to study more carefully and which bits to matter and so forth.

So in this case, I at least know that these three groups I really don't care about because they basically don't exist. So given that, we're going to ignore those three. So we're going to focus on this one here, this one here, and this one here. And so here you can see what I've done is I've taken my data frame and I've grouped by enclosure, and I am taking the average of these three fields.

So here you can see the average sale price, the average prediction, and the standard deviation of prediction for each of my three groups. So I can already start to learn a bit here, as you would expect, the prediction and the sale price are close to each other on average, so that's a good sign.

And then the standard deviation varies a little bit, it's a little hard to see in a table, so what we could do is we could try to start printing these things out. So here we've got the sale price for each level of enclosure, and here we've got the prediction for each level of enclosure.

And for the error bars, I'm using the standard deviation of prediction. So here you can see the actual, and here's the prediction, and here's my confidence interval. Or at least it's the average of the standard deviation of the random virus. So this will tell us if there's some groups or some rows that we're not very confident of at all.

So we could do something similar for product size. So here's different product sizes. We could do exactly the same thing of looking at our predictions of standard deviations. We could sort by, and what we could say is, what's the ratio of the standard deviation of the predictions to the predictions themselves?

So you'd kind of expect on average that when you're predicting something that's a bigger number that your standard deviation would be higher, so you can sort by that ratio. And what that tells us is that the product size large and product size compact, our predictions are less accurate, relatively speaking, as a ratio of the total price.

And so if we go back and have a look, there you go, that's why. From the histogram, those are the smallest groups. So as you would expect, in small groups we're doing a less good job. So this confidence interval you can really use for two main purposes. One is that you can group it up like this and look at the average confidence interval by group to find out if there are some groups that you just don't seem to have confidence about those groups.

But perhaps more importantly, you can look at them for specific rows. So when you put it in production, you might always want to see the confidence interval. So if you're doing credit scoring, so deciding whether to get somebody a loan, you probably want to see not only what's their level of risk, but how confident are we.

And if they want to borrow lots of money, and we're not at all confident about our ability to predict whether they'll pay it back, we might want to give them a smaller loan. So those are the two ways in which you would use this. Let me go to the next one, which is the most important.

The most important is feature importance. The only reason I didn't do this first is because I think the intuitive understanding of how to calculate confidence interval is the easiest one to understand intuitively. In fact, it's almost identical to something we've already calculated. But in terms of which one do I look at first in practice, I always look at this in practice.

So when I'm working on a cattle competition or a real-world project, I build a random forest as fast as I can, try and get it to the point that it's significantly better than random, but doesn't have to be much better than that, and the next thing I do is to plot the feature importance.

The feature importance tells us in this random forest which columns matter. So we had dozens and dozens of columns originally in this dataset, and here I'm just picking out the top 10. So you can just call rf_feature_importance, again this is part of the fast.ai library, it's leveraging stuff that's in scikit-learn.

Pass in the model, pass in the data frame, because we need to know the names of columns, and it'll tell you, it'll order, give you back a pandas data frame showing you in order of importance how important was each column. And here I'm just going to pick out the top 10.

So we can then plot that. So fi, because it's a data frame, we can use data frame plotting commands. So here I've plotted all of the feature importances. And so you can see here, and I haven't been able to write all of the names of the columns at the bottom, which that's not the important thing.

The important thing is to see that some columns are really, really important, and most columns don't really matter at all. In nearly every dataset you use in real life, this is what your feature importance is going to look like. It's going to say there's a handful of columns you care about.

And this is why I always start here, because at this point, in terms of looking into learning about this domain of heavy industrial equipment options, I'm only going to care about learning about the columns which matter. So are we going to bother learning about enclosure? Depends on whether enclosure is important.

There it is. It's in the top 10. So we are going to have to learn about enclosure. So then we could also plot this as a bar plot. So here I've just created a tiny little function here that's going to just plot my bars, and I'm just going to do it for the top 30.

And so you can see the same basic shape here, and I can see there's my enclosure. So we're going to learn about how this is calculated in just a moment. But before we worry about how it's calculated, much more important is to know what to do with it. So the most important thing to do with it is to now sit down with your client, or your data dictionary, or whatever your source of information is, and say to them, "Tell me about EMAID, what does that mean?

Where does it come from?" Lots of things like histograms of EMAID, scatter plots of EMAID against price, and learn everything you can because EMAID and coupler system, they're the things that matter. And what will often happen in real-world projects is that you'll sit with the client and you'll say, "Oh, it turns out the coupler system is the second most important thing," and then they might say, "That makes no sense." Now that doesn't mean that there's a problem with your model.

It means there's a problem with their understanding of the data that they gave you. So let me give you an example. I entered a Kaggle competition where the goal was to predict which applications for grants at a university would be successful. And I used this exact approach and I discovered a number of columns which were almost entirely predictive of the dependent variable.

And specifically when I then looked to see in what way they were predictive, it turned out whether they were missing or not was basically the only thing that mattered in this dataset. And so later on, I ended up winning that competition and I think a lot of it was thanks to this insight.

And so later on I heard what had happened. But it turns out that at that university, there's an administrative burden to filling out the database. And so for a lot of the grant applications, they don't fill in the database for the folks whose applications weren't accepted. So in other words, these missing values in the dataset were saying, 'Okay, this grant wasn't accepted because if it was accepted, then the admin folks are going to go in and type in that information.' So this is what we call data leakage.

Data leakage means there's information in the dataset that I was modelling with which the university wouldn't have had in real life at the point in time they were making a decision. So when they're actually deciding which grant applications should I prioritise, they don't actually know which ones the admin staff are going to add information to because it turns out they got accepted.

So one of the key things you'll find here is data leakage problems, and that's a serious problem that you need to deal with. The other thing that will happen is you'll often find its signs of collinearity. I think that's what's happened here with Coupler system. I think Coupler system tells you whether or not a particular kind of heavy industrial equipment has a particular feature on it, but if it's not that kind of industrial equipment at all, it will be empty, it will be missing.

And so Coupler system is really telling you whether or not it's a certain class of heavy industrial equipment. Now this is not leakage, this is actual information you actually have at the right time, it's just that interpreting it, you have to be careful. So I would go through at least the top 10 or look for where the natural breakpoints are and really study these things carefully.

To make life easier for myself, what I tend to do is I try to throw some data away and see if that matters. So in this case, I had a random forest which, let's go and see how accurate it was, 0.889. What I did was I said here, let's go through our feature importance data frame and filter out those where the importance is greater than 0.005, so 0.025 is about here, it's kind of like where they really flatten off.

So let's just keep those. And so that gives us a list of 25 column names. And so then I say let's now create a new data frame view which just contains those 25 columns, call split_bals on it again, split into test and training set, and create a new random forest.

And let's see what happens. And you can see here the R^2 basically didn't change, 8.9.1 versus 8.9. So it's actually increased a tiny bit. I mean generally speaking, removing redundant columns, obviously it shouldn't make it worse, if it makes it worse, they won't be redundant after all. It might make it a little better because if you think about how we built these trees when it's deciding what to split on, it's got less things to have to worry about trying, it's less often going to accidentally find a crappy column, so it's got a slightly better opportunity to create a slightly better tree with slightly less data, but it's not going to change it by much.

But it's going to make it a bit faster and it's going to let us focus on what matters. So if I rerun feature importance now, I've now got 25. Now the key thing that's happened is that when you remove redundant columns is that you're also removing sources of collinearity.

In other words, two columns that might be related to each other. Now collinearity doesn't make your random forest less predictive, but if you have two columns that are related to each other, this column is a little bit related to this column and this column is a strong driver of the dependent variable, then what's going to happen is that the importance is going to end up split between the two collinear columns.

It's going to say both of those columns matter, so it's going to split it between the two. So by removing some of those columns with very little impact, it makes your feature importance clearer. So you can see here actually, yearmade was pretty close to couple system before, but there must have been a bunch of things that were collinear with yearmade, which makes perfect sense.

Like old industrial equipment wouldn't have had a bunch of technical features that new ones would, for example. So it's actually saying yearmade really, really matters. So I trust this feature importance better. The predictive accuracy of the model is a tiny bit better, but this feature importance has a lot less collinearity to confuse us.

So let's talk about how this works. It's actually really simple. And not only is it really simple, it's a technique you can use not just for random forests, but for basically any kind of machine learning model. And interestingly, almost no one knows that. Many people will tell you this particular kind of model, there's no way of interpreting it.

And the most important interpretation of a model is knowing which things are important. And that's almost certainly not going to be true, because this technique I'm going to teach you actually works for any kind of model. So here's what we're going to do. We're going to take our data set, the bulldozers, and we've got this column which we're trying to predict, which is price.

And then we've got all of our independent variables. So here's an independent variable here, yearMade, plus a whole bunch of other variables. And remember, after we did a bit of trimming, we had 25 independent variables. How do we figure out how important yearMade is? Well we've got our whole random forest, and we can find out our predictive accuracy.

So we're going to put all of these rows through our random forest, and we're going to spit out some predictions, and we're going to compare them to the actual price you get in this case, for example, our root mean squared error and our R^2. And we're going to call that our starting point.

So now let's do exactly the same thing, but let's take the yearMade column and randomly shuffle it, randomly permute just that column. So now yearMade has exactly the same distribution as before, same means and deviation, but it's going to have no relationship to the dependent variable at all, because we totally randomly reordered it.

So before we might have found our R^2 with 0.89, and then after we shuffle yearMade, we check again and now it's like 0.8. Oh, that score got much worse when we destroyed that variable. It's like, let's try again, let's put yearMade back to how it was, and this time let's take enclosure and shuffle that.

And we find this time with enclosure it's 0.84, and we can say the amount of decrease in our score for yearMade was 0.09, and the amount of decrease in our score for enclosure was 0.05, and this is going to give us our feature importances for each one of our columns.

Wouldn't just excluding each column and running a random forest and checking the decay in the performance? You could remove the column and train a whole new random forest, but that's going to be really slow. This way we can keep our random forest and just test the particular accuracy of it again.

So this is nice and fast by comparison. In this case, we just have to rerun every row forward through the forest for each shuffled column. We're just basically doing predictions. So if you want to do multi-collinearity, would you do 2 of them and then 3 of them random shuffled?

Yeah, so I don't think you mean multi-collinearity, I think you mean looking for interaction effects. So if you want to say which pairs of variables are most important, you could do exactly the same thing, each pair in turn. In practice, there are better ways to do that because that's obviously computationally pretty expensive and so we'll try and find time to do that if we can.

We now have a model which is a little bit more accurate and we've learned a lot more about it. So we're out of time, and so what I would suggest you try doing now before next class for this bulldozers dataset is go through the top 5 or 10 predictors and try and learn what you can about how to draw plots in pandas and try to come back with some insights about what's the relationship between year made and the dependent variable, what's the histogram of year made.

Now that you know year made is really important, is there some noise in that column which we could fix? Are there some weird encodings in that column that we could fix? This idea I had that maybe a coupled system is there entirely because it's collinear with something else, do you want to try and figure out if that's true, if so, how would you do it?

FI product class desk, that brings alarm bells to me, it sounds like it might be a high cardinality categorical variable, it might be something with lots and lots of levels because it sounds like it's like a model name. So like go and have a look at that model name, does it have some ordering to it, could you make it an ordinal variable to make it better, does it have some kind of hierarchical structure in the string that we could split it on hyphen to create more subcolumns, have a think about this.

By Tuesday when you come back, ideally you've got a better accuracy than what I just showed because we found some new insights, or at least that you can tell the class about some things you've learned about how heavy industrial equipment options work in practice. See you on Tuesday.