back to indexIntro to Machine Learning: Lesson 6
Chapters
0:0 Intro
1:50 Horizontal Applications
3:10 Data Products
3:48 Objectives
4:52 Levers
5:40 What can levers do
7:2 What data does the organization have
8:45 A model
11:23 Counterfactual
12:16 Predictive vs Optimization
13:14 How do they all work together
14:33 Churn
18:17 Lead Prioritization
20:32 Define the Problem
23:50 Readmission Risk
25:30 Interpretation
31:1 Application
00:00:00.000 |
So we've looked at a lot of different random and random forest interpretation techniques 00:00:10.840 |
and a question that's come up a little bit on the forums is like what are these for really? 00:00:20.440 |
Like how do these help me get a better score on Kaggle? 00:00:24.760 |
And my answer has kind of been like they don't necessarily. 00:00:29.400 |
So I want to talk more about why do we do machine learning? 00:00:39.360 |
And to answer this question, I'm going to put this PowerPoint in the GitHub repo so 00:00:48.320 |
I want to show you something really important which is examples of how people have used 00:00:56.960 |
machine learning mainly in business, because that's where most of you are probably going 00:01:02.880 |
to end up after this is working for some company. 00:01:05.720 |
I'm going to show you applications of machine learning which are either based on things 00:01:10.600 |
that I've been personally involved in myself or know of people who are doing them directly, 00:01:15.280 |
so none of these are going to be like hypotheticals, these are all actual things that people are 00:01:22.080 |
doing and I've got direct or secondhand knowledge of. 00:01:27.160 |
I'm going to split them into two groups, horizontal and vertical. 00:01:29.920 |
So in business, horizontal means something that you do across different kinds of business, 00:01:37.920 |
whereas vertical means something that you do within a business or within a supply chain 00:01:45.480 |
So in other words, an example of horizontal applications is everything involving marketing. 00:01:53.880 |
So like every company pretty much has to try to sell more products to its customers, and 00:02:01.960 |
And so each of these boxes are examples of some of the things that people are using machine 00:02:18.720 |
So churn refers to a model which attempts to predict who's going to leave. 00:02:26.880 |
So I've done some churn modeling fairly recently in telecommunications, and so we're trying 00:02:32.460 |
to figure out for this big cell phone company which customers are going to leave. 00:02:38.880 |
That is not of itself that interesting, like building a highly predictive model that says 00:02:47.280 |
during me how it is almost certainly going to leave next month is probably not that helpful 00:02:53.680 |
because if I'm almost certainly going to leave next month, there's probably nothing you can 00:02:59.480 |
What's going to cost you too much to keep me? 00:03:02.920 |
So in order to understand why would we do churn modeling, I've got a little framework 00:03:11.200 |
So if you Google for Jeremy Howard data products, I think I've mentioned this thing before, 00:03:19.480 |
there's a paper you can find designing great data products that I wrote with a couple of 00:03:24.160 |
colleagues a few years ago, and in it I describe my experience of actually turning machine learning 00:03:37.880 |
And the basic trick is this thing I call the drivetrain approach which is these four steps. 00:03:49.440 |
The starting point to actually turn a machine learning project into something that's actually 00:03:55.000 |
useful is to know what am I trying to achieve. 00:03:58.480 |
And that doesn't mean I'm trying to achieve a high area under the ROC curve or I'm trying 00:04:04.420 |
to achieve a large difference between classes. 00:04:08.720 |
No, it would be I'm trying to sell more books or I'm trying to reduce the number of customers 00:04:16.740 |
that leave next month, or I'm trying to detect lung cancer earlier. 00:04:25.480 |
So the objective is something that absolutely directly is the thing that the company or 00:04:32.960 |
No company or organization lives in order to create a more accurate predictive model, 00:04:45.780 |
Now that's obviously the most important thing. 00:04:47.320 |
If you don't know the purpose of what you're modeling for, then you can't possibly do a 00:04:52.880 |
And hopefully people are starting to pick that up out there in the world of data science, 00:04:58.320 |
but interestingly what very few people are talking about is just as important as the 00:05:04.800 |
A lever is a thing that the organization can do to actually drive the objective. 00:05:14.560 |
What is a lever that an organization could use to reduce the number of customers that 00:05:26.880 |
They could take a closer look at the model and do some of this random forest interpretation 00:05:32.520 |
and see some of the causes that are causing people to leave and potentially change those 00:05:40.800 |
So that's a data scientist's answer, but I want you to go to the next level. 00:05:44.740 |
What are the things, the levers are the things they can do? 00:05:54.120 |
They could call someone and say, "Are you happy with anything we can do?" 00:06:01.120 |
Yeah, so they could give them a free pen or something if they buy 20 bucks worth of product 00:06:20.220 |
Okay, so you guys are the giving out carrots rather than handing out sticks. 00:06:23.960 |
Do you want to send it over to a couple of guys? 00:06:38.080 |
And so whenever you're working as a data scientist, keep coming back and thinking, "What are we 00:06:43.440 |
trying to achieve, we being the organization, and how are we trying to achieve it being 00:06:48.280 |
like what are the actual things we can do to make that objective happen?" 00:06:53.440 |
So building a model is never ever a lever, but it could help you with the lever. 00:07:02.920 |
So then the next step is, what data does the organization have that could possibly help 00:07:09.400 |
them to set that lever to achieve that objective? 00:07:14.120 |
And so this is not what data did they give you when you started the project, but think 00:07:19.980 |
about it from a first principle's point of view. 00:07:22.440 |
I'm working for a telecommunications company, they gave me some certain set of data, but 00:07:27.320 |
I'm sure they must know where their customers live, how many phone calls they made last 00:07:32.520 |
month, how many times they called customer service, whatever. 00:07:36.720 |
And so have a think about, okay, if we're trying to decide who should we give a special 00:07:44.680 |
offer to proactively, then we want to figure out what information do we have that might 00:07:52.320 |
help us to identify who's going to react well or badly to that. 00:07:57.560 |
Perhaps more interestingly would be what if we were doing a fraud algorithm, and so we're 00:08:04.520 |
trying to figure out who's going to not pay for the phone that they take out of the store, 00:08:09.600 |
they're on some 12 month payment plan, we never see them again. 00:08:13.760 |
Now in that case, the data we have available, it doesn't matter what's in the database, 00:08:17.840 |
what matters is what's the data that we can get when the customer is in the shop. 00:08:22.960 |
So there's often constraints around the data that we can actually use. 00:08:28.880 |
So we need to know what am I trying to achieve, what can this organization actually do specifically 00:08:36.580 |
to change that outcome, and at the point that that decision is being made, what data do 00:08:45.720 |
And so then the way I put that all together is with a model, and this is not a model in 00:08:50.240 |
the sense of a predictive model, but it's a model in the sense of a simulation model. 00:08:55.200 |
So one of the main examples I give in this paper is one I spent many years building, 00:09:00.280 |
which is if an insurance company changes their prices, how does that impact their profitability? 00:09:09.140 |
And so generally your simulation model contains a number of predictive models. 00:09:13.720 |
So I had, for example, a predictive model called an elasticity model that said for a 00:09:18.160 |
specific customer, if we charge them a specific price for a specific product, what's the probability 00:09:24.180 |
that they would say yes, both when it's new business and then a year later, what's the 00:09:32.240 |
And then there's another predictive model which is what's the probability that they're 00:09:36.160 |
going to make a claim, and how much is that claim going to be. 00:09:40.700 |
And so you can combine these models together then to say if we change our pricing by reducing 00:09:45.680 |
it by 10% for every body between 18 and 25, and we can run it through these models that 00:09:51.140 |
combine together into a simulation, then the overall impact on our market share in 10 years 00:09:55.620 |
time is x, and our cost is y, and our profit is z, and so forth. 00:10:01.600 |
So in practice, most of the time you really are going to care more about the results of 00:10:10.360 |
that simulation than you do about the predictive model directly. 00:10:15.820 |
But most people are not doing this effectively at the moment. 00:10:19.580 |
So for example, when I go to Amazon, I read all of Douglas Adams' books. 00:10:27.560 |
And so having read all of Douglas Adams' books, the next time I went to Amazon, they said 00:10:32.720 |
would you like to buy the collected works of Douglas Adams? 00:10:37.120 |
This is after I had bought every one of his books. 00:10:40.280 |
So from a machine learning point of view, some data scientist had said people that buy 00:10:48.200 |
one of Douglas Adams' books often go on to buy the collected works, but recommending 00:10:54.280 |
to me that I buy the collected works of Douglas Adams isn't smart. 00:10:59.480 |
And it's actually not smart at a number of levels. 00:11:02.660 |
Not only is it unlikely that I'm going to buy a box set of something of which I have 00:11:05.680 |
every one individually, but furthermore, it's not going to change my buying behavior. 00:11:10.880 |
I already know about Douglas Adams, I already know I like him, so taking up your valuable 00:11:16.160 |
web space to tell me hey, maybe you should buy more of the author who you're already 00:11:20.540 |
familiar with and have bought lots of times isn't actually going to change my behavior. 00:11:26.680 |
So what if instead of creating a predictive model, Amazon had built an optimization model 00:11:32.480 |
that could simulate and said if we show Jeremy this ad, how likely is he then to go on to 00:11:40.760 |
And if I don't show him this ad, how likely is he to go on to buy this book? 00:11:46.680 |
And so that's the counterfactual, the counterfactual is what would have happened otherwise. 00:11:51.840 |
And then you can take the difference and say okay, what should we recommend him that is 00:11:57.040 |
going to maximally change his behavior, so maximally result in more books. 00:12:02.560 |
And so you'd probably say like oh, he's never bought me Terry Pratchett books, he probably 00:12:08.120 |
doesn't know about Terry Pratchett, but lots of people that liked Douglas Adams did turn 00:12:12.760 |
out to like Terry Pratchett, so let's introduce him to a new author. 00:12:17.360 |
So it's the difference between a predictive model on the one hand versus an optimization 00:12:23.440 |
So the two tend to go hand in hand, the optimization model basically is saying, well first of all 00:12:34.520 |
The simulation model is saying in a world where we put Terry Pratchett's book on the 00:12:41.240 |
front page of Amazon for Jeremy Howard, this is what would have happened. 00:12:46.320 |
He would have bought it with a 94% probability. 00:12:50.640 |
And so that then tells us with this lever of what do I put on my homepage for Jeremy today, 00:13:00.040 |
we say okay, all the different settings of that lever that put Terry Pratchett on the 00:13:07.680 |
And then that's the thing which maximizes our profit from Jeremy's visit to Amazon.com 00:13:15.740 |
So generally speaking, your predictive models kind of feed in to this simulation model, 00:13:22.160 |
but you've kind of got to think about how do they all work together. 00:13:28.500 |
So I turn out that Jeremy Howard is very likely to leave his cell phone company next month. 00:13:41.080 |
And I can tell you, if my cell phone company calls me right now and says just calling to 00:13:45.720 |
say we love you, I'd be like I'm canceling right now. 00:13:53.920 |
So again, you'd want a simulation model that says what's the probability that Jeremy is 00:13:58.840 |
going to change his behavior as a result of calling him right now. 00:14:05.400 |
On the other hand, if I got a piece of mail tomorrow that said for each month you stay 00:14:11.520 |
with us, we're going to give you $100,000, okay, then that's going to definitely change 00:14:20.160 |
But then feeding that into the simulation model, it turns out that overall that would 00:14:34.160 |
So when we look at something like CHURM, we want to be thinking what are the levers we 00:14:41.680 |
can pull, and so what are the kind of models that we could build with what kinds of data 00:14:47.120 |
to help us pull those levers better to achieve our objectives. 00:14:51.720 |
And so when you think about it that way, you realize that the vast majority of these applications 00:14:57.240 |
are not largely about a predictive model at all, they're about interpretation, they're 00:15:07.400 |
So if we kind of take the intersection between on the one hand, here are all the levers that 00:15:15.360 |
we could pull, here are all the things we can do, and then here are all of the features 00:15:20.040 |
from our random forest feature importance that turn out to be strong drivers of the 00:15:26.120 |
And so then the intersection of those is here are the levers we could pull that actually 00:15:30.560 |
matter because if you can't change the thing, then it's not very interesting. 00:15:37.960 |
And if it's not actually a significant driver, it's not very interesting. 00:15:42.880 |
So we can actually use our random forest feature importance to tell us what can we actually 00:15:51.800 |
And then we can use the partial dependence to actually build this kind of simulation 00:15:55.440 |
model to say like okay, well if we did change that, what would happen? 00:16:02.720 |
So there are examples, lots and lots of these vertical examples. 00:16:08.000 |
And so what I want you to kind of think about as you think about the machine learning problems 00:16:12.760 |
you're working on is like why does somebody care about this? 00:16:18.640 |
And what would a good answer to them look like and how could you actually positively 00:16:25.440 |
So if you're creating like a Kaggle kernel, try to think about from the point of view 00:16:31.800 |
of the competition organizer, like what would they want to know? 00:16:39.000 |
So something like fraud detection, on the other hand, you probably just basically want 00:16:49.280 |
So you probably do just care about the predictive model, but then you do have to think carefully 00:16:55.000 |
So it's like okay, but we need to know who's fraudulent at the point that we're about to 00:17:02.760 |
So it's no point like looking at data that's available like a month later, for instance. 00:17:07.040 |
So you've kind of got this key issue of thinking about the actual operational constraints that 00:17:18.080 |
Lots of interesting applications in human resources, but like employee churn, it's another 00:17:23.740 |
kind of churn model where finding out that Jeremy Howard's sick of lecturing, he's going 00:17:29.040 |
to leave tomorrow, what are you going to do about it? 00:17:32.120 |
Well knowing that wouldn't actually be helpful, it'd be too late, right? 00:17:36.000 |
You would actually want a model that said what kinds of people are leaving USF? 00:17:41.920 |
And it turns out that everybody that goes to the downstairs cafe leaves USF, I guess 00:17:47.280 |
their food is awful, or whatever, or everybody that we're paying less than half a million 00:17:53.680 |
dollars a year is leaving USF because they can't afford basic housing in San Francisco. 00:18:00.480 |
So you could use your employee churn model not so much to say which employees hate us, 00:18:12.400 |
And so again, it's really the interpretation there that matters. 00:18:20.960 |
Now lead prioritization is a really interesting one, right? 00:18:23.640 |
This is one where a lot of companies, yes, Dana, can you pass that over there? 00:18:29.920 |
Yes, so I was just wondering, so for the churn thing, you suggested so one way is being an 00:18:37.760 |
employee, like one million a year or something, but then it sounds like there are two predictors 00:18:41.440 |
that you need to predict for, one being churn and one you need to optimize for your profit 00:18:49.840 |
So this is what this simulation model is all about, so it's a great question. 00:18:54.240 |
So you kind of figure out this objective we're trying to maximize which is like company profitability, 00:19:02.000 |
you can kind of create a pretty simple Excel model or something that says here's the revenues 00:19:06.200 |
and here's the costs, and the cost is equal to the number of people we employ multiplied 00:19:11.000 |
by their salaries, blah, blah, blah, blah, blah. 00:19:13.840 |
And so inside that kind of Excel model, there are certain cells, there are certain inputs 00:19:19.880 |
where you're like oh, that thing's kind of stochastic, or that thing is kind of uncertain 00:19:28.000 |
And so that's kind of what I do then is I then say okay, we need a predictive model 00:19:32.600 |
for how likely somebody is to stay if we change their salary, how likely they are to leave 00:19:41.320 |
with their current salary, how likely they are to leave next year if I increase their 00:19:50.000 |
So you kind of build a bunch of these models and then you combine them together with simple 00:19:55.480 |
business logic and then you can optimize that. 00:19:59.280 |
You can then say okay, if I pay Jeremy Howard half a million dollars, that's probably a 00:20:05.000 |
really good idea, and if I pay him less then it's probably not, or whatever. 00:20:12.960 |
So it's really shocking to me how few people do this. 00:20:18.240 |
Most people in industry measure their models using like AUC or RMSE or whatever, which 00:20:38.280 |
I wanted to stress a point that you made before. 00:20:41.600 |
In my experience, a lot of the problem was to define the problem, right? 00:20:47.200 |
So you are in a company, you're talking to somebody that doesn't have like this mentality 00:20:51.840 |
They don't know that you have to have X and Y and so on. 00:20:57.320 |
What exactly do you want and try to go through a few iterations of understanding what they 00:21:02.080 |
want and then you know the data, you know what the data is. 00:21:04.760 |
You know actually where you can measure, which is often know what they want. 00:21:08.420 |
So you have to kind of get a proxy for what they want. 00:21:11.160 |
And then so a lot of what you do is not that much of like, well, some people do actually 00:21:15.960 |
just work on really good models for, but a lot of people also just work on this kind 00:21:21.560 |
of how do you put this as a classification regression or some other type of modeling. 00:21:27.360 |
That's actually kind of the most interesting I think and also kind of what you have to 00:21:37.840 |
Understand the technical model building deeply, but also understand the kind of strategic 00:21:50.600 |
And as I say, I actually think there aren't many articles I wrote in 2012 I'm still recommending, 00:22:00.100 |
but this one I think is still equally valid today. 00:22:06.920 |
So yeah, so like another great example is lead prioritization, right? 00:22:10.760 |
So like a lot of companies, like every one of these boxes I'm showing, you can generally 00:22:16.200 |
find a company or many companies whose sole job in life is to build models of that thing. 00:22:24.120 |
So there are lots of companies that sell lead prioritization systems. 00:22:28.800 |
But again, the question is, how would we use that information? 00:22:35.920 |
So if it's like, oh, our best lead is Jeremy, he's our highest probability of buying, does 00:22:43.480 |
that mean I should send a salesperson out to Jeremy or I shouldn't? 00:22:48.120 |
Like if he's highly probable to buy, why waste my time with him? 00:22:54.340 |
So again, it's like he really wants some kind of simulation that says what's the likely 00:23:00.160 |
change in Jeremy's behavior if I send my best salesperson, Yannette, out to go and encourage 00:23:11.040 |
So yeah, I think there are many, many opportunities for data scientists in the world today to 00:23:20.040 |
move beyond predictive modeling to actually bringing it all together with the kind of 00:23:26.240 |
stuff that Dina was talking about in her question. 00:23:29.400 |
So as well as these horizontal applications that basically apply to every company, there's 00:23:37.520 |
a whole bunch of applications that are specific to every part of the world. 00:23:41.840 |
So for those of you that end up in healthcare, some of you will become experts in one or 00:23:53.440 |
So what's the probability that this patient is going to come back to the hospital? 00:23:58.400 |
And readmission is, depending on the details of the jurisdiction and so forth, it can be 00:24:05.920 |
a disaster for hospitals when somebody is readmitted. 00:24:10.540 |
So if you find out that this patient has a high probability of readmission, what do you 00:24:17.360 |
Well again, the predictive model is helpful of itself, it rather suggests that we just 00:24:22.440 |
shouldn't send them home yet because they're going to come back. 00:24:26.680 |
But wouldn't it be nice if we had the tree interpreter and it said to us the reason that 00:24:31.780 |
they're at high risk is because we don't have a recent EKG for them, and without a recent 00:24:38.880 |
EKG we can't have a high confidence about their cardiac health. 00:24:45.200 |
In which case it wouldn't be like, let's keep them in the hospital for two weeks, it would 00:24:51.520 |
So this is interaction between interpretation and predictive accuracy. 00:24:57.240 |
What I'm understanding you saying is that the predictive models are a really great starting 00:25:06.560 |
point, but in order to actually answer these questions we really need to focus on the interpretability 00:25:13.520 |
Yeah, I think so, and more specifically I'm saying we just learned a whole raft of random 00:25:20.960 |
forest interpretation techniques and so I'm trying to justify why, and so the reason why 00:25:31.280 |
is because actually maybe most of the time the interpretation is the thing we care about, 00:25:37.800 |
and you can create a chart or a table without machine learning, and indeed that's how most 00:25:50.320 |
Most managers build all kinds of tables and charts without any machine learning behind 00:25:56.920 |
But they often make terrible decisions because they don't know the feature importance of 00:26:01.600 |
the objective they're interested in, so the table they create is of things that actually 00:26:05.080 |
are the least important things anyway, or they just do a univariate chart rather than 00:26:11.000 |
a partial dependence plot so they don't actually realize that the relationship they thought 00:26:15.520 |
they're looking at is due entirely to something else. 00:26:20.000 |
So I'm kind of arguing for data scientists getting much more deeply involved in strategy 00:26:28.720 |
and in trying to use machine learning to really help a business with all of its objectives. 00:26:40.640 |
There are companies like Dunhumbi, a huge company that does nothing but retail applications 00:26:46.080 |
of machine learning, and so I believe there's a Dunhumbi product you can buy which will 00:26:55.440 |
help you figure out if I put my new store in this location versus that location, how 00:27:04.480 |
Or if I put my diapers in this part of the shop versus that part of the shop, how's that 00:27:10.520 |
going to impact purchasing behavior or whatever? 00:27:14.600 |
So I think it's also good to realize that the subset of machine learning applications 00:27:20.560 |
you tend to hear about in the tech press or whatever is this massively biased tiny subset 00:27:30.360 |
of stuff which Google and Facebook do, whereas the vast majority of stuff that actually makes 00:27:36.560 |
the world go round is these kinds of applications that actually help people make things, buy 00:27:52.560 |
So about tree interpretation, the way we looked at the tree was we manually checked which 00:27:59.200 |
feature was more important for particular observation, but for businesses they would 00:28:07.040 |
have huge amount of data and they want this interpretation for a lot of observations. 00:28:15.920 |
I don't think the automation is at all difficult, you can run any of these algorithms like looping 00:28:22.720 |
through the rows or throwing them in parallel, it's all just code, am I misunderstanding 00:28:29.560 |
Is it like they set a threshold that if some feature is above, like for different people 00:28:38.960 |
This is a really important issue actually, the vast majority of machine learning models 00:28:50.240 |
don't automate anything, they're designed to provide information to humans. 00:28:55.400 |
So for example, if you're a point of sales customer service phone operator for an insurance 00:29:05.000 |
company and your customer asks you why is my renewal $500 more expensive than last time, 00:29:12.760 |
then hopefully the insurance company provides in your terminal a little screen that shows 00:29:18.640 |
the result of the tree interpreter or whatever so that you can jump there and tell the customer, 00:29:23.880 |
okay, well here's last year, you're in this different zip code which has lower amounts 00:29:31.040 |
of car theft and this year also you've actually changed your vehicle to a more expensive one 00:29:39.560 |
So it's not so much about thresholds and automation, but about making these model outputs available 00:29:49.000 |
to the decision makers in an organization, whether they be at the top strategic level 00:29:53.800 |
of like, are we going to shut down this whole product or not, all the way to the operational 00:30:00.160 |
level of that individual discussion with a customer. 00:30:06.760 |
So another example is aircraft scheduling and gate management, there's lots of companies 00:30:13.360 |
that do that, and basically what happens is that there are people at an airport whose 00:30:25.280 |
job it is to basically tell each aircraft what gate to go to to figure out when to close 00:30:33.560 |
So the idea is you're giving them software which has the information they need to make 00:30:39.640 |
So the machine learning models end up embedded in that software, if I'm going to say, okay, 00:30:45.720 |
that plane that's currently coming in from Miami, there's a 48% chance that it's going 00:30:50.960 |
to be over 5 minutes late, and if it does, then this is going to be the knock-on impact 00:30:56.360 |
through the rest of the terminal, for instance. 00:30:59.040 |
So that's kind of how these things tend to fit together. 00:31:03.160 |
So there's so many of these, there's lots and lots, and so I don't expect you to remember 00:31:10.080 |
all these applications, but what I do want you to do is spend some time thinking about 00:31:15.720 |
them, like sit down with one of your friends and talk about a few examples, like okay, how 00:31:22.160 |
would we go about doing failure analysis and manufacturing, who would be doing that, why 00:31:29.560 |
would they be doing it, what kind of models might they use, what kind of data might they 00:31:34.000 |
Start to kind of practice this and get a sense, because then as you're interviewing and then 00:31:40.280 |
when you're at the workplace and you're talking to managers, you want to be straight away 00:31:47.360 |
able to kind of recognize that the person you're talking to, what do they try to achieve, 00:31:51.760 |
what are the levers that they have to pull, what are the data they have available to pull 00:31:56.120 |
those levers to achieve that thing, and therefore, how could we build models to help them do 00:32:01.120 |
that and what kind of predictions would they have to be making. 00:32:05.360 |
And so then you can have this really thoughtful, empathetic conversation with those people 00:32:10.440 |
saying like hey, in order to reduce the number of customers that are leaving, I guess you're 00:32:17.680 |
trying to figure out who should you be providing better pricing to or whatever, and so forth. 00:32:30.760 |
So what I'm noticing from your beautiful little chart above is that a lot of this, to me at 00:32:37.120 |
least, still seems like the primary purpose is at least base level, is predictive power. 00:32:45.520 |
And so I guess my thing is for explanatory problems, a lot of the ones that people are 00:32:51.120 |
faced with in social sciences, is that something machine learning can be used for or is used 00:32:55.520 |
for or is that not really the realm that it is? 00:33:00.040 |
That's a great question, and I've had a lot of conversations about this with people in 00:33:05.920 |
social sciences, and currently machine learning is not well applied in economics or psychology 00:33:15.960 |
But I'm convinced it can be, for the exact reasons we're talking about. 00:33:19.720 |
So if you're trying to figure out, if you're trying to do some kind of behavioral economics 00:33:24.340 |
and you're trying to understand why some people behave differently to other people, a random 00:33:29.400 |
forest with a feature importance plot would be a great way to start. 00:33:34.080 |
Or more interestingly, if you're trying to do some kind of sociology experiment or analysis 00:33:41.120 |
based on a large social network dataset where you have an observational study, you really 00:33:46.440 |
want to try and pull out all of the sources of exogenous variables, all the stuff that's 00:33:55.760 |
And so if you use a partial dependence plot with a random forest, that happens automatically. 00:34:00.320 |
So I actually gave a talk at MIT a couple of years ago for the first conference on digital 00:34:08.340 |
experimentation which was really talking about how do we experiment in things like social 00:34:18.120 |
Economists all do things with classic statistical tests, but in this case the economists I talked 00:34:39.000 |
to were absolutely fascinated by this and they actually asked me to give an introduction 00:34:44.760 |
to machine learning session at MIT to these various faculty and graduate folks in the 00:34:53.480 |
And some of those folks have gone on to write some pretty famous books and stuff, so hopefully 00:35:00.040 |
So it's definitely early days, but it's a big opportunity. 00:35:05.200 |
But as Yannette says, there's plenty of skepticism still out there. 00:35:18.680 |
Well the skepticism comes from unfamiliarity basically with this totally different approach. 00:35:28.800 |
So if you've spent 20 years studying econometrics and somebody comes along and says, "Here's 00:35:37.600 |
a totally different approach to all the stuff that econometricians do," naturally your first 00:35:55.520 |
Over time the next generation of people who are growing up with machine learning, some 00:36:00.400 |
of them will move into the social sciences, they'll make huge impacts that nobody's ever 00:36:04.720 |
managed to make before, and people will start going, "Wow!" 00:36:08.560 |
Just like happened in computer vision, when computer vision spent a long time of people 00:36:14.320 |
saying like, "Hey, maybe you should use deep learning for computer vision," and everybody 00:36:20.880 |
We have decades of work on amazing feature detectors for computer vision, and then finally 00:36:28.080 |
in 2012 Hinton and Kudresky came along and said, "Okay, our model is twice as good as 00:36:34.880 |
yours and we've only just started on this," and everybody was like, "Oh, okay, that's 00:36:43.280 |
Nowadays every computer vision researcher basically uses deep learning. 00:36:49.040 |
I think that time will come in this area too. 00:36:56.080 |
I think what we might do then is take a break and we're going to come back and talk about 00:37:03.540 |
these random forest interpretation techniques and do a bit of a review. 00:37:18.080 |
So let's have a go at talking about these different random forest interpretation methods, 00:37:33.340 |
So let's start with confidence based on tree variance. 00:37:40.640 |
So can one of you tell me one or more of the following things about confidence based on 00:37:50.700 |
Why would we be interested in that and how is it calculated? 00:37:56.700 |
This is going back a ways because it was the first one we looked at. 00:38:01.480 |
Even if you're not sure or you only know a little piece of it, give us your piece and 00:38:13.440 |
It's getting the variance of our predictions from random forests. 00:38:24.600 |
I think it's, if I'm remembering correctly, I think it's just the overall prediction. 00:38:31.320 |
The variance of the predictions of the trees, yes. 00:38:34.320 |
So normally the prediction is just the average, this is the variance of the trees. 00:38:38.840 |
So it kind of just gives you an idea of how much your prediction is going to vary. 00:38:42.960 |
So maybe you want to minimize variance, maybe that's your goal for whatever reason that 00:38:48.400 |
That's not so much the reason, so I like your calculation description. 00:38:51.840 |
Let's see if somebody else can tell us how you might use that. 00:39:08.760 |
So I remember that we talked about kind of the independence of the trees and so maybe 00:39:17.720 |
something about if the variance of the trees is higher or lower than. 00:39:25.640 |
That's not so much that, that's an interesting question but it's not what we're going to 00:39:34.960 |
So to remind you, just to fill in a detail here, what we generally do here is we take 00:39:39.920 |
just one row, like one observation often, and find out how confident we are about that, 00:39:47.760 |
like how much variance there are in the trees for that, or we can do it as we did here for 00:39:55.600 |
So according to me the idea is like for each row we calculate the standard deviation that 00:40:01.360 |
we get from the random forest model, and then maybe group according to different variables 00:40:08.240 |
or predictors, and see for which particular predictor the standard deviation is high, 00:40:13.600 |
and then go deep down as why it is happening, maybe it is because a particular category 00:40:19.800 |
of that variable has very less number of observations. 00:40:24.240 |
So that would be one approach is kind of what we've done here is to say like is there any 00:40:34.280 |
Something that I think is even more important would be when you're using this operationally, 00:40:42.000 |
let's say you're doing a credit decisioning algorithm. 00:40:46.340 |
So we're trying to say is Jeremy a good risk or a bad risk? 00:40:54.020 |
And the random forest says, I think he's a good risk, but I'm not at all confident. 00:41:02.880 |
In which case we might say maybe I shouldn't give him a million dollars, or else if the 00:41:07.400 |
random forest said, I think he's a good risk, I am very sure of that, then we're much more 00:41:16.920 |
And I'm a very good risk, so feel free to give me a million dollars. 00:41:21.840 |
I checked the random forest before a different notebook, not in the repo. 00:41:28.920 |
So it's quite hard for me to give you folks direct experience with this kind of single 00:41:39.080 |
observation interpretation stuff, because it's really like the kind of stuff that you 00:41:46.720 |
actually need to be putting out to the front line. 00:41:49.640 |
It's not something which you can really use so much in a kind of Kaggle context, but it's 00:41:54.760 |
more like if you're actually putting out some algorithm which is making big decisions that 00:42:02.400 |
could cost a lot of money, you probably don't so much care about the average prediction 00:42:08.760 |
of the random forest, but maybe you actually care about like the average minus a couple 00:42:13.880 |
of standard deviations, like what's the kind of worst-case prediction. 00:42:20.720 |
And so as Chika mentioned, it's like maybe there's a whole group that we're kind of 00:42:36.040 |
Alright, who wants to have a go at answering feature importance? 00:42:42.280 |
How do we calculate it or any subset thereof? 00:42:46.200 |
I think it's basically to find out which features are important for your model. 00:42:55.760 |
So you take each feature and you randomly sample all the values in the feature and you 00:43:01.720 |
see how the predictions are, if it's very different. 00:43:04.840 |
It means that that feature was actually important, else if it's fine to take any random values 00:43:09.460 |
for that feature, it means that maybe probably it's not very important. 00:43:19.560 |
There were some details that maybe were skimmed over a little bit. 00:43:23.040 |
I wonder if anybody else wants to jump into like a more detailed description of how it's 00:43:27.920 |
calculated because I know this morning some people were not quite sure. 00:43:33.280 |
Is there anybody who's not quite sure maybe who wants to like have a go or want to just 00:43:40.240 |
How exactly do we calculate feature importance for a particular feature? 00:43:44.400 |
I think after you're done building the random forest model, you take each column and randomly 00:43:49.400 |
shuffle it and generate a prediction and check the validation score. 00:43:54.120 |
If it gets pretty bad for after shuffling one of the columns, that means that column 00:44:03.940 |
I'm not exactly sure how we quantify the feature importance. 00:44:09.960 |
Dina, do you know how we quantify the feature importance? 00:44:16.000 |
I think we did the difference in the class square. 00:44:22.760 |
So let's say we've got our dependent variable which is price, right, and there's a bunch 00:44:26.860 |
of independent variables including year made, right. 00:44:30.640 |
And so basically we use the whole lot to build a random forest, right, and then that gives 00:44:44.400 |
And so then we can compare that to get R^2, RMSE, whatever you're interested in from the 00:44:55.720 |
Now the key thing here is I don't want to have to retrain my whole random forest. 00:45:03.960 |
So using the existing random forest, how can I figure out how important year made was, 00:45:11.440 |
And so the suggestion was, let's randomly shuffle the whole column, right. 00:45:18.080 |
It's got the same mean, same distribution, everything about it is the same, but there's 00:45:22.640 |
no connection at all between particular people, actual year made, and what's now in that column. 00:45:30.520 |
And so now I put that new version through with the same random forest, so there's no 00:45:40.120 |
retraining done, okay, to get some new Y hat, I call it Y hat, YM, right. 00:45:48.960 |
And then I can compare that to my actuals to get like an RMSE, YM, right. 00:45:57.480 |
And so now I can start to create a little table. 00:46:04.600 |
So now I can create a little table where I've basically got like the original here, RMSE, 00:46:11.600 |
and then I've got with year made scrambled, so this one had an RMSE of like 3, this one 00:46:18.240 |
had an RMSE of like 2, enclosure, scrambling that had an RMSE of like 2.5, right. 00:46:29.860 |
So I'd say year made, the importance is 1, 3 - 2, enclosure is 0.5, 3 - 2.5, and so forth. 00:46:40.280 |
So how much worse did my model get after I shuffled that variable? 00:46:51.280 |
"I assume you just chose those numbers randomly, but my question I guess is do all of them 00:47:01.960 |
theoretically not a perfect model to start out with, like will all the importance is 00:47:09.040 |
No, honestly I've never actually looked at what the units are, so I'm actually not quite 00:47:15.480 |
Sorry, we can check it out during the week if somebody's interested. 00:47:19.320 |
Have a look at this SKLearn code and see exactly what those units of measure are because I've 00:47:29.160 |
Although I don't check like the units of measure specifically, what I do check is the relative 00:47:36.720 |
And so like here's an example, so rather than just saying like what are the top 10, yesterday 00:47:41.880 |
one of the Tracticum students asked me about a feature importance where they said like 00:47:51.440 |
And I pointed out that the top one was a thousand times more important than the second one. 00:47:59.920 |
And so in that case it's like no don't look at the top 3, look at the one that's a thousand 00:48:04.080 |
times more important and ignore all the rest. 00:48:07.240 |
And so this is where sometimes the kind of your natural tendency to want to be like precise 00:48:12.800 |
and careful, you need to override that and be very practical. 00:48:16.560 |
Like okay this thing's a thousand times more important, don't spend any time on anything 00:48:21.040 |
So then you can go and talk to the manager of your project and say like okay this thing's 00:48:27.200 |
And then they might say oh that was a mistake, it shouldn't have been in there, we don't 00:48:32.440 |
actually have that information at the decision time. 00:48:37.440 |
For whatever reason we can't actually use that variable and so then you could remove 00:48:42.600 |
Or they might say gosh I had no idea that was by far more important than everything 00:48:49.200 |
So let's forget this random virus thing and just focus on understanding how we can better 00:48:55.880 |
collect that one variable and better use that one variable. 00:48:59.220 |
So that's like something which comes up quite a lot. 00:49:03.840 |
And actually another place that came up just yesterday, again another practicum student 00:49:07.720 |
asked me hey I'm doing this medical diagnostics project and my R^2 is 0.95 for a disease which 00:49:23.400 |
Is this random forest a genius or is something going wrong? 00:49:27.080 |
And I said like remember the second thing you do after you build a random forest is 00:49:32.960 |
So do feature importance and what you'll probably find is that the top column is something that 00:49:41.360 |
He came back to me half an hour later, he said yeah I did the feature importance, you 00:49:44.800 |
were right, the top column was basically something that was another encoding of the dependent 00:49:50.440 |
variable, I've removed it, and now my R^2 is -0.1 so that's an improvement. 00:50:04.280 |
The other thing I like to look at is this chart, is to basically say where do things 00:50:10.160 |
flatten off in terms of which ones should I be really focusing on. 00:50:17.600 |
And so when I did credit scoring in telecommunications, I found there were 9 variables that basically 00:50:23.640 |
predicted very accurately who was going to end up paying for their phone and who wasn't. 00:50:29.240 |
And apart from ending up with a model that saved them $3 billion a year in fraud and 00:50:35.100 |
credit costs, it also let them basically rejig their process so that they focused on collecting 00:50:53.060 |
This is an interesting one, very important, but in some ways kind of tricky to think about. 00:51:02.520 |
So from my understanding of what partial dependence is is that there's not always necessarily 00:51:07.760 |
like a relationship between strictly the dependent variable and this independent variable that 00:51:13.360 |
necessarily is showing importance, but rather an interaction between two variables that 00:51:24.440 |
I could expect this to be kind of flattened as a weird hokey day. 00:51:29.360 |
And so for this example, what we found was that it's not necessarily your maid or when 00:51:35.320 |
the sale was elapsed, but it's actually the age of the model. 00:51:38.160 |
And so that is easier to tell a company, well obviously your younger models are going to 00:51:45.080 |
sell for more, and it's less about when the year was made. 00:51:50.200 |
So let's come back to how we calculate this in a moment, but the first thing to realize 00:51:54.720 |
is that the vast majority of the time post your course here, when somebody shows you 00:52:03.240 |
They'll just grab the data from the database and they'll plot x against y, and then managers 00:52:11.240 |
So it'll be like, oh there's this drop-off here, so we should stop dealing in equipment 00:52:20.720 |
And this is a big problem because real-world data has lots of these interactions going 00:52:28.720 |
on, so maybe there was a recession going on around the time those things were being sold 00:52:35.240 |
or maybe around that time people were buying more of a different type of equipment or whatever. 00:52:41.120 |
So generally what we actually want to know is all other things being equal, what's the 00:52:46.040 |
relationship between year made and sale price. 00:52:50.440 |
Because if you think about the drivetrain approach idea of the levers, you really want 00:52:56.020 |
a model that says if I change this lever, how will it change my objective? 00:53:03.880 |
And so it's by pulling them apart using partial dependence that you can say, okay, actually 00:53:10.640 |
this is the relationship between year made and sale price, all other things being equal. 00:53:20.360 |
So for the variable year made, for example, you're going to train, you keep every other 00:53:27.960 |
variable constant and then you're going to pass every single value of the year made and 00:53:33.480 |
So for every model you're going to have the light blue for the values of it and the median 00:53:51.480 |
By leave everything else constant, what she means is leave them at whatever they are in 00:53:56.520 |
So just like when we did feature importance, we're going to leave the rest of the dataset 00:54:01.120 |
as it is, and we're going to do partial dependence plot for year made. 00:54:05.480 |
So you've got all of these other rows of data that will just leave as they are. 00:54:11.000 |
And so instead of randomly shuffling year made, instead what we're going to do is replace 00:54:18.720 |
every single value with exactly the same thing, 1960. 00:54:29.360 |
And just like before, we now pass that through our existing random forest, which we have 00:54:33.680 |
not retrained or changed in any way, to get back out a set of predictions. 00:54:44.560 |
And so then we can plot that on a chart, year made against partial dependence, 1960 here. 00:54:57.680 |
Now we can do it for 1961, 2, 3, 4, 5, and so forth. 00:55:03.520 |
And so we can do that on average for all of them, or we could do it just for one of them. 00:55:14.040 |
And so when we do it for just one of them and we change its year made and pass that 00:55:18.100 |
single thing through our model, that gives us one of these blue lines. 00:55:23.000 |
So each one of these blue lines is a single row as we change its year made from 1960 up 00:55:34.000 |
And so then we can just take the median of all of those blue lines to say, on average, 00:55:40.720 |
what's the relationship between year made and price, all other things being equal. 00:55:52.800 |
Why is it that this process tells us the relationship between year made and price, all other things 00:55:59.280 |
Well, maybe it's good to think about a really simplified approach. 00:56:03.640 |
A really simplified approach would say, what's the average auction? 00:56:11.280 |
What's the most common type of machine we sell? 00:56:17.600 |
We could come up with a single row that represents the average auction, and then we could say, 00:56:23.120 |
let's run that row through the random forest, replace its year made with 1960, and then do 00:56:29.480 |
it again with 1961, and then do it again with 1962, and we could plot those on our little 00:56:37.640 |
And that would give us a version of the relationship between year made and sale price, all other 00:56:48.520 |
But what if tractors looked like that and backhoe loaders looked like that? 00:57:04.200 |
Then taking the average one would hide the fact that there are these totally different 00:57:11.120 |
So instead we basically say, our data tells us what kinds of things we tend to sell, and 00:57:18.800 |
who we tend to sell them to, and when we tend to sell them, so let's use that. 00:57:22.740 |
So then we actually find out, for every blue line, here are actual examples of these relationships. 00:57:33.020 |
And so then what we can do, as well as plotting the median, is we can do a cluster analysis 00:57:43.960 |
And so we may find, in this case they all look like pretty much the different versions 00:57:53.040 |
So my main takeaway from this would be that the relationship between sale price and year 00:58:03.800 |
And remember, this was log of sale price, so this is actually showing us an exponential. 00:58:10.940 |
And so this is where I would then bring in the domain expertise, which is like, okay, 00:58:17.800 |
things depreciate over time by a constant ratio, so therefore I would expect older stuff 00:58:30.900 |
So this is where I kind of mentioned the very start of my machine learning project, I generally 00:58:37.880 |
try to avoid using as much domain expertise as I can and let the data do the talking. 00:58:44.200 |
So one of the questions I got this morning was, if there's like a sale ID, a model ID, 00:58:49.400 |
I should throw those away because they're just IDs. 00:58:52.720 |
No, don't assume anything about your data, leave them in, and if they turn out to be 00:58:58.980 |
super important predictors, you want to find out why is that. 00:59:04.240 |
But then, now I'm at the other end of my project, I've done my feature importance, I've pulled 00:59:09.380 |
out the stuff which is from that dendrogram, the kind of redundant features, I'm looking 00:59:15.200 |
at the partial dependence, and now I'm thinking, okay, is this shape what I expected? 00:59:22.040 |
So even better, before you plot this, first of all think, what shape would I expect this 00:59:29.040 |
It's always easy to justify to yourself after the fact, I knew it would look like this. 00:59:33.260 |
So what shape do you expect, and then is it that shape? 00:59:35.680 |
So in this case, I'd be like, yeah, this is what I would expect, whereas this is definitely 00:59:45.820 |
So the partial dependence plot has really pulled out the underlying truth. 00:59:50.840 |
So does anybody have any questions about why we use partial dependence or how we calculate 00:59:59.820 |
If there are 20 features that are important, then I will do the partial dependence for all 01:00:27.240 |
of them, where important means it's a lever I can actually pull, the magnitude of its 01:00:36.640 |
size is not much smaller than the other 19, based on all of these things, it's a feature 01:00:43.120 |
I ought to care about, then I will want to know how it's related. 01:00:48.480 |
It's pretty unusual to have that many features that are important both operationally and 01:00:55.400 |
from a modeling point of view, in my experience. 01:01:04.840 |
So important means it's a lever, so it's something I can change, and it's like, you know, kind 01:01:27.360 |
Or maybe it's not a lever directly, maybe it's like zip code, and I can't actually tell my 01:01:34.940 |
customers where to live, but I could focus my new marketing attention on a different 01:01:43.520 |
Would it make sense to do pairwise shuffling for every combination of two features and 01:01:52.160 |
hold everything else constant, like in feature importance, to see interactions and compare 01:02:00.560 |
So you wouldn't do that so much for partial dependence. 01:02:04.960 |
I think your question is really getting to the question of could we do that for feature 01:02:13.600 |
So I think interaction feature importance is a very important and interesting question. 01:02:21.360 |
But doing it by randomly shuffling every pair of columns, if you've got 100 columns, sounds 01:02:30.160 |
computationally intensive, possibly infeasible. 01:02:34.040 |
So what I'm going to do is after we talk about TreeInterpreter, I'll talk about an interesting 01:02:39.280 |
but largely unexplored approach that will probably work. 01:02:56.600 |
I was thinking this to be more like feature importance, but feature importance is for 01:03:04.120 |
complete random forest model, and this TreeInterpreter is for feature importance for particular 01:03:10.920 |
So if that, let's say it's about hospital readmission, so if a patient A1 is going to 01:03:18.640 |
be readmitted to a hospital, which feature for that particular patient is going to impact? 01:03:27.440 |
And it is calculated starting from the prediction of mean, then seeing how each feature is changing 01:03:37.600 |
I'm smiling because that was one of the best examples of technical communication I've heard 01:03:44.800 |
So it's really good to think about why was that effective. 01:03:49.280 |
So what Prince did there was he used as specific an example as possible. 01:03:57.040 |
Humans are much less good at understanding abstractions. 01:03:59.840 |
So if you say it takes some kind of feature and then there's an observation in that feature 01:04:05.720 |
where it's like, no, it's a hospital readmission. 01:04:13.120 |
The other thing he did that was very effective was to take an analogy to something we already 01:04:18.960 |
So we already understand the idea of feature importance across all of the rows in a dataset. 01:04:26.000 |
So now we're going to do it for a single row. 01:04:29.760 |
So one of the things I was really hoping we would learn from this experience is how to 01:04:38.640 |
So that was a really great role model from Prince of using all of the tricks we have 01:04:44.760 |
at our disposal for effective technical communication. 01:04:47.980 |
So hopefully you found that a useful explanation. 01:04:50.360 |
I don't have a hell of a lot to add to that other than to show you what that looks like. 01:04:56.960 |
So with the tree interpreter, we picked out a row. 01:05:03.360 |
And so remember when we talked about the confidence intervals at the very start, the confidence 01:05:10.880 |
based on tree variance, we mainly said you'd probably use that for a row. 01:05:17.560 |
So it's like, okay, why is this patient likely to be readmitted? 01:05:23.840 |
So here is all of the information we have about that patient, or in this case this auction. 01:05:35.180 |
So then we call tree interpreter dot predict, and we get back the prediction of the price, 01:05:44.640 |
So this is just the average price for everybody, so this is always going to be the same. 01:05:49.600 |
And then the contributions, which is how important is each of these things. 01:05:59.040 |
And so the way we calculated that was to say at the very start, the average price was 10, 01:06:21.800 |
And for those with this enclosure, the average was 9.5. 01:06:29.460 |
And then we split on year made, I don't know, less than 1990. 01:06:33.960 |
And for those with that year made, the average price was 9.7. 01:06:40.440 |
And then we split on the number of hours on the meter, and for this branch we got 9.4. 01:06:51.680 |
And so we then have a particular option, which we pass it through the tree, and it just so 01:07:01.500 |
So one row can only have one path through the tree. 01:07:17.760 |
And so as we go through, we start at the top, and we start with 10. 01:07:24.280 |
And we said enclosure resulted in a change from 10 to 9.5, minus 0.5. 01:07:33.120 |
Year made changed it from 9.5 to 9.7, so plus 0.2. 01:07:39.360 |
And then meter changed it from 9.7 down to 9.4, which is minus 0.3. 01:07:51.360 |
And then if we add all that together, 10 minus 1/2 is 9.5, plus 0.2 is 9.7, minus 0.3 is 9.4. 01:08:01.080 |
Lo and behold, that's that number, which takes us to our Excel spreadsheet. 01:08:22.080 |
So last week we had to use Excel for this because there isn't a good Python library 01:08:31.400 |
And so we saw we got our starting point, this is the bias, and then we had each of our contributions 01:08:40.040 |
The world is now a better place because Chris has created a Python waterfall chart module 01:08:44.680 |
for us and put it on pip, so never again will we have to use Excel for this. 01:08:50.300 |
And I wanted to point out that waterfall charts have been very important in business communications 01:08:57.040 |
at least as long as I've been in business, so that's about 25 years. 01:09:04.080 |
Python is a couple of decades old, a little bit less, maybe a couple of decades old. 01:09:10.520 |
But despite that, no one in the Python world ever got to the point where they actually 01:09:20.280 |
So they didn't exist until two days ago, which is to say the world is full of stuff which 01:09:27.420 |
ought to exist and doesn't, and doesn't necessarily take ahead a lot of time to build. 01:09:32.600 |
Chris, how did it take you to build the first Python waterfall chart? 01:09:39.240 |
Well there was a gist of it, so a hefty time amount but not unreasonable. 01:09:54.600 |
And now forevermore, people when they want the Python waterfall chart will end up at 01:09:59.440 |
Chris's GitHub repo and hopefully find lots of other USF contributors who have made it 01:10:08.000 |
So in order for you to help improve Chris's Python waterfall, you need to know how to 01:10:15.720 |
And so you're going to need to submit a pull request. 01:10:20.580 |
Life becomes very easy for submitting pull requests if you use something called hub. 01:10:24.680 |
So if you go to github/hub, that will send you over here. 01:10:33.240 |
And what they suggest you do is that you alias git to hub, because it turns out that hub 01:10:41.560 |
But what it lets you do is you can go git fork, git push, git pull request, and you've 01:10:54.280 |
Without hub, this is actually a pain and requires going to the website and filling in forms. 01:11:00.080 |
So this gives you no reason not to do pull requests. 01:11:03.840 |
And I mention this because when you're interviewing for a job or whatever, I can promise you that 01:11:10.080 |
the person you're talking to will check your GitHub. 01:11:13.200 |
And if they see you have a history of submitting thoughtful pull requests that are accepted 01:11:21.000 |
It looks great because it shows you're somebody who actually contributes. 01:11:24.640 |
It also shows that if they're being accepted, that you know how to create code that fits 01:11:28.960 |
with people's coding standards, has appropriate documentation, passes their tests and coverage, 01:11:36.040 |
So when people look at you and they say, "Oh, here's to somebody with a history of successfully 01:11:41.120 |
contributing accepted pull requests to open-source libraries," that's a great part of your portfolio. 01:11:50.560 |
So either I'm the person who built Python Waterfall, here is my repo, or I'm the person 01:11:58.880 |
who contributed currency number formatting to Python Waterfall, here's my pull request. 01:12:07.020 |
Any time you see something that doesn't work right in any open-source software you use 01:12:13.880 |
It's a great opportunity because you can fix it and send in the pull request. 01:12:19.240 |
So give it a go, it actually feels great the first time you have a pull request accepted. 01:12:24.280 |
And of course, one big opportunity is the FastAI library. 01:12:30.440 |
Was the person here the person who added all the docs to FastAI structured in the other 01:12:37.640 |
So thanks to one of our students, we now have doc strings for most of the fastai.structured 01:12:42.940 |
library and that again came via a pull request. 01:12:52.160 |
Does anybody have any questions about how to calculate any of these random forest interpretation 01:12:59.680 |
methods or why we might want to use any of these random forest interpretation methods? 01:13:05.440 |
Towards the end of the week, you're going to need to be able to build all of these yourself 01:13:12.800 |
Just looking at the tree interpreter, I noticed that some of the values are NINs. 01:13:29.360 |
I get why you keep them in the tree, but how can an NIN have a feature importance? 01:13:42.080 |
So, in other words, how is NIN handled in pandas and therefore in the tree? 01:13:54.480 |
Anybody remember how pandas, notice these are all in categorical variables, how does 01:13:58.600 |
pandas handle NINs in categorical variables and how does FastAI deal with them? 01:14:04.700 |
Can somebody pass it to the person who's talking? 01:14:10.600 |
Pandas sets them to -1, category code, and do you have to remember what we then do? 01:14:18.780 |
We add 1 to all of the category codes, so it ends up being 0. 01:14:22.960 |
So in other words, we have a category with, remember by the time it hits the random forest 01:14:26.960 |
it's just a number, and it's just a number 0. 01:14:30.960 |
And we map it back to the descriptions back here. 01:14:33.920 |
So the question really is, why shouldn't the random forest be able to split on 0? 01:14:42.520 |
So it could be NIN, high, medium, or low, 0, 1, 2, 3, 4. 01:14:47.000 |
And so missing values are one of these things that are generally taught really badly, like 01:14:54.680 |
often people get taught like here are some ways to remove columns with missing values 01:14:58.740 |
or remove rows with missing values or to replace missing values. 01:15:04.000 |
That's never what we want, because missingness is very very very often interesting. 01:15:10.960 |
And so we actually learned from our feature importance that coupler system NIN is like 01:15:19.920 |
And so for some reason, I could guess, right? 01:15:25.040 |
Coupler system NIN presumably means this is the kind of industrial equipment that doesn't 01:15:31.440 |
I don't know what kind that is, but apparently it's a more expensive kind. 01:15:40.400 |
So I did this competition for university grant research success where by far the most important 01:15:51.480 |
predictors were whether or not some of the fields were null, and it turned out that this 01:15:58.080 |
was data leakage, that these fields only got filled in most of the time after a research 01:16:05.800 |
So it allowed me to win that Kaggle competition, but it didn't actually help the university 01:16:17.080 |
So let's talk about extrapolation, and I am going to do something risky and dangerous, 01:16:32.760 |
And the reason we're going to do some live coding is I want to explore extrapolation 01:16:38.320 |
together with you, and I kind of also want to help give you a feel of how you might go 01:16:46.000 |
about writing code quickly in this notebook environment. 01:16:52.160 |
And this is the kind of stuff that you're going to need to be able to do in the real 01:16:55.360 |
world, and in the exam is quickly create the kind of code that we're going to talk about. 01:17:00.200 |
So I really like creating synthetic data sets anytime I'm trying to investigate the behavior 01:17:07.440 |
of something, because if I have a synthetic data set, I know how it should behave. 01:17:12.980 |
Which reminds me, before we do this, I promised that we would talk about interaction importance, 01:17:25.400 |
Tree interpreter tells us the contributions for a particular row based on the difference 01:17:35.980 |
We could calculate that for every row in our data set and add them up, and that would tell 01:17:50.760 |
One way of doing feature importance is by shuffling the columns one at a time, and another 01:17:55.560 |
way is by doing tree interpreter for every row and adding them up. 01:18:02.040 |
Neither is more right than the others, they're actually both quite widely used. 01:18:06.120 |
So this is type 1 and type 2 feature importance. 01:18:11.800 |
So we could try to expand this a little bit, to do not just single variable feature importance, 01:18:29.360 |
Now here's the thing, what I'm going to describe is very easy to describe. 01:18:36.160 |
It was described by Brimann right back when random forests were first invented, and it 01:18:40.880 |
is part of the commercial software product from Salford Systems who have the trademark 01:18:47.080 |
on random forests, but it is not part of any open source library I'm aware of. 01:18:54.720 |
And I've never seen an academic paper that actually studies it closely. 01:18:58.880 |
So what I'm going to describe here is a huge opportunity, but there's also lots and lots 01:19:16.360 |
This particular difference here is not just because of year made, but because of a combination 01:19:32.400 |
The fact that this is 9.7 is because enclosure was in this branch and year made was in this 01:19:39.440 |
So in other words, we could say the contribution of enclosure interacted with year made is 01:20:03.240 |
Well that's an interaction of year made and hours on the meter. 01:20:08.620 |
So year made interacted with, I'm using star here not to mean times, but to mean interacted 01:20:15.680 |
It's kind of a common way of doing things, like R's formulas do it this way as well. 01:20:20.240 |
Year made by interacted with meter has a contribution of -0.1. 01:20:34.280 |
Perhaps we could also say from here to here that this also shows an interaction between 01:20:41.840 |
meter and enclosure, with one thing in between them. 01:20:46.040 |
So maybe we could say meter by enclosure equals, and then what should it be? 01:21:02.200 |
In some ways that kind of seems unfair because we're also including the impact of year made. 01:21:18.760 |
And these are like details that I actually don't know the answer to. 01:21:23.680 |
How should we best assign a contribution to each pair of variables in this path? 01:21:36.880 |
The pairs of variables in that path all represent interactions. 01:21:42.000 |
Yes, Chris, could you pass that to Chris, please? 01:21:47.840 |
Why don't you force them to be next to each other in the tree? 01:21:55.320 |
I'm not going to say it's the wrong approach. 01:21:58.600 |
I don't think it's the right approach, though, because it feels like this path here, meter 01:22:08.800 |
So it seems like not recognizing that contribution is throwing away information, but I'm not 01:22:20.040 |
I had one of my staff at Kaggle actually do some R&D on this a few years ago, and I wasn't 01:22:27.360 |
close enough to know how they dealt with these details. 01:22:29.440 |
But they got it working pretty well, but unfortunately it never saw the light of day as a software 01:22:34.680 |
But this is something which maybe a group of you could get together and build -- I mean, 01:22:42.760 |
do some Googling to check, but I really don't think that there are any interaction feature 01:22:56.800 |
Wouldn't this exclude interactions between variables that don't matter until they interact? 01:23:03.640 |
So say your row never chooses to split down that path, but that variable interacting with 01:23:09.760 |
another one becomes your most important split. 01:23:13.920 |
I don't think that happens, because if there's an interaction that's important only because 01:23:18.520 |
it's an interaction and not on a univariate basis, it will appear sometimes assuming that 01:23:24.840 |
you set max features to less than 1, and so therefore it will appear in some parts. 01:23:37.360 |
Interaction appears on the same path through a tree, like in this case the tree there's 01:23:49.420 |
an interaction between enclosure and UMA because we branch on enclosure and then we branch 01:23:55.320 |
So to get to here we have to have some specific value of enclosure and some specific value 01:24:05.200 |
My brain is kind of working on this right now. 01:24:07.480 |
What if you went down the middle leafs between the two things you were trying to observe 01:24:12.840 |
and you would also take into account what the final measure is? 01:24:17.320 |
So if we extend the tree downwards you'd have many measures, both of the two things you're 01:24:22.120 |
trying to look at and also the in between steps. 01:24:27.000 |
There seems to be a way to like average information out in between them. 01:24:30.960 |
So I think what we should do is talk about this on the forum. 01:24:32.920 |
I think this is fascinating and I hope we build something great, but I need to do my 01:24:39.600 |
live coding so let's -- yeah, that was a great discussion. 01:24:50.120 |
And so to experiment with that you almost certainly want to create a synthetic data 01:24:56.960 |
It's like y = x1 + x2 + x1 * x2 or something, like something where you know there's this 01:25:05.120 |
interaction effect and there isn't that interaction effect and then you want to make sure that 01:25:09.920 |
the feature importance you get at the end is what you expected. 01:25:14.400 |
And so probably the first step would be to do single variable feature importance using 01:25:26.800 |
And one nice thing about this is it doesn't really matter how much data you have, all 01:25:33.680 |
you have to do to calculate feature importance is just slide through the tree. 01:25:37.600 |
So you should be able to write in a way that's actually pretty fast, and so even writing 01:25:41.800 |
it in pure Python might be fast enough depending on your tree size. 01:25:52.640 |
And so the first thing I want to do is create a synthetic data set that has a simple linear 01:25:57.600 |
We're going to pretend it's like a time series. 01:26:01.760 |
So we need to basically create some x values. 01:26:04.920 |
So the easiest way to kind of create some synthetic data of this type is to use LinSpace, 01:26:12.000 |
which creates some evenly spaced data between start and stop with by default 50 observations. 01:26:32.480 |
And so then we're going to create a dependent variable. 01:26:35.440 |
And so let's assume there's just a linear relationship between x and y, and let's add 01:26:47.080 |
So uniform random between low and high, so we could add somewhere between -0.2 and 0.2. 01:27:02.680 |
And so the next thing we need is a shape, which is basically what dimensions do you 01:27:13.920 |
And obviously we want them to be the same shape as x's shape. 01:27:24.080 |
Remember, when you see something in parentheses with a comma, that's a tuple with just one 01:27:33.560 |
So this is of shape 50, and so we've added 50 random numbers. 01:27:57.360 |
So when you're both working as a data scientist or doing your exams in this course, you need 01:28:04.800 |
to be able to quickly whip up a data set like that, throw it up on a plot without thinking 01:28:11.800 |
And as you can see, you don't have to really remember much, if anything, you just have 01:28:17.000 |
to know how to hit shift+tab to check the names of the parameters, and everything in 01:28:23.440 |
the exam will be open on the internet, so you can always Google for something to try 01:28:28.760 |
and find lin space if you've got what it's called. 01:28:36.800 |
And so we're now going to build a random forest model, and what I want to do is build a random 01:28:43.800 |
forest model that kind of acts as if this is a time series. 01:28:47.720 |
So I'm going to take this as a training set, I'm going to take this as our validation or 01:28:56.640 |
test set, just like we did in groceries or bulldozers or whatever. 01:29:05.680 |
So we can use exactly the same kind of code that we used in split_bells. 01:29:09.500 |
So we can basically say x train, x bell = x up to 40, x from 40. 01:29:30.000 |
So that just splits it into the first 40 versus the last 10. 01:29:35.060 |
And so we can do the same thing for y, and there we go. 01:29:47.160 |
So the next thing to do is we want to create a random forest and fit it, and that's going 01:30:00.920 |
Now that's actually going to give an error, and the reason why is that it expects x to 01:30:05.880 |
be a matrix, not a vector, because it expects x to have a number of columns of data. 01:30:14.280 |
So it's important to know that a matrix with one column is not the same thing as a vector. 01:30:22.120 |
So if I try to run this, expect a 2D array, we've got a 1D array instead. 01:30:31.800 |
So we need to convert our 2D array into a 1D array. 01:30:46.320 |
So it's important to make sure that x's rank is 1. 01:30:52.240 |
The rank of a variable is equal to the length of its shape. 01:31:01.600 |
So a vector we can think of as an array of rank 1, a matrix is an array of rank 2. 01:31:11.240 |
I very rarely use words like vector and matrix because they're kind of meaningless, specific 01:31:20.520 |
examples of something more general, which is they're all n-dimensional tensors, or n-dimensional 01:31:29.320 |
So an n-dimensional array, we can say it's a tensor of rank n, they basically mean kind 01:31:39.280 |
Physicists get crazy when you say that because to a physicist a tensor has quite a specific 01:31:42.960 |
meaning, but in machine learning we generally use it in the same way. 01:31:48.760 |
So how do we turn a 1-dimensional array into a 2-dimensional array? 01:31:58.000 |
There's a couple of ways we can do it, but basically we slice it. 01:32:02.160 |
So colon means give me everything in that axis, colon, none means give me everything 01:32:12.000 |
in the first axis, which is the only axis we have, and then none is a special indexer, 01:32:22.480 |
So let me show you, that is of shape 50, 1, so it's of rank 2, it has two axes. 01:32:34.160 |
One of them is a very boring axis, it's a length 1 axis, so let's move this over here. 01:32:42.720 |
There's 1, 50, and then to remind you the original is just 50. 01:32:51.760 |
So you can see I can put none as a special indexer to introduce a new unit axis there. 01:33:00.840 |
So this thing has 1 row and 50 columns, this thing has 50 rows and 1 column. 01:33:10.400 |
So that's what we want, we want 50 rows and 1 column. 01:33:15.640 |
This kind of playing around with ranks and dimensions is going to become increasingly 01:33:22.580 |
important in this course and in the deep learning course. 01:33:27.280 |
So I spend a lot of time slicing with none, slicing with other things, trying to create 01:33:32.760 |
3-dimensional, 4-dimensional tensors and so forth. 01:33:35.480 |
I'll show you a trick, I'll show you two tricks. 01:33:37.200 |
The first is you never ever need to write comma, colon, it's always assumed. 01:33:42.680 |
So if I delete that, this is exactly the same thing. 01:33:48.000 |
And you'll see that in code all the time, so you need to recognize it. 01:33:51.920 |
The second trick is, this is adding an axis in the second dimension, or I guess the index 01:34:00.520 |
What if I always want to put it in the last dimension? 01:34:04.040 |
And often our tensors change dimensions without us looking, because you went from a 1-channel 01:34:10.520 |
image to a 3-channel image, or you went from a single image to a mini-batch of images. 01:34:15.360 |
Like suddenly you get new dimensions appearing. 01:34:18.240 |
So to make things general, I would say this, dot dot dot. 01:34:22.800 |
Dot dot dot means as many dimensions as you need to fill this up. 01:34:28.040 |
And so in this case it's exactly the same thing, but I would always try to write it 01:34:32.440 |
that way because it means it's going to continue to work as I get higher dimensional tensors. 01:34:38.640 |
So in this case, I want 50 rows in one column, so I'll call that say x_1. 01:34:50.040 |
And so this is now a 2D array, and so I can create my random forest. 01:35:02.200 |
So then I can plot that, and this is where you're going to have to turn your brains on, 01:35:09.400 |
because the folks this morning got this very quickly, which was super impressive. 01:35:13.880 |
I'm going to plot y_train against m.predict x_train. 01:35:22.640 |
Where I hit go, what is this going to look like? 01:35:37.380 |
Our predictions hopefully are the same as the actuals, so this should fall on a line. 01:35:42.240 |
But there's some randomness, so I should have used scatterplot. 01:36:24.760 |
It's like, hey, we're extrapolating to the validation. 01:36:28.120 |
That's what I'd like it to look like, but that's not what it is going to look like. 01:36:37.920 |
Think about what trees do, and think about the fact that we have a validation set here 01:36:50.120 |
So think about a forest is just a bunch of trees, so the first tree is going to have 01:37:06.760 |
>> Yeah, that's what it does, but let's think about how it groups the dots. 01:37:13.540 |
>> I'm guessing since all the new data is actually outside of the original scope, it's 01:37:24.440 |
Forget the forest, let's create one tree, so we're probably going to split somewhere 01:37:28.160 |
around here first, and then we're going to probably split somewhere around here, and 01:37:32.840 |
then we're going to split somewhere around here, and somewhere around here. 01:37:40.720 |
So our prediction when we say, okay, let's take this one, and so it's going to put that 01:37:48.160 |
through the forest and end up predicting this average. 01:37:54.080 |
It can't predict anything higher than that, because there is nothing higher than that 01:37:59.840 |
So this is really important to realize, a random forest is not magic, it's just returning 01:38:05.280 |
the average of nearby observations where nearby is kind of in this tree space. 01:38:23.880 |
If you don't know how random forests work, then this is going to totally screw you. 01:38:28.640 |
If you think that it's actually going to be able to extrapolate to any kind of data it 01:38:33.400 |
hasn't seen before, particularly future time periods, it's just not. 01:38:45.440 |
So we're going to be talking about how to avoid this problem. 01:38:51.560 |
We talked a little bit in the last lesson about trying to avoid it by avoiding unnecessary 01:39:05.200 |
But in the end, if you really have a time series that looks like this, we actually have 01:39:14.160 |
So one way we could deal with a problem would be use a neural net, use something that actually 01:39:20.060 |
has a function or shape that can actually fit something like this. 01:39:31.040 |
Another approach would be to use all the time series techniques you guys are learning about 01:39:35.520 |
in the morning class to fit some kind of time series and then detrend it. 01:39:43.000 |
And so then you'll end up with detrended dots and then use the random forest to predict 01:39:49.840 |
And that's particularly cool because imagine that your random forest was actually trying 01:39:56.880 |
to predict data that, I don't know, maybe it was two different states. 01:40:03.320 |
And so the blue ones are down here and the red ones are up here. 01:40:09.480 |
Now if you try to use a random forest, it's going to do a pretty crappy job because time 01:40:16.520 |
So it's basically still going to split like this and then it's going to split like this. 01:40:21.760 |
And then finally, once it kind of gets down to this piece, it'll be like oh, okay, now 01:40:28.940 |
So in other words, when you've got this big time piece going on, you're not going to see 01:40:34.680 |
the other relationships in the random forest until every tree deals with time. 01:40:41.720 |
So one way to fix this would be with a gradient boosting machine, GBM. 01:40:47.520 |
And what a GBM does is it creates a little tree and runs everything through that first 01:40:52.680 |
little tree, which could be like the time tree, and then it calculates the residuals, 01:40:59.480 |
and then the next little tree just predicts the residuals. 01:41:05.480 |
So GBM still can't extrapolate to the future, but at least they can deal with time-dependent 01:41:15.440 |
So we're going to be talking about this quite a lot more over the next couple of weeks. 01:41:18.680 |
And in the end, the solution is going to be just use neural nets. 01:41:24.120 |
But for now, using some kind of time series analysis, detrend it, and then use a random 01:41:33.280 |
And if you're playing around with something like the Ecuador groceries competition, that 01:41:37.080 |
would be a really good thing to fiddle around with.