back to index

Intro to Machine Learning: Lesson 6


Chapters

0:0 Intro
1:50 Horizontal Applications
3:10 Data Products
3:48 Objectives
4:52 Levers
5:40 What can levers do
7:2 What data does the organization have
8:45 A model
11:23 Counterfactual
12:16 Predictive vs Optimization
13:14 How do they all work together
14:33 Churn
18:17 Lead Prioritization
20:32 Define the Problem
23:50 Readmission Risk
25:30 Interpretation
31:1 Application

Whisper Transcript | Transcript Only Page

00:00:00.000 | So we've looked at a lot of different random and random forest interpretation techniques
00:00:10.840 | and a question that's come up a little bit on the forums is like what are these for really?
00:00:20.440 | Like how do these help me get a better score on Kaggle?
00:00:24.760 | And my answer has kind of been like they don't necessarily.
00:00:29.400 | So I want to talk more about why do we do machine learning?
00:00:36.320 | What's the point?
00:00:39.360 | And to answer this question, I'm going to put this PowerPoint in the GitHub repo so
00:00:45.320 | you can have a look.
00:00:48.320 | I want to show you something really important which is examples of how people have used
00:00:56.960 | machine learning mainly in business, because that's where most of you are probably going
00:01:02.880 | to end up after this is working for some company.
00:01:05.720 | I'm going to show you applications of machine learning which are either based on things
00:01:10.600 | that I've been personally involved in myself or know of people who are doing them directly,
00:01:15.280 | so none of these are going to be like hypotheticals, these are all actual things that people are
00:01:22.080 | doing and I've got direct or secondhand knowledge of.
00:01:27.160 | I'm going to split them into two groups, horizontal and vertical.
00:01:29.920 | So in business, horizontal means something that you do across different kinds of business,
00:01:37.920 | whereas vertical means something that you do within a business or within a supply chain
00:01:43.880 | or within a process.
00:01:45.480 | So in other words, an example of horizontal applications is everything involving marketing.
00:01:53.880 | So like every company pretty much has to try to sell more products to its customers, and
00:02:00.360 | so therefore does marketing.
00:02:01.960 | And so each of these boxes are examples of some of the things that people are using machine
00:02:07.360 | learning for in marketing.
00:02:11.100 | So let's take an example, let's take churn.
00:02:18.720 | So churn refers to a model which attempts to predict who's going to leave.
00:02:26.880 | So I've done some churn modeling fairly recently in telecommunications, and so we're trying
00:02:32.460 | to figure out for this big cell phone company which customers are going to leave.
00:02:38.880 | That is not of itself that interesting, like building a highly predictive model that says
00:02:47.280 | during me how it is almost certainly going to leave next month is probably not that helpful
00:02:53.680 | because if I'm almost certainly going to leave next month, there's probably nothing you can
00:02:57.480 | do about it.
00:02:58.480 | It's too late.
00:02:59.480 | What's going to cost you too much to keep me?
00:03:02.920 | So in order to understand why would we do churn modeling, I've got a little framework
00:03:09.520 | that you might find helpful.
00:03:11.200 | So if you Google for Jeremy Howard data products, I think I've mentioned this thing before,
00:03:19.480 | there's a paper you can find designing great data products that I wrote with a couple of
00:03:24.160 | colleagues a few years ago, and in it I describe my experience of actually turning machine learning
00:03:33.660 | models into stuff that makes money.
00:03:37.880 | And the basic trick is this thing I call the drivetrain approach which is these four steps.
00:03:49.440 | The starting point to actually turn a machine learning project into something that's actually
00:03:55.000 | useful is to know what am I trying to achieve.
00:03:58.480 | And that doesn't mean I'm trying to achieve a high area under the ROC curve or I'm trying
00:04:04.420 | to achieve a large difference between classes.
00:04:08.720 | No, it would be I'm trying to sell more books or I'm trying to reduce the number of customers
00:04:16.740 | that leave next month, or I'm trying to detect lung cancer earlier.
00:04:23.880 | These are things, these are objectives.
00:04:25.480 | So the objective is something that absolutely directly is the thing that the company or
00:04:30.560 | the organization actually wants.
00:04:32.960 | No company or organization lives in order to create a more accurate predictive model,
00:04:41.240 | there's some reason.
00:04:43.780 | So that's your objective.
00:04:45.780 | Now that's obviously the most important thing.
00:04:47.320 | If you don't know the purpose of what you're modeling for, then you can't possibly do a
00:04:50.720 | good job of it.
00:04:52.880 | And hopefully people are starting to pick that up out there in the world of data science,
00:04:58.320 | but interestingly what very few people are talking about is just as important as the
00:05:02.240 | next thing, which is levers.
00:05:04.800 | A lever is a thing that the organization can do to actually drive the objective.
00:05:11.120 | So let's take the example of churn modeling.
00:05:14.560 | What is a lever that an organization could use to reduce the number of customers that
00:05:22.140 | are leaving?
00:05:26.880 | They could take a closer look at the model and do some of this random forest interpretation
00:05:32.520 | and see some of the causes that are causing people to leave and potentially change those
00:05:38.760 | issues in the company form.
00:05:40.800 | So that's a data scientist's answer, but I want you to go to the next level.
00:05:44.740 | What are the things, the levers are the things they can do?
00:05:47.160 | Do you want to put it past behind you?
00:05:48.400 | What are the things that they can do?
00:05:54.120 | They could call someone and say, "Are you happy with anything we can do?"
00:06:01.120 | Yeah, so they could give them a free pen or something if they buy 20 bucks worth of product
00:06:15.200 | next month.
00:06:16.760 | You are going to do that as well?
00:06:20.220 | Okay, so you guys are the giving out carrots rather than handing out sticks.
00:06:23.960 | Do you want to send it over to a couple of guys?
00:06:32.920 | Yeah, you could give them a special.
00:06:35.920 | So these are levers, right?
00:06:38.080 | And so whenever you're working as a data scientist, keep coming back and thinking, "What are we
00:06:43.440 | trying to achieve, we being the organization, and how are we trying to achieve it being
00:06:48.280 | like what are the actual things we can do to make that objective happen?"
00:06:53.440 | So building a model is never ever a lever, but it could help you with the lever.
00:07:02.920 | So then the next step is, what data does the organization have that could possibly help
00:07:09.400 | them to set that lever to achieve that objective?
00:07:14.120 | And so this is not what data did they give you when you started the project, but think
00:07:19.980 | about it from a first principle's point of view.
00:07:22.440 | I'm working for a telecommunications company, they gave me some certain set of data, but
00:07:27.320 | I'm sure they must know where their customers live, how many phone calls they made last
00:07:32.520 | month, how many times they called customer service, whatever.
00:07:36.720 | And so have a think about, okay, if we're trying to decide who should we give a special
00:07:44.680 | offer to proactively, then we want to figure out what information do we have that might
00:07:52.320 | help us to identify who's going to react well or badly to that.
00:07:57.560 | Perhaps more interestingly would be what if we were doing a fraud algorithm, and so we're
00:08:04.520 | trying to figure out who's going to not pay for the phone that they take out of the store,
00:08:09.600 | they're on some 12 month payment plan, we never see them again.
00:08:13.760 | Now in that case, the data we have available, it doesn't matter what's in the database,
00:08:17.840 | what matters is what's the data that we can get when the customer is in the shop.
00:08:22.960 | So there's often constraints around the data that we can actually use.
00:08:28.880 | So we need to know what am I trying to achieve, what can this organization actually do specifically
00:08:36.580 | to change that outcome, and at the point that that decision is being made, what data do
00:08:41.840 | they have or could they collect.
00:08:45.720 | And so then the way I put that all together is with a model, and this is not a model in
00:08:50.240 | the sense of a predictive model, but it's a model in the sense of a simulation model.
00:08:55.200 | So one of the main examples I give in this paper is one I spent many years building,
00:09:00.280 | which is if an insurance company changes their prices, how does that impact their profitability?
00:09:09.140 | And so generally your simulation model contains a number of predictive models.
00:09:13.720 | So I had, for example, a predictive model called an elasticity model that said for a
00:09:18.160 | specific customer, if we charge them a specific price for a specific product, what's the probability
00:09:24.180 | that they would say yes, both when it's new business and then a year later, what's the
00:09:29.760 | probability that they're new.
00:09:32.240 | And then there's another predictive model which is what's the probability that they're
00:09:36.160 | going to make a claim, and how much is that claim going to be.
00:09:40.700 | And so you can combine these models together then to say if we change our pricing by reducing
00:09:45.680 | it by 10% for every body between 18 and 25, and we can run it through these models that
00:09:51.140 | combine together into a simulation, then the overall impact on our market share in 10 years
00:09:55.620 | time is x, and our cost is y, and our profit is z, and so forth.
00:10:01.600 | So in practice, most of the time you really are going to care more about the results of
00:10:10.360 | that simulation than you do about the predictive model directly.
00:10:15.820 | But most people are not doing this effectively at the moment.
00:10:19.580 | So for example, when I go to Amazon, I read all of Douglas Adams' books.
00:10:27.560 | And so having read all of Douglas Adams' books, the next time I went to Amazon, they said
00:10:32.720 | would you like to buy the collected works of Douglas Adams?
00:10:37.120 | This is after I had bought every one of his books.
00:10:40.280 | So from a machine learning point of view, some data scientist had said people that buy
00:10:48.200 | one of Douglas Adams' books often go on to buy the collected works, but recommending
00:10:54.280 | to me that I buy the collected works of Douglas Adams isn't smart.
00:10:59.480 | And it's actually not smart at a number of levels.
00:11:02.660 | Not only is it unlikely that I'm going to buy a box set of something of which I have
00:11:05.680 | every one individually, but furthermore, it's not going to change my buying behavior.
00:11:10.880 | I already know about Douglas Adams, I already know I like him, so taking up your valuable
00:11:16.160 | web space to tell me hey, maybe you should buy more of the author who you're already
00:11:20.540 | familiar with and have bought lots of times isn't actually going to change my behavior.
00:11:26.680 | So what if instead of creating a predictive model, Amazon had built an optimization model
00:11:32.480 | that could simulate and said if we show Jeremy this ad, how likely is he then to go on to
00:11:39.760 | buy this book?
00:11:40.760 | And if I don't show him this ad, how likely is he to go on to buy this book?
00:11:46.680 | And so that's the counterfactual, the counterfactual is what would have happened otherwise.
00:11:51.840 | And then you can take the difference and say okay, what should we recommend him that is
00:11:57.040 | going to maximally change his behavior, so maximally result in more books.
00:12:02.560 | And so you'd probably say like oh, he's never bought me Terry Pratchett books, he probably
00:12:08.120 | doesn't know about Terry Pratchett, but lots of people that liked Douglas Adams did turn
00:12:12.760 | out to like Terry Pratchett, so let's introduce him to a new author.
00:12:17.360 | So it's the difference between a predictive model on the one hand versus an optimization
00:12:21.460 | model on the other hand.
00:12:23.440 | So the two tend to go hand in hand, the optimization model basically is saying, well first of all
00:12:33.040 | we have a simulation model.
00:12:34.520 | The simulation model is saying in a world where we put Terry Pratchett's book on the
00:12:41.240 | front page of Amazon for Jeremy Howard, this is what would have happened.
00:12:46.320 | He would have bought it with a 94% probability.
00:12:50.640 | And so that then tells us with this lever of what do I put on my homepage for Jeremy today,
00:13:00.040 | we say okay, all the different settings of that lever that put Terry Pratchett on the
00:13:03.680 | homepage has the highest simulated outcome.
00:13:07.680 | And then that's the thing which maximizes our profit from Jeremy's visit to Amazon.com
00:13:13.400 | today.
00:13:15.740 | So generally speaking, your predictive models kind of feed in to this simulation model,
00:13:22.160 | but you've kind of got to think about how do they all work together.
00:13:25.060 | So for example, let's go back to Chern.
00:13:28.500 | So I turn out that Jeremy Howard is very likely to leave his cell phone company next month.
00:13:37.140 | What are we going to do about it?
00:13:39.220 | Let's call him.
00:13:41.080 | And I can tell you, if my cell phone company calls me right now and says just calling to
00:13:45.720 | say we love you, I'd be like I'm canceling right now.
00:13:52.360 | That would be a terrible idea.
00:13:53.920 | So again, you'd want a simulation model that says what's the probability that Jeremy is
00:13:58.840 | going to change his behavior as a result of calling him right now.
00:14:02.280 | So one of the levers I have is call him.
00:14:05.400 | On the other hand, if I got a piece of mail tomorrow that said for each month you stay
00:14:11.520 | with us, we're going to give you $100,000, okay, then that's going to definitely change
00:14:16.560 | my behavior.
00:14:20.160 | But then feeding that into the simulation model, it turns out that overall that would
00:14:24.600 | be an unprofitable choice to make.
00:14:27.240 | So do you see how all this fits in together?
00:14:34.160 | So when we look at something like CHURM, we want to be thinking what are the levers we
00:14:41.680 | can pull, and so what are the kind of models that we could build with what kinds of data
00:14:47.120 | to help us pull those levers better to achieve our objectives.
00:14:51.720 | And so when you think about it that way, you realize that the vast majority of these applications
00:14:57.240 | are not largely about a predictive model at all, they're about interpretation, they're
00:15:02.960 | about understanding what happens if.
00:15:07.400 | So if we kind of take the intersection between on the one hand, here are all the levers that
00:15:15.360 | we could pull, here are all the things we can do, and then here are all of the features
00:15:20.040 | from our random forest feature importance that turn out to be strong drivers of the
00:15:25.120 | outcome.
00:15:26.120 | And so then the intersection of those is here are the levers we could pull that actually
00:15:30.560 | matter because if you can't change the thing, then it's not very interesting.
00:15:37.960 | And if it's not actually a significant driver, it's not very interesting.
00:15:42.880 | So we can actually use our random forest feature importance to tell us what can we actually
00:15:48.920 | do to make a difference.
00:15:51.800 | And then we can use the partial dependence to actually build this kind of simulation
00:15:55.440 | model to say like okay, well if we did change that, what would happen?
00:16:02.720 | So there are examples, lots and lots of these vertical examples.
00:16:08.000 | And so what I want you to kind of think about as you think about the machine learning problems
00:16:12.760 | you're working on is like why does somebody care about this?
00:16:18.640 | And what would a good answer to them look like and how could you actually positively
00:16:24.320 | impact this business?
00:16:25.440 | So if you're creating like a Kaggle kernel, try to think about from the point of view
00:16:31.800 | of the competition organizer, like what would they want to know?
00:16:36.420 | And how can you give them that information?
00:16:39.000 | So something like fraud detection, on the other hand, you probably just basically want
00:16:46.080 | to know who's fraudulent.
00:16:49.280 | So you probably do just care about the predictive model, but then you do have to think carefully
00:16:52.840 | about the data availability here.
00:16:55.000 | So it's like okay, but we need to know who's fraudulent at the point that we're about to
00:17:00.400 | deliver them a product.
00:17:02.760 | So it's no point like looking at data that's available like a month later, for instance.
00:17:07.040 | So you've kind of got this key issue of thinking about the actual operational constraints that
00:17:13.040 | you're working under.
00:17:18.080 | Lots of interesting applications in human resources, but like employee churn, it's another
00:17:23.740 | kind of churn model where finding out that Jeremy Howard's sick of lecturing, he's going
00:17:29.040 | to leave tomorrow, what are you going to do about it?
00:17:32.120 | Well knowing that wouldn't actually be helpful, it'd be too late, right?
00:17:36.000 | You would actually want a model that said what kinds of people are leaving USF?
00:17:41.920 | And it turns out that everybody that goes to the downstairs cafe leaves USF, I guess
00:17:47.280 | their food is awful, or whatever, or everybody that we're paying less than half a million
00:17:53.680 | dollars a year is leaving USF because they can't afford basic housing in San Francisco.
00:18:00.480 | So you could use your employee churn model not so much to say which employees hate us,
00:18:07.560 | but why do employees leave?
00:18:12.400 | And so again, it's really the interpretation there that matters.
00:18:20.960 | Now lead prioritization is a really interesting one, right?
00:18:23.640 | This is one where a lot of companies, yes, Dana, can you pass that over there?
00:18:29.920 | Yes, so I was just wondering, so for the churn thing, you suggested so one way is being an
00:18:37.760 | employee, like one million a year or something, but then it sounds like there are two predictors
00:18:41.440 | that you need to predict for, one being churn and one you need to optimize for your profit
00:18:46.840 | thing.
00:18:47.840 | So how does it work in that?
00:18:48.840 | Yeah, exactly.
00:18:49.840 | So this is what this simulation model is all about, so it's a great question.
00:18:54.240 | So you kind of figure out this objective we're trying to maximize which is like company profitability,
00:19:02.000 | you can kind of create a pretty simple Excel model or something that says here's the revenues
00:19:06.200 | and here's the costs, and the cost is equal to the number of people we employ multiplied
00:19:11.000 | by their salaries, blah, blah, blah, blah, blah.
00:19:13.840 | And so inside that kind of Excel model, there are certain cells, there are certain inputs
00:19:19.880 | where you're like oh, that thing's kind of stochastic, or that thing is kind of uncertain
00:19:25.880 | but we could predict it with a model.
00:19:28.000 | And so that's kind of what I do then is I then say okay, we need a predictive model
00:19:32.600 | for how likely somebody is to stay if we change their salary, how likely they are to leave
00:19:41.320 | with their current salary, how likely they are to leave next year if I increase their
00:19:48.000 | salary now, blah, blah, blah.
00:19:50.000 | So you kind of build a bunch of these models and then you combine them together with simple
00:19:55.480 | business logic and then you can optimize that.
00:19:59.280 | You can then say okay, if I pay Jeremy Howard half a million dollars, that's probably a
00:20:05.000 | really good idea, and if I pay him less then it's probably not, or whatever.
00:20:09.520 | You can figure out the overall impact.
00:20:12.960 | So it's really shocking to me how few people do this.
00:20:18.240 | Most people in industry measure their models using like AUC or RMSE or whatever, which
00:20:27.880 | is never actually what you want.
00:20:31.160 | Yes, can you pass it over here?
00:20:38.280 | I wanted to stress a point that you made before.
00:20:41.600 | In my experience, a lot of the problem was to define the problem, right?
00:20:47.200 | So you are in a company, you're talking to somebody that doesn't have like this mentality
00:20:50.840 | that you have.
00:20:51.840 | They don't know that you have to have X and Y and so on.
00:20:54.880 | So you have to try to get that out of them.
00:20:57.320 | What exactly do you want and try to go through a few iterations of understanding what they
00:21:02.080 | want and then you know the data, you know what the data is.
00:21:04.760 | You know actually where you can measure, which is often know what they want.
00:21:08.420 | So you have to kind of get a proxy for what they want.
00:21:11.160 | And then so a lot of what you do is not that much of like, well, some people do actually
00:21:15.960 | just work on really good models for, but a lot of people also just work on this kind
00:21:21.560 | of how do you put this as a classification regression or some other type of modeling.
00:21:27.360 | That's actually kind of the most interesting I think and also kind of what you have to
00:21:32.680 | do well.
00:21:33.680 | Yeah, the best people do both.
00:21:37.840 | Understand the technical model building deeply, but also understand the kind of strategic
00:21:46.440 | context deeply.
00:21:47.760 | And so this is one way to think about it.
00:21:50.600 | And as I say, I actually think there aren't many articles I wrote in 2012 I'm still recommending,
00:22:00.100 | but this one I think is still equally valid today.
00:22:06.920 | So yeah, so like another great example is lead prioritization, right?
00:22:10.760 | So like a lot of companies, like every one of these boxes I'm showing, you can generally
00:22:16.200 | find a company or many companies whose sole job in life is to build models of that thing.
00:22:24.120 | So there are lots of companies that sell lead prioritization systems.
00:22:28.800 | But again, the question is, how would we use that information?
00:22:35.920 | So if it's like, oh, our best lead is Jeremy, he's our highest probability of buying, does
00:22:43.480 | that mean I should send a salesperson out to Jeremy or I shouldn't?
00:22:48.120 | Like if he's highly probable to buy, why waste my time with him?
00:22:54.340 | So again, it's like he really wants some kind of simulation that says what's the likely
00:23:00.160 | change in Jeremy's behavior if I send my best salesperson, Yannette, out to go and encourage
00:23:08.840 | him to sign.
00:23:11.040 | So yeah, I think there are many, many opportunities for data scientists in the world today to
00:23:20.040 | move beyond predictive modeling to actually bringing it all together with the kind of
00:23:26.240 | stuff that Dina was talking about in her question.
00:23:29.400 | So as well as these horizontal applications that basically apply to every company, there's
00:23:37.520 | a whole bunch of applications that are specific to every part of the world.
00:23:41.840 | So for those of you that end up in healthcare, some of you will become experts in one or
00:23:46.880 | more of these areas, like readmission risk.
00:23:53.440 | So what's the probability that this patient is going to come back to the hospital?
00:23:58.400 | And readmission is, depending on the details of the jurisdiction and so forth, it can be
00:24:05.920 | a disaster for hospitals when somebody is readmitted.
00:24:10.540 | So if you find out that this patient has a high probability of readmission, what do you
00:24:16.360 | do about it?
00:24:17.360 | Well again, the predictive model is helpful of itself, it rather suggests that we just
00:24:22.440 | shouldn't send them home yet because they're going to come back.
00:24:26.680 | But wouldn't it be nice if we had the tree interpreter and it said to us the reason that
00:24:31.780 | they're at high risk is because we don't have a recent EKG for them, and without a recent
00:24:38.880 | EKG we can't have a high confidence about their cardiac health.
00:24:45.200 | In which case it wouldn't be like, let's keep them in the hospital for two weeks, it would
00:24:48.720 | be like let's give them an EKG.
00:24:51.520 | So this is interaction between interpretation and predictive accuracy.
00:24:57.240 | What I'm understanding you saying is that the predictive models are a really great starting
00:25:06.560 | point, but in order to actually answer these questions we really need to focus on the interpretability
00:25:12.520 | of these models.
00:25:13.520 | Yeah, I think so, and more specifically I'm saying we just learned a whole raft of random
00:25:20.960 | forest interpretation techniques and so I'm trying to justify why, and so the reason why
00:25:31.280 | is because actually maybe most of the time the interpretation is the thing we care about,
00:25:37.800 | and you can create a chart or a table without machine learning, and indeed that's how most
00:25:49.160 | of the world works.
00:25:50.320 | Most managers build all kinds of tables and charts without any machine learning behind
00:25:54.760 | them.
00:25:56.920 | But they often make terrible decisions because they don't know the feature importance of
00:26:01.600 | the objective they're interested in, so the table they create is of things that actually
00:26:05.080 | are the least important things anyway, or they just do a univariate chart rather than
00:26:11.000 | a partial dependence plot so they don't actually realize that the relationship they thought
00:26:15.520 | they're looking at is due entirely to something else.
00:26:20.000 | So I'm kind of arguing for data scientists getting much more deeply involved in strategy
00:26:28.720 | and in trying to use machine learning to really help a business with all of its objectives.
00:26:40.640 | There are companies like Dunhumbi, a huge company that does nothing but retail applications
00:26:46.080 | of machine learning, and so I believe there's a Dunhumbi product you can buy which will
00:26:55.440 | help you figure out if I put my new store in this location versus that location, how
00:27:00.800 | many people are going to shop there?
00:27:04.480 | Or if I put my diapers in this part of the shop versus that part of the shop, how's that
00:27:10.520 | going to impact purchasing behavior or whatever?
00:27:14.600 | So I think it's also good to realize that the subset of machine learning applications
00:27:20.560 | you tend to hear about in the tech press or whatever is this massively biased tiny subset
00:27:30.360 | of stuff which Google and Facebook do, whereas the vast majority of stuff that actually makes
00:27:36.560 | the world go round is these kinds of applications that actually help people make things, buy
00:27:43.280 | things, sell things, build things, so forth.
00:27:52.560 | So about tree interpretation, the way we looked at the tree was we manually checked which
00:27:59.200 | feature was more important for particular observation, but for businesses they would
00:28:07.040 | have huge amount of data and they want this interpretation for a lot of observations.
00:28:13.000 | So how do they automate it?
00:28:15.920 | I don't think the automation is at all difficult, you can run any of these algorithms like looping
00:28:22.720 | through the rows or throwing them in parallel, it's all just code, am I misunderstanding
00:28:28.480 | your question?
00:28:29.560 | Is it like they set a threshold that if some feature is above, like for different people
00:28:37.440 | will have different behavior?
00:28:38.960 | This is a really important issue actually, the vast majority of machine learning models
00:28:50.240 | don't automate anything, they're designed to provide information to humans.
00:28:55.400 | So for example, if you're a point of sales customer service phone operator for an insurance
00:29:05.000 | company and your customer asks you why is my renewal $500 more expensive than last time,
00:29:12.760 | then hopefully the insurance company provides in your terminal a little screen that shows
00:29:18.640 | the result of the tree interpreter or whatever so that you can jump there and tell the customer,
00:29:23.880 | okay, well here's last year, you're in this different zip code which has lower amounts
00:29:31.040 | of car theft and this year also you've actually changed your vehicle to a more expensive one
00:29:37.760 | or whatever.
00:29:39.560 | So it's not so much about thresholds and automation, but about making these model outputs available
00:29:49.000 | to the decision makers in an organization, whether they be at the top strategic level
00:29:53.800 | of like, are we going to shut down this whole product or not, all the way to the operational
00:30:00.160 | level of that individual discussion with a customer.
00:30:06.760 | So another example is aircraft scheduling and gate management, there's lots of companies
00:30:13.360 | that do that, and basically what happens is that there are people at an airport whose
00:30:25.280 | job it is to basically tell each aircraft what gate to go to to figure out when to close
00:30:31.880 | the doors, stuff like that.
00:30:33.560 | So the idea is you're giving them software which has the information they need to make
00:30:38.640 | good decisions.
00:30:39.640 | So the machine learning models end up embedded in that software, if I'm going to say, okay,
00:30:45.720 | that plane that's currently coming in from Miami, there's a 48% chance that it's going
00:30:50.960 | to be over 5 minutes late, and if it does, then this is going to be the knock-on impact
00:30:56.360 | through the rest of the terminal, for instance.
00:30:59.040 | So that's kind of how these things tend to fit together.
00:31:03.160 | So there's so many of these, there's lots and lots, and so I don't expect you to remember
00:31:10.080 | all these applications, but what I do want you to do is spend some time thinking about
00:31:15.720 | them, like sit down with one of your friends and talk about a few examples, like okay, how
00:31:22.160 | would we go about doing failure analysis and manufacturing, who would be doing that, why
00:31:29.560 | would they be doing it, what kind of models might they use, what kind of data might they
00:31:34.000 | Start to kind of practice this and get a sense, because then as you're interviewing and then
00:31:40.280 | when you're at the workplace and you're talking to managers, you want to be straight away
00:31:47.360 | able to kind of recognize that the person you're talking to, what do they try to achieve,
00:31:51.760 | what are the levers that they have to pull, what are the data they have available to pull
00:31:56.120 | those levers to achieve that thing, and therefore, how could we build models to help them do
00:32:01.120 | that and what kind of predictions would they have to be making.
00:32:05.360 | And so then you can have this really thoughtful, empathetic conversation with those people
00:32:10.440 | saying like hey, in order to reduce the number of customers that are leaving, I guess you're
00:32:17.680 | trying to figure out who should you be providing better pricing to or whatever, and so forth.
00:32:30.760 | So what I'm noticing from your beautiful little chart above is that a lot of this, to me at
00:32:37.120 | least, still seems like the primary purpose is at least base level, is predictive power.
00:32:45.520 | And so I guess my thing is for explanatory problems, a lot of the ones that people are
00:32:51.120 | faced with in social sciences, is that something machine learning can be used for or is used
00:32:55.520 | for or is that not really the realm that it is?
00:33:00.040 | That's a great question, and I've had a lot of conversations about this with people in
00:33:05.920 | social sciences, and currently machine learning is not well applied in economics or psychology
00:33:13.100 | or whatever on the whole.
00:33:15.960 | But I'm convinced it can be, for the exact reasons we're talking about.
00:33:19.720 | So if you're trying to figure out, if you're trying to do some kind of behavioral economics
00:33:24.340 | and you're trying to understand why some people behave differently to other people, a random
00:33:29.400 | forest with a feature importance plot would be a great way to start.
00:33:34.080 | Or more interestingly, if you're trying to do some kind of sociology experiment or analysis
00:33:41.120 | based on a large social network dataset where you have an observational study, you really
00:33:46.440 | want to try and pull out all of the sources of exogenous variables, all the stuff that's
00:33:53.880 | going on outside.
00:33:55.760 | And so if you use a partial dependence plot with a random forest, that happens automatically.
00:34:00.320 | So I actually gave a talk at MIT a couple of years ago for the first conference on digital
00:34:08.340 | experimentation which was really talking about how do we experiment in things like social
00:34:14.200 | networks and these digital environments.
00:34:18.120 | Economists all do things with classic statistical tests, but in this case the economists I talked
00:34:39.000 | to were absolutely fascinated by this and they actually asked me to give an introduction
00:34:44.760 | to machine learning session at MIT to these various faculty and graduate folks in the
00:34:51.560 | economics department.
00:34:53.480 | And some of those folks have gone on to write some pretty famous books and stuff, so hopefully
00:34:59.040 | it's been useful.
00:35:00.040 | So it's definitely early days, but it's a big opportunity.
00:35:05.200 | But as Yannette says, there's plenty of skepticism still out there.
00:35:18.680 | Well the skepticism comes from unfamiliarity basically with this totally different approach.
00:35:28.800 | So if you've spent 20 years studying econometrics and somebody comes along and says, "Here's
00:35:37.600 | a totally different approach to all the stuff that econometricians do," naturally your first
00:35:44.440 | reaction will be like, "Prove it!"
00:35:49.200 | So that's fair enough.
00:35:55.520 | Over time the next generation of people who are growing up with machine learning, some
00:36:00.400 | of them will move into the social sciences, they'll make huge impacts that nobody's ever
00:36:04.720 | managed to make before, and people will start going, "Wow!"
00:36:08.560 | Just like happened in computer vision, when computer vision spent a long time of people
00:36:14.320 | saying like, "Hey, maybe you should use deep learning for computer vision," and everybody
00:36:18.120 | in computer vision is like, "Prove it!"
00:36:20.880 | We have decades of work on amazing feature detectors for computer vision, and then finally
00:36:28.080 | in 2012 Hinton and Kudresky came along and said, "Okay, our model is twice as good as
00:36:34.880 | yours and we've only just started on this," and everybody was like, "Oh, okay, that's
00:36:42.280 | pretty convincing."
00:36:43.280 | Nowadays every computer vision researcher basically uses deep learning.
00:36:49.040 | I think that time will come in this area too.
00:36:56.080 | I think what we might do then is take a break and we're going to come back and talk about
00:37:03.540 | these random forest interpretation techniques and do a bit of a review.
00:37:09.080 | So let's come back at 2 o'clock.
00:37:18.080 | So let's have a go at talking about these different random forest interpretation methods,
00:37:24.480 | having talked about why they're important.
00:37:26.920 | So let's now remind ourselves what they are.
00:37:29.360 | So I've got to let you folks have a go.
00:37:33.340 | So let's start with confidence based on tree variance.
00:37:40.640 | So can one of you tell me one or more of the following things about confidence based on
00:37:46.160 | tree variance?
00:37:48.640 | What does it tell us?
00:37:50.700 | Why would we be interested in that and how is it calculated?
00:37:56.700 | This is going back a ways because it was the first one we looked at.
00:38:01.480 | Even if you're not sure or you only know a little piece of it, give us your piece and
00:38:05.200 | we'll build on it together.
00:38:06.360 | I think I got a piece of it.
00:38:13.440 | It's getting the variance of our predictions from random forests.
00:38:19.880 | That's true.
00:38:20.880 | That's the how.
00:38:21.880 | Can you be more specific?
00:38:22.880 | So what is it the variance of?
00:38:24.600 | I think it's, if I'm remembering correctly, I think it's just the overall prediction.
00:38:31.320 | The variance of the predictions of the trees, yes.
00:38:34.320 | So normally the prediction is just the average, this is the variance of the trees.
00:38:38.840 | So it kind of just gives you an idea of how much your prediction is going to vary.
00:38:42.960 | So maybe you want to minimize variance, maybe that's your goal for whatever reason that
00:38:46.920 | could be.
00:38:48.400 | That's not so much the reason, so I like your calculation description.
00:38:51.840 | Let's see if somebody else can tell us how you might use that.
00:38:56.400 | It's okay if you're not sure, have a step.
00:39:08.760 | So I remember that we talked about kind of the independence of the trees and so maybe
00:39:17.720 | something about if the variance of the trees is higher or lower than.
00:39:25.640 | That's not so much that, that's an interesting question but it's not what we're going to
00:39:31.480 | see here.
00:39:32.480 | I'm going to pass it back behind you.
00:39:34.960 | So to remind you, just to fill in a detail here, what we generally do here is we take
00:39:39.920 | just one row, like one observation often, and find out how confident we are about that,
00:39:47.760 | like how much variance there are in the trees for that, or we can do it as we did here for
00:39:52.320 | different groups.
00:39:55.600 | So according to me the idea is like for each row we calculate the standard deviation that
00:40:01.360 | we get from the random forest model, and then maybe group according to different variables
00:40:08.240 | or predictors, and see for which particular predictor the standard deviation is high,
00:40:13.600 | and then go deep down as why it is happening, maybe it is because a particular category
00:40:19.800 | of that variable has very less number of observations.
00:40:23.040 | Yeah, that's great.
00:40:24.240 | So that would be one approach is kind of what we've done here is to say like is there any
00:40:28.200 | groups where we're very unconfident?
00:40:34.280 | Something that I think is even more important would be when you're using this operationally,
00:40:42.000 | let's say you're doing a credit decisioning algorithm.
00:40:46.340 | So we're trying to say is Jeremy a good risk or a bad risk?
00:40:50.240 | Should we loan him a million dollars?
00:40:54.020 | And the random forest says, I think he's a good risk, but I'm not at all confident.
00:41:02.880 | In which case we might say maybe I shouldn't give him a million dollars, or else if the
00:41:07.400 | random forest said, I think he's a good risk, I am very sure of that, then we're much more
00:41:13.280 | comfortable giving him a million dollars.
00:41:16.920 | And I'm a very good risk, so feel free to give me a million dollars.
00:41:21.840 | I checked the random forest before a different notebook, not in the repo.
00:41:28.920 | So it's quite hard for me to give you folks direct experience with this kind of single
00:41:39.080 | observation interpretation stuff, because it's really like the kind of stuff that you
00:41:46.720 | actually need to be putting out to the front line.
00:41:49.640 | It's not something which you can really use so much in a kind of Kaggle context, but it's
00:41:54.760 | more like if you're actually putting out some algorithm which is making big decisions that
00:42:02.400 | could cost a lot of money, you probably don't so much care about the average prediction
00:42:08.760 | of the random forest, but maybe you actually care about like the average minus a couple
00:42:13.880 | of standard deviations, like what's the kind of worst-case prediction.
00:42:20.720 | And so as Chika mentioned, it's like maybe there's a whole group that we're kind of
00:42:29.040 | unconfident about.
00:42:32.920 | So that's confidence based on tree variance.
00:42:36.040 | Alright, who wants to have a go at answering feature importance?
00:42:40.280 | What is it?
00:42:41.280 | Why is it interesting?
00:42:42.280 | How do we calculate it or any subset thereof?
00:42:46.200 | I think it's basically to find out which features are important for your model.
00:42:55.760 | So you take each feature and you randomly sample all the values in the feature and you
00:43:01.720 | see how the predictions are, if it's very different.
00:43:04.840 | It means that that feature was actually important, else if it's fine to take any random values
00:43:09.460 | for that feature, it means that maybe probably it's not very important.
00:43:13.120 | Okay, that was terrific.
00:43:18.560 | That was all exactly right.
00:43:19.560 | There were some details that maybe were skimmed over a little bit.
00:43:23.040 | I wonder if anybody else wants to jump into like a more detailed description of how it's
00:43:27.920 | calculated because I know this morning some people were not quite sure.
00:43:33.280 | Is there anybody who's not quite sure maybe who wants to like have a go or want to just
00:43:37.660 | put it next to you there?
00:43:39.240 | Let's see.
00:43:40.240 | How exactly do we calculate feature importance for a particular feature?
00:43:44.400 | I think after you're done building the random forest model, you take each column and randomly
00:43:49.400 | shuffle it and generate a prediction and check the validation score.
00:43:54.120 | If it gets pretty bad for after shuffling one of the columns, that means that column
00:43:59.480 | was important.
00:44:00.960 | So that has higher importance.
00:44:03.940 | I'm not exactly sure how we quantify the feature importance.
00:44:07.720 | Okay, great.
00:44:09.960 | Dina, do you know how we quantify the feature importance?
00:44:15.000 | That was a great description.
00:44:16.000 | I think we did the difference in the class square.
00:44:20.440 | Or score of some sort, exactly, yeah.
00:44:22.760 | So let's say we've got our dependent variable which is price, right, and there's a bunch
00:44:26.860 | of independent variables including year made, right.
00:44:30.640 | And so basically we use the whole lot to build a random forest, right, and then that gives
00:44:39.320 | us our predictions, right.
00:44:44.400 | And so then we can compare that to get R^2, RMSE, whatever you're interested in from the
00:44:54.720 | model.
00:44:55.720 | Now the key thing here is I don't want to have to retrain my whole random forest.
00:45:01.920 | That's kind of slow and boring, right.
00:45:03.960 | So using the existing random forest, how can I figure out how important year made was,
00:45:10.440 | right.
00:45:11.440 | And so the suggestion was, let's randomly shuffle the whole column, right.
00:45:16.220 | So now that column is totally useless.
00:45:18.080 | It's got the same mean, same distribution, everything about it is the same, but there's
00:45:22.640 | no connection at all between particular people, actual year made, and what's now in that column.
00:45:27.920 | I've randomly shuffled it, okay.
00:45:30.520 | And so now I put that new version through with the same random forest, so there's no
00:45:40.120 | retraining done, okay, to get some new Y hat, I call it Y hat, YM, right.
00:45:48.960 | And then I can compare that to my actuals to get like an RMSE, YM, right.
00:45:57.480 | And so now I can start to create a little table.
00:46:04.600 | So now I can create a little table where I've basically got like the original here, RMSE,
00:46:11.600 | and then I've got with year made scrambled, so this one had an RMSE of like 3, this one
00:46:18.240 | had an RMSE of like 2, enclosure, scrambling that had an RMSE of like 2.5, right.
00:46:28.320 | And so then I just take these differences.
00:46:29.860 | So I'd say year made, the importance is 1, 3 - 2, enclosure is 0.5, 3 - 2.5, and so forth.
00:46:40.280 | So how much worse did my model get after I shuffled that variable?
00:46:46.000 | Does anybody have any questions about that?
00:46:49.360 | Can you pass that to Danielle please?
00:46:51.280 | "I assume you just chose those numbers randomly, but my question I guess is do all of them
00:47:01.960 | theoretically not a perfect model to start out with, like will all the importance is
00:47:06.680 | sub to 1 or is that not?"
00:47:09.040 | No, honestly I've never actually looked at what the units are, so I'm actually not quite
00:47:14.480 | sure.
00:47:15.480 | Sorry, we can check it out during the week if somebody's interested.
00:47:19.320 | Have a look at this SKLearn code and see exactly what those units of measure are because I've
00:47:26.680 | never bothered to check.
00:47:29.160 | Although I don't check like the units of measure specifically, what I do check is the relative
00:47:35.720 | importance.
00:47:36.720 | And so like here's an example, so rather than just saying like what are the top 10, yesterday
00:47:41.880 | one of the Tracticum students asked me about a feature importance where they said like
00:47:49.160 | oh I think these 3 are important.
00:47:51.440 | And I pointed out that the top one was a thousand times more important than the second one.
00:47:57.660 | So like look at the relative numbers here.
00:47:59.920 | And so in that case it's like no don't look at the top 3, look at the one that's a thousand
00:48:04.080 | times more important and ignore all the rest.
00:48:07.240 | And so this is where sometimes the kind of your natural tendency to want to be like precise
00:48:12.800 | and careful, you need to override that and be very practical.
00:48:16.560 | Like okay this thing's a thousand times more important, don't spend any time on anything
00:48:20.040 | else.
00:48:21.040 | So then you can go and talk to the manager of your project and say like okay this thing's
00:48:25.400 | a thousand times more important.
00:48:27.200 | And then they might say oh that was a mistake, it shouldn't have been in there, we don't
00:48:32.440 | actually have that information at the decision time.
00:48:37.440 | For whatever reason we can't actually use that variable and so then you could remove
00:48:40.520 | it and have a look.
00:48:42.600 | Or they might say gosh I had no idea that was by far more important than everything
00:48:47.760 | else put together.
00:48:49.200 | So let's forget this random virus thing and just focus on understanding how we can better
00:48:55.880 | collect that one variable and better use that one variable.
00:48:59.220 | So that's like something which comes up quite a lot.
00:49:03.840 | And actually another place that came up just yesterday, again another practicum student
00:49:07.720 | asked me hey I'm doing this medical diagnostics project and my R^2 is 0.95 for a disease which
00:49:19.760 | I was told is very hard to diagnose.
00:49:23.400 | Is this random forest a genius or is something going wrong?
00:49:27.080 | And I said like remember the second thing you do after you build a random forest is
00:49:31.160 | to do feature importance.
00:49:32.960 | So do feature importance and what you'll probably find is that the top column is something that
00:49:38.720 | shouldn't be there.
00:49:40.360 | And so that's what happened.
00:49:41.360 | He came back to me half an hour later, he said yeah I did the feature importance, you
00:49:44.800 | were right, the top column was basically something that was another encoding of the dependent
00:49:50.440 | variable, I've removed it, and now my R^2 is -0.1 so that's an improvement.
00:50:04.280 | The other thing I like to look at is this chart, is to basically say where do things
00:50:10.160 | flatten off in terms of which ones should I be really focusing on.
00:50:15.520 | So that's the most important one.
00:50:17.600 | And so when I did credit scoring in telecommunications, I found there were 9 variables that basically
00:50:23.640 | predicted very accurately who was going to end up paying for their phone and who wasn't.
00:50:29.240 | And apart from ending up with a model that saved them $3 billion a year in fraud and
00:50:35.100 | credit costs, it also let them basically rejig their process so that they focused on collecting
00:50:41.240 | those 9 variables much better.
00:50:47.040 | Alright, who wants to do partial dependence?
00:50:53.060 | This is an interesting one, very important, but in some ways kind of tricky to think about.
00:50:59.520 | I'll go ahead and try.
00:51:01.520 | Yeah please do.
00:51:02.520 | So from my understanding of what partial dependence is is that there's not always necessarily
00:51:07.760 | like a relationship between strictly the dependent variable and this independent variable that
00:51:13.360 | necessarily is showing importance, but rather an interaction between two variables that
00:51:20.440 | are working together.
00:51:21.440 | You're coming like this, right?
00:51:22.440 | Yeah.
00:51:23.440 | Where we're like, oh that's weird.
00:51:24.440 | I could expect this to be kind of flattened as a weird hokey day.
00:51:28.360 | Yeah.
00:51:29.360 | And so for this example, what we found was that it's not necessarily your maid or when
00:51:35.320 | the sale was elapsed, but it's actually the age of the model.
00:51:38.160 | And so that is easier to tell a company, well obviously your younger models are going to
00:51:45.080 | sell for more, and it's less about when the year was made.
00:51:48.240 | Yeah, exactly.
00:51:50.200 | So let's come back to how we calculate this in a moment, but the first thing to realize
00:51:54.720 | is that the vast majority of the time post your course here, when somebody shows you
00:52:00.480 | a chart, it will be like a univariate chart.
00:52:03.240 | They'll just grab the data from the database and they'll plot x against y, and then managers
00:52:08.360 | have a tendency to want to make a decision.
00:52:11.240 | So it'll be like, oh there's this drop-off here, so we should stop dealing in equipment
00:52:17.000 | made between 1990 and 1995 or whatever.
00:52:20.720 | And this is a big problem because real-world data has lots of these interactions going
00:52:28.720 | on, so maybe there was a recession going on around the time those things were being sold
00:52:35.240 | or maybe around that time people were buying more of a different type of equipment or whatever.
00:52:41.120 | So generally what we actually want to know is all other things being equal, what's the
00:52:46.040 | relationship between year made and sale price.
00:52:50.440 | Because if you think about the drivetrain approach idea of the levers, you really want
00:52:56.020 | a model that says if I change this lever, how will it change my objective?
00:53:03.880 | And so it's by pulling them apart using partial dependence that you can say, okay, actually
00:53:10.640 | this is the relationship between year made and sale price, all other things being equal.
00:53:17.520 | So how do we calculate that?
00:53:20.360 | So for the variable year made, for example, you're going to train, you keep every other
00:53:27.960 | variable constant and then you're going to pass every single value of the year made and
00:53:31.800 | then train the model after that.
00:53:33.480 | So for every model you're going to have the light blue for the values of it and the median
00:53:40.280 | is going to be the yellow line up there.
00:53:44.040 | Okay, so let's try and draw that.
00:53:51.480 | By leave everything else constant, what she means is leave them at whatever they are in
00:53:55.520 | the dataset.
00:53:56.520 | So just like when we did feature importance, we're going to leave the rest of the dataset
00:54:01.120 | as it is, and we're going to do partial dependence plot for year made.
00:54:05.480 | So you've got all of these other rows of data that will just leave as they are.
00:54:11.000 | And so instead of randomly shuffling year made, instead what we're going to do is replace
00:54:18.720 | every single value with exactly the same thing, 1960.
00:54:29.360 | And just like before, we now pass that through our existing random forest, which we have
00:54:33.680 | not retrained or changed in any way, to get back out a set of predictions.
00:54:41.480 | Why 1960?
00:54:44.560 | And so then we can plot that on a chart, year made against partial dependence, 1960 here.
00:54:57.680 | Now we can do it for 1961, 2, 3, 4, 5, and so forth.
00:55:03.520 | And so we can do that on average for all of them, or we could do it just for one of them.
00:55:14.040 | And so when we do it for just one of them and we change its year made and pass that
00:55:18.100 | single thing through our model, that gives us one of these blue lines.
00:55:23.000 | So each one of these blue lines is a single row as we change its year made from 1960 up
00:55:29.880 | to 2008.
00:55:34.000 | And so then we can just take the median of all of those blue lines to say, on average,
00:55:40.720 | what's the relationship between year made and price, all other things being equal.
00:55:46.960 | So why is it that this works?
00:55:52.800 | Why is it that this process tells us the relationship between year made and price, all other things
00:55:58.280 | being equal?
00:55:59.280 | Well, maybe it's good to think about a really simplified approach.
00:56:03.640 | A really simplified approach would say, what's the average auction?
00:56:09.120 | What's the average sale date?
00:56:11.280 | What's the most common type of machine we sell?
00:56:14.560 | Which location do we mostly sell things?
00:56:17.600 | We could come up with a single row that represents the average auction, and then we could say,
00:56:23.120 | let's run that row through the random forest, replace its year made with 1960, and then do
00:56:29.480 | it again with 1961, and then do it again with 1962, and we could plot those on our little
00:56:35.600 | chart.
00:56:37.640 | And that would give us a version of the relationship between year made and sale price, all other
00:56:46.240 | things being equal.
00:56:48.520 | But what if tractors looked like that and backhoe loaders looked like that?
00:57:04.200 | Then taking the average one would hide the fact that there are these totally different
00:57:09.680 | relationships.
00:57:11.120 | So instead we basically say, our data tells us what kinds of things we tend to sell, and
00:57:18.800 | who we tend to sell them to, and when we tend to sell them, so let's use that.
00:57:22.740 | So then we actually find out, for every blue line, here are actual examples of these relationships.
00:57:33.020 | And so then what we can do, as well as plotting the median, is we can do a cluster analysis
00:57:37.820 | to find out like a few different shapes.
00:57:43.960 | And so we may find, in this case they all look like pretty much the different versions
00:57:49.000 | of the same thing with different slopes.
00:57:53.040 | So my main takeaway from this would be that the relationship between sale price and year
00:58:00.520 | is basically a straight line.
00:58:03.800 | And remember, this was log of sale price, so this is actually showing us an exponential.
00:58:10.940 | And so this is where I would then bring in the domain expertise, which is like, okay,
00:58:17.800 | things depreciate over time by a constant ratio, so therefore I would expect older stuff
00:58:26.640 | year made to have this exponential shape.
00:58:30.900 | So this is where I kind of mentioned the very start of my machine learning project, I generally
00:58:37.880 | try to avoid using as much domain expertise as I can and let the data do the talking.
00:58:44.200 | So one of the questions I got this morning was, if there's like a sale ID, a model ID,
00:58:49.400 | I should throw those away because they're just IDs.
00:58:52.720 | No, don't assume anything about your data, leave them in, and if they turn out to be
00:58:58.980 | super important predictors, you want to find out why is that.
00:59:04.240 | But then, now I'm at the other end of my project, I've done my feature importance, I've pulled
00:59:09.380 | out the stuff which is from that dendrogram, the kind of redundant features, I'm looking
00:59:15.200 | at the partial dependence, and now I'm thinking, okay, is this shape what I expected?
00:59:22.040 | So even better, before you plot this, first of all think, what shape would I expect this
00:59:28.040 | to be?
00:59:29.040 | It's always easy to justify to yourself after the fact, I knew it would look like this.
00:59:33.260 | So what shape do you expect, and then is it that shape?
00:59:35.680 | So in this case, I'd be like, yeah, this is what I would expect, whereas this is definitely
00:59:43.640 | not what I'd expect.
00:59:45.820 | So the partial dependence plot has really pulled out the underlying truth.
00:59:50.840 | So does anybody have any questions about why we use partial dependence or how we calculate
00:59:59.820 | If there are 20 features that are important, then I will do the partial dependence for all
01:00:27.240 | of them, where important means it's a lever I can actually pull, the magnitude of its
01:00:36.640 | size is not much smaller than the other 19, based on all of these things, it's a feature
01:00:43.120 | I ought to care about, then I will want to know how it's related.
01:00:48.480 | It's pretty unusual to have that many features that are important both operationally and
01:00:55.400 | from a modeling point of view, in my experience.
01:01:04.840 | So important means it's a lever, so it's something I can change, and it's like, you know, kind
01:01:19.600 | of at the spiky end of this tail.
01:01:27.360 | Or maybe it's not a lever directly, maybe it's like zip code, and I can't actually tell my
01:01:34.940 | customers where to live, but I could focus my new marketing attention on a different
01:01:41.520 | zip code.
01:01:43.520 | Would it make sense to do pairwise shuffling for every combination of two features and
01:01:52.160 | hold everything else constant, like in feature importance, to see interactions and compare
01:01:56.760 | scores?
01:02:00.560 | So you wouldn't do that so much for partial dependence.
01:02:04.960 | I think your question is really getting to the question of could we do that for feature
01:02:10.760 | importance.
01:02:13.600 | So I think interaction feature importance is a very important and interesting question.
01:02:21.360 | But doing it by randomly shuffling every pair of columns, if you've got 100 columns, sounds
01:02:30.160 | computationally intensive, possibly infeasible.
01:02:34.040 | So what I'm going to do is after we talk about TreeInterpreter, I'll talk about an interesting
01:02:39.280 | but largely unexplored approach that will probably work.
01:02:43.720 | Okay, who wants to do TreeInterpreter?
01:02:47.880 | Alright, over here, Prince.
01:02:51.600 | Can you pass that over here to Prince?
01:02:56.600 | I was thinking this to be more like feature importance, but feature importance is for
01:03:04.120 | complete random forest model, and this TreeInterpreter is for feature importance for particular
01:03:09.240 | observation.
01:03:10.920 | So if that, let's say it's about hospital readmission, so if a patient A1 is going to
01:03:18.640 | be readmitted to a hospital, which feature for that particular patient is going to impact?
01:03:24.640 | And how can you change that?
01:03:27.440 | And it is calculated starting from the prediction of mean, then seeing how each feature is changing
01:03:34.040 | the behavior of that particular patient.
01:03:37.600 | I'm smiling because that was one of the best examples of technical communication I've heard
01:03:43.800 | in a long time.
01:03:44.800 | So it's really good to think about why was that effective.
01:03:49.280 | So what Prince did there was he used as specific an example as possible.
01:03:57.040 | Humans are much less good at understanding abstractions.
01:03:59.840 | So if you say it takes some kind of feature and then there's an observation in that feature
01:04:05.720 | where it's like, no, it's a hospital readmission.
01:04:09.800 | So we take a specific example.
01:04:13.120 | The other thing he did that was very effective was to take an analogy to something we already
01:04:17.960 | understand.
01:04:18.960 | So we already understand the idea of feature importance across all of the rows in a dataset.
01:04:26.000 | So now we're going to do it for a single row.
01:04:29.760 | So one of the things I was really hoping we would learn from this experience is how to
01:04:35.400 | become effective technical communicators.
01:04:38.640 | So that was a really great role model from Prince of using all of the tricks we have
01:04:44.760 | at our disposal for effective technical communication.
01:04:47.980 | So hopefully you found that a useful explanation.
01:04:50.360 | I don't have a hell of a lot to add to that other than to show you what that looks like.
01:04:56.960 | So with the tree interpreter, we picked out a row.
01:05:03.360 | And so remember when we talked about the confidence intervals at the very start, the confidence
01:05:10.880 | based on tree variance, we mainly said you'd probably use that for a row.
01:05:15.680 | So this would also be for a row.
01:05:17.560 | So it's like, okay, why is this patient likely to be readmitted?
01:05:23.840 | So here is all of the information we have about that patient, or in this case this auction.
01:05:30.540 | Why is this auction so expensive?
01:05:35.180 | So then we call tree interpreter dot predict, and we get back the prediction of the price,
01:05:41.320 | the bias, which is the root of the tree.
01:05:44.640 | So this is just the average price for everybody, so this is always going to be the same.
01:05:49.600 | And then the contributions, which is how important is each of these things.
01:05:59.040 | And so the way we calculated that was to say at the very start, the average price was 10,
01:06:17.640 | and then we split on enclosure.
01:06:21.800 | And for those with this enclosure, the average was 9.5.
01:06:29.460 | And then we split on year made, I don't know, less than 1990.
01:06:33.960 | And for those with that year made, the average price was 9.7.
01:06:40.440 | And then we split on the number of hours on the meter, and for this branch we got 9.4.
01:06:51.680 | And so we then have a particular option, which we pass it through the tree, and it just so
01:06:58.120 | happens that it takes this path.
01:07:01.500 | So one row can only have one path through the tree.
01:07:06.800 | And so we ended up at this point.
01:07:11.160 | So then we can create a little table.
01:07:17.760 | And so as we go through, we start at the top, and we start with 10.
01:07:22.760 | That's our bias.
01:07:24.280 | And we said enclosure resulted in a change from 10 to 9.5, minus 0.5.
01:07:33.120 | Year made changed it from 9.5 to 9.7, so plus 0.2.
01:07:39.360 | And then meter changed it from 9.7 down to 9.4, which is minus 0.3.
01:07:51.360 | And then if we add all that together, 10 minus 1/2 is 9.5, plus 0.2 is 9.7, minus 0.3 is 9.4.
01:08:01.080 | Lo and behold, that's that number, which takes us to our Excel spreadsheet.
01:08:18.520 | Where's Chris, who did our waterfall?
01:08:20.600 | There you are.
01:08:22.080 | So last week we had to use Excel for this because there isn't a good Python library
01:08:29.680 | for doing waterfall charts.
01:08:31.400 | And so we saw we got our starting point, this is the bias, and then we had each of our contributions
01:08:37.000 | and we ended up with our total.
01:08:40.040 | The world is now a better place because Chris has created a Python waterfall chart module
01:08:44.680 | for us and put it on pip, so never again will we have to use Excel for this.
01:08:50.300 | And I wanted to point out that waterfall charts have been very important in business communications
01:08:57.040 | at least as long as I've been in business, so that's about 25 years.
01:09:04.080 | Python is a couple of decades old, a little bit less, maybe a couple of decades old.
01:09:10.520 | But despite that, no one in the Python world ever got to the point where they actually
01:09:17.440 | thought I'm going to make a waterfall chart.
01:09:20.280 | So they didn't exist until two days ago, which is to say the world is full of stuff which
01:09:27.420 | ought to exist and doesn't, and doesn't necessarily take ahead a lot of time to build.
01:09:32.600 | Chris, how did it take you to build the first Python waterfall chart?
01:09:39.240 | Well there was a gist of it, so a hefty time amount but not unreasonable.
01:09:54.600 | And now forevermore, people when they want the Python waterfall chart will end up at
01:09:59.440 | Chris's GitHub repo and hopefully find lots of other USF contributors who have made it
01:10:04.840 | even better.
01:10:08.000 | So in order for you to help improve Chris's Python waterfall, you need to know how to
01:10:14.520 | do that.
01:10:15.720 | And so you're going to need to submit a pull request.
01:10:20.580 | Life becomes very easy for submitting pull requests if you use something called hub.
01:10:24.680 | So if you go to github/hub, that will send you over here.
01:10:33.240 | And what they suggest you do is that you alias git to hub, because it turns out that hub
01:10:37.920 | actually is a strict superset of git.
01:10:41.560 | But what it lets you do is you can go git fork, git push, git pull request, and you've
01:10:51.320 | now sent Chris a pull request.
01:10:54.280 | Without hub, this is actually a pain and requires going to the website and filling in forms.
01:11:00.080 | So this gives you no reason not to do pull requests.
01:11:03.840 | And I mention this because when you're interviewing for a job or whatever, I can promise you that
01:11:10.080 | the person you're talking to will check your GitHub.
01:11:13.200 | And if they see you have a history of submitting thoughtful pull requests that are accepted
01:11:17.980 | to interesting libraries, that looks great.
01:11:21.000 | It looks great because it shows you're somebody who actually contributes.
01:11:24.640 | It also shows that if they're being accepted, that you know how to create code that fits
01:11:28.960 | with people's coding standards, has appropriate documentation, passes their tests and coverage,
01:11:34.760 | and so forth.
01:11:36.040 | So when people look at you and they say, "Oh, here's to somebody with a history of successfully
01:11:41.120 | contributing accepted pull requests to open-source libraries," that's a great part of your portfolio.
01:11:48.200 | And you can specifically refer to it.
01:11:50.560 | So either I'm the person who built Python Waterfall, here is my repo, or I'm the person
01:11:58.880 | who contributed currency number formatting to Python Waterfall, here's my pull request.
01:12:07.020 | Any time you see something that doesn't work right in any open-source software you use
01:12:12.200 | is not a problem.
01:12:13.880 | It's a great opportunity because you can fix it and send in the pull request.
01:12:19.240 | So give it a go, it actually feels great the first time you have a pull request accepted.
01:12:24.280 | And of course, one big opportunity is the FastAI library.
01:12:30.440 | Was the person here the person who added all the docs to FastAI structured in the other
01:12:35.640 | class?
01:12:36.640 | Okay.
01:12:37.640 | So thanks to one of our students, we now have doc strings for most of the fastai.structured
01:12:42.940 | library and that again came via a pull request.
01:12:46.040 | So thank you.
01:12:52.160 | Does anybody have any questions about how to calculate any of these random forest interpretation
01:12:59.680 | methods or why we might want to use any of these random forest interpretation methods?
01:13:05.440 | Towards the end of the week, you're going to need to be able to build all of these yourself
01:13:11.800 | from scratch.
01:13:12.800 | Just looking at the tree interpreter, I noticed that some of the values are NINs.
01:13:29.360 | I get why you keep them in the tree, but how can an NIN have a feature importance?
01:13:37.680 | Okay let me pass it back to you.
01:13:41.080 | Why not?
01:13:42.080 | So, in other words, how is NIN handled in pandas and therefore in the tree?
01:13:49.240 | Is that to some default value?
01:13:54.480 | Anybody remember how pandas, notice these are all in categorical variables, how does
01:13:58.600 | pandas handle NINs in categorical variables and how does FastAI deal with them?
01:14:04.700 | Can somebody pass it to the person who's talking?
01:14:10.600 | Pandas sets them to -1, category code, and do you have to remember what we then do?
01:14:18.780 | We add 1 to all of the category codes, so it ends up being 0.
01:14:22.960 | So in other words, we have a category with, remember by the time it hits the random forest
01:14:26.960 | it's just a number, and it's just a number 0.
01:14:30.960 | And we map it back to the descriptions back here.
01:14:33.920 | So the question really is, why shouldn't the random forest be able to split on 0?
01:14:40.560 | It's just another number.
01:14:42.520 | So it could be NIN, high, medium, or low, 0, 1, 2, 3, 4.
01:14:47.000 | And so missing values are one of these things that are generally taught really badly, like
01:14:54.680 | often people get taught like here are some ways to remove columns with missing values
01:14:58.740 | or remove rows with missing values or to replace missing values.
01:15:04.000 | That's never what we want, because missingness is very very very often interesting.
01:15:10.960 | And so we actually learned from our feature importance that coupler system NIN is like
01:15:17.520 | one of the most important features.
01:15:19.920 | And so for some reason, I could guess, right?
01:15:25.040 | Coupler system NIN presumably means this is the kind of industrial equipment that doesn't
01:15:29.440 | have a coupler system.
01:15:31.440 | I don't know what kind that is, but apparently it's a more expensive kind.
01:15:35.880 | Does that make sense?
01:15:40.400 | So I did this competition for university grant research success where by far the most important
01:15:51.480 | predictors were whether or not some of the fields were null, and it turned out that this
01:15:58.080 | was data leakage, that these fields only got filled in most of the time after a research
01:16:03.720 | grant was accepted.
01:16:05.800 | So it allowed me to win that Kaggle competition, but it didn't actually help the university
01:16:11.000 | very much.
01:16:13.440 | Okay, great.
01:16:17.080 | So let's talk about extrapolation, and I am going to do something risky and dangerous,
01:16:27.440 | which is we're going to do some live coding.
01:16:32.760 | And the reason we're going to do some live coding is I want to explore extrapolation
01:16:38.320 | together with you, and I kind of also want to help give you a feel of how you might go
01:16:46.000 | about writing code quickly in this notebook environment.
01:16:52.160 | And this is the kind of stuff that you're going to need to be able to do in the real
01:16:55.360 | world, and in the exam is quickly create the kind of code that we're going to talk about.
01:17:00.200 | So I really like creating synthetic data sets anytime I'm trying to investigate the behavior
01:17:07.440 | of something, because if I have a synthetic data set, I know how it should behave.
01:17:12.980 | Which reminds me, before we do this, I promised that we would talk about interaction importance,
01:17:21.200 | and I just about forgot.
01:17:25.400 | Tree interpreter tells us the contributions for a particular row based on the difference
01:17:33.340 | in the tree.
01:17:35.980 | We could calculate that for every row in our data set and add them up, and that would tell
01:17:45.680 | us feature importance in a different way.
01:17:50.760 | One way of doing feature importance is by shuffling the columns one at a time, and another
01:17:55.560 | way is by doing tree interpreter for every row and adding them up.
01:18:02.040 | Neither is more right than the others, they're actually both quite widely used.
01:18:06.120 | So this is type 1 and type 2 feature importance.
01:18:11.800 | So we could try to expand this a little bit, to do not just single variable feature importance,
01:18:26.120 | but interaction feature importance.
01:18:29.360 | Now here's the thing, what I'm going to describe is very easy to describe.
01:18:36.160 | It was described by Brimann right back when random forests were first invented, and it
01:18:40.880 | is part of the commercial software product from Salford Systems who have the trademark
01:18:47.080 | on random forests, but it is not part of any open source library I'm aware of.
01:18:54.720 | And I've never seen an academic paper that actually studies it closely.
01:18:58.880 | So what I'm going to describe here is a huge opportunity, but there's also lots and lots
01:19:05.720 | of details that need to be fleshed out.
01:19:08.160 | But here's the basic idea.
01:19:16.360 | This particular difference here is not just because of year made, but because of a combination
01:19:27.800 | of year made and enclosure.
01:19:32.400 | The fact that this is 9.7 is because enclosure was in this branch and year made was in this
01:19:38.120 | branch.
01:19:39.440 | So in other words, we could say the contribution of enclosure interacted with year made is
01:19:50.840 | -0.3.
01:19:56.260 | And so what about that difference?
01:20:03.240 | Well that's an interaction of year made and hours on the meter.
01:20:08.620 | So year made interacted with, I'm using star here not to mean times, but to mean interacted
01:20:14.680 | with.
01:20:15.680 | It's kind of a common way of doing things, like R's formulas do it this way as well.
01:20:20.240 | Year made by interacted with meter has a contribution of -0.1.
01:20:34.280 | Perhaps we could also say from here to here that this also shows an interaction between
01:20:41.840 | meter and enclosure, with one thing in between them.
01:20:46.040 | So maybe we could say meter by enclosure equals, and then what should it be?
01:20:55.400 | Should it be -0.6?
01:21:02.200 | In some ways that kind of seems unfair because we're also including the impact of year made.
01:21:09.640 | So maybe it should be -0.6.
01:21:14.760 | Maybe we should add back this -0.2.
01:21:18.760 | And these are like details that I actually don't know the answer to.
01:21:23.680 | How should we best assign a contribution to each pair of variables in this path?
01:21:33.720 | But clearly, conceptually, we can.
01:21:36.880 | The pairs of variables in that path all represent interactions.
01:21:42.000 | Yes, Chris, could you pass that to Chris, please?
01:21:47.840 | Why don't you force them to be next to each other in the tree?
01:21:55.320 | I'm not going to say it's the wrong approach.
01:21:58.600 | I don't think it's the right approach, though, because it feels like this path here, meter
01:22:05.080 | and enclosure, are interacting.
01:22:08.800 | So it seems like not recognizing that contribution is throwing away information, but I'm not
01:22:17.320 | sure.
01:22:20.040 | I had one of my staff at Kaggle actually do some R&D on this a few years ago, and I wasn't
01:22:27.360 | close enough to know how they dealt with these details.
01:22:29.440 | But they got it working pretty well, but unfortunately it never saw the light of day as a software
01:22:33.680 | product.
01:22:34.680 | But this is something which maybe a group of you could get together and build -- I mean,
01:22:42.760 | do some Googling to check, but I really don't think that there are any interaction feature
01:22:48.240 | importance parts of any open source library.
01:22:52.200 | Could you pass that back?
01:22:56.800 | Wouldn't this exclude interactions between variables that don't matter until they interact?
01:23:03.640 | So say your row never chooses to split down that path, but that variable interacting with
01:23:09.760 | another one becomes your most important split.
01:23:13.920 | I don't think that happens, because if there's an interaction that's important only because
01:23:18.520 | it's an interaction and not on a univariate basis, it will appear sometimes assuming that
01:23:24.840 | you set max features to less than 1, and so therefore it will appear in some parts.
01:23:32.600 | What is meant by interaction here?
01:23:33.800 | Is it multiplication, ratio, addition?
01:23:37.360 | Interaction appears on the same path through a tree, like in this case the tree there's
01:23:49.420 | an interaction between enclosure and UMA because we branch on enclosure and then we branch
01:23:54.320 | on UMA.
01:23:55.320 | So to get to here we have to have some specific value of enclosure and some specific value
01:23:59.760 | of UMA.
01:24:05.200 | My brain is kind of working on this right now.
01:24:07.480 | What if you went down the middle leafs between the two things you were trying to observe
01:24:12.840 | and you would also take into account what the final measure is?
01:24:17.320 | So if we extend the tree downwards you'd have many measures, both of the two things you're
01:24:22.120 | trying to look at and also the in between steps.
01:24:27.000 | There seems to be a way to like average information out in between them.
01:24:29.960 | There could be.
01:24:30.960 | So I think what we should do is talk about this on the forum.
01:24:32.920 | I think this is fascinating and I hope we build something great, but I need to do my
01:24:39.600 | live coding so let's -- yeah, that was a great discussion.
01:24:47.080 | Keep thinking about it.
01:24:48.080 | Yeah, do some experiments.
01:24:50.120 | And so to experiment with that you almost certainly want to create a synthetic data
01:24:55.960 | set first.
01:24:56.960 | It's like y = x1 + x2 + x1 * x2 or something, like something where you know there's this
01:25:05.120 | interaction effect and there isn't that interaction effect and then you want to make sure that
01:25:09.920 | the feature importance you get at the end is what you expected.
01:25:14.400 | And so probably the first step would be to do single variable feature importance using
01:25:20.880 | the tree interpreter style approach.
01:25:26.800 | And one nice thing about this is it doesn't really matter how much data you have, all
01:25:33.680 | you have to do to calculate feature importance is just slide through the tree.
01:25:37.600 | So you should be able to write in a way that's actually pretty fast, and so even writing
01:25:41.800 | it in pure Python might be fast enough depending on your tree size.
01:25:49.360 | So we're going to talk about extrapolation.
01:25:52.640 | And so the first thing I want to do is create a synthetic data set that has a simple linear
01:25:56.600 | relationship.
01:25:57.600 | We're going to pretend it's like a time series.
01:26:01.760 | So we need to basically create some x values.
01:26:04.920 | So the easiest way to kind of create some synthetic data of this type is to use LinSpace,
01:26:12.000 | which creates some evenly spaced data between start and stop with by default 50 observations.
01:26:21.520 | So if we just do that, there it is.
01:26:32.480 | And so then we're going to create a dependent variable.
01:26:35.440 | And so let's assume there's just a linear relationship between x and y, and let's add
01:26:40.800 | a little bit of randomness to it.
01:26:47.080 | So uniform random between low and high, so we could add somewhere between -0.2 and 0.2.
01:27:02.680 | And so the next thing we need is a shape, which is basically what dimensions do you
01:27:10.200 | want these random numbers to be.
01:27:13.920 | And obviously we want them to be the same shape as x's shape.
01:27:17.340 | So we can just say x.shape.
01:27:24.080 | Remember, when you see something in parentheses with a comma, that's a tuple with just one
01:27:31.400 | thing in it.
01:27:33.560 | So this is of shape 50, and so we've added 50 random numbers.
01:27:39.440 | And so now we can plot those.
01:27:46.760 | So shift+tab, x, y.
01:27:48.280 | All right, so there's our data.
01:27:57.360 | So when you're both working as a data scientist or doing your exams in this course, you need
01:28:04.800 | to be able to quickly whip up a data set like that, throw it up on a plot without thinking
01:28:10.800 | too much.
01:28:11.800 | And as you can see, you don't have to really remember much, if anything, you just have
01:28:17.000 | to know how to hit shift+tab to check the names of the parameters, and everything in
01:28:23.440 | the exam will be open on the internet, so you can always Google for something to try
01:28:28.760 | and find lin space if you've got what it's called.
01:28:33.960 | So let's assume that's our data.
01:28:36.800 | And so we're now going to build a random forest model, and what I want to do is build a random
01:28:43.800 | forest model that kind of acts as if this is a time series.
01:28:47.720 | So I'm going to take this as a training set, I'm going to take this as our validation or
01:28:56.640 | test set, just like we did in groceries or bulldozers or whatever.
01:29:05.680 | So we can use exactly the same kind of code that we used in split_bells.
01:29:09.500 | So we can basically say x train, x bell = x up to 40, x from 40.
01:29:30.000 | So that just splits it into the first 40 versus the last 10.
01:29:35.060 | And so we can do the same thing for y, and there we go.
01:29:47.160 | So the next thing to do is we want to create a random forest and fit it, and that's going
01:29:57.600 | to require x's and y's.
01:30:00.920 | Now that's actually going to give an error, and the reason why is that it expects x to
01:30:05.880 | be a matrix, not a vector, because it expects x to have a number of columns of data.
01:30:14.280 | So it's important to know that a matrix with one column is not the same thing as a vector.
01:30:22.120 | So if I try to run this, expect a 2D array, we've got a 1D array instead.
01:30:31.800 | So we need to convert our 2D array into a 1D array.
01:30:35.400 | So remember I said x.shape is 50, right?
01:30:41.340 | So x has one axis.
01:30:46.320 | So it's important to make sure that x's rank is 1.
01:30:52.240 | The rank of a variable is equal to the length of its shape.
01:30:59.640 | How many axes does it have?
01:31:01.600 | So a vector we can think of as an array of rank 1, a matrix is an array of rank 2.
01:31:11.240 | I very rarely use words like vector and matrix because they're kind of meaningless, specific
01:31:20.520 | examples of something more general, which is they're all n-dimensional tensors, or n-dimensional
01:31:26.840 | arrays.
01:31:29.320 | So an n-dimensional array, we can say it's a tensor of rank n, they basically mean kind
01:31:38.280 | of the same thing.
01:31:39.280 | Physicists get crazy when you say that because to a physicist a tensor has quite a specific
01:31:42.960 | meaning, but in machine learning we generally use it in the same way.
01:31:48.760 | So how do we turn a 1-dimensional array into a 2-dimensional array?
01:31:58.000 | There's a couple of ways we can do it, but basically we slice it.
01:32:02.160 | So colon means give me everything in that axis, colon, none means give me everything
01:32:12.000 | in the first axis, which is the only axis we have, and then none is a special indexer,
01:32:18.320 | which means add a unit axis here.
01:32:22.480 | So let me show you, that is of shape 50, 1, so it's of rank 2, it has two axes.
01:32:34.160 | One of them is a very boring axis, it's a length 1 axis, so let's move this over here.
01:32:42.720 | There's 1, 50, and then to remind you the original is just 50.
01:32:51.760 | So you can see I can put none as a special indexer to introduce a new unit axis there.
01:33:00.840 | So this thing has 1 row and 50 columns, this thing has 50 rows and 1 column.
01:33:10.400 | So that's what we want, we want 50 rows and 1 column.
01:33:15.640 | This kind of playing around with ranks and dimensions is going to become increasingly
01:33:22.580 | important in this course and in the deep learning course.
01:33:27.280 | So I spend a lot of time slicing with none, slicing with other things, trying to create
01:33:32.760 | 3-dimensional, 4-dimensional tensors and so forth.
01:33:35.480 | I'll show you a trick, I'll show you two tricks.
01:33:37.200 | The first is you never ever need to write comma, colon, it's always assumed.
01:33:42.680 | So if I delete that, this is exactly the same thing.
01:33:48.000 | And you'll see that in code all the time, so you need to recognize it.
01:33:51.920 | The second trick is, this is adding an axis in the second dimension, or I guess the index
01:33:59.160 | 1 dimension.
01:34:00.520 | What if I always want to put it in the last dimension?
01:34:04.040 | And often our tensors change dimensions without us looking, because you went from a 1-channel
01:34:10.520 | image to a 3-channel image, or you went from a single image to a mini-batch of images.
01:34:15.360 | Like suddenly you get new dimensions appearing.
01:34:18.240 | So to make things general, I would say this, dot dot dot.
01:34:22.800 | Dot dot dot means as many dimensions as you need to fill this up.
01:34:28.040 | And so in this case it's exactly the same thing, but I would always try to write it
01:34:32.440 | that way because it means it's going to continue to work as I get higher dimensional tensors.
01:34:38.640 | So in this case, I want 50 rows in one column, so I'll call that say x_1.
01:34:46.960 | So let's now use that here.
01:34:50.040 | And so this is now a 2D array, and so I can create my random forest.
01:35:02.200 | So then I can plot that, and this is where you're going to have to turn your brains on,
01:35:09.400 | because the folks this morning got this very quickly, which was super impressive.
01:35:13.880 | I'm going to plot y_train against m.predict x_train.
01:35:22.640 | Where I hit go, what is this going to look like?
01:35:34.440 | It should basically be the same.
01:35:37.380 | Our predictions hopefully are the same as the actuals, so this should fall on a line.
01:35:42.240 | But there's some randomness, so I should have used scatterplot.
01:35:52.120 | So that's cool.
01:35:54.720 | That was the easy one.
01:35:58.640 | Let's now do the hard one, the fun one.
01:36:04.200 | What's that going to look like?
01:36:21.240 | So I'm going to say no, but nice try.
01:36:24.760 | It's like, hey, we're extrapolating to the validation.
01:36:28.120 | That's what I'd like it to look like, but that's not what it is going to look like.
01:36:37.920 | Think about what trees do, and think about the fact that we have a validation set here
01:36:46.100 | and a training set here.
01:36:50.120 | So think about a forest is just a bunch of trees, so the first tree is going to have
01:36:57.640 | a go.
01:36:58.640 | Can you pass that to Melissa?
01:37:01.680 | >> Will it start grouping the dots?
01:37:06.760 | >> Yeah, that's what it does, but let's think about how it groups the dots.
01:37:13.540 | >> I'm guessing since all the new data is actually outside of the original scope, it's
01:37:18.720 | all going to be basically one huge group.
01:37:24.440 | Forget the forest, let's create one tree, so we're probably going to split somewhere
01:37:28.160 | around here first, and then we're going to probably split somewhere around here, and
01:37:32.840 | then we're going to split somewhere around here, and somewhere around here.
01:37:36.160 | And so our final split is here.
01:37:40.720 | So our prediction when we say, okay, let's take this one, and so it's going to put that
01:37:48.160 | through the forest and end up predicting this average.
01:37:54.080 | It can't predict anything higher than that, because there is nothing higher than that
01:37:58.840 | to average.
01:37:59.840 | So this is really important to realize, a random forest is not magic, it's just returning
01:38:05.280 | the average of nearby observations where nearby is kind of in this tree space.
01:38:12.200 | So let's run it.
01:38:13.880 | Let's see if Tim's right.
01:38:18.560 | Holy shit, that's awful.
01:38:23.880 | If you don't know how random forests work, then this is going to totally screw you.
01:38:28.640 | If you think that it's actually going to be able to extrapolate to any kind of data it
01:38:33.400 | hasn't seen before, particularly future time periods, it's just not.
01:38:39.120 | It just can't.
01:38:40.440 | It's just averaging stuff it's already seen.
01:38:42.720 | That's all it can do.
01:38:45.440 | So we're going to be talking about how to avoid this problem.
01:38:51.560 | We talked a little bit in the last lesson about trying to avoid it by avoiding unnecessary
01:39:00.120 | time-dependent variables where we can.
01:39:05.200 | But in the end, if you really have a time series that looks like this, we actually have
01:39:11.520 | to deal with a problem.
01:39:14.160 | So one way we could deal with a problem would be use a neural net, use something that actually
01:39:20.060 | has a function or shape that can actually fit something like this.
01:39:28.520 | So then it'll extrapolate nicely.
01:39:31.040 | Another approach would be to use all the time series techniques you guys are learning about
01:39:35.520 | in the morning class to fit some kind of time series and then detrend it.
01:39:43.000 | And so then you'll end up with detrended dots and then use the random forest to predict
01:39:47.640 | those.
01:39:49.840 | And that's particularly cool because imagine that your random forest was actually trying
01:39:56.880 | to predict data that, I don't know, maybe it was two different states.
01:40:03.320 | And so the blue ones are down here and the red ones are up here.
01:40:09.480 | Now if you try to use a random forest, it's going to do a pretty crappy job because time
01:40:15.320 | is going to seem much more important.
01:40:16.520 | So it's basically still going to split like this and then it's going to split like this.
01:40:21.760 | And then finally, once it kind of gets down to this piece, it'll be like oh, okay, now
01:40:25.760 | I can see the difference between the states.
01:40:28.940 | So in other words, when you've got this big time piece going on, you're not going to see
01:40:34.680 | the other relationships in the random forest until every tree deals with time.
01:40:41.720 | So one way to fix this would be with a gradient boosting machine, GBM.
01:40:47.520 | And what a GBM does is it creates a little tree and runs everything through that first
01:40:52.680 | little tree, which could be like the time tree, and then it calculates the residuals,
01:40:59.480 | and then the next little tree just predicts the residuals.
01:41:03.400 | So it would be kind of like detrending it.
01:41:05.480 | So GBM still can't extrapolate to the future, but at least they can deal with time-dependent
01:41:11.900 | data more conveniently.
01:41:15.440 | So we're going to be talking about this quite a lot more over the next couple of weeks.
01:41:18.680 | And in the end, the solution is going to be just use neural nets.
01:41:24.120 | But for now, using some kind of time series analysis, detrend it, and then use a random
01:41:30.040 | forest on that isn't a bad technique at all.
01:41:33.280 | And if you're playing around with something like the Ecuador groceries competition, that
01:41:37.080 | would be a really good thing to fiddle around with.
01:41:40.960 | Alright, see you next time.