back to indexLesson 7 - Deep Learning for Coders (2020)
Chapters
0:0 Weight decay (L2 Regularization)
7:25 Creating our own Embedding module
12:45 Interpreting embeddings and bias
18:0 Embedding distance
20:0 Deep learning for collaborative filtering
24:9 Notebook 9 - Tabular modelling
25:30 entity embeddings for categorical variables
30:11 beyond deep learning for tabular data (ensembles of decision trees)
40:10 Decision Trees
64:0 Random Forests
72:10 Out-of-bag error
74:0 Model Interpretation
94:0 extrapolation
103:0 using a NN
109:20 ensembling
117:40 conclusion
00:00:00.000 |
Hi everybody and welcome to lesson 7. We're going to start by having a look at a kind 00:00:07.520 |
of regularization called weight decay. And the issue that we came to at the end of the 00:00:13.400 |
last lesson is that we were training our simple dot product model with bias, and our loss 00:00:22.400 |
started going down and then it started going up again. And so we have a problem that we 00:00:29.280 |
are overfitting. And remember in this case we're using mean squared error. So try to 00:00:35.560 |
recall why it is that we don't need a metric here, because mean squared error is pretty 00:00:42.760 |
much the thing we care about really, or we could use mean absolute error if we like, 00:00:47.800 |
but either of those works fine as a loss function. They don't have the problem of big flat areas 00:00:52.720 |
like accuracy does for classification. So what we want to do is to make it less likely 00:01:00.400 |
that we're going to overfit by doing something we call reducing the capacity of the model. 00:01:06.120 |
The capacity of the model is basically how much space does it have to find answers. And 00:01:11.800 |
if it can kind of find any answer anywhere, those answers can include basically memorizing 00:01:18.320 |
the data set. So one way to handle this would be to decrease the number of latent factors. 00:01:27.320 |
But generally speaking, reducing the number of parameters in a model, particularly as 00:01:32.780 |
we look at more deep learning style models, ends up biasing the models towards very simple 00:01:40.720 |
kind of shapes. So there's a better way to do it rather than reducing the number of parameters. 00:01:47.840 |
And we try to force the parameters to be smaller, unless they're really required to be big. 00:01:55.480 |
And the way we do that is with weight decay. Weight decay is also known as L2 regularization. 00:02:00.960 |
They're very slightly different, but we can think of them as the same thing. And what 00:02:05.320 |
we do is we change our loss function, and specifically we change the loss function by 00:02:10.120 |
adding to it the sum of all the weights squared. In fact, all of the parameters squared really 00:02:17.900 |
should stay. Why do we do that? Well, because if that's part of the loss function, then 00:02:24.460 |
one way to decrease the loss would be to decrease the weights, one particular weight or all 00:02:30.360 |
of the weights or something like that. And so when we decrease the weights, if you think 00:02:38.980 |
about what that would do, then think about, for example, the different possible values 00:02:49.100 |
of a in y equals ax squared. The larger a is, for example, a is 50, you get these very 00:02:57.080 |
narrow peaks. In general, big coefficients are going to cause big swings, big changes 00:03:05.800 |
in the loss, small changes in the parameters. And when you have these kind of sharp peaks 00:03:13.660 |
or valleys, it means that a small change to the parameter can make a, sorry, a small change 00:03:22.520 |
to the input and make a big change to the loss. And so if you have, if you're in that 00:03:27.760 |
situation, then you can basically fit all the data points close to exactly with a really 00:03:33.480 |
complex jagged function with sharp changes, which exactly tries to sit on each data point 00:03:41.020 |
rather than finding a nice smooth surface which connects them all together or goes through 00:03:46.660 |
them all. So if we limit our weights by adding in the loss function, the sum of the weights 00:03:54.580 |
squared, then what it's going to do is it's going to fit less well on the training set 00:04:00.760 |
because we're giving it less room to try anything that it wants to, but we're going to hope 00:04:05.320 |
that it would result in a better loss on the validation set or the test set so that it 00:04:10.740 |
will generalize better. One way to think about this is that the loss with weight decay is 00:04:17.020 |
just the loss plus the sum of the parameters squared times some number we pick, a hyperparameter. 00:04:27.500 |
This is like 0.1 or 0.01 or 0.001 kind of region. So this is basically what loss with 00:04:35.680 |
weight decay looks like in this equation. But remember when it actually comes to what's, 00:04:40.500 |
how is the loss used in stochastic gradient descent? It's used by taking its gradient. 00:04:45.940 |
So what's the gradient of this? Well, if you remember back to when you first learned calculus, 00:04:52.880 |
it's okay if you don't. The gradient of something squared is just two times that something. We've 00:05:00.200 |
changed some parameters to weight which is a bit confusing. So just use weight here to 00:05:05.800 |
keep it consistent. Maybe parameters is better. So the derivative of weight squared is just 00:05:12.880 |
two times weight. So in other words, to add in this term to the gradient, we can just 00:05:20.480 |
add to the gradients weight decay times two times weight. And since weight decay is just 00:05:28.420 |
a hyperparameter, we can just replace it with weight decay times two. So that would just 00:05:32.500 |
give us weight decay times weight. So weight decay refers to adding on the, to the gradients, 00:05:42.960 |
the weights times some hyperparameter. And so that is going to try to create these kind 00:05:48.780 |
of more shallow, less bumpy surfaces. So to do that, we can simply, when we call fit or 00:05:59.260 |
fit one cycle or whatever, we can pass in a WD parameter and that's just this number 00:06:07.100 |
here. So if we pass in point one, then the training loss goes from point two nine to 00:06:15.660 |
point four nine. That's much worse, right, because we can't overfit anymore. The valid 00:06:20.980 |
loss goes from point eight nine to point eight two, much better. So this is an important 00:06:27.740 |
thing to remember for those of you that have done a lot of more traditional statistical 00:06:31.820 |
models is in kind of more traditional statistical models, we try to avoid overfitting and we 00:06:37.800 |
try to increase generalization by decreasing the number of parameters. But in a lot of 00:06:44.040 |
modern machine learning and certainly deep learning, we tend to instead use regularization 00:06:51.660 |
such as weight decay because it gives us more flexibility. It lets us use more nonlinear 00:06:56.820 |
functions and still, you know, still reduces the capacity of the model. Great. So we're 00:07:03.940 |
down to point eight two three. This is a good model. This is really actually a very good 00:07:08.780 |
model. And so let's dig into actually what's going on here because in our, in our architecture, 00:07:18.300 |
remember we basically just had four embedding layers. So what's an embedding layer? We've 00:07:24.580 |
described it conceptually, but let's write our own. And remember we said that an embedding 00:07:29.980 |
layer is just a computational shortcut for doing a matrix multiplication by a one hot 00:07:35.380 |
encoded matrix and that that is actually the same as just indexing into an array. So an 00:07:43.860 |
embedding is just a indexing into an array. And so it's nice to be able to create our 00:07:50.380 |
own versions of things that exist in PyTorch and fast.ai. So let's do that for embedding. 00:07:56.780 |
So if we're going to create our own kind of layer, which is pretty cool, we need to be 00:08:02.500 |
aware of something, which is normally a layer is basically created by inheriting as we've 00:08:09.660 |
discussed from module or nn.module. So for example, this is an example here of a module 00:08:15.060 |
where we've created a class called t that inherits from module. And when it's constructed, 00:08:20.340 |
remember that's what dunder init does. We're just going to sit, this is just a dummy little 00:08:25.020 |
module here. We're going to set self.a to the number one repeated three times as a tensor. 00:08:31.820 |
Now if you remember back to notebook four, we talked about how the optimizers in PyTorch 00:08:37.820 |
and fast.ai rely on being able to grab the parameters attribute to find a list of all 00:08:42.900 |
the parameters. Now if you want to be able to optimize self.a, you would need to appear 00:08:48.860 |
in parameters, but actually there's nothing there. Why is that? That's because PyTorch 00:08:56.340 |
does not assume that everything that's in a module is something that needs to be learned. 00:09:01.460 |
To tell it that it's something that needs to be learned, you have to wrap it with nn.parameter. 00:09:05.900 |
So here's exactly the same class, but torch.ones, which is just a list of three, three ones 00:09:12.500 |
in this case is wrapped in nn.parameter. And now if I go parameters, I see I have a parameter 00:09:20.340 |
with three ones in it. And that's going to automatically call requires grad underscore 00:09:25.300 |
for us as well. We haven't had to do that for things like nn.linear in the past because 00:09:32.460 |
PyTorch automatically uses nn.parameter internally. So if we have a look at the parameters for 00:09:37.820 |
something that uses nn.linear with no bias layer, you'll see again we have here a parameter 00:09:44.900 |
with three things in it. So we want to in general be able to create a parameter. So 00:09:55.220 |
something with a tensor with a bunch of things in and generally we want to randomly initialize 00:09:59.260 |
them. So to randomly initialize, we can pass in the size we want. We can initialize a tensor 00:10:04.980 |
of zeros of that size and then randomly generate some normal, normally distributed random numbers 00:10:10.820 |
with a mean of zero and a deviation of 0.01. No particular reason I'm picking those numbers 00:10:16.220 |
just to show how this works. So here's something that will give us back a set of parameters 00:10:22.380 |
of any size we want. And so now we're going to replace everywhere that used to say embedding. 00:10:28.140 |
I'm going to replace it with create params. Everything else here is the same in the init 00:10:34.840 |
under init. And then the forward is very, very similar to before. As you can see, I'm 00:10:40.380 |
grabbing the zero index column from x, that's my users, and I just look it up as you see 00:10:49.260 |
in that user factors array. And the cool thing is I don't have to do anything with gradients 00:10:54.320 |
myself for this manual embedding layer because PyTorch can figure out the gradients automatically 00:10:59.500 |
as we've discussed. But then I just got the dot product as before, add on the bias as 00:11:03.700 |
before, do the sigmoid range as before. And so here's a dot product bias without any special 00:11:10.820 |
PyTorch layers and we fit and we get the same result. So I think that is pretty amazingly 00:11:18.580 |
cool. We've really shown that the embedding layer is nothing fancy, is nothing magic, right? 00:11:25.900 |
It's just indexing into an array. So hopefully that removes a bit of the mystery for you. 00:11:32.420 |
So let's have a look at this model that we've created and we've trained and find out what 00:11:38.100 |
it's learned. That's already useful. We've got something we can make pretty accurate 00:11:41.620 |
predictions with. But let's find out what those, what the model looks like. So remember 00:11:49.180 |
when we have a question. Okay, let's take a question before you can look at this. What's 00:11:55.700 |
the advantage of creating our own embedding layer over the stock PyTorch one? Oh, nothing 00:12:02.220 |
at all. We're just showing that we can. It's great to be able to dig under the surface 00:12:06.100 |
because at some point you'll want to try doing new things. So a good way to learn to do new 00:12:10.520 |
things is to be able to replicate things that already exist and you can expect that you 00:12:15.380 |
understand how they work. It's also a great way to understand the foundations of what's 00:12:19.180 |
going on is to actually create encode your own implementation. But I wouldn't expect 00:12:24.980 |
you to use this implementation in practice. But basically it removes all the mystery. So 00:12:32.940 |
if you remember we've created a learner called learn and to get to the model that's inside 00:12:37.780 |
it, you can always call learn.model and then inside that there's going to be automatically 00:12:45.180 |
created for it. Well, sorry, not automatically. We've created all these attributes movie factors 00:12:49.420 |
movie bias and so forth. So we can grab learn.model.movieBias. And now what I'm going to do is I'm going 00:12:59.220 |
to sort that vector and I'm going to print out the first five titles. And so what this 00:13:08.180 |
is going to do is it's going to print out the movies with the smallest bias and here 00:13:15.220 |
they are. What does this mean? Well, it kind of means these are the five movies that people 00:13:22.620 |
really didn't like. But it's more than that. It's not only do people not like them, but 00:13:29.180 |
if we take account of the genre they're in, the actors they have, you know, whatever the 00:13:34.620 |
latent factors are, people liked them a lot less than they expected. So maybe for example, 00:13:41.380 |
this is kind of I haven't seen any of these movies. Luckily perhaps this is a sci-fi movie. 00:13:49.900 |
So people who kind of like these sci-fi movies found they're so bad they still didn't like 00:13:54.100 |
it. So we can do the exact opposite, which is to sort sending. And here are the top five 00:14:02.460 |
movies and specifically they're the top five by bias, right? So these are the movies that 00:14:07.700 |
even after you take account of the fact that LA Confidential, I have seen all of these 00:14:11.900 |
ones. So LA Confidential is a kind of a murder mystery cop movie, I guess. And people who 00:14:18.860 |
don't necessarily like that genre or I think Guy Pearce was in it. So maybe they don't like 00:14:22.700 |
Guy Pearce very much, whatever. People still like this movie more than they expect. So 00:14:29.340 |
this is a kind of a nice thing that we can look inside our model and see what it's learned. 00:14:35.060 |
We can look at not only at the bias vector, but we can also look at the factors. Now there 00:14:43.940 |
are 50 factors, which is too many to visualize. So we can use a technique called PCA, Principle 00:14:50.020 |
Components now. This, the details don't matter, but basically they're going to squish those 00:14:53.780 |
50 factors down to three. And then we'll plot the top two as you can see here. And what we 00:15:04.060 |
see when we plot the top two is we can kind of see that the movies have been kind of spread 00:15:11.380 |
out across a space of some kind of latent factors. And so if you look at the far right, 00:15:18.700 |
there's a whole bunch of kind of big budget actually things. And on the far left, there's 00:15:25.540 |
more like cult kind of things, Fargo, Schindler's List, Monty Python. By the same token at the 00:15:33.660 |
bottom, we've got some English patient, Harry Met Sally, so kind of romance drama kind of 00:15:42.860 |
stuff. And at the top, we've got action and sci-fi kind of stuff. So you can see even 00:15:50.660 |
as though we haven't asked in any information about these movies, all we've seen is who 00:15:57.900 |
likes what. These latent factors have automatically kind of figured out a space or a way of thinking 00:16:05.380 |
about these movies based on what kinds of movies people like and what other kinds of 00:16:09.660 |
movies they like along with those. But that's really interesting to kind of try and visualize 00:16:15.300 |
what's going on inside your model. Now we don't have to do all this manually. We can 00:16:25.540 |
actually just say give me a collab learner using this set of data loaders with this number 00:16:32.120 |
of factors and this y range and it does everything we've just seen again about the same number. 00:16:37.980 |
Okay, so now you can see this is nice, right? We've actually been able to see right underneath 00:16:43.100 |
inside the collab learner part of the fast AI application, the collaborative filtering 00:16:48.060 |
application and we can build it all ourselves from scratch. We know how to create the SGD, 00:16:54.300 |
know how to create the embedding layer, we know how to create the model, the architecture. 00:17:01.500 |
So now you can see, you know, we've really can build up from scratch our own version 00:17:06.580 |
of this. So if we just type learn.model, you can see here the names are a bit more generic. 00:17:13.220 |
This is a user weight, item weight, user bias, item bias, but it's basically the same stuff 00:17:17.700 |
we've seen before. And we can replicate the exact analysis we saw before by using this 00:17:24.240 |
same idea. Okay, slightly different order this time because it is a bit random but pretty 00:17:34.660 |
similar as well. Another interesting thing we can do is we can think about the distance 00:17:41.880 |
between two movies. So let's grab all the movie factors or just pop them into a variable 00:17:51.220 |
and then let's pick a movie and then let's find the distance from that movie to every 00:18:05.160 |
other movie. And so one way of thinking about distance is you might recall the Pythagorean 00:18:10.340 |
formula or the distance on the hypotenuse of a triangle, which is also the distance 00:18:17.820 |
to a point in a Cartesian plane on a chart, which is root x squared plus y squared. You 00:18:25.020 |
might know, it doesn't matter if you don't, but you can do exactly the same thing for 00:18:28.760 |
50 dimensions. It doesn't just work for two dimensions. There's a, that tells you how 00:18:36.020 |
far away a point is from another point if you, if x and y are actually differences between 00:18:43.620 |
two movie vectors. So then what gets interesting is you can actually then divide that kind 00:18:58.020 |
of by the, by the length to make all the lengths the same distance to find out how the angle 00:19:03.620 |
between any two movies and that actually turns out to be a really good way to compare the 00:19:07.620 |
similarity of two things. That's called cosine similarity. And so the details don't matter. 00:19:12.120 |
You can look them up if you're interested. But the basic idea here is to see that we 00:19:16.340 |
can actually pick a movie and find the movie that is the most similar to it based on these 00:19:28.740 |
What motivated learning at a 50-dimensional embedding and then using a to reduce the three 00:19:36.060 |
Oh, because the purpose of this was actually to create a good model. So the, the visualization 00:19:42.100 |
part is normally kind of the exploration of what's going in, on in your model. And so 00:19:47.780 |
with a 50, with 50 latent factors, you're going to get a more accurate. So that's one 00:19:54.660 |
approach is this dot product version. There's another version we could use, which is we 00:20:02.660 |
could create a set of user factors and a set of item factors and just like before we could 00:20:12.620 |
look them up. But what we could then do instead of doing a dot product, we could concatenate 00:20:18.140 |
them together into a tensor that contains both the user and the movie factors next to 00:20:26.340 |
each other. And then we could pass them through a simple little neural network, linear, relu, 00:20:39.280 |
So importantly here, the first linear layer, the number of inputs is equal to the number 00:20:45.020 |
of user factors plus the number of item factors. And the number of outputs is however many 00:20:51.360 |
activations we have. And then we just default to 100 here. And then the final layer will 00:21:00.160 |
go from 100 to 1 because we're just making one prediction. And so we could create, we'll 00:21:06.060 |
call that collab nn. We can instantiate that to create a model. We can create a learner 00:21:11.060 |
and we can fit. It's not going quite as well as before. It's not terrible, but it's not 00:21:16.860 |
quite as good as our dot product version. But the interesting thing here is it does give 00:21:21.900 |
us some more flexibility, which is that since we're not doing a dot product, we can actually 00:21:27.060 |
have a different embedding size for each of users versus items. And actually fast.ai has 00:21:35.220 |
a simple heuristic. If you call get embedding size and pass in your data loaders, it will 00:21:40.540 |
suggest appropriate size embedding matrices for each of your categorical variables, each 00:21:49.060 |
of your user and item sensors. So that's, so if we pass in *m's settings, that's going 00:22:02.020 |
to pass in the user, tuple and the item, tuple, which we can then pass to embedding. This is 00:22:11.340 |
the * prefix we learned about in the last class in case you forgot. So this is kind 00:22:17.660 |
of interesting. We can, you know, we can see here that there's two different architectures 00:22:23.260 |
we could pick from. It wouldn't be necessarily obvious ahead of time which one's going to 00:22:26.660 |
work better. In this particular case, the simplest one, the dot product one, actually turned out 00:22:32.940 |
to work a bit better, which is interesting. This particular version here, if you call 00:22:37.580 |
collab_learner and pass use_nn = true, then what that's going to do is it's going to use 00:22:44.940 |
this version, the version with concatenation and the linear layers. So collab_learner, use_nn 00:22:56.980 |
= true, again we get about the same result as you'd expect because it's just a draw-cut 00:23:01.020 |
for this version. And it's interesting actually, we have a look at collab_learner, it actually 00:23:09.100 |
returns an object of type embedding_nn, and it's kind of cool if you look inside the fast.io 00:23:14.420 |
source code or use the double question mark trick to see the source code for embedding 00:23:18.020 |
nn, you'll see it's three lines of code. How does that happen? Because we're using this 00:23:24.340 |
thing called tab_ular_model, which we will learn about in a moment, but basically this 00:23:32.620 |
neural net version of collaborative filtering is literally just a tab_ular model in which 00:23:37.740 |
we pass no continuous variables and some embedding sizes. So we'll see that in a moment. 00:23:50.080 |
Okay so that is collaborative filtering, and again take a look at the further research 00:23:55.060 |
section in particular after you finish the questionnaire, because there's some really 00:23:59.740 |
important next steps you can take to push your knowledge and your skills. 00:24:06.620 |
So let's now move to notebook 9, tab_ular. And we're going to look at tab_ular_modeling 00:24:14.100 |
and do a deep dive. And let's start by talking about this idea that we were starting to see 00:24:19.180 |
here, which is embeddings. And specifically let's move beyond just having embeddings for 00:24:28.100 |
users and items, but embeddings for any kind of categorical variable. So really because 00:24:34.860 |
we know an embedding is just a lookup into an array, it can handle any kind of discrete 00:24:43.700 |
categorical data. So things like age are not discrete, they're continuous numerical data, 00:24:49.140 |
but something like sex or postcode are categorical variables. They have a certain number of discrete 00:24:56.420 |
levels. The number of discrete levels they have is called their cardinality. So to have 00:25:02.740 |
a look at an example of a dataset that contains both categorical and continuous variables, 00:25:10.060 |
we're going to look at the Rossman sales competition that ran on Kaggle a few years ago. And so 00:25:16.540 |
basically what's going to happen is we're going to see a table that contains information 00:25:21.580 |
about various stores in Germany, and the goal will be to try and predict how many sales 00:25:26.940 |
there's going to be for each day in a couple of week period for each store. 00:25:34.700 |
One of the interesting things about this competition is that one of the gold medalists used deep 00:25:39.980 |
learning, and it was one of the earliest known examples of a state-of-the-art deep learning 00:25:45.980 |
tabular model. I mean this is not long ago, 2015 or something, but really this idea of 00:25:52.660 |
creating state-of-the-art tabular models with deep learning has not been very common and 00:25:58.860 |
for not very long. You know interestingly compared to the other gold medalists in this 00:26:04.020 |
competition, the folks that use deep learning used a lot less feature engineering and a 00:26:08.500 |
lot less domain expertise. And so they wrote a paper called Entity Embeddings of Categorical 00:26:13.540 |
Variables, in which they basically described the exact thing that you saw in notebook 8, 00:26:21.940 |
the way you can think of one-hot encodings as just being embeddings, you can catenate 00:26:27.420 |
them together, and you can put them through a couple of layers, they call them dense layers, 00:26:33.140 |
we've called them linear layers, and create a neural network out of that. So this is really 00:26:38.940 |
a neat, you know, kind of simple and obvious hindsight trick. And they actually did exactly 00:26:45.940 |
what we did in the paper, which is to look at the results of the trained embeddings. 00:26:52.900 |
And so for example they had an embedding matrix for regions in Germany, because there wasn't 00:27:02.300 |
really metadata about this, these were just learned embeddings, just like we learned embeddings 00:27:06.120 |
about movies. And so then they just created, just like we did before, a chart where they 00:27:12.300 |
popped each region according to, I think probably a PCA of their embeddings. And then if you 00:27:18.820 |
circle the ones that are close to each other in blue, you'll see that they're actually 00:27:24.140 |
close to each other in Germany, and ditto for red, and ditto for green, and then here's 00:27:30.580 |
the brown. So this is like pretty amazing, is the way that we can see that it's kind 00:27:38.660 |
of learned something about what Germany looks like, based entirely on the purchasing behavior 00:27:44.180 |
of people in those states. Something else they did was to look at every store, and they 00:27:50.420 |
looked at the distance between stores in practice, like how many kilometers away they are. And 00:27:58.100 |
then they looked at the distance between stores in terms of their embedding distance, just 00:28:03.700 |
like we saw in the previous notebook. And there was this very strong correlation that 00:28:09.260 |
stores that were close to each other physically ended up having close embeddings as well, 00:28:18.180 |
even though the actual location of these stores in physical space was not part of the model. 00:28:26.180 |
Ditto with days of the week, so the days of the week or another embedding, and the days 00:28:32.100 |
of the week that were next to each other, ended up next to each other in embedding space, 00:28:37.740 |
and ditto for months of the year. So pretty fascinating the way kind of information about 00:28:44.900 |
the world ends up captured just by looking at training embeddings, which as we know are 00:28:50.700 |
just index lookups into an array. So the way we then combine these categorical variables 00:29:00.220 |
with these embeddings with continuous variables, what was done in both the entity embedding 00:29:06.620 |
paper that we just looked at, and then also described in more detail by Google when they 00:29:13.060 |
described how their recommendation system in Google Play works. This is from Google's 00:29:18.180 |
paper, is they have the categorical features that go through the embeddings, and then there 00:29:23.260 |
are continuous features, and then all the embedding results and the continuous features 00:29:27.940 |
are just concatenated together into this big concatenated table that then goes through 00:29:32.700 |
this case three layers of a neural net, and interestingly they also take the kind of collaborative 00:29:40.620 |
filtering bit and do the product as well and combine the two. So they use both of the tricks 00:29:46.340 |
were used in the previous notebook and combine them together. So that's the basic idea we're 00:29:54.340 |
going to be seeing for moving beyond just collaborative filtering, which is just two 00:30:01.180 |
categorical variables to as many categorical and as many continuous variables as we like. 00:30:07.340 |
But before we do that, let's take a step back and think about other approaches, because 00:30:12.900 |
as I mentioned, the idea of deep learning as a kind of a best practice for tabular data 00:30:19.940 |
is still pretty new and it's still kind of controversial. It's certainly not always the 00:30:25.500 |
case that it's the best approach. So when we're not using deep learning, what would 00:30:30.980 |
we be using? Well, what we'd probably be using is something called an ensemble of decision 00:30:36.620 |
trees and the two most popular are random forests and gradient boosting machines or 00:30:43.140 |
something similar. So basically between multi-layered neural networks, like with SGD and ensemble 00:30:49.900 |
of decision trees, that kind of covers the vast majority of approaches that you're likely 00:30:55.700 |
to see for tabular data. And so we're going to make sure we cover them both of course today, 00:31:01.820 |
in fact. So although deep learning is nearly always clearly superior for stuff like images 00:31:09.580 |
and audio and natural language text, these two approaches tend to give somewhat similar 00:31:15.820 |
results a lot of the time for tabular data. So let's take a look. You know, you really 00:31:21.820 |
should generally try both and see which works best for you for each problem you look at. 00:31:28.660 |
Why does the range go from 0 to 5.5 if the maximum is 5? 00:31:38.140 |
That's a great question. The reason is if you think about it for sigmoid, it's actually 00:31:43.740 |
impossible for a sigmoid to get all the way to the top or all the way to the bottom. Those 00:31:49.180 |
are asymptotes. So no matter how far, how big your x is, it can never quite get to the 00:31:54.780 |
top or no matter how small it is, it can never quite get to the bottom. So if you want to 00:31:58.580 |
be able to actually predict a rating of 5, then you need to use something higher than 00:32:07.300 |
Are embeddings used only for highly cardinal categorical variables, or is this approach 00:32:12.380 |
used in general? For low cardinality, can one use a one-hot encoding? 00:32:18.500 |
I'll remind you cardinality is the number of discrete levels in a variable. And remember 00:32:29.180 |
that an embedding is just a computational shortcut for a one-hot encoding. So there's 00:32:36.180 |
really no reason to use a one-hot encoding because it's, as long as you have more than 00:32:42.260 |
two levels, it's always going to be more memory and lower, and give you exactly mathematically 00:32:48.060 |
the same thing. And if there's just two levels, then it is basically identical. So there isn't 00:32:58.180 |
Thank you for those great questions. Okay, so one of the most important things about 00:33:08.980 |
decision tree ensembles is that at the current state of the technology, they do provide faster 00:33:15.100 |
and easier ways of interpreting the model. I think that's rapidly improving for deep 00:33:19.840 |
learning models on tabular data, but that's where we are right now. They also require 00:33:24.420 |
less hyperparameter tuning, so they're easier to kind of get right the first time. So my 00:33:30.260 |
first approach for analyzing a new tabular data set is always an ensemble of decision 00:33:35.220 |
trees. And specifically, I pretty much always start with a random forest because it's just 00:33:44.260 |
Your experience for highly imbalanced data, such as broad or medical data, what usually 00:33:50.100 |
works best out of random forest, XGBoost, or neural networks? 00:33:55.460 |
I'm not sure that whether the data is balanced or unbalanced is a key reason for choosing 00:34:03.260 |
one of those above the others. I would try all of them and see which works best. So the 00:34:09.660 |
exception to the guideline about start with decision tree ensembles is your first thing 00:34:13.820 |
to try would be if there's some very high cardinality categorical variables, then they 00:34:18.600 |
can be a bit difficult to get to work really well in decision tree ensembles. Or if there's 00:34:25.440 |
something like, most importantly, if it's like plain text data or image data or audio 00:34:29.380 |
data or something like that, then you're definitely going to need to use a neural net in there, 00:34:34.980 |
but you could actually ensemble it with a random forest, as we'll see. 00:34:40.420 |
Okay, so clearly we're going to need to understand how decision tree ensembles work. So PyTorch 00:34:50.500 |
isn't a great choice for decision tree ensembles. They're really designed for gradient-based 00:34:55.420 |
methods and random forests and decision tree growing are not really gradient-based methods 00:35:01.780 |
in the same way. So instead, we're going to use a library called scikit-learn, referred 00:35:08.640 |
to as sklearn as a module. Scikit-learn does a lot of things. We're only going to touch 00:35:16.420 |
on a tiny piece of them, stuff we need to do to train decision trees and random forests. 00:35:24.540 |
We've already mentioned before Wes McKinney's book, also a great book for understanding 00:35:28.600 |
more about scikit-learn. So the dataset for learning about decision tree ensembles is 00:35:35.500 |
going to be another dataset. It's going to, it's called the blue book for bulldozers dataset 00:35:42.660 |
and it's a Kaggle competition. So Kaggle competitions are fantastic. They are machine learning competitions 00:35:52.060 |
where you get interesting datasets, you get feedback on whether your approach is any good 00:35:56.820 |
or not. You can see on a leaderboard what approaches are working best and then you can 00:36:01.140 |
read blog posts from the winning contestants sharing tips and tricks. It's certainly not 00:36:07.940 |
a substitute for actual practice doing end-to-end data science projects, but for becoming good 00:36:19.060 |
at creating predictive models that are predictive, it's a really fantastic resource, highly recommended. 00:36:25.980 |
And you can also submit to old, most old competitions to see how you would have gone without having 00:36:31.860 |
to worry about, you know, the kind of stress of like whether people will be looking at 00:36:35.980 |
your results because they're not publicized or published if you do that. 00:36:41.540 |
There's a question. Can you comment on real-time applications of random forests? In my experience, 00:36:49.900 |
they tend to be too slow for real-time use cases like a recommender system, neural network 00:36:55.680 |
is much faster when run on the right hardware. 00:36:58.860 |
Let's get to that once we've seen what they are, shall we? Now you can't just download 00:37:08.620 |
an untar Kaggle datasets using the untar data thing that we have in fast.ai. So you actually 00:37:13.540 |
have to sign up to Kaggle and then follow these instructions for how to download data 00:37:20.380 |
from Kaggle. Make sure you replace creds here with what it describes. You need to get a 00:37:24.980 |
special API code and then run this one time to put that up on your server. And now you 00:37:32.020 |
can use Kaggle to download data using the API. So after we do that, we're going to end 00:37:41.720 |
up with a bunch of, as you see, CSV files. So let's take a look at this data. 00:37:49.340 |
So the main data, the main table is train.csv. Remember that's comma separated values and 00:37:55.980 |
the training set contains information such as unique identifier of a sale, the unique 00:38:00.980 |
identifier of a machine, the sale price, sale date. So what's going on here is one row of 00:38:07.100 |
the data represents a sale of a single piece of heavy machinery like a bulldozer at an 00:38:14.620 |
auction. So it happens at a date, as a price, it's of some particular piece of equipment 00:38:20.860 |
and so forth. So if we use pandas again to read in the CSV file, let's combine training 00:38:28.100 |
and valid together. We can then look at the columns to see. There's a lot of columns there 00:38:34.200 |
and many things which I don't know what the hell they mean like blade extension and pad 00:38:37.860 |
type and ride control. But the good news is we're going to show you a way that you don't 00:38:43.340 |
have to look at every single column and understand what they mean and random forests are going 00:38:48.120 |
to help us with that as well. So once again, we're going to be seeing this idea that models 00:38:53.700 |
can actually help us with data understanding and data cleanup. One thing we can look at 00:38:59.460 |
is ordinal columns, a good place to look at that now. If there's things there that you 00:39:03.920 |
know are discrete values but have some order like product size, it has medium and small 00:39:11.380 |
and large, medium and many. These should not be in alphabetical order or some random order, 00:39:19.340 |
they should be in this specific order, right? They have a specific ordering. So we can use 00:39:28.820 |
as type to turn it into a categorical variable and then we can say setCategories, audit equals 00:39:34.420 |
true to basically say this is an ordinal column. So it's got discrete values but we actually 00:39:40.300 |
want to define what the order of the classes are. We need to choose which is the dependent 00:39:48.260 |
variable and we do that by looking on Kaggle and Kaggle will tell us that the thing we're 00:39:52.180 |
meant to be predicting is sale price and actually specifically they'll tell us the thing we're 00:39:56.960 |
meant to be predicting is the log of sale price because root mean squared log error 00:40:02.060 |
is what we're actually going to be judged on in the competition where we take the log. 00:40:09.020 |
So we're not going to replace sale price with its log and that's what we'll be using from 00:40:12.940 |
now on. So a decision tree ensemble requires decision trees. So let's start by looking 00:40:20.340 |
at decision trees. So a decision tree in this case is a something that asks a series of 00:40:28.060 |
binary that is yes or no questions about data. So such as is somebody less than or greater 00:40:34.540 |
than 30? Yes they are. Are they eating healthily? Yes they are and so okay then we're going 00:40:39.700 |
to say they're fit or unfit. So like there's an example of some arbitrary decision tree 00:40:46.540 |
that somebody might have come up with. It's a series of binary yes and no choices and 00:40:51.620 |
at the bottom are leaf nodes that make some prediction. Now of course for our bulldozers 00:41:02.380 |
competition we don't know what binary questions to ask about these things and in what order 00:41:10.180 |
in order to make a prediction about sale price. So we're doing machine learning so we're going 00:41:15.180 |
to try and come up with some automated way to create the questions. And there's actually 00:41:20.700 |
a really simple procedure for doing that. You have to think about it. So if you want 00:41:24.620 |
to kind of stretch yourself here have a think about what's an automatic procedure that you 00:41:30.620 |
can come up with that would automatically build a decision tree where the final answer 00:41:36.300 |
would do a you know significantly better than random job of estimating the sale price of 00:41:44.220 |
one of these auctions. Alright so here's the approach that we could use. Loop through each 00:41:53.500 |
column of the data set. We're going to go through each of well obviously not sale price 00:41:59.300 |
it's a dependent variable sale ID machine ID auctioneer year made etc. And so one of 00:42:05.140 |
those will be for example product size. And so then what we're going to do is we're going 00:42:11.660 |
to loop through each possible value of product size large, large, medium, medium etc. And 00:42:21.380 |
then we're going to do a split basically like where this comma is and we're going to say 00:42:25.260 |
okay let's get all of the auctions of large equipment and put that into one group and 00:42:32.820 |
everything that's smaller than that and put that into another group. And so that's here 00:42:38.900 |
split the data into two groups based on whether they're greater than or less than that value. 00:42:45.740 |
If it's a categorical non-ordinal value a variable it'll be just whether it's equal 00:42:49.740 |
or not equal at that level. And then we're going to find the average sale price for each 00:42:55.620 |
of the two groups. So for the large group what was the average sale price? For the smaller 00:43:00.900 |
than large group what was the average sale price? And that will be our model. Our prediction 00:43:06.940 |
will simply be the average sale price for that group. And so then you can say well how 00:43:12.460 |
good is that model? If our model was just to ask a single question with a yes/no answer 00:43:17.380 |
put things into two groups and take the average of the group as being our prediction and we 00:43:22.260 |
can say how good would that model be? What would be the root mean squared error from 00:43:26.140 |
that model? And so we can then say all right how good would it be if we use large as a 00:43:32.580 |
split? And then let's try again what if we did large/medium as a split? What if we did 00:43:38.260 |
medium as a split? And so in each case we can find the root mean squared error of that 00:43:42.180 |
incredibly simple model. And then once we've done that for all of the product size levels 00:43:47.020 |
we can go to the next column and look at level of usage band and do every level of usage 00:43:55.380 |
band and then state, every level of state and so forth. And so there'll be some variable 00:44:02.860 |
and some split level which gives the best root mean squared error of this really really 00:44:09.540 |
simple model. And so then we'll say okay that would be our first binary decision. It gives 00:44:16.220 |
us two groups and then we're going to take each one of those groups separately and find 00:44:22.580 |
another single binary decision for each of those two groups using exactly the same procedure. 00:44:28.820 |
So then we'll have four groups and then we'll do exactly the same thing again separately 00:44:33.460 |
for each of those four groups and so forth. So let's see what that looks like and in fact 00:44:44.180 |
once we've gone through this you might even want to see if you can implement this algorithm 00:44:47.500 |
yourself. It's not trivial but it doesn't require any special coding skills so hopefully 00:44:55.020 |
you can find you'll be able to do it. There's a few things we have to do before we can actually 00:45:00.820 |
create a decision tree in terms of just some basic data munching. One is if we're going 00:45:06.460 |
to take advantage of dates we actually want to call fastai's addDatePart function and 00:45:13.660 |
what that does as you see after we call it is it creates a whole different a bunch of 00:45:18.980 |
different bits of metadata from that data. Say a year, say a month, say a week, say a 00:45:24.380 |
day and so forth. So say a date of itself doesn't have a whole lot of information directly 00:45:35.020 |
but we can pull lots of different information out of it. And so this is an example of something 00:45:39.340 |
called feature engineering which is where we take some piece of some piece of data and 00:45:44.220 |
we try to grab create lots of other pieces of data from it. So is this particular date 00:45:50.180 |
the end of a month or not? At the end of a year or not? And so forth. So that handle 00:45:56.940 |
states there's a bit more cleaning we want to do and fastai provides some things to make 00:46:03.700 |
cleaning easier. We can use the tabular pandas class to create a tabular data set in pandas. 00:46:13.460 |
And specifically we're going to use two tabular processes or tabular procs. A tabular processor 00:46:19.940 |
is basically just a transform and we've seen transforms before so go back and remind yourself 00:46:24.580 |
what a transform is. Except it's just slightly different it's like three lines of code if 00:46:30.620 |
you look at the code for it. It's actually going to modify the object in place rather 00:46:36.260 |
than creating a new object and giving it back to you. And that's because often these tables 00:46:40.420 |
of data are kind of really big and we don't want to waste lots of RAM. And it's just going 00:46:46.300 |
to run the transform once and save the result rather than doing it lazily when you access 00:46:51.060 |
it for the same reason. We're just going to make this a lot faster. So you can just think 00:46:57.160 |
of them as transforms really. One of them is called categorify and categorify is going 00:47:02.020 |
to replace a column with numeric categories using the same basic idea of like a vocab 00:47:09.340 |
like we've seen before. Fill missing is going to find any columns with missing data that's 00:47:16.240 |
going to fill in the missing data with the median of the data and create a new column 00:47:21.100 |
a boolean column which is set to true for anything that was missing. So these two things 00:47:25.760 |
is basically enough to get you to a point where most of the time you'll be able to train 00:47:29.260 |
a model. Now the next thing we need to do is think about our validation set. As we discussed 00:47:37.340 |
in lesson one, a random validation set is not always appropriate and certainly for something 00:47:44.020 |
like predicting auction results it almost certainly is not appropriate because we're 00:47:49.260 |
going to be wanting to use a model in the future not at some random date in the past. 00:47:54.660 |
So the way this Kaggle competition was set up was that the test set the thing that you 00:48:00.680 |
had to fill in and submit for the competition was two weeks of data that was after any of 00:48:08.860 |
the training set. So we should do the same thing for a validation set. We should create 00:48:14.580 |
something which is where the validation set is the last couple of weeks of data and so 00:48:22.820 |
then the training set will only be data before that. So we basically can do that by grabbing 00:48:28.340 |
everything before October 2011, create a training and validation set based on that condition 00:48:35.260 |
and grabbing those bits. So that's going to split our training set and validation set 00:48:43.520 |
by date not randomly. We're also going to need to tell when you create a tabular pandas 00:48:50.460 |
object you're going to be passing in a data frame, going to be passing in your tabular 00:48:54.980 |
procs and you also have to say what are my categorical and continuous variables. We can 00:49:00.100 |
use fast.ai's cont.cat.split to automatically split a data frame to continuous and categorical 00:49:07.820 |
variables for you. So we can just pass those in. Tell it what is the dependent variable, 00:49:14.940 |
you can have more than one, and what are the indexes to split into training and valid. 00:49:20.460 |
And this is a tabular object. So it's got all the information you need about the training 00:49:24.720 |
set, the validation set, categorical and continuous variables and the dependent variable and any 00:49:30.060 |
processes to run. It looks a lot like a datasets object, but it has a .train, it has a .valid 00:49:41.060 |
and so if we have a look at .show we can see the data. But .show is going to show us the 00:49:50.740 |
kind of the string data, but if we look at .items you can see internally it's actually 00:49:56.800 |
stored these very compact numbers which we can use directly in a model. So fast.ai has 00:50:06.140 |
basically got us to a point here where we have our data into a format ready for modeling 00:50:11.500 |
and our validation sets being created. To see how these numbers relate to these strings 00:50:19.580 |
we can again just like we saw last week use the classes attribute which is a dictionary 00:50:25.220 |
which basically tells us the vocab. So this is how we look up. For example 6 is 0, 1, 00:50:30.820 |
2, 3, 4, 5, 6. This is a compact example. That processing took takes a little while to run 00:50:39.260 |
so you can go ahead and save the tabular object and so then you can load it back later without 00:50:46.540 |
having to rerun all the processing. So that's a nice kind of fast way to quickly get back 00:50:52.820 |
up and running without having to reprocess your data. So we've done the basic data munging 00:50:59.100 |
we need. So we can now create a decision tree and in scikit-learn a decision tree where 00:51:04.180 |
the dependent variable is continuous is a decision tree regressor. And let's start by 00:51:10.440 |
telling it we just want a total of four leaf nodes. We'll see what that means in a moment 00:51:16.980 |
and in scikit-learn you generally call fit so it looks quite a lot like fast.ai and you 00:51:23.060 |
pass in your independent variables and your dependent variable and we can grab those straight 00:51:28.340 |
from our tabular object training set is .x's and .y and we can do the same thing for validation 00:51:35.860 |
just to save us in typing. Okay, question. Do you have any thoughts on what data augmentation 00:51:41.820 |
for tabular data might look like? I don't have a great sense of data augmentation for tabular 00:51:53.660 |
data. We'll be seeing later either in this course or in the next part dropout and mix 00:52:03.200 |
up and stuff like that which they might be able to do that in later layers in the tabular 00:52:11.260 |
model. Otherwise I think you'd need to think about kind of the semantics of the data and 00:52:16.220 |
think about what are things you could do to change the data without changing the meaning. 00:52:21.060 |
That's like a pretty tricky route. There question. Does fast.ai distinguish between ordered categories 00:52:29.340 |
such as low, medium, high and unordered categorical variables? Yes, that was that ordinal thing 00:52:36.180 |
I told you about before and all it really does is it ensures that your classes list 00:52:42.300 |
has a specific order so then these numbers actually have a specific order. And as you'll 00:52:47.860 |
see that's actually going to turn out to be pretty important for how we train our random 00:52:51.820 |
forest. Okay, so we can create a decision tree regressor. We can fit it and then we 00:53:00.300 |
can draw it, the fast.ai function. And here is the decision tree we just trained and behind 00:53:10.380 |
the scenes this actually used the basically the exact process that we described back here, 00:53:19.700 |
right? So this is where you can like try and create your own decision tree implementation 00:53:25.380 |
if you're interested in stretching yourself. So we're going to use one that's already exists 00:53:31.880 |
and the best way to understand what it's done is to look at this diagram from top to bottom. 00:53:37.060 |
So the first step is it says like okay the initial model it created is a model with no 00:53:44.660 |
binary splits at all. Specifically it's always going to predict the value 10.1 for every 00:53:50.600 |
single row. Why is that? Well because this is the simplest possible model is to take 00:53:57.020 |
the average of the dependent variable and always predict that. And so this is always 00:54:02.100 |
should be your kind of pretty much your basic baseline for regression. There are four hundred 00:54:08.720 |
and four thousand seven hundred and ten rows, auctions that we're averaging and the mean 00:54:14.660 |
squared error of this incredibly simple model in which there are no rules at all, no groups 00:54:20.860 |
at all, just a single average is a point for it. So then the next most complex model is 00:54:29.300 |
to take a single column, a plus system and a single binary decision is coupler system 00:54:35.980 |
less than or equal to 0.5. True, there are three hundred and sixty thousand eight hundred 00:54:41.780 |
and forty seven auctions where it's true and forty three thousand eight hundred and sixty 00:54:47.740 |
three where it's false. And now interestingly in the false case you can see that there are 00:54:54.100 |
no further binary decisions. So this is called a leaf node. It's a node where this is as 00:54:59.620 |
far as you can get and so if your coupler system is not less than or equal to 0.5 then 00:55:07.340 |
the prediction this model makes for your sale price is 9.21 versus if it's true it's 10.21. 00:55:15.100 |
So you can see it's actually found a very big difference here and that's why it picked 00:55:19.220 |
this as the first binary split. And so the mean squared error for this section here is 00:55:23.940 |
0.12 which is far better than we started out at, 0.48. This group still has 360,000 in 00:55:32.380 |
it and so it does another binary split. This time is the year that this piece of equipment 00:55:38.340 |
made was at less than or equal to 1991.5. If it was, if it's true then we get a leaf node 00:55:47.340 |
and the prediction is 9.97, mean squared error 0.37. If the value is false we don't have 00:55:53.420 |
a leaf node and we have another binary split. And you can see eventually we get down to 00:55:57.740 |
here coupler system true, year made, false, product size, false, mean squared error 0.17. 00:56:05.020 |
So all of these leaf nodes have MSCs that are smaller than that original baseline model 00:56:13.780 |
of just taking the mean. So this is how you can grow a decision tree. And we only stopped 00:56:19.660 |
here because we said max leaf nodes is 4, 1, 2, 3, 4, right? And so if we want to keep 00:56:27.140 |
training it further we can just use a higher number. There's actually a very nice library 00:56:36.220 |
by Terrence Park called dtree-vis which can show us exactly the same information like 00:56:42.220 |
so. And so here are the same leaf nodes 1, 2, 3, 4. And you can see the kind of the chart 00:56:49.980 |
of how many are there. This is the split, coupler system 0.5. Here are the two groups. 00:56:55.460 |
You can see the sale price in each of the two groups. And then here's the leaf node. 00:57:00.660 |
And so then the second split was on year made. And you can see here something weird is going 00:57:05.300 |
on with year made. There's a whole bunch of year mades that are a thousand which is obviously 00:57:09.700 |
not a sensible year for a bulldozer to be made. So presumably that's some kind of missing 00:57:15.140 |
value. So when we look at the kind of the picture like this it can give us some insights 00:57:21.400 |
about what's going on in our data. And so maybe we should replace those thousands with 00:57:28.700 |
1950 because that's you know obviously a very, very early year for a bulldozer. So we can 00:57:34.940 |
kind of pick it arbitrarily. It's actually not really going to make any difference to 00:57:39.700 |
the model that's created because all we care about is the order because we're just doing 00:57:44.740 |
these binary splits that it'll make it easier to look at as you can see. Here's our 1950s 00:57:50.420 |
now. And so now it's much easier to see what's going on in that binary split. So let's now 00:57:58.420 |
get rid of max leaf nodes and build a bigger decision tree. And then let's just for the 00:58:05.060 |
rest of this notebook create a couple of little functions. One to create the root mean squared 00:58:10.220 |
error which is just here. And another one to take a model and some independent independent 00:58:16.900 |
variables, predict from the model on the independent variables and then take the root mean squared 00:58:23.180 |
error with a dependent variable. So that's going to be our models root mean squared error. 00:58:29.700 |
So for this decision tree in which we didn't have a stopping criteria, so as many leaf 00:58:33.900 |
nodes as you like, the model's root mean squared error is zero. So we've just built the perfect 00:58:41.580 |
model. So this is great news, right? We've built the perfect auction trading system. 00:58:49.660 |
Well remember, we actually need to check the validation set. Let's check the check mRmse 00:58:54.620 |
with a validation set and oh, it's worse than zero. So our training set is zero, our validation 00:59:02.540 |
set is much worse than zero. Why has that happened? Well one of the things that a random 00:59:08.660 |
forest in sklearn can do is it can tell you the number of leaf nodes, number of leaves, 00:59:14.540 |
there are 341,000, number of data points 400,000. So in other words, we have nearly as many 00:59:22.460 |
leaf nodes as data points. Most of our leaf nodes only have a single thing in, but they're 00:59:26.780 |
taking an average of a single thing. Clearly this makes no sense at all. So what we should 00:59:32.060 |
actually do is pick some different stopping criteria and let's say, okay, if you get a 00:59:38.180 |
leaf node with 25 things or less in it, don't split things to create a leaf node with less 00:59:45.840 |
than 25 things in it. And now if we fit and we look at the root mean squared error for 00:59:51.540 |
the validation set, it's going to go down from 0.33 to 0.32. So the training sets got 00:59:59.460 |
worse from zero to 0.248. The validation sets got better and now we only have 12,000 leaf 01:00:10.100 |
Alright, so let's take a five minute break and then we're going to come back and see 01:00:15.260 |
how we get the best of both worlds, how we're going to get something which has the kind 01:00:19.660 |
of flexibility to get these, you know, what we're going to get down to zero, but to get, 01:00:26.540 |
you know, really deep trees, but also without overfitting. And the trick will be to use 01:00:32.860 |
something called bagging. We'll come back and talk about that in five minutes. 01:00:39.460 |
Okay, welcome back. So we're going to look at how we can get the best of both worlds 01:00:49.500 |
as we discussed and let's start by having a look at what we're doing with categorical 01:00:56.420 |
variables first of all. And so you might notice that previously with categorical variables, 01:01:03.500 |
for example, in collaborative filtering, we had to, you know, kind of think about like 01:01:10.500 |
how many embedding levels we have, for example, if you've used other modeling tools, you might 01:01:15.780 |
have doing things with creating dummy variables, stuff like that. For random forests on the 01:01:21.780 |
whole, you don't have to. The reason is, as we've seen, all of our categorical variables 01:01:32.460 |
have been turned into numbers. And so we can perfectly well have decision tree binary decisions 01:01:41.420 |
which use those particular numbers. Now, the numbers might not be ordered in any interesting 01:01:49.260 |
way, but if there's a particular level which kind of stands out as being important, it 01:01:56.380 |
only takes two binary splits to split out that level into a single, you know, into a 01:02:04.420 |
single piece. So generally speaking, I don't normally worry too much about kind of encoding 01:02:14.140 |
categorical variables in a special way. As I mentioned, I do try to encode ordinal variables 01:02:19.980 |
by saying what the order of the levels is, because often, as you would expect, sizes, 01:02:26.180 |
for example, you know, medium and small are going to mean kind of next to each other and 01:02:30.860 |
large and extra large would be next to each other. That's good to have those as similar 01:02:34.580 |
numbers. Having said that, you can kind of one hot encode a categorical variable if you 01:02:43.700 |
want to using get dummies in pandas. But there's not a lot of evidence that that actually helps. 01:02:51.380 |
There's actually that has been stored in a paper. And so I would say in general for categorical 01:02:57.540 |
variables don't worry about it too much. Just use what we've shown you. You have a question. 01:03:04.500 |
For ordinal categorical variables, how do you deal with when they have like nA or missing 01:03:12.460 |
values, where do you put that in the order? So in fast.ai, nA missing values always appear 01:03:22.300 |
as the first item. They'll always be the zero index item. And also if you get something 01:03:27.480 |
in the validation or test set, which is a level we haven't seen in training, that will 01:03:32.140 |
be considered to be that missing or nA value as well. All right, so what we're going to 01:03:41.020 |
do to try and improve our random forest is we're going to use something called bagging. 01:03:46.420 |
This was developed by a retired Berkeley professor named Leo Breiman in 1994. And he did a lot 01:03:54.180 |
of great work and perhaps you could argue that most of it happened after he retired. 01:03:59.700 |
His technical report was called bagging predictors. And he described how you could create multiple 01:04:05.260 |
versions of a predictor, so multiple different models. And you could then aggregate them 01:04:11.740 |
by averaging over the predictions. And specifically, the way he suggested doing this was to create 01:04:20.540 |
what he called bootstrap replicates. In other words, randomly select different subsets of 01:04:25.860 |
your data. Train a model on that subset, kind of store it away as one of your predictors, 01:04:31.820 |
and then do it again a bunch of times. And so each of these models is trained on a different 01:04:36.460 |
random subset of your data. And then you, to predict, you predict on all of those different 01:04:43.380 |
versions of your model and average them. And it turns out that bagging works really well. 01:04:52.300 |
So this, the sequence of steps is basically randomly choose some subset of rows, train 01:04:58.540 |
a model using that subset, save that model, and then return to step one. Do that a few 01:05:04.180 |
times to train a few models. And then to make a prediction, predict with all the models 01:05:10.300 |
and take the average. That is bagging. And it's very simple, but it's astonishingly powerful. 01:05:18.300 |
And the reason why is that each of these models we've trained, although they are not using 01:05:25.480 |
all of the data, so they're kind of less accurate than a model that uses all of the data. Each 01:05:31.980 |
of them is, the errors are not correlated, you know, the errors because of using that 01:05:39.880 |
smaller subset are not correlated with the errors of the other models because they're 01:05:44.140 |
random subsets. And so when you take the average of a bunch of kind of errors which are not 01:05:54.100 |
correlated with each other, the average of those errors is zero. So therefore, the average 01:06:01.080 |
of the models should give us an accurate prediction of the thing we're actually trying to predict. 01:06:08.380 |
So as I say here, it's an amazing result. We can improve the accuracy of nearly any 01:06:12.540 |
kind of algorithm by training it multiple times on different random subsets of data 01:06:18.380 |
and then averaging the predictions. So then Breiman in 2001 showed a way to do this specifically 01:06:27.140 |
for decision trees where not only did he randomly choose a subset of rows for each model, but 01:06:33.700 |
then for each binary split, he also randomly selected a subset of columns. And this is 01:06:40.200 |
called the random first. And it's perhaps the most widely used, most practically important 01:06:45.860 |
machine learning method and astonishingly simple. To create a random forest regressor, 01:06:54.100 |
you use sklearn's random forest regressor. If you pass njobs -1, it will use all of the 01:07:00.980 |
CPU cores that you have to run as fast as possible. nestimators says how many trees, 01:07:07.420 |
how many models to train. max_sample says how many rows to use, randomly chosen rows 01:07:15.100 |
to use in each one. max_features is how many randomly chosen columns to use for each binary 01:07:21.860 |
split point. min_sample's leaf is the stopping criteria and we'll come back to. So here's 01:07:29.960 |
a little function that will create a random first regressor and fit it through some set 01:07:35.580 |
of independent variables and a dependent variable. So we can give it a few default values and 01:07:43.460 |
create a random forest and train and our validation set RMSE is 0.23. If we compare that to what 01:07:55.500 |
we had before, we had 0.32. So dramatically better by using a random forest. 01:08:13.140 |
So what's happened when we called random forest regressor is it's just using that decision 01:08:22.020 |
tree builder that we've already seen, but it's building multiple versions with these 01:08:26.480 |
different random subsets and for each binary split it does, it's also randomly selecting 01:08:32.260 |
a subset of columns. And then when we create a prediction, it is averaging the predictions 01:08:38.880 |
of each of the trees. And as you can see it's giving a really great result. And one of the 01:08:45.260 |
amazing things we'll find is that it's going to be hard for us to improve this very much, 01:08:50.540 |
you know, the kind of the default starting point tends to turn out to be pretty great. 01:08:59.500 |
The sklearn docs have lots of good information in. One of the things that has this nice picture 01:09:03.980 |
that shows as you increase the number of estimators, how does the accuracy improve, error rate 01:09:11.620 |
improves for different max features levels. And in general, the more trees you add, the 01:09:21.100 |
more accurate your model. It's not going to overfit, right, because it's averaging more 01:09:26.060 |
of these, these weak models, more of these models that are trained on subsets of the 01:09:34.020 |
data. So train as many, use as many estimators as you like, really just a case of how much 01:09:40.420 |
time do you have and whether you kind of reach a point where it's not really improving anymore. 01:09:45.980 |
You can actually get at the underlying decision trees in a model, in a random forest model 01:09:50.620 |
using estimators_. So with a list comprehension, we can call predict on each individual tree. 01:09:57.900 |
And so here's an array, a numpy array containing the predictions from each individual tree 01:10:03.760 |
for each row in our data. So if we take the mean across the zero axis, we'll get exactly 01:10:15.100 |
the same number. Because remember, that's what a random forest does, is it takes the 01:10:21.380 |
mean of the trees, predictions. So one cool thing we could do is we could look at the 01:10:31.340 |
40 estimators we have and grab the predictions for the first i of those trees and take their 01:10:42.020 |
mean and then we can find the root mean squared error. And so in other words, here is the accuracy 01:10:50.220 |
when you've just got one tree, two trees, three trees, four trees, five trees, etc. 01:10:56.100 |
And you can see, so it's kind of nice, right? You can, you can actually create your own 01:11:01.220 |
kind of build your own tools to look inside these things and see what's going on. And 01:11:06.540 |
so we can see here that as you add more and more trees, the accuracy did indeed keep improving 01:11:11.980 |
or the root mean squared error kept improving, although the improvements slowed down after 01:11:18.060 |
a while. The validation set is worse than the training set and there's a couple of reasons 01:11:28.640 |
that could have happened. The first reason could be because we're still overfitting, 01:11:34.660 |
which is not necessarily a problem, it's just something we could identify. Or maybe it's 01:11:39.020 |
because the, the fact that we're trying to predict the last two weeks is actually a problem 01:11:44.700 |
and that the last two weeks are kind of different to the other auctions in our dataset, maybe 01:11:50.300 |
something changed over time. So how do we tell which of those two reasons there are? 01:11:56.740 |
What is the reason that our validation set is worse? We can actually find out using a 01:12:01.620 |
very clever trick called out of bag error, OOB error. And we use OOB error for lots of 01:12:06.900 |
things. You can grab the OOB error, or you can grab the OOB predictions from the model 01:12:16.340 |
with OOB prediction and you can grab the RMSE and you can find that the OOB error, RMSE is 01:12:23.860 |
0.21, which is quite a bit better than 0.23. So let me explain what OOB error is. What 01:12:35.420 |
OOB error is, is we look at each row of the training set, not the validation set, each 01:12:45.180 |
row of the training set and we say, so we say for row number one, which trees included 01:12:53.220 |
row number one in the training? And we'll say, okay, let's not use those for calculating 01:12:58.700 |
the error because it was part of those trees training. So we'll just calculate the error 01:13:04.100 |
for that row using the trees where that row was not included in training that tree. Because 01:13:10.860 |
remember every tree is using only a subset of the data. So we do that for every row. 01:13:15.860 |
We find the prediction using only the trees that were not used, that that row was not 01:13:24.100 |
used. And those are the OOB predictions. In other words, this is like giving us a validation 01:13:31.580 |
set result without actually needing a validation. But the thing is, it's not with that time 01:13:39.660 |
offset, it's not looking at the last two weeks, it's looking at the whole training set. But 01:13:43.580 |
this basically tells us how much of the error is due to overfitting versus due to being 01:13:50.620 |
the last couple of weeks. So that's a cool trick. OOB error is something that very quickly 01:13:55.700 |
kind of gives us a sense of how much we're, we're overfitting. And we don't even need 01:14:00.100 |
a validation set to do it. So there's that OOB error. So that's telling us a bit about 01:14:06.500 |
what's going on in our model. But then there's a lot of things we'd like to find out from 01:14:12.320 |
our model. And I've got five things in particular here which I generally find pretty interesting. 01:14:18.580 |
Which is, how confident are we about our predictions for some particular prediction we're making? 01:14:26.460 |
Like we can say this is what we think the prediction is, but how confident are we? Is 01:14:31.740 |
that exactly that or is it just about that or we really have no idea? And then for predict, 01:14:37.900 |
for predicting a particular item, which factors were the most important in that prediction 01:14:44.860 |
and how did they influence it? Overall, which columns are making the biggest difference 01:14:50.500 |
in MPRL? Which ones could we maybe throw away and it wouldn't matter? Which columns are 01:14:56.420 |
basically redundant with each other? So we don't really need both of them. And as we 01:15:03.580 |
vary some column, how does it change the prediction? So those are the five things that we're, that 01:15:09.500 |
I'm interested in figuring out and we can do all of those things with a random first. 01:15:15.340 |
Let's start with the first one. So the first one, we've already seen that we can grab all 01:15:23.060 |
of the predictions for all of the trees and take their mean to get the actual predictions 01:15:31.340 |
of the model and then to get the RMSE. But what if instead of saying mean, we did exactly 01:15:36.060 |
the same thing like so, but instead said standard deviation. This is going to tell us for every 01:15:46.740 |
row in our dataset, how much did the trees vary? And so if our model really had never 01:15:56.380 |
seen kind of data like this before, it was something where, you know, different trees 01:16:02.020 |
were giving very different predictions. It might give us a sense that maybe this is something 01:16:07.900 |
that we're not at all confident about. And as you can see, when we look at the standard 01:16:12.060 |
deviation of the trees for each prediction, let's just look at the first five. They vary 01:16:17.620 |
a lot, right, 0.2, 0.1, 0.09, 0.3, okay? So this is a really interesting, it's not something 01:16:30.820 |
that a lot of people talk about, but I think it's a really interesting approach to kind 01:16:33.940 |
of figuring out whether we might want to be cautious about a particular prediction because 01:16:40.260 |
maybe we're not very confident about it. But there's one thing we can easily do with a 01:16:46.540 |
random forest. The next thing, and this is I think the most important thing for me in 01:16:50.900 |
terms of interpretation, is feature importance. Here's what feature importance looks like. 01:16:57.420 |
We can call feature importance on a model with some independent variables. Let's say 01:17:01.860 |
grab the first 10. This says these are the 10 most important features in this random 01:17:09.500 |
forest. These are the things that are the most strongly driving sale price or we could 01:17:15.020 |
plot them. And so you can see here, there's just a few things that are by far the most 01:17:22.940 |
important. What year the equipment was made, bulldozer or whatever. How big is it? Upla 01:17:31.260 |
system, whatever that means, and the product class, whatever that means. And so you can 01:17:40.660 |
get this by simply looking inside your train model and grabbing the feature importances 01:17:46.260 |
attribute. And so here for making it better to print out, I'm just sticking that into 01:17:50.660 |
a data frame and sorting the sending by importance. So how is this actually being done? It's actually 01:18:00.700 |
really neat. What Scikit-learn does, and Bryman, the inventor of random forest described, is 01:18:07.740 |
that you can go through each tree and then start at the top of the tree and look at each 01:18:12.340 |
branch and at each branch see what feature was used, the split, which binary, which the 01:18:19.100 |
binary split was based on which column. And then how much better was the model after that 01:18:24.700 |
split compared to beforehand. And we basically then say, okay, that column was responsible 01:18:31.060 |
for that amount of improvement. And so you add that up across all of the splits, across 01:18:36.900 |
all of the trees for each column, and then you normalize it so they all add to one. And 01:18:43.700 |
that's what gives you these numbers, which we show the first few of them in this table 01:18:49.180 |
and the first 30 of them here in this chart. So this is something that's fast and it's 01:18:55.900 |
easy and it kind of gives us a good sense of like, well, maybe the stuff that are less 01:19:01.020 |
than 0.005 we could remove. So if we did that, that would leave us with only 21 columns. 01:19:12.940 |
So let's try that. Let's just, let's just say, okay, x's which are important, the x's which 01:19:19.340 |
are in this list of ones to keep, do the same, they're valid, retrain our random forest and 01:19:27.340 |
have a look at the result. And basically our accuracy is about the same, but we've gone 01:19:34.620 |
down from 78 columns to 21 columns. So I think this is really important. It's not just about 01:19:42.260 |
creating the most accurate model you can, but you want to kind of be able to fit it 01:19:45.460 |
in your head as best as possible. And so 21 columns is going to be much easier for us 01:19:50.020 |
to check for any data issues and understand what's going on. And the accuracy is about 01:19:55.300 |
the same, or the RMSE. So I would say, okay, let's do that. Let's just stick with x's important 01:20:03.980 |
from now on. And so here's this entire set of the 21 features. And you can see it looks 01:20:11.920 |
now like year made and product size of the two really important things. And then there's 01:20:17.500 |
a cluster of kind of mainly product related things that are kind of at the next level 01:20:21.860 |
of importance. One of the tricky things here is that we've got like a product class desk, 01:20:33.500 |
model ID, secondary desk, model desk, base model. They modeled a script. So they all look 01:20:38.740 |
like there might be similar ways of saying the same thing. So one thing that can help 01:20:43.360 |
us to interpret the feature importance better and understand better what's happening in 01:20:47.500 |
the model is to remove redundant features. So one way to do that is to call fast.ai's 01:20:59.020 |
cluster columns, which is basically a thin wrapper for stuff that scikit-learn already 01:21:02.980 |
provides. And what that's going to do is it's going to find pairs of columns, which are 01:21:09.420 |
very similar. So you can see here sale year and sale elapsed. See how this line is way 01:21:14.540 |
out to the right or else machine ID and model ID is not at all. It's way out to the left. 01:21:19.700 |
So that means that sale year and sale elapsed are very, very similar. When one is low, the 01:21:26.140 |
other tends to be low and vice versa. Here's a group of three, which all seem to be much 01:21:31.540 |
the same, and then product group desk and product group, and then FI best-based model 01:21:36.620 |
and FI model desk. But these all seem like things where maybe we could remove one of 01:21:42.860 |
each of these pairs because they're basically seem to be much the same, you know, they're 01:21:48.900 |
when one is high, the other is high and vice versa. So let's try removing one of each of 01:22:01.980 |
these. Now it takes a little while to train a random forest. And so for the, just to see 01:22:09.580 |
whether removing something makes it much worse, we could just do a very fast version. So we 01:22:16.460 |
could just train something where we only have 50,000 rows per tree, train for each tree, 01:22:24.980 |
and we'll just use 40 trees. And let's then just get the OOB for, and so for that fast 01:22:37.420 |
simple version, our basic OOB with our important x's is 0.877. And here for OOB, a higher number 01:22:48.500 |
is better. So then let's try going through each of the things we thought we might not 01:22:53.060 |
need and try dropping them and then getting the OOB error for our x's with that one column 01:23:01.580 |
removed. And so compared to 877, most of them don't seem to hurt very much. They'll elapse 01:23:11.220 |
to it quite a bit, right? So for each of those groups, let's go and see which one of the 01:23:18.420 |
ones seems like we could remove it. So here's the five I found. Let's remove the whole lot 01:23:25.980 |
and see what happens. And so the OOB went from 877 to 874, though hardly any difference 01:23:33.820 |
at all, despite the fact we managed to get rid of five of our variables. So let's create 01:23:42.180 |
something called x's final, which is the x's important and then dropping those five, save 01:23:50.300 |
them for later. We can always load them back again. And then let's check our random forest 01:23:56.700 |
using those and again 0.233 or 0.234. So we've got about the same thing, but we've got even 01:24:05.460 |
less columns now. So we're getting a kind of a simpler and simpler model without hurting 01:24:10.780 |
our accuracy. It's great. So the next thing we said we were interested in learning about 01:24:17.900 |
is for the columns that are, particularly the columns that are most important, how does, 01:24:24.260 |
what's the relationship between that column and the dependent variable? So for example, 01:24:28.700 |
what's the relationship between product size and sale price? So the first thing I would 01:24:33.900 |
do would be just to look at a histogram. So one way to do that is with value counts in 01:24:41.420 |
pandas. And we can see here our different levels of product size. And one thing to note here 01:24:52.780 |
is actually missing is actually the most common. And then next most is compact and small. And 01:25:00.180 |
then many is pretty tiny. So we can do the same thing for year made. Now for year made 01:25:07.420 |
we can't just see the basic bar chart. We, according to histogram is not it's a bar chart. 01:25:16.140 |
For year made we actually need a histogram, which pandas has stuff like this built in 01:25:21.460 |
so we can just call histogram. And that 1950, you remember we created it, that's kind of 01:25:27.020 |
this missing value thing that used to be a thousand. But most of them seem to have been 01:25:32.500 |
well into the 90's and 3000's. So let's now look at something called a partial dependence 01:25:38.780 |
plot. I'll show it to you first. Here is a partial dependence plot of year made against 01:25:52.460 |
partial dependence. What does this mean? Well we should focus on the part where we actually 01:25:59.100 |
have a reasonable amount of data. So at least well into the 80's, go around here. And so 01:26:05.900 |
let's look at this bit here. Basically what this says is that as year made increases, 01:26:14.220 |
the predicted sale price, log sale price of course also increases. You can see. And the 01:26:22.660 |
log sale price is increasing linearly on other roughly, but roughly then this is actually 01:26:28.780 |
an exponential relationship between year made and sale price. Why do we call it a partial 01:26:36.900 |
dependence? Are we just plotting the kind of the year against the average sale price? 01:26:41.700 |
Well no we're not. We can't do that because a lot of other things change from year to 01:26:47.540 |
year. Example, maybe more recently people tend to buy bigger bulldozers or more bulldozers 01:26:57.100 |
with air conditioning or more expensive models of bulldozers. And we really want to be able 01:27:03.700 |
to say like no just what's the impact of year and nothing else. And if you think about it 01:27:08.820 |
from a kind of an inflation point of view, you would expect that older bulldozers would 01:27:18.100 |
be kind of, that bulldozers would get kind of a constant ratio cheaper the further you 01:27:27.220 |
go back, which is what we see. So what we really want to say is all other things being equal, 01:27:33.980 |
what happens if only the year changes? And there's a really cool way we can answer that 01:27:39.820 |
question with a random forest. So how does year made impact sale price? All other things 01:27:46.020 |
being equal. So what we can do is we can go into our actual data set and replace every 01:27:52.460 |
single value in the year made column with 1950 and then calculate the predicted sale 01:27:58.620 |
price for every single auction and then take the average over all the auctions. And that's 01:28:03.820 |
what gives us this value here. And then we can do the same from 1951, 1952 and so forth 01:28:10.900 |
until eventually we get to our final year of 2011. So this isolates the effect of only 01:28:20.020 |
year made. So it's a kind of a bit of a curious thing to do, but it's actually, it's a pretty 01:28:28.580 |
neat trick for trying to kind of pull apart and create this partial dependence to say 01:28:34.920 |
what might be the impact of just changing year made. And we can do the same thing for 01:28:42.060 |
product size. And one of the interesting things if we do it for product size is we see that 01:28:46.540 |
the lowest value of predicted sale price log sale price is NA, which is a bit of a worry 01:28:58.700 |
because we kind of want to know well that means it's really important the question of 01:29:02.260 |
whether or not the product size is labeled is really important. And that is something 01:29:08.180 |
that I would want to dig into before I actually use this model to find out well why is it 01:29:12.700 |
that sometimes things aren't labeled and what does it mean, you know, why is it that that's 01:29:16.620 |
actually a that's just important predictor. So that is the partial dependence plot and 01:29:23.580 |
it's a really clever trick. So we have looked at four of the five questions we said we wanted 01:29:34.060 |
to answer at the start of this section. So the last one that we want to answer is one 01:29:41.780 |
here. We're predicting with a particular row of data what were the most important factors 01:29:46.980 |
and how did they influence that prediction. This is quite related to the very first thing 01:29:51.460 |
we saw. So it's like imagine you were using this auction price model in real life. You 01:29:57.300 |
had something on your tablet and you went into some auction and you looked up what the 01:30:02.320 |
predicted auction price would be for this lot that's coming up to find out whether it 01:30:09.940 |
seems like it's being under or overvalued and then you can decide what to do about that. 01:30:15.720 |
So one thing we said we'd be interested to know is like well are we actually confident 01:30:20.020 |
in our prediction and then we might be curious to find out like oh I'm really surprised it 01:30:25.180 |
was predicting such a high value. Why was it predicting such a high value? So to find 01:30:32.060 |
the answer to that question, we can use a module called TreeInterpreter. And TreeInterpreter, 01:30:41.260 |
the way it works is that you pass in a single row. So it's like here's the auction that's 01:30:47.620 |
coming up, here's the model, here's the auctioneer ID, etcetera, etcetera. Please predict the 01:30:55.220 |
value from the random forest, what's the expected sale price and then what we can do is we can 01:31:02.700 |
take that one row of data and put it through the first decision tree and we can see what's 01:31:07.700 |
the first split that's selected and then based on that split does it end up increasing or 01:31:13.340 |
decreasing the predicted price compared to that kind of raw baseline model of just take 01:31:19.340 |
the average and then you can do that again at the next split and again at the next split 01:31:23.020 |
and again at the next split. So for each split, we see what the increase or decrease in the 01:31:28.940 |
well, addiction, that's not right. We see what the increase or decrease in the prediction 01:31:37.420 |
is except while I'm here compared to the parent node. And so then you can do that for every 01:31:48.700 |
tree and then add up the total change in importance by split variable and that allows you to draw 01:31:56.660 |
something like this. So here's something that's looking at one particular row of data and 01:32:03.860 |
overall we start at zero and so zero is the initial 10.1. Remember this number 10.1 is 01:32:14.860 |
the average log sale price of the whole data set. They call it the bias. And so we call 01:32:22.300 |
that zero then for this particular row we're looking at year made as a negative 4.2 impact 01:32:31.180 |
on the prediction and then product size has a positive 0.2, cut plus system has a positive 01:32:38.300 |
0.046, model ID has a positive 0.127 and so forth, right. And so the red ones are negative 01:32:47.480 |
and the green ones are positive and you can see how they all join up until eventually 01:32:51.580 |
overall the prediction is that it's going to be negative 0.122 compared to 10.1 which 01:33:01.140 |
is equal to 9.98. So this kind of plot is called a waterfall plot and so basically when 01:33:12.240 |
we say tree interpreter dot predict it gives us back the prediction which is the actual 01:33:20.780 |
number we get back from the random forest, the bias which is just always this 10.1 for 01:33:25.900 |
this data set and then the contributions which is all of these different values. It's how 01:33:33.460 |
important was each factor and here I've used a threshold which means anything that was 01:33:42.140 |
less than 0.08 all gets thrown into this other category. I think this is a really useful 01:33:48.940 |
kind of thing to have in production because it can help you answer questions whether it 01:33:54.620 |
will be for the customer or for you know whoever's using your model if they're surprised about 01:33:59.500 |
some prediction why is that prediction. So I'm going to show you something really interesting 01:34:10.540 |
using some synthetic data and I want you to really have a think about why this is happening 01:34:16.640 |
before I tell you and I pause the video if you're watching the video when I get to that 01:34:21.660 |
point. Let's start by creating some synthetic data like so. So we're going to grab 40 values 01:34:29.460 |
evenly spaced between 0 and 20 and then we're just going to create the y=x line and add 01:34:37.740 |
some normally distributed random data on that. Here's this kind of plot. So here's some data 01:34:45.940 |
we want to try and predict and we're going to use a random forest in a kind of bit of 01:34:50.140 |
an overkill here. Now in this case we only have one independent variable. Scikit-learn 01:35:00.180 |
expects us to have more than one. So we can use unsqueeze in PyTorch to add that go from 01:35:10.620 |
a shape of 40 in other words a vector with 40 elements for a shape of 40 comma 1 in other 01:35:16.060 |
words a matrix of 40 rows with one column. So this unsqueeze 1 means add a unit axis 01:35:23.500 |
here. I don't use unsqueeze very often because I actually generally prefer the index with 01:35:30.260 |
a special value none. This works in PyTorch and numpy and the way it works is to say okay 01:35:37.220 |
xlin remember that size is a vector of length 40 every row and then none means insert a 01:35:46.180 |
unit axis here for the column. So these are two ways of doing the same thing but this 01:35:51.500 |
one is a little bit more flexible so that's what I use more often. But now that we've 01:35:55.540 |
got the shape that is expected which is a rank 2 tensor and an array with two dimensions 01:36:02.820 |
or axes we can create a random forest we can fit it and let's just use the first 30 data 01:36:08.860 |
points right so kind of stop here. And then let's do a prediction right so let's plot 01:36:16.580 |
the original data points and then also plot a prediction and look what happens on the 01:36:21.100 |
prediction it acts it's kind of nice and accurate and then suddenly what happens. So this is 01:36:27.820 |
the bit where if you're watching the video I want you to pause and have a think bias 01:36:30.980 |
is flat. So what's going on here well remember a random forest is just taking the average 01:36:39.380 |
of predictions of a bunch of trees and a tree the prediction of a tree is just the average 01:36:46.220 |
of the values in a leaf node and remember we fitted using a training set containing 01:36:51.980 |
only the first 30. So none of these appeared in the training set so the highest we could 01:36:59.060 |
get would be the average of values that are inside the training set. In other words there's 01:37:04.700 |
this maximum you can get to. So random forests cannot extrapolate outside of the bounds of 01:37:12.980 |
the data that they're training set. This is going to be a huge problem for things like 01:37:16.880 |
time series prediction where there's like an underlying trend for instance. But really 01:37:24.300 |
it's a more general issue than just time variables. It's going to be hard for random or impossible 01:37:29.660 |
often for random forests to just extrapolate outside the types of data that it's seen in 01:37:34.620 |
a general sense. So we need to make sure that our validation set does not contain out of 01:37:41.340 |
domain data. So how do we find out of domain data? So we might not even know our test set 01:37:50.900 |
is distributed in the same way as our training data. So if they're from two different time 01:37:54.760 |
periods how do you kind of tell how they vary, right? Or if it's a Kaggle competition how 01:38:00.980 |
do you tell if the test set and the training set which Kaggle gives you have some underlying 01:38:07.180 |
differences? There's actually a cool trick you can do which is you can create a column 01:38:13.020 |
called is_valid which contains 0 for everything in the training set and 1 for everything in 01:38:21.260 |
the validation set. And it's concatenating all of the independent variables together. 01:38:27.580 |
So it's concatenating the independent variables for both the training and validation set together. 01:38:32.700 |
So this is our independent variable and this becomes our dependent variable. And we're 01:38:38.740 |
going to create a random forest not for predicting price but a random forest that predicts is 01:38:45.740 |
this row from the validation set or the training set. So if the validation set and the training 01:38:51.980 |
set are from kind of the same distribution if they're not different then this random 01:38:57.060 |
forest should basically have zero predictive power. If it has any predictive power then 01:39:04.460 |
it means that our training and validation set are different. And to find out the source 01:39:09.100 |
of that difference we can use feature importance. And so you can see here that the difference 01:39:17.840 |
between the validation set and the training set is not surprisingly sale elapsed. So that's 01:39:26.940 |
the number of days since I think like 1970 or something. So it's basically the date. 01:39:32.180 |
So yes of course you can predict whether something is in the validation set or the training set 01:39:37.300 |
by looking at the date because that's actually how you find them. That makes sense. This is 01:39:41.900 |
interesting sales ID. So it looks like the sales ID is not some random identifier but 01:39:46.940 |
it increases over time. And ditto for machine ID. And then there's some other smaller ones 01:39:54.580 |
here that kind of makes sense. So I guess for something like model desk I guess there 01:39:59.500 |
are certain models that were only made in later years for instance. But you can see these 01:40:06.860 |
top three columns are a bit of an issue. So then we could say like okay what happens if 01:40:14.320 |
we look at each one of those columns those first three and remove them and then see how 01:40:22.220 |
it changes our RMSE on our sales price model on the validation set. So we start from point 01:40:35.460 |
232 and removing sales ID actually makes it a bit better. Sale elapsed makes it a bit 01:40:43.180 |
worse, machine ID about the same. So we can probably remove sales ID and machine ID without 01:40:49.180 |
losing any accuracy and yep it's actually slightly improved. But most importantly it's 01:40:54.600 |
going to be more resilient over time right because we're trying to remove the time related 01:41:00.380 |
features. Another thing to note is that since it seems that you know this kind of sale elapsed 01:41:09.280 |
issue that maybe it's making a big difference is maybe looking at the sale year distribution 01:41:16.420 |
this is the histogram. Most of the sales are in the last few years anyway. So what happens 01:41:21.380 |
if we only include the most recent few years. So let's just include everything after 2004. 01:41:29.900 |
So that is X is filtered. And if I train on that subset then my accuracy goes improves 01:41:38.060 |
a bit more from 331, 330. So that's interesting right. We're actually using less data, less 01:41:46.260 |
rows and getting a slightly better result because the more recent data is more representative. 01:41:53.980 |
So that's about as far as we can get with our random forest. But what I will say is 01:42:00.180 |
this. This issue of extrapolation would not happen with a neural net would it because 01:42:08.780 |
a neural net is using the kind of the underlying layers are linear layers. And so linear layers 01:42:13.860 |
can absolutely extrapolate. So the obvious thing to think then at this point is well 01:42:19.340 |
maybe what a neural net do a better job of this. That's going to be the thing. Next up 01:42:25.540 |
to this question. Question first. How do, how does feature importance relate to correlation? 01:42:37.020 |
Feature importance doesn't particularly relate to correlation. Correlation is a concept for 01:42:42.700 |
linear models and this is not a linear model. So remember feature importance is calculated 01:42:47.740 |
by looking at the improvement in accuracy as you go down each tree and you go down each 01:42:56.660 |
binary split. If you're used to linear regression then I guess correlation sometimes can be 01:43:05.620 |
used as a measure of feature importance. But this is a much more kind of direct version 01:43:13.660 |
that's taking account of these non-linearities and interactions of stuff as well. So it's 01:43:19.380 |
a much more flexible and reliable measure generally feature importance. Any more questions? 01:43:30.260 |
So I'll do the same thing with a neural network. I'm going to just copy and paste the same 01:43:34.620 |
lines of code that I had from before but this time I'll call it NN, DFNN and these are the 01:43:40.660 |
same lines of code. And I'll grab the same list of columns we had before in the dependent 01:43:45.140 |
variable to get the same data frame. Now as we've discussed for categorical columns we 01:43:52.460 |
probably want to use embeddings. So to create embeddings we need to know which columns should 01:43:57.740 |
be treated as categorical variables. And as we've discussed we can use "cont-cat-split" 01:44:01.940 |
for that. One of the useful things we can pass that is the maximum cardinality. So maxCard 01:44:09.380 |
equals 9000 means if there's a column with more than 9000 levels you should treat it 01:44:14.660 |
as continuous. And if it's got less than 9000 levels which it is categorical. So that's 01:44:20.660 |
you know it's a simple little function that just checks the cardinality and splits them 01:44:25.420 |
based on how many discrete levels they have. And of course the data type if it's not actually 01:44:31.460 |
numeric data type it has to be categorical. So there's our there's our split. And then 01:44:42.020 |
from there what we can do is we can say oh we've got to be a bit careful of "sail-elapsed" 01:44:49.420 |
because actually "sail-elapsed" I think has less than 9000 categories but we definitely 01:44:53.860 |
don't want to use that as a categorical variable. The whole point was to make it that this is 01:44:57.940 |
something that we can extrapolate. Though we certainly anything that's kind of time dependent 01:45:03.020 |
or we think that we might see things outside the range of inputs in the training data we 01:45:09.380 |
should make them continuous variables. So let's make "sail-elapsed" put it in continuous 01:45:14.520 |
neural net and remove it from categorical. So here's the number of unique levels this 01:45:22.820 |
is from pandas for everything in our neural net data set for the categorical variables. 01:45:28.460 |
And I get a bit nervous when I see these really high numbers so I don't want to have too many 01:45:32.740 |
things with like lots and lots of categories. The reason I don't want lots of things with 01:45:40.220 |
lots and lots of categories is just they're going to take up a lot of parameters because 01:45:44.020 |
in a embedding matrix this is you know every one of these is a row in an embedding matrix. 01:45:48.580 |
In this case I notice model ID and model desk might be describing something very similar. 01:45:54.380 |
So I'd quite like to find out if I could get rid of one and an easy way to do that would 01:45:58.680 |
be to use a random forest. So let's try removing the model desk and let's create a random forest 01:46:10.540 |
and let's see what happens and oh it's actually a tiny bit better and certainly not worse. 01:46:16.460 |
So that suggests that we can actually get rid of one of these levels or one of these 01:46:20.740 |
variables. So let's get rid of that one and so now we can create a tabular pandas object 01:46:26.900 |
just like before. But this time we're going to add one more processor which is normalize. 01:46:34.540 |
And the reason we need normalize, so normalize is subtract the mean divide by the standard 01:46:39.180 |
deviation. We didn't need that for a random forest because for a random forest we're just 01:46:44.660 |
looking at less than or greater than through our binary splits. So all that matters is 01:46:49.980 |
the order of things, how they're sorted, it doesn't matter whether they're super big or 01:46:53.700 |
super small. But it definitely matters for neural nets because we have these linear layers. 01:47:01.460 |
So we don't want to have you know things with kind of crazy distributions with some super 01:47:06.220 |
big numbers and super small numbers because it's not going to work. So it's always a good 01:47:10.680 |
idea to normalize things in neural nets so we can do that in a tabular neural net by 01:47:17.580 |
using the normalize tabular proc. So we can do the same thing that we did before with 01:47:23.460 |
creating our tabular pandas tabular object for the neural net. And then we can create 01:47:29.900 |
data loaders from that with a batch size. And this is a large batch size because tabular 01:47:35.540 |
models don't generally require nearly as much GPU RAM as a convolutional neural net or something 01:47:44.000 |
or an RNN or something. Since it's a regression model we're going to want R range. So let's 01:47:52.140 |
find the minimum and maximum of our dependent variable. And we can now go ahead and create 01:47:59.140 |
a tabular learner. Our tabular learner is going to take our data loaders, our way range, 01:48:06.140 |
how many activations do you want in each of the linear layers. And so you can have as 01:48:12.140 |
many linear layers as you like here. How many outputs are there? So this is a regression 01:48:18.380 |
with a single output. And what loss function do you want? We can use LRfind and then we 01:48:27.420 |
can go ahead and use fit1cycle. There's no pre-trained model obviously because this is 01:48:32.500 |
not something where people have got pre-trained models for industrial equipment options. So 01:48:39.140 |
we just use fit1cycle and train for a minute. And then we can check. And our RMSE is 0.226 01:48:52.500 |
which here was 0.230. So that's amazing. We actually have, you know, straight away a better 01:48:58.620 |
result than the random forest. It's a little more fussy, it takes a little bit longer. But 01:49:05.580 |
as you can see, you know, for interesting datasets like this, we can get some great 01:49:10.940 |
results with neural nets. So here's something else we could do though. The random forest 01:49:23.380 |
and the neural net, they each have their own pros and cons. There's some things they're 01:49:28.020 |
good at and there's some they're less good at. So maybe we can get the best of both worlds. 01:49:34.620 |
And a really easy way to do that is to use Ensemble. We've already seen that a random 01:49:39.420 |
forest is a decision tree ensemble. But now we can put that into another ensemble. We 01:49:43.740 |
can have an ensemble of the random forest and a neural net. There's lots of super fancy 01:49:49.180 |
ways you can do that. But a really simple way is to take the average. So sum up the 01:49:55.300 |
predictions from the two models, divide by two, and use that as prediction. So that's 01:50:01.620 |
our ensemble prediction is just literally the average of the random forest prediction 01:50:05.540 |
and the neural net prediction. And that gives us 0.223 versus 0.226. So how good is that? 01:50:18.900 |
Well it's a little hard to say because unfortunately this competition is old enough that we can't 01:50:25.540 |
even submit to it and find out how we would have gone on Kaggle. So we don't really know 01:50:30.980 |
and so we're relying on our own validation set. But it's quite a bit better than even 01:50:36.260 |
the first place score on the test set. So if the validation set is you know doing good 01:50:45.380 |
job then this is a good sign that this is a really really good model. Which wouldn't 01:50:51.060 |
necessarily be that surprising because you know in the last few years I guess we've learned 01:50:58.620 |
a lot about building these kinds of models. And we're kind of taking advantage of a lot 01:51:03.940 |
of the tricks that have appeared in recent years. And yeah maybe this goes to show that 01:51:11.660 |
well I think it certainly goes to show that both random forests and neural nets have a 01:51:17.300 |
lot to offer. And try both and maybe even find both. We've talked about an approach 01:51:29.540 |
to ensembling called bagging which is where we train lots of models on different subsets 01:51:35.680 |
of the data like the average of. Another approach to ensembling particularly ensembling of trees 01:51:42.500 |
is called boosting. And boosting involves training a small model which underfits your 01:51:50.580 |
data set. So maybe like just have a very small number of leaf nodes. And then you calculate 01:51:57.200 |
the predictions using the small model. And then you subtract the predictions from the 01:52:02.500 |
targets. So these are kind of like the errors of your small underfit model. We call them 01:52:07.580 |
residual. And then go back to step one but now instead of using the original targets 01:52:15.440 |
use the residuals. The train a small model which underfits your data set attempting to 01:52:21.020 |
predict the residuals. Then do that again and again until you reach some stopping criterion 01:52:28.900 |
such as the maximum number of trees. Now you that will leave you with a bunch of models 01:52:35.620 |
which you don't average but which use sum. Because each one is creating a model that's 01:52:42.500 |
based on the residual of the previous one. But we've subtracted the predictions of each 01:52:47.660 |
new tree from the residuals of the previous tree. So the residuals get smaller and smaller. 01:52:53.260 |
And then to make predictions we just have to do the opposite which is to add them all 01:52:56.980 |
together. So there's lots of variants of this. But you'll see things like GBMs for gradient 01:53:06.780 |
boosted machines or GBTTs for gradient boosted decision trees. And there's lots of minor 01:53:14.460 |
details around you know and significant details. But the basic idea is what I've shown. 01:53:21.580 |
All right let's take the questions. Dropping features in a model is a way to reduce the 01:53:28.020 |
complexity of the model and thus reduce overfitting. Is this better than adding some regularization 01:53:33.820 |
like weight decay? I didn't claim that we removed columns to avoid overfitting. We removed 01:53:49.180 |
the columns to simplify fewer things to analyze. It should also mean we don't need as many 01:54:00.460 |
trees but there's no particular reason to believe that this will regularize. And the 01:54:06.380 |
idea of regularization doesn't necessarily make a lot of sense to random forests and 01:54:10.620 |
always add more trees. Is there a good heuristic for picking the number of linear layers in 01:54:18.660 |
the tabular model? Not really. Well if there is I don't know what it is. I guess two, three 01:54:32.900 |
hidden layers works pretty well. So you know what I showed those numbers I showed are pretty 01:54:40.300 |
good for a large-ish model. A default it uses 200 and 100 so maybe start with the default 01:54:48.520 |
and then go up to 500 and 250 if that isn't an improvement and like just keep doubling 01:54:53.500 |
them until it stops improving or you run out of memory or time. The main thing to note 01:55:00.900 |
about boosted models is that there's nothing to stop us from overfitting. If you add more 01:55:05.660 |
and more trees to a bagging model sort of a random forest it's going to get, it should 01:55:11.660 |
generalize better and better because each time you're using a new model which is based 01:55:16.980 |
on a subset of the data. But boosting each model will fit the training set better and 01:55:24.740 |
better gradually overfit more and more. So boosting methods do require generally more 01:55:32.460 |
hyperparameter tuning and fiddling around with it. You know you certainly have regularization 01:55:37.940 |
boosting. They're pretty sensitive to their hyperparameters which is why they're not normally 01:55:46.640 |
my first go-to but they more often win Kaggle competition random forests do like they tend 01:55:57.140 |
to be good at getting that last little bit of performance. So the last thing I'm going 01:56:04.860 |
to mention is something super neat which a lot of people don't seem to know exists. There's 01:56:11.500 |
a shang so it's super cool which is something from the entity embeddings paper, the table 01:56:17.100 |
from it where what they did was they built a neural network, they got the entity embeddings 01:56:23.900 |
e.e. and then they tried a random forest using the entity embeddings as predictors rather 01:56:35.220 |
than the approach I described with just the raw categorical variables. And the the error 01:56:43.060 |
for a random forest went from 0.16 to 0.11. A huge improvement and very simple method 01:56:51.100 |
KNN went from 0.29 to 0.11. Basically all of the methods when they used entity embeddings 01:56:59.020 |
suddenly improved a lot. The one thing you should try if you have a look at the further 01:57:04.360 |
research section after the questionnaire is it asks to try to do this actually take those 01:57:10.260 |
entity embeddings that we trained in the neural net and use them in the random forest and 01:57:14.840 |
then maybe try ensembling again and see if you can beat the 0.223 that we had. This is 01:57:25.260 |
a really nice idea it's like you get you know all the benefits of boosted decision trees 01:57:32.140 |
but all of the nice features of entity embeddings and so this is something that not enough people 01:57:40.100 |
seem to be playing with for some reason. So overall you know random forests are nice and 01:57:49.940 |
easy to train you know they're very resilient they don't require much pre-processing they 01:57:54.460 |
train quickly they don't overfit you know they can be a little less accurate and they 01:58:03.020 |
can be a bit slow at inference time because the inference you have to go through every 01:58:08.180 |
one of those trees. Having said that a binary tree can be pretty heavily optimized so you 01:58:18.700 |
know it is something you can basically create a totally compiled version of a tree and they 01:58:24.100 |
can certainly also be done entirely in parallel so that's something to consider. Gradient boosting 01:58:36.020 |
machines are also fast to train on the whole but a little more fussy about high parameters 01:58:41.260 |
you have to be careful about overfitting but a bit more accurate. Neural nets may be the 01:58:49.380 |
fussiest to deal with they've kind of got the least rules of thumb around or tutorials 01:58:56.660 |
around saying this is kind of how to do it it's just a bit a bit newer a little bit less 01:59:00.660 |
well understood but they can give better results in many situations than the other two approaches 01:59:06.580 |
or at least with an ensemble can improve the other two approaches. So I would always start 01:59:11.380 |
with a random code and then see if you can beat it using these. So yeah why don't you 01:59:19.580 |
now see if you can find a Kaggle competition with tabular data whether it's running now 01:59:23.740 |
or it's a past one and see if you can repeat this process for that and see if you can get 01:59:29.220 |
in the top 10% of the private leaderboard that would be a really great stretch goal 01:59:34.860 |
at this point. Implement the decision tree algorithm yourself I think that's an important 01:59:40.100 |
one we really understand it and then from there create your own random forest from scratch 01:59:44.700 |
you might be surprised it's not that hard and then go and have a look at the tabular 01:59:52.500 |
model source code and at this point this is pretty exciting you should find you pretty 01:59:57.980 |
much know what all the lines do with two exceptions and if you don't you know dig around and explore 02:00:04.900 |
an experiment and see if you can figure it out. And with that we are I am very excited 02:00:13.220 |
to say at a point where we've really dug all the way in to the end of these real valuable 02:00:20.980 |
effective fast AI applications and we're understanding what's going on inside them. What should we 02:00:27.420 |
expect for next week? For next week we will at NLP and computer vision and we'll do the 02:00:36.500 |
same kind of ideas delve deep to see what's going on. Thanks everybody see you next week.