back to index

Lesson 7 - Deep Learning for Coders (2020)


Chapters

0:0 Weight decay (L2 Regularization)
7:25 Creating our own Embedding module
12:45 Interpreting embeddings and bias
18:0 Embedding distance
20:0 Deep learning for collaborative filtering
24:9 Notebook 9 - Tabular modelling
25:30 entity embeddings for categorical variables
30:11 beyond deep learning for tabular data (ensembles of decision trees)
40:10 Decision Trees
64:0 Random Forests
72:10 Out-of-bag error
74:0 Model Interpretation
94:0 extrapolation
103:0 using a NN
109:20 ensembling
117:40 conclusion

Whisper Transcript | Transcript Only Page

00:00:00.000 | Hi everybody and welcome to lesson 7. We're going to start by having a look at a kind
00:00:07.520 | of regularization called weight decay. And the issue that we came to at the end of the
00:00:13.400 | last lesson is that we were training our simple dot product model with bias, and our loss
00:00:22.400 | started going down and then it started going up again. And so we have a problem that we
00:00:29.280 | are overfitting. And remember in this case we're using mean squared error. So try to
00:00:35.560 | recall why it is that we don't need a metric here, because mean squared error is pretty
00:00:42.760 | much the thing we care about really, or we could use mean absolute error if we like,
00:00:47.800 | but either of those works fine as a loss function. They don't have the problem of big flat areas
00:00:52.720 | like accuracy does for classification. So what we want to do is to make it less likely
00:01:00.400 | that we're going to overfit by doing something we call reducing the capacity of the model.
00:01:06.120 | The capacity of the model is basically how much space does it have to find answers. And
00:01:11.800 | if it can kind of find any answer anywhere, those answers can include basically memorizing
00:01:18.320 | the data set. So one way to handle this would be to decrease the number of latent factors.
00:01:27.320 | But generally speaking, reducing the number of parameters in a model, particularly as
00:01:32.780 | we look at more deep learning style models, ends up biasing the models towards very simple
00:01:40.720 | kind of shapes. So there's a better way to do it rather than reducing the number of parameters.
00:01:47.840 | And we try to force the parameters to be smaller, unless they're really required to be big.
00:01:55.480 | And the way we do that is with weight decay. Weight decay is also known as L2 regularization.
00:02:00.960 | They're very slightly different, but we can think of them as the same thing. And what
00:02:05.320 | we do is we change our loss function, and specifically we change the loss function by
00:02:10.120 | adding to it the sum of all the weights squared. In fact, all of the parameters squared really
00:02:17.900 | should stay. Why do we do that? Well, because if that's part of the loss function, then
00:02:24.460 | one way to decrease the loss would be to decrease the weights, one particular weight or all
00:02:30.360 | of the weights or something like that. And so when we decrease the weights, if you think
00:02:38.980 | about what that would do, then think about, for example, the different possible values
00:02:49.100 | of a in y equals ax squared. The larger a is, for example, a is 50, you get these very
00:02:57.080 | narrow peaks. In general, big coefficients are going to cause big swings, big changes
00:03:05.800 | in the loss, small changes in the parameters. And when you have these kind of sharp peaks
00:03:13.660 | or valleys, it means that a small change to the parameter can make a, sorry, a small change
00:03:22.520 | to the input and make a big change to the loss. And so if you have, if you're in that
00:03:27.760 | situation, then you can basically fit all the data points close to exactly with a really
00:03:33.480 | complex jagged function with sharp changes, which exactly tries to sit on each data point
00:03:41.020 | rather than finding a nice smooth surface which connects them all together or goes through
00:03:46.660 | them all. So if we limit our weights by adding in the loss function, the sum of the weights
00:03:54.580 | squared, then what it's going to do is it's going to fit less well on the training set
00:04:00.760 | because we're giving it less room to try anything that it wants to, but we're going to hope
00:04:05.320 | that it would result in a better loss on the validation set or the test set so that it
00:04:10.740 | will generalize better. One way to think about this is that the loss with weight decay is
00:04:17.020 | just the loss plus the sum of the parameters squared times some number we pick, a hyperparameter.
00:04:27.500 | This is like 0.1 or 0.01 or 0.001 kind of region. So this is basically what loss with
00:04:35.680 | weight decay looks like in this equation. But remember when it actually comes to what's,
00:04:40.500 | how is the loss used in stochastic gradient descent? It's used by taking its gradient.
00:04:45.940 | So what's the gradient of this? Well, if you remember back to when you first learned calculus,
00:04:52.880 | it's okay if you don't. The gradient of something squared is just two times that something. We've
00:05:00.200 | changed some parameters to weight which is a bit confusing. So just use weight here to
00:05:05.800 | keep it consistent. Maybe parameters is better. So the derivative of weight squared is just
00:05:12.880 | two times weight. So in other words, to add in this term to the gradient, we can just
00:05:20.480 | add to the gradients weight decay times two times weight. And since weight decay is just
00:05:28.420 | a hyperparameter, we can just replace it with weight decay times two. So that would just
00:05:32.500 | give us weight decay times weight. So weight decay refers to adding on the, to the gradients,
00:05:42.960 | the weights times some hyperparameter. And so that is going to try to create these kind
00:05:48.780 | of more shallow, less bumpy surfaces. So to do that, we can simply, when we call fit or
00:05:59.260 | fit one cycle or whatever, we can pass in a WD parameter and that's just this number
00:06:07.100 | here. So if we pass in point one, then the training loss goes from point two nine to
00:06:15.660 | point four nine. That's much worse, right, because we can't overfit anymore. The valid
00:06:20.980 | loss goes from point eight nine to point eight two, much better. So this is an important
00:06:27.740 | thing to remember for those of you that have done a lot of more traditional statistical
00:06:31.820 | models is in kind of more traditional statistical models, we try to avoid overfitting and we
00:06:37.800 | try to increase generalization by decreasing the number of parameters. But in a lot of
00:06:44.040 | modern machine learning and certainly deep learning, we tend to instead use regularization
00:06:51.660 | such as weight decay because it gives us more flexibility. It lets us use more nonlinear
00:06:56.820 | functions and still, you know, still reduces the capacity of the model. Great. So we're
00:07:03.940 | down to point eight two three. This is a good model. This is really actually a very good
00:07:08.780 | model. And so let's dig into actually what's going on here because in our, in our architecture,
00:07:18.300 | remember we basically just had four embedding layers. So what's an embedding layer? We've
00:07:24.580 | described it conceptually, but let's write our own. And remember we said that an embedding
00:07:29.980 | layer is just a computational shortcut for doing a matrix multiplication by a one hot
00:07:35.380 | encoded matrix and that that is actually the same as just indexing into an array. So an
00:07:43.860 | embedding is just a indexing into an array. And so it's nice to be able to create our
00:07:50.380 | own versions of things that exist in PyTorch and fast.ai. So let's do that for embedding.
00:07:56.780 | So if we're going to create our own kind of layer, which is pretty cool, we need to be
00:08:02.500 | aware of something, which is normally a layer is basically created by inheriting as we've
00:08:09.660 | discussed from module or nn.module. So for example, this is an example here of a module
00:08:15.060 | where we've created a class called t that inherits from module. And when it's constructed,
00:08:20.340 | remember that's what dunder init does. We're just going to sit, this is just a dummy little
00:08:25.020 | module here. We're going to set self.a to the number one repeated three times as a tensor.
00:08:31.820 | Now if you remember back to notebook four, we talked about how the optimizers in PyTorch
00:08:37.820 | and fast.ai rely on being able to grab the parameters attribute to find a list of all
00:08:42.900 | the parameters. Now if you want to be able to optimize self.a, you would need to appear
00:08:48.860 | in parameters, but actually there's nothing there. Why is that? That's because PyTorch
00:08:56.340 | does not assume that everything that's in a module is something that needs to be learned.
00:09:01.460 | To tell it that it's something that needs to be learned, you have to wrap it with nn.parameter.
00:09:05.900 | So here's exactly the same class, but torch.ones, which is just a list of three, three ones
00:09:12.500 | in this case is wrapped in nn.parameter. And now if I go parameters, I see I have a parameter
00:09:20.340 | with three ones in it. And that's going to automatically call requires grad underscore
00:09:25.300 | for us as well. We haven't had to do that for things like nn.linear in the past because
00:09:32.460 | PyTorch automatically uses nn.parameter internally. So if we have a look at the parameters for
00:09:37.820 | something that uses nn.linear with no bias layer, you'll see again we have here a parameter
00:09:44.900 | with three things in it. So we want to in general be able to create a parameter. So
00:09:55.220 | something with a tensor with a bunch of things in and generally we want to randomly initialize
00:09:59.260 | them. So to randomly initialize, we can pass in the size we want. We can initialize a tensor
00:10:04.980 | of zeros of that size and then randomly generate some normal, normally distributed random numbers
00:10:10.820 | with a mean of zero and a deviation of 0.01. No particular reason I'm picking those numbers
00:10:16.220 | just to show how this works. So here's something that will give us back a set of parameters
00:10:22.380 | of any size we want. And so now we're going to replace everywhere that used to say embedding.
00:10:28.140 | I'm going to replace it with create params. Everything else here is the same in the init
00:10:34.840 | under init. And then the forward is very, very similar to before. As you can see, I'm
00:10:40.380 | grabbing the zero index column from x, that's my users, and I just look it up as you see
00:10:49.260 | in that user factors array. And the cool thing is I don't have to do anything with gradients
00:10:54.320 | myself for this manual embedding layer because PyTorch can figure out the gradients automatically
00:10:59.500 | as we've discussed. But then I just got the dot product as before, add on the bias as
00:11:03.700 | before, do the sigmoid range as before. And so here's a dot product bias without any special
00:11:10.820 | PyTorch layers and we fit and we get the same result. So I think that is pretty amazingly
00:11:18.580 | cool. We've really shown that the embedding layer is nothing fancy, is nothing magic, right?
00:11:25.900 | It's just indexing into an array. So hopefully that removes a bit of the mystery for you.
00:11:32.420 | So let's have a look at this model that we've created and we've trained and find out what
00:11:38.100 | it's learned. That's already useful. We've got something we can make pretty accurate
00:11:41.620 | predictions with. But let's find out what those, what the model looks like. So remember
00:11:49.180 | when we have a question. Okay, let's take a question before you can look at this. What's
00:11:55.700 | the advantage of creating our own embedding layer over the stock PyTorch one? Oh, nothing
00:12:02.220 | at all. We're just showing that we can. It's great to be able to dig under the surface
00:12:06.100 | because at some point you'll want to try doing new things. So a good way to learn to do new
00:12:10.520 | things is to be able to replicate things that already exist and you can expect that you
00:12:15.380 | understand how they work. It's also a great way to understand the foundations of what's
00:12:19.180 | going on is to actually create encode your own implementation. But I wouldn't expect
00:12:24.980 | you to use this implementation in practice. But basically it removes all the mystery. So
00:12:32.940 | if you remember we've created a learner called learn and to get to the model that's inside
00:12:37.780 | it, you can always call learn.model and then inside that there's going to be automatically
00:12:45.180 | created for it. Well, sorry, not automatically. We've created all these attributes movie factors
00:12:49.420 | movie bias and so forth. So we can grab learn.model.movieBias. And now what I'm going to do is I'm going
00:12:59.220 | to sort that vector and I'm going to print out the first five titles. And so what this
00:13:08.180 | is going to do is it's going to print out the movies with the smallest bias and here
00:13:15.220 | they are. What does this mean? Well, it kind of means these are the five movies that people
00:13:22.620 | really didn't like. But it's more than that. It's not only do people not like them, but
00:13:29.180 | if we take account of the genre they're in, the actors they have, you know, whatever the
00:13:34.620 | latent factors are, people liked them a lot less than they expected. So maybe for example,
00:13:41.380 | this is kind of I haven't seen any of these movies. Luckily perhaps this is a sci-fi movie.
00:13:49.900 | So people who kind of like these sci-fi movies found they're so bad they still didn't like
00:13:54.100 | it. So we can do the exact opposite, which is to sort sending. And here are the top five
00:14:02.460 | movies and specifically they're the top five by bias, right? So these are the movies that
00:14:07.700 | even after you take account of the fact that LA Confidential, I have seen all of these
00:14:11.900 | ones. So LA Confidential is a kind of a murder mystery cop movie, I guess. And people who
00:14:18.860 | don't necessarily like that genre or I think Guy Pearce was in it. So maybe they don't like
00:14:22.700 | Guy Pearce very much, whatever. People still like this movie more than they expect. So
00:14:29.340 | this is a kind of a nice thing that we can look inside our model and see what it's learned.
00:14:35.060 | We can look at not only at the bias vector, but we can also look at the factors. Now there
00:14:43.940 | are 50 factors, which is too many to visualize. So we can use a technique called PCA, Principle
00:14:50.020 | Components now. This, the details don't matter, but basically they're going to squish those
00:14:53.780 | 50 factors down to three. And then we'll plot the top two as you can see here. And what we
00:15:04.060 | see when we plot the top two is we can kind of see that the movies have been kind of spread
00:15:11.380 | out across a space of some kind of latent factors. And so if you look at the far right,
00:15:18.700 | there's a whole bunch of kind of big budget actually things. And on the far left, there's
00:15:25.540 | more like cult kind of things, Fargo, Schindler's List, Monty Python. By the same token at the
00:15:33.660 | bottom, we've got some English patient, Harry Met Sally, so kind of romance drama kind of
00:15:42.860 | stuff. And at the top, we've got action and sci-fi kind of stuff. So you can see even
00:15:50.660 | as though we haven't asked in any information about these movies, all we've seen is who
00:15:57.900 | likes what. These latent factors have automatically kind of figured out a space or a way of thinking
00:16:05.380 | about these movies based on what kinds of movies people like and what other kinds of
00:16:09.660 | movies they like along with those. But that's really interesting to kind of try and visualize
00:16:15.300 | what's going on inside your model. Now we don't have to do all this manually. We can
00:16:25.540 | actually just say give me a collab learner using this set of data loaders with this number
00:16:32.120 | of factors and this y range and it does everything we've just seen again about the same number.
00:16:37.980 | Okay, so now you can see this is nice, right? We've actually been able to see right underneath
00:16:43.100 | inside the collab learner part of the fast AI application, the collaborative filtering
00:16:48.060 | application and we can build it all ourselves from scratch. We know how to create the SGD,
00:16:54.300 | know how to create the embedding layer, we know how to create the model, the architecture.
00:17:01.500 | So now you can see, you know, we've really can build up from scratch our own version
00:17:06.580 | of this. So if we just type learn.model, you can see here the names are a bit more generic.
00:17:13.220 | This is a user weight, item weight, user bias, item bias, but it's basically the same stuff
00:17:17.700 | we've seen before. And we can replicate the exact analysis we saw before by using this
00:17:24.240 | same idea. Okay, slightly different order this time because it is a bit random but pretty
00:17:34.660 | similar as well. Another interesting thing we can do is we can think about the distance
00:17:41.880 | between two movies. So let's grab all the movie factors or just pop them into a variable
00:17:51.220 | and then let's pick a movie and then let's find the distance from that movie to every
00:18:05.160 | other movie. And so one way of thinking about distance is you might recall the Pythagorean
00:18:10.340 | formula or the distance on the hypotenuse of a triangle, which is also the distance
00:18:17.820 | to a point in a Cartesian plane on a chart, which is root x squared plus y squared. You
00:18:25.020 | might know, it doesn't matter if you don't, but you can do exactly the same thing for
00:18:28.760 | 50 dimensions. It doesn't just work for two dimensions. There's a, that tells you how
00:18:36.020 | far away a point is from another point if you, if x and y are actually differences between
00:18:43.620 | two movie vectors. So then what gets interesting is you can actually then divide that kind
00:18:58.020 | of by the, by the length to make all the lengths the same distance to find out how the angle
00:19:03.620 | between any two movies and that actually turns out to be a really good way to compare the
00:19:07.620 | similarity of two things. That's called cosine similarity. And so the details don't matter.
00:19:12.120 | You can look them up if you're interested. But the basic idea here is to see that we
00:19:16.340 | can actually pick a movie and find the movie that is the most similar to it based on these
00:19:23.260 | factors. Kind of interesting.
00:19:25.140 | I have a question.
00:19:27.060 | All right.
00:19:28.740 | What motivated learning at a 50-dimensional embedding and then using a to reduce the three
00:19:33.660 | versus just learning a three-dimensional?
00:19:36.060 | Oh, because the purpose of this was actually to create a good model. So the, the visualization
00:19:42.100 | part is normally kind of the exploration of what's going in, on in your model. And so
00:19:47.780 | with a 50, with 50 latent factors, you're going to get a more accurate. So that's one
00:19:54.660 | approach is this dot product version. There's another version we could use, which is we
00:20:02.660 | could create a set of user factors and a set of item factors and just like before we could
00:20:12.620 | look them up. But what we could then do instead of doing a dot product, we could concatenate
00:20:18.140 | them together into a tensor that contains both the user and the movie factors next to
00:20:26.340 | each other. And then we could pass them through a simple little neural network, linear, relu,
00:20:34.820 | linear, and then sigmoid range as before.
00:20:39.280 | So importantly here, the first linear layer, the number of inputs is equal to the number
00:20:45.020 | of user factors plus the number of item factors. And the number of outputs is however many
00:20:51.360 | activations we have. And then we just default to 100 here. And then the final layer will
00:21:00.160 | go from 100 to 1 because we're just making one prediction. And so we could create, we'll
00:21:06.060 | call that collab nn. We can instantiate that to create a model. We can create a learner
00:21:11.060 | and we can fit. It's not going quite as well as before. It's not terrible, but it's not
00:21:16.860 | quite as good as our dot product version. But the interesting thing here is it does give
00:21:21.900 | us some more flexibility, which is that since we're not doing a dot product, we can actually
00:21:27.060 | have a different embedding size for each of users versus items. And actually fast.ai has
00:21:35.220 | a simple heuristic. If you call get embedding size and pass in your data loaders, it will
00:21:40.540 | suggest appropriate size embedding matrices for each of your categorical variables, each
00:21:49.060 | of your user and item sensors. So that's, so if we pass in *m's settings, that's going
00:22:02.020 | to pass in the user, tuple and the item, tuple, which we can then pass to embedding. This is
00:22:11.340 | the * prefix we learned about in the last class in case you forgot. So this is kind
00:22:17.660 | of interesting. We can, you know, we can see here that there's two different architectures
00:22:23.260 | we could pick from. It wouldn't be necessarily obvious ahead of time which one's going to
00:22:26.660 | work better. In this particular case, the simplest one, the dot product one, actually turned out
00:22:32.940 | to work a bit better, which is interesting. This particular version here, if you call
00:22:37.580 | collab_learner and pass use_nn = true, then what that's going to do is it's going to use
00:22:44.940 | this version, the version with concatenation and the linear layers. So collab_learner, use_nn
00:22:56.980 | = true, again we get about the same result as you'd expect because it's just a draw-cut
00:23:01.020 | for this version. And it's interesting actually, we have a look at collab_learner, it actually
00:23:09.100 | returns an object of type embedding_nn, and it's kind of cool if you look inside the fast.io
00:23:14.420 | source code or use the double question mark trick to see the source code for embedding
00:23:18.020 | nn, you'll see it's three lines of code. How does that happen? Because we're using this
00:23:24.340 | thing called tab_ular_model, which we will learn about in a moment, but basically this
00:23:32.620 | neural net version of collaborative filtering is literally just a tab_ular model in which
00:23:37.740 | we pass no continuous variables and some embedding sizes. So we'll see that in a moment.
00:23:50.080 | Okay so that is collaborative filtering, and again take a look at the further research
00:23:55.060 | section in particular after you finish the questionnaire, because there's some really
00:23:59.740 | important next steps you can take to push your knowledge and your skills.
00:24:06.620 | So let's now move to notebook 9, tab_ular. And we're going to look at tab_ular_modeling
00:24:14.100 | and do a deep dive. And let's start by talking about this idea that we were starting to see
00:24:19.180 | here, which is embeddings. And specifically let's move beyond just having embeddings for
00:24:28.100 | users and items, but embeddings for any kind of categorical variable. So really because
00:24:34.860 | we know an embedding is just a lookup into an array, it can handle any kind of discrete
00:24:43.700 | categorical data. So things like age are not discrete, they're continuous numerical data,
00:24:49.140 | but something like sex or postcode are categorical variables. They have a certain number of discrete
00:24:56.420 | levels. The number of discrete levels they have is called their cardinality. So to have
00:25:02.740 | a look at an example of a dataset that contains both categorical and continuous variables,
00:25:10.060 | we're going to look at the Rossman sales competition that ran on Kaggle a few years ago. And so
00:25:16.540 | basically what's going to happen is we're going to see a table that contains information
00:25:21.580 | about various stores in Germany, and the goal will be to try and predict how many sales
00:25:26.940 | there's going to be for each day in a couple of week period for each store.
00:25:34.700 | One of the interesting things about this competition is that one of the gold medalists used deep
00:25:39.980 | learning, and it was one of the earliest known examples of a state-of-the-art deep learning
00:25:45.980 | tabular model. I mean this is not long ago, 2015 or something, but really this idea of
00:25:52.660 | creating state-of-the-art tabular models with deep learning has not been very common and
00:25:58.860 | for not very long. You know interestingly compared to the other gold medalists in this
00:26:04.020 | competition, the folks that use deep learning used a lot less feature engineering and a
00:26:08.500 | lot less domain expertise. And so they wrote a paper called Entity Embeddings of Categorical
00:26:13.540 | Variables, in which they basically described the exact thing that you saw in notebook 8,
00:26:21.940 | the way you can think of one-hot encodings as just being embeddings, you can catenate
00:26:27.420 | them together, and you can put them through a couple of layers, they call them dense layers,
00:26:33.140 | we've called them linear layers, and create a neural network out of that. So this is really
00:26:38.940 | a neat, you know, kind of simple and obvious hindsight trick. And they actually did exactly
00:26:45.940 | what we did in the paper, which is to look at the results of the trained embeddings.
00:26:52.900 | And so for example they had an embedding matrix for regions in Germany, because there wasn't
00:27:02.300 | really metadata about this, these were just learned embeddings, just like we learned embeddings
00:27:06.120 | about movies. And so then they just created, just like we did before, a chart where they
00:27:12.300 | popped each region according to, I think probably a PCA of their embeddings. And then if you
00:27:18.820 | circle the ones that are close to each other in blue, you'll see that they're actually
00:27:24.140 | close to each other in Germany, and ditto for red, and ditto for green, and then here's
00:27:30.580 | the brown. So this is like pretty amazing, is the way that we can see that it's kind
00:27:38.660 | of learned something about what Germany looks like, based entirely on the purchasing behavior
00:27:44.180 | of people in those states. Something else they did was to look at every store, and they
00:27:50.420 | looked at the distance between stores in practice, like how many kilometers away they are. And
00:27:58.100 | then they looked at the distance between stores in terms of their embedding distance, just
00:28:03.700 | like we saw in the previous notebook. And there was this very strong correlation that
00:28:09.260 | stores that were close to each other physically ended up having close embeddings as well,
00:28:18.180 | even though the actual location of these stores in physical space was not part of the model.
00:28:26.180 | Ditto with days of the week, so the days of the week or another embedding, and the days
00:28:32.100 | of the week that were next to each other, ended up next to each other in embedding space,
00:28:37.740 | and ditto for months of the year. So pretty fascinating the way kind of information about
00:28:44.900 | the world ends up captured just by looking at training embeddings, which as we know are
00:28:50.700 | just index lookups into an array. So the way we then combine these categorical variables
00:29:00.220 | with these embeddings with continuous variables, what was done in both the entity embedding
00:29:06.620 | paper that we just looked at, and then also described in more detail by Google when they
00:29:13.060 | described how their recommendation system in Google Play works. This is from Google's
00:29:18.180 | paper, is they have the categorical features that go through the embeddings, and then there
00:29:23.260 | are continuous features, and then all the embedding results and the continuous features
00:29:27.940 | are just concatenated together into this big concatenated table that then goes through
00:29:32.700 | this case three layers of a neural net, and interestingly they also take the kind of collaborative
00:29:40.620 | filtering bit and do the product as well and combine the two. So they use both of the tricks
00:29:46.340 | were used in the previous notebook and combine them together. So that's the basic idea we're
00:29:54.340 | going to be seeing for moving beyond just collaborative filtering, which is just two
00:30:01.180 | categorical variables to as many categorical and as many continuous variables as we like.
00:30:07.340 | But before we do that, let's take a step back and think about other approaches, because
00:30:12.900 | as I mentioned, the idea of deep learning as a kind of a best practice for tabular data
00:30:19.940 | is still pretty new and it's still kind of controversial. It's certainly not always the
00:30:25.500 | case that it's the best approach. So when we're not using deep learning, what would
00:30:30.980 | we be using? Well, what we'd probably be using is something called an ensemble of decision
00:30:36.620 | trees and the two most popular are random forests and gradient boosting machines or
00:30:43.140 | something similar. So basically between multi-layered neural networks, like with SGD and ensemble
00:30:49.900 | of decision trees, that kind of covers the vast majority of approaches that you're likely
00:30:55.700 | to see for tabular data. And so we're going to make sure we cover them both of course today,
00:31:01.820 | in fact. So although deep learning is nearly always clearly superior for stuff like images
00:31:09.580 | and audio and natural language text, these two approaches tend to give somewhat similar
00:31:15.820 | results a lot of the time for tabular data. So let's take a look. You know, you really
00:31:21.820 | should generally try both and see which works best for you for each problem you look at.
00:31:28.660 | Why does the range go from 0 to 5.5 if the maximum is 5?
00:31:38.140 | That's a great question. The reason is if you think about it for sigmoid, it's actually
00:31:43.740 | impossible for a sigmoid to get all the way to the top or all the way to the bottom. Those
00:31:49.180 | are asymptotes. So no matter how far, how big your x is, it can never quite get to the
00:31:54.780 | top or no matter how small it is, it can never quite get to the bottom. So if you want to
00:31:58.580 | be able to actually predict a rating of 5, then you need to use something higher than
00:32:03.100 | 5 your maximum.
00:32:07.300 | Are embeddings used only for highly cardinal categorical variables, or is this approach
00:32:12.380 | used in general? For low cardinality, can one use a one-hot encoding?
00:32:18.500 | I'll remind you cardinality is the number of discrete levels in a variable. And remember
00:32:29.180 | that an embedding is just a computational shortcut for a one-hot encoding. So there's
00:32:36.180 | really no reason to use a one-hot encoding because it's, as long as you have more than
00:32:42.260 | two levels, it's always going to be more memory and lower, and give you exactly mathematically
00:32:48.060 | the same thing. And if there's just two levels, then it is basically identical. So there isn't
00:32:53.820 | really any reason not to use it.
00:32:58.180 | Thank you for those great questions. Okay, so one of the most important things about
00:33:08.980 | decision tree ensembles is that at the current state of the technology, they do provide faster
00:33:15.100 | and easier ways of interpreting the model. I think that's rapidly improving for deep
00:33:19.840 | learning models on tabular data, but that's where we are right now. They also require
00:33:24.420 | less hyperparameter tuning, so they're easier to kind of get right the first time. So my
00:33:30.260 | first approach for analyzing a new tabular data set is always an ensemble of decision
00:33:35.220 | trees. And specifically, I pretty much always start with a random forest because it's just
00:33:39.220 | so reliable. Yes.
00:33:44.260 | Your experience for highly imbalanced data, such as broad or medical data, what usually
00:33:50.100 | works best out of random forest, XGBoost, or neural networks?
00:33:55.460 | I'm not sure that whether the data is balanced or unbalanced is a key reason for choosing
00:34:03.260 | one of those above the others. I would try all of them and see which works best. So the
00:34:09.660 | exception to the guideline about start with decision tree ensembles is your first thing
00:34:13.820 | to try would be if there's some very high cardinality categorical variables, then they
00:34:18.600 | can be a bit difficult to get to work really well in decision tree ensembles. Or if there's
00:34:25.440 | something like, most importantly, if it's like plain text data or image data or audio
00:34:29.380 | data or something like that, then you're definitely going to need to use a neural net in there,
00:34:34.980 | but you could actually ensemble it with a random forest, as we'll see.
00:34:40.420 | Okay, so clearly we're going to need to understand how decision tree ensembles work. So PyTorch
00:34:50.500 | isn't a great choice for decision tree ensembles. They're really designed for gradient-based
00:34:55.420 | methods and random forests and decision tree growing are not really gradient-based methods
00:35:01.780 | in the same way. So instead, we're going to use a library called scikit-learn, referred
00:35:08.640 | to as sklearn as a module. Scikit-learn does a lot of things. We're only going to touch
00:35:16.420 | on a tiny piece of them, stuff we need to do to train decision trees and random forests.
00:35:24.540 | We've already mentioned before Wes McKinney's book, also a great book for understanding
00:35:28.600 | more about scikit-learn. So the dataset for learning about decision tree ensembles is
00:35:35.500 | going to be another dataset. It's going to, it's called the blue book for bulldozers dataset
00:35:42.660 | and it's a Kaggle competition. So Kaggle competitions are fantastic. They are machine learning competitions
00:35:52.060 | where you get interesting datasets, you get feedback on whether your approach is any good
00:35:56.820 | or not. You can see on a leaderboard what approaches are working best and then you can
00:36:01.140 | read blog posts from the winning contestants sharing tips and tricks. It's certainly not
00:36:07.940 | a substitute for actual practice doing end-to-end data science projects, but for becoming good
00:36:19.060 | at creating predictive models that are predictive, it's a really fantastic resource, highly recommended.
00:36:25.980 | And you can also submit to old, most old competitions to see how you would have gone without having
00:36:31.860 | to worry about, you know, the kind of stress of like whether people will be looking at
00:36:35.980 | your results because they're not publicized or published if you do that.
00:36:41.540 | There's a question. Can you comment on real-time applications of random forests? In my experience,
00:36:49.900 | they tend to be too slow for real-time use cases like a recommender system, neural network
00:36:55.680 | is much faster when run on the right hardware.
00:36:58.860 | Let's get to that once we've seen what they are, shall we? Now you can't just download
00:37:08.620 | an untar Kaggle datasets using the untar data thing that we have in fast.ai. So you actually
00:37:13.540 | have to sign up to Kaggle and then follow these instructions for how to download data
00:37:20.380 | from Kaggle. Make sure you replace creds here with what it describes. You need to get a
00:37:24.980 | special API code and then run this one time to put that up on your server. And now you
00:37:32.020 | can use Kaggle to download data using the API. So after we do that, we're going to end
00:37:41.720 | up with a bunch of, as you see, CSV files. So let's take a look at this data.
00:37:49.340 | So the main data, the main table is train.csv. Remember that's comma separated values and
00:37:55.980 | the training set contains information such as unique identifier of a sale, the unique
00:38:00.980 | identifier of a machine, the sale price, sale date. So what's going on here is one row of
00:38:07.100 | the data represents a sale of a single piece of heavy machinery like a bulldozer at an
00:38:14.620 | auction. So it happens at a date, as a price, it's of some particular piece of equipment
00:38:20.860 | and so forth. So if we use pandas again to read in the CSV file, let's combine training
00:38:28.100 | and valid together. We can then look at the columns to see. There's a lot of columns there
00:38:34.200 | and many things which I don't know what the hell they mean like blade extension and pad
00:38:37.860 | type and ride control. But the good news is we're going to show you a way that you don't
00:38:43.340 | have to look at every single column and understand what they mean and random forests are going
00:38:48.120 | to help us with that as well. So once again, we're going to be seeing this idea that models
00:38:53.700 | can actually help us with data understanding and data cleanup. One thing we can look at
00:38:59.460 | is ordinal columns, a good place to look at that now. If there's things there that you
00:39:03.920 | know are discrete values but have some order like product size, it has medium and small
00:39:11.380 | and large, medium and many. These should not be in alphabetical order or some random order,
00:39:19.340 | they should be in this specific order, right? They have a specific ordering. So we can use
00:39:28.820 | as type to turn it into a categorical variable and then we can say setCategories, audit equals
00:39:34.420 | true to basically say this is an ordinal column. So it's got discrete values but we actually
00:39:40.300 | want to define what the order of the classes are. We need to choose which is the dependent
00:39:48.260 | variable and we do that by looking on Kaggle and Kaggle will tell us that the thing we're
00:39:52.180 | meant to be predicting is sale price and actually specifically they'll tell us the thing we're
00:39:56.960 | meant to be predicting is the log of sale price because root mean squared log error
00:40:02.060 | is what we're actually going to be judged on in the competition where we take the log.
00:40:09.020 | So we're not going to replace sale price with its log and that's what we'll be using from
00:40:12.940 | now on. So a decision tree ensemble requires decision trees. So let's start by looking
00:40:20.340 | at decision trees. So a decision tree in this case is a something that asks a series of
00:40:28.060 | binary that is yes or no questions about data. So such as is somebody less than or greater
00:40:34.540 | than 30? Yes they are. Are they eating healthily? Yes they are and so okay then we're going
00:40:39.700 | to say they're fit or unfit. So like there's an example of some arbitrary decision tree
00:40:46.540 | that somebody might have come up with. It's a series of binary yes and no choices and
00:40:51.620 | at the bottom are leaf nodes that make some prediction. Now of course for our bulldozers
00:41:02.380 | competition we don't know what binary questions to ask about these things and in what order
00:41:10.180 | in order to make a prediction about sale price. So we're doing machine learning so we're going
00:41:15.180 | to try and come up with some automated way to create the questions. And there's actually
00:41:20.700 | a really simple procedure for doing that. You have to think about it. So if you want
00:41:24.620 | to kind of stretch yourself here have a think about what's an automatic procedure that you
00:41:30.620 | can come up with that would automatically build a decision tree where the final answer
00:41:36.300 | would do a you know significantly better than random job of estimating the sale price of
00:41:44.220 | one of these auctions. Alright so here's the approach that we could use. Loop through each
00:41:53.500 | column of the data set. We're going to go through each of well obviously not sale price
00:41:59.300 | it's a dependent variable sale ID machine ID auctioneer year made etc. And so one of
00:42:05.140 | those will be for example product size. And so then what we're going to do is we're going
00:42:11.660 | to loop through each possible value of product size large, large, medium, medium etc. And
00:42:21.380 | then we're going to do a split basically like where this comma is and we're going to say
00:42:25.260 | okay let's get all of the auctions of large equipment and put that into one group and
00:42:32.820 | everything that's smaller than that and put that into another group. And so that's here
00:42:38.900 | split the data into two groups based on whether they're greater than or less than that value.
00:42:45.740 | If it's a categorical non-ordinal value a variable it'll be just whether it's equal
00:42:49.740 | or not equal at that level. And then we're going to find the average sale price for each
00:42:55.620 | of the two groups. So for the large group what was the average sale price? For the smaller
00:43:00.900 | than large group what was the average sale price? And that will be our model. Our prediction
00:43:06.940 | will simply be the average sale price for that group. And so then you can say well how
00:43:12.460 | good is that model? If our model was just to ask a single question with a yes/no answer
00:43:17.380 | put things into two groups and take the average of the group as being our prediction and we
00:43:22.260 | can say how good would that model be? What would be the root mean squared error from
00:43:26.140 | that model? And so we can then say all right how good would it be if we use large as a
00:43:32.580 | split? And then let's try again what if we did large/medium as a split? What if we did
00:43:38.260 | medium as a split? And so in each case we can find the root mean squared error of that
00:43:42.180 | incredibly simple model. And then once we've done that for all of the product size levels
00:43:47.020 | we can go to the next column and look at level of usage band and do every level of usage
00:43:55.380 | band and then state, every level of state and so forth. And so there'll be some variable
00:44:02.860 | and some split level which gives the best root mean squared error of this really really
00:44:09.540 | simple model. And so then we'll say okay that would be our first binary decision. It gives
00:44:16.220 | us two groups and then we're going to take each one of those groups separately and find
00:44:22.580 | another single binary decision for each of those two groups using exactly the same procedure.
00:44:28.820 | So then we'll have four groups and then we'll do exactly the same thing again separately
00:44:33.460 | for each of those four groups and so forth. So let's see what that looks like and in fact
00:44:44.180 | once we've gone through this you might even want to see if you can implement this algorithm
00:44:47.500 | yourself. It's not trivial but it doesn't require any special coding skills so hopefully
00:44:55.020 | you can find you'll be able to do it. There's a few things we have to do before we can actually
00:45:00.820 | create a decision tree in terms of just some basic data munching. One is if we're going
00:45:06.460 | to take advantage of dates we actually want to call fastai's addDatePart function and
00:45:13.660 | what that does as you see after we call it is it creates a whole different a bunch of
00:45:18.980 | different bits of metadata from that data. Say a year, say a month, say a week, say a
00:45:24.380 | day and so forth. So say a date of itself doesn't have a whole lot of information directly
00:45:35.020 | but we can pull lots of different information out of it. And so this is an example of something
00:45:39.340 | called feature engineering which is where we take some piece of some piece of data and
00:45:44.220 | we try to grab create lots of other pieces of data from it. So is this particular date
00:45:50.180 | the end of a month or not? At the end of a year or not? And so forth. So that handle
00:45:56.940 | states there's a bit more cleaning we want to do and fastai provides some things to make
00:46:03.700 | cleaning easier. We can use the tabular pandas class to create a tabular data set in pandas.
00:46:13.460 | And specifically we're going to use two tabular processes or tabular procs. A tabular processor
00:46:19.940 | is basically just a transform and we've seen transforms before so go back and remind yourself
00:46:24.580 | what a transform is. Except it's just slightly different it's like three lines of code if
00:46:30.620 | you look at the code for it. It's actually going to modify the object in place rather
00:46:36.260 | than creating a new object and giving it back to you. And that's because often these tables
00:46:40.420 | of data are kind of really big and we don't want to waste lots of RAM. And it's just going
00:46:46.300 | to run the transform once and save the result rather than doing it lazily when you access
00:46:51.060 | it for the same reason. We're just going to make this a lot faster. So you can just think
00:46:57.160 | of them as transforms really. One of them is called categorify and categorify is going
00:47:02.020 | to replace a column with numeric categories using the same basic idea of like a vocab
00:47:09.340 | like we've seen before. Fill missing is going to find any columns with missing data that's
00:47:16.240 | going to fill in the missing data with the median of the data and create a new column
00:47:21.100 | a boolean column which is set to true for anything that was missing. So these two things
00:47:25.760 | is basically enough to get you to a point where most of the time you'll be able to train
00:47:29.260 | a model. Now the next thing we need to do is think about our validation set. As we discussed
00:47:37.340 | in lesson one, a random validation set is not always appropriate and certainly for something
00:47:44.020 | like predicting auction results it almost certainly is not appropriate because we're
00:47:49.260 | going to be wanting to use a model in the future not at some random date in the past.
00:47:54.660 | So the way this Kaggle competition was set up was that the test set the thing that you
00:48:00.680 | had to fill in and submit for the competition was two weeks of data that was after any of
00:48:08.860 | the training set. So we should do the same thing for a validation set. We should create
00:48:14.580 | something which is where the validation set is the last couple of weeks of data and so
00:48:22.820 | then the training set will only be data before that. So we basically can do that by grabbing
00:48:28.340 | everything before October 2011, create a training and validation set based on that condition
00:48:35.260 | and grabbing those bits. So that's going to split our training set and validation set
00:48:43.520 | by date not randomly. We're also going to need to tell when you create a tabular pandas
00:48:50.460 | object you're going to be passing in a data frame, going to be passing in your tabular
00:48:54.980 | procs and you also have to say what are my categorical and continuous variables. We can
00:49:00.100 | use fast.ai's cont.cat.split to automatically split a data frame to continuous and categorical
00:49:07.820 | variables for you. So we can just pass those in. Tell it what is the dependent variable,
00:49:14.940 | you can have more than one, and what are the indexes to split into training and valid.
00:49:20.460 | And this is a tabular object. So it's got all the information you need about the training
00:49:24.720 | set, the validation set, categorical and continuous variables and the dependent variable and any
00:49:30.060 | processes to run. It looks a lot like a datasets object, but it has a .train, it has a .valid
00:49:41.060 | and so if we have a look at .show we can see the data. But .show is going to show us the
00:49:50.740 | kind of the string data, but if we look at .items you can see internally it's actually
00:49:56.800 | stored these very compact numbers which we can use directly in a model. So fast.ai has
00:50:06.140 | basically got us to a point here where we have our data into a format ready for modeling
00:50:11.500 | and our validation sets being created. To see how these numbers relate to these strings
00:50:19.580 | we can again just like we saw last week use the classes attribute which is a dictionary
00:50:25.220 | which basically tells us the vocab. So this is how we look up. For example 6 is 0, 1,
00:50:30.820 | 2, 3, 4, 5, 6. This is a compact example. That processing took takes a little while to run
00:50:39.260 | so you can go ahead and save the tabular object and so then you can load it back later without
00:50:46.540 | having to rerun all the processing. So that's a nice kind of fast way to quickly get back
00:50:52.820 | up and running without having to reprocess your data. So we've done the basic data munging
00:50:59.100 | we need. So we can now create a decision tree and in scikit-learn a decision tree where
00:51:04.180 | the dependent variable is continuous is a decision tree regressor. And let's start by
00:51:10.440 | telling it we just want a total of four leaf nodes. We'll see what that means in a moment
00:51:16.980 | and in scikit-learn you generally call fit so it looks quite a lot like fast.ai and you
00:51:23.060 | pass in your independent variables and your dependent variable and we can grab those straight
00:51:28.340 | from our tabular object training set is .x's and .y and we can do the same thing for validation
00:51:35.860 | just to save us in typing. Okay, question. Do you have any thoughts on what data augmentation
00:51:41.820 | for tabular data might look like? I don't have a great sense of data augmentation for tabular
00:51:53.660 | data. We'll be seeing later either in this course or in the next part dropout and mix
00:52:03.200 | up and stuff like that which they might be able to do that in later layers in the tabular
00:52:11.260 | model. Otherwise I think you'd need to think about kind of the semantics of the data and
00:52:16.220 | think about what are things you could do to change the data without changing the meaning.
00:52:21.060 | That's like a pretty tricky route. There question. Does fast.ai distinguish between ordered categories
00:52:29.340 | such as low, medium, high and unordered categorical variables? Yes, that was that ordinal thing
00:52:36.180 | I told you about before and all it really does is it ensures that your classes list
00:52:42.300 | has a specific order so then these numbers actually have a specific order. And as you'll
00:52:47.860 | see that's actually going to turn out to be pretty important for how we train our random
00:52:51.820 | forest. Okay, so we can create a decision tree regressor. We can fit it and then we
00:53:00.300 | can draw it, the fast.ai function. And here is the decision tree we just trained and behind
00:53:10.380 | the scenes this actually used the basically the exact process that we described back here,
00:53:19.700 | right? So this is where you can like try and create your own decision tree implementation
00:53:25.380 | if you're interested in stretching yourself. So we're going to use one that's already exists
00:53:31.880 | and the best way to understand what it's done is to look at this diagram from top to bottom.
00:53:37.060 | So the first step is it says like okay the initial model it created is a model with no
00:53:44.660 | binary splits at all. Specifically it's always going to predict the value 10.1 for every
00:53:50.600 | single row. Why is that? Well because this is the simplest possible model is to take
00:53:57.020 | the average of the dependent variable and always predict that. And so this is always
00:54:02.100 | should be your kind of pretty much your basic baseline for regression. There are four hundred
00:54:08.720 | and four thousand seven hundred and ten rows, auctions that we're averaging and the mean
00:54:14.660 | squared error of this incredibly simple model in which there are no rules at all, no groups
00:54:20.860 | at all, just a single average is a point for it. So then the next most complex model is
00:54:29.300 | to take a single column, a plus system and a single binary decision is coupler system
00:54:35.980 | less than or equal to 0.5. True, there are three hundred and sixty thousand eight hundred
00:54:41.780 | and forty seven auctions where it's true and forty three thousand eight hundred and sixty
00:54:47.740 | three where it's false. And now interestingly in the false case you can see that there are
00:54:54.100 | no further binary decisions. So this is called a leaf node. It's a node where this is as
00:54:59.620 | far as you can get and so if your coupler system is not less than or equal to 0.5 then
00:55:07.340 | the prediction this model makes for your sale price is 9.21 versus if it's true it's 10.21.
00:55:15.100 | So you can see it's actually found a very big difference here and that's why it picked
00:55:19.220 | this as the first binary split. And so the mean squared error for this section here is
00:55:23.940 | 0.12 which is far better than we started out at, 0.48. This group still has 360,000 in
00:55:32.380 | it and so it does another binary split. This time is the year that this piece of equipment
00:55:38.340 | made was at less than or equal to 1991.5. If it was, if it's true then we get a leaf node
00:55:47.340 | and the prediction is 9.97, mean squared error 0.37. If the value is false we don't have
00:55:53.420 | a leaf node and we have another binary split. And you can see eventually we get down to
00:55:57.740 | here coupler system true, year made, false, product size, false, mean squared error 0.17.
00:56:05.020 | So all of these leaf nodes have MSCs that are smaller than that original baseline model
00:56:13.780 | of just taking the mean. So this is how you can grow a decision tree. And we only stopped
00:56:19.660 | here because we said max leaf nodes is 4, 1, 2, 3, 4, right? And so if we want to keep
00:56:27.140 | training it further we can just use a higher number. There's actually a very nice library
00:56:36.220 | by Terrence Park called dtree-vis which can show us exactly the same information like
00:56:42.220 | so. And so here are the same leaf nodes 1, 2, 3, 4. And you can see the kind of the chart
00:56:49.980 | of how many are there. This is the split, coupler system 0.5. Here are the two groups.
00:56:55.460 | You can see the sale price in each of the two groups. And then here's the leaf node.
00:57:00.660 | And so then the second split was on year made. And you can see here something weird is going
00:57:05.300 | on with year made. There's a whole bunch of year mades that are a thousand which is obviously
00:57:09.700 | not a sensible year for a bulldozer to be made. So presumably that's some kind of missing
00:57:15.140 | value. So when we look at the kind of the picture like this it can give us some insights
00:57:21.400 | about what's going on in our data. And so maybe we should replace those thousands with
00:57:28.700 | 1950 because that's you know obviously a very, very early year for a bulldozer. So we can
00:57:34.940 | kind of pick it arbitrarily. It's actually not really going to make any difference to
00:57:39.700 | the model that's created because all we care about is the order because we're just doing
00:57:44.740 | these binary splits that it'll make it easier to look at as you can see. Here's our 1950s
00:57:50.420 | now. And so now it's much easier to see what's going on in that binary split. So let's now
00:57:58.420 | get rid of max leaf nodes and build a bigger decision tree. And then let's just for the
00:58:05.060 | rest of this notebook create a couple of little functions. One to create the root mean squared
00:58:10.220 | error which is just here. And another one to take a model and some independent independent
00:58:16.900 | variables, predict from the model on the independent variables and then take the root mean squared
00:58:23.180 | error with a dependent variable. So that's going to be our models root mean squared error.
00:58:29.700 | So for this decision tree in which we didn't have a stopping criteria, so as many leaf
00:58:33.900 | nodes as you like, the model's root mean squared error is zero. So we've just built the perfect
00:58:41.580 | model. So this is great news, right? We've built the perfect auction trading system.
00:58:49.660 | Well remember, we actually need to check the validation set. Let's check the check mRmse
00:58:54.620 | with a validation set and oh, it's worse than zero. So our training set is zero, our validation
00:59:02.540 | set is much worse than zero. Why has that happened? Well one of the things that a random
00:59:08.660 | forest in sklearn can do is it can tell you the number of leaf nodes, number of leaves,
00:59:14.540 | there are 341,000, number of data points 400,000. So in other words, we have nearly as many
00:59:22.460 | leaf nodes as data points. Most of our leaf nodes only have a single thing in, but they're
00:59:26.780 | taking an average of a single thing. Clearly this makes no sense at all. So what we should
00:59:32.060 | actually do is pick some different stopping criteria and let's say, okay, if you get a
00:59:38.180 | leaf node with 25 things or less in it, don't split things to create a leaf node with less
00:59:45.840 | than 25 things in it. And now if we fit and we look at the root mean squared error for
00:59:51.540 | the validation set, it's going to go down from 0.33 to 0.32. So the training sets got
00:59:59.460 | worse from zero to 0.248. The validation sets got better and now we only have 12,000 leaf
01:00:06.300 | nodes. So that is much more reasonable.
01:00:10.100 | Alright, so let's take a five minute break and then we're going to come back and see
01:00:15.260 | how we get the best of both worlds, how we're going to get something which has the kind
01:00:19.660 | of flexibility to get these, you know, what we're going to get down to zero, but to get,
01:00:26.540 | you know, really deep trees, but also without overfitting. And the trick will be to use
01:00:32.860 | something called bagging. We'll come back and talk about that in five minutes.
01:00:39.460 | Okay, welcome back. So we're going to look at how we can get the best of both worlds
01:00:49.500 | as we discussed and let's start by having a look at what we're doing with categorical
01:00:56.420 | variables first of all. And so you might notice that previously with categorical variables,
01:01:03.500 | for example, in collaborative filtering, we had to, you know, kind of think about like
01:01:10.500 | how many embedding levels we have, for example, if you've used other modeling tools, you might
01:01:15.780 | have doing things with creating dummy variables, stuff like that. For random forests on the
01:01:21.780 | whole, you don't have to. The reason is, as we've seen, all of our categorical variables
01:01:32.460 | have been turned into numbers. And so we can perfectly well have decision tree binary decisions
01:01:41.420 | which use those particular numbers. Now, the numbers might not be ordered in any interesting
01:01:49.260 | way, but if there's a particular level which kind of stands out as being important, it
01:01:56.380 | only takes two binary splits to split out that level into a single, you know, into a
01:02:04.420 | single piece. So generally speaking, I don't normally worry too much about kind of encoding
01:02:14.140 | categorical variables in a special way. As I mentioned, I do try to encode ordinal variables
01:02:19.980 | by saying what the order of the levels is, because often, as you would expect, sizes,
01:02:26.180 | for example, you know, medium and small are going to mean kind of next to each other and
01:02:30.860 | large and extra large would be next to each other. That's good to have those as similar
01:02:34.580 | numbers. Having said that, you can kind of one hot encode a categorical variable if you
01:02:43.700 | want to using get dummies in pandas. But there's not a lot of evidence that that actually helps.
01:02:51.380 | There's actually that has been stored in a paper. And so I would say in general for categorical
01:02:57.540 | variables don't worry about it too much. Just use what we've shown you. You have a question.
01:03:04.500 | For ordinal categorical variables, how do you deal with when they have like nA or missing
01:03:12.460 | values, where do you put that in the order? So in fast.ai, nA missing values always appear
01:03:22.300 | as the first item. They'll always be the zero index item. And also if you get something
01:03:27.480 | in the validation or test set, which is a level we haven't seen in training, that will
01:03:32.140 | be considered to be that missing or nA value as well. All right, so what we're going to
01:03:41.020 | do to try and improve our random forest is we're going to use something called bagging.
01:03:46.420 | This was developed by a retired Berkeley professor named Leo Breiman in 1994. And he did a lot
01:03:54.180 | of great work and perhaps you could argue that most of it happened after he retired.
01:03:59.700 | His technical report was called bagging predictors. And he described how you could create multiple
01:04:05.260 | versions of a predictor, so multiple different models. And you could then aggregate them
01:04:11.740 | by averaging over the predictions. And specifically, the way he suggested doing this was to create
01:04:20.540 | what he called bootstrap replicates. In other words, randomly select different subsets of
01:04:25.860 | your data. Train a model on that subset, kind of store it away as one of your predictors,
01:04:31.820 | and then do it again a bunch of times. And so each of these models is trained on a different
01:04:36.460 | random subset of your data. And then you, to predict, you predict on all of those different
01:04:43.380 | versions of your model and average them. And it turns out that bagging works really well.
01:04:52.300 | So this, the sequence of steps is basically randomly choose some subset of rows, train
01:04:58.540 | a model using that subset, save that model, and then return to step one. Do that a few
01:05:04.180 | times to train a few models. And then to make a prediction, predict with all the models
01:05:10.300 | and take the average. That is bagging. And it's very simple, but it's astonishingly powerful.
01:05:18.300 | And the reason why is that each of these models we've trained, although they are not using
01:05:25.480 | all of the data, so they're kind of less accurate than a model that uses all of the data. Each
01:05:31.980 | of them is, the errors are not correlated, you know, the errors because of using that
01:05:39.880 | smaller subset are not correlated with the errors of the other models because they're
01:05:44.140 | random subsets. And so when you take the average of a bunch of kind of errors which are not
01:05:54.100 | correlated with each other, the average of those errors is zero. So therefore, the average
01:06:01.080 | of the models should give us an accurate prediction of the thing we're actually trying to predict.
01:06:08.380 | So as I say here, it's an amazing result. We can improve the accuracy of nearly any
01:06:12.540 | kind of algorithm by training it multiple times on different random subsets of data
01:06:18.380 | and then averaging the predictions. So then Breiman in 2001 showed a way to do this specifically
01:06:27.140 | for decision trees where not only did he randomly choose a subset of rows for each model, but
01:06:33.700 | then for each binary split, he also randomly selected a subset of columns. And this is
01:06:40.200 | called the random first. And it's perhaps the most widely used, most practically important
01:06:45.860 | machine learning method and astonishingly simple. To create a random forest regressor,
01:06:54.100 | you use sklearn's random forest regressor. If you pass njobs -1, it will use all of the
01:07:00.980 | CPU cores that you have to run as fast as possible. nestimators says how many trees,
01:07:07.420 | how many models to train. max_sample says how many rows to use, randomly chosen rows
01:07:15.100 | to use in each one. max_features is how many randomly chosen columns to use for each binary
01:07:21.860 | split point. min_sample's leaf is the stopping criteria and we'll come back to. So here's
01:07:29.960 | a little function that will create a random first regressor and fit it through some set
01:07:35.580 | of independent variables and a dependent variable. So we can give it a few default values and
01:07:43.460 | create a random forest and train and our validation set RMSE is 0.23. If we compare that to what
01:07:55.500 | we had before, we had 0.32. So dramatically better by using a random forest.
01:08:13.140 | So what's happened when we called random forest regressor is it's just using that decision
01:08:22.020 | tree builder that we've already seen, but it's building multiple versions with these
01:08:26.480 | different random subsets and for each binary split it does, it's also randomly selecting
01:08:32.260 | a subset of columns. And then when we create a prediction, it is averaging the predictions
01:08:38.880 | of each of the trees. And as you can see it's giving a really great result. And one of the
01:08:45.260 | amazing things we'll find is that it's going to be hard for us to improve this very much,
01:08:50.540 | you know, the kind of the default starting point tends to turn out to be pretty great.
01:08:59.500 | The sklearn docs have lots of good information in. One of the things that has this nice picture
01:09:03.980 | that shows as you increase the number of estimators, how does the accuracy improve, error rate
01:09:11.620 | improves for different max features levels. And in general, the more trees you add, the
01:09:21.100 | more accurate your model. It's not going to overfit, right, because it's averaging more
01:09:26.060 | of these, these weak models, more of these models that are trained on subsets of the
01:09:34.020 | data. So train as many, use as many estimators as you like, really just a case of how much
01:09:40.420 | time do you have and whether you kind of reach a point where it's not really improving anymore.
01:09:45.980 | You can actually get at the underlying decision trees in a model, in a random forest model
01:09:50.620 | using estimators_. So with a list comprehension, we can call predict on each individual tree.
01:09:57.900 | And so here's an array, a numpy array containing the predictions from each individual tree
01:10:03.760 | for each row in our data. So if we take the mean across the zero axis, we'll get exactly
01:10:15.100 | the same number. Because remember, that's what a random forest does, is it takes the
01:10:21.380 | mean of the trees, predictions. So one cool thing we could do is we could look at the
01:10:31.340 | 40 estimators we have and grab the predictions for the first i of those trees and take their
01:10:42.020 | mean and then we can find the root mean squared error. And so in other words, here is the accuracy
01:10:50.220 | when you've just got one tree, two trees, three trees, four trees, five trees, etc.
01:10:56.100 | And you can see, so it's kind of nice, right? You can, you can actually create your own
01:11:01.220 | kind of build your own tools to look inside these things and see what's going on. And
01:11:06.540 | so we can see here that as you add more and more trees, the accuracy did indeed keep improving
01:11:11.980 | or the root mean squared error kept improving, although the improvements slowed down after
01:11:18.060 | a while. The validation set is worse than the training set and there's a couple of reasons
01:11:28.640 | that could have happened. The first reason could be because we're still overfitting,
01:11:34.660 | which is not necessarily a problem, it's just something we could identify. Or maybe it's
01:11:39.020 | because the, the fact that we're trying to predict the last two weeks is actually a problem
01:11:44.700 | and that the last two weeks are kind of different to the other auctions in our dataset, maybe
01:11:50.300 | something changed over time. So how do we tell which of those two reasons there are?
01:11:56.740 | What is the reason that our validation set is worse? We can actually find out using a
01:12:01.620 | very clever trick called out of bag error, OOB error. And we use OOB error for lots of
01:12:06.900 | things. You can grab the OOB error, or you can grab the OOB predictions from the model
01:12:16.340 | with OOB prediction and you can grab the RMSE and you can find that the OOB error, RMSE is
01:12:23.860 | 0.21, which is quite a bit better than 0.23. So let me explain what OOB error is. What
01:12:35.420 | OOB error is, is we look at each row of the training set, not the validation set, each
01:12:45.180 | row of the training set and we say, so we say for row number one, which trees included
01:12:53.220 | row number one in the training? And we'll say, okay, let's not use those for calculating
01:12:58.700 | the error because it was part of those trees training. So we'll just calculate the error
01:13:04.100 | for that row using the trees where that row was not included in training that tree. Because
01:13:10.860 | remember every tree is using only a subset of the data. So we do that for every row.
01:13:15.860 | We find the prediction using only the trees that were not used, that that row was not
01:13:24.100 | used. And those are the OOB predictions. In other words, this is like giving us a validation
01:13:31.580 | set result without actually needing a validation. But the thing is, it's not with that time
01:13:39.660 | offset, it's not looking at the last two weeks, it's looking at the whole training set. But
01:13:43.580 | this basically tells us how much of the error is due to overfitting versus due to being
01:13:50.620 | the last couple of weeks. So that's a cool trick. OOB error is something that very quickly
01:13:55.700 | kind of gives us a sense of how much we're, we're overfitting. And we don't even need
01:14:00.100 | a validation set to do it. So there's that OOB error. So that's telling us a bit about
01:14:06.500 | what's going on in our model. But then there's a lot of things we'd like to find out from
01:14:12.320 | our model. And I've got five things in particular here which I generally find pretty interesting.
01:14:18.580 | Which is, how confident are we about our predictions for some particular prediction we're making?
01:14:26.460 | Like we can say this is what we think the prediction is, but how confident are we? Is
01:14:31.740 | that exactly that or is it just about that or we really have no idea? And then for predict,
01:14:37.900 | for predicting a particular item, which factors were the most important in that prediction
01:14:44.860 | and how did they influence it? Overall, which columns are making the biggest difference
01:14:50.500 | in MPRL? Which ones could we maybe throw away and it wouldn't matter? Which columns are
01:14:56.420 | basically redundant with each other? So we don't really need both of them. And as we
01:15:03.580 | vary some column, how does it change the prediction? So those are the five things that we're, that
01:15:09.500 | I'm interested in figuring out and we can do all of those things with a random first.
01:15:15.340 | Let's start with the first one. So the first one, we've already seen that we can grab all
01:15:23.060 | of the predictions for all of the trees and take their mean to get the actual predictions
01:15:31.340 | of the model and then to get the RMSE. But what if instead of saying mean, we did exactly
01:15:36.060 | the same thing like so, but instead said standard deviation. This is going to tell us for every
01:15:46.740 | row in our dataset, how much did the trees vary? And so if our model really had never
01:15:56.380 | seen kind of data like this before, it was something where, you know, different trees
01:16:02.020 | were giving very different predictions. It might give us a sense that maybe this is something
01:16:07.900 | that we're not at all confident about. And as you can see, when we look at the standard
01:16:12.060 | deviation of the trees for each prediction, let's just look at the first five. They vary
01:16:17.620 | a lot, right, 0.2, 0.1, 0.09, 0.3, okay? So this is a really interesting, it's not something
01:16:30.820 | that a lot of people talk about, but I think it's a really interesting approach to kind
01:16:33.940 | of figuring out whether we might want to be cautious about a particular prediction because
01:16:40.260 | maybe we're not very confident about it. But there's one thing we can easily do with a
01:16:46.540 | random forest. The next thing, and this is I think the most important thing for me in
01:16:50.900 | terms of interpretation, is feature importance. Here's what feature importance looks like.
01:16:57.420 | We can call feature importance on a model with some independent variables. Let's say
01:17:01.860 | grab the first 10. This says these are the 10 most important features in this random
01:17:09.500 | forest. These are the things that are the most strongly driving sale price or we could
01:17:15.020 | plot them. And so you can see here, there's just a few things that are by far the most
01:17:22.940 | important. What year the equipment was made, bulldozer or whatever. How big is it? Upla
01:17:31.260 | system, whatever that means, and the product class, whatever that means. And so you can
01:17:40.660 | get this by simply looking inside your train model and grabbing the feature importances
01:17:46.260 | attribute. And so here for making it better to print out, I'm just sticking that into
01:17:50.660 | a data frame and sorting the sending by importance. So how is this actually being done? It's actually
01:18:00.700 | really neat. What Scikit-learn does, and Bryman, the inventor of random forest described, is
01:18:07.740 | that you can go through each tree and then start at the top of the tree and look at each
01:18:12.340 | branch and at each branch see what feature was used, the split, which binary, which the
01:18:19.100 | binary split was based on which column. And then how much better was the model after that
01:18:24.700 | split compared to beforehand. And we basically then say, okay, that column was responsible
01:18:31.060 | for that amount of improvement. And so you add that up across all of the splits, across
01:18:36.900 | all of the trees for each column, and then you normalize it so they all add to one. And
01:18:43.700 | that's what gives you these numbers, which we show the first few of them in this table
01:18:49.180 | and the first 30 of them here in this chart. So this is something that's fast and it's
01:18:55.900 | easy and it kind of gives us a good sense of like, well, maybe the stuff that are less
01:19:01.020 | than 0.005 we could remove. So if we did that, that would leave us with only 21 columns.
01:19:12.940 | So let's try that. Let's just, let's just say, okay, x's which are important, the x's which
01:19:19.340 | are in this list of ones to keep, do the same, they're valid, retrain our random forest and
01:19:27.340 | have a look at the result. And basically our accuracy is about the same, but we've gone
01:19:34.620 | down from 78 columns to 21 columns. So I think this is really important. It's not just about
01:19:42.260 | creating the most accurate model you can, but you want to kind of be able to fit it
01:19:45.460 | in your head as best as possible. And so 21 columns is going to be much easier for us
01:19:50.020 | to check for any data issues and understand what's going on. And the accuracy is about
01:19:55.300 | the same, or the RMSE. So I would say, okay, let's do that. Let's just stick with x's important
01:20:03.980 | from now on. And so here's this entire set of the 21 features. And you can see it looks
01:20:11.920 | now like year made and product size of the two really important things. And then there's
01:20:17.500 | a cluster of kind of mainly product related things that are kind of at the next level
01:20:21.860 | of importance. One of the tricky things here is that we've got like a product class desk,
01:20:33.500 | model ID, secondary desk, model desk, base model. They modeled a script. So they all look
01:20:38.740 | like there might be similar ways of saying the same thing. So one thing that can help
01:20:43.360 | us to interpret the feature importance better and understand better what's happening in
01:20:47.500 | the model is to remove redundant features. So one way to do that is to call fast.ai's
01:20:59.020 | cluster columns, which is basically a thin wrapper for stuff that scikit-learn already
01:21:02.980 | provides. And what that's going to do is it's going to find pairs of columns, which are
01:21:09.420 | very similar. So you can see here sale year and sale elapsed. See how this line is way
01:21:14.540 | out to the right or else machine ID and model ID is not at all. It's way out to the left.
01:21:19.700 | So that means that sale year and sale elapsed are very, very similar. When one is low, the
01:21:26.140 | other tends to be low and vice versa. Here's a group of three, which all seem to be much
01:21:31.540 | the same, and then product group desk and product group, and then FI best-based model
01:21:36.620 | and FI model desk. But these all seem like things where maybe we could remove one of
01:21:42.860 | each of these pairs because they're basically seem to be much the same, you know, they're
01:21:48.900 | when one is high, the other is high and vice versa. So let's try removing one of each of
01:22:01.980 | these. Now it takes a little while to train a random forest. And so for the, just to see
01:22:09.580 | whether removing something makes it much worse, we could just do a very fast version. So we
01:22:16.460 | could just train something where we only have 50,000 rows per tree, train for each tree,
01:22:24.980 | and we'll just use 40 trees. And let's then just get the OOB for, and so for that fast
01:22:37.420 | simple version, our basic OOB with our important x's is 0.877. And here for OOB, a higher number
01:22:48.500 | is better. So then let's try going through each of the things we thought we might not
01:22:53.060 | need and try dropping them and then getting the OOB error for our x's with that one column
01:23:01.580 | removed. And so compared to 877, most of them don't seem to hurt very much. They'll elapse
01:23:11.220 | to it quite a bit, right? So for each of those groups, let's go and see which one of the
01:23:18.420 | ones seems like we could remove it. So here's the five I found. Let's remove the whole lot
01:23:25.980 | and see what happens. And so the OOB went from 877 to 874, though hardly any difference
01:23:33.820 | at all, despite the fact we managed to get rid of five of our variables. So let's create
01:23:42.180 | something called x's final, which is the x's important and then dropping those five, save
01:23:50.300 | them for later. We can always load them back again. And then let's check our random forest
01:23:56.700 | using those and again 0.233 or 0.234. So we've got about the same thing, but we've got even
01:24:05.460 | less columns now. So we're getting a kind of a simpler and simpler model without hurting
01:24:10.780 | our accuracy. It's great. So the next thing we said we were interested in learning about
01:24:17.900 | is for the columns that are, particularly the columns that are most important, how does,
01:24:24.260 | what's the relationship between that column and the dependent variable? So for example,
01:24:28.700 | what's the relationship between product size and sale price? So the first thing I would
01:24:33.900 | do would be just to look at a histogram. So one way to do that is with value counts in
01:24:41.420 | pandas. And we can see here our different levels of product size. And one thing to note here
01:24:52.780 | is actually missing is actually the most common. And then next most is compact and small. And
01:25:00.180 | then many is pretty tiny. So we can do the same thing for year made. Now for year made
01:25:07.420 | we can't just see the basic bar chart. We, according to histogram is not it's a bar chart.
01:25:16.140 | For year made we actually need a histogram, which pandas has stuff like this built in
01:25:21.460 | so we can just call histogram. And that 1950, you remember we created it, that's kind of
01:25:27.020 | this missing value thing that used to be a thousand. But most of them seem to have been
01:25:32.500 | well into the 90's and 3000's. So let's now look at something called a partial dependence
01:25:38.780 | plot. I'll show it to you first. Here is a partial dependence plot of year made against
01:25:52.460 | partial dependence. What does this mean? Well we should focus on the part where we actually
01:25:59.100 | have a reasonable amount of data. So at least well into the 80's, go around here. And so
01:26:05.900 | let's look at this bit here. Basically what this says is that as year made increases,
01:26:14.220 | the predicted sale price, log sale price of course also increases. You can see. And the
01:26:22.660 | log sale price is increasing linearly on other roughly, but roughly then this is actually
01:26:28.780 | an exponential relationship between year made and sale price. Why do we call it a partial
01:26:36.900 | dependence? Are we just plotting the kind of the year against the average sale price?
01:26:41.700 | Well no we're not. We can't do that because a lot of other things change from year to
01:26:47.540 | year. Example, maybe more recently people tend to buy bigger bulldozers or more bulldozers
01:26:57.100 | with air conditioning or more expensive models of bulldozers. And we really want to be able
01:27:03.700 | to say like no just what's the impact of year and nothing else. And if you think about it
01:27:08.820 | from a kind of an inflation point of view, you would expect that older bulldozers would
01:27:18.100 | be kind of, that bulldozers would get kind of a constant ratio cheaper the further you
01:27:27.220 | go back, which is what we see. So what we really want to say is all other things being equal,
01:27:33.980 | what happens if only the year changes? And there's a really cool way we can answer that
01:27:39.820 | question with a random forest. So how does year made impact sale price? All other things
01:27:46.020 | being equal. So what we can do is we can go into our actual data set and replace every
01:27:52.460 | single value in the year made column with 1950 and then calculate the predicted sale
01:27:58.620 | price for every single auction and then take the average over all the auctions. And that's
01:28:03.820 | what gives us this value here. And then we can do the same from 1951, 1952 and so forth
01:28:10.900 | until eventually we get to our final year of 2011. So this isolates the effect of only
01:28:20.020 | year made. So it's a kind of a bit of a curious thing to do, but it's actually, it's a pretty
01:28:28.580 | neat trick for trying to kind of pull apart and create this partial dependence to say
01:28:34.920 | what might be the impact of just changing year made. And we can do the same thing for
01:28:42.060 | product size. And one of the interesting things if we do it for product size is we see that
01:28:46.540 | the lowest value of predicted sale price log sale price is NA, which is a bit of a worry
01:28:58.700 | because we kind of want to know well that means it's really important the question of
01:29:02.260 | whether or not the product size is labeled is really important. And that is something
01:29:08.180 | that I would want to dig into before I actually use this model to find out well why is it
01:29:12.700 | that sometimes things aren't labeled and what does it mean, you know, why is it that that's
01:29:16.620 | actually a that's just important predictor. So that is the partial dependence plot and
01:29:23.580 | it's a really clever trick. So we have looked at four of the five questions we said we wanted
01:29:34.060 | to answer at the start of this section. So the last one that we want to answer is one
01:29:41.780 | here. We're predicting with a particular row of data what were the most important factors
01:29:46.980 | and how did they influence that prediction. This is quite related to the very first thing
01:29:51.460 | we saw. So it's like imagine you were using this auction price model in real life. You
01:29:57.300 | had something on your tablet and you went into some auction and you looked up what the
01:30:02.320 | predicted auction price would be for this lot that's coming up to find out whether it
01:30:09.940 | seems like it's being under or overvalued and then you can decide what to do about that.
01:30:15.720 | So one thing we said we'd be interested to know is like well are we actually confident
01:30:20.020 | in our prediction and then we might be curious to find out like oh I'm really surprised it
01:30:25.180 | was predicting such a high value. Why was it predicting such a high value? So to find
01:30:32.060 | the answer to that question, we can use a module called TreeInterpreter. And TreeInterpreter,
01:30:41.260 | the way it works is that you pass in a single row. So it's like here's the auction that's
01:30:47.620 | coming up, here's the model, here's the auctioneer ID, etcetera, etcetera. Please predict the
01:30:55.220 | value from the random forest, what's the expected sale price and then what we can do is we can
01:31:02.700 | take that one row of data and put it through the first decision tree and we can see what's
01:31:07.700 | the first split that's selected and then based on that split does it end up increasing or
01:31:13.340 | decreasing the predicted price compared to that kind of raw baseline model of just take
01:31:19.340 | the average and then you can do that again at the next split and again at the next split
01:31:23.020 | and again at the next split. So for each split, we see what the increase or decrease in the
01:31:28.940 | well, addiction, that's not right. We see what the increase or decrease in the prediction
01:31:37.420 | is except while I'm here compared to the parent node. And so then you can do that for every
01:31:48.700 | tree and then add up the total change in importance by split variable and that allows you to draw
01:31:56.660 | something like this. So here's something that's looking at one particular row of data and
01:32:03.860 | overall we start at zero and so zero is the initial 10.1. Remember this number 10.1 is
01:32:14.860 | the average log sale price of the whole data set. They call it the bias. And so we call
01:32:22.300 | that zero then for this particular row we're looking at year made as a negative 4.2 impact
01:32:31.180 | on the prediction and then product size has a positive 0.2, cut plus system has a positive
01:32:38.300 | 0.046, model ID has a positive 0.127 and so forth, right. And so the red ones are negative
01:32:47.480 | and the green ones are positive and you can see how they all join up until eventually
01:32:51.580 | overall the prediction is that it's going to be negative 0.122 compared to 10.1 which
01:33:01.140 | is equal to 9.98. So this kind of plot is called a waterfall plot and so basically when
01:33:12.240 | we say tree interpreter dot predict it gives us back the prediction which is the actual
01:33:20.780 | number we get back from the random forest, the bias which is just always this 10.1 for
01:33:25.900 | this data set and then the contributions which is all of these different values. It's how
01:33:33.460 | important was each factor and here I've used a threshold which means anything that was
01:33:42.140 | less than 0.08 all gets thrown into this other category. I think this is a really useful
01:33:48.940 | kind of thing to have in production because it can help you answer questions whether it
01:33:54.620 | will be for the customer or for you know whoever's using your model if they're surprised about
01:33:59.500 | some prediction why is that prediction. So I'm going to show you something really interesting
01:34:10.540 | using some synthetic data and I want you to really have a think about why this is happening
01:34:16.640 | before I tell you and I pause the video if you're watching the video when I get to that
01:34:21.660 | point. Let's start by creating some synthetic data like so. So we're going to grab 40 values
01:34:29.460 | evenly spaced between 0 and 20 and then we're just going to create the y=x line and add
01:34:37.740 | some normally distributed random data on that. Here's this kind of plot. So here's some data
01:34:45.940 | we want to try and predict and we're going to use a random forest in a kind of bit of
01:34:50.140 | an overkill here. Now in this case we only have one independent variable. Scikit-learn
01:35:00.180 | expects us to have more than one. So we can use unsqueeze in PyTorch to add that go from
01:35:10.620 | a shape of 40 in other words a vector with 40 elements for a shape of 40 comma 1 in other
01:35:16.060 | words a matrix of 40 rows with one column. So this unsqueeze 1 means add a unit axis
01:35:23.500 | here. I don't use unsqueeze very often because I actually generally prefer the index with
01:35:30.260 | a special value none. This works in PyTorch and numpy and the way it works is to say okay
01:35:37.220 | xlin remember that size is a vector of length 40 every row and then none means insert a
01:35:46.180 | unit axis here for the column. So these are two ways of doing the same thing but this
01:35:51.500 | one is a little bit more flexible so that's what I use more often. But now that we've
01:35:55.540 | got the shape that is expected which is a rank 2 tensor and an array with two dimensions
01:36:02.820 | or axes we can create a random forest we can fit it and let's just use the first 30 data
01:36:08.860 | points right so kind of stop here. And then let's do a prediction right so let's plot
01:36:16.580 | the original data points and then also plot a prediction and look what happens on the
01:36:21.100 | prediction it acts it's kind of nice and accurate and then suddenly what happens. So this is
01:36:27.820 | the bit where if you're watching the video I want you to pause and have a think bias
01:36:30.980 | is flat. So what's going on here well remember a random forest is just taking the average
01:36:39.380 | of predictions of a bunch of trees and a tree the prediction of a tree is just the average
01:36:46.220 | of the values in a leaf node and remember we fitted using a training set containing
01:36:51.980 | only the first 30. So none of these appeared in the training set so the highest we could
01:36:59.060 | get would be the average of values that are inside the training set. In other words there's
01:37:04.700 | this maximum you can get to. So random forests cannot extrapolate outside of the bounds of
01:37:12.980 | the data that they're training set. This is going to be a huge problem for things like
01:37:16.880 | time series prediction where there's like an underlying trend for instance. But really
01:37:24.300 | it's a more general issue than just time variables. It's going to be hard for random or impossible
01:37:29.660 | often for random forests to just extrapolate outside the types of data that it's seen in
01:37:34.620 | a general sense. So we need to make sure that our validation set does not contain out of
01:37:41.340 | domain data. So how do we find out of domain data? So we might not even know our test set
01:37:50.900 | is distributed in the same way as our training data. So if they're from two different time
01:37:54.760 | periods how do you kind of tell how they vary, right? Or if it's a Kaggle competition how
01:38:00.980 | do you tell if the test set and the training set which Kaggle gives you have some underlying
01:38:07.180 | differences? There's actually a cool trick you can do which is you can create a column
01:38:13.020 | called is_valid which contains 0 for everything in the training set and 1 for everything in
01:38:21.260 | the validation set. And it's concatenating all of the independent variables together.
01:38:27.580 | So it's concatenating the independent variables for both the training and validation set together.
01:38:32.700 | So this is our independent variable and this becomes our dependent variable. And we're
01:38:38.740 | going to create a random forest not for predicting price but a random forest that predicts is
01:38:45.740 | this row from the validation set or the training set. So if the validation set and the training
01:38:51.980 | set are from kind of the same distribution if they're not different then this random
01:38:57.060 | forest should basically have zero predictive power. If it has any predictive power then
01:39:04.460 | it means that our training and validation set are different. And to find out the source
01:39:09.100 | of that difference we can use feature importance. And so you can see here that the difference
01:39:17.840 | between the validation set and the training set is not surprisingly sale elapsed. So that's
01:39:26.940 | the number of days since I think like 1970 or something. So it's basically the date.
01:39:32.180 | So yes of course you can predict whether something is in the validation set or the training set
01:39:37.300 | by looking at the date because that's actually how you find them. That makes sense. This is
01:39:41.900 | interesting sales ID. So it looks like the sales ID is not some random identifier but
01:39:46.940 | it increases over time. And ditto for machine ID. And then there's some other smaller ones
01:39:54.580 | here that kind of makes sense. So I guess for something like model desk I guess there
01:39:59.500 | are certain models that were only made in later years for instance. But you can see these
01:40:06.860 | top three columns are a bit of an issue. So then we could say like okay what happens if
01:40:14.320 | we look at each one of those columns those first three and remove them and then see how
01:40:22.220 | it changes our RMSE on our sales price model on the validation set. So we start from point
01:40:35.460 | 232 and removing sales ID actually makes it a bit better. Sale elapsed makes it a bit
01:40:43.180 | worse, machine ID about the same. So we can probably remove sales ID and machine ID without
01:40:49.180 | losing any accuracy and yep it's actually slightly improved. But most importantly it's
01:40:54.600 | going to be more resilient over time right because we're trying to remove the time related
01:41:00.380 | features. Another thing to note is that since it seems that you know this kind of sale elapsed
01:41:09.280 | issue that maybe it's making a big difference is maybe looking at the sale year distribution
01:41:16.420 | this is the histogram. Most of the sales are in the last few years anyway. So what happens
01:41:21.380 | if we only include the most recent few years. So let's just include everything after 2004.
01:41:29.900 | So that is X is filtered. And if I train on that subset then my accuracy goes improves
01:41:38.060 | a bit more from 331, 330. So that's interesting right. We're actually using less data, less
01:41:46.260 | rows and getting a slightly better result because the more recent data is more representative.
01:41:53.980 | So that's about as far as we can get with our random forest. But what I will say is
01:42:00.180 | this. This issue of extrapolation would not happen with a neural net would it because
01:42:08.780 | a neural net is using the kind of the underlying layers are linear layers. And so linear layers
01:42:13.860 | can absolutely extrapolate. So the obvious thing to think then at this point is well
01:42:19.340 | maybe what a neural net do a better job of this. That's going to be the thing. Next up
01:42:25.540 | to this question. Question first. How do, how does feature importance relate to correlation?
01:42:37.020 | Feature importance doesn't particularly relate to correlation. Correlation is a concept for
01:42:42.700 | linear models and this is not a linear model. So remember feature importance is calculated
01:42:47.740 | by looking at the improvement in accuracy as you go down each tree and you go down each
01:42:56.660 | binary split. If you're used to linear regression then I guess correlation sometimes can be
01:43:05.620 | used as a measure of feature importance. But this is a much more kind of direct version
01:43:13.660 | that's taking account of these non-linearities and interactions of stuff as well. So it's
01:43:19.380 | a much more flexible and reliable measure generally feature importance. Any more questions?
01:43:30.260 | So I'll do the same thing with a neural network. I'm going to just copy and paste the same
01:43:34.620 | lines of code that I had from before but this time I'll call it NN, DFNN and these are the
01:43:40.660 | same lines of code. And I'll grab the same list of columns we had before in the dependent
01:43:45.140 | variable to get the same data frame. Now as we've discussed for categorical columns we
01:43:52.460 | probably want to use embeddings. So to create embeddings we need to know which columns should
01:43:57.740 | be treated as categorical variables. And as we've discussed we can use "cont-cat-split"
01:44:01.940 | for that. One of the useful things we can pass that is the maximum cardinality. So maxCard
01:44:09.380 | equals 9000 means if there's a column with more than 9000 levels you should treat it
01:44:14.660 | as continuous. And if it's got less than 9000 levels which it is categorical. So that's
01:44:20.660 | you know it's a simple little function that just checks the cardinality and splits them
01:44:25.420 | based on how many discrete levels they have. And of course the data type if it's not actually
01:44:31.460 | numeric data type it has to be categorical. So there's our there's our split. And then
01:44:42.020 | from there what we can do is we can say oh we've got to be a bit careful of "sail-elapsed"
01:44:49.420 | because actually "sail-elapsed" I think has less than 9000 categories but we definitely
01:44:53.860 | don't want to use that as a categorical variable. The whole point was to make it that this is
01:44:57.940 | something that we can extrapolate. Though we certainly anything that's kind of time dependent
01:45:03.020 | or we think that we might see things outside the range of inputs in the training data we
01:45:09.380 | should make them continuous variables. So let's make "sail-elapsed" put it in continuous
01:45:14.520 | neural net and remove it from categorical. So here's the number of unique levels this
01:45:22.820 | is from pandas for everything in our neural net data set for the categorical variables.
01:45:28.460 | And I get a bit nervous when I see these really high numbers so I don't want to have too many
01:45:32.740 | things with like lots and lots of categories. The reason I don't want lots of things with
01:45:40.220 | lots and lots of categories is just they're going to take up a lot of parameters because
01:45:44.020 | in a embedding matrix this is you know every one of these is a row in an embedding matrix.
01:45:48.580 | In this case I notice model ID and model desk might be describing something very similar.
01:45:54.380 | So I'd quite like to find out if I could get rid of one and an easy way to do that would
01:45:58.680 | be to use a random forest. So let's try removing the model desk and let's create a random forest
01:46:10.540 | and let's see what happens and oh it's actually a tiny bit better and certainly not worse.
01:46:16.460 | So that suggests that we can actually get rid of one of these levels or one of these
01:46:20.740 | variables. So let's get rid of that one and so now we can create a tabular pandas object
01:46:26.900 | just like before. But this time we're going to add one more processor which is normalize.
01:46:34.540 | And the reason we need normalize, so normalize is subtract the mean divide by the standard
01:46:39.180 | deviation. We didn't need that for a random forest because for a random forest we're just
01:46:44.660 | looking at less than or greater than through our binary splits. So all that matters is
01:46:49.980 | the order of things, how they're sorted, it doesn't matter whether they're super big or
01:46:53.700 | super small. But it definitely matters for neural nets because we have these linear layers.
01:47:01.460 | So we don't want to have you know things with kind of crazy distributions with some super
01:47:06.220 | big numbers and super small numbers because it's not going to work. So it's always a good
01:47:10.680 | idea to normalize things in neural nets so we can do that in a tabular neural net by
01:47:17.580 | using the normalize tabular proc. So we can do the same thing that we did before with
01:47:23.460 | creating our tabular pandas tabular object for the neural net. And then we can create
01:47:29.900 | data loaders from that with a batch size. And this is a large batch size because tabular
01:47:35.540 | models don't generally require nearly as much GPU RAM as a convolutional neural net or something
01:47:44.000 | or an RNN or something. Since it's a regression model we're going to want R range. So let's
01:47:52.140 | find the minimum and maximum of our dependent variable. And we can now go ahead and create
01:47:59.140 | a tabular learner. Our tabular learner is going to take our data loaders, our way range,
01:48:06.140 | how many activations do you want in each of the linear layers. And so you can have as
01:48:12.140 | many linear layers as you like here. How many outputs are there? So this is a regression
01:48:18.380 | with a single output. And what loss function do you want? We can use LRfind and then we
01:48:27.420 | can go ahead and use fit1cycle. There's no pre-trained model obviously because this is
01:48:32.500 | not something where people have got pre-trained models for industrial equipment options. So
01:48:39.140 | we just use fit1cycle and train for a minute. And then we can check. And our RMSE is 0.226
01:48:52.500 | which here was 0.230. So that's amazing. We actually have, you know, straight away a better
01:48:58.620 | result than the random forest. It's a little more fussy, it takes a little bit longer. But
01:49:05.580 | as you can see, you know, for interesting datasets like this, we can get some great
01:49:10.940 | results with neural nets. So here's something else we could do though. The random forest
01:49:23.380 | and the neural net, they each have their own pros and cons. There's some things they're
01:49:28.020 | good at and there's some they're less good at. So maybe we can get the best of both worlds.
01:49:34.620 | And a really easy way to do that is to use Ensemble. We've already seen that a random
01:49:39.420 | forest is a decision tree ensemble. But now we can put that into another ensemble. We
01:49:43.740 | can have an ensemble of the random forest and a neural net. There's lots of super fancy
01:49:49.180 | ways you can do that. But a really simple way is to take the average. So sum up the
01:49:55.300 | predictions from the two models, divide by two, and use that as prediction. So that's
01:50:01.620 | our ensemble prediction is just literally the average of the random forest prediction
01:50:05.540 | and the neural net prediction. And that gives us 0.223 versus 0.226. So how good is that?
01:50:18.900 | Well it's a little hard to say because unfortunately this competition is old enough that we can't
01:50:25.540 | even submit to it and find out how we would have gone on Kaggle. So we don't really know
01:50:30.980 | and so we're relying on our own validation set. But it's quite a bit better than even
01:50:36.260 | the first place score on the test set. So if the validation set is you know doing good
01:50:45.380 | job then this is a good sign that this is a really really good model. Which wouldn't
01:50:51.060 | necessarily be that surprising because you know in the last few years I guess we've learned
01:50:58.620 | a lot about building these kinds of models. And we're kind of taking advantage of a lot
01:51:03.940 | of the tricks that have appeared in recent years. And yeah maybe this goes to show that
01:51:11.660 | well I think it certainly goes to show that both random forests and neural nets have a
01:51:17.300 | lot to offer. And try both and maybe even find both. We've talked about an approach
01:51:29.540 | to ensembling called bagging which is where we train lots of models on different subsets
01:51:35.680 | of the data like the average of. Another approach to ensembling particularly ensembling of trees
01:51:42.500 | is called boosting. And boosting involves training a small model which underfits your
01:51:50.580 | data set. So maybe like just have a very small number of leaf nodes. And then you calculate
01:51:57.200 | the predictions using the small model. And then you subtract the predictions from the
01:52:02.500 | targets. So these are kind of like the errors of your small underfit model. We call them
01:52:07.580 | residual. And then go back to step one but now instead of using the original targets
01:52:15.440 | use the residuals. The train a small model which underfits your data set attempting to
01:52:21.020 | predict the residuals. Then do that again and again until you reach some stopping criterion
01:52:28.900 | such as the maximum number of trees. Now you that will leave you with a bunch of models
01:52:35.620 | which you don't average but which use sum. Because each one is creating a model that's
01:52:42.500 | based on the residual of the previous one. But we've subtracted the predictions of each
01:52:47.660 | new tree from the residuals of the previous tree. So the residuals get smaller and smaller.
01:52:53.260 | And then to make predictions we just have to do the opposite which is to add them all
01:52:56.980 | together. So there's lots of variants of this. But you'll see things like GBMs for gradient
01:53:06.780 | boosted machines or GBTTs for gradient boosted decision trees. And there's lots of minor
01:53:14.460 | details around you know and significant details. But the basic idea is what I've shown.
01:53:21.580 | All right let's take the questions. Dropping features in a model is a way to reduce the
01:53:28.020 | complexity of the model and thus reduce overfitting. Is this better than adding some regularization
01:53:33.820 | like weight decay? I didn't claim that we removed columns to avoid overfitting. We removed
01:53:49.180 | the columns to simplify fewer things to analyze. It should also mean we don't need as many
01:54:00.460 | trees but there's no particular reason to believe that this will regularize. And the
01:54:06.380 | idea of regularization doesn't necessarily make a lot of sense to random forests and
01:54:10.620 | always add more trees. Is there a good heuristic for picking the number of linear layers in
01:54:18.660 | the tabular model? Not really. Well if there is I don't know what it is. I guess two, three
01:54:32.900 | hidden layers works pretty well. So you know what I showed those numbers I showed are pretty
01:54:40.300 | good for a large-ish model. A default it uses 200 and 100 so maybe start with the default
01:54:48.520 | and then go up to 500 and 250 if that isn't an improvement and like just keep doubling
01:54:53.500 | them until it stops improving or you run out of memory or time. The main thing to note
01:55:00.900 | about boosted models is that there's nothing to stop us from overfitting. If you add more
01:55:05.660 | and more trees to a bagging model sort of a random forest it's going to get, it should
01:55:11.660 | generalize better and better because each time you're using a new model which is based
01:55:16.980 | on a subset of the data. But boosting each model will fit the training set better and
01:55:24.740 | better gradually overfit more and more. So boosting methods do require generally more
01:55:32.460 | hyperparameter tuning and fiddling around with it. You know you certainly have regularization
01:55:37.940 | boosting. They're pretty sensitive to their hyperparameters which is why they're not normally
01:55:46.640 | my first go-to but they more often win Kaggle competition random forests do like they tend
01:55:57.140 | to be good at getting that last little bit of performance. So the last thing I'm going
01:56:04.860 | to mention is something super neat which a lot of people don't seem to know exists. There's
01:56:11.500 | a shang so it's super cool which is something from the entity embeddings paper, the table
01:56:17.100 | from it where what they did was they built a neural network, they got the entity embeddings
01:56:23.900 | e.e. and then they tried a random forest using the entity embeddings as predictors rather
01:56:35.220 | than the approach I described with just the raw categorical variables. And the the error
01:56:43.060 | for a random forest went from 0.16 to 0.11. A huge improvement and very simple method
01:56:51.100 | KNN went from 0.29 to 0.11. Basically all of the methods when they used entity embeddings
01:56:59.020 | suddenly improved a lot. The one thing you should try if you have a look at the further
01:57:04.360 | research section after the questionnaire is it asks to try to do this actually take those
01:57:10.260 | entity embeddings that we trained in the neural net and use them in the random forest and
01:57:14.840 | then maybe try ensembling again and see if you can beat the 0.223 that we had. This is
01:57:25.260 | a really nice idea it's like you get you know all the benefits of boosted decision trees
01:57:32.140 | but all of the nice features of entity embeddings and so this is something that not enough people
01:57:40.100 | seem to be playing with for some reason. So overall you know random forests are nice and
01:57:49.940 | easy to train you know they're very resilient they don't require much pre-processing they
01:57:54.460 | train quickly they don't overfit you know they can be a little less accurate and they
01:58:03.020 | can be a bit slow at inference time because the inference you have to go through every
01:58:08.180 | one of those trees. Having said that a binary tree can be pretty heavily optimized so you
01:58:18.700 | know it is something you can basically create a totally compiled version of a tree and they
01:58:24.100 | can certainly also be done entirely in parallel so that's something to consider. Gradient boosting
01:58:36.020 | machines are also fast to train on the whole but a little more fussy about high parameters
01:58:41.260 | you have to be careful about overfitting but a bit more accurate. Neural nets may be the
01:58:49.380 | fussiest to deal with they've kind of got the least rules of thumb around or tutorials
01:58:56.660 | around saying this is kind of how to do it it's just a bit a bit newer a little bit less
01:59:00.660 | well understood but they can give better results in many situations than the other two approaches
01:59:06.580 | or at least with an ensemble can improve the other two approaches. So I would always start
01:59:11.380 | with a random code and then see if you can beat it using these. So yeah why don't you
01:59:19.580 | now see if you can find a Kaggle competition with tabular data whether it's running now
01:59:23.740 | or it's a past one and see if you can repeat this process for that and see if you can get
01:59:29.220 | in the top 10% of the private leaderboard that would be a really great stretch goal
01:59:34.860 | at this point. Implement the decision tree algorithm yourself I think that's an important
01:59:40.100 | one we really understand it and then from there create your own random forest from scratch
01:59:44.700 | you might be surprised it's not that hard and then go and have a look at the tabular
01:59:52.500 | model source code and at this point this is pretty exciting you should find you pretty
01:59:57.980 | much know what all the lines do with two exceptions and if you don't you know dig around and explore
02:00:04.900 | an experiment and see if you can figure it out. And with that we are I am very excited
02:00:13.220 | to say at a point where we've really dug all the way in to the end of these real valuable
02:00:20.980 | effective fast AI applications and we're understanding what's going on inside them. What should we
02:00:27.420 | expect for next week? For next week we will at NLP and computer vision and we'll do the
02:00:36.500 | same kind of ideas delve deep to see what's going on. Thanks everybody see you next week.