Lesson 7 - Deep Learning for Coders (2020)

00:00:00.000 | Hi everybody and welcome to lesson 7. We're going to start by having a look at a kind

00:00:07.520 | of regularization called weight decay. And the issue that we came to at the end of the

00:00:13.400 | last lesson is that we were training our simple dot product model with bias, and our loss

00:00:22.400 | started going down and then it started going up again. And so we have a problem that we

00:00:29.280 | are overfitting. And remember in this case we're using mean squared error. So try to

00:00:35.560 | recall why it is that we don't need a metric here, because mean squared error is pretty

00:00:42.760 | much the thing we care about really, or we could use mean absolute error if we like,

00:00:47.800 | but either of those works fine as a loss function. They don't have the problem of big flat areas

00:00:52.720 | like accuracy does for classification. So what we want to do is to make it less likely

00:01:00.400 | that we're going to overfit by doing something we call reducing the capacity of the model.

00:01:06.120 | The capacity of the model is basically how much space does it have to find answers. And

00:01:11.800 | if it can kind of find any answer anywhere, those answers can include basically memorizing

00:01:18.320 | the data set. So one way to handle this would be to decrease the number of latent factors.

00:01:27.320 | But generally speaking, reducing the number of parameters in a model, particularly as

00:01:32.780 | we look at more deep learning style models, ends up biasing the models towards very simple

00:01:40.720 | kind of shapes. So there's a better way to do it rather than reducing the number of parameters.

00:01:47.840 | And we try to force the parameters to be smaller, unless they're really required to be big.

00:01:55.480 | And the way we do that is with weight decay. Weight decay is also known as L2 regularization.

00:02:00.960 | They're very slightly different, but we can think of them as the same thing. And what

00:02:05.320 | we do is we change our loss function, and specifically we change the loss function by

00:02:10.120 | adding to it the sum of all the weights squared. In fact, all of the parameters squared really

00:02:17.900 | should stay. Why do we do that? Well, because if that's part of the loss function, then

00:02:24.460 | one way to decrease the loss would be to decrease the weights, one particular weight or all

00:02:30.360 | of the weights or something like that. And so when we decrease the weights, if you think

00:02:38.980 | about what that would do, then think about, for example, the different possible values

00:02:49.100 | of a in y equals ax squared. The larger a is, for example, a is 50, you get these very

00:02:57.080 | narrow peaks. In general, big coefficients are going to cause big swings, big changes

00:03:05.800 | in the loss, small changes in the parameters. And when you have these kind of sharp peaks

00:03:13.660 | or valleys, it means that a small change to the parameter can make a, sorry, a small change

00:03:22.520 | to the input and make a big change to the loss. And so if you have, if you're in that

00:03:27.760 | situation, then you can basically fit all the data points close to exactly with a really

00:03:33.480 | complex jagged function with sharp changes, which exactly tries to sit on each data point

00:03:41.020 | rather than finding a nice smooth surface which connects them all together or goes through

00:03:46.660 | them all. So if we limit our weights by adding in the loss function, the sum of the weights

00:03:54.580 | squared, then what it's going to do is it's going to fit less well on the training set

00:04:00.760 | because we're giving it less room to try anything that it wants to, but we're going to hope

00:04:05.320 | that it would result in a better loss on the validation set or the test set so that it

00:04:10.740 | will generalize better. One way to think about this is that the loss with weight decay is

00:04:17.020 | just the loss plus the sum of the parameters squared times some number we pick, a hyperparameter.

00:04:27.500 | This is like 0.1 or 0.01 or 0.001 kind of region. So this is basically what loss with

00:04:35.680 | weight decay looks like in this equation. But remember when it actually comes to what's,

00:04:40.500 | how is the loss used in stochastic gradient descent? It's used by taking its gradient.

00:04:45.940 | So what's the gradient of this? Well, if you remember back to when you first learned calculus,

00:04:52.880 | it's okay if you don't. The gradient of something squared is just two times that something. We've

00:05:00.200 | changed some parameters to weight which is a bit confusing. So just use weight here to

00:05:05.800 | keep it consistent. Maybe parameters is better. So the derivative of weight squared is just

00:05:12.880 | two times weight. So in other words, to add in this term to the gradient, we can just

00:05:20.480 | add to the gradients weight decay times two times weight. And since weight decay is just

00:05:28.420 | a hyperparameter, we can just replace it with weight decay times two. So that would just

00:05:32.500 | give us weight decay times weight. So weight decay refers to adding on the, to the gradients,

00:05:42.960 | the weights times some hyperparameter. And so that is going to try to create these kind

00:05:48.780 | of more shallow, less bumpy surfaces. So to do that, we can simply, when we call fit or

00:05:59.260 | fit one cycle or whatever, we can pass in a WD parameter and that's just this number

00:06:07.100 | here. So if we pass in point one, then the training loss goes from point two nine to

00:06:15.660 | point four nine. That's much worse, right, because we can't overfit anymore. The valid

00:06:20.980 | loss goes from point eight nine to point eight two, much better. So this is an important

00:06:27.740 | thing to remember for those of you that have done a lot of more traditional statistical

00:06:31.820 | models is in kind of more traditional statistical models, we try to avoid overfitting and we

00:06:37.800 | try to increase generalization by decreasing the number of parameters. But in a lot of

00:06:44.040 | modern machine learning and certainly deep learning, we tend to instead use regularization

00:06:51.660 | such as weight decay because it gives us more flexibility. It lets us use more nonlinear

00:06:56.820 | functions and still, you know, still reduces the capacity of the model. Great. So we're

00:07:03.940 | down to point eight two three. This is a good model. This is really actually a very good

00:07:08.780 | model. And so let's dig into actually what's going on here because in our, in our architecture,

00:07:18.300 | remember we basically just had four embedding layers. So what's an embedding layer? We've

00:07:24.580 | described it conceptually, but let's write our own. And remember we said that an embedding

00:07:29.980 | layer is just a computational shortcut for doing a matrix multiplication by a one hot

00:07:35.380 | encoded matrix and that that is actually the same as just indexing into an array. So an

00:07:43.860 | embedding is just a indexing into an array. And so it's nice to be able to create our

00:07:50.380 | own versions of things that exist in PyTorch and fast.ai. So let's do that for embedding.

00:07:56.780 | So if we're going to create our own kind of layer, which is pretty cool, we need to be

00:08:02.500 | aware of something, which is normally a layer is basically created by inheriting as we've

00:08:09.660 | discussed from module or nn.module. So for example, this is an example here of a module

00:08:15.060 | where we've created a class called t that inherits from module. And when it's constructed,

00:08:20.340 | remember that's what dunder init does. We're just going to sit, this is just a dummy little

00:08:25.020 | module here. We're going to set self.a to the number one repeated three times as a tensor.

00:08:31.820 | Now if you remember back to notebook four, we talked about how the optimizers in PyTorch

00:08:37.820 | and fast.ai rely on being able to grab the parameters attribute to find a list of all

00:08:42.900 | the parameters. Now if you want to be able to optimize self.a, you would need to appear

00:08:48.860 | in parameters, but actually there's nothing there. Why is that? That's because PyTorch

00:08:56.340 | does not assume that everything that's in a module is something that needs to be learned.

00:09:01.460 | To tell it that it's something that needs to be learned, you have to wrap it with nn.parameter.

00:09:05.900 | So here's exactly the same class, but torch.ones, which is just a list of three, three ones

00:09:12.500 | in this case is wrapped in nn.parameter. And now if I go parameters, I see I have a parameter

00:09:20.340 | with three ones in it. And that's going to automatically call requires grad underscore

00:09:25.300 | for us as well. We haven't had to do that for things like nn.linear in the past because

00:09:32.460 | PyTorch automatically uses nn.parameter internally. So if we have a look at the parameters for

00:09:37.820 | something that uses nn.linear with no bias layer, you'll see again we have here a parameter

00:09:44.900 | with three things in it. So we want to in general be able to create a parameter. So

00:09:55.220 | something with a tensor with a bunch of things in and generally we want to randomly initialize

00:09:59.260 | them. So to randomly initialize, we can pass in the size we want. We can initialize a tensor

00:10:04.980 | of zeros of that size and then randomly generate some normal, normally distributed random numbers

00:10:10.820 | with a mean of zero and a deviation of 0.01. No particular reason I'm picking those numbers

00:10:16.220 | just to show how this works. So here's something that will give us back a set of parameters

00:10:22.380 | of any size we want. And so now we're going to replace everywhere that used to say embedding.

00:10:28.140 | I'm going to replace it with create params. Everything else here is the same in the init

00:10:34.840 | under init. And then the forward is very, very similar to before. As you can see, I'm

00:10:40.380 | grabbing the zero index column from x, that's my users, and I just look it up as you see

00:10:49.260 | in that user factors array. And the cool thing is I don't have to do anything with gradients

00:10:54.320 | myself for this manual embedding layer because PyTorch can figure out the gradients automatically

00:10:59.500 | as we've discussed. But then I just got the dot product as before, add on the bias as

00:11:03.700 | before, do the sigmoid range as before. And so here's a dot product bias without any special

00:11:10.820 | PyTorch layers and we fit and we get the same result. So I think that is pretty amazingly

00:11:18.580 | cool. We've really shown that the embedding layer is nothing fancy, is nothing magic, right?

00:11:25.900 | It's just indexing into an array. So hopefully that removes a bit of the mystery for you.

00:11:32.420 | So let's have a look at this model that we've created and we've trained and find out what

00:11:38.100 | it's learned. That's already useful. We've got something we can make pretty accurate

00:11:41.620 | predictions with. But let's find out what those, what the model looks like. So remember

00:11:49.180 | when we have a question. Okay, let's take a question before you can look at this. What's

00:11:55.700 | the advantage of creating our own embedding layer over the stock PyTorch one? Oh, nothing

00:12:02.220 | at all. We're just showing that we can. It's great to be able to dig under the surface

00:12:06.100 | because at some point you'll want to try doing new things. So a good way to learn to do new

00:12:10.520 | things is to be able to replicate things that already exist and you can expect that you

00:12:15.380 | understand how they work. It's also a great way to understand the foundations of what's

00:12:19.180 | going on is to actually create encode your own implementation. But I wouldn't expect

00:12:24.980 | you to use this implementation in practice. But basically it removes all the mystery. So

00:12:32.940 | if you remember we've created a learner called learn and to get to the model that's inside

00:12:37.780 | it, you can always call learn.model and then inside that there's going to be automatically

00:12:45.180 | created for it. Well, sorry, not automatically. We've created all these attributes movie factors

00:12:49.420 | movie bias and so forth. So we can grab learn.model.movieBias. And now what I'm going to do is I'm going

00:12:59.220 | to sort that vector and I'm going to print out the first five titles. And so what this

00:13:08.180 | is going to do is it's going to print out the movies with the smallest bias and here

00:13:15.220 | they are. What does this mean? Well, it kind of means these are the five movies that people

00:13:22.620 | really didn't like. But it's more than that. It's not only do people not like them, but

00:13:29.180 | if we take account of the genre they're in, the actors they have, you know, whatever the

00:13:34.620 | latent factors are, people liked them a lot less than they expected. So maybe for example,

00:13:41.380 | this is kind of I haven't seen any of these movies. Luckily perhaps this is a sci-fi movie.

00:13:49.900 | So people who kind of like these sci-fi movies found they're so bad they still didn't like

00:13:54.100 | it. So we can do the exact opposite, which is to sort sending. And here are the top five

00:14:02.460 | movies and specifically they're the top five by bias, right? So these are the movies that

00:14:07.700 | even after you take account of the fact that LA Confidential, I have seen all of these

00:14:11.900 | ones. So LA Confidential is a kind of a murder mystery cop movie, I guess. And people who

00:14:18.860 | don't necessarily like that genre or I think Guy Pearce was in it. So maybe they don't like

00:14:22.700 | Guy Pearce very much, whatever. People still like this movie more than they expect. So

00:14:29.340 | this is a kind of a nice thing that we can look inside our model and see what it's learned.

00:14:35.060 | We can look at not only at the bias vector, but we can also look at the factors. Now there

00:14:43.940 | are 50 factors, which is too many to visualize. So we can use a technique called PCA, Principle

00:14:50.020 | Components now. This, the details don't matter, but basically they're going to squish those

00:14:53.780 | 50 factors down to three. And then we'll plot the top two as you can see here. And what we

00:15:04.060 | see when we plot the top two is we can kind of see that the movies have been kind of spread

00:15:11.380 | out across a space of some kind of latent factors. And so if you look at the far right,

00:15:18.700 | there's a whole bunch of kind of big budget actually things. And on the far left, there's

00:15:25.540 | more like cult kind of things, Fargo, Schindler's List, Monty Python. By the same token at the

00:15:33.660 | bottom, we've got some English patient, Harry Met Sally, so kind of romance drama kind of

00:15:42.860 | stuff. And at the top, we've got action and sci-fi kind of stuff. So you can see even

00:15:50.660 | as though we haven't asked in any information about these movies, all we've seen is who

00:15:57.900 | likes what. These latent factors have automatically kind of figured out a space or a way of thinking

00:16:05.380 | about these movies based on what kinds of movies people like and what other kinds of

00:16:09.660 | movies they like along with those. But that's really interesting to kind of try and visualize

00:16:15.300 | what's going on inside your model. Now we don't have to do all this manually. We can

00:16:25.540 | actually just say give me a collab learner using this set of data loaders with this number

00:16:32.120 | of factors and this y range and it does everything we've just seen again about the same number.

00:16:37.980 | Okay, so now you can see this is nice, right? We've actually been able to see right underneath

00:16:43.100 | inside the collab learner part of the fast AI application, the collaborative filtering

00:16:48.060 | application and we can build it all ourselves from scratch. We know how to create the SGD,

00:16:54.300 | know how to create the embedding layer, we know how to create the model, the architecture.

00:17:01.500 | So now you can see, you know, we've really can build up from scratch our own version

00:17:06.580 | of this. So if we just type learn.model, you can see here the names are a bit more generic.

00:17:13.220 | This is a user weight, item weight, user bias, item bias, but it's basically the same stuff

00:17:17.700 | we've seen before. And we can replicate the exact analysis we saw before by using this

00:17:24.240 | same idea. Okay, slightly different order this time because it is a bit random but pretty

00:17:34.660 | similar as well. Another interesting thing we can do is we can think about the distance

00:17:41.880 | between two movies. So let's grab all the movie factors or just pop them into a variable

00:17:51.220 | and then let's pick a movie and then let's find the distance from that movie to every

00:18:05.160 | other movie. And so one way of thinking about distance is you might recall the Pythagorean

00:18:10.340 | formula or the distance on the hypotenuse of a triangle, which is also the distance

00:18:17.820 | to a point in a Cartesian plane on a chart, which is root x squared plus y squared. You

00:18:25.020 | might know, it doesn't matter if you don't, but you can do exactly the same thing for

00:18:28.760 | 50 dimensions. It doesn't just work for two dimensions. There's a, that tells you how

00:18:36.020 | far away a point is from another point if you, if x and y are actually differences between

00:18:43.620 | two movie vectors. So then what gets interesting is you can actually then divide that kind

00:18:58.020 | of by the, by the length to make all the lengths the same distance to find out how the angle

00:19:03.620 | between any two movies and that actually turns out to be a really good way to compare the

00:19:07.620 | similarity of two things. That's called cosine similarity. And so the details don't matter.

00:19:12.120 | You can look them up if you're interested. But the basic idea here is to see that we

00:19:16.340 | can actually pick a movie and find the movie that is the most similar to it based on these

00:19:23.260 | factors. Kind of interesting.

00:19:25.140 | I have a question.

00:19:27.060 | All right.

00:19:28.740 | What motivated learning at a 50-dimensional embedding and then using a to reduce the three

00:19:33.660 | versus just learning a three-dimensional?

00:19:36.060 | Oh, because the purpose of this was actually to create a good model. So the, the visualization

00:19:42.100 | part is normally kind of the exploration of what's going in, on in your model. And so

00:19:47.780 | with a 50, with 50 latent factors, you're going to get a more accurate. So that's one

00:19:54.660 | approach is this dot product version. There's another version we could use, which is we

00:20:02.660 | could create a set of user factors and a set of item factors and just like before we could

00:20:12.620 | look them up. But what we could then do instead of doing a dot product, we could concatenate

00:20:18.140 | them together into a tensor that contains both the user and the movie factors next to

00:20:26.340 | each other. And then we could pass them through a simple little neural network, linear, relu,

00:20:34.820 | linear, and then sigmoid range as before.

00:20:39.280 | So importantly here, the first linear layer, the number of inputs is equal to the number

00:20:45.020 | of user factors plus the number of item factors. And the number of outputs is however many

00:20:51.360 | activations we have. And then we just default to 100 here. And then the final layer will

00:21:00.160 | go from 100 to 1 because we're just making one prediction. And so we could create, we'll

00:21:06.060 | call that collab nn. We can instantiate that to create a model. We can create a learner

00:21:11.060 | and we can fit. It's not going quite as well as before. It's not terrible, but it's not

00:21:16.860 | quite as good as our dot product version. But the interesting thing here is it does give

00:21:21.900 | us some more flexibility, which is that since we're not doing a dot product, we can actually

00:21:27.060 | have a different embedding size for each of users versus items. And actually fast.ai has

00:21:35.220 | a simple heuristic. If you call get embedding size and pass in your data loaders, it will

00:21:40.540 | suggest appropriate size embedding matrices for each of your categorical variables, each

00:21:49.060 | of your user and item sensors. So that's, so if we pass in *m's settings, that's going

00:22:02.020 | to pass in the user, tuple and the item, tuple, which we can then pass to embedding. This is

00:22:11.340 | the * prefix we learned about in the last class in case you forgot. So this is kind

00:22:17.660 | of interesting. We can, you know, we can see here that there's two different architectures

00:22:23.260 | we could pick from. It wouldn't be necessarily obvious ahead of time which one's going to

00:22:26.660 | work better. In this particular case, the simplest one, the dot product one, actually turned out

00:22:32.940 | to work a bit better, which is interesting. This particular version here, if you call

00:22:37.580 | collab_learner and pass use_nn = true, then what that's going to do is it's going to use

00:22:44.940 | this version, the version with concatenation and the linear layers. So collab_learner, use_nn

00:22:56.980 | = true, again we get about the same result as you'd expect because it's just a draw-cut

00:23:01.020 | for this version. And it's interesting actually, we have a look at collab_learner, it actually

00:23:09.100 | returns an object of type embedding_nn, and it's kind of cool if you look inside the fast.io

00:23:14.420 | source code or use the double question mark trick to see the source code for embedding

00:23:18.020 | nn, you'll see it's three lines of code. How does that happen? Because we're using this

00:23:24.340 | thing called tab_ular_model, which we will learn about in a moment, but basically this

00:23:32.620 | neural net version of collaborative filtering is literally just a tab_ular model in which

00:23:37.740 | we pass no continuous variables and some embedding sizes. So we'll see that in a moment.

00:23:50.080 | Okay so that is collaborative filtering, and again take a look at the further research

00:23:55.060 | section in particular after you finish the questionnaire, because there's some really

00:23:59.740 | important next steps you can take to push your knowledge and your skills.

00:24:06.620 | So let's now move to notebook 9, tab_ular. And we're going to look at tab_ular_modeling

00:24:14.100 | and do a deep dive. And let's start by talking about this idea that we were starting to see

00:24:19.180 | here, which is embeddings. And specifically let's move beyond just having embeddings for

00:24:28.100 | users and items, but embeddings for any kind of categorical variable. So really because

00:24:34.860 | we know an embedding is just a lookup into an array, it can handle any kind of discrete

00:24:43.700 | categorical data. So things like age are not discrete, they're continuous numerical data,

00:24:49.140 | but something like sex or postcode are categorical variables. They have a certain number of discrete

00:24:56.420 | levels. The number of discrete levels they have is called their cardinality. So to have

00:25:02.740 | a look at an example of a dataset that contains both categorical and continuous variables,

00:25:10.060 | we're going to look at the Rossman sales competition that ran on Kaggle a few years ago. And so

00:25:16.540 | basically what's going to happen is we're going to see a table that contains information

00:25:21.580 | about various stores in Germany, and the goal will be to try and predict how many sales

00:25:26.940 | there's going to be for each day in a couple of week period for each store.

00:25:34.700 | One of the interesting things about this competition is that one of the gold medalists used deep

00:25:39.980 | learning, and it was one of the earliest known examples of a state-of-the-art deep learning

00:25:45.980 | tabular model. I mean this is not long ago, 2015 or something, but really this idea of

00:25:52.660 | creating state-of-the-art tabular models with deep learning has not been very common and

00:25:58.860 | for not very long. You know interestingly compared to the other gold medalists in this

00:26:04.020 | competition, the folks that use deep learning used a lot less feature engineering and a

00:26:08.500 | lot less domain expertise. And so they wrote a paper called Entity Embeddings of Categorical

00:26:13.540 | Variables, in which they basically described the exact thing that you saw in notebook 8,

00:26:21.940 | the way you can think of one-hot encodings as just being embeddings, you can catenate

00:26:27.420 | them together, and you can put them through a couple of layers, they call them dense layers,

00:26:33.140 | we've called them linear layers, and create a neural network out of that. So this is really

00:26:38.940 | a neat, you know, kind of simple and obvious hindsight trick. And they actually did exactly

00:26:45.940 | what we did in the paper, which is to look at the results of the trained embeddings.

00:26:52.900 | And so for example they had an embedding matrix for regions in Germany, because there wasn't

00:27:02.300 | really metadata about this, these were just learned embeddings, just like we learned embeddings

00:27:06.120 | about movies. And so then they just created, just like we did before, a chart where they

00:27:12.300 | popped each region according to, I think probably a PCA of their embeddings. And then if you

00:27:18.820 | circle the ones that are close to each other in blue, you'll see that they're actually

00:27:24.140 | close to each other in Germany, and ditto for red, and ditto for green, and then here's

00:27:30.580 | the brown. So this is like pretty amazing, is the way that we can see that it's kind

00:27:38.660 | of learned something about what Germany looks like, based entirely on the purchasing behavior

00:27:44.180 | of people in those states. Something else they did was to look at every store, and they

00:27:50.420 | looked at the distance between stores in practice, like how many kilometers away they are. And

00:27:58.100 | then they looked at the distance between stores in terms of their embedding distance, just

00:28:03.700 | like we saw in the previous notebook. And there was this very strong correlation that

00:28:09.260 | stores that were close to each other physically ended up having close embeddings as well,

00:28:18.180 | even though the actual location of these stores in physical space was not part of the model.

00:28:26.180 | Ditto with days of the week, so the days of the week or another embedding, and the days

00:28:32.100 | of the week that were next to each other, ended up next to each other in embedding space,

00:28:37.740 | and ditto for months of the year. So pretty fascinating the way kind of information about

00:28:44.900 | the world ends up captured just by looking at training embeddings, which as we know are

00:28:50.700 | just index lookups into an array. So the way we then combine these categorical variables

00:29:00.220 | with these embeddings with continuous variables, what was done in both the entity embedding

00:29:06.620 | paper that we just looked at, and then also described in more detail by Google when they

00:29:13.060 | described how their recommendation system in Google Play works. This is from Google's

00:29:18.180 | paper, is they have the categorical features that go through the embeddings, and then there

00:29:23.260 | are continuous features, and then all the embedding results and the continuous features

00:29:27.940 | are just concatenated together into this big concatenated table that then goes through

00:29:32.700 | this case three layers of a neural net, and interestingly they also take the kind of collaborative

00:29:40.620 | filtering bit and do the product as well and combine the two. So they use both of the tricks

00:29:46.340 | were used in the previous notebook and combine them together. So that's the basic idea we're

00:29:54.340 | going to be seeing for moving beyond just collaborative filtering, which is just two

00:30:01.180 | categorical variables to as many categorical and as many continuous variables as we like.

00:30:07.340 | But before we do that, let's take a step back and think about other approaches, because

00:30:12.900 | as I mentioned, the idea of deep learning as a kind of a best practice for tabular data

00:30:19.940 | is still pretty new and it's still kind of controversial. It's certainly not always the

00:30:25.500 | case that it's the best approach. So when we're not using deep learning, what would

00:30:30.980 | we be using? Well, what we'd probably be using is something called an ensemble of decision

00:30:36.620 | trees and the two most popular are random forests and gradient boosting machines or

00:30:43.140 | something similar. So basically between multi-layered neural networks, like with SGD and ensemble

00:30:49.900 | of decision trees, that kind of covers the vast majority of approaches that you're likely

00:30:55.700 | to see for tabular data. And so we're going to make sure we cover them both of course today,

00:31:01.820 | in fact. So although deep learning is nearly always clearly superior for stuff like images

00:31:09.580 | and audio and natural language text, these two approaches tend to give somewhat similar

00:31:15.820 | results a lot of the time for tabular data. So let's take a look. You know, you really

00:31:21.820 | should generally try both and see which works best for you for each problem you look at.

00:31:28.660 | Why does the range go from 0 to 5.5 if the maximum is 5?

00:31:38.140 | That's a great question. The reason is if you think about it for sigmoid, it's actually

00:31:43.740 | impossible for a sigmoid to get all the way to the top or all the way to the bottom. Those

00:31:49.180 | are asymptotes. So no matter how far, how big your x is, it can never quite get to the

00:31:54.780 | top or no matter how small it is, it can never quite get to the bottom. So if you want to

00:31:58.580 | be able to actually predict a rating of 5, then you need to use something higher than

00:32:03.100 | 5 your maximum.

00:32:07.300 | Are embeddings used only for highly cardinal categorical variables, or is this approach

00:32:12.380 | used in general? For low cardinality, can one use a one-hot encoding?

00:32:18.500 | I'll remind you cardinality is the number of discrete levels in a variable. And remember

00:32:29.180 | that an embedding is just a computational shortcut for a one-hot encoding. So there's

00:32:36.180 | really no reason to use a one-hot encoding because it's, as long as you have more than

00:32:42.260 | two levels, it's always going to be more memory and lower, and give you exactly mathematically

00:32:48.060 | the same thing. And if there's just two levels, then it is basically identical. So there isn't

00:32:53.820 | really any reason not to use it.

00:32:58.180 | Thank you for those great questions. Okay, so one of the most important things about

00:33:08.980 | decision tree ensembles is that at the current state of the technology, they do provide faster

00:33:15.100 | and easier ways of interpreting the model. I think that's rapidly improving for deep

00:33:19.840 | learning models on tabular data, but that's where we are right now. They also require

00:33:24.420 | less hyperparameter tuning, so they're easier to kind of get right the first time. So my

00:33:30.260 | first approach for analyzing a new tabular data set is always an ensemble of decision

00:33:35.220 | trees. And specifically, I pretty much always start with a random forest because it's just

00:33:39.220 | so reliable. Yes.

00:33:44.260 | Your experience for highly imbalanced data, such as broad or medical data, what usually

00:33:50.100 | works best out of random forest, XGBoost, or neural networks?

00:33:55.460 | I'm not sure that whether the data is balanced or unbalanced is a key reason for choosing

00:34:03.260 | one of those above the others. I would try all of them and see which works best. So the

00:34:09.660 | exception to the guideline about start with decision tree ensembles is your first thing

00:34:13.820 | to try would be if there's some very high cardinality categorical variables, then they

00:34:18.600 | can be a bit difficult to get to work really well in decision tree ensembles. Or if there's

00:34:25.440 | something like, most importantly, if it's like plain text data or image data or audio

00:34:29.380 | data or something like that, then you're definitely going to need to use a neural net in there,

00:34:34.980 | but you could actually ensemble it with a random forest, as we'll see.

00:34:40.420 | Okay, so clearly we're going to need to understand how decision tree ensembles work. So PyTorch

00:34:50.500 | isn't a great choice for decision tree ensembles. They're really designed for gradient-based

00:34:55.420 | methods and random forests and decision tree growing are not really gradient-based methods

00:35:01.780 | in the same way. So instead, we're going to use a library called scikit-learn, referred

00:35:08.640 | to as sklearn as a module. Scikit-learn does a lot of things. We're only going to touch

00:35:16.420 | on a tiny piece of them, stuff we need to do to train decision trees and random forests.

00:35:24.540 | We've already mentioned before Wes McKinney's book, also a great book for understanding

00:35:28.600 | more about scikit-learn. So the dataset for learning about decision tree ensembles is

00:35:35.500 | going to be another dataset. It's going to, it's called the blue book for bulldozers dataset

00:35:42.660 | and it's a Kaggle competition. So Kaggle competitions are fantastic. They are machine learning competitions

00:35:52.060 | where you get interesting datasets, you get feedback on whether your approach is any good

00:35:56.820 | or not. You can see on a leaderboard what approaches are working best and then you can

00:36:01.140 | read blog posts from the winning contestants sharing tips and tricks. It's certainly not

00:36:07.940 | a substitute for actual practice doing end-to-end data science projects, but for becoming good

00:36:19.060 | at creating predictive models that are predictive, it's a really fantastic resource, highly recommended.

00:36:25.980 | And you can also submit to old, most old competitions to see how you would have gone without having

00:36:31.860 | to worry about, you know, the kind of stress of like whether people will be looking at

00:36:35.980 | your results because they're not publicized or published if you do that.

00:36:41.540 | There's a question. Can you comment on real-time applications of random forests? In my experience,

00:36:49.900 | they tend to be too slow for real-time use cases like a recommender system, neural network

00:36:55.680 | is much faster when run on the right hardware.

00:36:58.860 | Let's get to that once we've seen what they are, shall we? Now you can't just download

00:37:08.620 | an untar Kaggle datasets using the untar data thing that we have in fast.ai. So you actually

00:37:13.540 | have to sign up to Kaggle and then follow these instructions for how to download data

00:37:20.380 | from Kaggle. Make sure you replace creds here with what it describes. You need to get a

00:37:24.980 | special API code and then run this one time to put that up on your server. And now you

00:37:32.020 | can use Kaggle to download data using the API. So after we do that, we're going to end

00:37:41.720 | up with a bunch of, as you see, CSV files. So let's take a look at this data.

00:37:49.340 | So the main data, the main table is train.csv. Remember that's comma separated values and

00:37:55.980 | the training set contains information such as unique identifier of a sale, the unique

00:38:00.980 | identifier of a machine, the sale price, sale date. So what's going on here is one row of

00:38:07.100 | the data represents a sale of a single piece of heavy machinery like a bulldozer at an

00:38:14.620 | auction. So it happens at a date, as a price, it's of some particular piece of equipment

00:38:20.860 | and so forth. So if we use pandas again to read in the CSV file, let's combine training

00:38:28.100 | and valid together. We can then look at the columns to see. There's a lot of columns there

00:38:34.200 | and many things which I don't know what the hell they mean like blade extension and pad

00:38:37.860 | type and ride control. But the good news is we're going to show you a way that you don't

00:38:43.340 | have to look at every single column and understand what they mean and random forests are going

00:38:48.120 | to help us with that as well. So once again, we're going to be seeing this idea that models

00:38:53.700 | can actually help us with data understanding and data cleanup. One thing we can look at

00:38:59.460 | is ordinal columns, a good place to look at that now. If there's things there that you

00:39:03.920 | know are discrete values but have some order like product size, it has medium and small

00:39:11.380 | and large, medium and many. These should not be in alphabetical order or some random order,

00:39:19.340 | they should be in this specific order, right? They have a specific ordering. So we can use

00:39:28.820 | as type to turn it into a categorical variable and then we can say setCategories, audit equals

00:39:34.420 | true to basically say this is an ordinal column. So it's got discrete values but we actually

00:39:40.300 | want to define what the order of the classes are. We need to choose which is the dependent

00:39:48.260 | variable and we do that by looking on Kaggle and Kaggle will tell us that the thing we're

00:39:52.180 | meant to be predicting is sale price and actually specifically they'll tell us the thing we're

00:39:56.960 | meant to be predicting is the log of sale price because root mean squared log error

00:40:02.060 | is what we're actually going to be judged on in the competition where we take the log.

00:40:09.020 | So we're not going to replace sale price with its log and that's what we'll be using from

00:40:12.940 | now on. So a decision tree ensemble requires decision trees. So let's start by looking

00:40:20.340 | at decision trees. So a decision tree in this case is a something that asks a series of

00:40:28.060 | binary that is yes or no questions about data. So such as is somebody less than or greater

00:40:34.540 | than 30? Yes they are. Are they eating healthily? Yes they are and so okay then we're going

00:40:39.700 | to say they're fit or unfit. So like there's an example of some arbitrary decision tree

00:40:46.540 | that somebody might have come up with. It's a series of binary yes and no choices and

00:40:51.620 | at the bottom are leaf nodes that make some prediction. Now of course for our bulldozers

00:41:02.380 | competition we don't know what binary questions to ask about these things and in what order

00:41:10.180 | in order to make a prediction about sale price. So we're doing machine learning so we're going

00:41:15.180 | to try and come up with some automated way to create the questions. And there's actually

00:41:20.700 | a really simple procedure for doing that. You have to think about it. So if you want

00:41:24.620 | to kind of stretch yourself here have a think about what's an automatic procedure that you

00:41:30.620 | can come up with that would automatically build a decision tree where the final answer

00:41:36.300 | would do a you know significantly better than random job of estimating the sale price of

00:41:44.220 | one of these auctions. Alright so here's the approach that we could use. Loop through each

00:41:53.500 | column of the data set. We're going to go through each of well obviously not sale price

00:41:59.300 | it's a dependent variable sale ID machine ID auctioneer year made etc. And so one of

00:42:05.140 | those will be for example product size. And so then what we're going to do is we're going

00:42:11.660 | to loop through each possible value of product size large, large, medium, medium etc. And

00:42:21.380 | then we're going to do a split basically like where this comma is and we're going to say

00:42:25.260 | okay let's get all of the auctions of large equipment and put that into one group and

00:42:32.820 | everything that's smaller than that and put that into another group. And so that's here

00:42:38.900 | split the data into two groups based on whether they're greater than or less than that value.

00:42:45.740 | If it's a categorical non-ordinal value a variable it'll be just whether it's equal

00:42:49.740 | or not equal at that level. And then we're going to find the average sale price for each

00:42:55.620 | of the two groups. So for the large group what was the average sale price? For the smaller

00:43:00.900 | than large group what was the average sale price? And that will be our model. Our prediction

00:43:06.940 | will simply be the average sale price for that group. And so then you can say well how

00:43:12.460 | good is that model? If our model was just to ask a single question with a yes/no answer

00:43:17.380 | put things into two groups and take the average of the group as being our prediction and we

00:43:22.260 | can say how good would that model be? What would be the root mean squared error from

00:43:26.140 | that model? And so we can then say all right how good would it be if we use large as a

00:43:32.580 | split? And then let's try again what if we did large/medium as a split? What if we did

00:43:38.260 | medium as a split? And so in each case we can find the root mean squared error of that

00:43:42.180 | incredibly simple model. And then once we've done that for all of the product size levels

00:43:47.020 | we can go to the next column and look at level of usage band and do every level of usage

00:43:55.380 | band and then state, every level of state and so forth. And so there'll be some variable

00:44:02.860 | and some split level which gives the best root mean squared error of this really really

00:44:09.540 | simple model. And so then we'll say okay that would be our first binary decision. It gives

00:44:16.220 | us two groups and then we're going to take each one of those groups separately and find

00:44:22.580 | another single binary decision for each of those two groups using exactly the same procedure.

00:44:28.820 | So then we'll have four groups and then we'll do exactly the same thing again separately

00:44:33.460 | for each of those four groups and so forth. So let's see what that looks like and in fact

00:44:44.180 | once we've gone through this you might even want to see if you can implement this algorithm

00:44:47.500 | yourself. It's not trivial but it doesn't require any special coding skills so hopefully

00:44:55.020 | you can find you'll be able to do it. There's a few things we have to do before we can actually

00:45:00.820 | create a decision tree in terms of just some basic data munching. One is if we're going

00:45:06.460 | to take advantage of dates we actually want to call fastai's addDatePart function and

00:45:13.660 | what that does as you see after we call it is it creates a whole different a bunch of

00:45:18.980 | different bits of metadata from that data. Say a year, say a month, say a week, say a

00:45:24.380 | day and so forth. So say a date of itself doesn't have a whole lot of information directly

00:45:35.020 | but we can pull lots of different information out of it. And so this is an example of something

00:45:39.340 | called feature engineering which is where we take some piece of some piece of data and

00:45:44.220 | we try to grab create lots of other pieces of data from it. So is this particular date

00:45:50.180 | the end of a month or not? At the end of a year or not? And so forth. So that handle

00:45:56.940 | states there's a bit more cleaning we want to do and fastai provides some things to make

00:46:03.700 | cleaning easier. We can use the tabular pandas class to create a tabular data set in pandas.

00:46:13.460 | And specifically we're going to use two tabular processes or tabular procs. A tabular processor

00:46:19.940 | is basically just a transform and we've seen transforms before so go back and remind yourself

00:46:24.580 | what a transform is. Except it's just slightly different it's like three lines of code if

00:46:30.620 | you look at the code for it. It's actually going to modify the object in place rather

00:46:36.260 | than creating a new object and giving it back to you. And that's because often these tables

00:46:40.420 | of data are kind of really big and we don't want to waste lots of RAM. And it's just going

00:46:46.300 | to run the transform once and save the result rather than doing it lazily when you access

00:46:51.060 | it for the same reason. We're just going to make this a lot faster. So you can just think

00:46:57.160 | of them as transforms really. One of them is called categorify and categorify is going

00:47:02.020 | to replace a column with numeric categories using the same basic idea of like a vocab

00:47:09.340 | like we've seen before. Fill missing is going to find any columns with missing data that's

00:47:16.240 | going to fill in the missing data with the median of the data and create a new column

00:47:21.100 | a boolean column which is set to true for anything that was missing. So these two things

00:47:25.760 | is basically enough to get you to a point where most of the time you'll be able to train

00:47:29.260 | a model. Now the next thing we need to do is think about our validation set. As we discussed

00:47:37.340 | in lesson one, a random validation set is not always appropriate and certainly for something

00:47:44.020 | like predicting auction results it almost certainly is not appropriate because we're

00:47:49.260 | going to be wanting to use a model in the future not at some random date in the past.

00:47:54.660 | So the way this Kaggle competition was set up was that the test set the thing that you

00:48:00.680 | had to fill in and submit for the competition was two weeks of data that was after any of

00:48:08.860 | the training set. So we should do the same thing for a validation set. We should create

00:48:14.580 | something which is where the validation set is the last couple of weeks of data and so

00:48:22.820 | then the training set will only be data before that. So we basically can do that by grabbing

00:48:28.340 | everything before October 2011, create a training and validation set based on that condition

00:48:35.260 | and grabbing those bits. So that's going to split our training set and validation set

00:48:43.520 | by date not randomly. We're also going to need to tell when you create a tabular pandas

00:48:50.460 | object you're going to be passing in a data frame, going to be passing in your tabular

00:48:54.980 | procs and you also have to say what are my categorical and continuous variables. We can

00:49:00.100 | use fast.ai's cont.cat.split to automatically split a data frame to continuous and categorical

00:49:07.820 | variables for you. So we can just pass those in. Tell it what is the dependent variable,

00:49:14.940 | you can have more than one, and what are the indexes to split into training and valid.

00:49:20.460 | And this is a tabular object. So it's got all the information you need about the training

00:49:24.720 | set, the validation set, categorical and continuous variables and the dependent variable and any

00:49:30.060 | processes to run. It looks a lot like a datasets object, but it has a .train, it has a .valid

00:49:41.060 | and so if we have a look at .show we can see the data. But .show is going to show us the

00:49:50.740 | kind of the string data, but if we look at .items you can see internally it's actually

00:49:56.800 | stored these very compact numbers which we can use directly in a model. So fast.ai has

00:50:06.140 | basically got us to a point here where we have our data into a format ready for modeling

00:50:11.500 | and our validation sets being created. To see how these numbers relate to these strings

00:50:19.580 | we can again just like we saw last week use the classes attribute which is a dictionary

00:50:25.220 | which basically tells us the vocab. So this is how we look up. For example 6 is 0, 1,

00:50:30.820 | 2, 3, 4, 5, 6. This is a compact example. That processing took takes a little while to run

00:50:39.260 | so you can go ahead and save the tabular object and so then you can load it back later without

00:50:46.540 | having to rerun all the processing. So that's a nice kind of fast way to quickly get back

00:50:52.820 | up and running without having to reprocess your data. So we've done the basic data munging

00:50:59.100 | we need. So we can now create a decision tree and in scikit-learn a decision tree where

00:51:04.180 | the dependent variable is continuous is a decision tree regressor. And let's start by

00:51:10.440 | telling it we just want a total of four leaf nodes. We'll see what that means in a moment

00:51:16.980 | and in scikit-learn you generally call fit so it looks quite a lot like fast.ai and you

00:51:23.060 | pass in your independent variables and your dependent variable and we can grab those straight

00:51:28.340 | from our tabular object training set is .x's and .y and we can do the same thing for validation

00:51:35.860 | just to save us in typing. Okay, question. Do you have any thoughts on what data augmentation

00:51:41.820 | for tabular data might look like? I don't have a great sense of data augmentation for tabular

00:51:53.660 | data. We'll be seeing later either in this course or in the next part dropout and mix

00:52:03.200 | up and stuff like that which they might be able to do that in later layers in the tabular

00:52:11.260 | model. Otherwise I think you'd need to think about kind of the semantics of the data and

00:52:16.220 | think about what are things you could do to change the data without changing the meaning.

00:52:21.060 | That's like a pretty tricky route. There question. Does fast.ai distinguish between ordered categories

00:52:29.340 | such as low, medium, high and unordered categorical variables? Yes, that was that ordinal thing

00:52:36.180 | I told you about before and all it really does is it ensures that your classes list

00:52:42.300 | has a specific order so then these numbers actually have a specific order. And as you'll

00:52:47.860 | see that's actually going to turn out to be pretty important for how we train our random

00:52:51.820 | forest. Okay, so we can create a decision tree regressor. We can fit it and then we

00:53:00.300 | can draw it, the fast.ai function. And here is the decision tree we just trained and behind

00:53:10.380 | the scenes this actually used the basically the exact process that we described back here,

00:53:19.700 | right? So this is where you can like try and create your own decision tree implementation

00:53:25.380 | if you're interested in stretching yourself. So we're going to use one that's already exists

00:53:31.880 | and the best way to understand what it's done is to look at this diagram from top to bottom.

00:53:37.060 | So the first step is it says like okay the initial model it created is a model with no

00:53:44.660 | binary splits at all. Specifically it's always going to predict the value 10.1 for every

00:53:50.600 | single row. Why is that? Well because this is the simplest possible model is to take

00:53:57.020 | the average of the dependent variable and always predict that. And so this is always

00:54:02.100 | should be your kind of pretty much your basic baseline for regression. There are four hundred

00:54:08.720 | and four thousand seven hundred and ten rows, auctions that we're averaging and the mean

00:54:14.660 | squared error of this incredibly simple model in which there are no rules at all, no groups

00:54:20.860 | at all, just a single average is a point for it. So then the next most complex model is

00:54:29.300 | to take a single column, a plus system and a single binary decision is coupler system

00:54:35.980 | less than or equal to 0.5. True, there are three hundred and sixty thousand eight hundred

00:54:41.780 | and forty seven auctions where it's true and forty three thousand eight hundred and sixty

00:54:47.740 | three where it's false. And now interestingly in the false case you can see that there are

00:54:54.100 | no further binary decisions. So this is called a leaf node. It's a node where this is as

00:54:59.620 | far as you can get and so if your coupler system is not less than or equal to 0.5 then

00:55:07.340 | the prediction this model makes for your sale price is 9.21 versus if it's true it's 10.21.

00:55:15.100 | So you can see it's actually found a very big difference here and that's why it picked

00:55:19.220 | this as the first binary split. And so the mean squared error for this section here is

00:55:23.940 | 0.12 which is far better than we started out at, 0.48. This group still has 360,000 in

00:55:32.380 | it and so it does another binary split. This time is the year that this piece of equipment

00:55:38.340 | made was at less than or equal to 1991.5. If it was, if it's true then we get a leaf node

00:55:47.340 | and the prediction is 9.97, mean squared error 0.37. If the value is false we don't have

00:55:53.420 | a leaf node and we have another binary split. And you can see eventually we get down to

00:55:57.740 | here coupler system true, year made, false, product size, false, mean squared error 0.17.

00:56:05.020 | So all of these leaf nodes have MSCs that are smaller than that original baseline model

00:56:13.780 | of just taking the mean. So this is how you can grow a decision tree. And we only stopped

00:56:19.660 | here because we said max leaf nodes is 4, 1, 2, 3, 4, right? And so if we want to keep

00:56:27.140 | training it further we can just use a higher number. There's actually a very nice library

00:56:36.220 | by Terrence Park called dtree-vis which can show us exactly the same information like

00:56:42.220 | so. And so here are the same leaf nodes 1, 2, 3, 4. And you can see the kind of the chart

00:56:49.980 | of how many are there. This is the split, coupler system 0.5. Here are the two groups.

00:56:55.460 | You can see the sale price in each of the two groups. And then here's the leaf node.

00:57:00.660 | And so then the second split was on year made. And you can see here something weird is going

00:57:05.300 | on with year made. There's a whole bunch of year mades that are a thousand which is obviously

00:57:09.700 | not a sensible year for a bulldozer to be made. So presumably that's some kind of missing

00:57:15.140 | value. So when we look at the kind of the picture like this it can give us some insights

00:57:21.400 | about what's going on in our data. And so maybe we should replace those thousands with

00:57:28.700 | 1950 because that's you know obviously a very, very early year for a bulldozer. So we can

00:57:34.940 | kind of pick it arbitrarily. It's actually not really going to make any difference to

00:57:39.700 | the model that's created because all we care about is the order because we're just doing

00:57:44.740 | these binary splits that it'll make it easier to look at as you can see. Here's our 1950s

00:57:50.420 | now. And so now it's much easier to see what's going on in that binary split. So let's now

00:57:58.420 | get rid of max leaf nodes and build a bigger decision tree. And then let's just for the

00:58:05.060 | rest of this notebook create a couple of little functions. One to create the root mean squared

00:58:10.220 | error which is just here. And another one to take a model and some independent independent

00:58:16.900 | variables, predict from the model on the independent variables and then take the root mean squared

00:58:23.180 | error with a dependent variable. So that's going to be our models root mean squared error.

00:58:29.700 | So for this decision tree in which we didn't have a stopping criteria, so as many leaf

00:58:33.900 | nodes as you like, the model's root mean squared error is zero. So we've just built the perfect

00:58:41.580 | model. So this is great news, right? We've built the perfect auction trading system.

00:58:49.660 | Well remember, we actually need to check the validation set. Let's check the check mRmse

00:58:54.620 | with a validation set and oh, it's worse than zero. So our training set is zero, our validation

00:59:02.540 | set is much worse than zero. Why has that happened? Well one of the things that a random

00:59:08.660 | forest in sklearn can do is it can tell you the number of leaf nodes, number of leaves,

00:59:14.540 | there are 341,000, number of data points 400,000. So in other words, we have nearly as many

00:59:22.460 | leaf nodes as data points. Most of our leaf nodes only have a single thing in, but they're

00:59:26.780 | taking an average of a single thing. Clearly this makes no sense at all. So what we should

00:59:32.060 | actually do is pick some different stopping criteria and let's say, okay, if you get a

00:59:38.180 | leaf node with 25 things or less in it, don't split things to create a leaf node with less

00:59:45.840 | than 25 things in it. And now if we fit and we look at the root mean squared error for

00:59:51.540 | the validation set, it's going to go down from 0.33 to 0.32. So the training sets got

00:59:59.460 | worse from zero to 0.248. The validation sets got better and now we only have 12,000 leaf

01:00:06.300 | nodes. So that is much more reasonable.

01:00:10.100 | Alright, so let's take a five minute break and then we're going to come back and see

01:00:15.260 | how we get the best of both worlds, how we're going to get something which has the kind

01:00:19.660 | of flexibility to get these, you know, what we're going to get down to zero, but to get,

01:00:26.540 | you know, really deep trees, but also without overfitting. And the trick will be to use

01:00:32.860 | something called bagging. We'll come back and talk about that in five minutes.

01:00:39.460 | Okay, welcome back. So we're going to look at how we can get the best of both worlds

01:00:49.500 | as we discussed and let's start by having a look at what we're doing with categorical

01:00:56.420 | variables first of all. And so you might notice that previously with categorical variables,

01:01:03.500 | for example, in collaborative filtering, we had to, you know, kind of think about like

01:01:10.500 | how many embedding levels we have, for example, if you've used other modeling tools, you might

01:01:15.780 | have doing things with creating dummy variables, stuff like that. For random forests on the

01:01:21.780 | whole, you don't have to. The reason is, as we've seen, all of our categorical variables

01:01:32.460 | have been turned into numbers. And so we can perfectly well have decision tree binary decisions

01:01:41.420 | which use those particular numbers. Now, the numbers might not be ordered in any interesting

01:01:49.260 | way, but if there's a particular level which kind of stands out as being important, it

01:01:56.380 | only takes two binary splits to split out that level into a single, you know, into a

01:02:04.420 | single piece. So generally speaking, I don't normally worry too much about kind of encoding

01:02:14.140 | categorical variables in a special way. As I mentioned, I do try to encode ordinal variables

01:02:19.980 | by saying what the order of the levels is, because often, as you would expect, sizes,

01:02:26.180 | for example, you know, medium and small are going to mean kind of next to each other and

01:02:30.860 | large and extra large would be next to each other. That's good to have those as similar

01:02:34.580 | numbers. Having said that, you can kind of one hot encode a categorical variable if you

01:02:43.700 | want to using get dummies in pandas. But there's not a lot of evidence that that actually helps.

01:02:51.380 | There's actually that has been stored in a paper. And so I would say in general for categorical

01:02:57.540 | variables don't worry about it too much. Just use what we've shown you. You have a question.

01:03:04.500 | For ordinal categorical variables, how do you deal with when they have like nA or missing

01:03:12.460 | values, where do you put that in the order? So in fast.ai, nA missing values always appear

01:03:22.300 | as the first item. They'll always be the zero index item. And also if you get something

01:03:27.480 | in the validation or test set, which is a level we haven't seen in training, that will

01:03:32.140 | be considered to be that missing or nA value as well. All right, so what we're going to

01:03:41.020 | do to try and improve our random forest is we're going to use something called bagging.

01:03:46.420 | This was developed by a retired Berkeley professor named Leo Breiman in 1994. And he did a lot

01:03:54.180 | of great work and perhaps you could argue that most of it happened after he retired.

01:03:59.700 | His technical report was called bagging predictors. And he described how you could create multiple

01:04:05.260 | versions of a predictor, so multiple different models. And you could then aggregate them

01:04:11.740 | by averaging over the predictions. And specifically, the way he suggested doing this was to create

01:04:20.540 | what he called bootstrap replicates. In other words, randomly select different subsets of

01:04:25.860 | your data. Train a model on that subset, kind of store it away as one of your predictors,

01:04:31.820 | and then do it again a bunch of times. And so each of these models is trained on a different

01:04:36.460 | random subset of your data. And then you, to predict, you predict on all of those different

01:04:43.380 | versions of your model and average them. And it turns out that bagging works really well.

01:04:52.300 | So this, the sequence of steps is basically randomly choose some subset of rows, train

01:04:58.540 | a model using that subset, save that model, and then return to step one. Do that a few

01:05:04.180 | times to train a few models. And then to make a prediction, predict with all the models

01:05:10.300 | and take the average. That is bagging. And it's very simple, but it's astonishingly powerful.

01:05:18.300 | And the reason why is that each of these models we've trained, although they are not using

01:05:25.480 | all of the data, so they're kind of less accurate than a model that uses all of the data. Each

01:05:31.980 | of them is, the errors are not correlated, you know, the errors because of using that

01:05:39.880 | smaller subset are not correlated with the errors of the other models because they're

01:05:44.140 | random subsets. And so when you take the average of a bunch of kind of errors which are not

01:05:54.100 | correlated with each other, the average of those errors is zero. So therefore, the average

01:06:01.080 | of the models should give us an accurate prediction of the thing we're actually trying to predict.

01:06:08.380 | So as I say here, it's an amazing result. We can improve the accuracy of nearly any

01:06:12.540 | kind of algorithm by training it multiple times on different random subsets of data

01:06:18.380 | and then averaging the predictions. So then Breiman in 2001 showed a way to do this specifically

01:06:27.140 | for decision trees where not only did he randomly choose a subset of rows for each model, but

01:06:33.700 | then for each binary split, he also randomly selected a subset of columns. And this is

01:06:40.200 | called the random first. And it's perhaps the most widely used, most practically important

01:06:45.860 | machine learning method and astonishingly simple. To create a random forest regressor,

01:06:54.100 | you use sklearn's random forest regressor. If you pass njobs -1, it will use all of the

01:07:00.980 | CPU cores that you have to run as fast as possible. nestimators says how many trees,

01:07:07.420 | how many models to train. max_sample says how many rows to use, randomly chosen rows

01:07:15.100 | to use in each one. max_features is how many randomly chosen columns to use for each binary

01:07:21.860 | split point. min_sample's leaf is the stopping criteria and we'll come back to. So here's

01:07:29.960 | a little function that will create a random first regressor and fit it through some set

01:07:35.580 | of independent variables and a dependent variable. So we can give it a few default values and

01:07:43.460 | create a random forest and train and our validation set RMSE is 0.23. If we compare that to what

01:07:55.500 | we had before, we had 0.32. So dramatically better by using a random forest.

01:08:13.140 | So what's happened when we called random forest regressor is it's just using that decision

01:08:22.020 | tree builder that we've already seen, but it's building multiple versions with these

01:08:26.480 | different random subsets and for each binary split it does, it's also randomly selecting

01:08:32.260 | a subset of columns. And then when we create a prediction, it is averaging the predictions

01:08:38.880 | of each of the trees. And as you can see it's giving a really great result. And one of the

01:08:45.260 | amazing things we'll find is that it's going to be hard for us to improve this very much,

01:08:50.540 | you know, the kind of the default starting point tends to turn out to be pretty great.

01:08:59.500 | The sklearn docs have lots of good information in. One of the things that has this nice picture

01:09:03.980 | that shows as you increase the number of estimators, how does the accuracy improve, error rate

01:09:11.620 | improves for different max features levels. And in general, the more trees you add, the

01:09:21.100 | more accurate your model. It's not going to overfit, right, because it's averaging more

01:09:26.060 | of these, these weak models, more of these models that are trained on subsets of the

01:09:34.020 | data. So train as many, use as many estimators as you like, really just a case of how much

01:09:40.420 | time do you have and whether you kind of reach a point where it's not really improving anymore.

01:09:45.980 | You can actually get at the underlying decision trees in a model, in a random forest model

01:09:50.620 | using estimators_. So with a list comprehension, we can call predict on each individual tree.

01:09:57.900 | And so here's an array, a numpy array containing the predictions from each individual tree

01:10:03.760 | for each row in our data. So if we take the mean across the zero axis, we'll get exactly

01:10:15.100 | the same number. Because remember, that's what a random forest does, is it takes the

01:10:21.380 | mean of the trees, predictions. So one cool thing we could do is we could look at the

01:10:31.340 | 40 estimators we have and grab the predictions for the first i of those trees and take their

01:10:42.020 | mean and then we can find the root mean squared error. And so in other words, here is the accuracy

01:10:50.220 | when you've just got one tree, two trees, three trees, four trees, five trees, etc.

01:10:56.100 | And you can see, so it's kind of nice, right? You can, you can actually create your own

01:11:01.220 | kind of build your own tools to look inside these things and see what's going on. And

01:11:06.540 | so we can see here that as you add more and more trees, the accuracy did indeed keep improving

01:11:11.980 | or the root mean squared error kept improving, although the improvements slowed down after

01:11:18.060 | a while. The validation set is worse than the training set and there's a couple of reasons

01:11:28.640 | that could have happened. The first reason could be because we're still overfitting,

01:11:34.660 | which is not necessarily a problem, it's just something we could identify. Or maybe it's

01:11:39.020 | because the, the fact that we're trying to predict the last two weeks is actually a problem

01:11:44.700 | and that the last two weeks are kind of different to the other auctions in our dataset, maybe

01:11:50.300 | something changed over time. So how do we tell which of those two reasons there are?

01:11:56.740 | What is the reason that our validation set is worse? We can actually find out using a

01:12:01.620 | very clever trick called out of bag error, OOB error. And we use OOB error for lots of

01:12:06.900 | things. You can grab the OOB error, or you can grab the OOB predictions from the model

01:12:16.340 | with OOB prediction and you can grab the RMSE and you can find that the OOB error, RMSE is

01:12:23.860 | 0.21, which is quite a bit better than 0.23. So let me explain what OOB error is. What

01:12:35.420 | OOB error is, is we look at each row of the training set, not the validation set, each

01:12:45.180 | row of the training set and we say, so we say for row number one, which trees included

01:12:53.220 | row number one in the training? And we'll say, okay, let's not use those for calculating

01:12:58.700 | the error because it was part of those trees training. So we'll just calculate the error

01:13:04.100 | for that row using the trees where that row was not included in training that tree. Because

01:13:10.860 | remember every tree is using only a subset of the data. So we do that for every row.

01:13:15.860 | We find the prediction using only the trees that were not used, that that row was not

01:13:24.100 | used. And those are the OOB predictions. In other words, this is like giving us a validation

01:13:31.580 | set result without actually needing a validation. But the thing is, it's not with that time

01:13:39.660 | offset, it's not looking at the last two weeks, it's looking at the whole training set. But

01:13:43.580 | this basically tells us how much of the error is due to overfitting versus due to being

01:13:50.620 | the last couple of weeks. So that's a cool trick. OOB error is something that very quickly

01:13:55.700 | kind of gives us a sense of how much we're, we're overfitting. And we don't even need

01:14:00.100 | a validation set to do it. So there's that OOB error. So that's telling us a bit about

01:14:06.500 | what's going on in our model. But then there's a lot of things we'd like to find out from

01:14:12.320 | our model. And I've got five things in particular here which I generally find pretty interesting.

01:14:18.580 | Which is, how confident are we about our predictions for some particular prediction we're making?

01:14:26.460 | Like we can say this is what we think the prediction is, but how confident are we? Is

01:14:31.740 | that exactly that or is it just about that or we really have no idea? And then for predict,

01:14:37.900 | for predicting a particular item, which factors were the most important in that prediction

01:14:44.860 | and how did they influence it? Overall, which columns are making the biggest difference

01:14:50.500 | in MPRL? Which ones could we maybe throw away and it wouldn't matter? Which columns are

01:14:56.420 | basically redundant with each other? So we don't really need both of them. And as we

01:15:03.580 | vary some column, how does it change the prediction? So those are the five things that we're, that

01:15:09.500 | I'm interested in figuring out and we can do all of those things with a random first.

01:15:15.340 | Let's start with the first one. So the first one, we've already seen that we can grab all

01:15:23.060 | of the predictions for all of the trees and take their mean to get the actual predictions

01:15:31.340 | of the model and then to get the RMSE. But what if instead of saying mean, we did exactly

01:15:36.060 | the same thing like so, but instead said standard deviation. This is going to tell us for every

01:15:46.740 | row in our dataset, how much did the trees vary? And so if our model really had never

01:15:56.380 | seen kind of data like this before, it was something where, you know, different trees

01:16:02.020 | were giving very different predictions. It might give us a sense that maybe this is something

01:16:07.900 | that we're not at all confident about. And as you can see, when we look at the standard

01:16:12.060 | deviation of the trees for each prediction, let's just look at the first five. They vary

01:16:17.620 | a lot, right, 0.2, 0.1, 0.09, 0.3, okay? So this is a really interesting, it's not something

01:16:30.820 | that a lot of people talk about, but I think it's a really interesting approach to kind

01:16:33.940 | of figuring out whether we might want to be cautious about a particular prediction because

01:16:40.260 | maybe we're not very confident about it. But there's one thing we can easily do with a

01:16:46.540 | random forest. The next thing, and this is I think the most important thing for me in

01:16:50.900 | terms of interpretation, is feature importance. Here's what feature importance looks like.

01:16:57.420 | We can call feature importance on a model with some independent variables. Let's say

01:17:01.860 | grab the first 10. This says these are the 10 most important features in this random

01:17:09.500 | forest. These are the things that are the most strongly driving sale price or we could

01:17:15.020 | plot them. And so you can see here, there's just a few things that are by far the most

01:17:22.940 | important. What year the equipment was made, bulldozer or whatever. How big is it? Upla

01:17:31.260 | system, whatever that means, and the product class, whatever that means. And so you can

01:17:40.660 | get this by simply looking inside your train model and grabbing the feature importances

01:17:46.260 | attribute. And so here for making it better to print out, I'm just sticking that into

01:17:50.660 | a data frame and sorting the sending by importance. So how is this actually being done? It's actually

01:18:00.700 | really neat. What Scikit-learn does, and Bryman, the inventor of random forest described, is

01:18:07.740 | that you can go through each tree and then start at the top of the tree and look at each

01:18:12.340 | branch and at each branch see what feature was used, the split, which binary, which the

01:18:19.100 | binary split was based on which column. And then how much better was the model after that

01:18:24.700 | split compared to beforehand. And we basically then say, okay, that column was responsible

01:18:31.060 | for that amount of improvement. And so you add that up across all of the splits, across

01:18:36.900 | all of the trees for each column, and then you normalize it so they all add to one. And

01:18:43.700 | that's what gives you these numbers, which we show the first few of them in this table

01:18:49.180 | and the first 30 of them here in this chart. So this is something that's fast and it's

01:18:55.900 | easy and it kind of gives us a good sense of like, well, maybe the stuff that are less

01:19:01.020 | than 0.005 we could remove. So if we did that, that would leave us with only 21 columns.

01:19:12.940 | So let's try that. Let's just, let's just say, okay, x's which are important, the x's which

01:19:19.340 | are in this list of ones to keep, do the same, they're valid, retrain our random forest and

01:19:27.340 | have a look at the result. And basically our accuracy is about the same, but we've gone

01:19:34.620 | down from 78 columns to 21 columns. So I think this is really important. It's not just about

01:19:42.260 | creating the most accurate model you can, but you want to kind of be able to fit it

01:19:45.460 | in your head as best as possible. And so 21 columns is going to be much easier for us

01:19:50.020 | to check for any data issues and understand what's going on. And the accuracy is about

01:19:55.300 | the same, or the RMSE. So I would say, okay, let's do that. Let's just stick with x's important

01:20:03.980 | from now on. And so here's this entire set of the 21 features. And you can see it looks

01:20:11.920 | now like year made and product size of the two really important things. And then there's

01:20:17.500 | a cluster of kind of mainly product related things that are kind of at the next level

01:20:21.860 | of importance. One of the tricky things here is that we've got like a product class desk,

01:20:33.500 | model ID, secondary desk, model desk, base model. They modeled a script. So they all look

01:20:38.740 | like there might be similar ways of saying the same thing. So one thing that can help

01:20:43.360 | us to interpret the feature importance better and understand better what's happening in

01:20:47.500 | the model is to remove redundant features. So one way to do that is to call fast.ai's

01:20:59.020 | cluster columns, which is basically a thin wrapper for stuff that scikit-learn already

01:21:02.980 | provides. And what that's going to do is it's going to find pairs of columns, which are

01:21:09.420 | very similar. So you can see here sale year and sale elapsed. See how this line is way

01:21:14.540 | out to the right or else machine ID and model ID is not at all. It's way out to the left.

01:21:19.700 | So that means that sale year and sale elapsed are very, very similar. When one is low, the

01:21:26.140 | other tends to be low and vice versa. Here's a group of three, which all seem to be much

01:21:31.540 | the same, and then product group desk and product group, and then FI best-based model

01:21:36.620 | and FI model desk. But these all seem like things where maybe we could remove one of

01:21:42.860 | each of these pairs because they're basically seem to be much the same, you know, they're

01:21:48.900 | when one is high, the other is high and vice versa. So let's try removing one of each of

01:22:01.980 | these. Now it takes a little while to train a random forest. And so for the, just to see

01:22:09.580 | whether removing something makes it much worse, we could just do a very fast version. So we

01:22:16.460 | could just train something where we only have 50,000 rows per tree, train for each tree,

01:22:24.980 | and we'll just use 40 trees. And let's then just get the OOB for, and so for that fast

01:22:37.420 | simple version, our basic OOB with our important x's is 0.877. And here for OOB, a higher number

01:22:48.500 | is better. So then let's try going through each of the things we thought we might not

01:22:53.060 | need and try dropping them and then getting the OOB error for our x's with that one column

01:23:01.580 | removed. And so compared to 877, most of them don't seem to hurt very much. They'll elapse

01:23:11.220 | to it quite a bit, right? So for each of those groups, let's go and see which one of the

01:23:18.420 | ones seems like we could remove it. So here's the five I found. Let's remove the whole lot

01:23:25.980 | and see what happens. And so the OOB went from 877 to 874, though hardly any difference

01:23:33.820 | at all, despite the fact we managed to get rid of five of our variables. So let's create

01:23:42.180 | something called x's final, which is the x's important and then dropping those five, save

01:23:50.300 | them for later. We can always load them back again. And then let's check our random forest

01:23:56.700 | using those and again 0.233 or 0.234. So we've got about the same thing, but we've got even

01:24:05.460 | less columns now. So we're getting a kind of a simpler and simpler model without hurting

01:24:10.780 | our accuracy. It's great. So the next thing we said we were interested in learning about

01:24:17.900 | is for the columns that are, particularly the columns that are most important, how does,

01:24:24.260 | what's the relationship between that column and the dependent variable? So for example,

01:24:28.700 | what's the relationship between product size and sale price? So the first thing I would

01:24:33.900 | do would be just to look at a histogram. So one way to do that is with value counts in

01:24:41.420 | pandas. And we can see here our different levels of product size. And one thing to note here

01:24:52.780 | is actually missing is actually the most common. And then next most is compact and small. And

01:25:00.180 | then many is pretty tiny. So we can do the same thing for year made. Now for year made

01:25:07.420 | we can't just see the basic bar chart. We, according to histogram is not it's a bar chart.

01:25:16.140 | For year made we actually need a histogram, which pandas has stuff like this built in

01:25:21.460 | so we can just call histogram. And that 1950, you remember we created it, that's kind of

01:25:27.020 | this missing value thing that used to be a thousand. But most of them seem to have been

01:25:32.500 | well into the 90's and 3000's. So let's now look at something called a partial dependence

01:25:38.780 | plot. I'll show it to you first. Here is a partial dependence plot of year made against

01:25:52.460 | partial dependence. What does this mean? Well we should focus on the part where we actually

01:25:59.100 | have a reasonable amount of data. So at least well into the 80's, go around here. And so

01:26:05.900 | let's look at this bit here. Basically what this says is that as year made increases,

01:26:14.220 | the predicted sale price, log sale price of course also increases. You can see. And the

01:26:22.660 | log sale price is increasing linearly on other roughly, but roughly then this is actually

01:26:28.780 | an exponential relationship between year made and sale price. Why do we call it a partial

01:26:36.900 | dependence? Are we just plotting the kind of the year against the average sale price?

01:26:41.700 | Well no we're not. We can't do that because a lot of other things change from year to

01:26:47.540 | year. Example, maybe more recently people tend to buy bigger bulldozers or more bulldozers

01:26:57.100 | with air conditioning or more expensive models of bulldozers. And we really want to be able

01:27:03.700 | to say like no just what's the impact of year and nothing else. And if you think about it

01:27:08.820 | from a kind of an inflation point of view, you would expect that older bulldozers would

01:27:18.100 | be kind of, that bulldozers would get kind of a constant ratio cheaper the further you

01:27:27.220 | go back, which is what we see. So what we really want to say is all other things being equal,

01:27:33.980 | what happens if only the year changes? And there's a really cool way we can answer that

01:27:39.820 | question with a random forest. So how does year made impact sale price? All other things

01:27:46.020 | being equal. So what we can do is we can go into our actual data set and replace every

01:27:52.460 | single value in the year made column with 1950 and then calculate the predicted sale

01:27:58.620 | price for every single auction and then take the average over all the auctions. And that's

01:28:03.820 | what gives us this value here. And then we can do the same from 1951, 1952 and so forth

01:28:10.900 | until eventually we get to our final year of 2011. So this isolates the effect of only

01:28:20.020 | year made. So it's a kind of a bit of a curious thing to do, but it's actually, it's a pretty

01:28:28.580 | neat trick for trying to kind of pull apart and create this partial dependence to say

01:28:34.920 | what might be the impact of just changing year made. And we can do the same thing for

01:28:42.060 | product size. And one of the interesting things if we do it for product size is we see that

01:28:46.540 | the lowest value of predicted sale price log sale price is NA, which is a bit of a worry

01:28:58.700 | because we kind of want to know well that means it's really important the question of

01:29:02.260 | whether or not the product size is labeled is really important. And that is something

01:29:08.180 | that I would want to dig into before I actually use this model to find out well why is it

01:29:12.700 | that sometimes things aren't labeled and what does it mean, you know, why is it that that's

01:29:16.620 | actually a that's just important predictor. So that is the partial dependence plot and

01:29:23.580 | it's a really clever trick. So we have looked at four of the five questions we said we wanted

01:29:34.060 | to answer at the start of this section. So the last one that we want to answer is one

01:29:41.780 | here. We're predicting with a particular row of data what were the most important factors

01:29:46.980 | and how did they influence that prediction. This is quite related to the very first thing

01:29:51.460 | we saw. So it's like imagine you were using this auction price model in real life. You

01:29:57.300 | had something on your tablet and you went into some auction and you looked up what the

01:30:02.320 | predicted auction price would be for this lot that's coming up to find out whether it

01:30:09.940 | seems like it's being under or overvalued and then you can decide what to do about that.

01:30:15.720 | So one thing we said we'd be interested to know is like well are we actually confident

01:30:20.020 | in our prediction and then we might be curious to find out like oh I'm really surprised it

01:30:25.180 | was predicting such a high value. Why was it predicting such a high value? So to find

01:30:32.060 | the answer to that question, we can use a module called TreeInterpreter. And TreeInterpreter,

01:30:41.260 | the way it works is that you pass in a single row. So it's like here's the auction that's

01:30:47.620 | coming up, here's the model, here's the auctioneer ID, etcetera, etcetera. Please predict the

01:30:55.220 | value from the random forest, what's the expected sale price and then what we can do is we can

01:31:02.700 | take that one row of data and put it through the first decision tree and we can see what's

01:31:07.700 | the first split that's selected and then based on that split does it end up increasing or

01:31:13.340 | decreasing the predicted price compared to that kind of raw baseline model of just take

01:31:19.340 | the average and then you can do that again at the next split and again at the next split

01:31:23.020 | and again at the next split. So for each split, we see what the increase or decrease in the

01:31:28.940 | well, addiction, that's not right. We see what the increase or decrease in the prediction

01:31:37.420 | is except while I'm here compared to the parent node. And so then you can do that for every

01:31:48.700 | tree and then add up the total change in importance by split variable and that allows you to draw

01:31:56.660 | something like this. So here's something that's looking at one particular row of data and

01:32:03.860 | overall we start at zero and so zero is the initial 10.1. Remember this number 10.1 is

01:32:14.860 | the average log sale price of the whole data set. They call it the bias. And so we call

01:32:22.300 | that zero then for this particular row we're looking at year made as a negative 4.2 impact

01:32:31.180 | on the prediction and then product size has a positive 0.2, cut plus system has a positive

01:32:38.300 | 0.046, model ID has a positive 0.127 and so forth, right. And so the red ones are negative

01:32:47.480 | and the green ones are positive and you can see how they all join up until eventually

01:32:51.580 | overall the prediction is that it's going to be negative 0.122 compared to 10.1 which

01:33:01.140 | is equal to 9.98. So this kind of plot is called a waterfall plot and so basically when

01:33:12.240 | we say tree interpreter dot predict it gives us back the prediction which is the actual

01:33:20.780 | number we get back from the random forest, the bias which is just always this 10.1 for

01:33:25.900 | this data set and then the contributions which is all of these different values. It's how

01:33:33.460 | important was each factor and here I've used a threshold which means anything that was

01:33:42.140 | less than 0.08 all gets thrown into this other category. I think this is a really useful

01:33:48.940 | kind of thing to have in production because it can help you answer questions whether it

01:33:54.620 | will be for the customer or for you know whoever's using your model if they're surprised about

01:33:59.500 | some prediction why is that prediction. So I'm going to show you something really interesting

01:34:10.540 | using some synthetic data and I want you to really have a think about why this is happening

01:34:16.640 | before I tell you and I pause the video if you're watching the video when I get to that

01:34:21.660 | point. Let's start by creating some synthetic data like so. So we're going to grab 40 values

01:34:29.460 | evenly spaced between 0 and 20 and then we're just going to create the y=x line and add

01:34:37.740 | some normally distributed random data on that. Here's this kind of plot. So here's some data

01:34:45.940 | we want to try and predict and we're going to use a random forest in a kind of bit of

01:34:50.140 | an overkill here. Now in this case we only have one independent variable. Scikit-learn

01:35:00.180 | expects us to have more than one. So we can use unsqueeze in PyTorch to add that go from

01:35:10.620 | a shape of 40 in other words a vector with 40 elements for a shape of 40 comma 1 in other

01:35:16.060 | words a matrix of 40 rows with one column. So this unsqueeze 1 means add a unit axis

01:35:23.500 | here. I don't use unsqueeze very often because I actually generally prefer the index with

01:35:30.260 | a special value none. This works in PyTorch and numpy and the way it works is to say okay

01:35:37.220 | xlin remember that size is a vector of length 40 every row and then none means insert a

01:35:46.180 | unit axis here for the column. So these are two ways of doing the same thing but this

01:35:51.500 | one is a little bit more flexible so that's what I use more often. But now that we've

01:35:55.540 | got the shape that is expected which is a rank 2 tensor and an array with two dimensions

01:36:02.820 | or axes we can create a random forest we can fit it and let's just use the first 30 data

01:36:08.860 | points right so kind of stop here. And then let's do a prediction right so let's plot

01:36:16.580 | the original data points and then also plot a prediction and look what happens on the

01:36:21.100 | prediction it acts it's kind of nice and accurate and then suddenly what happens. So this is

01:36:27.820 | the bit where if you're watching the video I want you to pause and have a think bias

01:36:30.980 | is flat. So what's going on here well remember a random forest is just taking the average

01:36:39.380 | of predictions of a bunch of trees and a tree the prediction of a tree is just the average

01:36:46.220 | of the values in a leaf node and remember we fitted using a training set containing

01:36:51.980 | only the first 30. So none of these appeared in the training set so the highest we could

01:36:59.060 | get would be the average of values that are inside the training set. In other words there's

01:37:04.700 | this maximum you can get to. So random forests cannot extrapolate outside of the bounds of

01:37:12.980 | the data that they're training set. This is going to be a huge problem for things like

01:37:16.880 | time series prediction where there's like an underlying trend for instance. But really

01:37:24.300 | it's a more general issue than just time variables. It's going to be hard for random or impossible

01:37:29.660 | often for random forests to just extrapolate outside the types of data that it's seen in

01:37:34.620 | a general sense. So we need to make sure that our validation set does not contain out of

01:37:41.340 | domain data. So how do we find out of domain data? So we might not even know our test set

01:37:50.900 | is distributed in the same way as our training data. So if they're from two different time

01:37:54.760 | periods how do you kind of tell how they vary, right? Or if it's a Kaggle competition how

01:38:00.980 | do you tell if the test set and the training set which Kaggle gives you have some underlying

01:38:07.180 | differences? There's actually a cool trick you can do which is you can create a column

01:38:13.020 | called is_valid which contains 0 for everything in the training set and 1 for everything in

01:38:21.260 | the validation set. And it's concatenating all of the independent variables together.

01:38:27.580 | So it's concatenating the independent variables for both the training and validation set together.

01:38:32.700 | So this is our independent variable and this becomes our dependent variable. And we're

01:38:38.740 | going to create a random forest not for predicting price but a random forest that predicts is

01:38:45.740 | this row from the validation set or the training set. So if the validation set and the training

01:38:51.980 | set are from kind of the same distribution if they're not different then this random

01:38:57.060 | forest should basically have zero predictive power. If it has any predictive power then

01:39:04.460 | it means that our training and validation set are different. And to find out the source

01:39:09.100 | of that difference we can use feature importance. And so you can see here that the difference

01:39:17.840 | between the validation set and the training set is not surprisingly sale elapsed. So that's

01:39:26.940 | the number of days since I think like 1970 or something. So it's basically the date.

01:39:32.180 | So yes of course you can predict whether something is in the validation set or the training set

01:39:37.300 | by looking at the date because that's actually how you find them. That makes sense. This is

01:39:41.900 | interesting sales ID. So it looks like the sales ID is not some random identifier but

01:39:46.940 | it increases over time. And ditto for machine ID. And then there's some other smaller ones

01:39:54.580 | here that kind of makes sense. So I guess for something like model desk I guess there

01:39:59.500 | are certain models that were only made in later years for instance. But you can see these

01:40:06.860 | top three columns are a bit of an issue. So then we could say like okay what happens if

01:40:14.320 | we look at each one of those columns those first three and remove them and then see how

01:40:22.220 | it changes our RMSE on our sales price model on the validation set. So we start from point

01:40:35.460 | 232 and removing sales ID actually makes it a bit better. Sale elapsed makes it a bit

01:40:43.180 | worse, machine ID about the same. So we can probably remove sales ID and machine ID without

01:40:49.180 | losing any accuracy and yep it's actually slightly improved. But most importantly it's

01:40:54.600 | going to be more resilient over time right because we're trying to remove the time related

01:41:00.380 | features. Another thing to note is that since it seems that you know this kind of sale elapsed

01:41:09.280 | issue that maybe it's making a big difference is maybe looking at the sale year distribution

01:41:16.420 | this is the histogram. Most of the sales are in the last few years anyway. So what happens

01:41:21.380 | if we only include the most recent few years. So let's just include everything after 2004.

01:41:29.900 | So that is X is filtered. And if I train on that subset then my accuracy goes improves

01:41:38.060 | a bit more from 331, 330. So that's interesting right. We're actually using less data, less

01:41:46.260 | rows and getting a slightly better result because the more recent data is more representative.

01:41:53.980 | So that's about as far as we can get with our random forest. But what I will say is

01:42:00.180 | this. This issue of extrapolation would not happen with a neural net would it because

01:42:08.780 | a neural net is using the kind of the underlying layers are linear layers. And so linear layers

01:42:13.860 | can absolutely extrapolate. So the obvious thing to think then at this point is well

01:42:19.340 | maybe what a neural net do a better job of this. That's going to be the thing. Next up

01:42:25.540 | to this question. Question first. How do, how does feature importance relate to correlation?

01:42:37.020 | Feature importance doesn't particularly relate to correlation. Correlation is a concept for

01:42:42.700 | linear models and this is not a linear model. So remember feature importance is calculated

01:42:47.740 | by looking at the improvement in accuracy as you go down each tree and you go down each

01:42:56.660 | binary split. If you're used to linear regression then I guess correlation sometimes can be

01:43:05.620 | used as a measure of feature importance. But this is a much more kind of direct version

01:43:13.660 | that's taking account of these non-linearities and interactions of stuff as well. So it's

01:43:19.380 | a much more flexible and reliable measure generally feature importance. Any more questions?

01:43:30.260 | So I'll do the same thing with a neural network. I'm going to just copy and paste the same

01:43:34.620 | lines of code that I had from before but this time I'll call it NN, DFNN and these are the

01:43:40.660 | same lines of code. And I'll grab the same list of columns we had before in the dependent

01:43:45.140 | variable to get the same data frame. Now as we've discussed for categorical columns we

01:43:52.460 | probably want to use embeddings. So to create embeddings we need to know which columns should

01:43:57.740 | be treated as categorical variables. And as we've discussed we can use "cont-cat-split"

01:44:01.940 | for that. One of the useful things we can pass that is the maximum cardinality. So maxCard

01:44:09.380 | equals 9000 means if there's a column with more than 9000 levels you should treat it

01:44:14.660 | as continuous. And if it's got less than 9000 levels which it is categorical. So that's

01:44:20.660 | you know it's a simple little function that just checks the cardinality and splits them

01:44:25.420 | based on how many discrete levels they have. And of course the data type if it's not actually

01:44:31.460 | numeric data type it has to be categorical. So there's our there's our split. And then

01:44:42.020 | from there what we can do is we can say oh we've got to be a bit careful of "sail-elapsed"

01:44:49.420 | because actually "sail-elapsed" I think has less than 9000 categories but we definitely

01:44:53.860 | don't want to use that as a categorical variable. The whole point was to make it that this is

01:44:57.940 | something that we can extrapolate. Though we certainly anything that's kind of time dependent

01:45:03.020 | or we think that we might see things outside the range of inputs in the training data we

01:45:09.380 | should make them continuous variables. So let's make "sail-elapsed" put it in continuous

01:45:14.520 | neural net and remove it from categorical. So here's the number of unique levels this

01:45:22.820 | is from pandas for everything in our neural net data set for the categorical variables.

01:45:28.460 | And I get a bit nervous when I see these really high numbers so I don't want to have too many

01:45:32.740 | things with like lots and lots of categories. The reason I don't want lots of things with

01:45:40.220 | lots and lots of categories is just they're going to take up a lot of parameters because

01:45:44.020 | in a embedding matrix this is you know every one of these is a row in an embedding matrix.

01:45:48.580 | In this case I notice model ID and model desk might be describing something very similar.

01:45:54.380 | So I'd quite like to find out if I could get rid of one and an easy way to do that would

01:45:58.680 | be to use a random forest. So let's try removing the model desk and let's create a random forest

01:46:10.540 | and let's see what happens and oh it's actually a tiny bit better and certainly not worse.

01:46:16.460 | So that suggests that we can actually get rid of one of these levels or one of these

01:46:20.740 | variables. So let's get rid of that one and so now we can create a tabular pandas object

01:46:26.900 | just like before. But this time we're going to add one more processor which is normalize.

01:46:34.540 | And the reason we need normalize, so normalize is subtract the mean divide by the standard

01:46:39.180 | deviation. We didn't need that for a random forest because for a random forest we're just

01:46:44.660 | looking at less than or greater than through our binary splits. So all that matters is

01:46:49.980 | the order of things, how they're sorted, it doesn't matter whether they're super big or

01:46:53.700 | super small. But it definitely matters for neural nets because we have these linear layers.

01:47:01.460 | So we don't want to have you know things with kind of crazy distributions with some super

01:47:06.220 | big numbers and super small numbers because it's not going to work. So it's always a good

01:47:10.680 | idea to normalize things in neural nets so we can do that in a tabular neural net by

01:47:17.580 | using the normalize tabular proc. So we can do the same thing that we did before with

01:47:23.460 | creating our tabular pandas tabular object for the neural net. And then we can create

01:47:29.900 | data loaders from that with a batch size. And this is a large batch size because tabular

01:47:35.540 | models don't generally require nearly as much GPU RAM as a convolutional neural net or something

01:47:44.000 | or an RNN or something. Since it's a regression model we're going to want R range. So let's

01:47:52.140 | find the minimum and maximum of our dependent variable. And we can now go ahead and create

01:47:59.140 | a tabular learner. Our tabular learner is going to take our data loaders, our way range,

01:48:06.140 | how many activations do you want in each of the linear layers. And so you can have as

01:48:12.140 | many linear layers as you like here. How many outputs are there? So this is a regression

01:48:18.380 | with a single output. And what loss function do you want? We can use LRfind and then we

01:48:27.420 | can go ahead and use fit1cycle. There's no pre-trained model obviously because this is

01:48:32.500 | not something where people have got pre-trained models for industrial equipment options. So

01:48:39.140 | we just use fit1cycle and train for a minute. And then we can check. And our RMSE is 0.226

01:48:52.500 | which here was 0.230. So that's amazing. We actually have, you know, straight away a better

01:48:58.620 | result than the random forest. It's a little more fussy, it takes a little bit longer. But

01:49:05.580 | as you can see, you know, for interesting datasets like this, we can get some great

01:49:10.940 | results with neural nets. So here's something else we could do though. The random forest

01:49:23.380 | and the neural net, they each have their own pros and cons. There's some things they're

01:49:28.020 | good at and there's some they're less good at. So maybe we can get the best of both worlds.

01:49:34.620 | And a really easy way to do that is to use Ensemble. We've already seen that a random

01:49:39.420 | forest is a decision tree ensemble. But now we can put that into another ensemble. We

01:49:43.740 | can have an ensemble of the random forest and a neural net. There's lots of super fancy

01:49:49.180 | ways you can do that. But a really simple way is to take the average. So sum up the

01:49:55.300 | predictions from the two models, divide by two, and use that as prediction. So that's

01:50:01.620 | our ensemble prediction is just literally the average of the random forest prediction

01:50:05.540 | and the neural net prediction. And that gives us 0.223 versus 0.226. So how good is that?

01:50:18.900 | Well it's a little hard to say because unfortunately this competition is old enough that we can't

01:50:25.540 | even submit to it and find out how we would have gone on Kaggle. So we don't really know

01:50:30.980 | and so we're relying on our own validation set. But it's quite a bit better than even

01:50:36.260 | the first place score on the test set. So if the validation set is you know doing good

01:50:45.380 | job then this is a good sign that this is a really really good model. Which wouldn't

01:50:51.060 | necessarily be that surprising because you know in the last few years I guess we've learned

01:50:58.620 | a lot about building these kinds of models. And we're kind of taking advantage of a lot

01:51:03.940 | of the tricks that have appeared in recent years. And yeah maybe this goes to show that

01:51:11.660 | well I think it certainly goes to show that both random forests and neural nets have a

01:51:17.300 | lot to offer. And try both and maybe even find both. We've talked about an approach

01:51:29.540 | to ensembling called bagging which is where we train lots of models on different subsets

01:51:35.680 | of the data like the average of. Another approach to ensembling particularly ensembling of trees

01:51:42.500 | is called boosting. And boosting involves training a small model which underfits your

01:51:50.580 | data set. So maybe like just have a very small number of leaf nodes. And then you calculate

01:51:57.200 | the predictions using the small model. And then you subtract the predictions from the

01:52:02.500 | targets. So these are kind of like the errors of your small underfit model. We call them

01:52:07.580 | residual. And then go back to step one but now instead of using the original targets

01:52:15.440 | use the residuals. The train a small model which underfits your data set attempting to

01:52:21.020 | predict the residuals. Then do that again and again until you reach some stopping criterion

01:52:28.900 | such as the maximum number of trees. Now you that will leave you with a bunch of models

01:52:35.620 | which you don't average but which use sum. Because each one is creating a model that's

01:52:42.500 | based on the residual of the previous one. But we've subtracted the predictions of each

01:52:47.660 | new tree from the residuals of the previous tree. So the residuals get smaller and smaller.

01:52:53.260 | And then to make predictions we just have to do the opposite which is to add them all

01:52:56.980 | together. So there's lots of variants of this. But you'll see things like GBMs for gradient

01:53:06.780 | boosted machines or GBTTs for gradient boosted decision trees. And there's lots of minor

01:53:14.460 | details around you know and significant details. But the basic idea is what I've shown.

01:53:21.580 | All right let's take the questions. Dropping features in a model is a way to reduce the

01:53:28.020 | complexity of the model and thus reduce overfitting. Is this better than adding some regularization

01:53:33.820 | like weight decay? I didn't claim that we removed columns to avoid overfitting. We removed

01:53:49.180 | the columns to simplify fewer things to analyze. It should also mean we don't need as many

01:54:00.460 | trees but there's no particular reason to believe that this will regularize. And the

01:54:06.380 | idea of regularization doesn't necessarily make a lot of sense to random forests and

01:54:10.620 | always add more trees. Is there a good heuristic for picking the number of linear layers in

01:54:18.660 | the tabular model? Not really. Well if there is I don't know what it is. I guess two, three

01:54:32.900 | hidden layers works pretty well. So you know what I showed those numbers I showed are pretty

01:54:40.300 | good for a large-ish model. A default it uses 200 and 100 so maybe start with the default

01:54:48.520 | and then go up to 500 and 250 if that isn't an improvement and like just keep doubling

01:54:53.500 | them until it stops improving or you run out of memory or time. The main thing to note

01:55:00.900 | about boosted models is that there's nothing to stop us from overfitting. If you add more

01:55:05.660 | and more trees to a bagging model sort of a random forest it's going to get, it should

01:55:11.660 | generalize better and better because each time you're using a new model which is based

01:55:16.980 | on a subset of the data. But boosting each model will fit the training set better and

01:55:24.740 | better gradually overfit more and more. So boosting methods do require generally more

01:55:32.460 | hyperparameter tuning and fiddling around with it. You know you certainly have regularization

01:55:37.940 | boosting. They're pretty sensitive to their hyperparameters which is why they're not normally

01:55:46.640 | my first go-to but they more often win Kaggle competition random forests do like they tend

01:55:57.140 | to be good at getting that last little bit of performance. So the last thing I'm going

01:56:04.860 | to mention is something super neat which a lot of people don't seem to know exists. There's

01:56:11.500 | a shang so it's super cool which is something from the entity embeddings paper, the table

01:56:17.100 | from it where what they did was they built a neural network, they got the entity embeddings

01:56:23.900 | e.e. and then they tried a random forest using the entity embeddings as predictors rather

01:56:35.220 | than the approach I described with just the raw categorical variables. And the the error

01:56:43.060 | for a random forest went from 0.16 to 0.11. A huge improvement and very simple method

01:56:51.100 | KNN went from 0.29 to 0.11. Basically all of the methods when they used entity embeddings

01:56:59.020 | suddenly improved a lot. The one thing you should try if you have a look at the further

01:57:04.360 | research section after the questionnaire is it asks to try to do this actually take those

01:57:10.260 | entity embeddings that we trained in the neural net and use them in the random forest and

01:57:14.840 | then maybe try ensembling again and see if you can beat the 0.223 that we had. This is

01:57:25.260 | a really nice idea it's like you get you know all the benefits of boosted decision trees

01:57:32.140 | but all of the nice features of entity embeddings and so this is something that not enough people

01:57:40.100 | seem to be playing with for some reason. So overall you know random forests are nice and

01:57:49.940 | easy to train you know they're very resilient they don't require much pre-processing they

01:57:54.460 | train quickly they don't overfit you know they can be a little less accurate and they

01:58:03.020 | can be a bit slow at inference time because the inference you have to go through every

01:58:08.180 | one of those trees. Having said that a binary tree can be pretty heavily optimized so you

01:58:18.700 | know it is something you can basically create a totally compiled version of a tree and they

01:58:24.100 | can certainly also be done entirely in parallel so that's something to consider. Gradient boosting

01:58:36.020 | machines are also fast to train on the whole but a little more fussy about high parameters

01:58:41.260 | you have to be careful about overfitting but a bit more accurate. Neural nets may be the

01:58:49.380 | fussiest to deal with they've kind of got the least rules of thumb around or tutorials

01:58:56.660 | around saying this is kind of how to do it it's just a bit a bit newer a little bit less

01:59:00.660 | well understood but they can give better results in many situations than the other two approaches

01:59:06.580 | or at least with an ensemble can improve the other two approaches. So I would always start

01:59:11.380 | with a random code and then see if you can beat it using these. So yeah why don't you

01:59:19.580 | now see if you can find a Kaggle competition with tabular data whether it's running now

01:59:23.740 | or it's a past one and see if you can repeat this process for that and see if you can get

01:59:29.220 | in the top 10% of the private leaderboard that would be a really great stretch goal

01:59:34.860 | at this point. Implement the decision tree algorithm yourself I think that's an important

01:59:40.100 | one we really understand it and then from there create your own random forest from scratch

01:59:44.700 | you might be surprised it's not that hard and then go and have a look at the tabular

01:59:52.500 | model source code and at this point this is pretty exciting you should find you pretty

01:59:57.980 | much know what all the lines do with two exceptions and if you don't you know dig around and explore

02:00:04.900 | an experiment and see if you can figure it out. And with that we are I am very excited

02:00:13.220 | to say at a point where we've really dug all the way in to the end of these real valuable

02:00:20.980 | effective fast AI applications and we're understanding what's going on inside them. What should we

02:00:27.420 | expect for next week? For next week we will at NLP and computer vision and we'll do the

02:00:36.500 | same kind of ideas delve deep to see what's going on. Thanks everybody see you next week.

Lesson 7 - Deep Learning for Coders (2020)

Chapters