Lesson 7 - Deep Learning for Coders (2020)

Hi everybody and welcome to lesson 7. We're going to start by having a look at a kind of regularization called weight decay. And the issue that we came to at the end of the last lesson is that we were training our simple dot product model with bias, and our loss started going down and then it started going up again.

And so we have a problem that we are overfitting. And remember in this case we're using mean squared error. So try to recall why it is that we don't need a metric here, because mean squared error is pretty much the thing we care about really, or we could use mean absolute error if we like, but either of those works fine as a loss function.

They don't have the problem of big flat areas like accuracy does for classification. So what we want to do is to make it less likely that we're going to overfit by doing something we call reducing the capacity of the model. The capacity of the model is basically how much space does it have to find answers.

And if it can kind of find any answer anywhere, those answers can include basically memorizing the data set. So one way to handle this would be to decrease the number of latent factors. But generally speaking, reducing the number of parameters in a model, particularly as we look at more deep learning style models, ends up biasing the models towards very simple kind of shapes.

So there's a better way to do it rather than reducing the number of parameters. And we try to force the parameters to be smaller, unless they're really required to be big. And the way we do that is with weight decay. Weight decay is also known as L2 regularization. They're very slightly different, but we can think of them as the same thing.

And what we do is we change our loss function, and specifically we change the loss function by adding to it the sum of all the weights squared. In fact, all of the parameters squared really should stay. Why do we do that? Well, because if that's part of the loss function, then one way to decrease the loss would be to decrease the weights, one particular weight or all of the weights or something like that.

And so when we decrease the weights, if you think about what that would do, then think about, for example, the different possible values of a in y equals ax squared. The larger a is, for example, a is 50, you get these very narrow peaks. In general, big coefficients are going to cause big swings, big changes in the loss, small changes in the parameters.

And when you have these kind of sharp peaks or valleys, it means that a small change to the parameter can make a, sorry, a small change to the input and make a big change to the loss. And so if you have, if you're in that situation, then you can basically fit all the data points close to exactly with a really complex jagged function with sharp changes, which exactly tries to sit on each data point rather than finding a nice smooth surface which connects them all together or goes through them all.

So if we limit our weights by adding in the loss function, the sum of the weights squared, then what it's going to do is it's going to fit less well on the training set because we're giving it less room to try anything that it wants to, but we're going to hope that it would result in a better loss on the validation set or the test set so that it will generalize better.

One way to think about this is that the loss with weight decay is just the loss plus the sum of the parameters squared times some number we pick, a hyperparameter. This is like 0.1 or 0.01 or 0.001 kind of region. So this is basically what loss with weight decay looks like in this equation.

But remember when it actually comes to what's, how is the loss used in stochastic gradient descent? It's used by taking its gradient. So what's the gradient of this? Well, if you remember back to when you first learned calculus, it's okay if you don't. The gradient of something squared is just two times that something.

We've changed some parameters to weight which is a bit confusing. So just use weight here to keep it consistent. Maybe parameters is better. So the derivative of weight squared is just two times weight. So in other words, to add in this term to the gradient, we can just add to the gradients weight decay times two times weight.

And since weight decay is just a hyperparameter, we can just replace it with weight decay times two. So that would just give us weight decay times weight. So weight decay refers to adding on the, to the gradients, the weights times some hyperparameter. And so that is going to try to create these kind of more shallow, less bumpy surfaces.

So to do that, we can simply, when we call fit or fit one cycle or whatever, we can pass in a WD parameter and that's just this number here. So if we pass in point one, then the training loss goes from point two nine to point four nine. That's much worse, right, because we can't overfit anymore.

The valid loss goes from point eight nine to point eight two, much better. So this is an important thing to remember for those of you that have done a lot of more traditional statistical models is in kind of more traditional statistical models, we try to avoid overfitting and we try to increase generalization by decreasing the number of parameters.

But in a lot of modern machine learning and certainly deep learning, we tend to instead use regularization such as weight decay because it gives us more flexibility. It lets us use more nonlinear functions and still, you know, still reduces the capacity of the model. Great. So we're down to point eight two three.

This is a good model. This is really actually a very good model. And so let's dig into actually what's going on here because in our, in our architecture, remember we basically just had four embedding layers. So what's an embedding layer? We've described it conceptually, but let's write our own.

And remember we said that an embedding layer is just a computational shortcut for doing a matrix multiplication by a one hot encoded matrix and that that is actually the same as just indexing into an array. So an embedding is just a indexing into an array. And so it's nice to be able to create our own versions of things that exist in PyTorch and fast.ai.

So let's do that for embedding. So if we're going to create our own kind of layer, which is pretty cool, we need to be aware of something, which is normally a layer is basically created by inheriting as we've discussed from module or nn.module. So for example, this is an example here of a module where we've created a class called t that inherits from module.

And when it's constructed, remember that's what dunder init does. We're just going to sit, this is just a dummy little module here. We're going to set self.a to the number one repeated three times as a tensor. Now if you remember back to notebook four, we talked about how the optimizers in PyTorch and fast.ai rely on being able to grab the parameters attribute to find a list of all the parameters.

Now if you want to be able to optimize self.a, you would need to appear in parameters, but actually there's nothing there. Why is that? That's because PyTorch does not assume that everything that's in a module is something that needs to be learned. To tell it that it's something that needs to be learned, you have to wrap it with nn.parameter.

So here's exactly the same class, but torch.ones, which is just a list of three, three ones in this case is wrapped in nn.parameter. And now if I go parameters, I see I have a parameter with three ones in it. And that's going to automatically call requires grad underscore for us as well.

We haven't had to do that for things like nn.linear in the past because PyTorch automatically uses nn.parameter internally. So if we have a look at the parameters for something that uses nn.linear with no bias layer, you'll see again we have here a parameter with three things in it. So we want to in general be able to create a parameter.

So something with a tensor with a bunch of things in and generally we want to randomly initialize them. So to randomly initialize, we can pass in the size we want. We can initialize a tensor of zeros of that size and then randomly generate some normal, normally distributed random numbers with a mean of zero and a deviation of 0.01.

No particular reason I'm picking those numbers just to show how this works. So here's something that will give us back a set of parameters of any size we want. And so now we're going to replace everywhere that used to say embedding. I'm going to replace it with create params.

Everything else here is the same in the init under init. And then the forward is very, very similar to before. As you can see, I'm grabbing the zero index column from x, that's my users, and I just look it up as you see in that user factors array. And the cool thing is I don't have to do anything with gradients myself for this manual embedding layer because PyTorch can figure out the gradients automatically as we've discussed.

But then I just got the dot product as before, add on the bias as before, do the sigmoid range as before. And so here's a dot product bias without any special PyTorch layers and we fit and we get the same result. So I think that is pretty amazingly cool.

We've really shown that the embedding layer is nothing fancy, is nothing magic, right? It's just indexing into an array. So hopefully that removes a bit of the mystery for you. So let's have a look at this model that we've created and we've trained and find out what it's learned.

That's already useful. We've got something we can make pretty accurate predictions with. But let's find out what those, what the model looks like. So remember when we have a question. Okay, let's take a question before you can look at this. What's the advantage of creating our own embedding layer over the stock PyTorch one?

Oh, nothing at all. We're just showing that we can. It's great to be able to dig under the surface because at some point you'll want to try doing new things. So a good way to learn to do new things is to be able to replicate things that already exist and you can expect that you understand how they work.

It's also a great way to understand the foundations of what's going on is to actually create encode your own implementation. But I wouldn't expect you to use this implementation in practice. But basically it removes all the mystery. So if you remember we've created a learner called learn and to get to the model that's inside it, you can always call learn.model and then inside that there's going to be automatically created for it.

Well, sorry, not automatically. We've created all these attributes movie factors movie bias and so forth. So we can grab learn.model.movieBias. And now what I'm going to do is I'm going to sort that vector and I'm going to print out the first five titles. And so what this is going to do is it's going to print out the movies with the smallest bias and here they are.

What does this mean? Well, it kind of means these are the five movies that people really didn't like. But it's more than that. It's not only do people not like them, but if we take account of the genre they're in, the actors they have, you know, whatever the latent factors are, people liked them a lot less than they expected.

So maybe for example, this is kind of I haven't seen any of these movies. Luckily perhaps this is a sci-fi movie. So people who kind of like these sci-fi movies found they're so bad they still didn't like it. So we can do the exact opposite, which is to sort sending.

And here are the top five movies and specifically they're the top five by bias, right? So these are the movies that even after you take account of the fact that LA Confidential, I have seen all of these ones. So LA Confidential is a kind of a murder mystery cop movie, I guess.

And people who don't necessarily like that genre or I think Guy Pearce was in it. So maybe they don't like Guy Pearce very much, whatever. People still like this movie more than they expect. So this is a kind of a nice thing that we can look inside our model and see what it's learned.

We can look at not only at the bias vector, but we can also look at the factors. Now there are 50 factors, which is too many to visualize. So we can use a technique called PCA, Principle Components now. This, the details don't matter, but basically they're going to squish those 50 factors down to three.

And then we'll plot the top two as you can see here. And what we see when we plot the top two is we can kind of see that the movies have been kind of spread out across a space of some kind of latent factors. And so if you look at the far right, there's a whole bunch of kind of big budget actually things.

And on the far left, there's more like cult kind of things, Fargo, Schindler's List, Monty Python. By the same token at the bottom, we've got some English patient, Harry Met Sally, so kind of romance drama kind of stuff. And at the top, we've got action and sci-fi kind of stuff.

So you can see even as though we haven't asked in any information about these movies, all we've seen is who likes what. These latent factors have automatically kind of figured out a space or a way of thinking about these movies based on what kinds of movies people like and what other kinds of movies they like along with those.

But that's really interesting to kind of try and visualize what's going on inside your model. Now we don't have to do all this manually. We can actually just say give me a collab learner using this set of data loaders with this number of factors and this y range and it does everything we've just seen again about the same number.

Okay, so now you can see this is nice, right? We've actually been able to see right underneath inside the collab learner part of the fast AI application, the collaborative filtering application and we can build it all ourselves from scratch. We know how to create the SGD, know how to create the embedding layer, we know how to create the model, the architecture.

So now you can see, you know, we've really can build up from scratch our own version of this. So if we just type learn.model, you can see here the names are a bit more generic. This is a user weight, item weight, user bias, item bias, but it's basically the same stuff we've seen before.

And we can replicate the exact analysis we saw before by using this same idea. Okay, slightly different order this time because it is a bit random but pretty similar as well. Another interesting thing we can do is we can think about the distance between two movies. So let's grab all the movie factors or just pop them into a variable and then let's pick a movie and then let's find the distance from that movie to every other movie.

And so one way of thinking about distance is you might recall the Pythagorean formula or the distance on the hypotenuse of a triangle, which is also the distance to a point in a Cartesian plane on a chart, which is root x squared plus y squared. You might know, it doesn't matter if you don't, but you can do exactly the same thing for 50 dimensions.

It doesn't just work for two dimensions. There's a, that tells you how far away a point is from another point if you, if x and y are actually differences between two movie vectors. So then what gets interesting is you can actually then divide that kind of by the, by the length to make all the lengths the same distance to find out how the angle between any two movies and that actually turns out to be a really good way to compare the similarity of two things.

That's called cosine similarity. And so the details don't matter. You can look them up if you're interested. But the basic idea here is to see that we can actually pick a movie and find the movie that is the most similar to it based on these factors. Kind of interesting.

I have a question. All right. What motivated learning at a 50-dimensional embedding and then using a to reduce the three versus just learning a three-dimensional? Oh, because the purpose of this was actually to create a good model. So the, the visualization part is normally kind of the exploration of what's going in, on in your model.

And so with a 50, with 50 latent factors, you're going to get a more accurate. So that's one approach is this dot product version. There's another version we could use, which is we could create a set of user factors and a set of item factors and just like before we could look them up.

But what we could then do instead of doing a dot product, we could concatenate them together into a tensor that contains both the user and the movie factors next to each other. And then we could pass them through a simple little neural network, linear, relu, linear, and then sigmoid range as before.

So importantly here, the first linear layer, the number of inputs is equal to the number of user factors plus the number of item factors. And the number of outputs is however many activations we have. And then we just default to 100 here. And then the final layer will go from 100 to 1 because we're just making one prediction.

And so we could create, we'll call that collab nn. We can instantiate that to create a model. We can create a learner and we can fit. It's not going quite as well as before. It's not terrible, but it's not quite as good as our dot product version. But the interesting thing here is it does give us some more flexibility, which is that since we're not doing a dot product, we can actually have a different embedding size for each of users versus items.

And actually fast.ai has a simple heuristic. If you call get embedding size and pass in your data loaders, it will suggest appropriate size embedding matrices for each of your categorical variables, each of your user and item sensors. So that's, so if we pass in *m's settings, that's going to pass in the user, tuple and the item, tuple, which we can then pass to embedding.

This is the * prefix we learned about in the last class in case you forgot. So this is kind of interesting. We can, you know, we can see here that there's two different architectures we could pick from. It wouldn't be necessarily obvious ahead of time which one's going to work better.

In this particular case, the simplest one, the dot product one, actually turned out to work a bit better, which is interesting. This particular version here, if you call collab_learner and pass use_nn = true, then what that's going to do is it's going to use this version, the version with concatenation and the linear layers.

So collab_learner, use_nn = true, again we get about the same result as you'd expect because it's just a draw-cut for this version. And it's interesting actually, we have a look at collab_learner, it actually returns an object of type embedding_nn, and it's kind of cool if you look inside the fast.io source code or use the double question mark trick to see the source code for embedding nn, you'll see it's three lines of code.

How does that happen? Because we're using this thing called tab_ular_model, which we will learn about in a moment, but basically this neural net version of collaborative filtering is literally just a tab_ular model in which we pass no continuous variables and some embedding sizes. So we'll see that in a moment.

Okay so that is collaborative filtering, and again take a look at the further research section in particular after you finish the questionnaire, because there's some really important next steps you can take to push your knowledge and your skills. So let's now move to notebook 9, tab_ular. And we're going to look at tab_ular_modeling and do a deep dive.

And let's start by talking about this idea that we were starting to see here, which is embeddings. And specifically let's move beyond just having embeddings for users and items, but embeddings for any kind of categorical variable. So really because we know an embedding is just a lookup into an array, it can handle any kind of discrete categorical data.

So things like age are not discrete, they're continuous numerical data, but something like sex or postcode are categorical variables. They have a certain number of discrete levels. The number of discrete levels they have is called their cardinality. So to have a look at an example of a dataset that contains both categorical and continuous variables, we're going to look at the Rossman sales competition that ran on Kaggle a few years ago.

And so basically what's going to happen is we're going to see a table that contains information about various stores in Germany, and the goal will be to try and predict how many sales there's going to be for each day in a couple of week period for each store. One of the interesting things about this competition is that one of the gold medalists used deep learning, and it was one of the earliest known examples of a state-of-the-art deep learning tabular model.

I mean this is not long ago, 2015 or something, but really this idea of creating state-of-the-art tabular models with deep learning has not been very common and for not very long. You know interestingly compared to the other gold medalists in this competition, the folks that use deep learning used a lot less feature engineering and a lot less domain expertise.

And so they wrote a paper called Entity Embeddings of Categorical Variables, in which they basically described the exact thing that you saw in notebook 8, the way you can think of one-hot encodings as just being embeddings, you can catenate them together, and you can put them through a couple of layers, they call them dense layers, we've called them linear layers, and create a neural network out of that.

So this is really a neat, you know, kind of simple and obvious hindsight trick. And they actually did exactly what we did in the paper, which is to look at the results of the trained embeddings. And so for example they had an embedding matrix for regions in Germany, because there wasn't really metadata about this, these were just learned embeddings, just like we learned embeddings about movies.

And so then they just created, just like we did before, a chart where they popped each region according to, I think probably a PCA of their embeddings. And then if you circle the ones that are close to each other in blue, you'll see that they're actually close to each other in Germany, and ditto for red, and ditto for green, and then here's the brown.

So this is like pretty amazing, is the way that we can see that it's kind of learned something about what Germany looks like, based entirely on the purchasing behavior of people in those states. Something else they did was to look at every store, and they looked at the distance between stores in practice, like how many kilometers away they are.

And then they looked at the distance between stores in terms of their embedding distance, just like we saw in the previous notebook. And there was this very strong correlation that stores that were close to each other physically ended up having close embeddings as well, even though the actual location of these stores in physical space was not part of the model.

Ditto with days of the week, so the days of the week or another embedding, and the days of the week that were next to each other, ended up next to each other in embedding space, and ditto for months of the year. So pretty fascinating the way kind of information about the world ends up captured just by looking at training embeddings, which as we know are just index lookups into an array.

So the way we then combine these categorical variables with these embeddings with continuous variables, what was done in both the entity embedding paper that we just looked at, and then also described in more detail by Google when they described how their recommendation system in Google Play works. This is from Google's paper, is they have the categorical features that go through the embeddings, and then there are continuous features, and then all the embedding results and the continuous features are just concatenated together into this big concatenated table that then goes through this case three layers of a neural net, and interestingly they also take the kind of collaborative filtering bit and do the product as well and combine the two.

So they use both of the tricks were used in the previous notebook and combine them together. So that's the basic idea we're going to be seeing for moving beyond just collaborative filtering, which is just two categorical variables to as many categorical and as many continuous variables as we like.

But before we do that, let's take a step back and think about other approaches, because as I mentioned, the idea of deep learning as a kind of a best practice for tabular data is still pretty new and it's still kind of controversial. It's certainly not always the case that it's the best approach.

So when we're not using deep learning, what would we be using? Well, what we'd probably be using is something called an ensemble of decision trees and the two most popular are random forests and gradient boosting machines or something similar. So basically between multi-layered neural networks, like with SGD and ensemble of decision trees, that kind of covers the vast majority of approaches that you're likely to see for tabular data.

And so we're going to make sure we cover them both of course today, in fact. So although deep learning is nearly always clearly superior for stuff like images and audio and natural language text, these two approaches tend to give somewhat similar results a lot of the time for tabular data.

So let's take a look. You know, you really should generally try both and see which works best for you for each problem you look at. Why does the range go from 0 to 5.5 if the maximum is 5? That's a great question. The reason is if you think about it for sigmoid, it's actually impossible for a sigmoid to get all the way to the top or all the way to the bottom.

Those are asymptotes. So no matter how far, how big your x is, it can never quite get to the top or no matter how small it is, it can never quite get to the bottom. So if you want to be able to actually predict a rating of 5, then you need to use something higher than 5 your maximum.

Are embeddings used only for highly cardinal categorical variables, or is this approach used in general? For low cardinality, can one use a one-hot encoding? I'll remind you cardinality is the number of discrete levels in a variable. And remember that an embedding is just a computational shortcut for a one-hot encoding.

So there's really no reason to use a one-hot encoding because it's, as long as you have more than two levels, it's always going to be more memory and lower, and give you exactly mathematically the same thing. And if there's just two levels, then it is basically identical. So there isn't really any reason not to use it.

Thank you for those great questions. Okay, so one of the most important things about decision tree ensembles is that at the current state of the technology, they do provide faster and easier ways of interpreting the model. I think that's rapidly improving for deep learning models on tabular data, but that's where we are right now.

They also require less hyperparameter tuning, so they're easier to kind of get right the first time. So my first approach for analyzing a new tabular data set is always an ensemble of decision trees. And specifically, I pretty much always start with a random forest because it's just so reliable.

Yes. Your experience for highly imbalanced data, such as broad or medical data, what usually works best out of random forest, XGBoost, or neural networks? I'm not sure that whether the data is balanced or unbalanced is a key reason for choosing one of those above the others. I would try all of them and see which works best.

So the exception to the guideline about start with decision tree ensembles is your first thing to try would be if there's some very high cardinality categorical variables, then they can be a bit difficult to get to work really well in decision tree ensembles. Or if there's something like, most importantly, if it's like plain text data or image data or audio data or something like that, then you're definitely going to need to use a neural net in there, but you could actually ensemble it with a random forest, as we'll see.

Okay, so clearly we're going to need to understand how decision tree ensembles work. So PyTorch isn't a great choice for decision tree ensembles. They're really designed for gradient-based methods and random forests and decision tree growing are not really gradient-based methods in the same way. So instead, we're going to use a library called scikit-learn, referred to as sklearn as a module.

Scikit-learn does a lot of things. We're only going to touch on a tiny piece of them, stuff we need to do to train decision trees and random forests. We've already mentioned before Wes McKinney's book, also a great book for understanding more about scikit-learn. So the dataset for learning about decision tree ensembles is going to be another dataset.

It's going to, it's called the blue book for bulldozers dataset and it's a Kaggle competition. So Kaggle competitions are fantastic. They are machine learning competitions where you get interesting datasets, you get feedback on whether your approach is any good or not. You can see on a leaderboard what approaches are working best and then you can read blog posts from the winning contestants sharing tips and tricks.

It's certainly not a substitute for actual practice doing end-to-end data science projects, but for becoming good at creating predictive models that are predictive, it's a really fantastic resource, highly recommended. And you can also submit to old, most old competitions to see how you would have gone without having to worry about, you know, the kind of stress of like whether people will be looking at your results because they're not publicized or published if you do that.

There's a question. Can you comment on real-time applications of random forests? In my experience, they tend to be too slow for real-time use cases like a recommender system, neural network is much faster when run on the right hardware. Let's get to that once we've seen what they are, shall we?

Now you can't just download an untar Kaggle datasets using the untar data thing that we have in fast.ai. So you actually have to sign up to Kaggle and then follow these instructions for how to download data from Kaggle. Make sure you replace creds here with what it describes. You need to get a special API code and then run this one time to put that up on your server.

And now you can use Kaggle to download data using the API. So after we do that, we're going to end up with a bunch of, as you see, CSV files. So let's take a look at this data. So the main data, the main table is train.csv. Remember that's comma separated values and the training set contains information such as unique identifier of a sale, the unique identifier of a machine, the sale price, sale date.

So what's going on here is one row of the data represents a sale of a single piece of heavy machinery like a bulldozer at an auction. So it happens at a date, as a price, it's of some particular piece of equipment and so forth. So if we use pandas again to read in the CSV file, let's combine training and valid together.

We can then look at the columns to see. There's a lot of columns there and many things which I don't know what the hell they mean like blade extension and pad type and ride control. But the good news is we're going to show you a way that you don't have to look at every single column and understand what they mean and random forests are going to help us with that as well.

So once again, we're going to be seeing this idea that models can actually help us with data understanding and data cleanup. One thing we can look at is ordinal columns, a good place to look at that now. If there's things there that you know are discrete values but have some order like product size, it has medium and small and large, medium and many.

These should not be in alphabetical order or some random order, they should be in this specific order, right? They have a specific ordering. So we can use as type to turn it into a categorical variable and then we can say setCategories, audit equals true to basically say this is an ordinal column.

So it's got discrete values but we actually want to define what the order of the classes are. We need to choose which is the dependent variable and we do that by looking on Kaggle and Kaggle will tell us that the thing we're meant to be predicting is sale price and actually specifically they'll tell us the thing we're meant to be predicting is the log of sale price because root mean squared log error is what we're actually going to be judged on in the competition where we take the log.

So we're not going to replace sale price with its log and that's what we'll be using from now on. So a decision tree ensemble requires decision trees. So let's start by looking at decision trees. So a decision tree in this case is a something that asks a series of binary that is yes or no questions about data.

So such as is somebody less than or greater than 30? Yes they are. Are they eating healthily? Yes they are and so okay then we're going to say they're fit or unfit. So like there's an example of some arbitrary decision tree that somebody might have come up with. It's a series of binary yes and no choices and at the bottom are leaf nodes that make some prediction.

Now of course for our bulldozers competition we don't know what binary questions to ask about these things and in what order in order to make a prediction about sale price. So we're doing machine learning so we're going to try and come up with some automated way to create the questions.

And there's actually a really simple procedure for doing that. You have to think about it. So if you want to kind of stretch yourself here have a think about what's an automatic procedure that you can come up with that would automatically build a decision tree where the final answer would do a you know significantly better than random job of estimating the sale price of one of these auctions.

Alright so here's the approach that we could use. Loop through each column of the data set. We're going to go through each of well obviously not sale price it's a dependent variable sale ID machine ID auctioneer year made etc. And so one of those will be for example product size.

And so then what we're going to do is we're going to loop through each possible value of product size large, large, medium, medium etc. And then we're going to do a split basically like where this comma is and we're going to say okay let's get all of the auctions of large equipment and put that into one group and everything that's smaller than that and put that into another group.

And so that's here split the data into two groups based on whether they're greater than or less than that value. If it's a categorical non-ordinal value a variable it'll be just whether it's equal or not equal at that level. And then we're going to find the average sale price for each of the two groups.

So for the large group what was the average sale price? For the smaller than large group what was the average sale price? And that will be our model. Our prediction will simply be the average sale price for that group. And so then you can say well how good is that model?

If our model was just to ask a single question with a yes/no answer put things into two groups and take the average of the group as being our prediction and we can say how good would that model be? What would be the root mean squared error from that model?

And so we can then say all right how good would it be if we use large as a split? And then let's try again what if we did large/medium as a split? What if we did medium as a split? And so in each case we can find the root mean squared error of that incredibly simple model.

And then once we've done that for all of the product size levels we can go to the next column and look at level of usage band and do every level of usage band and then state, every level of state and so forth. And so there'll be some variable and some split level which gives the best root mean squared error of this really really simple model.

And so then we'll say okay that would be our first binary decision. It gives us two groups and then we're going to take each one of those groups separately and find another single binary decision for each of those two groups using exactly the same procedure. So then we'll have four groups and then we'll do exactly the same thing again separately for each of those four groups and so forth.

So let's see what that looks like and in fact once we've gone through this you might even want to see if you can implement this algorithm yourself. It's not trivial but it doesn't require any special coding skills so hopefully you can find you'll be able to do it. There's a few things we have to do before we can actually create a decision tree in terms of just some basic data munching.

One is if we're going to take advantage of dates we actually want to call fastai's addDatePart function and what that does as you see after we call it is it creates a whole different a bunch of different bits of metadata from that data. Say a year, say a month, say a week, say a day and so forth.

So say a date of itself doesn't have a whole lot of information directly but we can pull lots of different information out of it. And so this is an example of something called feature engineering which is where we take some piece of some piece of data and we try to grab create lots of other pieces of data from it.

So is this particular date the end of a month or not? At the end of a year or not? And so forth. So that handle states there's a bit more cleaning we want to do and fastai provides some things to make cleaning easier. We can use the tabular pandas class to create a tabular data set in pandas.

And specifically we're going to use two tabular processes or tabular procs. A tabular processor is basically just a transform and we've seen transforms before so go back and remind yourself what a transform is. Except it's just slightly different it's like three lines of code if you look at the code for it.

It's actually going to modify the object in place rather than creating a new object and giving it back to you. And that's because often these tables of data are kind of really big and we don't want to waste lots of RAM. And it's just going to run the transform once and save the result rather than doing it lazily when you access it for the same reason.

We're just going to make this a lot faster. So you can just think of them as transforms really. One of them is called categorify and categorify is going to replace a column with numeric categories using the same basic idea of like a vocab like we've seen before. Fill missing is going to find any columns with missing data that's going to fill in the missing data with the median of the data and create a new column a boolean column which is set to true for anything that was missing.

So these two things is basically enough to get you to a point where most of the time you'll be able to train a model. Now the next thing we need to do is think about our validation set. As we discussed in lesson one, a random validation set is not always appropriate and certainly for something like predicting auction results it almost certainly is not appropriate because we're going to be wanting to use a model in the future not at some random date in the past.

So the way this Kaggle competition was set up was that the test set the thing that you had to fill in and submit for the competition was two weeks of data that was after any of the training set. So we should do the same thing for a validation set.

We should create something which is where the validation set is the last couple of weeks of data and so then the training set will only be data before that. So we basically can do that by grabbing everything before October 2011, create a training and validation set based on that condition and grabbing those bits.

So that's going to split our training set and validation set by date not randomly. We're also going to need to tell when you create a tabular pandas object you're going to be passing in a data frame, going to be passing in your tabular procs and you also have to say what are my categorical and continuous variables.

We can use fast.ai's cont.cat.split to automatically split a data frame to continuous and categorical variables for you. So we can just pass those in. Tell it what is the dependent variable, you can have more than one, and what are the indexes to split into training and valid. And this is a tabular object.

So it's got all the information you need about the training set, the validation set, categorical and continuous variables and the dependent variable and any processes to run. It looks a lot like a datasets object, but it has a .train, it has a .valid and so if we have a look at .show we can see the data.

But .show is going to show us the kind of the string data, but if we look at .items you can see internally it's actually stored these very compact numbers which we can use directly in a model. So fast.ai has basically got us to a point here where we have our data into a format ready for modeling and our validation sets being created.

To see how these numbers relate to these strings we can again just like we saw last week use the classes attribute which is a dictionary which basically tells us the vocab. So this is how we look up. For example 6 is 0, 1, 2, 3, 4, 5, 6. This is a compact example.

That processing took takes a little while to run so you can go ahead and save the tabular object and so then you can load it back later without having to rerun all the processing. So that's a nice kind of fast way to quickly get back up and running without having to reprocess your data.

So we've done the basic data munging we need. So we can now create a decision tree and in scikit-learn a decision tree where the dependent variable is continuous is a decision tree regressor. And let's start by telling it we just want a total of four leaf nodes. We'll see what that means in a moment and in scikit-learn you generally call fit so it looks quite a lot like fast.ai and you pass in your independent variables and your dependent variable and we can grab those straight from our tabular object training set is .x's and .y and we can do the same thing for validation just to save us in typing.

Okay, question. Do you have any thoughts on what data augmentation for tabular data might look like? I don't have a great sense of data augmentation for tabular data. We'll be seeing later either in this course or in the next part dropout and mix up and stuff like that which they might be able to do that in later layers in the tabular model.

Otherwise I think you'd need to think about kind of the semantics of the data and think about what are things you could do to change the data without changing the meaning. That's like a pretty tricky route. There question. Does fast.ai distinguish between ordered categories such as low, medium, high and unordered categorical variables?

Yes, that was that ordinal thing I told you about before and all it really does is it ensures that your classes list has a specific order so then these numbers actually have a specific order. And as you'll see that's actually going to turn out to be pretty important for how we train our random forest.

Okay, so we can create a decision tree regressor. We can fit it and then we can draw it, the fast.ai function. And here is the decision tree we just trained and behind the scenes this actually used the basically the exact process that we described back here, right? So this is where you can like try and create your own decision tree implementation if you're interested in stretching yourself.

So we're going to use one that's already exists and the best way to understand what it's done is to look at this diagram from top to bottom. So the first step is it says like okay the initial model it created is a model with no binary splits at all.

Specifically it's always going to predict the value 10.1 for every single row. Why is that? Well because this is the simplest possible model is to take the average of the dependent variable and always predict that. And so this is always should be your kind of pretty much your basic baseline for regression.

There are four hundred and four thousand seven hundred and ten rows, auctions that we're averaging and the mean squared error of this incredibly simple model in which there are no rules at all, no groups at all, just a single average is a point for it. So then the next most complex model is to take a single column, a plus system and a single binary decision is coupler system less than or equal to 0.5.

True, there are three hundred and sixty thousand eight hundred and forty seven auctions where it's true and forty three thousand eight hundred and sixty three where it's false. And now interestingly in the false case you can see that there are no further binary decisions. So this is called a leaf node.

It's a node where this is as far as you can get and so if your coupler system is not less than or equal to 0.5 then the prediction this model makes for your sale price is 9.21 versus if it's true it's 10.21. So you can see it's actually found a very big difference here and that's why it picked this as the first binary split.

And so the mean squared error for this section here is 0.12 which is far better than we started out at, 0.48. This group still has 360,000 in it and so it does another binary split. This time is the year that this piece of equipment made was at less than or equal to 1991.5.

If it was, if it's true then we get a leaf node and the prediction is 9.97, mean squared error 0.37. If the value is false we don't have a leaf node and we have another binary split. And you can see eventually we get down to here coupler system true, year made, false, product size, false, mean squared error 0.17.

So all of these leaf nodes have MSCs that are smaller than that original baseline model of just taking the mean. So this is how you can grow a decision tree. And we only stopped here because we said max leaf nodes is 4, 1, 2, 3, 4, right? And so if we want to keep training it further we can just use a higher number.

There's actually a very nice library by Terrence Park called dtree-vis which can show us exactly the same information like so. And so here are the same leaf nodes 1, 2, 3, 4. And you can see the kind of the chart of how many are there. This is the split, coupler system 0.5.

Here are the two groups. You can see the sale price in each of the two groups. And then here's the leaf node. And so then the second split was on year made. And you can see here something weird is going on with year made. There's a whole bunch of year mades that are a thousand which is obviously not a sensible year for a bulldozer to be made.

So presumably that's some kind of missing value. So when we look at the kind of the picture like this it can give us some insights about what's going on in our data. And so maybe we should replace those thousands with 1950 because that's you know obviously a very, very early year for a bulldozer.

So we can kind of pick it arbitrarily. It's actually not really going to make any difference to the model that's created because all we care about is the order because we're just doing these binary splits that it'll make it easier to look at as you can see. Here's our 1950s now.

And so now it's much easier to see what's going on in that binary split. So let's now get rid of max leaf nodes and build a bigger decision tree. And then let's just for the rest of this notebook create a couple of little functions. One to create the root mean squared error which is just here.

And another one to take a model and some independent independent variables, predict from the model on the independent variables and then take the root mean squared error with a dependent variable. So that's going to be our models root mean squared error. So for this decision tree in which we didn't have a stopping criteria, so as many leaf nodes as you like, the model's root mean squared error is zero.

So we've just built the perfect model. So this is great news, right? We've built the perfect auction trading system. Well remember, we actually need to check the validation set. Let's check the check mRmse with a validation set and oh, it's worse than zero. So our training set is zero, our validation set is much worse than zero.

Why has that happened? Well one of the things that a random forest in sklearn can do is it can tell you the number of leaf nodes, number of leaves, there are 341,000, number of data points 400,000. So in other words, we have nearly as many leaf nodes as data points.

Most of our leaf nodes only have a single thing in, but they're taking an average of a single thing. Clearly this makes no sense at all. So what we should actually do is pick some different stopping criteria and let's say, okay, if you get a leaf node with 25 things or less in it, don't split things to create a leaf node with less than 25 things in it.

And now if we fit and we look at the root mean squared error for the validation set, it's going to go down from 0.33 to 0.32. So the training sets got worse from zero to 0.248. The validation sets got better and now we only have 12,000 leaf nodes. So that is much more reasonable.

Alright, so let's take a five minute break and then we're going to come back and see how we get the best of both worlds, how we're going to get something which has the kind of flexibility to get these, you know, what we're going to get down to zero, but to get, you know, really deep trees, but also without overfitting.

And the trick will be to use something called bagging. We'll come back and talk about that in five minutes. Okay, welcome back. So we're going to look at how we can get the best of both worlds as we discussed and let's start by having a look at what we're doing with categorical variables first of all.

And so you might notice that previously with categorical variables, for example, in collaborative filtering, we had to, you know, kind of think about like how many embedding levels we have, for example, if you've used other modeling tools, you might have doing things with creating dummy variables, stuff like that.

For random forests on the whole, you don't have to. The reason is, as we've seen, all of our categorical variables have been turned into numbers. And so we can perfectly well have decision tree binary decisions which use those particular numbers. Now, the numbers might not be ordered in any interesting way, but if there's a particular level which kind of stands out as being important, it only takes two binary splits to split out that level into a single, you know, into a single piece.

So generally speaking, I don't normally worry too much about kind of encoding categorical variables in a special way. As I mentioned, I do try to encode ordinal variables by saying what the order of the levels is, because often, as you would expect, sizes, for example, you know, medium and small are going to mean kind of next to each other and large and extra large would be next to each other.

That's good to have those as similar numbers. Having said that, you can kind of one hot encode a categorical variable if you want to using get dummies in pandas. But there's not a lot of evidence that that actually helps. There's actually that has been stored in a paper. And so I would say in general for categorical variables don't worry about it too much.

Just use what we've shown you. You have a question. For ordinal categorical variables, how do you deal with when they have like nA or missing values, where do you put that in the order? So in fast.ai, nA missing values always appear as the first item. They'll always be the zero index item.

And also if you get something in the validation or test set, which is a level we haven't seen in training, that will be considered to be that missing or nA value as well. All right, so what we're going to do to try and improve our random forest is we're going to use something called bagging.

This was developed by a retired Berkeley professor named Leo Breiman in 1994. And he did a lot of great work and perhaps you could argue that most of it happened after he retired. His technical report was called bagging predictors. And he described how you could create multiple versions of a predictor, so multiple different models.

And you could then aggregate them by averaging over the predictions. And specifically, the way he suggested doing this was to create what he called bootstrap replicates. In other words, randomly select different subsets of your data. Train a model on that subset, kind of store it away as one of your predictors, and then do it again a bunch of times.

And so each of these models is trained on a different random subset of your data. And then you, to predict, you predict on all of those different versions of your model and average them. And it turns out that bagging works really well. So this, the sequence of steps is basically randomly choose some subset of rows, train a model using that subset, save that model, and then return to step one.

Do that a few times to train a few models. And then to make a prediction, predict with all the models and take the average. That is bagging. And it's very simple, but it's astonishingly powerful. And the reason why is that each of these models we've trained, although they are not using all of the data, so they're kind of less accurate than a model that uses all of the data.

Each of them is, the errors are not correlated, you know, the errors because of using that smaller subset are not correlated with the errors of the other models because they're random subsets. And so when you take the average of a bunch of kind of errors which are not correlated with each other, the average of those errors is zero.

So therefore, the average of the models should give us an accurate prediction of the thing we're actually trying to predict. So as I say here, it's an amazing result. We can improve the accuracy of nearly any kind of algorithm by training it multiple times on different random subsets of data and then averaging the predictions.

So then Breiman in 2001 showed a way to do this specifically for decision trees where not only did he randomly choose a subset of rows for each model, but then for each binary split, he also randomly selected a subset of columns. And this is called the random first. And it's perhaps the most widely used, most practically important machine learning method and astonishingly simple.

To create a random forest regressor, you use sklearn's random forest regressor. If you pass njobs -1, it will use all of the CPU cores that you have to run as fast as possible. nestimators says how many trees, how many models to train. max_sample says how many rows to use, randomly chosen rows to use in each one.

max_features is how many randomly chosen columns to use for each binary split point. min_sample's leaf is the stopping criteria and we'll come back to. So here's a little function that will create a random first regressor and fit it through some set of independent variables and a dependent variable. So we can give it a few default values and create a random forest and train and our validation set RMSE is 0.23.

If we compare that to what we had before, we had 0.32. So dramatically better by using a random forest. So what's happened when we called random forest regressor is it's just using that decision tree builder that we've already seen, but it's building multiple versions with these different random subsets and for each binary split it does, it's also randomly selecting a subset of columns.

And then when we create a prediction, it is averaging the predictions of each of the trees. And as you can see it's giving a really great result. And one of the amazing things we'll find is that it's going to be hard for us to improve this very much, you know, the kind of the default starting point tends to turn out to be pretty great.

The sklearn docs have lots of good information in. One of the things that has this nice picture that shows as you increase the number of estimators, how does the accuracy improve, error rate improves for different max features levels. And in general, the more trees you add, the more accurate your model.

It's not going to overfit, right, because it's averaging more of these, these weak models, more of these models that are trained on subsets of the data. So train as many, use as many estimators as you like, really just a case of how much time do you have and whether you kind of reach a point where it's not really improving anymore.

You can actually get at the underlying decision trees in a model, in a random forest model using estimators_. So with a list comprehension, we can call predict on each individual tree. And so here's an array, a numpy array containing the predictions from each individual tree for each row in our data.

So if we take the mean across the zero axis, we'll get exactly the same number. Because remember, that's what a random forest does, is it takes the mean of the trees, predictions. So one cool thing we could do is we could look at the 40 estimators we have and grab the predictions for the first i of those trees and take their mean and then we can find the root mean squared error.

And so in other words, here is the accuracy when you've just got one tree, two trees, three trees, four trees, five trees, etc. And you can see, so it's kind of nice, right? You can, you can actually create your own kind of build your own tools to look inside these things and see what's going on.

And so we can see here that as you add more and more trees, the accuracy did indeed keep improving or the root mean squared error kept improving, although the improvements slowed down after a while. The validation set is worse than the training set and there's a couple of reasons that could have happened.

The first reason could be because we're still overfitting, which is not necessarily a problem, it's just something we could identify. Or maybe it's because the, the fact that we're trying to predict the last two weeks is actually a problem and that the last two weeks are kind of different to the other auctions in our dataset, maybe something changed over time.

So how do we tell which of those two reasons there are? What is the reason that our validation set is worse? We can actually find out using a very clever trick called out of bag error, OOB error. And we use OOB error for lots of things. You can grab the OOB error, or you can grab the OOB predictions from the model with OOB prediction and you can grab the RMSE and you can find that the OOB error, RMSE is 0.21, which is quite a bit better than 0.23.

So let me explain what OOB error is. What OOB error is, is we look at each row of the training set, not the validation set, each row of the training set and we say, so we say for row number one, which trees included row number one in the training?

And we'll say, okay, let's not use those for calculating the error because it was part of those trees training. So we'll just calculate the error for that row using the trees where that row was not included in training that tree. Because remember every tree is using only a subset of the data.

So we do that for every row. We find the prediction using only the trees that were not used, that that row was not used. And those are the OOB predictions. In other words, this is like giving us a validation set result without actually needing a validation. But the thing is, it's not with that time offset, it's not looking at the last two weeks, it's looking at the whole training set.

But this basically tells us how much of the error is due to overfitting versus due to being the last couple of weeks. So that's a cool trick. OOB error is something that very quickly kind of gives us a sense of how much we're, we're overfitting. And we don't even need a validation set to do it.

So there's that OOB error. So that's telling us a bit about what's going on in our model. But then there's a lot of things we'd like to find out from our model. And I've got five things in particular here which I generally find pretty interesting. Which is, how confident are we about our predictions for some particular prediction we're making?

Like we can say this is what we think the prediction is, but how confident are we? Is that exactly that or is it just about that or we really have no idea? And then for predict, for predicting a particular item, which factors were the most important in that prediction and how did they influence it?

Overall, which columns are making the biggest difference in MPRL? Which ones could we maybe throw away and it wouldn't matter? Which columns are basically redundant with each other? So we don't really need both of them. And as we vary some column, how does it change the prediction? So those are the five things that we're, that I'm interested in figuring out and we can do all of those things with a random first.

Let's start with the first one. So the first one, we've already seen that we can grab all of the predictions for all of the trees and take their mean to get the actual predictions of the model and then to get the RMSE. But what if instead of saying mean, we did exactly the same thing like so, but instead said standard deviation.

This is going to tell us for every row in our dataset, how much did the trees vary? And so if our model really had never seen kind of data like this before, it was something where, you know, different trees were giving very different predictions. It might give us a sense that maybe this is something that we're not at all confident about.

And as you can see, when we look at the standard deviation of the trees for each prediction, let's just look at the first five. They vary a lot, right, 0.2, 0.1, 0.09, 0.3, okay? So this is a really interesting, it's not something that a lot of people talk about, but I think it's a really interesting approach to kind of figuring out whether we might want to be cautious about a particular prediction because maybe we're not very confident about it.

But there's one thing we can easily do with a random forest. The next thing, and this is I think the most important thing for me in terms of interpretation, is feature importance. Here's what feature importance looks like. We can call feature importance on a model with some independent variables.

Let's say grab the first 10. This says these are the 10 most important features in this random forest. These are the things that are the most strongly driving sale price or we could plot them. And so you can see here, there's just a few things that are by far the most important.

What year the equipment was made, bulldozer or whatever. How big is it? Upla system, whatever that means, and the product class, whatever that means. And so you can get this by simply looking inside your train model and grabbing the feature importances attribute. And so here for making it better to print out, I'm just sticking that into a data frame and sorting the sending by importance.

So how is this actually being done? It's actually really neat. What Scikit-learn does, and Bryman, the inventor of random forest described, is that you can go through each tree and then start at the top of the tree and look at each branch and at each branch see what feature was used, the split, which binary, which the binary split was based on which column.

And then how much better was the model after that split compared to beforehand. And we basically then say, okay, that column was responsible for that amount of improvement. And so you add that up across all of the splits, across all of the trees for each column, and then you normalize it so they all add to one.

And that's what gives you these numbers, which we show the first few of them in this table and the first 30 of them here in this chart. So this is something that's fast and it's easy and it kind of gives us a good sense of like, well, maybe the stuff that are less than 0.005 we could remove.

So if we did that, that would leave us with only 21 columns. So let's try that. Let's just, let's just say, okay, x's which are important, the x's which are in this list of ones to keep, do the same, they're valid, retrain our random forest and have a look at the result.

And basically our accuracy is about the same, but we've gone down from 78 columns to 21 columns. So I think this is really important. It's not just about creating the most accurate model you can, but you want to kind of be able to fit it in your head as best as possible.

And so 21 columns is going to be much easier for us to check for any data issues and understand what's going on. And the accuracy is about the same, or the RMSE. So I would say, okay, let's do that. Let's just stick with x's important from now on. And so here's this entire set of the 21 features.

And you can see it looks now like year made and product size of the two really important things. And then there's a cluster of kind of mainly product related things that are kind of at the next level of importance. One of the tricky things here is that we've got like a product class desk, model ID, secondary desk, model desk, base model.

They modeled a script. So they all look like there might be similar ways of saying the same thing. So one thing that can help us to interpret the feature importance better and understand better what's happening in the model is to remove redundant features. So one way to do that is to call fast.ai's cluster columns, which is basically a thin wrapper for stuff that scikit-learn already provides.

And what that's going to do is it's going to find pairs of columns, which are very similar. So you can see here sale year and sale elapsed. See how this line is way out to the right or else machine ID and model ID is not at all. It's way out to the left.

So that means that sale year and sale elapsed are very, very similar. When one is low, the other tends to be low and vice versa. Here's a group of three, which all seem to be much the same, and then product group desk and product group, and then FI best-based model and FI model desk.

But these all seem like things where maybe we could remove one of each of these pairs because they're basically seem to be much the same, you know, they're when one is high, the other is high and vice versa. So let's try removing one of each of these. Now it takes a little while to train a random forest.

And so for the, just to see whether removing something makes it much worse, we could just do a very fast version. So we could just train something where we only have 50,000 rows per tree, train for each tree, and we'll just use 40 trees. And let's then just get the OOB for, and so for that fast simple version, our basic OOB with our important x's is 0.877.

And here for OOB, a higher number is better. So then let's try going through each of the things we thought we might not need and try dropping them and then getting the OOB error for our x's with that one column removed. And so compared to 877, most of them don't seem to hurt very much.

They'll elapse to it quite a bit, right? So for each of those groups, let's go and see which one of the ones seems like we could remove it. So here's the five I found. Let's remove the whole lot and see what happens. And so the OOB went from 877 to 874, though hardly any difference at all, despite the fact we managed to get rid of five of our variables.

So let's create something called x's final, which is the x's important and then dropping those five, save them for later. We can always load them back again. And then let's check our random forest using those and again 0.233 or 0.234. So we've got about the same thing, but we've got even less columns now.

So we're getting a kind of a simpler and simpler model without hurting our accuracy. It's great. So the next thing we said we were interested in learning about is for the columns that are, particularly the columns that are most important, how does, what's the relationship between that column and the dependent variable?

So for example, what's the relationship between product size and sale price? So the first thing I would do would be just to look at a histogram. So one way to do that is with value counts in pandas. And we can see here our different levels of product size. And one thing to note here is actually missing is actually the most common.

And then next most is compact and small. And then many is pretty tiny. So we can do the same thing for year made. Now for year made we can't just see the basic bar chart. We, according to histogram is not it's a bar chart. For year made we actually need a histogram, which pandas has stuff like this built in so we can just call histogram.

And that 1950, you remember we created it, that's kind of this missing value thing that used to be a thousand. But most of them seem to have been well into the 90's and 3000's. So let's now look at something called a partial dependence plot. I'll show it to you first.

Here is a partial dependence plot of year made against partial dependence. What does this mean? Well we should focus on the part where we actually have a reasonable amount of data. So at least well into the 80's, go around here. And so let's look at this bit here. Basically what this says is that as year made increases, the predicted sale price, log sale price of course also increases.

You can see. And the log sale price is increasing linearly on other roughly, but roughly then this is actually an exponential relationship between year made and sale price. Why do we call it a partial dependence? Are we just plotting the kind of the year against the average sale price?

Well no we're not. We can't do that because a lot of other things change from year to year. Example, maybe more recently people tend to buy bigger bulldozers or more bulldozers with air conditioning or more expensive models of bulldozers. And we really want to be able to say like no just what's the impact of year and nothing else.

And if you think about it from a kind of an inflation point of view, you would expect that older bulldozers would be kind of, that bulldozers would get kind of a constant ratio cheaper the further you go back, which is what we see. So what we really want to say is all other things being equal, what happens if only the year changes?

And there's a really cool way we can answer that question with a random forest. So how does year made impact sale price? All other things being equal. So what we can do is we can go into our actual data set and replace every single value in the year made column with 1950 and then calculate the predicted sale price for every single auction and then take the average over all the auctions.

And that's what gives us this value here. And then we can do the same from 1951, 1952 and so forth until eventually we get to our final year of 2011. So this isolates the effect of only year made. So it's a kind of a bit of a curious thing to do, but it's actually, it's a pretty neat trick for trying to kind of pull apart and create this partial dependence to say what might be the impact of just changing year made.

And we can do the same thing for product size. And one of the interesting things if we do it for product size is we see that the lowest value of predicted sale price log sale price is NA, which is a bit of a worry because we kind of want to know well that means it's really important the question of whether or not the product size is labeled is really important.

And that is something that I would want to dig into before I actually use this model to find out well why is it that sometimes things aren't labeled and what does it mean, you know, why is it that that's actually a that's just important predictor. So that is the partial dependence plot and it's a really clever trick.

So we have looked at four of the five questions we said we wanted to answer at the start of this section. So the last one that we want to answer is one here. We're predicting with a particular row of data what were the most important factors and how did they influence that prediction.

This is quite related to the very first thing we saw. So it's like imagine you were using this auction price model in real life. You had something on your tablet and you went into some auction and you looked up what the predicted auction price would be for this lot that's coming up to find out whether it seems like it's being under or overvalued and then you can decide what to do about that.

So one thing we said we'd be interested to know is like well are we actually confident in our prediction and then we might be curious to find out like oh I'm really surprised it was predicting such a high value. Why was it predicting such a high value? So to find the answer to that question, we can use a module called TreeInterpreter.

And TreeInterpreter, the way it works is that you pass in a single row. So it's like here's the auction that's coming up, here's the model, here's the auctioneer ID, etcetera, etcetera. Please predict the value from the random forest, what's the expected sale price and then what we can do is we can take that one row of data and put it through the first decision tree and we can see what's the first split that's selected and then based on that split does it end up increasing or decreasing the predicted price compared to that kind of raw baseline model of just take the average and then you can do that again at the next split and again at the next split and again at the next split.

So for each split, we see what the increase or decrease in the well, addiction, that's not right. We see what the increase or decrease in the prediction is except while I'm here compared to the parent node. And so then you can do that for every tree and then add up the total change in importance by split variable and that allows you to draw something like this.

So here's something that's looking at one particular row of data and overall we start at zero and so zero is the initial 10.1. Remember this number 10.1 is the average log sale price of the whole data set. They call it the bias. And so we call that zero then for this particular row we're looking at year made as a negative 4.2 impact on the prediction and then product size has a positive 0.2, cut plus system has a positive 0.046, model ID has a positive 0.127 and so forth, right.

And so the red ones are negative and the green ones are positive and you can see how they all join up until eventually overall the prediction is that it's going to be negative 0.122 compared to 10.1 which is equal to 9.98. So this kind of plot is called a waterfall plot and so basically when we say tree interpreter dot predict it gives us back the prediction which is the actual number we get back from the random forest, the bias which is just always this 10.1 for this data set and then the contributions which is all of these different values.

It's how important was each factor and here I've used a threshold which means anything that was less than 0.08 all gets thrown into this other category. I think this is a really useful kind of thing to have in production because it can help you answer questions whether it will be for the customer or for you know whoever's using your model if they're surprised about some prediction why is that prediction.

So I'm going to show you something really interesting using some synthetic data and I want you to really have a think about why this is happening before I tell you and I pause the video if you're watching the video when I get to that point. Let's start by creating some synthetic data like so.

So we're going to grab 40 values evenly spaced between 0 and 20 and then we're just going to create the y=x line and add some normally distributed random data on that. Here's this kind of plot. So here's some data we want to try and predict and we're going to use a random forest in a kind of bit of an overkill here.

Now in this case we only have one independent variable. Scikit-learn expects us to have more than one. So we can use unsqueeze in PyTorch to add that go from a shape of 40 in other words a vector with 40 elements for a shape of 40 comma 1 in other words a matrix of 40 rows with one column.

So this unsqueeze 1 means add a unit axis here. I don't use unsqueeze very often because I actually generally prefer the index with a special value none. This works in PyTorch and numpy and the way it works is to say okay xlin remember that size is a vector of length 40 every row and then none means insert a unit axis here for the column.

So these are two ways of doing the same thing but this one is a little bit more flexible so that's what I use more often. But now that we've got the shape that is expected which is a rank 2 tensor and an array with two dimensions or axes we can create a random forest we can fit it and let's just use the first 30 data points right so kind of stop here.

And then let's do a prediction right so let's plot the original data points and then also plot a prediction and look what happens on the prediction it acts it's kind of nice and accurate and then suddenly what happens. So this is the bit where if you're watching the video I want you to pause and have a think bias is flat.

So what's going on here well remember a random forest is just taking the average of predictions of a bunch of trees and a tree the prediction of a tree is just the average of the values in a leaf node and remember we fitted using a training set containing only the first 30.

So none of these appeared in the training set so the highest we could get would be the average of values that are inside the training set. In other words there's this maximum you can get to. So random forests cannot extrapolate outside of the bounds of the data that they're training set.

This is going to be a huge problem for things like time series prediction where there's like an underlying trend for instance. But really it's a more general issue than just time variables. It's going to be hard for random or impossible often for random forests to just extrapolate outside the types of data that it's seen in a general sense.

So we need to make sure that our validation set does not contain out of domain data. So how do we find out of domain data? So we might not even know our test set is distributed in the same way as our training data. So if they're from two different time periods how do you kind of tell how they vary, right?

Or if it's a Kaggle competition how do you tell if the test set and the training set which Kaggle gives you have some underlying differences? There's actually a cool trick you can do which is you can create a column called is_valid which contains 0 for everything in the training set and 1 for everything in the validation set.

And it's concatenating all of the independent variables together. So it's concatenating the independent variables for both the training and validation set together. So this is our independent variable and this becomes our dependent variable. And we're going to create a random forest not for predicting price but a random forest that predicts is this row from the validation set or the training set.

So if the validation set and the training set are from kind of the same distribution if they're not different then this random forest should basically have zero predictive power. If it has any predictive power then it means that our training and validation set are different. And to find out the source of that difference we can use feature importance.

And so you can see here that the difference between the validation set and the training set is not surprisingly sale elapsed. So that's the number of days since I think like 1970 or something. So it's basically the date. So yes of course you can predict whether something is in the validation set or the training set by looking at the date because that's actually how you find them.

That makes sense. This is interesting sales ID. So it looks like the sales ID is not some random identifier but it increases over time. And ditto for machine ID. And then there's some other smaller ones here that kind of makes sense. So I guess for something like model desk I guess there are certain models that were only made in later years for instance.

But you can see these top three columns are a bit of an issue. So then we could say like okay what happens if we look at each one of those columns those first three and remove them and then see how it changes our RMSE on our sales price model on the validation set.

So we start from point 232 and removing sales ID actually makes it a bit better. Sale elapsed makes it a bit worse, machine ID about the same. So we can probably remove sales ID and machine ID without losing any accuracy and yep it's actually slightly improved. But most importantly it's going to be more resilient over time right because we're trying to remove the time related features.

Another thing to note is that since it seems that you know this kind of sale elapsed issue that maybe it's making a big difference is maybe looking at the sale year distribution this is the histogram. Most of the sales are in the last few years anyway. So what happens if we only include the most recent few years.

So let's just include everything after 2004. So that is X is filtered. And if I train on that subset then my accuracy goes improves a bit more from 331, 330. So that's interesting right. We're actually using less data, less rows and getting a slightly better result because the more recent data is more representative.

So that's about as far as we can get with our random forest. But what I will say is this. This issue of extrapolation would not happen with a neural net would it because a neural net is using the kind of the underlying layers are linear layers. And so linear layers can absolutely extrapolate.

So the obvious thing to think then at this point is well maybe what a neural net do a better job of this. That's going to be the thing. Next up to this question. Question first. How do, how does feature importance relate to correlation? Feature importance doesn't particularly relate to correlation.

Correlation is a concept for linear models and this is not a linear model. So remember feature importance is calculated by looking at the improvement in accuracy as you go down each tree and you go down each binary split. If you're used to linear regression then I guess correlation sometimes can be used as a measure of feature importance.

But this is a much more kind of direct version that's taking account of these non-linearities and interactions of stuff as well. So it's a much more flexible and reliable measure generally feature importance. Any more questions? So I'll do the same thing with a neural network. I'm going to just copy and paste the same lines of code that I had from before but this time I'll call it NN, DFNN and these are the same lines of code.

And I'll grab the same list of columns we had before in the dependent variable to get the same data frame. Now as we've discussed for categorical columns we probably want to use embeddings. So to create embeddings we need to know which columns should be treated as categorical variables. And as we've discussed we can use "cont-cat-split" for that.

One of the useful things we can pass that is the maximum cardinality. So maxCard equals 9000 means if there's a column with more than 9000 levels you should treat it as continuous. And if it's got less than 9000 levels which it is categorical. So that's you know it's a simple little function that just checks the cardinality and splits them based on how many discrete levels they have.

And of course the data type if it's not actually numeric data type it has to be categorical. So there's our there's our split. And then from there what we can do is we can say oh we've got to be a bit careful of "sail-elapsed" because actually "sail-elapsed" I think has less than 9000 categories but we definitely don't want to use that as a categorical variable.

The whole point was to make it that this is something that we can extrapolate. Though we certainly anything that's kind of time dependent or we think that we might see things outside the range of inputs in the training data we should make them continuous variables. So let's make "sail-elapsed" put it in continuous neural net and remove it from categorical.

So here's the number of unique levels this is from pandas for everything in our neural net data set for the categorical variables. And I get a bit nervous when I see these really high numbers so I don't want to have too many things with like lots and lots of categories.

The reason I don't want lots of things with lots and lots of categories is just they're going to take up a lot of parameters because in a embedding matrix this is you know every one of these is a row in an embedding matrix. In this case I notice model ID and model desk might be describing something very similar.

So I'd quite like to find out if I could get rid of one and an easy way to do that would be to use a random forest. So let's try removing the model desk and let's create a random forest and let's see what happens and oh it's actually a tiny bit better and certainly not worse.

So that suggests that we can actually get rid of one of these levels or one of these variables. So let's get rid of that one and so now we can create a tabular pandas object just like before. But this time we're going to add one more processor which is normalize.

And the reason we need normalize, so normalize is subtract the mean divide by the standard deviation. We didn't need that for a random forest because for a random forest we're just looking at less than or greater than through our binary splits. So all that matters is the order of things, how they're sorted, it doesn't matter whether they're super big or super small.

But it definitely matters for neural nets because we have these linear layers. So we don't want to have you know things with kind of crazy distributions with some super big numbers and super small numbers because it's not going to work. So it's always a good idea to normalize things in neural nets so we can do that in a tabular neural net by using the normalize tabular proc.

So we can do the same thing that we did before with creating our tabular pandas tabular object for the neural net. And then we can create data loaders from that with a batch size. And this is a large batch size because tabular models don't generally require nearly as much GPU RAM as a convolutional neural net or something or an RNN or something.

Since it's a regression model we're going to want R range. So let's find the minimum and maximum of our dependent variable. And we can now go ahead and create a tabular learner. Our tabular learner is going to take our data loaders, our way range, how many activations do you want in each of the linear layers.

And so you can have as many linear layers as you like here. How many outputs are there? So this is a regression with a single output. And what loss function do you want? We can use LRfind and then we can go ahead and use fit1cycle. There's no pre-trained model obviously because this is not something where people have got pre-trained models for industrial equipment options.

So we just use fit1cycle and train for a minute. And then we can check. And our RMSE is 0.226 which here was 0.230. So that's amazing. We actually have, you know, straight away a better result than the random forest. It's a little more fussy, it takes a little bit longer.

But as you can see, you know, for interesting datasets like this, we can get some great results with neural nets. So here's something else we could do though. The random forest and the neural net, they each have their own pros and cons. There's some things they're good at and there's some they're less good at.

So maybe we can get the best of both worlds. And a really easy way to do that is to use Ensemble. We've already seen that a random forest is a decision tree ensemble. But now we can put that into another ensemble. We can have an ensemble of the random forest and a neural net.

There's lots of super fancy ways you can do that. But a really simple way is to take the average. So sum up the predictions from the two models, divide by two, and use that as prediction. So that's our ensemble prediction is just literally the average of the random forest prediction and the neural net prediction.

And that gives us 0.223 versus 0.226. So how good is that? Well it's a little hard to say because unfortunately this competition is old enough that we can't even submit to it and find out how we would have gone on Kaggle. So we don't really know and so we're relying on our own validation set.

But it's quite a bit better than even the first place score on the test set. So if the validation set is you know doing good job then this is a good sign that this is a really really good model. Which wouldn't necessarily be that surprising because you know in the last few years I guess we've learned a lot about building these kinds of models.

And we're kind of taking advantage of a lot of the tricks that have appeared in recent years. And yeah maybe this goes to show that well I think it certainly goes to show that both random forests and neural nets have a lot to offer. And try both and maybe even find both.

We've talked about an approach to ensembling called bagging which is where we train lots of models on different subsets of the data like the average of. Another approach to ensembling particularly ensembling of trees is called boosting. And boosting involves training a small model which underfits your data set. So maybe like just have a very small number of leaf nodes.

And then you calculate the predictions using the small model. And then you subtract the predictions from the targets. So these are kind of like the errors of your small underfit model. We call them residual. And then go back to step one but now instead of using the original targets use the residuals.

The train a small model which underfits your data set attempting to predict the residuals. Then do that again and again until you reach some stopping criterion such as the maximum number of trees. Now you that will leave you with a bunch of models which you don't average but which use sum.

Because each one is creating a model that's based on the residual of the previous one. But we've subtracted the predictions of each new tree from the residuals of the previous tree. So the residuals get smaller and smaller. And then to make predictions we just have to do the opposite which is to add them all together.

So there's lots of variants of this. But you'll see things like GBMs for gradient boosted machines or GBTTs for gradient boosted decision trees. And there's lots of minor details around you know and significant details. But the basic idea is what I've shown. All right let's take the questions. Dropping features in a model is a way to reduce the complexity of the model and thus reduce overfitting.

Is this better than adding some regularization like weight decay? I didn't claim that we removed columns to avoid overfitting. We removed the columns to simplify fewer things to analyze. It should also mean we don't need as many trees but there's no particular reason to believe that this will regularize.

And the idea of regularization doesn't necessarily make a lot of sense to random forests and always add more trees. Is there a good heuristic for picking the number of linear layers in the tabular model? Not really. Well if there is I don't know what it is. I guess two, three hidden layers works pretty well.

So you know what I showed those numbers I showed are pretty good for a large-ish model. A default it uses 200 and 100 so maybe start with the default and then go up to 500 and 250 if that isn't an improvement and like just keep doubling them until it stops improving or you run out of memory or time.

The main thing to note about boosted models is that there's nothing to stop us from overfitting. If you add more and more trees to a bagging model sort of a random forest it's going to get, it should generalize better and better because each time you're using a new model which is based on a subset of the data.

But boosting each model will fit the training set better and better gradually overfit more and more. So boosting methods do require generally more hyperparameter tuning and fiddling around with it. You know you certainly have regularization boosting. They're pretty sensitive to their hyperparameters which is why they're not normally my first go-to but they more often win Kaggle competition random forests do like they tend to be good at getting that last little bit of performance.

So the last thing I'm going to mention is something super neat which a lot of people don't seem to know exists. There's a shang so it's super cool which is something from the entity embeddings paper, the table from it where what they did was they built a neural network, they got the entity embeddings e.e.

and then they tried a random forest using the entity embeddings as predictors rather than the approach I described with just the raw categorical variables. And the the error for a random forest went from 0.16 to 0.11. A huge improvement and very simple method KNN went from 0.29 to 0.11.

Basically all of the methods when they used entity embeddings suddenly improved a lot. The one thing you should try if you have a look at the further research section after the questionnaire is it asks to try to do this actually take those entity embeddings that we trained in the neural net and use them in the random forest and then maybe try ensembling again and see if you can beat the 0.223 that we had.

This is a really nice idea it's like you get you know all the benefits of boosted decision trees but all of the nice features of entity embeddings and so this is something that not enough people seem to be playing with for some reason. So overall you know random forests are nice and easy to train you know they're very resilient they don't require much pre-processing they train quickly they don't overfit you know they can be a little less accurate and they can be a bit slow at inference time because the inference you have to go through every one of those trees.

Having said that a binary tree can be pretty heavily optimized so you know it is something you can basically create a totally compiled version of a tree and they can certainly also be done entirely in parallel so that's something to consider. Gradient boosting machines are also fast to train on the whole but a little more fussy about high parameters you have to be careful about overfitting but a bit more accurate.

Neural nets may be the fussiest to deal with they've kind of got the least rules of thumb around or tutorials around saying this is kind of how to do it it's just a bit a bit newer a little bit less well understood but they can give better results in many situations than the other two approaches or at least with an ensemble can improve the other two approaches.

So I would always start with a random code and then see if you can beat it using these. So yeah why don't you now see if you can find a Kaggle competition with tabular data whether it's running now or it's a past one and see if you can repeat this process for that and see if you can get in the top 10% of the private leaderboard that would be a really great stretch goal at this point.

Implement the decision tree algorithm yourself I think that's an important one we really understand it and then from there create your own random forest from scratch you might be surprised it's not that hard and then go and have a look at the tabular model source code and at this point this is pretty exciting you should find you pretty much know what all the lines do with two exceptions and if you don't you know dig around and explore an experiment and see if you can figure it out.

And with that we are I am very excited to say at a point where we've really dug all the way in to the end of these real valuable effective fast AI applications and we're understanding what's going on inside them. What should we expect for next week? For next week we will at NLP and computer vision and we'll do the same kind of ideas delve deep to see what's going on.

Thanks everybody see you next week.

Lesson 7 - Deep Learning for Coders (2020)

Chapters

Transcript