back to indexLesson 6: Practical Deep Learning for Coders 2022
Chapters
0:0 Review
2:9 TwoR model
4:43 How to create a decision tree
7:2 Gini
10:54 Making a submission
15:52 Bagging
19:6 Random forest introduction
20:9 Creating a random forest
22:38 Feature importance
26:37 Adding trees
29:32 What is OOB
32:8 Model interpretation
35:47 Removing the redundant features
35:59 What does Partial dependence do
39:22 Can you explain why a particular prediction is made
46:7 Can you overfit a random forest
49:3 What is gradient boosting
51:56 Introducing walkthrus
54:28 What does fastkaggle do
62:52 fastcore.parallel
64:12 item_tfms=Resize(480, method='squish')
66:20 Fine-tuning project
67:22 Criteria for evaluating models
70:22 Should we submit as soon as we can
75:15 How to automate the process of sharing kaggle notebooks
80:17 AutoML
84:16 Why the first model run so slow on Kaggle GPUs
87:53 How much better can a new novel architecture improve the accuracy
88:33 Convnext
91:10 How to iterate the model with padding
92:1 What does our data augmentation do to images
94:12 How to iterate the model with larger images
96:8 pandas indexing
98:16 What data-augmentation does tta use?
00:00:02.000 |
Not welcome back to welcome to lesson six first time. We've been a lesson six welcome back to practically deep learning for coders 00:00:21.560 |
for those of you who've forgotten what we did was we 00:00:31.720 |
And we were looking at creating binary splits 00:00:39.240 |
Categorical variables or binary variables like sex 00:00:46.600 |
Continuous variables like the log of the fare that they paid and 00:00:55.200 |
Using those, you know, we also kind of came up with a score 00:00:59.120 |
Which was basically how how good a job did that split to of grouping the 00:01:06.880 |
Survival characteristics into two groups, you know all of nearly all of one of whom survived nearly all of whom the other didn't survive 00:01:14.640 |
So they had like small standard deviation in each group 00:01:20.080 |
So then we created the world's simplest little UI to allow us to fiddle around and try to find a good 00:01:33.640 |
Which was on on sex and actually we created this little 00:01:39.200 |
Automated version and so this is I think the first time we can we're not quite the first time. No, this is 00:01:45.800 |
This is yet another time. I should say that we have successfully created a 00:01:52.760 |
Machine learning algorithm from scratch. This one is about the world's simplest one. It's one are 00:01:56.920 |
Creating the single rule which does a good job of splitting your data set into two parts 00:02:04.240 |
Which differ as much as possible on the dependent variable? 00:02:08.080 |
One hour is probably not going to cut it for a lot of things though 00:02:13.520 |
It's surprisingly effective, but it's so maybe we could go a step further 00:02:18.760 |
And the other step further we could go is we could create like a 2r. What if we took each of those? 00:02:24.000 |
groups males and females in the Titanic data set and 00:02:29.040 |
Split each of those into two other groups. So split the males into two groups and split the females into two groups 00:02:39.040 |
To do that we can repeat the exact same piece of code we just did but let's remove 00:02:50.240 |
Then split the data set into males and females 00:02:53.760 |
And run the same piece of code that we just did before but just for the males 00:02:58.480 |
And so this is going to be like a one-hour rule for how do we predict which males survive the Titanic? 00:03:06.760 |
And let's have a look three eight three seven three eight three eight three eight. Okay, so it's 00:03:17.760 |
Turns out to be for the males the biggest predictor of whether they were going to survive 00:03:21.560 |
That shipwreck and we can do the same thing females. So for females 00:03:27.200 |
There we go, no great supplies P class so whether they were in 00:03:40.000 |
Was the biggest predictor for females of whether they would survive the shipwreck? 00:03:52.200 |
Decision tree it is a series of binary splits 00:04:00.960 |
Split up our data more and more such that at in the end 00:04:04.720 |
These in the leaf nodes as we call them. We will hopefully get as you know much 00:04:12.680 |
Stronger prediction as possible about survival 00:04:15.160 |
So we could just repeat this step for each of the four groups we've now created males 00:04:27.760 |
Everybody else and we could do it again and then we'd have eight groups 00:04:32.280 |
We could do that manually with another couple of lines of code or we can just use 00:04:39.120 |
Decision tree classifier, which is a class which does exactly that for us 00:04:42.840 |
So there's no magic in here. It's just doing what we've just described 00:04:47.240 |
And a decision tree classifier comes from a library called psychic learn 00:04:54.200 |
Psychic learn is a fantastic library that focuses on kind of classical 00:05:01.720 |
non deep learning ish machine learning methods 00:05:08.800 |
So we can so to create the exact same decision tree 00:05:12.640 |
We can say please create a decision tree traffic classifier with that most four leaf nodes 00:05:32.640 |
You can see here it's gonna first of all split on sex now 00:05:36.360 |
It looks a bit weird to say sex is less than or equal to point five 00:05:39.100 |
But remember what our binary characteristics are coded as zero one 00:05:44.640 |
So that's just how we you know easy way to say males versus females 00:05:55.040 |
What class are they in and for the males what age are they and here's our four leaf nodes 00:06:07.000 |
116 of them survived and four of them didn't so very good idea to be a well-to-do 00:06:28.000 |
68 survived 350 died so a very bad idea to be a male adult on the Titanic 00:06:35.580 |
So you can see you can kind of get a quick summary 00:06:39.200 |
Of what's going on and one of the reasons people tend to like decision trees particularly for exploratory data analysis 00:06:46.200 |
Is it doesn't allow us to get a quick picture of what are the key? 00:06:51.160 |
driving variables in this data set and how much do they kind of 00:07:03.920 |
It's got one additional piece of information. We haven't seen before. It's this thing called Gini 00:07:07.840 |
Gini is just another way of measuring how good a split is and 00:07:21.760 |
How likely is it that if you go into that sample? 00:07:29.000 |
Then go in again and grab another item. How likely is it that you're going to grab the same item each time? 00:07:42.620 |
Just people who survived or just people who didn't survive the probability would be one you get the same time same every time 00:07:48.540 |
If it was an exactly equal mix the probability would be point five 00:07:55.200 |
Yeah, that's where this this formula comes from in the binary case 00:07:58.880 |
And in fact, you can see it here, right? This group here is pretty much 50/50. So Gini's point five 00:08:05.240 |
Where else this group here is nearly a hundred percent in one class. So Gini is nearly 00:08:12.760 |
And I think I've written it backwards here as well, so I better fix that 00:08:23.480 |
This decision tree is you know, we would expect it to be all accurate so we can calculate 00:08:33.000 |
It's been absolute error and for the 1r. So just doing males versus females 00:08:47.680 |
Actually, we have do we have an accuracy score somewhere here we are point three three six 00:08:59.640 |
Point two one five. Okay, so point two one five. So that was for the 1r version for the decision tree with four leaf nodes 00:09:08.120 |
Point two two four. So it's actually a little worse, right? 00:09:12.300 |
And I think this just reflects the fact that this is such a small data set 00:09:20.480 |
Version was so good. We haven't really improved it that much 00:09:27.320 |
Amongst the randomness of such a small validation set 00:09:34.640 |
To 50 a minimum of 50 samples per leaf node. So that means that in each of these 00:09:41.520 |
So you have it says samples which in this case is passengers on the Titanic. There's at least there's 67 people that 00:09:55.680 |
That's how you define that. So this decision tree keeps building keep splitting until it gets to a point where there's going to be less 00:10:02.240 |
Than 50 at which point it stops putting that that leaf so you can see they're all got at least 50 samples 00:10:09.280 |
And so here's the decision tree that builds as you can see, it doesn't have to be like constant depth, right? 00:10:35.000 |
Super cheap fares and so forth, right? So it keeps going down until we get to that group. So 00:10:40.600 |
Let's try that decision trees. That decision tree has an absolute error of point one eight three 00:10:46.560 |
So not surprisingly, you know, once we get there, it's starting to look like it's a little bit better 00:10:56.680 |
This is a kaggle competition. So therefore we should submit it to the leaderboard and 00:11:09.760 |
Not just beginners but every level of practitioner make on Kaggle is not to submit to the leaderboard 00:11:15.480 |
Spend months making some perfect thing, right? 00:11:19.560 |
But you're actually going to see how you're going and you should try and submit something to the leaderboard every day 00:11:24.200 |
So, you know regardless of how rubbish it is because 00:11:33.040 |
And so you want to keep iterating so to submit something to the leaderboard you generally have to provide a 00:11:48.920 |
We're going to apply the category codes to get the the category for each one in our test set 00:11:55.440 |
We're going to set the survived column to our predictions 00:11:58.800 |
And then we're going to send that off to a CSV 00:12:05.080 |
So yeah, so I submitted that and I got a score a little bit worse than most of our linear models and neural nets 00:12:13.120 |
But not terrible, you know, it was it's it's just doing an okay job 00:12:17.800 |
Now one interesting thing for the decision tree is there was a lot less pre-processing to do 00:12:27.320 |
Did you notice that we didn't have to create any dummy variables for our categories? 00:12:35.080 |
Like you certainly can create dummy variables, but you often don't have to so for example 00:12:40.720 |
You know for for class, you know, it's one two or three you can just split on one two or three, you know 00:12:52.040 |
What was that thing like the the embarkation? 00:12:54.560 |
City code like we just convert them kind of arbitrarily to numbers one two and three and you can split on those numbers 00:13:02.720 |
So with random forest or so not random first not the decision trees 00:13:06.320 |
Yeah, you can generally get away with not doing stuff like 00:13:16.760 |
We only did that to make our graph look better. But if you think about it 00:13:26.840 |
It's exactly the same as putting on fair is less than either the 2.7, you know, whatever blog base we used I can't remember 00:13:36.840 |
All that a decision tree cares about is the ordering of the data and this is another reason that decision tree-based approaches are fantastic 00:13:44.520 |
Because they don't care at all about outliers, you know long tail distributions 00:13:52.320 |
Categorical variables, whatever you can throw it all in and it'll do a perfectly fine job 00:14:01.280 |
For tabular data, I would always start by using a decision tree-based approach 00:14:08.040 |
And kind of create some baselines and so forth because it's it's really hard to mess it up 00:14:21.120 |
So, yeah, so here for example is embarked right it it was coded originally as 00:14:28.080 |
the first letter of the city they embarked in 00:14:35.320 |
And so pandas for us creates this this vocab this list of all of the possible values 00:14:43.520 |
Attribute you can see it's that S is that 0 1 2 so S has become 2 C 00:14:50.800 |
Has become 0 and so forth. All right, so that's how we converting the categories the strings 00:15:01.480 |
So, yeah, so if we wanted to split C into one group and Q and S and the other we can just do 00:15:12.780 |
Now, of course if we wanted to split C and S into one group and Q into the other 00:15:21.000 |
On one side and QS at Q and S on the other and then Q and S into Q versus S 00:15:26.680 |
And then the Q and S leaf nodes could get similar 00:15:29.960 |
Predictions so like you do have that sometimes it can take a little bit more 00:15:36.920 |
Most of the time I find categorical variables work fine as numeric in decision tree-based approaches 00:15:44.600 |
And as I say here, I tend to use dummy variables only if there's like less than four levels 00:15:48.840 |
Now what if we wanted to make this more accurate could we grow the tree further I 00:16:03.800 |
You know, there's only 50 samples in these leaves right it's it's not really 00:16:14.680 |
Know if I keep splitting it the leaf nodes are going to have subtle data that that's not really going to make very useful predictions 00:16:24.720 |
Now there are limitations to how accurate a decision tree can be 00:16:35.720 |
We can do something that's actually very I mean, I find it amazing and fascinating 00:16:45.560 |
And Leo Bremen came with his came up with this idea 00:16:50.480 |
Called bagging and here's the basic idea of bagging 00:17:00.560 |
Because let's say it's a decision tree, it's really small we've hardly used any data for it, right? 00:17:07.280 |
It's not very good. So it's got error. It's got errors on predictions 00:17:11.560 |
It's not a systematically biased error. It's not always predicting too high or is predicting too low 00:17:16.880 |
I mean decision trees, you know on average will predict the average, right? 00:17:23.920 |
So what I could do is I could build another decision tree in 00:17:28.300 |
Some slightly different way that would have different splits and it would also be not a great model but 00:17:38.160 |
Predicts the correct thing on average. It's not completely hopeless 00:17:41.080 |
And again, you know, some of the errors are a bit too high and some are a bit too low 00:17:45.320 |
And I could keep doing this. So if I could create building lots and lots of slightly different decision trees 00:17:50.960 |
I'm gonna end up with say a hundred different models all of which are unbiased 00:17:57.680 |
All of which are better than nothing and all of which have some errors bit high some bit low whatever 00:18:04.040 |
So what would happen if I average their predictions? 00:18:08.440 |
Assuming that the models are not correlated with each other 00:18:13.000 |
Then you're going to end up with errors on either side of the correct prediction 00:18:20.560 |
Some are a bit high some are a bit low and there'll be this kind of distribution of errors, right? And 00:18:32.800 |
So that means the average of the predictions of these multiple 00:18:36.280 |
uncorrelated models each of which is unbiased will be 00:18:40.600 |
The correct prediction because they have an error of zero and this is a mind-blowing insight 00:18:47.100 |
It says that if we can generate a whole bunch of 00:18:57.440 |
We can average them and get something better than any of the individual models because the average of the error 00:19:12.600 |
Well, we already have a great way to build models, which is to create a decision tree 00:19:19.160 |
How do we create lots of unbiased but different models? 00:19:25.880 |
Let's just grab a different subset of the data each time. Let's just grab at random half the rows and 00:19:32.720 |
Build a decision tree and then grab another half the rows and build a decision tree 00:19:37.920 |
And grab another half the rows and build a decision tree each of those decision trees is going to be not great 00:19:44.960 |
But it will be unbiased. It will be predicting the average on average 00:19:48.780 |
It will certainly be better than nothing because it's using, you know, some real data to try and create a real decision tree 00:19:55.680 |
The they won't be correlated with each other because they're each random subsets. So that makes all of our criteria 00:20:05.160 |
When you do this, you create something called a random forest 00:20:17.960 |
Here is a function to create a decision tree. So let's say what put this is just the proportion of data 00:20:23.760 |
So let's say we put 75% of the data in each time or we could change it to 50% whatever 00:20:29.680 |
So this is the number of samples in this subset n and so let's at random choose 00:20:38.600 |
n times the proportion we requested from the sample and build a decision tree from that and 00:20:52.520 |
Get a tree and stick them all in a list using a list comprehension 00:20:56.800 |
And now let's grab the predictions for each one of those trees and 00:21:03.760 |
Then let's stack all those predictions up together and take their mean 00:21:12.480 |
And what do we get one two three four five six seven eight that's seven lines of code. So 00:21:25.400 |
This is a slight simplification. There's one other difference that random forests do 00:21:30.000 |
Which is when they build the decision tree. They also randomly select a subset of columns and 00:21:36.480 |
they select a different random subset of columns each time they do a split and 00:21:41.800 |
So the idea is you kind of want it to be as random as possible, but also somewhat useful 00:21:49.440 |
We can do that by creating a random forest classifier 00:22:06.720 |
Samples per leaf and then fit does what we just did and here's our main absolute error rich 00:22:16.680 |
Again, it's like not as good as our decision tree, but it's still pretty good. And again, it's such a small data set 00:22:23.640 |
So we can submit that to Kaggle so earlier on I created a little function to submit to Kaggle 00:22:28.160 |
So now I just create some predictions and I submit to Kaggle and yeah looks like it gave nearly identical results to a single tree 00:22:42.160 |
About random forests and I should say in most real-world data sets of reasonable size random forests 00:22:48.480 |
Basically always give you much better results than decision trees. This is just a small data set to show you what to do 00:22:58.640 |
Random forests as we can do something quite cool with it. What we can do is we can look at the 00:23:04.080 |
Underlying decision trees they create so we've now got a hundred decision trees 00:23:12.440 |
Did it find a split on and so it's a here. Okay. Well the first thing it split on with six and 00:23:25.080 |
Two now just take the weighted average of point three eight and point three one weighted by the samples 00:23:30.400 |
So that's probably going to be about point three three. So it's a okay. It's like point one for improvement in Gini. Thanks to sex 00:23:41.480 |
We can do that again. Okay. Well then P class, you know, how about did that improve Gini? 00:23:46.000 |
Again, we keep waiting it by the number of samples as well log fair. How much does that improve Gini and we can keep track 00:23:55.560 |
How much in total did they improve the Gini in this decision tree and then do that for every decision tree and 00:24:05.320 |
then add them up per column and that gives you something called a feature importance plot and 00:24:14.960 |
And a feature importance plot tells you how important is each feature 00:24:20.280 |
how often did the trees pick it and how much did it improve the Gini when it did and 00:24:26.240 |
so we can see from the feature importance plot that sex was the most important and 00:24:34.120 |
Class was the second most important and everything else was a long way back 00:24:38.400 |
Now this is another reason by the way why our random forest isn't really particularly helpful 00:24:43.400 |
Because it's just such an easy split to do right? I basically all that matters is 00:24:49.080 |
You know what class you're in and whether you're male and female 00:24:57.720 |
Feature importance plots remember because they're built on random forests 00:25:09.280 |
The distribution of your data and they can handle categorical variables and stuff like that 00:25:13.600 |
That means that you can basically any tabular data set you have you can just plot this 00:25:21.000 |
Random forests, you know for most data sets only take a few seconds to train, you know, really at most of a minute or two 00:25:29.440 |
And so if you've got a big data set and you know hundreds of columns 00:25:41.760 |
It's such a helpful thing to do. So I've done that for example. I did some work in credit scoring 00:25:51.080 |
Things would predict who's going to default on a loan and I was given 00:26:01.200 |
And I put it straight into a random forest and found I think there was about 30 columns that seemed 00:26:09.360 |
like two hours after I started the job and I went to the 00:26:14.040 |
Head of marketing and the head of risk and I told them here's the columns. I think that we should focus on and 00:26:21.160 |
They were like, oh my god. We just finished a two-year 00:26:25.800 |
consulting project with one of the big consultants 00:26:27.960 |
Paid the millions of dollars and they came up with a subset of these 00:26:41.400 |
With random forests along this path. I'll touch on them briefly 00:26:56.040 |
Which goes into this in a lot more detail and particularly interestingly chapter 8 of the book uses a 00:27:03.160 |
Much bigger and more interesting data set which is auction prices of heavy industrial equipment 00:27:10.600 |
I mean, it's less interesting historically, but more interestingly numerically 00:27:14.600 |
And so some of the things I did there on this data set 00:27:25.800 |
Say this isn't from the data set. This is from the psychic learn documentation 00:27:28.960 |
They looked at how as you increase the number of estimators. So the number of trees 00:27:37.400 |
Accuracy improve so I then did the same thing on our data set. So I actually just 00:27:42.000 |
Added up to 40 more and more and more trees and 00:27:47.960 |
you can see that basically as as predicted by that kind of an initial bit of 00:27:53.840 |
Hand-wavy theory I gave you that you would expect the more trees 00:27:58.360 |
The lower the error because the more things you're averaging and that's exactly what we find the accuracy improves as we have more trees 00:28:11.600 |
You might have just answered his question actually as he talked it but he's he's asking on the same theme the number of trees in a 00:28:18.520 |
Random forest does increasing the number of trees always? 00:28:21.480 |
Translate to a better error. Yes. It does always I mean tiny bumps, right? But yeah, once you smooth it out 00:28:40.320 |
If you end up production ising a random forest then of course every one of these trees you have to 00:28:49.520 |
So it's thought that there's no cost. I mean having said that 00:28:53.600 |
Zipping through a binary tree is the kind of thing you can 00:29:01.120 |
Do fast in fact, it's it's quite easy to let literally 00:29:09.320 |
With a bunch of if statements and compile it and get extremely fast performance 00:29:15.400 |
I don't often use more than a hundred trees. This is a rule of thumb 00:29:31.000 |
So then there's another interesting feature of random forests 00:29:35.920 |
Which is remember how in our example we trained with? 00:29:44.560 |
So that means for each tree there was 25% of the data we didn't train on 00:29:47.840 |
Now this actually means if you don't have much data in some situations you can get away with not having a validation set and 00:30:06.360 |
See how accurate that tree was on those rows and we can average for each row 00:30:13.520 |
their accuracy on all of the trees in which they were not part of the training and 00:30:22.120 |
Or OOB error and this is built in also to SK learn you can ask for an OOB 00:30:43.640 |
So we know that bagging is powerful as an ensemble approach to machine learning 00:30:48.280 |
Would it be advisable to try out bagging then first when approaching a particular? 00:30:56.080 |
Before deep learning, so that's the first part of the question 00:31:01.280 |
And the second part is could we create a bagging model which includes fast AI deep learning models? 00:31:11.160 |
Absolutely. So to be clear, you know bagging is kind of like a 00:31:17.000 |
prediction it's not a method of modeling itself. It's just a method of 00:31:24.680 |
So random forests in particular as a particular approach to bagging 00:31:31.040 |
Is a you know, I would probably always start personally a tabular 00:31:35.800 |
Project with a random forest because they're nearly impossible to mess up and they give good insight and they give a good base case 00:31:42.400 |
But yeah your question then about can you bag? 00:31:47.440 |
other models is a very interesting one and the answer is you absolutely can and 00:32:04.040 |
So I you know you might be getting the impression I'm a bit of a fan of random forests and 00:32:12.960 |
Before I was before you know, people thought of me as the deep learning guy people thought of me as the random forests guy 00:32:20.480 |
I used to go on about random forests all the time and one of the reasons I'm so enthused about them isn't just that they're 00:32:27.400 |
Very accurate or that they require, you know that they're very hard to mess up and require very little processing pre-processing 00:32:32.280 |
But they give you a lot of quick and easy insight 00:32:40.480 |
Which I think that we're interested in and all of which are things that random forests good at they will tell us how confident 00:32:47.400 |
Are we in our predictions on some particular row? So when somebody you know, when we're giving a loan to somebody 00:32:59.240 |
But I'd also like to know how confident are we that we know because if we're if we like well 00:33:06.360 |
We think they'll repay but we're not confident of that. We would probably want to give them less of a loan and 00:33:12.960 |
Another thing that's very important is when we're then making a prediction. So again, for example for for credit 00:33:29.080 |
What what is the what is the reason that we made a prediction and you'll see why all these things? 00:33:34.480 |
Which columns are the strongest predictors? You've already seen that one, right? That's the feature importance plot 00:33:39.960 |
Which columns are effectively redundant with each other ie they're basically highly correlated with each other 00:33:49.240 |
And then one of the most important ones as you vary a column, how does it vary the predictions? So for example in your 00:34:06.960 |
Well something that probably the regulator would want to know might be some, you know, some protected 00:34:14.720 |
Race or some socio demographic characteristics that you're not allowed to use in your model. So they might check things like that 00:34:20.000 |
For the first thing how confident are we in our predictions using a particular row of data? 00:34:27.960 |
There's a really simple thing we can do which is remember how when we 00:34:32.960 |
Calculated our predictions manually we stacked up the predictions together and took their mean 00:34:37.720 |
Well, what if you took their standard deviation instead? 00:34:42.680 |
so if you stack up your predictions and take their standard deviation and 00:34:49.720 |
That means all of them all of the trees are predicting something different and that suggests that we don't really know what we're doing 00:34:57.040 |
And so that would happen if different subsets of the data end up giving completely different trees 00:35:07.160 |
So there's like a really simple thing you can do to get a sense of your prediction confidence 00:35:13.080 |
Okay feature importance. We've already discussed 00:35:16.160 |
After I do feature importance, you know, like I said when I had the what 7,000 or so columns that got rid of like all but 30 00:35:26.760 |
That doesn't tend to improve the predictions of your random forest very much 00:35:36.680 |
You know kind of logistically thinking about cleaning up the data 00:35:40.160 |
You can focus on cleaning those 30 columns stuff like that. So I tend to remove the low importance variables 00:35:45.080 |
I'm going to skip over this bit about removing redundant features because it's a little bit outside what we're talking about 00:35:57.880 |
But what I do want to mention is is the partial dependence this is the thing which says 00:36:09.480 |
Column and the dependent variable and so this is something called a partial dependence plot now 00:36:15.640 |
This one's actually not specific to random forests 00:36:17.800 |
A partial dependence plot is something you can do for basically any machine learning model 00:36:22.760 |
Let's first of all look at one and then talk about how we make it 00:36:27.040 |
So in this data set we're looking at the relationship. We're looking at 00:36:32.560 |
the sale price at auction of heavy industrial equipment like bulldozers, this is specifically the 00:36:38.920 |
blue books for bulldozers Kaggle competition and 00:36:42.240 |
a partial dependence plot between the year that the bulldozer or whatever was made and 00:36:48.880 |
The price that was sold for this is actually the log price is 00:36:52.640 |
That it goes up more recent bulldozers more recently made bulldozers are more expensive 00:37:00.920 |
And as you go back it back to older and older build it bulldozers 00:37:04.000 |
They're less and less expensive to a point and maybe these ones are some old 00:37:14.760 |
You might think that you could easily create this plot by simply looking at your data at each year and taking the average sale price 00:37:25.680 |
I mean it kind of does but it kind of doesn't let me give an example 00:37:29.600 |
It turns out that one of the biggest predictors of sale price for industrial equipment. It's whether it has air conditioning 00:37:37.080 |
and so air conditioning is you know, it's an expensive thing to add and it makes the equipment more expensive to buy and 00:37:45.200 |
Most things didn't have air conditioning back in the 60s and 70s and most of them do now 00:37:50.360 |
So if you plot the relationship between year made and price 00:37:55.480 |
You're actually going to be seeing a whole bunch of 00:37:57.880 |
When you know how popular was air conditioning? 00:38:01.640 |
Right, so you get this this cross correlation going on that we just want to know know 00:38:06.480 |
What's what's just the impact of of the year? It was made all else being equal 00:38:10.960 |
So there's actually a really easy way to do that which is we take our data set 00:38:17.320 |
We take the we leave it exactly as it is to just use the training data set 00:38:22.200 |
but we take every single row and for the year made column we set it to 1950 and 00:38:27.680 |
so then we predict for every row what would the sale price of that have been if it was made in 1950 and 00:38:35.120 |
then we repeat it for 1951 and they repeated for 1952 and so forth and then we plot the averages and 00:38:42.160 |
That does exactly what I just said. Remember I said the special words all else being equal 00:38:47.720 |
This is setting everything else equal. It's the everything else is the data as it actually occurred and we're only varying year made 00:38:58.320 |
That works just as well for deep learning or gradient boosting trees or logistic regressions or whatever. It's a really 00:39:10.280 |
And you can do more than one column at a time, you know, you can do two-way 00:39:20.840 |
Another one. Okay, so then another one I mentioned was 00:39:27.960 |
Prediction was made. So how did you decide for this particular row? 00:39:36.960 |
This is actually pretty easy to do there's a thing called tree interpreter 00:39:41.840 |
But we could you could easily create this in about half a dozen lines of code all we do 00:39:51.920 |
This customer's come in they've asked for a loan 00:39:54.480 |
We've put in all of their data through the random forest. It's bad out of prediction 00:39:59.200 |
We can actually have a look and say okay. Well that in tree number one 00:40:03.680 |
What's the path that went down through the tree to get to the leaf node? 00:40:07.520 |
And we can say oh, well first of all it looked at sex and then it looked at postcode and then it looked at income 00:40:16.960 |
exactly in tree number one which variables were used and what was the 00:40:24.480 |
Then we can do the same entry to 7 3 3 2 3 4 does this sound familiar? 00:40:29.160 |
It's basically the same as our feature importance plot, right? 00:40:32.680 |
But it's just for this one row of data and so that will tell you basically the feature 00:40:37.320 |
Importances for that one particular prediction and so then we can plot them 00:40:43.280 |
Like this. So for example, this is an example of an 00:40:49.680 |
According to this plot, you know, so he predicted that the net would be 00:40:55.280 |
This is just a change from from so I don't actually know what the price is 00:41:03.760 |
But this is this is how much each one impacted the price. So 00:41:06.640 |
Year made I guess this must have been an old attractor. It caused a prediction of the price to go down 00:41:13.280 |
But then it must have been a larger machine the product size caused it to go up 00:41:17.440 |
Couple of system made it go up model ID made it go up and 00:41:21.280 |
So forth, right so you can see the reds says this made this made our prediction go down green made our prediction go up and 00:41:31.080 |
Which things had the biggest impact on the prediction and what was the direction? 00:41:47.440 |
Yeah, there are a couple that have that are sort of queued up this is a good spot to jump to them 00:41:59.360 |
first of all Andrew's asking jumping back to the 00:42:02.680 |
The OOB era, would you ever exclude a tree from a forest if had a if it had a bad out of bag? 00:42:11.360 |
Like if you if you had a I guess if you had a particularly bad 00:42:13.960 |
Tree in your ensemble. Yeah, like might you just 00:42:17.720 |
Would you delete a tree that was not doing its thing? It's not playing its part. No you wouldn't 00:42:24.180 |
If you start deleting trees then you are no longer 00:42:29.960 |
Having a unbiased prediction of the dependent variable 00:42:34.440 |
You are biasing it by making a choice. So even the bad ones 00:42:50.400 |
Bagging and we're just going you know layers and layers here 00:42:55.440 |
You know, we could go on and create ensembles of bagged models 00:42:59.520 |
And you know, is it reasonable to assume that they would continue that's not gonna make much difference, right? 00:43:05.480 |
If they're all like you could take you a hundred trees split them into groups of ten create ten bagged ensembles 00:43:12.640 |
And then average those but the average of an average is the same as the average 00:43:16.180 |
You could like have a wider range of other kinds of models 00:43:20.760 |
You could have like neural nets trained on different subsets as well 00:43:23.800 |
But again, it's just the average of an average will still give you the average 00:43:26.640 |
Right. So there's not a lot of value in kind of structuring the ensemble 00:43:31.840 |
You just I mean some some ensembles you can structure but but not bagging bagging's the simplest one 00:43:39.920 |
There are more sophisticated approaches, but this one 00:43:47.080 |
Is a bit specific and it's referencing content you haven't covered but we're here now. So 00:43:57.840 |
Random forest model sometimes has different results when you compare to other explainability techniques 00:44:07.120 |
And we haven't covered these in the course, but Amir is just curious if you've got any thoughts on which is more accurate or reliable 00:44:14.240 |
Random forest feature importance or other techniques? I 00:44:26.400 |
More immediately trusting random forest feature importances over other techniques on the whole 00:44:32.560 |
On the basis that it's very hard to mess up a random forest 00:44:42.680 |
Yeah, I feel like pretty confident that a random forest feature importance is going to 00:44:50.680 |
As long as this is the kind of data which a random forest is likely to be pretty good at you know 00:44:56.400 |
Doing you know, if it's like a computer vision model random forests aren't 00:45:01.920 |
And so one of the things that Brian and talked about a lot was explainability and he's got a great essay called the two cultures 00:45:08.120 |
of statistics in which he talks about I guess what we're nowadays called kind of like data scientists and machine learning folks versus classic statisticians and 00:45:16.120 |
He he was you know, definitely a data scientist well before the 00:45:22.560 |
The label existed and he pointed out. Yeah, you know first and foremost 00:45:26.720 |
You need a model that's accurate. It is to make good predictions a model that makes bad predictions 00:45:33.800 |
Will also be bad for making explanations because it doesn't actually know what's going on 00:45:38.200 |
So if you know if you if you've got a deep learning model that's far more accurate than your random forest then it's you know 00:45:45.640 |
Explainability methods from the deep learning model will probably be more useful because it's explaining a model 00:45:53.760 |
Alright, let's take a 10-minute break and we'll come back at 5 past 7 00:46:03.840 |
Welcome back one person pointed out I noticed I got the chapter wrong. It's chapter 9 not chapter 8 in the book 00:46:20.960 |
Somebody asked during the break about overfitting 00:46:28.840 |
Basically, no, not really adding more trees will make it more accurate 00:46:35.840 |
It kind of asymptotes so you can't make it infinitely accurate by using infinite trees, but certainly, you know adding more trees won't make it worse 00:46:53.520 |
Let the trees grow very deep that could overfit 00:46:57.720 |
So you just have to make sure you have enough trees 00:47:00.800 |
Radak told me about experiment he did during that Radak told me during the break about an experiment he did 00:47:15.480 |
Which is something I've done something similar which is adding lots and lots of randomly generated columns 00:47:26.160 |
If you try it, it basically doesn't work. It's like it's really hard 00:47:30.680 |
to confuse a random forest by giving it lots of 00:47:34.440 |
meaningless data it does an amazingly good job of picking out 00:47:38.720 |
The the useful stuff as I said, you know, I had 00:47:43.000 |
30 useful columns out of 7,000 and it found them 00:47:48.720 |
And often, you know when you find those 30 columns 00:47:54.400 |
I was doing consulting at the time go back to the client and say like tell me more about these columns 00:47:58.680 |
That's and they'd say like oh well that one there. We've actually got a better version of that now 00:48:02.000 |
There's a new system, you know, we should grab that and oh this column actually that was because of this thing that happened last year 00:48:07.960 |
But we don't do it anymore or you know, like you can really have this kind of discussion about the stuff you've zoomed into 00:48:26.440 |
There are other things that you have to think about with lots of kinds of models like particularly regression models things like interactions 00:48:32.520 |
You don't have to worry about that with random forests like because you split on one column and then split on another column 00:48:43.760 |
Normalization you don't have to worry about you know, you don't have to have normally distributed columns 00:48:49.960 |
So, yeah, definitely worth a try now something I haven't gone into 00:49:10.720 |
You'll see that my friend Terrence and I have a three-part series about gradient boosting 00:49:20.240 |
But to explain gradient boosting is a lot like random forests 00:49:31.560 |
model training now fitting a tree again and again and again on different random subsets of the data 00:49:37.280 |
Instead what we do is we fit very very very small trees to hardly ever any splits and 00:49:44.840 |
We then say okay. What's the error? So, you know 00:49:49.120 |
so imagine the simplest tree would be a one-hour rule tree of 00:49:55.320 |
Male versus female say and then use you take what's called the residual 00:50:00.600 |
That's the difference between the prediction and the actual the error and then you create another tree which attempts to predict that 00:50:08.120 |
very small tree and then you create another very small tree which track tries to predict the error from that and 00:50:17.000 |
So forth each one is predicting the residual from all of the previous ones. And so then to calculate a prediction 00:50:25.120 |
Rather than taking the average of all the trees 00:50:27.640 |
you take the sum of all the trees because each one is predicted the difference between the actual and 00:50:33.640 |
All of the previous trees and that's called boosting 00:50:37.920 |
versus bagging so boosting and bagging are two kind of meta-ensembling techniques and 00:50:44.160 |
When bagging is applied to trees, it's called a random forest and when boosting is applied to trees 00:50:50.880 |
It's called a gradient boosting machine or gradient boosted decision tree 00:50:55.800 |
Gradient boosting is generally speaking more accurate than random forests 00:51:11.040 |
It's not necessarily my first go-to thing having said that there are ways to avoid over fitting 00:51:19.280 |
It's it you know because it's breakable it's not my first choice 00:51:26.040 |
But yeah, check out our stuff here if you're interested and you know, you there is stuff which largely automates the process 00:51:34.920 |
There's lots of hyper parameters. You have to select people generally just you know, try every combination of hyper parameters 00:51:41.040 |
And in the end you're generally should be able to get a more accurate gradient boosting model than random forest 00:51:58.560 |
Kaggle notebook on random forests how random forests really work 00:52:20.240 |
Walk through where me and I don't know how many 20 or 30 folks get together on a zoom call and chat about 00:52:36.480 |
You know, we've been trying to kind of practice what you know things along the way 00:52:49.640 |
What does it look like to pick a Kaggle competition and just like? 00:52:57.080 |
Kind of mechanical steps that you would do for any computer vision model 00:53:06.880 |
Competition I picked was paddy disease classification 00:53:15.080 |
Recognizing diseases rice diseases and rice patties 00:53:18.600 |
And yeah, I spent I don't know a couple of hours or three. I can't remember a few hours 00:53:29.920 |
Found that I was number one on the leaderboard and I thought oh, that's that's interesting like 00:53:41.880 |
And then I thought well, there's all these other things. We should be doing as well and I tried 00:53:45.800 |
three more things and each time I tried another thing I got further ahead at the top of the leaderboard so 00:53:57.500 |
the process I'm gonna do it reasonably quickly because 00:54:08.960 |
For you to see the entire thing in you know, seven hours of detail or however long we probably were six to seven hours of conversations 00:54:16.560 |
But I want to kind of take you through the basic process that I went through 00:54:22.780 |
So since I've been starting to do more stuff on Kaggle, you know, I realized there's some 00:54:35.600 |
Kind of menial steps. I have to do each time particularly because I like to run stuff on my own machine 00:54:44.200 |
So to do to make my life easier I created a little module called fast Kaggle 00:54:51.120 |
Which you'll see in my notebooks now on which you can download from pit or Conda 00:54:56.900 |
And as you'll see it makes some things a bit easier for example 00:55:02.920 |
unloading the data for the paddy disease classification if you just run setup comp and 00:55:08.400 |
Pass in the name of the competition if you are on Kaggle it will return a path to 00:55:18.440 |
Competition data that's already on Kaggle if you are not on Kaggle and you haven't downloaded it 00:55:25.480 |
If you're not on Kaggle and you have not downloaded on zip the data, it will return a path to the one that you've already downloaded 00:55:31.520 |
also, if you are on Kaggle you can ask it to make sure that 00:55:34.620 |
Pip things are installed that might not be up to date. Otherwise 00:55:39.040 |
So this basically one line of code now gets us all set up and ready to go 00:55:46.680 |
So I ran this particular one on my own machine so it's downloaded and unzipped the data 00:55:55.680 |
Six walkthroughs so far. These are the videos 00:56:06.240 |
For attempts that's a few fiddling around at the start 00:56:10.880 |
So the overall approach at is well and this is not just to a Kaggle competition right at the reason 00:56:27.840 |
In a Kaggle competition, you know when you're working on some work project or something 00:56:32.300 |
You might be able to convince yourself and everybody around you that you've done a fantastic job of 00:56:38.100 |
not overfitting and your models better than what anybody else could have made and whatever else but 00:56:44.240 |
The brutal assessment of the private leaderboard 00:56:51.560 |
Is your model actually predicting things correctly and is it overfit? 00:57:04.120 |
You know, you're never gonna know and a lot of people don't go through that process because at some level they don't want to know 00:57:09.640 |
But it's okay, you know, nobody needed it you don't have to put your own name there 00:57:19.040 |
Always did right from the very first one. I wanted, you know, if I was gonna screw up royally 00:57:23.760 |
I wanted to have the pressure on myself of people seeing me in last place 00:57:27.240 |
but you know, it's it's fine you could do it all and honestly and 00:57:34.320 |
As you improve you also have so much self-confidence, you know 00:57:41.880 |
The stuff we do in a Kaggle competition is indeed a subset of the things we need to do in real life 00:57:49.120 |
It's an important subset, you know building a model that actually predicts things correctly and doesn't overfit is important and furthermore 00:57:57.080 |
structuring your code and analysis in such a way that you can keep improving over a three-month period without gradually getting into more and 00:58:04.900 |
more of a tangled mess of impossible to understand code and 00:58:07.840 |
Having no idea what untitled copy 13 was and why it was better than 00:58:22.480 |
Well away from customers or whatever, you know before you've kind of figured things out 00:58:27.280 |
So the things I talk about here about doing things well in this Kaggle competition 00:58:34.040 |
Should work, you know in other settings as well 00:58:39.060 |
And so these are the two focuses that I recommend 00:58:45.200 |
Get a really good validation set together. We've talked about that before right and in a Kaggle competition 00:58:50.460 |
That's like it's very rare to see people do well in a Kaggle competition who don't have a good validation set 00:58:55.900 |
sometimes that's easy and this competition actually it is easy because the 00:59:05.800 |
But most of the time it's not actually I would say 00:59:12.720 |
How quickly can you try things and find out what worked? So obviously you need a good validation set. Otherwise, it's impossible to iterate and 00:59:19.880 |
So quickly iterating means not saying what is the biggest? 00:59:25.640 |
You know open AI takes four months on a hundred TPUs model that I can train 00:59:34.100 |
it's what can I do that's going to train in a minute or so and 00:59:39.200 |
Will quickly give me a sense of like well, I could try this I could try that what things gonna work and then try 00:59:48.200 |
It also doesn't mean that saying like, oh I heard this is amazing you 00:59:52.540 |
Bayesian hyper parameter tuning approach. I'm gonna spend three months implementing that because that's gonna like give you one thing 01:00:04.640 |
In these competitions or in machine learning in general, you actually have to do everything 01:00:12.560 |
And doing just one thing really well will still put you somewhere about last place 01:00:17.160 |
So I actually saw that a couple of years ago Aussie guy who's 01:00:27.720 |
Actually put together a team entered the Kaggle competition and literally came in last place 01:00:34.160 |
Because they spent the entire three months trying to build this amazing new 01:00:43.040 |
Never actually never actually iterated if you iterate a guarantee you won't be in last place 01:00:48.820 |
Okay, so here's how we can grab our data with fast Kaggle and it gives us tells us what path it's in 01:01:05.120 |
And I only do this because I'm creating a notebook to share, you know when I share a notebook 01:01:11.960 |
I like to be able to say as you can see, this is point eight three blah blah blah, right and 01:01:15.760 |
Know that when you see it, it'll be point eight three as well 01:01:18.720 |
But when I'm doing stuff, otherwise, I would never set a random seed 01:01:22.400 |
I want to be able to run things multiple times and see how much it changes each time 01:01:29.960 |
The modifications I'm making changing it because they're improving it making it worse or is it just random variation 01:01:37.960 |
That's a bad idea because you won't be able to see the random variation. So this is just here for presenting a notebook 01:01:44.520 |
Okay, so the data they've given us as usual they've got a sample submission they've got some test set images 01:01:53.280 |
They've got some training set images a CSV file about the training set 01:02:00.000 |
And then these other two you can ignore because I created them 01:02:09.440 |
Get image files. So that gets us a list of the file names of all the images here recursively 01:02:26.880 |
This is a pillow image Python imaging library image 01:02:30.120 |
In the imaging world. They generally say columns by rows in 01:02:35.560 |
The array slash tensor world. We always say rows by columns 01:02:40.800 |
So if you ask pie torch what the size of this is, it'll say 640 by 480 and I guarantee at some point 01:02:47.440 |
This is going to bite you. So try to recognize it now 01:02:50.540 |
Okay, so they're kind of taller than they are. There's at least this one is taller than it is wide 01:02:58.320 |
I'd actually like to know are they all this size because it's really helpful if they all are all the same size or at least similar 01:03:03.880 |
Believe it or not the amount of time it takes to decode a JPEG is actually quite significant 01:03:12.640 |
And so figuring out what size these things are is actually going to be pretty slow 01:03:18.200 |
But my fast core library has a parallel sub module which can basically do anything 01:03:25.140 |
That you can do in Python. It can do it in parallel. So in this case, we wanted to create a pillow image and get its size 01:03:31.060 |
So if we create a function that does that and pass it to parallel passing in the function and the list of files 01:03:37.720 |
It does it in parallel and that actually runs pretty fast 01:03:44.060 |
How this happened ten thousand four hundred and three images are indeed 480 by 640 and four of them aren't 01:03:52.540 |
So basically what this says to me is that we should pre-process them or you know 01:03:56.580 |
At some point process them so that they're probably all for 80 by 640 or all basically the kind of same size 01:04:04.100 |
But we can't not do some initial resizing. Otherwise, this is going to screw things up 01:04:17.540 |
So like that probably the easiest way to do things the most common way to do things is to 01:04:22.460 |
Either squish or crop every image to be a square 01:04:26.860 |
So squishing is when you just in this case squish the aspect ratio down 01:04:33.260 |
As opposed to cropping randomly a section out, so if we call resize squish it will squish it down 01:04:42.900 |
And so this is 480 by 480 squared. So this is what it's going to do to all of the images first on the CPU 01:04:50.500 |
That allows them to be all batched together into a single mini batch 01:04:56.780 |
Everything in a mini batch has to be the same shape 01:05:02.220 |
then that mini batch is put through data augmentation and 01:05:09.620 |
Grab a random subset of the image and make it at 128 by 128 pixel 01:05:15.980 |
And here's what that looks like. Here's our data 01:05:19.780 |
So show batch works for pretty much everything not just in the fast AI library 01:05:26.620 |
But even for things like fast audio, which are kind of community based things 01:05:30.980 |
You should be to use show batch on anything and and see or hear or whatever what your data looks like 01:05:41.780 |
But apparently these are various rice diseases and this is what they look like 01:05:46.060 |
So, um, I I jump into creating models much more quickly than most people 01:05:58.260 |
Because I find model, you know models are a great way to understand my data as we've seen before 01:06:03.540 |
So I basically build a model as soon as I can 01:06:10.900 |
Create a model that's going to let me iterate quickly. So that means that I'm going to need a model that can train quickly 01:06:23.580 |
Did this big project the best vision models of fine-tuning 01:06:29.340 |
Where we looked at nearly a hundred different 01:06:46.740 |
Which ones could we fine-tune which ones had the best transfer learning results 01:06:52.320 |
And we tried two different data sets very different data sets 01:06:55.900 |
One is the pets data set that we've seen before 01:06:58.940 |
So trying to predict what breed of pet is from 37 different breeds 01:07:07.780 |
Satellite imagery data set called planet. They're very very different data sets in terms of what they contain and also very different sizes 01:07:15.500 |
The planet ones a lot smaller the pets ones a lot bigger 01:07:19.180 |
And so the main things we measured were how much memory did it use? 01:07:23.980 |
How accurate was it and how long did it take to fit? 01:07:27.580 |
And then I created this score which can which combines the fit time and error rate together 01:07:37.780 |
For picking a model and now in this case. I want to pick something 01:07:48.380 |
there's one clear winner on speed which is resnet 26 D and 01:07:53.380 |
So its accuracy was 6% versus the best was like 4.1% 01:07:59.540 |
So okay, it's not amazingly accurate, but it's still pretty good, and it's gonna be really fast 01:08:11.460 |
when they do deep learning they're going to spend all of their time learning about exactly how a resnet 26 D is made and 01:08:19.500 |
convolutions and resnet blocks and transformers and blah blah blah we will cover all that stuff 01:08:31.740 |
Right, it's just it's just a function right and what matters is the inputs to it and the outputs to it 01:08:41.500 |
So let's create a learner which with a resnet 26 D from our data loaders 01:08:58.380 |
starting at a very very very low learning rate and gradually increase the learning rate and track the loss and 01:09:03.860 |
Initially the learn the loss won't improve because the learning rate is so small 01:09:10.060 |
It doesn't really do anything and at some point the learning rates high enough that the loss will start coming down 01:09:15.020 |
Then at some other point the load the learning rate so high that it's gonna start jumping past the answer and it's got a bit worse 01:09:22.700 |
And so somewhere around here is a learning rate. We'd want to pick 01:09:29.980 |
We've got a couple of different ways of making suggestions I 01:09:34.180 |
Generally ignore them because these suggestions are specifically designed to be conservative 01:09:41.620 |
They're a bit lower than perhaps an optimal in order to make sure we don't recommend something that totally screws up 01:09:47.220 |
But I kind of like to say like well, how far right can I go and still see it like clearly really improving quickly? 01:10:01.740 |
Fine-tune our model with a learning rate of 0.01 01:10:04.140 |
Three epochs and look the whole thing took a minute. That's what we want, right? We want to be able to iterate 01:10:10.220 |
Rapidly just a minute or so. So that's enough time for me to go and you know, grab a glass of water or 01:10:16.340 |
There's some reading like it's not gonna get too distracted 01:10:25.500 |
Nothing, we submit as soon as we can. Okay, let's get our submission in so we've got a model. Let's get it in 01:10:32.020 |
So we read in our CSV file of the sample submission and 01:10:37.580 |
So the CSV file basically looks like we're gonna have to have a list of the image 01:10:42.180 |
file names in order and then a column of labels 01:10:46.980 |
So we can get all the image files in the test image 01:10:56.580 |
So now we want is what we want is a data loader 01:11:01.300 |
Which is exactly like the data loader we use to train the model 01:11:07.380 |
Except pointing at the test set we want to use exactly the same transformations 01:11:11.700 |
So there's actually a DL dot test DL method which does that you just pass in 01:11:29.580 |
Test data loader has a key difference to a normal data loader, which is that it does not have any labels 01:11:39.620 |
So we can get the predictions for our learner passing in that data loader and 01:11:48.300 |
In the case of a classification problem, you can also ask for them to be decoded decoded means rather than just get returned the 01:11:58.660 |
Rice disease we're every plus it'll tell you what is the index of the most probable 01:12:05.420 |
Rice disease. That's what decoded means. So that return with probabilities 01:12:10.420 |
Targets, which obviously will be empty because it's a test set. So throw them away and those decoded indexes 01:12:17.260 |
Which look like this numbers from 0 to 9 because there's 10 possible rice diseases 01:12:21.700 |
The Kaggle submission does not expect numbers from 0 to 9 it expects to see 01:12:30.780 |
So what do those numbers from 0 to 9 represent? 01:12:46.380 |
Realized later. This is a slightly inefficient way to do it, but it does the job 01:12:55.580 |
If I enumerate the vocab that gives me pairs of numbers 0 bacterial leaf blight 1 bacterial leaf streak, etc 01:13:02.740 |
They could then create a dictionary out of that and then I can use pandas 01:13:07.700 |
To look up each thing in a dictionary. They call that map 01:13:13.260 |
If you're a pandas user, you've probably seen map used before being passed a function 01:13:18.140 |
Which is really really slow. But if you pass map addict, it's actually really really fast do it this way if you can 01:13:34.540 |
Submission sample submission file SS. So if we replace this column label with our predictions 01:13:50.860 |
This means run a bash command a shell command head is the first few rows. Let's just take a look that looks reasonable 01:14:03.180 |
Iterating rapidly means everything needs to be 01:14:09.300 |
Fast and easy things that are slow and hard don't just take up your time 01:14:14.420 |
But they take up your mental energy. So even submitting to Kaggle needs needs to be fast. So I put it into a cell 01:14:30.340 |
Give it a description. So just run the cell and it submits to Kaggle and as you can see it says here 01:14:44.780 |
Top 80% also known as bottom 20% which is not too surprising right? I mean, it's it's one minute of training time 01:14:53.820 |
But it's something that we can start with and that would be like 01:15:00.740 |
However long it takes to get to this point that you put in our submission 01:15:04.260 |
Now you've really started right because then tomorrow 01:15:11.260 |
So I'd like to share my notebooks and so even sharing the notebook I've automated 01:15:20.340 |
So part of fast Kaggle is you can use this thing called push notebook and that sends it off to Kaggle 01:15:43.060 |
Why would you create public notebooks on Kaggle? Well 01:16:06.940 |
But this time rather than finding out in no uncertain terms whether you can predict things accurately 01:16:12.660 |
This time you can find out no, it's no uncertain terms whether you can communicate things in a way that people find interesting and useful 01:16:22.380 |
You know, so be it right that's something to know and then you know ideally go and ask some friends like 01:16:29.980 |
What do you think I could do to improve and if they say oh nothing it's fantastic you can tell no that's not true 01:16:36.540 |
I didn't get any votes. I'll try again. This isn't good. How do I make it better? You know 01:16:47.580 |
If you can create models that predict things well, and you can communicate your results in a way that is clear and compelling 01:16:54.220 |
You're a pretty good data scientist, you know, like they're two pretty important things and so here's a great way to 01:17:01.900 |
Test yourself out on those things and improve. Yes, John 01:17:07.220 |
Yes, Jeremy. We have a sort of a I think a timely question here from Zakiya about your iterative approach 01:17:13.700 |
And they're asking do you create different Kaggle notebooks for each model that you try? 01:17:19.940 |
So one Kaggle book for the first one then separate notebooks subsequently or do you do append to the bottom of it? 01:17:27.600 |
What's your strategy? That's a great question 01:17:33.740 |
The daily walkthroughs but isn't quite caught up yet. So I will say keep keep it up because 01:17:38.460 |
In the six hours of going through this you'll see me create all the notebooks 01:17:51.720 |
You can see them so basically yeah, I started with 01:18:01.900 |
Bit messier without the pros but that same basic thing. I then duplicated it 01:18:09.500 |
Which is here and because I duplicated it, you know this stuff which I still need it's still there right? 01:18:21.380 |
And so at first if I don't really want to do index my duplicate it it will be called 01:18:26.540 |
You know first steps on the road to the top part one - copy one 01:18:41.680 |
Or if it doesn't say to go anywhere I rename it into something like, you know 01:18:47.840 |
Experiment blah blah blah and I'll put some notes at the bottom and I might put it into a failed folder or something 01:19:01.080 |
That I find works really well, which is just duplicating notebooks and editing them and naming them carefully and putting them in order 01:19:08.960 |
And you know put the file name in when you submit as well 01:19:17.440 |
Then of course also if you've got things in git 01:19:19.440 |
You know, you can have a link to the git commit so you'll know exactly what it is 01:19:25.380 |
My notebooks will only have one submission in and then I'll move on and create a new notebook 01:19:30.020 |
So I don't really worry about burgeoning so much 01:19:32.600 |
But you can do that as well if that helps you 01:19:41.360 |
I've worked with a lot of people who use much more sophisticated and complex processes and tools and stuff, but 01:19:48.080 |
None of them seem to be able to stay as well organized as I am 01:19:53.200 |
I think they kind of get a bit lost in their tools sometimes and 01:20:09.480 |
The specifics of you know finding the best model and all that sort of stuff 01:20:14.120 |
we've got a couple of questions that are in the same space, which is 01:20:17.280 |
You know, we've got some people here talking about AutoML frameworks 01:20:21.000 |
Which you might want to you know touch on for people who haven't heard of those 01:20:24.000 |
If you've got any particular AutoML frameworks, you think are 01:20:30.080 |
Recommending or just more generally, how do you go trying different models random forest gradient boosting neural network? 01:20:36.840 |
It just so in that space if you could comment it sure 01:20:40.080 |
I use AutoML less than anybody. I know I would guess 01:21:02.200 |
The reason why is I like being highly intentional, you know 01:21:07.560 |
I like to think more like a scientist and have hypotheses and test them carefully 01:21:14.760 |
And come out with conclusions, which then I implement, you know, so for example 01:21:23.720 |
Didn't try a huge grid search of every possible 01:21:29.280 |
Model every possible learning rate every possible pre-processing approach blah blah blah, right instead step one was to find out 01:21:46.080 |
Make a difference, you know are some models better with squished and some models better with crop and 01:21:59.280 |
But for one or two versions of each of the main families that took 20 minutes and the answer was no in every single case 01:22:05.680 |
The same thing was better. So we don't need to do a grid search over that anymore, you know 01:22:12.160 |
Or another classic one is like learning rates. Most people 01:22:15.680 |
Do a kind of grid search over learning rates or they'll train a thousand models, you know with different learning rates 01:22:23.620 |
This fantastic researcher named Leslie Smith invented the learning rate finder a few years ago 01:22:27.660 |
We implemented it. I think within days of it first coming out as a technical report. That's what I've used ever since 01:22:42.160 |
Yeah, I mean then like neural nets versus GBM sources random forests, I mean that's 01:22:52.740 |
That shouldn't be too much of a question on the whole like they have pretty clear 01:23:05.560 |
If I'm doing computer vision, I'm obviously going to use a computer vision deep learning model 01:23:10.760 |
And which one I would use. Well if I'm transfer learning, which hopefully is always I would look up the two tables here 01:23:18.920 |
Which is which are the best at fine-tuning to very similar things to what they were pre trained on and then the same thing for planet 01:23:26.440 |
Is which ones are best for fine-tuning for two data sets that are very different to what they're trained on 01:23:32.680 |
And as it happens in both case, they're very similar in particular con next is right up towards the top in both cases 01:23:39.120 |
so I just like to have these rules of thumb and 01:23:46.080 |
Random forests going to be the fastest easiest way to get a pretty good result GBM's 01:23:50.760 |
Probably gonna give me a slightly better result if I need it and can be bothered fussing around 01:23:59.920 |
GBM I would probably yeah, actually I probably would run a hyper parameter 01:24:07.280 |
Because it is fiddly and and it's fast. So you may as well 01:24:11.000 |
So yeah, so now you know, we were able to make a slightly better submission slightly better model 01:24:29.560 |
I had a couple of thoughts about this. The first thing was 01:24:35.520 |
A minute on my home computer and then when I uploaded it to Kaggle it took about four minutes per epoch 01:24:46.120 |
Kaggle's GPUs are not amazing, but they're not that bad 01:24:54.160 |
And what was up is I realized that they only have two 01:24:58.640 |
Virtual CPUs, which nowadays is tiny like, you know, you generally want is a rule of thumb about eight 01:25:12.120 |
So spending all of its time just reading the damn data 01:25:14.720 |
Now the data was 640 by 480 and we were ending up with any 128 pixel size bits for speed 01:25:27.760 |
Kaggle iteration faster as well. And so very simple thing to do 01:25:34.800 |
So fast AI has a function called resize images and you say okay take all the train images and stick them in 01:25:49.000 |
And it will recreate the same folder structure over here. And so that's why I called this the training path 01:26:10.040 |
With no loss of accuracy. So that was kind of step one was to actually get my fast 01:26:20.720 |
Now still a minute it's a long time and on Kaggle you can actually see this little graph showing how much the CPU is being 01:26:28.200 |
Used how much the GPU is being used on your own home machine 01:26:30.920 |
You can there are tools free GP, you know free tools to do the same thing 01:26:34.680 |
I saw that the GPU was still hardly being used. So it's still CPU was being driven pretty hard 01:26:41.000 |
I wanted to use a better model anyway to move up the leaderboard 01:26:50.640 |
Oh, by the way, this graph is very useful. So this is 01:26:55.040 |
This is speed versus error rate by family and so we're about to be looking at these 01:27:10.640 |
So we're going to be looking at this one complex tiny 01:27:18.920 |
Here it is complex tiny so we were looking at resident 2016 which took this long on this data set 01:27:25.200 |
But this one here is nearly the best. It's third best, but it's still very fast 01:27:31.900 |
And so it's the best overall score. So let's use this 01:27:36.200 |
Particularly because you know, we're still spending all of our time waiting for the CPU anyway 01:27:40.760 |
So it turned out that when I switched my architecture to Conve next 01:27:53.920 |
Let me switch to the Kaggle version because my outputs are missing for some reason 01:28:03.680 |
Yeah, so I started out by running the resident 2016 on the resized images and got 01:28:08.160 |
Similar error rate, but I ran a few more epochs 01:28:14.280 |
so then I do exactly the same thing but with Conve next small and 01:28:18.560 |
4.5% error rate. So don't think that different architectures are best 01:28:24.400 |
Tiny little differences. This is over twice as good 01:28:34.880 |
Lot of folks you talked to will never have heard of this Conve next because it's very new and 01:28:44.040 |
Keep up to date with new things. They kind of learn something at university and then they stop stop learning 01:28:51.040 |
So if somebody's still just using res nets all the time 01:28:54.320 |
You know, you can tell them we've we've actually we've moved on, you know 01:29:02.880 |
But for the mix of speed and performance, you know, not so much 01:29:10.480 |
Conve next, you know again, you want these rules of thumb, right? If you're not sure what to do 01:29:15.920 |
This Conve next. Okay, and then like most things there's different sizes. There's a tiny there's a small there's a base 01:29:24.840 |
There's a large there's an extra large and you know, it's just well, let's look at the picture 01:29:43.080 |
Tiny takes less time but higher error, right? So you you pick 01:29:48.280 |
About your speed versus accuracy trade-off for you. So for us small is great 01:29:54.860 |
And so yeah now we've got a 4.5 cent error that's that's terrific 01:30:03.680 |
Now let's iterate on Kaggle, this is taking about a minute per epoch on my computer is probably taking about 20 seconds per epoch 01:30:14.120 |
So, you know one thing we could try is instead of using squish as 01:30:19.880 |
Our pre-processing let's try using crop. So that will randomly crop out an area 01:30:25.840 |
And that's the default. So if I remove the method equals squish that will crop 01:30:31.480 |
So you see how I've tried to get everything into a single 01:30:34.080 |
Function right the single function I can tell it that's going to find the definition 01:30:39.920 |
What architecture do I want to train? How do I want to transform the items? 01:30:45.320 |
How do I want to transform the batches and how many epochs do I want to do? That's basically it, right? 01:30:50.480 |
So this time I want to use the same architecture comp next. I want to resize without cropping and then use the same data augmentation and 01:31:03.000 |
So not particularly it's a tiny bit worse, but not enough to be interesting 01:31:08.280 |
Instead of cropping we can pad now padding is interesting. Do you see how these are all square? 01:31:20.720 |
Padding is interesting because it's the only way of pre-processing images 01:31:24.200 |
Which doesn't distort them and doesn't lose anything if you crop you lose things 01:31:31.840 |
This does neither now, of course the downside is that there's pixels that are literally pointless. They contain zeros 01:31:39.720 |
So every way of getting this working has its compromises 01:31:44.560 |
but this approach of resizing where we pad with zeros is 01:31:48.480 |
Not used enough and it can actually often work quite well 01:31:52.600 |
And this case it was about as good as our best so far 01:32:09.600 |
See these pictures this is all the same picture 01:32:16.840 |
But it's gone through our data augmentation. So sometimes it's a bit darker. Sometimes it's flipped horizontally 01:32:24.400 |
Sometimes it's slightly rotated. Sometimes it's slightly what sometimes it's zooming into a slightly different section, but this is all the same picture 01:32:31.240 |
Maybe our model would like some of these versions better than others 01:32:37.200 |
So what we can do is we can pass all of these to our model get predictions for all of them and 01:32:48.280 |
Right. So it's our own kind of like little mini bagging approach and this is called test time augmentation 01:32:54.440 |
Fast AI is very unusual in making that available in a single method 01:32:59.800 |
you just pass TTA and it will pass multiple augmented versions of the image and 01:33:13.120 |
So this is the same model as before which had a four point five percent 01:33:26.120 |
Wait, why does this say four point eight last time I did this it was way better. Well, that's messing things up, isn't it? 01:33:39.120 |
So when I did this originally on my home computer it went from like four point five to three point nine so possibly a 01:33:45.160 |
Got a very bad luck. It's time. So this is the first time I've actually ever seen TTA give a worse result 01:34:00.840 |
If I should do something other than the crop padding, all right, I'll have to check that out and I'll try and come back to 01:34:11.760 |
Anyway take my word for it every other time I've tried it TTA has been better 01:34:17.480 |
So then, you know now that we've got a pretty good way of 01:34:25.320 |
We've got TTA. We've got a good training process 01:34:28.480 |
Let's just make bigger images and something that's really interesting and a lot of people don't realize is your images don't have to be square 01:34:39.240 |
Given that nearly all of our images are 640 by 480 we can just pick, you know that aspect ratio 01:34:46.080 |
So for example 256 by 192 and we'll resize everything 01:34:53.900 |
That should work even better still. So if we do that, we'll do 12 epochs 01:34:58.360 |
Okay, now our error rates down to 2.2 percent and 01:35:08.440 |
Okay, this time you can see it actually improving down to under 2 percent 01:35:13.280 |
So that's pretty cool, right? We've got our error rate at the start of this notebook. We were at 01:35:18.880 |
Twelve percent and by the time we've got through our little experiments 01:35:34.600 |
And nothing about this is in any way specific to 01:35:38.320 |
Rice or this competition, you know, it's like this is a very 01:35:53.440 |
Certainly any kind of this type of computer vision competition and you have computer vision data set almost 01:36:00.120 |
But you know, it looked very similar for a collaborative filtering model or tabular model NLP model whatever 01:36:06.000 |
So, of course again, I want us a bit as soon as I can so just copy and paste the exact same steps 01:36:13.640 |
I took last time basically for creating a submission 01:36:16.320 |
So as I said last time we did it using pandas, but there's actually an easier way 01:36:22.640 |
So the step where here I've got the numbers from 0 to 9 01:36:26.840 |
Which is like which which rice disease is it? 01:36:35.000 |
Make it an array. So that's going to be a list of ten things and 01:36:39.040 |
Then we can index into that vocab with our indices, which is kind of weird. This is a list of ten things 01:36:47.400 |
This is a list of I don't know four or five thousand things. So this will give me four or five thousand results, which is 01:36:55.160 |
Each vocab item for that thing. So this is another way of doing the same mapping and I would 01:37:03.840 |
playing with this code to understand what it does because it's the kind of like 01:37:07.480 |
very fast what you know, not just in terms of writing but this this the this would 01:37:17.920 |
Very very well. This is the kind of coding you want to get used to 01:37:25.120 |
Anyway, so then we can submit it just like last time and 01:37:34.200 |
That's that's where you want to be right? Like generally speaking I find in Kaggle competitions the top 25% is 01:37:46.680 |
Level, you know, look just not to say like it's not easy 01:37:53.960 |
but if you get in the top 25% and I think you can really feel like yeah, this is this is a 01:38:01.600 |
Reasonable attempt and so that's I think this is a very reasonable attempt 01:38:06.360 |
Okay, before we wrap up John any last questions 01:38:10.560 |
Yeah, there's this there's two I think that would be good if we could touch on quickly before you wrap up 01:38:22.560 |
When I use TTA during my training process do I need to do something special during inference or is this something you use only 01:38:34.160 |
TTA means test time augmentation. So specifically it means inference. I think you mean augmentation during training. So yeah, so during training 01:38:42.360 |
You basically always do augmentation, which means you're varying each image slightly 01:38:50.720 |
Model never seems the same image exactly the same twice and so I can't memorize it 01:38:55.400 |
On fast AI and as I say, I don't think anybody else does this as far as I know if you call TTA 01:39:02.800 |
it will use the exact same augmentation approach on 01:39:10.120 |
Average out the prediction but but like multiple times on the same image and we'll average them out 01:39:15.960 |
So you don't have to do anything different. But if you didn't have any data augmentation in training, you can't use TTA 01:39:21.760 |
It uses the same by default the same data augmentation you use for training 01:39:26.120 |
Great. Thank you. And the other one is about how 01:39:30.520 |
You know when you first started this example you squared the models and the images rather and you talked about 01:39:36.760 |
squashing verse cropping verse, you know clipping and 01:39:39.920 |
Scaling and so on but then you went on to say that 01:39:45.120 |
These models can actually take rectangular input, right? 01:39:48.600 |
so there's a question that's kind of probing it at that, you know, if the if the models can take rectangular inputs 01:39:56.520 |
Why would you ever even care as long as they're all the same size? So I 01:40:06.480 |
Datasets tend to have a wide variety of input sizes and aspect ratios 01:40:14.240 |
You know, if there's just as many tall skinny ones as wide 01:40:21.360 |
You know, you doesn't make sense to create a rectangle because some of them you're gonna really destroy them 01:40:40.480 |
Off-the-shelf library support for yet and I don't think and I don't know that anybody else has even published about this 01:40:47.800 |
Batch things that are similar aspect ratios together and use the kind of median 01:40:54.640 |
Rectangle for those and have had some good results with that. But honestly 01:40:59.240 |
99.99% of people given a wide variety of aspect ratios chuck everything into a square a 01:41:07.280 |
Follow-up just this is my own interest. Have you ever looked at? 01:41:10.600 |
You know, so the issue with with padding as you say is that you're putting black pixels there 01:41:17.120 |
Those are not nans those are black pixels. That's right. It's here. And so there's something problematic to me, you know conceptually about that 01:41:35.040 |
Presented for broadcast on 16 to 9 you got the kind of the blurred stretch that kind of stuff 01:41:39.760 |
No, we played with that a lot. Yeah, I used to be really into it actually and fast a I still by default 01:41:45.920 |
Uses a reflection padding, which means if this is I don't know that says this is a 20 pixel wide thing 01:41:51.400 |
it takes the 20 pixels next to it and flips it over and sticks it here and 01:41:55.080 |
It looks pretty good. You know, another one is copy which simply takes the outside pixel and it's a bit more like TV 01:42:06.760 |
You know much too much agreed it turns out none of them really help you know of anything they make it worse 01:42:16.980 |
The computer wants to know no, this is the end of the image. There's nothing else here. And if you reflect it, for example 01:42:22.880 |
Then you're kind of creating weird spikes that didn't exist and the computer's got to be like, oh, I wonder what that spike is 01:42:29.640 |
So yeah, it's a great question and I obviously spent like a couple of years 01:42:34.120 |
Assuming that we should be doing things that look more image-like 01:42:38.180 |
But actually the computer likes things to be presented to it in as straightforward a way as possible 01:42:43.820 |
Alright, thanks everybody and I hope to see some of you in the walkthroughs and otherwise see you next time